Communication server device, communication equipment and its operation method

By constructing a text database and vector space model, and utilizing similarity metrics and corpus frequencies, the accuracy problem in text unit decoding is solved, improving the accuracy and efficiency of text data processing. This approach is suitable for text decoding and translation applications on handheld communication devices.

CN113826102BActive Publication Date: 2026-06-30GRABTAXI HOLDINGS PTE LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GRABTAXI HOLDINGS PTE LTD
Filing Date
2019-05-15
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies have low accuracy when processing abbreviated units in text data, and are prone to false alarms and missed alarms, especially in text-based communication, which affects the efficiency and accuracy of subsequent data processing.

Method used

By constructing a text database and a vector space model, and using similarity metrics and heuristic models, abbreviated text units are compared with candidate text units. Candidate text units with ordered relationships are selected as deabbreviation units. Combining corpus frequency and orthographic similarity improves decoding accuracy.

Benefits of technology

It achieves high-accuracy decoding of text units, reduces false alarms and false negatives, improves the efficiency and accuracy of subsequent data processing, and supports applications such as automatic translation and text analysis.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN113826102B_ABST
    Figure CN113826102B_ABST
Patent Text Reader

Abstract

This disclosure relates to a communication server apparatus, a communication device, and a method of operating the same. A communication server apparatus (100) is configured to receive (202) text data comprising at least one text data element associated with an abbreviated text unit. The text data element is compared (204) with a plurality of candidate text data elements from a given text database, each candidate text data element being associated with a corresponding candidate text unit in the database. A similarity metric between the at least one text data element and these candidate text data elements is determined (206), and the candidate text data elements are processed (208) to select candidate text data elements that have an ordered relationship with the abbreviated text unit. These similarity metric values ​​and the selection of these candidate text data elements are used (210) to designate the associated candidate text unit as the de-abbreviated text unit of the abbreviated text unit.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention generally relates to the field of communications. One aspect of the invention relates to a communication server apparatus for processing text data to de-abbreviate text units. Another aspect of the invention relates to a communication device and system for processing text data to de-abbreviate text units. Further aspects of the invention relate to a method for processing text data to de-abbreviate text units, and a computer program and computer program product including instructions for implementing the method. Background Technology

[0002] Data processing in communication systems is well-known in the art. One example of data processing used in communication systems is the processing of data and information that facilitates text-based communication rather than audio-based communication. Previously considered techniques have addressed the problem of processing text data to enable communication systems to operate as efficiently as possible while minimizing bandwidth usage and computational processing.

[0003] Some of these techniques address the problem of text units (such as words) in text-based communication by processing data elements of text data. For example, some previously considered text data processing techniques have attempted to determine whether text units appearing in text-based communication conform to the canonical forms of text corpora, databases, or dictionaries. Other techniques have addressed the problem of determining whether non-canonical text units can be converted into canonical forms.

[0004] However, these previously considered methods typically employ rudimentary techniques to compare data from non-canonical and canonical text units, or provide highly complex but prone-to-false-and-false-missing techniques. This approach is particularly challenging when dealing with abbreviated text units, such as the abbreviations of commonly used words in text-based communication. Summary of the Invention

[0005] The various aspects of the invention are set forth in the independent claims. Some optional features are defined in the dependent claims.

[0006] The implementation of the techniques disclosed herein can provide significant technical advantages. For example, much higher accuracy can be achieved when decoding abbreviated text units or determining their correct or canonical text units or words in text data from text-based communications.

[0007] In at least some embodiments, the techniques disclosed herein allow text units to be decoded or de-abbreviated without unacceptable levels of false positives / false negatives, which would otherwise be undecipherable or at least too difficult to handle with previously considered techniques. Furthermore, these techniques enable greater accuracy and efficiency in any subsequent data processing, such as text analysis to enhance user interface features or other features of communication devices, compression or grouping of communications, text translation, etc.

[0008] In exemplary embodiments, the functionality of the technologies disclosed herein can be implemented in software running on a handheld communication device (such as a mobile phone). The software implementing the functionality of the technologies disclosed herein can be contained in an "app"—a computer program or computer program product—that a user has already downloaded from an online store. When running, for example, on a user's mobile phone, the hardware features of the mobile phone can be used to implement functions such as using the mobile phone's transceiver components to establish a secure communication channel for receiving text-based communications, and using the mobile phone's processor(s) to determine candidate text for abbreviated text units in text data. Attached Figure Description

[0009] The invention will now be described by way of example only, with reference to the accompanying drawings, in which:

[0010] Figure 1 This is a schematic block diagram illustrating a first exemplary communication system for processing text data to de-abbreviate text units;

[0011] Figure 2 This is a flowchart illustrating the steps of an exemplary method for processing text data;

[0012] Figure 3 It is a schematic diagram illustrating the processing of text data elements and their associated text units;

[0013] Figure 4 It is a schematic diagram illustrating text data records and examples of processing these records; and

[0014] Figure 5 This is a flowchart illustrating the steps of an exemplary method for processing text data. Detailed Implementation

[0015] First refer to Figure 1The diagram illustrates a communication system 100. The communication system 100 includes a communication server device 102, a service provider communication device 104, and a user communication device 106. These devices are connected to a communication network 108 (e.g., the Internet) via corresponding communication links 110, 112, and 114 implementing, for example, Internet communication protocols. Communication devices 104 and 106 are capable of communicating via other communication networks (such as the Public Switched Telephone Network (PSTN network), including mobile cellular communication networks), but for clarity, [the following is omitted as it is not directly related to the diagram]. Figure 1 These communication networks are omitted from the text.

[0016] The communication server device 102 can be as follows: Figure 1 The diagram illustrates a single server, or it may have functionality performed by server device 102 and distributed across multiple server components. Figure 1 In the example, the communication server device 102 may include multiple individual components, including but not limited to: one or more microprocessors 116, and memory 118 (e.g., volatile memory, such as RAM) for loading executable instructions 120 that define the functions performed by the server device 102 under the control of the processor 116. The communication server device 102 also includes an input / output module 122 that allows the server to communicate via a communication network 108. A user interface 124 is provided for user control and may include, for example, conventional peripheral computing devices such as a display monitor, computer keyboard, etc. The server device 102 also includes a database 126, the purpose of which will become more apparent from the following discussion.

[0017] Service provider communication device 104 may include multiple individual components, including but not limited to: one or more microprocessors 128, and memory 130 (e.g., volatile memory, such as RAM) for loading executable instructions 132 that define the functions performed by service provider communication device 104 under the control of processor 128. Service provider communication device 104 also includes input / output modules 134 that allow service provider communication device 104 to communicate via communication network 108. User interface 136 is provided for user control. If service provider communication device 104 is, for example, a smartphone or tablet device, then user interface 136 will have a touch panel display, which is common in many smartphones and other handheld devices. Alternatively, if service provider communication device is, for example, a conventional desktop or laptop computer, then user interface may have, for example, conventional peripheral computing devices, such as a display monitor, computer keyboard, etc. Service provider communication device may be, for example, a device managed by a text data processing service provider.

[0018] User communication device 106 may be, for example, a smartphone or tablet device with the same or similar hardware architecture as service provider communication device 104.

[0019] Figure 2 It is a flowchart illustrating a method for processing text data to de-abbreviate text units. Figure 1 and Figure 2 The preceding description illustrates and describes a communication server apparatus 102 for processing text data to de-abbreviate text units. This communication server apparatus includes a processor 116 and a memory 120. The communication server apparatus 102 is configured, under the control of the processor 116, to execute instructions 120 stored in the memory 118 to: receive (202) text data comprising at least one text data element associated with an abbreviated text unit; compare (204) the at least one text data element with a plurality of candidate text data elements from a representation of a given text database, each candidate text data element being associated with a corresponding candidate text unit in the database; determine (206) the value of a similarity metric between the at least one text data element and these candidate text data elements; process (208) the candidate text data elements to select candidate text data elements that have an ordered relationship with the abbreviated text unit; and use (210) these similarity metric values ​​and the selection of these candidate text data elements to designate the associated candidate text unit as the de-abbreviated text unit of the abbreviated text unit.

[0020] Furthermore, a method is provided for execution in a communication server apparatus 102 for processing text data to de-abbreviate text units, the method comprising, under the control of a processor 116 of the server apparatus, performing the following operations: receiving (202) text data including at least one text data element associated with an abbreviated text unit; comparing (204) the at least one text data element with a plurality of candidate text data elements from a given text database, each candidate text data element being associated with a corresponding candidate text unit in the database; determining (206) a value of a similarity metric between the at least one text data element and the candidate text data elements; processing (208) the candidate text data elements to select candidate text data elements with respect to associated candidate text units having an ordered relationship with the abbreviated text unit; and using (210) the similarity metric values ​​and the selection of the candidate text elements to designate the associated candidate text unit as the de-abbreviated text unit of the abbreviated text unit.

[0021] Furthermore, a communication system for processing text data to de-abbreviate text units is also provided. This communication system includes a communication server device (102), at least one user communication device (106), and a communication network device (104, 108). The communication network device is operable to enable the communication server device and the at least one user communication device to establish communication with each other through it. The at least one user communication device (104, 106) includes a first processor and a first memory. The at least one user communication device is configured to execute first instructions stored in the first memory under the control of the first processor to: receive text data including at least one text data element associated with an abbreviated text unit. The communication server device (102) includes a second processor and a second memory. The communication server apparatus is configured to execute second instructions stored in the second memory under the control of the second processor, such that: (204) the at least one text data element is compared with a plurality of candidate text data elements from a given text database, each candidate text data element being associated with a corresponding candidate text unit in the database; (206) the value of a similarity metric between the at least one text data element and the candidate text data elements is determined; (208) the candidate text data elements are processed to select candidate text data elements that have an ordered relationship with the abbreviated text unit; and (210) the similarity metric values ​​and the selection of the candidate text data elements are used to designate the associated candidate text unit as the de-abbreviated text unit of the abbreviated text unit.

[0022] As described above, the techniques described herein relate to processing text data to decode or de-abbreviate abbreviated text units (e.g., words) appearing in text-based communications or messages. De-abbreviation allows, for example, units or words to be interpreted through further processing steps (e.g., text analysis or translation), or to be displayed, for example, in interpreted, non-abbreviated form to a user of the communication device receiving the communication or message via a display device of the communication device.

[0023] Abbreviations of words or text units in communication messages appear in various contexts and media, but are particularly prevalent in text-based communication between users of computers and electronic devices, such as emails, text or SMS messages, and messages via social media platforms. For example, when typing a short message to send to a recipient on a handheld electronic device, users often type a simplified version of a word or phrase because they believe the simplified form will be clear to the recipient. For example:

[0024] • Officially recognized acronyms (e.g., UN = United Nations, USA = United States of America).

[0025] • Unofficially recognized but very common slang terms (e.g., 'lol' means 'laugh out loud'; 'how r u' means 'how are you?')).

[0026] • Temporary abbreviations, in which the writer may not assume that the reader has seen the exact abbreviation, but rather that the reader will correctly reconstruct the original meaning anyway ('thks', 'thx', and 'thnks' are all recognizable versions of 'thanks').

[0027] Temporary abbreviations are particularly common in some languages. Typical patterns involve omitting certain characters, such as vowels: for example, other forms of 'thanks' as described above; in Indonesian, 'sy sdh smp' means 'sayasudah sampai' (I have arrived). Other patterns may include omitting diacritics in languages ​​that use them: in Vietnamese, '5phut' means 'five phút' (five minutes).

[0028] The techniques described herein involve processing text data and / or text units (words, syllables, morphemes, etc.) representing such messages or the data that forms their basis, in order to, for example, convert abbreviated forms of words to their unabbreviated forms, thereby mapping abbreviated or non-canonical input text to the correct canonical form.

[0029] As mentioned above, the techniques described in this paper offer technological advantages in the fields of data processing and communication, such as improved efficiency and higher accuracy for subsequent text data processing applications. They also allow users to interpret messages more easily. Other potential applications of these techniques include:

[0030] • Supports automatic translation of text conversations between parties speaking different languages. For example, this can be used in ride-hailing apps between passengers and drivers speaking different languages ​​to preprocess the input text and then pass the correct transcribed form of the input text to a translation service such as Google Translate so that the translation result can be transmitted / sent to the recipient.

[0031] • Automatic correction within handheld electronic devices to display the correct, canonical form of the entered text. The entered text doesn't necessarily have to be transmitted forward to the recipient. For example, it can be used for personal notes.

[0032] A broad form of example technique aims to use a combination of two or more of the following:

[0033] a. Compare abbreviated words or text units with similar words, such as words that may appear in similar contexts and / or words that are similar in vocabulary or orthography;

[0034] b. Identify potential candidates for the correct standard word by sorting through similar or matching abbreviations; and

[0035] c. Compare the abbreviation with reference data derived from a large number of reference works to identify commonly used words in the references as candidates.

[0036] For example, in its simplest form, step c. can be done by selecting candidate words / phrases that appear most frequently in a large number of reference works: for example, the words / phrases with the most instances in Wikipedia.

[0037] For step a., one option is to train a heuristic model on a text database and compare the abbreviated input text unit to the text database by comparing it with the modeling data. For example, in a vector space model of the text database (described in more detail below), the vector found for the input text unit can be compared with the neighbor vectors of canonical words in the modeling database, and a score can be derived for each candidate word / phrase. The most likely canonical form of the input text is likely the word / phrase with the highest score.

[0038] An example of the similarity measurement in step a is as follows.

[0039] Suppose that the character substitutions of a text unit (a word in this case) W give candidate targets {W1, W2, ..., Wn}. That is, all Wi can be converted to W by deleting characters (e.g., we can specify that the deleted character is a vowel) or diacritics. Comparing Wi with W returns a direct match. The similarity score sim(A, B) can then be used to select which Wi is most similar to the source word W. In other words, for each Wi, we compute sim(W, Wi) and select the Wi with the highest similarity score.

[0040] In one technique, multiple similar candidates can be selected for further processing (such as steps b. and / or c.). For example, candidates can be categorized or ranked by similarity score and processed in order of ranking, or only those candidates with a similarity score above a given threshold can be processed.

[0041] It should be noted that many such text similarity measures are known in this field. Some work by comparing the similarity of word distributions in a given text corpus or database.

[0042] One approach to this is to construct a vector space model of the text corpus. It is well known in the art that this can be done by representing the text corpus in a multidimensional space and counting the frequency of each word in the corpus to give vector values, where each word or text unit has a separate dimension. For any input text unit, a corresponding vector can be found in the vector space, and then a similarity metric between that vector and its neighboring vectors can be computed. For example, cosine similarity—the representation of the angle between two vectors in the vector space—can be calculated.

[0043] Therefore, a word that frequently appears with another word in a given corpus (such as an abbreviation of the word with letters removed or diacritics removed) will have a high cosine similarity value in a vector space model with the corresponding vectors of the two words.

[0044] Another similarity measure can be to calculate simple orthographic or lexical similarity between text units; for example, whether the text units are similar in length, whether they have the same number of vowels and consonants, and so on.

[0045] Figure 3 This is a schematic diagram illustrating text data elements and their associated text units. Text-based message 302 contains the text string "Pls pickup…". To find the abbreviations of text units or words in this text-based message, the message text can be converted (e.g., as displayed on the user's device GUI) into text data elements Ei (304). For example, this could be converting each text unit into a representative vector in a vector space model. The data element could also be a representation of the text unit as a lexical basis for orthographic comparison. Encoding the text unit into grouped data for transmission could also provide suitable text data elements for comparison. The data elements associated with a given text unit can, of course, include more than one of the above; for example, the underlying text data elements(s) processed for a given text unit could include representative vectors and data representing the text unit in compressed, encoded, transmitted, or other software element formats.

[0046] Within data element Ei, there exist one or more text data elements Ex(306) associated with the text unit "ppl"(308) from the message ("2ppl,2luggage…"). Therefore, data element Ei can be used in text data processing to find the de-abbreviated form of the text unit "ppl". In one example, the processing steps would involve finding the vector associated with "ppl" in a vector space model (trained on corpus text) and finding the vector's neighbors using cosine similarity.

[0047] For the technique in the example of step c. above, the frequency of occurrence of candidate text units in the text database is determined and that frequency is used to specify the associated candidate text units. For example, a Wikipedia corpus can be used, and the frequency of the input text unit in that corpus can be used to help determine which of several candidates (e.g., those recommended by the cosine similarity of the neighboring vectors of the vectors associated with the input text unit) is the best choice.

[0048] For vector space models, it can be advantageous to train or generate the model from a corpus related to the text input that may require de-abbreviation. For example, a Wikipedia corpus would contain very few abbreviations, such as "thx"; however, if the corpus is application-specific (where text-based messages will be interpreted), such as using a corpus of text-based messages as training data, there may be similar groups of abbreviations. Furthermore, if the corpus is relevant—for example, if a vector space model is trained using a set of driver messages for later analysis of driver messages—the results should be further improved.

[0049] However, for frequency-based corpora, a preferred corpus might be a standardized set, making standard words more likely to appear in relevant contexts. Therefore, in one technique, the text database used to determine the frequency of occurrence of associated candidate text units is a different text database than the one used for vector space models. This has another advantage: such a standardized corpus may be publicly available.

[0050] In the technique of step b above, a candidate text unit can be selected if it has an ordered relationship with the abbreviated text unit; for example, if the characters of the abbreviated text unit are a partially ordered set of the characters of the candidate text unit (or for those characters), or if the characters of the abbreviated text unit have a similar order to the characters of the candidate text unit, or if the consonants of the abbreviated text unit are the same or similar to the consonants of the candidate text unit, or have the same or similar order.

[0051] One difference between these techniques and previous ones is that most similarity measures are symmetric, i.e., sim(a,b) = sim(b,a). This property is generally undesirable for text normalization because (for example) we always want to map “dmn” to “dimana”, but we never want to map “dimana” to “dmn”. So we want sim(“dmn”,“dimana”) to be high, but sim(“dimana”,“dmn”) to be low. This can be implemented by only considering those pairs where adding the vowel back would map the source to the target. This can be implemented as a filter. So, for example, “dimana” can be converted to “dmn” by removing the letter (the vowel in this case), therefore “dimana” is considered a potential substitute for “dmn”. The reverse is not true, therefore “dmn” cannot be considered a potential substitute for “dimana”.

[0052] Even when the simplified form of the input word differs from the target word by many characters, this combination of similarity measurement, filtering by ranking relations, and optional discrimination by frequency in (different) corpora provides accurate results. Previous techniques struggled to detect these situations; for those using only vector neighbors, some words ranked higher than the correct form; consonant filtering and corpus frequency weighting could confirm the correct word. For some candidates that are actually correct for abbreviation, orthographic distance alone might be high. The technique described in this paper allows dissimilar words to be selected as candidates, provided they also pass the ranking and frequency comparison stages.

[0053] Figure 4 This is a schematic diagram illustrating text data record 400 and an example of processing these records. A text data record or group has a header 402 and auxiliary message components 406. The record contains multiple text data components, which may include text data, text data elements, compressed text data, etc. Here, Figure 3 The text message contains multiple text data components of the abbreviated text unit "ppl". This record or packet can be received by the user's communication equipment, communication device, or service provider's communication equipment.

[0054] The payload data components of a data record can be processed in the manner described herein to find the de-abbreviation of “ppl” and edit the payload or form a new data record (422, 426) to now include (424) the data component of the de-abbreviated text unit “people” (person).

[0055] Figure 5This is a flowchart illustrating the steps of an exemplary method for processing text data. In this exemplary technique, the abbreviation to be found is "berapa," which means 'how much / many' in Indonesian, and the input abbreviation is "brp"—a common abbreviation in text-based messaging. The text-based message analyzed in this example is a message between a driver and a passenger in a tourism environment.

[0056] Receive (502) the input word 'brp'. The first stage is the vector space model or word embedding similarity step. Here, the corpus to be used to train or generate the vector space model is a set of reviews for a travel company called "Grab". The reviews are likely written in a dialect similar to the received message used for de-abbreviation.

[0057] The words in Grab review 504 (a corpus of words from user reviews of their journeys, drivers, etc.) are pre-mapped into an n-dimensional vector 506. This is done as part of the preprocessing stage (similar to the word counts used for Wikipedia comparisons in 516 and 518 – see below).

[0058] In an alternative, the vector model can be trained on a combination of Grab comments and a Wikipedia corpus—this gives a combination of the expected dialect in the message and the range of the Wikipedia corpus, in case some words from either corpus are missed.

[0059] Identify the nearest neighbors from the vector model (508). Scoring is done via cosine similarity. The next stage (510) lists the candidates, and then a cutoff threshold can be set to give, for example, the 10 closest hits. This gives a list of the closest neighbors and their respective similarity scores (similarity to the input text).

[0060] At this stage, orthographic similarity scores can optionally be calculated and used in parallel, or a combined score with cosine similarity can be used. Orthographic similarity can be used to compare abbreviations of words with corresponding words in Wikipedia and Grab comments, even if those abbreviations do not appear in Grab comments. Using orthographic similarity at this stage can improve the overall effectiveness of the processing technique by reducing the complexity of subsequent filtering stages 512 (e.g., by reducing the number of candidates used for filtering).

[0061] The results from these similarity scores are then filtered by sorting relationships; for example, only those that can be obtained from the input text by adding something (e.g., characters—consonants, vowels) or diacritics (diacritics). In this example (512), the filter targets targets with the same consonants in the same order and with one or more vowels added. Therefore, the result (514) is reduced again.

[0062] Next, the Indonesian Wikipedia corpus (516) is used to find word frequency counts (518), and the similarity score is multiplied by some function of the counts from Wikipedia. In this example (520), the similarity score of each word pair (input word, each nearest neighbor) is multiplied by the logarithm of the number of times that word appears in the Wikipedia corpus. There are several reasons for using log(2+count): a word that appears 10 times more often than another word is obviously important, but not necessarily 10 times more important; adding a Δ constant allows for useless results for zero instances—log(0) is considered undefined, and the steepness of the log curve is useful for low input numbers: log(1) is zero, and everything starts from at least 2. Words in Grab comments may have zero instances in the Wikipedia corpus, so 2 is added to them before taking the log. Alternatively, other types of weighting can also be used, such as using a square root or another monotonically increasing function with a positive y-intercept.

[0063] The highest score result (522) from this final stage is considered the most likely canonical form and is therefore considered the designated candidate text.

[0064] It should be understood that the invention has been described by way of example only. Various modifications may be made to the technology described herein without departing from the spirit and scope of the appended claims. The disclosed technology includes technologies that can be provided independently or in combination with each other. Thus, features described with respect to one technology may also be presented in combination with another technology.

Claims

1. A communication server apparatus for processing text data to de-abbreviate text units, the communication server apparatus comprising a processor and a memory, the communication server apparatus being configured to execute instructions stored in the memory under the control of the processor, such that: Receive text data including at least one text data element associated with an abbreviated text unit; The at least one text data element is compared with a plurality of candidate text data elements of a representation from a given text database, each candidate text data element being associated with a corresponding candidate text unit in the database; Determine the value of the similarity metric between the at least one text data element and these candidate text data elements; Process candidate text data elements to select candidate text data elements that have an ordered relationship with the abbreviated text unit; and Using these similarity metrics and these candidate text data elements, associated candidate text units are selected as the de-abbreviated text units of the abbreviated text unit, wherein the representation of the text database is a vector space model trained on the given text database, and wherein the text data elements include vectors from the model, each vector of the vector space model being associated with a corresponding candidate text unit. The abbreviated text units are converted into representative vectors in the vector space model. Furthermore, the device is configured to compare the vector of the abbreviated text unit with a plurality of candidate text data element vectors.

2. The communication server device as described in claim 1, wherein, The device is configured to, after determining the value of the similarity metric: Candidate text data elements are classified based on these similarity metrics, and The classified candidate text data elements are processed to select these candidate text data elements that have an ordered relationship with the abbreviated text unit.

3. The communication server device as described in claim 2, wherein, The device is configured to classify these candidate text data elements using a minimum similarity metric based on a threshold.

4. The communication server device as described in claim 1 or claim 2, wherein, The apparatus is configured to: determine the frequency of occurrence of an associated candidate text unit in a text database; and use the determined frequency of occurrence for the specified associated candidate text unit.

5. The communication server apparatus as described in claim 4, wherein, The text database used to determine the frequency of occurrence of the associated candidate text unit is a secondary text database.

6. The communication server apparatus as described in claim 1 or claim 2, wherein, For the step of selecting candidate text data elements of associated candidate text units that have an ordered relationship with the abbreviated text unit, the apparatus is configured to: determine, for the candidate text unit and the abbreviated text unit, whether the characters of the abbreviated text unit are of, or a partially ordered set of, the characters of the candidate text unit.

7. The communication server apparatus as described in claim 1 or claim 2, wherein, For the step of selecting candidate text data elements of associated candidate text units that have an ordered relationship with the abbreviated text unit, the apparatus is configured to: determine, for the candidate text unit and the abbreviated text unit, whether the characters of the abbreviated text unit have a similar order to the characters of the candidate text unit.

8. The communication server apparatus as described in claim 1 or claim 2, wherein, For the step of selecting candidate text data elements of associated candidate text units that have an ordered relationship with the abbreviated text unit, the apparatus is configured to: determine, for the candidate text unit and the abbreviated text unit, whether the characters of the abbreviated text unit are the same as or similar to the consonants of the candidate text unit.

9. The communication server apparatus as claimed in claim 1 or claim 2, wherein, This similarity metric includes a cosine similarity metric.

10. The communication server apparatus as claimed in claim 1 or claim 2, wherein, This similarity metric includes orthographic similarity.

11. A communication device for processing text data to de-abbreviate text units, the communication device comprising a processor and a memory, the communication device being configured to execute instructions stored in the memory under the control of the processor, such that: Receive text data including at least one text data element associated with an abbreviated text unit; The at least one text data element is compared with a plurality of candidate text data elements of a representation from a given text database, each candidate text data element being associated with a corresponding candidate text unit in the database; Determine the value of the similarity metric between the at least one text data element and these candidate text data elements; Process candidate text data elements to select candidate text data elements that have an ordered relationship with the abbreviated text unit; and Using these similarity metrics and these candidate text data elements, we select associated candidate text units as the de-abbreviated text units for the abbreviated text unit. The text database is represented by a vector space model trained on the given text database, and the text data elements include vectors from the model, each vector of the vector space model being associated with a corresponding candidate text unit. The abbreviated text units are converted into representative vectors in the vector space model. Furthermore, the device is configured to compare the vector of the abbreviated text unit with multiple candidate text data element vectors.

12. A system for processing text data to de-abbreviate text units, the system comprising a communication server device, at least one user communication device, and a communication network device, the communication network device being operable to enable the communication server device and the at least one user communication device to establish communication with each other through it, wherein, The at least one user communication device includes a first processor and a first memory, and the at least one user communication device is configured to execute a first instruction stored in the first memory under the control of the first processor, so as to: Receive text data including at least one text data element associated with an abbreviated text unit, wherein: The communication server device includes a second processor and a second memory. The communication server device is configured to execute second instructions stored in the second memory under the control of the second processor, so as to: The at least one text data element is compared with a plurality of candidate text data elements of a representation from a given text database, each candidate text data element being associated with a corresponding candidate text unit in the database; Determine the value of the similarity metric between the at least one text data element and these candidate text data elements; Process candidate text data elements to select candidate text data elements that have an ordered relationship with the abbreviated text unit; and Using these similarity metrics and these candidate text data elements, we select associated candidate text units as the de-abbreviated text units for the abbreviated text unit. The text database is represented by a vector space model trained on the given text database, and the text data elements include vectors from the model, each vector of the vector space model being associated with a corresponding candidate text unit. The abbreviated text units are converted into representative vectors in the vector space model. Furthermore, the communication server device is configured to compare the vector of the abbreviated text unit with multiple candidate text data element vectors.

13. A method executed in a communication server apparatus for processing text data to de-abbreviate text units, the method comprising performing the following operations under the control of a processor of the server apparatus: Receive text data including at least one text data element associated with an abbreviated text unit; The at least one text data element is compared with a plurality of candidate text data elements of a representation from a given text database, each candidate text data element being associated with a corresponding candidate text unit in the database; Determine the value of the similarity metric between the at least one text data element and these candidate text data elements; Process candidate text data elements to select candidate text data elements that have an ordered relationship with the abbreviated text unit; and Using these similarity metrics and these candidate text data elements, we select associated candidate text units as the de-abbreviated text units for the abbreviated text unit. The text database is represented by a vector space model trained on the given text database, and the text data elements include vectors from the model, each vector of the vector space model being associated with a corresponding candidate text unit. The abbreviated text units are converted into representative vectors in the vector space model. Furthermore, the method includes comparing the vector of the abbreviated text unit with a plurality of candidate text data element vectors.

14. A computer program product comprising instructions for implementing the method of claim 13.

15. A non-transitory storage medium storing instructions that, when executed by a processor, cause the processor to perform the method of claim 13.