An abnormal transaction identification method and device
By calculating the word segmentation similarity and cluster feature weights of merchant names, abnormal transactions are identified, solving the problem of inaccurate user identification of transaction types in existing technologies and improving the accuracy and security of the transaction system.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- WEBANK (CHINA)
- Filing Date
- 2021-12-08
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies cannot accurately identify users with different transaction types, resulting in insufficient accuracy in transaction screening results.
By calculating the word segmentation similarity of merchant names, clustering feature vectors are determined. Based on the weight and distance of clustering features, the degree of abnormality of merchant names is identified. Combined with blacklists and whitelists for preliminary screening, the clustering results are optimized to improve accuracy.
It improves the accuracy of user filtering results for different transaction types, reduces the waste of computing resources, and enhances the security of the transaction system.
Smart Images

Figure CN114138975B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of network technology, and in particular to a method and apparatus for identifying abnormal transactions. Background Technology
[0002] In recent years, with the development of computer technology, more and more technologies are being applied in the financial field, and the traditional financial industry is gradually transforming into financial technology (Fintech). However, due to the security and real-time requirements of the financial industry, higher demands are being placed on technology. For example, to ensure the legality of transactions, existing technologies generally verify the legality of transaction information. Only after verification is the transaction processed, thereby ensuring the security of transaction processing.
[0003] In existing technologies, different transaction types target different users. For example, futures, stocks, and bonds are available to businesses with tangible production equipment, materials, or physical assets, such as manufacturing companies, raw material processing companies, mining companies, metal smelting companies, and oil extraction companies, as well as individual savings customers. However, these types of transactions are not open to real estate merchants, lottery merchants, or insurance merchants. Therefore, after receiving transaction information, the system needs to identify the merchant names in the transaction information based on a merchant blacklist (e.g., including real estate merchants, lottery merchants, and insurance merchants). If a merchant is not on the blacklist, the transaction can be processed; otherwise, it is not. While this method can filter users based on different transaction types and ensure the smooth operation of transactions, it still cannot achieve more accurate identification and filtering.
[0004] Therefore, there is an urgent need for an abnormal transaction identification method and device to improve the accuracy of user screening results for different transaction types. Summary of the Invention
[0005] This invention provides a method and apparatus for identifying abnormal transactions, which improves the accuracy of user screening results for different transaction types.
[0006] In a first aspect, embodiments of the present invention provide an abnormal transaction identification method, the method comprising:
[0007] The first clustering feature vector of the merchant name under each clustering feature is determined based on the similarity corresponding to each word segment, wherein each clustering feature is a feature in which the difference of the sample merchant name under each feature is greater than a preset threshold; the clustering result of the first clustering feature vector and the second clustering feature vector of each sample merchant name is determined based on the first weight of each clustering feature, wherein the first weight is determined based on the degree of dispersion of each sample merchant name under each clustering feature; the identification result of the merchant name is determined based on the identification result of each sample merchant name in the clustering result; the identification result is used to indicate whether the merchant corresponding to the merchant name is an abnormal merchant.
[0008] In the above method, the similarity between each word segment of the merchant name in the transaction to be processed and each abnormal keyword is calculated to obtain the similarity corresponding to each word segment. This allows us to determine the degree of abnormality of the merchant name in the transaction to be processed. Further, based on the similarity corresponding to each word segment, the first clustering feature vector of the merchant name under each clustering feature is determined. Based on the first weight of each clustering feature, the clustering result is determined between the first clustering feature vector and the second clustering feature vector of each sample merchant name labeled as either a normal or abnormal merchant name. This allows us to determine whether the merchant name is more similar to a normal merchant name or more similar to an abnormal merchant name. Since different clustering features represent the degree of abnormality of merchant names differently, the degree of difference of the same clustering feature in the sample merchant names also varies. Therefore, based on the first weight corresponding to the clustering feature, the second clustering feature vector of each sample merchant name is calculated accordingly. This ensures that the clustering result incorporates both the representation of the degree of abnormality of merchant names by different clustering features and the degree of difference of the same clustering feature in the sample merchant names, thus improving the accuracy of the clustering result.
[0009] Optionally, the identification result of the merchant name is determined based on the identification result of each sample merchant name in the clustering result, including:
[0010] Based on the second weight of each cluster distance in the clustering results, the weighted cluster distance of each sample merchant in the clustering results is determined; in the second weight, the closer the distance, the higher the weight value.
[0011] The identification result is determined based on the merchant name of the sample with the minimum value in the weighted clustering distance, and is thus identified as the merchant name.
[0012] The above method not only obtains the clustering results of the first clustering feature vector and the second clustering feature vector of each sample merchant name based on the first weight corresponding to each clustering feature, but also integrates two factors into the clustering results: the representation of the degree of anomaly of merchant names by different clustering features and the degree of difference of the same clustering feature in the sample merchant names, thereby improving the accuracy of the clustering results.
[0013] Further settings can be made to define the clustering distance from the merchant name. For each different clustering distance from the merchant name, a second weight is assigned. Based on this second weight, a first identification weight is determined for identifying normal merchants and a second identification weight for identifying abnormal merchants in the clustering results. The identification result with the higher weight between the first and second identification weights is then identified as the merchant name corresponding to that name. In this way, based on the second weights corresponding to different clustering distances, the clustering results can be further optimized. Even if the number of abnormal and normal merchant name samples in the sample merchant names is unbalanced, adjustments can be made based on the distance between the sample merchant names and the merchant names in the pending transactions to eliminate the impact of this imbalance.
[0014] Optionally, the similarity between each segment of the merchant name in the pending transaction and each abnormal keyword is calculated to obtain the similarity corresponding to each segment. This includes: obtaining the merchant name in the pending transaction; obtaining each segment of the merchant name using Jieba segmentation technology; obtaining each segment vector of each segment using a segmentation conversion vector tool; and calculating the similarity between each segment vector and each abnormal keyword vector for each segment vector of the merchant name to obtain the similarity corresponding to the segment.
[0015] In the above method, a clustering feature vector representing the degree of anomaly in the merchant name is determined by the similarity between the word segmentation vector of the merchant name and the vectors of each abnormal keyword. Thus, by converting word segmentation into vectors, the degree of anomaly can be quantified by converting text into numbers, improving the accuracy of abnormal transaction identification results.
[0016] Optionally, each clustering feature is a feature in which the difference between sample merchant names under each feature is greater than a set threshold, including: for any sample merchant name, determining the feature value of the sample merchant name under each feature based on the similarity corresponding to each word segment of the sample merchant name; determining the variance of each feature based on the feature value of each sample merchant name under each feature; and determining the features with variance not less than a preset threshold as clustering features.
[0017] In the above method, after determining the feature values of the sample merchant names under each feature based on the similarity corresponding to each word segment, the variance selection method is used to obtain cluster features and preset thresholds for the cluster features. These preset thresholds are then used to filter the cluster features of the identified merchant names, saving computational resources. This also allows for the removal of features that cannot represent the degree of abnormality of the merchant names or whose representation of abnormality is not obvious, reducing computational load and saving computational resources. For example, the lowest similarity score can be removed because it cannot represent the degree of abnormality of the merchant name.
[0018] Optionally, each clustering feature includes the highest similarity, the average similarity, and the number of segments in the merchant name among the similarities corresponding to each word segment.
[0019] In the above method, the highest similarity reflects the most obvious degree of anomaly in the merchant name, the average similarity reflects the average degree of anomaly in each similar feature of the merchant name, and the number of word segments reflects the length of the merchant name and the data basis on which each feature is based. Thus, the clustering feature vector contains at least three main factors: obvious anomaly features, general anomaly features, and the feature data basis, ensuring the accuracy of the clustering results.
[0020] Optionally, the first weight is determined based on the dispersion of each sample merchant name under each cluster feature, including: for each abnormal sample merchant name, determining the feature value of the abnormal sample merchant name under each cluster feature; for each cluster feature, determining the coefficient of variation of the cluster feature based on the feature value of each abnormal sample merchant name under the cluster feature; and determining the first weight of each cluster feature based on the coefficient of variation of each cluster feature.
[0021] In the above method, the feature values of sample merchant names under each cluster feature are determined. For each cluster feature, the coefficient of variation of the cluster feature is determined based on the feature values of each sample merchant name under that cluster feature. In other words, the dispersion of each cluster feature is obtained based on the sample merchant names. A higher dispersion generally indicates a greater difference between the feature values of normal and abnormal merchant names under that cluster feature. Correspondingly, the cluster feature is more accurate in representing the identification result of whether a merchant name is normal or abnormal. Thus, based on the coefficient of variation of each cluster feature, the first weight of each cluster feature is determined, improving the accuracy of the identification results.
[0022] Optionally, before obtaining the similarity corresponding to each word segment, the method further includes: determining that the merchant name is not in the merchant blacklist or whitelist; after determining the identification result of the merchant name, the method further includes: when the identification result of the merchant is an abnormal merchant, determining whether the abnormal keyword is added to the merchant name.
[0023] In the above method, before calculating similarity, a preliminary screening is performed using merchant names in blacklists and whitelists. If the merchant name in the transaction information to be processed is in the blacklist or whitelist, further identification is unnecessary, saving computational resources. Furthermore, after identifying abnormal merchant names, abnormal keywords are updated to further improve subsequent anomaly identification results.
[0024] Optionally, the second weight includes a first value and a second value, including: taking a distance threshold as the boundary, the second weight corresponding to a distance less than the distance threshold is set as the first value; the second weight corresponding to a distance not less than the distance threshold is set as the second value; the first value is higher than the second value.
[0025] In the above method, a clustering distance is set based on the merchant name clustering distance. The closer the clustering distance is to the merchant name, the higher the second weight value. For the second weight corresponding to different clustering distances from the merchant name, a first identification weight for merchants identified as normal and a second identification weight for merchants identified as abnormal are determined based on the second weight. This further determines the identification result corresponding to the merchant name. Thus, since a closer clustering distance to the merchant name results in a higher identification weight based on the second weight, even if the number of abnormal and normal merchant name samples in the sample merchant names is unbalanced, the clustering distance between the sample merchant names and the merchant names in the pending transactions can be adjusted based on the second weight of the clustering distance to eliminate this imbalance and improve the accuracy of the identification results. For example, if only 1 / 4 of the sample merchant names are abnormal, and the remaining sample merchant names are all normal. The closest merchant name to the one in the pending transaction is only one abnormal merchant name, while the remaining three sample merchant names are the next closest. If the merchant name is identified based on this clustering result, it is highly likely to be identified as a normal merchant name when it is actually an abnormal one. Therefore, by using the second weight of the clustering distance, the clustering identification result of the abnormal merchant name can be weighted more heavily, thus correcting the identification result and improving the accuracy of the identification. This example is only used to simply illustrate how the second weight of the clustering distance eliminates the influence of sample imbalance, to facilitate understanding of this second weight. Practical applications are not limited to this example and are generally much more complex.
[0026] Secondly, embodiments of the present invention provide an abnormal transaction identification device, the device comprising:
[0027] The processing module is used to determine the first clustering feature vector of the merchant name under each clustering feature based on the similarity corresponding to each word segmentation, wherein each clustering feature is a feature in which the difference of the sample merchant name under each feature is greater than a preset threshold.
[0028] The processing module is further configured to determine the clustering result of the first clustering feature vector and the second clustering feature vector of each sample merchant name based on the first weight of each clustering feature, wherein the first weight is determined based on the degree of dispersion of each sample merchant name under each clustering feature.
[0029] The identification module is used to determine the identification result of the merchant name based on the identification result of each sample merchant name in the clustering result; the identification result is used to indicate whether the merchant corresponding to the merchant name is an abnormal merchant.
[0030] Thirdly, embodiments of this application also provide a computing device, including: a memory for storing a program; and a processor for calling the program stored in the memory and executing the method described in various possible designs of the first aspect according to the obtained program.
[0031] Fourthly, embodiments of this application also provide a computer-readable non-volatile storage medium including a computer-readable program that, when read and executed by a computer, causes the computer to perform the method described in various possible designs of the first aspect.
[0032] These or other implementations of this application will become clearer and easier to understand in the following description of the embodiments. Attached Figure Description
[0033] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0034] Figure 1 This is a schematic diagram of an abnormal transaction identification system architecture provided by an embodiment of the present invention;
[0035] Figure 2 This is a schematic diagram of an abnormal transaction identification system architecture provided by an embodiment of the present invention;
[0036] Figure 3 This is a flowchart illustrating an abnormal transaction identification method provided in an embodiment of the present invention.
[0037] Figure 4 This is a flowchart illustrating an abnormal transaction identification method provided in an embodiment of the present invention.
[0038] Figure 5 This is a schematic diagram of an abnormal transaction identification device provided for an embodiment of the invention. Detailed Implementation
[0039] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this invention, and not all of them. Based on the embodiments of this invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this invention.
[0040] Figure 1This invention provides a system architecture for identifying abnormal transactions. In this system, transaction system 101 generates pending transactions based on user-submitted transaction requests. Before processing these pending transactions, an abnormal transaction identification system 102 determines that the pending transaction is a transaction request submitted by a legitimate merchant. Only then can transaction system 101 process the pending transaction. This ensures the security of transaction processing by transaction system 101.
[0041] Based on the above system architecture, this embodiment of the invention provides a system architecture for abnormal transaction identification, such as... Figure 2 As shown, it includes:
[0042] In the abnormal transaction identification system 102, the word segmentation tool 201 segments the merchant name in the transaction to be processed, obtains each word of the merchant name, and sends each word of the merchant name to the feature calculation module 202.
[0043] The feature calculation module 202 uses a word segmentation to vectorize the merchant name, and then calculates the similarity between each word's vector and the vectors of each preset keyword to obtain the similarity between each word and each preset keyword. Based on the similarity of each word segment, the feature calculation module 202 obtains multiple features of the merchant name, and further determines the clustering features based on these features, obtaining the first clustering feature vector of the merchant name. The feature calculation module 202 then sends this first clustering feature vector of the merchant name to the recognition module 203.
[0044] The identification module 203 receives the first clustering feature vector of the merchant name. Based on the first weight corresponding to each clustering feature (thus setting different weight values for different clustering features to determine the degree of abnormal influence on the merchant name, making the clustering feature vector calculated based on the first weight more reflective of the normal or abnormal nature of the merchant name, improving the accuracy of abnormal transaction identification results) and a clustering algorithm, it calculates the clustering results of the first clustering feature vector of the merchant name and the second clustering feature vector of the sample merchant names. Further, based on the second weight of the distance between each cluster in the clustering result, it determines the first identification weight for merchants identified as normal and the second identification weight for merchants identified as abnormal. In the second weight, the closer the distance, the higher the weight value (thus eliminating the impact of the imbalance between the number of normal and abnormal merchant names in the sample merchant names on the identification results, improving the accuracy of the identification results). Finally, the identification result with the higher weight between the first and second identification weights is output as the merchant name identification result.
[0045] If the identification result determines that the merchant name in the transaction to be processed is a normal merchant name, the transaction system 101 can process the transaction to be processed.
[0046] Furthermore, in one example of the above system architecture, the parameters of the models and algorithms in each module of the abnormal transaction identification system can be obtained by training on the names of each sample merchant in the sample set.
[0047] The word segmentation tool 201 can be a Jieba word segmentation tool. Before performing word segmentation, a large amount of data is pre-set to form a corpus, so that the word segmentation tool 201 can perform word segmentation on the merchant name based on the corpus. The data in the corpus can be obtained by extracting and processing historical transaction data, or it can be obtained from merchant images, etc. There are no restrictions on the specific method of obtaining the data. For example, the data in the corpus can be selected from the existing stock of 930,000 merchants as the training data for the word segmentation tool 201.
[0048] The clustering features in the feature calculation module 202 can be determined based on the degree of difference among the features of the sample merchant names. For example, for each feature of each sample merchant name in the sample set, the variance of each sample merchant name under that feature is calculated. If the variance value exceeds a preset threshold, then that feature is a clustering feature; otherwise, the feature is not calculated. In this way, features with the same characteristics of normal and abnormal merchant names that are not significantly different and are not suitable for anomaly identification can be removed. Features with the same characteristics of normal and abnormal merchant names that are significantly different and suitable for distinguishing between normal and abnormal merchant names can be obtained. Using these features as clustering features can improve the accuracy of the identification results and save subsequent computing resources.
[0049] The first weight corresponding to each cluster feature in the feature calculation module 202 can be obtained by training based on the dispersion of the cluster feature of each sample merchant name under each cluster feature in the sample set, and then further obtaining the first weight of each cluster feature.
[0050] The second weights corresponding to the cluster distances in the feature calculation module 202 can be obtained by training based on the merchant names of each sample in the sample set. For example, for any merchant name in the sample set, the recognition result can be obtained based on the other remaining merchant names in the sample set. The second weights are then adjusted accordingly based on the accuracy and error rate of the recognition results. The method for obtaining the second weights here is just an example, and no restrictions are placed on the specific method.
[0051] Based on this, embodiments of this application provide a flowchart for a transaction anomaly identification method, such as... Figure 3 As shown, it includes:
[0052] Step 301: Determine the first clustering feature vector of the merchant name under each clustering feature based on the similarity corresponding to each word segment, wherein each clustering feature is a feature in which the difference of the sample merchant name under each feature is greater than a preset threshold.
[0053] Step 302: Determine the clustering result of the first clustering feature vector and the second clustering feature vector of each sample merchant name according to the first weight of each clustering feature. The first weight is determined according to the degree of dispersion of each sample merchant name under each clustering feature.
[0054] In one example, the three clustering features of merchant name Z in the transaction to be processed are z1, z2, and z3, and the first clustering feature vector of merchant name Z is [z1, z2, z3]. The first weights corresponding to these three clustering features are w1, w2, and w3, respectively. The three clustering features corresponding to sample merchant name A are a1, a2, and a3, and the second clustering feature vector of sample merchant name A is [a1, a2, a3]. Then, when calculating the clustering distance between merchant name Z and sample merchant name A, the clustering distance Daz = [(z1-a1)]. 2 / w1+(z2-a2) 2 / w2+(z3-a3) 2 / w3] 1 / 2 Furthermore, the clustering algorithm used to obtain the clustering results here can be the nearest neighbor algorithm, or the K-means algorithm, etc., without any specific restrictions.
[0055] Step 303: Determine the identification result of the merchant name based on the identification result of each sample merchant name in the clustering result; the identification result is used to indicate whether the merchant corresponding to the merchant name is an abnormal merchant.
[0056] In the above method, the similarity between each word segment of the merchant name in the transaction to be processed and each abnormal keyword is calculated to obtain the similarity corresponding to each word segment. This allows us to determine the degree of abnormality of the merchant name in the transaction to be processed. Further, based on the similarity corresponding to each word segment, the first clustering feature vector of the merchant name under each clustering feature is determined. Based on the first weight of each clustering feature, the clustering result is determined between the first clustering feature vector and the second clustering feature vector of each sample merchant name labeled as either a normal or abnormal merchant name. This allows us to determine whether the merchant name is more similar to a normal merchant name or more similar to an abnormal merchant name. Since different clustering features represent the degree of abnormality of merchant names differently, the degree of difference of the same clustering feature in the sample merchant names also varies. Therefore, based on the first weight corresponding to the clustering feature, the second clustering feature vector of each sample merchant name is calculated accordingly. This ensures that the clustering result incorporates both the representation of the degree of abnormality of merchant names by different clustering features and the degree of difference of the same clustering feature in the sample merchant names, thus improving the accuracy of the clustering result.
[0057] This application provides a method for calculating identification results, which determines the identification result of a merchant name based on the identification result of each sample merchant name in the clustering result, including:
[0058] Based on the second weight of each cluster distance in the clustering results, the weighted cluster distance of each sample merchant in the clustering results is determined; in the second weight, the closer the distance, the higher the weight value.
[0059] The identification result is determined based on the merchant name identification result (normal sample merchant name label or abnormal sample merchant name label) with the minimum value among the weighted clustering distances. This result is then used to identify the merchant name. In the example above, the first clustering feature vector of merchant name Z in the transaction to be processed is [z1, z2, z3], and the first weights corresponding to these three clustering features are w1, w2, and w3, respectively. The second clustering feature vector of sample merchant name A is [a1, a2, a3], the second clustering feature vector of sample merchant name B is [b1, b2, b3], and the second clustering feature vector of sample merchant name C is [c1, c2, c3]. Furthermore, when calculating the clustering distance between merchant name Z and sample merchant name A, the clustering distance Daz = [(z1-a1)]. 2 / w1+(z2-a2) 2 / w2+(z3-a3) 2 / w3] 1 / 2 When calculating the cluster distance between merchant name Z and sample merchant name B, the cluster distance Dbz = [(z1-b1)]. 2 / w1+(z2-b2) 2 / w2+(z3-b3)2 / w3] 1 / 2 When calculating the cluster distance between merchant name Z and sample merchant name C, the cluster distance Dcz = [(z1-c1)]. 2 / w1+(z2-c2) 2 / w2+(z3-c3) 2 / w3] 1 / 2 .
[0060] If the second weight corresponding to the first cluster distance (which can be considered as the cluster distance closest to merchant name Z) is u1, then the sample merchant name A is within the first cluster distance range from merchant name Z;
[0061] The second weight corresponding to the second cluster distance (which can be considered as the cluster distance closest to merchant name Z) is u2, and the sample merchant name B is within the second cluster distance range from merchant name Z;
[0062] The second weight corresponding to the third cluster distance (which can be considered as the cluster distance closest to merchant name Z) is u3, and the sample merchant name C is within the range of the third cluster distance from merchant name Z;
[0063] Accordingly, the normal or abnormal label of the sample merchant name corresponding to the minimum value among Daz*u1, Dbz*u2, and Dcz*u3 is used to determine whether the merchant name in the transaction to be processed is a normal merchant name or an abnormal merchant name.
[0064] Alternatively, the identification result can be determined by applying only the second weight of the cluster distance. That is, based on the example above, the first cluster feature vector of merchant name Z in the transaction to be processed is [z1, z2, z3], the second cluster feature vector of sample merchant name A is [a1, a2, a3], the second cluster feature vector of sample merchant name B is [b1, b2, b3], and the second cluster feature vector of sample merchant name C is [c1, c2, c3]. Furthermore, when calculating the cluster distance between merchant name Z and sample merchant name A, the cluster distance Daz = [(z1-a1)]. 2 +(z2-a2) 2 +(z3-a3) 2 ] 1 / 2 When calculating the cluster distance between merchant name Z and sample merchant name B, the cluster distance Dbz = [(z1-b1)]. 2 +(z2-b2) 2 +(z3-b3) 2 ] 1 / 2 When calculating the cluster distance between merchant name Z and sample merchant name C, the cluster distance Dcz = [(z1-c1)]. 2 +(z2-c2) 2 +(z3-c3) 2] 1 / 2 .
[0065] If the second weight corresponding to the first cluster distance (which can be considered as the cluster distance closest to merchant name Z) is u1, then the sample merchant name A is within the first cluster distance range from merchant name Z;
[0066] The second weight corresponding to the second cluster distance (which can be considered as the cluster distance closest to merchant name Z) is u2, and the sample merchant name B is within the second cluster distance range from merchant name Z;
[0067] The second weight corresponding to the third cluster distance (which can be considered as the cluster distance closest to merchant name Z) is u3, and the sample merchant name C is within the range of the third cluster distance from merchant name Z;
[0068] Accordingly, the normal or abnormal label of the sample merchant name corresponding to the minimum value among Daz*u1, Dbz*u2, and Dcz*u3 is used to determine whether the merchant name in the transaction to be processed is a normal merchant name or an abnormal merchant name.
[0069] This application provides a word segmentation similarity method, which calculates the similarity between each word segment of the merchant name in the transaction to be processed and each abnormal keyword, to obtain the similarity corresponding to each word segment. The method includes: obtaining the merchant name in the transaction to be processed; obtaining each word segment of the merchant name through Jieba word segmentation technology; obtaining each word segmentation vector through a word segmentation conversion vector tool; and calculating the similarity between each word segmentation vector and each abnormal keyword vector for each word segmentation vector of the merchant name, to obtain the similarity corresponding to the word segmentation.
[0070] In one example, the merchant name from the pending transaction is retrieved from the transaction system. This merchant name is then segmented using the `jieba.cut` function, resulting in n words. For example, a merchant name "Shanghai ** Investment Management Co., Ltd." would be segmented into ["Shanghai", "**", "investment", "management", "co. Ltd"]. Further, the `word2vec` function in the Gensim module can be used to convert each segment into a vector, resulting in a segment vector. Further still, for each segment, the similarity function is used to calculate its similarity to M preset keywords.
[0071] S1 = similarity[segmentation(1), preset keyword(1)] + similarity[segmentation(1), preset keyword(2)] + ... similarity[segmentation(1), preset keyword(M)];
[0072] S2 = similarity[segmentation(2), preset keyword(1)] + similarity[segmentation(2), preset keyword(2)] + ... similarity[segmentation(2), preset keyword(M)];
[0073] ...
[0074] Sn = similarity[segmentation(n), preset keyword(1)] + similarity[segmentation(n), preset keyword(2)] + ... similarity[segmentation(n), preset keyword(M)].
[0075] The similarity scores of the word segments (1)-(n) with the M preset keywords are calculated to obtain the similarity scores S1-Sn corresponding to each word segment (1)-(n). It should be noted that the algorithm for calculating merchant similarity is not limited to the similarity function; any method that can obtain the similarity between vectors is acceptable, such as cosine similarity, Euclidean distance, etc. The similarity calculation for each word segment and the M preset keywords can also be performed by multiplying the similarity scores of each word segment with each preset keyword sequentially, etc. No specific restrictions are imposed here.
[0076] This application provides a method for determining clustering features. Each clustering feature is a feature whose difference in sample merchant names under each feature is greater than a set threshold. The method includes: for any sample merchant name, determining the feature value of the sample merchant name under each feature based on the similarity of each word segment of the sample merchant name; determining the variance of each feature based on the feature value of each sample merchant name under each feature; and determining the features whose variance is not less than a preset threshold as clustering features. In other words, for any sample merchant name, multiple feature values can be obtained. For each feature in the sample set, the variance of the feature values of each sample merchant name under that feature is calculated, and features whose variance is not less than a preset threshold are determined as clustering features. That is, if the variance is less than the preset threshold, it indicates that the feature is very similar to normal and abnormal merchant names and cannot be used to distinguish whether a merchant name is normal or abnormal. Therefore, such features are not included in the calculation, saving computational resources. For example, features in sample merchant names can include highest similarity, lowest similarity, average similarity, and word segmentation count. These are determined using the variance selection method. The lowest similarity feature is highly similar to both normal and abnormal merchant names (generally, this is because most merchants in production are normal, therefore, the variance of the lowest similarity in sample merchant names is relatively small). Therefore, this lowest similarity feature is removed. This is merely an example of using the variance selection method for feature selection and does not impose restrictions on the determination of clustering features.
[0077] This application provides a clustering feature, wherein each clustering feature includes the highest similarity, average similarity, and the number of word segments in the similarity corresponding to each word segment. In the above example, based on the variance selection method, if the variance values of the highest similarity, average similarity, and the number of word segments in the merchant name are greater than the preset thresholds corresponding to each feature, then the highest similarity, average similarity, and the number of word segments in the merchant name are determined as clustering features. In the above example, for each merchant name,
[0078] This application provides a first weight determination method, wherein the first weight is determined based on the dispersion of each sample merchant name under each cluster feature, including:
[0079] For each abnormal sample merchant name, determine the feature value of the abnormal sample merchant name under each cluster feature;
[0080] For each cluster feature, the coefficient of variation of the cluster feature is determined based on the feature value of each abnormal sample merchant name under the cluster feature;
[0081] The first weight of each cluster feature is determined based on the coefficient of variation of each cluster feature.
[0082] In the example above, for each abnormal sample merchant name, the feature values of the abnormal sample merchant name under each cluster feature are determined, as shown in Table 1 below:
[0083] Merchants Highest similarity Average similarity Word segmentation count A 0.624 0.572 4 B 0.454 0.382 4 C 0.925 0.742 4 D 0.854 0.624 5 E 0.754 0.682 5 F 0.654 0.495 6
[0084] Table 1
[0085] Furthermore, based on the feature values of each cluster feature of the merchant names in each abnormal sample in Table 1, the mean, standard deviation, and coefficient of variation of each cluster feature are calculated to obtain the first weight corresponding to each cluster feature, such as...
[0086] As shown in Table 2:
[0087] index Highest similarity Average similarity Word segmentation count average value 0.711 0.583 4.667 Standard deviation 0.17 0.13 0.816 coefficient of variation 0.239 0.224 0.175 Weight 0.375 0.35 0.274
[0088] Table 2
[0089] The coefficient of variation (COP) is a crucial indicator of dispersion, reflecting the differences and fluctuations in the feature values of each cluster feature across anomalous merchant names. Numerically, the COP equals the standard deviation divided by the mean. The first weight is equal to the ratio of the COP to the sum of the COPs (the sum of the COPs of all cluster features). For example, if the sum of the COPs in Table 2 is 0.638, then the first weights for the highest similarity, average similarity, and word segmentation count are respectively:
[0090] The first weight of the highest similarity:
[0091]
[0092] First weight of average similarity:
[0093]
[0094] The first weight of word segmentation count:
[0095]
[0096] If the feature values corresponding to a certain clustering feature differ greatly, it indicates that this clustering feature is difficult to implement in production. Therefore, it is a key clustering feature that reflects whether a merchant name is normal or abnormal, so it should be assigned a higher first weight. In one example, the first weight of the highest similarity is greater than the first weight of the average similarity, which is greater than the first weight of the number of word segments, and the first weight of the highest similarity + the first weight of the average similarity + the first weight of the number of word segments = 1. In this way, by calculating the first weight of each clustering feature using the coefficient of variation method, the sensitivity of each clustering feature to the degree of abnormality of the merchant name is improved, thereby improving the accuracy of abnormal transaction identification results.
[0097] This application provides a method for pre-screening merchant names, which, before obtaining the similarity corresponding to each word segmentation, further includes: determining that the merchant name is not in a merchant blacklist or whitelist;
[0098] After determining the identification result of the merchant name, the process further includes: when the merchant's identification result is an abnormal merchant, determining whether the abnormal keyword has been added to the merchant name. That is, before obtaining the similarity scores corresponding to each word segment of the merchant name to be processed, the merchant name is compared with merchant names in the blacklist and whitelist. If it can be determined at this point that the merchant name is in the blacklist, an alarm can be generated directly, and the transaction to be processed will not be processed; if it can be determined at this point that the merchant name is in the whitelist, the transaction to be processed for that merchant name can be processed directly. This prevents the waste of computational resources caused by identifying merchant names whose identification results can be directly determined from the blacklist and whitelist through the abnormal transaction identification system, and also speeds up the identification process.
[0099] This application provides a second weighting method, whereby the second weight includes a first value and a second value, comprising: setting the second weight corresponding to a distance less than a distance threshold as the first value; setting the second weight corresponding to a distance not less than the distance threshold as the second value; and ensuring that the first value is higher than the second value. In other words, the closer a sample merchant's name is to the name of the merchant to be processed within a defined clustering distance range, the higher the second weight value. Specifically, the closer a sample merchant's name is to the name of the merchant to be processed, the higher its similarity to that name.
[0100] If the clustering algorithm used is KNN nearest neighbor clustering, the sample set contains the second cluster feature vectors of sample merchant names with known identification results. The first cluster feature vector of the merchant name to be processed is clustered with the second cluster feature vectors of each sample merchant name in the sample set. Each feature of the first cluster feature vector is compared with the corresponding feature in the second cluster feature vector of the sample merchant name. Then, the K most similar (closest) second cluster feature vectors to the first cluster feature vector in the sample set are extracted. The frequency of each type of label (normal sample merchant name and abnormal sample merchant name) in these K second cluster feature vectors is counted. The label with the most frequent occurrence is the classification label of the merchant name to be processed.
[0101] In most cases, there is an imbalance between the number of normal and abnormal merchant names. For example, a subset of normal merchant names may have a large sample size, while other subsets may have a small sample size. This can lead to a situation where, when a merchant name to be processed is input, the majority of its K nearest neighbors are normal merchant names. However, it is possible that these normal merchant names are not actually close to the merchant name to be processed, while a small number of abnormal merchant names are very close. This can easily result in inaccurate identification. To eliminate the problem of inaccurate identification caused by the imbalance between the number of normal and abnormal merchant names, a second weight is set. The training algorithm assigns a higher weight to merchant names that are closer to the merchant name to be processed, thereby improving the accuracy of the identification results.
[0102] Based on the above method flow, embodiments of this application provide a flow of an abnormal transaction identification method, such as... Figure 4 As shown, it includes:
[0103] Step 401: Obtain the corpus.
[0104] Here, the corpus data in the corpus can be merchant names from historical production data and / or generated based on merchant names in registration certificates, etc., to obtain the corpus.
[0105] Step 402: Obtain the sample set and preset keywords. The sample set contains the names of normal sample merchants and abnormal sample merchants with known recognition results.
[0106] Step 403: Train the algorithm / model for the abnormal transaction identification system based on the merchant names in the sample set. Obtain the first weight, the second weight, and the clustering features.
[0107] Step 404: Obtain pending transactions and determine the merchant name of the pending transaction.
[0108] Step 405: Use the Jieba word segmentation tool to segment the merchant name of the transaction to be processed and obtain each word of the merchant name.
[0109] Step 406: Convert each word into a vector using the word segmentation to vector conversion tool.
[0110] Step 407: Calculate the similarity between each word segment and each preset keyword to obtain the similarity of each word segment.
[0111] Step 408: Determine the cluster feature value of the merchant name based on the similarity of each word segment.
[0112] Step 409: Determine the first cluster feature vector of the merchant name based on the feature values of the cluster features of the merchant name.
[0113] Step 410: Based on the first weight corresponding to each cluster feature, calculate the first cluster feature vector after calculating the first weight by comparing it with the corresponding feature value in the first cluster feature vector of the merchant name.
[0114] Step 411: Cluster the first cluster feature vector of the merchant name after calculating the first weight with the second cluster feature vector of each sample merchant name in the sample set to obtain the clustering result.
[0115] Step 412: Based on the clustering distance between each second cluster feature vector and the first cluster feature vector after calculating the first weight in the clustering results, determine the second weight of the clustering distance between each second cluster feature vector and the first cluster feature vector after calculating the first weight.
[0116] Step 413: Multiply the second weight corresponding to the clustering distance with the clustering distance of the second clustering feature vector corresponding to the clustering distance and the clustering distance of the first clustering feature vector after calculating the first weight to obtain the clustering result after calculating the second weight.
[0117] Step 414: Determine the identification result of the merchant name to be processed based on the clustering result after calculating the second weight.
[0118] For example, if four sample merchant names close to the name of the merchant to be processed are selected from the sample set, and the clustering distance between the abnormal sample merchant name 1 and the name of the merchant to be processed is determined to be 10 after calculating the second weight, the clustering distance between the normal sample merchant name 1 and the name of the merchant to be processed is 9, the clustering distance between the normal sample merchant name 2 and the name of the merchant to be processed is 30, and the clustering distance between the normal sample merchant name 3 and the name of the merchant to be processed is 30. The second weight corresponding to the clustering distance in [0, 20] is 0.1, and the second weight corresponding to the clustering distance in [21, 40] is 0.9. Then, after calculating the second weight, the clustering distance between the abnormal sample merchant name 1 and the name of the merchant to be processed is 1, the clustering distance between the normal sample merchant name 1 and the name of the merchant to be processed is 0.9, the clustering distance between the normal sample merchant name 2 and the name of the merchant to be processed is 2.7, and the clustering distance between the normal sample merchant name 3 and the name of the merchant to be processed is 2.7. The final result is as follows: the cluster distance between the abnormal sample merchant names and the merchant names of the pending transactions is 1, and the cluster distance between the normal sample merchant names and the merchant names of the pending transactions is (2.7 + 2.7 + 0.9) / 3 = 1.1. Therefore, the merchant names of the pending transactions are abnormal.
[0119] It should be noted that the above process steps are not unique. For example, steps 401, 402, and 403 can be executed separately, and steps 401, 402, and 403 can be executed according to their respective update cycles during the operation of the abnormal transaction identification system.
[0120] Based on the same concept, embodiments of the present invention provide an abnormal transaction identification device. Figure 5 This is a schematic diagram of an abnormal transaction identification device provided in an embodiment of this application, as shown below. Figure 5 The following are examples:
[0121] Processing module 501 is used to determine the first clustering feature vector of the merchant name under each clustering feature based on the similarity corresponding to each word segmentation, wherein each clustering feature is a feature in which the difference of the sample merchant name under each feature is greater than a preset threshold.
[0122] The processing module 501 is further configured to determine the clustering result of the first clustering feature vector and the second clustering feature vector of each sample merchant name according to the first weight of each clustering feature, wherein the first weight is determined according to the degree of dispersion of each sample merchant name under each clustering feature.
[0123] The identification module 502 is used to determine the identification result of the merchant name based on the identification result of each sample merchant name in the clustering result; the identification result is used to indicate whether the merchant corresponding to the merchant name is an abnormal merchant.
[0124] Optionally, the identification module 502 is specifically used to determine the weighted clustering distance of each sample merchant in the clustering result according to the second weight of each clustering distance in the clustering result; the closer the distance in the second weight, the higher the weight value; and determine the identification result based on the sample merchant name with the minimum value in the weighted clustering distance, and determine it as the identification result of the merchant name.
[0125] Optionally, the processing module 501 is specifically used to: obtain the merchant name in the transaction to be processed; obtain each word of the merchant name through Jieba word segmentation technology; obtain each word vector of each word through a word segmentation conversion vector tool; and calculate the similarity between each word vector and each abnormal keyword vector for each word vector of the merchant name, thereby obtaining the similarity corresponding to the word segmentation.
[0126] Optionally, the processing module 501 is specifically used to: for any sample merchant name, determine the feature value of the sample merchant name under each feature based on the similarity of each word segment of the sample merchant name; determine the variance of each feature based on the feature value of each sample merchant name under each feature; and determine the features with variance not less than a preset threshold as clustering features.
[0127] Optionally, each clustering feature includes the highest similarity, the average similarity, and the number of segments in the merchant name among the similarities corresponding to each word segment.
[0128] Optionally, the processing module 501 is specifically used to: determine the feature value of each abnormal sample merchant name under each cluster feature; determine the coefficient of variation of each cluster feature based on the feature value of each abnormal sample merchant name under the cluster feature; and determine the first weight of each cluster feature based on the coefficient of variation of each cluster feature.
[0129] Optionally, the processing module 501 is further configured to determine that the merchant name is not in the merchant blacklist or whitelist; the processing module 501 is further configured to determine whether the abnormal keyword is added to the merchant name when the merchant identification result is an abnormal merchant.
[0130] Optionally, using a distance threshold as a boundary, the second weight corresponding to a distance less than the distance threshold is set as the first value; the second weight corresponding to a distance not less than the distance threshold is set as the second value; and the first value is higher than the second value.
[0131] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0132] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to this application. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0133] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0134] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0135] Obviously, those skilled in the art can make various modifications and variations to this application without departing from the spirit and scope of this application. Therefore, if such modifications and variations fall within the scope of the claims of this application and their equivalents, this application also intends to include such modifications and variations.
Claims
1. A method for identifying abnormal transactions, characterized in that, include: Obtain the merchant name from the pending transactions, and use Jieba segmentation technology to obtain the segmented words of the merchant name; The segmentation vectors of each word are obtained using a word segmentation to vector conversion tool; For each word segmentation vector of the merchant name, calculate the similarity between the word segmentation vector and each abnormal keyword vector to obtain the word segmentation similarity. The first cluster feature vector of the merchant name under each cluster feature is determined based on the similarity corresponding to each word segment. Each cluster feature is a feature in which the difference of the sample merchant name under each feature is greater than a preset threshold. Based on the first weight of each clustering feature, the clustering result of the first clustering feature vector and the second clustering feature vector of each sample merchant name is determined. The first weight is determined based on the degree of dispersion of each sample merchant name under each clustering feature. Based on the second weight of each cluster distance in the clustering results, the weighted cluster distance of each sample merchant in the clustering results is determined; in the second weight, the closer the distance, the higher the weight value. The identification result is determined based on the merchant name of the sample with the minimum value in the weighted clustering distance, and is thus identified as the identification result of the merchant name; the identification result is used to indicate whether the merchant corresponding to the merchant name is an abnormal merchant; The first weight is determined based on the dispersion of each sample merchant name under each cluster feature, including: For each abnormal sample merchant name, determine the feature value of the abnormal sample merchant name under each cluster feature; For each cluster feature, the coefficient of variation of the cluster feature is determined based on the feature value of each abnormal sample merchant name under the cluster feature; The first weight of each cluster feature is determined based on the coefficient of variation of each cluster feature.
2. The method as described in claim 1, characterized in that, Each cluster feature is a feature in which the difference between sample merchant names under each feature is greater than a set threshold, including: For any sample merchant name, the feature value of the sample merchant name under each feature is determined based on the similarity of each word segment of the sample merchant name; Based on the feature values of each sample merchant name under each feature, the variance of each feature is determined; features with variances not less than a preset threshold are determined as clustering features.
3. The method as described in claim 1, characterized in that, The clustering features include the highest similarity, average similarity, and the number of segments in the merchant name among the similarity scores for each segment.
4. The method as described in claim 1, characterized in that, Before obtaining the similarity scores for each word segment, the process also includes: It has been confirmed that the merchant name is not in the merchant blacklist or whitelist; After determining the identification result of the merchant name, the process also includes: When the identification result is an abnormal merchant, determine whether abnormal keywords have been added to the merchant name.
5. The method as described in claim 4, characterized in that, The second weight includes a first value and a second value, including: Using a distance threshold as the boundary, the second weight corresponding to a distance less than the distance threshold is set to the first value; the second weight corresponding to a distance not less than the distance threshold is set to the second value; and the first value is higher than the second value.
6. An abnormal transaction identification device, characterized in that, include: The processing module obtains the merchant name in the transaction to be processed and obtains the word segments of the merchant name through Jieba word segmentation technology; The word segmentation vectors are obtained by using a word segmentation and vector conversion tool; for each word segmentation vector of the merchant name, the similarity between the word segmentation vector and each abnormal keyword vector is calculated to obtain the similarity corresponding to the word segmentation; the first clustering feature vector of the merchant name under each clustering feature is determined based on the similarity corresponding to each word segmentation, wherein each clustering feature is a feature of the sample merchant name with a difference greater than a preset threshold under each feature; The processing module is further configured to determine the clustering result of the first clustering feature vector and the second clustering feature vector of each sample merchant name based on the first weight of each clustering feature, wherein the first weight is determined based on the degree of dispersion of each sample merchant name under each clustering feature. The identification module is used to determine the weighted clustering distance of each sample merchant in the clustering result based on the second weight of each clustering distance in the clustering result; the closer the distance in the second weight, the higher the weight value; the identification result is determined based on the sample merchant name with the minimum value in the weighted clustering distance, and is determined as the identification result of the merchant name; the identification result is used to indicate whether the merchant name is an abnormal merchant; The processing module is specifically used to determine the feature value of each abnormal sample merchant name under each cluster feature; for each cluster feature, determine the coefficient of variation of the cluster feature based on the feature value of each abnormal sample merchant name under the cluster feature; and determine the first weight of each cluster feature based on the coefficient of variation of each cluster feature.
7. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a program that, when run on a computer, causes the computer to perform the method of any one of claims 1 to 5.
8. A computer device, characterized in that, include: Memory, used to store computer programs; A processor is configured to invoke a computer program stored in the memory and execute the method as described in any one of claims 1 to 5 according to the obtained program.