Fraud phone recognition method, system and equipment based on multi-source features
A technology of fraudulent calls and identification methods, applied in the information field, can solve problems such as no mention, inconvenience in daily life, and affecting user experience
Active Publication Date: 2021-05-28
XI AN JIAOTONG UNIV
6 Cites 3 Cited by
AI-Extracted Technical Summary
Problems solved by technology
[0002] With the continuous development of the communication industry, more and more users enjoy the convenience that communication brings to life, but at the same time, more and more fraudulent calls are emerging, and a large number of groups or individuals use Attacks and other methods harass the target group and defraud people of money. Similar scam calls emerge in endlessly in life, which seriously affects the user experience and brings great inconvenience to the daily life of users.
[0003] At present, methods for identifying fraudulent calls are based on some basic feature extraction using machine learning or deep learning methods for identification, such as: [A fraudulent application detection method based on deep learning], [An analysis of fraudulent calls based on multidimensional time series] Methods], [method and system for identifying fraudulent ...
Method used
The fraudulent call identification method based on multi-source feature provided by the present invention, this method comprises: user selection comprises normal number, sales promotion number and fraud number, builds the user classification that fits reality more, and based on selected user's period of time Second-degree call signaling, portrait data, location data, and Internet access data construct multi-source feature indicators, including basic features of user call data, user’s basic call features, portrait features, user location and Internet access features, and graph-based similarity Graph network m...
Abstract
The invention discloses a fraud phone recognition method, system and equipment based on multi-source characteristics, and the method comprises the steps: a user selects a normal number, a promotion number and a fraud number, constructs a user classification which is more practical, and based on the multi-source characteristic indexes of the selected user, the basic characteristics of user call data, and the basic call characteristics of the user, image characteristic, user position and Internet-surfing characteristic are included; structural features of a user two-degree network is extracted based on a Struct2Vec image network model with a similar image structure; multi-point and one-line fraud mode structures and the like are identified; based on the fact that user two-degree call data is converted into call time sequence data, a feature combination based on a time sequence is extracted, and on the basis of constructing the multi-source feature, a sample data set is balanced by using an oversampling method Borderline-SMOTE, a classification model for normal, fraud and promotion identification is finally constructed. The model carries out training prediction by using a plurality of different integrated learning combination modes, and accurate and effective identification of fraud calls is realized in combination with a black and white list filtering mechanism.
Application Domain
Character and pattern recognitionSupervisory/monitoring/testing arrangements +2
Technology Topic
Machine learningWhitelist +8
Image
Examples
- Experimental program(1)
Example Embodiment
[0052]The present invention provides a multi-source-based fraud phone recognition method, this method includes: user selection includes normal numbers, sales number, and fraud numbers, constructing more actual user classification, and is based on a two-time call for the selected user. Signaling, portrait data, location data, and Internet data build multi-source feature indicators, including the basic feature of user call data, user's basic call feature, portrait characteristics, user's location and internet feature, and diagram network model based on the graphic structure --Struct2VEC Extracts the structural characteristics of the user's secondary network, identifies a multi-point line and other fraud mode structures, and converts the user's secondary call data to call timing data, extracts timing-based feature combinations, on the basis of constructing multi-source features, Using improved SMOTE sampling method BorderLine-Smote balance sample data set, finally build normal, fraud, sales identification classification model, model uses a variety of different integrated learning portfolios to train forecasting, combined with black and white list filtering mechanism, realizing fraud phone Precision and effective recognition, referencefigure 1.
[0053]Specifically, the following steps:
[0054]Step 1. Construction of three types of sample data including normal numbers, sales numbers, and fraud numbers, including: user's secondary call data, location data, and Internet data, and based on this to extract the basic features of the selected user, including: user Basic call features, portrait features, location features, and Internet features.
[0055]Step Second, based on the user's two-degree call data, the diagram network model is constructed - Struct2VEC, and the structural characteristics of the user's secondary network are extracted, and the structural features such as multi-point lines are identified.
[0056]Step 3. Based on the user's call data, it is converted to a call timing data to construct a combination of timing-based call timing characteristics.
[0057]Step 4. Use the improved SMOTE over sampling method BorderLine-Smote balance sample data set.
[0058]Step 5. Based on feature data, construct a variety of different integrated learning portfolios including boosting and bagging, and use allocation weights, and build a classification model of normal, fraud, and sales identification, combined with black and white list mechanisms, accurately identify fraud numbers.
[0059]In order to make the objects, technical solutions, and advantages of the present invention, the present invention will be further described in detail below with reference to the accompanying drawings.
[0060]For operational data, a sample-free scam, sales phone, and random sample acquisition, call data, portrait data, location data, and acquisition data, the acquisition data, and the subsequent steps lays data foundation .
[0061]Step 1. Construction of three types of sample data including normal numbers, sales numbers, and fraud numbers, the sample data including the user's two-degree call data, location data, and Internet data, and based on this extraction selected user Foundation features include the basic call feature, image information characteristics, location information characteristics, and Internet information characteristics.
[0062]Among them, the characteristics of the call include: the call, the number of the total calls, the number of total calls, and the main queue of the total call, and the main called the average call time, the main difference, standard deviation, the week work Eyele out and the incoming ratio, the number of income ratios, the number, the number of calls, the same number, the same number contact interval, the average number of calls, the same number segment, the same number segment, the same number segment is the serial number Place the ratio, the number of units (hours, days, week, month), continuous call average call duration, continuous call probability, the number of calls of the calling call is 1, the called is called whether it is dial within a certain time after being called for a certain time 110 Or 12321, etc., the called call is a call contact or a stranger, the call exhaled different numbers, the main called call, the length of 0 digits, the caller is hanging rate, the ring is long, the ring is long, Variance, standard deviation, and peer-to-end user call above.
[0063]The position information characteristics include: the number of outgoing positions, the same location output ratio distribution, all position distribution proportional entropy, as the main call, different positions, the number of days, the number of urban, the number and entropy values, the called foreign number For more, the maximum distance, the average number of times and entropy values of the base station are visited. The call occurs in the workplace, the number of times in the home, the number of days, the number of days, the number of recordings, and the like.
[0064]Image information features include: user gender, age, account opening time, accumulated active days, recent call costs distribution, whether the external network, registration method, the last stop time, the number of shutdown, the number of times, the number of exchanges, whether the operator is For the virtual number, the like.
[0065]Internet information features include: traffic, such as: Total time, mean, variance and usage trend, URL access statistics, such as malicious websites, adult websites, gambling websites, etc. Access trends and number of times, etc. Application statistics, such as: common APP usage and proportion, special APP use statistics, etc.
[0066]Step Second, based on the user's two-degree call data, the diagram network model is constructed - Struct2VEC, and the structural characteristics of the user's secondary network are extracted, and the structural features such as multi-point lines are identified. Divided into the following small steps:
[0067]Step 1: Using the secondary call network construction map, based on each vertex, obtain the vertices of each layer, where the layer is the origin of the vertices, one network is the first layer, the secondary network is the second layer, Such push.
[0068]Vertex pair Distance formula:
[0069]flyk(u, v) = fk-1(U, V) + G (S (rk(u)), s (rk(v))), k ≥ 0 and | rk(u) |, | Rk(v) |>0
[0070]Among them, Rk(u) Represents the vertex collection of the vertex u distance K, Rk(v) Indicates the vertex collection of the vertex V distance to K, S (rk(u)) Represents vertex collection Rk(u) The aquary sequence, that is, the set vertices are vertices to the vertex u distance K, arranged in order according to the degree of the vertex.
[0071]g (d1, d2) ≥ 0 is a function of measuring the distance of the order of sequences D1, D2, i.e., the distance between two an organism sequence. Due to S (R)k(u)) and s (rkThe length of (V)) may be different, and may contain duplicate elements, which use Dynamictime Warping (DTW) to measure two ordered sequences, ie, DTW can be used to measure two different lengths and contain sequences containing repeating elements. Distance, based on DTW, the distance function between the definition elements is: This definition of distance functions actually punishes the difference between the two vertices is relatively small;
[0072]flyk(U, V) indicates the structural distance on the loop between the vertices U and V, where the distance k is actually referring to the collection of nodes that are less than equal to K, because each time add fk-1(U, V), iteration is added, this is the function of the vertex pair distance.
[0073]Step 2: Responsible level map according to the distance to the distance
[0074]One distance between two vertices can be calculated for each K, which is mainly used to construct a hierarchical strip, for subsequent randomness through the aquary sequence distance between the vertices obtained above. Walk away.
[0075]Defining the border of two vertices in a layer K This definition is less than 1, and when the distance is 0, the border is 1.
[0076]Connecting the same vertices belonging to different levels by a direction, that is, the upper vertices are connected to each vertex, which will be connected to their corresponding upper vertices.
[0077]Step 3: Randomly walking sampling vertices in the rolling level map
[0078]The sampling vertex sequence is performed in the rolling level method based on the random walk. First, select any point as the starting point, do the sequence of the point to get the point, and then treat this sequence as a sentence, learn to learn the vertex, to get the vertex, the vertex in the zipper level map Randomly visible to get the embedded feature vector of all vertices.
[0079]Step 3. Based on the user's call data, it is converted to a call timing data to construct a combination of timing-based call timing characteristics. Includes the following small steps:
[0080]Step 1: Based on the call and the length of the user, the number of call times per day and the number of call intervals occurring during each time, the user's call timing data is constructed;
[0081]Step 2: Using the open source tsfresh in Python, input the construction of timing data, outputting the timing feature, the timings include, but not limited to, the square and the absolute value of the sequence of the sequence, sequence Approximate entropy (used to measure periodic, unpredictable and volatility of a time series), the number, maximum value, minimum value, and no repetition value of the number, larger (small) of the number of mean values, large (small) coefficient, large (small) (Small) 64 timing variations based on timing changes, such as timing variation, such as time series, and as small as a mode-input characteristic value of the model input.
[0082]Step 4. Use the improved SMOTE over sampling method BorderLine-Smote balance sample data set. Includes the following small steps:
[0083]Step 1: Set the pre-training data characterization into the training set and test set according to the preset ratio, and fixed random seeds, keep the test set unchanged, and perform the effect ratio;
[0084]Step 2: Based on BorderLine Smote over sampling technology, operation training set, will be divided into three categories of scams in training, don't be Safe, Danger and Noise, where SAFE classes are more than half of the samples around the sample, The Danger class is more than half of the samples around the sample, which is considered to be samples on the boundary. The NOISE class is a variety of samples around the sample, regardless of noise, only sample samples of the Danger class. Sample.
[0085]Step 3: Sampling of a few categories of the Danger class, using K-neighboring methods to randomly select a small number of samples, and minority samples.
[0086]Step 4: Randomly change the fixed test set, perform the effect alignment, select the appropriate expansion ratio and parameters.
[0087]Step 5. Based on feature data, construct a variety of different integrated learning portfolios including boosting and bagging, and use allocation weights, and build a classification model of normal, fraud, sales identification, combined with black and white list mechanisms, accurate identification of fraud numbers, specific as follows:
[0088]Step 1: Two-class models of normal and fraud, normal and sales, sales, and fraud, single two-class models, using Boosting and Bagging integrated learning algorithms, such as random forests in Bagging, boosting The two integrated learning methods in XGBoost, LightGBM, GBDT, and AdaBoost, integrated learning are integrated. During training, the final model training results are output in probability.
[0089]Step 2: Combine the probability, the grid search method selects the appropriate weight for allocation, and constructs a three-class model. Such as: X, Y, Z represents the selected model output, A, B, and C represent weight allocation, so that A + B + C = 1, adopts grid search method, selecting the best algorithm weight allocation, combined output Two classifier, construct a three-class identification model; W = AX + by + CZ, A + B + C = 1
[0090]Step 3: Perform white list filtering and blacklist matching in black and white lists, using three category identification model identification of the remaining unrecognizable number;
[0091]Among them, black and white lists include: fraudulent calls, sales phones, some reliable marked takeaways, taxis, drip drivers, registered company phones, etc., these numbers have high-tense, high-frequency characteristics, but numbers The survival cycle is long, once tag identification, can be effectively filtered through black and white list, and the sales call is similar to the fraud phone, there is a high exhaled frequency, short life cycle, so the present invention is a class, which is identified.
[0092]Alternatively, the present invention also provides a computer device including, but not limited to, one or more processors and a memory, memory for storing a computer executable program, a processor reads a portion or all of the computer from a memory. The program and execution, the processor can perform some or all of the steps or all steps of the scam recognition method of the present invention when performing some or all of the calculated executable.
[0093]The equipment identified Ether Fang Upper Pang's scam can be a laptop, a tablet, a desktop computer or a workstation.
[0094]The processor can be a central processor (CPU), a Digital Signal Processor (DSP), a dedicated integrated circuit (ASIC), or an outgoing programmable gate array (FPGA).
[0095]For the memory of the present invention, it can be an internal storage unit of a laptop, a tablet, a desktop computer, a mobile phone or a workstation, such as memory, hard disk; external storage unit, such as a mobile hard disk, flash card.
[0096]Computer readable storage media can include computer storage media and communication media. Computer storage media includes volatility and nonvolatile, removable, and non-movable media for storing any methods or techniques implemented by information such as computer readable instructions, data structures, program modules, or other data. Computer readable storage media can include: read-only memory (ROM, READ ONLY MEMORY), random access memory (RAM, RANDOM Access Memory), SSD, SOLID State DRIVES, or CD. Among them, a random access memory can include resistive random access memory (RERAM, RESISTANCE).
PUM


Description & Claims & Application Information
We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.