Communication identification device, method, and program

The communication identification device addresses the challenge of identifying steady communication in dynamic network environments by classifying and clustering flow data, masking mismatched connection information, and using fluctuation components to accurately identify communication sources.

JP2026100482APending Publication Date: 2026-06-19KDDI CORP +2

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
KDDI CORP
Filing Date
2024-12-09
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing methods for identifying steady communication fail in network environments where connection information of the communication destination changes dynamically due to Dynamic DNS, as they rely on fixed IP addresses and port numbers.

Method used

A communication identification device that collects flow data, classifies it into sets based on source and destination connection information, calculates communication fluctuations, clusters the data, masks mismatched destination information, and identifies steady communication using fluctuation components and index values.

🎯Benefits of technology

Accurately identifies steady-state communication even in dynamic network environments by quantifying communication volume and frequency fluctuations, enabling effective communication source identification.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026100482000001_ABST
    Figure 2026100482000001_ABST
Patent Text Reader

Abstract

Identify steady-state communications in a network environment where connection information of the communication destination changes dynamically. [Solution] The data collection unit 10 collects flow data from communications on the network. The data classification unit 20 classifies the collected flow data into a first set unique to each combination of source connection information and destination connection information. The masking unit 30 calculates a first index value for each first set based on the unit communication volume and communication frequency, clusters all first sets based on the first index value, and for each first set classified into the same cluster, masks the mismatched portion of the destination connection information for combinations where the source connection information matches but the destination connection information does not. The index value calculation unit 40 calculates a second index value for each second set of combinations with the masked mismatched portion of the destination connection information and for each first set based on the unit communication volume and communication frequency. The communication identification unit 50 identifies steady-state communications for each first and second set based on the second index value.
Need to check novelty before this filing date? Find Prior Art

Description

【Technical Field】 【0001】 The present invention relates to an apparatus, method, and program for analyzing communication on a network to identify steady communication, and particularly to a communication identification apparatus, method, and program for identifying steady communication with a communication source in a network environment where connection information of a communication destination changes dynamically. 【Background Art】 【0002】 Techniques for identifying foreground communication and background communication in a communication terminal and improved inventions thereof are disclosed in Patent Documents 1-4. 【0003】 Patent Document 1 discloses a technique for identifying, for a mobile terminal or the like, foreground communication generated by a user's operation and background communication in which an application communicates independently of the user's operation. Here, the steady communication to be identified by the present invention corresponds to a part of the background communication. 【0004】 In Patent Document 1, communication is uniquely identified by a combination of a communication source IP address, a communication destination IP address, and a communication destination port number (SD group), and each communication is classified as foreground communication or background communication based on autocorrelation and cross-correlation regarding the occurrence timing of the communication. 【0005】 In Patent Document 1, it is determined that communication with a high autocorrelation coefficient is likely to be background communication that is automatically and mechanically executed by the OS or an application independently of the user's operation. Also, communication with a high cross-correlation coefficient with communication already classified as background communication is similarly likely to be background communication. On the other hand, even when cross-correlation is recognized, if the occurrence timing of the communication is less than a predetermined threshold, it may be user-initiated communication, so it is classified as foreground communication. 【0006】 Patent Document 2 discloses a method for reducing computational complexity based on the method described in Patent Document 1. The time axis is discretized into bins of width Δt, and the timing of communication occurrences is vectorized by setting a 1 in the nth digit of the bit sequence T when communication occurs in the nth bin. By representing each digit with a 1 as the distance from the starting bin, the original sparse vector can be compressed. By calculating the confidence level from the compressed vector, foreground and background communication can be distinguished with less computation. 【0007】 Patent Document 3 discloses an invention that distinguishes between foreground and background communication even when the periodicity of communication occurrence times is disrupted due to network conditions or other reasons, in addition to the method described in Patent Document 2. Improvements have been made to the invention by using statistical quantities such as the amount of communication per Δt, the number of packets, and throughput characteristics as communication characteristics, in addition to the time at which communication occurs, so that even when the periodicity is considered to be disrupted, if the time-series fluctuation of the communication characteristics is small, it can be determined to be background communication. 【0008】 Patent Document 4 discloses an invention for identifying background communication in cases where, in addition to the method described in Patent Document 3, content servers are distributed in cloud services, etc., and even if the source is the same and the session is related to the same service or application, the destination may differ. Specifically, by aggregating SD groups with the same source information but different destination information based on the correlation of the communication characteristics of each session, communications with different destinations are identified as background communication. [Prior art documents] [Patent Documents] 【0009】 [Patent Document 1] Japanese Patent Publication No. 2015-162810 [Patent Document 2] Japanese Patent Publication No. 2016-082518 [Patent Document 3] Japanese Patent Publication No. 2016-116026 [Patent Document 4] Japanese Patent Publication No. 2016-127361 [Overview of the Initiative] [Problems that the invention aims to solve] 【0010】 In the field of communication security, the increasing speed and volume of communication data makes packet-level processing difficult, leading to the use of lighter flow data such as IPFIX. 【0011】 Communication data can be broadly classified into stationary and non-stationary communication. Stationary communication, such as Keep Alive, is communication where the unit amount and frequency of communication are constant, and it often appears in various types of communication data. 【0012】 On the other hand, in tasks such as abnormal communication detection and application identification, communication characteristics often appear in non-stationary communications, and it is expected that classifying and handling stationary and non-stationary communications will improve the accuracy of the task. 【0013】 Patent documents 1-4 disclose methods for distinguishing between foreground and background communication based on information about the source and destination of communication, as well as communication characteristics such as communication volume and packet size. These methods are based on autocorrelation regarding the timing of communication occurrence; in other words, if communication occurs steadily from the start of the session, it is determined to be background communication. Furthermore, patent document 3 takes into account fluctuations in communication characteristics such as communication volume and packet size in addition to the timing of communication occurrence. 【0014】 However, in recent applications using cloud computing or dynamic DNS, connection information such as the IP address and port number of the communication destination may change dynamically for load balancing purposes. In cases where the connection information of the communication destination changes dynamically, the methods described in Patent Documents 1-4 cannot be used as is to identify steady-state communication. 【0015】 The objective of the present invention is to solve the above technical problems and to address the identification of steady-state communications in a network environment where the connection information of the communication destination changes dynamically. This is achieved by masking a portion of the connection information of the communication destination, quantifying the steadyness of the unit communication volume and communication frequency as index values, and combining both index values ​​to identify steady-state communications. [Means for solving the problem] 【0016】 To achieve the above objectives, the present invention comprises means for collecting flow data from communication data on a network; means for classifying the collected flow data into a first set unique to each combination of source connection information x and destination connection information y; means for calculating communication fluctuations based on the flow data for each first set; means for clustering all first sets based on communication fluctuations; means for masking the mismatched portion of destination connection information y for combinations in which source connection information x matches but destination connection information y does not match for each first set classified into the same cluster; means for calculating the fluctuation components of unit communication amount and communication frequency for each first set and each second set of combinations in which the mismatched portion of destination connection information y has been masked; and means for identifying steady communication for each first and second set based on each fluctuation component. [Effects of the Invention] 【0017】 According to the present invention, even in network environments where connection information of the communication destination changes dynamically due to Dynamic DNS or the like, it becomes possible to accurately identify steady-state communication between the communication source and the communication destination. [Brief explanation of the drawing] 【0018】 [Figure 1] This figure shows the configuration of a network to which the communication identification device of the present invention is applied. [Figure 2] This is a functional block diagram showing the configuration of a communication identification device to which the present invention is applied. [Figure 3] This diagram schematically illustrates a method for classifying flow data. [Figure 4] This diagram schematically illustrates a method for masking part of the communication destination connection information. [Figure 5] This is a flowchart illustrating the method for identifying regular communications using a communication identification device. [Figure 6] This diagram illustrates the method for calculating communication frequency. [Modes for carrying out the invention] 【0019】 Embodiments of the present invention will now be described in detail with reference to the drawings. Figure 1 is a schematic diagram showing the configuration of a network to which the communication identification device of the present invention is applied, in which multiple clients as source terminals and multiple application (app) servers as destination terminals are interconnected via a reverse proxy. Each source terminal receives desired services from multiple application servers that are distributed amongst themselves by accessing the reverse proxy. 【0020】 In this embodiment, the reverse proxy is a Dynamic DNS reverse proxy that implements Dynamic DNS, enabling a connection from the source terminal to a destination terminal whose IP address and port number change dynamically, using a fixed hostname. 【0021】 Figure 2 is a functional block diagram showing the configuration of the main parts of the communication identification device 1 to which the present invention is applied. It is implemented in or connected to the Dynamic DNS reverse proxy and identifies steady-state communications based on flow data obtained by analyzing communication data. The communication identification device 1 mainly consists of a data collection unit 10, a data classification unit 20, a mask unit 30, an index value calculation unit 40, and an identification unit 50. 【0022】 The data collection unit 10 captures all communication data relayed by the Dynamic DNS reverse proxy and collects connection information such as the IP addresses and port numbers of the source and destination terminals, as well as statistics such as communication volume, number of packets, and communication start time, as flow data using network traffic monitoring / analysis methods such as IPFIX. 【0023】 As schematically shown in Figure 3, the data classification unit 20 takes the entire set of collected flow data as D, and records each flow data in set D that has the same combination of connection information unique to the source terminal (source connection information) x and connection information unique to the destination terminal (destination connection information) y, and assigns the corresponding first set D x, y They are classified into these categories. 【0024】 The mask unit 30 includes a communication variation calculation unit 301, a clustering unit 302, and a connection information mask unit 303. As will be described in detail later with reference to Figure 4, the mask unit 30 performs the following masking process, replacing a portion of the destination connection information y with a MASK token in records where the source connection information x matches but a portion of the destination connection information y does not match. 【0025】 The communication variation calculation unit 301 is the first set D x, y For each, the communication fluctuation is calculated based on the flow data. In this embodiment, the first set D x, y For each unit, the coefficient of variation of the unit communication amount and the coefficient of variation of the communication frequency are calculated as communication variation. The clustering unit 302 is the first set D x, y The data is clustered based on the coefficients of variation of the unit communication volume and communication frequency. 【0026】 The connection information masking unit 303 masks a portion of the destination IP address or the destination port number from the destination connection information y for each cluster. Records in which the combination of source connection information x and destination connection information y becomes the same after the masking process are placed in the second set D. x, masked_y It is classified as follows: 【0027】 The index value calculation unit 40 includes a unit traffic calculation unit 401 and a communication frequency calculation unit 402, and calculates an index value for steady state identification based on the unit traffic and the communication frequency for each of the first set D x, y and the second set D x, masked . 【0028】 Based on the index value, the communication identification unit 50 identifies steady state communication for each of the union set D x, y of the first set D x, masked_y and the second set D ' x, y as shown in the following formula (1). 【0029】 【Equation】 【0030】 Such a communication identification device 1 can be configured by implementing an application (program) that realizes each function described in detail below on a general-purpose computer, server, or portable smartphone or tablet terminal equipped with a CPU, ROM, RAM, bus, interface, etc. Alternatively, it can also be configured as a dedicated machine or single-function machine in which part of the application is hardwareized or softwareized. 【0031】 Figure 5 is a flowchart showing the communication identification procedure by the communication identification device 1. In step S1, the flow data collection unit 10 captures the communication data to be identified without omission, and collects connection information such as the IP addresses and port numbers of the communication source terminal and the communication destination terminal, and statistical amounts such as the traffic volume, the number of packets, and the communication start time as flow data. 【0032】 In step S2, the data classification unit 20 combines the corresponding communication source connection information x and communication destination connection information y in the collected flow data set D and records them, and classifies each record with the same combination into the first set D x, y unique to the combination. 【0033】 Each combination of connection information x and y can use the IP addresses of the communication source and destination individually, or it may use a combination of an IP address and a port number. Alternatively, destination information can be categorized by function based on the functional classification of the communication destination using methods such as clustering. In this embodiment, a combination of the IP address of the communication source terminal and the IP address and port number of the communication destination terminal will be used. 【0034】 In this embodiment, since the communication destination connection information y changes dynamically, even if the combination of the source terminal and the destination terminal is the same at this point, the flow data (records) will be different for multiple first sets D. x, y It is sometimes classified as such. 【0035】 In steps S3-1 to S3-4, the mask unit 30 performs a masking process in which, for each combination of source connection information x and destination connection information y, a portion of the destination connection information y is replaced with a MASK token, according to the following procedures. 【0036】 In step S3-1, the first set D x, y Coefficient of variation CV for each unit of data transfer x, y volume The amount is calculated. In this embodiment, the amount of communication is calculated in units of packets, and the flow record k∈D x, y Regarding the number of packets, k , the amount of data transmitted is q k The amount of data transfer per packet in this case is v k This can be found using the following equation (2). 【0037】 【number】 【0038】 Data volume per packet v k (k∈D x, y ) coefficient of variation CV x, y volume This is calculated using the following equation (3), and is used as one of the indicators for evaluating stability. 【0039】 【number】 【0040】 In step S3-2, the first set D x, y Coefficient of variation CV for each communication frequency x, y frequency This is calculated. In this embodiment, the flow record k∈D x, y The start time of each communication is t k The start time of the entire period for which you want to determine stationarity is s t , end time e t Let i ∈ {1, 2, ..., |D} be the time point. x, y A set of timestamps T that combine |} x, y We can find this using the following equation (4). 【0041】 【number】 【0042】 In this embodiment, by adding the start time st and end time et for the entire period over which we want to determine stationarity, we can exclude sets of flow records that communicate periodically only during a portion of the period, as shown in Figure 6. This makes it possible to identify only true stationary communications that exhibit stationarity over the entire period. 【0043】 In this embodiment, T x, y Communication frequency f i The coefficient of variation CV can be expressed as shown in equation (5) below, and is obtained by equation (6) below. x, y frequency This is used as another indicator for evaluating stability. 【0044】 【number】 【0045】 【number】 【0046】 In step S3-3, all first sets D x, y Its coefficient of variation CV x, y volume and coefficient of variation CV x, y frequency Based on this, the data is classified into multiple clusters (clustering). In this embodiment, D x, y For ⊂D, each CV x, y volume CV x, y frequency Since these are calculated, we use these again as the coefficients of variation in equation (7) below. However, C is the first set D x, y This represents the total number. 【0047】 【number】 【0048】 Next, density-based clustering methods such as DBSCAN are applied to the CV. Density-based clustering is a clustering method that groups points within a specified distance parameter φ into the same cluster. 【0049】 Therefore, the first set D, which is classified into the same cluster. x, y This is a classification with a similar degree of stationarity in parameter φ; in other words, the combination of source and destination is substantially the same, but because the destination connection information y changes dynamically, it is a different first set D. x, y It can be concluded that this is a set that was likely classified as such. 【0050】 In step S3-4, for each cluster, a portion of the destination IP address or the destination port number in the destination connection information y is masked. Specifically, as shown in Figure 4 as an example, D x, y Replace part of the destination IP address or destination port number in ⊂D with a MASK token to obtain a second set D containing the masked destination masked_y. x,masked_y This generates the mask. The mask area is determined by the following policies (1) to (5). 【0051】 (1) For combinations where only the upper octet of the destination IP address is common, the lower octet is masked while retaining the common part of the destination IP address. 【0052】 (2) If network address information is provided in advance, the part of the destination IP address that corresponds to the host address is masked. 【0053】 (3) Mask according to your own rules. For example, if you have sufficient knowledge of the entities you are connecting to and have rules to efficiently mask the entities that should be classified. 【0054】 (4) For combinations where the destination IP address is the same but the port number is different, all port numbers are masked. 【0055】 (5) If none of the above apply, do not wear a mask. 【0056】 Returning to Figure 5, once the masking process is complete, the process proceeds to step S4, where the index value calculation unit 40 calculates the first set D, which is a set of flow data in each cluster where the source connection information x and destination connection information y match, i.e., the set of connection information that matches even without masking. x, y And with a mask, the connection information matches in the second set D. x, masked_y The union D ' x, y For each set, the coefficient of variation (CV) of the communication volume was calculated. x, y volume and the coefficient of variation CV of communication frequency x, y frequency Calculate. 【0057】 In step S5, the communication identification unit 50 determines the union D ' x, y set D x, y ,D x, masked_y For each unit, the coefficient of variation CV of the unit communication amount and communication frequency. x, y volume CV x, y frequency Based on this, regular communications are identified. 【0058】 For each set, CV x, y volume The smaller the value, the more stable the communication is considered to be, so the threshold ε volume When set, the set D satisfies the following condition (8): x, y ,D x, masked_y This is judged to be constant with respect to the unit amount of data transmitted. 【0059】 【number】 【0060】 When a flow record is bidirectional, the number of packets and the amount of data transmitted each have statistics for inbound (destination → source) and outbound (source → destination). In this case, the sum of the inbound and outbound values ​​can be used. At this time, the number of packets p and the amount of data transmitted q are calculated as shown in equations (9) and (10) below. 【0061】 【number】 【0062】 【number】 【0063】 For each set, CV x, y frequency The smaller the value, the more we can determine that the communication frequency is stationary, so the threshold ε frequency When this is set, the set that satisfies equation (11) is judged to be stationary with respect to communication frequency. 【0064】 【number】 【0065】 And the CV, a stationary index related to unit communication volume. x, yvolume and the CV (Continuous Computation Index) related to communication frequency x, y frequency Using this, sets in which both the unit communication amount and communication frequency are determined to be stationary, as shown in equation (12), are identified as stationary communications. 【0066】 【number】 【0067】 In the above embodiment, the coefficient of variation CV x, y volume CV x, y frequency The present invention was described as setting a fixed threshold and identifying each set for steady-state communication by comparing it with the fixed threshold. However, the present invention is not limited to this, and the threshold... volume and? frequency It's also acceptable to allow this to be set dynamically. 【0068】 In other words, when the stationarity index of the unit communication amount and the stationarity index of the communication frequency are calculated for all sets, a series like the following equation (13) is obtained. Here, X represents the set of all communication sources, and Y represents the set of all communication destinations. 【0069】 【number】 【0070】 Coefficient of variation (CV) volume CV frequency Since the smaller the values ​​of each coefficient of variation, the more stationary the communication is considered to be, if we plot the above series on a two-dimensional plane with each coefficient of variation on the vertical and horizontal axes, we can expect that the sets representing stationary communication will cluster near the origin. 【0071】 Here, the above sequence is classified into two clusters using a clustering method having the number of clusters, such as k-means, as a hyperparameter or a binary classification method such as One-class SVM. At this time, the initial values of the centers of each cluster are given as {(0, 0), (a, b)} (0 ≪ a, 0 ≪ b). It is expected that the cluster starting learning from (0, 0) will learn to include stationary communication, and the cluster starting learning from (a, b) will learn to include non-stationary communication. 【0072】 After the learning of the clusters, a boundary value that can separate the stationary communication cluster and the non-stationary communication cluster is set as a threshold value. For example, when the stationary communication cluster is c0, the threshold value can be determined by obtaining the maximum value as shown in the following equations (14) and (15). 【0073】 【Number】 【0074】 【Number】 【0075】 Also, instead of determining ε volume and ε frequency independently, a function F that separates the stationary communication cluster and the non-stationary communication cluster may be used instead of the threshold value. Examples of such a function F include methods such as SVM. 【0076】 And according to each of the above embodiments, even when the connection information of the communication destination changes dynamically, it becomes possible to accurately identify the stationary communication between the communication source and the communication destination. Therefore, it becomes possible to contribute to Goal 9, "Build resilient infrastructure, promote inclusive and sustainable industrialization" and Goal 11, "Make cities inclusive, safe, resilient and sustainable" of the Sustainable Development Goals (SDGs) led by the United Nations. 【Explanation of Signs】 【0077】 1...Communication identification device, 10...Data acquisition unit, 20...Data classification unit, 30...Mask unit, 40...Index value calculation unit, 50...Communication identification unit, 301...Communication fluctuation calculation unit, 302...Clustering unit, 303...Connection information mask unit, 401...Unit communication volume calculation unit, 402...Communication frequency calculation unit

Claims

[Claim 1] In a communication identification device that identifies communications on a network, A means of collecting flow data from communication data on a network, A means for classifying the collected flow data into a first set unique to the combination of the source connection information x and the destination connection information y, A means for calculating communication fluctuations based on the flow data for each first set, A means for clustering all first sets based on the aforementioned communication fluctuations, For each first set classified into the same cluster, a means for masking the mismatched portion of the destination connection information y in combinations where the source connection information x matches but the destination connection information y does not match, For each second set of combinations in which the mismatched portion of the first set and the communication destination connection information y is masked, means for calculating the variable components of the unit communication amount and communication frequency, A communication identification device characterized by comprising means for identifying steady-state communications for each of the first and second sets based on the aforementioned fluctuating components. [Claim 2] The means for calculating the communication variation calculates the coefficients of variation for the unit communication amount and communication frequency based on the flow data for each of the first sets, The communication identification device according to claim 1, characterized in that the clustering means clusters all first sets based on each of the coefficients of variation. [Claim 3] The communication identification device according to claim 1, characterized in that the identification means identifies a communication as steady-state communication when the coefficients of variation of the unit communication amount and the communication frequency for each of the first and second sets fall below a predetermined threshold. [Claim 4] The aforementioned identification means is, A means for dynamically setting the boundary between the two statistical classifications of the coefficients of variation of the unit communication volume of the first and second sets as a threshold for the unit communication volume, The system comprises means for dynamically setting the boundary between two statistically determined communication frequency thresholds for the coefficients of variation of the communication frequencies of the first and second sets, The communication identification device according to claim 1, characterized in that it identifies a communication as a steady-state communication when both the unit communication amount and the communication frequency fall below the corresponding thresholds that are dynamically set. [Claim 5] The communication identification device according to any one of claims 1 to 4, characterized in that the communication frequency is the communication frequency during the period from the start time to the end time of communication. [Claim 6] In a communication identification method in which a computer identifies communications on a network, By collecting flow data from network communication data, The collected flow data is classified into a first set unique to the combination of the source connection information x and the destination connection information y, For each first set, calculate the communication variation based on its flow data. Based on the aforementioned communication fluctuations, all first sets are clustered, For each first set classified into the same cluster, the mismatched portion of the destination connection information y for combinations where the source connection information x matches but the destination connection information y does not match is masked. For each second set of combinations in which the mismatched portion of the first set and the communication destination connection information y is masked, the variable components of the unit communication amount and communication frequency are calculated. A communication identification method characterized by identifying steady-state communications for each of the first and second sets based on each of the aforementioned fluctuating components. [Claim 7] In a communication identification program that identifies communications on a network, Procedures for collecting flow data from network communication data, The procedure for classifying the collected flow data into a first set unique to the combination of the source connection information x and the destination connection information y, A procedure for calculating communication fluctuations based on the flow data for each first set, A procedure for clustering all first sets based on the aforementioned communication fluctuations, For each first set classified into the same cluster, a procedure to mask the mismatched portion of the destination connection information y for combinations where the source connection information x matches but the destination connection information y does not match, and A procedure for calculating the variable components of the unit communication amount and communication frequency for each second set of combinations in which the mismatched portion of the first set and the communication destination connection information y is masked, A procedure for identifying steady-state communications for each of the first and second sets based on each of the aforementioned fluctuating components, A communication identification program characterized by causing a computer to execute it.

Citation Information

Patent Citations

  • Communication identification method and device

    JP2015162810A

  • Communication identification method and device

    JP2016082518A

  • Communication identification method and device

    JP2016116026A

  • Communication discrimination method and device

    JP2016127361A