Systems and methods for detecting a phishing domain in a domain name system (DNS) record set

By combining generative adversarial neural networks and bidirectional encoder representations with image recognition technology, the problem of detecting phishing domains in existing technologies has been solved. This enables efficient identification and early prevention of phishing domains using homographs and misspelled characters, thus improving detection efficiency and accuracy.

CN115314236BActive Publication Date: 2026-06-23ENSIGN INFOSECURITY PTE LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ENSIGN INFOSECURITY PTE LTD
Filing Date
2022-05-07
Publication Date
2026-06-23

Smart Images

  • Figure CN115314236B_ABST
    Figure CN115314236B_ABST
Patent Text Reader

Abstract

This document describes a system and method for detecting phishing domains used by cyber attackers to conduct phishing attacks in a Domain Name System (DNS) record set, the system including a homograph phishing domain detection module, a typosquatting phishing domain detection module, a general phishing domain detection module, and an alert module. These modules are configured to use a combination of homograph, typosquatting, and general phishing domain techniques to collaboratively detect and identify phishing domains from a DNS record set. Subsequently, the alert module can be used to correlate alerts from the various phishing detection modules to discover phishing activity occurring in DNS network data.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a system and method for detecting phishing domains used by network attackers to carry out phishing attacks using Domain Name System (DNS) records. The system includes a homograph phishing domain detection module, a misspelling-based phishing domain detection module, a general phishing domain detection module, and an alerting module. These modules are configured to collaboratively detect and identify phishing domains occurring within a given DNS record set using a combination of homograph, misspelling-based, and general phishing domain techniques. Background Technology

[0002] Generally, there are three main types of phishing domains that are commonly used for phishing attacks: homograph phishing domains that exploit visual similarities, misspelling phishing domains that exploit typos, and other general phishing domains such as "win-free-iPhone[.]com" that exploits human desires. It is worth noting that phishing attacks include the following: attackers attempt to (1) gather knowledge of the victim by obtaining credentials, or (2) gain initial access by tricking people into downloading and executing malicious payloads.

[0003] Misspelling phishing domains are a common type of phishing attack that involves the repetition of the same Latin characters. Specifically, misspelling phishing domains involve registering domains that exploit typos unsuspecting users might make when typing the desired domain. For example, in 2006, a misspelled variant of "google[.]com" – "goggle[.]com" – was abused by malicious actors, and even more recently, a misspelled variant of "youtube[.]com" – "yuube[.]com" – was used to host malware.

[0004] To address this type of attack, the currently accepted approach in the industry is to calculate the edit distance between strings. For example, equation (1) below shows two example domains that are 1 Levenshtein edit distance (LD) from the “real” domain (“facebook.com”). In this known approach, it should be noted that a lower LD value indicates a domain that is extremely similar to the “real” domain, thus increasing the likelihood of a phishing attempt.

[0005] ld(“facebook.com”; “face4book.com”) = 1…Equation (1)

[0006] ld("facebook.com"; "faceb0ok.com") = 1

[0007] However, this method does not consider keyboard distance, only heuristics. For example, in equation (1), “faceb0ok.com” is more likely a misspelled variant of “facebook.com” because “o” and “0” are very close on the keyboard, while “p” and “c” are very far apart. Despite this obvious difference, the LD algorithm classifies both domains as having an equal Levenshtein edit distance (LD) to “facebook.com”. Nevertheless, LD still forms the basis of many modern spell-checking systems, making them suitable for detecting misspelled phishing domains that exploit typographical / spelling errors.

[0008] In recent years, those skilled in the art have recognized the importance of considering keyboard distances and have increasingly sought new methods to account for them and propose better spell-checking algorithms. One such method proposed by those skilled in the art is to compare two field strings by returning the sum of the distances between corresponding characters. If one string is longer than the other, the remaining characters are calculated to have the same value as the maximum distance. While it considers keyboard distances, it does not account for the transpose of adjacent characters.

[0009] Furthermore, the Levenshtein edit distance (LD) method described above fails when attackers exploit homograph attacks using characters that do not belong to the Latin character set (which is a subset of similar attacks). For example, the domains “fácebook.com” and “facebooZ.com” are both LDs of “facebook.com” (1), but they have different visual characteristics. In addition, since most modern browsers support display internationalized domain names (IDNs), domain names including numbers and other special characters can be registered. Such IDNs are usually converted to their corresponding Latin characters in the form of “puny codes”. While it is very useful in facilitating domain names in various languages, it opens up the possibility of network attacks, especially homograph-type domain attacks. Homograph attack vectors come into play when mixed characters look like their corresponding Latin characters. As shown in Table 1 below, it is not easy to distinguish homograph-type domains from their original domains.

[0010]

[0011] Table 1

[0012] With the rise of homograph phishing attacks, those skilled in the art have proposed numerous techniques to detect such attacks. One approach involves studying the similarity between individual characters and evaluating their pairwise similarity based on mean squared error. This study is then fine-tuned by having people evaluate the similarity of these character pairs. A major drawback of this approach is that it does not account for strings, as the similarity comparison is performed at the character level. Furthermore, this method assumes that combinations of homographs may affect the confounding power of homograph strings. This approach also lacks a sufficiently large training dataset, thus requiring the use of artificial labelers for fine-tuning.

[0013] In another approach, those skilled in the art have demonstrated that Siamese convolutional neural networks (CNNs) can detect and classify homograph attacks. This method utilizes a dataset containing pairs of real and spoof domains that can be rendered in the Arial font. While highly useful for training machine learning (ML) algorithms, a major drawback of this approach is its inherent bias towards the Arial font. This means that small codecs that can be rendered by other fonts are not considered in this approach. Deep learning models trained on such datasets will be biased towards Arial. Furthermore, creating large, curated datasets for multiple fonts would be extremely tedious and potentially inefficient, as it would again be biased towards only those specific fonts.

[0014] Besides homograph and misspelled domains, there are also phishing domains that exploit basic human desires. These are the most difficult to identify and detect; therefore, no widely accepted method has yet been developed to address them. Techniques proposed by those skilled in the art for identifying such malicious domains utilize other characteristics, such as domain lifetime (i.e., how long the domain has been queried before people stop querying it), number of IP addresses, etc., to determine whether a domain is malicious. However, results show that such models typically end up capturing malware domains and other domains associated with later stages of the attack framework, particularly C2 communications, DNS leaks, etc., and are practically unable to detect any phishing domains. This suggests that such techniques are better suited for detecting malware and malicious domains, but not for phishing domains, which typically occur in the early stages of a cyberattack.

[0015] For the reasons stated above, those skilled in the art have been working to develop a system and method capable of detecting phishing domains from a set of DNS records, which may include homograph phishing domains, misspelling phishing domains, and / or generic phishing domains. Summary of the Invention

[0016] The systems and methods provided according to embodiments of the present invention solve the above and other problems and make progress in the field.

[0017] A first advantage of embodiments of the system and method according to the invention is that the invention is able to detect and block early phishing attacks based solely on network traffic data such as received DNS records, thereby preventing protected systems from being compromised by later attacks such as data breaches or credential collection.

[0018] A second advantage of embodiments of the system and method according to the invention is that, in addition to detecting such domains, the invention can enrich alerts related to the identification of such domains, so that other organizations also attacked by such phishing domains can be alerted.

[0019] A third advantage of embodiments of the system and method according to the present invention is that the invention is able to identify phishing domains using three types of phishing domain detection techniques.

[0020] A fourth advancement of embodiments of the system and method according to the invention is that the invention utilizes a generative adversarial neural network (GAN), which is trained using a unique loss function to generate a virtually infinite number of glyphs, not limited to any particular font type, to train a separate homograph encoder to help detect and identify homograph phishing domains.

[0021] A fifth advantage of embodiments of the system and method according to the invention is that by transforming the string comparison problem into an image recognition problem, batch processing can be performed, thereby allowing for the rapid processing of more strings and thus enabling the processing of large-scale network traffic data.

[0022] A sixth advantage of embodiments of the system and method according to the invention is that keyboard distance is inherently taken into account when the detection of misspelled phishing domains is transformed into an image recognition problem using the Swype image renderer.

[0023] A seventh advantage of embodiments of the system and method according to the invention is that a bidirectional encoder representation (BERT) neural network from the converter is used to employ natural language understanding to determine whether the domain is a possible general phishing domain.

[0024] An eighth advantage of embodiments of the system and method according to the invention is that the resolved IP of the domain marked as a possible generic phishing domain is also used as an additional feature for determining whether the domain should be classified as a generic phishing domain, thereby significantly reducing the false positive rate of generic phishing domain detection.

[0025] A ninth advantage of embodiments of the system and method according to the invention is that time and frequency analysis is performed on each IP address that is a victim of phishing to determine whether the attack has progressed to a later stage.

[0026] The above advantages are provided by embodiments of the method according to the invention, which operate in the following manner.

[0027] According to a first aspect of the present invention, a system for detecting phishing domains in a Domain Name System (DNS) record set is disclosed, the system comprising: a homograph phishing domain detection module, including a trained homograph encoder E H and reference database D H This module is configured to receive a string image I rendered from a homograph field. HT_N The homograph domains consist of domains where each domain includes a small code; a trained homograph encoder E is used. H Image I of each string HT_N Encode them as their associated encoding e HT_N When the associated encoding e HT_N With reference database D H When the Euclidean distance between any encodings is below a predetermined threshold, each homograph domain is classified as a homograph phishing domain; the misspelling-based phishing domain detection module includes a trained misspelling-based encoder E. S and reference database D S This module is configured to receive Swype-like images of the domain rendered from the DNS record set. ST_N Using a trained misspelling preemptive encoder E S Each Swype-like image I ST_N Encode them as their associated encoding e ST_N When the associated encoding e ST_N and reference database D S When the Euclidean distance between any of the codes is less than a predetermined threshold, it will be used to generate a Swype-like image I. ST_NEach domain in the set is classified as a misspelling phishing domain; a general phishing domain detection module includes a trained converter-based neural network configured to: receive domains identified from the DNS record set that have strings that at least partially match strings in a phishing reference list K; generate a probability score for each identified domain using the trained converter-based neural network; resolve an Internet Protocol (IP) address for each identified domain with a probability score exceeding a predetermined probability threshold; for each resolved IP address, obtain all external domains associated with the resolved IP address, where external domains refer to all domains within the DNS record set that resolve to the resolved IP address; generate a probability score for each obtained external domain using the converter-based neural network; classify each obtained external domain with a probability score exceeding a predetermined probability threshold as a general phishing domain; and an alert module configured to generate alerts for phishing domains detected in the DNS record set based on the homograph phishing domain detection module, the misspelling phishing domain detection module, and the general phishing domain detection module.

[0028] Regarding a first aspect of the invention, the alert module includes a blacklist and a rule-based filtering module, which are configured to filter out benign domains from identified phishing domains.

[0029] Regarding a first aspect of the invention, the alert module further includes: a behavior analysis module configured to: obtain unfiltered phishing domains from a blacklist and a rule-based filtering module; obtain IP addresses that query at least one unfiltered phishing domain; for each IP address that queries at least one unfiltered phishing domain, generate a count-based vector based on the number of queries for each unfiltered phishing domain; apply L2 normalization to each count-based vector; apply hierarchical clustering to the IP addresses and their associated count-based vectors to identify count-based vectors with similar characteristics; and classify the IP addresses associated with the identified count-based vectors as IP addresses that have suffered the same phishing attack activity.

[0030] Regarding a first aspect of the invention, the alert module further includes: a behavior analysis module configured to: obtain unfiltered phishing domains from a blacklist and a rule-based filtering module; obtain IP addresses that query at least one unfiltered phishing domain; for each IP address that queries at least one unfiltered phishing domain, generate a count-based vector based on the number of queries for each unfiltered phishing domain based on the IP address; apply L2 normalization to each count-based vector; apply Local Sensitive Hash (LSH) to the IP addresses and their associated count-based vectors to identify count-based vectors with similar characteristics; and classify the IP addresses associated with the identified count-based vectors as IP addresses subjected to phishing attack activities.

[0031] Regarding the first aspect of the invention, the behavior analysis module further includes a dimensionality reduction process such as TSNE or UMAP prior to hierarchical clustering or LSH.

[0032] Regarding a first aspect of the invention, the alert module further includes: a behavior analysis module, further configured to: obtain unfiltered phishing domains from a blacklist and a rule-based filtering module; obtain IP addresses that query at least one unfiltered phishing domain; for each IP address that queries at least one unfiltered phishing domain, collect a sorted list of timestamps of queries against at least one unfiltered phishing domain; calculate relative times using the sorted timestamps, wherein each calculated relative time is the time elapsed relative to a first timestamp; bin the calculated relative times into desired sampling frequencies and count the number of entries in each bin to obtain an occurrence time series, wherein the occurrence time series is defined as the number of queries against at least one unfiltered phishing domain; apply a Hanning filter to the occurrence time series; and perform frequency analysis on the filtered occurrence time series to determine the presence of periodicity and its associated frequencies.

[0033] Regarding the first aspect of the invention, by making the homograph encoder E H The triple loss training function L with respect to parameters triplet Minimize to train homograph encoder E H And function L triplet Defined as

[0034]

[0035] Where function L triplet The positive input is provided by a set of images rendered from a collection of frequently visited popular Internet domains, function L tripletThe anchor input is provided with a set of outputs from a trained Phish Generative Adversarial Network (Phish-GAN), where each output corresponds to a glyph version of the current set's positive input. The function L... triplet The negative input is provided as a set of string images with a domain, which is obtained from the encoder E. H Sampling is considered to be a set of popular Internet domains that are similar to the anchor input of the current set but do not include the actual positive samples of the set, and where E(P) is defined as positive encoded output, E(A) is defined as anchor encoded output, E(N) is defined as negative encoded output, and M is margin.

[0036] Regarding the first aspect of the invention, by minimizing the normalized temperature-scale cross-entropy (NT-Xent) loss function L NT-Xent To train the homograph encoder E H L NT-Xent Defined as

[0037]

[0038] Where function L NT-Xent The positive input is provided by a set of images rendered from a collection of frequently visited popular Internet domains, function L NT-Xent The anchor input is provided with a set of outputs from a trained Phish Generative Adversarial Network (Phish-GAN), where each output corresponds to a glyph version of the current set's positive input. The function L... NT-Xent The negative input is provided by a set of string images with domains derived from a set of frequently accessed popular internet domains that the encoder EH considers similar to the anchor input of the current set but do not include actual positive samples from that set, and wherein, It is the cosine similarity measure between the anchor encoding and positive samples, and It is a cosine similarity measure between the anchor's encoding and the negative sample.

[0039] Regarding a first aspect of the invention, the trained Phish-GAN includes: a trained generator G configured to receive an image x rendered from a training dataset y and randomly generated noise z to produce a glyph version G(x, z) for each received image; and a trained discriminator D configured to: receive the glyph version G(x, z) and the image x from the trained generator G; and classify the image x and G(x, z) as fake or real images using a discriminator function D(), wherein an auxiliary dot product loss function L is used. dot To train the generator G, L dot Defined as:

[0040]

[0041] in Defined as a planar function, a planar function reshapes the image tensor into a vector for computing the dot product, and Phish-GAN trained in this way includes the generator objective function. and discriminator objective function , and Defined as

[0042]

[0043]

[0044] Regarding the first aspect of the invention, by minimizing the normalized temperature-scale cross-entropy (NT-Xent) loss function L NT-Xent To train the misspelling preemptive encoder E S L NT-Xent Defined as

[0045]

[0046] The positive input to the function LNT-Xent is provided with a set of Swype-like images rendered from a set of frequently visited popular internet domains. The anchor input to the function LNT-Xent is provided with a set of Swype-like images of a phishing domain generated for the positive input of the associated set. The negative input to the function LNT-Xent is provided with a set of Swype-like images of a domain derived from a set of frequently visited popular internet domains that the encoder ES considers similar to the anchor input of the current set but does not include actual positive samples from that set. It is the cosine similarity measure between the anchor encoding and positive samples, and It is a cosine similarity measure between the anchor's encoding and the negative sample.

[0047] Regarding the first aspect of the invention, by minimizing the triple loss function L... triplet To train the misspelling preemptive encoder E S L triplet Defined as

[0048]

[0049] Where function L triplet The positive input is provided with a set of Swype-like images rendered from a collection of frequently visited popular internet domains, function L triplet The anchor input is provided with a Swype-like image set of a misspelled phishing domain generated for a positive input set of the associated set, and the function L triplet The negative input is provided by a set of Swype-like images with a domain derived from the encoder E.S Consider a set of popular Internet domains that are similar to the anchor input of the current set but do not include the actual positive samples of the set, and where E(P) is defined as positive encoded output, E(A) is defined as anchor encoded output, E(N) is defined as negative encoded output, and M is margin.

[0050] Regarding a first aspect of the invention, the converter-based neural network is trained using a binary cross-entropy loss function.

[0051] Regarding the first aspect of the invention, the generation of probability scores for each identified domain using a general phishing domain detection module includes the general phishing domain detection module being configured to: perform sub-word tokenization on each identified domain, and generate probability scores based on the sub-word tokens using a trained converter-based neural network.

[0052] Regarding the first aspect of the invention, the general phishing domain detection module is further configured to: for each resolved IP address, acquire all external domains, wherein the external domains include the domains resolved to the IP address; generate a probability score for each external domain based on sub-word tags using a trained converter-based neural network; and when the percentage of acquired external domains determined to be classified as general phishing domains exceeds a percentage threshold, classify all external domains with probability scores exceeding a predetermined threshold as general phishing domains.

[0053] Regarding a first aspect of the invention, the homograph phishing domain detection module classifies each homograph domain as a homograph phishing domain, including a homograph phishing domain detection module that is instead configured to: when the associated encoding e HT_N With reference database D H When the cosine similarity between any encodings exceeds a predetermined threshold, each homograph domain is classified as a homograph phishing domain; the misspelling-based phishing domain detection module detects the Swype-like image I used to generate the phishing domain. ST_N Each domain in the set is categorized, including the misspelling-based phishing domain detection module, which is replaced by a configuration where: when the associated encoding e ST_N and reference database D S When the cosine similarity measure between any two codes exceeds a predetermined threshold, it will be used to generate a Swype-like image I. ST_N Each domain in the set is categorized as a misspelling-based phishing domain.

[0054] Regarding the first aspect of the invention, the homograph phishing domain detection module is further configured to: receive string images I rendered from all queried domains. HT_Nall ; Use the trained homograph encoder EH to process each string image I HT_Nall Encode them as their associated encoding eHT_Nall When the associated encoding e HT_N and reference database D H When the similarity metric between any two codes exceeds a predetermined threshold, each queried domain is classified as a similar phishing domain.

[0055] According to a second aspect of the invention, a method for detecting phishing domains in a Domain Name System (DNS) record set, the method using a trained homograph encoder E H and reference database D H The homograph phishing domain detection module includes a trained misspelling preemptive encoder E. S and reference database D S The method includes: receiving a string image I rendered from a homograph domain using the homograph phishing domain detection module; a general phishing domain detection module including a trained converter-based neural network and an alert module; and a method for detecting phishing domains using the homograph phishing domain detection module. HT_N The homograph domains consist of domains where each domain includes a small code; a trained homograph encoder E is used. H Image I of each string HT_N Encode them as their associated encoding e HT_N When the associated encoding e HT_N With reference database D H When the Euclidean distance between any encodings is below a predetermined threshold, the homograph phishing domain detection module classifies each homograph domain as a homograph phishing domain; the misspelling-based phishing domain detection module receives Swype-like images of domains rendered from the DNS record set. ST_N ; A misspelling-based phishing domain detection module trained on the misspelling-based encoding encoder E S Each Swype-like image I ST_N Encode them as their associated encoding e ST_N When the associated encoding e ST_N and reference database D S When the Euclidean distance between any two codes is below a predetermined threshold, the misspelling-based phishing domain detection module will be used to generate Swype-like images. ST_NEach domain in the set is classified as a misspelling phishing domain; a general phishing domain detection module is used to receive domains identified from the DNS record set that have strings that at least partially match strings in the phishing reference list K; a converter-based neural network trained by the general phishing domain detection module is used to generate a probability score for each identified domain; the general phishing domain detection module is used to resolve Internet Protocol (IP) addresses for each identified domain with a probability score exceeding a predetermined probability threshold; for each resolved IP address, the general phishing domain detection module is used to: obtain all external domains associated with the resolved IP address, where external domains refer to all domains within the DNS record set that resolve to the resolved IP address; generate a probability score for each obtained external domain using a converter-based neural network; classify each obtained external domain with a probability score exceeding a predetermined probability threshold as a general phishing domain; and use an alert module to generate alerts for phishing domains detected in the DNS record set based on homograph phishing domains from the homograph phishing domain detection module, misspelling phishing domains from the misspelling phishing domain detection module, and general phishing domains from the general phishing domain detection module.

[0056] Regarding a second aspect of the invention, generating an alert using an alert module includes: using a blacklist and a rule-based filtering module to filter out benign domains from identified phishing domains.

[0057] Regarding a second aspect of the invention, generating an alert using an alert module includes: using a behavior analysis module to obtain unfiltered phishing domains from a blacklist and a rule-based filtering module; using the behavior analysis module to obtain IP addresses that query at least one unfiltered phishing domain; for each IP address that queries at least one unfiltered phishing domain, using the behavior analysis module: generating a count-based vector based on the number of queries for each unfiltered phishing domain by the IP address; applying L2 normalization to each count-based vector; using the behavior analysis module to apply hierarchical clustering to the IP addresses and their associated count-based vectors to identify count-based vectors with similar characteristics; and using the behavior analysis module to classify the IP addresses associated with the identified count-based vectors as IP addresses that have suffered the same phishing attack activity.

[0058] Regarding a second aspect of the invention, generating an alert using an alert module includes: using a behavior analysis module to obtain unfiltered phishing domains from a blacklist and a rule-based filtering module; using the behavior analysis module to obtain IP addresses that query at least one unfiltered phishing domain; for each IP address that queries at least one unfiltered phishing domain, using the behavior analysis module: generating a count-based vector based on the number of queries for each unfiltered phishing domain by the IP address; applying L2 normalization to each count-based vector; using the behavior analysis module to apply a position-sensitive hash (LSH) to the IP address and its associated count-based vector to identify count-based vectors with similar characteristics; and using the behavior analysis module to classify the IP addresses associated with the identified count-based vectors as IP addresses that have been subjected to phishing attack activities. Attached Figure Description

[0059] The above and other problems are solved by the features and advantages of the system and method according to the invention, as described in the detailed description and shown in the accompanying drawings.

[0060] Figure 1 A block diagram of modules for implementing a method and / or system for detecting phishing domains within a Domain Name System (DNS) record set, according to an embodiment of the present invention, is shown.

[0061] Figure 2 A block diagram of a module according to an embodiment of the present invention is shown, which can be used to implement another embodiment of a method and / or system for detecting phishing domains in a Domain Name System (DNS) record set;

[0062] Figure 3 A block diagram representing a processing system according to an embodiment of the present invention is shown.

[0063] Figure 4 This is a process flow for training Phish-GAN and homograph encoder according to an embodiment of the present invention;

[0064] Figure 5 An exemplary output of the Phish-GAN discriminator is shown when fake and real images are provided to the discriminator;

[0065] Figure 6 The training process flow of the homograph encoder according to an embodiment of the present invention is shown;

[0066] Figure 7 The flowchart shown illustrates the process executed by the homograph phishing domain detection module according to an embodiment of the present invention;

[0067] Figure 8An example of encoding generated by a trained homograph encoder, projected onto a low-dimensional visualization plane, is shown according to an embodiment of the present invention.

[0068] Figure 9 An exemplary Swype rendered image of a domain name according to an embodiment of the present invention is shown;

[0069] Figure 10 The training process flow of the misspelling preemptive encoder according to an embodiment of the present invention is shown;

[0070] Figure 11 The flowchart of the process executed by the misspelling-based phishing domain detection module according to an embodiment of the present invention is shown;

[0071] Figure 12 A domain cluster and its variants, encoded by a trained misspelling preemptive encoder and projected onto a low-dimensional visualization plane according to an embodiment of the present invention, are illustrated.

[0072] Figure 13 A converter-based neural network model is shown according to an embodiment of the present invention, which is used by a general phishing domain detection module to identify possible general phishing domains;

[0073] Figure 14 The process flow for identifying generic phishing domains using a keyword list, a converter-based neural network, and resolved IP addresses is illustrated.

[0074] Figure 15 An example count-based vector for IP addresses is shown;

[0075] Figure 16 An exemplary graph showing the number of occurrences detected within a fixed time period is shown;

[0076] Figure 17 It shows in Figure 16 The diagram shown uses a Hamming filter;

[0077] Figure 18 An exemplary cluster of organizations targeted by the same malicious domain is shown; and

[0078] Figure 19 An example graph showing the periodicity of phishing attacks is shown. Detailed Implementation

[0079] This invention relates to a system and method for detecting phishing domains used by network attackers to carry out phishing attacks using Domain Name System (DNS) records. The system includes a homograph phishing domain detection module, a misspelling-based phishing domain detection module, a general phishing domain detection module, and an alerting module. These modules are configured to collaboratively detect and identify phishing domains from a DNS record set using a combination of homograph, misspelling-based, and general phishing domain techniques. Specifically, these modules are configured to utilize generative adversarial neural networks (GANs), image recognition algorithms, converter neural networks, and / or behavior-based analytics techniques to identify and detect phishing domains and potential phishing activities from a DNS record set.

[0080] The invention will now be described in detail with reference to several embodiments illustrated in the accompanying drawings. In the following description, numerous specific features are set forth to provide a thorough understanding of the embodiments of the invention. However, it will be apparent to those skilled in the art that embodiments may be implemented without some or all of these specific features. Such embodiments should also fall within the scope of the invention. Furthermore, some processing steps and / or structures described below may not be described in detail, and the reader will refer to the relevant citations to avoid unnecessarily obscuring the invention.

[0081] Furthermore, those skilled in the art will recognize that many functional units throughout this specification are labeled as modules. They will also recognize that modules can be implemented as circuits, logic chips, or any kind of discrete component. Moreover, they will recognize that modules can be implemented in software and then executed by various processors. In embodiments of the invention, modules may also include computer instructions or executable code that can instruct a computer processor to perform a series of events based on received instructions. The choice of implementation of the module is left as a design choice to those skilled in the art and does not limit the scope of the invention in any way.

[0082] Figure 1 A block diagram of modules for implementing a method and / or system for detecting phishing domains in a Domain Name System (DNS) record set, according to embodiments of the present invention, is shown. Typically, as... Figure 1 As shown, DNS record set 103 is provided to string image renderer 105, Swype image renderer 115 and domain filter 120.

[0083] The string image from the string image renderer 105 is then provided to the homograph detection module 125. Module 125 then continues to identify possible homograph phishing domains from the received string image and provides these identified homograph phishing domains to the alarm module 140.

[0084] Simultaneously, Swype images from Swype image renderer 115 are provided to the misspelled phishing domain detection module 130. Module 130 then continues to identify misspelled phishing domains from the received string images and provides these misspelled phishing domains to the alert module 140.

[0085] Simultaneously, at domain filter 120, strings that do not at least partially match the strings in the phishing reference list are filtered out. The remaining strings are then provided to the general phishing detection module 135. Module 135 then uses the remaining strings to identify general phishing domains and provides these general phishing domains to the alerting module 140.

[0086] Alert module 140 then utilizes the received homographed phishing domains, misspelled phishing domains, and generic phishing domains to identify additional anomalous behavior affecting multiple IP addresses and possible similar phishing campaigns. Phishing campaigns are essentially attack activities carried out by specific cyber attackers. Alert module 140 correlates the outputs from modules 120, 130, and 135 to identify similar phishing behaviors affecting various IP addresses within DNS data, to identify IP addresses experiencing the same phishing campaign, and to indicate their likelihood of being targeted by the same threat.

[0087] In use Figure 1 Before the system shown detects phishing domains in the DNS record set, the various modules contained therein must undergo a setup phase. In an embodiment of the invention, a checklist L of commonly used domains, popular domains, and all such similar domains is first generated. Then, a string image renderer 105 is used to render a string image for each domain in checklist L. Then, a trained homograph encoder E is provided within the homograph detection module 125. H It is configured to generate an encoding for each rendered string image. Then, it is processed by a trained homograph encoder E. H These generated codes are stored in the reference database D. H In an embodiment of the present invention, reference database D is used. H It can be stored within the homograph phishing domain detection module 125 or in an external server that communicates with the module 125.

[0088] During the setup phase, a checklist L is also provided to the Swype image renderer 115. The Swype image renderer 115 then renders a Swype-sample image for each domain in the checklist L. Then, the trained misspelling encoder E, provided within the misspelling detection module 130... SIt is configured to generate encoding for each rendered Swype-like image. Then, it is encoded by encoder E. S These generated codes are stored in the reference database D. S In an embodiment of the present invention, reference database D is used. S It can be stored in the misspelling detection module 130 or in an external server that communicates with the module 130.

[0089] In addition, during this setup phase, a list of keywords frequently used in phishing and clickbait is obtained from external databases and records, such as the PhishTank database, and is used to populate the phishing reference list K.

[0090] Once the setup phase is complete, deployment can then proceed. Figure 1 The system shown is used to detect phishing domains in DNS record set 103.

[0091] The string image renderer 105 is initially configured to render homograph fields as string images I. HT_N The set, wherein homograph fields include fields whose domain strings contain small codes, indicating the presence of at least one glyph (i.e., a non-Latin character) in their domain strings. These homograph fields include domain names from DNS record set 103, each domain name having a subdomain / domain name starting with "xn--", which indicates the presence of small codes, implying the presence of homographs. In other embodiments of the invention, string image renderer 105 may be configured to render all domain names in DNS record set 105 as string images. HT_N This set of similarity domains allows for the detection and classification of similarity domains that utilize visual similarity within the Latin language family, even in the absence of homographs. Examples of such similarity domains include replacing "w" with "vv" and "I" with "1".

[0092] The homograph phishing detection module 125 is equipped with a trained homograph encoder E H and reference database D H It is then configured to receive string image I from string image renderer 105. HT_N Set. Then, the trained homograph encoder E H Continue with each string image I HT_N Encode them as their associated encoding e HT_N Then encode these e HT_N With reference database D H The codes contained therein are compared. If an associated code e is found... HT_N With reference database D HIf the Euclidean distance between any two codes is less than a predetermined threshold, module 125 will assign the associated code e to the code. HT_N This is categorized as a homograph phishing domain. In other embodiments of the invention, other similarity measures besides Euclidean distance may also be used. Examples include cosine similarity, L1 similarity, etc.

[0093] In another embodiment of the invention, the homograph phishing domain detection module 125 may be further configured to receive another string image I rendered from all queried domains from the string image renderer 105. HT_Nall Set. Then, the trained homograph encoder E H Continue with each string image I HT_Nall Encode them as their associated encoding e HT_Nall Then encode these e HT_Nall With reference database D H The codes contained therein are compared. Then, when the associated code e HT_Nall When the similarity measure (i.e., Euclidean distance and / or cosine similarity) between the query domain and any encoding in the reference database DH exceeds a predetermined threshold, each queried domain is classified as a similar phishing domain.

[0094] Meanwhile, Swype image renderer 115 is configured to render domain names from DNS record set 103 as Swype-like images I. ST_N Collection. The misspelling detection module 130 is equipped with a trained misspelling encoder E. S and reference database D S It is then configured to receive Swype-sampled images I from Swype image renderer 115. ST_N Set. Then, the trained misspelling preemptive encoder E S Continue processing each Swype-like image I ST_N Encode them as their associated encoding e ST_N Then encode these e ST_N With reference database D S The codes contained therein are compared. If an associated code e is found... ST_N With reference database D S The Euclidean distance between any codes is less than a predetermined threshold; or in other embodiments, if the Euclidean distance between any codes in the associated code e is less than a predetermined threshold. ST_N and reference database D S If any of the codes have a cosine similarity higher than the required matching threshold, then module 130 will associate the code e with the cosine similarity. ST_NThe domains are classified as phishing domains identified by misspelling. In another embodiment of the invention, a weighted Damerau-Leveshtein distance algorithm that heuristically considers keyboard distance can be used to provide further verification of the phishing domains identified by module 130.

[0095] In addition to simultaneously detecting homographed and misspelled phishing domains, domain filter 120 is configured to filter out domains from DNS record set 103 that do not contain at least partially keywords or phrases from phishing reference list K. As a result, the remaining domains will all have strings that at least partially match the strings in phishing reference list K.

[0096] The remaining domain names are then passed through a trained converter-based neural network provided within the generic phishing domain detection module 135 to generate a probability score for each of these remaining domain names. In other words, for each provided domain name, the trained converter-based neural network classifies the probability that the domain name is a generic phishing domain based on the characteristics of its string. For each domain name with a probability score exceeding a predetermined probability threshold, the domain name's Internet Protocol (IP) address is then resolved. All resolved IP addresses are then stored in a set {p} of suspicious resolved IP addresses. For each IP address included in set {p}, module 135 then obtains all corresponding domain names resolved to that IP address from the DNS record set 103, regardless of whether these domain names contain any keywords from the phishing reference list K. Each of these domain names is then passed through the trained converter neural network to classify whether the domain name includes a generic phishing domain. In other words, for each resolved IP address, module 135 obtains all domain names (referred to as external domains) within the DNS record set 103 associated with the resolved IP address; then, the probability score for each obtained external domain is generated again using the converter-based neural network. External domains with a probability score exceeding a predetermined probability threshold are then classified as general phishing domains. In another embodiment of the invention, external domains with a probability score exceeding a predetermined probability threshold are classified as general phishing domains only when it is determined that at least a certain percentage of external domains resolving to a specific IP address have been considered general phishing domains. For example, before classifying external domains exceeding the predetermined probability threshold as general phishing domains, 50% or more of all domains resolving to a specific IP address (i.e., external domains resolving to a specific IP address) must have already been flagged as potential phishing domains. This can serve as an additional check to reduce the false positive rate. The basic principle behind this check is that attackers tend to reuse their infrastructure, especially when attempting phishing attacks, so they can expand their reach without incurring significant additional costs.

[0097] Alert module 140 is then configured to provide further analysis of the outputs of various phishing detection modules 125, 130, and 135 by identifying anomalous phishing behavior and possible phishing activities by sophisticated threat actors attempting to phish multiple IPs / organizations. Module 140 takes all identified phishing domains and their victims as input from the DNS record set, based on homographed phishing domains received from homographed phishing domain detection module 125, misspelled phishing domains received from misspelled phishing domain detection module 130, and generic phishing domains received from generic phishing domain detection module 135. Module 140 includes a blacklist and rule-based filtering module 205 and a behavior analysis module 210. Filtering module 205 can be configured to further filter out benign domains accidentally picked up from the DNS record set 103 by phishing detection modules 125, 130, and 135. For example, any domain name ending in ".gov.sg" or "com.sg" can be removed from the list of potential phishing domains identified by modules 125, 130, and 135, because domain names ending in these country-specific top-level domains (TLDs) must be officially registered. Therefore, the likelihood of a phishing website being hosted on these TLDs is extremely low. Behavioral analysis module 210 is then configured to determine whether a malicious party is conducting a phishing attack campaign based on the remaining phishing domains, and whether the phishing domains exhibit any unusual temporal behavior.

[0098] Each module described above will be discussed in detail in later sections.

[0099] According to an embodiment of the present invention, in Figure 3 The diagram shows a block diagram representing the components of a processing system 300, which can be... Figure 1 or Figure 2 Any of the modules shown are provided to implement embodiments according to the present invention. Those skilled in the art will recognize that the exact configuration of each processing system provided within these modules may differ, and the exact configuration of processing system 300 may vary. Figure 3 Provided as an example only.

[0100] In embodiments of the invention, each module may include a controller 301 and a user interface 302. The user interface 302 is arranged to enable manual interaction between the user and each of these modules as needed, and for this purpose includes user input instructions to provide each of these modules with the input / output components required for updates. Those skilled in the art will recognize that the components of the user interface 302 may vary from embodiment to embodiment, but generally include one or more of a display 340, a keyboard 335, and a touchpad 336.

[0101] The controller 301 communicates with the user interface 302 via a bus 315 and includes a memory 320, a processor 305 mounted on a circuit board that processes instructions and data for executing the methods of this embodiment, an operating system 306, an input / output (I / O) interface 330 for communicating with the user interface 302, and a communication interface (in the form of a network interface card 350 in this embodiment). For example, the network interface card 350 can be used to send data from these modules to other processing devices via wired or wireless networks, or to receive data via wired or wireless networks. Wireless networks that the network interface card 350 can use include, but are not limited to, Wi-Fi, Bluetooth, Near Field Communication (NFC), cellular networks, satellite networks, telecommunications networks, wide area networks (WANs), etc.

[0102] Memory 320 and operating system 306 communicate with CPU 305 via bus 310. The memory components include volatile and non-volatile memory, and more than one of each type, including random access memory (RAM) 320, read-only memory (ROM) 325, and mass storage device 345, the last of which includes one or more solid-state drives (SSDs). Memory 320 also includes secure memory 346 for securely storing secret keys or private keys. Those skilled in the art will recognize that the aforementioned memory components include non-transitory computer-readable media and should be considered to include all computer-readable media except for transient propagation signals. Typically, instructions are stored in the memory components as program code, but may also be hardwired. Memory 320 may include a kernel and / or programming modules, such as software applications that may be stored in volatile or non-volatile memory.

[0103] In this document, the term "processor" is used generically to refer to any device or component capable of processing such instructions, and may include: microprocessors, microcontrollers, programmable logic devices, or other computing devices. That is, processor 305 may be provided by any suitable logic circuitry for receiving input, processing instructions stored in memory, and producing output (e.g., to a memory component or on display 340). In this embodiment, processor 305 may be a single-core or multi-core processor with memory-addressable space. In one example, processor 305 may be multi-core, including, for example, an 8-core CPU. In another example, it may be a cluster of CPU cores running in parallel to accelerate computation.

[0104] Generally speaking, the homograph detection module 125 includes a homograph encoder E H E HPhish-GAN is trained using a conditional generative adversarial network (GAN) labeled Phish-GAN. Phish-GAN is configured and trained to generate an infinite dataset of homographs based on multiple fonts, and a reference database D. H D H Generated during the setup phase. Used for training the Phish-GAN405 and homograph encoder E. H The process flow of (i.e., encoder 450) is as follows: Figure 4 As shown.

[0105] Phish-GAN 405 is generally based on the Pix2Pix network architecture, with some new changes to its loss function and network architecture. For example... Figure 4 As shown, the main components in Phish-GAN 405 are the generator neural network 415 and the discriminator 420, D(). In the initial stage, since the generator 415 has not yet been trained, its output will initially only include noise, regardless of its input. In embodiments of the invention, the generator 415 may include, but is not limited to, a UNet architecture with skip connections connecting downsampling and upsampling layers. The discriminator 420 may include, but is not limited to, a classifier convolutional neural network configured to learn to classify whether the input image is a real image or a fake image, i.e., an image generated by Phish-GAN / GAN. Figure 5 An exemplary workflow of the discriminator 420 is illustrated. When a glyph image 502 is provided to the discriminator 504, the discriminator 504 should ideally produce an output indicating that the image is fake. Conversely, when a genuine image 506 is provided to the same discriminator 504, the discriminator 504 should ideally produce an output indicating that the image is genuine.

[0106] To train Phish-GAN 405, an open-source dataset y, including domains and their possible similarities, was used. The entire dataset y was then fed to an image renderer 410, configured to render strings in various fonts into images x. The rendered images x from the dataset, along with some randomly generated noise z, were then fed to a generator neural network 415. The generator 415 used the received rendered images x and the randomly generated noise z to generate a glyph version G(x, z) for each rendered image x. The glyph version G(x, z) and the rendered images x were then fed to a discriminator 420. The discriminator 420 then attempted to distinguish between real and fake images based on the provided data. In the training step 430, the outputs produced by the discriminator 420 were compared to determine if the discriminator 420 had successfully identified real and fake images. The results from step 430 were then fed to the discriminator 420 for further training. The output from the discriminator 420 is then fed back to the generator 415 to train it in an adversarial manner (i.e., if the discriminator 420 mistakenly believes that the input image generated by a particular GAN is a real image, the training loss of the generator will be low, and vice versa).

[0107] In an embodiment of the present invention, the dot product loss function L dot The dot product loss L can be used as an auxiliary loss in step 440 to train generator 415. dot Defined as:

[0108]

[0109] in, Defined as a "planar function," it reshapes the image tensor into a vector to compute the dot product. Research has found that this loss function is useful in preserving image style.

[0110] Then, the generator objective function It can be defined as:

[0111]

[0112] in, This is a typical generator adversarial loss, which is inversely proportional to the discriminator's performance. Then, the discriminator's objective function... It can be defined as:

[0113]

[0114] in, Defined as a typical discriminator adversarial loss, it is inversely proportional to the generator's performance.

[0115] Once Phish-GAN 405 has been trained, its output 406 can be used to train encoder 450.

[0116] In embodiments of the present invention, encoder 450 may include, but is not limited to, convolutional neural networks (CNNs). In embodiments of the present invention, a comparative loss function is used, such as, but not limited to, triple loss techniques, as... Figure 4 The lower half of the diagram is shown and can be used to train encoder 450. Triple loss techniques generally include three inputs: positive (P) input, anchor (A) input, and negative (N) input; and three outputs: positive code E (P), negative code E (N), and anchor code E (A).

[0117] In this embodiment of the invention, for training the encoder 450, a dataset including the most popular domains is used as the positive (P) input, the output from Phish-GAN 405 is used as the anchor (A) input (thus providing a dataset including the most popular domains as input to Phish-GAN 405), and random samples from the domains that the encoder 450 considers most similar to the corresponding anchor input (A) but excluding the corresponding positive input (P) are provided as negative (N) inputs. In other words, it can be said that the positive input (P) includes examples that the anchor input (A) attempts to imitate from the dataset including the most popular domains, while the negative samples are randomly sampled from a list of domains most similar to the anchor input (A) (as the encoder 450 considers them) but excluding the positive input (P). Triple loss L triplet It can be defined as:

[0118]

[0119] Where we minimize The maximum value of 0. M is defined as the margin. From this equation, it is clear that minimizing the triple loss L... triplet This is equivalent to ensuring that the distance between E(A) and E(P) is at least less than the distance between E(A) and E(N), with a margin of M. If it is larger, the loss is zero, and it does not affect the training of encoder 450. In this way, encoder 450 is trained to output codes similar to positive and anchor inputs, while outputting codes different for anchor and negative inputs, with a margin of at least M. In other embodiments of the invention, other similarity loss functions can also be used as the similarity function of triple loss. In other embodiments, NT-Xent loss can be used together with the cosine similarity loss function to train the encoder.

[0120] Figure 6The diagram illustrates a workflow in which encoder 450 can be trained according to an embodiment of the invention. Dataset 601 comprises a dataset with the most popular domains, and this dataset is provided to image renderer 605. Rendered images of strings from dataset 601 are then provided to a trained Phish-GAN 615 as positive (P) inputs to encoder 450. The glyphized outputs from the trained Phish-GAN 615 are then provided to encoder 450 as its anchor (A) input. Random samples from the domains obtained from dataset 601, considered by encoder 450 to be similar to the anchor (A) (excluding positive samples provided as positive (P) inputs), are then provided to encoder 450 as negative (N) inputs. The encodings E(P), E(A), and E(N), along with all provided inputs, are then used to train encoder 450 using a contrastive loss function.

[0121] Once trained, the trained encoder 450 can be used to generate encodings for the queried domain and domains included in the popular domain dataset. If the Euclidean distance between the encoding of the queried domain and any encoding of a domain in the dataset is less than a certain threshold, the queried domain is classified as a homograph phishing domain. The "target" domain (i.e., the domain the homograph is attempting to mimic) is then determined as the domain in the dataset whose encoding has the smallest Euclidean distance to the encoding of the queried domain. This workflow is as follows: Figure 7 As shown.

[0122] Checklist 702 includes a list of popular domains generated from various external databases. Image renderer 704 then generates rendered images of the domains from checklist 702 and provides them to training encoder 710. Training encoder 710 then provides the encoding from checklist 702 to database 708.

[0123] When the queried domain is provided to image renderer 706, which is essentially the same module as image renderer 704, image renderer 706 then renders an image of the queried domain. The rendered image of the queried domain is then provided to trained encoder 710. Trained encoder 710 then generates an encoding of the queried domain. In step 712, the encoding of the queried domain is compared with the encodings in database 708 for similarity. If the Euclidean distance between the encoding of the queried domain and any encoding of a domain in the database is less than a certain threshold (theoretically, a margin M), the queried domain is classified as a homograph phishing domain, and an alert is generated with its “target” domain. The “target” domain (i.e., the domain the homograph is attempting to emulate) is the corresponding domain in database 708 whose encoding has the smallest Euclidean distance to the encoding of the queried domain.

[0124] In other embodiments of the invention, a hashing algorithm, such as Locality Sensitive Hash (LSH), can be applied to the encodings of the fields in the checklist and the queried field to quickly filter out fields whose encodings are significantly different from those of the queried field. This can significantly speed up the search for similar fields.

[0125] Figure 8 Exemplary dimensionality-reduced encodings of homograph network phishing domains (popular domains) obtained from the output of a trained encoder, along with their associated "target" domains, are plotted. Specifically, the homograph phishing domain 802 is associated with its target domain 'instagram.com', the homograph phishing domain 804 is associated with its target domain 'yahoo.com', the homograph phishing domain 806 is associated with its target domain 'facebook.com', the homograph phishing domain 808 is associated with its target domain 'google.com', the homograph phishing domain 810 is associated with its target domain 'linkedin.com', the homograph phishing domain 812 is associated with its target domain 'covid19info.live', the homograph phishing domain 814 is associated with its target domain 'wikipedia.org', the homograph phishing domain 816 is associated with its target domain 'microsoft.com', the homograph phishing domain 818 is associated with its target domain 'twitter.com', the homograph phishing domain 820 is associated with its target domain 'youtube.com', and the homograph phishing domain 822 is associated with its target domain 'apple.com'.

[0126] Simultaneously, while DNS record 105 is being rendered by string image renderer 105, Swype image renderer 115 also renders DNS record 105 as a Swype-like image. Swype image renderer 115 achieves this by rendering a Swype-like image based on the QWERTY keyboard layout, as this is the most widely adopted keyboard format. In an embodiment of the invention, this is done by mapping each character in the string to a grid position based on its physical position on the QWERTY keyboard. For example, the character “q” is mapped to grid [1; 0] because it is located in the second row and first column of the QWERTY keyboard. To ensure that the rows are separated and not exactly on top of each other, a small amount of noise is added to each keyboard position corresponding to each character in the string. In this embodiment, random uniform noise between 0 and 0.1 is added to both axes. This corresponds to 10% of a key on the keyboard, since each key on the keyboard is defined as having a length and height of 1. Next, a pre-set color sequence is used to account for the character sequence in the text string (although other recognition methods could also be used). For example, the first stroke between the 1st and 2nd characters is always blue, the next stroke between the 3rd and 4th characters is always light blue, and so on. Finally, the corresponding positions of the keys on a 4 x 10 grid corresponding to the QWERTY keyboard (along with the noise) are multiplied by 10 and rendered as a 40 x 100 image using Python Pillow grouping.

[0127] Figure 9 Examples of two popular domains, namely domains 902 and 906, are shown; and two possible misspelling variants, namely variants 904 and 908. It can be seen that, using the Swype image renderer 900, the Swype-like images 912 and 916 for the two domains are similar to the Swype-like images 914 and 918 for their corresponding misspelling variants, respectively. This occurs in domain 902 and its variant 904 because “o” and “i” are adjacent to each other on the keyboard. Therefore, the difference between the Swype-like images of “google” 912 and “goigle” 914 is almost imperceptible. Similarly, for domain 906 and its variant 908, “a” and “q” are adjacent to each other on the keyboard, so the Swype-like renderings of “facebook” 916 and “fqcebook” 918 are also almost identical.

[0128] Generally speaking, the misspelling detection module 130 includes a trained misspelling encoder E. S and reference database D S (Generated as described above). Reference Figure 10 Encoder E SThe 1015 encoder can be trained using a contrastive loss function, such as, but not limited to, a triple loss function (as described in the previous section) or a normalized temperature-scale cross-entropy (NT-Xent) loss function, allowing the 1015 encoder to produce encodings similar to the domain and its misspelling variants, and encodings different from the domain and its non-misspelling variants. A homograph encoder E, which may be used for training, is also employed. H Using the same method, the actual domain is provided as the positive (P) input, the misspelling variant is provided as the anchor (A) input, and the non-misspelling variant, which is considered by the encoder 1015 to display visual characteristics similar to the anchor (A), is provided as the negative (N) input of the encoder 1015.

[0129] In this embodiment of the invention, the NT-Xent loss function is used to train the encoder 1015. The NT-Xent loss function uses cosine similarity, s i,j Instead of Euclidean distance, cross-entropy loss is also used to train the encoder. One advantage of this method over triple loss is that instead of simply sampling a single negative sample, a batch of negative samples can be used to train the model all at once. In this embodiment, the first 8 samples closest to the anchor (as determined by encoder 1015) and excluding positive samples are used as negative samples. Input vector and Cosine similarity s between i,j It can be defined as:

[0130]

[0131] in Defined as input vector The transpose of the input vector. The numerator is essentially a dot product, and the denominator normalizes the numerator to a value of 1 so that the output can be interpreted as the cosine of the angle between the two input vectors.

[0132] Then, the NT-Xent loss function L NT-Xent It can be defined as:

[0133]

[0134] in It is the cosine similarity between the anchor and the positive sample. It is the cosine similarity between the anchor and a specific negative sample, and It is the temperature used to measure loss.

[0135] Specifically, encoder 1015 is trained to output meaningful embeddings that can be used to detect misspelling phishing domains. An exemplary CNN architecture for encoder 1015 is shown in Table 2 below.

[0136]

[0137] Table 2

[0138] The first two layers with identical padding are crucial for preserving image edge information. The output of the final layer is L2 normalized, making the magnitude of each embedding equal to 1. This allows both the triple loss (TL) and the NT-Xent loss (NL) to be used to train the model. Specifically, setting the L2 norm of the embeddings to 1 means that minimizing the Euclidean distance between anchor and positive sample pairs according to the TL loss formula is equivalent to maximizing the cosine similarity, since the squared Euclidean distance between normalized vectors is inversely proportional to their cosine similarity.

[0139] refer to Figure 10 Domain 1001 was provided to DNSTwist 1005 and image renderer 1010b. Domain 1001 comprised the top 20,000 domains from a massive database of millions of domains obtained from an external database. DNSTwist 1005 then proceeded to generate samples of misspelling-based phishing domains for Domain 1001. Specifically, DNSTwist 1005 permuted each domain name in Domain 1001 according to a predefined set of rules that attackers tend to use, to generate potential misspelling-based phishing domains. The domains generated by DNSTwist 1005 also considered keyboard distance by providing a permissible dictionary for each key to be replaced. Additionally, it should be noted that the generated dataset is biased towards small edit distances (≤2). In summary, the dataset generated by DNSTwist 1005 will contain a total of approximately 2 million possible phishing domains, each of which originates from one of the domain names in Domain 1001.

[0140] The dataset generated by DNSTwist is then provided to image renderer 1010a (essentially the same as image renderer 1010b), which continues to render the domains in the dataset as Swype-like images used as anchor (A) inputs to encoder 1015. Simultaneously, Swype-like images rendered by image renderer 1010b for domain 1001 are provided as positive (P) inputs to encoder 1015. Then, non-misspelling variants that encoder 1015 considers similar to the anchor but not positive samples are provided as negative (N) inputs to encoder 1015. After receiving all these inputs, encoder 1015 is then trained in step 1020 using a contrastive loss function such as a triple loss function or an NT-Xent loss function.

[0141] Once trained, encoder 1015 can be used to generate encodings for both the corresponding Swype-like image of the queried domain and domains in a checklist, which can include the top 20k domains from a massive dataset of millions. If the similarity score (in this case, the cosine similarity score between the encoding of the queried domain and any encoding of a domain in the checklist) is greater than a certain threshold, the domain is classified as a misspelling phishing domain, and the target domain (i.e., the domain that the misspelling attempts to imitate) is the domain in the checklist whose encoding is considered most similar to the encoding of the queried domain. Figure 11 A flowchart illustrating this process is shown.

[0142] Checklist 1102 includes a list of popular domains generated from various external databases. Swype-like images of the domains from checklist 1102 are rendered by image renderer 1104 and provided to trained encoder 1106. Encoder 1106 then provides the encoding of checklist 1102 to database 1112.

[0143] When the queried domain 1108 is provided to the image renderer 1110, the image renderer 1110 renders a Swype-like image of the queried domain 1108. The rendered image of the queried domain 1108 is then provided to the trained encoder 1106. The trained encoder 1106 then generates an encoding for the queried domain. In step 1114, the encoding of the queried domain is compared with the encodings in the database 1112 for similarity. If the cosine similarity (or Euclidean distance) between the encoding of the queried domain and any encoding of a domain in the dataset is greater than (or less than in the case of Euclidean distance) a specific threshold, the queried domain is classified as a misspelling phishing domain, and an alert is generated with its “target” domain. The “target” domain (i.e., the domain the homograph is attempting to imitate) is the corresponding domain in the database 1112 whose encoding has the maximum cosine similarity (or minimum Euclidean distance) to the encoding of the queried domain.

[0144] Figure 12Exemplary reduced-dimensional encodings of (popular domain) misspelling phishing domains obtained from the output of a trained encoder are plotted, along with their associated "target" domains. Specifically, misspelling phishing domain 1202 is associated with its target domain 'instagram.com', misspelling phishing domain 1204 with its target domain 'microsoft.com', misspelling phishing domain 1206 with its target domain 'facebook.com', misspelling phishing domain 1208 with its target domain 'youtube.com', misspelling phishing domain 1210 with its target domain 'linkedin.com', misspelling phishing domain 1212 with its target domain 'google.com', and misspelling phishing domain 1214 with its target domain 'apple.com'.

[0145] In embodiments of the invention, a weighted Damerau-Levenshtein distance (DLD) algorithm can be implemented to calculate the keyboard distance-weighted edit distance between the queried domain and the identified target domain. This algorithm provides additional rich information and also serves as a verification step to demonstrate that neural networks and conventional DLD-based algorithms provide the same results.

[0146] In other embodiments of the invention, a hashing algorithm such as Local Sensitive Hash (LSH) can be applied to the encodings of the fields in the checklist and the encoding of the queried field to quickly filter out fields whose encodings are significantly different from those of the queried field. This can significantly speed up the search for similar fields. In one such experiment, the LSH algorithm managed to improve the search speed by 12 times.

[0147] The homograph detection module 125 and the misspelling detection module 130 are configured to detect and identify domains that attempt to mimic legitimate domains. Unlike these two modules, the general phishing detection module 135 is configured to detect generic phishing domains that exploit human weaknesses. Examples of such domains include, but are not limited to, "watch-this[.]live", "get-free-airtickets[.]live", and "celeb-secret[.]online". These domains are not intended to mimic any specific domain, but rather to lure people into clicking on them, thereby promoting phishing, malware dissemination, etc. In other words, the general phishing module 135 is configured to detect domains that do not resemble any specific public domain, which exploit human desires to try to trick people into clicking on them, thereby promoting phishing attacks.

[0148] To achieve this, the phishing detection module 135 utilizes natural language processing (NLP) techniques, such as a converter architecture, to detect such domains. In one embodiment of the invention, instead of training the converter model from scratch and using character-level tokenization, a pre-trained converter neural network is used, such as, but not limited to, a bidirectional encoder representation (BERT) model from the converter (trained on the open-source Wikipedia text corpus). Then, transfer learning is performed by further fine-tuning the model based on a dataset consisting of domains validated by VirusTotal and PhishTank. The model architecture is as follows... Figure 13 As shown.

[0149] Then, the sub-word tokenization scheme 1302 was used in conjunction with sub-word tokens obtained from the Wikipedia corpus. It was found that the sub-word tokenization scheme 1302 had more semantic meaning and a better understanding of the ability of certain words in domains that are typically used to trick people into clicking on words such as "win", "free", "watch", etc.

[0150] The model is then fine-tuned by taking the pooled output 1306 of the converter 1304 and passing it through several dense layers 1308 (i.e., multi-level perceptrons) to obtain a binary output of sigmoid activation indicating whether the domain is a possible general phishing domain.

[0151] The model was trained using a binary cross-entropy loss function. At the end of the training phase, the converter model was able to receive an input domain name, tokenize it, and then output the probability of whether the input domain was a possible phishing domain.

[0152] To reduce the number of false alarms when large volumes of network traffic data are fed to the phishing detection module 135, an additional workflow is employed to leverage another feature: resolved IP addresses. When a query IP addresses a specific domain, the resolved IP addresses include the DNS server's response to the queried IP. This is an important feature because attackers tend to reuse their infrastructure to achieve the widest impact with minimal effort. A study of our data shows that multiple phishing domains often reside on the same IP address. The workflow of this process is as follows: Figure 14 As shown.

[0153] Initially, a list of keywords typically used to lure people into clicking on websites is generated. This list can be obtained from PhishTank and further enhanced with additional information to obtain a phishing reference list K. In step 1402, when the DNS record set is provided, the keyword filter will filter out records whose domain names do not contain at least one of the keywords in list K.

[0154] Then, in step 1404, the remaining domain names having at least one of the keywords in list K are provided to the neural network (e.g., a BERT-based converter model). Then, in step 1406, using the suspicious domain names obtained from step 1404 (i.e., domain names marked by neural network 1404 as possible generic phishing domains), the corresponding resolved IP address is determined for each of these suspicious domains using the original DNS record set. This set of suspicious IP addresses is then labeled as set P.

[0155] For each resolved IP in the set P, all domains resolved to that particular IP are then obtained, regardless of the initial keyword list used as a filter. For each of these obtained domains, they are then processed again by a neural network to determine whether these obtained domains are suspicious domains that should be flagged, which occurs in step 1408.

[0156] If the proportion of domains marked as suspicious by the neural network exceeds a certain threshold (50% in our embodiment), then the domains marked by the neural network will be generated as alerts, i.e., classified as general phishing domains; otherwise, all alerts will be discarded. This occurs in step 1410.

[0157] The alert module 140 includes a blacklist and rule-based filtering module 205 and a behavior analysis module 210, as previously described in... Figure 2 As shown in the image.

[0158] In this embodiment of the invention, a blacklist of legitimate domains is generated based on past research and results obtained from the alerting module 140 over an extended time period. Exemplary rules included in module 205 may instruct the alerting module 140 not to generate an alert if the domain ends with a specific string such as, but not limited to, “.com.sg”, or if the domain ends with a trusted address such as, but not limited to, “.gov.sg”.

[0159] In addition to using such a blacklist of legitimate domains, rules can be set to remove potential alerts generated by alert module 140 based on the following rules:

[0160] 1. The domain name string query associated with each DNS record must be a valid domain name. For example, the domain name string query must contain at least one '.'.

[0161] 2. The domain name string must contain a valid top-level domain (TLD).

[0162] 3. The domain name string must contain valid characters.

[0163] 4. The domain name string must contain a valid number of characters.

[0164] Furthermore, the blacklist can be periodically updated based on the output generated by the alarm module 140 over an extended period of time, and can also be changed according to the requirements of those skilled in the art.

[0165] In another embodiment of the invention, the behavior analysis module 210 may be configured to use two types of behavior analysis to process filtering alerts obtained from the blacklist and rule-based filtering module 205: (i) activity level detection, and (ii) periodic query detection.

[0166] When performing activity-level detection, IPs (and related organizations) that have experienced similar types of phishing attacks are initially identified. These IPs are used as proxies so that the system can detect phishing activity using only network traffic data.

[0167] Given a specific analysis period, a set of query IPs and their corresponding phishing domains are obtained from the domains provided to the alerting module 140 and subsequently filtered by the blacklist and rule-based filtering module 205. The complete set of tagged phishing domains is then integrated. It should be noted that the size of this set will be the same as the size of the vector generated for each query IP.

[0168] For each IP address of at least one suspicious phishing domain output by query module 205, a vector is generated using the following method:

[0169] 1. Generate a count for each phishing domain visited by the query IP from the list of suspicious phishing domains accessed.

[0170] 2. Use counting to fill the vector.

[0171] 3. Finally, L2 normalize the vectors. This is to account for different sizes of organization.

[0172] 4. Now each query IP has an associated vector, and dimensionality reduction techniques such as Unified Manifold Approximation and Projection (UMAP) or T-Distributed Stochastic Neighborhood Embedding (TSNE) can be applied, followed by hierarchical clustering to automatically determine the cluster.

[0173] The process in step 3 has been explained above, and is described in detail below:

[0174] a. In this example, suppose two IPs are querying a suspicious phishing domain, as shown below:

[0175]

[0176] b. First, merge the suspicious phishing domain sets as follows:

[0177]

[0178] c. Since there are five (5) unique suspicious phishing domains, the dimension of the output vector is 5.

[0179] d. Next, for each IP, generate a count-based vector, such as Figure 15 As shown, a count-based vector 1502 is generated for the first IP address "11.22.33.44", while a count-based vector 1504 is generated for the second IP address "aa.bb.cc.dd".

[0180] e. Finally, the two count-based vectors 1502 and 1504 are L2 normalized so that their size is 1.

[0181] In other embodiments of the invention, Local Sensitive Hash (LSH) can be used instead of hierarchical clustering to obtain vectors with similar characteristics.

[0182] In a further embodiment of the invention, the behavior analysis module 210 may be configured to analyze the timestamp of the detected phishing domain for each queried IP to create a time series.

[0183] Based on this list of timestamps, the relative time (i.e., the time difference between each record in the list and the first record in the list) is then calculated. The relative times are then binned into selected time intervals (i.e., these time interval bins may be referred to as the sampling period). In this embodiment, the sampling period is set to 1 minute (i.e., 60 seconds).

[0184] Figure 16 An example of this chart 1602 is shown. As you can see, the X-axis includes 1440 minutes, equivalent to one day. This is because, in this embodiment, the objective is to determine the frequency of queries to such suspicious phishing domains within a day. This analysis time period is arbitrary and can be increased. For each minute (in terms of relative time), the number of queries to identified phishing domains occurring within that minute, relative to the first record, is combined (i.e., binned) and counted. Once chart 1602 is generated, more advanced frequency analysis can be performed on it.

[0185] In another embodiment of the invention, each timestamp list is converted into... Figure 17 Following the representation shown, the above signal is then transmitted via, as shown in the diagram. Figure 17 The Hanning filter 1702 shown is used to reduce "ringing" in the frequency domain. Ringing is a product of a sudden end to a signal.

[0186] In the example above, the signal abruptly ends at approximately 1200 minutes. This abrupt end is equivalent to a brickwall filter, which is the same as a multiplication with a rect function. Since the rect function in the frequency domain is a sinc function, we will see many high sidelobes in the frequency domain. Therefore, in this embodiment, a Hanning filter is used by multiplying the above signal with a Hanning filter function, such as... Figure 17 As shown. Note that the Hanning filter is applied from 0 to the last non-zero timestamp. In other embodiments of the invention, other filter functions, such as the Hamming or Blackman filter functions, can be used. The signal can then be converted to the frequency domain using a typical discrete Fourier transform algorithm for additional signal processing and analysis. In our embodiment, we use frequency determination methods in the frequency domain to determine the presence of periodic signals to determine whether the attack has progressed to a later stage.

[0187] In an exemplary embodiment of the invention, the behavior analysis module 210 is configured to process filtering alerts obtained from the blacklist and rule-based filtering module 205 using two types of behavior analysis: (i) activity level detection, and (ii) detection of periodic queries. A complete set of query IPs for querying phishing domains is then generated and plotted as... Figure 18 The point in the middle. Figure 18 The clusters shown, especially clusters 1802, 1804, 1806, 1808, 1810, 1812, 1814, 1816, 1818, and 1820, represent various possible phishing activities targeting different IP addresses.

[0188] When further analysis is performed using frequency analysis Figure 18 When examining the graph, it was discovered that an IP cluster was querying similar phishing domains at a clear 30-minute interval, one example being... Figure 19 As shown, peaks 1902, 1904, and 1906 appear periodically. This periodicity can be used to indicate that the attack includes more sophisticated attacks.

[0189] Those skilled in the art can identify many other variations, substitutions, variants and modifications, and the present invention is intended to cover all such variations, substitutions, variants and modifications that fall within the scope of the appended claims.

Claims

1. A system for detecting phishing domains in the Domain Name System (DNS) record set, comprising: The homograph phishing domain detection module includes a trained homograph encoder E. H and reference database D H The module is configured as follows: Receive string image I rendered from homograph field HT_N , The homograph field includes fields where each character contains a small code; Using the trained homograph encoder E H Each of the string images I HT_N Encode them as their associated encoding e HT_N ; When the associated encoding e HT_N With reference database D H When the Euclidean distance between any encodings is below a predetermined threshold, each homograph domain is classified as a homograph phishing domain. The misspelling-based phishing domain detection module includes a trained misspelling-based encoder E. S and reference database D S The module is configured as follows: Receive Swype-like image I of the domain rendered from the DNS record set. ST_N ; Using the trained misspelling preemptive encoder E S Each Swype-like image I ST_N Encode them as their associated encoding e ST_N ; When the associated encoding e ST_N and reference database D S When the Euclidean distance between any of the encodings is less than a predetermined threshold, it will be used to generate the Swype-like image I. ST_N Each domain in the set is categorized as a misspelling-based phishing domain; A general-purpose phishing domain detection module, comprising a trained converter-based neural network, is configured to: A domain that receives a string identified from the DNS record set that at least partially matches a string in the phishing reference list K; The trained converter-based neural network is used to generate probability scores for each identified domain; For each identified domain with a probability score exceeding a predetermined probability threshold, resolve the Internet Protocol (IP) address; For each of the resolved IP addresses, Obtain all external domains associated with the resolved IP address, where external domains refer to all domains within the DNS record set that resolve to the resolved IP address; The converter-based neural network is used to generate probability scores for each acquired outer domain; Each obtained external domain with a probability score exceeding the predetermined probability threshold is classified as a general phishing domain; as well as An alert module is configured to generate an alert for phishing domains detected in the DNS record set based on the homograph phishing domains from the homograph phishing domain detection module, the misspelled phishing domains from the misspelled phishing domain detection module, and the general phishing domains from the general phishing domain detection module.

2. The system of claim 1, wherein the alarm module includes a blacklist and a rule-based filtering module, the blacklist and the rule-based filtering module being configured to filter out benign domain names from the identified phishing domains.

3. The system according to claim 2, wherein the alarm module further comprises: The behavior analysis module is configured as follows: Obtain unfiltered phishing domains from the blacklist and rule-based filtering module; Obtain the IP address of at least one of the unfiltered phishing domains; For each IP address querying at least one of the aforementioned unfiltered phishing domains Generate a count-based vector based on the number of queries to each unfiltered phishing domain for the IP address; Apply L2 normalization to each count-based vector; Hierarchical clustering is applied to the IP addresses and their associated count-based vectors to identify count-based vectors with similar characteristics; as well as The IP addresses associated with the identified count-based vectors are classified as IP addresses that have suffered from the same phishing attack activity.

4. The system according to claim 2, wherein the alarm module further comprises: The behavior analysis module is configured as follows: Obtain unfiltered phishing domains from the blacklist and rule-based filtering module; Obtain the IP address of at least one of the unfiltered phishing domains; For each IP address querying at least one of the aforementioned unfiltered phishing domains Generate a count-based vector based on the number of queries to each unfiltered phishing domain for the IP address; Apply L2 normalization to each count-based vector; Locality Sensitive Hash (LSH) is applied to the IP address and its associated count-based vectors to identify count-based vectors with similar characteristics; as well as The IP addresses associated with the identified count-based vectors are classified as IP addresses that have been subjected to phishing attacks.

5. The system according to claim 3 or 4, wherein the behavior analysis module further includes a dimensionality reduction process such as TSNE or UMAP prior to hierarchical clustering or LSH.

6. The system according to claim 2, wherein the alarm module further comprises: The behavior analysis module is also configured as follows: Obtain unfiltered phishing domains from the blacklist and rule-based filtering module; Obtain the IP address of at least one of the unfiltered phishing domains; For each IP address querying at least one unfiltered phishing domain Collect a sorted list of timestamps when querying the at least one unfiltered phishing domain; The relative time is calculated using the sorted timestamps, where each calculated relative time is the time elapsed relative to the first timestamp; The calculated relative time is binned into the desired sampling frequency, and the number of entries in each bin is counted to obtain the occurrence time series, wherein the occurrence time series is defined as the number of queries performed on the at least one unfiltered phishing domain. Apply the Hanning filter to the time series that appeared; as well as Frequency analysis is performed on the filtered occurrence time series to determine the presence of periodicity and its associated frequencies.

7. The system of claim 1, wherein by making the homograph encoder E H The triple loss training function L with respect to parameters triplet Minimize to train the homograph encoder E H And the function L triplet Defined as The function L mentioned above triplet The positive input is provided by a set of images rendered from a collection of frequently visited popular Internet domains, and the function L triplet The anchor input is provided with a set of outputs from a trained Phish Generative Adversarial Network (Phish-GAN), where each output corresponds to a glyph version of the positive input of the current set, and the function L triplet The negative input is provided as a set of string images with domains, which are derived from the encoder E. H Sampling is considered within a set of popular internet domains that are frequently accessed but do not include the actual positive samples of the current set, and E(P) is defined as positive encoded output, E(A) is defined as anchor encoded output, E(N) is defined as negative encoded output, and M is the margin.

8. The system according to claim 1, wherein, By minimizing the normalized temperature scale cross-entropy (NT-Xent) loss function L NT-Xent To train the homograph encoder E H The L NT-Xent Defined as The function L mentioned above NT-Xent The positive input is provided by a set of images rendered from a collection of frequently visited popular Internet domains, and the function L NT-Xent The anchor input is provided with a set of outputs from a trained Phish Generative Adversarial Network (Phish-GAN), where each output corresponds to a glyph version of the positive input of the current set, and the function L NT-Xent The negative input is provided as a set of string images with domains, the domains being derived from the encoder E. H Consider a set of popular internet domains that are frequently visited and are similar to the anchor input of the current set, but do not include the actual positive samples of that set. in, It is the cosine similarity measure between the anchor's encoding and the positive sample, and It is the cosine similarity measure between the encoding of the anchor and the negative sample.

9. The system according to claim 7 or 8, wherein the trained Phish-GAN comprises: The trained generator G is configured to receive an image x rendered from the training dataset y and randomly generated noise z to produce a glossary version G(x, z) for each received image. The trained discriminator D is configured as follows: Receive the glyph version G(x, z) and image x from the trained generator G; and The discriminator function D() classifies the images x and G(x, z) as either fake or real images. Among them, the auxiliary dot product loss function L is used. dot To train the generator G, the L dot Defined as: in Defined as a planar function, the planar function reshapes the image tensor into a vector to compute the dot product, and thus the trained Phish-GAN includes a generator objective function. and discriminator objective function , and Defined as 。 10. The system according to claim 1, wherein, By minimizing the normalized temperature scale cross-entropy (NT-Xent) loss function L NT-Xent To train the misspelling preemptive encoder E S The L NT-Xent Defined as Where function L NT-Xent The positive input is provided with a set of Swype-like images rendered from a collection of frequently visited popular internet domains, function L NT-Xent The anchor input is provided with a Swype-like image set of a misspelled phishing domain generated for a positive input set of the associated set, and the function L NT-Xent The negative input is provided by a set of Swype-like images with a domain derived from the encoder E. S Consider a set of popular internet domains that are frequently visited and are similar to the anchor input of the current set, but do not include the actual positive samples of that set. in, It is the cosine similarity measure between the anchor's encoding and the positive sample, and It is the cosine similarity measure between the encoding of the anchor and the negative sample.

11. The system according to claim 1, wherein, By minimizing the triple loss function L triplet To train the misspelling preemptive encoder E S The L triplet Defined as Where function L triplet The positive input is provided with a set of Swype-like images rendered from a collection of frequently visited popular internet domains, function L triplet The anchor input is provided with a Swype-like image set of a misspelled phishing domain generated for a positive input set of the associated set, and the function L triplet The negative input is provided by a set of Swype-like images with a domain derived from the encoder E. S Consider a set of popular internet domains that are frequently visited and are similar to the anchor input of the current set, but do not include the actual positive samples of that set. Where E(P) is defined as positive encoded output, E(A) is defined as anchor encoded output, E(N) is defined as negative encoded output, and M is margin.

12. The system of claim 1, wherein the converter-based neural network is trained using a binary cross-entropy loss function.

13. The system according to claim 1, wherein, The general phishing domain detection module is used to generate the probability score for each of the identified domains, including the general phishing domain detection module being configured as follows: Sub-word tokenization is performed on each of the identified domains, and The probability score is generated based on the sub-word tags using the trained converter-based neural network.

14. The system of claim 13, wherein the general phishing domain detection module is further configured to: For each of the resolved IP addresses, Retrieve all external domains, including those that resolve to the IP address. Using the trained converter-based neural network, the probability score for each outer domain is generated based on the sub-word tags; and When the percentage of the obtained external domains that are determined to be classified as general phishing domains exceeds a percentage threshold, all external domains whose probability scores exceed a predetermined threshold are classified as general phishing domains.

15. The system according to claim 1, wherein, The homograph phishing domain detection module classifies each homograph domain as a homograph phishing domain, and the homograph phishing domain detection module is replaced by the following configuration: When the associated encoding e HT_N With reference database D H When the cosine similarity between any encodings exceeds a predetermined threshold, each homograph domain is classified as a homograph phishing domain. The misspelled phishing domain detection module is used to generate the Swype-like image I. ST_N The classification of each domain in the set includes the replacement of the misspelling-based phishing domain detection module with the following configuration: When the associated encoding e ST_N and reference database D S When the cosine similarity metric between any two codes exceeds a predetermined threshold, it will be used to generate the Swype-like image I. ST_N Each domain in the set is categorized as a misspelling-based phishing domain.

16. The system according to claim 1, wherein the homograph phishing domain detection module is further configured to: Receive string image I rendered from all queried domains HT_Nall ; Using the trained homograph encoder E H Each of the string images I HT_Nall Encode them as their associated encoding e HT_Nall ; When the associated encoding e HT_N and reference database D H When the similarity metric between any two codes exceeds a predetermined threshold, each queried domain is classified as a similar phishing domain.

17. A method for detecting phishing domains in the Domain Name System (DNS) record set, said method using a trained homograph encoder E H and reference database D H The homograph phishing domain detection module includes a trained misspelling preemptive encoder E. S and reference database D S The method includes: a misspelling-based phishing domain detection module, and a general phishing domain detection module comprising a trained converter-based neural network and an alert module; the method further includes: The homograph phishing domain detection module is used to receive the string image I rendered from the homograph domain. HT_N , The homograph field includes fields where each character contains a small code; Using the trained homograph encoder E H Each of the string images I HT_N Encode them as their associated encoding e HT_N ; When the associated encoding e HT_N With reference database D H When the Euclidean distance between any encodings is less than a predetermined threshold, the homograph phishing domain detection module is used to classify each homograph domain as a homograph phishing domain. The misspelling-based phishing domain detection module receives a Swype-like image I of the domain rendered from the DNS record set. ST_N ; The trained misspelling-based encoder E using the misspelling-based phishing domain detection module. S Each Swype-like image I ST_N Encode them as their associated encoding e ST_N ; When the associated encoding e ST_N and reference database D S When the Euclidean distance between any two codes is less than a predetermined threshold, the misspelling-based phishing domain detection module will be used to generate the Swype-like image I. ST_N Each domain in the set is categorized as a misspelling-based phishing domain; The general phishing domain detection module is used to receive domains identified from the DNS record set that have strings that at least partially match the strings in the phishing reference list K; The trained converter-based neural network of the general phishing domain detection module generates a probability score for each identified domain. The general-purpose phishing domain detection module is used to resolve Internet Protocol (IP) addresses for each identified domain with a probability score exceeding a predetermined probability threshold. For each of the resolved IP addresses, use the general phishing domain detection module: Obtain all external domains associated with the resolved IP address, where external domains refer to all domains within the DNS record set that resolve to the resolved IP address; The converter-based neural network is used to generate probability scores for each acquired outer domain; Each obtained external domain with a probability score exceeding the predetermined probability threshold is classified as a general phishing domain; as well as Using the alarm module, an alarm is generated for phishing domains detected in the DNS record set based on the homograph phishing domains from the homograph phishing domain detection module, the misspelled phishing domains from the misspelled phishing domain detection module, and the general phishing domains from the general phishing domain detection module.

18. The method of claim 17, wherein generating an alarm using the alarm module comprises: Using blacklists and rule-based filtering modules, benign domains are filtered out from the identified phishing domains.

19. The method of claim 18, wherein generating an alarm using the alarm module comprises: The behavior analysis module is used to obtain unfiltered phishing domains from the blacklist and the rule-based filtering module. Use the behavior analysis module to obtain the IP address of at least one of the unfiltered phishing domains; For each IP address querying at least one of the aforementioned unfiltered phishing domains, use the behavior analysis module: Generate a count-based vector based on the number of queries to each unfiltered phishing domain for the IP address; Apply L2 normalization to each count-based vector; The behavior analysis module is used to apply hierarchical clustering to the IP address and its associated count-based vectors to identify count-based vectors with similar features. as well as The behavior analysis module is used to classify IP addresses associated with the identified count-based vectors as IP addresses that have been subjected to the same phishing attack activity.

20. The method of claim 18, wherein generating an alarm using the alarm module comprises: The behavior analysis module is used to obtain unfiltered phishing domains from the blacklist and the rule-based filtering module. Use the behavior analysis module to obtain the IP address of at least one of the unfiltered phishing domains; For each IP address querying at least one of the aforementioned unfiltered phishing domains, use the behavior analysis module: Generate a count-based vector based on the number of queries to each unfiltered phishing domain for the IP address; Apply L2 normalization to each count-based vector; The behavior analysis module is used to apply position-sensitive hashing (LSH) to the IP address and its associated count-based vectors to identify count-based vectors with similar characteristics. as well as The behavior analysis module is used to classify IP addresses associated with the identified count-based vectors as IP addresses that have been subjected to phishing attacks.