Anti-phishing method based on multi-factor comprehensive assessment method
A comprehensive evaluation, multi-factor technology, applied in special data processing applications, instruments, electrical and digital data processing, etc., can solve problems such as high false alarm rate, single method, insufficient power, etc., to achieve high accuracy and comprehensive effect.
Inactive Publication Date: 2014-12-10
XIDIAN UNIV
3 Cites 13 Cited by
AI-Extracted Technical Summary
Problems solved by technology
At present, security software has a single method for identifying phishing websites based on URL, which does not involve the essence of phishing websites; black and white list identification is lagging, and phishing websites frequently change URLs. This method is a passive anti-phishing method that sacrifices the interests of some users. ; The efficiency and speed of recognition bas...
Method used
[0102] The number of outbound links in the webpage is detected, and the essence of the Internet is some Web diagrams formed by hyperlinks. When processing with a web crawler, it is necessary to put the Web graph formed by the above hyperlinks into the memory. For a web page in a Web graph, the link it points to is called the "out-degree" of the web page. The outbound link of a website is called the "first-level outbound link" of the website, referred to as "first-level outbound link"; the outbound link corresponding to the first-level outbound link of the web page is called the "second-level outbound link" of the original website Links", referred to as "second-lev...
Abstract
The invention relates to an anti-phishing method based on a multi-factor comprehensive assessment method. The method comprises the following steps: step a, establishing a blacklist and whitelist library of URL (uniform resource locator), and processing a target URL, judging whether the processed URL is in the blacklist/whitelist, if so, executing the step d, directly feeding back a result to the user; otherwise, executing the step b, subsequently detecting the website; step b, detecting four aspects of the website: URL angle recognition, website behavior and detail feature recognition, server angle recognition and crawler angle recognition; step c, summarizing and affirming the feedback result; and step d, displaying a result. The method provided by the invention can be used for assessing in many ways with strict procedure; the consideration is comprehensive and the accuracy is high; the hit suspects and corresponding weight value, searched webpage link, website file and judgment criterion are displayed in a graphical interface in a simple and clear mode , the result is available for related professional for examining while being fed back to the user.
Application Domain
TransmissionSpecial data processing applications
Technology Topic
Assessment methodsWeight value +9
Image
Examples
- Experimental program(1)
Example Embodiment
[0049] The above and other technical features and advantages of the present invention will be described in more detail below in conjunction with the accompanying drawings.
[0050] see figure 1 As shown, it is a flowchart of the anti-phishing method based on the multi-factor comprehensive evaluation method of the present invention, wherein:
[0051] Step a, establish a black and white list library of URLs, and process the target URL to determine whether the processed URL is in the black/white list, if it is in the list library, execute step d, and directly feed back the result to the user; if not list library, then execute step b to perform subsequent detection of the website.
[0052] In step b, during the detection process, the URL angle recognition is performed first, because the execution speed of this part is relatively fast, and there is no need to set up a special thread for wasting overhead; after that, three threads are used to detect the remaining three aspects. After the detection of these three parts is completed, the total weight will be written into the file temp_result.dat in the agreed format to facilitate the summary and feedback of the results.
[0053] Step c, summarizing the results of equal rights feedback. If the sum of the total weights exceeds the agreed threshold, the user will be sent a warning about the danger of phishing websites, and if it is less than the threshold, the user will be fed back with a safe detection result.
[0054] The weight record uses an array representing the result, which is initialized to 0 first, and then the values in the array are added when the results are counted, and finally output to the file temp_result according to the format, which is convenient for calling when summarizing the weight. For the URL, website behavior and detailed features, and the server perspective, first set the weight of all suspicious points to 1, and then test and count the number of hits for each point on 500 foreign phishing websites and 500 domestic phishing websites published on PhishTank. Then assign weights to each suspicious point according to the results.
[0055] In the processing of the above detection results, the linear weighting method, threshold algorithm and statistical algorithm are mainly used. The statistical algorithm is to find the number of records that meet the set conditions within a given range, use a conditional statement to judge whether the current record meets the given conditions, and add one to the counted number if it is met. In the first three parts, we use the linear weighting method in the multi-factor comprehensive scoring method to score the above recognition results. Use two vectors to implement, respectively vector S 1 ,s 2 ,...s i , ....> and the vector W 1 ,w 2 ,...w i ,...>. In the vector S, if the above suspicious point is suspicious, the response is assigned a value of 1, otherwise it is assigned a value of 0; in the vector W, w i for the corresponding s i The weight of w i The method is derived from the above statistical algorithm.
[0056] The formula for calculating the total weight is: G=∑s i w i. Set, if the obtained value G is greater than the upper threshold, the user will be warned of the danger of the website; if the obtained value G is smaller than the lower threshold, the user will be returned to the website security prompt; if the obtained value G is between the upper and lower limits , the corresponding suspiciousness is returned to the user, prompting the user to be careful when accessing, and recommending the user to understand the method of preventing phishing attacks. Wherein, the specific threshold is also obtained by a statistical algorithm. It is stipulated that the upper threshold of the total weight alarm is set at 70, and the lower threshold is 30.
[0057] Step d, display the result.
[0058] During the running process, it will synchronously return the target website’s form form response, GET request response, geographic location query, website outbound link, etc., website files, suspicious feature points and their corresponding weights, so that relevant professionals can understand the working principle; After the program finishes running, different reminder windows will pop up to the user according to the total weight.
[0059] In the step b, four inspections are included, URL angle identification b1, website behavior and detail feature identification b2, server angle identification b3 and crawler angle identification b4. These four checks are described below.
[0060] The URL angle identification: the angle identification of the present invention includes black and white list identification, URL form check, use of special characters for masquerade check, domain name series check, and path series check.
[0061] URL identification is one of the most widely used methods at present, with the advantages of fast identification speed and 100% identification rate of black and white lists, including URL blacklist-based technology and machine learning-based URL detection technology, etc. In the present invention, in the black and white list, the user is directly prompted to further improve the accuracy and speed of detection.
[0062] The URL format check is used to determine whether the URL format is suspicious. Phishers often use IP to indicate the full domain name of the URL of the phishing website, which can effectively hide the identity of the server. At the same time, this kind of URL cannot be banned by closing the domain name. This kind of situation rarely occurs in the case of normal websites, so It can be used as a sign to judge the suspiciousness of the URL.
[0063] The masquerade check using special characters is used to check other manifestations of phishing websites except using IP addresses to hide their domain names. Usually, hexadecimal encryption is used or special characters are added to the URL to pretend to forge the URL. URLs are used for masquerading checks, some characters in URLs have specific functions, and some characters have specific functions depending on the location. If the character cannot be displayed literally, it will be sent to the WEB server in an escaped format. The URL that actually plays a role in parsing in the URL starts from behind the logo, which is the principle of deception.
[0064] The domain name level numerical check is used to determine whether the domain name level is normal. The domain name in a normal URL can simply and clearly reflect the content of the website. In order to convince users that the website they are visiting is a regular website, phishers will set their domain name to be similar to the regular website on the one hand, and on the other hand, they will also use The domain name of several levels of regular websites is supplemented after the domain name.
[0065] The path level check is used to check the path level of the URL. A normal URL consists of domain name, access path and access parameters. Phishers will not only work hard on the domain name, but also add content such as the abbreviation of the counterfeit website in the following access paths to deceive users, and this often manifests as a very large number of path progressions.
[0066] The URL angle identification method process:
[0067] Step b11, standardize the URL of the incoming parameters into a form starting with http://
[0068] Step b12, count the number of ".", English letters and "/" in the string. If the number of "." exceeds the specified threshold, it means that the level of the domain name exceeds the specified value, and the corresponding position is added to the corresponding position of the record weight array. value;
[0069] Step b13, if there is no English letter and the number of "." is 3 (such as 192.168.0.1), it means that it is in the form of IP, weighted value;
[0070] Step b14, if it contains special characters, such as "" characters and excessive hexadecimal codes (such as: %XX, X represents numbers), it means that the URL is disguised and weighted with special characters;
[0071] Step b15, if the number of "/" is too large, it means that the number of paths has too many layers, and weighted.
[0072] The weight record uses an array representing the result, which is initialized to 0 first, and then the values in the array are added when the results are counted, and finally output to the file temp_result according to the format, which is convenient for calling when summarizing the weight.
[0073] The website behavior and detailed feature identification includes form Action check, response analysis after submitting the form, HTML specification check, cookie setting check, and script ratio check.
[0074] In the phishing website, after entering the user name and password at will, the phishing website cannot know whether the user has entered the real user name and password, but makes almost similar responses to the user. Among the phishing websites, more than 90% of them redirect users to regular websites to hide themselves after obtaining their usernames and passwords; Log in a successful response for subsequent fraudulent content. The reason why phishing websites have such behavioral characteristics is that they do not have a database that can be queried and verified, only to record user names and passwords, and this is the most essential difference between them and regular websites in terms of handling user submissions.
[0075] The form Action checks, the form is used to collect different types of user input, and when the user clicks the confirmation button, the content of the form will be transmitted to another file. The form's action attribute (action) defines the filename of the destination file ("html_form_action.asp"). The file defined by the action attribute usually performs related processing on the received input data. The registration content associated with the phishing website is submitted to the official website through the form form, and the official website often does not submit the form to other domain names, so it can be regarded as a suspicious feature.
[0076] According to the response analysis after submitting the form, after the form is submitted, the operation performed by the regular website is to query and compare the user name and password in the database, while the phishing website often takes some fixed actions, such as redirecting the user to the regular website Website, to enhance its concealment, making it difficult for users to detect. Suspicious characteristic is redirecting users to another domain that does not belong to the original domain.
[0077] The HTML standardization check is used to check that the HTML code of the website is standard. A legal and regular website should comply with the new standards as much as possible, while the writing of phishing website codes is often more casual, and its standardization degree is lower than that of regular websites. Therefore, if a website's HTML code is found to be irregular, it will increase its suspiciousness.
[0078] The cookie check is set, and the cookie refers to the data (usually encrypted) stored on the user's local terminal by some websites in order to identify the user's identity and perform session tracking. However, phishing sites generally do not require the functionality provided by the above-mentioned cookies. Its construction is only to extract the user's account information and other content, and even does not want the user to visit them again, which will increase the risk of them being found and reported.
[0079] The script ratio check uses statistics to set a threshold, and if the ratio of the script length to the total pages exceeds this threshold, it is considered suspicious.
[0080] The identification process of website behavior and detailed features is
[0081] Step b21, importing the URL to be detected, then processing the URL and extracting the domain name and path, performing DNS query, and establishing a connection with the target;
[0082] Step b22, sending an HTTP GET request according to the extracted path to obtain the page source code for analysis. Among them, the GET request is constructed by imitating the request of IE browser
[0083] Step b23, analyze the received request in the following steps:
[0084] (1) Check whether a cookie is set in the message header, and if not, assign the corresponding weight to the global variable Weight_Sum.
[0085] (2) For all " in the response"Calculate the content between ", divide its length by the total page length, and get the proportion of scripts, compare it with the lower threshold, if it is greater than the threshold, add the corresponding weight to Weight_Sum. Our team analyzes regular websites and long scripts. A large number of statistics were made on the proportion of script scripts on phishing websites, and the lower limit threshold was determined to be 0.60;
[0086] (3) Check whether the HTML code is standardized: If you find the "" tag and find the response closing tag "", judge whether the attribute case in the tag conforms to the specification, action Whether the target is enclosed in double quotes "", etc. Every time a suspicious feature is met, the coefficient multiplied by the corresponding weight value is increased by 1.
[0087] (4) Check whether the target of the action attribute in the label is the same as the domain name, and if not, weight it;
[0088] (5) When the action target is under the domain name, analyze the GET response to extract parameters and send the form, analyze the response, if there is a Location in the message header, check whether the address is under the domain name, if not, weight it.
[0089] (6) Output the result to the temp_result.dat file in the agreed form, which is convenient for calling when summarizing weights.
[0090] Identification from the server perspective includes checking the number of corresponding IPs under the domain name, checking the geographic location of the IP address, and checking the Whois information.
[0091] The number of visits to a regular website is very different from that of a phishing website, so there may be differences in server technology between a regular website and a phishing website. According to statistics, more than 90% of phishing sites are distributed overseas to escape domestic legal sanctions. In addition, if you visit a domestic bank, but the domain name is resolved overseas, this is also very suspicious. Therefore, we can also speculate whether it is a phishing website from the geographical location of the IP. Some researchers also said that phishing websites have a short lifespan, which will be reflected in the whois information of the website domain name.
[0092] The number of corresponding IPs under the domain name is checked, and access to large websites is sometimes mapped to different IPs. This is because these domain names with high traffic volume use load balancing technology. DNS load balancing technology is to configure multiple IP addresses for the same host name in the DNS server. When answering DNS queries, the DNS server will return different resolution results in sequence based on the IP addresses of the host records in the DNS file for each query. Direct the client's access to different machines, so that different clients can access different servers, so as to achieve the purpose of load balancing. However, a simple phishing website often has very limited visits, and the creator will not spend money to adopt this technology, so it can be used as a feature for judging phishing websites.
[0093] The geographical location check of the IP address is used to determine whether the IP address is abnormal. For domestic users, we can detect the geographical location of the IP they want to visit to see whether it is in the country and whether it is in the above-mentioned most suspicious regions to judge whether it is suspicious.
[0094] According to the Whois information check, according to statistics, the average survival time of phishing websites is less than one day, and the domain names used by them are often relatively cheap, and the use time of domain names is not long. However, the key formal websites have older qualifications and earlier registration times, and the difference between the deadline and registration time will be larger. According to our team's test, the difference is more than 3 years for most of the regular websites, and less than three years for most of the phishing websites. Therefore, it can be used as a suspicious point to detect whether the site is a phishing website.
[0095] The process of the server angle identification is:
[0096] Step b31, process the incoming URL, extract the main domain name, perform DNS query, and use the gethostbyname function in winsock to count the length of the h_addr_list linked list in the returned structure hostent. If it is greater than 1, it means that there is more than one IP under it. Unweighted, otherwise add the corresponding weight.
[0097] Step b32, submit the returned IP to http://www.ip138.com/ for query, according to the statistical results of the Anti-Phishing Alliance, for domestic users, if the target is in a country with many phishing websites, such as : United States, then the corresponding weighting.
[0098] Step b33, submit the normalized domain name to http://whois.chinaz.com/ for query, extract the difference between the website expiration time and registration time from the obtained response, if it is less than the specified value, weight it, otherwise Not weighted. Here, according to the previous statistics, the designated value is temporarily 3 years.
[0099] In step b34, still output the result installation agreement format to temp_result.dat.
[0100] The crawler angle identification includes the detection of the number of outbound links on the webpage and the number and type of webpage files.
[0101] A web crawler (Spider) searches for web pages through link addresses. It starts from a certain page (usually the home page) of the website, reads the content of the web page, finds other link addresses in the web page, and then uses them to find the next web page. Go on until all the pages of this website are crawled. By using the crawler method, the structure, scale and importance of the website can be analyzed.
[0102] In the detection of the number of outbound links on a web page, the essence of the Internet is some Web graphs formed by hyperlinks. When processing with a web crawler, it is necessary to put the Web graph formed by the above hyperlinks into the memory. For a web page in a Web graph, the link it points to is called the "out-degree" of the web page. The outbound link of the website is called the "first-level outbound link" of the website, referred to as "first-level outbound link"; the outbound link corresponding to the first-level outbound link of the web page is called the "second-level outbound link" of the original website Links", referred to as "second-level out-degree"; the importance of web pages is obtained by combining the two-level out-degree links of web pages to improve the relevance and quality of search results. The judging principle for the detection of the number of out-degree links on a web page: on a regular website, there are many out-degree links on the first layer of the web page, and the number of out-degree links on the second layer of the web page is also relatively large. However, phishing websites are operated by individuals or small teams, and have low connection with other websites, so it is difficult to form a large network structure, and the number of outbound links in both layers is small.
[0103] The method steps for detecting the number of outbound links of the webpage are as follows: after the URL is imported, first crawl the webpage to be tested with a crawler, obtain the first-level outbound links under the same parent domain name, and return the searched results on the graphical interface For the convenience of users to view, and record the number of its links.
[0104] Since there may be many links on the first layer of the website, considering time and efficiency, it is not possible to crawl each link in turn. In order to solve this problem, when the second-level links are selected for testing, the program uses the method of randomly extracting 5 out-degrees from the first-level links to obtain the realization. If there are less than 5 links, all of them are selected for testing. When crawling the second-level links, select the path under the parent domain name of the original web page to search, and record their total out-degrees. In order to prevent the problem of excessive number of outbound websites detected, a maximum threshold is set. If the number is greater than this, the website is considered to be unsuspicious and jumps out. According to experimental statistics, the total number of out-degrees on the two layers of phishing websites generally does not exceed 500, so the threshold is tentatively set at 500.
[0105] Finally, the results are documented in an agreed form, so that subsequent calls can be evaluated.
[0106] The number and type of webpage files are detected, and the crawler is used to obtain the webpage files, so that the number of files of corresponding types can be obtained. Web page files mainly include the following types: static web page text html, htm, dynamic page file shtml, server script files asp, php, etc. They can show the composition and structural layout of the corresponding website. The more the number and types, the deeper the website level, the stronger the connection between the front-end server and the back-end library files, the higher the integrity of the website, and the more important the website is.
[0107] Formal websites have a large number of webpage files and complete types (divided by function) due to their finely crafted websites, distinct file levels, and complete structure. Phishing websites generally imitate regular websites with a loose overall structure and a low degree of connection with other websites. Therefore, the types and numbers of webpage texts and server scripts are very small. Generally, there are only administrator logs that record login information and support websites in the background. php, asp, etc., can be directly compared through the detection results of crawlers.
[0108] However, some phishing website administrators will imitate the structural characteristics of regular websites and add files on the server side. In this way, only by checking this page, you will find that there are many files and types, and the structure is reasonable, so as to achieve the purpose of confusing real ones. In view of this situation, crawlers are used to obtain the webpage files of the second-level outgoing links: that is, only the paths under the parent domain name are searched, and then they are accessed separately, and the types and numbers of corresponding files are recorded. Because the outbound link created on the second layer of the phishing website is of low importance, the website structure is simple, and there are few webpage files. Therefore, a comprehensive analysis of the two-layer crawling results can determine the suspicious degree of the website.
[0109] Method steps for detecting the number and type of webpage files:
[0110] First, use a crawler to crawl the webpage to be tested, check the files under the webpage in turn, and judge whether it ends with one of the five types: html, htm, shtlm, asp, and php. If so, record the file name of the first-level link; Then check whether its URL is under the original parent domain name, if so, crawl out the second-level file from the link, find the corresponding type of file and record its number, and return the output in turn to form a tree diagram in the graphical interface box for convenience user view. Considering time and efficiency, the number selection method is the same as the link above, except that the upper threshold of the total number of links in the two layers is changed to 200.
[0111] Since what is obtained in this part is the number, the weight evaluation adopts the method of partition judgment. First, divide the number interval by statistical method, assign the corresponding suspicious degree S (S ∈ [0, 1]) according to the interval of the result, and then multiply it by the coefficient K of the overall division. Suspiciousness total weight: N=SK. Finally, the total weight value is output to the file in an agreed form, which is convenient for calling feedback later.
[0112] The above descriptions are only preferred embodiments of the present invention, and are only illustrative rather than restrictive to the present invention. Those skilled in the art understand that many changes, modifications, and even equivalents can be made within the spirit and scope defined by the claims of the invention, but all will fall within the protection scope of the present invention.
PUM


Description & Claims & Application Information
We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
Similar technology patents
Energy system reliability evaluation method considering thermal inertia and energy network constraints
PendingCN110544017Aconsider comprehensively
Owner:SHANGHAI JIAO TONG UNIV +2
Method for measuring human factor complexity of manual station in production line with consideration of cognition
InactiveCN108564279ACorrection is reliable and accurateconsider comprehensively
Owner:TONGJI UNIV
Method and device for calculating standard depth/thickness ratio of coal mine goaf overhead transmission line
ActiveCN106940364ATheoretical basis is sufficientconsider comprehensively
Owner:国网山西省电力公司阳泉供电公司 +2
Classification and recommendation of technical efficacy words
- consider comprehensively
- improve accuracy
Energy system reliability evaluation method considering thermal inertia and energy network constraints
PendingCN110544017Aconsider comprehensively
Owner:SHANGHAI JIAO TONG UNIV +2
Method for measuring human factor complexity of manual station in production line with consideration of cognition
InactiveCN108564279ACorrection is reliable and accurateconsider comprehensively
Owner:TONGJI UNIV
Method and device for calculating standard depth/thickness ratio of coal mine goaf overhead transmission line
ActiveCN106940364ATheoretical basis is sufficientconsider comprehensively
Owner:国网山西省电力公司阳泉供电公司 +2
Golf club head with adjustable vibration-absorbing capacity
Owner:FUSHENG IND CO LTD
Direct fabrication of aligners for arch expansion
Owner:ALIGN TECH
Stent delivery system with securement and deployment accuracy
Owner:BOSTON SCI SCIMED INC
Method for improving an HS-DSCH transport format allocation
Owner:NOKIA SOLUTIONS & NETWORKS OY
Catheter systems
Owner:ST JUDE MEDICAL ATRIAL FIBRILLATION DIV