Webpage risk detection method and related device
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING HONGTENG INTELLIGENT TECH CO LTD
- Filing Date
- 2024-12-19
- Publication Date
- 2026-06-23
AI Technical Summary
Existing technologies suffer from high manual review costs, high false positive rates, and poor real-time performance when detecting unknown risky websites, making it difficult to effectively deal with the continuous emergence of new malicious websites.
By crawling the webpage to be detected, key features are extracted, including target attributes, target hash, target embedding vector, and target keywords. A comprehensive evaluation is then conducted using multi-dimensional feature analysis methods, including hash matching, embedding vector comparison, and keyword recognition. This approach considers multiple aspects of the webpage and avoids misjudgments caused by a single feature.
It improves the accuracy and efficiency of risk detection, reduces false positives and false negatives, enhances the system's flexibility and robustness, and can identify malicious websites that have been modified or disguised.
Smart Images

Figure CN122268601A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the fields of computer and communication technology, and more specifically, to a method and related equipment for detecting web page risks. Background Technology
[0002] With the rapid development of the internet, the number of malicious and fraudulent websites has increased dramatically, posing a serious threat to the cybersecurity of users and businesses. Currently, some mature technologies exist for quickly detecting malicious websites, primarily relying on blacklists / whitelists, rule-based matching, and supervised learning-based machine learning and deep learning. These methods perform well when handling known samples, but their ability to detect unknown risky websites is limited, and in practice, a large number of malicious websites are missed. Specifically, blacklists / whitelists require constant updates, and the update speed often lags behind the emergence of new malicious websites; rule-based matching has limited effectiveness against complex and constantly evolving malicious tactics, and the maintenance cost of the rules is high; supervised learning-based machine learning and deep learning mainly rely on known samples, achieving automatic detection by summarizing and learning the complex feature relationships of known samples. For such models to achieve good generalization ability, a large amount of manually labeled data is required. Furthermore, malicious website creators constantly change their attack methods, causing the model's training data to quickly become outdated, posing a significant challenge to sample labelers.
[0003] Currently, the common practice for detecting unknown risky websites is to use data mining techniques to extract features from a large number of websites, perform cluster analysis to identify potentially risky websites, and then manually verify whether they are malicious. However, this method suffers from high manual review costs, high false positive rates, and poor real-time performance, making it difficult to effectively deal with the constantly emerging new malicious websites. Therefore, how to effectively improve the detection recall rate of unknown risky websites while relying less on extensive manual review has become an urgent problem to be solved. Summary of the Invention
[0004] The embodiments of this application provide a webpage risk detection method and related equipment, which can at least to some extent overcome the problems of high manual review costs, high false alarm rates, and poor real-time performance of existing technologies for detecting unknown risk websites.
[0005] Other features and advantages of this application will become apparent from the following detailed description, or may be learned in part from practice of this application.
[0006] According to one aspect of the embodiments of this application, a webpage risk detection method is provided, comprising: crawling a webpage to be detected; extracting key features from the webpage to be detected, the key features including target attributes, target hashes, target embedding vectors, and target keywords; and determining a final risk detection result based on the key features of the webpage to be detected.
[0007] In some embodiments of this application, the step of crawling the webpage to be detected and extracting key features from the webpage to be detected specifically includes: crawling the webpage to be detected; performing feature extraction on the webpage to be detected to obtain corresponding target attributes; performing fuzzy hash calculation on the webpage to be detected to obtain corresponding target hash; performing natural language processing on the webpage to be detected to obtain corresponding target embedding vectors; and performing word segmentation on the webpage to be detected to obtain corresponding target keywords.
[0008] In some embodiments of this application, the step of performing fuzzy hash calculation on the webpage to be detected to obtain the corresponding target hash specifically includes: extracting the target webpage content of the webpage to be detected; and performing fuzzy hash calculation on the target webpage content to obtain the corresponding target hash.
[0009] In some embodiments of this application, the step of performing natural language processing on the webpage to be detected to obtain the corresponding target embedding vector specifically includes: extracting the webpage text of the webpage to be detected; performing natural language processing on the webpage text to generate the corresponding target embedding vector.
[0010] In some embodiments of this application, the step of performing word segmentation on the webpage to be detected to obtain corresponding target keywords specifically includes: extracting the webpage text of the webpage to be detected; performing word segmentation on the webpage text to generate corresponding target keywords.
[0011] In some embodiments of this application, determining the final risk detection result based on the key features of the webpage to be detected includes: determining a first risk detection result based on the target attributes of the webpage to be detected; determining a second risk detection result based on the target hash of the webpage to be detected; determining a third risk detection result based on the target embedding vector of the webpage to be detected; determining a fourth risk detection result based on the target keywords of the webpage to be detected; and determining the final risk detection result based on the first detection result, the second detection result, the third detection result, and the fourth detection result.
[0012] In some embodiments of this application, determining the first risk detection result based on the target attributes of the webpage to be detected specifically includes: determining a combination of target attributes from the target attributes of the webpage to be detected; and comparing the hash value of the combination of target attributes to determine the first risk detection result.
[0013] In some embodiments of this application, determining the final risk detection result based on the first detection result, the second detection result, the third detection result, and the fourth detection result specifically includes: determining the risk percentage, where the risk percentage is the proportion of the webpage to be detected that is determined to be risky among the first detection result, the second detection result, the third detection result, and the fourth detection result; and determining the final risk detection result based on the risk percentage.
[0014] According to one aspect of the embodiments of this application, a webpage risk detection device is provided, the webpage risk detection device comprising: a webpage crawling module, configured to crawl a webpage to be detected and extract key features from the webpage to be detected, the key features including target attributes, target hashes, target embedding vectors, and target keywords; and a risk detection module, configured to determine a final risk detection result based on the key features of the webpage to be detected.
[0015] In some embodiments of this application, the web page crawling module specifically includes: a web page crawling submodule for crawling web pages to be detected; a target attribute submodule for extracting features from the web pages to be detected to obtain corresponding target attributes; and a fuzzy hashing submodule for performing fuzzy hash calculations on the web pages to be detected to obtain corresponding target hashes.
[0016] The natural language submodule is used to perform natural language processing on the webpage to be detected to obtain the corresponding target embedding vector; the word segmentation submodule is used to perform word segmentation processing on the webpage to be detected to obtain the corresponding target keywords.
[0017] In some embodiments of this application, the fuzzy hashing submodule specifically includes: a content extraction unit, used to extract the target webpage content of the webpage to be detected; and a fuzzy hashing unit, used to perform fuzzy hash calculation on the target webpage content to obtain the corresponding target hash.
[0018] In some embodiments of this application, the natural language submodule specifically includes: a text extraction unit for extracting the webpage text of the webpage to be detected; and a vector embedding unit for performing natural language processing on the webpage text to generate a corresponding target embedding vector.
[0019] In some embodiments of this application, the word segmentation processing submodule specifically includes: a text extraction unit, used to extract the webpage text of the webpage to be detected; and a word segmentation processing unit, used to perform word segmentation processing on the webpage text to generate corresponding target keywords.
[0020] In some embodiments of this application, the risk detection module includes: a first risk detection submodule, configured to determine a first risk detection result based on the target attributes of the webpage to be detected; a second risk detection submodule, configured to determine a second risk detection result based on the target hash of the webpage to be detected; a third risk detection submodule, configured to determine a third risk detection result based on the target embedding vector of the webpage to be detected; a fourth risk detection submodule, configured to determine a fourth risk detection result based on the target keywords of the webpage to be detected; and a final risk detection submodule, configured to determine a final risk detection result based on the first detection result, the second detection result, the third detection result, and the fourth detection result.
[0021] In some embodiments of this application, the first risk detection result specifically includes: an attribute combination unit, used to determine a target attribute combination among the target attributes of the webpage to be detected; and an attribute hash unit, used to compare the hash value of the target attribute combination to determine the first risk detection result.
[0022] In some embodiments of this application, the final risk detection submodule specifically includes: a risk proportion unit, used to determine the risk proportion, wherein the risk proportion is the proportion of the webpage to be detected that is determined to be risky among the first detection result, the second detection result, the third detection result, and the fourth detection result; and a final result unit, used to determine the final risk detection result based on the risk proportion.
[0023] According to one aspect of the embodiments of this application, a computer-readable medium is provided having a computer program stored thereon, which, when executed by a processor, implements the webpage risk detection method as described in the above embodiments.
[0024] According to one aspect of the embodiments of this application, an electronic device is provided, including: one or more processors; and a storage device for storing one or more programs, which, when executed by the one or more processors, cause the one or more processors to implement the webpage risk detection method as described in the above embodiments.
[0025] A computer program product includes one or more computer programs, characterized in that, when the one or more computer programs are executed by one or more processors, they implement the steps of the webpage risk detection method as described in the above embodiments.
[0026] In some embodiments of this application, the technical solutions comprehensively extract target attributes (such as domain name and certificate), target hash, target embedding vector, and target keywords of web pages. By combining multi-dimensional features (such as hash, embedding vector, and keywords) and integrating multiple feature analysis methods (including hash matching, embedding vector comparison, and keyword recognition), a comprehensive evaluation is conducted, taking into account multiple aspects of the web page. This avoids relying on only one feature, preventing misjudgments caused by a single feature, ensuring that risk detection covers all aspects of the website, enhancing the comprehensiveness of detection, thereby improving the accuracy of risk detection and reducing false positives and false negatives. Simultaneously, through the extraction and comparison of key features, this application allows the system to quickly filter out potentially risky websites from a large number of web pages, improving the efficiency of web page detection. Furthermore, the introduction of embedding vectors can handle subtle changes in web page content, effectively identifying modified or disguised malicious websites, enhancing the system's flexibility and robustness.
[0027] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and do not limit this application. Attached Figure Description
[0028] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application. It is obvious that the drawings described below are merely some embodiments of this application, and those skilled in the art can obtain other drawings based on these drawings without any inventive effort. In the drawings:
[0029] Figure 1 A schematic diagram of an exemplary system architecture to which the technical solutions of the embodiments of this application can be applied is shown.
[0030] Figure 2 The diagram shows a flowchart of a webpage risk detection method provided in an embodiment of this application.
[0031] Figure 3 It shows according to Figure 2 A flowchart illustrating a specific implementation of step S100 in the webpage risk detection method shown in the corresponding embodiment.
[0032] Figure 4 It shows according to Figure 2 A flowchart illustrating a specific implementation of step S200 in the webpage risk detection method shown in the corresponding embodiment.
[0033] Figure 5 A schematic diagram of the structure of a webpage risk detection device provided in an embodiment of this application is shown.
[0034] Figure 6A schematic diagram of the structure of an electronic device provided in an embodiment of this application is shown. Detailed Implementation
[0035] Exemplary embodiments will now be described more fully with reference to the accompanying drawings. However, these exemplary embodiments can be implemented in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided to make this application more comprehensive and complete, and to fully convey the concept of the exemplary embodiments to those skilled in the art.
[0036] Furthermore, the described features, structures, or characteristics can be combined in any suitable manner in one or more embodiments. Numerous specific details are provided in the following description to give a thorough understanding of embodiments of this application. However, those skilled in the art will recognize that the technical solutions of this application can be practiced without one or more of the specific details, or other methods, components, apparatuses, steps, etc., can be employed. In other instances, well-known methods, apparatuses, implementations, or operations are not shown or described in detail to avoid obscuring various aspects of this application.
[0037] The block diagrams shown in the accompanying drawings are merely functional entities and do not necessarily correspond to physically independent entities. That is, these functional entities can be implemented in software, in one or more hardware modules or integrated circuits, or in different network and / or processor devices and / or microcontroller devices.
[0038] The flowcharts shown in the accompanying drawings are merely illustrative and do not necessarily include all content and operations / steps, nor do they necessarily have to be performed in the described order. For example, some operations / steps can be broken down, while others can be combined or partially combined; therefore, the actual execution order may change depending on the specific circumstances.
[0039] Figure 1 A schematic diagram of an exemplary system architecture to which the technical solutions of the embodiments of this application can be applied is shown.
[0040] like Figure 1 As shown, the system architecture may include terminal devices (such as...) Figure 1 The device shown includes one or more of a smartphone 101, tablet 102, and portable computer 103 (which could also be a desktop computer, etc.), a network 104, and a server 105. The network 104 serves as a medium for providing a communication link between the terminal device and the server 105. The network 104 can include various connection types, such as wired communication links, wireless communication links, etc.
[0041] It should be understood that Figure 1The number of terminal devices, networks, and servers shown is merely illustrative. Depending on implementation needs, there can be any number of terminal devices, networks, and servers. For example, server 105 could be a server cluster composed of multiple servers.
[0042] Users can interact with server 105 via network 104 using terminal devices to receive or send messages, etc. Server 105 can be a server providing various services. For example, a user can upload a webpage to be detected to server 105 using terminal device 103 (or terminal device 101 or 102). Server 105 can extract key features from the webpage to be detected, including target attributes, target hashes, target embedding vectors, and target keywords; based on the key features of the webpage to be detected, the final risk detection result is determined.
[0043] It should be noted that the webpage risk detection method provided in this application embodiment is generally executed by server 105, and correspondingly, the webpage risk detection device is generally set in server 105. However, in other embodiments of this application, the terminal device may also have similar functions to the server, thereby executing the webpage risk detection scheme provided in this application embodiment.
[0044] The implementation details of the technical solutions in the embodiments of this application are described in detail below:
[0045] Figure 2 A flowchart of a webpage risk detection method according to an embodiment of this application is shown. This webpage risk detection method can be executed by a server, which may be... Figure 1 The server shown. (Refer to...) Figure 2 As shown, the risk detection method for this webpage includes at least the following:
[0046] S100, crawl the webpage to be detected, and extract the key features in the webpage to be detected. The key features include target attributes, target hash, target embedding vector and target keywords.
[0047] S200, based on the key features of the webpage to be detected, determine the final risk detection result.
[0048] In the embodiments of this application, by comprehensively extracting the target attributes (such as domain name, certificate), target hash, target embedding vector, and target keywords of the webpage, and combining multi-dimensional features (such as hash, embedding vector, keywords, etc.), multiple feature analysis methods (including hash matching, embedding vector comparison, keyword recognition, etc.) are integrated for comprehensive evaluation. This approach comprehensively considers multiple aspects of the webpage, avoiding reliance on a single feature and preventing misjudgments caused by a single feature. It ensures that risk detection covers all aspects of the website, enhances the comprehensiveness of detection, and thus improves the accuracy of risk detection, reducing false positives and false negatives. Simultaneously, by extracting and comparing key features, this application can quickly filter out potentially risky websites from a large number of webpages, improving the efficiency of webpage detection. Furthermore, the introduction of embedding vectors can handle subtle changes in webpage content, effectively identifying malicious websites that have been modified or disguised, enhancing flexibility and robustness.
[0049] In the S100, web page crawling can be achieved by using automated tools or web crawling technologies (such as Selenium, Scrapy, etc.) to crawl the web pages to be detected from the Internet, so as to obtain the content of the web pages to be detected in real time.
[0050] Key feature extraction includes extracting target attributes, target hashes, target embedding vectors, and target keywords. Target attributes (such as domain name, IP address, SSL certificate information, etc.) help identify the legitimacy of a website from its basic information, especially domain name or certificate information, which can help identify whether a website is known to be malicious. Target hashes, by extracting the hash value of webpage content or embedded resources, allow for content comparison and identify webpages similar to known risky websites. Hash value comparison can quickly identify risky webpages, playing a crucial role in similarity detection. Target embedding vectors utilize natural language processing techniques (such as Word2Vec, BERT, etc.) to convert webpage content into vectors, capturing the semantic information of the webpage content and identifying webpages similar to known risky websites at the semantic level. This is very effective for handling malicious websites with minor variations in webpage content. Target keywords help identify the core information of the website content at the text level, further aiding in the classification and judgment of whether the website contains malicious behavior or content. This feature helps quickly locate the main topics and content of a website, playing a particularly important role in identifying fraudulent websites.
[0051] Specifically, in some embodiments, the specific implementation of step S100 can be found in [reference needed]. Figure 3 . Figure 3 It is based on Figure 2 According to the detailed description of step S100 in the webpage risk detection method shown in the corresponding embodiment, step S100 in the webpage risk detection method may include the following steps:
[0052] S110, captures the webpage to be tested.
[0053] S120, perform feature extraction on the webpage to be detected to obtain the corresponding target attributes.
[0054] S130, perform fuzzy hash calculation on the webpage to be detected to obtain the corresponding target hash.
[0055] S140, Perform natural language processing on the webpage to be detected to obtain the corresponding target embedding vector.
[0056] S150, the webpage to be detected is segmented to obtain the corresponding target keywords.
[0057] In this embodiment, by analyzing webpage content from multiple perspectives, multi-dimensional features including basic webpage attributes, hash values, embedding vectors, and keywords are extracted, ensuring a comprehensive evaluation of the webpage. This makes risk detection no longer limited to a single feature, but rather comprehensively considers the potential risks of a webpage through multiple methods, increasing accuracy and robustness.
[0058] Among these methods, fuzzy hashing can identify similar web pages, accurately identifying potential risks even if the content is altered or disguised. The use of natural language processing and target embedding vectors allows for a deeper understanding of the web page's semantics, rather than relying solely on surface features. This enables the method to identify potentially dangerous web pages, even those that have changed in appearance or structure. Keyword extraction, by analyzing the core content of the web page, further improves the accuracy of the judgment, playing a crucial role, especially in identifying malicious websites such as those used for fraud and phishing.
[0059] This embodiment improves overall detection efficiency by automatically extracting and calculating features such as fuzzy hashes, embedding vectors, and keywords, enabling rapid processing of large amounts of webpage information. Especially when comparing the similarity of webpage content with known risky websites, automated hash value and embedding vector calculations significantly shorten detection time. This embodiment also adapts to changes in webpage content; in particular, through embedding vectors and fuzzy hashes, it effectively handles dynamic changes to webpages (such as page content adjustments and image encryption), ensuring that even subtle differences can be detected as risky webpages.
[0060] In S110, the webpage to be detected is crawled from the network to ensure that the target webpage can be obtained from the Internet as the basis for subsequent processing.
[0061] In S120, attributes representing important information about the webpage (such as URL, domain name, IP address, etc.) are extracted from the webpage. This information helps determine the legitimacy and credibility of the webpage. For example, whether the webpage's SSL certificate information is valid, and whether the domain name is a known and legitimate domain name.
[0062] In this step, through statistical analysis of numerous attribute features of massive amounts of web pages, and employing scientific feature selection methods, including algorithms such as information gain and chi-square, a number of target attributes representing important characteristics of the web pages were ultimately identified. These target attributes include domain names and IP addresses, which are important identifiers of a website's identity. For example, extracting domain aliases (CNAME records), which may point to the same server, helps in identifying associated malicious websites; extracting SSL / TLS certificate information, including the certificate authority, validity period, and certificate chain, can be used to verify the website's identity; and extracting the website's title and keywords, which reflect the website's main content and theme.
[0063] In step S130, the hash value of the webpage is calculated using fuzzy hashing technology. Fuzzy hashing is effective in identifying the similarity of webpage content, and it plays a crucial role, especially in detecting disguised, altered, or encrypted webpages. Even if the webpage content has changed, fuzzy hashing can still identify similarities with known risky webpages, thereby improving the accuracy of detection.
[0064] Specifically, in some embodiments, the specific implementation of step S130 can be found in the following embodiments. This embodiment is based on... Figure 3 According to the detailed description of step S130 in the webpage risk detection method shown in the corresponding embodiment, step S130 in the webpage risk detection method may include the following steps:
[0065] Extract the target webpage content of the webpage to be detected.
[0066] Perform a fuzzy hash calculation on the content of the target webpage to obtain the corresponding target hash.
[0067] This embodiment employs SimHash technology to perform fuzzy hash calculations on webpage content. SimHash can tolerate minor local variations, grouping similar but not identical webpages together. Key content of the webpage is extracted, including the title, keywords, and body text. This content is then preprocessed, such as removing HTML tags, word segmentation, and stop word removal. A SimHash value is calculated on the preprocessed content, generating a fixed-length hash value, which is the target hash.
[0068] In this embodiment, firstly, fuzzy hashing can handle minor changes in webpage content, avoiding detection failures due to adjustments or disguises in the webpage content. For example, malicious webpages may circumvent traditional content matching techniques by changing certain text, images, or structures, but fuzzy hashing can capture such similarities in webpage content, thereby effectively identifying potentially risky webpages.
[0069] Secondly, fuzzy hashing compares webpage content by calculating its "fingerprint," avoiding simple literal content matching and enabling a more accurate determination of whether a webpage is malicious. This technology improves the accuracy of webpage risk detection, especially when dealing with complex or disguised webpages, accurately identifying potential threats in the webpage content.
[0070] Furthermore, fuzzy hashing is a fast and efficient process, especially when processing a large number of web pages. It can quickly derive the hash value of the web page content and compare it with known risky web pages. This method is highly effective in large-scale web page detection, efficiently filtering out potentially risky web pages and providing timely feedback for further processing.
[0071] Finally, fuzzy hashing technology can not only identify completely identical web pages, but also malicious web pages with altered structures or content. This improves the robustness of the risk detection system, ensuring efficient and accurate detection performance even in the face of complex disguises and dynamically changing web page content.
[0072] In S140, Natural Language Processing (NLP) technology is used to analyze the webpage content, transforming the text information into a machine-processable form. NLP techniques can capture implicit semantic information within the webpage, representing its content as an embedded vector (such as BERT or Word2Vec), thereby achieving a deeper understanding of the webpage content and identifying potential risks.
[0073] Furthermore, by converting webpage content (such as text) into embedded vectors, the webpage can be compared with known risky or malicious websites. Through semantic similarity calculations, it can be determined whether the target webpage shares similar risk characteristics with malicious webpages, thus identifying potential risks even if the webpage content changes.
[0074] Specifically, in some embodiments, the specific implementation of step S140 can be found in the following embodiments. This embodiment is based on... Figure 3 According to the detailed description of step S140 in the webpage risk detection method shown in the corresponding embodiment, step S140 in the webpage risk detection method may include the following steps:
[0075] Extract the webpage text of the webpage to be detected.
[0076] Natural language processing is performed on the webpage text to generate the corresponding target embedding vector.
[0077] In this embodiment, the webpage text is first extracted from the webpage to be detected. Unstructured data within the webpage is processed, and irrelevant elements such as images and advertisements are removed to ensure that subsequent natural language processing (NLP) algorithms focus on the core information of the webpage text, thereby improving the processing efficiency of subsequent steps. Then, the webpage text is preprocessed, including word segmentation, stop word removal, and stemming. Finally, embedding vector technology is used to perform NLP on the preprocessed results, generating target embedding vectors. NLP technology can process the extracted webpage text, transforming it into a machine-understandable form. A key technology in NLP is word embedding, such as Word2Vec, GloVe, and BERT, which can represent each word in the text as a fixed-dimensional high-dimensional word vector. These high-dimensional word vectors can capture the semantic features of words, enabling computers to understand the content and meaning of the text. That is, the embedding vectors obtained in this embodiment not only consider the semantics of individual words but also represent the overall webpage content through contextual information. Ultimately, these embedding vectors map the webpage text into a high-dimensional space, with each webpage corresponding to a vector, and similar webpages mapped to nearby positions. This provides rich semantic information for subsequent risk assessment. By comparing the distance between embedding vectors, the system can determine the similarity between the webpage to be detected and known risky webpages, thereby identifying potentially malicious webpages.
[0078] In S150, the webpage text is segmented to extract keywords. These keywords reflect the main topic, content, and potential risks of the webpage. For example, keywords related to fraud, gambling, and malware can directly help determine whether a webpage is risky.
[0079] Specifically, in some embodiments, the specific implementation of step S150 can be found in the following embodiments. This embodiment is based on... Figure 3 According to the detailed description of step S150 in the webpage risk detection method shown in the corresponding embodiment, step S150 may include the following steps:
[0080] Extract the webpage text of the webpage to be detected.
[0081] The webpage text is segmented to generate corresponding target keywords.
[0082] In this embodiment, key information is effectively extracted from the webpage by segmenting the webpage text. This process transforms the complex text of the webpage into manageable small units, helping the system understand and identify the core content of the webpage. By selecting target keywords, the system can focus on risk-related information, thereby improving the accuracy of detection.
[0083] Specifically, this embodiment utilizes the Jieba word segmentation search engine mode, which attempts to find all possible word combinations, including some uncommon or newly emerging words. By analyzing these combinations, potential new words or out-of-vocabulary (OV) words can be further filtered out. Using n-gram segmentation technology, for a given text, candidate phrases can be extracted using a large N value (e.g., 5-gram), and then the frequency of these phrases can be calculated to determine whether they are new words. Word independence and internal cohesion are measured using two metrics: degrees of freedom and cohesion. This is typically evaluated by calculating the co-occurrence probability between adjacent characters within a word. Words with high degrees of freedom and high cohesion are more likely to be new words or OV words.
[0084] Word segmentation and target keyword extraction help the system focus on the important content of a webpage, namely potentially risky words, rather than being distracted by other information on the page. By identifying these keywords, the system can better determine the nature of the webpage and reduce false positives or false negatives caused by redundant page content or formatting changes.
[0085] Word segmentation effectively handles complex text structures on web pages and extracts risk-indicating keywords from the content, providing a more universal and efficient detection method. Regardless of changes in the web page content, as long as keywords can be extracted, corresponding detection can be performed. This characteristic makes it adaptable to various types of web page risk detection.
[0086] This embodiment utilizes word segmentation and keyword extraction techniques to extract the most representative elements from complex webpage content. This method not only enhances the understanding of webpage content but also improves adaptability to changes in different webpage formats, languages, and structures. Even if the webpage content changes, its risk nature can still be accurately determined by identifying keywords.
[0087] In S200, the extracted key features are integrated and analyzed using certain decision-making algorithms or models (such as machine learning models, rule engines, etc.) to determine whether a webpage is a risky website.
[0088] In some embodiments, a rule engine can be used to directly determine the risk level of a webpage based on specific rules (e.g., domain name rules, hash matching rules, certificate validity rules, etc.).
[0089] In other embodiments, features can be learned by training a machine learning model (e.g., a classification model) and combined with historical data to predict whether a webpage is a risky webpage.
[0090] For example, if the extracted target hash is highly similar to a known risky website, the embedding vector matches a known group of malicious websites, or the webpage contains fraudulent keywords, the model can output that the webpage is a high-risk webpage.
[0091] This step, through the comprehensive feature analysis method described above, can reduce false negatives and false negatives. The combination of multiple technical means ensures the accuracy and efficiency of risk detection results. For example, if a webpage's SSL certificate information is abnormal, and its content is similar to the embedding vector of a known malicious website, it can be quickly identified as a risky webpage.
[0092] Specifically, in some embodiments, the specific implementation of step S200 can be found in [reference needed]. Figure 4 . Figure 4 It is based on Figure 2 According to the detailed description of step S200 in the webpage risk detection method shown in the corresponding embodiment, step S200 may include the following steps:
[0093] S210, determine the first risk detection result based on the target attributes of the webpage to be detected.
[0094] S220, determine the second risk detection result based on the target hash of the webpage to be detected.
[0095] S230, determine the third risk detection result based on the target embedding vector of the webpage to be detected.
[0096] S240, Based on the target keywords of the webpage to be detected, determine the fourth risk detection result.
[0097] S250, based on the first test result, the second test result, the third test result, and the fourth test result, determine the final risk test result.
[0098] In this embodiment, the detection results of each dimension can reflect different aspects of a webpage's characteristics. For example, target hashes can identify known malicious webpages, while target keywords and target embedding vectors can reveal potential dangers from a content perspective. By integrating these different detection results, the system can obtain a more comprehensive webpage risk assessment, thereby avoiding misjudgments caused by a single dimension. This comprehensive judgment approach means that the detection method no longer relies on a single dimension, enabling a comprehensive analysis of the webpage's risk characteristics from multiple angles. By simultaneously considering the webpage's structural information (such as target attributes), content information (such as keywords and embedding vectors), and security information (such as hash values), the system's assessment results can integrate all factors to make a more accurate and reliable risk judgment. Multi-dimensional information fusion can greatly reduce omissions and erroneous judgments. For example, some malicious webpages may exhibit a relatively normal appearance or attributes, but their actual content may conceal risks, or they may have already been spread elsewhere; hash values may help to discover this. By combining the results of multiple risk detection dimensions, the final risk assessment can more accurately reflect the true risk of the webpage. Multi-dimensional comprehensive detection also greatly enhances the system's robustness. Even if one dimension is compromised (e.g., the target attribute of a webpage is disguised or altered), other dimensions can still provide strong information support for the system, ensuring that it maintains a high level of detection capability under different circumstances. With the diversification of cybersecurity threats and webpage fraud methods, webpage risk detection also needs to address increasingly complex patterns. By comprehensively considering multiple factors such as target attributes, hash values, embedding vectors, and target keywords, the system can identify different types of potential risks, including known malicious websites, phishing attacks disguised as legitimate websites, and complex attack patterns such as content-based fraud.
[0099] In S210, with the diversification of cybersecurity threats and web fraud methods, web risk detection also needs to address increasingly complex patterns. By comprehensively considering multiple factors such as target attributes, hash values, embedding vectors, and target keywords, the system can identify different types of potential risks, including known malicious websites, phishing attacks disguised as legitimate websites, and complex attack patterns such as content-based fraud.
[0100] Specifically, in some embodiments, the specific implementation of step S210 can be found in the following embodiments. This embodiment is based on... Figure 4 According to the detailed description of step S210 in the webpage risk detection method shown in the corresponding embodiment, step S210 in the webpage risk detection method may include the following steps:
[0101] Determine the target attribute combination from the target attributes of the webpage to be detected.
[0102] By comparing the hash values of the target attribute combinations, the first risk detection result is determined.
[0103] In this embodiment, hash values of embedded links and resource images within a webpage can be extracted. These hash values can be used to identify similar webpage content. Hash values of key text paragraphs within the webpage can also be extracted, helping to identify the main content of the webpage. Through association rule analysis, combined features can be formed, creating strong rules. By matching these target attributes, potentially unknown risky websites can be quickly located.
[0104] Specifically, the webpage to be detected is analyzed to identify several target attributes (such as webpage title, URL, HTML tags, image sources, etc.). These attributes are typically the main factors determining whether a webpage poses a risk. A combination of target attributes can be established using these attributes. This combination comprehensively reflects the core characteristics of the webpage. The selection and combination of target attributes helps improve detection accuracy because the risk of a webpage is not always manifested by a single attribute but requires the combination of information from multiple dimensions. For the identified combination of target attributes, its hash value is calculated. The hash value is a fixed-length string generated by mapping the webpage's target attribute combination using a hash algorithm. The calculation of the hash value ensures that the webpage's attribute combination is simplified into a unique identifier during detection, facilitating comparison. During the hash value comparison process, the detection system compares the webpage's hash value with a database of hash values for known risky webpages. If they match, it indicates that the webpage is similar to some known risky webpages and may pose a risk. This embodiment avoids analyzing each specific piece of content on the webpage individually by hashing the webpage's target attributes and comparing the hash values. Hash comparison is more efficient than direct comparison of webpage content, significantly reducing computation and detection time, thus improving the efficiency of webpage risk detection. Hash comparison of target attribute combinations can effectively reduce false positives or false negatives because it relies on the core features of the webpage, rather than its visual appearance or other external factors. Furthermore, hash comparison technology can quickly identify similarities to known malicious webpages, improving the accuracy of risk detection.
[0105] In S220, a hash value is a unique identifier for webpage content. Hash calculations can be performed on the webpage's source code, resource files, etc., to determine if the webpage has been tampered with or matches known malicious webpages. Hash matching can help detect known malicious webpages or flagged risky websites. Compared to other methods, hash matching is a highly efficient and accurate matching technique that can quickly identify known malicious pages.
[0106] Specifically, in this step, the SimHash values of known risky websites can be stored in a hash index library, forming a hash index library. Using this hash index library, the SimHash values of unknown websites are matched against the SimHash values of known risky websites. Similarity calculation methods such as Hamming distance are used; if the similarity between the SimHash value of an unknown website and the hash value of a known risky website is higher than a preset threshold, then that website is marked as a suspected risky website.
[0107] In S230, embedding vectors are typically derived from deep learning techniques. By embedding representations of webpage content, webpage text and other content can be transformed into high-dimensional vectors. These vectors can be used by machine learning models to capture semantic information of the webpage, thereby discovering potential risk patterns. Embedding vectors can identify some complex and hidden risk features, playing a particularly important role in semantic analysis of webpage content.
[0108] Specifically, in this step, the vectors of known category websites can be stored in a vector index library, forming a vector index library (using efficient vector indexing techniques such as Annoy and Faiss to build the vector index library). Using this vector index library, similar unknown category websites are matched based on the vectors of known category websites. Common similarity calculation methods include cosine similarity and Euclidean distance. If the similarity between the vector of an unknown website and the vector of a known risky website is higher than a threshold, it is marked as a suspected risky website.
[0109] In S240, the target keywords extracted through word segmentation technology have been analyzed in detail in the aforementioned embodiments. These keywords help the system identify the core content of the webpage, thereby assessing whether it contains risk-related terms (such as "fraud," "phishing," "malware," etc.). The extraction and analysis of these keywords help the system more accurately understand the topic of the webpage, and thus detect potential dangers.
[0110] In this step, the websites can be submitted to a human review team for labeling and risk identification. Human review confirms whether these websites are malicious, thus enriching the machine learning sample and improving the model's generalization ability and detection accuracy.
[0111] In S250, by integrating multi-dimensional data, the system can reduce errors caused by misjudgments in a single dimension. For example, some abnormal web pages may conceal their target attributes through disguise, but their risks can still be detected by analyzing hash values, embedding vectors, or keywords. This multi-dimensional comprehensive analysis enhances the system's adaptability and robustness to different types of risks. Single-dimensional evaluation methods (e.g., relying solely on page hash values or keywords) may have limitations. Some web pages may not exhibit significant malicious characteristics, but through multi-dimensional analysis, the system can more accurately identify hidden risks. For instance, a web page may not have obvious problems with its target attributes, but potential malicious content can be discovered through analysis of target embedding vectors or target keywords.
[0112] Specifically, in some embodiments, the specific implementation of step S250 can be found in the following embodiments. This embodiment is based on... Figure 4 According to the detailed description of step S250 in the webpage risk detection method shown in the corresponding embodiment, step S250 may include the following steps:
[0113] The risk percentage is determined as the proportion of web pages identified as having risk among the first detection result, the second detection result, the third detection result, and the fourth detection result.
[0114] The final risk detection result is determined based on the stated risk percentage.
[0115] In this embodiment, the proportion of a webpage judged to be risky across different detection dimensions is quantified by calculating the "risk percentage." The risk percentage refers to the percentage of detection results indicating a risky webpage out of the four detection dimensions. For example, if two out of the four detection dimensions indicate that the webpage is risky, the risk percentage is 50%.
[0116] Based on the calculated risk percentage, the overall risk assessment of the webpage is finally determined. If the risk percentage is high, the webpage's risk level may be rated as high; if the risk percentage is low, the webpage's risk level is low.
[0117] This embodiment avoids misjudgments caused by relying on a single detection result by integrating the results of multiple detection dimensions. Each detection result assesses the risk of a webpage from a different perspective, and the comprehensive assessment can more accurately reflect the overall security status of the webpage.
[0118] The introduction of risk percentage further quantifies the risk level of web pages, making the final risk assessment more reliable and able to give a more reasonable final conclusion based on the weight of different detection dimensions.
[0119] Using a single detection method alone may lead to false positives (incorrectly labeling safe web pages as risky ones) or false negatives (failing to identify web pages that actually pose a risk). By combining the results of multiple detections and calculating the risk percentage, the probability of false positives and false negatives can be reduced. Only when multiple detection dimensions point to the presence of risk can a web page be definitively determined to have a high risk.
[0120] The following describes an apparatus embodiment of this application, which can be used to execute the webpage risk detection method in the above embodiments of this application. For details not disclosed in the apparatus embodiments of this application, please refer to the embodiments of the webpage risk detection method described above.
[0121] Figure 5 A block diagram of a webpage risk detection device according to an embodiment of this application is shown.
[0122] Reference Figure 5 As shown, a webpage risk detection device 500 according to an embodiment of this application includes a webpage crawling module 510 and a risk detection module 520.
[0123] The webpage crawling module 510 is used to crawl the webpage to be detected and extract key features from the webpage to be detected. The key features include target attributes, target hash, target embedding vector and target keywords. The risk detection module 520 is used to determine the final risk detection result based on the key features of the webpage to be detected.
[0124] In some embodiments of this application, the web page crawling module specifically includes: a web page crawling submodule for crawling web pages to be detected; a target attribute submodule for extracting features from the web pages to be detected to obtain corresponding target attributes; and a fuzzy hashing submodule for performing fuzzy hash calculations on the web pages to be detected to obtain corresponding target hashes.
[0125] The natural language submodule is used to perform natural language processing on the webpage to be detected to obtain the corresponding target embedding vector; the word segmentation submodule is used to perform word segmentation processing on the webpage to be detected to obtain the corresponding target keywords.
[0126] In some embodiments of this application, the fuzzy hashing submodule specifically includes: a content extraction unit, used to extract the target webpage content of the webpage to be detected; and a fuzzy hashing unit, used to perform fuzzy hash calculation on the target webpage content to obtain the corresponding target hash.
[0127] In some embodiments of this application, the natural language submodule specifically includes: a text extraction unit for extracting the webpage text of the webpage to be detected; and a vector embedding unit for performing natural language processing on the webpage text to generate a corresponding target embedding vector.
[0128] In some embodiments of this application, the word segmentation processing submodule specifically includes: a text extraction unit, used to extract the webpage text of the webpage to be detected; and a word segmentation processing unit, used to perform word segmentation processing on the webpage text to generate corresponding target keywords.
[0129] In some embodiments of this application, the risk detection module includes: a first risk detection submodule, configured to determine a first risk detection result based on the target attributes of the webpage to be detected; a second risk detection submodule, configured to determine a second risk detection result based on the target hash of the webpage to be detected; a third risk detection submodule, configured to determine a third risk detection result based on the target embedding vector of the webpage to be detected; a fourth risk detection submodule, configured to determine a fourth risk detection result based on the target keywords of the webpage to be detected; and a final risk detection submodule, configured to determine a final risk detection result based on the first detection result, the second detection result, the third detection result, and the fourth detection result.
[0130] In some embodiments of this application, the first risk detection result specifically includes: an attribute combination unit, used to determine a target attribute combination among the target attributes of the webpage to be detected; and an attribute hash unit, used to compare the hash value of the target attribute combination to determine the first risk detection result.
[0131] In some embodiments of this application, the final risk detection submodule specifically includes: a risk proportion unit, used to determine the risk proportion, wherein the risk proportion is the proportion of the webpage to be detected that is determined to be risky among the first detection result, the second detection result, the third detection result, and the fourth detection result; and a final result unit, used to determine the final risk detection result based on the risk proportion.
[0132] Figure 6 A schematic diagram of the structure of a computer system suitable for implementing the electronic device of the present application is shown.
[0133] It should be noted that, Figure 6 The computer system of the electronic device shown is merely an example and should not impose any limitation on the functionality and scope of use of the embodiments of this application.
[0134] like Figure 6As shown, the computer system includes a Central Processing Unit (CPU) 1801, which can perform various appropriate actions and processes based on programs stored in Read-Only Memory (ROM) 1802 or programs loaded from storage portion 1808 into Random Access Memory (RAM) 1803, such as performing the methods described in the above embodiments. The RAM 1803 also stores various programs and data required for system operation. The CPU 1801, ROM 1802, and RAM 1803 are interconnected via a bus 1804. An Input / Output (I / O) interface 1805 is also connected to the bus 1804.
[0135] The following components are connected to I / O interface 1805: an input section 1806 including a keyboard, mouse, etc.; an output section 1807 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and speakers, etc.; a storage section 1808 including a hard disk, etc.; and a communication section 1809 including a network interface card such as a LAN (Local Area Network) card, modem, etc. The communication section 1809 performs communication processing via a network such as the Internet. A drive 1810 is also connected to I / O interface 1805 as needed. Removable media 1811, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., are installed on drive 1810 as needed so that computer programs read from them can be installed into storage section 1808 as needed.
[0136] Specifically, according to embodiments of this application, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this application include a computer program product comprising a computer program carried on a computer-readable medium, the computer program including a computer program for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication section 1809, and / or installed from removable medium 1811. When the computer program is executed by central processing unit (CPU) 1801, it performs various functions defined in the system of this application.
[0137] It should be noted that the computer-readable medium shown in the embodiments of this application can be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), flash memory, optical fiber, portable compact disc read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this application, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this application, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying a computer-readable computer program. The transmitted data signal can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. The computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The computer program contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to wireless, wired, etc., or any suitable combination thereof.
[0138] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. Each block in a flowchart or block diagram may represent a module, segment, or portion of code, which contains one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0139] The units described in the embodiments of this application can be implemented in software or hardware, and the described units can also be located in a processor. The names of these units do not necessarily limit the specific unit itself.
[0140] In another aspect, this application also provides a computer-readable medium, which may be included in the electronic device described in the above embodiments; or it may exist independently and not assembled into the electronic device. The computer-readable medium carries one or more programs, which, when executed by the electronic device, cause the electronic device to perform the methods described in the above embodiments.
[0141] This specification also provides a computer program product that stores at least one instruction, said at least one instruction being loaded and executed by the processor as described above. Figures 1-4 The method described in the illustrated embodiment can be found in the following document for a detailed execution process. Figures 1-4 The specific details of the illustrated embodiments will not be elaborated here.
[0142] It should be noted that although several modules or units for the device used to perform actions have been mentioned in the detailed description above, this division is not mandatory. In fact, according to the embodiments of this application, the features and functions of two or more modules or units described above can be embodied in one module or unit. Conversely, the features and functions of one module or unit described above can be further divided and embodied by multiple modules or units.
[0143] Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein can be implemented by software or by combining software with necessary hardware. Therefore, the technical solutions according to the embodiments of this application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (such as a CD-ROM, USB flash drive, external hard drive, etc.) or on a network, including several instructions to cause a computing device (such as a personal computer, server, touch terminal, or network device, etc.) to execute the method according to the embodiments of this application.
[0144] Other embodiments of this application will readily occur to those skilled in the art upon consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary techniques in the art not disclosed herein.
[0145] It should be understood that this application is not limited to the precise structure described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this application is limited only by the appended claims.
Claims
1. A method for detecting website risks, characterized in that, include: The webpage to be detected is crawled, and key features in the webpage to be detected are extracted. The key features include target attributes, target hash, target embedding vector and target keywords. Based on the key features of the webpage to be detected, the final risk detection result is determined.
2. The webpage risk detection method as described in claim 1, characterized in that, The process of crawling the webpage to be detected and extracting key features from the webpage to be detected specifically includes: Capture the webpage to be tested; Feature extraction is performed on the webpage to be detected to obtain the corresponding target attributes; Perform fuzzy hash calculation on the webpage to be detected to obtain the corresponding target hash; Natural language processing is performed on the webpage to be detected to obtain the corresponding target embedding vector; The webpage to be detected is segmented into words to obtain the corresponding target keywords.
3. The webpage risk detection method as described in claim 2, characterized in that, The step of performing fuzzy hash calculation on the webpage to be detected to obtain the corresponding target hash specifically includes: Extract the target webpage content of the webpage to be detected; Perform a fuzzy hash calculation on the content of the target webpage to obtain the corresponding target hash.
4. The webpage risk detection method as described in claim 2, characterized in that, The step of performing natural language processing on the webpage to be detected to obtain the corresponding target embedding vector specifically includes: Extract the webpage text of the webpage to be detected; Natural language processing is performed on the webpage text to generate the corresponding target embedding vector.
5. The webpage risk detection method as described in claim 2, characterized in that, The step of segmenting the webpage to be detected to obtain the corresponding target keywords specifically includes: Extract the webpage text of the webpage to be detected; The webpage text is segmented to generate corresponding target keywords.
6. The webpage risk detection method as described in claim 1, characterized in that, The step of determining the final risk detection result based on the key features of the webpage to be detected includes: Based on the target attributes of the webpage to be detected, the first risk detection result is determined; The second risk detection result is determined based on the target hash of the webpage to be detected; The third risk detection result is determined based on the target embedding vector of the webpage to be detected; The fourth risk detection result is determined based on the target keywords of the webpage to be detected; The final risk detection result is determined based on the first, second, third, and fourth test results.
7. A website risk detection device, characterized in that, The webpage risk detection device includes: The web page crawling module is used to crawl the web page to be detected and extract key features from the web page to be detected. The key features include target attributes, target hash, target embedding vector and target keywords. The risk detection module is used to determine the final risk detection result based on the key features of the webpage to be detected.
8. A computer-readable medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the webpage risk detection method as described in any one of claims 1 to 6.
9. An electronic device, characterized in that, include: One or more processors; A storage device for storing one or more programs, which, when executed by one or more processors, cause the one or more processors to implement the webpage risk detection method as described in any one of claims 1 to 6.
10. A computer program product comprising one or more computer programs, characterized in that, When the one or more computer programs are executed by one or more processors, they implement the steps of the webpage risk detection method according to any one of claims 1 to 6.