A method, system, and apparatus for risk identification of sensitive content on web pages.

By establishing a massive sensitive word database and an unsupervised text classification model, combined with regular expression judgment, the problems of low efficiency and high false positive rate in the identification of sensitive content on web pages in existing technologies have been solved, and accurate identification and efficient monitoring of sensitive content on web pages have been achieved.

CN117332085BActive Publication Date: 2026-06-30EASTCOM NETWORK SECURITY (SHENZHEN) TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
EASTCOM NETWORK SECURITY (SHENZHEN) TECH CO LTD
Filing Date
2023-09-28
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies lack effective methods for quickly detecting and screening sensitive content on web pages, leading to the appearance of harmful information, affecting user experience and reducing the timeliness of detection.

Method used

A massive sensitive word database is established, and semantic analysis is performed using the AC automaton algorithm and unsupervised text classification model. Combined with regular expression judgment, sensitive words and privacy information in web pages are identified. Custom blacklists and whitelists and parameterization methods are supported to achieve accurate matching and false positive screening.

Benefits of technology

It enables accurate identification of sensitive content on web pages, improves monitoring efficiency and accuracy, meets the needs of batch monitoring, and reduces the false positive rate.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117332085B_ABST
    Figure CN117332085B_ABST
Patent Text Reader

Abstract

This invention discloses a method for risk identification of sensitive content on web pages, belonging to the technical field of network content security. This method establishes a massive sensitive word library to accurately identify sensitive words and privacy information on extracted pages, significantly improving performance to meet the needs of batch monitoring of web pages. By adjusting the scores of the sensitive word library, adding or deleting sensitive words, or fine-tuning the algorithm, the overall monitoring accuracy can be controlled. The method includes the following steps: establishing a sensitive word library; loading the sensitive word library and constructing the identification system context environment; reading the content of each valid page and formatting it before outputting the formatted page content; identifying the formatted page content against the sensitive word library to extract sensitive content metadata containing all data information containing sensitive words; and performing semantic analysis on the sensitive content metadata through unsupervised classification to obtain sensitive content results.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the technical field of network content security, and more specifically, to a method for risk identification of sensitive content on web pages. The invention also relates to a system and apparatus for implementing this method. Background Technology

[0002] A website is a host on the internet with a domain name or address that provides certain online services. It is a space for storing files, hosted on a server. People can use websites to publish information they want to share or to provide related online services. With the rapid development of the internet, the amount of data generated by websites every day is becoming increasingly massive. Most of this data is stored on servers, and this data lacks effective monitoring and auditing measures, making it vulnerable to hacking, tampering, and other malicious attacks. Currently, website developers and administrators lack effective methods to quickly detect and screen out problematic pages. When users browse problematic content, inappropriate remarks (such as politically charged or abusive language), sensitive content (such as pornography), private data, malware, and other harmful information may appear, which has a significant negative impact on operations.

[0003] Current methods to avoid this problem often rely on human review and user feedback, but these methods are not only inefficient but also significantly reduce the timeliness of discovering such information. Even with the help of artificial intelligence algorithms, a large amount of information is still missed or misjudged. Therefore, there is an urgent need to design a more effective method to identify the risks of sensitive content on web pages, in order to enhance the user browsing experience. Summary of the Invention

[0004] The primary objective of this invention is to provide a method for risk identification of sensitive content on web pages. This method establishes a massive sensitive word database to accurately identify sensitive words and privacy information on extracted pages, and also greatly improves performance to meet the needs of batch monitoring of web pages. By adjusting the scores of the sensitive word database, adding or deleting sensitive words, or fine-tuning the algorithm, the overall monitoring accuracy can be controlled.

[0005] A second objective of this invention is to provide a risk identification system for sensitive content on web pages, which can effectively monitor sensitive words on web pages. A third objective of this invention is to provide an apparatus for implementing this risk identification system.

[0006] The first technical solution adopted in this invention is as follows:

[0007] A method for identifying the risks of sensitive content on a webpage includes the following steps:

[0008] S1. Establish a sensitive word database, load the sensitive word database, and construct the recognition system context environment;

[0009] S2. Read the content of each valid page one by one, perform formatting processing, and then output the formatted page content;

[0010] S3. After identifying the formatted page content obtained in step S2 with the sensitive words in the sensitive word library, extract the sensitive content metadata of all data information containing sensitive words.

[0011] S4. Obtain the sensitive content results by performing semantic analysis on the sensitive content metadata from step S3 through unsupervised text classification.

[0012] S5. Use regular expressions to identify privacy-leaking information data from the formatted page content obtained in step S2;

[0013] S6. Output the sensitive content results obtained in step S4 and the privacy leakage information data obtained in step S5.

[0014] Furthermore, the sensitive word library established in step S1 includes a sensitive word blacklist, a sensitive word whitelist, the type of each sensitive word, and the score of each sensitive word. By setting a regular expression and loading the sensitive word library, the context environment of the recognition system is constructed.

[0015] Furthermore, step S2 specifically involves: reading the page content; if the page is invalid, discarding it and reading the next page content; if the page is valid, performing formatting to obtain an html1 file containing tags and an html2 file without tags.

[0016] Furthermore, step S3 includes the following steps:

[0017] S3.1. Input all sensitive words in the sensitive word library into the AC automaton algorithm in batches, compare them with the content in the html1 file, identify all sensitive words contained in the page and the specific location coordinates of each sensitive word on the page, and temporarily store each sensitive word and its specific location coordinates on the page.

[0018] S3.2. The sensitive words in the page temporarily stored in step S3.1 are compared with the sensitive word whitelist in the sensitive word library. The sensitive words that belong to the sensitive word whitelist and their specific location coordinates are removed from the sensitive words in the page temporarily stored.

[0019] S3.3 Extract the context of each sensitive word in the page that exists in the page after completing step S3.2, and temporarily store the context of the specific location of the sensitive word.

[0020] S3.4. Locate the line numbers of the sensitive words obtained in step S3.2 on the original page, temporarily store the line number results, and finally, the sensitive content metadata is composed of each sensitive word, specific location coordinates, specific location context results and line numbers on the temporarily stored page.

[0021] Furthermore, in step S3, after extracting the sensitive content metadata, the sensitive content metadata is updated after calculating a suspiciousness score. The specific steps are as follows:

[0022] S3.5. In the sensitive content metadata obtained in step S3.4, sensitive words whose scores reach or exceed the set threshold are selected and the sensitive content metadata corresponding to the super-score sensitive words is temporarily stored. The sensitive content metadata corresponding to the remaining sensitive words is temporarily stored.

[0023] S3.6. Classify the remaining sensitive content metadata corresponding to the sensitive words according to their types, and temporarily store the sensitive word results contained in each type.

[0024] S3.7. Perform a statistical analysis on the number of sensitive words in each category obtained in step S3.6. If the number of categories is greater than or equal to a set threshold, calculate the sensitive word score corresponding to the sensitive words in that category. If the sensitive word score of that category is greater than or equal to a set threshold, add the sensitive content metadata corresponding to the sensitive words in that category to the sensitive content metadata corresponding to the super-score sensitive words in step S3.5.

[0025] Furthermore, in step S4, the sensitive content metadata from step S3 is input one by one into the unsupervised text semantic analysis model to analyze the context information of the sensitive words. If the sensitive content metadata corresponding to the sensitive word is regular sensitive content, it is deleted. The sensitive content metadata obtained after the analysis is completed is the sensitive content result.

[0026] Furthermore, in step S5, the privacy-leaking information data is identified from the html2 file using regular expressions.

[0027] The second technical solution adopted in this invention is as follows:

[0028] A risk identification system for sensitive web page content includes:

[0029] Sensitive word database module: used to store the blacklist of sensitive words, the whitelist of sensitive words, and the score of each sensitive word;

[0030] Page extraction module: Used to extract the content of valid pages;

[0031] Page Sensitive Word Recognition Module: Used to identify sensitive content metadata from the extracted content of valid pages;

[0032] Page privacy breach identification module: used to identify privacy breach information from the extracted valid pages;

[0033] The output of the page extraction module and the output of the sensitive word database module are both connected to the input of the page sensitive word recognition module. The output of the page extraction module is also connected to the input of the page privacy leakage information recognition module.

[0034] Furthermore, it also includes:

[0035] Unsupervised content classification module: used to perform semantic analysis on the identified sensitive content metadata;

[0036] Suspiciousness Score Analysis Module: Used to calculate the suspiciousness of identified sensitive content metadata;

[0037] The input of the suspiciousness score analysis module is connected to the output of the page sensitive word recognition module, and the output of the suspiciousness score analysis module is connected to the input of the unsupervised classification module.

[0038] The third technical solution adopted in this invention is as follows:

[0039] A risk identification device for sensitive web page content includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the steps of any of the risk identification methods described above.

[0040] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0041] 1. This invention provides a method for risk identification of sensitive content on web pages. The method involves establishing a sensitive word library, loading the sensitive word library, and constructing a recognition system context environment. It reads the content of each valid page, formats it, and outputs the formatted page content. The formatted page content is then compared with the sensitive word library to extract sensitive content metadata containing all sensitive words. The sensitive content metadata is then subjected to unsupervised segmentation and semantic analysis to obtain sensitive content results. Regular expressions are used to identify privacy-leaking information in the formatted page content. The sensitive content results and privacy-leaking information are then output. By establishing a massive sensitive word library and supporting custom blacklists and whitelists for accurate matching, and by using parameterization to distinguish between uppercase and lowercase sensitive words, performance overhead and speed do not decrease with the size of the sensitive word library. Furthermore, unsupervised segmentation is used for false positive screening, achieving accurate monitoring while significantly improving performance, allowing for batch monitoring of all web pages on a website. This risk identification method only requires adjusting the sensitive word library scores, adding or deleting sensitive words, or fine-tuning the suspiciousness algorithm to control the overall monitoring accuracy.

[0042] 2. The system for extracting, converting, and loading web page data according to the present invention, by setting up a sensitive word database module, a page extraction module, a page sensitive word identification module, and a page privacy leakage information identification module, with each module interconnected, can effectively monitor sensitive words in web pages. Attached Figure Description

[0043] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this application, illustrate exemplary embodiments of the invention and, together with their description, serve to explain the invention and do not constitute an undue limitation thereof. In the drawings:

[0044] Figure 1 This is a flowchart of the steps in this method;

[0045] Figure 2 This is a schematic diagram of the system structure. Detailed Implementation

[0046] The technical solution of the present invention will be further described in detail below with reference to specific embodiments, but this does not constitute any limitation on the present invention.

[0047] Reference Figure 1 As shown, a method for identifying the risk of sensitive content on a webpage according to the present invention includes the following steps:

[0048] S1. Establish a sensitive word database, load the sensitive word database, and construct the recognition system context environment.

[0049] The established sensitive word database includes a blacklist of sensitive words, a whitelist of sensitive words, the type of each sensitive word, and a score for each sensitive word. The sensitive word database is loaded by setting regular expressions to construct the context environment of the recognition system. The score of a sensitive word can be set from 1 to 9 points, and the type of a sensitive word can be set, such as 1 = politically sensitive, 2 = prohibited, etc. A larger database is more beneficial to the recognition effect.

[0050] S2. Read the content of each valid page one by one, format it, and then output the formatted page content.

[0051] The specific operation is as follows: Read the page content; if the page is invalid, discard it and read the next page; if the page is valid, format it to obtain an html1 file containing tags and an html2 file without tags. After formatting, restore the page to a display consistent with the browser.

[0052] S3. After identifying the formatted page content obtained in step S2 with the sensitive words in the sensitive word library, extract the sensitive content metadata of all data information containing sensitive words.

[0053] Specifically, the following steps are included:

[0054] S3.1. All sensitive words in the sensitive word database are batch-input into the AC automaton algorithm and compared with the content in the html1 file to identify all sensitive words contained on the page and their specific coordinates on the page. The AC automaton algorithm is then used to quickly extract sensitive content from the page.

[0055] S3.2. Compare the sensitive words in the page temporarily stored in step S3.1 with the sensitive word whitelist in the sensitive word library, and remove the sensitive words that belong to the sensitive word whitelist and their specific location coordinates from the sensitive words in the page temporarily stored.

[0056] S3.3 Extract the context of each sensitive word in the page corresponding to the specific location on the page after completing step S3.2, and temporarily store the specific location context of the sensitive word.

[0057] S3.4. Locate the line numbers of the sensitive words obtained in step S3.2 on the original page, temporarily store the line number results, and finally, the sensitive content metadata is composed of each sensitive word, specific location coordinates, specific location context results and line numbers on the temporarily stored page.

[0058] After extracting the sensitive content metadata, the sensitive content metadata is updated after calculating its suspiciousness score. The specific steps are as follows:

[0059] S3.5. In the sensitive content metadata obtained in step S3.4, sensitive words whose scores reach or exceed the set threshold are selected and their corresponding sensitive content metadata is temporarily stored. The sensitive content metadata corresponding to the remaining sensitive words is also temporarily stored.

[0060] S3.6. Classify the remaining sensitive content metadata corresponding to the sensitive words according to their types, and temporarily store the sensitive word results contained in each type.

[0061] S3.7. Perform a statistical analysis of the number of sensitive words in each category obtained in step S3.6. If the number of categories is greater than or equal to a set threshold, calculate the sensitive word score for that category. If the sensitive word score for that category is greater than or equal to the set threshold, add the sensitive content metadata corresponding to that category to the sensitive content metadata corresponding to the super-score sensitive words in step S3.5. Through score design and a series of statistical analyses based on a large amount of real-world data, the final derived calculation method and judgment value are used to filter out potentially high-risk sensitive content. Using highly suspicious content as input parameters and submitting it to an unsupervised text classification model for high-accuracy verification can effectively reduce the occurrence of false positives.

[0062] S4. Obtain the sensitive content results by performing semantic analysis on the sensitive content metadata from step S3 through unsupervised segmentation.

[0063] The specific operation is as follows: input the sensitive content metadata in step S3 one by one into the unsupervised text semantic analysis model, analyze the context information of the sensitive words, and delete the sensitive content metadata corresponding to the sensitive words if it is regular sensitive content. The sensitive content metadata obtained after the analysis is completed is the sensitive content result.

[0064] S5. Use regular expressions to identify privacy-leaking information from the formatted page content obtained in step S2. Specifically, use regular expressions to identify privacy-leaking information from the html2 file.

[0065] S6. Output the sensitive content results obtained in step S4 and the privacy leakage information data obtained in step S5.

[0066] The risk identification method of this invention achieves accurate matching by establishing a massive sensitive word database and supporting custom blacklists and whitelists. It uses parameterization to differentiate between uppercase and lowercase sensitive words, ensuring that performance overhead and speed do not decrease with the size of the sensitive word database. Furthermore, it incorporates unsupervised classification to screen for false positives, achieving accurate monitoring while significantly improving performance, enabling batch monitoring of all web pages on a website. The risk identification method of this invention only requires adjusting the sensitive word database scores, adding or deleting sensitive words, or fine-tuning the suspiciousness algorithm to control the overall monitoring accuracy.

[0067] Reference Figure 2 As shown, the present invention provides a risk identification system for sensitive web page content, comprising:

[0068] Sensitive word database module: used to store the blacklist of sensitive words, the whitelist of sensitive words, and the score of each sensitive word;

[0069] Page extraction module: Used to extract the content of valid pages;

[0070] Page Sensitive Word Recognition Module: Used to identify sensitive content metadata from the extracted content of valid pages;

[0071] Page privacy breach identification module: used to identify privacy breach information from the extracted valid pages;

[0072] The output terminals of the page extraction module and the sensitive word database module are both connected to the input terminal of the page sensitive word recognition module. The output terminal of the page extraction module is also connected to the input terminal of the page privacy leakage information recognition module. Through the interrelationships between these modules, the risk identification system of this invention can effectively monitor sensitive words on web pages.

[0073] Furthermore, it also includes:

[0074] Unsupervised content classification module: used to perform semantic analysis on the identified sensitive content metadata;

[0075] Suspiciousness Score Analysis Module: Used to calculate the suspiciousness of identified sensitive content metadata;

[0076] The input of the suspiciousness score analysis module is connected to the output of the page sensitive word recognition module, and the output of the suspiciousness score analysis module is connected to the input of the unsupervised classification module.

[0077] The present invention provides a risk identification device for sensitive web page content, comprising a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of any of the methods described herein.

[0078] Example 1

[0079] The present invention provides a method for risk identification of sensitive content on web pages. First, a sensitive word database is preset, and all sensitive words are assigned scores (e.g., 1 to 9 points) and their corresponding types (e.g., 1 = political, 2 = prohibited).

[0080] After downloading the website pages to be monitored, the page content is input into a risk identification system for sensitive web page content according to this invention for identification, and finally the identification result is output. The identification process of the risk identification system for sensitive web page content according to this invention is as follows:

[0081] S01: Load the sensitive word library (including blacklist and whitelist) and initialize the system.

[0082] S02: Read the page content and determine its validity. If the page is blank, discard it. Otherwise, format the page and output two types of formatted data: one with a browser-compatible appearance, named html1; and the other with all html tags removed, named html2.

[0083] S03: Using the AC automaton algorithm, sensitive words are input into the automaton in batches and compared with html1 to identify all sensitive words contained on the page and their coordinates on the page. The identified results are named result1 and temporarily stored.

[0084] S04: Compare result1 with the sensitive word whitelist and remove the results corresponding to sensitive words that belong to the whitelist.

[0085] S05: Extract the context of result1 in html1 (e.g., the first and last 30 characters) and update result1 accordingly.

[0086] S06: Locate the line number of result1 in html1, find the line number where the sensitive word is located, and update the result in result1.

[0087] S07: Screen out all sensitive words in result1 with a score greater than or equal to x, where x is a set threshold. These are all considered to be highly suspicious sensitive content. Save the result as risk and proceed to the next step of processing the remaining sensitive words.

[0088] S08: Categorize by sensitive word type.

[0089] S09: Determine the total number of all categories:

[0090] If the number of categories is greater than or equal to n, where n is a set threshold, calculate the sensitive word score corresponding to the sensitive words in that category; otherwise, skip the calculation.

[0091] If the sensitive word score for this category is greater than or equal to m, where m is a set threshold, the sensitive content metadata corresponding to the sensitive words in this category is updated in the risk; otherwise, it is discarded.

[0092] S10: Input highly suspicious sensitive content (risk) into the unsupervised text semantic analysis model. The model analyzes the contextual information of sensitive words, removes the results that are normal sensitive content, and only retains and outputs the problematic results.

[0093] S11: Use regular expressions to identify and extract private information from html2, and temporarily save it as hide.

[0094] S12: Combine the results of highly suspicious sensitive content (risk) with the privacy information (hide) in the output.

[0095] The above description is only a preferred embodiment of the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A method for identifying the risks of sensitive content on a webpage, characterized in that, Includes the following steps: S1. Establish a sensitive word database, load the sensitive word database, and construct the recognition system context environment; S2. Read the content of each valid page one by one, perform formatting processing, and then output the formatted page content; S3. After identifying the formatted page content obtained in step S2 with the sensitive words in the sensitive word library, extract the sensitive content metadata of all data information containing sensitive words. S4. Obtain the sensitive content results by performing semantic analysis on the sensitive content metadata from step S3 through unsupervised text classification. S5. Use regular expressions to identify privacy-leaking information data from the formatted page content obtained in step S2; S6. Output the sensitive content results obtained in step S4 and the privacy leakage information data obtained in step S5; The specific operation of step S2 is as follows: read the page content; if the page is invalid, discard it directly and read the next page content; if the page is valid, perform formatting to obtain an html1 file containing tags and an html2 file without tags. Step S3 includes the following steps: S3.

1. Input all sensitive words in the sensitive word library into the AC automaton algorithm in batches, compare them with the content in the html1 file, identify all sensitive words contained in the page and the specific location coordinates of each sensitive word on the page, and temporarily store each sensitive word and its specific location coordinates on the page. S3.

2. The sensitive words in the page temporarily stored in step S3.1 are compared with the sensitive word whitelist in the sensitive word library. The sensitive words that belong to the sensitive word whitelist and their specific location coordinates are removed from the sensitive words in the page temporarily stored. S3.3 Extract the context of each sensitive word in the page that exists in the page after completing step S3.2, and temporarily store the context of the specific location of the sensitive word. S3.

4. Locate the line numbers of the sensitive words obtained in step S3.2 on the original page, temporarily store the line number results, and finally, the sensitive content metadata is composed of each sensitive word, specific location coordinates, specific location context results and line numbers on the temporarily stored page.

2. The method for risk identification of sensitive webpage content according to claim 1, characterized in that, The sensitive word library established in step S1 includes a sensitive word blacklist, a sensitive word whitelist, the type of each sensitive word, and the score of each sensitive word. The sensitive word library is loaded after setting a regular expression to construct the recognition system context environment.

3. The method for risk identification of sensitive webpage content according to claim 1, characterized in that, In step S3, after extracting the sensitive content metadata, the sensitive content metadata is updated after calculating its suspiciousness score. The specific steps are as follows: S3.

5. In the sensitive content metadata obtained in step S3.4, sensitive words whose scores reach or exceed the set threshold are selected and the sensitive content metadata corresponding to the super-score sensitive words is temporarily stored. The sensitive content metadata corresponding to the remaining sensitive words is temporarily stored. S3.

6. Classify the remaining sensitive content metadata corresponding to the sensitive words according to their types, and temporarily store the sensitive word results contained in each type. S3.

7. Perform a statistical analysis on the number of sensitive words in each category obtained in step S3.

6. If the number of categories is greater than or equal to a set threshold, calculate the sensitive word score corresponding to the sensitive words in that category. If the sensitive word score of that category is greater than or equal to a set threshold, add the sensitive content metadata corresponding to the sensitive words in that category to the sensitive content metadata corresponding to the super-score sensitive words in step S3.

5.

4. The method for risk identification of sensitive webpage content according to claim 3, characterized in that, In step S4, the sensitive content metadata from step S3 is input one by one into the unsupervised text semantic analysis model to analyze the context information of the sensitive words. If the sensitive content metadata corresponding to the sensitive word is regular sensitive content, it is deleted. The sensitive content metadata obtained after the analysis is completed is the sensitive content result.

5. The method for risk identification of sensitive webpage content according to claim 1, characterized in that, In step S5, the privacy-leaking information data is identified from the html2 file using regular expressions.

6. A risk identification system for sensitive webpage content, used in the risk identification method for sensitive webpage content as described in claim 1, characterized in that, include: Sensitive word database module: used to store the blacklist of sensitive words, the whitelist of sensitive words, and the score of each sensitive word; Page extraction module: Used to extract the content of valid pages; Page Sensitive Word Recognition Module: Used to identify sensitive content metadata from the extracted content of valid pages; Page privacy breach identification module: used to identify privacy breach information from the extracted valid pages; The output of the page extraction module and the output of the sensitive word database module are both connected to the input of the page sensitive word recognition module. The output of the page extraction module is also connected to the input of the page privacy leakage information recognition module.

7. A risk identification system for sensitive webpage content according to claim 6, characterized in that, Also includes: Unsupervised text classification module: used to perform semantic analysis on the identified sensitive content metadata; Suspiciousness Score Analysis Module: Used to calculate the suspiciousness of identified sensitive content metadata; The input of the suspiciousness score analysis module is connected to the output of the page sensitive word recognition module, and the output of the suspiciousness score analysis module is connected to the input of the unsupervised text classification module.

8. A device for risk identification of sensitive content on a webpage, comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes a computer program, it implements the steps of any of the methods described in claims 1-5.