Cyber space mapping method based on weakly supervised learning

By employing a cyberspace mapping method based on weakly supervised learning and utilizing a CNN+LSTM model for cyberspace asset identification, this method solves the problem of unclear asset ownership in existing technologies, achieves high-precision cyberspace mapping, and supports cybersecurity incident monitoring and emergency response.

CN117093915BActive Publication Date: 2026-06-26NAT COMP NETWORK & INFORMATION SECURITY MANAGEMENT CENT

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NAT COMP NETWORK & INFORMATION SECURITY MANAGEMENT CENT
Filing Date
2023-09-13
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing cyberspace mapping platforms cannot accurately locate the ownership of assets, and IP classification is insufficient, making it difficult to construct a map that comprehensively describes cyberspace information.

Method used

This study employs a weakly supervised learning approach. By establishing a public cyberspace mapping IP address database, it identifies known IP information, collects information using its own basic resource data, and combines data cleaning, feature extraction, weak label generation, and classification model training to identify the association information of unknown IP addresses. A CNN+LSTM model is used for classification, and the model is optimized to improve accuracy, ultimately achieving the identification of Internet assets.

Benefits of technology

It achieves high-precision cyberspace mapping, can quickly identify the geographical location, industry classification, port service information, etc. of Internet assets, provides high-precision cyberspace mapping maps, and supports network security incident monitoring and emergency response.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117093915B_ABST
    Figure CN117093915B_ABST
Patent Text Reader

Abstract

The application discloses a network space mapping method based on weak supervision learning, comprising the following steps: S1, establishing a public network space mapping IP address library, and identifying known IP information; using self-owned basic resource data to collect information of IP with relatively clear unit attribution; S2, identifying IP address association information of non-known IP address. In the application, through self-developed asset identification algorithm, weak supervision learning algorithm is used to extract website features, high-precision asset labels are made, and internet assets are spatially mapped. The main contents of mapping include IP street-level geographic location, industry classification, IP port service information, certificate information, website feature information, etc. In a manner of combining spatial mapping map and vector topographic map, data is presented. The network space mapping map is the infrastructure for realizing digital production and life and digital governance in the digital era, and is of great significance for providing network security event monitoring and analysis, emergency response and attack tracing.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of cyberspace exploration technology, and in particular to a cyberspace mapping method based on weakly supervised learning. Background Technology

[0002] With the widespread application of computer network technology, cyberspace has become the fifth strategic space after land, sea, air, and space, and the "second space" for human production and life. Once the network is compromised, almost all social systems will be unable to function or play their role. Furthermore, cyberspace assets are vast and complex, making it difficult to ascertain the characteristics and locations of all IP assets. Therefore, there is an urgent need for efficient management, rational resource allocation, and effective security monitoring and protection of cyberspace.

[0003] Maps are an important carrier of geospatial information and a crucial tool for commanders to strategize and direct operations. However, cyberspace currently lacks a "cyber map" similar to a geospatial map that can comprehensively describe and display cyberspace information. The technology for constructing cyberspace maps is called "cyberspace mapping," which is similar to geospatial mapping.

[0004] Currently, many countries are conducting research on cyberspace asset mapping. Platforms such as Shodan and Censys search engine abroad, and ZoomEye and Fofa in China, are researching technologies such as cyberspace resource detection and network topology measurement. However, through the use of domestic and foreign Internet asset mapping platforms, it has been found that the assets detected by the platforms cannot accurately locate the ownership unit, and there are also many shortcomings in IP classification. Summary of the Invention

[0005] The purpose of this invention is to propose a network spatial mapping method based on weakly supervised learning in order to solve the above-mentioned problems.

[0006] To achieve the above objectives, the present invention adopts the following technical solution:

[0007] A network spatial mapping method based on weakly supervised learning includes the following steps:

[0008] S1. Establish a public network space mapping IP address database to identify known IP information; utilize existing basic resource data to collect information on IPs with relatively clear unit affiliations;

[0009] S2. Identify IP address association information for non-known IP addresses;

[0010] Step S2 includes data collection and cleaning, feature extraction and representation, weak label generation, classification model establishment, model evaluation and optimization, and Internet asset identification.

[0011] Preferably, the data collection and cleaning specifically includes: establishing a private network space mapping IP address database, which stores the data in step S2 that is determined to be a public network space mapping IP address database; obtaining the IP addresses of web services through the scanned banner information; using a web crawler to crawl the website content, including page title, filing information, website favicon, website content, domain name corresponding to the IP, etc.; and cleaning the data to remove useless information and noise.

[0012] Preferably, the feature extraction and representation specifically includes: constructing a network space mapping IP address feature analysis model from feature values ​​such as actively probed domain names, IP addresses, echo strings, port services, registration agencies, and registration times, and extracting key feature values.

[0013] Preferably, the weak tag generation specifically includes: using the feature values ​​obtained by active detection, generating weak tags manually, and directly constructing semantic fragments from the feature values, that is, combining the extracted feature values ​​into a string in sequence.

[0014] Preferably, the establishment of the classification model specifically includes: using a model composed of CNN+LSTM, training it in a weakly supervised end-to-end manner, and training a classification model based on the generated weak labels and feature vectors to classify unlabeled data.

[0015] Preferably, the model evaluation and optimization specifically includes: evaluating the model using cross-validation, checking the model's performance and accuracy, adjusting hyperparameters, and optimizing the model using regularization techniques to improve classification accuracy and performance.

[0016] Preferably, the Internet asset identification specifically includes: classifying unlabeled data using a trained model to identify its category, function, and organization, thereby realizing Internet asset identification; and saving feature values ​​such as domain name, IP address, echo string, port service, registration institution, and registration time, as well as tags such as port service and unit affiliation, into a private network space mapping IP address database.

[0017] In summary, due to the adoption of the above technical solution, the beneficial effects of the present invention are:

[0018] 1. This application utilizes a self-developed asset identification algorithm and weakly supervised learning to extract website features, create high-precision asset tags, and rapidly perform spatial mapping of internet assets. The mapping primarily includes IP street-level geographic location, industry classification, IP port service information, certificate information, and website characteristic information. The data is ultimately presented by combining spatial mapping maps with vector topographic maps. As a fundamental infrastructure for realizing digital production, life, and governance in the digital age, cyberspace mapping maps are of great significance for providing network security incident monitoring and analysis, emergency response, and attack tracing. Attached Figure Description

[0019] Figure 1 The diagram illustrates the flowchart of a network spatial mapping method based on weakly supervised learning according to an embodiment of the present invention. Detailed Implementation

[0020] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0021] Please see Figure 1 The present invention provides a technical solution:

[0022] A network spatial mapping method based on weakly supervised learning includes the following steps:

[0023] S1. Establish a public network space mapping IP address database to identify known IP information; utilize existing basic resource data to collect information on IPs with relatively clear unit affiliations;

[0024] S2. Identify IP address association information for non-known IP addresses;

[0025] Step S2 includes data collection and cleaning, feature extraction and representation, weak label generation, classification model establishment, model evaluation and optimization, and Internet asset identification.

[0026] Specifically, such as Figure 1As shown, the data collection and cleaning process specifically includes: establishing a private network space mapping IP address database, which stores data from the public network space mapping IP address database determined in step S2; obtaining IP addresses with web services through scanned banner information, using web crawlers to crawl website content, including page titles, registration information, website favicons, website content, domain names corresponding to IPs, etc., and cleaning the data to remove useless information and noise.

[0027] Specifically, such as Figure 1 As shown, feature extraction and representation specifically include: constructing a network space mapping IP address feature analysis model based on feature values ​​such as actively probed domain names, IP addresses, echo strings, port services, registry organizations, and registration times, and extracting key feature values ​​as shown in Table 1.

[0028] Table 1. Characteristics of extracted cyberspace assets

[0029] Serial Number feature describe Data types 1 domain domain name String 2 ports Number of ports Plastic Surgery 3 cn Service registration country String 4 org Service Registrant String 5 server Server service types String 6 banner Accessing the page will display the header value. String 7 Cert_subject Certificate Issuance Recipients String 8 Cert_issuer Certificate issuer String 9 Cert_time Certificate validity period Plastic Surgery 10 jarm JARM fingerprint String 11 Cert_valid Is it a self-signed certificate? (true if yes, false if no) Boolean type

[0030] The weak label generation specifically includes: using the feature values ​​obtained by active detection, generating weak labels manually. Based on the feature dimensions extracted above, and since many of them are string types, in order to improve the accuracy of recognition, this invention directly constructs semantic fragments from the feature values, that is, combining the extracted feature values ​​into a string in sequence.

[0031] The specific steps of building a classification model include: using a model composed of CNN and LSTM, training it in a weakly supervised end-to-end manner, and training a classification model based on the generated weak labels and feature vectors to classify unlabeled data.

[0032] Model evaluation and optimization specifically include: evaluating the model using cross-validation, checking the model's performance and accuracy, adjusting hyperparameters, and optimizing the model using regularization techniques to improve classification accuracy and performance.

[0033] Internet asset identification specifically includes: using a trained model to classify unlabeled data, identifying its category, function, and organization, etc. The model outputs multi-classification results, including port service and unit affiliation, to realize Internet asset identification. The feature values ​​such as domain name, IP, echo string, port service, registration agency and registration time, as well as the labels such as port service and unit affiliation, are saved into a private network space mapping IP address database.

[0034] This application presents a cyberspace mapping method based on weakly supervised learning, which profiles basic IP resources. Specific content includes host attributes, network attributes, security attributes, industry attributes, and attribution attributes. By labeling IP addresses, and using weakly supervised feature and behavior clustering, information flow, or structural association analysis to mine unlabeled data, the method identifies pattern information in unknown entity data, seeks similarities and associations with known types of entities, and determines the attribution information of unknown entities.

[0035] Through a self-developed asset identification algorithm, using weakly supervised learning to extract website features, and creating high-precision asset tags, the system rapidly performs spatial mapping of internet assets. The mapping primarily includes IP street-level geographic location, industry classification, IP port service information, certificate information, and website characteristic information. The data is ultimately presented by combining spatial mapping maps with vector topographic maps. As a fundamental infrastructure for realizing digital production, life, and governance in the digital age, cyberspace mapping is of great significance for providing network security incident monitoring and analysis, emergency response, and attack tracing.

[0036] The above description of the embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A network spatial mapping method based on weakly supervised learning, characterized in that, Includes the following steps: S1. Establish a public cyberspace mapping IP address database and identify known IP information; Utilize existing basic resource data to collect information on IPs with relatively clear unit affiliations; S2. Identify IP address association information for non-known IP addresses; Step S2 includes data collection and cleaning, feature extraction and representation, weak label generation, classification model building, model evaluation and optimization, and Internet asset identification. The data collection and cleaning process specifically includes: establishing a private network space mapping IP address database, which stores data from the public network space mapping IP address database determined in step S2; obtaining IP addresses with web services through scanned banner information; using web crawlers to crawl website content, including page titles, registration information, website favicons, website content, and domain name information corresponding to the IP addresses; and cleaning the data to remove useless information and noise. The feature extraction and representation specifically includes: constructing a network space mapping IP address feature analysis model from the feature values ​​of actively probed domain names, IPs, echo strings, port services, registration agencies, and registration times, and extracting key feature values; The weak tag generation specifically includes: using the feature values ​​obtained by active detection, generating weak tags manually, and directly constructing semantic fragments from the feature values, that is, combining the extracted feature values ​​into a string in sequence; The establishment of the classification model specifically includes: using a model composed of CNN+LSTM, training it in a weakly supervised end-to-end manner, and training a classification model based on the generated weak labels and feature vectors to classify unlabeled data.

2. The network spatial mapping method based on weakly supervised learning according to claim 1, characterized in that, The model evaluation and optimization specifically include: evaluating the model using cross-validation, checking the model's performance and accuracy, adjusting hyperparameters, and optimizing the model using regularization techniques to improve classification accuracy and performance.

3. The network spatial mapping method based on weakly supervised learning according to claim 2, characterized in that, The Internet asset identification specifically includes: using a trained model to classify unlabeled data, identifying its category, function, and organizational information, thereby realizing Internet asset identification, and saving the domain name, IP address, echo string, port service, registration institution, registration time feature value, port service and unit affiliation label into a private network space mapping IP address database.