A network asset identification method and system based on page text content
By using a deep semantic training architecture based on the Poly-encoder model and an enterprise feature vector library, the problems of misjudgment and missed judgment in network asset identification are solved, achieving efficient and real-time asset identification and meeting enterprise-level management needs.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HUNAN GUOKE YICUN INFORMATION TECH CO LTD
- Filing Date
- 2026-05-15
- Publication Date
- 2026-06-12
AI Technical Summary
Existing network asset identification technologies suffer from a trade-off between semantic understanding accuracy and large-scale data processing efficiency, leading to misjudgments or omissions of asset ownership and failing to meet real-time requirements.
We employ a deep semantic training architecture based on the Poly-encoder model, combined with domain fine-tuning of the pre-trained model, to extract text semantic features by parsing the page DOM tree structure, and use an enterprise feature vector library for fast similarity calculation to achieve accurate identification of implicit attribution relationships.
Significantly reduces the false positive and false negative rates of asset identification, meets the real-time requirements of enterprise-level large-scale asset management, and reduces the time required for matching tens of millions of data points to the second level.
Smart Images

Figure CN122197906A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of network security technology, specifically to a method and system for identifying network assets based on page text content. Background Technology
[0002] With the deep integration of internet technology and digital business, various online services such as websites, mini-programs, and mobile portals have become key carriers for enterprise operations and external services. This has led to an explosive growth in the quantity and variety of online assets owned and managed by enterprises, making the asset environment increasingly complex. However, many enterprises' asset management systems often lag behind the rapid iteration and expansion of their businesses, resulting in a large number of newly added, abandoned, or changed assets not being included in the management scope in a timely and accurate manner, creating significant blind spots in security supervision. These "shadow assets," lacking effective security monitoring and maintenance, are highly susceptible to becoming entry points for cyberattacks, thus posing a serious threat to the enterprise's core data and business continuity.
[0003] Currently, mainstream automated network asset identification technologies primarily rely on the analysis of asset page text content and can be broadly categorized into two methods. The first is a mechanical comparison method based on rule-based keywords. This method requires security operations personnel to predefine a series of regular expressions or keyword libraries related to company names and product identifiers, and then directly match them by scanning the page text to determine asset ownership. However, in actual web pages, asset ownership is often not explicitly presented through the company's official full name, but rather implicit in company aliases, industry-specific terms, business scenario descriptions, or even specific cultural references. For example, a company might use project codes instead of its official names in its internal systems. This type of semantic information lacks explicit rule features, making it difficult for methods based on keyword weight calculations or fixed regular expressions to achieve accurate capture, inevitably resulting in a high rate of false positives (mistaking third-party assets for company assets) and false negatives (failing to identify assets belonging to the company).
[0004] The second category is recognition methods based on traditional semantic models. These methods, such as early semantic matching models based on interactive encoding, attempt to understand the semantics of text through the model, and can handle the aforementioned implicit relationships to some extent. However, when facing real-time recognition of dynamic, multi-billion-level asset data at the internet scale, their architectural flaws are glaringly exposed. To calculate the similarity between a page to be identified and a massive number of enterprise tags, traditional models typically need to concatenate or deeply interactively encode the text to be identified with each enterprise tag. This means that a single query requires tens of millions of complex model calculations. This computational mode results in enormous processing overhead and extremely high recognition latency, completely failing to meet the minute-level or even second-level real-time response requirements of enterprise-level asset management.
[0005] Therefore, existing technical solutions present an irreconcilable contradiction between the accuracy of semantic understanding and the efficiency of large-scale data processing. Enterprises urgently need a network asset identification solution that can deeply integrate deep semantic understanding capabilities with an efficient retrieval architecture to accurately capture complex and implicit asset ownership relationships while achieving real-time, efficient inventory and identification of massive amounts of assets. Summary of the Invention
[0006] The technical problem to be solved by the present invention is to provide a network asset identification method and system based on page text content, in order to solve the problems of misjudgment or omission of asset ownership due to the difficulty in capturing semantic information, and the difficulty in meeting the requirements of efficient identification due to insufficient real-time processing capability of large-scale data.
[0007] To solve the above-mentioned technical problems, the technical solution adopted by the method of the present invention includes the following steps: S1. Obtain the page text content of the target network asset, parse it based on the page DOM tree structure to filter out noisy data, and extract the text to be analyzed, including at least the business description area and the copyright information area. S2, extract text semantic feature vectors from the text to be analyzed through a pre-trained network asset recognition model. The network asset recognition model is a semantic matching model based on the Poly-encoder architecture and trained with network asset domain data, used to encode semantic features representing its business affiliation information from the text. S3, calculate the similarity between the text semantic feature vector and the feature vectors of each enterprise stored in the preset enterprise feature vector library, wherein each enterprise feature vector in the enterprise feature vector library is associated with an enterprise tag; S4. Based on a preset similarity threshold, select the enterprise tag with the highest similarity score that meets the threshold from all enterprise tags associated with the enterprise feature vector library, and use it as the enterprise affiliation of the target network asset.
[0008] As a further improvement to the method of the present invention, the parsing based on the page DOM tree structure in step S1 includes: S11, parse the page DOM tree, dynamically determine the business description area based on tag nesting depth, CSS class name characteristics and HTML5 semantic tags, and extract the text within the area; S12, construct a copyright information feature matrix, the features including legal keywords, symbol features and date format; based on the above features, use a pre-trained classification model to perform binary classification on each block in the DOM tree, identify the copyright information area and extract its text content; S13, the extracted business area text and copyright information area text are concatenated to form the text to be analyzed.
[0009] As a further improvement to the method of the present invention, in step S2, the training method of the network asset identification model is as follows: A pre-trained language model with weight parameters adjusted according to the domain of network assets was used as the basic feature extractor, and the Poly-encoder model was trained using the network asset dataset. The network asset dataset includes positive samples and negative samples. The positive samples are the association pairs between page text and the tag of the actual enterprise to which it belongs, and the negative samples are the association pairs between page text and the tag of "third-party enterprise". By comparing and learning positive and negative samples, the parameters of the network asset recognition model are optimized, enabling the model to accurately calculate the semantic similarity between page text and enterprise tags. A network asset identification model is generated by monitoring accuracy metrics using a validation set.
[0010] As a further improvement to the method of the present invention, the pre-trained language model that adjusts the weight parameters according to the network asset domain is obtained by the following method: Construct similar texts in the domain of network assets and randomly divide them into training set, validation set and test set according to a predetermined ratio; A general language model is loaded, and using the training set, a contrastive learning strategy is employed to optimize the parameters. The general language model includes BERT, ERNIE, or BGE models. During training, for data within a batch, the model updates the weights by optimizing the following contrastive learning loss function: ; in, and These represent the vector representations of the anchor text and the positive sample text, respectively. A vector representation of all samples within a batch. The cosine similarity function is used. Where N is the temperature coefficient and N is the batch size; During the adjustment of weight parameters, the model performance is monitored using a validation set, and model parameters are generated. The model's generalization ability is evaluated on the test set, and the model that passes the evaluation and has the best performance is determined as the final domain fine-tuning model for subsequent recognition model training.
[0011] As a further improvement to the method of the present invention, the similar text pairs in the network asset domain are constructed by the following method: From the preprocessed web asset page text, all page texts with the same enterprise tag are selected and paired to construct multiple positive sample pairs. Each positive sample pair indicates that the two texts are semantically similar and belong to the same enterprise. Subsequently, all positive sample pairs are randomly divided into training set, validation set and test set according to a predetermined ratio. The preprocessing involves parsing the web asset page text based on the page DOM tree structure to filter noisy data and extract the text to be analyzed, which includes at least the business description area and the copyright information area.
[0012] As a further improvement to the method of the present invention, the method for generating the preset enterprise feature vector library is as follows: Each enterprise tag to be identified is input into the network asset identification model, mapped to a corresponding enterprise feature vector, and all enterprise feature vectors and their associated enterprise tags are stored offline in the vector database.
[0013] As a further improvement to the method of the present invention, the similarity score is obtained by calculating the cosine similarity, and the calculation formula is as follows: ; in, The cosine similarity function is used. This is a vector of the page text content. For enterprise feature vectors, For vector dimensions, and Represent the two vectors in Components on the dimension.
[0014] As a further improvement to the method of the present invention, in step S4, the calculation formula for selecting enterprise tags with the highest similarity scores that meet the threshold is as follows:
[0015] Here, "label" represents the company's label. To obtain the index position corresponding to the maximum cosine similarity, i.e., to obtain the enterprise tag number, The cosine similarity function is used. This is a vector of the page text content. Let E be the enterprise feature vector, and let E be the set of all enterprise labels in the enterprise feature pyramid.
[0016] This invention also provides a network asset identification system based on page text content, used to implement the above-mentioned network asset identification method based on page text content, comprising: The data preprocessing module is used to obtain the page text content of the target network asset, parse it based on the page DOM tree structure, filter out noisy data, and extract the text to be analyzed, including at least the business description area and the copyright information area. The text feature extraction module is used to extract text semantic feature vectors from the text to be analyzed through a pre-trained network asset recognition model. The network asset recognition model is a semantic matching model based on the Poly-encoder architecture and trained with network asset domain data, which is used to encode semantic features that represent the business affiliation information of the text. The similarity calculation module is used to calculate the similarity between the text semantic feature vector and the feature vectors of each enterprise stored in the preset enterprise feature vector library, wherein each enterprise feature vector in the enterprise feature vector library is associated with an enterprise tag; The result election module is used to select, based on a preset similarity threshold, all enterprise tags associated with the enterprise feature vector library and with the highest similarity score, as the enterprise affiliation of the target network asset.
[0017] The present invention also provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the above-described method for identifying network assets based on page text content.
[0018] Compared with the prior art, the advantages of the present invention are as follows: This invention utilizes a deep semantic training architecture based on the Poly-encoder model, combined with domain fine-tuning of the pre-trained model, to effectively capture deep semantic information such as business models and industry terms implicit in page text. This overcomes the shortcomings of traditional keyword matching rules in identifying implicit attribution relationships, significantly reducing the false positive and false negative rates in asset identification. Simultaneously, by constructing an offline-online dual-stage processing mechanism called the "Enterprise Feature Tower," the computationally intensive enterprise feature encoding process is completed offline and cached in a vector database. During online identification, only the vector of the test text needs to be calculated and quickly retrieved using cosine similarity with the vectors in the database. This compresses the matching time of tens of millions of data points to the second level, effectively solving the problem of poor real-time performance of traditional semantic models and meeting the timeliness requirements of large-scale enterprise asset management. Attached Figure Description
[0019] Figure 1 This is a flowchart illustrating an embodiment of the present invention.
[0020] Figure 2 This is a schematic diagram illustrating the process of generating a preset enterprise feature vector library in an embodiment of the present invention.
[0021] Figure 3This is a schematic diagram illustrating the training process of the network asset identification model in an embodiment of the present invention. Detailed Implementation
[0022] The present invention will be further described below with reference to the accompanying drawings and specific preferred embodiments, but this does not limit the scope of protection of the present invention.
[0023] The technical solution adopted in this embodiment is as follows: Figure 1 As shown, it includes the following steps: S1. Obtain the page text content of the target network asset, parse it based on the page DOM tree structure to filter out noisy data, and extract the text to be analyzed, including at least the business description area and the copyright information area. In practice, the text content of the target network assets includes: web asset page text data, visible text content extracted from the HTML source code of the target website after parsing, text data of mobile portal and mini-program asset pages, and text content obtained after page rendering by simulating mobile access. S2, extract text semantic feature vectors from the text to be analyzed through a pre-trained network asset recognition model. The network asset recognition model is a semantic matching model based on the Poly-encoder architecture and trained with network asset domain data, used to encode semantic features representing its business affiliation information from the text. In specific implementation, the network asset identification model is a dual-tower structure model based on Poly-encoder. The dual-tower structure includes a text tower and an enterprise tower. The two tower structures share model parameters. The text tower processes page text and captures the deep semantics of the input text simultaneously from multiple dimensions such as lexical keywords, syntactic business patterns and chapter-level scene features through multiple encoders. After feature fusion and attention mechanism processing, the semantic representation that is strongly related to enterprise business is strengthened. S3, calculate the similarity between the text semantic feature vector and the feature vectors of each enterprise stored in the preset enterprise feature vector library, wherein each enterprise feature vector in the enterprise feature vector library is associated with an enterprise tag; S4. Based on a preset similarity threshold, select the enterprise tag with the highest similarity score that meets the threshold from all enterprise tags associated with the enterprise feature vector library, and use it as the enterprise affiliation of the target network asset.
[0024] This embodiment, through a deep semantic training architecture based on the Poly-encoder model and combined with domain fine-tuning of the pre-trained model, can effectively capture deep semantic information such as business models and industry terms hidden in the page text, overcome the shortcomings of traditional keyword matching rules in being unable to identify implicit attribution relationships, and significantly reduce the false positive and false negative rates of asset identification. Meanwhile, by constructing an "enterprise feature tower" offline-online dual-stage processing mechanism, the computationally intensive enterprise feature encoding process is completed offline and cached in the vector database. During online recognition, only the vector of the text to be tested needs to be calculated and a fast cosine similarity retrieval is performed with the vector in the database. This reduces the matching time of tens of millions of data points to the second level, effectively solving the problem of poor real-time performance of traditional semantic models and meeting the timeliness requirements of large-scale enterprise asset management.
[0025] In a specific application example, the parsing based on the page DOM tree structure described in step S1 includes: S11, parse the page DOM tree, dynamically determine the business description area based on tag nesting depth, CSS class name characteristics and HTML5 semantic tags, and extract the text within the area; The CSS class name features include class names containing the keywords "main-content," "article," and "business"; the HTML5 semantic tags include... <main> 、 <article>Label; S12, construct a copyright information feature matrix, the features including legal keywords, symbol features and date format; based on the above features, use a pre-trained classification model to perform binary classification on each block in the DOM tree, identify the copyright information area and extract its text content; The legal keywords typically include "Copyright", "All Rights Reserved", "ICP", and "Filing Number"; the symbolic features include "©", "®", and "™"; and the date format includes a year string conforming to the format "YYYY-YYYY" or "YYYY".
[0026] S13, the extracted business area text and copyright information area text are concatenated to form the text to be analyzed.
[0027] In specific application examples, such as Figure 3 As shown, in step S2, the training method for the network asset identification model is as follows: A pre-trained language model with weight parameters adjusted according to the domain of network assets was used as the basic feature extractor, and the Poly-encoder model was trained using the network asset dataset. The network asset dataset includes positive samples and negative samples. Positive samples are association pairs between page text and the actual enterprise tag to which it belongs, in the format (page text, enterprise tag), and are labeled as positive samples. Negative samples are association pairs between page text and the "third-party enterprise" tag, in the format (page text, third-party enterprise), and are labeled as negative samples. By comparing and learning positive and negative samples, the parameters of the network asset recognition model are optimized, enabling the model to accurately calculate the semantic similarity between page text and enterprise tags. A network asset identification model is generated by monitoring accuracy metrics using a validation set.
[0028] In a specific application example, the pre-trained language model that adjusts the weight parameters according to the network asset domain is obtained through the following method: Construct similar texts in the domain of network assets and randomly divide them into training set, validation set and test set according to a predetermined ratio; A general language model is loaded, and using the training set, a contrastive learning strategy is employed to optimize the parameters. The general language model includes BERT, ERNIE, or BGE models. During training, for data within a batch, the model updates the weights by optimizing the following contrastive learning loss function: ; in, and These represent the vector representations of the anchor text and the positive sample text, respectively. A vector representation of all samples within a batch. The cosine similarity function is used. Where N is the temperature coefficient and N is the batch size; During the adjustment of weight parameters, the model performance is monitored using a validation set, and model parameters are generated. The model's generalization ability is evaluated on the test set, and the model that passes the evaluation and has the best performance is determined as the final domain fine-tuning model for subsequent recognition model training.
[0029] In a specific application example, similar text pairs in the domain of network assets are constructed using the following method: From the preprocessed web asset page text, all page texts with the same enterprise tag are selected and paired to construct multiple positive sample pairs. Each positive sample pair indicates that the two texts are semantically similar and belong to the same enterprise. Subsequently, all positive sample pairs are randomly divided into training set, validation set and test set according to a predetermined ratio. The preprocessing involves parsing the web asset page text based on the page DOM tree structure to filter noisy data and extract the text to be analyzed, which includes at least the business description area and the copyright information area.
[0030] In specific application examples, such as Figure 2 As shown, the method for generating the preset enterprise feature vector library is as follows: Each enterprise tag to be identified is input into the network asset identification model, mapped to a corresponding enterprise feature vector, and all enterprise feature vectors and their associated enterprise tags are stored offline in the vector database.
[0031] As a further improvement to the method of the present invention, the similarity score is obtained by calculating the cosine similarity, and the calculation formula is as follows: ; in, The cosine similarity function is used. This is a vector of the page text content. For enterprise feature vectors, For vector dimensions, and Represent the two vectors in Components on the dimension.
[0032] In a specific application example, in step S4, the formula for calculating the enterprise tags that meet the similarity score threshold and have the highest score is as follows:
[0033] Here, "label" represents the company's label. To obtain the index position corresponding to the maximum cosine similarity, i.e., to obtain the enterprise tag number, The cosine similarity function is used. This is a vector of the page text content. Let E be the enterprise feature vector, and let E be the set of all enterprise labels in the enterprise feature pyramid.
[0034] This embodiment also includes a network asset identification system based on page text content, used to implement the above-described network asset identification method based on page text content, including: The data preprocessing module is used to obtain the page text content of the target network asset, parse it based on the page DOM tree structure, filter out noisy data, and extract the text to be analyzed, including at least the business description area and the copyright information area. The text feature extraction module is used to extract text semantic feature vectors from the text to be analyzed through a pre-trained network asset recognition model. The network asset recognition model is a semantic matching model based on the Poly-encoder architecture and trained with network asset domain data, which is used to encode semantic features that represent the business affiliation information of the text. The similarity calculation module is used to calculate the similarity between the text semantic feature vector and the feature vectors of each enterprise stored in the preset enterprise feature vector library, wherein each enterprise feature vector in the enterprise feature vector library is associated with an enterprise tag; The result election module is used to select, based on a preset similarity threshold, all enterprise tags associated with the enterprise feature vector library and with the highest similarity score, as the enterprise affiliation of the target network asset.
[0035] This embodiment also includes a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the above-described method for identifying network assets based on page text content.
[0036] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code. This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create a machine for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The functions specified in one or more boxes. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable apparatus for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0037] The above description is merely a preferred embodiment of the present invention. The scope of protection of the present invention is not limited to the above embodiments. All technical solutions falling within the scope of the present invention's concept are within the scope of protection of the present invention. It should be noted that for those skilled in the art, any improvements and modifications made without departing from the principles of the present invention should also be considered within the scope of protection of the present invention.< / article> < / main>
Claims
1. A method for identifying network assets based on page text content, characterized in that, include: S1. Obtain the page text content of the target network asset, parse it based on the page DOM tree structure to filter out noisy data, and extract the text to be analyzed, including at least the business description area and the copyright information area. S2, extract text semantic feature vectors from the text to be analyzed through a pre-trained network asset recognition model. The network asset recognition model is a semantic matching model based on the Poly-encoder architecture and trained with network asset domain data, used to encode semantic features representing its business affiliation information from the text. S3, calculate the similarity between the text semantic feature vector and the feature vectors of each enterprise stored in the preset enterprise feature vector library, wherein each enterprise feature vector in the enterprise feature vector library is associated with an enterprise tag; S4. Based on a preset similarity threshold, select the enterprise tag with the highest similarity score that meets the threshold from all enterprise tags associated with the enterprise feature vector library, and use it as the enterprise affiliation of the target network asset.
2. The method for identifying network assets based on page text content according to claim 1, characterized in that, The parsing based on the page DOM tree structure described in step S1 includes: S11, parse the page DOM tree, dynamically determine the business description area based on tag nesting depth, CSS class name characteristics and HTML5 semantic tags, and extract the text within the area; S12, construct a copyright information feature matrix, the features including legal keywords, symbol features and date format; based on the above features, use a pre-trained classification model to perform binary classification on each block in the DOM tree, identify the copyright information area and extract its text content; S13, the extracted business area text and copyright information area text are concatenated to form the text to be analyzed.
3. The method for identifying network assets based on page text content according to claim 1, characterized in that, In step S2, the training method for the network asset identification model is as follows: A pre-trained language model with weight parameters adjusted according to the domain of network assets was used as the basic feature extractor, and the Poly-encoder model was trained using the network asset dataset. The network asset dataset includes positive samples and negative samples. The positive samples are the association pairs between page text and the tag of the actual enterprise to which it belongs, and the negative samples are the association pairs between page text and the tag of "third-party enterprise". By comparing and learning positive and negative samples, the parameters of the network asset recognition model are optimized, enabling the model to accurately calculate the semantic similarity between page text and enterprise tags. A network asset identification model is generated by monitoring accuracy metrics using a validation set.
4. The method for identifying network assets based on page text content according to claim 3, characterized in that, The pre-trained language model that adjusts weight parameters according to the network asset domain is obtained through the following method: Construct similar texts in the domain of network assets and randomly divide them into training set, validation set and test set according to a predetermined ratio; A general language model is loaded, and using the training set, a contrastive learning strategy is employed to optimize the parameters. The general language model includes BERT, ERNIE, or BGE models. During training, for data within a batch, the model updates the weights by optimizing the following contrastive learning loss function: ; in, and These represent the vector representations of the anchor text and the positive sample text, respectively. A vector representation of all samples within a batch. The cosine similarity function is used. Where N is the temperature coefficient and N is the batch size; During the adjustment of weight parameters, the model performance is monitored using a validation set, and model parameters are generated. The model's generalization ability is evaluated on the test set, and the model that passes the evaluation and has the best performance is determined as the final domain fine-tuning model for subsequent recognition model training.
5. The method for identifying network assets based on page text content according to claim 4, characterized in that, The similar text pairs in the domain of network assets are constructed using the following method: From the preprocessed web asset page text, all page texts with the same enterprise tag are selected and paired to construct multiple positive sample pairs. Each positive sample pair indicates that the two texts are semantically similar and belong to the same enterprise. Subsequently, all positive sample pairs are randomly divided into training set, validation set and test set according to a predetermined ratio. The preprocessing involves parsing the web asset page text based on the page DOM tree structure to filter noisy data and extract the text to be analyzed, which includes at least the business description area and the copyright information area.
6. The method for identifying network assets based on page text content according to claim 1, characterized in that, The method for generating the preset enterprise feature vector library is as follows: Each enterprise tag to be identified is input into the network asset identification model, mapped to a corresponding enterprise feature vector, and all enterprise feature vectors and their associated enterprise tags are stored offline in the vector database.
7. The method for identifying network assets based on page text content according to claim 1, characterized in that, The similarity score is obtained by calculating the cosine similarity, and the calculation formula is as follows: ; in, The cosine similarity function is used. This is a vector of the page text content. For enterprise feature vectors, For vector dimensions, and Represent the two vectors in Components on the dimension.
8. The method for identifying network assets based on page text content according to claim 1, characterized in that, In step S4, the formula for calculating the enterprise tags that meet the similarity score threshold and have the highest score is as follows: Here, "label" represents the company's label. To obtain the index position corresponding to the maximum cosine similarity, i.e., to obtain the enterprise tag number, The cosine similarity function is used. This is a vector of the page text content. Let E be the enterprise feature vector, and let E be the set of all enterprise labels in the enterprise feature pyramid.
9. A network asset identification system based on page text content, used to implement the network asset identification method based on page text content as described in any one of claims 1-8, characterized in that, include: The data preprocessing module is used to obtain the page text content of the target network asset, parse it based on the page DOM tree structure, filter out noisy data, and extract the text to be analyzed, including at least the business description area and the copyright information area. The text feature extraction module is used to extract text semantic feature vectors from the text to be analyzed through a pre-trained network asset recognition model. The network asset recognition model is a semantic matching model based on the Poly-encoder architecture and trained with network asset domain data, which is used to encode semantic features that represent the business affiliation information of the text. The similarity calculation module is used to calculate the similarity between the text semantic feature vector and the feature vectors of each enterprise stored in the preset enterprise feature vector library, wherein each enterprise feature vector in the enterprise feature vector library is associated with an enterprise tag; The result election module is used to select, based on a preset similarity threshold, all enterprise tags associated with the enterprise feature vector library and with the highest similarity score, as the enterprise affiliation of the target network asset.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of a network asset identification method based on page text content as described in any one of claims 1-8.