A network information collection method and system based on real-time text analysis
By employing real-time text analysis methods, utilizing sparse matrix factorization and neural network techniques, and combining analytic hierarchy process (AHP) and entropy method to optimize the text classification model, the uncertainty problem in network information collection was solved, achieving efficient and accurate information collection and automated processing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA NAT INST OF STANDARDIZATION
- Filing Date
- 2023-09-14
- Publication Date
- 2026-06-23
Smart Images

Figure CN117216273B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of real-time text, and in particular to a method and system for collecting network information based on real-time text analysis. Background Technology
[0002] The application of network information collection technology in the real-time text domain is becoming increasingly widespread, helping network information collection managers to collect network information in a timely and efficient manner. Currently, network information is characterized by its massive volume of textual information, diverse data types, and high information density, leading to numerous uncertainties in network information collection methods. Although some network information collection methods and systems have been invented, they still cannot effectively solve the problem of uncertainty in network information collection methods. Summary of the Invention
[0003] The purpose of this invention is to provide a method and system for collecting network information based on real-time text analysis.
[0004] To achieve the above objectives, the present invention is implemented according to the following technical solution:
[0005] This invention includes the following steps:
[0006] A acquires real-time text data, preprocesses the real-time text data, and obtains first data and second data based on the preprocessed real-time text data, wherein:
[0007] The preprocessed real-time text data is subjected to topic extraction, text entities are obtained based on the topic, and keywords are extracted from the text entities to obtain the first data;
[0008] The preprocessed real-time text data is subjected to syntactic analysis to obtain the second data.
[0009] B calculates the similarity between the first data and the second data, and then weights the similarity between the first data and the second data to obtain the classification target;
[0010] C constructs a text classification model based on the classification objective, inputs the real-time text data into the text classification model to obtain classification data, and outputs the classification data as collected network information.
[0011] Furthermore, the preprocessing described in step A includes segmentation, word segmentation, stop word removal, part-of-speech tagging, punctuation removal, number removal, special character removal, traditional Chinese character conversion, pinyin removal, and text vectorization.
[0012] Furthermore, the method for extracting topics from the preprocessed real-time text data includes:
[0013] Remove adverbs, compound words, and adjectives from the preprocessed real-time text data, and retain nouns to form a noun dictionary:
[0014]
[0015] Where the noun dictionary is B, each row represents a noun corresponding to one of the historical retrieval data, the dictionary length is m, and the number of words is n; the nouns are matched with the dictionary to construct a high-dimensional sparse matrix; the sparse matrix is decomposed into the product of the basis matrix and the coefficient matrix:
[0016] N m×n =A n×r ×U r×m
[0017] The sparse matrix N has r columns, m length, and a dictionary of n words. m×n A is a basis matrix with n rows and r columns. n×r The coefficient matrix of r columns and m rows is U r×m The basis matrix is the set of topics, and the coefficient matrix U is the set of matched topic words. The process involves multiple iterations to reduce the dimensionality of the high-dimensional matrix, stopping the iteration when the following condition is met:
[0018] ||A t+1 -A t ||<ε
[0019] ||U t+1 -U t ||<ε
[0020] The iteration number is t, ε is any small real number, and the basis matrix for the (t+1)th iteration is A. t+1 The coefficient matrix of the (t+1)th iteration is U t+1 The basis matrix for the t-th iteration is A. t The coefficient matrix of the t-th iteration is U t The keyword is output as the extraction result.
[0021] Furthermore, the method for obtaining text entities based on the topic includes:
[0022] The topic is encoded, and the predicted text is labeled using a two-stream structure with fine-grained and coarse-grained single-identifier masks. An improved long short-term neural network is used for text entity recognition, and the neuron parameters are updated accordingly.
[0023] e t =σ(R) e x t +Q e g t-1 +b e )
[0024] a t =σ(R) a x t +Q a g t-1 +b a )
[0025] v t =σ(R) v x t +Q V g t-1 +b v )
[0026] g t =σ t ×tanh(c t )
[0027] The forget gate at time t is e. t The input gate at time t is a t The output gate at time t is v t The theme at the previous moment was g t-1 The topic input at time t is x t The weight matrix of the forget gate is R. e Q e The weight matrix of the input gate is R. a Q a The weight matrix of the output gate is R. v Q V The bias of the forget gate is b. e The input gate bias is b a The output gate is biased by b. v The activation function is σ, and the hidden state at the previous time step is g. t-1 The activation function at time t is σ. t The memory cell at time t is c t The hyperbolic tangent function of a memory cell is tanh(c t );
[0028] The encoded topic is segmented into words, and an efficient pointer decoder is used to exclude non-entity first and non-entity last characters from the matrix coordinates. The resulting encoded vector sequence is then used to calculate the score function for consecutive sequences of type c entities based on the entity score function and the vector sequence.
[0029] W c (i, j) = W s (i, j) + (R) w ) T [x i :x j ]
[0030] The continuous sequence from i to j represents the entity's score function, W. s (i, j), the scores of entities of type c with sequence indices i to j are (R w ) T [x i :x j Sort the results in descending order based on the scores, and output the top three to five as text entities.
[0031] Furthermore, the method for extracting keywords from the text entities to obtain the first data includes:
[0032] Calculate the word frequency of text entities based on keywords:
[0033]
[0034] The number of times text entity i appears in entity text data j is t. i,j The number of times text entity k appears in entity text data j is t. k,j Given n text entities, calculate the inverse document frequency of each text entity:
[0035]
[0036] The total amount of information in the text data is D, and the number of information containing text entity i is |{j:t i ∈d i The inverse document frequency of text entity i is s. i The selection score is obtained based on term frequency and inverse document frequency.
[0037] w i =V i,j *s i
[0038] The selection score for text entity i is w. i Sort the selected scores in descending order and output the first 8 text entities as the first data.
[0039] Furthermore, a method for obtaining second data by performing syntactic analysis on the preprocessed real-time text data includes:
[0040] The preprocessed real-time text data is decomposed into words according to grammatical rules. A statistical method is used to perform grammatical parsing on the words. Based on the grammatical relationships between words in a sentence, the grammatical structure of the sentence is constructed. The dependency relationships between words are analyzed. The grammatical structure of the sentence is represented by a tree structure. The tree structure of the sentence is parsed and analyzed. The real-time text data after parsing and analysis is output as the second data.
[0041] Furthermore, the formulas for calculating the similarity between the first data and the second data are as follows:
[0042]
[0043] The first dataset is X, the real-time text dataset is Y, and the average value of the first dataset is... The average value of the real-time text dataset is The i-th first data is X i The j-th real-time text data is Y j The first set of data has n data points, the second set has m data points, and the similarity between the first data point i and the real-time text data j is . The similarity calculation for the second set of data is similar.
[0044] Furthermore, a method for obtaining the classification target by weighting the similarity between the first data and the second data includes:
[0045] The similarity between the first and second data points is calculated using the analytic hierarchy process (AHP) and entropy method. The weights calculated by AHP and entropy method are then used to calculate the combined weights for the first and second data points.
[0046]
[0047] The weights calculated by the analytic hierarchy process are α. j The weight calculated by the entropy method is β. j The weight of the j-th similarity is τ. j Output weights.
[0048] Furthermore, a text classification model based on a long short-term neural network is constructed according to the classification target. The preprocessed real-time text data is randomly divided into a training set and a test set in a 3:2 ratio. The training set is input into the retrieval matching model for training, and the test set is input into the trained retrieval matching model. The parameters of the text classification model are continuously optimized until the accuracy and efficiency of the classification target are both higher than 0.91, at which point training stops.
[0049] Secondly, an information retrieval device includes:
[0050] Data acquisition module: used to acquire real-time text data, preprocess the real-time text data, and obtain first data and second data based on the preprocessed real-time text data, wherein:
[0051] The preprocessed real-time text data is subjected to topic extraction, text entities are obtained based on the topic, and keywords are extracted from the text entities to obtain the first data;
[0052] The preprocessed real-time text data is subjected to syntactic analysis to obtain the second data.
[0053] Similarity module: used to calculate the similarity between the first data and the second data, and to weight the similarity between the first data and the second data to obtain the classification target;
[0054] Collection module: used to construct a text classification model based on the classification target, input the real-time text data into the text classification model to obtain classification data, and output the classification data as collected network information.
[0055] The beneficial effects of this invention are:
[0056] This invention is a network information collection method based on real-time text analysis. Compared with existing technologies, this invention has the following technical advantages:
[0057] This invention improves the accuracy of network information collection by preprocessing, acquiring first and second data, constructing classification targets and classification steps, thereby enhancing the precision of network information collection. It also makes network information collection intelligent, which can greatly save resources and manpower costs, improve work efficiency, and realize the automatic collection of network information. It can extract themes, entities and keywords from network information to be processed in real time, which is of great significance to network information collection. It can adapt to network information collection according to different standards and network information collection needs of different users, and has a certain degree of universality. Attached Figure Description
[0058] Figure 1 This is a flowchart illustrating the steps of a network information collection method based on real-time text analysis according to the present invention. Detailed Implementation
[0059] The present invention will be further described below through specific embodiments. The illustrative embodiments and descriptions herein are used to explain the present invention, but are not intended to limit the present invention.
[0060] The network information collection method for real-time text analysis of this invention includes the following steps:
[0061] like Figure 1 As shown, this embodiment includes the following steps:
[0062] A acquires real-time text data, preprocesses the real-time text data, and obtains first data and second data based on the preprocessed real-time text data, wherein:
[0063] The preprocessed real-time text data is subjected to topic extraction, text entities are obtained based on the topic, and keywords are extracted from the text entities to obtain the first data;
[0064] The preprocessed real-time text data is subjected to syntactic analysis to obtain the second data.
[0065] In the actual evaluation, the following network information will be used as the research object:
[0066] Network Information 1: "How to install a WiFi network at home? This article will introduce how to install a WiFi network at home, allowing you to easily enjoy the convenience of high-speed internet. Installing a WiFi network requires a router and a computer. The specific steps are as follows: First, connect the router to the computer and log in to the administrator interface; second, set the router's WiFi parameters; finally, enable the router's DHCP server so that your devices can automatically obtain IP addresses and DNS server information."
[0067] Network Information 2: "Analysis of the Advantages and Disadvantages of Cloud Storage. This article will introduce the advantages and disadvantages of cloud storage. Cloud storage is a way to store data via the Internet, which can achieve centralized management and backup of data. On the advantage side, cloud storage has high scalability and availability, can be accessed anytime, anywhere, and reduces storage costs. However, it also has some disadvantages, such as data security issues, access speed limitations, and the requirement of an internet connection to access it."
[0068] Online Information 3: "How to make a delicious pizza at home? This article will show you how to make a delicious pizza at home. First, you need to prepare pizza dough and various toppings, such as cheese, tomato sauce, sausage, vegetables, etc. Then, place the pizza dough on a baking sheet, spread an appropriate amount of tomato sauce, sprinkle an appropriate amount of cheese and other toppings, and finally bake it in a preheated oven for 10-15 minutes."
[0069] B calculates the similarity between the first data and the second data, and then weights the similarity between the first data and the second data to obtain the classification target;
[0070] In the actual assessment, the classification objective for network information 1 was 0.8772, the classification objective for network information 2 was 0.9024, and the classification objective for network information 3 was 0.878.
[0071] C constructs a text classification model based on the classification objective, inputs the real-time text data into the text classification model to obtain classification data, and outputs the classification data as collected network information.
[0072] In the actual assessment, Network Information 1 was classified as network installation, Network Information 2 as cloud storage analysis, and Network Information 3 as food preparation.
[0073] In this embodiment, the preprocessing in step A includes segmentation, word segmentation, stop word removal, part-of-speech tagging, punctuation removal, number removal, special character removal, traditional Chinese character conversion, pinyin removal, and text vectorization.
[0074] In actual assessment, after preprocessing
[0075] Network Information 1: "Home\Installation\WIFI\Network\Introduction\Home\Installation\Network\Easy\Enjoy\High-Speed Network\Convenient Experience\Install\WiFi Network\Router\Computer\Specific Steps\First\Router\Connect\Computer\Administrator\Interface\Settings\Router\WiFi\Parameters\Router\DHCP\Server\Enable\Device\Automatically Obtain\IP Address\DNS Server\Information"
[0076] Network Information 2: "Cloud Storage\Advantages\Disadvantages\Analysis\Introduction\Cloud Storage\Advantages\Disadvantages\Cloud Storage\Internet\Data Storage\Methods\Implementation\Data\Centralized\Management\Backup\Cloud Storage Scalability\Availability\Access Anytime, Anywhere\Reduced\Storage Costs\Disadvantages\Data Security Issues\Access Speed Limits\Network Access".
[0077] Online information 3: "Homemade Pizza Introduction Homemade Pizza Preparation Pizza Dough Toppings Cheese Tomato Sauce Sausage Vegetables Pizza Dough Spread Tomato Sauce on Baking Pan Cheese Other Toppings Place in Oven Bake
[0078] In this embodiment, the method for extracting topics from the preprocessed real-time text data includes:
[0079] Remove adverbs, compound words, and adjectives from the preprocessed real-time text data, and retain nouns to form a noun dictionary:
[0080]
[0081] Where the noun dictionary is B, each row represents a noun corresponding to one of the historical retrieval data, the dictionary length is m, and the number of words is n; the nouns are matched with the dictionary to construct a high-dimensional sparse matrix; the sparse matrix is decomposed into the product of the basis matrix and the coefficient matrix:
[0082] N m×n =A n×r ×U r×m
[0083] The sparse matrix N has r columns, m length, and a dictionary of n words. m×n A is a basis matrix with n rows and r columns. n×r The coefficient matrix of r columns and m rows is U r×mThe basis matrix is the set of topics, and the coefficient matrix U is the set of matched topic words. The process involves multiple iterations to reduce the dimensionality of the high-dimensional matrix, stopping the iteration when the following condition is met:
[0084] ||A t+1 -A t ||<ε
[0085] ||U t+1 -U t ||<ε
[0086] The iteration number is t, ε is any small real number, and the basis matrix for the (t+1)th iteration is A. t+1 The coefficient matrix of the (t+1)th iteration is U t+1 The basis matrix for the t-th iteration is A. t The coefficient matrix of the t-th iteration is U t Output the keywords as the extraction results;
[0087] In the actual assessment, the topic of Network Information 1 was "Home\Installation\WIFI\Network", the topic of Network Information 2 was "Cloud Storage\Advantages\Disadvantages\Analysis", and the topic of Network Information 3 was "Home\Making\Pizza".
[0088] In this embodiment, the method for obtaining text entities based on the topic includes:
[0089] The topic is encoded, and the predicted text is labeled using a two-stream structure with fine-grained and coarse-grained single-identifier masks. An improved long short-term neural network is used for text entity recognition, and the neuron parameters are updated accordingly.
[0090] e t =σ(R) e x t +Q e g t-1 +b e )
[0091] a t =σ(R) a x t +Q a g t-1 +b a )
[0092] v t =σ(R) v x t +Q V g t-1 +b v )
[0093] g t =σ t ×tanh(ct )
[0094] The forget gate at time t is e. t The input gate at time t is a t The output gate at time t is v t The theme at the previous moment was g t -1, the topic input at time t is x t The weight matrix of the forget gate is R. e Q e The weight matrix of the input gate is R. a Q a The weight matrix of the output gate is R. v Q V The bias of the forget gate is b. e The input gate bias is b a The output gate is biased by b. v The activation function is σ, and the hidden state at the previous time step is g. t-1 The activation function at time t is σ. t The memory cell at time t is c t The hyperbolic tangent function of a memory cell is tanh(c t );
[0095] The encoded topic is segmented into words, and an efficient pointer decoder is used to exclude non-entity first and non-entity last characters from the matrix coordinates. The resulting encoded vector sequence is then used to calculate the score function for consecutive sequences of type c entities based on the entity score function and the vector sequence.
[0096] W c (i, j) = W s (i, j) + (R) w ) T [x i :x j ]
[0097] The continuous sequence from i to j represents the entity's score function, W. s (i, j), the scores of entities of type c with sequence indices i to j are (R w ) T [x i :x j Sort the results in descending order based on the scores, and output the top three to five as text entities;
[0098] In the actual assessment, the text entity of network information 1 is "home\install\WIFI", the text entity of network information 2 is "cloud storage\advantages\disadvantages", and the text entity of network information 3 is "home\make\pizza".
[0099] In this embodiment, the method for extracting keywords from the text entities to obtain the first data includes:
[0100] Calculate the word frequency of text entities based on keywords:
[0101]
[0102] The number of times text entity i appears in entity text data j is t. i,j The number of times text entity k appears in entity text data j is t. k,j Given n text entities, calculate the inverse document frequency of each text entity:
[0103]
[0104] The total amount of information in the text data is D, and the number of information containing text entity i is |{j:t i ∈d i The inverse document frequency of text entity i is s. i The selection score is obtained based on term frequency and inverse document frequency.
[0105] w i =V i,j *s i
[0106] The selection score for text entity i is w. i Sort the selected scores in descending order and output the first 8 text entities as the first data.
[0107] In the actual assessment, the first data for Network Information 1 was "Installation\WIFI", the first data for Network Information 2 was "Cloud Storage\Advantages\Disadvantages", and the first data for Network Information 3 was "Making\Pizza".
[0108] In this embodiment, the method for obtaining second data by performing syntactic analysis on the preprocessed real-time text data includes:
[0109] The preprocessed real-time text data is decomposed into words according to grammatical rules. A statistical method is used to perform grammatical parsing on the words. Based on the grammatical relationships between words in the sentence, the grammatical structure of the sentence is constructed. The dependency relationships between words are analyzed. The grammatical structure of the sentence is represented by a tree structure. The tree structure of the sentence is parsed and analyzed. The real-time text data after parsing and analysis is output as the second data.
[0110] In the actual assessment, the second data of network information 1 is the agent "installation", the proper noun "at home", the patient "network", and the modifier "WIFI". The second data of network information 2 is the possession of cloud storage, the modifier "through the Internet", the storage is data, and the method is centralized. The second data of network information 3 is the patient "pizza", the agent "making", and the proper noun "at home".
[0111] In this embodiment, the formulas for calculating the similarity between the first data and the second data are as follows:
[0112]
[0113] The first dataset is X, the real-time text dataset is Y, and the average value of the first dataset is... The average value of the real-time text dataset is The i-th first data is X i The j-th real-time text data is Y j The first set of data has n data points, the second set has m data points, and the similarity between the first data point i and the real-time text data j is . The similarity calculation for the second set of data is similar;
[0114] In the actual evaluation, the similarity of the first data of network information 1 was 0.93, the similarity of the second data of network information 1 was 0.81, the similarity of the first data of network information 2 was 0.92, the similarity of the second data of network information 2 was 0.88, the similarity of the first data of network information 3 was 0.9, and the similarity of the second data of network information 3 was 0.85.
[0115] In this embodiment, the method for obtaining the classification target by weighting the similarity of the first data and the similarity of the second data includes:
[0116] The similarity between the first and second data points is calculated using the analytic hierarchy process (AHP) and entropy method. The weights calculated by AHP and entropy method are then used to calculate the combined weights for the first and second data points.
[0117]
[0118] The weights calculated by the analytic hierarchy process are α. j The weight calculated by the entropy method is β. j The weight of the j-th similarity is τ. j Output weights;
[0119] In the actual evaluation, the weight of the first data point was 0.56, and the weight of the second data point was 0.44.
[0120] In this embodiment, a text classification model based on a long short-term neural network is constructed according to the classification target. The preprocessed real-time text data is randomly divided into a training set and a test set in a 3:2 ratio. The training set is input into the retrieval matching model for training, and the test set is input into the trained retrieval matching model. The parameters of the text classification model are continuously optimized until the accuracy and efficiency of the classification target are both higher than 0.91, at which point training stops.
[0121] Secondly, an information retrieval device includes:
[0122] Data acquisition module: used to acquire real-time text data, preprocess the real-time text data, and obtain first data and second data based on the preprocessed real-time text data, wherein:
[0123] The preprocessed real-time text data is subjected to topic extraction, text entities are obtained based on the topic, and keywords are extracted from the text entities to obtain the first data;
[0124] The preprocessed real-time text data is subjected to syntactic analysis to obtain the second data.
[0125] Similarity module: used to calculate the similarity between the first data and the second data, and to weight the similarity between the first data and the second data to obtain the classification target;
[0126] Collection module: used to construct a text classification model based on the classification target, input the real-time text data into the text classification model to obtain classification data, and output the classification data as collected network information.
[0127] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method for collecting network information based on real-time text analysis, characterized in that, Includes the following steps: A acquires real-time text data, preprocesses the real-time text data, and obtains first data and second data based on the preprocessed real-time text data, wherein: The preprocessed real-time text data is subjected to topic extraction, text entities are obtained based on the topic, and keywords are extracted from the text entities to obtain the first data; The preprocessed real-time text data is subjected to syntactic analysis to obtain the second data. B calculates the similarity between the first data and the second data, and then weights the similarity between the first data and the second data to obtain the classification target; C constructs a text classification model based on the classification objective, inputs the real-time text data into the text classification model to obtain classification data, and outputs the classification data as collected network information. The method for obtaining text entities based on the topic includes: The topic is encoded, and the predicted text is labeled using a two-stream structure with fine-grained and coarse-grained single-identifier masks. An improved long short-term neural network is used for text entity recognition, and the neuron parameters are updated accordingly. Where the forget gate at time t is The input gate at time t is The output gate at time t is The theme of the previous moment was The topic input at time t is The weight matrix of the forget gate is The weight matrix of the input gate is The weight matrix of the output gate is The bias of the forget gate is The input gate bias is The output gate bias is , the activation function is , and the hidden layer state at the previous time step is . The activation function at time t is The memory cells at time t are The hyperbolic tangent function of memory cells is ; The encoded topic is segmented into words, and an efficient pointer decoder is used to exclude non-entity first and non-entity last characters from the matrix coordinates. The resulting encoded vector sequence is then used to calculate the score function for consecutive sequences of type c entities based on the entity score function and the vector sequence. The continuous sequence from i to j represents the entity's score function. The scores for entities of type c with sequence indices i to j are: Sort the scores in descending order and output the top three to five as text entities; A method for extracting keywords from the text entities to obtain first data includes: Calculate the word frequency of text entities based on keywords: The number of times text entity i appears in entity text data j is: The number of times text entity k appears in entity text data j is Given n text entities, calculate the inverse document frequency of each text entity: The total amount of information in the text data is D, and the number of information containing text entity i is . The inverse document frequency of text entity i is The selection score is obtained based on term frequency and inverse document frequency. The selection score for text entity i is: Sort the selected scores in descending order and output the first 8 text entities as the first data.
2. The network information collection method based on real-time text analysis according to claim 1, characterized in that, The preprocessing described in step A includes segmentation, word segmentation, stop word removal, part-of-speech tagging, punctuation removal, number removal, special character removal, traditional Chinese character conversion, pinyin removal, and text vectorization.
3. The network information collection method based on real-time text analysis according to claim 1, characterized in that, A method for extracting topics from the preprocessed real-time text data includes: Remove adverbs, compound words, and adjectives from the preprocessed real-time text data, and retain nouns to form a noun dictionary: The terminology dictionary is B, where each row represents a term corresponding to a historical retrieval data point. The dictionary has a length of m and contains n terms. The terms are matched against the dictionary to construct a high-dimensional sparse matrix. This sparse matrix is then decomposed into the product of a basis matrix and a coefficient matrix. The sparse matrix with r columns, m length, and n words is given by [formula missing]. The basis matrix with n rows and r columns is The coefficient matrix with r columns and m rows is The basis matrix is the set of topics, and the coefficient matrix U is the set of matched topic words. The process involves multiple iterations to reduce the dimensionality of the high-dimensional matrix, stopping the iteration when the following condition is met: Where the number of iterations is t, and any small real number is . The basis matrix of the (t+1)th iteration is The coefficient matrix of the (t+1)th iteration is The basis matrix of the t-th iteration is The coefficient matrix of the t-th iteration is The keyword is output as the extraction result.
4. The network information collection method based on real-time text analysis according to claim 1, characterized in that, A method for obtaining second data by performing syntactic analysis on the preprocessed real-time text data includes: The preprocessed real-time text data is decomposed into words according to grammatical rules. A statistical method is used to perform grammatical parsing on the words. Based on the grammatical relationships between words in a sentence, the grammatical structure of the sentence is constructed. The dependency relationships between words are analyzed. The grammatical structure of the sentence is represented by a tree structure. The tree structure of the sentence is parsed and analyzed. The real-time text data after parsing and analysis is output as the second data.
5. The network information collection method based on real-time text analysis according to claim 1, characterized in that, The formulas for calculating the similarity between the first data and the second data are as follows: The first dataset is X, the real-time text dataset is Y, and the average value of the first dataset is... The average value of the real-time text dataset is The i-th first data is The j-th real-time text data is The first set of data has n data points, the second set has m data points, and the similarity between the first data point i and the real-time text data j is . The similarity calculation for the second set of data is similar.
6. The network information collection method based on real-time text analysis according to claim 1, characterized in that, A method for obtaining a classification target by weighting the similarity between the first data and the second data includes: The similarity between the first and second data points is calculated using the analytic hierarchy process (AHP) and entropy method. The weights calculated by AHP and entropy method are then used to calculate the combined weights for the first and second data points. The weights calculated by the analytic hierarchy process are: The weights calculated using the entropy method are: The weight of the j-th similarity is Output weights.
7. The network information collection method based on real-time text analysis according to claim 1, characterized in that, A text classification model based on a long short-term neural network is constructed according to the classification target. The preprocessed real-time text data is randomly divided into a training set and a test set in a 3:2 ratio. The training set is input into the retrieval matching model for training, and the test set is input into the trained retrieval matching model. The parameters of the text classification model are continuously optimized until the accuracy and efficiency of the classification target are both higher than 0.91, at which point training stops.
8. A network information collection system based on real-time text analysis, used to perform the method according to any one of claims 1-7, characterized in that, include: Data acquisition module: used to acquire real-time text data, preprocess the real-time text data, and obtain first data and second data based on the preprocessed real-time text data, wherein: The preprocessed real-time text data is subjected to topic extraction, text entities are obtained based on the topic, and keywords are extracted from the text entities to obtain the first data; The preprocessed real-time text data is subjected to syntactic analysis to obtain the second data. Similarity module: used to calculate the similarity between the first data and the second data, and to weight the similarity between the first data and the second data to obtain the classification target; Collection module: used to construct a text classification model based on the classification target, input the real-time text data into the text classification model to obtain classification data, and output the classification data as collected network information.