System for generating feature vectors for long texts to realize classification

A feature vector and long text technology, which is applied in the field of systems that generate feature vectors for long texts to achieve classification, can solve the problems of system work dependence, poor stability of network information extraction system, paralysis of network extraction system, etc., and achieve the effect of reducing complexity.

Inactive Publication Date: 2019-11-05
上海鸿翼软件技术股份有限公司
View PDF4 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

These network information extraction systems are mainly aimed at enterprise-level users, with single functions and not flexible and easy to use
The stability of the network information extraction system based on the distributed network is also poor. The system work depends on the master node. Once the master node is abnormal, the entire network extraction system will be paralyzed
Moreover, since each slave node has to communicate with the master node, the resource allocation method is sequential allocation, lack of unified resource scheduling, resulting in poor system performance

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • System for generating feature vectors for long texts to realize classification

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0029] Embodiment 1: The following will give an example of the application scenario of a system for classifying long text generation feature vectors as follows:

[0030] A system for classifying long text generated feature vectors, the method includes the following steps:

[0031] Including: data preprocessing module, word vector calculation module, high-dimensional clustering module, long text classification module;

[0032] The data preprocessing module includes the word segmentation processing module and the module for removing irrelevant words from the text; firstly, the original text data is subjected to word segmentation processing based on the Trie tree, and then the irrelevant words are removed from the text, and the word segmentation with high word frequency and low frequency word segmentation are performed differently The process of removing meaningless function words, prepositions, pronouns and other stop words in high-frequency word segmentation words, and performi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a system for generating feature vectors for long texts to realize classification. The system comprises a data preprocessing module, a word vector calculation module, a high-dimensional clustering module and a long text classification module. Redundant invalid data are deleted through the data preprocessing module, so that the data complexity and the processing dimension are reduced, and the performance and the result accuracy are improved. The word vector calculation module calculates word vectors based on an improved dynamic dimension Skip-Gram algorithm, and then theword vectors are clustered through the high-dimensional clustering module. The long text classification module classifies the long texts according to a clustering result.

Description

technical field [0001] The invention relates to the technical field of the Internet, and is a system for classifying long texts by generating feature vectors. Background technique [0002] With the advent of the Internet age, Internet information data is growing at an extremely fast rate. With the development of big data, there is an urgent need for a fast, large and stable method of obtaining Internet information, so the network information extraction system has a very broad application prospect. Most of the traditional network information extraction methods are based on the static analysis of the page, extracting the link tags in the page, so as to obtain the links of other pages. These network information extraction systems are mainly aimed at enterprise-level users, with single functions and not flexible and easy to use. The stability of the network information extraction system based on the distributed network is also poor. The system work depends on the master node. ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35G06F17/27G06K9/62
CPCG06F16/35G06F18/23G06F18/2411
Inventor 龙凌云张华
Owner 上海鸿翼软件技术股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products