Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A text processing method and system

A text processing and text technology, applied in the direction of digital data processing, natural language data processing, special data processing applications, etc., can solve the problem that the keyword statistics cannot be counted, the mining effect is very good, the business personnel are not considered, and the length of the text is not good. first class question

Active Publication Date: 2017-02-15
GUOXIN YOUE DATA CO LTD
View PDF3 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Structured data analysis cannot fully mine and discover semantics in big data
The challenge of unstructured text mining lies in: maintenance challenges brought about by language diversity, including the variety of language expressions in the text, irregular usages such as abbreviations and abbreviations are common, and it is necessary to exhaustively enumerate all language expressions. Difficulty in maintenance due to language expression details; maintenance challenges brought about by many business categories and fast-changing rules: There are many business categories and fast-changing categories. Every time a category changes, it is necessary to reorganize the language rules of all related categories, and the maintenance workload is huge. , maintenance efficiency is low; challenges brought by multilingual synchronous processing: mining of different languages ​​needs to be analyzed at the same time, and rules need to be established for each language separately, requiring maintenance business personnel to master multiple languages, and the requirements for maintenance personnel are too high; noise in the text The classification challenge brought by big data: the length of the text is different, and the correlation among them is intricate, so it is impossible to use the method of keyword statistics to achieve a good mining effect
[0003] However, the existing technologies generally use statistical methods for text mining, without considering the needs of business personnel, and only provide mining algorithms, which has brought a lot of trouble to business personnel
The problem faced by text mining technology is how to analyze and mine valuable information that users care about from a piece or a large amount of unstructured text, so that business personnel can define mining requirements and mining rules from a business perspective, regardless of the language expression in the text Linguistic Ambiguity Caused by the Diversity of Habits

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A text processing method and system
  • A text processing method and system
  • A text processing method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0054] This embodiment provides a text processing method, which includes the following steps:

[0055] S1. Establish a classification hyperplane function.

[0056] In this embodiment, when predicting the text data to be predicted, it is first necessary to establish a classification hyperplane function for predicting the input text data to be predicted, so that the text data to be predicted can be classified through the classification hyperplane function Classification; for example, there are two news, one about basketball and one about diet, you can use these two news as training text to get the classification hyperplane function, and then treat the predicted text data (later news) through the classification hyperplane function Make predictions to determine whether they belong in the basketball or diet news.

[0057] In this embodiment, the establishment of the classification hyperplane function may include the following steps:

[0058] S10: Perform word segmentation process...

Embodiment 2

[0110] This embodiment provides a text processing system, which includes:

[0111] Classification hyperplane function building module, used for classification hyperplane function;

[0112] The text prediction module predicts the text through the classification hyperplane function;

[0113] Wherein, the classification hyperplane function building module includes:

[0114] An entry-document matrix building unit is used to perform word segmentation processing on the text and establish an entry-document matrix;

[0115] A feature extraction unit extracts features from the term document matrix through a decision tree algorithm;

[0116] The classification hyperplane function construction unit is used to construct the classification hyperplane function.

[0117] In this embodiment, similar to Embodiment 1, the entry-document matrix building unit reads the text into the R language program, uses a word segmentation tool or user-defined word segmentation rules to split the text into...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a text processing method and system. The text processing method comprises the steps of S1, establishing a classification hyperplane function; and S2, performing prediction on a newly-input text via the classification hyperplane function. The step S1 specifically comprises the sub-steps of S10, performing work segmentation processing on a text and establishing an entry document matrix ; S20, extracting features from the entry document matrix via the decision tree algorithm; S30, constructing the classification hyperplane function. The method and the system have the advantages that after word segmentation of a stored text, the sentence features of the text are extracted; the features are extracted according to the decision tree algorithm, so that the number of dimensions of model training points in a support vector machine is reduced and the training time is shortened. The feature vectors of texts are extracted according to decision tree training and text classification is performed by using a multi-core support vector machine according to the feature vectors, so that the method and the system have the advantages of accurate calculation, fewer model training samples, short training time and high text classification accuracy.

Description

technical field [0001] The invention relates to the technical field of intelligent text information processing, in particular to a text processing method and system. Background technique [0002] 80% of social big data is unstructured data, and unstructured big data processing is the biggest challenge big data faces. Structured data analysis cannot fully mine and discover semantics in big data. The challenge of unstructured text mining lies in: maintenance challenges brought about by language diversity, including the variety of language expressions in the text, irregular usages such as abbreviations and abbreviations are common, and it is necessary to exhaustively enumerate all language expressions. Difficulty in maintenance due to language expression details; maintenance challenges brought about by many business categories and fast-changing rules: There are many business categories and fast-changing categories. Every time a category changes, it is necessary to reorganize t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30G06F17/27G06K9/62
CPCG06F16/355G06F16/36G06F40/284G06F18/2411
Inventor 张斌德夏珺峥李彩虹
Owner GUOXIN YOUE DATA CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products