Vectorization method and device of text

A device and text technology, applied in the field of text vectorization, can solve problems such as inability to restore semantics, poor fault tolerance, and influence of machine learning results, and achieve the effect of avoiding error cascading effects and good fault tolerance

Inactive Publication Date: 2018-09-25
BEIJING DIDI INFINITY TECH & DEV
View PDF6 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] 1. The existing technology is basically based on Chinese word segmentation. Chinese word segmentation works well in the application of written sentences, but the effect of dealing with colloquial sentences such as public opinion is not good, and a considerable amount of error will be introduced
Due to the existence of cascading effects, it will have a great impact on the final machine learning results
In addition, relying on Chinese word segmentation, the fault tolerance for colloquial sentences such as public opinion is poor
[0008] 2. The TF-IDF type text vectorization method finally produces a high vector dimension, ranging from tens of...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Vectorization method and device of text
  • Vectorization method and device of text
  • Vectorization method and device of text

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0068] The following will clearly and completely describe the technical solutions in the embodiments of the present disclosure with reference to the accompanying drawings in the embodiments of the present disclosure. Apparently, the described embodiments are only some of the embodiments of the present disclosure, not all of them. Based on the embodiments in the present disclosure, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present disclosure.

[0069] Some of the words mentioned in the embodiments of the present disclosure are illustrated below.

[0070] The user equipment (User Equipment, UE for short) mentioned in the embodiment of the present disclosure refers to equipment such as a mobile terminal or a personal computer (Personal Computer, PC for short) used. Examples include smartphones, personal digital assistants (PDAs), tablets, laptops, carputers, handheld game consoles, ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a vectorization method and device of text, and relates to the field of text vectorization. The vectorization method of the text includes the steps of obtaining the text to be processed and determining the application type of the text to obtain a sample of the text; extracting all single-character elements of the sample to obtain a single-character set of the sample; according to the application type of the sample, extracting double-character elements of the sample to obtain a double-character set of the sample; combining the single-character set and the double-characterset to obtain a word list; according to the word list, building and obtaining text vectors of the text. By the adoption of the vectorization method and device of the text, Chinese word segmentation is omitted, and errors and subsequent cascade effects are avoided, wherein the errors are introduced by the word segmentation into colloquial sentences such as public opinions; the vectorization methodand device of the text have good fault-tolerant capabilities for wrongly written characters in the colloquial sentences such as the public opinions.

Description

technical field [0001] The present invention relates to the field of text vectorization, in particular to a text vectorization method and device. Background technique [0002] For various machine learning algorithms, their input is a vector, and the output can be a continuous value or a discrete value. Text classification or clustering is a very important application in the field of machine learning, and text vectorization is the first step in text classification or clustering, which directly determines the quality of the final result of machine learning. [0003] Existing text vectorization techniques are as follows: [0004] TF-IDF (term frequency–inverse document frequency, term frequency-inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. The dimension of the sentence vector is the number of vocabulary, and the value of each dimension is the weight calculated by the TF-IDF method of the word corresponding to the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 刘家兵刘永波吴春龙张少松
Owner BEIJING DIDI INFINITY TECH & DEV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products