Bag-of-words model text representation method and system based on fault-tolerant rough set, and medium

A technology of text representation and bag-of-words model, which is applied in unstructured text data retrieval, text database clustering/classification, special data processing applications, etc., can solve the problem of the sparseness of the bag-of-words model and lack of semantic association, and improve the sparseness sexual effect

Active Publication Date: 2020-05-05
CHONGQING UNIV OF POSTS & TELECOMM
View PDF3 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the bag-of-words model has the disadvantages of sparsity and lack of semantic association

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Bag-of-words model text representation method and system based on fault-tolerant rough set, and medium
  • Bag-of-words model text representation method and system based on fault-tolerant rough set, and medium
  • Bag-of-words model text representation method and system based on fault-tolerant rough set, and medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0070] The technical solutions in the embodiments of the present invention will be described clearly and in detail below with reference to the drawings in the embodiments of the present invention. The described embodiments are only some of the embodiments of the invention.

[0071] The technical scheme that the present invention solves the problems of the technologies described above is:

[0072] refer to figure 1 , the present invention aims to solve the problems of the traditional bag-of-words model technology. A text representation method that can solve the shortcomings of its sparsity and lack of semantic association, make the text representation dense and mine the latent semantics of the text is proposed. Technical scheme of the present invention is as follows:

[0073] A text representation method based on a fault-tolerant rough set bag-of-words model, which comprises the following steps:

[0074] 1. Perform data preprocessing on the corpus: (1) remove the header and...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a bag-of-words model text representation method based on a fault-tolerant rough set, which comprises the following steps: 101, acquiring a corpus by a computer, and performingpreprocessing operation on the corpus; 102, determining a base word set of a corpus by the computer; 103, acquiring an uncertainty function (fault-tolerant class) of each base word by the computer, and constructing an uncertainty function matrix; 104, calculating the fuzzy membership degree of each base word based on each document by the computer, and constructing a fuzzy membership matrix; 105, acquiring upper approximation and lower approximation of each document by the computer based on the fuzzy membership matrix, and constructing upper and lower approximation matrixes; 106, constructing atext representation matrix of the corpus by the computer based on the two weight calculation schemes; according to the method, the fuzzy membership degree of words in a corpus to each document is calculated mainly by applying a fault-tolerant rough set model, the upper approximation and the lower approximation of the documents are obtained, a corresponding weight calculation method is constructedfor document representation, and the current technical development requirements are practically combined.

Description

technical field [0001] The invention belongs to the fields of machine learning, artificial intelligence technology and natural language processing, and is a text representation method in natural language processing. Background technique [0002] Text representation is the basis of Natural Language Processing (NLP). With the explosive growth of the Internet, countless text data continue to accumulate. However, unlike numeric data, which is a structured data type, document or text data is unstructured data. Unstructured data is not suitable for direct application in machine learning or deep learning algorithms. As the basis of natural language processing and text mining tasks, efficient text or document representation is particularly important. A major challenge in document representation is how to transform unstructured text data into structured data. For an effective document representation, it not only needs to truly reflect the content of the document, but also has the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35G06F16/335G06F16/36
CPCG06F16/35G06F16/36G06F16/335
Inventor 邱东江海欢闫睿腾
Owner CHONGQING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products