Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Text classification screening method using LDA

A screening method and text classification technology, applied in text database clustering/classification, text database indexing, unstructured text data retrieval, etc., can solve the problems of insufficient data quality, low money cost, etc., and achieve excellent classification effect, cost-saving effect

Pending Publication Date: 2021-04-16
SHANGHAI GOLDEN BRIDGE INFOTECH CO LTD
View PDF0 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

But in many cases, in the absence of readily available labeled data, how to provide the model with data of the highest possible quality becomes a concern
[0003] Training models are inseparable from data, but in many cases there is not enough data (data quality is too low or the cost of labeling money is too high), so the industry has proposed so-called unsupervised learning, but it is still rarely used in practice, and more often it is add more training samples

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text classification screening method using LDA
  • Text classification screening method using LDA
  • Text classification screening method using LDA

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] The technical solutions in the embodiments of the present invention will be clearly and completely described below. Obviously, the described embodiments are only part of the embodiments of the present invention, rather than all embodiments. All other embodiments obtained by those of ordinary skill in the art without paying creative efforts belong to the protection scope of the present invention.

[0034] The core technology model used in the present invention is the LDA subject classification model, and a series of steps and strategies are designed around this model for data screening. The main principles of the LDA model are:

[0035] The LDA model is a three-layer Bayesian topic model. It discovers the hidden topic information in the text through the unsupervised learning method. " or "Concept". The essence of implicit semantic analysis is to use the co-occurrence characteristics of terms in the text to discover the topic structure of the text. This method does not ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a text classification screening method using LDA, and the method comprises the steps: obtaining a data set which comprises a plurality of short sentences; preprocessing the data by using a natural language processing method, and cleaning and sorting the data; determining a subject, and manually selecting text sentences suitable for the subject; establishing a corresponding text vector matrix by using the selected text sentences and a bag-of-words model; training a first LDA model by using the vector matrix; screening residual sentences in the text by using the first LDA model, calculating the correlation between the text set and a plurality of topic words obtained by calculating a first LDA topic, and taking the correlation as a threshold for evaluating whether one sentence meets a selected topic model or not; adding a text screened through topic correlation, and training a second LDA model; using the second LDA model for judging and screening remaining sentences in the text through cosine similarity; and taking the sentences screened for three times in total as text data conforming to a screening target.

Description

technical field [0001] The invention relates to the field of natural language processing, which can effectively screen sentences matching selected topics, prepare data sets for various machine learning algorithms, or perform text classification. Background technique [0002] At present, machine learning has been widely used in various fields. However, for models that need to process natural language, it is often necessary to preset a special topic to train the model. Training a model requires a human-labeled dataset to ensure the quality of the model. But in many cases, in the absence of readily available labeled data, how to provide the model with data of the highest possible quality becomes a concern. [0003] Training models are inseparable from data, but in many cases there is not enough data (data quality is too low or the cost of labeling money is too high), so the industry has proposed so-called unsupervised learning, but it is still rarely used in practice, and mor...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35G06F16/335G06F16/31G06F40/211G06F40/216G06F40/242G06F40/30G06K9/62G06N20/00
Inventor 赵博吕建文周兴晖陈力薛柔月金鑫蒋尚秀
Owner SHANGHAI GOLDEN BRIDGE INFOTECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products