Junk corpus screening method, system and device based on LGBM model and BTM model

A screening method and model technology, applied in character and pattern recognition, natural language data processing, instruments, etc., can solve problems such as the influence of subjective factors, achieve the effect of reducing workload, reducing the cost of manual labeling, and ensuring the speed of inference

Pending Publication Date: 2021-03-26
BEIJING XUEZHITU NETWORK TECH
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] The embodiment of the present application provides a garbage corpus screening method based on the LGBM model and the BTM model, to at least solve the problem of subjective factors in related technologies

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Junk corpus screening method, system and device based on LGBM model and BTM model
  • Junk corpus screening method, system and device based on LGBM model and BTM model
  • Junk corpus screening method, system and device based on LGBM model and BTM model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0056] refer to Figure 1 to Figure 3 As shown, this example discloses a specific implementation of a garbage corpus screening method based on the LGBM model and the BTM model (hereinafter referred to as the "method").

[0057] Specifically refer to figure 1 and figure 2 As shown, the method disclosed in this embodiment mainly includes the following steps:

[0058] Step S1 , extract comments from the product to obtain comment data.

[0059] Specifically, in some of the embodiments, the e-commerce platform is set up based on the analysis texts that meet the conditions in the massive text database, and the comment data is extracted for different categories of commodities. For example, data extraction is performed on products such as "milk" and "cosmetics".

[0060] Then execute step S2, use the BTM model to carry out topic mining on the comment data, and summarize high-frequency words in spam comments according to the mining results.

[0061] Specifically, in some of the e...

Embodiment 2

[0074] In combination with a garbage corpus screening method based on the LGBM model and the BTM model disclosed in Embodiment 1, this embodiment discloses a specific implementation example of a garbage corpus screening system based on the LGBM model and the BTM model (hereinafter referred to as "the system") .

[0075] refer to Figure 4 As shown, the system includes:

[0076] The extracting module 100 is used to extract comments on commodities to obtain comment data;

[0077] The mining module 200 uses the BTM model to carry out topic mining on the comment data, and summarizes the high-frequency words of spam comments according to the mining results;

[0078] Training module 300, training an LGBM model based on the comment data and the high-frequency words of the spam comments;

[0079] The screening module 400 uses the trained LGBM model to screen spam comment corpus.

[0080] Specifically, in some of these embodiments, a review classification module 500 is also include...

Embodiment 3

[0088] combine Figure 5 As shown, this embodiment discloses a specific implementation manner of a computer device. The computer device may comprise a processor 81 and a memory 82 storing computer program instructions.

[0089] Specifically, the processor 81 may include a central processing unit (CPU), or an Application Specific Integrated Circuit (ASIC for short), or may be configured to implement one or more integrated circuits in the embodiments of the present application.

[0090] Among them, the memory 82 may include mass storage for data or instructions. For example without limitation, the memory 82 may include a hard disk drive (Hard Disk Drive, referred to as HDD), a floppy disk drive, a solid state drive (SolidState Drive, referred to as SSD), flash memory, optical disk, magneto-optical disk, magnetic tape or universal serial bus (Universal Serial Bus, referred to as USB) drive or a combination of two or more of the above. Storage 82 may comprise removable or non-r...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a junk corpus screening method, system and device based on an LGBM model and a BTM model, and the method comprises the steps of carrying out the comment extraction of differenttypes of commodities, and obtaining comment data; performing topic mining on the comment data by using a BTM model, and summarizing spam comment high-frequency words according to a mining result; training an LGBM model based on the comment data and the spam comment high-frequency words; and screening out spam comment corpora by using the trained LGBM model. According to the invention, junk comments irrelevant to commented commodities can be screened out under the conditions of ensuring the inference speed and reducing manual annotation.

Description

technical field [0001] The invention relates to computer applications and the fields of natural language processing. More specifically, the present invention relates to a garbage corpus screening method, system and equipment based on LGBM model and BTM model. Background technique [0002] With the development of e-commerce, a large number of user comments on commodities have been generated on the Internet. These review texts are an important corpus for mining consumer opinions on products. However, due to the incentive mechanism of e-commerce platforms for reviews, some users generate a large amount of spam texts when reviewing products, such as making up word counts and copying irrelevant content. Effective mining of data creates interference. Therefore, how to filter out spam comments from a large amount of data and leave valuable comments for subsequent consumer opinion mining is very important. [0003] At present, natural language processing, deep learning and other ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/33G06F40/216G06F40/258G06F40/289G06K9/62
CPCG06F40/216G06F40/258G06F40/289G06F16/3344G06F18/214
Inventor 王东海卫海天
Owner BEIJING XUEZHITU NETWORK TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products