Unlock instant, AI-driven research and patent intelligence for your innovation.

Text processing method and system and medium

A text processing and text technology, which is applied in the field of data processing, can solve problems such as erroneous exclusion of repeated content, indistinguishability between long sentences and short sentences, etc., and achieve the effects of improving robustness, reducing the risk of explosion, and improving clarity

Inactive Publication Date: 2021-10-08
SAIC-GM-WULING AUTOMOBILE CO LTD
View PDF5 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The main purpose of the present invention is to provide a text processing method, system and medium, which can improve the robustness of text deduplication, avoid the defects of short and ultra-short texts such as car reviews that easily cause repeated content to be excluded by mistake, and solve the traditional Cosine vector encounters the problem of indistinguishable long sentences and short sentences with repeated meanings

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text processing method and system and medium
  • Text processing method and system and medium
  • Text processing method and system and medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0031] It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0032] At present, the method of deduplication of automobile user comments in vehicle enterprises is mostly the method of absolute mapping of content and the method of transforming vector space VSM model and then performing similarity analysis on high-dimensional space vectors. Short comments, complex semantic structure, unstable deduplication results, local sensitive hashing method, ignoring different sentence order, finding local close words and then determining the weight by Hamming distance is effective in long text sentence duplicate checking, but in It is very easy to mistakenly remove duplicates in short and ultra-short texts such as car reviews.

[0033] In the present invention, in order to digitize the text, Chinese word segmentation and stop words (high frequency but not affecting semantics) are used to con...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a text processing method and system and a medium. The method comprises the following steps: acquiring an automobile user comment text; performing word segmentation and stop word removal processing on the automobile user comment text; performing text keyword extraction on the text subjected to word segmentation and stop word removal to obtain a keyword extraction result; constructing a corresponding similarity vector space, and performing vectorization processing on the automobile user comment text to obtain an ultrahigh-dimensional high-dimensional vector; based on the high-dimensional vector, performing unbalanced cosine similarity analysis on the automobile user comment text to obtain phrase similarity; if the phrase similarity is greater than a preset threshold value, taking the automobile user comment text as a to-be-deleted text; otherwise, reserving the automobile user comment text. The text deduplication robustness can be improved, the defect that repeated contents in short texts and ultra-short texts such as automobile comments are extremely easy to be eliminated by mistake is overcome, and the problem that long sentences and short sentences which are repeated in meaning cannot be distinguished when a traditional cosine vector encounters is solved.

Description

technical field [0001] The present invention relates to the technical field of data processing, in particular to a text processing method, system and medium. Background technique [0002] At present, the method of extracting keywords for car after-sales problems and forum user comments is mostly the MD5 method of content absolute mapping and the method of cosine similarity analysis of high-dimensional space vectors by local sensitive hashing. Among them, the traditional method uses MD5 to detect the same text content, The efficiency is very high, but a slight change in the characters will cause the identification of repeated keywords; the local sensitive hash method, regardless of the order of different sentence patterns, finds local close words and then determines the weight by the Hamming distance to check duplicates in long text segments However, it is very easy to mistakenly exclude such short texts and ultra-short texts such as car reviews, and the robustness of dedupli...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/289G06F40/216G06F40/30
CPCG06F40/289G06F40/216G06F40/30
Inventor 王伟梁玮兰斌旋彭婧龙鲜菊
Owner SAIC-GM-WULING AUTOMOBILE CO LTD