De-weighting method and apparatus for short text

A short text and text technology, which is applied to the field of deduplication methods and devices for short texts, can solve problems such as too strict judgment conditions, and achieve the effect of improving generalization ability and efficiency and reducing the amount of calculation.

Inactive Publication Date: 2017-04-19
BEIJING INTELLIGENT STEWARD CO LTD
View PDF8 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In view of this, the present invention proposes a method and device for deduplication of short texts, which solves the problems of too strict judgment conditions in deduplication of texts, and improves the generalization ability and efficiency of deduplication of short texts

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • De-weighting method and apparatus for short text
  • De-weighting method and apparatus for short text
  • De-weighting method and apparatus for short text

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0016] figure 1 It is a flow chart of a method for deduplication of short text in Embodiment 1 of the present invention. The method is used for deduplication of short text. The method can be executed by a device with a document processing function, and the device can be composed of Realized by software and / or hardware, for example, a typical user terminal device, such as a mobile phone, a computer, and the like. In this embodiment, the generalization relationship refers to the relationship between the general description and the specific description of an element, and the specific description is based on the general description and extended. Generalization refers to operating on elements to make them more general. The method for deduplication of short text in this embodiment includes: step S110, step S120, step S130 and step S140.

[0017] Step S110, acquiring text string information of the short text.

[0018] Specifically, the user inputs a text string to be processed to ...

Embodiment 2

[0027] figure 2 It is a flow chart of a method for deduplicating short text in Embodiment 2 of the present invention. This embodiment further explains step S120, step S130 and step S140 on the basis of embodiment 1. In step S120, obtaining the keywords of the text string according to the word segmentation information of the text string includes: removing stop words in the word segmentation information, and performing normalization processing. In step S130, the factors affecting the keyword weight include at least the frequency of each keyword and / or the reverse document frequency, and the text substring includes a threshold number of keywords including: removing the weight of the keyword in the text string Keywords less than the preset weight threshold; or, according to the weight corresponding to the keywords, select the keywords of the threshold number in the text string; divide the two or Two or more keywords are combined into phrases. In step S140, removing duplicates o...

Embodiment 3

[0039] image 3 It is a deduplication method for short text in Embodiment 3 of the present invention. On the basis of Embodiment 1 and Embodiment 2, this embodiment, as a preferred embodiment, deduplicates between two text strings operations are described. Specifically, the method for deduplication of short text in this embodiment includes: step S310, step S320, step S330, step S340, step S350, step S360 and step S370.

[0040] Step S310, acquiring information of the first text string and the second text string.

[0041] Step S320, performing word segmentation on the first text string to obtain word segmentation information of the first text string, and performing word segmentation on the second text string to obtain word segmentation information of the second text string.

[0042] Step S330, performing stop word removal and normalization operations on the word segmentation of the first text string to obtain keyword information of the first text string; and performing stop w...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

An embodiment of the invention discloses a de-weighting method for a short text. The de-weighting method comprises the steps of obtaining text string information of the short text; performing word segmentation on the text string, and obtaining keywords of the text string according to the word segmentation information of the text string; obtaining a text sub string according to a weight corresponding to the keywords, wherein the text sub string comprises keywords with the number of a threshold value; and removing repeating items of the text sub string. According to the technical scheme provided by the embodiment, by obtaining the keywords of the text string, a generalization performance on the original text string is achieved, and the de-weighting generalization capability and efficiency are improved; and meanwhile, the calculated quantity is low, and a de-weighing effect among multiple text strings is realized.

Description

technical field [0001] The embodiments of the present invention relate to the technical field of text processing, and in particular, to a method and device for deduplication of short text. Background technique [0002] Text deduplication refers to removing the same words, words or components with similar semantics in the text string. With the continuous development of Internet technology, a large number of short message streams have appeared. The number of these messages is huge, but the length is generally very short. This kind of information is often called short text. Texts generally within 200 characters, such as common mobile phone short messages sent through mobile communication networks, instant messages sent through instant messaging software, comments on weblogs, comments on Internet news, etc. [0003] The current text deduplication methods are mainly text hashing method and similarity comparison method. Text hashing methods are divided into consistent hashing an...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/24G06F17/27
CPCG06F40/166G06F40/284
Inventor 李苗苗
Owner BEIJING INTELLIGENT STEWARD CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products