Segmentation word-taking method and system for social text

A technology of social text and word segmentation, applied in the field of social text processing to achieve accurate segmentation results

Active Publication Date: 2022-02-08
成都无糖信息技术有限公司
View PDF13 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Aiming at the problem existing in the prior art that it is impossible to precisely segment social texts involved in network fraud, the present invention proposes a method and system for segmenting and extracting words from social texts, the purpose of which is to address the above Regarding the strong and distinctive language style of the chat information of people engaged in Internet fraud, and the huge difference with the traditional chat content, the present invention forms a text recognition and segmentation technology with its own characteristics according to different types of corpus, and accurately segments the text take words

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Segmentation word-taking method and system for social text

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0058] Such as figure 1 As shown, a method for segmenting and extracting words for social text is provided, including

[0059] S1: Collect the original text data of chat messages of persons engaged in online fraud in the last month, and clean the original text data; specifically:

[0060] S1.1: Use regular expressions to remove invalid characters in the original text data, including: invisible characters, URLs, numbers, non-Chinese characters, @ strings and meaningless characters;

[0061] S1.2: Determine whether there is a sensitive word delimiter, and replace it with a null character if it exists:

[0062] S1.2.1: First create a set of candidate sensitive word separators, add all emoticons and punctuation symbols in the original text data to the set of candidate sensitive word separators, and use them as candidate sensitive word separators;

[0063] S1.2.2: Then use a regular expression to extract all the candidate sensitive word separators and add them to the first list, ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a segmentation word-taking method and system for a social text, and belongs to the technical field of social text processing. In order to solve the problem that in the prior art, a social text involved by a person engaged in network fraud cannot be accurately segmented, the system comprises a text preprocessing module, an N-gram lexicon creating module, a word segmentation function module and an N-gram lexicon updating module. According to a self-defined word segmentation function in the word segmentation module, accurate word segmentation is carried out on text information, a certain updating period is set, and new data generated during the updating period is used for updating an N-gram word bank, so that the word segmentation accuracy is improved. Aiming at the strong and distinct language style of the chat information of the people engaged in the network fraud and the huge difference between the chat information and the traditional chat content, a text recognition and segmentation technology with own characteristics is formed according to different types of corpora, and accurate segmentation and word extraction are performed on the text.

Description

technical field [0001] The invention belongs to the technical field of social text processing, and in particular relates to a method and system for segmenting and extracting words from social text. Background technique [0002] With the progress and development of the Internet, the situation of suspected Internet fraud is becoming more and more severe, and the forms of Internet fraud are also emerging in endlessly. Behind this rampant is that there are huge industries to provide assistance for it, which makes all aspects of Internet fraud operate independently. The cost of fraud has been reduced, and the number of online frauds has continued to break new highs. [0003] In the existing technology, people engaged in cyber fraud generally communicate and trade in various anonymous communication software and dark web forums. The language used in the communication is unique in style, often with black words and code words that only experts can understand. [0004] For this kind ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/289G06F40/242G06F16/23G06F16/215
CPCG06F40/289G06F40/242G06F16/23G06F16/215
Inventor 刘晓雪王剑辉伍仪洲张瑞冬童永鳌朱鹏
Owner 成都无糖信息技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products