Text word segmentation processing method and device, equipment and medium

A text word segmentation and processing method technology, applied in the direction of electrical digital data processing, natural language data processing, instruments, etc., can solve problems that affect work efficiency, low average efficiency, and increased time for extracting and segmenting words, so as to improve the efficiency of word segmentation Effect

Active Publication Date: 2021-12-24
GUANGZHOU HUADUO NETWORK TECH
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] First, it is not efficient to directly match and search for a word in a dictionary list with a large number of elements. At the same time, it is not efficient to enumerate all possible word combinations and then search sequentially.
[0007] Second, traversing from the back to the front, the time complexity is at least O(n2) level. As the length of the token continues to increase, the time to extract the segmented words will increase significantly, which will greatly affect the overall work efficiency.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text word segmentation processing method and device, equipment and medium
  • Text word segmentation processing method and device, equipment and medium
  • Text word segmentation processing method and device, equipment and medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0071] Embodiments of the present application are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary only for explaining the present application, and are not construed as limiting the present application.

[0072] Those skilled in the art will understand that unless otherwise stated, the singular forms "a", "an", "said" and "the" used herein may also include plural forms. It should be further understood that the word "comprising" used in the specification of the present application refers to the presence of the features, integers, steps, operations, elements and / or components, but does not exclude the presence or addition of one or more other features, Integers, steps, operations, elements, components, and / or groups thereof. It will be under...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a text word segmentation processing method and device, equipment and a medium. The method comprises the following steps: collecting a text to be subjected to word segmentation, the text to be subjected to word segmentation comprises a plurality of suspected words which are connected in series, and the suspected words are composed of pronunciation characters; all characters in the text to be subjected to word segmentation are sequentially traversed, redundant characters formed by continuous repetition in the suspected words are ignored in the traversing process, the redundant characters are converted into words in a dictionary tree diagram, the words are sequentially added into a result list, the dictionary tree diagram comprises a plurality of paths starting from a root node of the dictionary tree diagram and respectively reaching different tail end nodes, and the word sequence of the dictionary tree diagram is obtained; nodes through which each path passes store each character of the single word in sequence; and outputting the words in the result list in sequence as word segmentation results. According to the word segmentation device, word segmentation processing is carried out according to the tree diagram, abnormal repeated characters can be processed in the word segmentation process, redundant characters in the text to be subjected to word segmentation are ignored, and words contained in the text are extracted accurately and accurately.

Description

technical field [0001] The present application relates to the technical field of computer word segmentation, in particular to a text word segmentation processing method, and also relates to the corresponding device, equipment and non-volatile storage medium of the method. Background technique [0002] Word segmentation, that is, splitting text paragraphs or sentences into several words, is one of the most basic parts in text-related "Natural Language Processing" (NLP), and it also plays a very important role. Accurately and quickly obtaining the word-segmented results will help ensure the accuracy of subsequent NLP and improve work efficiency. [0003] After several tokens are obtained based on the format of the text itself (such as spaces between words) and punctuation marks, repetitions are often introduced between multiple words (such as He-Who-Must-Not-Be-Named) and normal words The letters (such as g-o-o-o-o-o-d) will not be reasonably segmented or restored to normal w...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/289G06F40/242
CPCG06F40/289G06F40/242Y02D10/00
Inventor 李世家姜博怀
Owner GUANGZHOU HUADUO NETWORK TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products