Key-based segmentation method and device for character sequences

A character sequence and keyword technology, which is applied in the field of keyword-based character sequence segmentation methods and devices, can solve the problems of ambiguity in word segmentation, insufficient new word recognition ability, ambiguity in word recognition, etc., so as to avoid word segmentation ambiguity and improve word segmentation. The effect of accuracy and new word recognition ability

Active Publication Date: 2015-06-03
北京金蝶云基科技有限公司
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] 1. The accuracy of word segmentation cannot meet the requirements: For some applications that require very high accuracy of word segmentation, such as expressions and formula word segmentation, the accuracy of word segmentation must reach 100%, otherwise it will cause errors in calculation results
[0008] 2. There is ambiguity in word segmentation: there is ambiguity in the recognition of words
For example, in formulas or expressions, when sequences such as +, +=, if, else if, etc. appear, they cannot be recognized well
[0009] 3. Insufficient ability to recognize new words: When new entries appear in the system, the algorithm cannot recognize new words well.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Key-based segmentation method and device for character sequences
  • Key-based segmentation method and device for character sequences
  • Key-based segmentation method and device for character sequences

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0047] The solution of the embodiment of the present invention is mainly: by defining the priority of keywords, establish a keyword list, arrange the order according to the priority of each keyword in the keyword list, and take the keyword as the minimum segmentation unit at the same time, according to the predetermined word segmentation algorithm Segment the character sequence to improve word segmentation accuracy and new word recognition ability, and avoid word segmentation ambiguity.

[0048] Such as figure 1 As shown, an embodiment of the present invention proposes a keyword-based character sequence segmentation method, including:

[0049] Step S101, loading keywords and creating a keyword list;

[0050] Wherein, keywords (key) include user-defined keywords and predefined keywords. Custom keywords are dynamically loaded by the program and can be maintained outside the system; predefined keywords are fixed and built into the system. Predefined keywords are commonly used ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a key-based segmentation method and a device for character sequences. The method includes loading keys and establishing a key list; sequencing the keys according to priority of the key attributes in the key list; and according to the sequencing result and taking the key as the smallest segmentation unit, and performing segmentation on the character sequences according to predetermined word segmentation. According to the method and the device, the priority of the keys is defined, the key list is established, the sequence is formed according to the priority of the keys in the key list, the key serves as the smallest segmentation unit, the segmentation is performed on the character sequences according to predetermined word segmentation, therefore word segmentation accuracy and new word identifying ability are improved, word segmentation ambiguity is avoided, and requirements of application scenes such as formula, function verification and analysis which have a strict requirements for word segmentation can be satisfied.

Description

technical field [0001] The present invention relates to the technical field of character sequence word segmentation, in particular to a keyword-based character sequence segmentation method and device. Background technique [0002] Currently commonly used word segmentation algorithms mainly fall into the following three categories: [0003] 1. Word segmentation based on string matching; 2. Word segmentation based on understanding; 3. Word segmentation based on statistics. The comparison results of the above three branch algorithms are shown in Table 1 below: [0004] word segmentation method Word segmentation based on string matching participle based on comprehension word segmentation based on statistics Ambiguity recognition Difference powerful powerful new word recognition Difference powerful powerful need a dictionary need unnecessary unnecessary need corpus no no ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/27
Inventor 阳荣
Owner 北京金蝶云基科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products