Word segmentation method, computer-readable storage medium and system supporting a large number of lexicons

A word segmentation method and word segmentation algorithm technology, which is applied in computing, instruments, electrical digital data processing, etc., can solve problems such as slow performance, not considering the whole, and not supporting a large number of lexicons, etc., to achieve the effect of increasing rationality and improving word segmentation efficiency

Active Publication Date: 2021-06-22
启业云大数据(南京)有限公司
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] 1. The dictionary management function is weak and does not support a large number of thesaurus;
[0005] 2. In the scenario of a large number of thesaurus, there is a lack of optimization of search technology, and the performance is slow;
[0006] 3. The hit logic of big words in the dictionary is just a simple weight scheme, without considering the whole, and the result of word segmentation is unreasonable

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Word segmentation method, computer-readable storage medium and system supporting a large number of lexicons
  • Word segmentation method, computer-readable storage medium and system supporting a large number of lexicons
  • Word segmentation method, computer-readable storage medium and system supporting a large number of lexicons

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0043] This embodiment proposes a word segmentation method that supports a large number of thesaurus, and its flow is as follows figure 1 shown, including the following steps:

[0044] Step 1: Build a domain dictionary, and establish a first-level index and a second-level index for each word in the domain dictionary whose length is greater than N; where the key of the first-level index is the first M words of each word, and the value of the first-level index is the length of the word; the key of the secondary index is the combination of M headers of each word and the length of the word, and the value of the secondary index is the hash mapping result of the word.

[0045] Specifically, the domain dictionary may be a domain dictionary of one domain, or multiple domain dictionaries of different domains, and each domain dictionary has an identifier indicating a corresponding domain.

[0046] In the domain dictionary, a primary index and a secondary index are also established for ...

Embodiment 2

[0095] This embodiment proposes a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the word segmentation method is implemented.

Embodiment 3

[0097] This embodiment proposes a word segmentation system that supports a large number of lexicons for implementing the word segmentation method, and the word segmentation system can refer to figure 1 , including offline model unit, domain dictionary module, domain search module and word segmentation module, where:

[0098] The domain dictionary module stores pre-built domain dictionaries in different domains. Each word longer than N in the domain dictionary has a first-level index and a second-level index; this module opens the dictionary to users, allowing users to dynamically add new ones. Words, add custom words; this module also has a dictionary management function, users can manage dictionaries through this module, for example:

[0099] Users can mark words in fields to facilitate searching by fields;

[0100] Users can mark words according to the 4-tag method to facilitate offline training;

[0101] On the management page, users can make these annotations take effect...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention proposes a word segmentation method, computer-readable storage medium and system supporting a large number of thesaurus. The method includes the following steps: constructing a domain dictionary; constructing an offline word segmentation model based on the domain dictionary; for the original text to be segmented, by offline word segmentation The model performs word segmentation to obtain the first word segmentation result; the original text to be segmented is extracted to be searched, and then based on the word to be searched, the first-level index search and the second-level index search are performed in the domain dictionary, and finally the second-level index results are screened. Extract the candidate words; reorganize the candidate words and the first word segmentation results, construct the directed graph of the original text based on the reorganization results, and calculate the optimal word segmentation results based on the shortest path method. The present invention combines word segmentation results in a single field with big word search results, constructs a directed graph based on the combined results, and converts the problem of solving the optimal word segmentation scheme into the problem of the optimal path to quickly solve, which is very suitable for separating big words.

Description

technical field [0001] The invention relates to the technical field of NLP natural language processing of artificial intelligence, in particular to a word segmentation method supporting a large number of thesaurus, a computer-readable storage medium and a system. Background technique [0002] There are many word segmentation tools at present, such as: jieba, pyltp, etc. Although these word segmentation tools can effectively segment words, but in practical applications, word usage habits in different fields are different, so the word segmentation results of the same sentence in different fields should also be different. However, most of the existing technologies perform word segmentation based on a single dictionary, which may lead to unsatisfactory word segmentation results. [0003] Based on the above reasons, the current word segmentation scheme begins to consider introducing domain dictionaries, but there are still the following defects: [0004] 1. The dictionary manag...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/284G06F40/242
CPCG06F40/242G06F40/284
Inventor 王三明王聪明胡小敏
Owner 启业云大数据(南京)有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products