Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Word segmentation method supporting large number of word banks, and computer readable storage medium and system

A word segmentation method and word segmentation algorithm technology, which is applied in computing, instruments, electrical digital data processing, etc., can solve problems such as slow performance, not considering the whole, and not supporting a large number of lexicons, etc., to achieve the effect of increasing rationality and improving word segmentation efficiency

Active Publication Date: 2021-02-02
启业云大数据(南京)有限公司
View PDF5 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] 1. The dictionary management function is weak and does not support a large number of thesaurus;
[0005] 2. In the scenario of a large number of thesaurus, there is a lack of optimization of search technology, and the performance is slow;
[0006] 3. The hit logic of big words in the dictionary is just a simple weight scheme, without considering the whole, and the result of word segmentation is unreasonable

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Word segmentation method supporting large number of word banks, and computer readable storage medium and system
  • Word segmentation method supporting large number of word banks, and computer readable storage medium and system
  • Word segmentation method supporting large number of word banks, and computer readable storage medium and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0043] This embodiment proposes a word segmentation method that supports a large number of thesaurus, and its flow is as follows figure 1 shown, including the following steps:

[0044] Step 1: Build a domain dictionary, and establish a first-level index and a second-level index for each word in the domain dictionary whose length is greater than N; where the key of the first-level index is the first M words of each word, and the value of the first-level index is the length of the word; the key of the secondary index is the combination of M headers of each word and the length of the word, and the value of the secondary index is the hash mapping result of the word.

[0045] Specifically, the domain dictionary may be a domain dictionary of one domain, or multiple domain dictionaries of different domains, and each domain dictionary has an identifier indicating a corresponding domain.

[0046] In the domain dictionary, a primary index and a secondary index are also established for ...

Embodiment 2

[0095] This embodiment proposes a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the word segmentation method is implemented.

Embodiment 3

[0097] This embodiment proposes a word segmentation system that supports a large number of lexicons for implementing the word segmentation method, and the word segmentation system can refer to figure 1 , including offline model unit, domain dictionary module, domain search module and word segmentation module, where:

[0098] The domain dictionary module stores pre-built domain dictionaries in different domains. Each word longer than N in the domain dictionary has a first-level index and a second-level index; this module opens the dictionary to users, allowing users to dynamically add new ones. Words, add custom words; this module also has a dictionary management function, users can manage dictionaries through this module, for example:

[0099] Users can mark words in fields to facilitate searching by fields;

[0100] Users can mark words according to the 4-tag method to facilitate offline training;

[0101] On the management page, users can make these annotations take effect...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a word segmentation method supporting a large number of word banks, and a computer readable storage medium and a system. The method comprises the following steps: constructing adomain dictionary; constructing an offline word segmentation model based on a domain dictionary; for the original text to be subjected to word segmentation, performing word segmentation through an offline word segmentation model to obtain a first word segmentation result; carrying out to-be-searched word extraction on the original text to be subjected to word segmentation, then carrying out first-level index search and second-level index search in the domain dictionary based on the to-be-searched words, and finally screening second-level index results to extract candidate words; and recombining the candidate words and the first word segmentation result, constructing a directed graph of the original text based on a recombining result, and calculating an optimal word segmentation result based on a shortest path method. According to the method, the word segmentation result in the single field is combined with the big word search result, the directed graph is constructed based on the combination result, the problem of solving the optimal word segmentation scheme is converted into the problem of the optimal path to be quickly solved, and the method is very suitable for segmenting the big words.

Description

technical field [0001] The invention relates to the technical field of NLP natural language processing of artificial intelligence, in particular to a word segmentation method supporting a large number of thesaurus, a computer-readable storage medium and a system. Background technique [0002] There are many word segmentation tools at present, such as: jieba, pyltp, etc. Although these word segmentation tools can effectively segment words, but in practical applications, word usage habits in different fields are different, so the word segmentation results of the same sentence in different fields should also be different. However, most of the existing technologies perform word segmentation based on a single dictionary, which may lead to unsatisfactory word segmentation results. [0003] Based on the above reasons, the current word segmentation scheme begins to consider introducing domain dictionaries, but there are still the following defects: [0004] 1. The dictionary manag...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/284G06F40/242
CPCG06F40/242G06F40/284
Inventor 胡小敏
Owner 启业云大数据(南京)有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products