Word segmentation method based on MMseg algorithm and pointwise mutual information algorithm
A technology of point-by-point mutual information and word segmentation method, which is applied in computing, natural language analysis, special data processing applications, etc., can solve the problems of slow word segmentation speed and low word segmentation efficiency, and achieve the effect of improving accuracy and ambiguity resolution ability
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0037] Such as figure 1 As shown, the method provided by the present invention uses the MMseg algorithm to perform word segmentation processing on the text based on the dictionary, and uses the point-by-point mutual information algorithm to correct the word segmentation results after obtaining the word segmentation results;
[0038] The specific process of correcting word segmentation results by the point-by-point mutual information algorithm is as follows: calculate the point-by-point mutual information of adjacent words x and y in the text, and then judge whether the point-by-point mutual information of words x and y is greater than the set Threshold, if so, divide word x and word y as an independent word.
[0039] Among them, the MMseg algorithm is a dictionary-based word segmentation algorithm, and the interpretation of MMSeg can be divided into two parts: "matching algorithm" and "disambiguation" rules. The matching algorithm includes two word segmentation methods: simpl...
Embodiment 2
[0066] In this embodiment, a specific experiment is carried out based on the method in Embodiment 1. In this experiment, web crawler software was used to capture 5,000 product title description information from a shopping website as the experimental corpus, and some meaningless symbols such as punctuation, underscore, and special symbols were filtered out, of which 3,500 were used as training texts, and 1,500 As the test text, the statistics of the experimental results are as follows:
[0067] data set number of sentences word count single word / multiple word training corpus 3500 54834 2541 / 34156 test corpus 1500 24545 1154 / 14354
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com