LSTM-based mixed corpus word segmentation method
A word segmentation method and corpus technology, applied in natural language data processing, special data processing applications, instruments, etc., can solve the problems of word segmentation accuracy loss, dependence on dictionaries, inability to recognize unregistered words, etc., and achieve the effect of improving accuracy
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment Construction
[0053] In order to make the above-mentioned features and advantages of the present invention more comprehensible, the following specific embodiments are described in detail in conjunction with the accompanying drawings.
[0054] The flow chart of the present invention is as figure 1 As shown, its implementation can be divided into 2 stages: 1) training stage and 2) prediction stage.
[0055] (1) Training stage:
[0056] Step 1: Step 1: Data Preprocessing. Multiple languages are involved in the mixed corpus, such as Simplified Chinese, Traditional Chinese, Japanese and Korean. We first format each tagged language as follows: each word segmentation result occupies one line. For example, for Traditional Chinese, Simplified Chinese and Japanese "I am Chinese, I love China." The data format is as follows:
[0057] Simplified Chinese:
[0058] I
[0059] yes
[0060] Chinese people
[0061] ,
[0062] I
[0063] Love
[0064] China
[0065] .
[0066] traditional Chin...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com