Text big data-oriented Chinese word segmentation method

A Chinese word segmentation and big data technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., to achieve the effect of improving throughput, improving accuracy, and reducing word segmentation time

Active Publication Date: 2015-03-11
WUHAN SHUWEI TECH
View PDF5 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In order to improve the throughput rate, word segmentation efficiency, and accuracy in the big data environment, the present invention proposes a Chinese word segmentation method for text big data, using the MapReduce computing model to process massive data, based on statistics and character strings The method of combining matching can effectively solve the problems of accuracy, practicability and efficiency of Chinese word segmentation in the case of large text data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text big data-oriented Chinese word segmentation method
  • Text big data-oriented Chinese word segmentation method
  • Text big data-oriented Chinese word segmentation method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0051] Below at first technical terms of the present invention are explained and illustrated:

[0052] MapReduce computing model: MapReduce is a general software framework proposed by Google to implement distributed parallel computing tasks. It simplifies the parallel software programming model on super-large clusters composed of ordinary computers, and can be used for parallel computing of large-scale data sets.

[0053] Hadoop Distributed File System: Hadoop is a distributed system infrastructure developed by the Apache Foundation. The core design of the Hadoop framework is: HDFS (Hadoop Distributed File System) and MapReduce. HDFS provides storage for massive data, and MapReduce provides calculation for massive data. In the Hadoop distributed file system, there are mainly three roles: JobClient, JobTracker, and TaskTracker. JobClient is used to submit tasks; JobTracker is used to monitor the running status of Task and perform corresponding scheduling; TaskTracker actively...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a text big data-oriented Chinese word segmentation method, and belongs to the field of natural language processing. The method is characterized by comprising the following steps: (1) performing decomposition processing on local massive data files to form a data block; (2) performing Map treatment on the decomposed data block to obtain a <Key, Value> key value pair with offset of Key and text content of Value; (3) obtaining a final word segmentation result through a series of word segmentation processing and obtaining the <Key, Value> key value pair with offset of Key and word segmentation result of Value to serve as output of a Map function; (4) performing Reduce processing on the <Key, Value> key value pair obtained by the Map function, obtaining an index file of the <Key, Value> key value pair corresponding to an original file and a word segmentation result file, and summarily writing a final result into an HDFS (Hadoop Distributed File System). According to the method, the word segmentation accuracy is guaranteed under the condition of text big data, and meanwhile the throughput rate of the system and the Chinese word segmentation efficiency are greatly improved, so the method has an extremely high practical value.

Description

technical field [0001] The invention belongs to the technical field of natural language processing, and more specifically relates to a Chinese word segmentation method for text big data. Background technique [0002] In recent years, Internet information has grown explosively. The scale of text on the Internet is getting larger and larger, and information resources are increasing. It is becoming more and more difficult to manually obtain important information from massive data. The information that users are interested in is submerged in a large number of irrelevant information. . In order to obtain valuable information from a large amount of resource information, natural language processing technology has attracted the attention of Internet companies, such as Google, Baidu and other search engine companies have extensive research in the field of natural language processing. [0003] In the big data environment, the processing of massive data requires the use of parallel di...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
Inventor 邹复好周可唐小蔓郑胜张胜陈进才李春花
Owner WUHAN SHUWEI TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products