Text big data-oriented Chinese word segmentation method

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A Chinese word segmentation and big data technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., to achieve the effect of improving throughput, improving accuracy, and reducing word segmentation time

Active Publication Date: 2015-03-11

WUHAN SHUWEI TECH

View PDF5 Cites 19 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

In order to improve the throughput rate, word segmentation efficiency, and accuracy in the big data environment, the present invention proposes a Chinese word segmentation method for text big data, using the MapReduce computing model to process massive data, based on statistics and character strings The method of combining matching can effectively solve the problems of accuracy, practicability and efficiency of Chinese word segmentation in the case of large text data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0051] Below at first technical terms of the present invention are explained and illustrated:

[0052] MapReduce computing model: MapReduce is a general software framework proposed by Google to implement distributed parallel computing tasks. It simplifies the parallel software programming model on super-large clusters composed of ordinary computers, and can be used for parallel computing of large-scale data sets.

[0053] Hadoop Distributed File System: Hadoop is a distributed system infrastructure developed by the Apache Foundation. The core design of the Hadoop framework is: HDFS (Hadoop Distributed File System) and MapReduce. HDFS provides storage for massive data, and MapReduce provides calculation for massive data. In the Hadoop distributed file system, there are mainly three roles: JobClient, JobTracker, and TaskTracker. JobClient is used to submit tasks; JobTracker is used to monitor the running status of Task and perform corresponding scheduling; TaskTracker actively...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a text big data-oriented Chinese word segmentation method, and belongs to the field of natural language processing. The method is characterized by comprising the following steps: (1) performing decomposition processing on local massive data files to form a data block; (2) performing Map treatment on the decomposed data block to obtain a <Key, Value> key value pair with offset of Key and text content of Value; (3) obtaining a final word segmentation result through a series of word segmentation processing and obtaining the <Key, Value> key value pair with offset of Key and word segmentation result of Value to serve as output of a Map function; (4) performing Reduce processing on the <Key, Value> key value pair obtained by the Map function, obtaining an index file of the <Key, Value> key value pair corresponding to an original file and a word segmentation result file, and summarily writing a final result into an HDFS (Hadoop Distributed File System). According to the method, the word segmentation accuracy is guaranteed under the condition of text big data, and meanwhile the throughput rate of the system and the Chinese word segmentation efficiency are greatly improved, so the method has an extremely high practical value.

Description

technical field [0001] The invention belongs to the technical field of natural language processing, and more specifically relates to a Chinese word segmentation method for text big data. Background technique [0002] In recent years, Internet information has grown explosively. The scale of text on the Internet is getting larger and larger, and information resources are increasing. It is becoming more and more difficult to manually obtain important information from massive data. The information that users are interested in is submerged in a large number of irrelevant information. . In order to obtain valuable information from a large amount of resource information, natural language processing technology has attracted the attention of Internet companies, such as Google, Baidu and other search engine companies have extensive research in the field of natural language processing. [0003] In the big data environment, the processing of massive data requires the use of parallel di...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F17/27

Inventor邹复好周可唐小蔓郑胜张胜陈进才李春花

OwnerWUHAN SHUWEI TECH

Text big data-oriented Chinese word segmentation method

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology