Text big data-oriented Chinese word segmentation method

A Chinese word segmentation and big data technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., to achieve the effect of improving throughput, improving accuracy, and reducing word segmentation time
CN104408034AActive Publication Date: 2015-03-11WUHAN SHUWEI TECH

Patent Information

Authority / Receiving Office
CN ยท China
Patent Type
Applications(China)
Current Assignee / Owner
WUHAN SHUWEI TECH
Publication Date
2015-03-11

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
  • Figure 3
    Figure 3
Patent Text Reader

Abstract

The invention discloses a text big data-oriented Chinese word segmentation method, and belongs to the field of natural language processing. The method is characterized by comprising the following steps: (1) performing decomposition processing on local massive data files to form a data block; (2) performing Map treatment on the decomposed data block to obtain a <Key, Value> key value pair with offset of Key and text content of Value; (3) obtaining a final word segmentation result through a series of word segmentation processing and obtaining the <Key, Value> key value pair with offset of Key and word segmentation result of Value to serve as output of a Map function; (4) performing Reduce processing on the <Key, Value> key value pair obtained by the Map function, obtaining an index file of the <Key, Value> key value pair corresponding to an original file and a word segmentation result file, and summarily writing a final result into an HDFS (Hadoop Distributed File System). According to the method, the word segmentation accuracy is guaranteed under the condition of text big data, and meanwhile the throughput rate of the system and the Chinese word segmentation efficiency are greatly improved, so the method has an extremely high practical value.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention belongs to the technical field of natural language processing, and more specifically relates to a Chinese word segmentation method for text big data. Background technique

[0002] In recent years, Internet information has grown explosively. The scale of text on the Internet is getting larger and larger, and information resources are increasing. It is becoming more and more difficult to manually obtain important information from massive data. The information that users are interested in is submerged in a large number of irrelevant information. . In order to obtain valuable information from a large amount of resource information, natural language processing technology has attracted the attention of Internet companies, such as Google, Baidu and other search engine companies have extensive research in the field of natural language processing.

[0003] In the big data environment, the processing of massive data requires the use of parallel di...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More