Construction method and device of word segmentation training data

A technology of training data and construction method, applied in electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of word segmentation training data sparse data, limited data sources, etc., to enrich data sources and overcome data sparse problems. Effect

Active Publication Date: 2015-02-04
BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
View PDF3 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, not all corpus data can find the content of web pages containing anchor text data on the Internet, so the data sources of this scheme are very limited
Therefore, if the word segmentation training data is obtained completely in this way, the obtained word segmentation training data will have obvious data sparse problems

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Construction method and device of word segmentation training data
  • Construction method and device of word segmentation training data
  • Construction method and device of word segmentation training data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] The present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, but not to limit the present invention. In addition, it should be noted that, for the convenience of description, only parts related to the present invention are shown in the drawings but not all content.

[0023] figure 1 and figure 2 A first embodiment of the invention is shown.

[0024] figure 1 It is a flow chart of the method for constructing word segmentation training data provided by the first embodiment of the present invention. see figure 1 , the construction method of the word segmentation training data includes:

[0025] S110. Obtain the user's query statement in one query session of the user and the webpage title of the webpage link clicked by the user in the query result of the query statement.

[0026] Since ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention discloses a construction method and device of the word segmentation training data. The construction method of the word segmentation training data comprises the following steps: acquiring an inquiry sentence of a user in an inquiry session of the user and the webpage title of a webpage finally clicked by the user; comparing the inquiry sentence with the webpage title to obtain a public character string between the inquiry sentence and the webpage title; performing word segmentation on the inquiry sentence and the webpage title according to the obtained public character string. By adopting the construction method and device of word segmentation training data provided by the embodiment of the invention, the data source of the word segmentation training data is enriched, and the problem of data sparseness of the word segmentation training data is solved.

Description

technical field [0001] Embodiments of the present invention relate to the technical field of natural language processing, and in particular to a method and device for constructing word segmentation training data. Background technique [0002] Most word segmentation technologies require background corpus. Therefore, the annotation quality of the corpus in the corpus determines the quality of the final word segmentation results. At present, the labeling of corpus data in most corpora is done manually. The manual labeling of corpus data has high requirements on the professional quality of the labelers, and the manual labeling process is time-consuming and laborious, resulting in low efficiency of word segmentation for corpus data. [0003] There is a solution to improve word segmentation efficiency of corpus data, which is to use the anchor text on the webpage as a reference to perform word segmentation on corpus data. For example, the text "John Venn was a 19th-century Brit...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/3344G06F16/9535
Inventor 石磊张开旭
Owner BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products