Method and device for segmentation on basis of webpage content classification

A technology for word segmentation processing and web page content, applied in the field of search, can solve the problems of waste of equipment system resources, poor recognition accuracy of common words, wrong word segmentation results, etc., to improve the accuracy of word segmentation, reduce re-input, and reduce indexing time. Effect

Inactive Publication Date: 2014-08-27
BEIJING QIHOO TECH CO LTD +1
View PDF5 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This method only needs to count the frequency of word groups in the corpus, but this method also has certain limitations. It will often extract some common word groups with high co-occurrence frequency but not words, such as "this" , "one", "some", "my", "many", etc., and the recognition accuracy of common words is poor, and the space-time overhead is large
[0006] On the one hand, the result of word segmentation is wrong, so that the related information obtained later is very different from the original expe

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for segmentation on basis of webpage content classification
  • Method and device for segmentation on basis of webpage content classification

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0080] Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be implemented in various forms and should not be limited by the embodiments set forth herein. On the contrary, these embodiments are provided to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

[0081] Reference figure 1 , Shows a step flow chart of an embodiment of a method for word segmentation based on web content categories according to an embodiment of the present invention, which may include the following steps:

[0082] Step 101: Extract text information of webpage content in search resources;

[0083] The processing flow of a search engine can generally be divided into two parts, the first part ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention provides a method and a device for segmentation on the basis of webpage content classification. The method comprises the following steps of: extracting the text information of webpage contents in search resources; dividing the classes of the text information according to the classes of the webpage contents; segmenting the text information according to segmentation dictionaries corresponding to the classes of the text information. According to the embodiment of the invention, the classes of the text information of the webpage contents in the search resources are divided, and the text information is segmented on the basis of the segmentation dictionaries corresponding to the classes, so as to adapt to different classes of language characteristics better, meanwhile, the segmentation accuracy for different classes is also improved, and the optimal processing for local segmentation is realized; moreover, the improvement of the accuracy of segmentation is close to the intention of a user and improve the user experience, and then reduce the operations of re-input, search and the like of the user, and improve the simplicity of operation, meanwhile, the response of equipment on the operation of the user is reduced, and the consumption of the system resources of the equipment is reduced.

Description

Technical field [0001] The invention relates to the technical field of search, in particular to a method for word segmentation processing based on web content categories and a device for word segmentation processing based on web content categories. Background technique [0002] With the rapid development of the Internet, network applications tend to be diversified, and the amount of information on the Internet has increased dramatically. [0003] In various occasions, users often need to input key information to obtain related information. For example, enter keywords in a search engine to search for web information, enter keywords in a forum to search for posts, and so on. [0004] Word segmentation is the basis for information processing and information retrieval. All information processing and information retrieval work are performed after word segmentation. Therefore, the segmentation error will be added to the subsequent processing, and it is difficult to eliminate. Because of...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/951G06F16/3344
Inventor 项碧波
Owner BEIJING QIHOO TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products