Webpage text classification method and device and webpage text identification method and device

A text classification and webpage technology, applied in the field of webpage text recognition devices, webpage text classification, and webpage text classification devices, can solve problems such as unsatisfactory classification features, ignoring important attributes, and exaggerating the role of invalid words

Active Publication Date: 2017-10-24
ALIBABA GRP HLDG LTD
View PDF5 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Due to the polymeaning, ambiguity, heterogeneity and other characteristics of massive texts, the selection of classification features in the existing technology is unsatisfactory. For example, the role of some invalid words is often exaggerated, or some The important attributes of some feature word segmentation, resulting in extremely low accuracy of web page text classification

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage text classification method and device and webpage text identification method and device
  • Webpage text classification method and device and webpage text identification method and device
  • Webpage text classification method and device and webpage text identification method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0098] In order to make the above objects, features and advantages of the present application more obvious and comprehensible, the present application will be further described in detail below in conjunction with the accompanying drawings and specific implementation methods.

[0099] Text classification is to obtain the mapping rules between categories and unknown texts by training a certain set of texts, that is, to calculate the correlation between texts and categories, and then determine the category of the text according to the trained classifier.

[0100] Text classification is a guided learning process, which finds the relationship model (classifier) ​​between text attributes (features) and text categories based on a set of training texts that have been labeled, and then uses this learned relationship model to New text for category judgment. The overall process of text classification can be divided into two parts: training and classification. The purpose of training is ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention provides a webpage text classification method and device and a webpage text identification method and device. The webpage text classification method comprises the steps of collecting text data in a webpage; carrying out word segmentation on the text data and obtaining basic segmentation words; calculating first attribute values and second attribute values of the basic segmentation words; calculating feature values of the basic segmentation words according to the first attribute values and second attribute values; screening feature segmentation words from the basic segmentation words according to the feature values; calculating weights corresponding to the feature segmentation words; and taking the weights as the feature vectors of the corresponding feature segmentation words and training a classification model through adoption of the feature vectors. According to the embodiment of the invention, the feature extraction objectivity and accuracy are effectively ensured; the influences of features on the classification are taken into consideration; the webpage text classification accuracy is improved; and a user can obtain valid information timely and accurately from massive texts.

Description

technical field [0001] The present application relates to the technical field of text classification, and in particular to a method for classifying web page text, a device for classifying web page text, a method for recognizing web page text, and a device for recognizing web page text. Background technique [0002] In today's information society, various forms of information have greatly enriched people's lives, especially with the large-scale popularization of the Internet, the amount of information on the network is growing rapidly, such as various electronic documents, emails and web pages Flooded on the network, resulting in information clutter. In order to quickly, accurately and comprehensively find the information we need, text classification has become an important way to effectively organize and manage text data, and has attracted more and more attention. [0003] Webpage text classification refers to determining the category of the corresponding webpage according ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F40/237
CPCG06F16/95G06F16/36G06F16/355G06F40/237
Inventor 段秉南
Owner ALIBABA GRP HLDG LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products