Unlock instant, AI-driven research and patent intelligence for your innovation.

Method and device for webpage text classification, method and device for webpage text recognition

A technology for text classification and web page, applied in the field of web page text classification device, web page text classification, web page text recognition device field, can solve the problems of exaggerating the role of invalid words, ignoring important attributes, unsatisfactory classification characteristics, etc. Accuracy, Guaranteed Objectivity and Accuracy, Precise Effect

Active Publication Date: 2021-04-30
ALIBABA GRP HLDG LTD
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Due to the polymeaning, ambiguity, heterogeneity and other characteristics of massive texts, the selection of classification features in the existing technology is unsatisfactory. For example, the role of some invalid words is often exaggerated, or some The important attributes of some feature word segmentation, resulting in extremely low accuracy of web page text classification

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for webpage text classification, method and device for webpage text recognition
  • Method and device for webpage text classification, method and device for webpage text recognition
  • Method and device for webpage text classification, method and device for webpage text recognition

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0098] In order to make the above objects, features and advantages of the present application more obvious and comprehensible, the present application will be further described in detail below in conjunction with the accompanying drawings and specific implementation methods.

[0099] Text classification is to obtain the mapping rules between categories and unknown texts by training a certain set of texts, that is, to calculate the correlation between texts and categories, and then determine the category of the text according to the trained classifier.

[0100] Text classification is a guided learning process, which finds the relationship model (classifier) ​​between text attributes (features) and text categories based on a set of training texts that have been labeled, and then uses this learned relationship model to New text for category judgment. The overall process of text classification can be divided into two parts: training and classification. The purpose of training is ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

Embodiments of the present application provide a method and device for classifying web page text, and a method and device for recognizing web page text. The method and device for classifying web page text includes: collecting text data in a web page; performing word segmentation on the text data to obtain basic word segmentation Calculate the first attribute value and the second attribute value of each basic participle; Calculate the characteristic value of each basic participle according to the first attribute value and the second attribute value; Screen out the characteristic from the basic participle according to the characteristic value word segmentation; calculate the corresponding weight of each feature word segmentation; take the weight as the feature vector of the corresponding feature word segmentation, and use the feature vector to train a classification model. The embodiments of the present application not only effectively ensure the objectivity and accuracy of feature extraction, but also take into account the influence of features on classification, thereby improving the accuracy of webpage text classification, and making it more convenient for users to timely and accurately obtain effective data from massive texts. information.

Description

technical field [0001] The present application relates to the technical field of text classification, and in particular to a method for classifying web page text, a device for classifying web page text, a method for recognizing web page text, and a device for recognizing web page text. Background technique [0002] In today's information society, various forms of information have greatly enriched people's lives, especially with the large-scale popularization of the Internet, the amount of information on the network is growing rapidly, such as various electronic documents, emails and web pages Flooded on the network, resulting in information clutter. In order to quickly, accurately and comprehensively find the information we need, text classification has become an important way to effectively organize and manage text data, and has attracted more and more attention. [0003] Webpage text classification refers to determining the category of the corresponding webpage according ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35G06F16/36G06F16/95G06F40/237
CPCG06F16/95G06F16/36G06F16/355G06F40/237
Inventor 段秉南
Owner ALIBABA GRP HLDG LTD