Method of Tibetan language webpage text classification based on semanteme

A text classification and webpage technology, applied in the field of semantic-based Tibetan webpage text classification, can solve the problems of applied research constraints, lack of knowledge base resources, and large amount of calculations

Active Publication Date: 2013-07-24
MINZU UNIVERSITY OF CHINA
View PDF2 Cites 29 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the lack of resources in the Tibetan ontology knowledge base restricts the application research based on the semantic level of Tibetan
Moreover, in the traditional Web text classification method, Tibetan words are conside

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method of Tibetan language webpage text classification based on semanteme
  • Method of Tibetan language webpage text classification based on semanteme
  • Method of Tibetan language webpage text classification based on semanteme

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041] The technical solutions of the present invention will be described in further detail below with reference to the accompanying drawings and embodiments.

[0042] figure 1 It is a flow chart of the semantic-based Tibetan web page text classification method of the present invention, such as figure 1 Shown, the Tibetan web page text classification method based on semantics of the present invention comprises:

[0043] Step 101, extract text information used to characterize the webpage from the Tibetan webpage.

[0044] In step 101, the text information is first extracted from the Tibetan webpage by using the rule method, and the obtained text information is expressed as X 1 , text message X 1 Include the body content of the page CT 1 , the column CL of the web page 1 , the title of the page T 1 and the page's publication date D 1 ;

[0045] Specifically, the method of rules is used to pre-analyze the characteristics of each website webpage set, and the corresponding ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method of Tibetan language webpage text classification based on semanteme. The method of the Tibetan language webpage text classification based on the semanteme comprises the following steps: firstly, extracting text messages which are used for charactering a webpage from the Tibetan language webpage, then carrying out word classification processing on the text messages, and expressing words obtained by means of the word classification processing as a word vector space, secondly, mapping the words expressed through the word vector space to a concept of a semanteme space according to a preset Tibetan language classification body to obtain a semanteme space of a text to be classified, and at last adopting a classification algorithm to classify the semanteme space of the text to be classified according to a preset semanteme space of a training sample set. The method of the Tibetan language webpage text classification based on the semanteme carries out preprocessing on the webpage and adopting a KNN classification algorithm of the text similarity of a weighting semantic net to achieve real-time and high-efficiency classification of the Tibetan language webpage.

Description

technical field [0001] The invention relates to data preprocessing technology, in particular to a semantic-based Tibetan web page text classification method. Background technique [0002] With the rapid development of informatization and economicization in Tibetan areas, the scale of Tibetan netizens and webpages is growing at an alarming rate. The Internet has become a carrier for Tibetan information transmission and sharing, and a place for Tibetan people to express their opinions. Inappropriate remarks may trigger public opinion, and negative information will pose a greater threat to social and public safety. Tibetan webpage text classification technology is the premise and basis for the realization of Tibetan network public opinion monitoring technology, and has important research value. [0003] Semantic text classification based on the Tibetan web is a key technology for processing and organizing a large amount of web text data. It can automatically determine the text...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/27
Inventor 胥桂仙
Owner MINZU UNIVERSITY OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products