Unlock instant, AI-driven research and patent intelligence for your innovation.

Traditional Mongolian webpage recognition method and traditional Mongolian webpage recognition system

An identification method and technology of an identification device, which are applied in the network field, can solve the problems that absolute frequency does not take into account the characteristics of words used in texts in different fields, and the accuracy of webpage language identification varies greatly.

Inactive Publication Date: 2015-05-06
MINZU UNIVERSITY OF CHINA
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Among the three existing technologies for identifying webpage languages, the webpage language recognition technology based on high-frequency words is more effective than the other two methods, but this technology only considers the absolute frequency of language units and does not consider the words used in texts in different fields. Therefore, the recognition accuracy of webpage languages ​​varies greatly

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Traditional Mongolian webpage recognition method and traditional Mongolian webpage recognition system
  • Traditional Mongolian webpage recognition method and traditional Mongolian webpage recognition system
  • Traditional Mongolian webpage recognition method and traditional Mongolian webpage recognition system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0023] The technical solutions of the present invention will be described in further detail below with reference to the accompanying drawings and embodiments.

[0024] figure 1 It is the flow chart of the traditional Mongolian webpage recognition method provided by the first embodiment, as figure 1 As shown, the method includes:

[0025] Step S101, acquiring and counting the word frequency and document frequency of each word in the traditional Mongolian webpage corpus.

[0026] Specifically, each word in the traditional Mongolian webpage corpus is obtained, and the word frequency TF of each word is counted i and document frequency DF i , where i≥0.

[0027] Wherein, in a given file, term frequency (term frequency, TF) refers to the number of times a given word appears in the file.

[0028] In a given file set, Document Frequency (DF) refers to the number of times a given file appears in the file set.

[0029] Optionally, before obtaining and counting the word frequency a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a traditional Mongolian webpage recognition method and a traditional Mongolian webpage recognition system. The method includes the following steps: the word frequency and document frequency of each word in a traditional Mongolian webpage corpus are obtained and counted, and the harmonic mean of each word is calculated; according to the harmonic means in descending order, a first previous number of words are chosen, and the harmonic means of the first previous number of words are accumulated, so that a first accumulated sum is obtained; the word frequencies of the first previous number of words in a webpage to be recognized are obtained and counted, and are accumulated, so that a second accumulated sum is obtained; when the difference between the first accumulated sum and the second accumulated sum is less than or equal to a first threshold, the webpage to be recognized is determined to be a traditional Mongolian webpage. The traditional Mongolian webpage recognition method provided by the invention can carry out the recognition of traditional Mongolian webpages with high accuracy and high efficiency, and thereby can help to collect traditional Mongolian webpages and implement a traditional Mongolian full-text search engine.

Description

technical field [0001] The invention relates to the field of network technology, in particular to a traditional Mongolian web page recognition method and device. Background technique [0002] Traditional Mongolian is the official way of writing the Mongolian language in the Inner Mongolia Autonomous Region of China (that is, the orthography of Mongolian language written in the Mongolian alphabet). Traditional Mongolian network resources are an important way for the Mongolian people to use their own language to transmit information and share resources. They are also the main platform for the inheritance of Mongolian traditional culture. Full-text search engines are of great significance. The number of traditional Mongolian network resources in my country is relatively small compared with Chinese and English network resources, and the encoding is complex. Therefore, it is very important to collect traditional Mongolian network resources accurately and efficiently. Preliminary...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/951
Inventor 王志娟
Owner MINZU UNIVERSITY OF CHINA
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More