Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Webpage class feature vector extracting method based on ant colony algorithm

An ant colony algorithm and feature word technology, applied in the field of text mining, can solve the problems of complex redundancy of Internet information and the inability of learning algorithms to deal with it, and achieve the effect of high accuracy.

Active Publication Date: 2014-04-23
TONGJI UNIV
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the Internet information is complex and redundant. How to extract accurate information from the inaccurate sample set as the feature value of the class and obtain the weight value of each feature word without being too complicated for the learning algorithm to handle

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage class feature vector extracting method based on ant colony algorithm
  • Webpage class feature vector extracting method based on ant colony algorithm
  • Webpage class feature vector extracting method based on ant colony algorithm

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] The present invention is based on the traditional search engine, extracts categories according to the manual classification catalog of DMOZ, and then uses the web crawler to crawl the first 200 results of the full-text search engine search results according to the category names, and excludes webpage tags, advertisements, etc. After waiting for the noise information, the text of the web page is extracted as a sample set. Then use the tokenizer to segment the training set, remove stop words and low-frequency words, and count the word frequency of each word in each article, the total number of documents where the word appears, the number of times the word and the class name co-occur, and the total number of articles . Finally, the improved ant colony algorithm is used to extract the feature words and obtain their weight values, so as to obtain the class and its feature words. The specific structure of the class is as figure 1 shown. The classifier is constructed in thi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a method for extracting feature words by improved ant colony algorithm. The method comprises the following steps: when in pretreatment, storing all information into a hash table, wherein coco_prepare is used for storing the information of each article, consisting of article id, words and appearance number of each word, and readhdfs_prepare is used for storing the statistical information of a training set of each class, consisting of word frequency of each word, file number and appearance number of taxon; setting the parameter of the ant colony algorithm, including ant number M, iteration number N, ant steps namely feature word number K, initialization path information prime matrix adMatrixs, local update decay rate p1, total update decay rate p2 and pheromone amount m released by ant. According to the method for extracting feature words by improved ant colony algorithm, the ant colony algorithm is firstly brought in to solve the problem of extracting accurate feature vectors for classes in the case of lacking accurate sample sets.

Description

technical field [0001] The invention relates to text mining and is applied to web page classification. Background technique [0002] Web text mining is a method and tool to extract useful information from massive web pages, among which web page classification is one of its main aspects. As we all know, the premise of training your own classifier in the field of machine learning is to have a sample set that can accurately represent the class, which is used as a training set and a test set. There are three main ways to obtain the sample set: (1) use the existing public corpus; (2) manually collect the corresponding sample set according to the class name; (3) use web crawlers. According to actual needs, class names are defined according to different needs. Method (1) has a small corpus and is not enough to meet actual needs. Method (2) is time-consuming and labor-intensive, so in reality, method (3) will be used to obtain training sets. However, the Internet information is co...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/35G06N3/00
Inventor 蒋昌俊陈闳中闫春钢丁志军王鹏伟孙海春邓晓栋刘俊俊
Owner TONGJI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products