Webpage class feature vector extracting method based on ant colony algorithm

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
An ant colony algorithm and feature word technology, applied in the field of text mining, can solve the problems of complex redundancy of Internet information and the inability of learning algorithms to deal with it, and achieve the effect of high accuracy.

Active Publication Date: 2014-04-23

TONGJI UNIV

View PDF3 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

However, the Internet information is complex and redundant. How to extract accurate information from the inaccurate sample set as the feature value of the class and obtain the weight value of each feature word without being too complicated for the learning algorithm to handle

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0029] The present invention is based on the traditional search engine, extracts categories according to the manual classification catalog of DMOZ, and then uses the web crawler to crawl the first 200 results of the full-text search engine search results according to the category names, and excludes webpage tags, advertisements, etc. After waiting for the noise information, the text of the web page is extracted as a sample set. Then use the tokenizer to segment the training set, remove stop words and low-frequency words, and count the word frequency of each word in each article, the total number of documents where the word appears, the number of times the word and the class name co-occur, and the total number of articles . Finally, the improved ant colony algorithm is used to extract the feature words and obtain their weight values, so as to obtain the class and its feature words. The specific structure of the class is as figure 1 shown. The classifier is constructed in thi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention relates to a method for extracting feature words by improved ant colony algorithm. The method comprises the following steps: when in pretreatment, storing all information into a hash table, wherein coco_prepare is used for storing the information of each article, consisting of article id, words and appearance number of each word, and readhdfs_prepare is used for storing the statistical information of a training set of each class, consisting of word frequency of each word, file number and appearance number of taxon; setting the parameter of the ant colony algorithm, including ant number M, iteration number N, ant steps namely feature word number K, initialization path information prime matrix adMatrixs, local update decay rate p1, total update decay rate p2 and pheromone amount m released by ant. According to the method for extracting feature words by improved ant colony algorithm, the ant colony algorithm is firstly brought in to solve the problem of extracting accurate feature vectors for classes in the case of lacking accurate sample sets.

Description

technical field [0001] The invention relates to text mining and is applied to web page classification. Background technique [0002] Web text mining is a method and tool to extract useful information from massive web pages, among which web page classification is one of its main aspects. As we all know, the premise of training your own classifier in the field of machine learning is to have a sample set that can accurately represent the class, which is used as a training set and a test set. There are three main ways to obtain the sample set: (1) use the existing public corpus; (2) manually collect the corresponding sample set according to the class name; (3) use web crawlers. According to actual needs, class names are defined according to different needs. Method (1) has a small corpus and is not enough to meet actual needs. Method (2) is time-consuming and labor-intensive, so in reality, method (3) will be used to obtain training sets. However, the Internet information is co...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

CPCG06F16/35G06N3/00

Inventor蒋昌俊陈闳中闫春钢丁志军王鹏伟孙海春邓晓栋刘俊俊

OwnerTONGJI UNIV

Webpage class feature vector extracting method based on ant colony algorithm

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements:Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology