Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Wiki semantic matching-based document classification method and system

A document classification and semantic matching technology, applied in the Internet field, can solve the problems of low efficiency and inaccuracy of document classification technology

Active Publication Date: 2017-02-01
WENZHOU UNIV OUJIANG COLLEGE
View PDF5 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] In order to overcome the contradiction between the effectiveness and high efficiency faced by the wiki semantic matching method, the present invention provides a document classification method and system based on wiki semantic matching, the purpose of which is to efficiently Calculate the similarity between documents to classify documents, thereby solving the technical problems of low efficiency or inaccuracy of existing document classification technology

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Wiki semantic matching-based document classification method and system
  • Wiki semantic matching-based document classification method and system
  • Wiki semantic matching-based document classification method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0179] A document classification method based on wiki semantic matching, pre-constructed wiki semantic reference space

[0180] Extract 100,000 conceptual entities from the Wikipedia database, and preprocess the concepts according to the following steps:

[0181] A. Word segmentation: Use the NLTK tokenizer (www.nltk.org) to divide each concept Express as a set of independent words, and lowercase each word;

[0182] B. Remove stop words: remove stop words from the set of independent words corresponding to each concept in step A, including prepositions, pronouns, and articles, so that each concept Expressed as a set of words with independent meaning;

[0183] C. Stemization: Use the famous Snowball framework (snowall.tartarus.org / texts / introduction.html) to convert each concept obtained in step B Each word in the corresponding independent set of words with meaning is transformed into its stem, thereby converting each concept Expressed as a set of keywords, it can be w...

Embodiment 2

[0240] A document classification system based on wiki semantic matching, including:

[0241] The first module, which has the wiki semantic reference space built in, is used to obtain the text document set formed by the text documents to be classified and for each of these text documents Use keyword matching to obtain the keyword set of the text document, and use matching rules to match the related reference concept set of the text document from the Wiki semantic reference space; The corresponding keyword set and reference concept set are submitted to the second module.

[0242] The first module includes a keyword matching submodule and a reference concept matching submodule.

[0243] The keyword matching submodule is used to match a given text document Obtain its keyword collection, including:

[0244] A word segmentation component for dividing a given text document into Represented as an independent word set, submitted to the stop word removal component;

[0245] T...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a wiki semantic matching-based document classification method and system. The method comprises the following steps of (1) obtaining a keyword set of a text document by utilizing keyword matching for each text document D in a document set, and performing matching in a wiki semantic reference space by utilizing a matching rule to obtain a reference concept set related to the text documents; (2) generating keyword vectors of the text document according to the keyword set of the text document, and generating concept vectors of the text document according to the keyword vectors and the reference concept set of the text document; (3) calculating comprehensive similarity between any two text documents in a plurality of to-be-classified text document sets according to the concept vectors and the keyword vectors; and (4) performing classification according to the comprehensive similarity between the any two text documents. The system comprises a first module, a second module, a third module and a fourth module. According to the method and the system, the contradiction between validity and high efficiency confronted by a wiki semantic matching method is overcome and an efficient online document classification method is provided.

Description

technical field [0001] The invention belongs to the technical field of the Internet, and more specifically relates to a document classification method and system based on wiki semantic matching. Background technique [0002] With the development of World Wide Web technology, the explosive growth of the number of online text documents urgently requires efficient text classification algorithms to facilitate users to quickly navigate and browse online text documents. The traditional text document classification method usually adopts "keyword text matching technology". The similarity measurement between text documents; that is, the similarity between text documents is measured by analyzing the common keywords between text documents. However, the keyword text matching technology only considers the surface text information of the keywords in the text document, but does not consider the semantic information behind the keywords, which leads to many problems, such as semantic confus...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/35
Inventor 吴宗大徐湖鹏
Owner WENZHOU UNIV OUJIANG COLLEGE
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products