Unlock instant, AI-driven research and patent intelligence for your innovation.

A method and a system for constructing a classification corpus by means of the Internet

A corpus and Internet technology, applied in neural learning methods, text database clustering/classification, text database indexing, etc., can solve problems such as poor accuracy and ignoring web page layout

Active Publication Date: 2019-01-25
杭州数湾信息科技有限公司 +1
View PDF7 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The technical problem to be solved by the present invention is: the technical problem of poor accuracy due to the current use of the Internet to construct a classified corpus only relying on the node topology and ignoring the layout of the web page

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and a system for constructing a classification corpus by means of the Internet
  • A method and a system for constructing a classification corpus by means of the Internet
  • A method and a system for constructing a classification corpus by means of the Internet

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] The specific embodiments of the present invention will be further specifically described below through specific embodiments in conjunction with the accompanying drawings.

[0029] like figure 1 As shown, the present invention provides a method for constructing a dynamic classification corpus using Internet corpus, including the following steps: S1, setting the target category: setting the target category by the user, and setting a number of initial keywords. For the target category A, set n keywords, n≥1, K={k 1 , k 2 ,...,k n}, keywords mainly describe the characteristic words contained in this category of information; S2, setting information sources: provide several information sources by the user, or submit the first N items of the search engine retrieval results by the initial keywords of the target category as Internet information sources, so Each information source described above includes a website address and several information source description keywords, a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a natural language processing technology, in particular to a method for constructing a classification corpus by means of the Internet, comprising the following steps: S1, setting a target class; 2, setting an information source; S3, performing information source rating; S4, performing information collection and analysis; S5, filtering the backup document corpus set; S6, outputting classification corpus. A substantial effect of the present invention is that: on the premise of minimizing the level of manual intervention, based on the Internet pages with clear classification logo and dynamically updated list content as the corpus information source, combined with web page typesetting characteristics and web page DOM node topological structure characteristics, the accuracy of web page topic corpus extraction is improved. Through the matching evaluation system of target categories and information source keywords, the Internet dynamic corpus is screened by quantitative similarity between texts, and a high-quality text classification corpus is constructed.

Description

technical field [0001] The invention relates to natural language processing technology, in particular to a method and system for constructing a classified corpus by means of the Internet. Background technique [0002] With the rapid growth of Internet information, search engines have become an indispensable tool for people to browse network information. The search engine retrieves the website database according to the keywords provided by the user, and presents a list of websites of interest to the user. However, in many cases, it is difficult for users to find keywords that accurately describe the retrieval target, thus seriously affecting the accuracy of the returned results. Especially if the user lacks knowledge about the field to be searched. At the same time, due to the large number of web pages that need to be indexed, in order to balance the accuracy and recall, traditional general search engines often return search results belonging to different topics. This strat...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/31G06F16/35G06N3/08
CPCG06N3/08
Inventor 闵勇
Owner 杭州数湾信息科技有限公司