A topic corpus construction method and system based on a search engine

A technology of search engine and construction method, which is applied in the direction of network data retrieval, other database retrieval, web data retrieval using information identifiers, etc. It can solve the problems of undefined and ODP marked documents that cannot meet the demand, so as to improve the relevance , good applicability, and the effect of reducing workload

Active Publication Date: 2019-06-25
INST OF INFORMATION ENG CAS
View PDF7 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, there are three problems in the application of ODP's annotation corpus: first, most of the web pages covered by ODP are English web pages; second, there are many undefined categories (new categories) in practical applications; third, the number of marked documents of ODP cannot meet the demand

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A topic corpus construction method and system based on a search engine
  • A topic corpus construction method and system based on a search engine
  • A topic corpus construction method and system based on a search engine

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049] In order to make the above objects, features and advantages of the present invention more comprehensible, the present invention will be further described in detail below through specific embodiments and accompanying drawings.

[0050] In the search engine-based subject corpus construction method of this embodiment, the input is a subject word, and the output is a subject document set. The whole process of the method is as follows figure 1 As shown, the specific steps include:

[0051] (1) Seed page acquisition. In this step, a search engine is used to acquire topic-related seed webpages. The main modules include query conversion module and meta search module. The query conversion module converts the subject words into query words of the search engine, and can use a method based on a knowledge base, a method based on feedback, or manually perform query conversion. The meta search module is used to send the query words to the search engine, obtain the query results of...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a topic corpus construction method and system based on a search engine. The method comprises the following steps: 1) obtaining a theme-related seed webpage by utilizing a search engine; 2) expanding the seed webpage to find a list page; 3) judging the list page to obtain a list page really related to the theme; 4) extracting links in the list page really related to the theme, and downloading the links to obtain an original webpage; And 5) performing text extraction on the original webpage to form a final topic corpus. The system comprises a seed webpage acquisition unit, a list page discovery unit, a list page auditing unit, a webpage downloading unit and a text extraction unit. Compared with the prior art, the method has the advantages that the manual annotation amount required for constructing the topic corpora of the same scale is greatly reduced, and the method has good applicability to construction of various topic corpora.

Description

technical field [0001] The invention relates to the automatic construction of corpus and the topic classification based on statistical machine learning, and is especially suitable for the problem that the topic classification lacks training corpus. Background technique [0002] With the development of artificial intelligence, text classification has been widely used in various fields. Typical classification requirements include topic classification and sentiment classification. Among them, the topic classification is divided into categories according to the content topics described in the documents. From the perspective of computer input and output, the input is a document, and the output is a topic category. Currently, text classification mainly uses classification methods based on machine learning. The text classification method based on machine learning needs training data, that is, for each subject category, there must be a batch of text documents related to this categ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/953G06F16/955G06F16/958
Inventor 李鹏王斌周美林齐保元梅钰
Owner INST OF INFORMATION ENG CAS
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products