Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Webpage information extraction method

A technology of webpage information and implementation methods, which is applied in the directions of instruments, computing, and electrical digital data processing, etc., and can solve problems such as poor quality, rough granularity of candidate attributes, and low accuracy

Inactive Publication Date: 2012-06-13
PEKING UNIV
View PDF2 Cites 48 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] However, there is a problem in these current methods, that is, only some candidate attributes are extracted, and the extracted attributes are not processed in the later stage, which leads to the rough granularity of the extracted candidate attributes and the low accuracy. The expression of polysemous words is relatively poor in quality, and can only be added to the knowledge base after manual selection
And these methods do not evaluate the attributes, because some attributes are closely related to the target concept, and some are relatively weak. Selecting the closely related attributes can be beneficial to the classification of concepts

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage information extraction method
  • Webpage information extraction method
  • Webpage information extraction method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041] Assume that all attributes of the concept "star" need to be extracted, and the input is a list of target instances of the concept "star", that is, a collection of stars such as Andy Lau and Zhang Ziyi. First, extract the candidate attributes corresponding to the concept instance list from various network encyclopedia data sources, and the attribute values ​​corresponding to these attributes; then use these attribute value information to conduct synonymous induction on candidate attributes, find out attributes with similar meanings and Merge them together; then use web resources to evaluate the candidate attributes, and select the attributes that are closely related to the target concept; finally, analyze the attribute values ​​of the attributes and predict the type of attribute value corresponding to each attribute. The following is a detailed description of each specific step (for the process, see figure 1 ).

[0042] A. Build an instance list and extract candidate at...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a webpage information extraction method, in particular a method for extracting concept attributes from a network encyclopedia data source and processing the concept attributes. The method comprises the following steps of: constructing an example list, and extracting candidate attributes of examples in the list from a multi-source heterogeneous data source; performing synonymic induction on the extracted attributes, and putting synonymic attributes in the same set; sub-classifying the induced attributes; analyzing the corresponding attribute value types of the classified attributes; and recommending the attributes and corresponding attribute value type information to a user, or storing the attributes and the corresponding attribute value type information into a structured database. By adoption of the scheme of the invention, high-quality concept attribute information can be extracted from a webpage, a knowledge base can be better constructed, and other natural language processing tasks such as extraction of attribute values, text classification and classification of query logs in a search engine can be better performed.

Description

technical field [0001] The invention provides a method for extracting web page information, in particular to a method for extracting concept attributes from a network encyclopedia data source and processing them. Background technique [0002] Today, with the explosive growth of Internet texts, how to organize information and represent knowledge reasonably and effectively, and establish a good knowledge base so that people can quickly and quickly obtain the knowledge they want from massive web pages is a very important task. research work. In the construction of knowledge base, concepts and attributes are the core elements of knowledge representation. A concept is an object that reflects objective things and their unique attributes, and an attribute is a description of the characteristics of a concept. From attribute information, a more comprehensive understanding of the characteristics of a concept can be obtained. Therefore, in the automatic construction of knowledge base...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 穗志方李文杰
Owner PEKING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products