Adaptive information extraction method for webpage characteristics

A technology of information extraction and web page features, applied in special data processing applications, instruments, calculations, etc., can solve the problem that extraction tasks cannot achieve high accuracy, and achieve high accuracy, strong scalability, and simple expansion process Effect

Inactive Publication Date: 2011-11-23
HUAZHONG UNIV OF SCI & TECH
View PDF4 Cites 56 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

(3) Redundancy, the same information may appear repeatedly on multiple sites
This method has a certain automatic extraction ability,

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Adaptive information extraction method for webpage characteristics
  • Adaptive information extraction method for webpage characteristics
  • Adaptive information extraction method for webpage characteristics

Examples

Experimental program
Comparison scheme
Effect test

example

[0089] Take from the academic home page http: / / www.cs.uiuc.edu / ~hanj / Take the process of extracting information from . According to the judgment of the search engine, select the first search result to be the academic homepage of the author.

[0090] Use an HTML parser to parse the page, obtain the sub-links, and select the following sub-pages for further analysis according to the link keywords and context:

[0091] http: / / www.cs.uiuc.edu / homes / hanj / pubs / index.htm

[0092] https: / / agora.cs.illinois.edu / display / cs591han / Research+Publications+-+Data+Mining+Research+Group+at+CS%2C+UIUC

[0093] Divide each page to be analyzed into text units. Taking the home page as an example, the following results are obtained:

[0094]

[0095]

[0096] Use the support vector machine to classify the above text units, and determine them as the author's name, irrelevant data, university information, email address, and article information. According to the determined category, further ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for extracting information from an academic home page. The method comprises the following steps of: (1) finding an academic home page from Internet; (2) crawling and analyzing the academic home page, wherein the crawling of an irrelevant page is reduced by using a heuristic strategy so as to accelerate analysis speed; (3) analyzing the page into a form of documentobject module (DOM), and dividing according to attributes and contents of elements so as to acquire a cohesive text unit list; (4) identifying the text unit by using an information recognizer, wherein each information recognizer only identifies one information type, and performing subfield extraction on the text information; (5) performing association analysis on the extraction result, eliminating different meanings by using the association of the information, and complementing the missing field; and (6) matching the extraction result and a database, and eliminating the redundant data, wherein the extraction result is stored in a semantic database in a form of semantic data. In the method, by combination of heuristic rules, a machine learning method and a conditional probability model, academic information can be extracted efficiently and accurately from the academic home page.

Description

technical field [0001] The invention belongs to the field of information extraction systems, and in particular relates to an information extraction method adaptive to web page features. The method is especially suitable for extracting information such as author names, email addresses, institution information, and published articles from academic homepages. Background technique [0002] With the advent of the information age, the Internet has gradually become the main way for people to share and obtain information. Various information is published on the Internet in the form of web pages for people to read. However, with the explosive growth of Internet information, people find it more and more difficult to find the information they need on the Internet. On the one hand, the amount of information is huge, and on the other hand, the way information is presented is very flexible and free, which increases people's ability to identify targets. The cost of information. Therefore,...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 金海李毅赵峰严奉伟
Owner HUAZHONG UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products