Adaptive information extraction method for webpage characteristics

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A technology of information extraction and web page features, applied in special data processing applications, instruments, calculations, etc., can solve the problem that extraction tasks cannot achieve high accuracy, and achieve high accuracy, strong scalability, and simple expansion process Effect

Inactive Publication Date: 2011-11-23

HUAZHONG UNIV OF SCI & TECH

View PDF4 Cites 56 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

(3) Redundancy, the same information may appear repeatedly on multiple sites

This method has a certain automatic extraction ability, but because the bottom layer still relies on the regularization method, it cannot achieve high accuracy for complex extraction tasks

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

example

[0089] Take from the academic home page http: / / www.cs.uiuc.edu / ~hanj / Take the process of extracting information from . According to the judgment of the search engine, select the first search result to be the academic homepage of the author.

[0090] Use an HTML parser to parse the page, obtain the sub-links, and select the following sub-pages for further analysis according to the link keywords and context:

[0091] http: / / www.cs.uiuc.edu / homes / hanj / pubs / index.htm

[0092] https: / / agora.cs.illinois.edu / display / cs591han / Research+Publications+-+Data+Mining+Research+Group+at+CS%2C+UIUC

[0093] Divide each page to be analyzed into text units. Taking the home page as an example, the following results are obtained:

[0094]

[0095]

[0096] Use the support vector machine to classify the above text units, and determine them as the author's name, irrelevant data, university information, email address, and article information. According to the determined category, further ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a method for extracting information from an academic home page. The method comprises the following steps of: (1) finding an academic home page from Internet; (2) crawling and analyzing the academic home page, wherein the crawling of an irrelevant page is reduced by using a heuristic strategy so as to accelerate analysis speed; (3) analyzing the page into a form of documentobject module (DOM), and dividing according to attributes and contents of elements so as to acquire a cohesive text unit list; (4) identifying the text unit by using an information recognizer, wherein each information recognizer only identifies one information type, and performing subfield extraction on the text information; (5) performing association analysis on the extraction result, eliminating different meanings by using the association of the information, and complementing the missing field; and (6) matching the extraction result and a database, and eliminating the redundant data, wherein the extraction result is stored in a semantic database in a form of semantic data. In the method, by combination of heuristic rules, a machine learning method and a conditional probability model, academic information can be extracted efficiently and accurately from the academic home page.

Description

technical field [0001] The invention belongs to the field of information extraction systems, and in particular relates to an information extraction method adaptive to web page features. The method is especially suitable for extracting information such as author names, email addresses, institution information, and published articles from academic homepages. Background technique [0002] With the advent of the information age, the Internet has gradually become the main way for people to share and obtain information. Various information is published on the Internet in the form of web pages for people to read. However, with the explosive growth of Internet information, people find it more and more difficult to find the information they need on the Internet. On the one hand, the amount of information is huge, and on the other hand, the way information is presented is very flexible and free, which increases people's ability to identify targets. The cost of information. Therefore,...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

Inventor金海李毅赵峰严奉伟

OwnerHUAZHONG UNIV OF SCI & TECH

Adaptive information extraction method for webpage characteristics

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

example

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology