Systems and methods for identifying and extracting data from HTML pages

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
a technology of html pages and data extraction, applied in the field of analyzing and extracting information from web pages, can solve the problems of difficult computer programs to do so without knowing in advance which pieces to use, page complexity and variable, and may not be much more complex and variable. achieve the effect of quick extraction

Inactive Publication Date: 2005-12-08

OATH INC

View PDF22 Cites 37 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Benefits of technology

"The present invention provides a system and method for automatically analyzing web pages formatted using HTML or other markup languages to extract desired information. The system can identify and extract information from multiple web pages with minimal human setup. The algorithm is robust and can handle small changes to the formatting of documents. The system can be used for data mining, marketing analysis, testing, and quality assurance purposes. The invention allows for the identification and extraction of structured data from unstructured web pages, making it possible to build databases from unstructured web information. The method involves selecting a model page, identifying a first area of interest, and parsing the model page and a second web page to determine the first and second strings of symbols. The system can extract the second area of interest from the second page if it matches the first string of symbols. The invention is applicable to any markup language, such as XML, WML, HTML, DHTML, and HDML."

Problems solved by technology

People reading D and D′ can easily parse the information and understand its different pieces, but it is difficult for a computer program to so do without knowing in advance which pieces are included and how they are arranged.

However, this same extraction mechanism, when analyzing the second document for product Q, will miss the price of product Q, because neither the ON SALE text nor the red formatting is present.

In general, the page may be much more complex and variable.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0039]FIG. 1 illustrates a general overview of an information retrieval and communication network 10 including a client device 20 according to an embodiment of the present invention. In computer network 10, client device 20 is coupled through the Internet 40, or other communication network, to servers 501 to 50N. Client device 20 is also interconnected to server 30 either directly, over any LAN or WAN connection, or over the Internet 40. As will be described herein, client device 20 is configured according to the present invention to access and retrieve web pages from any of servers 501 to 50N, identify and extract desired information therefrom, and provide the information to server 30 to populate database 35. Although as described herein, access and processing of web pages is performed using client device 20, it will be understood that server 30 can also be configured to access and process web pages according to the present invention described herein.

[0040] Several elements in the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

Systems and methods for analyzing HTML formatted web pages to automatically identify and extract desired information. A computer algorithm identifies and extracts different pieces of information from different web pages automatically after minimal manual setup. The algorithm automatically analyzes pages with different content if they have the same, or similar, formats. The algorithm is fast and efficient and performs the extraction process quickly in real-time. The systems and methods are useful to build databases from unstructured web information. The algorithm can be used as an agent that captures information about products, and compares prices or other characteristics. It can also be used to populate structured databases that, given the different pieces of information, can analyze products and their characteristics. And it can also be used for data mining applications looking for patterns useful for marketing analyses, or other uses.

Description

COPYRIGHT NOTICE [0001] A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever. BACKGROUND OF THE INVENTION [0002] The present invention relates generally to analyzing and extracting information from web pages, and more particularly to automatically identifying and extracting desired information in web pages. [0003] The World Wide Web (WWW) is now the premier outlet to publish information of all types and forms. Documents published on the web, commonly called web pages, are published using a language called HTML (or Hyper Text Markup Language), which sets standards for the formatting of documents. These standards make it possible for people to read and understand documents no matter ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(United States)

IPC IPC(8): G06F17/00G06F17/30

CPCY10S707/99931G06F16/951Y10S707/99936

InventorMANBER, UDILU, QI

OwnerOATH INC

Systems and methods for identifying and extracting data from HTML pages

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Benefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology