Systems and methods for identifying and extracting data from HTML pages

a technology of html pages and data extraction, applied in the field of analyzing and extracting information from web pages, can solve the problems of difficult computer programs to do so without knowing in advance which pieces to use, page complexity and variable, and may not be much more complex and variable. achieve the effect of quick extraction

Inactive Publication Date: 2005-12-08
OATH INC
View PDF22 Cites 37 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

[0031] The present invention provides systems and methods for analyzing web pages formatted using HTML or other markup language to automatically identify and extract desired information. In one embodiment, aspects of the invention are embodied in a computer algorithm that identifies and extracts different pieces of information from different web pages automatically after minimal manual setup. The algorithm automatically analyzes pages with different content if they have the same, or similar, formats. The algorithm is robust, in the sense that it operates successfully and correctly in the presence of small changes to the formatting of documents. The algorithm is fast and efficient and performs the extraction process quickly in real-time. Many database and data mining applications require structured data—they have to know the meanings of numbers and text, and not just their values, so they can infer relationships among them. Using the techniques of the present invention, it becomes possible to build databases from unstructured web information. The algorithm can be implemented in an agent that captures information about products, and compares prices or other characteristics. The algorithm can also be used to populate structured databases that, given the different pieces of information, can analyze products and their characteristics. Additionally, the algorithm can be used for data mining applications, e.g., looking for patterns useful for marketing analyses, for testing and quality assurance (QA) purposes, or other uses.

Problems solved by technology

People reading D and D′ can easily parse the information and understand its different pieces, but it is difficult for a computer program to so do without knowing in advance which pieces are included and how they are arranged.
However, this same extraction mechanism, when analyzing the second document for product Q, will miss the price of product Q, because neither the ON SALE text nor the red formatting is present.
In general, the page may be much more complex and variable.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Systems and methods for identifying and extracting data from HTML pages
  • Systems and methods for identifying and extracting data from HTML pages
  • Systems and methods for identifying and extracting data from HTML pages

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0039]FIG. 1 illustrates a general overview of an information retrieval and communication network 10 including a client device 20 according to an embodiment of the present invention. In computer network 10, client device 20 is coupled through the Internet 40, or other communication network, to servers 501 to 50N. Client device 20 is also interconnected to server 30 either directly, over any LAN or WAN connection, or over the Internet 40. As will be described herein, client device 20 is configured according to the present invention to access and retrieve web pages from any of servers 501 to 50N, identify and extract desired information therefrom, and provide the information to server 30 to populate database 35. Although as described herein, access and processing of web pages is performed using client device 20, it will be understood that server 30 can also be configured to access and process web pages according to the present invention described herein.

[0040] Several elements in the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

Systems and methods for analyzing HTML formatted web pages to automatically identify and extract desired information. A computer algorithm identifies and extracts different pieces of information from different web pages automatically after minimal manual setup. The algorithm automatically analyzes pages with different content if they have the same, or similar, formats. The algorithm is fast and efficient and performs the extraction process quickly in real-time. The systems and methods are useful to build databases from unstructured web information. The algorithm can be used as an agent that captures information about products, and compares prices or other characteristics. It can also be used to populate structured databases that, given the different pieces of information, can analyze products and their characteristics. And it can also be used for data mining applications looking for patterns useful for marketing analyses, or other uses.

Description

COPYRIGHT NOTICE [0001] A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever. BACKGROUND OF THE INVENTION [0002] The present invention relates generally to analyzing and extracting information from web pages, and more particularly to automatically identifying and extracting desired information in web pages. [0003] The World Wide Web (WWW) is now the premier outlet to publish information of all types and forms. Documents published on the web, commonly called web pages, are published using a language called HTML (or Hyper Text Markup Language), which sets standards for the formatting of documents. These standards make it possible for people to read and understand documents no matter ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(United States)
IPC IPC(8): G06F17/00G06F17/30
CPCY10S707/99931G06F16/951Y10S707/99936
Inventor MANBER, UDILU, QI
Owner OATH INC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products