Unlock instant, AI-driven research and patent intelligence for your innovation.

Template-based structured document classification and extraction

A structured document and document classification technology, applied in structured data retrieval, calculation model, database model, etc., can solve the impractical problem of reverse engineering of data extraction template

Active Publication Date: 2021-08-03
GOOGLE LLC
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, with the ever-changing content and layout of B2C communications, manual reverse engineering of data extraction templates can become impractical

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Template-based structured document classification and extraction
  • Template-based structured document classification and extraction
  • Template-based structured document classification and extraction

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0024] figure 1 The figure shows an example environment in which a corpus of structured documents 100 can be clustered into clusters 132 1-m , and wherein clusters containing structured documents can be analyzed to generate data extraction templates 134 1-m . As used herein, "structured documents" may refer to B2C communications such as emails, text messages (eg SMS, MMS), instant messages and any other that are typically (but not always) automatically generated eg using templates B2C communication. Additionally, in some implementations, structured documents may include other types of documents, such as letters (e.g., in Portable Document Format (“PDF”) and / or word processing formats), invoices, bills, receipts, invitations (e.g. Invitations received via social networking applications) or other structured documents that may not be considered communications and / or attachments to other communications (eg, email). In various implementations, structured documents may be struct...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

This application relates to template-based structured document classification and extraction. This includes automatically generating data extraction templates for structured documents (eg B2C emails, invoices, bills, invitations, etc.) and assigning categories to those data extraction templates to streamline data extraction from subsequent structured documents. In various embodiments, data extraction templates generated from clusters of structured documents that share fixed content may be identified. Features of structured document clusters may be applied as input to an extraction machine learning model trained to provide temporal field locations in structured documents to determine temporal field locations in structured document clusters. An association between the data extraction template and the determined temporary field location may be stored. Based on this association, data points are extracted from a given structured document of users who share at least fixed content with the cluster of structured documents. The extracted data points can be presented to the user.

Description

technical field [0001] This application relates to template-based structured document classification and extraction. Background technique [0002] A user may be overwhelmed with a flood of business-to-consumer ("B2C") email and similar communications informing the user of various information (eg, itinerary receipts, delinquent bill notifications, incoming event notifications, etc.). If the user does not set a reminder, create a calendar entry, or take other similar action in response to receiving such a communication, the user may, for example, miss a meeting, fail to pay a bill, miss a flight, etc. Additionally, various data points in the communication that may be immediately relevant to the user, such as information related to an incoming or current journey (e.g. flight information, hotel reservations, event / venue information, etc.) may be spread across multiple different communications And it may be difficult for users to find out. [0003] Data and other similar docume...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/28G06N20/00
CPCG06F16/285G06F16/288G06N20/00G06Q10/10G06N20/20G06F16/93G06F40/174G06F40/186
Inventor 盛盈卢一峰谢婧杨杰路易斯·加西亚·普埃约楼季楠詹姆斯·文特
Owner GOOGLE LLC