Neural networks for digital ink processing

The digital ink processing system efficiently interprets and processes handwritten inputs to modify digital content items, addressing the limitations of conventional systems by enhancing user interaction and content modification capabilities.

WO2025137556A9PCT designated stage Publication Date: 2026-06-11GOOGLE LLC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
GOOGLE LLC
Filing Date
2024-12-20
Publication Date
2026-06-11

AI Technical Summary

Technical Problem

Conventional systems that accept handwritten strokes from users typically treat handwriting as only an input methodology for text or graphic content, lacking the ability to process and interpret handwritten annotations to modify digital content items efficiently.

Method used

A digital ink processing system that generates and modifies digital content items by processing handwritten user instructions, utilizing a hierarchical representation format to interpret and process handwritten inputs, enabling features like spell-check, grammar correction, and text completion.

🎯Benefits of technology

Enables users to interactively create and modify digital content items more conveniently with a significantly improved user experience by interpreting and processing handwritten inputs as instructions for modifying digital content.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure US2024061437_11062026_PF_FP_ABST
    Figure US2024061437_11062026_PF_FP_ABST
Patent Text Reader

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for processing handwritten user instructions submitted to by a user to generate and modify digital content items. According to one aspect, there is provided a method comprising: receiving input data characterizing a document displayed on a device, the input data comprising data characterizing a plurality of handwritten strokes; generating, from the input data, a hierarchical representation of the document that characterizes relationships between objects represented by respective subsets of the plurality of handwritten strokes, wherein the hierarchical representation follows a particular hierarchical representation format; and processing the hierarchical representation of the document to identify a set of one or more candidate modifications to the document, wherein each candidate modification comprises one or more actions to change the document.
Need to check novelty before this filing date? Find Prior Art

Description

Attorney Docket No. 56113-0548WO1NEURAL NETWORKS FOR DIGITAL INK PROCESSINGCROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the benefit of priority to U.S. Provisional Application No.63 / 613,642 filed on December 21, 2023, and U.S. Provisional Application No. 63 / 548,826 filed on February 1, 2024, the contents of which are hereby incorporated by reference.BACKGROUND

[0002] This specification relates to processing data using machine learning models.

[0003] Machine learning models receive an input and generate an output, e.g., a predicted output, based on the received input. Some machine learning models are parametric models and generate the output based on the received input and on values of the parameters of the model.

[0004] Some machine learning models are deep models that employ multiple layers of models to generate an output for a received input. For example, a deep neural network is a deep machine learning model that includes an output layer and one or more hidden layers that each apply a non-linear transformation to a received input to generate an output.SUMMARY

[0005] This specification generally describes a system implemented as computer programs on one or more computers in one or more locations that can generate and modify digital content items by processing handwritten user instructions submitted to by a user, e.g., by writing with a finger or stylus on a touch screen, writing with a mouse or pointing device, writing using a writing or drawing tablet, writing using an electronic pen, and so on.

[0006] The digital content items can be, e.g., digital notes, images, text documents, maps, and so on.

[0007] According to one aspect, there is provided method performed by one or more computers, the method comprising: receiving input data characterizing a user document displayed on a user device, the input data comprising data characterizing a plurality' of handwritten strokes submitted by a user in the user document while the user document is displayed on the user device; generating, from the input data, a hierarchical representation of the user document that characterizes relationships between objects represented by respective subsets of the plurality7of handw ritten strokes, wherein the hierarchical representation follows a particular hierarchical representation format; and processing the hierarchical representation of the user document to identify a set of one or more candidate modifications to the userAttorney Docket No. 56113-0548WO1 document, wherein each candidate modification comprises one or more actions to change the user document.

[0008] By generating hierarchical representations that represent both handwritten inputs and digital content items, the system can efficiently parse and interpret handwritten inputs within the context of the digital content items to generate updates for the digital content items. For example, the described systems can parse and interpret handwritten instructions to perform particular processing operations of the digital content item. As another example, the described systems can parse, interpret, and process received handwritten inputs as content to be added to the digital content item. For example, the system can identify misspellings within received handwritten text and propose correct spellings, identify ungrammatical sentences in the user document and propose a grammatically correct rephrasing, identify incomplete blocks of text received handwritten text and can propose appropriate text completions, and so on.

[0009] By interpreting and processing handwritten inputs as representing instructions for modifying digital content items, described systems enable users to interactively create and modify a variety of digital content items. Compared to conventional systems that accept handwritten strokes from users, which typically treat handwriting as only an input methodology for text or graphic content, the described systems can process and interpret handwritten annotations from users to modify digital content items. The described systems therefore enable users to more conveniently create and interact with digital content items (e.g., more efficiently, with a significantly improved user experience, etc.).

[0010] The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below; Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.BRIEF DESCRIPTION OF THE DRAWINGS

[0011] FIG. 1 illustrates an example digital ink processing system interacting with a user to update an item of digital content.

[0012] FIG. 2 is a block diagram of an example digital ink processing system.

[0013] FIG. 3 is a flow diagram of an example process for generating content updates for a digital content item by processing handwritten inputs using a digital ink processing system.

[0014] FIG. 4 is a block diagram for an example handwriting encoding system.

[0015] FIG. 5 illustrates an example handwritten stroke token for a digital ink encoding representing a handwritten input.Atorney Docket No. 56113-0548WO1

[0016] FIG. 6 is a flow diagram of an example process for generating a digital ink encoding for a handwritten input using a handwriting encoding system.

[0017] FIG. 7 is a flow diagram of an example process for training a handwriting encoding system.

[0018] FIG. 8 illustrates an example hierarchical encoding for a digital content item.

[0019] FIG. 9 is a flow diagram of an example process for generating a hierarchical encoding for a digital content item using a graph generating neural network.

[0020] FIG. 10 is a flow diagram of an example process for training a graph generating neural network to generate hierarchical encodings for digital content items.

[0021] FIG. 11 is a flow diagram of an example process for generating content updates for a digital content item using a digital ink processing neural network.

[0022] FIG. 12 is a flow diagram of an example process for training a digital ink processing neural network to generate content updates for a digital content item.

[0023] FIG. 13 illustrates creating and editing a user document by processing and interpreting handwritten inputs using a digital ink processing system.

[0024] FIG. 14 illustrates creating and modifying a digital note by processing and interpreting handwritten inputs using a digital ink processing system.

[0025] FIG. f 5 illustrates interacting with a web page by processing and interpreting handwritten inputs using a digital ink processing system.|0026| FIG. 16 illustrates editing an image by processing and interpreting handwritten inputs using a digital ink processing system.

[0027] FIG. 17 illustrates editing a map by processing and interpreting handwritten inputs using a digital ink processing system.

[0028] Like reference numbers and designations in the various drawings indicate like elements.DETAILED DESCRIPTION

[0029] FIG. 1 illustrates an example digital ink processing system 102 interacting with a user 104 to update a digital content item 106.

[0030] The digital ink processing system 102 can interact with the user by means of a user interface 108. The user interface 108 can present (e g., display) the digital content item 106 to the user 104 and can receive inputs from the user 104. In particular, the user interface 108 can receive data characterizing handwritten strokes 110 (e.g., data characterizing handwritten text, drawings, symbols, etc.) from the user 104.Attorney Docket No. 56113-0548WO1

[0031] The user interface 108 can be a user interface of a device (e.g., a desktop computer, a laptop, a mobile phone, a tablet, a touch-screen interface, and so on). As an example, the device can be a user device, e.g., mobile phone of the user 104, a laptop of the user 104, a tablet of the user 104, and so on. As another example, the device can be part of a point-of-sale system (e.g., a kiosk, a tablet, a mobile phone, etc., that can display information about a user purchase and receive inputs from the user 104) with which the user 104 interacts. As another example, the device can be part of an advertisement system (e.g.. a kiosk, a tablet, a mobile phone, etc., that can display an advertisement and receive inputs from the user 104) with which the user 104 interacts. As another example, the device can be part of an informational system (e.g., a kiosk, a tablet, a mobile phone, etc., that can display information and receive inputs from the user 104 to answer user questions) with which the user 104 interacts.

[0032] The data characterizing the handwritten strokes 110 can characterize handwritten strokes as submitted by the user 104 to the user interface 108 using any of a variety of input methods, e.g., writing with a finger or stylus on a touch screen, writing with a mouse or pointing device, writing using a writing or drawing tablet, writing using an electronic pen, and so on. In some implementations, the data characterizing the handwritten strokes 110 can include an image of handwritten strokes (e.g., an image of handwritten text, drawings, symbols, etc.) submitted by the user 104 to the user interface 108.

[0033] The digital ink processing system 102 can process the handwritten strokes 110 and the digital content item 106 to generate content updates 112 for the digital content item 106. As described in more detail below with reference to FIG. 2, the digital ink processing system 102 can process and interpret the handwritten strokes 110 using the digital content item 106 as context to generate the content updates 112 modifying the digital content item 106 (e.g., adding content to the item 106, removing content from the item 106, editing content within the item 106, etc.).

[0034] In particular, the digital ink processing system 102 can process and interpret the handwritten strokes 110 as representations of instructions from the user 104 for modifying the digital content item 106. For example, the digital ink processing system 102 can process and interpret the handwritten strokes 110 as annotations that identify and provide instructions for modifying portions of the digital content item 106.

[0035] As an example, the digital content item 106 can be a user document (e.g., a text document) and the handwritten strokes 110 can represent instructions from the user 104 for modifying portions of text within the user document. As further examples, the handwritten strokes 110 can represent instructions to modify the user document by, e.g., adding text,Atorney Docket No. 56113-0548WO1 removing portions of text, editing portions of text, rearranging portions of text, and so on, and the digital ink processing system 102 can update the user document in accordance with the instructions represented by the handwritten strokes 110. Creating and editing a user document by processing and interpreting handwritten inputs is described in more detail below with reference to FIG 13.

[0036] As another example, the digital content item 106 can be a digital note (e.g.. a digital note including data of a variety of formats such as. text data, image data, digital ink data representing hand-written annotations, etc.) for the user 104 and the handwritten strokes 110 can represent instructions from the user 104 for modifying portions of the digital note. As further examples, the handwritten strokes 110 can represent instructions to modify the digital note by, e.g.. adding content to the digital note, removing content, editing content, rearranging content, linking pieces of content, and so on, and the digital ink processing system 102 can update the digital note in accordance with the instructions represented by the handwritten strokes 110. Creating and modifying a digital note by processing and interpreting handwritten inputs is described in more detail below with reference to FIG 14.

[0037] As another example, the digital content item 106 can be a web page and the handwritten strokes 110 can represent instructions from the user 104 for processing content of the web page and modifying local data stored for the web page. As further examples, the handwritten strokes 110 can represent instructions to process content of the web page by, e.g., searching selected text using a search engine, saving images and text from the web page, generating a text summary of some or all of the web page, and so on, and the digital ink processing system 102 can process the web page in accordance with the instructions represented by the handwritten strokes 110. As another example, the handwritten strokes 110 can represent digital notes (e.g., handwritten text, symbols, drawings, etc.) created by the user 104 for the webpage and the digital ink processing system 102 can process the handwritten strokes 1 10 to save a local copy of the digital note that, e.g., can be displayed (e.g., as an overlay for the webpage) when the user 104 accesses the webpage. Interacting with a web page by processing and interpreting handwritten inputs is described in more detail below with reference to FIG 15.

[0038] As another example, the digital content item 106 can be an image and the handwritten strokes 1 10 can represent instructions from the user 104 for modifying portions of the image. As a further example, the handw ritten strokes 110 can represent annotations of the image from the user 104 that identify respective regions of the image (e.g., as circled, pointed to, and so on, by the user 104) and represent instructions to modify the respective regions of the image (e.g., by, e.g., performing color correction, applying image processing filters, generating imageAttorney Docket No. 56113-0548WO1 content using a generative machine learning model, and so on) and the digital ink processing system 102 can update the image in accordance with the instructions represented by the handwritten strokes 110. Editing an image by processing and interpreting handwritten inputs is described in more detail below with reference to FIG. 16.

[0039] As another example, the digital content item 106 can be a map and the handwritten strokes 110 can represent instructions from the user 104 for processing, e.g., points, paths, regions, and so on depicted by the map. As a further example, the handwritten strokes 110 can represent annotations of the map from the user 104 that identify respective geo-located points, paths, regions, and so on of the map (e.g., as circled, pointed to, and so on, by the user 104) and the digital ink processing system 102 can update the map in accordance with the annotations represented by the handwritten strokes 110 (e.g., by including geo-located labels, routes, boundaries, and so on within the map as specified by the handwritten strokes 110). Editing a map by processing and interpreting handwritten inputs is described in more detail below with reference to FIG. 17.

[0040] By interpreting and processing the handwritten strokes 110 from the user 104 as representing instructions for modifying the digital content item 106, the digital ink processing system 102 enables users to interactively create and modify’ a variety of digital content items. Compared to conventional systems that accept handwritten strokes from users, which typically treat handwriting as only an input methodology for text or graphic content, the digital ink processing system 102 can process and interpret handwritten annotations from users to modify digital content items. The digital ink processing system 102 therefore enables users to more conveniently create and interact with digital content items (e.g., more efficiently, with a significantly improved user experience, etc.).

[0041] FIG. 2 is a block diagram of an example digital ink processing system 102. The digital ink processing system 102 is an example of a system implemented as computer programs on one or more computers in one or more locations in which the systems, components, and techniques described below are implemented.

[0042] The digital ink processing system 102 can process data characterizing a handwritten strokes 110 and a digital content item 106 to generate content updates 112 modifying the digital content item 106 (e g., adding content to the item 106, removing content from the item 106, editing content within the item 106, etc.). In particular, the digital ink processing system 102 can use the digital content item 106 as context as part of processing and interpreting the handwritten strokes 110 as instructions for modifying the digital content item 106.Atorney Docket No. 56113-0548WO1

[0043] The data characterizing the handwritten strokes 110 can have any of a variety of formats for representing handwritten inputs. For example, the data characterizing the handwritten strokes 110 can be images of handwritten strokes. As another example, the data characterizing the handwritten strokes 110 can be geometric data specifying handwritten strokes, e.g., as captured by an input device such as an electronic pen, a writing tablet, a touch screen, and so on.

[0044] The digital ink processing system 102 can include a handwriting encoding system 202. a content encoding system 204, and a digital ink processing neural network 206, which are each described next (and throughout this specification).

[0045] The handwriting encoding system 202 can process the data characterizing the handwritten strokes 110 to generate a digital ink encoding 208 representing the handwritten strokes 110. The digital ink encoding 208 can be a sequence of tokens representing the handwritten strokes 110. The handwriting encoding system 202 is described in more detail below with reference to FIGs. 4-7.

[0046] The content encoding system 204 can process the digital content item 106 and the digital ink encoding 208 for the handwritten strokes 110 to generate a hierarchical encoding 210 (e.g., ahierarchical representation) of both the digital content item 106 and the handwritten strokes 110.

[0047] In some implementations, the content encoding system 204 can directly process the digital content item 106 as part of generating the hierarchical encoding 210. In other implementations, the content encoding system can receive and process an initial hierarchical encoding of the digital content item 106 as part of generating the hierarchical encoding 210 representing both the digital content item 106 and the handwritten strokes 110 (e.g., generating the hierarchical encoding 210 by adding elements representing the handwritten strokes 110 to the initial hierarchical encoding of the digital content item 106).

[0048] The digital ink processing neural network 206 can process data characterizing the hierarchical encoding 210 representing the digital content item 106 and the handwritten strokes 110 to generate the content updates 112 for the digital content item 106. The digital ink processing neural network 204 can have any neural network architecture appropriate for processing the data characterizing the hierarchical encoding 210 and generating the content updates 112. For example, the digital ink processing neural network 206 can be a token processing network (e.g., a language model, a vision language model, etc.) configured to process an input token sequence representing the hierarchical encoding 210 to generate an output token sequence representing the content updates 112 for the digital content item 106.Attorney Docket No. 56113-0548WO1As another example, the digital ink processing neural network 206 can be a graph neural network configured process an input graph representing the hierarchical encoding 210 to generate an updated output graph representing the digital content item 106 as updated in accordance with the received handwritten strokes 110.

[0049] An example process for generating the content updates 112 for the digital content item 106 using the digital ink processing system 102 is described in more detail below with reference to FIG. 3.

[0050] FIG. 3 is a flow diagram of an example process for generating content updates for a digital content item by processing handwritten inputs using a digital ink processing system. For convenience, the process 300 will be described as being performed by a system of one or more computers located in one or more locations. For example, a digital ink processing system, e.g., the digital ink processing system 102 of FIG. 1, appropriately programmed in accordance with this specification, can perform the process 300.

[0051] In general, the system can perform the process 300 as part of an interaction with a user. For example, at each of a sequence of time steps, the system can display the current digital content item to the user (e.g., by way of a user interface), receive a current handwritten input from the user (e.g., by way of the user interface) characterizing a requested modification to the current digital content item, and perform the process 300 to generate an updated digital content item for the time step in accordance with the request characterized by the current handwritten input. Each time the system generates an updated digital content item, the system can display the updated digital content item to the user and can await additional handwritten inputs from the user characterizing requests for further modifications to the digital content item.

[0052] The system can receive data characterizing the handwritten input (step 302). The system can receive data characterizing any of a variety of handwritten inputs. For example, the data characterizing the handwritten input can characterize handwritten strokes as submitted by a user to a user interface by, e.g., writing with a finger or stylus on a touch screen, writing with a mouse or pointing device, writing using a writing or drawing tablet, writing using an electronic pen, and so on. As another example, the data characterizing the handwritten input can include an image of handwritten strokes (e.g.. an image of handwritten text, drawings, symbols, etc.).

[0053] The system can generate a digital ink encoding representing the handwritten input (step 304). In particular, the system can process the data characterizing the handwritten inputs using a digital ink encoding system to generate the digital ink encoding for the handwritten input.Atorney Docket No. 56113-0548WO1The digital ink encoding can be a sequence of tokens representing the handwritten input, with each token representing a respective one or more handwritten strokes of the handwritten input.

[0054] The digital ink encoding system is described in more detail below with reference to FIGs. 4-7. In particular, an example process for generating the digital ink encoding using the digital ink encoding system is described in more detail below with reference to FIG. 6.

[0055] The system can process the digital content item and the digital ink encoding for the handwritten input using a content encoding system to generate a hierarchical encoding (e.g.. a hierarchical representation) of both the digital content item and the handwritten input (step 306). An example hierarchical encoding for the digital content item is described in more detail below with reference to FIG. 8. An example process for generating the hierarchical representation for the digital content item and the handwritten input is described in more detail below with reference to FIG. 9.

[0056] The system can process the hierarchical encoding for the handwritten input and the digital content item using a digital ink processing neural network to generate the content updates for the digital content item (step 308). The digital ink processing neural network can have any neural network architecture appropriate for processing the data characterizing the hierarchical encoding and generating the content updates. For example, the digital ink processing neural network can be a token processing network (e.g., a language model, a vision language model, etc.) configured to process an input token sequence representing the hierarchical encoding to generate an output token sequence representing the content updates for the digital content item. As another example, the digital ink processing neural network can be a graph neural network configured process an input graph representing the hierarchical encoding to generate an updated output graph representing the digital content item as updated in accordance with the received handwritten strokes.

[0057] The digital ink processing neural network is described in more detail below with reference to FIG. 11 and FIG. 12. In particular, an example process for generating the content updated for the digital content item by processing the digital ink encoding and the content encoding using the digital ink processing neural network is described in more detail below with reference to FIG. 11.

[0058] The system can output the generated updates for the digital content item (step 310). In general, the updated digital content item can be stored and used for any of a variety of downstream tasks. As a particular example, when the system interacts with a user to update the digital content item, the system can display the updated digital content item to the user andAttorney Docket No. 56113-0548WO1 can await additional handwritten inputs from the user characterizing requests for further modifications to the digital content item.

[0059] FIG. 4 is a block diagram for an example handwriting encoding system 202. The handwriting encoding system 202 is an example of a system implemented as computer programs on one or more computers in one or more locations in which the systems, components, and techniques described below are implemented.

[0060] As described above, the handwriting encoding system 202 can process the data characterizing a handwritten input to generate a digital ink encoding representing the handwritten input. As an example, the handwriting encoding system 202 can receive input stroke data 402 specifying handwritten strokes within the handwritten input (e.g., as captured by an input device such as an electronic pen, a writing tablet, a touch screen, and so on). The input stroke data 402 can characterize paths (e.g., curves) of the handwritten strokes by specifying, e.g., positions of points along the paths of the handwritten strokes, geometrical properties (e.g., curvatures, stroke widths, etc.) of the paths at points along the paths, properties of an input device (e g., a pressure of the input device, a velocity of the input device, an angle and / or an orientation of the input device, etc.) creating the handwritten strokes at points along the paths of the handwritten strokes, and so on. As another example, the handwriting encoding system 202 can receive input image data 404 characterizing an image of the handwritten input. The handwriting encoding system 202 can output the digital ink encoding of the handwritten input as a sequence of handwritten stroke tokens 406 representing the handwritten strokes of the handwritten input.

[0061] The handwriting encoding system 202 can include a task processing neural network 408 and a token generating neural network 409. The task processing neural network 408 can process the data characterizing the handwritten input (e.g., the input stroke data 402, the input image data 404, or both) to generate an input encoding (e.g., a network input) for the token generating neural network 409. The token generating neural network 409 can process the input encoding produced by the task processing neural network 408 to generate the output sequence of handwritten stroke tokens 406 representing the handwritten strokes of the handwritten input.

[0062] The task processing neural network 408 can include a handwritten stroke encoder 410 and an image processing neural network 414, which are each described next.

[0063] The handwritten stroke encoder 410 can process the input stroke data 402 to generate an input stroke encoding 412 representing the handwritten strokes of the handwritten input. The input stroke encoding 412 can be. e.g., one or more encoding vectors of numerical values, each encoding vector representing a respective portion of the input stroke data 402. InAttorney Docket No. 56113-0548WO1 particular, the input stroke encoding 412 can be a sequence of handwritten stroke tokens representing the handwritten input, as described in more detail below with reference to FIG. 5. The input encoding generated by the task processing neural network 408 and processed by the token generating neural network 409 can include the input stroke encoding 412.

[0064] The image processing neural network 414 can be configured (e.g., trained) to process the input image data 404 to generate an input image encoding 416 representing the handwritten strokes of the handwritten input. The input image encoding 416 can be. e.g.. one or more encoding vectors of numerical values, each encoding vector representing a respective portion of the input image data 404. In particular, the input image encoding 416 can be a sequence of image tokens representing the input image of the handwritten input. The input encoding generated by the task processing neural network 408 and processed by the token generating neural network 409 can include the input image encoding 416.

[0065] The image processing neural network 414 can have any appropriate neural network architecture for processing the input image data 404 to generate the input image encoding 416. For example, the image processing neural network 414 can include any of a variety of processing layers (e.g., convolutional layers, recurrent layers, attention layers, etc.) in any combination appropriate for processing the input image data 404 to generate the input image encoding 416. In particular, when the input image encoding 416 is a sequence of image tokens representing the input image of the handwritten input, the image processing neural network 414 can be atoken processing neural network (e.g., a vision transformer) configured to generate the input stroke encoding 41 by applying a sequence of attention operations to an initial token sequence representing the input image of the handwritten input.

[0066] The image processing neural network 414 can be trained using any appropriate machine learning technique. In particular, as described in more detail below with reference to FIG. 7, the image processing neural network 414 can be trained as part of training the handwriting encoding system 202.

[0067] The token generating neural network 409 can be configured (e.g., trained) to process the input encoding produced by the task generating neural network 408 to generate the sequence of handwritten stroke tokens 406 representing the handwritten input. For example, when the handwriting encoding system 202 receives input stroke data 402, the token generating neural network 409 can process the input stroke encoding 412 for the input stroke data 402 to generate the output sequence of handwritten stroke tokens 406. As another example, when the handwriting encoding system 202 receives input image data 404, the token generating neuralAtorney Docket No. 56113-0548WO1 network 409 can process the input image encoding 416 for the input image data 404 to generate the output sequence of handwritten stroke tokens 406.

[0068] The token generating neural network 409 can have any appropriate neural network architecture for generating the output sequence handwritten stroke tokens 406. For example, the token generating neural network 409 can include any of a variety of processing layers (e.g., convolutional layers, recurrent layers, attention layers, etc.) in any combination appropriate for generating the output sequence handwritten stroke tokens 406. In particular, the token generating neural network 409 can be a token processing neural network (e.g., a language model, a vision language model, etc.) configured to apply a sequence of attention operations to an input token sequence (e.g., including handwritten stroke tokens generated by the handwritten stroke encoder 410, image tokens generated by the image encoding neural network 414, etc.) to generate the output sequence handwritten stroke tokens 406.

[0069] When the token generating neural network 409 is a token processing neural network, the token generating neural network 409 can be configured to process input tokens and generate output tokens from a particular token vocabulary (e g., a discrete set of tokens that can be processed and / or generated by the token generating neural network 409). The handwritten stroke encoder 410 can be configured to generate the input stroke encoding 412 as a sequence of initial handwritten stroke tokens selected from the token vocabulary of the token generating neural network 409. Similarly, the image encoding neural network 414 can be configured to generate the input image encoding 416 as a sequence of image tokens selected from the token vocabulary of the token generating neural network 409.

[0070] In some implementations, the token generating neural network 409 can be configured to generate an output sequence of content tokens 418 (e.g., selected from the vocabulary of the token generating neural network 409) that characterize a content of the handwritten input. For example, when the handwrtten input includes handwritten text, the token generating neural network 409 can generate an output sequence of content tokens 418 that includes one or more text tokens representing one or more characters or words within the handwritten text. As another example, when the handwritten input includes handwritten symbols, the token generating neural network 409 can generate an output sequence of content tokens 418 that includes one or more content tokens representing classifications of the handwritten symbols (e.g., classifying the handwritten symbols as representing lines, arrows, shapes, brackets, drawings, etc.).

[0071] In some implementations, the task processing neural network 408 can be configured (e.g., trained) to receive an input prompt 420 characterizing a request to perform a particularAtorney Docket No. 56113-0548WO1 processing task and generate an input encoding that can be processed by the token generating neural network 409 to perform the particular processing task. For example, the input prompt 420 can include a request to derender input image data 404 depicting a handwritten input (e.g., a request to generate a sequence of handwritten stroke tokens representing the handwritten input), the task processing neural network 408 can generate an appropriate input encoding for the token generating neural network 409, and the token generating neural network 409 can generate the sequence of handwritten stroke tokens 406 in response to the input prompt 420. As another example, the input prompt 420 can include a request to determine the content of the handwritten input, the task processing neural network 408 can generate an appropriate input encoding for the token generating neural network 409, and the token generating neural network 409 can generate the sequence of content tokens 418 in response to the input prompt 420.

[0072] The handwriting encoding system 202 can receive the input prompt 420 from any of a variety of sources. As an example, when the handwriting encoding system 202 is a part of a digital ink processing system (e.g., the digital ink processing system 102 of FIG. 1), the system can receive the input prompt 420 from the digital ink processing system. As another example, the system can receive the input prompt 420 from a user.

[0073] The input prompt 420 can be a text prompt representing a request to perform a particular processing task. The task processing neural network 408 can include a text processing neural network 422 configured (e.g., trained) to generate an input prompt encoding 424 representing the input prompt 420. When the token generating neural network 409 is a token processing neural network, the input prompt encoding 424 can be a sequence of text tokens (e.g., as selected from the vocabulary of the token generating neural network 409) representing the input prompt 420. The text processing neural network 422 can have any appropriate network architecture for processing the input prompts 420 to generate the input prompt encoding 424, e.g., a transformer network, a text embedding network, and so on.

[0074] The handwriting encoding system 202 can be trained using any appropriate machine learning technique. When the task processing neural network 408 is configured to process input prompts characterizing requests to perform particular processing tasks as part of generating input encodings for the token generating neural network 409. the handwriting encoding system 202 can be trained using training data that includes examples for multiple different processing tasks (e.g., by including training examples with example input prompts for the multiple different processing tasks). An example process for training the handwriting encoding system 202 is described in more detail below with reference to FIG. 7.Atorney Docket No. 56113-0548WO1

[0075] By generating multiple output modalities (e.g., textual content and handwritten stroke data), the handwriting encoding system 202 can be trained using a combination of training datasets for different tasks. For example, the handwriting encoding system 202 can be trained using training data for determining the textual content of handwritten text and based on separate training data for determining the accuracy of handwritten strokes identified by the handwriting encoding system 202. Conventional methods for processing handwritten text often rely on training data for a particular handwritten text processing task, which may be limited or difficult to obtain. By training based on performance in multiple handwritten tasks, the handwriting encoding system can be trained using a greater amount of training data than conventional methods. Therefore, by producing both text content and stroke prediction outputs and by training on data for both text content and stroke prediction tasks, the handwriting encoding system 202 can be more efficiently trained (e.g., in terms of computational costs, such as training time, memory usage, power consumption, etc.) to produce more accurate outputs (e.g., as determined by performance metrics for either the text content or the stroke prediction tasks) than conventional methods.

[0076] When the token generating neural network 409 generates handwritten stroke tokens 406 for a handwritten input, the output token sequence generated by the token generating neural network 409 can be used as a digital ink encoding 426 of the handwritten input that can efficiently represent handwritten text as a combination of textual content tokens and handwritten stroke tokens. This allows the handwriting encoding system 202 or another system (e.g., the digital ink processing system 102 of FIG. 1 ) to perform any of a variety of downstream tasks starting from an image or other depiction of handwritten text. For example, the handwritten text can be digitized and displayed on a display of a user device, allowing the user to interact with the handwritten text by submitting strokes using touch or other inputs. As a further example, the handwriting encoding system 202 or another system (e.g., the digital ink processing system 102 of FIG. 1) can efficiently process the content of the handwritten text based on the textual content tokens generated by the handwriting encoding system 202, e.g., to perform spell-check, grammar check, a language processing task, and so on. As another example, the handwriting encoding system 202 or another system (e.g., the digital ink processing system 102 of FIG. 1) can efficiently process handwritten strokes of the handwritten text based on the handwritten stroke tokens generated by the described systems, e.g., to modify a writing style of the text, to add, remove, or modify handwritten strokes, and so on.

[0077] In particular, when the handwriting encoding system 202 is part of a digital ink processing system (e.g., the digital ink processing system 102 of FIG. 1), the digital inkAttorney Docket No. 56113-0548WO1 processing system can process digital ink encodings of handwritten user inputs as part of creating and modifying digital content items. An example process of generating digital ink encodings for handwritten inputs using the handwriting encoding system 202 is described in more detail below with reference to FIG. 6.

[0078] FIG. 5 illustrates a sequence of handwritten stroke tokens 502 for a digital ink encoding representing a handwritten stroke 504.

[0079] The sequence of tokens 502 representing the handwritten stroke 504 can include one or more tokens associated with each of one or more points along a path of the handwritten stroke 504. For example, as illustrated in FIG. 5, the sequence of tokens 502 includes one or more tokens for each of 7 sampled points along the path of the handwritten stroke 504.

[0080] Each token of the sequence of tokens 502 can therefore represent (e.g.. encode) particular properties of the handwritten stroke 504 at the point along the path of the stroke 504 associated with the token. For example, as illustrated in FIG. 5, the sequence of tokens 502 includes, for each of the 7 sampled points along the path of the handwritten stroke 504, associated tokens representing (e.g., encoding) a 2-dimensional (e.g., X-Y) position of the point within a fixed canvas for the handwritten stroke 504.

[0081] In particular, for each of a set of properties of the handwritten stroke 504 and for each of the points along the path of the stroke 504, the sequence of tokens 502 can include a respective token from a set of tokens for the property that represents (e.g., encodes) a value of the property at the point along the path of the stroke 504. The set of tokens for each property can quantize possible values of the property (e.g., for each property, the set of tokens for the property can be a discrete set of tokens, with each token representing a particular value for the property). For each property and for each of the points along the path of the stroke 504, the sequence of tokens 502 can include a token from the set of tokens for the property’ that represents a closest value for the property to the value of the property for the point of the handwritten stroke 504.

[0082] For example, as illustrated in FIG. 5, the sequence of tokens 502 includes, for each of the 7 sampled points along the path of the handwritten stroke 504. a respective token representing (e.g., encoding) a Y position of the point selected from a set of tokens 506-A through 506-E representing (e g., encoding) possible Y positions for points of the stroke 504. Similarly, as illustrated in FIG. 5, the sequence of tokens 502 includes, for each of the 7 sampled points along the path of the handw ritten stroke 504, a respective token representing (e.g., encoding) an X position of the point selected from a set of tokens 508-A through 508-E representing (e.g., encoding) possible X positions for points of the stroke 504.Attorney Docket No. 56113-0548WO1

[0083] For illustrative purposes, FIG. 5 illustrates the sequence of tokens 502 including tokens representing X and Y positions for the sampled points along the path of the stroke 504. However, the sequence of tokens 502 can include tokens representing (e.g., encoding) any of a variety of properties of points of the stroke 504. For example, the sequence of tokens 502 can include tokens characterizing geometric properties of the stroke 504 at each of the sampled points such as, e.g., a cur ature of the stroke 504 at each point, a width of the stroke 504 at each point, and so on. As another example, the sequence of tokens 502 can include tokens characterizing properties of an input device (e.g., an electronic pen, a finger, etc.) creating the stroke 504 at each of the sampled points such as, e.g., a pressure of the input device at each point, a velocity of the input device at each point, an angle and / or an orientation of the input device at each point, and so on.

[0084] In some implementations, the sequence of tokens 502 can include tokens representing properties of the handwritten stroke 504 as a whole. For example, the sequence of tokens 502 can include tokens representing a classification of the handwritten stroke (e.g., as a character, a symbol, a drawing, a letter, etc.). As another example, the sequence of tokens 502 can include tokens representing a content of the handwritten stroke 504 (e.g., an identity of a letter represented by the stroke 504, a type of symbol depicted by the stroke 504, a shape depicted by the stroke 504, etc.).

[0085] FIG. 6 is a flow diagram of an example process for generating a digital ink encoding for a handwritten input using a handwriting encoding system. For convenience, the process 600 will be described as being performed by a system of one or more computers located in one or more locations. For example, a handwriting encoding system, e.g., the handwriting encoding system 202 of FIG. 2, appropriately programmed in accordance with this specification, can perform the process 600.

[0086] The system can receive data characterizing a handwritten input (step 602). The handwritten input can include a plurality of handwritten strokes depicting, e.g., handwritten text, handwritten symbols, handwritten drawings, and so on.

[0087] As an example, the system can receive input stroke data specifying handwritten strokes within the handwritten input (e.g.. as captured by an input device such as an electronic pen. a writing tablet, a touch screen, and so on). The input stroke data can characterize paths (e.g., curves) of the handwritten strokes by specifying, e.g., positions of points along the paths of the handwritten strokes, geometrical properties (e.g., curvatures, stroke widths, etc.) of the paths at points along the paths, properties of an input device (e.g., a pressure of the input device, aAttorney Docket No. 56113-0548WO1 velocity' of the input device, an angle and / or an orientation of the input device, etc.) creating the handwritten strokes at points along the paths of the handwritten strokes, and so on.

[0088] As another example, the system can receive input image data characterizing an image of the handwritten input.

[0089] In some implementations, the system can receive an input prompt (step 604). For example, the system can receive the input prompt from a user. As another example, when the system is part of another system (e.g., the digital ink processing system 102 of FIG. 1). the system can receive the input prompt from the other system.

[0090] The input prompt can be a text prompt representing a request to perform a particular processing task. As an example, the input prompt can include a request to derender the data characterizing the handwritten input (e.g.. to generate a digital ink representation of the handwritten input that represents the handwritten strokes of the handwritten input). As another example, the input prompt can include a request to recognize (e.g., identify) content depicted by the handwritten input. The input prompt can be, e.g., ‘‘Derender the handwritten input”, “Recognize the handwritten input”, “Derender and recognize the handwritten input”, and so on.

[0091] The system can process the data characterizing the handwritten input to generate an input encoding of the handwritten input (step 606). In particular, the system can process the data characterizing the handwritten input using a task processing neural network to generate the input encoding of the handwritten input.

[0092] For example, the task processing neural network can include a handwritten stroke encoder configured to process input stroke data characterizing the handwritten input to generate an input stroke encoding representing each of the handwritten strokes of the handwritten input. The input stroke encoding can be, e.g.. one or more encoding vectors of numerical values, each encoding vector representing a respective portion of the input stroke data. In particular, as described in more detail above with reference to FIG. 5, the input stroke encoding can be a sequence of handwritten stroke tokens representing the handwritten input, with each token representing (e.g., encoding) a respective property’ of an associated handwritten stroke within the handwritten input. The task processing neural network can include the input stroke encoding within the input encoding of the handwritten input.

[0093] As another example, the task processing neural network can include an image processing neural network configured to process input image data characterizing the handwritten input to generate an input image encoding representing the image of the handwritten input. The input image encoding can be, e.g., one or more encoding vectors ofAtorney Docket No. 56113-0548WO1 numerical values, each encoding vector representing a respective portion of the input image data. In particular, the input image encoding can be a sequence of image tokens representing the input image of the handwritten input. The task processing neural network can include the input image encoding within the input encoding of the handwritten input.

[0094] The image processing neural network can have any appropriate neural network architecture for processing the input image data to generate the input image encoding. For example, the image processing neural network can include any of a variety of processing layers (e.g., convolutional layers, recurrent layers, attention layers, etc.) in any combination appropriate for processing the input image data to generate the input image encoding. In particular, the image processing neural network can be a token processing neural network (e.g., a vision transformer) configured to generate the input image encoding by applying a sequence of attention operations to an initial token sequence representing the input image of the handwritten input.

[0095] When the system receives an input prompt, the task processing neural network can generate an input prompt encoding representing the input prompt by processing the input prompt using a text processing neural network. The input prompt encoding can be, e.g., one or more encoding vectors of numerical values, each encoding vector representing a respective portion of the input prompt. In particular, the input prompt encoding can be a sequence of text tokens representing the input prompt. The text processing neural network can have any appropriate network architecture for processing the input prompt to generate the input prompt encoding, e.g., a transformer network, a text embedding network, and so on. The task processing neural network can include the input prompt encoding within the input encoding of the handwritten input.

[0096] The system can process the input encoding of the handwritten input to generate the output digital ink encoding of the handwritten input (step 608). The digital ink encoding of the handwritten input can include a sequence of handwritten stroke tokens representing the handwritten strokes of the handwritten input.

[0097] For each handwritten stroke of the handwritten input, the output sequence of handwritten stroke tokens can include one or more handwntten stroke tokens characterizing the handwritten stroke. In particular, as described in more detail above with reference to FIG. 5, each handwritten stroke token for a handwritten stroke can characterize a respective property of a point along a path (e.g., a curve) of the handwritten stroke. For example, for each handwritten stroke, the sequence of handwritten stroke tokens can include tokens characterizing spatial coordinates (e.g., X positions, Y positions, etc.) of associated pointsAttorney Docket No. 56113-0548WO1 along the path (e.g., curve) of the handwritten stroke, geometrical properties (e.g., curv atures, stroke widths, etc.) at associated points along the path (e.g., curve) of the handwritten stroke, properties of an input device (e.g., pressures of the input device, velocities of the input device, angles and / or orientations of the input device, etc.) creating the handwritten stroke at points along the path (e.g., curve) of the handwritten stroke.

[0098] The system can generate the digital ink encoding by processing the initial encoding of the handwritten input using a token generating neural network. The token generating neural network can have any appropriate neural network architecture for generating the output sequence handwritten stroke tokens. For example, the token generating neural network can include any of a variety of processing layers (e.g., convolutional layers, recurrent layers, attention layers, etc.) in any combination appropriate for generating the output sequence handwritten stroke tokens. In particular, the token generating neural network can be a token processing neural network (e.g., a language model, a vision language model, etc.) configured to apply a sequence of attention operations to an input token sequence (e.g., including handwritten stroke tokens generated by the handwritten stroke encoder, image tokens generated by the image encoding neural network, etc.) to generate the output sequence handwritten stroke tokens.

[0099] In some implementations, the token generating neural network can be configured to generate an output sequence of content tokens that characterize a content of the handwritten input. For example, when the handwritten input includes handwritten text, the token generating neural network can generate an output sequence of content tokens that includes one or more text tokens representing one or more characters or words within the handwritten text. As another example, when the handwritten input includes handwritten symbols, the token generating neural network can generate an output sequence of content tokens that includes one or more content tokens representing classifications of the handwritten symbols (e.g., classifying the handwritten symbols as representing lines, arrows, shapes, brackets, drawings, etc.).

[0100] The token generating neural network can be configured (e.g., trained) to receive an input prompt encoding characterizing a request to perform a particular processing task and generate a network output to perform the particular processing task. For example, when the system receives an input prompt representing a request to derender input image data depicting a handwritten input (e.g., a request to generate a sequence of handwritten stroke tokens representing the handwritten input), the token generating neural network can generate the sequence of handwritten stroke tokens for the handwritten input in response to the inputAttorney Docket No. 56113-0548WO1 prompt. As another example, when the system receives an input prompt representing a request to determine the content of the handwritten input, the token generating neural network can generate the sequence of content tokens for the handwritten input in response to the input prompt.

[0101] The token generating neural network can be configured to process the input prompt encoding by any of a variety of means. As an example, the token generating neural network can be configured to process a concatenation of the input prompt encoding and an encoding of the handwritten input (e.g., an input stroke encoding or an input image encoding for the handwritten input) as a network as part of generating the digital ink encoding of the handwritten input. As a further example, when the token generating neural network is a token processing neural network, the token generating neural network can process an input token sequence formed by concatenating respective token sequences of the input prompt encoding and of the encoding for the handwritten input (e.g., the input stroke encoding or the input image encoding for the handwritten input). As another example, the token generating neural network can be configured to conditionally process the encoding of the handwritten input based on the input prompt by receiving and processing the input prompt encoding as a conditioning input. For example, when the token generating neural network is a token processing neural network, the token generating neural network can include one or more cross attention layers configured to perform cross-attention operations using the input prompt encoding.|0102| The token generating neural network can be trained using any appropriate machine learning technique. In particular, the token generating neural network can be jointly trained with the task processing neural network. When the task processing neural network is configured to process input prompts characterizing requests to perform particular processing tasks as part of generating input encodings for the token generating neural network, the task processing neural network and the token generating neural network can be trained using training data that includes examples for multiple different processing tasks (e.g., by including training examples with example input prompts for the multiple different processing tasks), as described in more detail below with reference to FIG. 7.

[0103] The digital ink encodings generated by the system can be used to perform any of a variety of handwriting processing and recognition tasks. In particular, when the system is part of a digital ink processing system (e.g., the digital ink processing system 102 of FIG. 1), the digital ink processing system can process digital ink encodings of handwritten user inputs as part of creating and modifying digital content items.Attorney Docket No. 56113-0548WO1

[0104] FIG. 7 is a flow diagram of an example process for training a handwriting encoding system. For convenience, the process 700 will be described as being performed by a system of one or more computers located in one or more locations. For example, a digital ink processing system, e.g., the digital ink processing system 102 of FIG. 1, appropriately programmed in accordance with this specification, can perform the process 700.

[0105] The system can receive training data for a plurality of handwriting processing tasks (step 702). The training data can include a plurality of training examples, with each training example including: (i) an example handwritten input for the training example, an example input prompt for the training example representing a request to perform a particular processing task for the training example, and (iii) a target digital ink encoding for the training example to perform the particular processing task for the training example.

[0106] The system can train the handwriting encoding system over a sequence of training iterations. At each training iteration, the system can perform steps 704 through 710.

[0107] The system can process the example handwritten inputs of one or more training examples for the training iteration using the handwriting encoding system to generate corresponding digital ink encodings for the training iteration (step 704). In particular, the system can generate the digital ink encodings for the training iteration using the handwriting encoding system following process 600 described in more detail above with reference to FIG. 6.|0108| The system can evaluate an objective function for the handwriting encoding system using the generated digital ink encodings for the training iteration (step 706). The objective function for the handwriting encoding system can measure a reconstruction loss for each of the training examples for the training iteration.

[0109] In general, the reconstruction loss for each training example can measure a difference between the digital ink encoding generated by the handwriting encoding system and the target digital ink encoding for the training example. The reconstruction loss for each training example can include different loss terms depending on the processing task for the training example.

[0110] For example, when the processing task for a training example is to determine handwritten strokes within the handwritten input for the training example, the target digital ink encoding can characterize target handwritten strokes for the training example and the reconstruction loss can include an error calculated based on differences between the target handwritten strokes and handwritten strokes characterized by the digital ink encoding generated by the handwriting encoding system for the training example. As a further example, the reconstruction loss can include a Chamfer loss measuring that measures a distance betweenAttorney Docket No. 56113-0548WO1 points along the target handwritten strokes and points along the handwritten strokes characterized by the digital ink encoding generated by the handwriting encoding system for the training example, following:

[0111] Where { } is a set of points sampled along the handwritten strokes characterized by the digital ink encoding generated by the handwriting encoding system for the training example and {xj} is a set of points sampled along the target handwritten strokes for the training example.

[0112] As another example, the target digital ink encoding for the training example can include a target sequence of handwritten stroke tokens for the training example and the reconstruction loss for the training example can measure a likelihood of the handwriting encoding system generating the target sequence of handwritten stroke tokens for the training example when processing the example handwritten input for the training example.

[0113] As another example, when the processing task for the training example is to determine content of the handwritten input for the training example, the target digital ink encoding for the training example can include a target sequence of content tokens for the training example and the reconstruction loss for the training example can measure a likelihood of the handwriting encoding system generating the target sequence of content tokens for the training example when processing the example handwritten input for the training example.

[0114] The system can update parameters of the handwriting encoding system to optimize the objective function (step 708). The system can update the parameters of the handwriting encoding system using any appropriate machine learning technique. For example, the system can determine gradients of the objective function with respect to the parameters of the handwriting encoding system and can determine updates for the parameters using, e.g., stochastic gradient descent, ADAM, and so on.

[0115] The system can determine whether the training is complete (step 710). The system can use any of a variety of criteria to determine whether the training is complete. For example, the system can determine that training is complete after a pre-determined number of training iterations. As another example, the system can determine that pre-training is complete when a value of the objective function falls below a pre-determined threshold. As another example, the system can determine that training is complete when a difference between values of the objective function for the current training iteration and a previous training iteration falls below a pre-determined threshold.Atorney Docket No. 56113-0548WO1

[0116] If the system determines that training is not complete, the system can continue to a next training iteration (e.g., return to step 704)

[0117] When the system determines that training is complete, the system can provide the trained handwriting encoding system (step 712).

[0118] FIG. 8 illustrates an example hierarchical encoding 802 for a digital content item 804. For illustrative purposes, the digital content item 804 of FIG. 8 is a digital note that includes an image and handwritten text. In general, hierarchical encodings can be generated for any of a variety of digital content items, such as, e g., digital notes, images, text documents, maps, and so on.

[0119] In general, the hierarchical encoding 802 identifies elements (e.g., objects) of the digital content item 804 and specifies relationships between the elements of the digital content item 804. The hierarchical encoding 802 can identify each element of the digital content item 804 as either a base object (e.g., an image, a handwritten stroke, a text character, and so on) or as a particularly formatted collection of other elements (e.g., a word, a handwritten symbol, a text line, a text block, a list, and so on). The hierarchical encoding 802 can identify various relationships between the elements of the digital content item, e.g.. identifying one element as containing another element, annotating another element, pointing to another element, and so on. When the hierarchical encoding 802 identifies a first element of the digital content item as containing a second element, the hierarchical encoding 802 can identify each, if any, elements contained by the second element as themselves being contained by the first element. In some implementations, the hierarchical encoding 802 can identify each element of the digital content item as being contained by, at most, one other element of the digital content item.

[0120] For example, as illustrated in FIG. 8, the hierarchical encoding 802 for the digital content item 804 identifies an image and various elements composed of handwritten strokes within the digital content item 804. In particular, the hierarchical encoding 802 identifies words formed by collections of handwritten strokes, lines handwritten text formed by collections of words, and blocks of handwritten text formed by collections of lines of handwritten text. Additionally, the hierarchical encoding 802 for the digital content item 804 identifies a list that includes multiple lines of handwritten text (e.g.. “Camping’; “Picnic”. “Hiking”) as list elements. The hierarchical encoding 802 also identifies handwritten arrows linking different elements of the digital content item (e.g., an arrow pointing from the word “Weather” in the text block “Weather can be uncertain” to the word “Camping” of the identified list.Atorney Docket No. 56113-0548WO1

[0121] The hierarchical representation 802 can follow a particular hierarchical representation format specifies a set of rules for the hierarchical representation 802 of the digital content item 804. The particular hierarchical representation format for the hierarchical representation 802 can define a set of syntactic rules that determine the types of elements included within the hierarchical representation 802 and the allowed relationships between the elements of the hierarchical representation. As an example, the hierarchical encoding 802 can follow the particular hierarchical representation format listed below in Table 1.TYPE DESCRIPTIONJELEMENTT WORD . A textual word that contains characters.[ELEMENT] BULLET A bullet symbol.| ELEMENT] TEXTLINEA line of text containing words or other symbols[ELEMENT] TEXTBLOCK A block of text containing textlines.| ELEMENT! LIST A list of elements (e.g.. a list of textlines).[ELEMENT] ANNOTATION Strokes annotating other elements (e.g., strikethroughs, underlines, etc.).[ELEMENT] ARROW A connector that connects to elements.[ ELEMENT] DRAWING A non-text sketch made of handwritten strokes.| ELEMENT | IMAGE A digital image.[RELATIONSHIP] A relationship between a parent element and a CONTAINMENT child element.[ RELATIONSHIP] ANNOTATING A relationship between an annotation and an annotated element.[RELATIONSHIP] POINTING A relationship describing an arrow pointing tofrom another element.Table 1: An example hierarchical representation format.

[0122] For the purpose of illustrating the hierarchical encoding 802, the example hierarchical representation format listed in Table 1 describes elements and relationships relevant to digital notes of handwritten text. However, hierarchical representation formats for hierarchical encodings of digital content items can include elements and relationships not listed in Table 1. In particular, hierarchical representation formats can include elements and relationships relevant to creating and modifying other types of digital content items (e.g., text documents, digital maps, and so on).

[0123] As an example, a hierarchical representation format for text documents can include separate elements representing typed text (e.g., elements representing typed words, typed linesAttorney Docket No. 56113-0548WO1 of text, typed blocks of text, etc.) and representing handwritten text (e.g., handwritten characters, handwritten words, handwritten lines of text, handwritten blocks of text, etc.). By including separate elements for typed text and handwritten text, a hierarchical representation format for a text document can explicitly differentiate between an element of text (e.g., typed text) that represents text content of the text document and an element of text (e.g., handwritten text) that represents an instruction from a user to modify the text document.

[0124] As another example, a hierarchical representation format for images can include elements representing hand drawn shapes (e.g., shapes drawn by a user that can specify particular regions of the images) and elements representing spatial regions of the images (e.g., as enclosed by a handwritten shape). By including elements representing hand drawn shapes and elements representing spatial regions of images, a hierarchical representation format for an image can identify regions of the image selected by a user for processing and modification.

[0125] As another example, a hierarchical representation format for a digital map can include respective elements representing geolocated points, paths, regions, and so on within the digital map. By including elements for geolocated features of digital maps, a hierarchical representation format for a digital map can specify that a hand drawn symbol represents a particular location within the map, that a hand drawn line represents a particular path or route within the map, that a hand drawn shape represents a particular spatial region within the map, and so on.|0126| In some implementations, the hierarchical representation format for a digital content item can include elements representing user instructions or commands. For example, when the digital content item is a digital note that includes handwritten text, the hierarchical representation format can include separate elements representing handw ritten content of the digital note and handwritten instructions for modifying the digital note. In other implementations, e.g., when the digital content item does not include handwritten content, each element of the hierarchical representation format for the digital content item that includes handwritten content can automatically represent a user instruction or a portion of a user instruction.

[0127] The hierarchical representation format for a digital content item can define allowed modifications for the digital content item. The syntactic rules of the hierarchical representation format (e.g., the elements and relationships defined by the hierarchical representation format) can determine the allowed modifications for the digital content item. For example, when the hierarchical representation format of a digital content item includes elements representing handwritten text that are defined to only include handwritten strokes or other elementsAtorney Docket No. 56113-0548WO1 representing handwritten text, the hierarchical representation format can restrict modifications of handwritten text elements of the digital content item to those that modify a textual content of the handwritten text (e.g., text insertion, text deletion, spelling correction, grammar correction, etc.) or a display style of the handwritten text (e.g., resizing, restyling using handwriting synthesis, etc.). As another example, when the hierarchical representation format of a digital content item includes elements representing containers of handwritten text (e.g., lists of handwritten elements, tables of handwritten elements, etc.), the hierarchical representation format can restrict modifications of the containers of handwritten text within the digital content item to those that, e.g., add handwritten elements to the containers, remove handwritten elements from the containers, rearrange handwritten elements within the containers, modify a display style of the containers (e.g., element spacing, element alignment, etc.), and so on.

[0128] Any of a variety of systems can use the hierarchical representation format for a digital content item as part of updating the digital content item or generating candidate modifications to the digital content item. For example, a user interface system (e.g., the user interface system 108 of FIG. 1) can process user inputs (e.g.. text inputs, mouse inputs, gestures, and so on) to apply corresponding modifications to the digital content item as allowed by the hierarchical representation format for the digital content item. As another example, a processing system (e.g., the digital ink processing system 102 of FIG. 1) can apply modifications to the digital content item or suggest candidate modifications for the digital content item as allowed by the hierarchical representation format for the digital content item.

[0129] In particular, a digital ink processing neural network (e.g., the digital ink processing neural netw ork 204 of FIG. 2) can leverage the hierarchical representation format for a digital content item to more efficiently process handwritten user inputs to determine appropriate updates or candidate modifications to the digital content item. For example, the digital ink processing neural network can be configured to perform a fixed set of processing operations (e.g., using a set of sub-networks of the digital ink processing neural network, a set of external APIs that can be called by the digital ink processing neural network, and so on) to process elements of the digital content item. Each of the fixed set of processing operations can perform modifications of respective elements of the digital content item that are allowed by the hierarchical representation format for the digital content item. As part of processing a handwritten user input for a digital content item to update the digital content item, the digital ink processing neural network can identify any elements of the hierarchical representation of the digital content item that require modification, select operations from the fixed set ofAttorney Docket No. 56113-0548WO1 processing operations that are required to update the digital content item in accordance with the handwritten input, and perform the selected operations from the set of processing operations to update the digital content item.

[0130] An encoding system (e.g., the content encoding system 206 of FIG. 2) can process the digital content item 804 to generate a graph characterizing the hierarchical encoding 802 for the digital content item 802. The graph characterizing the hierarchical encoding 802 can include a respective graph node representing each element specified by the hierarchical encoding 802 and can include a respective graph edge representing each relationship specified by the hierarchical encoding 802. In particular, as described in more detail below with reference to FIG. 9, the encoding system can generate the graph characterizing the hierarchical encoding 802 using a graph generating neural network.

[0131] FIG. 9 is a flow diagram of an example process for generating a hierarchical encoding for a digital content item using a graph generating neural network. For convenience, the process 900 will be described as being performed by a system of one or more computers located in one or more locations. For example, a content encoding system, e.g., the content encoding system 206 of FIG. 2. appropriately programmed in accordance with this specification, can perform the process 900.

[0132] The graph generating neural network can be configured (e.g., trained) to: (i) process input data characterizing the digital content item and, optionally, a handwritten input and (ii) produce an output graph characterizing the digital content item following a graph representation format. In particular, the output graph can be a hierarchical encoding of the digital content item following a hierarchical representation format for the digital content item, as described in more detail above with reference to FIG. 8. As described in more detail below with reference to FIG. 11, the graph generating network can be trained to reproduce groundtruth target graphs representing the contents of example digital content items based on training data that includes example input data characterizing the example digital content items.

[0133] The output graph can include: (i) graph nodes that each represent a respective element within the digital content item and (ii) graph edges that each connect a respective pair of graph nodes and characterizes a relationship between the content elements characterized by the pair of graph nodes. For example, a first graph node can represent a word, a second graph node can represent a line of text, and a graph edge between the first and second graph nodes can represent that the line of text contains the word.

[0134] The output graph can associate (e.g., assign) additional data to each graph node that characterizes the contents of the content element represented by the node. For example, theAtorney Docket No. 56113-0548WO1 graph representation format can assign additional data to a graph node representing a word in the document that specifies the text of the word.

[0135] As another example, the output graph can assign additional data to each graph node identifying a respective one of a plurality of node types for the graph node. For example, the plurality7of graph node ty pes in the output graph can include, e.g., a stroke type that represents individual handwritten strokes, an image type that represents digital images, node types representing collections of handwritten texts of certain forms (e.g., a word node type, a textline node type, a textblock node type, etc ), node types representing collections of non-textual handwritten symbols of certain forms (e.g., a drawing note type, an arrow node ty pe, a bullet point node type, etc.), node types representing collections of elements as organized in certain manners (e.g., a list node type, a table node type, etc.), and so on.

[0136] As another example, the output graph can assign additional data to each graph edge identifying a respective one of a plurality of edge types for the graph edge. For example, the plurality7of graph edge types in the output graph can include, e.g., edge ty pes representing containment relationships that characterize elements of the digital content item containing and / or being contained by other elements of the digital content item, edge types representing interaction relationships that characterize elements of the digital content item interacting with other elements of the digital content item in certain manners (e.g., pointing to, pointing from, annotating, etc.).|0137] The node types within the output graph can be hierarchically organized into a sequence of hierarchical levels following containment relationships between the node types. The graph generating neural network can generate the output graph following the sequence of hierarchical levels (e.g., by sequentially grouping individual handwritten strokes to identify7words, grouping handwritten words to identify7lines of handwritten text, grouping lines of handwritten text to identify blocks of handwritten text, and so on).

[0138] The graph generating neural network can include a sequence of graph neural netw ork layers. Each graph neural network layer can be associated with a hierarchical level of the output graph and configured to perform graph operations to process an input graph for the layer and generate an output graph for the layer for the hierarchical level of the output graph associated with the graph neural network layer. In particular, the graph generating neural network can include one or more clustering layers that are configured to generate graph nodes of higher hierarchical levels of the output graph by grouping (e.g., clustering) graph nodes from lower hierarchical levels of the output graph.Attorney Docket No. 56113-0548WO1

[0139] The example process 900 for generating the output graph for a digital content item using the graph generating neural network is described next.

[0140] The system can receive data characterizing a digital content item (step 902). The data characterizing the digital content item can identify base elements of the digital content item (e.g., handwritten strokes, handwritten words, typed characters, typed words, images, and so on) and can specify geometrical properties of the base elements, such as positions and extents of the base elements within the digital content item.

[0141] The system can optionally receive data characterizing a handwritten input (step 904). For example, the system can receive a digital ink encoding of the handwritten input as generated by a handwriting encoding system (e.g., the handwriting encoding system 202 of FIG. 2). In general, the data characterizing the handwritten input can identify base elements of the handwritten input (e.g., handwritten strokes, handwritten words, etc.) and can specify geometrical properties of the base elements of the handw ritten input, such as positions and extents of the base elements within the handwritten input. The handwritten input can be generated as an overlay of the digital content item (e.g., by a user drawing or writing on a representation of the digital content item) and the data characterizing the handwritten input can specify geometrical properties of the base elements of the handwritten input relative to the digital content item, such as positions and extents of the base elements of the handwritten input within the digital content item.|0142| The system can generate an initial graph representing the base elements of the digital content item (step 906). In particular, the system can generate the initial graph to include graph nodes representing respective base elements of the digital content item. Each graph node of the initial graph can be a terminal node at a low est hierarchical level of the output graph for the digital content item (e.g., each graph node of the initial graph can represent a base element the digital content item or the handwritten input that does not contain other elements).

[0143] The system can associate (e.g., assign) data to the graph nodes of the initial graph characterizing the respective base elements of the digital content item represented by the graph nodes. In particular, the system can determine initial graph node embeddings for each graph node of the initial graph that characterize, e.g., content of the base element represented by the graph node, a classification of the base element represented by the graph node, a position within the digital content item of the base element represented by the graph node, an extent within the digital content item of the base element represented by the graph node, and so on.

[0144] To determine text content of handwritten elements represented by graph nodes of the initial graph, the system can process handwritten elements of the digital content item (and,Atorney Docket No. 56113-0548WO1 optionally, of the handwritten input) using any of a variety of text processing neural networks of the graph generating neural network.

[0145] As an example, the graph generating network can include a text segmentation network configured to process input data characterizing strokes representing handwritten text and segment the strokes into individual characters. The text segmentation network can have any appropriate network architecture (e.g., a Transformer-based network architecture) and can be trained to reproduce ground-truth segmentations for example handwritten inputs. As part of generating the initial graph, the system can process input data characterizing handwritten strokes (e.g., within the digital content item or, optionally, within the handwritten input) to produce one or more character segmentations of the handwritten strokes. The system can use the character segmentations generated by the text segmentation network to include geometric data characterizing handwritten text within the graph nodes representing handwritten text in the initial graph, such as data representing sizes of handwritten characters within the handwritten text, data characterizing word baselines for the handwritten text, and so on.

[0146] As another example, the graph generating network can include a text recognition network configured to process input data characterizing handwritten strokes representing handwritten text and produce text transcriptions of the handwritten text. The text recognition network can have any appropriate architecture (e.g., an LSTM network architecture) and can be trained to reproduce ground-truth target transcriptions for example handwritten text. In some implementations, as part of generating the hierarchical representation of the user document, the system can process the input data that characterizes handwritten strokes within the user document to produce one or more text transcriptions using the text recognition netw ork and include text from the one or more produced text transcriptions within the hierarchical representation.

[0147] The system can then generate the output graph for the digital content item by processing and updating the initial graph using graph neural netw ork layers associated w ith each of one or more hierarchical levels of the output graph.

[0148] In particular, for each hierarchical level of the output graph, the system can process a current graph using a clustering layer of the graph generating neural network for the hierarchical level to generate graph nodes for the hierarchical level of the output graph (step 908). For example, a first clustering layer of the graph generating neural network can cluster graph nodes representing handwritten strokes to generate graph nodes representing handwritten words, while subsequent clustering layers can, e.g., cluster graph nodes representing handwritten words to generate graph nodes representing handwritten lines of text, cluster graphAtorney Docket No. 56113-0548WO1 nodes representing handwritten lines of text to generate graph nodes representing handwritten blocks of text, and so on. When the graph generating neural network generates a graph node for the output graph representing an element of the digital content item that contains other elements of the digital content item, the graph generating neural network can include graph edges between the graph node for the containing element and each graph node for the contained elements that represents the containment relationship between the elements. In some implementations, the system can process the current graph to identify additional relationships (e.g., pointing to, pointing from, annotating, etc.) between elements represented by graph nodes of the current graph and can include graph edges representing the identified relationships. As a particular example, the system can utilize certain heuristics to identify the additional relationships, such as by identifying the additional relationships when two elements of the digital content item are within a threshold distance and do not have a containment relationship between each other.

[0149] For each hierarchical level of the output graph, the system can process the current graph using any graph processing layers of the graph generating neural network associated with the hierarchical level to update embeddings for the graph nodes and graph edges of the output graph (step 910). Each graph processing layer can, for example, perform a sequence of message passing operations to update the graph node and graph edge embeddings of the current graph. For example, the graph generating neural network can process the current graph using a graph processing layer for a hierarchical level of list elements to update a graph node embedding for a list of elements of the digital content item to represent, e g., a position of the list, an spacing or alignment for the list, a combined content of the list, and so on based on the positions, extents, and contents of the elements of the list represented by the graph nodes for the elements of the list.

[0150] After updating the graph using the graph generating neural network, the system can finally return the generated output graph for the digital content item (step 912).

[0151] FIG. 10 is a flow diagram of an example process for training a graph generating neural network to generate hierarchical encodings for digital content items. For convenience, the process 1000 will be described as being performed by a system of one or more computers located in one or more locations. For example, a digital ink processing system, e.g., the digital ink processing system 102 of FIG. 1, appropriately programmed in accordance with this specification, can perform the process 1000.

[0152] The system can receive training data that includes target graph encodings for a plurality of example digital content items (step 1002). The training data can include a plurality ofAttorney Docket No. 56113-0548WO1 training examples, with each training example including data characterizing: (i) an example digital content item for the training example and (ii) a target graph encoding representing the example digital content item for the training example. Some or all of the plurality of training examples can include data characterizing respective example handwritten inputs. In particular, some or all of the training examples can include digital ink encodings for respective example handwritten inputs as generated by processing the example handwritten inputs using a handwriting encoding system (e.g., by processing the example handwritten inputs using the handwriting encoding system 202 of FIG. 2 following the process 600 described in FIG. 6). When a training example includes an example handwritten input, the target graph encoding for the training example can j ointly represent both the example digital content item for the training example and the example handwritten input for the training example.

[0153] The system can train the graph generating neural network over a sequence of training iterations. At each training iteration, the system can perform steps 1004 through 1010.

[0154] The system can process the example digital content items of one or more training examples for the training iteration using the graph generating neural network to generate corresponding graph encodings for the example digital content items for the training iteration (step 1004). In particular, for each example digital content item, the system can generate each hierarchical level of the graph encoding for the example digital content item by processing graph data from the target graph encoding for the digital content item for previous hierarchical levels of the graph encoding (e.g., following steps 908 and 910 of the process 900 described above with reference to FIG. 9).

[0155] The system can evaluate an objective function for the graph generating neural network that measures an error between the target graph encodings and the generated graph encodings for the example digital content items for the training example (step 1006). In particular, the objective function for the graph generating neural network can measure an error between corresponding graph node embeddings and graph edge embeddings of the target and generated graph encodings for the example digital content items for the training example. The graph node embeddings and the graph edge embeddings can include feature vectors representing any of a variety of continuous features (e.g.. positions, sizes, extents, etc.) and categorical features (e.g., classifications, labels, etc.) of the graph nodes and graph edges, and the objective function for the graph generating neural network can measure any appropriate error between each corresponding feature vector within the generated and target graph encodings. For example, the objective function for the graph generating neural network can measure a regression loss (e.g., an L2 loss) between corresponding continuous feature vectors and a classification lossAttorney Docket No. 56113-0548WO1(e.g., a cross-entropy loss) between corresponding categorical feature vectors of the target and generated graph encodings.

[0156] The system can update parameters of the graph generating neural network to optimize the objective function for the graph generating neural network (step 1008). The system can update the parameters of the graph generating neural network using any appropriate machine learning technique. For example, the system can determine gradients of the objective function with respect to the parameters of the graph generating neural network and can determine updates for the parameters using, e.g., stochastic gradient descent, ADAM, and so on.

[0157] The system can determine whether the training is complete (step 1010). The system can use any of a variety of criteria to determine whether the training is complete. For example, the system can determine that training is complete after a pre-determined number of training iterations. As another example, the system can determine that pre-training is complete when a value of the objective function falls below a pre-determined threshold. As another example, the system can determine that training is complete when a difference between values of the objective function for the current training iteration and a previous training iteration falls below a pre-determined threshold.

[0158] If the system determines that training is not complete, the system can continue to a next training iteration (e.g., return to step 1004)

[0159] When the system determines that training is complete, the system can provide the trained graph generating neural network (step 1012).

[0010] FIG. 1 1 is a flow diagram of an example process for generating content updates for a digital content item using a digital ink processing neural network. For convenience, the process 1100 will be described as being performed by a system of one or more computers located in one or more locations. For example, a digital ink processing neural network, e.g.. the digital ink processing neural network 204 of FIG. 2, appropriately programmed in accordance with this specification, can perform the process 1100.

[0161] The system can receive data characterizing a hierarchical encoding for the digital content item (step 1102). For example, the hierarchical encoding of the digital content item can be generated by a content encoding system (e.g.. by the content encoding system 206 of FIG. 2). In particular, the hierarchical encoding of the digital content item can be a graph encoding of the digital content item (e.g., as generated by a graph generating neural network following the process 600 described above with reference to FIG. 6).

[0162] In some implementations, the system can directly receive and process the hierarchical encoding of the digital content item. For example, the digital ink processing neural networkAtorney Docket No. 56113-0548WO1 can be a graph neural network that can directly receive and process a graph encoding of the digital content item. In other implementations, the system can receive and process a representation of the hierarchical encoding (e.g., a text sequence, a token sequence, and so on that specifies the hierarchical encoding). For example, the digital ink processing neural network can be a token processing neural network (e.g., a language model, a visual language model, and so on) that can receive and process a token sequence representing the hierarchical encoding of the digital content item.

[0163] The hierarchical encoding can include instructions for performing particular processing tasks (e.g., text processing tasks, language processing tasks, image processing tasks, etc.) to update the digital content item. For example, the hierarchical encoding can jointly encode both the digital content item and a handwritten input representing a request to perform a particular processing task to update the digital content item.

[0164] In some implementations, the system can be configured to receive additional data characterizing a selected or requested processing task. For example, the hierarchical encoding of the digital content item can represent only the digital content item itself, and the system can receive instructions (e.g., from a user by way of a user interface) to perform a particular processing task to update the digital content item.

[0165] In some implementations, the system can process the received data to determine one or more processing systems to use as part of updating the digital content item (step 1104). For example, when the digital ink processing neural network is a token processing neural network, the digital ink processing neural network can be configured to process the received data to generate an output token sequence that identifies a processing system to use to update the digital content item and specifies that can be provided to the processing system to generate the update for the digital content item. As another example, when the digital ink processing neural network is a graph neural network, the digital ink processing neural network can be configured to process the hierarchical encoding of the digital content item to generate a graph that specifies which processing systems should be used to update the digital content item. For example, the digital ink processing neural network can generate graph node and edge embeddings for a graph encoding of the digital content item that identify particular processing systems to use to update the digital content item and elements (e.g., sub-graphs) of the graph encoding that can be processed by the identified processing systems to generate updates for the digital content item.

[0166] The processing systems can include any of a variety of external processing systems (e.g., that the system can call by an API) or processing systems included within the system.Attorney Docket No. 56113-0548WO1

[0167] For example, the processing systems can include a text generation model (e.g., a handwriting synthesis generative neural network). The text generation model can be configured (e.g., trained) to replicate example lines of text based on training data comprising the example lines of text. The text generation model can be trained to replicate example lines of handwritten text using training data that includes example lines of handwritten text. As a further example, the system can identify that the hierarchical encoding of the digital content item includes a request to generate handwritten text of a certain style and can determine that the text generation model should be used to process a particular input from the digital content item in order to generate the requested handwritten text.

[0168] As another example, the processing systems can include a language processing neural network (e.g., a language model trained to perform language processing tasks). The language processing neural network can be configured to perform, e.g., transcription, translation, text completion, text generation, spell checking, grammar checking, and so on. As a further example, the system can identify that the hierarchical encoding of the digital content item includes a request to perform a particular language processing task and can determine that the language processing neural network should be used to process a particular input from the digital content item in order to perform the requested language processing task.

[0169] As another example, the processing systems can include an image processing model (e.g., an image processing neural network trained to perform image processing tasks). The image processing model can be configured to perform, e.g., image generation, color correction, image resizing, image cropping, and so on. As a further example, the system can identify that the hierarchical encoding of the digital content item includes a request to perform a particular image processing task and can determine that the image processing model should be used to process a particular input from the digital content item in order to perform the requested image processing task.

[0170] As another example, the processing systems can include a geographical data processing model. The geographical data processing systems can be configured to perform, e.g., route planning, travel time estimation, distance estimation, and so on. As a further example, the system can identify that the hierarchical encoding of the digital content item includes a request to perform a particular geographical data processing task and can determine that the geographical data processing model should be used to process a particular input from the digital content item in order to perform the requested geographical data processing task.

[0171] As another example, the system can be configured to interact with APIs of, e.g., user calendars, user shopping carts, and so on. For example, the system can identify that theAtorney Docket No. 56113-0548WO1 hierarchical encoding of the digital content item includes items for a schedule can determine that an API of a user’s calendar should be used to process a particular input from the digital content item in order to update the user’s calendar. As another example, the system can identify that the hierarchical encoding of the digital content item includes items for a shopping list can determine that an API of a user’s shopping cart should be used to process a particular input from the digital content item in order to update the user’s shopping cart.

[0172] The system can generate content updates for the digital content item by processing the received data (step 1106). In general, the system can generate content updates for the digital content item by adding, removing, or modify ing elements and relationships specified by the hierarchical encoding of the digital content item. For example, the system can generate data characterizing an updated hierarchical representation of the digital content item. This enables the system to generate or select content updates from a set of allowed candidate modifications of the digital content item (e.g., allowed by the hierarchical representation format for the digital content item, as described in more detail above with reference to FIG. 8).

[0173] The system can generate any of a variety of content updates for the digital content item. For example, the system can update the digital content item to include handwritten elements from a handwritten input received from a user. As another example, the system can update digital content item by performing a processing task to update particular elements of the digital content item. For example, the system can generate and include text within the digital content item (e.g., rephrase text content in accordance with a handwritten instruction from a user, propose correct spellings for misspelled words within the digital content item, propose rewritten text for ungrammatical sentences in the digital content item). As another example, the system can generate and process images within the digital content item (e.g., generate image content, perform image filtering, perform image cropping, perform image resizing, etc. in accordance with a handwritten instruction from a user). As another example, the system can synthesize handwritten elements to include within the digital content item (e.g., by generating handwritten text using a text generation model, by cloning handwritten strokes from a user within the digital content item, by computing one or more Bezier curves for the synthesized handwritten elements, etc.). As another example, the system can resize and rearrange elements of the digital content item (e.g., the system resize and rearrange elements of the digital content item following hand-drawn arrows and shapes from a user). As another example, the system can remove elements from the digital content item (e.g., remove elements of the digital content item that a user has crossed out).Attorney Docket No. 56113-0548WO1

[0174] When the system identifies processing systems to use to generate content updates for the digital content item (e.g.. following step 1104), the system can generate the content updates for the content item using the identified processing systems. In particular, the system can process a particular input from the digital content item using an identified processing system to generate a content update for the digital content item.

[0175] The system can finally output data characterizing the updated digital content item (step 1108). For example, the system can output an updated hierarchical encoding of the digital content item. As another example, the system can output a sequence of text or a sequence of tokens representing an updated hierarchical encoding of the digital content item.

[0176] In some implementations, the updated digital content item can be saved. In other implementations, the updated digital content item can be presented to a user for approval, e.g., by means of a user interface. For example, when the system proposes a corrected spelling for a word within the digital content item or a correction for an ungrammatical sentence within the digital content item, the proposal can be presented to the user for approval before storing the updated digital content item.

[0177] FIG. 12 is a flow diagram of an example process for training a digital ink processing neural network to generate content updates for a digital content item. For convenience, the process 1200 will be described as being performed by a system of one or more computers located in one or more locations. For example, a digital ink processing system, e.g., the digital ink processing system 102 of FIG. 1. appropriately programmed in accordance with this specification, can perform the process 1200.

[0178] The system can receive training data that includes target network outputs specifying target updates for example digital content items (step 1202). The training data can include a plurality of training examples, with each training example including data characterizing: (i) a hierarchical encoding for a digital content item for the training example and (ii) a target network outputs specifying target updates for example digital content item for the training example. The training examples can include hierarchical encodings for respective example digital content items as generated by processing the example digital content items using a content encoding system (e.g., by processing the example digital content items using the content encoding system 206 of FIG. 2 following the process 900 described in FIG. 9).

[0179] As an example, when the digital ink processing neural network is a token processing network (e g., a language model, a visual language model, etc.), each training example can include: (i) an example input sequence of tokens representing a hierarchical encoding of a digital content item for the training example and (ii) a target output token sequence thatAttorney Docket No. 56113-0548WO1 identifies particular processing systems and inputs to be provided to the identified processing systems to generate the target content updates for the example digital content item. As another example, when the digital ink processing neural network is a graph neural network, each training example can include: (i) a graph encoding of the example digital content item for the training example and (ii) a target output graph that includes graph node embeddings and graph edge embeddings that identify particular processing systems and input elements from the digital content item that should be processed by the identified processing systems to generate the target content updates for the example digital content item.

[0180] The system can train the digital ink processing neural network over a sequence of training iterations. At each training iteration, the system can perform steps 1204 through 1208.

[0181] The system can evaluate an objective function for the digital ink processing neural network that measures a likelihood of the digital ink processing neural network generating target network outputs from the training data (step 1204). As part of evaluating the objective function, the system can process one or more training examples using the digital ink processing neural network.

[0182] For example, the digital ink processing neural network is a token processing network the objective function can measure a likelihood of the digital ink processing neural network generating the target output token sequences for the training examples for the training iteration by processing corresponding example input token sequences for the training iteration.|0183| As another example, when the digital ink processing neural network is a graph neural network, the system can measure an error (e.g., a cross-entropy loss for each node and edge embedding) between the target output graphs for the training iteration and corresponding output graphs generated by the digital ink processing neural network processing the example input graphs for the training iteration.

[0184] The system can update parameters of the digital ink processing neural network to optimize the objective function for the digital ink processing neural network (step 1206). The system can update the parameters of the digital ink processing neural network using any appropriate machine learning technique. For example, the system can determine gradients of the objective function with respect to the parameters of the digital ink processing neural network and can determine updates for the parameters using, e.g., stochastic gradient descent, ADAM, and so on.

[0185] The system can determine whether the training is complete (step 1208). The system can use any of a variety of criteria to determine whether the training is complete. For example, the system can determine that training is complete after a pre-determined number of trainingAtorney Docket No. 56113-0548WO1 iterations. As another example, the system can determine that pre-training is complete when a value of the objective function falls below a pre-determined threshold. As another example, the system can determine that training is complete when a difference between values of the objective function for the current training iteration and a previous training iteration falls below a pre-determined threshold.

[0186] If the system determines that training is not complete, the system can continue to a next training iteration (e.g.. return to step 1204)

[0187] When the system determines that training is complete, the system can provide the trained digital ink processing neural network (step 1210).

[0188] FIG. 13 illustrates creating and editing a user document 1302 by processing and interpreting handwritten inputs 1304-A, 1304-B. and 1304-C using a digital ink processing system.

[0189] The handwritten inputs 1304-A, 1304-B, and 1304-C can represent instructions for modifying portions of text within the user document 1302. The digital ink processing system can process and interpret the handwritten inputs 1304-A, 1304-B, and 1304-C to generate updates for the user document 1302 in accordance with the instructions represented by the handwritten inputs 1304-A, 1304-B, and 1304-C.

[0190] For example, the digital ink processing system can interpret a handwritten input (e.g., the handwritten input 1304-A as illustrated in FIG. 13) as representing an instruction to remove a particular portion of text from the user document 1302 and can generate an update to the user document 1302 that removes the particular portion of text. As another example, the digital ink processing system can interpret a handwritten input (e.g., the handwritten input 1304-B as illustrated in FIG. 13) as representing an instruction to insert a particular portion of text (e.g., handwritten text within the user input) within the user document 1302 and can generate an update to the user document 1302 that inserts the particular portion of text. As another example, the digital ink processing system can interpret a handwritten input (e.g., the handwritten input 1304-B as illustrated in FIG. 13) as representing an instruction to change a format of a particular portion of text within the user document 1302 and can generate an update to the user document 1302 that appropriately changes the format of the particular portion of text within the user document 1302. As another example, the digital ink processing system can interpret a handwritten input (e.g., the handwritten input 1304-C as illustrated in FIG. 13) as representing an instruction to move of a particular portion of text within the user document 1302 and can generate an update to the user document 1302 that appropriately rearranges text within the user document 1302.Atorney Docket No. 56113-0548WO1

[0191] In some implementations, the digital ink processing system can determine that a particular processing system is required to update the user document 1302 and can use the particular processing system to update the user document 1302. For example, when the digital ink processing system includes the particular processing system, the digital ink processing system can process data characterizing content from the user document 1302, the handwritten inputs, or both using the particular processing system as part of updating the user document 1302. As another example, the digital ink processing system can transmit data characterizing content from the user document 1302, the handwritten inputs, or both to the particular processing system (e.g., via an API of the particular processing system) and can receive outputs from the particular processing system as part of updating the user document 1302. As a particular example, the digital ink processing system can interpret a handwritten input (e.g., the handwritten input 1304-B as illustrated in FIG. 13) as representing an instruction to generate text to include within the user document 1302 and can call an API for atext generation neural network to generate the requested text for the user document 1302.

[0192] FIG. 14 illustrates creating and modifying a digital note 1402 for a user by processing and interpreting a handwritten input 1404 from the user using a digital ink processing system.

[0193] The digital note 1402 can include data of a variety of formats such as, text data, image data, digital ink data representing hand-written annotations, and so on. The digital ink processing system can process and interpret the handwritten input 1404 from the user to generate updates for the digital note 1402. In particular, the handwritten input 1404 can represent instructions to modify the digital note 1402 by, e.g., adding content to the digital note 1402, removing content, editing content, rearranging content, linking pieces of content, and so on, and the digital ink processing system can update the digital note 1402 in accordance with the instructions represented by the handwritten input 1404.

[0194] In some implementations, the digital ink processing system can determine that a particular processing system is required to update the digital note 1402 and can use the particular processing system to update the digital note 1402. For example, when the digital ink processing system includes the particular processing system, the digital ink processing system can process data characterizing content from the digital note 1402, the handwritten input 1404. or both using the particular processing system as part of updating the digital note 1402. As another example, the digital ink processing system can transmit data characterizing content from the digital note 1402. the handwritten inputs, or both to the particular processing system (e.g., via an API of the particular processing system) and can receive outputs from the particular processing system as part of updating the digital note 1402.Atorney Docket No. 56113-0548WO1

[0195] For example, the digital ink processing system can interpret the handwritten input 1404 as representing handwritten text to include within the digital note 1402 and can process text of the handwritten input 1404 using, e.g., a spellchecking system to propose respellings of words within the handwritten input 1404, a grammar checking system to propose grammar corrections for text within the handwritten input 1404, and so on. As another example, the digital ink processing system can interpret the handwritten input 1404 as representing a request to generate handwritten text within the digital note 1402 (e.g.. to restyle handwritten text within the digital note 1402) and can generate handwritten text to include within the digital note 1402 using a handwriting synthesis system. As another example, the digital ink processing system can interpret the handwritten input 1404 as including items for a calendar of the user and can call an API to add the items to the calendar. As another example, the digital ink processing system can interpret the handwritten input 1404 as including items for a shopping list of the user and can call an API to add the items to a digital shopping cart for the user.

[0196] FIG. 15 illustrates interacting with a web page 1502 by processing and interpreting a handwritten input 1504 using a digital ink processing system.

[0197] The handwritten input 1504 can represent instructions from a user for interacting with the web page 1502 and the digital ink processing system can interpret the handwritten input 1504 to process content of the web page 1502 and modify local data stored for the w eb page 1502. For example, the digital ink processing system can interpret the handwritten input 1504 as representing instructions to process content of the web page 1502 by, e.g., searching selected text using a search engine, saving images and text from the web page 1502, generating a text summary' of some or all of the w eb page 1502, and so on, and can process the w eb page 1502 in accordance with the instructions represented by the handwritten input 1504. As another example, the digital ink processing system can interpret the handwritten input 1504 as representing digital notes (e.g., handwritten text, symbols, drawings, etc.) created by the user for the w eb page 1502 and can process the handwritten input 1504 to save a local copy of the digital note that, e.g., can be displayed (e.g., as an overlay for the webpage) when the user accesses the web page 1502.

[0198] In some implementations, the digital ink processing system can determine that a particular processing system is required to interact with the web page 1502 as instructed by the handwritten input 1504 and can use the particular processing system to interact with the web page 1502. For example, when the digital ink processing system includes the particular processing system, the digital ink processing system can process data characterizing content from process content of the web page 1502 and modify' local data stored for the web page 1502,Attorney Docket No. 56113-0548WO1 the handwritten input 1504, or both using the particular processing system as part of interacting with the user document 1502. As another example, the digital ink processing system can transmit data characterizing content from process content of the web page 1502 and modify local data stored for the web page 1502, the handwritten input 1504, or both to the particular processing system (e.g., via an API of the particular processing system) and can receive outputs from the particular processing system as part of interacting with the web page 1502. As a particular example, the digital ink processing system can interpret the handwritten input 1504 as representing an instruction to a summary of the w eb page 1502 and can call an API for a language model to generate the requested summary' of the w eb page 1502.

[0199] FIG. 16 illustrates editing an image 1602 by processing and interpreting handwritten inputs using a digital ink processing system.

[0200] The handwritten inputs (e.g., including the handwritten inputs 1604- A and 1604-B of FIG. 16) can represent instructions for modifying portions of the image and the digital ink processing system can interpret the handwritten input to modify the image as directed by the handwritten inputs. For example, the digital ink processing system can interpret the handwritten inputs as representing annotations of the image identifying respective regions of the image (e.g., by circling the respective regions, pointed to the respective regions, etc.) and providing instructions to modify the respective regions of the image (e.g., by performing color correction, applying image processing filters, generating image content using a generative machine learning model, etc.) and can update the image in accordance with the instructions represented by the handwritten inputs.

[0201] In some implementations, the digital ink processing system can determine that a particular processing system is required to modify the image 1602 and can use the particular processing system to modify the image 1602. For example, when the digital ink processing system includes the particular processing system, the digital ink processing system can process data characterizing content from the image 1602, the handwritten inputs, or both using the particular processing system as part of modifying the image 1602. As another example, the digital ink processing system can transmit data characterizing content from the image 1602, the handwritten inputs, or both to the particular processing system (e.g.. via an API of the particular processing system) and can receive outputs from the particular processing system as part of modifying the user document 1602. As a particular example, the digital ink processing system can interpret a handwritten input as representing an instruction to generate image data to include within the image 1602 and can call an API for an image generation neural network to generate the requested image data for the image 1602.Atorney Docket No. 56113-0548WO1

[0202] FIG. 17 illustrates editing a map 1702 by processing and interpreting a handwritten input 1704 using a digital ink processing system.

[0203] The handwritten input 1704 can represent instructions for processing, e.g., points, paths, regions, and so on depicted by the map and the digital ink processing system can interpret the handwritten input 1704 to edit the map as directed by the handwritten input 1704. For example, the digital ink processing system can interpret the handwritten input 1704 as representing annotations of the map identifying respective geo-located points, paths, regions, and so on of the map (e g., by circling portions of the map, pointing to portions of the map, etc.) and can update the map in accordance with the annotations represented by the handwritten input 1704 (e.g., by including geo-located labels, routes, boundaries, and so on within the map as specified by the handwritten input 1704).

[0204] In some implementations, the digital ink processing system can determine that a particular processing system is required to update the map 1702 and can use the particular processing system to update the map 1702. For example, when the digital ink processing system includes the particular processing system, the digital ink processing system can process data characterizing content from the map 1702. the handwritten input 1704, or both using the particular processing system as part of updating the map 1702. As another example, the digital ink processing system can transmit data characterizing content from the map 1702, the handwritten input 1704, or both to the particular processing system (e.g., via an API of the particular processing system) and can receive outputs from the particular processing system as part of updating the map 1702. As a particular example, the digital ink processing system can interpret the handwritten input 1704 as representing a planned trip between two locations on the map 1702 and can call an API for a travel planning system to determine, e.g., a suggested route between the two locations, a predicted travel time for the trip, and so on.

[0205] This specification uses the term “configured” in connection with systems and computer program components. For a system of one or more computers to be configured to perform particular operations or actions means that the system has installed on it software, firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions. For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by data processing apparatus, cause the apparatus to perform the operations or actions.

[0206] Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in thisAttorney Docket No. 56113-0548WO1 specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory storage medium for execution by, or to control the operation of, data processing apparatus. The computer storage medium can be a machine- readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. Alternatively or in addition, the program instructions can be encoded on an artificially-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.

[0207] The term '‘data processing apparatus” refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

[0208] A computer program, which may also be referred to or described as a program, software, a software application, an app, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub-programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.

[0209] In this specification the term “engine” is used broadly to refer to a software-based system, subsystem, or process that is programmed to perform one or more specific functions.Attorney Docket No. 56113-0548WO1Generally, an engine will be implemented as one or more software modules or components, installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine: in other cases, multiple engines can be installed and running on the same computer or computers.

[0210] The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA or an ASIC, or by a combination of special purpose logic circuitry' and one or more programmed computers.

[0211] Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by. or incorporated in, special purpose logic circuitry. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.

[0212] Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory' devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.

[0213] To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g.. a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the userAtorney Docket No. 56113-0548WO1 can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user’s device in response to requests received from the web browser. Also, a computer can interact with a user by sending text messages or other forms of message to a personal device, e.g., a smartphone that is running a messaging application, and receiving responsive messages from the user in return.

[0214] Data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and computeintensive parts of machine learning training or production, i.e., inference, workloads.

[0215] Machine learning models can be implemented and deployed using a machine learning framework, e.g., a TensorFlow framework, or a Jax framework.

[0216] Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e g., an application server, or that includes a front-end component, e.g.. a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g.. a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet.

[0217] The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client. Data generated at the user device, e.g., a result of the user interaction, can be received at the server from the device.

[0218] While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the contextAttorney Docket No. 56113-0548WO1 of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially be claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

[0219] Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

[0220] In addition to the embodiments described above, the following embodiments are also innovative:

[0221] Embodiment 1 is a method performed by one or more computers, the method comprising: receiving input data characterizing a user document displayed on a device, the input data comprising data characterizing a plurality of handwritten strokes submitted by a user in the user document while the user document is displayed on the device; generating, from the input data, a hierarchical representation of the user document that characterizes relationships between objects represented by respective subsets of the plurality of handwritten strokes, wherein the hierarchical representation follows a particular hierarchical representation format; and processing the hierarchical representation of the user document to identify a set of one or more candidate modifications to the user document, wherein each candidate modification comprises one or more actions to change the user document.

[0222] Embodiment 2 is the method of embodiment 1, wherein: the particular hierarchical representation format defines a set of syntactic rules that classify whether a modification to the user document is an allowed modification to the user document; and each of the set of one or more candidate modifications is an allowed modification to the user document according to the set of syntactic rules.Attorney Docket No. 56113-0548WO1

[0223] Embodiment 3 is the method of embodiment 1 or embodiment 2, further comprising: selecting one or more proposed modifications from the set of one or more candidate modifications to the user document; and providing, for presentation to the user on the device, data characterizing the one or more proposed modifications.

[0224] Embodiment 4 is the method of any one of embodiments 1-3, further comprising: updating the hierarchical representation based on one or more modifications from the set of candidate modifications to user document.

[0225] Embodiment 5 is the method of embodiment 4, wherein updating the hierarchical representation based on one or more modifications from the set of candidate modifications to user document comprises: receiving data from the user characterizing one or more selected modifications to the user document, wherein each of the one or more selected modifications is an allowed modification to the user document according to the set of syntactic rules; and updating the hierarchical representation based on the one or more selected modifications characterized by the received data.

[0226] Embodiment 6 is the method of any one of embodiments 1-5. further comprising: receiving data from the user characterizing the addition of one or more handwritten strokes to the user document; and updating the hierarchical representation based on the received data.

[0227] Embodiment 7 is the method of any one of embodiments 1-6, further comprising: receiving data from the user characterizing the deletion of one or more handwritten strokes from the user document; and updating the hierarchical representation based on the received data.

[0228] Embodiment 8 is the method of any one of embodiments 1-7, further comprising: providing, for presentation to the user on the device, an updated user document based on the updated hierarchical representation.

[0229] Embodiment 9 is the method of any one of embodiments 1-8, wherein the particular hierarchical representation format is a graph representation format comprising: (i) graph nodes, where each graph node represents a document object within the user document; and (ii) edges between the graph nodes, where each edge connects a respective pair of graph nodes and characterizes a relationship connecting the document objects characterized by the pair of graph nodes.

[0230] Embodiment 10 is the method of embodiment 9, wherein the graph representation format assigns additional data to each graph node that characterizes the contents of the document object represented by the node.Atorney Docket No. 56113-0548WO1

[0231] Embodiment 11 is the method of embodiment 9 or embodiment 10. wherein the graph representation format assigns a respective one of a plurality of node types to each graph node.

[0232] Embodiment 12 is the method of embodiment 11, wherein the node t pe assigned to each node identifies syntactic properties of the document object represented by the node.

[0233] Embodiment 13 is the method of embodiment 12, wherein the syntactic properties of each particular document object define, at least in part, one or more syntactic rules for the particular document object.

[0234] Embodiment 14 is the method of embodiment 12 or embodiment 13, wherein one of the plurality of node types is a stroke type that represents an individual hand written stroke within the user document.

[0235] Embodiment 15 is the method of any one of embodiments 12-14, wherein one of the plurality of node types is an image type that represents a digital image within the user document.

[0236] Embodiment 16 is the method of any one of embodiments 12-15, wherein at least one of the plurality of node types represents a collection of handwritten text of a certain form within the user document.

[0237] Embodiment 17 is the method of any one of embodiments 12-16, wherein at least one of the plurality of node types represents a collection of non-textual handwritten symbols of a certain form, within the user document.|0238| Embodiment 18 is the method of any one of embodiments 12-17. wherein at least one of the plurality' of node types represents a collection of document objects as organized in a certain manner within the user document.

[0239] Embodiment 19 is the method of any one of embodiments 1-18, wherein the graph representation format assigns a respective one of a plurality of edge types to each graph edge.

[0240] Embodiment 20 is the method of embodiment 20, wherein the edge type assigned to each edge identifies syntactic properties of the relationship between the document objects represented by the pair of graph nodes joined by the graph edge.

[0241] Embodiment 21 is the method of embodiment 20, wherein the syntactic properties of each particular relationship define, at least in part, one or more syntactic rules for the particular relationship.

[0242] Embodiment 22 is the method of embodiment 20 or embodiment 21, wherein the syntactic properties of each particular relationship define, at least in part, one or more syntactic rules for the document objects linked by the particular relationship.Attorney Docket No. 56113-0548WO1

[0243] Embodiment 23 is the method of any one of embodiments 19-22, wherein one of the plurality of edge types represents a containment relationship that characterizes one document object within the user document as containing another document object within the user document.

[0244] Embodiment 24 is the method of embodiment 23, wherein the graph representation format requires that, when the format assigns an edge characterizing a first document object as containing a second document object, the second document object characterizes a plurality of document elements that is a subset of the plurality of elements characterized by the first document object.

[0245] Embodiment 25 is the method of embodiment 24, wherein the graph representation format requires that each document object within the user document is characterized as being contained by at most one other document object within the user document.

[0246] Embodiment 26 is the method of any one of embodiments 19-25, wherein at least one of the plurality of edge types represents an interaction relationship of a certain manner that characterizes one document object interacting with another document object in the certain manner.

[0247] Embodiment 27 is the method of any embodiments 9-26, wherein generating, from the input data, the hierarchical representation of the user document further comprises: generating the hierarchical representation of the user document using a graph generating neural network configured to (i) process the input data characterizing the user document; and (ii) produce an output graph characterizing the user document following the graph representation format.

[0248] Embodiment 28 is the method of embodiment 27, wherein the graph generating neural network comprises a sequence of graph neural network layers, with each graph neural network layer configured to: process data characterizing an input graph for the layer; and generate data characterizing an output graph for the layer.

[0249] Embodiment 29 is the method of embodiment 28, wherein: the input data characterizing the user document comprises numerical data characterizing geometric properties of document elements within the user document; and the first graph neural network layer within the sequence of graph neural network layers is configured to process an input graph determined by the geometric properties of the document elements.

[0250] Embodiment 30 is the method of embodiment 29, wherein the graph generating network is configured to produce the output graph such that each terminal node of the output graph corresponds to a node within the input graph for the first graph neural network layer.Attorney Docket No. 56113-0548WO1

[0251] Embodiment 31 is the method of embodiment 29 or embodiment 30, wherein the graph generating neural network further comprises one or more clustering layers, with each clustering layer configured to: process data characterizing an input graph for the clustering layer; and generate data characterizing an output graph for the clustering layer, such that the output graph has as many or fewer nodes than the input graph for the clustering layer.

[0252] Embodiment 32 is the method of embodiment 31 , wherein the graph generating network is configured to produce the output graph such that each non-terminal nodes of the output graph correspond to a particular node within the output graph for one of the clustering layers.

[0253] Embodiment 33 is the method of embodiment 32, wherein the graph generating network is configured to produce the output graph such that each edge of the output graph corresponds to a particular edge within the output graphs from one of the layers of the graph generating network.

[0254] Embodiment 34 is the method of any one of embodiments 27-33, wherein the graph generating network has been trained to reproduce ground-truth target graphs representing the contents of example documents based on training data comprising example input data characterizing the example documents.

[0255] Embodiment 35 is the method of any one of embodiments 27-34, wherein the graph generating network further comprises a text recognition network configured to: process input data characterizing handwritten strokes representing handwritten text; and produce text transcriptions of the input data.

[0256] Embodiment 36 is the method of embodiment 35, wherein the text recognition network is an LSTM network.

[0257] Embodiment 37 is the method of embodiment 35 or embodiment 36, wherein the text recognition network has been trained to reproduce ground-truth target transcriptions for example input data.

[0258] Embodiment 38 is the method of any one of embodiments 35-37, wherein generating the hierarchical representation of the user document using the graph generating neural network further comprises: processing input data that characterizes handwritten strokes within the user document to produce one or more text transcriptions; and including text from the one or more produced text transcriptions within the hierarchical representation.

[0259] Embodiment 39 is the method of any one of embodiments 27-38, wherein the graph generating network further comprises a text segmentation network configured: to process input data characterizing strokes representing handwritten text; and to segment the strokes into individual characters.Atorney Docket No. 56113-0548WO1

[0260] Embodiment 40 is the method of embodiment 39, wherein the text segmentation network is a Transformer based network.

[0261] Embodiment 41 is the method of embodiment 39 or embodiment 40, wherein text segmentation network has been trained to reproduce ground-truth segmentations for example input.

[0262] Embodiment 42 is the method of any one of embodiments 39-41, wherein generating the hierarchical representation of the user document using the graph generating neural network further comprises: processing input data that characterizes handwritten strokes associated with segments of handwritten text within the user document to produce one or more character segmentations; and including geometric data characterizing the handwritten text based within the hierarchical representation based on the one or more produced character segmentations.

[0263] Embodiment 43 is the method of embodiment 42, wherein the geometric data characterizing the handwritten text within comprises data representing the sizes of hand-written characters within the segments.

[0264] Embodiment 44 is the method of embodiment 42 or embodiment 43, wherein the geometric data characterizing the handwritten text further comprises word baselines.

[0265] Embodiment 45 is the method of any one of embodiments 1 -44, wherein at least one of the set of candidate modifications comprises actions that include one or more synthesized handwritten elements in the user document, the actions comprising: synthesizing one or more handwritten elements for the user document; and including the one or more synthesized handwritten elements within the document.

[0266] Embodiment 46 is the method of embodiment 45, wherein synthesizing one or more handwritten elements for the user document comprises: cloning one or more handwritten strokes.

[0267] Embodiment 47 is the method of embodiment 46, wherein cloning the one or more handwritten strokes comprises: cloning the one or more handwritten strokes from handwritten strokes included within the user document.

[0268] Embodiment 48 is the method of embodiment 46, wherein cloning the one or more handwritten strokes comprises: accessing data characterizing a plurality of example handwritten elements; and cloning the one or more handwritten strokes from handwritten strokes included within the plurality of example handwritten elements.

[0269] Embodiment 49 is the method of embodiment 48, wherein the plurality of example handwritten elements includes elements written by the user.Attorney Docket No. 56113-0548WO1

[0270] Embodiment 50 is the method of embodiment 45, wherein synthesizing one or more handwritten elements for the user document comprises: computing one or more Bezier curves for the handwritten elements.

[0271] Embodiment 51 is the method of embodiment 45, wherein the one or more handwritten elements comprise a line of text.

[0272] Embodiment 52 is the method of embodiment 51, wherein synthesizing one or more handwritten elements for the user document comprises: processing data characterizing the line of text using a text generation model to produce one or more handwritten strokes for the line of text.

[0273] Embodiment 53 is the method of embodiment 52, wherein the data characterizing the line of text characterizes a transcription of the line of text.

[0274] Embodiment 54 is the method of embodiment 52 or embodiment 53, wherein the data characterizing the line of text characterizes geometric properties of the line of text.

[0275] Embodiment 55 is the method of any one of embodiments 52-54, wherein the data characterizing the line of text characterizes a textual context for the line of text.

[0276] Embodiment 56 is the method of any one of embodiments 52-55, wherein the data characterizing the line of text characterizes a style for the line of text.

[0277] Embodiment 57 is the method of any one of embodiments 52-56, wherein the text generation model is a generative neural network.|0278| Embodiment 58 is the method of embodiment 57, wherein the text generative neural network has been trained to replicate example lines of text based on training data comprising the example lines of text.

[0279] Embodiment 59 is the method of any one of embodiments 1-58, wherein at least one of the set of candidate modifications to the user document comprises actions to change a word in the user document to correct the spelling of the word.

[0280] Embodiment 60 is the method of any one of embodiments 1-59, wherein at least one of the set of candidate modifications to the user document comprises actions to change one or more words of a sentence in the user document to correct the grammar of the sentence.

[0281] Embodiment 61 is the method of any one of embodiments 1-60, wherein processing the hierarchical representation of the user document to identify a set of one or more candidate modifications to the user document further comprises: processing data characterizing the hierarchical representation using a language processing neural network to generate a language processing network output; including a generated completion for a textual element within the user document based on the language processing network output.Attorney Docket No. 56113-0548WO1

[0282] Embodiment 62 is the method of embodiment 61, wherein the data characterizing the hierarchical representation characterizes transcribed text within the hierarchical representation.

[0283] Embodiment 63 is the method of embodiment 61 or embodiment 62, wherein the data characterizing the hierarchical representation characterizes a structure of the hierarchical representation.

[0284] Embodiment 64 is the method of any one of embodiments 61-63, wherein the data characterizing the hierarchical representation characterizes relationships between document objects in the hierarchical representation.

[0285] Embodiment 65 is the method of any one of embodiments 61-64, wherein the language processing neural network is a large language model trained to perform one or more language processing tasks.

[0286] Embodiment 66 is the method of any one of embodiments 61-65, wherein processing the hierarchical representation of the user document to identify a set of one or more candidate modifications to the user document further comprises: receiving data from the user characterizing a selected language processing task.

[0287] Embodiment 67 is the method of embodiment 66, processing data characterizing the hierarchical representation using a language processing neural network to generate a language processing network output further comprises: processing data characterizing the hierarchical representation alongside data characterizing the selected language processing task using a language processing neural network to generate the language processing network output.

[0288] Embodiment 68 is the method of any one of embodiments 61 -67, wherein at least one of the set of candidate modifications to the user document comprises actions to complete a textual element of the user document.

[0289] Embodiment 69 is the method of embodiment 68, wherein the textual element is part or all of a sentence.

[0290] Embodiment 70 is the method of embodiment 68, wherein the textual element is a block of text.

[0291] Embodiment 71 is the method of embodiment 68, wherein the textual element is a list.

[0292] Embodiment 72 is the method of any one of embodiments 68-71, wherein the actions to complete a textual element of the user document are based on the language processing network output.

[0293] Embodiment 73 is the method of any one of embodiments 61-72. wherein at least one of the set of candidate modifications to the user document comprises actions to rephrase a segment of text in the user document.Atorney Docket No. 56113-0548WO1

[0294] Embodiment 74 is the method of embodiment 73, wherein the actions to rephrase a segment of text in the user document are based on the language processing network output.

[0295] Embodiment 75 is the method of any one of embodiments 61-74, wherein at least one of the set of candidate modifications to the user document comprises actions to rearrange document objects within the user document.

[0296] Embodiment 76 is the method of embodiment 75. wherein the actions to rearrange document objects within the user document are based on the language processing network output.

[0297] Embodiment 77 is the method of any one of embodiments 61-76, wherein at least one of the set of candidate modifications to the user document comprises actions to rearrange document objects within an organized collection of document objects in the user document.

[0298] Embodiment 78 is the method of embodiment 77, wherein the actions to rearrange document objects within an organized collection of document objects in the user document are based on the language processing netw ork output.

[0299] Embodiment 79 is the method of any one of embodiments 61-78, further comprising: calling an application programming interface (API) to perform a task for the user based on the language processing network output.

[0300] Embodiment 80 is a system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform the operations of the respective method of any one of embodiments 1-79.

[0301] Embodiment 81 is one or more computer-readable storage media encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform the operations of the respective method of any one of embodiments 1-79.

[0302] Embodiment 82 is a method, comprising: receiving data characterizing handwritten text; and processing the received data using a neural network to generate a neural network output for the received data, that comprises an output sequence of tokens, wherein the output sequence of tokens includes a sequence of one or more handwritten stroke tokens characterizing the handwritten strokes of the handwritten text.

[0303] Embodiment 83 is the method of embodiment 82, wherein, for each of one or more handwritten strokes of the handwritten text: the output sequence of tokens includes a sequence of one or more handwritten stroke tokens that characterize the handwritten stroke.Attorney Docket No. 56113-0548WO1

[0304] Embodiment 84 is the method of embodiment 83, wherein each handwritten stroke token characterizes one or more spatial coordinates for an associated point along a curve of the handwritten stroke.

[0305] Embodiment 85 is the method of embodiment 84, wherein one or more handwritten stroke tokens characterize a value of a first spatial coordinate of points along the curve of the handwritten stroke.

[0306] Embodiment 86 is the method of embodiment 85, wherein one or more handwritten stroke tokens characterize a value of a second spatial coordinate of points along the curve of the handwritten stroke.

[0307] Embodiment 87 is the method of any one of embodiments 82-86, wherein the received data characterizing the handwritten text includes an image of the handwritten text.

[0308] Embodiment 88 is the method of any one of embodiments 82-87, further comprising: receiving an input prompt specifying a processing task; and wherein processing the received data using the neural network to generate the neural network output for the received data comprises processing the received data and the input prompt using a neural network to generate a neural network output to perform the specified task for the received data.

[0309] Embodiment 89 is the method of any one of embodiments 82-88, wherein the output sequence of tokens includes one or more text content tokens that characterize the textual content of the handwritten text.|0310| Embodiment 90 is the method of any one of embodiments 82-89, wherein the neural network includes a task processing neural network configured to process task processing network inputs characterizing handwritten text and generate corresponding task processing network outputs for the task processing network inputs.

[0311] Embodiment 91 is the method of embodiment 90, wherein: the neural network includes a token generating neural network and processing the received data using a neural network to generate a neural network output for the received data comprises: processing the received data using the task processing neural network to generate a task processing network output for the received data; and processing the task processing network output using the token generating neural network to generate the output sequence of tokens for the received data.

[0312] Embodiment 92 is the method of embodiment 91, when dependent on embodiment 83, wherein: the task processing neural network includes an image processing neural network configured to process images of handwritten text and generate corresponding image processing network outputs for the images; and processing the received data using the task processing neural network comprises: processing the image of the handwritten text using the imageAttorney Docket No. 56113-0548WO1 processing neural network to generate an image processing network output for the received data; and generating the task processing network output based on the image processing network output for the received data.

[0313] Embodiment 94 is the method of embodiment 91 or embodiment 92, when dependent on embodiment 84, wherein: the input prompt is a text prompt; the task processing neural network includes a text processing neural network configured to process text prompts and generate corresponding text processing network outputs for the text prompts; and processing the received data using the task processing neural network comprises: processing the input prompt using the text processing neural network to generate a text processing network output for the text prompts for the input prompt; and generating the task processing network output based on the text processing network output for the input prompt.

[0314] Embodiment 94 is the method of embodiment 93, when dependent on embodiment 92, wherein generating the task processing network output for the input prompt comprises: generating the task processing network output by concatenating the image processing network output for the image of the handwritten text and the text processing network output for the input prompt.

[0315] Embodiment 95 is the method of any one of embodiments 92-94, wherein the image processing neural network is configured to process images of handwritten text and generate corresponding output sequences of tokens characterizing the images of handwritten text.|0316| Embodiment 96 is the method of embodiment 95, wherein the image processing neural network is a vision transformer network.

[0317] Embodiment 97 is the method of any one of embodiments 93-96, wherein the text processing neural network is configured to process input prompts and generate corresponding output sequences of tokens characterizing the input prompts.

[0318] Embodiment 98 is the method of any one of embodiments 82-97, wherein the neural network is a large language model.

[0319] Embodiment 99 is the method of any one of embodiments 88-98, wherein the processing task includes determining a textual content of the handwritten text.

[0320] Embodiment 100 is the method of any one of embodiments 88-99, wherein the processing task includes determining the handwritten strokes of the handwritten text.

[0321] Embodiment 101 is the method of any one of embodiments 82-100, wherein the neural network has been trained following operations comprising: receiving training data that includes examples of handwritten text and corresponding target network outputs for one or more processing tasks; and, for one or more training steps: for one or more of the examples ofAttorney Docket No. 56113-0548WO1 handwriten text in the training data: processing the example of handwritten text using the neural network to generate a neural network output for the example of handwriten text and determining a reconstruction loss for the example based on differences between (i) the neural network output and (ii) the corresponding target network output for the example; and training the neural network by optimizing the determined reconstruction losses.

[0322] Embodiment 102 is the method of embodiment 101, when dependent on embodiment 88. wherein processing the example of handwriten text using the neural network to generate a neural network output for the example of handwritten text comprises: processing the example of handwriten text and an input prompt specifying a processing task using a neural network to generate a neural network output to perform the specified task for the example of handwritten text.

[0323] Embodiment 103 is the method of embodiment 102 when dependent on embodiment 99, wherein determining a reconstruction loss for the example comprises: determining the reconstruction loss for the example based on differences between (i) textual content characterized by the neural network output and (ii) a corresponding ground truth textual content for the example.

[0324] Embodiment 104 is the method of embodiment 102 or embodiment 103, when dependent on claim 100, wherein determining a reconstruction loss for the example comprises: determining the reconstruction loss for the example based on differences between (i) handwriten strokes characterized by the neural network output and (ii) corresponding ground truth handwriten strokes for the example.

[0325] Embodiment 105 is a system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations of the respective method of any one of embodiments 82-104.

[0326] Embodiment 106 is one or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations of the respective method of any one of embodiments 82-104.

[0327] Particular embodiments of the subject mater have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular orderAttorney Docket No. 56113-0548WO1 shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

[0328] What is claimed is:

Claims

1. Attorney Docket No. 56113-0548WO1CLAIMS1. A method performed by one or more computers, the method comprising: receiving input data characterizing a document displayed on a device, the input data comprising data characterizing a plurality of handwritten strokes submitted while the document is displayed on the device; generating, from the input data, a hierarchical representation of the document that characterizes relationships between objects represented by respective subsets of the plurality of handwritten strokes, wherein the hierarchical representation follows a particular hierarchical representation format; and processing the hierarchical representation of the document to identify a set of one or more candidate modifications to the document, wherein each candidate modification comprises one or more actions to change the document.

2. The method of claim 1 , wherein: the particular hierarchical representation format defines a set of syntactic rules that classify whether a modification to the document is an allowed modification to the document; and each of the set of one or more candidate modifications is an allowed modification to the document according to the set of syntactic rules.

3. The method of claim 1 or claim 2, further comprising: selecting one or more proposed modifications from the set of one or more candidate modifications to the document; and providing, for presentation on the device, data characterizing the one or more proposed modifications.

4. The method of any one of claims 1-3, further comprising: updating the hierarchical representation based on one or more modifications from the set of candidate modifications to document.

5. The method of claim 4, wherein updating the hierarchical representation based on one or more modifications from the set of candidate modifications to document comprises:Attorney Docket No. 56113-0548WO1 receiving data characterizing one or more selected modifications to the document, wherein each of the one or more selected modifications is an allowed modification to the document according to the set of syntactic rules; and updating the hierarchical representation based on the one or more selected modifications characterized by the received data.

6. The method of any preceding claim, further comprising: receiving data characterizing the addition of one or more handwritten strokes to the document; and updating the hierarchical representation based on the received data.

7. The method of any preceding claim, further comprising: receiving data characterizing the deletion of one or more handwritten strokes from the document; and updating the hierarchical representation based on the received data.

8. The method of any preceding claim, further comprising: providing, for presentation on the device, an updated document based on the updated hierarchical representation.

9. The method of any preceding claim, wherein the particular hierarchical representation format is a graph representation format comprising:(i) graph nodes, where each graph node represents a document object within the document; and(ii) edges between the graph nodes, where each edge connects a respective pair of graph nodes and characterizes a relationship connecting the document objects characterized by the pair of graph nodes.

10. The method of claim 9, wherein the graph representation format assigns additional data to each graph node that characterizes the contents of the document object represented by the node.1 1. The method of claim 9 or claim 10, wherein the graph representation format assigns a respective one of a plurality of node types to each graph node.Atorney Docket No. 56113-0548WO112. The method of claim 11, wherein the node ty pe assigned to each node identifies syntactic properties of the document object represented by the node.

13. The method of claim 12, wherein the syntactic properties of each particular document object define, at least in part, one or more syntactic rules for the particular document object.

14. The method of claim 12 or claim 13, wherein one of the plurality of node types is a stroke ty pe that represents an individual hand written stroke within the document.

15. The method of any one of claims 12-14, wherein one of the plurality of node types is an image type that represents a digital image within the document.

16. The method of any one of claims 12-15, wherein at least one of the plurality' of node types represents a collection of handwritten text of a certain form within the document.

17. The method of any one of claims 12-16, wherein at least one of the plurality of node ty pes represents a collection of non-textual handwritten symbols of a certain form, within the document.

18. The method of any one of claims 12-17, wherein at least one of the plurality of node ty pes represents a collection of document objects as organized in a certain manner within the document.

19. The method of any' preceding claim, wherein the graph representation format assigns a respective one of a plurality7of edge types to each graph edge.

20. The method of claim 19, wherein the edge type assigned to each edge identifies syntactic properties of the relationship between the document objects represented by the pair of graph nodes joined by the graph edge.

21. The method of claim 20, wherein the syntactic properties of each particular relationship define, at least in part, one or more syntactic rules for the particular relationship.

22. The method of claim 20 or claim 21 , wherein the syntactic properties of each particular relationship define, at least in part, one or more syntactic rules for the document objects linked by the particular relationship.Atorney Docket No. 56113-0548WO123. The method of any one of claims 19-22, wherein one of the plurality of edge types represents a containment relationship that characterizes one document object within the document as containing another document object within the document.

24. The method of claim 23, wherein the graph representation format requires that, when the format assigns an edge characterizing a first document object as containing a second document object, the second document object characterizes a plurality of document elements that is a subset of the plurality of elements characterized by the first document object.

25. The method of claim 24, wherein the graph representation format requires that each document object within the document is characterized as being contained by at most one other document object within the document.

26. The method of any one of claims 19-25, wherein at least one of the plurality of edge types represents an interaction relationship of a certain manner that characterizes one document object interacting with another document object in the certain manner.

27. The method of any one of claims 9-26, wherein generating, from the input data, the hierarchical representation of the document further comprises: generating the hierarchical representation of the document using a graph generating neural network configured to:(i) process the input data characterizing the document; and(ii) produce an output graph characterizing the document following the graph representation format.

28. The method of claim 27, wherein: the graph generating neural network comprises a sequence of graph neural network layers, with each graph neural network layer configured to: process data characterizing an input graph for the layer; and generate data characterizing an output graph for the layer.Attorney Docket No. 56113-0548WO129. The method of claim 28, wherein: the input data characterizing the document comprises numerical data characterizing geometric properties of document elements within the document; and the first graph neural network layer within the sequence of graph neural network layers is configured to process an input graph determined by the geometric properties of the document elements.

30. The method of claim 29, wherein the graph generating network is configured to produce the output graph such that each terminal node of the output graph corresponds to a node within the input graph for the first graph neural network layer.

31. The method of claim 29 or claim 30, wherein the graph generating neural network further comprises one or more clustering layers, with each clustering layer configured to: process data characterizing an input graph for the clustering layer; and generate data characterizing an output graph for the clustering layer, such that the output graph has as many or fewer nodes than the input graph for the clustering layer.

32. The method of claim 31 , wherein the graph generating network is configured to produce the output graph such that each non-terminal nodes of the output graph correspond to a particular node within the output graph for one of the clustering layers.

33. The method of claim 32, wherein the graph generating network is configured to produce the output graph such that each edge of the output graph corresponds to a particular edge within the output graphs from one of the layers of the graph generating network.

34. The method of any one of claims 27-33, wherein the graph generating network has been trained to reproduce ground-truth target graphs representing the contents of example documents based on training data comprising example input data characterizing the example documents.

35. The method of any one of claims 27-34, wherein the graph generating network further comprises a text recognition network configured to: process input data characterizing handwritten strokes representing handwritten text; and produce text transcriptions of the input data.Attorney Docket No. 56113-0548WO136. The method of claim 35, wherein the text recognition network is an LSTM network.

37. The method of claim 35 or claim 36, wherein the text recognition network has been trained to reproduce ground-truth target transcriptions for example input data.

38. The method of any one of claims 35-37, wherein generating the hierarchical representation of the document using the graph generating neural network further comprises: processing input data that characterizes handwritten strokes within the document to produce one or more text transcriptions; and including text from the one or more produced text transcriptions within the hierarchical representation.

39. The method of any one of claims 27-38, wherein the graph generating network further comprises a text segmentation network configured: to process input data characterizing strokes representing handwritten text; and segment the strokes into individual characters.

40. The method of claim 39, wherein the text segmentation network is a Transformer based network.

41. The method of claim 39 or claim 40, wherein text segmentation network has been trained to reproduce ground-truth segmentations for example input.

42. The method of any one of claims 39-41, wherein generating the hierarchical representation of the document using the graph generating neural network further comprises: processing input data that characterizes handwritten strokes associated with segments of handwritten text within the document to produce one or more character segmentations; and including geometric data characterizing the handwritten text based within the hierarchical representation based on the one or more produced character segmentations.

43. The method of claim 42, wherein the geometric data characterizing the handwritten text within comprises data representing the sizes of hand-written characters within the segments.

44. The method of claim 42 or claim 43, wherein the geometric data characterizing the handwritten text further comprises word baselines.Attorney Docket No. 56113-0548WO145. The method of any preceding claim, wherein at least one of the set of candidate modifications comprises actions that include one or more synthesized handwritten elements in the document, the actions comprising: synthesizing one or more handwritten elements for the document; and including the one or more synthesized handwritten elements within the document.

46. The method of claim 45, wherein synthesizing one or more handwritten elements for the document comprises: cloning one or more handwritten strokes.

47. The method of claim 46, wherein cloning the one or more handwritten strokes comprises: cloning the one or more handwritten strokes from handwritten strokes included within the document.

48. The method of claim 46, wherein cloning the one or more handwritten strokes comprises: accessing data characterizing a plurality of example handwritten elements; and cloning the one or more handwritten strokes from handwritten strokes included within the plurality of example handwritten elements.

49. The method of claim 48, wherein the plurality of example handwritten elements includes elements written by a user.

50. The method of claim 45, wherein synthesizing one or more handwritten elements for the document comprises: computing one or more Bezier curves for the handwritten elements.

51. The method of claim 45, wherein the one or more handwritten elements comprise a line of text.

52. The method of claim 51, wherein synthesizing one or more handwritten elements for the document comprises: processing data characterizing the line of text using a text generation model to produce one or more handwritten strokes for the line of text.

53. The method of claim 52, wherein the data characterizing the line of text characterizes a transcription of the line of text.Attorney Docket No. 56113-0548WO154. The method of claims 52 or claim 53, wherein the data characterizing the line of text characterizes geometric properties of the line of text.

55. The method of any one of claims 52-54, wherein the data characterizing the line of text characterizes a textual context for the line of text.

56. The method of any one of claims 52-55, wherein the data characterizing the line of text characterizes a style for the line of text.

57. The method of any one of claims 52-56, wherein the text generation model is a generative neural network.

58. The method of claim 57, wherein the text generative neural network has been trained to replicate example lines of text based on training data comprising the example lines of text.

59. The method of any preceding claim, wherein at least one of the set of candidate modifications to the document comprises actions to change a word in the document to correct the spelling of the word.

60. The method of any preceding claim, wherein at least one of the set of candidate modifications to the document comprises actions to change one or more words of a sentence in the document to correct the grammar of the sentence.

61. The method of any preceding claim, wherein processing the hierarchical representation of the document to identify a set of one or more candidate modifications to the document further comprises: processing data characterizing the hierarchical representation using a language processing neural network to generate a language processing network output; including a generated completion for a textual element within the document based on the language processing network output.

62. The method of claim 61 , wherein the data characterizing the hierarchical representation characterizes transcribed text within the hierarchical representation.

63. The method of claims 61 or claim 62, wherein the data characterizing the hierarchical representation characterizes a structure of the hierarchical representation.Atorney Docket No. 56113-0548WO164. The method of any one of claims 61 -63, wherein the data characterizing the hierarchical representation characterizes relationships between document objects in the hierarchical representation.

65. The method of any one of claims 61-64, wherein the language processing neural network is a large language model trained to perform one or more language processing tasks.

66. The method of any one of claims 61-65, wherein processing the hierarchical representation of the document to identify a set of one or more candidate modifications to document further comprises: receiving data characterizing a selected language processing task.

67. The method of claim 66, processing data characterizing the hierarchical representation using a language processing neural network to generate a language processing network output further compnses: processing data characterizing the hierarchical representation alongside data characterizing the selected language processing task using a language processing neural network to generate the language processing network output68. The method of any one of claims 61-67, wherein at least one of the set of candidate modifications to the document comprises actions to complete a textual element of the document.

69. The method of claim 68, wherein the textual element is part or all of a sentence.

70. The method of claim 68, wherein the textual element is a block of text.

71. The method of claim 68, wherein the textual element is a list.

72. The method of any one of claims 68-71. wherein the actions to complete a textual element of the document are based on the language processing network output.

73. The method of any one of claims 61-72, wherein at least one of the set of candidate modifications to the document comprises actions to rephrase a segment of text in the document.

74. The method of claim 73, wherein the actions to rephrase a segment of text in the document are based on the language processing network output.Atorney Docket No. 56113-0548WO175. The method of any one of claims 61-74, wherein at least one of the set of candidate modifications to the document comprises actions to rearrange document objects within the document.

76. The method of claim 75, wherein the actions to rearrange document objects within the document are based on the language processing network output.

77. The method of any one of claims 61-76, wherein at least one of the set of candidate modifications to the document comprises actions to rearrange document objects within an organized collection of document objects in the document.

78. The method of claim 77, wherein the actions to rearrange document objects within an organized collection of document objects in the document are based on the language processing network output.

79. The method of any one of claims 61-78, further comprising: calling an application programming interface (API) to perform a task based on the language processing network output.

80. A system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform the operations of the respective method of any one or claims 1- 79.

81. One or more computer-readable storage media encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform the operations of the respective method of any one of claims 1-79.

82. A method, comprising: receiving data characterizing handwritten text: and processing the received data using a neural network to generate a neural network output for the received data, that comprises an output sequence of tokens, wherein the output sequence of tokens includes a sequence of one or more handwritten stroke tokens characterizing the handwritten strokes of the handwritten text.

83. The method of claim 82, wherein: for each of one or more handwritten strokes of the handwritten text:Atorney Docket No. 56113-0548WO1 the output sequence of tokens includes a sequence of one or more handwritten stroke tokens that characterize the handwritten stroke.

84. The method of claim 83, wherein each handwritten stroke token characterizes one or more spatial coordinates for an associated point along a curve of the handwritten stroke.

85. The method of claim 84, wherein one or more handwritten stroke tokens characterize a value of a first spatial coordinate of points along the curve of the handwritten stroke.

86. The method of claim 85, wherein one or more handwritten stroke tokens characterize a value of a second spatial coordinate of points along the curve of the handwritten stroke.

87. The method of any one of claims 82-86, wherein the received data characterizing the handwritten text includes an image of the handwritten text.

88. The method of any one of claims 82-87, further comprising: receiving an input prompt specifying a processing task; and wherein processing the received data using the neural network to generate the neural network output for the received data comprises: processing the received data and the input prompt using a neural network to generate a neural network output to perform the specified task for the received data.

89. The method of any one of claims 82-88, wherein the output sequence of tokens includes one or more text content tokens that characterize the textual content of the handwritten text.

90. The method of any one of claims 82-89, wherein the neural network includes a task processing neural network configured to process task processing network inputs characterizing handwritten text and generate corresponding task processing network outputs for the task processing network inputs.

91. The method of claim 90. wherein: the neural network includes a token generating neural network; and processing the received data using a neural network to generate a neural network output for the received data comprises: processing the received data using the task processing neural network to generate a task processing network output for the received data; andAtorney Docket No. 56113-0548WO1 processing the task processing network output using the token generating neural network to generate the output sequence of tokens for the received data.

92. The method of claim 91, when dependent on claim 83, wherein: the task processing neural network includes an image processing neural network configured to process images of handwritten text and generate corresponding image processing network outputs for the images; and processing the received data using the task processing neural network comprises: processing the image of the handw ritten text using the image processing neural network to generate an image processing network output for the received data; and generating the task processing network output based on the image processing network output for the received data.

93. The method of claim 91 or claim 92. when dependent on claim 84, wherein: the input prompt is a text prompt; the task processing neural network includes a text processing neural network configured to process text prompts and generate corresponding text processing network outputs for the text prompts; and processing the received data using the task processing neural network comprises: processing the input prompt using the text processing neural network to generate a text processing network output for the text prompts for the input prompt; and generating the task processing network output based on the text processing network output for the input prompt.

94. The method of claim 93, when dependent on claim 92, wherein generating the task processing network output for the input prompt comprises: generating the task processing network output by concatenating the image processing network output for the image of the handwritten text and the text processing network output for the input prompt.

95. The method of any one of claims 92-94, wherein the image processing neural network is configured to process images of handwritten text and generate corresponding output sequences of tokens characterizing the images of handwritten text.

96. The method of claim 95, wherein the image processing neural network is a vision transformer network.Attorney Docket No. 56113-0548WO197. The method of any one of claims 93-96, wherein the text processing neural network is configured to process input prompts and generate corresponding output sequences of tokens characterizing the input prompts.

98. The method of any one of claims 82-97, wherein the neural network is a large language model.

99. The method of any one of claims 88-98, wherein the processing task includes determining a textual content of the handwritten text.

100. The method of any one of claims 88-99, wherein the processing task includes determining the handwritten strokes of the handwritten text.

101. The method of any one of claims 82-100, wherein the neural network has been trained following operations comprising: receiving training data that includes examples of handwritten text and corresponding target network outputs for one or more processing tasks; for one or more training steps: for one or more of the examples of handwritten text in the training data: processing the example of handwritten text using the neural network to generate a neural network output for the example of handwritten text; and determining a reconstruction loss for the example based on differences between (i) the neural network output and (ii) the corresponding target network output for the example; and training the neural network by optimizing the determined reconstruction losses.

102. The method of claim 101, when dependent on claim 88, wherein processing the example of handwritten text using the neural network to generate a neural network output for the example of handwritten text comprises: processing the example of handwritten text and an input prompt specifying a processing task using a neural network to generate a neural network output to perform the specified task for the example of handwritten text.

103. The method of claim 102 when dependent on claim 99, wherein determining a reconstruction loss for the example comprises: determining the reconstruction loss for the example based on differences between (i)Attorney Docket No. 56113-0548WO1 textual content characterized by the neural network output and (ii) a corresponding ground truth textual content for the example.

104. The method of claim 102 or claim 103, when dependent on claim 100, wherein determining a reconstruction loss for the example comprises: determining the reconstruction loss for the example based on differences between (i) handwritten strokes characterized by the neural network output and (ii) corresponding ground truth handwritten strokes for the example.

105. A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations of the respective method of any one of claims 82-104.

106. One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations of the respective method of any one of claims 82-104.