Token stream differencing with moved-block detection

a technology of moving blocks and token streams, applied in the field of token stream differencing, can solve the problems of confusing results and cluttering the results report, and achieve the effects of simple change tracking, effective detection of moved blocks of text, and good moved block detection performan

Inactive Publication Date: 2009-01-08
ADOBE SYST INC
View PDF5 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

[0006]In general, in one aspect, the invention involves obtaining a first token stream and a second token stream, comparing the first and second token streams to identify a group of tokens that are substantially similar in the first and second token streams, the similar-tokens group including common sub-sequences, which are identical in the first and second token streams, and at least one unmatched token, and presenting matched token information corresponding to the similar-tokens group to represent changes in document flow. Implementations of the invention can include one or more of the following advantageous features.
[0014]The invention can be implemented to realize one or more of the following advantages. Move detection, including detection of moved blocks that were changed slightly but are still substantially similar, allows simpler change tracking between revisions of a document. Additions and deletions are also identified, and the techniques can be used with tokens that represent various types of sequential data. In text documents, a word-level token granularity can be used, while still effectively detecting moved blocks of text with small modifications relative to the size of the block. Moves that occur within a single line can be detected, and text reflows need not cause problems for moved-block detection.
[0015]Very good moved-block detection performance can be obtained, even on token streams that have many similarities and are of roughly the same length. A predefined sequences differencing technique can be used as a replaceable component with the systems and techniques of the invention. This design modularity enables additional performance improvements to be realized in the future when faster differencing techniques become available.
[0016]Moreover, moved blocks of tokens can be grouped together before a final presentation. Collecting the identified moved blocks into larger groups can reduce visual complexity of a final results presentation, and the grouped blocks can be displayed using colors to assist a reviewer. For example, when the tokens are words in a text document, displaying the grouped blocks using multiple colors can assist a reader in identifying changes in document flow.

Problems solved by technology

Moreover, when such techniques actually do identify moved blocks, the displayed results can be very confusing because small additions and / or deletions within a moved block of text can create a checker-boarding effect in the generated results, where moved and unmoved words interleave each other, thus cluttering the results report.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Token stream differencing with moved-block detection
  • Token stream differencing with moved-block detection
  • Token stream differencing with moved-block detection

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0026]FIG. 1 is a flowchart showing a process of token stream differencing. A token stream is an ordered sequence of tokens. The ordered sequence can be in an electronic document. As used herein, the terms “electronic document” and “document” mean a set of electronic data, including both electronic data stored in a file and electronic data received over a network. An electronic document does not necessarily correspond to a file. A document may be stored in a portion of a file that holds other documents, in a single file dedicated to the document in question, or in a set of coordinated files.

[0027]Tokens in a token stream can represent nearly anything. For text files, tokens can be characters, words, or lines of text, and can include white space tokens or other text elements. For example, the tokens in a text file can be the words in the file arranged within the token stream in reading order. Tokens can be any discrete data elements that can be arranged in a sequence. For example, in...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

Methods and apparatus implementing systems and techniques for differencing token streams and detecting moved blocks of tokens. In general, in one implementation, the technique includes: obtaining a first token stream and a second token stream, comparing the first and second token streams to identify a group of tokens that are substantially similar in the first and second token streams, the similar-tokens group including common sub-sequences, which are identical in the first and second token streams, and at least one unmatched token, and presenting matched token information corresponding to the similar-tokens group to represent changes in document flow.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS[0001]This application is a continuation application of and claims priority to U.S. application Ser. No. 10 / 272,858 filed on Oct. 16, 2002. The disclosure of the prior application is considered part of (and is incorporated by reference in) the disclosure of this application.BACKGROUND OF THE INVENTION[0002]The present application describes systems and techniques relating to token stream differencing, for example, comparison of text documents to identify document changes.[0003]Various techniques exist for comparing token streams. Such comparison is commonly referred to as differencing or as a diff operation. Differencing two token streams typically involves comparing two versions of a token stream, commonly referred to as the original stream and the modified stream, and looking for differences between them. In the context of text comparison, many differencing processes use individual text characters or words as the tokens. Such diff processes ar...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(United States)
IPC IPC(8): G06F17/27G06F17/22
CPCG06F17/277G06F17/2211G06F40/194G06F40/284
Inventor IE, WILLIAMALTMAN, ADAM E.ROWE, EDWARD R. W.
Owner ADOBE SYST INC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products