Forensic analysis of cloud-native artifacts · in criminal cases. They reported that the analyzed...

ilable at ScienceDirect

Digital Investigation 16 (2016) S104eS113

Contents lists ava

Digital Investigation

journal homepage: www.elsevier .com/locate/di in

DFRWS 2016 Europe d Proceedings of the Third Annual DFRWS Europe

Forensic analysis of cloud-native artifacts

Vassil Roussev*, Shane McCulleyGreater New Orleans Center for Information Assurance (GNOCIA), University of New Orleans, New Orleans, LA, 70148, USA

Keywords:Cloud forensicsGoogle docs formatReverse engineeringCloud-native artifactskumodocskumodd

* Corresponding author.E-mail addresses: [email protected] (V. Roussev

edu (S. McCulley).

http://dx.doi.org/10.1016/j.diin.2016.01.0131742-2876/© 2016 The Authors. Published by Elsevcreativecommons.org/licenses/by-nc-nd/4.0/).

a b s t r a c t

Forensic analysis of cloud artifacts is still in its infancy; current approaches overwhelmingfollow the traditional method of collecting artifacts on a client device. In this work, weintroduce the concept of analyzing cloud-native digital artifactsedata objects that maintainthe persistent state of web/SaaS applications. Unlike traditional applications, in which thepersistent state takes the form of files in the local file system, web apps download thenecessary state on the fly and leave no trace in local storage.Using Google Docs as a case study, we demonstrate that such artifacts can have acompletely different structureetheir state is often maintained in the form of a complete(or partial) log of user editing actions. Thus, the traditional approach of obtaining asnapshot in time of the state of the artifacts is inherently forensically deficient in that itignores potentially critical information on the evolution of a document over time. Further,cloud-native artifacts have no standardized external representation, which raises ques-tions with respect to their long-term preservation and interpretation.© 2016 The Authors. Published by Elsevier Ltd on behalf of DFRWS. This is an open access

articleunder theCCBY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Introduction

The traditional business model of the software industryhas been software as a product (SaaP); that is, software isacquired like any physical product and, once the sale iscomplete, the owner can use it as they see for an unlimitedperiod of time. The alternativeesoftware as a service(SaaS)eis a subscription-based model, which did not startbecoming practical until the emergence of widespeadInternet access some two decades ago. Conceptually, themove from SaaP to SaaS shifts the responsibility for oper-ating the software and its environment from the customerto the provider. Technologically, such a shift was enabled bythe growth of the Internet as a universal means of com-munications, and was facilitated by the emergence of theweb browser as a standardized client user interface (UI)platform.

), [email protected].

ier Ltd on behalf of DFRWS

The traditional analytical model of digital forensics hasbeen client-centricethe investigator works with physicalevidence carriers, such as storage media or integratedcompute devices (e.g., smartphones). On the client (orstandalone) device it is easy to identify where the com-putations are performed and where the results/traces arestored. Therefore, research has focused on discovering andacquiring every little piece of log and timestamp informa-tion, and extracting every last bit of discarded data thatapplications and the OS may have left behind.

The introduction of Gmail in 2004ethe first web 2.0application in widespread useedemonstrated that all theessential technological prerequisites for mass, web-basedSaaS deployments have been met. The introduction of thefirst public cloud services by Amazon in 2006 enabled anyvendor to rent scalable, server-side infrastructure andbecome a SaaS provider. A decade later, the transition toSaaS is moving at full speed, and the need to understand itforensically is becoming ever more critical.

This massive technological shift presents a qualitativelynew challenge for forensics; one that cannot be addressedby minor adjustments to tools and practices. Specifically,

. This is an open access article under the CC BY-NC-ND license (http://

http://creativecommons.org/licenses/by-nc-nd/4.0/

mailto:[email protected]



http://crossmark.crossref.org/dialog/?doi=10.1016/j.diin.2016.01.013&domain=pdf

www.sciencedirect.com/science/journal/17422876

http://www.elsevier.com/locate/diin

http://dx.doi.org/10.1016/j.diin.2016.01.013





Fig. 1. SaaS application architecture.

V. Roussev, S. McCulley / Digital Investigation 16 (2016) S104eS113 S105

the SaaS model disrupts the familiar client-centricworldeboth code and data are delivered over thenetwork on demand, and thus become moving forensictargets. For example, a Google Docs document shows up asnothing more than a hyperlink on the local disk; the actualcontent is downloaded and made available for editing onlyin the browser.

In this work, we approach the problem by going directlyto the data sourceethe service providereusing both publicand private APIs and data structures. This leads to a newapproach that, we believe, is a preview of how cloudforensic tools will be built.

Related work

The primary focus of previous work on cloud storageforensics has been on adapting the traditional applicationforensics approach to finding client-side artifacts. This in-volves blackbox differential analysis, where before andafter images are created and compared to deduce theessential functions of the application. Section (Client-baseddata acquisition & analysis) summarizes representativework in this area.

Section (API-baseddataacquisition&analysis) presents amore recent alternative,which seeks to avoid the limitationsof client acquisition by working with the provider's API.

Client-based data acquisition & analysis

Chung et al. (2012) analyzed four cloud storage services(Amazon S3, Google Docs, Dropbox, and Evernote) in searchof traces left by them on the client system that can be usedin criminal cases. They reported that the analyzed servicesmay create different artifacts depending on specific fea-tures of the services, and proposed a process model forforensic investigation of cloud storage services based onthe collection and analysis of artifacts of the target cloudstorage services from client systems. The procedure in-cludes gathering volatile data from aMac/Windows system(if available), and then retrieving data from the Internethistory, log files, and directories. On mobile devices theyrooted an Android phone to gather data and for iPhone theyused iTunes information like backup iTunes files. Theobjective was to check for traces of a cloud storage serviceexist in the collected data.

In Hale (2013), Hale analyzes the Amazon Cloud Driveand discusses the digital artifacts left behind after anAmazon Cloud Drive account has been accessed ormanipulated from a computer. There are two possibilitiesto manipulate an Amazon Cloud Drive Account: one is viatheweb application accessible using aweb browser and theother is a client application provided by Amazon and can beinstalled on the system. After analyzing the two methods,he found artifacts of the interface in the web browser his-tory, and among cached files. He also found applicationartifacts in the Windows registry, application installationfiles on default location, and an SQLite database used tokeep track of pending upload/download tasks.

Quick and Choo (2013) analyzedDropbox and discuss theartifacts left behind after a Dropbox account has beenaccessed, or manipulated. Using hash analysis and keyword

searches they try to determine if the client software pro-videdbyDropboxhas beenused. This involves extracting theaccount username from browser history (Mozilla Firefox,Google Chrome, and Microsoft Internet Explorer), and theuse of the Dropbox through several avenues such as direc-tory listings, prefetch files, link files, thumbnails, registry,browser history, and memory captures. In follow-up work,Quick and Choo (2014) use a similar conceptual approach toanalyze the client-side operation and artifacts of GoogleDrive, and provide a starting point for investigators.

Martini and Choo (2013) have researched the operationof ownCloud, which is a self-hosted file synchronization andsharing solution. As such, it occupies a slightly differentniche as it is much more likely for the client and serversides to be under the control of the same person/organi-zation. They were able to recover artifacts including syncand file management metadata (logging, database andconfiguration data), cached files describing the files theuser has stored on the client device and uploaded to thecloud environment or vise versa, and browser artifacts.

API-based data acquisition & analysis

The client-side acquisition approaches discussed so farhave one big assumption in common; namely, that all thedata artifacts of interest can be acquired from the client.The problem is that this is not true in the general case, andis likely to be not true in the common case. As illustrated onFig. 1, the client can no longer be considered the originalsource of the data. Rather, it maintains a cached version thatis likely incomplete (inmoreways than one) and potentiallyout of date.

Considering the above functional architecture, there arethree major lapses in client-based acquisitions of cloud-hosted data:

Partial replication. The most obvious problem is thatnone of the clients working with a cloud storage accountmay have a complete copy of the data. Currently, cloudstorage providers offer selective replication so that deviceswith less local storage (smartphones) are not over-whelmed. Going forward, as data accumulates online, itwould become increasingly impractical (and unnecessary)to maintain a complete local copy. Amazon already offersunlimited storage at $60/year, and that is a lot of data toclone locally. From a forensic standpoint, a client-basedacquisition is blind to the overall picture, and has nomeans to guarantee completeness.

Artifact revisions. Most storage services provide auto-matic revision tracking that keeps copies of previous

V. Roussev, S. McCulley / Digital Investigation 16 (2016) S104eS113S106

versions of user files. However, these are not present in theclient cache, and are only recalled on demand (via a webinterface). Forensically, this is another dimension alongwhich a client-based acquisition is blind and incomplete.

Cloud-native artifacts. Web applications rarely storepersistent state on the client devices. There are somenotable exceptions, such as the caching of user credentialsand offline operation while disconnected, but the norm isthat all data is hosted on the server side. This gives rise tothe concept of cloud-native data, which we use to describeinternal data structures used by SaaS applications that arenot stored on the client persistently. This clearly creates aproblem for client-side methods as the data of forensicinterest is not present locally.

In Roussev et al. (2016), we argued that the only way tofully address the first two aspects of the problem is toutilize the service provider's official API. We developed acloud drive acquisition tool, kumodd, which can performfull API-based acquisition of four major cloud providers:Google Drive, Dropbox, Box, andMicrosoft OneDrive. Our toolcan enumerate and download all files associated with oneof the above accounts and all of their revisions. Further, itcan acquire snapshots of cloud-native artifacts in standardformats, such as PDF, via the API.

However, kumodd cannot possibly acquire cloud-nativeartifacts in their original form because they are not part ofthe official API supported. For example, a Google Docsdocument is represented as a hyperlink and there are nomeans in the API to acquire the content. From a developer'spoint of view, such artifacts are internal data structures andthere is no necessity to provide access to them via the API.In effect, there is a private communication protocol be-tween the client and server components of the web appthat is used alongside the public one (Fig. 1).

The remainder of this discussion focuses on the acqui-sition and analysis of cloud-native artifacts, using GoogleDocs as a case study.

Understanding Google Docs

For the purposes of this discussion, we use Google Docsto refer to the entire suite of online office, productivity and

Fig. 2. DraftBack analytical int

collaboration tools offered by Google. We use Documents,Sheets, Slides, etc., to refer to the individual applications inthat suite.

In all likelihood, the very first cloud-native tool withforensics applications is DraftBack (draftback.com): abrowser extension created by the writer and programmerJames Somers, which can replay the complete history of aDocuments document. The primary intent of the code is togive writers the ability to look over their own shoulder andanalyze how they write. Coincidentally, this is preciselywhat a forensic investigator would like to be able todoerewind to any point in the life of a document, right tothe very beginning.

In addition to providing the in-browser playback (usingthe Quill open source editor (Chen and Mulligan) of all theplaintext editing actionseeither in fast-forward, or real-time modeeDraftBack provides an analytical interfacewhich maps the time of editing sessions to locations in thedocument (Fig. 2).

This can be used to narrow down the scope of inquiryfor long-lived documents. Somers' work, although notmotivated by forensics, is an example of SaaS analysis thatdoes not rely on trace data resident on the clienteall resultsare produced solely by (partially) reverse engineering theweb application's private protocol. Assuming that aninvestigator is in possession of valid user credentials, orsuch are provided by Google under legal order, the exam-ination can be performed on the spot; any spot with abrowser and an Internet connection.

These observation served as a starting point of our ownwork in an effort to build a true forensic tool that un-derstands the needs of the investigative process.

Documents

In 2010, Google unveiled a new version of Google Docs(Google, 2010a), allowing for greater real-time onlinecollaboration. The new Documents editor, named kix, han-dles rendering elements like a traditional word process-orea clear break from prior practices where an editableHTML element was used. Kix was “designed specifically forcharacter-by-character real time collaboration using

erface (edited for size).

http://draftback.com


operational transformation” (Google, 2010b). (Operationaltransformation is a concurrency management mechanismthat eschewes preventive locking in favor of reactive, on-the-fly resolution of conflicting user actions by trans-forming the editing operation to achieve consistency.)Another important technical decision was to keep adetailed history of document revisions that allows users togo back to any previous version; this feature is available toany collaborator with editing privileges.

Google's approach to storing the revisions is alsodifferent than most prior solutionserather than keep aseries of snapshots, the complete history of editingactionsesince the creation of the documenteis retained.When a specific version is needed, the log is replayed fromthe beginning until the desired time; replaying the entirelog yields the current version. This design means that, ineffect, there is no delete operation that irrevocably destroysdata, and that has important forensic (and privacy)implications.

To support fine-grain revisions, as well as collaborativeediting, user actions are pushed to the server as often asevery 150 ms, depending on the speed of input. In collab-oration mode, these fine-grained actions, primarily in-sertions and deletions of text and images, are merged onthe server end, and a unified history of the document isrecorded. The actions, potentially transformed, are thenpushed to the other clients to ensure consistent, up-to-dateviews of the document.

The number of major revisions available via the publicAPI corresponds to the major revisions shown to user.Major style changes seem to prompt more of those types ofrevisions; for example, our working document where wekeep track of our experiments has over 5100 incrementalrevisions but only six major one. However, the test docu-ment we used for reverse engineering purposes, has 27major revisions with less than 1000 incremental ones. Itappears the passage of time since last edit also plays a role.Starting a new session does not seem to be enough totrigger a new major revision.

Fig. 3. Chunked snapshot for a document contain

The internal representation of the document, as deliv-ered to the client, is in the form of a JSON object calledchangelog. The structure is deeply nested but contains onearray per revision, with most elements of the array con-taining objects (keyevalue pairs). Each array ends withidentifying information for that revision as follows: anepoch timestamp in Unix format, the Google ID of theauthor, revision number, session ID, session revisionnumber, and the revision itself.

Each time the document is opened, a new session isgenerated, and the number of revisions that occur withinthat session are tracked. Some revisions, such as insertingan object, appear as a single entry with multiple actions inthe form of a multiset, which contains a series of nesteddictionaries. The keys in the dictionary are abbreviated(2e4 characters)ealmost certainly for performance rea-sonsebut not outright obfuscated.

The changelog contains a special chunked snapshot ob-ject, which contains all the information needed to createthe document as of the starting revision. The length of thesnapshot varies greatly depending on the number ofembedded kix objects and paragraphs; it has only two en-tries (containing default text styles) for revisions startingat 1.

For any revision with text in the document, the firstelement of the snapshot consists of a plaintext string of alltext in the document, followed by default styles for title,subtitle, and headings h1 through h6, language of thedocument, and first paragraph index and paragraph styles.The next several elements are all kix anchors for embeddedobjects like comments or suggestions, followed by a listingof each contiguous format area with the styles for thosesections that should be applied, as well as paragraphs andassociated IDs used to jump to those sections from a tableof contents.

Fig. 3 shows the representation of a minimal exampledocument; one inwhich the text “Test document” has beentyped. In this case, the snapshot (starting on line 3) con-tains the state of the document before the very last upda-

ing the text “Test document” (shortened).


teethe typing of the last three letters: “ent”. Thus, thesnapshot contains a text insertion for the string “Testdocum” (line 4, highlighted), as well as a number of defaultstyle definitions and other basic document properties. Thelog part of the document contains a single insertion of thestring “ent” (line 2, highlighted) with the appropriate timestamp and identifying information.

More generally, document description from revision x torevision y, therewould be a snapshot of the state at revisionx, followed by y� x entries in the changelog describingeach individual change between revisions x and y. Theability to choose the range of changes to load, allows kix tobalance the flexibility of allowing users to go back in time,and the need to be efficient and not replay needlesslyancient document history.

The changelog for a specific range of versions can beobtained manually by using the development tools builtinto modern browsers. After logging in and opening thedocument, the list of network requests contains a load URLof the form: https://docs.google.com/documents/d/<doc_id>/load?<doc_id>&start¼<start_rev>&end¼<end_rev>&token¼<auth_token>, where doc_id is the uniquedocument identifier, start_rev is the initial revision (snap-shot), end_rev the end of the revision range, and auth_tokenis an authentication token (Fig. 4). The revisions start at oneand must not exceed the actual number of revisions, andthe start cannot be greater than the end.

To retrieve the document from the command line, wecan compose request using the URL in the address bar andthe necessary authentication headers (Google Chromeprovides a convenient “copy as cURL” option that con-structs the full command line automatically). Alternatively,the URL could be opened in a browser window.

To facilitate automated collection, we built a Python tool,called kumodocs, which uses the Google Drive API to ac-quire the changelog for a given range of versions. It alsoparses the JSON result and converts it into a flat CSV formatto simplify its automated processing with existing tools.Each line contains a timestamp, user id, revision number,session id, session revision, action type, followed by a

Fig. 4. Example load request a

dictionary of key-value pairs involved in any modifications.This format is closer to that of traditional logs and is easierto both examine the editing events manually, and to usecommand-line text processing tools. The style modificationare encoded in dictionaries so that they can be readily used(in Python, or JavaScript) to replay the events in a differenteditor.

The first stage in this process is to obtain the plaintextcontent of the documents, followed by the application ofthe decoded formating styles, and the addition ofembedded objects (like images). Once the changelog isacquired, obtaining the plaintext is relatively easy byapplying all string insert and delete operations, andignoring everything else.

Actions manipulating page elements, such as a table,equation, picture, etc., have a type of ae (add element), de(delete element), or te (tether element); the latter isassociated with a kix anchor and kix id. Element in-sertions are accompanied by a multiset of style adjust-ments, containing large dictionaries of initializationvalues. Objects like comments and suggestions onlycontain anchor and id information in the changelog, andno actual text content.

Picture element insertions contain source location(URL), with uploaded files containing a local URL accessiblethrough HTML5's FileSystem API (filesystem: https://docs.google.com/persistent/docs/documents/<doc_id>/image/<image_id>). Inserting an image from Google Drive pro-duces a source URL in the changelog from thegoogleusercontent.com domain (Google's CDN). Uponfurther examination of the HTML elements in the revisiondocument, we established that they were referencing adifferent CDN link, even immediately after insertion. Asexpected, images inserted from URLs also had a copy in theCDN given that the source might not be available afterinsertion.

By analyzing the network requests, we found that the(internal) Documents API has a renderdata method. It isused with a POST request with the same headers and querystrings as the load method used to fetch the changelog:

nd changelog response.

https://docs.google.com/documents/d/%3cdoc_id%3e/load?%3cdoc_id%3e&start=%3cstart_rev%3e&end=%3cend_rev%3e&token=%3cauth_token%3e









https://docs.google.com/persistent/docs/documents/%3cdoc%5fid%3e/image/%3cimage%5fid%3e;



http://googleusercontent.com


https://docs.google.com/document/d/<doc_id>/renderdata?id¼<doc_id>

The renderdata request body contains, in effect, a bulkdata request in the form:

The cosmoId values observed correspond to the i_cidattribute of embedded pictures in the changelog, and thecontainer is the document id. The renderdata responsecontains a list of the CDN-hosted URLs that are worldreadable.

To understand the behavior of CDN-stored images, weembedded two freshly taken photos (never published onthe Internet) into a new document; one of the images wasembedded directly from the local file system, the other onevia Google Drive. After deleting both images in the docu-ment, the CDN-hosted links continued to be available(without authentication); this was tested via a script whichdownloads the images every hour and those remainedavailable for the duration of the test (72 h).

In a related experiment, we embedded two differentpictures in a similar way to a new sheet. Then, we deletedthe entire document from Google Drive; the picture linksremained live for approximately another hour before dis-appearing. Taken together, the experiments suggests thatan embedded image remains available from the CDN, aslong as at least one revision of a document references it;once all references are deleted, the object is garbagecollected. Forensically, this is an interesting behavior thatcan potentially uncover very old data, long considereddestroyed by its owners.

From a security perspective, the CDN design is notunreasonableeit has the security model of a deaddropeanyone who knows the location can access it. Giventhe length of the identifier and its apparent randomness, itwould be infeasible to guess it. From the point of view ofapplication design, it is effectively necessary to maintainCDN objects that could potentially be needed to restore aprevious version of a document. However, the behavior isnot necessarily intuitive to users, and can bring back to lifeartifacts long considered erased.

Fig. 5. Slides changelog ex

Reverting to a previous version is another operation thatdoes not destroy the editing history of the document;instead, a revert operations containing a snapshot of thedesired new state is added to the history. In other words,

the reversion operation itself can be later walked back andthe state before it can be examined, consistent with theappend-only design chosen by Google.

Access to embedded Google Drawings objects is adifferent fromembedded imagesethe changelog referencesthem by a unique drawing id. The drawing could thenbe accessed by https://docs.google.com/drawings/d/<drawing_id>/image?w¼<width>&h¼<height>. This URLdoes require Google authentication and the authenticatedused must have appropriate access permissions.

Slides and drawings

We found that the Slides app uses a similar changelogapproach to transfer the state of the artifacts. That is, thedata is communicated as an abstract data structure, whichis interpreted and rendered by the JavaScript client code.The overall formatting of the log was similar, with the mostimportant difference being that it was sent as an array ofarrays and values instead of a dictionary and values as theDocuments were (Fig. 5). It appears that all of the keys hadbeen removed from the dictionaries and only the valuessent instead, in effect, as a tuple. This makes the reverseengineering a bit more cumbersome, but it still allows us totrack the mapping between chosen actions and theirencoding, as before.

The first element in each update contains an integerencoding of the type field, with 15 corresponding to stringinsertion, 16 to text deletion, 3 to text box creation, and soon (the Appendix provides a summary of our findings). Themultiset operation in the description of more complexevents, such the creation of a new slide, with operationsdetailing the type of slide inserted, followed by severalinsert actions for each text box for that slide.

ample (truncated).

https://docs.google.com/document/d/%3cdoc%5fid%3e/renderdata%3fid%3d%3cdoc%5fid%3e



https://docs.google.com/drawings/d/%3cdrawing%5fid%3e/image%3fw=%3cwidth%3e%26h=%3cheight%3e





Fig. 6. Drawings changelog/snapshot example.


The first slide consists of a title and subtitle, wherethe title id is i0 and subtitle id is i1. Every other text boxhas a unique id such as g675a3a03f_0_2, with the finalelement being incremented whenever a new text box iscreated with that id. The first number appears to be aversion/session identifiereclosing the slide and reopen-ing the web page causes it to increase by 1; the 10-digitprefix also changes occasionally but we have not yetestablished the exact circumstances. Text box de-scriptions have a six-element list describing the startingx; y position in the page, orientation, and scalar sizeassociated with it. The granularity of changes for Slides ishigher than in Documents, as each character has itsown revision, rather than occasionally being groupedtogether with others. Each insertion details the id of thetext box and the index to be inserted in that text box, anddeletion is similar in that a text box id and range aregiven.

Adding a slide consists of a group of operations (atransaction) containing the creation of a slide, setting ofslide attributes, insertion of text boxes (based on the tem-plate). Duplicating a slide is a large transaction, consistingof the creation of the same slide type, andefor each textbox on old slideethe addition of a box, as well as therespective text and style. Deletion is another transaction,where each box is deleted from the slide first, followed bythe slide itself. Changing the theme of a slide creates amassive number of actions inside a transaction with anentirely new slide being created, and each text box iscreated and has 30e40 modification actions associatedwith it, followed by the old slide having all of its text boxesdeleted and, finally, the old slide itself deleted.

As shown on Fig. 6, the structure of drawing objects'changelog is a simplified version of the Slides changelog.Unlike Fig. 5, this one shows the complete artifact for adrawing consisting of a single text box with the word“Test”.

Fig. 7. Suggestion changelog

Sheets

We applied the same differential analysis approach tothe Sheets app as before, and monitored the network in-teractions with the server to understand the protocol. Itappears that Sheets, which also supports incremental ver-sioning after every update, works differently. When arequest for a specific version is performed, the response is abrowser-ready HTML document. In other words, both thecomputations and the HTML encoding are performed onthe server, and the final result is spoonfed to the browserfor rendering; no dynamic adjustments necessary. It is stillfeasible to extract the state of the spreadsheet after everyupdate; however, critical informationesuch as formulaseisnot available.

The solution to this problem is to use the Google SheetsAPI, which provides the means to extract the content ofindividual cells, including formulas. Such an effort isbeyond the scope of this discussion, in part, due to the verydifferent communication protocol adopted in Sheets. Theresults of the API calls to retrieve cell ranges are encoded inXML using the Atom Syndication Format (RFC 4287, 5988)and will require more complex parsing than the light-weight JSON used in Documents and Slides.

Suggestions and comments

Suggestions are marked up edits to the document thatcan be accepted, or rejected, by the collaborators; thissimilar to the “track changes”mode inMicrosoft Word. Theyare present in the changelog and are treated similarly toother changes; however, they have dedicated operationtypes that allow the editor to treat them differently interms of formatting and UI (Fig. 7).

Comments are not explicitly represented in thechangelog; instead, a kix anchor id is present (Fig. 8).Fortunately, the Google Drive API has a list method, which

example (truncated).

Fig. 8. Comment changelog example (truncated).


allows the retrieval of all comments associated with adocument, including deleted ones. However, the actualcontent of deleted comments is stripped away; only currentand resolved ones are available.

Thus, both comments and suggestions are part of thelong-term history of a document and are readily recover-able either via the public, or the private service interface.

PoC tool: kumodocs

At present, our proof of concept tool, kumodocs, workson Documents and Slides artifacts. Given a range of revisions(corresponding to a time interval), the tool will acquire andinterpret the changelog and will produce a plaintextversion of the document as of the last specified revision.Kumodocs will also acquire all embedded images that wereactive for at least part of the period.

For Slides, we map all text edits from the changelog tothe individual text boxes, and output the result in a series offiles: slide0_box0.txt, slide0_box1.txt, slide1_box0.txt, etc.(empty ones are skipped). We also acquire all suggestionsand comments associated with the document as appro-priately named text file.

Summary

Our experience with analyzing the three Google Docsapplications validates our motivating concerns. Namely, wesaw thateeven within the same suite of toolsethe ap-proaches to maintaining and rendering the internal state ofthe artifacts vary in a non-trivial fashion. Based on theobserved differences, it is almost certain that these threeproducts have been developed by different teams, and eachproduct bears the stamp of its designers.

Documents is closest to how most web application aredevelopededata is communicated in abstract form instructured JSON and rendered on the client. Slides's proto-col is clearly not designed to be human-readableeit ap-pears that its design is dominated by efficiency concerns.The numeric encoding of operations and flat data structure(the use of JSON is nominal) makes it faster to interpret onthe client. Like Documents, the data is communicated inpure form and all rendering is performed by the client.Sheets takes an entirely server-centric approacheall calcu-lations and all rendering work is done on the server.

In the absence of any standardized external repre-sentationsesuch as the ones used by standalone clientapplicationsethere is a definite need to develop tools and

(most likely) new formats that allow the acquisition, andlong-term preservation and interpretation of cloud-nativeartifacts. Similar to draftback, our prototype has the abil-ity to perform a basic playback of text editing using theQuill open source editor.

Discussion

Based on our analysis, there are several interesting im-plications for the forensic examination of Google Docsartifacts.

Online preview. One unexpected results is that it is notunsound for an investigator to review a document in edit-ing mode (in order to have access to all revisions sincecreation). Since the history of the document is an append-only log, it is practically impossible for any user to spoil thedocument, as any modifications can easily be undone.However, whole document deletion is still a problem so weneed a “write blocker” app/browser extension that moni-tors HTTP requests and filters out requests that can spoilthe evidence.

The golden hour. It appears that Google's CDN, whichhosts embedded objects (images), keeps them around forabout an hour after deletion. This opens up the opportunityto potentially recover from last-minute deletions, bycombining methods from browser and memory forensicsand the retrieval of remnant CDN objects.

Reverse engineering is still critical. Our experienceshows that reverse engineering is still needed in the cloudforensics environment, although the emphasis will likelyshift to network protocols. While prior work (Roussevet al., 2016) showed that the public API is a valuablesource of evidence, this experience has brought back intofocus the need to understand how SaaS applications workby means of reverse engineering their private protocoland data structures. Fortunately, this is graybox (and notblackbox) analysis as we can monitor all communicationsand can instrument the client (JavaScript) code at criticaljunctures.

Long-term preservation is a challenge. One conceptualchallenge is the problem of both storing the acquired evi-dence, and retaining the ability to correctly interpret it.Internal data, like the changelog, is an irreplaceable sourceof evidence; however, it needs to be interpreted/renderedin order to have meaning to the analyst. Unlike traditionalstandalone applications, we do not have the ability to retainthe application's code (which is split between the clientand the server); thus, any solution would involve some

Table 1Documents changelog keys.

Key/key: value Interpretation

Operationsmlti multi-operation (transaction)is, ds insert/delete stringae, de, ue, te embedded elements: add, delete, update,

tether (to anchor)rvrt revert to earlier revisionop operational transformationsdef_ps, sdef_ts set default paragraph/text styleas, sm adjust/modifymsfd, usfd, sas suggestion added/rejected/acceptedsugid suggestion idOperation attributesmts multi-operation descriptions string argumentsi, ei starting/ending indexibi insert before indextbs_al, tbs_of table alignment/offsetdas_a datasheet anchorDocument style &

attributesds_pw, ds_ph page width/heightds_mt, ds_mb top/bottom marginds_ml, ds_mr left/right marginlgs_l languageHeader styleshs_t, hs_st, hs_nt title, subtitle, normal texths_h1 … hs_h6 h1/ldots/h6Paragraph styleps_hdid, ps_hd heading id/styleps_al, ps_ls horizontal alignment line spaceps_il, ps_ifl indent line/first line (amount)ps_sb, ps_sa space before/after paragraph (amount)Text stylets_ff, ts_fs font family/sizets_fgc, ts_bgc foreground/background colorts_bd, ts_it bold/italicts_un, ts_st underline/strikethroughts_sc, ts_va small caps, vertical alignment


level of equivalency translation. So far, we have addressedthe lowest-hanging fruitereplaying plaintext editingcommands, and acquiring embedded images. A morecomplete solution would be to translate the log into com-mands for a comparable application, where all formattingcan be faithfully reproduced so that the visual appearanceis retained.

How representative is Google Docs? It is important forfollow-up work to understand to what degree the design ofGoogle Docs is representative of a broader class of onlinecollaborative application suites in general. We have donesome preliminary studies on several similar tools, such asZoho Writer, Microsoft Word Online, and Dropbox Paper. Theinitial impression is that, as per the real-time requirementsof such apps, incremental updates are continuously sent tothe server, and that fine-grain versions of the document aremade available to the user (and the investigator). UnlikeGoogle Docs, we did not readily identify an internal APImechanism by which the log of editing action could beretrieved. However, there are indications that the log itselflikely exists on the server and that the versions shown tothe user are generated from it on the fly. There are alsosigns that the append-only log is an idea that appeals todevelopers; e.g., reverting to an older version in ZohoWritercauses is to be added to the list of user-selectable versions(tagged with “reverted”); Word and Paper have similarconcepts.

Conclusion

In this work, we performed an initial examination ofGoogle Docs artifacts in an effort to understand the chal-lenges and opportunities presented by cloud-native arti-facts. We define such artifacts as data objects used by web/SaaS applications that are hosted exclusively in the cloudinfrastructure, and are not stored persistently by clientdevices. The specific contributions to the field are asfollows:

Problem formulation. We argued that the traditionalapproach of client-side evidence acquisition is completelyblind to cloud-native artifacts. Further, the artifacts ofgreatest importance are internal data structures thatcontain important historical information are internal datastructures. Therefore, we need to develop forensic toolsthat can acquire and interpret them. Further, we need todevelop the means to independently render the history ofan artifact for archival purposes.

Artifacts & behavior analysis. We performed an analysisof Google Docs artifacts, with a primary emphasis on theDocuments and Slides applications and their changelog in-ternal data structure. We greatly expanded upon Sommersinitial analysis (Somers) and systematically documentedour finding (Appendix: Changelog keys). Further, weinvestigated the mechanisms used for embedding objectsinto the artifacts and showed that Google's CDN is apotentially vast source of recoverable data, with apparentlyunlimited timeframe.

PoC tool development. The main practical result of thiswork is the development of a set of proof-of-concepttool that extracts and processes the history of Docu-ments and Slides. Currently, we can extract the text

content for any fine-grain revision, the embedded imagesand drawings, as well as the history of comments asso-ciated with the document. The tool is called kumodocs,and is available on GitHub at https://github.com/kumofx/kumodocs.

In addition to forensics, the tool can also be used toperform a quick privacy audit as it will identify all images,suggestions, comments that have been ostensibly deleted,but are still recoverable.

In the immediate future, we expect to complete theanalysis of the remaining apps in the Google Docs suite,and to release a more complete specification documentsimilar to (Metz, 2012). Following that, we expect to builda complete solution that allows for the screening, acqui-sition, and long-term preservation of Google Docsevidence.

Appendix. Changelog keys

This appendix contains an (incomplete) set of operationand attribute encodings used in the two versions of thechangelog. Its purpose is illustrative; a complete descrip-tion will be the subject of a separate specificationdocument.

https://github.com/kumofx/kumodocs

https://github.com/kumofx/kumodocs

Table 2Slides changelog codes.

Code Interpretation

Operations0 delete box3 add box4 multiset5 modify box6 adjust page element9 adjust page style12 add slide13 delete slide14 move slide15 add text16 delete text17 adjust text style18 set slide attributes22 insert table43 transition44 insert imageStyle modifications½0;1� bold flag½1;1� italics flag½2;1� underline½4;hexvalue� color½5; fontfamily� font family½6; fontsize� font size (6..400)½7; fontmod� super/subscript font

(1 ¼ super, 2 ¼ sub)½11; spacing� line spacing (100/115/150/200)½12;halign� horizontal alignment

(1 ¼ left (default), 2 ¼ center, 3 ¼ right,4 ¼ justified)

½20;1� strikethrough flag½44; valign� vertical alignment (0 ¼ top, 1 ¼

middle, 2 ¼ bottom)


References

Chen J, Mulligan B. Quill rich text editor. URL, https://github.com/quilljs/quill/.

Chung H, Park J, Lee S, Kang C. Digital forensic investigation of cloudstorage services. Digit Investig 2012;9(2):81e95. http://dx.doi.org/10.1016/j.diin.2012.05.015.

Google. The next generation of Google Docs. 2010. URL, http://googleblog.blogspot.com/2010/04/next-generation-of-google-docs.html.

Google. Google drive blog archive: May 2010. 2010. URL, http://googledrive.blogspot.com/2010_05_01_archive.html.

Hale J. Amazon cloud drive forensic analysis. Digit Investig2013;10(3):259e65. URL, http://dx.doi.org/10.1016/j.diin.2013.04.006.

Martini B, Choo K-KR. Cloud storage forensics: ownCloud as a case study.Digit Investig 2013;10(4):287e99. URL, http://dx.doi.org/10.1016/j.diin.2013.08.005.

Metz J. Expert witness compression format version 2 specification,working draft. 2012. https://goo.gl/iXmkBf.

Quick D, Choo KR. Dropbox analysis: data remnants on user machines.Digit Investig 2013;10(1):3e18. http://dx.doi.org/10.1016/j.diin.2013.02.003.

Quick D, Choo K-KR. Google drive: forensic analysis of data remnants. JNetw Comput Appl 2014;40:179e93. URL, http://dx.doi.org/10.1016/j.jnca.2013.09.016.

Roussev V, Barreto A, Ahmed I. Forensic acquisition of cloud drives. In:Peterson G, Shenoi S, editors. Advances in Digital Forensics, vol. XII.Springer; 2016.

Somers J. How I reverse engineered Google docs to play back any docu-ments keystrokes. URL, http://features.jsomers.net/how-i-reverse-engineered-google-docs/.

https://github.com/quilljs/quill/

https://github.com/quilljs/quill/



http://googleblog.blogspot.com/2010/04/next-generation-of-google-docs.html



http://googledrive.blogspot.com/2010_05_01_archive.html

http://googledrive.blogspot.com/2010_05_01_archive.html





https://goo.gl/iXmkBf



http://dx.doi.org/10.1016/j.jnca.2013.09.016

http://dx.doi.org/10.1016/j.jnca.2013.09.016

http://refhub.elsevier.com/S1742-2876(16)30007-X/sref10



http://features.jsomers.net/how-i-reverse-engineered-google-docs/

http://features.jsomers.net/how-i-reverse-engineered-google-docs/

Date post:	20-May-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times