+ All Categories
Home > Documents > StevenEnglehardt*,JeffreyHan,andArvindNarayanan ... · commontracker,LiveIntent(Section4.5). We...

StevenEnglehardt*,JeffreyHan,andArvindNarayanan ... · commontracker,LiveIntent(Section4.5). We...

Date post: 15-May-2018
Category:
Upload: vandiep
View: 216 times
Download: 0 times
Share this document with a friend
18
Proceedings on Privacy Enhancing Technologies 2018 Steven Englehardt*, Jeffrey Han, and Arvind Narayanan* I never signed up for this! Privacy implications of email tracking Abstract: We show that the simple act of viewing emails contains privacy pitfalls for the unwary. We assembled a corpus of commercial mailing-list emails, and find a network of hundreds of third parties that track email recipients via methods such as embedded pixels. About 30% of emails leak the recipient’s email address to one or more of these third parties when they are viewed. In the majority of cases, these leaks are intentional on the part of email senders, and further leaks occur if the recipi- ent clicks links in emails. Mail servers and clients may employ a variety of defenses, but we analyze 16 servers and clients and find that they are far from comprehen- sive. We propose, prototype, and evaluate a new defense, namely stripping tracking tags from emails based on en- hanced versions of existing web tracking protection lists. 1 Introduction Email began as a non-interactive protocol for sending simple textual messages. But modern email clients sup- port much of the functionality of the web, and the ex- plosion of third-party web tracking has also extended to emails, especially mailing lists. Surprisingly, while there is a vast literature on web tracking, email tracking has seen little research. The ostensible purpose of email tracking is for senders to know which emails have been read by which recipients. Numerous companies offer such services to email senders [11, 14, 22], and mail clients that have privacy features advertise them as a way for users to protect their privacy from email senders [20, 31, 42]. But we find that email tracking is far more sophisticated: a large network of third parties also receive this informa- tion, and it is linked to users’ cookies, and hence to *Corresponding Author: Steven Englehardt: Princeton University, E-mail: [email protected] Jeffrey Han: Princeton University, E-mail: [email protected] *Corresponding Author: Arvind Narayanan: Princeton University, E-mail: [email protected] their activities across the web. Worse, with many email clients, third-party trackers receive the user’s email ad- dress when the user views emails. Further, when users click links in emails, regardless of the email client, we find additional leaks of the email address to trackers. These privacy breaches are our primary interest in this work. We show that much of the time, leaks of email ad- dresses to third parties are intentional on the part of commercial email senders. The resulting links between identities and web history profiles belie the claim of “anonymous” web tracking. The practice enables on- boarding, or online marketing based on offline activity [9], as well as cross-device tracking, or linking between different devices of the same user [12]. And although email addresses are not always shared with third par- ties in plaintext—sometimes they are hashed—we argue that hashing does little to protect privacy in this context (Section 8). Email tracking is possible because modern graph- ical email clients allow rendering a subset of HTML. JavaScript is invariably stripped, but embedded images and stylesheets are allowed. These are downloaded and rendered by the email client when the user views the email (unless they are proxied by the user’s email server; of the providers we studied (Section 6.2), only Gmail and Yandex do so). Crucially, many email clients, and almost all web browsers, in the case of webmail, send third-party cookies with these requests, allowing link- ing to web profiles. The email address is leaked by being encoded as a parameter into these third-party URLs. When links in emails pointing to the sender’s web- site are clicked, the resulting leaks are outside the con- trol of the email client or the email server. Even if the link doesn’t contain any identifier, the web browser that opens the link will send the user’s cookie with the request. The website can then link the cookie to the user’s email address; this link may have been estab- lished when the user provided her email address to the sender via a web form. Finally, the sender can pass on the email address—and other personally identifiable in- formation (PII), if available—to embedded third parties using methods such as redirects and referrer headers.
Transcript

Proceedings on Privacy Enhancing Technologies 2018

Steven Englehardt Jeffrey Han and Arvind Narayanan

I never signed up for thisPrivacy implications of email trackingAbstract We show that the simple act of viewing emailscontains privacy pitfalls for the unwary We assembleda corpus of commercial mailing-list emails and find anetwork of hundreds of third parties that track emailrecipients via methods such as embedded pixels About30 of emails leak the recipientrsquos email address to one ormore of these third parties when they are viewed In themajority of cases these leaks are intentional on the partof email senders and further leaks occur if the recipi-ent clicks links in emails Mail servers and clients mayemploy a variety of defenses but we analyze 16 serversand clients and find that they are far from comprehen-sive We propose prototype and evaluate a new defensenamely stripping tracking tags from emails based on en-hanced versions of existing web tracking protection lists

1 IntroductionEmail began as a non-interactive protocol for sendingsimple textual messages But modern email clients sup-port much of the functionality of the web and the ex-plosion of third-party web tracking has also extended toemails especially mailing lists Surprisingly while thereis a vast literature on web tracking email tracking hasseen little research

The ostensible purpose of email tracking is forsenders to know which emails have been read by whichrecipients Numerous companies offer such services toemail senders [11 14 22] and mail clients that haveprivacy features advertise them as a way for users toprotect their privacy from email senders [20 31 42] Butwe find that email tracking is far more sophisticated alarge network of third parties also receive this informa-tion and it is linked to usersrsquo cookies and hence to

Corresponding Author Steven Englehardt PrincetonUniversity E-mail stecsprincetoneduJeffrey Han Princeton University E-mailjh34alumniprincetoneduCorresponding Author Arvind Narayanan PrincetonUniversity E-mail arvindncsprincetonedu

their activities across the web Worse with many emailclients third-party trackers receive the userrsquos email ad-dress when the user views emails Further when usersclick links in emails regardless of the email client wefind additional leaks of the email address to trackersThese privacy breaches are our primary interest in thiswork

We show that much of the time leaks of email ad-dresses to third parties are intentional on the part ofcommercial email senders The resulting links betweenidentities and web history profiles belie the claim ofldquoanonymousrdquo web tracking The practice enables on-boarding or online marketing based on offline activity[9] as well as cross-device tracking or linking betweendifferent devices of the same user [12] And althoughemail addresses are not always shared with third par-ties in plaintextmdashsometimes they are hashedmdashwe arguethat hashing does little to protect privacy in this context(Section 8)

Email tracking is possible because modern graph-ical email clients allow rendering a subset of HTMLJavaScript is invariably stripped but embedded imagesand stylesheets are allowed These are downloaded andrendered by the email client when the user views theemail (unless they are proxied by the userrsquos email serverof the providers we studied (Section 62) only Gmailand Yandex do so) Crucially many email clients andalmost all web browsers in the case of webmail sendthird-party cookies with these requests allowing link-ing to web profiles The email address is leaked by beingencoded as a parameter into these third-party URLs

When links in emails pointing to the senderrsquos web-site are clicked the resulting leaks are outside the con-trol of the email client or the email server Even ifthe link doesnrsquot contain any identifier the web browserthat opens the link will send the userrsquos cookie with therequest The website can then link the cookie to theuserrsquos email address this link may have been estab-lished when the user provided her email address to thesender via a web form Finally the sender can pass onthe email addressmdashand other personally identifiable in-formation (PII) if availablemdashto embedded third partiesusing methods such as redirects and referrer headers

I never signed up for this Privacy implications of email tracking 2

We now outline the methods we used our findingsand our proposed defenses against email tracking

11 Methods

Building on the OpenWPM web crawler [17] we createda tool to automatically search for mailing list subscrip-tion forms on websites and fill them in It is challengingto scale such a tool due to numerous idiosyncrasies ofwebsites (Section 3) Our crawler visited 15700 sitesand attempted to sign up for emails on each of theseThe resulting corpus contains 12618 emails from 902distinct senders The tool may be of independent in-terest for studying questions such as PII leakage fromcontact forms [38]

Next we discuss how we detect instances of PII innetwork traffic (Section 41) This is a challenging prob-lem because data might be encoded or hashed possiblyiteratively (eg double hashing or base-64 encoded andthen hashed) In this study we focus exclusively on leaksof email addresses but our techniques are agnostic tothe type of PII We examine leaks in network traffic re-sulting from a simulation of a user reading the corpus ofemails collected as above (Section 4) We also simulatedthe user clicking on a sample of the links in the emailsreceived and looked for leaks in the resulting web traffic(Section 5)

We present a set of heuristics to classify such leak-age as intentional or accidental (Section 41) Inten-tional leakage suggests a business relationship betweenthe party sending the information and the party receiv-ing it whereas accidental leakage happens due to poorprogramming practices [23 24]

Email providers (eg Gmail employers) and emailclients (eg Apple Mail Thunderbird) may both em-ploy measures to mitigate email tracking such as prox-ying of images1 or suppressing cookies We built a toolthat allows users and researchers to test the behaviorof email providers and clients to assess the ability ofemail senders and third parties to track users We useit ourselves to survey 16 email clients (Section 62)

1 Providers proxy resources by rewriting all remote resourcesin an email to point to a location on the providerrsquos server Theprovider requests the resource from the third-party server ratherthan the user requesting it directly

12 The state of email tracking

Email tracking is pervasive We find that 85 of emailsin our corpus contain embedded third-party contentand 70 contain resources categorized as trackers bypopular tracking-protection lists There are an averageof 52 and a median of 2 third parties per email whichembeds any third-party content and nearly 900 thirdparties contacted at least once But the top ones are fa-miliar Google-owned third parties (Doubleclick GoogleAPIs etc) are present in one-third of emails

We simulate users viewing emails in a full-fledgedemail client (Section 4) We find that about 29 ofemails leak the userrsquos email address to at least one thirdparty and about 19 of senders sent at least one emailthat had such a leak The majority of these leaks (62)are intentional based on our heuristics Tracking protec-tion is helpful but not perfect it reduces the numberof email leaks by 87 Interestingly the top trackersthat receive these leaked emails are different from thetop web trackers We present a case study of the most-common tracker LiveIntent (Section 45)

We also simulate users clicking on links in emailswhich causes a page to load in a full-fledged web browser(Section 5) We find that 11 of links contain embeddedcontent requests that leak the email address to a thirdparty and at least 35 of senders include at least onesuch link in one email The top third-party domains andorganizations that receive these leaked email addressesare substantially similar to the list of top third partiesoverall

13 Evaluating and improving defenses

We identify five possible defenses against email track-ing content proxying HTML filtering cookie blockingreferrer blocking and request blocking There are threepossible ways to deploy defenses by the mail serverthe mail user agent or the web user agent (ie thebrowser that handles links that are clicked on emails)We present a systematization of how each of these enti-ties could deploy each of these defenses (Section 61)

The defenses that can be deployed by web browsersto protect against leaks of emails are nearly identicalto defenses against web tracking in general This is amature area of research and there are numerous toolson the market based on filter lists Based on our dataanalysis we identify a list of 27125 distinct URLs (from133 domains) that receive leaked email addresses andare not blocked by prominent filter lists presumably

I never signed up for this Privacy implications of email tracking 3

because these trackers are specific to emails (Section 7)We believe that these would make useful additions toexisting filter lists Except for this contribution we focusour analysis of defenses on mail servers and mail useragents rather than web browsers

Based on our analysis of 16 email servers and clients(Section 62) we find that a patchwork of defenses areemployed and no setup offers complete protection fromthe threats we identify Perhaps the best option forprivacy-conscious users today is to use webmail and in-stall tracker-blocking extensions such as uBlock Originor Ghostery

We show that HTML filtering can be an effectivedefense The idea is to rewrite email bodies to removetracking elements This can be done by either the mailserver or the mail user agent We prototype an elementfiltering tool based on existing tracking-protection listsand evaluate its effectiveness (Section 7)

2 Related workEmail secuity and privacy The literature on emailsecurity and privacy has focused on authentication ofemails and the privacy of email contents For exam-ple Durumeric et al found that the long tail of SMTPservers largely fail to deploy encryption and authenti-cation leaving users vulnerable to downgrade attackswhich are widespread in the wild [15] Holz et al alsofound that email is poorly secured in transit oftendue to configuration errors [21] We study an orthog-onal problem Securing email in transit will not defendagainst email tracking and vice versa

Third-party web tracking Email tracking is anoutgrowth of third-party web tracking which has growntremendously in prevalence and complexity since the1990s [13 26 28 35] Today Google is the most promi-nent tracker through various third-party domains andcan track users across nearly 80 of sites [27] Webtracking has expanded from simple HTTP cookies to in-clude more persistent tracking techniques to ldquorespawnrdquoor re-instantiate HTTP cookies through Flash cookies[37] cache E-Tags and HTML5 localStorage [10] Over-all tracking is moving from stateful to stateless tech-niques device fingerprinting attempts to identify usersby a combination of the devicersquos properties [16 25] Suchtechniques have advanced quickly [19 30 33] and arenow widespread on the web [7 8 17 32] These tech-niques allow trackers to compile unique browsing histo-ries but they do not link histories to identity

Compared to web tracking email tracking does notuse fingerprinting because (most) email clients prohibitJavaScript On the other hand email readily providesa unique persistent real-world identifier namely theemail address Web tracking researchers have created anumber of tools for detecting and measuring trackingand privacy such as FPDetective [8] OpenWPM [17]and FourthParty [28] We use OpenWPM for most ofour measurements in this paper

PII leakage Leaks of PII of logged-in users fromfirst-party websites to third parties are rampant theearly papers on this problem were by Krishnamurthy etal [23 24] PII leaks enable trackers to potentially at-tach identities to browsing histories More recent workincludes detection of PII leakage to third parties insmartphone apps [34 40] PII leakage in contact forms[38] PII leakage that enables cross-device tracking [12]and data leakage due to browser extensions [39]

The common problem faced by these authors (andby us) is that PII may be obfuscated When the data col-lection is crowdsourced [34 40] rather than automatedthere is the further complication that the strings thatconstitute PII are not specified by the researcher andthus not known in advance On the other hand crowd-sourced data collection allows obtaining numerous in-stances of each type of leak which might make detectioneasier

Various approaches are seen in prior work Ren etal employ heuristics for splitting fields in network traf-fic and detecting likely keys they then apply machinelearning to discriminate between PII and other fields[34] Starov et al apply differential testing that is vary-ing the PII entered into the system and detecting theresulting changes in information flows [38] This is chal-lenging to apply in our context because we observed fre-quent AB testing in the commercial emails in our cor-pus which makes it tricky to attribute observed changesto PII This is an area for future work Finally our ownapproach is most similar to that of Brookman et al [12]and Starov et al [39] who test combinations of encod-ings andor hashes

3 Collecting a dataset of emailsWe now describe how we assembled a large-scale cor-pus of mailing-list emails We do not attempt to studya ldquotypicalrdquo userrsquos mailbox since we have no empiricaldata from real usersrsquo mailboxes Rather our goal in as-sembling a large corpus is to study the overall landscape

I never signed up for this Privacy implications of email tracking 4

High-level architecture of crawlerAssemble a list of sites For each sitendash Find pages potentially containing forms For

each pagendash Find the best form on the page via top-

down form detection and bottom-up formdetection If a form was foundlowast Fill in the formlowast Fill in any secondary forms if necessarylowast Once a form has been submitted skip

the rest of the pages and continue tonext site

High-level architecture of serverReceive and store email For each emailndash Check for and process confirmation links

Fig 1 High-level architecture of the email collection systemwith the individual modules italicized

of third-party tracking of emails identify as many track-ers as possible (feeding into our enhancements to exist-ing tracking-protection lists) and as many interestingbehaviors as possible (such as different hashes and en-codings of emails addresses)

To achieve scale we use an automated approachWe created a web crawler based on the OpenWPM webprivacy measurement tool [17] to search for and fill informs that appear to be mailing-list subscriptions Thecrawler has five modules and the server that processesemails has two modules They are both described at ahigh level in Fig 1 We now describe each of the sevenmodules in turn

Assemble a list of sites Alexa maintains a pub-lic list of the top 1 million websites based on monthlytraffic statistics as well as rankings of the top 500 web-sites by category We used the ldquoShoppingrdquo and ldquoNewsrdquocategories since we found them more likely to containnewsletters In addition we visited the top 14700 sitesof the 1 million sites for a total of 15700 sites

Detect and rank formsWhen the crawler cannotlocate a form on the landing page it searches through allinternal links (ltagt tags) in the DOM until a page con-taining a suitable form is found A ranked list of termsshown in Table 1 is used to prioritize the links mostlikely to contain a mailing list On each page forms aredetected using both a top-down and bottom-up proce-dure The top-down procedure examines all fields con-tained in ltformgt elements Forms which have a higherz-index and more input fields are given a higher rank

while forms which appear to be part of user account reg-istration are given a lower rank If no ltformgt elementsare found we attempt to discover forms contained in al-ternative containers (eg forms in ltdivgt containers) us-ing a bottom-up procedure We start with each ltinputgtelement and recursively examine its parents until onewith a submit button is found For further details seeTop-down form detection and Bottom-up form detectionin Appendix Section 101

Fill in the form Once a form is found the crawlermust fill out the input fields such that all inputs vali-date The crawler fills all visible form fields includingltinputgt tags ltselectgt tags (ie dropdown lists) andother submit ltbuttongt tags Most websites use the gen-eral text type for all text inputs We surveyed a numberof top websites to determine common naming practicesfor input fields and filled the fields with the data of theexpected type For example name fields were filled witha generic first and last name After submitting a formwe wait for a few seconds and re-run the procedure tofill follow-up fields if required For further details seeDetermining form field type and Handling two-part formsubmissions in Appendix Section 101

Receive and store email We set up an SMTPserver to receive emails The server accepts any mail sentto an existing email address and rejects it otherwise Itthen parses the contents of the mail and logs metadata(such as the sender address subject text and recipientaddress) to a central database All textual portions ofthe message contents are written to disk We provideimplementation details in Appendix Section 102

Check for and process confirmation links Ourserver will check the first email sent to each email ad-dress to determine if the mailing list requires additionaluser interaction to confirm the subscription If the ini-tial emailrsquos subject or rendered body text includes thekeywords ldquoconfirmrdquo ldquoverifyrdquo ldquovalidaterdquo or ldquoactivaterdquowe extract potential confirmation links from the emailFor HTML emails we collect links which match thesekeywords along with additional lower-priority keywordsldquosubscriberdquo or ldquoclickrdquo For plain-text emails we simplychoose the longest link text Emails with the past-tensekeywords ldquoconfirmedrdquo ldquosubscribedrdquo and ldquoactivatedrdquo insubject lines are skipped as are links with the text ldquoun-subscriberdquo ldquocancelrdquo ldquodeactivaterdquo and ldquoviewrdquo If anylink is found it is visited using OpenWPM

Form submission measurement Our crawlerdiscovered and attempted to submit forms on 3335sites We received at least one email from 1242 (37) ofthose sites To understand the types of form submissionfailures we ran a follow-up measurement in August 2017

I never signed up for this Privacy implications of email tracking 5

Description Keywords LocationEmail list registration newsletter weekly ad subscribe inbox email sale alert link textGeneric registration signup sign up sign me up register create join link textGeneric articlesposts article news 2017 link URLSelecting languageregion us =usamp en-us link URLBlacklist unsubscribe mobile phone link text

Table 1 The web crawler chooses links to click based on keywords that appear in the link text or URL The keywords were generatedby iterating on an initial set of terms optimizing for the success of mailing list sign-ups on the top sites We created an initial set ofsearch terms and manually observed the crawler interact with the top pages Each time the crawler missed a mailing list sign-up formor failed to go to a page containing a sign-up form we inspected the page and updated the set of keywords This process was repeateduntil the crawler was successful on the sampled sites

Submission classification of sampled sitesTotal successful submissions 38

rarrMailing lists subscription 32rarrUser account registration 6

Failed required a CAPTCHA 16Failed unsupported form fields 25Unable to classify via screenshots 21

Table 2 Submission success status of a sample of 252 of the3335 form submissions made during the sign-up crawl The suc-cess and failure classification was determined through a manualreview of screenshots taken before and after an attempted formsubmission

where we took screenshots of the pages before and afterthe initial and follow-up form submissions We manu-ally examined a random sample of sites on which a formsubmission was attempted We summarize the results inTable 2

When filling forms our crawler will interact withuser account registration forms mailing list sign-upforms and contact forms The successful submissionswere mostly mailing list sign-ups and a small number ofuser account registrations which are included as theycan be tied to a mailing list The failed submissions weremostly caused by forms other than mailing lists In factmore than 70 of the failures caused by a captcha orunsupported field were not mailing list form submis-sions Overall only 11 of the sampled mailing list in-teractions resulted in a captcha Since our primary fo-cus is mailing lists we leave the evaluation of complexand captcha-protected forms to future work

Email corpus The assembled corpus contains atotal of 12618 HTML emails from 902 sites We re-ceived an average of around 14 emails per site and amedian of 5 A few sites had very active mailing listswith 20 sites sending over 100 emails during the testperiod We observe that we received no spam whichwe confirmed both by manual inspection of a sample of

emails as well as by finding an exact one-to-one corre-spondence between the 902 senders in our dataset andthe unique email addresses that we generated This en-sures that the results represent the behavior of the siteswhere we registered rather than spammers

4 Privacy leaks when viewingemails

41 Measurement methodology

Simulating a webmail client To measure web track-ing in email bodies we render the emails using a simu-lated webmail client in an OpenWPM instance Manywebmail clients remove a subset of HTML tags fromthe email body to restrict the capabilities of renderedcontent In particular Javascript is exclusively removedwhile iframe tags and CSS [6] have mixed support Wesimulate a permissive webmail client one which disablesJavascript and removes the Referer header from all re-quests but applies no other restrictions to the renderedcontent

The email content is served on localhost but isaccessed through the domain localtestme (which re-solves to localhost) to avoid any special handling thebrowser may have for the local network We configureOpenWPM to run 15 measurement instances in parallelEach email is loaded twice in its own measurement in-stance once with a fresh profile and then again keepingthe same browser profile after sleeping for 10 secondsThis is intended to allow remote content on the page toload both with and without browser state present In-deed we observe some tracking images which redirect tonew domains upon every subsequent reload of the sameemail

I never signed up for this Privacy implications of email tracking 6

Classifying third-party content Many emailclients load embedded content directly from remoteservers (we further explore the properties of emailclients in Section 62) Thus remote content presentin multiple emails can track users in the same waythird-party content can track users across sites on theweb However unlike the web there isnrsquot always aclear distinction of which requests are ldquothird-partyrdquo andwhich are ldquofirst-partyrdquo For example all resources loadedby webmail clients are considered third-party by thebrowser We consider any request to a domain2 whichis different than both the domain on which we signedup for the mailing list and the domain of the senderrsquosemail address to be a third-party request

Detecting email leakage Email addresses leakto remote servers through resource requests Detectingthese leaks is not as simple as searching for email ad-dresses in requests since the addresses may be hashed orencoded sometimes iteratively To detect such leakagewe develop a methodology that given a set of encod-ings and hashes a plaintext email address and a URLtoken is able to determine if the token is a transforma-tion of the email address Starting with the plaintextemail address we pre-compute a candidate set of tokensby applying all supported encodings and hashes itera-tively stopping once we reach three nested encodingsor hashes We then take the URL token and apply allsupported decodings to the value checking if the resultis present in the candidate set If not we iterativelyapply decodings until we reach a level of three nesteddecodings

In a preliminary measurement we found no exam-ples of a value that was encoded before being hashedThis is unsurprising as hashed email addresses are usedto sync data between parties and adding a transforma-tion before the hash would prevent that use case Thuswhen analyzing the requests in this dataset we restrictourselves to at most three nested hashes for a set of24 supported hashes including md5 sha1 sha256 Forencodings we apply all possible combinations of 10 en-codings including base64 urlencoding and gzip Thefull list of supported hashes and encodings is given inAppendix 103

Classifying email leakage Email leaks may notbe intentional If an email address is included in thequery string or path of a document URL it may auto-matically end up in the Referer header of subsequent

2 A domain is identified by its public suffix plus the componentof the hostname immediately preceding its public suffix (PS+1)

requests from that document Requests which result ina redirect also often add the referrer of the previous re-quest to the query string of the new request In manyinstances this happens irrespective of the presence ofan email address in the original request The situationis made more complex on the web since third-partyJavascript can dynamically build URLs and trigger re-quests

The reduced HTML support and lack of Javascriptexecution in email clients makes it possible to deter-mine intentionality for most leaks When an email isrendered requests can result from three sources fromelements embedded in the original HTML from withinan embedded iframe (if supported by the client) or froma redirected request1 If a leak occurs in a Referer header it is uninten-

tional For webmail clients the Referer header (ifenabled) will be the client itself A mail sender canembed an iframe which loads a URL that includesthe userrsquos email address with the explicit intentionthat the userrsquos email leak to third parties via theReferer header However we chose not to includethis possibility because email senders have multi-ple direct options for sharing information with thirdparties that do not rely on the sparsely supportediframe tag

2 If a leak occurs in a request to a resource embeddeddirectly in the HTML of the email body (and is notthe result of a redirect) it is intentional We candetermine intentionality since any request result-ing from an HTML document must have been con-structed by the email sender Note that this does nothold for web documents since embedded Javascriptcan dynamically construct requests during the pagevisit

3 If a request results from a redirect the party re-sponsible for the leak is the party whose request(ie the triggering URL) responded with a redi-rect to the new location (ie the target URL) Weclassify a leak as intentional if the leaked value ishashed between the triggering URL and the targetURL or if there are more encodings or hashes ofthe leaked value included in the target URL thanin the triggering URL If the target URL includesa full copy of the triggering URL (in any encoding)the leak is unintentional All other cases are clas-sified as ambiguous such the case where a targetURL includes only the query string of the triggeringURL

I never signed up for this Privacy implications of email tracking 7

Measuring blocked tags Tracking protection toolswhich block resource requests offer users protectionagainst the tracking embedded in emails We evaluatethe effectiveness of these tools by checking the requestsin our dataset against two major blocklists EasyListand EasyPrivacy [4] These lists block advertisementand tracking related requests and are bundled withseveral popular blocking extensions including AdBlockPlus [1] and uBlock Origin [5] We use the BlockList-Parser library [3] to determine if a request would havebeen blocked3 by an extension utilizing these lists Weclassify a request as blocked if it matches any of thefollowing three conditions1 The request directly matches the filter list2 The request is the result of a redirect and any re-

quest earlier in the redirect was blocked3 The request is loaded in an iframe and the iframe

document request (or any resulting redirect) wasblocked

It is possible to do this classification in an offline fashionbecause of the lack of Javascript support in email clientsThis removes the need to run measurements with oneof the aforementioned extensions installed In environ-ments that support Javascript content can be loadeddynamically and as the result of interactions betweenseveral scripts In such an environment it is much moredifficult to determine which requests would have beenblocked by a single script appearing on the block list

42 Email provides much of same trackingopportunities as the web

Remote resources embedded in email content can trackusers across emails As we show in our survey of emailclients (Section 62) many email clients allow remoteresources to set persistent cookies and send those cook-ies with resource requests In total we find that 10724of the measured emails (85) embed resources from atleast one third party with an average of 5 third partiesper email The distribution of embedded third partiesis far from uniform we find a median of two per emailand a small number of emails embedding as many as 50third parties (Figure 2)

3 We set the parser options as we would expect them to beset for a request occurring in a webmail client For example allrequests are considered third-party requests

Domain of Emails of Top 1Mdoubleclicknet 222 475mathtagcom 142 79dotomicom 127 35adnxscom 122 132tapadcom 110 26liadmcom 110 04returnpathnet 110 lt01bidswitchnet 105 49fontsgoogleapiscom 102 394list-managecom 101 lt01

Table 3 Top third-party domains by percentage of the 12618emails in the corpus For comparison we show the percentageof the top 1 million websites on which these third parties arepresent

Fig 2 CDF of third parties per email aggregating data acrossthe initial viewing and re-opening of an email In addition 14of emails have between 25 and 53 third parties

Table 3 shows the top third-party domains presentin email Many of these parties also have a large presenceon the web [17] blurring the line between email and webtracking On webmail clients requests to these cross-context third parties will use the same cookies allowingthem to track both a userrsquos web browsing and emailhabits In total the emails visited during our crawlsembed resources from 879 third parties

43 Leaks of email addresses to thirdparties are common

In addition to being able to track email habits 99 thirdparties (11) also gain access to a userrsquos email ad-dress whether in plaintext or hashed In email clientswhich support cookies these third parties will receivethe email address alongside any cookies theyrsquove set onthe userrsquos device Trackers which are also present on theweb will thus be able to link this address with the userrsquosbrowsing history profile

Around 19 of the 902 senders leaked the userrsquosemail address to a third party in at least one emailand in total 29 of emails contain leaks to third par-

I never signed up for this Privacy implications of email tracking 8

ties We find that a majority of these leaks 62 of the100963 leaks to third parties are intentional Theseintentional leaks mostly occur through remote contentembedded directly by the sender Furthermore 1 ofleaks are classified as unintentional with the remainderconsidered ambiguous While we do not attempt to de-termine how these identifiers are being used plaintextand hashed emails can be used for persistent trackingcross-device tracking and syncing information betweenparties

Leak of Senders of RecipientsMD5 100 (111) 38 (385)SHA1 64 (71) 19 (192)SHA256 69 (76) 13 (131)Plaintext Domain 55 (61) 2 (20)Plaintext Address 77 (85) 54 (545)URL Encoded Address 6 (06) 8 (81)SHA1 of MD5 1 (01) 1 (10)SHA256 of MD5 1 (01) 1 (10)MD5 of MD5 1 (01) 1 (10)SHA384 1 (01) 1 (10)

Table 4 Email address leakage to third parties by encoding Per-centages are given out of a total of 902 senders and 99 third-party leak recipients All hashes are of the full email addressEmail ldquodomainrdquo is the part of the address after the ldquordquoThese appear to be a misuse of LiveIntentrsquos API (Section 45)

The leaked addresses are often hashed Althoughwe can detect email addresses hashed with 24 differentfunctions and up to three nested layers we only findMD5 SHA1 and SHA256 in frequent use Table 4 summa-rizes the number of senders and receivers of each encod-ing The relatively low diversity of hashes and encodingssuggests that these techniques are not being used to ob-fuscate the collection of email addresses In fact thequery parameters which contain hashed emails some-times identify the hash functions used in the parametername (eg a string like md5=ltmd5 hash of emailgt ap-pearing in the HTTP request) The design of APIs likeLiveIntentrsquos which first receives an email address andthen syncs with a number of other parties (Section 45)suggests that these hashed address may be used to shareor link data from multiple parties

Recipient Organization of SendersLiveIntent 68 (75)Acxiom 46 (51)Litmus Software 28 (31)Conversant Media 26 (29)Neustar 24 (27)apxlvcom 18 (20)5421114717 18 (20)Trancos 17 (19)WPP 17 (19)548261160 16 (18)

Table 5 Top organizations receiving email address leaks by num-ber of the 902 total senders A domain is used in place of anorganization when it isnrsquot clear which organization it belong to

Table 5 identifies the top organizations4 which re-ceive leaked email addresses This shows that email ad-dress collection from emails is largely consolidated to afew major players which are mostly distinct from thepopular web trackers In fact only one of the top 10organizations Neustar is found in the top 20 third-party organizations on the top 1 million websites asmeasured by Englehardt and Narayanan [17] Also sur-prising is the prevalence of leaks to IP addresses whichaccounts for eight of the top 20 domains receiving emailaddresses This may be due to the relatively ephemeralnature of newsletter emails which removes concerns ofIP address churn over time

44 Reopening emails brings in new thirdparties

Despite the lack of Javascript support email views aredynamic The email content itself is static but any re-mote resources embedded in it may return different re-sponses each time the email is viewed and even redirectto different third parties To examine the effects of thiswe load every email first with a ldquocleanrdquo browser profileand then again without clearing the profile Surprisinglythe average Jaccard similarity [36] between the sets ofthird parties loaded during the first and second viewsof the same email is only 60

The majority of emailsmdashtwo-thirdsmdashload fewerthird parties when the email is reopened compared tothe initial view However about 21 of emails load at

4 We map domains to organizations using the classification pro-vided by Libert [27] adding several new email-specific organi-zations When an organization could not be found we use thePS+1

I never signed up for this Privacy implications of email tracking 9

Row Request URL0 httpinboxwashingtonexaminercomimp[]ampe=ltEMAILgtampp=01 httppliadmcomimp[]ampm=ltMD5(address)gtampsh=ltSHA1(address)gtampsh2=ltSHA256(address)gt

ampp=0ampdom=ltEMAIL_DOMAINgt2 httpxbidswitchnetsyncssp=liveintentampbidder_id=5298amplicd=3357ampx=EGFM[]3 httpxbidswitchnetul_cbsyncssp=liveintentampbidder_id=5298amplicd=3357ampx=EGFM[]4 httppadsymptoticcomdpx_pid=12688amp_psign=d3e69[]ampbidswitch_ssp_id=liveintentamp_redirect=[]5 httppadsymptoticcomdpx_pid=12688amp_psign=d3e69[]ampbidswit[]amp_redirect=[]amp_expected_cookie=[]6 httpxbidswitchnetsyncdsp_id=126ampuser_id=84f3[]ampssp=liveintent7 httpiliadmcoms19751bidder_id=5298amplicd=3357ampbidder_uuid=ltUUID_1gt8 httpcmgdoubleclicknetpixelgoogle_nid=liveintent_dbmampgoogle_cmampgoogle_sc9 httpcmgdoubleclicknetpixelgoogle_nid=liveintent_dbmampgoogle_cm=ampgoogle_sc=ampgoogle_tc=10 httppliadmcommatch_gbidder_id=24314ampbidder_uuid=ltUUID_2gtampgoogle_cver=111 httpxbidswitchnetsyncssp=liveintentampbidder_id=5298amplicd=12 httppooludspiponwebnetsyncssp=bidswitchampbidswitch_ssp_id=liveintent

Table 6 Redirect chain from a LiveIntent Email Tracking Pixel URL query strings are truncated for clarity (using [])

least one resource when an email is reopened that wasnrsquotpresent the first time A small number of third partiesare disproportionately responsible for thismdashthey loaddifferent sets of additional third parties each time theemail is opened (Table 14 in the Appendix)

The number of leaks between email loads stays rela-tively constant with less than 50 emails leaking to newparties on the second load5 However as the compari-son of Table 14 with Table 5 shows many of the topleak recipients are also responsible for redirecting to thehighest number of new parties Thus reloading an emailincreases the number of potential recipients of a leak ifthe redirectors share data based on the email or emailhash they receive

45 Case study LiveIntent

LiveIntent receives email addresses from the largestnumber of senders 68 in total In this section weanalyze a sample of the request chains that re-sult in leaks to LiveIntent Table 6 shows an ex-ample redirect chain of a single pixel embedded inan email from the washingtonexaminercom mailinglist The initial request (row 0) is to a subdomain ofwashingtonexaminercom and includes the userrsquos plain-text email address in the e= query string parameter Thedomain redirects to liadmcom (row 1) a LiveIntent do-main and includes the MD5 SHA1 and SHA256 hashes ofthe email address in the parameters m= sh= and sh2=

5 We exclude leaks which occur to a different IP address on thesecond load This occurs in 349 emails but is less meaningfulgiven the dynamic nature of IP address

The URL also includes the domain portion of the userrsquosaddress

In rows 2 - 12 the request redirects through severalother domains and back to itself exchanging what ap-pear to be partner IDs and bidder IDs In rows 7 and10 LiveIntent receives a UUID from the domain in theprevious request which could allow it to exchange in-formation with those trackers outside of the browser

46 Request blockers help but donrsquot fixthe problem

Privacy conscious users often deploy blocking exten-sions such as uBlock Origin Privacy Badger orGhostery to block tracking requests Since webmailclients are browser-based these blocking extensions canalso filter requests that occur while displaying email con-tent6 We use our blocked tag detection methodology(Section 41) to determine which resources would havebeen blocked by the popular EasyList and EasyPrivacyblocklists We then examine the remaining requests todetermine how frequently email addresses continue toleak

Overall the blocklists cut the number of third par-ties receiving leaked email addresses from any sendernearly in half from 99 to 51 Likewise the number ofsenders which leak email addresses in at least one emailis greatly reduced from 19 to just 7 However asTable 7 shows a significant number of leaks of both

6 Thunderbird supports most of the popular Firefox extensionsand as such Thunderbird users can also deploy these defensesSee Table 12 for more details

I never signed up for this Privacy implications of email tracking 10

Encoding of Senders of RecipientsPlaintext Address 34 (37) 34 (667)MD5 21 (23) 12 (235)SHA1 14 (16) 6 (118)URL Encoded Address 4 (04) 4 (78)SHA256 4 (04) 2 (39)SHA384 1 (01) 1 (20)

Table 7 Encodings used in leaks to third parties after filteringrequests with EasyList and EasyPrivacy Totals are given out of902 email senders and 51 third-party leak recipients

Recipient Domain of Sendersmediawallahscriptcom 7jetlorecom 4scrippsnetworkscom 4alocdncom 3richrelevancecom 3ivitrackcom 2intentiqcom 2gatehousemediacom 2realtimeemail 2ziffimagescom 2

Table 8 The top third-party leak recipient domains after filteringrequests with EasyList and EasyPrivacy All recipients receiveleaks from less than 1 of the 902 senders studied

plaintext and email hashes still occur In Table 8 we seethat there are still several third-party domains whichreceive email address leaks despite blocking Several ofthese domains are known trackers which could be in-cluded in the blocklists In addition IP addresses andCDN domains are still recipients of leaked email ad-dresses Blocking on other URL features such as theURL path could help reduce leaks to these domains

5 Privacy leaks when clickinglinks in emails

In Section 4 we explore the privacy impact of a useropening and rendering an email In this section we ex-plore the privacy impact of a user clicking links withinan email Once a user clicks a link in an email the link istypically opened in a web browser Unlike email clientsweb browsers will typically support Javascript and ad-vanced features of HTML creating many potential av-enues for privacy leaks However the only way an emailaddress can propagate to a page visit is through the di-rect embedding of the address in a link contained in theoriginal email body

51 Measurement methodology

Sampling links from emails To evaluate the privacyleaks which occur when links in emails are clicked wegenerate a dataset from the HTML content of all emailsand visit them individually in an instrumented browserTo extract the links from mail content we parse allemail bodies with BeautifulSoup [2] and extract thesrc property of all ltagt tags We sample up to 200 uniquelinks per sender using the following sampling strategyFirst we bin links across all emails from a sender bythe PS+1 and path of the link Next we sample one linkfrom each bin without replacement until there are nomore links or we reach a limit of 200 This helps ensurethat we have as diverse a set of landing pages as possibleby stripping fragment and query string identifiers thatmay not influence the landing page

Simulating link clicks To simulate a user click-ing a link we visit each link in an OpenWPM instanceusing a fresh browser profile The browser fully loadsthe page and sleeps for 10 seconds before closing Un-like the email viewing simulation (Section 4) we enableboth Javascript and Referer headers This simulationreplicates what happens when a link is clicked in a stan-dalone email client only the URL of the clicked link ispassed to the browser for handling In a webmail clientthe initial request resulting from the click may also con-tain a cookie and a Referer header containing the emailclientrsquos URL We do not simulate these headers in ourcrawl

Detecting email address leakage To detectleakage of email addresses we use the procedure de-scribed in Section 41 Since the Referer header is en-abled for these measurements we consider a party tohave received a leak if it is contained either in the URLor the Referer header of the resource request to thatparty Email addresses can also be shared with the partythrough the Cookie header request POST bodies web-socket connections WebRTC connections and so onWe consider these out of scope for this analysis

52 Results

We found that about 11 of links contain requests thatleak the email address to a third party About 12 ofall emails contain at least one such link and among thissubset there are an average of 35 such links per emailThe percentage of the 902 senders that leak the emailaddress in at least one link in one email is higher 355Finally there were over 1400 distinct third parties that

I never signed up for this Privacy implications of email tracking 11

Recipient Organization of SendersGoogle 247 (274)Facebook 160 (177)Twitter 94 (104)Adobe 81 (90)Microsoft 73 (81)Pinterest 72 (80)LiveIntent 69 (76)Akamai 69 (76)Acxiom 68 (75)AppNexus 61 (68)

Table 9 The top leak recipient organizations based on a sampleof simulated link clicks All values are out of 902 total senders

Recipient Domain of Sendersgoogle-analyticscom 200 (222)doubleclicknet 196 (217)googlecom 159 (176)facebookcom 154 (171)facebooknet 145 (161)fontsgoogleapiscom 102 (113)googleadservicescom 96 (106)twittercom 94 (104)googletagmanagercom 87 (96)gstaticcom 78 (86)

Table 10 The top leak recipient domains based on a sample ofsimulated link clicks All values are out of 902 senders

received the email address in one or more of our sim-ulated link clicks We expect that all statistics in thisparagraph except the first are slight underestimatesdue to our limit of 200 links per sender

Table 9 shows the top organizations that receiveleaked email addresses and Table 10 shows the top do-mains Over a quarter of senders leak the email addressto Google in at least one link

The most striking difference between these resultsand the corresponding results for viewing emails is thatthese lists look very similar to the list of top third partytrackers [17] with the addition of a small number oforganizations specific to email tracking This motivatesthe privacy concern that identities could potentially beattached to third-party web tracking profiles

6 Evaluation of defenses

61 Landscape of defenses

Defenses against tracking can be employed by severalparties We ignore mail senders and trackers themselves

since email tracking is a thriving commercial space andour evidence suggests that senders by and large coop-erate with trackers to leak email addresses We insteadfocus on parties who have an incentive to protect therecipientrsquos privacy namely the recipientrsquos mail servermail user agent and the web browser

The lines between these roles can be blurry so weillustrate with two examples Consider a user readingYahoo mail via Firefox The email server is Yahoo theemail client is Firefox together with Yahoo mailrsquos client-side JavaScript and the web browser is again Fire-fox Or consider a user reading her university mail viaGmailrsquos IMAP feature on her iPhone For our purposesboth the university and Gmail count as email serverssince either of them is in a position to employ defensesThe email client is the Gmail iOS app and the webbrowser is Safari

Defense Email server Email client Web browserContent proxying XHTML filtering X XCookie blocking X XReferrer blocking X X XRequest blocking X X

Table 11 Applicability of each of the five possible defenses toeach of the three contexts in which they may be deployed An Xindicates that the defense is applicable

Table 11 summarizes the applicability of various de-fenses to the three roles We discuss each in turn

Content proxying Email tracking is possible be-cause of embedded content such as images and CSS (cas-cading style sheets) To prevent this some email serversnotably Gmail proxy embedded content Thus whenthe recipient views the email the mail user agent doesnot make any requests to third parties

This defense doesnrsquot prevent the recipient email ad-dress being leaked to third parties since it is leakedby being encoded in the URL In fact it hinders ef-forts by the mail client to prevent email address leakage(see request blocking below) However it prevents thirdparties from learning the userrsquos IP address client deviceproperties and when the email was read (depending onhow the proxy is configured) Most importantly it pre-vents the third-party cookie from being sent and thusprevents the third party from linking the userrsquos emailaddress to a tracking profile In this way it is a comple-ment to cookie blocking

This defense can be deployed by the email serverConceivably the email client might have its own server

I never signed up for this Privacy implications of email tracking 12

component through which embedded resources are prox-ied but no email clients currently work this way andfurther it would introduce its own privacy vulnerabili-ties so we ignore this possibility

HTML filtering HTML filtering refers to modify-ing the contents of HTML emails to mitigate tracking Itmay be applied by the email server or the client but it ismore suitable to the server since the client can generallyachieve the same effect in other ways eg by requestblocking or modifying the rendering engine It is rarelyapplied today and only in minimal ways In Section 7 weprototype a comprehensive HTML filtering technique

HTML filtering modifies the content of the emailbody and thus might interfere with some email au-thentication methods notably Domain Keys IdentifiedEmail (DKIM) However since filtering is carried outby the recipientrsquos mail server (Mail Transfer Agent) andnot by intermediate mail relays filtering can be done af-ter the signature has been verified and thus there is noimpact on email authentication

The following three techniques are applicable in oneof two scenarios when the email client requests embed-ded resources or when the web browser handles clickson links in emails

Cookie blocking Cookie blocking in the emailclient prevents third-party cookies from being sent whenembedded content is requested It is especially relevantin the webmail context where the cookie allows thirdparties to link an email address to a web browsing pro-file Even otherwise blocking cookies is helpful since itmakes it harder for third parties to compile a profile ofthe recipientrsquos email viewing (they can always do this forthe subset of emails where the email address is leaked)

Referrer blocking If the email client sends theReferer header when loading embedded resources itcan allow several types of leaks Depending on the imple-mentation the referrer may encode which client is beingused and which specific email is being read If the recip-ient forwarded an email to someone else and the email isbeing viewed in a different userrsquos mailbox it could leakthis information Worse if the client supports iframesin emails and the email address happens to be in theiframe URL all requests to resources embedded in thatiframe will accidentally leak the email address For allthese reasons referrer blocking is a privacy-enhancingmeasure There is little legitimate use for the referrerheader in the context of email While clients can cer-tainly block the header (as can web browsers) serverscan do this as well by rewriting HTML to add therel=ldquonoreferrerrdquo attribute to links and inserting a Re-ferrer Policy via the meta tag

Request blocking Request blocking is a powerfultechnique which is well known due to ad blockers andother browser privacy extensions It relies on manuallycompiled filter lists containing thousands of regular ex-pressions that define third-party content to be blockedThe most widely used ad-blocking list is EasyList andthe most widely used tracker-blocking list is EasyPri-vacy Filter list based blocking introduces false positivesand false negatives [43] but the popularity of ad block-ing suggests that many users find the usability trade-offto be acceptable While request-blocking extensions aresupported primarily by web browsers some email clientsalso have support for them notably Thunderbird

62 Survey of email clients

We built an email privacy tester to discover which de-fenses are deployed by which popular email servers andclients7 Browser support for tracking protection hasbeen extensively studied elsewhere [29] so we do notconsider it here

The email privacy tester allows the researcher to en-ter an email address and the name of an email clientand then sends an email to that address containing atracking image and a link The image and the link bothhave unique URLs The researcher views the email inthe specified email client and then clicks on the linkThe server records the following information the emailaddress the email client the IP address timestampand headers sent for both the image and the link re-quests The list of headers includes the cookie referrerand user agent

We created accounts with a total of 9 emailproviders and tested them with a total of 16 email clientsusing various devices available in our lab We analyzedthe data recorded by the email privacy tester and sum-marize the results in Table 12 We found that if de-fenses are deployed by email servers at all they are onlyenabled for specific email clients (typically the defaultwebmail client) Therefore we do not report on serversseparately but instead fold it into the analysis of clientsWe also found that HTML filtering in a general form isnot deployed but only in the limited form of image andreferrer blocking so we report on that instead We sum-marize our findings in Table 12

7 httpsemailtrackingopenwpmcom

I never signed up for this Privacy implications of email tracking 13

Mail Client Platform Proxies Content Blocks Images Blocks Referrers Blocks Cookies Ext SupportGmail Web Yes No L Yes I Yesdagger Yesdagger YesYahoo Mail Web No Yes L Yes I No No YesOutlook Web App Web No Yes No No YesOutlookcom Web No No No No YesYandex Mail Web Yes No L Yes I Yesdagger Yesdagger YesGMX Web No No No No YesZimbra Web No Yes No No Yes163com Web No No No No YesSina Web No No No No YesApple Mail iOS No No Yes Yes NoGmail iOS Yes No Yes Yes NoGmail Android Yes No Yes Yes NoApple Mail Desktop No No Yes Yes NoWindows Mail Desktop No No Yes No NoOutlook 2016 Desktop No Yes Yes No NoThunderbird Desktop No Yes Yes Optional (Default No) Yes

Table 12 A survey of the privacy impacting features of email clients We explore whether the client proxies image requests blocksimages by default blocks referrer headers from being sent (with image requests ldquoIrdquo and with link clicks ldquoLrdquo) blocks external re-sources from settings cookies and whether or not the client supports request blocking extensions mdash either through the browser (forweb clients) or directly (in the case of Thunderbird)Images are only blocked for messages considered spamdagger Blocking occurs as a result of proxied content

7 Proposed defenseWe argue that tracking protection should be at the cen-ter of a defensive strategy against email tracking It canbe employed either via HTML filtering on the server orvia request blocking on the client Tracking protection(and ad blocking) based on filter lists has proven to beeffective and popular in web browsers and its limita-tions manageable The other defenses we examined allhave serious drawbacks for example content proxyingcomes at a cost to the email server and makes emailleaks worse and cookie blocking is at best a partial so-lution

We propose to improve tracking protection in twoways

Server-side email content filtering First weprototype a server-side HTML filtering module We usethe existing standard EasyList and EasyPrivacy filterlists Our filtering script is written in Python using theBlockListParser library [3] It scans for any HTML con-tent (texthtml) in email bodies parses those contentsidentifies embedded resources (images or CSS) whoseURLs match one of the regular expressions in the filterlists strips them out and rewrites the HTML

To test the effectiveness of HTML filtering we ranour leak detection procedure on the filtered corpus ofemails We exclude one sender due to a measurementissue We found that 110 of senders will leak email ad-

dresses to a third party in at least one email and 115of emails contain embedded resources which leak emailto a third-party Overall 62 third parties received leakedemail addresses down from 99 As tracking-protectionlists improve (see below) we can expect these numbersto decrease further These numbers are very close tothe corresponding numbers for request blocking (Sec-tion 46) The two techniques arenrsquot identical the onedifference is that in static files filtering is limited to theURLs present in the body of the HTML and will missthose that result from a redirect However this differ-ence is small and we conclude that HTML filtering isessentially as effective as request blocking

Note that webmail users can already enjoy track-ing protection but server-side deployment will help allusers including those who use email clients that donrsquotsupport request-blocking extensions

Filling gaps in tracking-protection lists As asecond line of defense we use our dataset to identifya list of 27125 URLs representing 133 distinct partieswhich contain leaks of email addresses but which arenrsquotblocked by EasyList or EasyPrivacy These include firstparties in addition to third parties We are able to iden-tify first-party tracking URLs by observing groups ofURLs of similar structure across different first-party do-mains For example 51 email senders leak the userrsquosemail address to a URL of the form liltpublic suffix+ 1gtimp which appears to be part of LiveIntentrsquos API(Section 45) We summarize the most common struc-

I never signed up for this Privacy implications of email tracking 14

tures in the leaking URLs missed by tracking protectionlists in Table 13

URL Pattern of SendersliltPS+1gtimp 51 (57)partnerltPS+1gt 7 (07)stripeltPS+1gtstripeimage 4 (04)pltPS+1gtespopen 4 (04)apiltPS+1gtlayoutssectionltNgt 4 (04)ltPS+1gtcustomer-service 3 (03)miltPS+1gtprp 3 (03)dmtkltPS+1gt 3 (03)linksltPS+1gteopen 3 (03)eadsltPS+1gtimp 3 (03)

Table 13 The top URL patterns from URLs which leak email ad-dresses and are missed by tracking protection lists (Section 46)The patterns are generated by stripping request URLs to host-name and path replacing the public suffix plus one with ltPS+1gtreplacing integers with ltNgt and stripping the last portion of thepath if it ends with a file extension The patterns are ranked bythe number of senders which make at least one leaking requestmatching that pattern in any of the senderrsquos emails All values aregiven out of the total of 902 senders studied

We suspect that the reason so many trackers aremissed is that many of them are not active in the regu-lar web tracking space We have made the list of leakingURLs missed by tracking protection lists publicly avail-able8 It should be straightforward to add regular ex-pressions to filter lists based on these URLs we suggestthat filter list creators should regularly conduct scansof email corpora to identify new trackers

8 Discussion and conclusionPrivacy risks of email tracking Email security andprivacy has not received much research attention despiteits central importance in digital life We showed thatcommercial emails contain a high density of third-partytrackers This is of concern not only because trackerscan learn the recipientrsquos IP address when emails wereopened and so on but also because these third partiesare by and large the same ones that are involved in webtracking This means that trackers can connect emailaddresses to browsing histories and profiles which leadsto further privacy breaches such as cross-device tracking

8 httpsgistgithubcomenglehardt6438c5d775ffd535b317d5c6ce3cde61

and linking of online and offline activities Indeed emailis an underappreciated avenue for straightforward cross-device tracking since recipients tend to view emails onmultiple devices

The advice provided by many mail clients may mis-lead users into thinking the privacy risks associated withremote content are fairly limited The remote contenthelp pages of Gmail [20] Yahoo Mail [42] and Thun-derbird [31] all discuss the threat strictly in terms ofthe email sender learning information about the userrather than a number of third parties

Even network adversaries can benefit from the leaksin emails The NSA is known to piggyback on advertis-ing cookies for surveillance [18] and our work suggestsone way in which a surveillance agency might attachidentities to web activity records in line with the find-ings of Englehardt et al [18] Indeed nearly 91 ofURLs containing leaks of emails are sent in plaintext

Ineffectiveness of hashing The putative justi-fication for email address leaks in the online ad techindustry is that the address is hashed However hash-ing of PII including emails is not a meaningful pri-vacy protection This is folk knowledge in the securitycommunity but bears repeating Compared to hashingof passwords there are several reasons why hashing ofemail addresses is far more easily reversible via vari-ants of a dictionary attack First while (at least) someusers attempt to maximize the entropy of passwordsmost users aim to pick memorable emails and hence theset of potential emails is effectively enumerable Due toGPUs trillions of hashes can be attempted at low costSecond unlike password hashing salting is not applica-ble to email hashing since multiple third parties need tobe able to independently derive the same hash from theemail address

Perhaps most importantly if the adversaryrsquos goalis to retrieve records corresponding to a known emailaddress or set of email addresses then hashing ispointlessmdashthe adversary can simply hash the email ad-dresses and then look them up For example if the ad-versary is a surveillance agency as discussed above andseeks to retrieve network logs corresponding to a givenemail address this is trivially possible despite hashing

LimitationsWe mention several limitations of ourwork First despite the large number of heuristics thatwent into identifying and submitting forms it is a fun-damentally hard problem and our crawler fails in manycases including pages requiring complex mouse interac-tions pages containing very poorly structured HTMLand captcha-protected form submission pages More-over it is difficult to programmatically distinguish be-

I never signed up for this Privacy implications of email tracking 15

tween successful and failed form submissions Lookingat received network data is impractical since responsescould easily include text for both success and failuremessages On the other hand looking only at changesin the rendered text on the webpage is more feasiblebut would require handling many possible edge cases(eg page redirects alerts pop-up windows iframes)and might still be too unreliable to use as a metric forsuccess

Second our corpus of emails is not intended to berepresentative and we are unable to draw conclusionsabout the extent of tracking in the typical userrsquos mail-box

Third our simulation of a user viewing emails as-sumes a permissive user agent We expect that thisclosely approximates a webmail setup with defaultbrowser settings (on browsers except Safari whichblocks third-party cookies by default) but we have nottested this assumption

Future work Finally we mention several potentialareas of future work

Mailing list managers It would be helpful to bet-ter understand the relationship between email sendersand mailing list managers (such as Constant Contact)To what extent is email tracking driven by senders ver-sus mailing list managers When a sender sets up amarketing campaign with a mailing list manager is thetracking disclosed to the sender

PII leakage in registration forms Researchers havepreviously found leakage of PII to third parties in con-tact forms on websites [38] As far as we know there hasbeen no large-scale study of PII leakage in registrationforms where more sensitive information is often present(eg phone numbers street addresses and passwords)Recording and analyzing the third-party requests madeduring our crawls is an important area for further inves-tigation

Cookie syncing It would be interesting to find outif cookie syncing occurs when viewing emailsmdasha processin which different trackers exchange and link togethertheir own IDs for the same user Past work has shownthat this happens among the vast majority of top thirdparties on the web [17] so we suspect that it occursthrough email as well

AB testing We notice some clear instances of ABtesting in our data as might be expected in market-ing campaigns Specifically we registered multiple emailaddresses on some sites at roughly the same time andfound several emails sent at nearly the same time (mil-liseconds apart) with different subject lines and emailbodies advertising different products We have not at-

tempted to reverse-engineer or systematically analyzethese differences but it may be interesting to see if andhow the received content changes in response to readreceipts click-through metrics or other types of userinteractions

Differential testing Despite testing for various en-codings hashes and combinations it is possible thatwe have missed some leaks of email addresses We can-not hope to exhaustively test for all combinations ofencodings and hashes Instead we propose differentialtesting by registering multiple email addresses on thesame site we can look for parameters in URLs that aredifferent for different email addresses which are sugges-tive of transformed email addresses The difficulty withthis approach is that AB testing mentioned above isa confound

In summary we hope that our work leads to greaterawareness of the privacy risks of email tracking spursfurther research on the topic and paves the way fordeployment of robust defenses

9 AcknowledgementsWe would like to thank the anonymous reviewers AylinCaliskan Paul-Olivier Dehaye Joel Reardon and PaulVan Oorschot for their helpful comments Wersquore alsograteful to Guumlnes Acar Paul Ellenbogen Marc JuarezHarry Kalodner Marcela Melara and Laura Roberts fortheir assistance in compiling data for our email survey

This work was supported by NSF Grant CNS1526353 by a research grant from Mozilla and by Ama-zon AWS Cloud Credits for Research

References[1] Adblock Plus - Surf the web without annoying ads https

adblockplusorg Online accessed 2017-09-05[2] BeautifulSoup httpswwwcrummycomsoftware

BeautifulSoup Online accessed 2017-09-05[3] BlockListParser httpsgithubcomshivamagarwal-iitb

BlockListParser Online accessed 2017-09-05[4] EasyList and EasyPrivacy httpseasylistto Online

accessed 2017-09-05[5] uBlock Origin - An efficient blocker for Chromium and Fire-

fox Fast and lean httpsgithubcomgorhilluBlockOnline accessed 2017-09-05

[6] CSS Support Guide for Email Clients Campaign Sourcehttpswwwcampaignmonitorcomcss (Archive httpswwwwebcitationorg6rLLXBX0E) 2014

I never signed up for this Privacy implications of email tracking 16

[7] Gunes Acar Christian Eubank Steven Englehardt MarcJuarez Arvind Narayanan and Claudia Diaz The web neverforgets Persistent tracking mechanisms in the wild In Pro-ceedings of ACM CCS pages 674ndash689 ACM 2014

[8] Gunes Acar Marc Juarez Nick Nikiforakis Claudia DiazSeda Guumlrses Frank Piessens and Bart Preneel Fpdetectivedusting the web for fingerprinters In Proceedings of the2013 ACM SIGSAC conference on Computer amp communica-tions security pages 1129ndash1140 ACM 2013

[9] Julia Angwin Why online tracking is getting creepier ProP-ublica Jun 2014

[10] Mika D Ayenson Dietrich James Wambach Ashkan SoltaniNathan Good and Chris Jay Hoofnagle Flash cookies andprivacy II Now with html5 and etag respawning 2011

[11] Bananatag Email Tracking for Gmail Outlook and otherclients httpsbananatagcomemail-tracking Onlineaccessed 2017-09-04

[12] Justin Brookman Phoebe Rouge Aaron Alva Alva andChristina Yeung Cross-device tracking Measurement anddisclosures In Proceedings of the Privacy Enhancing Tech-nologies Symposium 2017

[13] Ceren Budak Sharad Goel Justin Rao and Georgios ZervasUnderstanding emerging threats to online advertising InProceedings of the ACM Conference on Economics andComputation 2016

[14] ContactMonkey Email Tracking for Outlook and Gmailhttpswwwcontactmonkeycomemail-tracking Onlineaccessed 2017-09-04

[15] Zakir Durumeric David Adrian Ariana Mirian James Kas-ten Elie Bursztein Nicolas Lidzborski Kurt Thomas VijayEranti Michael Bailey and J Alex Halderman Neither snownor rain nor mitm An empirical analysis of email deliv-ery security In Proceedings of the 2015 ACM Conferenceon Internet Measurement Conference pages 27ndash39 ACM2015

[16] Peter Eckersley How unique is your web browser In In-ternational Symposium on Privacy Enhancing TechnologiesSymposium pages 1ndash18 Springer 2010

[17] Steven Englehardt and Arvind Narayanan Online trackingA 1-million-site measurement and analysis In ACM Confer-ence on Computer and Communications Security 2016

[18] Steven Englehardt Dillon Reisman Christian Eubank Pe-ter Zimmerman Jonathan Mayer Arvind Narayanan andEdward W Felten Cookies that give you away The surveil-lance implications of web tracking In Proceedings of the24th Conference on World Wide Web 2015

[19] David Fifield and Serge Egelman Fingerprinting web usersthrough font metrics In International Conference on Finan-cial Cryptography and Data Security 2015

[20] Gmail Help Choose whether to show images httpssupportgooglecommailanswer145919 Online accessed2017-09-06

[21] Ralph Holz Johanna Amann Olivier Mehani Mohamed AliKacircafar and Matthias Wachs TLS in the wild An internet-wide analysis of tls-based protocols for electronic commu-nication In 23nd Annual Network and Distributed SystemSecurity Symposium NDSS 2016 San Diego CaliforniaUSA February 21-24 2016 2016

[22] HubSpot Start Email Tracking Today httpswwwhubspotcomproductssalesemail-tracking Online ac-

cessed 2017-09-04[23] Balachander Krishnamurthy Konstantin Naryshkin and

Craig Wills Privacy leakage vs protection measures thegrowing disconnect In Proceedings of the Web 2011

[24] Balachander Krishnamurthy and Craig E Wills On the leak-age of personally identifiable information via online socialnetworks In Proceedings of the 2nd ACM workshop onOnline social networks pages 7ndash12 ACM 2009

[25] Pierre Laperdrix Walter Rudametkin and Benoit BaudryBeauty and the beast Diverting modern web browsers tobuild unique browser fingerprints In 37th IEEE Symposiumon Security and Privacy 2016

[26] Adam Lerner Anna Kornfeld Simpson Tadayoshi Kohnoand Franziska Roesner Internet jones and the raiders of thelost trackers An archaeological study of web tracking from1996 to 2016 In 25th USENIX Security Symposium 2016

[27] Timothy Libert Exposing the invisible web An analysis ofthird-party http requests on 1 million websites InternationalJournal of Communication 918 2015

[28] Jonathan R Mayer and John C Mitchell Third-party webtracking Policy and technology In 2012 IEEE Symposiumon Security and Privacy IEEE 2012

[29] Georg Merzdovnik Markus Huber Damjan Buhov NickNikiforakis Sebastian Neuner Martin Schmiedecker andEdgar Weippl Block me if you can A large-scale study oftracker-blocking tools In Proceedings of the 2nd IEEE Euro-pean Symposium on Security and Privacy (IEEE EuroSampP)2017

[30] Keaton Mowery and Hovav Shacham Pixel perfect Finger-printing canvas in HTML5 W2SP 2012

[31] Mozilla Support Remote Content in Messages httpssupportmozillaorgen-USkbremote-content-in-messagesOnline accessed 2017-09-04

[32] Nick Nikiforakis Alexandros Kapravelos Wouter JoosenChristopher Kruegel Frank Piessens and Giovanni VignaCookieless monster Exploring the ecosystem of web-baseddevice fingerprinting In Security and privacy (SP) 2013IEEE symposium on pages 541ndash555 IEEE 2013

[33] Lukasz Olejnik Gunes Acar Claude Castelluccia and Clau-dia Diaz The leaking battery A privacy analysis of theHTML5 Battery Status API Technical report 2015

[34] Jingjing Ren Ashwin Rao Martina Lindorfer ArnaudLegout and David Choffnes Recon Revealing and control-ling pii leaks in mobile network traffic In Proceedings of the14th Annual International Conference on Mobile SystemsApplications and Services pages 361ndash374 ACM 2016

[35] Franziska Roesner Tadayoshi Kohno and David WetherallDetecting and defending against third-party tracking onthe web In Proceedings of the 9th USENIX conferenceon Networked Systems Design and Implementation pages12ndash12 USENIX Association 2012

[36] scikit-learn Jaccard Similarity Score httpscikit-learnorgstablemodulesgeneratedsklearnmetricsjaccard_similarity_scorehtml Online accessed 2017-09-05

[37] Ashkan Soltani Shannon Canty Quentin Mayo LaurenThomas and Chris Jay Hoofnagle Flash cookies and pri-vacy In AAAI spring symposium intelligent informationprivacy management volume 2010 pages 158ndash163 2010

[38] Oleksii Starov Phillipa Gill and Nick Nikiforakis Are yousure you want to contact us quantifying the leakage of pii

I never signed up for this Privacy implications of email tracking 17

via website contact forms Proceedings on Privacy Enhanc-ing Technologies 2016(1)20ndash33 2016

[39] Oleksii Starov and Nick Nikiforakis Extended trackingpowers Measuring the privacy diffusion enabled by browserextensions In Proceedings of the 26th International Confer-ence on World Wide Web pages 1481ndash1490 2017

[40] Narseo Vallina-Rodriguez Christian Kreibich Mark Allmanand Vern Paxson Lumen Fine-grained visibility and controlof mobile traffic in user-space 2017

[41] W3C 410 Forms - HTML5 httpswwww3orgTRhtml5formshtml Online accessed 2017-09-07

[42] Yahoo Help Block images in your incoming Yahoo Mailemails httpshelpyahoocomkbSLN5043html Onlineaccessed 2017-09-06

[43] Zhonghao Yu Sam Macbeth Konark Modi and Josep MPujol Tracking the trackers In Proceedings of the 25thInternational Conference on World Wide Web pages 121ndash132 International World Wide Web Conferences SteeringCommittee 2016

10 Appendix

101 Form discovery and fillingmethodology

Choosing pages on which to search for forms Thecrawler searches through all links (ltagt tags) on the land-ing page to find pages that are most likely to contain amailing list form It does this by matching the link textand URL against a ranked list of terms which are shownin Table 1 As an initial step we filter out invisiblelinks and links to external sites We check that the linktext does not contain words in our blacklist which aimsto avoid unsubscribe pages and phone-based registra-tion If we have found any links that match the crawlerclicks on the one with the highest rank then runs theform-finding procedure on the new page and any newlyopened pop-up windows If no forms are found it goesback and repeats this process for the remaining linksThe reason for clicking on generic article links is thatwe have come across several news sites with newsletterforms only within article pages We also make sure toselect the English language or USEnglish locale whenavailable since our keywords are in English

Top-down form detection For each page thecrawler visits it first searches through the HTML DOMfor any potential email registration forms When sitesuse the standard ltformgt element it can simply iteratethrough each formrsquos input fields (ltinputgt tags) and seeif any text fields ask for an email address (by matchingon input type and keywords) If so it marks the form as

a candidate and then chooses the best candidate usingthe following criteria (in order)1 Always return the topmost form Any form stacked

on top of other elements is probably a modal or dia-log and we find that the most common use of thesecomponents is to promote a sitersquos mailing lists Werely on the z-index CSS property which specifies thestacking order of an element in relation to others (asa relative arbitrary integer) Note that most DOMelements take the default z-index value of auto in-heriting the actual value from its parent thus thecrawler recursively checks a formrsquos parent elementsuntil it finds a non-auto value or reaches the rootof the DOM tree To break ties it also searches forthe literal strings ldquomodalrdquo or ldquodialogrdquo within theformrsquos HTML since we find that such componentsare usually descriptively named

2 Rank login forms lower This is the other class offorms that often asks for an email address so thecrawler explicitly checks for the strings ldquologinrdquo ldquologinrdquo and ldquosign inrdquo within a formrsquos HTML to avoidthese when other candidates are present

3 Prefer forms with more input fields This is mainlyhelpful for identifying the correct follow-up form ifwe submit our email address in the footer of a pagethe same footer might be present on the page we getredirected to In this scenario the form we want topick is the longer one

Additionally registration forms are sometimes foundinside of inline frames (ltiframegt tag) which are ef-fectively separate HTML pages embedded in the mainpage If necessary we iterate through each frame andapply the same procedure to locate registration formswithin them

Bottom-up form detection A growing numberof sites place logical forms inside of generic containerelements (eg ltdivgt or ltspangt tags) without using anyltformgt tags Therefore if top-down form detection failswe take a bottom-up approach the crawler first iteratesthrough all the ltinputgt elements on the page to checkif any email address fields exist at all then recursivelyexamines their parents to find the first container thatalso contains a submit button This container is usuallythe smallest logical form unit that includes all of therelevant input fields

Determining form field type Once a form isdiscovered we need to determine which fields are con-tained in the form and fill each field with valid dataWe skip any invisible elements since a real user wouldnot be expected to fill them Some fields can be iden-

I never signed up for this Privacy implications of email tracking 18

tified by their type attribute alonemdashfor example telfor phone numbers and email for email addressesmdashbutthese specific types were introduced in the relatively re-cent HTML5 standard [41] and most websites still usethe general text type for all text inputs In our sur-vey of the top sites we found that contextual hints arescattered across many tag attributes with the most fre-quent being name class id placeholder value forand title In addition tags that contain HTML bod-ies (such as ltbuttongt tags) often contain hints in theinnerHTML

Handling two-part form submissions Aftersubmitting a form we are sometimes prompted to fillout another longer form before the registration is ac-cepted This second form might appear on the samepage (ie using JavaScript) or on a separate page ei-ther through a redirect or as a pop-up window We takea simplistic approach the crawler waits a few secondsthen applies the same form-finding procedure first onany pop-up windows and then on the original windowThis approach may have the effect of submitting thesame form twice but we argue that this does not pro-duce any adverse resultsmdashduplicate form submissionsare a plausible user interaction that web services shouldbe expected to handle gracefully

102 Mail server implementation

The mail server receives emails using SubEtha SMTP alibrary offering a simple low-level API to handle incom-ing mail The server accepts any mail sent to (RCPT TO)an existing email address and rejects it otherwise Themail contents (DATA) are parsed in MIME format usingthe JavaMail API and the raw message contents arewritten to disk MIME messages consist of a set of head-ers and a content body with the required Content-Typeheader indicating the format of the content notably amultipart content body contains additional MIME mes-sage subparts enabling messages to be arranged in atree structure To save disk space we recursively scanmultipart MIME messages for subparts with contenttypes that are non-text (text) such as attached im-ages or other data and discard them before storing themessages since we do not examine any non-textual con-tent

103 Supported hash functions andencodings for leak detection

Supported hashes and checksums md2 md4 md5sha sha1 sha256 sha224 sha384 sha3-224 sha3-256 sha3-384 sha3-512 murmurhash2 (signed andunsigned) murmurhash3 32-bit murmurhash3 64-bitmurmurhash3 128-bit ripemd160 whirlpool blake2bblake2s crc32 adler32

Supported encodings base16 base32 base58base64 urlencoding deflate gzip zlib entity yenc

104 Top parties redirecting to new thirdparties on email reload

Redirecting Party OrganizationAvg addrsquolparties S E

pippiocom Acxiom 57 7 32liadmcom LiveIntent 37 68 1097rlcdncom Acxiom 17 11 551imiclkcom MediaMath 13 2 4mathtagcom MediaMath 11 11 382alcmpncom ALCdagger 08 6 132emltrkcom Litmus 07 41 638acxiom-onlinecom Acxiom 04 2 33dynemlcom PowerInbox 01 3 13adnxscom AppNexus 01 19 277

Table 14 Top parties by average number of new third-party re-sources in a redirect chain when an email is reloaded The num-ber of senders ( S) out of 902 total and the number of emails(E) out of 12618 total on which this occurs is given for eachredirecting party We exclude redirecting parties that only exhibitthis behavior in emails from a single sender In total there are 12parties which exhibit this type of redirect behavior Includes statistics for chains which redirect to httppliadmcomimp in the first redirect We observe a common pattern ofURLs of the form lifirstpartycom redirecting first to this end-point which then redirects to a number of other third partiesdagger American List Counsel

  • I never signed up for this Privacy implications of email tracking
    • 1 Introduction
      • 11 Methods
      • 12 The state of email tracking
      • 13 Evaluating and improving defenses
        • 2 Related work
        • 3 Collecting a dataset of emails
        • 4 Privacy leaks when viewing emails
          • 41 Measurement methodology
          • 42 Email provides much of same tracking opportunities as the web
          • 43 Leaks of email addresses to third parties are common
          • 44 Reopening emails brings in new third parties
          • 45 Case study LiveIntent
          • 46 Request blockers help but dont fix the problem
            • 5 Privacy leaks when clicking links in emails
              • 51 Measurement methodology
              • 52 Results
                • 6 Evaluation of defenses
                  • 61 Landscape of defenses
                  • 62 Survey of email clients
                    • 7 Proposed defense
                    • 8 Discussion and conclusion
                    • 9 Acknowledgements
                    • 10 Appendix
                      • 101 Form discovery and filling methodology
                      • 102 Mail server implementation
                      • 103 Supported hash functions and encodings for leak detection
                      • 104 Top parties redirecting to new third parties on email reload

I never signed up for this Privacy implications of email tracking 2

We now outline the methods we used our findingsand our proposed defenses against email tracking

11 Methods

Building on the OpenWPM web crawler [17] we createda tool to automatically search for mailing list subscrip-tion forms on websites and fill them in It is challengingto scale such a tool due to numerous idiosyncrasies ofwebsites (Section 3) Our crawler visited 15700 sitesand attempted to sign up for emails on each of theseThe resulting corpus contains 12618 emails from 902distinct senders The tool may be of independent in-terest for studying questions such as PII leakage fromcontact forms [38]

Next we discuss how we detect instances of PII innetwork traffic (Section 41) This is a challenging prob-lem because data might be encoded or hashed possiblyiteratively (eg double hashing or base-64 encoded andthen hashed) In this study we focus exclusively on leaksof email addresses but our techniques are agnostic tothe type of PII We examine leaks in network traffic re-sulting from a simulation of a user reading the corpus ofemails collected as above (Section 4) We also simulatedthe user clicking on a sample of the links in the emailsreceived and looked for leaks in the resulting web traffic(Section 5)

We present a set of heuristics to classify such leak-age as intentional or accidental (Section 41) Inten-tional leakage suggests a business relationship betweenthe party sending the information and the party receiv-ing it whereas accidental leakage happens due to poorprogramming practices [23 24]

Email providers (eg Gmail employers) and emailclients (eg Apple Mail Thunderbird) may both em-ploy measures to mitigate email tracking such as prox-ying of images1 or suppressing cookies We built a toolthat allows users and researchers to test the behaviorof email providers and clients to assess the ability ofemail senders and third parties to track users We useit ourselves to survey 16 email clients (Section 62)

1 Providers proxy resources by rewriting all remote resourcesin an email to point to a location on the providerrsquos server Theprovider requests the resource from the third-party server ratherthan the user requesting it directly

12 The state of email tracking

Email tracking is pervasive We find that 85 of emailsin our corpus contain embedded third-party contentand 70 contain resources categorized as trackers bypopular tracking-protection lists There are an averageof 52 and a median of 2 third parties per email whichembeds any third-party content and nearly 900 thirdparties contacted at least once But the top ones are fa-miliar Google-owned third parties (Doubleclick GoogleAPIs etc) are present in one-third of emails

We simulate users viewing emails in a full-fledgedemail client (Section 4) We find that about 29 ofemails leak the userrsquos email address to at least one thirdparty and about 19 of senders sent at least one emailthat had such a leak The majority of these leaks (62)are intentional based on our heuristics Tracking protec-tion is helpful but not perfect it reduces the numberof email leaks by 87 Interestingly the top trackersthat receive these leaked emails are different from thetop web trackers We present a case study of the most-common tracker LiveIntent (Section 45)

We also simulate users clicking on links in emailswhich causes a page to load in a full-fledged web browser(Section 5) We find that 11 of links contain embeddedcontent requests that leak the email address to a thirdparty and at least 35 of senders include at least onesuch link in one email The top third-party domains andorganizations that receive these leaked email addressesare substantially similar to the list of top third partiesoverall

13 Evaluating and improving defenses

We identify five possible defenses against email track-ing content proxying HTML filtering cookie blockingreferrer blocking and request blocking There are threepossible ways to deploy defenses by the mail serverthe mail user agent or the web user agent (ie thebrowser that handles links that are clicked on emails)We present a systematization of how each of these enti-ties could deploy each of these defenses (Section 61)

The defenses that can be deployed by web browsersto protect against leaks of emails are nearly identicalto defenses against web tracking in general This is amature area of research and there are numerous toolson the market based on filter lists Based on our dataanalysis we identify a list of 27125 distinct URLs (from133 domains) that receive leaked email addresses andare not blocked by prominent filter lists presumably

I never signed up for this Privacy implications of email tracking 3

because these trackers are specific to emails (Section 7)We believe that these would make useful additions toexisting filter lists Except for this contribution we focusour analysis of defenses on mail servers and mail useragents rather than web browsers

Based on our analysis of 16 email servers and clients(Section 62) we find that a patchwork of defenses areemployed and no setup offers complete protection fromthe threats we identify Perhaps the best option forprivacy-conscious users today is to use webmail and in-stall tracker-blocking extensions such as uBlock Originor Ghostery

We show that HTML filtering can be an effectivedefense The idea is to rewrite email bodies to removetracking elements This can be done by either the mailserver or the mail user agent We prototype an elementfiltering tool based on existing tracking-protection listsand evaluate its effectiveness (Section 7)

2 Related workEmail secuity and privacy The literature on emailsecurity and privacy has focused on authentication ofemails and the privacy of email contents For exam-ple Durumeric et al found that the long tail of SMTPservers largely fail to deploy encryption and authenti-cation leaving users vulnerable to downgrade attackswhich are widespread in the wild [15] Holz et al alsofound that email is poorly secured in transit oftendue to configuration errors [21] We study an orthog-onal problem Securing email in transit will not defendagainst email tracking and vice versa

Third-party web tracking Email tracking is anoutgrowth of third-party web tracking which has growntremendously in prevalence and complexity since the1990s [13 26 28 35] Today Google is the most promi-nent tracker through various third-party domains andcan track users across nearly 80 of sites [27] Webtracking has expanded from simple HTTP cookies to in-clude more persistent tracking techniques to ldquorespawnrdquoor re-instantiate HTTP cookies through Flash cookies[37] cache E-Tags and HTML5 localStorage [10] Over-all tracking is moving from stateful to stateless tech-niques device fingerprinting attempts to identify usersby a combination of the devicersquos properties [16 25] Suchtechniques have advanced quickly [19 30 33] and arenow widespread on the web [7 8 17 32] These tech-niques allow trackers to compile unique browsing histo-ries but they do not link histories to identity

Compared to web tracking email tracking does notuse fingerprinting because (most) email clients prohibitJavaScript On the other hand email readily providesa unique persistent real-world identifier namely theemail address Web tracking researchers have created anumber of tools for detecting and measuring trackingand privacy such as FPDetective [8] OpenWPM [17]and FourthParty [28] We use OpenWPM for most ofour measurements in this paper

PII leakage Leaks of PII of logged-in users fromfirst-party websites to third parties are rampant theearly papers on this problem were by Krishnamurthy etal [23 24] PII leaks enable trackers to potentially at-tach identities to browsing histories More recent workincludes detection of PII leakage to third parties insmartphone apps [34 40] PII leakage in contact forms[38] PII leakage that enables cross-device tracking [12]and data leakage due to browser extensions [39]

The common problem faced by these authors (andby us) is that PII may be obfuscated When the data col-lection is crowdsourced [34 40] rather than automatedthere is the further complication that the strings thatconstitute PII are not specified by the researcher andthus not known in advance On the other hand crowd-sourced data collection allows obtaining numerous in-stances of each type of leak which might make detectioneasier

Various approaches are seen in prior work Ren etal employ heuristics for splitting fields in network traf-fic and detecting likely keys they then apply machinelearning to discriminate between PII and other fields[34] Starov et al apply differential testing that is vary-ing the PII entered into the system and detecting theresulting changes in information flows [38] This is chal-lenging to apply in our context because we observed fre-quent AB testing in the commercial emails in our cor-pus which makes it tricky to attribute observed changesto PII This is an area for future work Finally our ownapproach is most similar to that of Brookman et al [12]and Starov et al [39] who test combinations of encod-ings andor hashes

3 Collecting a dataset of emailsWe now describe how we assembled a large-scale cor-pus of mailing-list emails We do not attempt to studya ldquotypicalrdquo userrsquos mailbox since we have no empiricaldata from real usersrsquo mailboxes Rather our goal in as-sembling a large corpus is to study the overall landscape

I never signed up for this Privacy implications of email tracking 4

High-level architecture of crawlerAssemble a list of sites For each sitendash Find pages potentially containing forms For

each pagendash Find the best form on the page via top-

down form detection and bottom-up formdetection If a form was foundlowast Fill in the formlowast Fill in any secondary forms if necessarylowast Once a form has been submitted skip

the rest of the pages and continue tonext site

High-level architecture of serverReceive and store email For each emailndash Check for and process confirmation links

Fig 1 High-level architecture of the email collection systemwith the individual modules italicized

of third-party tracking of emails identify as many track-ers as possible (feeding into our enhancements to exist-ing tracking-protection lists) and as many interestingbehaviors as possible (such as different hashes and en-codings of emails addresses)

To achieve scale we use an automated approachWe created a web crawler based on the OpenWPM webprivacy measurement tool [17] to search for and fill informs that appear to be mailing-list subscriptions Thecrawler has five modules and the server that processesemails has two modules They are both described at ahigh level in Fig 1 We now describe each of the sevenmodules in turn

Assemble a list of sites Alexa maintains a pub-lic list of the top 1 million websites based on monthlytraffic statistics as well as rankings of the top 500 web-sites by category We used the ldquoShoppingrdquo and ldquoNewsrdquocategories since we found them more likely to containnewsletters In addition we visited the top 14700 sitesof the 1 million sites for a total of 15700 sites

Detect and rank formsWhen the crawler cannotlocate a form on the landing page it searches through allinternal links (ltagt tags) in the DOM until a page con-taining a suitable form is found A ranked list of termsshown in Table 1 is used to prioritize the links mostlikely to contain a mailing list On each page forms aredetected using both a top-down and bottom-up proce-dure The top-down procedure examines all fields con-tained in ltformgt elements Forms which have a higherz-index and more input fields are given a higher rank

while forms which appear to be part of user account reg-istration are given a lower rank If no ltformgt elementsare found we attempt to discover forms contained in al-ternative containers (eg forms in ltdivgt containers) us-ing a bottom-up procedure We start with each ltinputgtelement and recursively examine its parents until onewith a submit button is found For further details seeTop-down form detection and Bottom-up form detectionin Appendix Section 101

Fill in the form Once a form is found the crawlermust fill out the input fields such that all inputs vali-date The crawler fills all visible form fields includingltinputgt tags ltselectgt tags (ie dropdown lists) andother submit ltbuttongt tags Most websites use the gen-eral text type for all text inputs We surveyed a numberof top websites to determine common naming practicesfor input fields and filled the fields with the data of theexpected type For example name fields were filled witha generic first and last name After submitting a formwe wait for a few seconds and re-run the procedure tofill follow-up fields if required For further details seeDetermining form field type and Handling two-part formsubmissions in Appendix Section 101

Receive and store email We set up an SMTPserver to receive emails The server accepts any mail sentto an existing email address and rejects it otherwise Itthen parses the contents of the mail and logs metadata(such as the sender address subject text and recipientaddress) to a central database All textual portions ofthe message contents are written to disk We provideimplementation details in Appendix Section 102

Check for and process confirmation links Ourserver will check the first email sent to each email ad-dress to determine if the mailing list requires additionaluser interaction to confirm the subscription If the ini-tial emailrsquos subject or rendered body text includes thekeywords ldquoconfirmrdquo ldquoverifyrdquo ldquovalidaterdquo or ldquoactivaterdquowe extract potential confirmation links from the emailFor HTML emails we collect links which match thesekeywords along with additional lower-priority keywordsldquosubscriberdquo or ldquoclickrdquo For plain-text emails we simplychoose the longest link text Emails with the past-tensekeywords ldquoconfirmedrdquo ldquosubscribedrdquo and ldquoactivatedrdquo insubject lines are skipped as are links with the text ldquoun-subscriberdquo ldquocancelrdquo ldquodeactivaterdquo and ldquoviewrdquo If anylink is found it is visited using OpenWPM

Form submission measurement Our crawlerdiscovered and attempted to submit forms on 3335sites We received at least one email from 1242 (37) ofthose sites To understand the types of form submissionfailures we ran a follow-up measurement in August 2017

I never signed up for this Privacy implications of email tracking 5

Description Keywords LocationEmail list registration newsletter weekly ad subscribe inbox email sale alert link textGeneric registration signup sign up sign me up register create join link textGeneric articlesposts article news 2017 link URLSelecting languageregion us =usamp en-us link URLBlacklist unsubscribe mobile phone link text

Table 1 The web crawler chooses links to click based on keywords that appear in the link text or URL The keywords were generatedby iterating on an initial set of terms optimizing for the success of mailing list sign-ups on the top sites We created an initial set ofsearch terms and manually observed the crawler interact with the top pages Each time the crawler missed a mailing list sign-up formor failed to go to a page containing a sign-up form we inspected the page and updated the set of keywords This process was repeateduntil the crawler was successful on the sampled sites

Submission classification of sampled sitesTotal successful submissions 38

rarrMailing lists subscription 32rarrUser account registration 6

Failed required a CAPTCHA 16Failed unsupported form fields 25Unable to classify via screenshots 21

Table 2 Submission success status of a sample of 252 of the3335 form submissions made during the sign-up crawl The suc-cess and failure classification was determined through a manualreview of screenshots taken before and after an attempted formsubmission

where we took screenshots of the pages before and afterthe initial and follow-up form submissions We manu-ally examined a random sample of sites on which a formsubmission was attempted We summarize the results inTable 2

When filling forms our crawler will interact withuser account registration forms mailing list sign-upforms and contact forms The successful submissionswere mostly mailing list sign-ups and a small number ofuser account registrations which are included as theycan be tied to a mailing list The failed submissions weremostly caused by forms other than mailing lists In factmore than 70 of the failures caused by a captcha orunsupported field were not mailing list form submis-sions Overall only 11 of the sampled mailing list in-teractions resulted in a captcha Since our primary fo-cus is mailing lists we leave the evaluation of complexand captcha-protected forms to future work

Email corpus The assembled corpus contains atotal of 12618 HTML emails from 902 sites We re-ceived an average of around 14 emails per site and amedian of 5 A few sites had very active mailing listswith 20 sites sending over 100 emails during the testperiod We observe that we received no spam whichwe confirmed both by manual inspection of a sample of

emails as well as by finding an exact one-to-one corre-spondence between the 902 senders in our dataset andthe unique email addresses that we generated This en-sures that the results represent the behavior of the siteswhere we registered rather than spammers

4 Privacy leaks when viewingemails

41 Measurement methodology

Simulating a webmail client To measure web track-ing in email bodies we render the emails using a simu-lated webmail client in an OpenWPM instance Manywebmail clients remove a subset of HTML tags fromthe email body to restrict the capabilities of renderedcontent In particular Javascript is exclusively removedwhile iframe tags and CSS [6] have mixed support Wesimulate a permissive webmail client one which disablesJavascript and removes the Referer header from all re-quests but applies no other restrictions to the renderedcontent

The email content is served on localhost but isaccessed through the domain localtestme (which re-solves to localhost) to avoid any special handling thebrowser may have for the local network We configureOpenWPM to run 15 measurement instances in parallelEach email is loaded twice in its own measurement in-stance once with a fresh profile and then again keepingthe same browser profile after sleeping for 10 secondsThis is intended to allow remote content on the page toload both with and without browser state present In-deed we observe some tracking images which redirect tonew domains upon every subsequent reload of the sameemail

I never signed up for this Privacy implications of email tracking 6

Classifying third-party content Many emailclients load embedded content directly from remoteservers (we further explore the properties of emailclients in Section 62) Thus remote content presentin multiple emails can track users in the same waythird-party content can track users across sites on theweb However unlike the web there isnrsquot always aclear distinction of which requests are ldquothird-partyrdquo andwhich are ldquofirst-partyrdquo For example all resources loadedby webmail clients are considered third-party by thebrowser We consider any request to a domain2 whichis different than both the domain on which we signedup for the mailing list and the domain of the senderrsquosemail address to be a third-party request

Detecting email leakage Email addresses leakto remote servers through resource requests Detectingthese leaks is not as simple as searching for email ad-dresses in requests since the addresses may be hashed orencoded sometimes iteratively To detect such leakagewe develop a methodology that given a set of encod-ings and hashes a plaintext email address and a URLtoken is able to determine if the token is a transforma-tion of the email address Starting with the plaintextemail address we pre-compute a candidate set of tokensby applying all supported encodings and hashes itera-tively stopping once we reach three nested encodingsor hashes We then take the URL token and apply allsupported decodings to the value checking if the resultis present in the candidate set If not we iterativelyapply decodings until we reach a level of three nesteddecodings

In a preliminary measurement we found no exam-ples of a value that was encoded before being hashedThis is unsurprising as hashed email addresses are usedto sync data between parties and adding a transforma-tion before the hash would prevent that use case Thuswhen analyzing the requests in this dataset we restrictourselves to at most three nested hashes for a set of24 supported hashes including md5 sha1 sha256 Forencodings we apply all possible combinations of 10 en-codings including base64 urlencoding and gzip Thefull list of supported hashes and encodings is given inAppendix 103

Classifying email leakage Email leaks may notbe intentional If an email address is included in thequery string or path of a document URL it may auto-matically end up in the Referer header of subsequent

2 A domain is identified by its public suffix plus the componentof the hostname immediately preceding its public suffix (PS+1)

requests from that document Requests which result ina redirect also often add the referrer of the previous re-quest to the query string of the new request In manyinstances this happens irrespective of the presence ofan email address in the original request The situationis made more complex on the web since third-partyJavascript can dynamically build URLs and trigger re-quests

The reduced HTML support and lack of Javascriptexecution in email clients makes it possible to deter-mine intentionality for most leaks When an email isrendered requests can result from three sources fromelements embedded in the original HTML from withinan embedded iframe (if supported by the client) or froma redirected request1 If a leak occurs in a Referer header it is uninten-

tional For webmail clients the Referer header (ifenabled) will be the client itself A mail sender canembed an iframe which loads a URL that includesthe userrsquos email address with the explicit intentionthat the userrsquos email leak to third parties via theReferer header However we chose not to includethis possibility because email senders have multi-ple direct options for sharing information with thirdparties that do not rely on the sparsely supportediframe tag

2 If a leak occurs in a request to a resource embeddeddirectly in the HTML of the email body (and is notthe result of a redirect) it is intentional We candetermine intentionality since any request result-ing from an HTML document must have been con-structed by the email sender Note that this does nothold for web documents since embedded Javascriptcan dynamically construct requests during the pagevisit

3 If a request results from a redirect the party re-sponsible for the leak is the party whose request(ie the triggering URL) responded with a redi-rect to the new location (ie the target URL) Weclassify a leak as intentional if the leaked value ishashed between the triggering URL and the targetURL or if there are more encodings or hashes ofthe leaked value included in the target URL thanin the triggering URL If the target URL includesa full copy of the triggering URL (in any encoding)the leak is unintentional All other cases are clas-sified as ambiguous such the case where a targetURL includes only the query string of the triggeringURL

I never signed up for this Privacy implications of email tracking 7

Measuring blocked tags Tracking protection toolswhich block resource requests offer users protectionagainst the tracking embedded in emails We evaluatethe effectiveness of these tools by checking the requestsin our dataset against two major blocklists EasyListand EasyPrivacy [4] These lists block advertisementand tracking related requests and are bundled withseveral popular blocking extensions including AdBlockPlus [1] and uBlock Origin [5] We use the BlockList-Parser library [3] to determine if a request would havebeen blocked3 by an extension utilizing these lists Weclassify a request as blocked if it matches any of thefollowing three conditions1 The request directly matches the filter list2 The request is the result of a redirect and any re-

quest earlier in the redirect was blocked3 The request is loaded in an iframe and the iframe

document request (or any resulting redirect) wasblocked

It is possible to do this classification in an offline fashionbecause of the lack of Javascript support in email clientsThis removes the need to run measurements with oneof the aforementioned extensions installed In environ-ments that support Javascript content can be loadeddynamically and as the result of interactions betweenseveral scripts In such an environment it is much moredifficult to determine which requests would have beenblocked by a single script appearing on the block list

42 Email provides much of same trackingopportunities as the web

Remote resources embedded in email content can trackusers across emails As we show in our survey of emailclients (Section 62) many email clients allow remoteresources to set persistent cookies and send those cook-ies with resource requests In total we find that 10724of the measured emails (85) embed resources from atleast one third party with an average of 5 third partiesper email The distribution of embedded third partiesis far from uniform we find a median of two per emailand a small number of emails embedding as many as 50third parties (Figure 2)

3 We set the parser options as we would expect them to beset for a request occurring in a webmail client For example allrequests are considered third-party requests

Domain of Emails of Top 1Mdoubleclicknet 222 475mathtagcom 142 79dotomicom 127 35adnxscom 122 132tapadcom 110 26liadmcom 110 04returnpathnet 110 lt01bidswitchnet 105 49fontsgoogleapiscom 102 394list-managecom 101 lt01

Table 3 Top third-party domains by percentage of the 12618emails in the corpus For comparison we show the percentageof the top 1 million websites on which these third parties arepresent

Fig 2 CDF of third parties per email aggregating data acrossthe initial viewing and re-opening of an email In addition 14of emails have between 25 and 53 third parties

Table 3 shows the top third-party domains presentin email Many of these parties also have a large presenceon the web [17] blurring the line between email and webtracking On webmail clients requests to these cross-context third parties will use the same cookies allowingthem to track both a userrsquos web browsing and emailhabits In total the emails visited during our crawlsembed resources from 879 third parties

43 Leaks of email addresses to thirdparties are common

In addition to being able to track email habits 99 thirdparties (11) also gain access to a userrsquos email ad-dress whether in plaintext or hashed In email clientswhich support cookies these third parties will receivethe email address alongside any cookies theyrsquove set onthe userrsquos device Trackers which are also present on theweb will thus be able to link this address with the userrsquosbrowsing history profile

Around 19 of the 902 senders leaked the userrsquosemail address to a third party in at least one emailand in total 29 of emails contain leaks to third par-

I never signed up for this Privacy implications of email tracking 8

ties We find that a majority of these leaks 62 of the100963 leaks to third parties are intentional Theseintentional leaks mostly occur through remote contentembedded directly by the sender Furthermore 1 ofleaks are classified as unintentional with the remainderconsidered ambiguous While we do not attempt to de-termine how these identifiers are being used plaintextand hashed emails can be used for persistent trackingcross-device tracking and syncing information betweenparties

Leak of Senders of RecipientsMD5 100 (111) 38 (385)SHA1 64 (71) 19 (192)SHA256 69 (76) 13 (131)Plaintext Domain 55 (61) 2 (20)Plaintext Address 77 (85) 54 (545)URL Encoded Address 6 (06) 8 (81)SHA1 of MD5 1 (01) 1 (10)SHA256 of MD5 1 (01) 1 (10)MD5 of MD5 1 (01) 1 (10)SHA384 1 (01) 1 (10)

Table 4 Email address leakage to third parties by encoding Per-centages are given out of a total of 902 senders and 99 third-party leak recipients All hashes are of the full email addressEmail ldquodomainrdquo is the part of the address after the ldquordquoThese appear to be a misuse of LiveIntentrsquos API (Section 45)

The leaked addresses are often hashed Althoughwe can detect email addresses hashed with 24 differentfunctions and up to three nested layers we only findMD5 SHA1 and SHA256 in frequent use Table 4 summa-rizes the number of senders and receivers of each encod-ing The relatively low diversity of hashes and encodingssuggests that these techniques are not being used to ob-fuscate the collection of email addresses In fact thequery parameters which contain hashed emails some-times identify the hash functions used in the parametername (eg a string like md5=ltmd5 hash of emailgt ap-pearing in the HTTP request) The design of APIs likeLiveIntentrsquos which first receives an email address andthen syncs with a number of other parties (Section 45)suggests that these hashed address may be used to shareor link data from multiple parties

Recipient Organization of SendersLiveIntent 68 (75)Acxiom 46 (51)Litmus Software 28 (31)Conversant Media 26 (29)Neustar 24 (27)apxlvcom 18 (20)5421114717 18 (20)Trancos 17 (19)WPP 17 (19)548261160 16 (18)

Table 5 Top organizations receiving email address leaks by num-ber of the 902 total senders A domain is used in place of anorganization when it isnrsquot clear which organization it belong to

Table 5 identifies the top organizations4 which re-ceive leaked email addresses This shows that email ad-dress collection from emails is largely consolidated to afew major players which are mostly distinct from thepopular web trackers In fact only one of the top 10organizations Neustar is found in the top 20 third-party organizations on the top 1 million websites asmeasured by Englehardt and Narayanan [17] Also sur-prising is the prevalence of leaks to IP addresses whichaccounts for eight of the top 20 domains receiving emailaddresses This may be due to the relatively ephemeralnature of newsletter emails which removes concerns ofIP address churn over time

44 Reopening emails brings in new thirdparties

Despite the lack of Javascript support email views aredynamic The email content itself is static but any re-mote resources embedded in it may return different re-sponses each time the email is viewed and even redirectto different third parties To examine the effects of thiswe load every email first with a ldquocleanrdquo browser profileand then again without clearing the profile Surprisinglythe average Jaccard similarity [36] between the sets ofthird parties loaded during the first and second viewsof the same email is only 60

The majority of emailsmdashtwo-thirdsmdashload fewerthird parties when the email is reopened compared tothe initial view However about 21 of emails load at

4 We map domains to organizations using the classification pro-vided by Libert [27] adding several new email-specific organi-zations When an organization could not be found we use thePS+1

I never signed up for this Privacy implications of email tracking 9

Row Request URL0 httpinboxwashingtonexaminercomimp[]ampe=ltEMAILgtampp=01 httppliadmcomimp[]ampm=ltMD5(address)gtampsh=ltSHA1(address)gtampsh2=ltSHA256(address)gt

ampp=0ampdom=ltEMAIL_DOMAINgt2 httpxbidswitchnetsyncssp=liveintentampbidder_id=5298amplicd=3357ampx=EGFM[]3 httpxbidswitchnetul_cbsyncssp=liveintentampbidder_id=5298amplicd=3357ampx=EGFM[]4 httppadsymptoticcomdpx_pid=12688amp_psign=d3e69[]ampbidswitch_ssp_id=liveintentamp_redirect=[]5 httppadsymptoticcomdpx_pid=12688amp_psign=d3e69[]ampbidswit[]amp_redirect=[]amp_expected_cookie=[]6 httpxbidswitchnetsyncdsp_id=126ampuser_id=84f3[]ampssp=liveintent7 httpiliadmcoms19751bidder_id=5298amplicd=3357ampbidder_uuid=ltUUID_1gt8 httpcmgdoubleclicknetpixelgoogle_nid=liveintent_dbmampgoogle_cmampgoogle_sc9 httpcmgdoubleclicknetpixelgoogle_nid=liveintent_dbmampgoogle_cm=ampgoogle_sc=ampgoogle_tc=10 httppliadmcommatch_gbidder_id=24314ampbidder_uuid=ltUUID_2gtampgoogle_cver=111 httpxbidswitchnetsyncssp=liveintentampbidder_id=5298amplicd=12 httppooludspiponwebnetsyncssp=bidswitchampbidswitch_ssp_id=liveintent

Table 6 Redirect chain from a LiveIntent Email Tracking Pixel URL query strings are truncated for clarity (using [])

least one resource when an email is reopened that wasnrsquotpresent the first time A small number of third partiesare disproportionately responsible for thismdashthey loaddifferent sets of additional third parties each time theemail is opened (Table 14 in the Appendix)

The number of leaks between email loads stays rela-tively constant with less than 50 emails leaking to newparties on the second load5 However as the compari-son of Table 14 with Table 5 shows many of the topleak recipients are also responsible for redirecting to thehighest number of new parties Thus reloading an emailincreases the number of potential recipients of a leak ifthe redirectors share data based on the email or emailhash they receive

45 Case study LiveIntent

LiveIntent receives email addresses from the largestnumber of senders 68 in total In this section weanalyze a sample of the request chains that re-sult in leaks to LiveIntent Table 6 shows an ex-ample redirect chain of a single pixel embedded inan email from the washingtonexaminercom mailinglist The initial request (row 0) is to a subdomain ofwashingtonexaminercom and includes the userrsquos plain-text email address in the e= query string parameter Thedomain redirects to liadmcom (row 1) a LiveIntent do-main and includes the MD5 SHA1 and SHA256 hashes ofthe email address in the parameters m= sh= and sh2=

5 We exclude leaks which occur to a different IP address on thesecond load This occurs in 349 emails but is less meaningfulgiven the dynamic nature of IP address

The URL also includes the domain portion of the userrsquosaddress

In rows 2 - 12 the request redirects through severalother domains and back to itself exchanging what ap-pear to be partner IDs and bidder IDs In rows 7 and10 LiveIntent receives a UUID from the domain in theprevious request which could allow it to exchange in-formation with those trackers outside of the browser

46 Request blockers help but donrsquot fixthe problem

Privacy conscious users often deploy blocking exten-sions such as uBlock Origin Privacy Badger orGhostery to block tracking requests Since webmailclients are browser-based these blocking extensions canalso filter requests that occur while displaying email con-tent6 We use our blocked tag detection methodology(Section 41) to determine which resources would havebeen blocked by the popular EasyList and EasyPrivacyblocklists We then examine the remaining requests todetermine how frequently email addresses continue toleak

Overall the blocklists cut the number of third par-ties receiving leaked email addresses from any sendernearly in half from 99 to 51 Likewise the number ofsenders which leak email addresses in at least one emailis greatly reduced from 19 to just 7 However asTable 7 shows a significant number of leaks of both

6 Thunderbird supports most of the popular Firefox extensionsand as such Thunderbird users can also deploy these defensesSee Table 12 for more details

I never signed up for this Privacy implications of email tracking 10

Encoding of Senders of RecipientsPlaintext Address 34 (37) 34 (667)MD5 21 (23) 12 (235)SHA1 14 (16) 6 (118)URL Encoded Address 4 (04) 4 (78)SHA256 4 (04) 2 (39)SHA384 1 (01) 1 (20)

Table 7 Encodings used in leaks to third parties after filteringrequests with EasyList and EasyPrivacy Totals are given out of902 email senders and 51 third-party leak recipients

Recipient Domain of Sendersmediawallahscriptcom 7jetlorecom 4scrippsnetworkscom 4alocdncom 3richrelevancecom 3ivitrackcom 2intentiqcom 2gatehousemediacom 2realtimeemail 2ziffimagescom 2

Table 8 The top third-party leak recipient domains after filteringrequests with EasyList and EasyPrivacy All recipients receiveleaks from less than 1 of the 902 senders studied

plaintext and email hashes still occur In Table 8 we seethat there are still several third-party domains whichreceive email address leaks despite blocking Several ofthese domains are known trackers which could be in-cluded in the blocklists In addition IP addresses andCDN domains are still recipients of leaked email ad-dresses Blocking on other URL features such as theURL path could help reduce leaks to these domains

5 Privacy leaks when clickinglinks in emails

In Section 4 we explore the privacy impact of a useropening and rendering an email In this section we ex-plore the privacy impact of a user clicking links withinan email Once a user clicks a link in an email the link istypically opened in a web browser Unlike email clientsweb browsers will typically support Javascript and ad-vanced features of HTML creating many potential av-enues for privacy leaks However the only way an emailaddress can propagate to a page visit is through the di-rect embedding of the address in a link contained in theoriginal email body

51 Measurement methodology

Sampling links from emails To evaluate the privacyleaks which occur when links in emails are clicked wegenerate a dataset from the HTML content of all emailsand visit them individually in an instrumented browserTo extract the links from mail content we parse allemail bodies with BeautifulSoup [2] and extract thesrc property of all ltagt tags We sample up to 200 uniquelinks per sender using the following sampling strategyFirst we bin links across all emails from a sender bythe PS+1 and path of the link Next we sample one linkfrom each bin without replacement until there are nomore links or we reach a limit of 200 This helps ensurethat we have as diverse a set of landing pages as possibleby stripping fragment and query string identifiers thatmay not influence the landing page

Simulating link clicks To simulate a user click-ing a link we visit each link in an OpenWPM instanceusing a fresh browser profile The browser fully loadsthe page and sleeps for 10 seconds before closing Un-like the email viewing simulation (Section 4) we enableboth Javascript and Referer headers This simulationreplicates what happens when a link is clicked in a stan-dalone email client only the URL of the clicked link ispassed to the browser for handling In a webmail clientthe initial request resulting from the click may also con-tain a cookie and a Referer header containing the emailclientrsquos URL We do not simulate these headers in ourcrawl

Detecting email address leakage To detectleakage of email addresses we use the procedure de-scribed in Section 41 Since the Referer header is en-abled for these measurements we consider a party tohave received a leak if it is contained either in the URLor the Referer header of the resource request to thatparty Email addresses can also be shared with the partythrough the Cookie header request POST bodies web-socket connections WebRTC connections and so onWe consider these out of scope for this analysis

52 Results

We found that about 11 of links contain requests thatleak the email address to a third party About 12 ofall emails contain at least one such link and among thissubset there are an average of 35 such links per emailThe percentage of the 902 senders that leak the emailaddress in at least one link in one email is higher 355Finally there were over 1400 distinct third parties that

I never signed up for this Privacy implications of email tracking 11

Recipient Organization of SendersGoogle 247 (274)Facebook 160 (177)Twitter 94 (104)Adobe 81 (90)Microsoft 73 (81)Pinterest 72 (80)LiveIntent 69 (76)Akamai 69 (76)Acxiom 68 (75)AppNexus 61 (68)

Table 9 The top leak recipient organizations based on a sampleof simulated link clicks All values are out of 902 total senders

Recipient Domain of Sendersgoogle-analyticscom 200 (222)doubleclicknet 196 (217)googlecom 159 (176)facebookcom 154 (171)facebooknet 145 (161)fontsgoogleapiscom 102 (113)googleadservicescom 96 (106)twittercom 94 (104)googletagmanagercom 87 (96)gstaticcom 78 (86)

Table 10 The top leak recipient domains based on a sample ofsimulated link clicks All values are out of 902 senders

received the email address in one or more of our sim-ulated link clicks We expect that all statistics in thisparagraph except the first are slight underestimatesdue to our limit of 200 links per sender

Table 9 shows the top organizations that receiveleaked email addresses and Table 10 shows the top do-mains Over a quarter of senders leak the email addressto Google in at least one link

The most striking difference between these resultsand the corresponding results for viewing emails is thatthese lists look very similar to the list of top third partytrackers [17] with the addition of a small number oforganizations specific to email tracking This motivatesthe privacy concern that identities could potentially beattached to third-party web tracking profiles

6 Evaluation of defenses

61 Landscape of defenses

Defenses against tracking can be employed by severalparties We ignore mail senders and trackers themselves

since email tracking is a thriving commercial space andour evidence suggests that senders by and large coop-erate with trackers to leak email addresses We insteadfocus on parties who have an incentive to protect therecipientrsquos privacy namely the recipientrsquos mail servermail user agent and the web browser

The lines between these roles can be blurry so weillustrate with two examples Consider a user readingYahoo mail via Firefox The email server is Yahoo theemail client is Firefox together with Yahoo mailrsquos client-side JavaScript and the web browser is again Fire-fox Or consider a user reading her university mail viaGmailrsquos IMAP feature on her iPhone For our purposesboth the university and Gmail count as email serverssince either of them is in a position to employ defensesThe email client is the Gmail iOS app and the webbrowser is Safari

Defense Email server Email client Web browserContent proxying XHTML filtering X XCookie blocking X XReferrer blocking X X XRequest blocking X X

Table 11 Applicability of each of the five possible defenses toeach of the three contexts in which they may be deployed An Xindicates that the defense is applicable

Table 11 summarizes the applicability of various de-fenses to the three roles We discuss each in turn

Content proxying Email tracking is possible be-cause of embedded content such as images and CSS (cas-cading style sheets) To prevent this some email serversnotably Gmail proxy embedded content Thus whenthe recipient views the email the mail user agent doesnot make any requests to third parties

This defense doesnrsquot prevent the recipient email ad-dress being leaked to third parties since it is leakedby being encoded in the URL In fact it hinders ef-forts by the mail client to prevent email address leakage(see request blocking below) However it prevents thirdparties from learning the userrsquos IP address client deviceproperties and when the email was read (depending onhow the proxy is configured) Most importantly it pre-vents the third-party cookie from being sent and thusprevents the third party from linking the userrsquos emailaddress to a tracking profile In this way it is a comple-ment to cookie blocking

This defense can be deployed by the email serverConceivably the email client might have its own server

I never signed up for this Privacy implications of email tracking 12

component through which embedded resources are prox-ied but no email clients currently work this way andfurther it would introduce its own privacy vulnerabili-ties so we ignore this possibility

HTML filtering HTML filtering refers to modify-ing the contents of HTML emails to mitigate tracking Itmay be applied by the email server or the client but it ismore suitable to the server since the client can generallyachieve the same effect in other ways eg by requestblocking or modifying the rendering engine It is rarelyapplied today and only in minimal ways In Section 7 weprototype a comprehensive HTML filtering technique

HTML filtering modifies the content of the emailbody and thus might interfere with some email au-thentication methods notably Domain Keys IdentifiedEmail (DKIM) However since filtering is carried outby the recipientrsquos mail server (Mail Transfer Agent) andnot by intermediate mail relays filtering can be done af-ter the signature has been verified and thus there is noimpact on email authentication

The following three techniques are applicable in oneof two scenarios when the email client requests embed-ded resources or when the web browser handles clickson links in emails

Cookie blocking Cookie blocking in the emailclient prevents third-party cookies from being sent whenembedded content is requested It is especially relevantin the webmail context where the cookie allows thirdparties to link an email address to a web browsing pro-file Even otherwise blocking cookies is helpful since itmakes it harder for third parties to compile a profile ofthe recipientrsquos email viewing (they can always do this forthe subset of emails where the email address is leaked)

Referrer blocking If the email client sends theReferer header when loading embedded resources itcan allow several types of leaks Depending on the imple-mentation the referrer may encode which client is beingused and which specific email is being read If the recip-ient forwarded an email to someone else and the email isbeing viewed in a different userrsquos mailbox it could leakthis information Worse if the client supports iframesin emails and the email address happens to be in theiframe URL all requests to resources embedded in thatiframe will accidentally leak the email address For allthese reasons referrer blocking is a privacy-enhancingmeasure There is little legitimate use for the referrerheader in the context of email While clients can cer-tainly block the header (as can web browsers) serverscan do this as well by rewriting HTML to add therel=ldquonoreferrerrdquo attribute to links and inserting a Re-ferrer Policy via the meta tag

Request blocking Request blocking is a powerfultechnique which is well known due to ad blockers andother browser privacy extensions It relies on manuallycompiled filter lists containing thousands of regular ex-pressions that define third-party content to be blockedThe most widely used ad-blocking list is EasyList andthe most widely used tracker-blocking list is EasyPri-vacy Filter list based blocking introduces false positivesand false negatives [43] but the popularity of ad block-ing suggests that many users find the usability trade-offto be acceptable While request-blocking extensions aresupported primarily by web browsers some email clientsalso have support for them notably Thunderbird

62 Survey of email clients

We built an email privacy tester to discover which de-fenses are deployed by which popular email servers andclients7 Browser support for tracking protection hasbeen extensively studied elsewhere [29] so we do notconsider it here

The email privacy tester allows the researcher to en-ter an email address and the name of an email clientand then sends an email to that address containing atracking image and a link The image and the link bothhave unique URLs The researcher views the email inthe specified email client and then clicks on the linkThe server records the following information the emailaddress the email client the IP address timestampand headers sent for both the image and the link re-quests The list of headers includes the cookie referrerand user agent

We created accounts with a total of 9 emailproviders and tested them with a total of 16 email clientsusing various devices available in our lab We analyzedthe data recorded by the email privacy tester and sum-marize the results in Table 12 We found that if de-fenses are deployed by email servers at all they are onlyenabled for specific email clients (typically the defaultwebmail client) Therefore we do not report on serversseparately but instead fold it into the analysis of clientsWe also found that HTML filtering in a general form isnot deployed but only in the limited form of image andreferrer blocking so we report on that instead We sum-marize our findings in Table 12

7 httpsemailtrackingopenwpmcom

I never signed up for this Privacy implications of email tracking 13

Mail Client Platform Proxies Content Blocks Images Blocks Referrers Blocks Cookies Ext SupportGmail Web Yes No L Yes I Yesdagger Yesdagger YesYahoo Mail Web No Yes L Yes I No No YesOutlook Web App Web No Yes No No YesOutlookcom Web No No No No YesYandex Mail Web Yes No L Yes I Yesdagger Yesdagger YesGMX Web No No No No YesZimbra Web No Yes No No Yes163com Web No No No No YesSina Web No No No No YesApple Mail iOS No No Yes Yes NoGmail iOS Yes No Yes Yes NoGmail Android Yes No Yes Yes NoApple Mail Desktop No No Yes Yes NoWindows Mail Desktop No No Yes No NoOutlook 2016 Desktop No Yes Yes No NoThunderbird Desktop No Yes Yes Optional (Default No) Yes

Table 12 A survey of the privacy impacting features of email clients We explore whether the client proxies image requests blocksimages by default blocks referrer headers from being sent (with image requests ldquoIrdquo and with link clicks ldquoLrdquo) blocks external re-sources from settings cookies and whether or not the client supports request blocking extensions mdash either through the browser (forweb clients) or directly (in the case of Thunderbird)Images are only blocked for messages considered spamdagger Blocking occurs as a result of proxied content

7 Proposed defenseWe argue that tracking protection should be at the cen-ter of a defensive strategy against email tracking It canbe employed either via HTML filtering on the server orvia request blocking on the client Tracking protection(and ad blocking) based on filter lists has proven to beeffective and popular in web browsers and its limita-tions manageable The other defenses we examined allhave serious drawbacks for example content proxyingcomes at a cost to the email server and makes emailleaks worse and cookie blocking is at best a partial so-lution

We propose to improve tracking protection in twoways

Server-side email content filtering First weprototype a server-side HTML filtering module We usethe existing standard EasyList and EasyPrivacy filterlists Our filtering script is written in Python using theBlockListParser library [3] It scans for any HTML con-tent (texthtml) in email bodies parses those contentsidentifies embedded resources (images or CSS) whoseURLs match one of the regular expressions in the filterlists strips them out and rewrites the HTML

To test the effectiveness of HTML filtering we ranour leak detection procedure on the filtered corpus ofemails We exclude one sender due to a measurementissue We found that 110 of senders will leak email ad-

dresses to a third party in at least one email and 115of emails contain embedded resources which leak emailto a third-party Overall 62 third parties received leakedemail addresses down from 99 As tracking-protectionlists improve (see below) we can expect these numbersto decrease further These numbers are very close tothe corresponding numbers for request blocking (Sec-tion 46) The two techniques arenrsquot identical the onedifference is that in static files filtering is limited to theURLs present in the body of the HTML and will missthose that result from a redirect However this differ-ence is small and we conclude that HTML filtering isessentially as effective as request blocking

Note that webmail users can already enjoy track-ing protection but server-side deployment will help allusers including those who use email clients that donrsquotsupport request-blocking extensions

Filling gaps in tracking-protection lists As asecond line of defense we use our dataset to identifya list of 27125 URLs representing 133 distinct partieswhich contain leaks of email addresses but which arenrsquotblocked by EasyList or EasyPrivacy These include firstparties in addition to third parties We are able to iden-tify first-party tracking URLs by observing groups ofURLs of similar structure across different first-party do-mains For example 51 email senders leak the userrsquosemail address to a URL of the form liltpublic suffix+ 1gtimp which appears to be part of LiveIntentrsquos API(Section 45) We summarize the most common struc-

I never signed up for this Privacy implications of email tracking 14

tures in the leaking URLs missed by tracking protectionlists in Table 13

URL Pattern of SendersliltPS+1gtimp 51 (57)partnerltPS+1gt 7 (07)stripeltPS+1gtstripeimage 4 (04)pltPS+1gtespopen 4 (04)apiltPS+1gtlayoutssectionltNgt 4 (04)ltPS+1gtcustomer-service 3 (03)miltPS+1gtprp 3 (03)dmtkltPS+1gt 3 (03)linksltPS+1gteopen 3 (03)eadsltPS+1gtimp 3 (03)

Table 13 The top URL patterns from URLs which leak email ad-dresses and are missed by tracking protection lists (Section 46)The patterns are generated by stripping request URLs to host-name and path replacing the public suffix plus one with ltPS+1gtreplacing integers with ltNgt and stripping the last portion of thepath if it ends with a file extension The patterns are ranked bythe number of senders which make at least one leaking requestmatching that pattern in any of the senderrsquos emails All values aregiven out of the total of 902 senders studied

We suspect that the reason so many trackers aremissed is that many of them are not active in the regu-lar web tracking space We have made the list of leakingURLs missed by tracking protection lists publicly avail-able8 It should be straightforward to add regular ex-pressions to filter lists based on these URLs we suggestthat filter list creators should regularly conduct scansof email corpora to identify new trackers

8 Discussion and conclusionPrivacy risks of email tracking Email security andprivacy has not received much research attention despiteits central importance in digital life We showed thatcommercial emails contain a high density of third-partytrackers This is of concern not only because trackerscan learn the recipientrsquos IP address when emails wereopened and so on but also because these third partiesare by and large the same ones that are involved in webtracking This means that trackers can connect emailaddresses to browsing histories and profiles which leadsto further privacy breaches such as cross-device tracking

8 httpsgistgithubcomenglehardt6438c5d775ffd535b317d5c6ce3cde61

and linking of online and offline activities Indeed emailis an underappreciated avenue for straightforward cross-device tracking since recipients tend to view emails onmultiple devices

The advice provided by many mail clients may mis-lead users into thinking the privacy risks associated withremote content are fairly limited The remote contenthelp pages of Gmail [20] Yahoo Mail [42] and Thun-derbird [31] all discuss the threat strictly in terms ofthe email sender learning information about the userrather than a number of third parties

Even network adversaries can benefit from the leaksin emails The NSA is known to piggyback on advertis-ing cookies for surveillance [18] and our work suggestsone way in which a surveillance agency might attachidentities to web activity records in line with the find-ings of Englehardt et al [18] Indeed nearly 91 ofURLs containing leaks of emails are sent in plaintext

Ineffectiveness of hashing The putative justi-fication for email address leaks in the online ad techindustry is that the address is hashed However hash-ing of PII including emails is not a meaningful pri-vacy protection This is folk knowledge in the securitycommunity but bears repeating Compared to hashingof passwords there are several reasons why hashing ofemail addresses is far more easily reversible via vari-ants of a dictionary attack First while (at least) someusers attempt to maximize the entropy of passwordsmost users aim to pick memorable emails and hence theset of potential emails is effectively enumerable Due toGPUs trillions of hashes can be attempted at low costSecond unlike password hashing salting is not applica-ble to email hashing since multiple third parties need tobe able to independently derive the same hash from theemail address

Perhaps most importantly if the adversaryrsquos goalis to retrieve records corresponding to a known emailaddress or set of email addresses then hashing ispointlessmdashthe adversary can simply hash the email ad-dresses and then look them up For example if the ad-versary is a surveillance agency as discussed above andseeks to retrieve network logs corresponding to a givenemail address this is trivially possible despite hashing

LimitationsWe mention several limitations of ourwork First despite the large number of heuristics thatwent into identifying and submitting forms it is a fun-damentally hard problem and our crawler fails in manycases including pages requiring complex mouse interac-tions pages containing very poorly structured HTMLand captcha-protected form submission pages More-over it is difficult to programmatically distinguish be-

I never signed up for this Privacy implications of email tracking 15

tween successful and failed form submissions Lookingat received network data is impractical since responsescould easily include text for both success and failuremessages On the other hand looking only at changesin the rendered text on the webpage is more feasiblebut would require handling many possible edge cases(eg page redirects alerts pop-up windows iframes)and might still be too unreliable to use as a metric forsuccess

Second our corpus of emails is not intended to berepresentative and we are unable to draw conclusionsabout the extent of tracking in the typical userrsquos mail-box

Third our simulation of a user viewing emails as-sumes a permissive user agent We expect that thisclosely approximates a webmail setup with defaultbrowser settings (on browsers except Safari whichblocks third-party cookies by default) but we have nottested this assumption

Future work Finally we mention several potentialareas of future work

Mailing list managers It would be helpful to bet-ter understand the relationship between email sendersand mailing list managers (such as Constant Contact)To what extent is email tracking driven by senders ver-sus mailing list managers When a sender sets up amarketing campaign with a mailing list manager is thetracking disclosed to the sender

PII leakage in registration forms Researchers havepreviously found leakage of PII to third parties in con-tact forms on websites [38] As far as we know there hasbeen no large-scale study of PII leakage in registrationforms where more sensitive information is often present(eg phone numbers street addresses and passwords)Recording and analyzing the third-party requests madeduring our crawls is an important area for further inves-tigation

Cookie syncing It would be interesting to find outif cookie syncing occurs when viewing emailsmdasha processin which different trackers exchange and link togethertheir own IDs for the same user Past work has shownthat this happens among the vast majority of top thirdparties on the web [17] so we suspect that it occursthrough email as well

AB testing We notice some clear instances of ABtesting in our data as might be expected in market-ing campaigns Specifically we registered multiple emailaddresses on some sites at roughly the same time andfound several emails sent at nearly the same time (mil-liseconds apart) with different subject lines and emailbodies advertising different products We have not at-

tempted to reverse-engineer or systematically analyzethese differences but it may be interesting to see if andhow the received content changes in response to readreceipts click-through metrics or other types of userinteractions

Differential testing Despite testing for various en-codings hashes and combinations it is possible thatwe have missed some leaks of email addresses We can-not hope to exhaustively test for all combinations ofencodings and hashes Instead we propose differentialtesting by registering multiple email addresses on thesame site we can look for parameters in URLs that aredifferent for different email addresses which are sugges-tive of transformed email addresses The difficulty withthis approach is that AB testing mentioned above isa confound

In summary we hope that our work leads to greaterawareness of the privacy risks of email tracking spursfurther research on the topic and paves the way fordeployment of robust defenses

9 AcknowledgementsWe would like to thank the anonymous reviewers AylinCaliskan Paul-Olivier Dehaye Joel Reardon and PaulVan Oorschot for their helpful comments Wersquore alsograteful to Guumlnes Acar Paul Ellenbogen Marc JuarezHarry Kalodner Marcela Melara and Laura Roberts fortheir assistance in compiling data for our email survey

This work was supported by NSF Grant CNS1526353 by a research grant from Mozilla and by Ama-zon AWS Cloud Credits for Research

References[1] Adblock Plus - Surf the web without annoying ads https

adblockplusorg Online accessed 2017-09-05[2] BeautifulSoup httpswwwcrummycomsoftware

BeautifulSoup Online accessed 2017-09-05[3] BlockListParser httpsgithubcomshivamagarwal-iitb

BlockListParser Online accessed 2017-09-05[4] EasyList and EasyPrivacy httpseasylistto Online

accessed 2017-09-05[5] uBlock Origin - An efficient blocker for Chromium and Fire-

fox Fast and lean httpsgithubcomgorhilluBlockOnline accessed 2017-09-05

[6] CSS Support Guide for Email Clients Campaign Sourcehttpswwwcampaignmonitorcomcss (Archive httpswwwwebcitationorg6rLLXBX0E) 2014

I never signed up for this Privacy implications of email tracking 16

[7] Gunes Acar Christian Eubank Steven Englehardt MarcJuarez Arvind Narayanan and Claudia Diaz The web neverforgets Persistent tracking mechanisms in the wild In Pro-ceedings of ACM CCS pages 674ndash689 ACM 2014

[8] Gunes Acar Marc Juarez Nick Nikiforakis Claudia DiazSeda Guumlrses Frank Piessens and Bart Preneel Fpdetectivedusting the web for fingerprinters In Proceedings of the2013 ACM SIGSAC conference on Computer amp communica-tions security pages 1129ndash1140 ACM 2013

[9] Julia Angwin Why online tracking is getting creepier ProP-ublica Jun 2014

[10] Mika D Ayenson Dietrich James Wambach Ashkan SoltaniNathan Good and Chris Jay Hoofnagle Flash cookies andprivacy II Now with html5 and etag respawning 2011

[11] Bananatag Email Tracking for Gmail Outlook and otherclients httpsbananatagcomemail-tracking Onlineaccessed 2017-09-04

[12] Justin Brookman Phoebe Rouge Aaron Alva Alva andChristina Yeung Cross-device tracking Measurement anddisclosures In Proceedings of the Privacy Enhancing Tech-nologies Symposium 2017

[13] Ceren Budak Sharad Goel Justin Rao and Georgios ZervasUnderstanding emerging threats to online advertising InProceedings of the ACM Conference on Economics andComputation 2016

[14] ContactMonkey Email Tracking for Outlook and Gmailhttpswwwcontactmonkeycomemail-tracking Onlineaccessed 2017-09-04

[15] Zakir Durumeric David Adrian Ariana Mirian James Kas-ten Elie Bursztein Nicolas Lidzborski Kurt Thomas VijayEranti Michael Bailey and J Alex Halderman Neither snownor rain nor mitm An empirical analysis of email deliv-ery security In Proceedings of the 2015 ACM Conferenceon Internet Measurement Conference pages 27ndash39 ACM2015

[16] Peter Eckersley How unique is your web browser In In-ternational Symposium on Privacy Enhancing TechnologiesSymposium pages 1ndash18 Springer 2010

[17] Steven Englehardt and Arvind Narayanan Online trackingA 1-million-site measurement and analysis In ACM Confer-ence on Computer and Communications Security 2016

[18] Steven Englehardt Dillon Reisman Christian Eubank Pe-ter Zimmerman Jonathan Mayer Arvind Narayanan andEdward W Felten Cookies that give you away The surveil-lance implications of web tracking In Proceedings of the24th Conference on World Wide Web 2015

[19] David Fifield and Serge Egelman Fingerprinting web usersthrough font metrics In International Conference on Finan-cial Cryptography and Data Security 2015

[20] Gmail Help Choose whether to show images httpssupportgooglecommailanswer145919 Online accessed2017-09-06

[21] Ralph Holz Johanna Amann Olivier Mehani Mohamed AliKacircafar and Matthias Wachs TLS in the wild An internet-wide analysis of tls-based protocols for electronic commu-nication In 23nd Annual Network and Distributed SystemSecurity Symposium NDSS 2016 San Diego CaliforniaUSA February 21-24 2016 2016

[22] HubSpot Start Email Tracking Today httpswwwhubspotcomproductssalesemail-tracking Online ac-

cessed 2017-09-04[23] Balachander Krishnamurthy Konstantin Naryshkin and

Craig Wills Privacy leakage vs protection measures thegrowing disconnect In Proceedings of the Web 2011

[24] Balachander Krishnamurthy and Craig E Wills On the leak-age of personally identifiable information via online socialnetworks In Proceedings of the 2nd ACM workshop onOnline social networks pages 7ndash12 ACM 2009

[25] Pierre Laperdrix Walter Rudametkin and Benoit BaudryBeauty and the beast Diverting modern web browsers tobuild unique browser fingerprints In 37th IEEE Symposiumon Security and Privacy 2016

[26] Adam Lerner Anna Kornfeld Simpson Tadayoshi Kohnoand Franziska Roesner Internet jones and the raiders of thelost trackers An archaeological study of web tracking from1996 to 2016 In 25th USENIX Security Symposium 2016

[27] Timothy Libert Exposing the invisible web An analysis ofthird-party http requests on 1 million websites InternationalJournal of Communication 918 2015

[28] Jonathan R Mayer and John C Mitchell Third-party webtracking Policy and technology In 2012 IEEE Symposiumon Security and Privacy IEEE 2012

[29] Georg Merzdovnik Markus Huber Damjan Buhov NickNikiforakis Sebastian Neuner Martin Schmiedecker andEdgar Weippl Block me if you can A large-scale study oftracker-blocking tools In Proceedings of the 2nd IEEE Euro-pean Symposium on Security and Privacy (IEEE EuroSampP)2017

[30] Keaton Mowery and Hovav Shacham Pixel perfect Finger-printing canvas in HTML5 W2SP 2012

[31] Mozilla Support Remote Content in Messages httpssupportmozillaorgen-USkbremote-content-in-messagesOnline accessed 2017-09-04

[32] Nick Nikiforakis Alexandros Kapravelos Wouter JoosenChristopher Kruegel Frank Piessens and Giovanni VignaCookieless monster Exploring the ecosystem of web-baseddevice fingerprinting In Security and privacy (SP) 2013IEEE symposium on pages 541ndash555 IEEE 2013

[33] Lukasz Olejnik Gunes Acar Claude Castelluccia and Clau-dia Diaz The leaking battery A privacy analysis of theHTML5 Battery Status API Technical report 2015

[34] Jingjing Ren Ashwin Rao Martina Lindorfer ArnaudLegout and David Choffnes Recon Revealing and control-ling pii leaks in mobile network traffic In Proceedings of the14th Annual International Conference on Mobile SystemsApplications and Services pages 361ndash374 ACM 2016

[35] Franziska Roesner Tadayoshi Kohno and David WetherallDetecting and defending against third-party tracking onthe web In Proceedings of the 9th USENIX conferenceon Networked Systems Design and Implementation pages12ndash12 USENIX Association 2012

[36] scikit-learn Jaccard Similarity Score httpscikit-learnorgstablemodulesgeneratedsklearnmetricsjaccard_similarity_scorehtml Online accessed 2017-09-05

[37] Ashkan Soltani Shannon Canty Quentin Mayo LaurenThomas and Chris Jay Hoofnagle Flash cookies and pri-vacy In AAAI spring symposium intelligent informationprivacy management volume 2010 pages 158ndash163 2010

[38] Oleksii Starov Phillipa Gill and Nick Nikiforakis Are yousure you want to contact us quantifying the leakage of pii

I never signed up for this Privacy implications of email tracking 17

via website contact forms Proceedings on Privacy Enhanc-ing Technologies 2016(1)20ndash33 2016

[39] Oleksii Starov and Nick Nikiforakis Extended trackingpowers Measuring the privacy diffusion enabled by browserextensions In Proceedings of the 26th International Confer-ence on World Wide Web pages 1481ndash1490 2017

[40] Narseo Vallina-Rodriguez Christian Kreibich Mark Allmanand Vern Paxson Lumen Fine-grained visibility and controlof mobile traffic in user-space 2017

[41] W3C 410 Forms - HTML5 httpswwww3orgTRhtml5formshtml Online accessed 2017-09-07

[42] Yahoo Help Block images in your incoming Yahoo Mailemails httpshelpyahoocomkbSLN5043html Onlineaccessed 2017-09-06

[43] Zhonghao Yu Sam Macbeth Konark Modi and Josep MPujol Tracking the trackers In Proceedings of the 25thInternational Conference on World Wide Web pages 121ndash132 International World Wide Web Conferences SteeringCommittee 2016

10 Appendix

101 Form discovery and fillingmethodology

Choosing pages on which to search for forms Thecrawler searches through all links (ltagt tags) on the land-ing page to find pages that are most likely to contain amailing list form It does this by matching the link textand URL against a ranked list of terms which are shownin Table 1 As an initial step we filter out invisiblelinks and links to external sites We check that the linktext does not contain words in our blacklist which aimsto avoid unsubscribe pages and phone-based registra-tion If we have found any links that match the crawlerclicks on the one with the highest rank then runs theform-finding procedure on the new page and any newlyopened pop-up windows If no forms are found it goesback and repeats this process for the remaining linksThe reason for clicking on generic article links is thatwe have come across several news sites with newsletterforms only within article pages We also make sure toselect the English language or USEnglish locale whenavailable since our keywords are in English

Top-down form detection For each page thecrawler visits it first searches through the HTML DOMfor any potential email registration forms When sitesuse the standard ltformgt element it can simply iteratethrough each formrsquos input fields (ltinputgt tags) and seeif any text fields ask for an email address (by matchingon input type and keywords) If so it marks the form as

a candidate and then chooses the best candidate usingthe following criteria (in order)1 Always return the topmost form Any form stacked

on top of other elements is probably a modal or dia-log and we find that the most common use of thesecomponents is to promote a sitersquos mailing lists Werely on the z-index CSS property which specifies thestacking order of an element in relation to others (asa relative arbitrary integer) Note that most DOMelements take the default z-index value of auto in-heriting the actual value from its parent thus thecrawler recursively checks a formrsquos parent elementsuntil it finds a non-auto value or reaches the rootof the DOM tree To break ties it also searches forthe literal strings ldquomodalrdquo or ldquodialogrdquo within theformrsquos HTML since we find that such componentsare usually descriptively named

2 Rank login forms lower This is the other class offorms that often asks for an email address so thecrawler explicitly checks for the strings ldquologinrdquo ldquologinrdquo and ldquosign inrdquo within a formrsquos HTML to avoidthese when other candidates are present

3 Prefer forms with more input fields This is mainlyhelpful for identifying the correct follow-up form ifwe submit our email address in the footer of a pagethe same footer might be present on the page we getredirected to In this scenario the form we want topick is the longer one

Additionally registration forms are sometimes foundinside of inline frames (ltiframegt tag) which are ef-fectively separate HTML pages embedded in the mainpage If necessary we iterate through each frame andapply the same procedure to locate registration formswithin them

Bottom-up form detection A growing numberof sites place logical forms inside of generic containerelements (eg ltdivgt or ltspangt tags) without using anyltformgt tags Therefore if top-down form detection failswe take a bottom-up approach the crawler first iteratesthrough all the ltinputgt elements on the page to checkif any email address fields exist at all then recursivelyexamines their parents to find the first container thatalso contains a submit button This container is usuallythe smallest logical form unit that includes all of therelevant input fields

Determining form field type Once a form isdiscovered we need to determine which fields are con-tained in the form and fill each field with valid dataWe skip any invisible elements since a real user wouldnot be expected to fill them Some fields can be iden-

I never signed up for this Privacy implications of email tracking 18

tified by their type attribute alonemdashfor example telfor phone numbers and email for email addressesmdashbutthese specific types were introduced in the relatively re-cent HTML5 standard [41] and most websites still usethe general text type for all text inputs In our sur-vey of the top sites we found that contextual hints arescattered across many tag attributes with the most fre-quent being name class id placeholder value forand title In addition tags that contain HTML bod-ies (such as ltbuttongt tags) often contain hints in theinnerHTML

Handling two-part form submissions Aftersubmitting a form we are sometimes prompted to fillout another longer form before the registration is ac-cepted This second form might appear on the samepage (ie using JavaScript) or on a separate page ei-ther through a redirect or as a pop-up window We takea simplistic approach the crawler waits a few secondsthen applies the same form-finding procedure first onany pop-up windows and then on the original windowThis approach may have the effect of submitting thesame form twice but we argue that this does not pro-duce any adverse resultsmdashduplicate form submissionsare a plausible user interaction that web services shouldbe expected to handle gracefully

102 Mail server implementation

The mail server receives emails using SubEtha SMTP alibrary offering a simple low-level API to handle incom-ing mail The server accepts any mail sent to (RCPT TO)an existing email address and rejects it otherwise Themail contents (DATA) are parsed in MIME format usingthe JavaMail API and the raw message contents arewritten to disk MIME messages consist of a set of head-ers and a content body with the required Content-Typeheader indicating the format of the content notably amultipart content body contains additional MIME mes-sage subparts enabling messages to be arranged in atree structure To save disk space we recursively scanmultipart MIME messages for subparts with contenttypes that are non-text (text) such as attached im-ages or other data and discard them before storing themessages since we do not examine any non-textual con-tent

103 Supported hash functions andencodings for leak detection

Supported hashes and checksums md2 md4 md5sha sha1 sha256 sha224 sha384 sha3-224 sha3-256 sha3-384 sha3-512 murmurhash2 (signed andunsigned) murmurhash3 32-bit murmurhash3 64-bitmurmurhash3 128-bit ripemd160 whirlpool blake2bblake2s crc32 adler32

Supported encodings base16 base32 base58base64 urlencoding deflate gzip zlib entity yenc

104 Top parties redirecting to new thirdparties on email reload

Redirecting Party OrganizationAvg addrsquolparties S E

pippiocom Acxiom 57 7 32liadmcom LiveIntent 37 68 1097rlcdncom Acxiom 17 11 551imiclkcom MediaMath 13 2 4mathtagcom MediaMath 11 11 382alcmpncom ALCdagger 08 6 132emltrkcom Litmus 07 41 638acxiom-onlinecom Acxiom 04 2 33dynemlcom PowerInbox 01 3 13adnxscom AppNexus 01 19 277

Table 14 Top parties by average number of new third-party re-sources in a redirect chain when an email is reloaded The num-ber of senders ( S) out of 902 total and the number of emails(E) out of 12618 total on which this occurs is given for eachredirecting party We exclude redirecting parties that only exhibitthis behavior in emails from a single sender In total there are 12parties which exhibit this type of redirect behavior Includes statistics for chains which redirect to httppliadmcomimp in the first redirect We observe a common pattern ofURLs of the form lifirstpartycom redirecting first to this end-point which then redirects to a number of other third partiesdagger American List Counsel

  • I never signed up for this Privacy implications of email tracking
    • 1 Introduction
      • 11 Methods
      • 12 The state of email tracking
      • 13 Evaluating and improving defenses
        • 2 Related work
        • 3 Collecting a dataset of emails
        • 4 Privacy leaks when viewing emails
          • 41 Measurement methodology
          • 42 Email provides much of same tracking opportunities as the web
          • 43 Leaks of email addresses to third parties are common
          • 44 Reopening emails brings in new third parties
          • 45 Case study LiveIntent
          • 46 Request blockers help but dont fix the problem
            • 5 Privacy leaks when clicking links in emails
              • 51 Measurement methodology
              • 52 Results
                • 6 Evaluation of defenses
                  • 61 Landscape of defenses
                  • 62 Survey of email clients
                    • 7 Proposed defense
                    • 8 Discussion and conclusion
                    • 9 Acknowledgements
                    • 10 Appendix
                      • 101 Form discovery and filling methodology
                      • 102 Mail server implementation
                      • 103 Supported hash functions and encodings for leak detection
                      • 104 Top parties redirecting to new third parties on email reload

I never signed up for this Privacy implications of email tracking 3

because these trackers are specific to emails (Section 7)We believe that these would make useful additions toexisting filter lists Except for this contribution we focusour analysis of defenses on mail servers and mail useragents rather than web browsers

Based on our analysis of 16 email servers and clients(Section 62) we find that a patchwork of defenses areemployed and no setup offers complete protection fromthe threats we identify Perhaps the best option forprivacy-conscious users today is to use webmail and in-stall tracker-blocking extensions such as uBlock Originor Ghostery

We show that HTML filtering can be an effectivedefense The idea is to rewrite email bodies to removetracking elements This can be done by either the mailserver or the mail user agent We prototype an elementfiltering tool based on existing tracking-protection listsand evaluate its effectiveness (Section 7)

2 Related workEmail secuity and privacy The literature on emailsecurity and privacy has focused on authentication ofemails and the privacy of email contents For exam-ple Durumeric et al found that the long tail of SMTPservers largely fail to deploy encryption and authenti-cation leaving users vulnerable to downgrade attackswhich are widespread in the wild [15] Holz et al alsofound that email is poorly secured in transit oftendue to configuration errors [21] We study an orthog-onal problem Securing email in transit will not defendagainst email tracking and vice versa

Third-party web tracking Email tracking is anoutgrowth of third-party web tracking which has growntremendously in prevalence and complexity since the1990s [13 26 28 35] Today Google is the most promi-nent tracker through various third-party domains andcan track users across nearly 80 of sites [27] Webtracking has expanded from simple HTTP cookies to in-clude more persistent tracking techniques to ldquorespawnrdquoor re-instantiate HTTP cookies through Flash cookies[37] cache E-Tags and HTML5 localStorage [10] Over-all tracking is moving from stateful to stateless tech-niques device fingerprinting attempts to identify usersby a combination of the devicersquos properties [16 25] Suchtechniques have advanced quickly [19 30 33] and arenow widespread on the web [7 8 17 32] These tech-niques allow trackers to compile unique browsing histo-ries but they do not link histories to identity

Compared to web tracking email tracking does notuse fingerprinting because (most) email clients prohibitJavaScript On the other hand email readily providesa unique persistent real-world identifier namely theemail address Web tracking researchers have created anumber of tools for detecting and measuring trackingand privacy such as FPDetective [8] OpenWPM [17]and FourthParty [28] We use OpenWPM for most ofour measurements in this paper

PII leakage Leaks of PII of logged-in users fromfirst-party websites to third parties are rampant theearly papers on this problem were by Krishnamurthy etal [23 24] PII leaks enable trackers to potentially at-tach identities to browsing histories More recent workincludes detection of PII leakage to third parties insmartphone apps [34 40] PII leakage in contact forms[38] PII leakage that enables cross-device tracking [12]and data leakage due to browser extensions [39]

The common problem faced by these authors (andby us) is that PII may be obfuscated When the data col-lection is crowdsourced [34 40] rather than automatedthere is the further complication that the strings thatconstitute PII are not specified by the researcher andthus not known in advance On the other hand crowd-sourced data collection allows obtaining numerous in-stances of each type of leak which might make detectioneasier

Various approaches are seen in prior work Ren etal employ heuristics for splitting fields in network traf-fic and detecting likely keys they then apply machinelearning to discriminate between PII and other fields[34] Starov et al apply differential testing that is vary-ing the PII entered into the system and detecting theresulting changes in information flows [38] This is chal-lenging to apply in our context because we observed fre-quent AB testing in the commercial emails in our cor-pus which makes it tricky to attribute observed changesto PII This is an area for future work Finally our ownapproach is most similar to that of Brookman et al [12]and Starov et al [39] who test combinations of encod-ings andor hashes

3 Collecting a dataset of emailsWe now describe how we assembled a large-scale cor-pus of mailing-list emails We do not attempt to studya ldquotypicalrdquo userrsquos mailbox since we have no empiricaldata from real usersrsquo mailboxes Rather our goal in as-sembling a large corpus is to study the overall landscape

I never signed up for this Privacy implications of email tracking 4

High-level architecture of crawlerAssemble a list of sites For each sitendash Find pages potentially containing forms For

each pagendash Find the best form on the page via top-

down form detection and bottom-up formdetection If a form was foundlowast Fill in the formlowast Fill in any secondary forms if necessarylowast Once a form has been submitted skip

the rest of the pages and continue tonext site

High-level architecture of serverReceive and store email For each emailndash Check for and process confirmation links

Fig 1 High-level architecture of the email collection systemwith the individual modules italicized

of third-party tracking of emails identify as many track-ers as possible (feeding into our enhancements to exist-ing tracking-protection lists) and as many interestingbehaviors as possible (such as different hashes and en-codings of emails addresses)

To achieve scale we use an automated approachWe created a web crawler based on the OpenWPM webprivacy measurement tool [17] to search for and fill informs that appear to be mailing-list subscriptions Thecrawler has five modules and the server that processesemails has two modules They are both described at ahigh level in Fig 1 We now describe each of the sevenmodules in turn

Assemble a list of sites Alexa maintains a pub-lic list of the top 1 million websites based on monthlytraffic statistics as well as rankings of the top 500 web-sites by category We used the ldquoShoppingrdquo and ldquoNewsrdquocategories since we found them more likely to containnewsletters In addition we visited the top 14700 sitesof the 1 million sites for a total of 15700 sites

Detect and rank formsWhen the crawler cannotlocate a form on the landing page it searches through allinternal links (ltagt tags) in the DOM until a page con-taining a suitable form is found A ranked list of termsshown in Table 1 is used to prioritize the links mostlikely to contain a mailing list On each page forms aredetected using both a top-down and bottom-up proce-dure The top-down procedure examines all fields con-tained in ltformgt elements Forms which have a higherz-index and more input fields are given a higher rank

while forms which appear to be part of user account reg-istration are given a lower rank If no ltformgt elementsare found we attempt to discover forms contained in al-ternative containers (eg forms in ltdivgt containers) us-ing a bottom-up procedure We start with each ltinputgtelement and recursively examine its parents until onewith a submit button is found For further details seeTop-down form detection and Bottom-up form detectionin Appendix Section 101

Fill in the form Once a form is found the crawlermust fill out the input fields such that all inputs vali-date The crawler fills all visible form fields includingltinputgt tags ltselectgt tags (ie dropdown lists) andother submit ltbuttongt tags Most websites use the gen-eral text type for all text inputs We surveyed a numberof top websites to determine common naming practicesfor input fields and filled the fields with the data of theexpected type For example name fields were filled witha generic first and last name After submitting a formwe wait for a few seconds and re-run the procedure tofill follow-up fields if required For further details seeDetermining form field type and Handling two-part formsubmissions in Appendix Section 101

Receive and store email We set up an SMTPserver to receive emails The server accepts any mail sentto an existing email address and rejects it otherwise Itthen parses the contents of the mail and logs metadata(such as the sender address subject text and recipientaddress) to a central database All textual portions ofthe message contents are written to disk We provideimplementation details in Appendix Section 102

Check for and process confirmation links Ourserver will check the first email sent to each email ad-dress to determine if the mailing list requires additionaluser interaction to confirm the subscription If the ini-tial emailrsquos subject or rendered body text includes thekeywords ldquoconfirmrdquo ldquoverifyrdquo ldquovalidaterdquo or ldquoactivaterdquowe extract potential confirmation links from the emailFor HTML emails we collect links which match thesekeywords along with additional lower-priority keywordsldquosubscriberdquo or ldquoclickrdquo For plain-text emails we simplychoose the longest link text Emails with the past-tensekeywords ldquoconfirmedrdquo ldquosubscribedrdquo and ldquoactivatedrdquo insubject lines are skipped as are links with the text ldquoun-subscriberdquo ldquocancelrdquo ldquodeactivaterdquo and ldquoviewrdquo If anylink is found it is visited using OpenWPM

Form submission measurement Our crawlerdiscovered and attempted to submit forms on 3335sites We received at least one email from 1242 (37) ofthose sites To understand the types of form submissionfailures we ran a follow-up measurement in August 2017

I never signed up for this Privacy implications of email tracking 5

Description Keywords LocationEmail list registration newsletter weekly ad subscribe inbox email sale alert link textGeneric registration signup sign up sign me up register create join link textGeneric articlesposts article news 2017 link URLSelecting languageregion us =usamp en-us link URLBlacklist unsubscribe mobile phone link text

Table 1 The web crawler chooses links to click based on keywords that appear in the link text or URL The keywords were generatedby iterating on an initial set of terms optimizing for the success of mailing list sign-ups on the top sites We created an initial set ofsearch terms and manually observed the crawler interact with the top pages Each time the crawler missed a mailing list sign-up formor failed to go to a page containing a sign-up form we inspected the page and updated the set of keywords This process was repeateduntil the crawler was successful on the sampled sites

Submission classification of sampled sitesTotal successful submissions 38

rarrMailing lists subscription 32rarrUser account registration 6

Failed required a CAPTCHA 16Failed unsupported form fields 25Unable to classify via screenshots 21

Table 2 Submission success status of a sample of 252 of the3335 form submissions made during the sign-up crawl The suc-cess and failure classification was determined through a manualreview of screenshots taken before and after an attempted formsubmission

where we took screenshots of the pages before and afterthe initial and follow-up form submissions We manu-ally examined a random sample of sites on which a formsubmission was attempted We summarize the results inTable 2

When filling forms our crawler will interact withuser account registration forms mailing list sign-upforms and contact forms The successful submissionswere mostly mailing list sign-ups and a small number ofuser account registrations which are included as theycan be tied to a mailing list The failed submissions weremostly caused by forms other than mailing lists In factmore than 70 of the failures caused by a captcha orunsupported field were not mailing list form submis-sions Overall only 11 of the sampled mailing list in-teractions resulted in a captcha Since our primary fo-cus is mailing lists we leave the evaluation of complexand captcha-protected forms to future work

Email corpus The assembled corpus contains atotal of 12618 HTML emails from 902 sites We re-ceived an average of around 14 emails per site and amedian of 5 A few sites had very active mailing listswith 20 sites sending over 100 emails during the testperiod We observe that we received no spam whichwe confirmed both by manual inspection of a sample of

emails as well as by finding an exact one-to-one corre-spondence between the 902 senders in our dataset andthe unique email addresses that we generated This en-sures that the results represent the behavior of the siteswhere we registered rather than spammers

4 Privacy leaks when viewingemails

41 Measurement methodology

Simulating a webmail client To measure web track-ing in email bodies we render the emails using a simu-lated webmail client in an OpenWPM instance Manywebmail clients remove a subset of HTML tags fromthe email body to restrict the capabilities of renderedcontent In particular Javascript is exclusively removedwhile iframe tags and CSS [6] have mixed support Wesimulate a permissive webmail client one which disablesJavascript and removes the Referer header from all re-quests but applies no other restrictions to the renderedcontent

The email content is served on localhost but isaccessed through the domain localtestme (which re-solves to localhost) to avoid any special handling thebrowser may have for the local network We configureOpenWPM to run 15 measurement instances in parallelEach email is loaded twice in its own measurement in-stance once with a fresh profile and then again keepingthe same browser profile after sleeping for 10 secondsThis is intended to allow remote content on the page toload both with and without browser state present In-deed we observe some tracking images which redirect tonew domains upon every subsequent reload of the sameemail

I never signed up for this Privacy implications of email tracking 6

Classifying third-party content Many emailclients load embedded content directly from remoteservers (we further explore the properties of emailclients in Section 62) Thus remote content presentin multiple emails can track users in the same waythird-party content can track users across sites on theweb However unlike the web there isnrsquot always aclear distinction of which requests are ldquothird-partyrdquo andwhich are ldquofirst-partyrdquo For example all resources loadedby webmail clients are considered third-party by thebrowser We consider any request to a domain2 whichis different than both the domain on which we signedup for the mailing list and the domain of the senderrsquosemail address to be a third-party request

Detecting email leakage Email addresses leakto remote servers through resource requests Detectingthese leaks is not as simple as searching for email ad-dresses in requests since the addresses may be hashed orencoded sometimes iteratively To detect such leakagewe develop a methodology that given a set of encod-ings and hashes a plaintext email address and a URLtoken is able to determine if the token is a transforma-tion of the email address Starting with the plaintextemail address we pre-compute a candidate set of tokensby applying all supported encodings and hashes itera-tively stopping once we reach three nested encodingsor hashes We then take the URL token and apply allsupported decodings to the value checking if the resultis present in the candidate set If not we iterativelyapply decodings until we reach a level of three nesteddecodings

In a preliminary measurement we found no exam-ples of a value that was encoded before being hashedThis is unsurprising as hashed email addresses are usedto sync data between parties and adding a transforma-tion before the hash would prevent that use case Thuswhen analyzing the requests in this dataset we restrictourselves to at most three nested hashes for a set of24 supported hashes including md5 sha1 sha256 Forencodings we apply all possible combinations of 10 en-codings including base64 urlencoding and gzip Thefull list of supported hashes and encodings is given inAppendix 103

Classifying email leakage Email leaks may notbe intentional If an email address is included in thequery string or path of a document URL it may auto-matically end up in the Referer header of subsequent

2 A domain is identified by its public suffix plus the componentof the hostname immediately preceding its public suffix (PS+1)

requests from that document Requests which result ina redirect also often add the referrer of the previous re-quest to the query string of the new request In manyinstances this happens irrespective of the presence ofan email address in the original request The situationis made more complex on the web since third-partyJavascript can dynamically build URLs and trigger re-quests

The reduced HTML support and lack of Javascriptexecution in email clients makes it possible to deter-mine intentionality for most leaks When an email isrendered requests can result from three sources fromelements embedded in the original HTML from withinan embedded iframe (if supported by the client) or froma redirected request1 If a leak occurs in a Referer header it is uninten-

tional For webmail clients the Referer header (ifenabled) will be the client itself A mail sender canembed an iframe which loads a URL that includesthe userrsquos email address with the explicit intentionthat the userrsquos email leak to third parties via theReferer header However we chose not to includethis possibility because email senders have multi-ple direct options for sharing information with thirdparties that do not rely on the sparsely supportediframe tag

2 If a leak occurs in a request to a resource embeddeddirectly in the HTML of the email body (and is notthe result of a redirect) it is intentional We candetermine intentionality since any request result-ing from an HTML document must have been con-structed by the email sender Note that this does nothold for web documents since embedded Javascriptcan dynamically construct requests during the pagevisit

3 If a request results from a redirect the party re-sponsible for the leak is the party whose request(ie the triggering URL) responded with a redi-rect to the new location (ie the target URL) Weclassify a leak as intentional if the leaked value ishashed between the triggering URL and the targetURL or if there are more encodings or hashes ofthe leaked value included in the target URL thanin the triggering URL If the target URL includesa full copy of the triggering URL (in any encoding)the leak is unintentional All other cases are clas-sified as ambiguous such the case where a targetURL includes only the query string of the triggeringURL

I never signed up for this Privacy implications of email tracking 7

Measuring blocked tags Tracking protection toolswhich block resource requests offer users protectionagainst the tracking embedded in emails We evaluatethe effectiveness of these tools by checking the requestsin our dataset against two major blocklists EasyListand EasyPrivacy [4] These lists block advertisementand tracking related requests and are bundled withseveral popular blocking extensions including AdBlockPlus [1] and uBlock Origin [5] We use the BlockList-Parser library [3] to determine if a request would havebeen blocked3 by an extension utilizing these lists Weclassify a request as blocked if it matches any of thefollowing three conditions1 The request directly matches the filter list2 The request is the result of a redirect and any re-

quest earlier in the redirect was blocked3 The request is loaded in an iframe and the iframe

document request (or any resulting redirect) wasblocked

It is possible to do this classification in an offline fashionbecause of the lack of Javascript support in email clientsThis removes the need to run measurements with oneof the aforementioned extensions installed In environ-ments that support Javascript content can be loadeddynamically and as the result of interactions betweenseveral scripts In such an environment it is much moredifficult to determine which requests would have beenblocked by a single script appearing on the block list

42 Email provides much of same trackingopportunities as the web

Remote resources embedded in email content can trackusers across emails As we show in our survey of emailclients (Section 62) many email clients allow remoteresources to set persistent cookies and send those cook-ies with resource requests In total we find that 10724of the measured emails (85) embed resources from atleast one third party with an average of 5 third partiesper email The distribution of embedded third partiesis far from uniform we find a median of two per emailand a small number of emails embedding as many as 50third parties (Figure 2)

3 We set the parser options as we would expect them to beset for a request occurring in a webmail client For example allrequests are considered third-party requests

Domain of Emails of Top 1Mdoubleclicknet 222 475mathtagcom 142 79dotomicom 127 35adnxscom 122 132tapadcom 110 26liadmcom 110 04returnpathnet 110 lt01bidswitchnet 105 49fontsgoogleapiscom 102 394list-managecom 101 lt01

Table 3 Top third-party domains by percentage of the 12618emails in the corpus For comparison we show the percentageof the top 1 million websites on which these third parties arepresent

Fig 2 CDF of third parties per email aggregating data acrossthe initial viewing and re-opening of an email In addition 14of emails have between 25 and 53 third parties

Table 3 shows the top third-party domains presentin email Many of these parties also have a large presenceon the web [17] blurring the line between email and webtracking On webmail clients requests to these cross-context third parties will use the same cookies allowingthem to track both a userrsquos web browsing and emailhabits In total the emails visited during our crawlsembed resources from 879 third parties

43 Leaks of email addresses to thirdparties are common

In addition to being able to track email habits 99 thirdparties (11) also gain access to a userrsquos email ad-dress whether in plaintext or hashed In email clientswhich support cookies these third parties will receivethe email address alongside any cookies theyrsquove set onthe userrsquos device Trackers which are also present on theweb will thus be able to link this address with the userrsquosbrowsing history profile

Around 19 of the 902 senders leaked the userrsquosemail address to a third party in at least one emailand in total 29 of emails contain leaks to third par-

I never signed up for this Privacy implications of email tracking 8

ties We find that a majority of these leaks 62 of the100963 leaks to third parties are intentional Theseintentional leaks mostly occur through remote contentembedded directly by the sender Furthermore 1 ofleaks are classified as unintentional with the remainderconsidered ambiguous While we do not attempt to de-termine how these identifiers are being used plaintextand hashed emails can be used for persistent trackingcross-device tracking and syncing information betweenparties

Leak of Senders of RecipientsMD5 100 (111) 38 (385)SHA1 64 (71) 19 (192)SHA256 69 (76) 13 (131)Plaintext Domain 55 (61) 2 (20)Plaintext Address 77 (85) 54 (545)URL Encoded Address 6 (06) 8 (81)SHA1 of MD5 1 (01) 1 (10)SHA256 of MD5 1 (01) 1 (10)MD5 of MD5 1 (01) 1 (10)SHA384 1 (01) 1 (10)

Table 4 Email address leakage to third parties by encoding Per-centages are given out of a total of 902 senders and 99 third-party leak recipients All hashes are of the full email addressEmail ldquodomainrdquo is the part of the address after the ldquordquoThese appear to be a misuse of LiveIntentrsquos API (Section 45)

The leaked addresses are often hashed Althoughwe can detect email addresses hashed with 24 differentfunctions and up to three nested layers we only findMD5 SHA1 and SHA256 in frequent use Table 4 summa-rizes the number of senders and receivers of each encod-ing The relatively low diversity of hashes and encodingssuggests that these techniques are not being used to ob-fuscate the collection of email addresses In fact thequery parameters which contain hashed emails some-times identify the hash functions used in the parametername (eg a string like md5=ltmd5 hash of emailgt ap-pearing in the HTTP request) The design of APIs likeLiveIntentrsquos which first receives an email address andthen syncs with a number of other parties (Section 45)suggests that these hashed address may be used to shareor link data from multiple parties

Recipient Organization of SendersLiveIntent 68 (75)Acxiom 46 (51)Litmus Software 28 (31)Conversant Media 26 (29)Neustar 24 (27)apxlvcom 18 (20)5421114717 18 (20)Trancos 17 (19)WPP 17 (19)548261160 16 (18)

Table 5 Top organizations receiving email address leaks by num-ber of the 902 total senders A domain is used in place of anorganization when it isnrsquot clear which organization it belong to

Table 5 identifies the top organizations4 which re-ceive leaked email addresses This shows that email ad-dress collection from emails is largely consolidated to afew major players which are mostly distinct from thepopular web trackers In fact only one of the top 10organizations Neustar is found in the top 20 third-party organizations on the top 1 million websites asmeasured by Englehardt and Narayanan [17] Also sur-prising is the prevalence of leaks to IP addresses whichaccounts for eight of the top 20 domains receiving emailaddresses This may be due to the relatively ephemeralnature of newsletter emails which removes concerns ofIP address churn over time

44 Reopening emails brings in new thirdparties

Despite the lack of Javascript support email views aredynamic The email content itself is static but any re-mote resources embedded in it may return different re-sponses each time the email is viewed and even redirectto different third parties To examine the effects of thiswe load every email first with a ldquocleanrdquo browser profileand then again without clearing the profile Surprisinglythe average Jaccard similarity [36] between the sets ofthird parties loaded during the first and second viewsof the same email is only 60

The majority of emailsmdashtwo-thirdsmdashload fewerthird parties when the email is reopened compared tothe initial view However about 21 of emails load at

4 We map domains to organizations using the classification pro-vided by Libert [27] adding several new email-specific organi-zations When an organization could not be found we use thePS+1

I never signed up for this Privacy implications of email tracking 9

Row Request URL0 httpinboxwashingtonexaminercomimp[]ampe=ltEMAILgtampp=01 httppliadmcomimp[]ampm=ltMD5(address)gtampsh=ltSHA1(address)gtampsh2=ltSHA256(address)gt

ampp=0ampdom=ltEMAIL_DOMAINgt2 httpxbidswitchnetsyncssp=liveintentampbidder_id=5298amplicd=3357ampx=EGFM[]3 httpxbidswitchnetul_cbsyncssp=liveintentampbidder_id=5298amplicd=3357ampx=EGFM[]4 httppadsymptoticcomdpx_pid=12688amp_psign=d3e69[]ampbidswitch_ssp_id=liveintentamp_redirect=[]5 httppadsymptoticcomdpx_pid=12688amp_psign=d3e69[]ampbidswit[]amp_redirect=[]amp_expected_cookie=[]6 httpxbidswitchnetsyncdsp_id=126ampuser_id=84f3[]ampssp=liveintent7 httpiliadmcoms19751bidder_id=5298amplicd=3357ampbidder_uuid=ltUUID_1gt8 httpcmgdoubleclicknetpixelgoogle_nid=liveintent_dbmampgoogle_cmampgoogle_sc9 httpcmgdoubleclicknetpixelgoogle_nid=liveintent_dbmampgoogle_cm=ampgoogle_sc=ampgoogle_tc=10 httppliadmcommatch_gbidder_id=24314ampbidder_uuid=ltUUID_2gtampgoogle_cver=111 httpxbidswitchnetsyncssp=liveintentampbidder_id=5298amplicd=12 httppooludspiponwebnetsyncssp=bidswitchampbidswitch_ssp_id=liveintent

Table 6 Redirect chain from a LiveIntent Email Tracking Pixel URL query strings are truncated for clarity (using [])

least one resource when an email is reopened that wasnrsquotpresent the first time A small number of third partiesare disproportionately responsible for thismdashthey loaddifferent sets of additional third parties each time theemail is opened (Table 14 in the Appendix)

The number of leaks between email loads stays rela-tively constant with less than 50 emails leaking to newparties on the second load5 However as the compari-son of Table 14 with Table 5 shows many of the topleak recipients are also responsible for redirecting to thehighest number of new parties Thus reloading an emailincreases the number of potential recipients of a leak ifthe redirectors share data based on the email or emailhash they receive

45 Case study LiveIntent

LiveIntent receives email addresses from the largestnumber of senders 68 in total In this section weanalyze a sample of the request chains that re-sult in leaks to LiveIntent Table 6 shows an ex-ample redirect chain of a single pixel embedded inan email from the washingtonexaminercom mailinglist The initial request (row 0) is to a subdomain ofwashingtonexaminercom and includes the userrsquos plain-text email address in the e= query string parameter Thedomain redirects to liadmcom (row 1) a LiveIntent do-main and includes the MD5 SHA1 and SHA256 hashes ofthe email address in the parameters m= sh= and sh2=

5 We exclude leaks which occur to a different IP address on thesecond load This occurs in 349 emails but is less meaningfulgiven the dynamic nature of IP address

The URL also includes the domain portion of the userrsquosaddress

In rows 2 - 12 the request redirects through severalother domains and back to itself exchanging what ap-pear to be partner IDs and bidder IDs In rows 7 and10 LiveIntent receives a UUID from the domain in theprevious request which could allow it to exchange in-formation with those trackers outside of the browser

46 Request blockers help but donrsquot fixthe problem

Privacy conscious users often deploy blocking exten-sions such as uBlock Origin Privacy Badger orGhostery to block tracking requests Since webmailclients are browser-based these blocking extensions canalso filter requests that occur while displaying email con-tent6 We use our blocked tag detection methodology(Section 41) to determine which resources would havebeen blocked by the popular EasyList and EasyPrivacyblocklists We then examine the remaining requests todetermine how frequently email addresses continue toleak

Overall the blocklists cut the number of third par-ties receiving leaked email addresses from any sendernearly in half from 99 to 51 Likewise the number ofsenders which leak email addresses in at least one emailis greatly reduced from 19 to just 7 However asTable 7 shows a significant number of leaks of both

6 Thunderbird supports most of the popular Firefox extensionsand as such Thunderbird users can also deploy these defensesSee Table 12 for more details

I never signed up for this Privacy implications of email tracking 10

Encoding of Senders of RecipientsPlaintext Address 34 (37) 34 (667)MD5 21 (23) 12 (235)SHA1 14 (16) 6 (118)URL Encoded Address 4 (04) 4 (78)SHA256 4 (04) 2 (39)SHA384 1 (01) 1 (20)

Table 7 Encodings used in leaks to third parties after filteringrequests with EasyList and EasyPrivacy Totals are given out of902 email senders and 51 third-party leak recipients

Recipient Domain of Sendersmediawallahscriptcom 7jetlorecom 4scrippsnetworkscom 4alocdncom 3richrelevancecom 3ivitrackcom 2intentiqcom 2gatehousemediacom 2realtimeemail 2ziffimagescom 2

Table 8 The top third-party leak recipient domains after filteringrequests with EasyList and EasyPrivacy All recipients receiveleaks from less than 1 of the 902 senders studied

plaintext and email hashes still occur In Table 8 we seethat there are still several third-party domains whichreceive email address leaks despite blocking Several ofthese domains are known trackers which could be in-cluded in the blocklists In addition IP addresses andCDN domains are still recipients of leaked email ad-dresses Blocking on other URL features such as theURL path could help reduce leaks to these domains

5 Privacy leaks when clickinglinks in emails

In Section 4 we explore the privacy impact of a useropening and rendering an email In this section we ex-plore the privacy impact of a user clicking links withinan email Once a user clicks a link in an email the link istypically opened in a web browser Unlike email clientsweb browsers will typically support Javascript and ad-vanced features of HTML creating many potential av-enues for privacy leaks However the only way an emailaddress can propagate to a page visit is through the di-rect embedding of the address in a link contained in theoriginal email body

51 Measurement methodology

Sampling links from emails To evaluate the privacyleaks which occur when links in emails are clicked wegenerate a dataset from the HTML content of all emailsand visit them individually in an instrumented browserTo extract the links from mail content we parse allemail bodies with BeautifulSoup [2] and extract thesrc property of all ltagt tags We sample up to 200 uniquelinks per sender using the following sampling strategyFirst we bin links across all emails from a sender bythe PS+1 and path of the link Next we sample one linkfrom each bin without replacement until there are nomore links or we reach a limit of 200 This helps ensurethat we have as diverse a set of landing pages as possibleby stripping fragment and query string identifiers thatmay not influence the landing page

Simulating link clicks To simulate a user click-ing a link we visit each link in an OpenWPM instanceusing a fresh browser profile The browser fully loadsthe page and sleeps for 10 seconds before closing Un-like the email viewing simulation (Section 4) we enableboth Javascript and Referer headers This simulationreplicates what happens when a link is clicked in a stan-dalone email client only the URL of the clicked link ispassed to the browser for handling In a webmail clientthe initial request resulting from the click may also con-tain a cookie and a Referer header containing the emailclientrsquos URL We do not simulate these headers in ourcrawl

Detecting email address leakage To detectleakage of email addresses we use the procedure de-scribed in Section 41 Since the Referer header is en-abled for these measurements we consider a party tohave received a leak if it is contained either in the URLor the Referer header of the resource request to thatparty Email addresses can also be shared with the partythrough the Cookie header request POST bodies web-socket connections WebRTC connections and so onWe consider these out of scope for this analysis

52 Results

We found that about 11 of links contain requests thatleak the email address to a third party About 12 ofall emails contain at least one such link and among thissubset there are an average of 35 such links per emailThe percentage of the 902 senders that leak the emailaddress in at least one link in one email is higher 355Finally there were over 1400 distinct third parties that

I never signed up for this Privacy implications of email tracking 11

Recipient Organization of SendersGoogle 247 (274)Facebook 160 (177)Twitter 94 (104)Adobe 81 (90)Microsoft 73 (81)Pinterest 72 (80)LiveIntent 69 (76)Akamai 69 (76)Acxiom 68 (75)AppNexus 61 (68)

Table 9 The top leak recipient organizations based on a sampleof simulated link clicks All values are out of 902 total senders

Recipient Domain of Sendersgoogle-analyticscom 200 (222)doubleclicknet 196 (217)googlecom 159 (176)facebookcom 154 (171)facebooknet 145 (161)fontsgoogleapiscom 102 (113)googleadservicescom 96 (106)twittercom 94 (104)googletagmanagercom 87 (96)gstaticcom 78 (86)

Table 10 The top leak recipient domains based on a sample ofsimulated link clicks All values are out of 902 senders

received the email address in one or more of our sim-ulated link clicks We expect that all statistics in thisparagraph except the first are slight underestimatesdue to our limit of 200 links per sender

Table 9 shows the top organizations that receiveleaked email addresses and Table 10 shows the top do-mains Over a quarter of senders leak the email addressto Google in at least one link

The most striking difference between these resultsand the corresponding results for viewing emails is thatthese lists look very similar to the list of top third partytrackers [17] with the addition of a small number oforganizations specific to email tracking This motivatesthe privacy concern that identities could potentially beattached to third-party web tracking profiles

6 Evaluation of defenses

61 Landscape of defenses

Defenses against tracking can be employed by severalparties We ignore mail senders and trackers themselves

since email tracking is a thriving commercial space andour evidence suggests that senders by and large coop-erate with trackers to leak email addresses We insteadfocus on parties who have an incentive to protect therecipientrsquos privacy namely the recipientrsquos mail servermail user agent and the web browser

The lines between these roles can be blurry so weillustrate with two examples Consider a user readingYahoo mail via Firefox The email server is Yahoo theemail client is Firefox together with Yahoo mailrsquos client-side JavaScript and the web browser is again Fire-fox Or consider a user reading her university mail viaGmailrsquos IMAP feature on her iPhone For our purposesboth the university and Gmail count as email serverssince either of them is in a position to employ defensesThe email client is the Gmail iOS app and the webbrowser is Safari

Defense Email server Email client Web browserContent proxying XHTML filtering X XCookie blocking X XReferrer blocking X X XRequest blocking X X

Table 11 Applicability of each of the five possible defenses toeach of the three contexts in which they may be deployed An Xindicates that the defense is applicable

Table 11 summarizes the applicability of various de-fenses to the three roles We discuss each in turn

Content proxying Email tracking is possible be-cause of embedded content such as images and CSS (cas-cading style sheets) To prevent this some email serversnotably Gmail proxy embedded content Thus whenthe recipient views the email the mail user agent doesnot make any requests to third parties

This defense doesnrsquot prevent the recipient email ad-dress being leaked to third parties since it is leakedby being encoded in the URL In fact it hinders ef-forts by the mail client to prevent email address leakage(see request blocking below) However it prevents thirdparties from learning the userrsquos IP address client deviceproperties and when the email was read (depending onhow the proxy is configured) Most importantly it pre-vents the third-party cookie from being sent and thusprevents the third party from linking the userrsquos emailaddress to a tracking profile In this way it is a comple-ment to cookie blocking

This defense can be deployed by the email serverConceivably the email client might have its own server

I never signed up for this Privacy implications of email tracking 12

component through which embedded resources are prox-ied but no email clients currently work this way andfurther it would introduce its own privacy vulnerabili-ties so we ignore this possibility

HTML filtering HTML filtering refers to modify-ing the contents of HTML emails to mitigate tracking Itmay be applied by the email server or the client but it ismore suitable to the server since the client can generallyachieve the same effect in other ways eg by requestblocking or modifying the rendering engine It is rarelyapplied today and only in minimal ways In Section 7 weprototype a comprehensive HTML filtering technique

HTML filtering modifies the content of the emailbody and thus might interfere with some email au-thentication methods notably Domain Keys IdentifiedEmail (DKIM) However since filtering is carried outby the recipientrsquos mail server (Mail Transfer Agent) andnot by intermediate mail relays filtering can be done af-ter the signature has been verified and thus there is noimpact on email authentication

The following three techniques are applicable in oneof two scenarios when the email client requests embed-ded resources or when the web browser handles clickson links in emails

Cookie blocking Cookie blocking in the emailclient prevents third-party cookies from being sent whenembedded content is requested It is especially relevantin the webmail context where the cookie allows thirdparties to link an email address to a web browsing pro-file Even otherwise blocking cookies is helpful since itmakes it harder for third parties to compile a profile ofthe recipientrsquos email viewing (they can always do this forthe subset of emails where the email address is leaked)

Referrer blocking If the email client sends theReferer header when loading embedded resources itcan allow several types of leaks Depending on the imple-mentation the referrer may encode which client is beingused and which specific email is being read If the recip-ient forwarded an email to someone else and the email isbeing viewed in a different userrsquos mailbox it could leakthis information Worse if the client supports iframesin emails and the email address happens to be in theiframe URL all requests to resources embedded in thatiframe will accidentally leak the email address For allthese reasons referrer blocking is a privacy-enhancingmeasure There is little legitimate use for the referrerheader in the context of email While clients can cer-tainly block the header (as can web browsers) serverscan do this as well by rewriting HTML to add therel=ldquonoreferrerrdquo attribute to links and inserting a Re-ferrer Policy via the meta tag

Request blocking Request blocking is a powerfultechnique which is well known due to ad blockers andother browser privacy extensions It relies on manuallycompiled filter lists containing thousands of regular ex-pressions that define third-party content to be blockedThe most widely used ad-blocking list is EasyList andthe most widely used tracker-blocking list is EasyPri-vacy Filter list based blocking introduces false positivesand false negatives [43] but the popularity of ad block-ing suggests that many users find the usability trade-offto be acceptable While request-blocking extensions aresupported primarily by web browsers some email clientsalso have support for them notably Thunderbird

62 Survey of email clients

We built an email privacy tester to discover which de-fenses are deployed by which popular email servers andclients7 Browser support for tracking protection hasbeen extensively studied elsewhere [29] so we do notconsider it here

The email privacy tester allows the researcher to en-ter an email address and the name of an email clientand then sends an email to that address containing atracking image and a link The image and the link bothhave unique URLs The researcher views the email inthe specified email client and then clicks on the linkThe server records the following information the emailaddress the email client the IP address timestampand headers sent for both the image and the link re-quests The list of headers includes the cookie referrerand user agent

We created accounts with a total of 9 emailproviders and tested them with a total of 16 email clientsusing various devices available in our lab We analyzedthe data recorded by the email privacy tester and sum-marize the results in Table 12 We found that if de-fenses are deployed by email servers at all they are onlyenabled for specific email clients (typically the defaultwebmail client) Therefore we do not report on serversseparately but instead fold it into the analysis of clientsWe also found that HTML filtering in a general form isnot deployed but only in the limited form of image andreferrer blocking so we report on that instead We sum-marize our findings in Table 12

7 httpsemailtrackingopenwpmcom

I never signed up for this Privacy implications of email tracking 13

Mail Client Platform Proxies Content Blocks Images Blocks Referrers Blocks Cookies Ext SupportGmail Web Yes No L Yes I Yesdagger Yesdagger YesYahoo Mail Web No Yes L Yes I No No YesOutlook Web App Web No Yes No No YesOutlookcom Web No No No No YesYandex Mail Web Yes No L Yes I Yesdagger Yesdagger YesGMX Web No No No No YesZimbra Web No Yes No No Yes163com Web No No No No YesSina Web No No No No YesApple Mail iOS No No Yes Yes NoGmail iOS Yes No Yes Yes NoGmail Android Yes No Yes Yes NoApple Mail Desktop No No Yes Yes NoWindows Mail Desktop No No Yes No NoOutlook 2016 Desktop No Yes Yes No NoThunderbird Desktop No Yes Yes Optional (Default No) Yes

Table 12 A survey of the privacy impacting features of email clients We explore whether the client proxies image requests blocksimages by default blocks referrer headers from being sent (with image requests ldquoIrdquo and with link clicks ldquoLrdquo) blocks external re-sources from settings cookies and whether or not the client supports request blocking extensions mdash either through the browser (forweb clients) or directly (in the case of Thunderbird)Images are only blocked for messages considered spamdagger Blocking occurs as a result of proxied content

7 Proposed defenseWe argue that tracking protection should be at the cen-ter of a defensive strategy against email tracking It canbe employed either via HTML filtering on the server orvia request blocking on the client Tracking protection(and ad blocking) based on filter lists has proven to beeffective and popular in web browsers and its limita-tions manageable The other defenses we examined allhave serious drawbacks for example content proxyingcomes at a cost to the email server and makes emailleaks worse and cookie blocking is at best a partial so-lution

We propose to improve tracking protection in twoways

Server-side email content filtering First weprototype a server-side HTML filtering module We usethe existing standard EasyList and EasyPrivacy filterlists Our filtering script is written in Python using theBlockListParser library [3] It scans for any HTML con-tent (texthtml) in email bodies parses those contentsidentifies embedded resources (images or CSS) whoseURLs match one of the regular expressions in the filterlists strips them out and rewrites the HTML

To test the effectiveness of HTML filtering we ranour leak detection procedure on the filtered corpus ofemails We exclude one sender due to a measurementissue We found that 110 of senders will leak email ad-

dresses to a third party in at least one email and 115of emails contain embedded resources which leak emailto a third-party Overall 62 third parties received leakedemail addresses down from 99 As tracking-protectionlists improve (see below) we can expect these numbersto decrease further These numbers are very close tothe corresponding numbers for request blocking (Sec-tion 46) The two techniques arenrsquot identical the onedifference is that in static files filtering is limited to theURLs present in the body of the HTML and will missthose that result from a redirect However this differ-ence is small and we conclude that HTML filtering isessentially as effective as request blocking

Note that webmail users can already enjoy track-ing protection but server-side deployment will help allusers including those who use email clients that donrsquotsupport request-blocking extensions

Filling gaps in tracking-protection lists As asecond line of defense we use our dataset to identifya list of 27125 URLs representing 133 distinct partieswhich contain leaks of email addresses but which arenrsquotblocked by EasyList or EasyPrivacy These include firstparties in addition to third parties We are able to iden-tify first-party tracking URLs by observing groups ofURLs of similar structure across different first-party do-mains For example 51 email senders leak the userrsquosemail address to a URL of the form liltpublic suffix+ 1gtimp which appears to be part of LiveIntentrsquos API(Section 45) We summarize the most common struc-

I never signed up for this Privacy implications of email tracking 14

tures in the leaking URLs missed by tracking protectionlists in Table 13

URL Pattern of SendersliltPS+1gtimp 51 (57)partnerltPS+1gt 7 (07)stripeltPS+1gtstripeimage 4 (04)pltPS+1gtespopen 4 (04)apiltPS+1gtlayoutssectionltNgt 4 (04)ltPS+1gtcustomer-service 3 (03)miltPS+1gtprp 3 (03)dmtkltPS+1gt 3 (03)linksltPS+1gteopen 3 (03)eadsltPS+1gtimp 3 (03)

Table 13 The top URL patterns from URLs which leak email ad-dresses and are missed by tracking protection lists (Section 46)The patterns are generated by stripping request URLs to host-name and path replacing the public suffix plus one with ltPS+1gtreplacing integers with ltNgt and stripping the last portion of thepath if it ends with a file extension The patterns are ranked bythe number of senders which make at least one leaking requestmatching that pattern in any of the senderrsquos emails All values aregiven out of the total of 902 senders studied

We suspect that the reason so many trackers aremissed is that many of them are not active in the regu-lar web tracking space We have made the list of leakingURLs missed by tracking protection lists publicly avail-able8 It should be straightforward to add regular ex-pressions to filter lists based on these URLs we suggestthat filter list creators should regularly conduct scansof email corpora to identify new trackers

8 Discussion and conclusionPrivacy risks of email tracking Email security andprivacy has not received much research attention despiteits central importance in digital life We showed thatcommercial emails contain a high density of third-partytrackers This is of concern not only because trackerscan learn the recipientrsquos IP address when emails wereopened and so on but also because these third partiesare by and large the same ones that are involved in webtracking This means that trackers can connect emailaddresses to browsing histories and profiles which leadsto further privacy breaches such as cross-device tracking

8 httpsgistgithubcomenglehardt6438c5d775ffd535b317d5c6ce3cde61

and linking of online and offline activities Indeed emailis an underappreciated avenue for straightforward cross-device tracking since recipients tend to view emails onmultiple devices

The advice provided by many mail clients may mis-lead users into thinking the privacy risks associated withremote content are fairly limited The remote contenthelp pages of Gmail [20] Yahoo Mail [42] and Thun-derbird [31] all discuss the threat strictly in terms ofthe email sender learning information about the userrather than a number of third parties

Even network adversaries can benefit from the leaksin emails The NSA is known to piggyback on advertis-ing cookies for surveillance [18] and our work suggestsone way in which a surveillance agency might attachidentities to web activity records in line with the find-ings of Englehardt et al [18] Indeed nearly 91 ofURLs containing leaks of emails are sent in plaintext

Ineffectiveness of hashing The putative justi-fication for email address leaks in the online ad techindustry is that the address is hashed However hash-ing of PII including emails is not a meaningful pri-vacy protection This is folk knowledge in the securitycommunity but bears repeating Compared to hashingof passwords there are several reasons why hashing ofemail addresses is far more easily reversible via vari-ants of a dictionary attack First while (at least) someusers attempt to maximize the entropy of passwordsmost users aim to pick memorable emails and hence theset of potential emails is effectively enumerable Due toGPUs trillions of hashes can be attempted at low costSecond unlike password hashing salting is not applica-ble to email hashing since multiple third parties need tobe able to independently derive the same hash from theemail address

Perhaps most importantly if the adversaryrsquos goalis to retrieve records corresponding to a known emailaddress or set of email addresses then hashing ispointlessmdashthe adversary can simply hash the email ad-dresses and then look them up For example if the ad-versary is a surveillance agency as discussed above andseeks to retrieve network logs corresponding to a givenemail address this is trivially possible despite hashing

LimitationsWe mention several limitations of ourwork First despite the large number of heuristics thatwent into identifying and submitting forms it is a fun-damentally hard problem and our crawler fails in manycases including pages requiring complex mouse interac-tions pages containing very poorly structured HTMLand captcha-protected form submission pages More-over it is difficult to programmatically distinguish be-

I never signed up for this Privacy implications of email tracking 15

tween successful and failed form submissions Lookingat received network data is impractical since responsescould easily include text for both success and failuremessages On the other hand looking only at changesin the rendered text on the webpage is more feasiblebut would require handling many possible edge cases(eg page redirects alerts pop-up windows iframes)and might still be too unreliable to use as a metric forsuccess

Second our corpus of emails is not intended to berepresentative and we are unable to draw conclusionsabout the extent of tracking in the typical userrsquos mail-box

Third our simulation of a user viewing emails as-sumes a permissive user agent We expect that thisclosely approximates a webmail setup with defaultbrowser settings (on browsers except Safari whichblocks third-party cookies by default) but we have nottested this assumption

Future work Finally we mention several potentialareas of future work

Mailing list managers It would be helpful to bet-ter understand the relationship between email sendersand mailing list managers (such as Constant Contact)To what extent is email tracking driven by senders ver-sus mailing list managers When a sender sets up amarketing campaign with a mailing list manager is thetracking disclosed to the sender

PII leakage in registration forms Researchers havepreviously found leakage of PII to third parties in con-tact forms on websites [38] As far as we know there hasbeen no large-scale study of PII leakage in registrationforms where more sensitive information is often present(eg phone numbers street addresses and passwords)Recording and analyzing the third-party requests madeduring our crawls is an important area for further inves-tigation

Cookie syncing It would be interesting to find outif cookie syncing occurs when viewing emailsmdasha processin which different trackers exchange and link togethertheir own IDs for the same user Past work has shownthat this happens among the vast majority of top thirdparties on the web [17] so we suspect that it occursthrough email as well

AB testing We notice some clear instances of ABtesting in our data as might be expected in market-ing campaigns Specifically we registered multiple emailaddresses on some sites at roughly the same time andfound several emails sent at nearly the same time (mil-liseconds apart) with different subject lines and emailbodies advertising different products We have not at-

tempted to reverse-engineer or systematically analyzethese differences but it may be interesting to see if andhow the received content changes in response to readreceipts click-through metrics or other types of userinteractions

Differential testing Despite testing for various en-codings hashes and combinations it is possible thatwe have missed some leaks of email addresses We can-not hope to exhaustively test for all combinations ofencodings and hashes Instead we propose differentialtesting by registering multiple email addresses on thesame site we can look for parameters in URLs that aredifferent for different email addresses which are sugges-tive of transformed email addresses The difficulty withthis approach is that AB testing mentioned above isa confound

In summary we hope that our work leads to greaterawareness of the privacy risks of email tracking spursfurther research on the topic and paves the way fordeployment of robust defenses

9 AcknowledgementsWe would like to thank the anonymous reviewers AylinCaliskan Paul-Olivier Dehaye Joel Reardon and PaulVan Oorschot for their helpful comments Wersquore alsograteful to Guumlnes Acar Paul Ellenbogen Marc JuarezHarry Kalodner Marcela Melara and Laura Roberts fortheir assistance in compiling data for our email survey

This work was supported by NSF Grant CNS1526353 by a research grant from Mozilla and by Ama-zon AWS Cloud Credits for Research

References[1] Adblock Plus - Surf the web without annoying ads https

adblockplusorg Online accessed 2017-09-05[2] BeautifulSoup httpswwwcrummycomsoftware

BeautifulSoup Online accessed 2017-09-05[3] BlockListParser httpsgithubcomshivamagarwal-iitb

BlockListParser Online accessed 2017-09-05[4] EasyList and EasyPrivacy httpseasylistto Online

accessed 2017-09-05[5] uBlock Origin - An efficient blocker for Chromium and Fire-

fox Fast and lean httpsgithubcomgorhilluBlockOnline accessed 2017-09-05

[6] CSS Support Guide for Email Clients Campaign Sourcehttpswwwcampaignmonitorcomcss (Archive httpswwwwebcitationorg6rLLXBX0E) 2014

I never signed up for this Privacy implications of email tracking 16

[7] Gunes Acar Christian Eubank Steven Englehardt MarcJuarez Arvind Narayanan and Claudia Diaz The web neverforgets Persistent tracking mechanisms in the wild In Pro-ceedings of ACM CCS pages 674ndash689 ACM 2014

[8] Gunes Acar Marc Juarez Nick Nikiforakis Claudia DiazSeda Guumlrses Frank Piessens and Bart Preneel Fpdetectivedusting the web for fingerprinters In Proceedings of the2013 ACM SIGSAC conference on Computer amp communica-tions security pages 1129ndash1140 ACM 2013

[9] Julia Angwin Why online tracking is getting creepier ProP-ublica Jun 2014

[10] Mika D Ayenson Dietrich James Wambach Ashkan SoltaniNathan Good and Chris Jay Hoofnagle Flash cookies andprivacy II Now with html5 and etag respawning 2011

[11] Bananatag Email Tracking for Gmail Outlook and otherclients httpsbananatagcomemail-tracking Onlineaccessed 2017-09-04

[12] Justin Brookman Phoebe Rouge Aaron Alva Alva andChristina Yeung Cross-device tracking Measurement anddisclosures In Proceedings of the Privacy Enhancing Tech-nologies Symposium 2017

[13] Ceren Budak Sharad Goel Justin Rao and Georgios ZervasUnderstanding emerging threats to online advertising InProceedings of the ACM Conference on Economics andComputation 2016

[14] ContactMonkey Email Tracking for Outlook and Gmailhttpswwwcontactmonkeycomemail-tracking Onlineaccessed 2017-09-04

[15] Zakir Durumeric David Adrian Ariana Mirian James Kas-ten Elie Bursztein Nicolas Lidzborski Kurt Thomas VijayEranti Michael Bailey and J Alex Halderman Neither snownor rain nor mitm An empirical analysis of email deliv-ery security In Proceedings of the 2015 ACM Conferenceon Internet Measurement Conference pages 27ndash39 ACM2015

[16] Peter Eckersley How unique is your web browser In In-ternational Symposium on Privacy Enhancing TechnologiesSymposium pages 1ndash18 Springer 2010

[17] Steven Englehardt and Arvind Narayanan Online trackingA 1-million-site measurement and analysis In ACM Confer-ence on Computer and Communications Security 2016

[18] Steven Englehardt Dillon Reisman Christian Eubank Pe-ter Zimmerman Jonathan Mayer Arvind Narayanan andEdward W Felten Cookies that give you away The surveil-lance implications of web tracking In Proceedings of the24th Conference on World Wide Web 2015

[19] David Fifield and Serge Egelman Fingerprinting web usersthrough font metrics In International Conference on Finan-cial Cryptography and Data Security 2015

[20] Gmail Help Choose whether to show images httpssupportgooglecommailanswer145919 Online accessed2017-09-06

[21] Ralph Holz Johanna Amann Olivier Mehani Mohamed AliKacircafar and Matthias Wachs TLS in the wild An internet-wide analysis of tls-based protocols for electronic commu-nication In 23nd Annual Network and Distributed SystemSecurity Symposium NDSS 2016 San Diego CaliforniaUSA February 21-24 2016 2016

[22] HubSpot Start Email Tracking Today httpswwwhubspotcomproductssalesemail-tracking Online ac-

cessed 2017-09-04[23] Balachander Krishnamurthy Konstantin Naryshkin and

Craig Wills Privacy leakage vs protection measures thegrowing disconnect In Proceedings of the Web 2011

[24] Balachander Krishnamurthy and Craig E Wills On the leak-age of personally identifiable information via online socialnetworks In Proceedings of the 2nd ACM workshop onOnline social networks pages 7ndash12 ACM 2009

[25] Pierre Laperdrix Walter Rudametkin and Benoit BaudryBeauty and the beast Diverting modern web browsers tobuild unique browser fingerprints In 37th IEEE Symposiumon Security and Privacy 2016

[26] Adam Lerner Anna Kornfeld Simpson Tadayoshi Kohnoand Franziska Roesner Internet jones and the raiders of thelost trackers An archaeological study of web tracking from1996 to 2016 In 25th USENIX Security Symposium 2016

[27] Timothy Libert Exposing the invisible web An analysis ofthird-party http requests on 1 million websites InternationalJournal of Communication 918 2015

[28] Jonathan R Mayer and John C Mitchell Third-party webtracking Policy and technology In 2012 IEEE Symposiumon Security and Privacy IEEE 2012

[29] Georg Merzdovnik Markus Huber Damjan Buhov NickNikiforakis Sebastian Neuner Martin Schmiedecker andEdgar Weippl Block me if you can A large-scale study oftracker-blocking tools In Proceedings of the 2nd IEEE Euro-pean Symposium on Security and Privacy (IEEE EuroSampP)2017

[30] Keaton Mowery and Hovav Shacham Pixel perfect Finger-printing canvas in HTML5 W2SP 2012

[31] Mozilla Support Remote Content in Messages httpssupportmozillaorgen-USkbremote-content-in-messagesOnline accessed 2017-09-04

[32] Nick Nikiforakis Alexandros Kapravelos Wouter JoosenChristopher Kruegel Frank Piessens and Giovanni VignaCookieless monster Exploring the ecosystem of web-baseddevice fingerprinting In Security and privacy (SP) 2013IEEE symposium on pages 541ndash555 IEEE 2013

[33] Lukasz Olejnik Gunes Acar Claude Castelluccia and Clau-dia Diaz The leaking battery A privacy analysis of theHTML5 Battery Status API Technical report 2015

[34] Jingjing Ren Ashwin Rao Martina Lindorfer ArnaudLegout and David Choffnes Recon Revealing and control-ling pii leaks in mobile network traffic In Proceedings of the14th Annual International Conference on Mobile SystemsApplications and Services pages 361ndash374 ACM 2016

[35] Franziska Roesner Tadayoshi Kohno and David WetherallDetecting and defending against third-party tracking onthe web In Proceedings of the 9th USENIX conferenceon Networked Systems Design and Implementation pages12ndash12 USENIX Association 2012

[36] scikit-learn Jaccard Similarity Score httpscikit-learnorgstablemodulesgeneratedsklearnmetricsjaccard_similarity_scorehtml Online accessed 2017-09-05

[37] Ashkan Soltani Shannon Canty Quentin Mayo LaurenThomas and Chris Jay Hoofnagle Flash cookies and pri-vacy In AAAI spring symposium intelligent informationprivacy management volume 2010 pages 158ndash163 2010

[38] Oleksii Starov Phillipa Gill and Nick Nikiforakis Are yousure you want to contact us quantifying the leakage of pii

I never signed up for this Privacy implications of email tracking 17

via website contact forms Proceedings on Privacy Enhanc-ing Technologies 2016(1)20ndash33 2016

[39] Oleksii Starov and Nick Nikiforakis Extended trackingpowers Measuring the privacy diffusion enabled by browserextensions In Proceedings of the 26th International Confer-ence on World Wide Web pages 1481ndash1490 2017

[40] Narseo Vallina-Rodriguez Christian Kreibich Mark Allmanand Vern Paxson Lumen Fine-grained visibility and controlof mobile traffic in user-space 2017

[41] W3C 410 Forms - HTML5 httpswwww3orgTRhtml5formshtml Online accessed 2017-09-07

[42] Yahoo Help Block images in your incoming Yahoo Mailemails httpshelpyahoocomkbSLN5043html Onlineaccessed 2017-09-06

[43] Zhonghao Yu Sam Macbeth Konark Modi and Josep MPujol Tracking the trackers In Proceedings of the 25thInternational Conference on World Wide Web pages 121ndash132 International World Wide Web Conferences SteeringCommittee 2016

10 Appendix

101 Form discovery and fillingmethodology

Choosing pages on which to search for forms Thecrawler searches through all links (ltagt tags) on the land-ing page to find pages that are most likely to contain amailing list form It does this by matching the link textand URL against a ranked list of terms which are shownin Table 1 As an initial step we filter out invisiblelinks and links to external sites We check that the linktext does not contain words in our blacklist which aimsto avoid unsubscribe pages and phone-based registra-tion If we have found any links that match the crawlerclicks on the one with the highest rank then runs theform-finding procedure on the new page and any newlyopened pop-up windows If no forms are found it goesback and repeats this process for the remaining linksThe reason for clicking on generic article links is thatwe have come across several news sites with newsletterforms only within article pages We also make sure toselect the English language or USEnglish locale whenavailable since our keywords are in English

Top-down form detection For each page thecrawler visits it first searches through the HTML DOMfor any potential email registration forms When sitesuse the standard ltformgt element it can simply iteratethrough each formrsquos input fields (ltinputgt tags) and seeif any text fields ask for an email address (by matchingon input type and keywords) If so it marks the form as

a candidate and then chooses the best candidate usingthe following criteria (in order)1 Always return the topmost form Any form stacked

on top of other elements is probably a modal or dia-log and we find that the most common use of thesecomponents is to promote a sitersquos mailing lists Werely on the z-index CSS property which specifies thestacking order of an element in relation to others (asa relative arbitrary integer) Note that most DOMelements take the default z-index value of auto in-heriting the actual value from its parent thus thecrawler recursively checks a formrsquos parent elementsuntil it finds a non-auto value or reaches the rootof the DOM tree To break ties it also searches forthe literal strings ldquomodalrdquo or ldquodialogrdquo within theformrsquos HTML since we find that such componentsare usually descriptively named

2 Rank login forms lower This is the other class offorms that often asks for an email address so thecrawler explicitly checks for the strings ldquologinrdquo ldquologinrdquo and ldquosign inrdquo within a formrsquos HTML to avoidthese when other candidates are present

3 Prefer forms with more input fields This is mainlyhelpful for identifying the correct follow-up form ifwe submit our email address in the footer of a pagethe same footer might be present on the page we getredirected to In this scenario the form we want topick is the longer one

Additionally registration forms are sometimes foundinside of inline frames (ltiframegt tag) which are ef-fectively separate HTML pages embedded in the mainpage If necessary we iterate through each frame andapply the same procedure to locate registration formswithin them

Bottom-up form detection A growing numberof sites place logical forms inside of generic containerelements (eg ltdivgt or ltspangt tags) without using anyltformgt tags Therefore if top-down form detection failswe take a bottom-up approach the crawler first iteratesthrough all the ltinputgt elements on the page to checkif any email address fields exist at all then recursivelyexamines their parents to find the first container thatalso contains a submit button This container is usuallythe smallest logical form unit that includes all of therelevant input fields

Determining form field type Once a form isdiscovered we need to determine which fields are con-tained in the form and fill each field with valid dataWe skip any invisible elements since a real user wouldnot be expected to fill them Some fields can be iden-

I never signed up for this Privacy implications of email tracking 18

tified by their type attribute alonemdashfor example telfor phone numbers and email for email addressesmdashbutthese specific types were introduced in the relatively re-cent HTML5 standard [41] and most websites still usethe general text type for all text inputs In our sur-vey of the top sites we found that contextual hints arescattered across many tag attributes with the most fre-quent being name class id placeholder value forand title In addition tags that contain HTML bod-ies (such as ltbuttongt tags) often contain hints in theinnerHTML

Handling two-part form submissions Aftersubmitting a form we are sometimes prompted to fillout another longer form before the registration is ac-cepted This second form might appear on the samepage (ie using JavaScript) or on a separate page ei-ther through a redirect or as a pop-up window We takea simplistic approach the crawler waits a few secondsthen applies the same form-finding procedure first onany pop-up windows and then on the original windowThis approach may have the effect of submitting thesame form twice but we argue that this does not pro-duce any adverse resultsmdashduplicate form submissionsare a plausible user interaction that web services shouldbe expected to handle gracefully

102 Mail server implementation

The mail server receives emails using SubEtha SMTP alibrary offering a simple low-level API to handle incom-ing mail The server accepts any mail sent to (RCPT TO)an existing email address and rejects it otherwise Themail contents (DATA) are parsed in MIME format usingthe JavaMail API and the raw message contents arewritten to disk MIME messages consist of a set of head-ers and a content body with the required Content-Typeheader indicating the format of the content notably amultipart content body contains additional MIME mes-sage subparts enabling messages to be arranged in atree structure To save disk space we recursively scanmultipart MIME messages for subparts with contenttypes that are non-text (text) such as attached im-ages or other data and discard them before storing themessages since we do not examine any non-textual con-tent

103 Supported hash functions andencodings for leak detection

Supported hashes and checksums md2 md4 md5sha sha1 sha256 sha224 sha384 sha3-224 sha3-256 sha3-384 sha3-512 murmurhash2 (signed andunsigned) murmurhash3 32-bit murmurhash3 64-bitmurmurhash3 128-bit ripemd160 whirlpool blake2bblake2s crc32 adler32

Supported encodings base16 base32 base58base64 urlencoding deflate gzip zlib entity yenc

104 Top parties redirecting to new thirdparties on email reload

Redirecting Party OrganizationAvg addrsquolparties S E

pippiocom Acxiom 57 7 32liadmcom LiveIntent 37 68 1097rlcdncom Acxiom 17 11 551imiclkcom MediaMath 13 2 4mathtagcom MediaMath 11 11 382alcmpncom ALCdagger 08 6 132emltrkcom Litmus 07 41 638acxiom-onlinecom Acxiom 04 2 33dynemlcom PowerInbox 01 3 13adnxscom AppNexus 01 19 277

Table 14 Top parties by average number of new third-party re-sources in a redirect chain when an email is reloaded The num-ber of senders ( S) out of 902 total and the number of emails(E) out of 12618 total on which this occurs is given for eachredirecting party We exclude redirecting parties that only exhibitthis behavior in emails from a single sender In total there are 12parties which exhibit this type of redirect behavior Includes statistics for chains which redirect to httppliadmcomimp in the first redirect We observe a common pattern ofURLs of the form lifirstpartycom redirecting first to this end-point which then redirects to a number of other third partiesdagger American List Counsel

  • I never signed up for this Privacy implications of email tracking
    • 1 Introduction
      • 11 Methods
      • 12 The state of email tracking
      • 13 Evaluating and improving defenses
        • 2 Related work
        • 3 Collecting a dataset of emails
        • 4 Privacy leaks when viewing emails
          • 41 Measurement methodology
          • 42 Email provides much of same tracking opportunities as the web
          • 43 Leaks of email addresses to third parties are common
          • 44 Reopening emails brings in new third parties
          • 45 Case study LiveIntent
          • 46 Request blockers help but dont fix the problem
            • 5 Privacy leaks when clicking links in emails
              • 51 Measurement methodology
              • 52 Results
                • 6 Evaluation of defenses
                  • 61 Landscape of defenses
                  • 62 Survey of email clients
                    • 7 Proposed defense
                    • 8 Discussion and conclusion
                    • 9 Acknowledgements
                    • 10 Appendix
                      • 101 Form discovery and filling methodology
                      • 102 Mail server implementation
                      • 103 Supported hash functions and encodings for leak detection
                      • 104 Top parties redirecting to new third parties on email reload

I never signed up for this Privacy implications of email tracking 4

High-level architecture of crawlerAssemble a list of sites For each sitendash Find pages potentially containing forms For

each pagendash Find the best form on the page via top-

down form detection and bottom-up formdetection If a form was foundlowast Fill in the formlowast Fill in any secondary forms if necessarylowast Once a form has been submitted skip

the rest of the pages and continue tonext site

High-level architecture of serverReceive and store email For each emailndash Check for and process confirmation links

Fig 1 High-level architecture of the email collection systemwith the individual modules italicized

of third-party tracking of emails identify as many track-ers as possible (feeding into our enhancements to exist-ing tracking-protection lists) and as many interestingbehaviors as possible (such as different hashes and en-codings of emails addresses)

To achieve scale we use an automated approachWe created a web crawler based on the OpenWPM webprivacy measurement tool [17] to search for and fill informs that appear to be mailing-list subscriptions Thecrawler has five modules and the server that processesemails has two modules They are both described at ahigh level in Fig 1 We now describe each of the sevenmodules in turn

Assemble a list of sites Alexa maintains a pub-lic list of the top 1 million websites based on monthlytraffic statistics as well as rankings of the top 500 web-sites by category We used the ldquoShoppingrdquo and ldquoNewsrdquocategories since we found them more likely to containnewsletters In addition we visited the top 14700 sitesof the 1 million sites for a total of 15700 sites

Detect and rank formsWhen the crawler cannotlocate a form on the landing page it searches through allinternal links (ltagt tags) in the DOM until a page con-taining a suitable form is found A ranked list of termsshown in Table 1 is used to prioritize the links mostlikely to contain a mailing list On each page forms aredetected using both a top-down and bottom-up proce-dure The top-down procedure examines all fields con-tained in ltformgt elements Forms which have a higherz-index and more input fields are given a higher rank

while forms which appear to be part of user account reg-istration are given a lower rank If no ltformgt elementsare found we attempt to discover forms contained in al-ternative containers (eg forms in ltdivgt containers) us-ing a bottom-up procedure We start with each ltinputgtelement and recursively examine its parents until onewith a submit button is found For further details seeTop-down form detection and Bottom-up form detectionin Appendix Section 101

Fill in the form Once a form is found the crawlermust fill out the input fields such that all inputs vali-date The crawler fills all visible form fields includingltinputgt tags ltselectgt tags (ie dropdown lists) andother submit ltbuttongt tags Most websites use the gen-eral text type for all text inputs We surveyed a numberof top websites to determine common naming practicesfor input fields and filled the fields with the data of theexpected type For example name fields were filled witha generic first and last name After submitting a formwe wait for a few seconds and re-run the procedure tofill follow-up fields if required For further details seeDetermining form field type and Handling two-part formsubmissions in Appendix Section 101

Receive and store email We set up an SMTPserver to receive emails The server accepts any mail sentto an existing email address and rejects it otherwise Itthen parses the contents of the mail and logs metadata(such as the sender address subject text and recipientaddress) to a central database All textual portions ofthe message contents are written to disk We provideimplementation details in Appendix Section 102

Check for and process confirmation links Ourserver will check the first email sent to each email ad-dress to determine if the mailing list requires additionaluser interaction to confirm the subscription If the ini-tial emailrsquos subject or rendered body text includes thekeywords ldquoconfirmrdquo ldquoverifyrdquo ldquovalidaterdquo or ldquoactivaterdquowe extract potential confirmation links from the emailFor HTML emails we collect links which match thesekeywords along with additional lower-priority keywordsldquosubscriberdquo or ldquoclickrdquo For plain-text emails we simplychoose the longest link text Emails with the past-tensekeywords ldquoconfirmedrdquo ldquosubscribedrdquo and ldquoactivatedrdquo insubject lines are skipped as are links with the text ldquoun-subscriberdquo ldquocancelrdquo ldquodeactivaterdquo and ldquoviewrdquo If anylink is found it is visited using OpenWPM

Form submission measurement Our crawlerdiscovered and attempted to submit forms on 3335sites We received at least one email from 1242 (37) ofthose sites To understand the types of form submissionfailures we ran a follow-up measurement in August 2017

I never signed up for this Privacy implications of email tracking 5

Description Keywords LocationEmail list registration newsletter weekly ad subscribe inbox email sale alert link textGeneric registration signup sign up sign me up register create join link textGeneric articlesposts article news 2017 link URLSelecting languageregion us =usamp en-us link URLBlacklist unsubscribe mobile phone link text

Table 1 The web crawler chooses links to click based on keywords that appear in the link text or URL The keywords were generatedby iterating on an initial set of terms optimizing for the success of mailing list sign-ups on the top sites We created an initial set ofsearch terms and manually observed the crawler interact with the top pages Each time the crawler missed a mailing list sign-up formor failed to go to a page containing a sign-up form we inspected the page and updated the set of keywords This process was repeateduntil the crawler was successful on the sampled sites

Submission classification of sampled sitesTotal successful submissions 38

rarrMailing lists subscription 32rarrUser account registration 6

Failed required a CAPTCHA 16Failed unsupported form fields 25Unable to classify via screenshots 21

Table 2 Submission success status of a sample of 252 of the3335 form submissions made during the sign-up crawl The suc-cess and failure classification was determined through a manualreview of screenshots taken before and after an attempted formsubmission

where we took screenshots of the pages before and afterthe initial and follow-up form submissions We manu-ally examined a random sample of sites on which a formsubmission was attempted We summarize the results inTable 2

When filling forms our crawler will interact withuser account registration forms mailing list sign-upforms and contact forms The successful submissionswere mostly mailing list sign-ups and a small number ofuser account registrations which are included as theycan be tied to a mailing list The failed submissions weremostly caused by forms other than mailing lists In factmore than 70 of the failures caused by a captcha orunsupported field were not mailing list form submis-sions Overall only 11 of the sampled mailing list in-teractions resulted in a captcha Since our primary fo-cus is mailing lists we leave the evaluation of complexand captcha-protected forms to future work

Email corpus The assembled corpus contains atotal of 12618 HTML emails from 902 sites We re-ceived an average of around 14 emails per site and amedian of 5 A few sites had very active mailing listswith 20 sites sending over 100 emails during the testperiod We observe that we received no spam whichwe confirmed both by manual inspection of a sample of

emails as well as by finding an exact one-to-one corre-spondence between the 902 senders in our dataset andthe unique email addresses that we generated This en-sures that the results represent the behavior of the siteswhere we registered rather than spammers

4 Privacy leaks when viewingemails

41 Measurement methodology

Simulating a webmail client To measure web track-ing in email bodies we render the emails using a simu-lated webmail client in an OpenWPM instance Manywebmail clients remove a subset of HTML tags fromthe email body to restrict the capabilities of renderedcontent In particular Javascript is exclusively removedwhile iframe tags and CSS [6] have mixed support Wesimulate a permissive webmail client one which disablesJavascript and removes the Referer header from all re-quests but applies no other restrictions to the renderedcontent

The email content is served on localhost but isaccessed through the domain localtestme (which re-solves to localhost) to avoid any special handling thebrowser may have for the local network We configureOpenWPM to run 15 measurement instances in parallelEach email is loaded twice in its own measurement in-stance once with a fresh profile and then again keepingthe same browser profile after sleeping for 10 secondsThis is intended to allow remote content on the page toload both with and without browser state present In-deed we observe some tracking images which redirect tonew domains upon every subsequent reload of the sameemail

I never signed up for this Privacy implications of email tracking 6

Classifying third-party content Many emailclients load embedded content directly from remoteservers (we further explore the properties of emailclients in Section 62) Thus remote content presentin multiple emails can track users in the same waythird-party content can track users across sites on theweb However unlike the web there isnrsquot always aclear distinction of which requests are ldquothird-partyrdquo andwhich are ldquofirst-partyrdquo For example all resources loadedby webmail clients are considered third-party by thebrowser We consider any request to a domain2 whichis different than both the domain on which we signedup for the mailing list and the domain of the senderrsquosemail address to be a third-party request

Detecting email leakage Email addresses leakto remote servers through resource requests Detectingthese leaks is not as simple as searching for email ad-dresses in requests since the addresses may be hashed orencoded sometimes iteratively To detect such leakagewe develop a methodology that given a set of encod-ings and hashes a plaintext email address and a URLtoken is able to determine if the token is a transforma-tion of the email address Starting with the plaintextemail address we pre-compute a candidate set of tokensby applying all supported encodings and hashes itera-tively stopping once we reach three nested encodingsor hashes We then take the URL token and apply allsupported decodings to the value checking if the resultis present in the candidate set If not we iterativelyapply decodings until we reach a level of three nesteddecodings

In a preliminary measurement we found no exam-ples of a value that was encoded before being hashedThis is unsurprising as hashed email addresses are usedto sync data between parties and adding a transforma-tion before the hash would prevent that use case Thuswhen analyzing the requests in this dataset we restrictourselves to at most three nested hashes for a set of24 supported hashes including md5 sha1 sha256 Forencodings we apply all possible combinations of 10 en-codings including base64 urlencoding and gzip Thefull list of supported hashes and encodings is given inAppendix 103

Classifying email leakage Email leaks may notbe intentional If an email address is included in thequery string or path of a document URL it may auto-matically end up in the Referer header of subsequent

2 A domain is identified by its public suffix plus the componentof the hostname immediately preceding its public suffix (PS+1)

requests from that document Requests which result ina redirect also often add the referrer of the previous re-quest to the query string of the new request In manyinstances this happens irrespective of the presence ofan email address in the original request The situationis made more complex on the web since third-partyJavascript can dynamically build URLs and trigger re-quests

The reduced HTML support and lack of Javascriptexecution in email clients makes it possible to deter-mine intentionality for most leaks When an email isrendered requests can result from three sources fromelements embedded in the original HTML from withinan embedded iframe (if supported by the client) or froma redirected request1 If a leak occurs in a Referer header it is uninten-

tional For webmail clients the Referer header (ifenabled) will be the client itself A mail sender canembed an iframe which loads a URL that includesthe userrsquos email address with the explicit intentionthat the userrsquos email leak to third parties via theReferer header However we chose not to includethis possibility because email senders have multi-ple direct options for sharing information with thirdparties that do not rely on the sparsely supportediframe tag

2 If a leak occurs in a request to a resource embeddeddirectly in the HTML of the email body (and is notthe result of a redirect) it is intentional We candetermine intentionality since any request result-ing from an HTML document must have been con-structed by the email sender Note that this does nothold for web documents since embedded Javascriptcan dynamically construct requests during the pagevisit

3 If a request results from a redirect the party re-sponsible for the leak is the party whose request(ie the triggering URL) responded with a redi-rect to the new location (ie the target URL) Weclassify a leak as intentional if the leaked value ishashed between the triggering URL and the targetURL or if there are more encodings or hashes ofthe leaked value included in the target URL thanin the triggering URL If the target URL includesa full copy of the triggering URL (in any encoding)the leak is unintentional All other cases are clas-sified as ambiguous such the case where a targetURL includes only the query string of the triggeringURL

I never signed up for this Privacy implications of email tracking 7

Measuring blocked tags Tracking protection toolswhich block resource requests offer users protectionagainst the tracking embedded in emails We evaluatethe effectiveness of these tools by checking the requestsin our dataset against two major blocklists EasyListand EasyPrivacy [4] These lists block advertisementand tracking related requests and are bundled withseveral popular blocking extensions including AdBlockPlus [1] and uBlock Origin [5] We use the BlockList-Parser library [3] to determine if a request would havebeen blocked3 by an extension utilizing these lists Weclassify a request as blocked if it matches any of thefollowing three conditions1 The request directly matches the filter list2 The request is the result of a redirect and any re-

quest earlier in the redirect was blocked3 The request is loaded in an iframe and the iframe

document request (or any resulting redirect) wasblocked

It is possible to do this classification in an offline fashionbecause of the lack of Javascript support in email clientsThis removes the need to run measurements with oneof the aforementioned extensions installed In environ-ments that support Javascript content can be loadeddynamically and as the result of interactions betweenseveral scripts In such an environment it is much moredifficult to determine which requests would have beenblocked by a single script appearing on the block list

42 Email provides much of same trackingopportunities as the web

Remote resources embedded in email content can trackusers across emails As we show in our survey of emailclients (Section 62) many email clients allow remoteresources to set persistent cookies and send those cook-ies with resource requests In total we find that 10724of the measured emails (85) embed resources from atleast one third party with an average of 5 third partiesper email The distribution of embedded third partiesis far from uniform we find a median of two per emailand a small number of emails embedding as many as 50third parties (Figure 2)

3 We set the parser options as we would expect them to beset for a request occurring in a webmail client For example allrequests are considered third-party requests

Domain of Emails of Top 1Mdoubleclicknet 222 475mathtagcom 142 79dotomicom 127 35adnxscom 122 132tapadcom 110 26liadmcom 110 04returnpathnet 110 lt01bidswitchnet 105 49fontsgoogleapiscom 102 394list-managecom 101 lt01

Table 3 Top third-party domains by percentage of the 12618emails in the corpus For comparison we show the percentageof the top 1 million websites on which these third parties arepresent

Fig 2 CDF of third parties per email aggregating data acrossthe initial viewing and re-opening of an email In addition 14of emails have between 25 and 53 third parties

Table 3 shows the top third-party domains presentin email Many of these parties also have a large presenceon the web [17] blurring the line between email and webtracking On webmail clients requests to these cross-context third parties will use the same cookies allowingthem to track both a userrsquos web browsing and emailhabits In total the emails visited during our crawlsembed resources from 879 third parties

43 Leaks of email addresses to thirdparties are common

In addition to being able to track email habits 99 thirdparties (11) also gain access to a userrsquos email ad-dress whether in plaintext or hashed In email clientswhich support cookies these third parties will receivethe email address alongside any cookies theyrsquove set onthe userrsquos device Trackers which are also present on theweb will thus be able to link this address with the userrsquosbrowsing history profile

Around 19 of the 902 senders leaked the userrsquosemail address to a third party in at least one emailand in total 29 of emails contain leaks to third par-

I never signed up for this Privacy implications of email tracking 8

ties We find that a majority of these leaks 62 of the100963 leaks to third parties are intentional Theseintentional leaks mostly occur through remote contentembedded directly by the sender Furthermore 1 ofleaks are classified as unintentional with the remainderconsidered ambiguous While we do not attempt to de-termine how these identifiers are being used plaintextand hashed emails can be used for persistent trackingcross-device tracking and syncing information betweenparties

Leak of Senders of RecipientsMD5 100 (111) 38 (385)SHA1 64 (71) 19 (192)SHA256 69 (76) 13 (131)Plaintext Domain 55 (61) 2 (20)Plaintext Address 77 (85) 54 (545)URL Encoded Address 6 (06) 8 (81)SHA1 of MD5 1 (01) 1 (10)SHA256 of MD5 1 (01) 1 (10)MD5 of MD5 1 (01) 1 (10)SHA384 1 (01) 1 (10)

Table 4 Email address leakage to third parties by encoding Per-centages are given out of a total of 902 senders and 99 third-party leak recipients All hashes are of the full email addressEmail ldquodomainrdquo is the part of the address after the ldquordquoThese appear to be a misuse of LiveIntentrsquos API (Section 45)

The leaked addresses are often hashed Althoughwe can detect email addresses hashed with 24 differentfunctions and up to three nested layers we only findMD5 SHA1 and SHA256 in frequent use Table 4 summa-rizes the number of senders and receivers of each encod-ing The relatively low diversity of hashes and encodingssuggests that these techniques are not being used to ob-fuscate the collection of email addresses In fact thequery parameters which contain hashed emails some-times identify the hash functions used in the parametername (eg a string like md5=ltmd5 hash of emailgt ap-pearing in the HTTP request) The design of APIs likeLiveIntentrsquos which first receives an email address andthen syncs with a number of other parties (Section 45)suggests that these hashed address may be used to shareor link data from multiple parties

Recipient Organization of SendersLiveIntent 68 (75)Acxiom 46 (51)Litmus Software 28 (31)Conversant Media 26 (29)Neustar 24 (27)apxlvcom 18 (20)5421114717 18 (20)Trancos 17 (19)WPP 17 (19)548261160 16 (18)

Table 5 Top organizations receiving email address leaks by num-ber of the 902 total senders A domain is used in place of anorganization when it isnrsquot clear which organization it belong to

Table 5 identifies the top organizations4 which re-ceive leaked email addresses This shows that email ad-dress collection from emails is largely consolidated to afew major players which are mostly distinct from thepopular web trackers In fact only one of the top 10organizations Neustar is found in the top 20 third-party organizations on the top 1 million websites asmeasured by Englehardt and Narayanan [17] Also sur-prising is the prevalence of leaks to IP addresses whichaccounts for eight of the top 20 domains receiving emailaddresses This may be due to the relatively ephemeralnature of newsletter emails which removes concerns ofIP address churn over time

44 Reopening emails brings in new thirdparties

Despite the lack of Javascript support email views aredynamic The email content itself is static but any re-mote resources embedded in it may return different re-sponses each time the email is viewed and even redirectto different third parties To examine the effects of thiswe load every email first with a ldquocleanrdquo browser profileand then again without clearing the profile Surprisinglythe average Jaccard similarity [36] between the sets ofthird parties loaded during the first and second viewsof the same email is only 60

The majority of emailsmdashtwo-thirdsmdashload fewerthird parties when the email is reopened compared tothe initial view However about 21 of emails load at

4 We map domains to organizations using the classification pro-vided by Libert [27] adding several new email-specific organi-zations When an organization could not be found we use thePS+1

I never signed up for this Privacy implications of email tracking 9

Row Request URL0 httpinboxwashingtonexaminercomimp[]ampe=ltEMAILgtampp=01 httppliadmcomimp[]ampm=ltMD5(address)gtampsh=ltSHA1(address)gtampsh2=ltSHA256(address)gt

ampp=0ampdom=ltEMAIL_DOMAINgt2 httpxbidswitchnetsyncssp=liveintentampbidder_id=5298amplicd=3357ampx=EGFM[]3 httpxbidswitchnetul_cbsyncssp=liveintentampbidder_id=5298amplicd=3357ampx=EGFM[]4 httppadsymptoticcomdpx_pid=12688amp_psign=d3e69[]ampbidswitch_ssp_id=liveintentamp_redirect=[]5 httppadsymptoticcomdpx_pid=12688amp_psign=d3e69[]ampbidswit[]amp_redirect=[]amp_expected_cookie=[]6 httpxbidswitchnetsyncdsp_id=126ampuser_id=84f3[]ampssp=liveintent7 httpiliadmcoms19751bidder_id=5298amplicd=3357ampbidder_uuid=ltUUID_1gt8 httpcmgdoubleclicknetpixelgoogle_nid=liveintent_dbmampgoogle_cmampgoogle_sc9 httpcmgdoubleclicknetpixelgoogle_nid=liveintent_dbmampgoogle_cm=ampgoogle_sc=ampgoogle_tc=10 httppliadmcommatch_gbidder_id=24314ampbidder_uuid=ltUUID_2gtampgoogle_cver=111 httpxbidswitchnetsyncssp=liveintentampbidder_id=5298amplicd=12 httppooludspiponwebnetsyncssp=bidswitchampbidswitch_ssp_id=liveintent

Table 6 Redirect chain from a LiveIntent Email Tracking Pixel URL query strings are truncated for clarity (using [])

least one resource when an email is reopened that wasnrsquotpresent the first time A small number of third partiesare disproportionately responsible for thismdashthey loaddifferent sets of additional third parties each time theemail is opened (Table 14 in the Appendix)

The number of leaks between email loads stays rela-tively constant with less than 50 emails leaking to newparties on the second load5 However as the compari-son of Table 14 with Table 5 shows many of the topleak recipients are also responsible for redirecting to thehighest number of new parties Thus reloading an emailincreases the number of potential recipients of a leak ifthe redirectors share data based on the email or emailhash they receive

45 Case study LiveIntent

LiveIntent receives email addresses from the largestnumber of senders 68 in total In this section weanalyze a sample of the request chains that re-sult in leaks to LiveIntent Table 6 shows an ex-ample redirect chain of a single pixel embedded inan email from the washingtonexaminercom mailinglist The initial request (row 0) is to a subdomain ofwashingtonexaminercom and includes the userrsquos plain-text email address in the e= query string parameter Thedomain redirects to liadmcom (row 1) a LiveIntent do-main and includes the MD5 SHA1 and SHA256 hashes ofthe email address in the parameters m= sh= and sh2=

5 We exclude leaks which occur to a different IP address on thesecond load This occurs in 349 emails but is less meaningfulgiven the dynamic nature of IP address

The URL also includes the domain portion of the userrsquosaddress

In rows 2 - 12 the request redirects through severalother domains and back to itself exchanging what ap-pear to be partner IDs and bidder IDs In rows 7 and10 LiveIntent receives a UUID from the domain in theprevious request which could allow it to exchange in-formation with those trackers outside of the browser

46 Request blockers help but donrsquot fixthe problem

Privacy conscious users often deploy blocking exten-sions such as uBlock Origin Privacy Badger orGhostery to block tracking requests Since webmailclients are browser-based these blocking extensions canalso filter requests that occur while displaying email con-tent6 We use our blocked tag detection methodology(Section 41) to determine which resources would havebeen blocked by the popular EasyList and EasyPrivacyblocklists We then examine the remaining requests todetermine how frequently email addresses continue toleak

Overall the blocklists cut the number of third par-ties receiving leaked email addresses from any sendernearly in half from 99 to 51 Likewise the number ofsenders which leak email addresses in at least one emailis greatly reduced from 19 to just 7 However asTable 7 shows a significant number of leaks of both

6 Thunderbird supports most of the popular Firefox extensionsand as such Thunderbird users can also deploy these defensesSee Table 12 for more details

I never signed up for this Privacy implications of email tracking 10

Encoding of Senders of RecipientsPlaintext Address 34 (37) 34 (667)MD5 21 (23) 12 (235)SHA1 14 (16) 6 (118)URL Encoded Address 4 (04) 4 (78)SHA256 4 (04) 2 (39)SHA384 1 (01) 1 (20)

Table 7 Encodings used in leaks to third parties after filteringrequests with EasyList and EasyPrivacy Totals are given out of902 email senders and 51 third-party leak recipients

Recipient Domain of Sendersmediawallahscriptcom 7jetlorecom 4scrippsnetworkscom 4alocdncom 3richrelevancecom 3ivitrackcom 2intentiqcom 2gatehousemediacom 2realtimeemail 2ziffimagescom 2

Table 8 The top third-party leak recipient domains after filteringrequests with EasyList and EasyPrivacy All recipients receiveleaks from less than 1 of the 902 senders studied

plaintext and email hashes still occur In Table 8 we seethat there are still several third-party domains whichreceive email address leaks despite blocking Several ofthese domains are known trackers which could be in-cluded in the blocklists In addition IP addresses andCDN domains are still recipients of leaked email ad-dresses Blocking on other URL features such as theURL path could help reduce leaks to these domains

5 Privacy leaks when clickinglinks in emails

In Section 4 we explore the privacy impact of a useropening and rendering an email In this section we ex-plore the privacy impact of a user clicking links withinan email Once a user clicks a link in an email the link istypically opened in a web browser Unlike email clientsweb browsers will typically support Javascript and ad-vanced features of HTML creating many potential av-enues for privacy leaks However the only way an emailaddress can propagate to a page visit is through the di-rect embedding of the address in a link contained in theoriginal email body

51 Measurement methodology

Sampling links from emails To evaluate the privacyleaks which occur when links in emails are clicked wegenerate a dataset from the HTML content of all emailsand visit them individually in an instrumented browserTo extract the links from mail content we parse allemail bodies with BeautifulSoup [2] and extract thesrc property of all ltagt tags We sample up to 200 uniquelinks per sender using the following sampling strategyFirst we bin links across all emails from a sender bythe PS+1 and path of the link Next we sample one linkfrom each bin without replacement until there are nomore links or we reach a limit of 200 This helps ensurethat we have as diverse a set of landing pages as possibleby stripping fragment and query string identifiers thatmay not influence the landing page

Simulating link clicks To simulate a user click-ing a link we visit each link in an OpenWPM instanceusing a fresh browser profile The browser fully loadsthe page and sleeps for 10 seconds before closing Un-like the email viewing simulation (Section 4) we enableboth Javascript and Referer headers This simulationreplicates what happens when a link is clicked in a stan-dalone email client only the URL of the clicked link ispassed to the browser for handling In a webmail clientthe initial request resulting from the click may also con-tain a cookie and a Referer header containing the emailclientrsquos URL We do not simulate these headers in ourcrawl

Detecting email address leakage To detectleakage of email addresses we use the procedure de-scribed in Section 41 Since the Referer header is en-abled for these measurements we consider a party tohave received a leak if it is contained either in the URLor the Referer header of the resource request to thatparty Email addresses can also be shared with the partythrough the Cookie header request POST bodies web-socket connections WebRTC connections and so onWe consider these out of scope for this analysis

52 Results

We found that about 11 of links contain requests thatleak the email address to a third party About 12 ofall emails contain at least one such link and among thissubset there are an average of 35 such links per emailThe percentage of the 902 senders that leak the emailaddress in at least one link in one email is higher 355Finally there were over 1400 distinct third parties that

I never signed up for this Privacy implications of email tracking 11

Recipient Organization of SendersGoogle 247 (274)Facebook 160 (177)Twitter 94 (104)Adobe 81 (90)Microsoft 73 (81)Pinterest 72 (80)LiveIntent 69 (76)Akamai 69 (76)Acxiom 68 (75)AppNexus 61 (68)

Table 9 The top leak recipient organizations based on a sampleof simulated link clicks All values are out of 902 total senders

Recipient Domain of Sendersgoogle-analyticscom 200 (222)doubleclicknet 196 (217)googlecom 159 (176)facebookcom 154 (171)facebooknet 145 (161)fontsgoogleapiscom 102 (113)googleadservicescom 96 (106)twittercom 94 (104)googletagmanagercom 87 (96)gstaticcom 78 (86)

Table 10 The top leak recipient domains based on a sample ofsimulated link clicks All values are out of 902 senders

received the email address in one or more of our sim-ulated link clicks We expect that all statistics in thisparagraph except the first are slight underestimatesdue to our limit of 200 links per sender

Table 9 shows the top organizations that receiveleaked email addresses and Table 10 shows the top do-mains Over a quarter of senders leak the email addressto Google in at least one link

The most striking difference between these resultsand the corresponding results for viewing emails is thatthese lists look very similar to the list of top third partytrackers [17] with the addition of a small number oforganizations specific to email tracking This motivatesthe privacy concern that identities could potentially beattached to third-party web tracking profiles

6 Evaluation of defenses

61 Landscape of defenses

Defenses against tracking can be employed by severalparties We ignore mail senders and trackers themselves

since email tracking is a thriving commercial space andour evidence suggests that senders by and large coop-erate with trackers to leak email addresses We insteadfocus on parties who have an incentive to protect therecipientrsquos privacy namely the recipientrsquos mail servermail user agent and the web browser

The lines between these roles can be blurry so weillustrate with two examples Consider a user readingYahoo mail via Firefox The email server is Yahoo theemail client is Firefox together with Yahoo mailrsquos client-side JavaScript and the web browser is again Fire-fox Or consider a user reading her university mail viaGmailrsquos IMAP feature on her iPhone For our purposesboth the university and Gmail count as email serverssince either of them is in a position to employ defensesThe email client is the Gmail iOS app and the webbrowser is Safari

Defense Email server Email client Web browserContent proxying XHTML filtering X XCookie blocking X XReferrer blocking X X XRequest blocking X X

Table 11 Applicability of each of the five possible defenses toeach of the three contexts in which they may be deployed An Xindicates that the defense is applicable

Table 11 summarizes the applicability of various de-fenses to the three roles We discuss each in turn

Content proxying Email tracking is possible be-cause of embedded content such as images and CSS (cas-cading style sheets) To prevent this some email serversnotably Gmail proxy embedded content Thus whenthe recipient views the email the mail user agent doesnot make any requests to third parties

This defense doesnrsquot prevent the recipient email ad-dress being leaked to third parties since it is leakedby being encoded in the URL In fact it hinders ef-forts by the mail client to prevent email address leakage(see request blocking below) However it prevents thirdparties from learning the userrsquos IP address client deviceproperties and when the email was read (depending onhow the proxy is configured) Most importantly it pre-vents the third-party cookie from being sent and thusprevents the third party from linking the userrsquos emailaddress to a tracking profile In this way it is a comple-ment to cookie blocking

This defense can be deployed by the email serverConceivably the email client might have its own server

I never signed up for this Privacy implications of email tracking 12

component through which embedded resources are prox-ied but no email clients currently work this way andfurther it would introduce its own privacy vulnerabili-ties so we ignore this possibility

HTML filtering HTML filtering refers to modify-ing the contents of HTML emails to mitigate tracking Itmay be applied by the email server or the client but it ismore suitable to the server since the client can generallyachieve the same effect in other ways eg by requestblocking or modifying the rendering engine It is rarelyapplied today and only in minimal ways In Section 7 weprototype a comprehensive HTML filtering technique

HTML filtering modifies the content of the emailbody and thus might interfere with some email au-thentication methods notably Domain Keys IdentifiedEmail (DKIM) However since filtering is carried outby the recipientrsquos mail server (Mail Transfer Agent) andnot by intermediate mail relays filtering can be done af-ter the signature has been verified and thus there is noimpact on email authentication

The following three techniques are applicable in oneof two scenarios when the email client requests embed-ded resources or when the web browser handles clickson links in emails

Cookie blocking Cookie blocking in the emailclient prevents third-party cookies from being sent whenembedded content is requested It is especially relevantin the webmail context where the cookie allows thirdparties to link an email address to a web browsing pro-file Even otherwise blocking cookies is helpful since itmakes it harder for third parties to compile a profile ofthe recipientrsquos email viewing (they can always do this forthe subset of emails where the email address is leaked)

Referrer blocking If the email client sends theReferer header when loading embedded resources itcan allow several types of leaks Depending on the imple-mentation the referrer may encode which client is beingused and which specific email is being read If the recip-ient forwarded an email to someone else and the email isbeing viewed in a different userrsquos mailbox it could leakthis information Worse if the client supports iframesin emails and the email address happens to be in theiframe URL all requests to resources embedded in thatiframe will accidentally leak the email address For allthese reasons referrer blocking is a privacy-enhancingmeasure There is little legitimate use for the referrerheader in the context of email While clients can cer-tainly block the header (as can web browsers) serverscan do this as well by rewriting HTML to add therel=ldquonoreferrerrdquo attribute to links and inserting a Re-ferrer Policy via the meta tag

Request blocking Request blocking is a powerfultechnique which is well known due to ad blockers andother browser privacy extensions It relies on manuallycompiled filter lists containing thousands of regular ex-pressions that define third-party content to be blockedThe most widely used ad-blocking list is EasyList andthe most widely used tracker-blocking list is EasyPri-vacy Filter list based blocking introduces false positivesand false negatives [43] but the popularity of ad block-ing suggests that many users find the usability trade-offto be acceptable While request-blocking extensions aresupported primarily by web browsers some email clientsalso have support for them notably Thunderbird

62 Survey of email clients

We built an email privacy tester to discover which de-fenses are deployed by which popular email servers andclients7 Browser support for tracking protection hasbeen extensively studied elsewhere [29] so we do notconsider it here

The email privacy tester allows the researcher to en-ter an email address and the name of an email clientand then sends an email to that address containing atracking image and a link The image and the link bothhave unique URLs The researcher views the email inthe specified email client and then clicks on the linkThe server records the following information the emailaddress the email client the IP address timestampand headers sent for both the image and the link re-quests The list of headers includes the cookie referrerand user agent

We created accounts with a total of 9 emailproviders and tested them with a total of 16 email clientsusing various devices available in our lab We analyzedthe data recorded by the email privacy tester and sum-marize the results in Table 12 We found that if de-fenses are deployed by email servers at all they are onlyenabled for specific email clients (typically the defaultwebmail client) Therefore we do not report on serversseparately but instead fold it into the analysis of clientsWe also found that HTML filtering in a general form isnot deployed but only in the limited form of image andreferrer blocking so we report on that instead We sum-marize our findings in Table 12

7 httpsemailtrackingopenwpmcom

I never signed up for this Privacy implications of email tracking 13

Mail Client Platform Proxies Content Blocks Images Blocks Referrers Blocks Cookies Ext SupportGmail Web Yes No L Yes I Yesdagger Yesdagger YesYahoo Mail Web No Yes L Yes I No No YesOutlook Web App Web No Yes No No YesOutlookcom Web No No No No YesYandex Mail Web Yes No L Yes I Yesdagger Yesdagger YesGMX Web No No No No YesZimbra Web No Yes No No Yes163com Web No No No No YesSina Web No No No No YesApple Mail iOS No No Yes Yes NoGmail iOS Yes No Yes Yes NoGmail Android Yes No Yes Yes NoApple Mail Desktop No No Yes Yes NoWindows Mail Desktop No No Yes No NoOutlook 2016 Desktop No Yes Yes No NoThunderbird Desktop No Yes Yes Optional (Default No) Yes

Table 12 A survey of the privacy impacting features of email clients We explore whether the client proxies image requests blocksimages by default blocks referrer headers from being sent (with image requests ldquoIrdquo and with link clicks ldquoLrdquo) blocks external re-sources from settings cookies and whether or not the client supports request blocking extensions mdash either through the browser (forweb clients) or directly (in the case of Thunderbird)Images are only blocked for messages considered spamdagger Blocking occurs as a result of proxied content

7 Proposed defenseWe argue that tracking protection should be at the cen-ter of a defensive strategy against email tracking It canbe employed either via HTML filtering on the server orvia request blocking on the client Tracking protection(and ad blocking) based on filter lists has proven to beeffective and popular in web browsers and its limita-tions manageable The other defenses we examined allhave serious drawbacks for example content proxyingcomes at a cost to the email server and makes emailleaks worse and cookie blocking is at best a partial so-lution

We propose to improve tracking protection in twoways

Server-side email content filtering First weprototype a server-side HTML filtering module We usethe existing standard EasyList and EasyPrivacy filterlists Our filtering script is written in Python using theBlockListParser library [3] It scans for any HTML con-tent (texthtml) in email bodies parses those contentsidentifies embedded resources (images or CSS) whoseURLs match one of the regular expressions in the filterlists strips them out and rewrites the HTML

To test the effectiveness of HTML filtering we ranour leak detection procedure on the filtered corpus ofemails We exclude one sender due to a measurementissue We found that 110 of senders will leak email ad-

dresses to a third party in at least one email and 115of emails contain embedded resources which leak emailto a third-party Overall 62 third parties received leakedemail addresses down from 99 As tracking-protectionlists improve (see below) we can expect these numbersto decrease further These numbers are very close tothe corresponding numbers for request blocking (Sec-tion 46) The two techniques arenrsquot identical the onedifference is that in static files filtering is limited to theURLs present in the body of the HTML and will missthose that result from a redirect However this differ-ence is small and we conclude that HTML filtering isessentially as effective as request blocking

Note that webmail users can already enjoy track-ing protection but server-side deployment will help allusers including those who use email clients that donrsquotsupport request-blocking extensions

Filling gaps in tracking-protection lists As asecond line of defense we use our dataset to identifya list of 27125 URLs representing 133 distinct partieswhich contain leaks of email addresses but which arenrsquotblocked by EasyList or EasyPrivacy These include firstparties in addition to third parties We are able to iden-tify first-party tracking URLs by observing groups ofURLs of similar structure across different first-party do-mains For example 51 email senders leak the userrsquosemail address to a URL of the form liltpublic suffix+ 1gtimp which appears to be part of LiveIntentrsquos API(Section 45) We summarize the most common struc-

I never signed up for this Privacy implications of email tracking 14

tures in the leaking URLs missed by tracking protectionlists in Table 13

URL Pattern of SendersliltPS+1gtimp 51 (57)partnerltPS+1gt 7 (07)stripeltPS+1gtstripeimage 4 (04)pltPS+1gtespopen 4 (04)apiltPS+1gtlayoutssectionltNgt 4 (04)ltPS+1gtcustomer-service 3 (03)miltPS+1gtprp 3 (03)dmtkltPS+1gt 3 (03)linksltPS+1gteopen 3 (03)eadsltPS+1gtimp 3 (03)

Table 13 The top URL patterns from URLs which leak email ad-dresses and are missed by tracking protection lists (Section 46)The patterns are generated by stripping request URLs to host-name and path replacing the public suffix plus one with ltPS+1gtreplacing integers with ltNgt and stripping the last portion of thepath if it ends with a file extension The patterns are ranked bythe number of senders which make at least one leaking requestmatching that pattern in any of the senderrsquos emails All values aregiven out of the total of 902 senders studied

We suspect that the reason so many trackers aremissed is that many of them are not active in the regu-lar web tracking space We have made the list of leakingURLs missed by tracking protection lists publicly avail-able8 It should be straightforward to add regular ex-pressions to filter lists based on these URLs we suggestthat filter list creators should regularly conduct scansof email corpora to identify new trackers

8 Discussion and conclusionPrivacy risks of email tracking Email security andprivacy has not received much research attention despiteits central importance in digital life We showed thatcommercial emails contain a high density of third-partytrackers This is of concern not only because trackerscan learn the recipientrsquos IP address when emails wereopened and so on but also because these third partiesare by and large the same ones that are involved in webtracking This means that trackers can connect emailaddresses to browsing histories and profiles which leadsto further privacy breaches such as cross-device tracking

8 httpsgistgithubcomenglehardt6438c5d775ffd535b317d5c6ce3cde61

and linking of online and offline activities Indeed emailis an underappreciated avenue for straightforward cross-device tracking since recipients tend to view emails onmultiple devices

The advice provided by many mail clients may mis-lead users into thinking the privacy risks associated withremote content are fairly limited The remote contenthelp pages of Gmail [20] Yahoo Mail [42] and Thun-derbird [31] all discuss the threat strictly in terms ofthe email sender learning information about the userrather than a number of third parties

Even network adversaries can benefit from the leaksin emails The NSA is known to piggyback on advertis-ing cookies for surveillance [18] and our work suggestsone way in which a surveillance agency might attachidentities to web activity records in line with the find-ings of Englehardt et al [18] Indeed nearly 91 ofURLs containing leaks of emails are sent in plaintext

Ineffectiveness of hashing The putative justi-fication for email address leaks in the online ad techindustry is that the address is hashed However hash-ing of PII including emails is not a meaningful pri-vacy protection This is folk knowledge in the securitycommunity but bears repeating Compared to hashingof passwords there are several reasons why hashing ofemail addresses is far more easily reversible via vari-ants of a dictionary attack First while (at least) someusers attempt to maximize the entropy of passwordsmost users aim to pick memorable emails and hence theset of potential emails is effectively enumerable Due toGPUs trillions of hashes can be attempted at low costSecond unlike password hashing salting is not applica-ble to email hashing since multiple third parties need tobe able to independently derive the same hash from theemail address

Perhaps most importantly if the adversaryrsquos goalis to retrieve records corresponding to a known emailaddress or set of email addresses then hashing ispointlessmdashthe adversary can simply hash the email ad-dresses and then look them up For example if the ad-versary is a surveillance agency as discussed above andseeks to retrieve network logs corresponding to a givenemail address this is trivially possible despite hashing

LimitationsWe mention several limitations of ourwork First despite the large number of heuristics thatwent into identifying and submitting forms it is a fun-damentally hard problem and our crawler fails in manycases including pages requiring complex mouse interac-tions pages containing very poorly structured HTMLand captcha-protected form submission pages More-over it is difficult to programmatically distinguish be-

I never signed up for this Privacy implications of email tracking 15

tween successful and failed form submissions Lookingat received network data is impractical since responsescould easily include text for both success and failuremessages On the other hand looking only at changesin the rendered text on the webpage is more feasiblebut would require handling many possible edge cases(eg page redirects alerts pop-up windows iframes)and might still be too unreliable to use as a metric forsuccess

Second our corpus of emails is not intended to berepresentative and we are unable to draw conclusionsabout the extent of tracking in the typical userrsquos mail-box

Third our simulation of a user viewing emails as-sumes a permissive user agent We expect that thisclosely approximates a webmail setup with defaultbrowser settings (on browsers except Safari whichblocks third-party cookies by default) but we have nottested this assumption

Future work Finally we mention several potentialareas of future work

Mailing list managers It would be helpful to bet-ter understand the relationship between email sendersand mailing list managers (such as Constant Contact)To what extent is email tracking driven by senders ver-sus mailing list managers When a sender sets up amarketing campaign with a mailing list manager is thetracking disclosed to the sender

PII leakage in registration forms Researchers havepreviously found leakage of PII to third parties in con-tact forms on websites [38] As far as we know there hasbeen no large-scale study of PII leakage in registrationforms where more sensitive information is often present(eg phone numbers street addresses and passwords)Recording and analyzing the third-party requests madeduring our crawls is an important area for further inves-tigation

Cookie syncing It would be interesting to find outif cookie syncing occurs when viewing emailsmdasha processin which different trackers exchange and link togethertheir own IDs for the same user Past work has shownthat this happens among the vast majority of top thirdparties on the web [17] so we suspect that it occursthrough email as well

AB testing We notice some clear instances of ABtesting in our data as might be expected in market-ing campaigns Specifically we registered multiple emailaddresses on some sites at roughly the same time andfound several emails sent at nearly the same time (mil-liseconds apart) with different subject lines and emailbodies advertising different products We have not at-

tempted to reverse-engineer or systematically analyzethese differences but it may be interesting to see if andhow the received content changes in response to readreceipts click-through metrics or other types of userinteractions

Differential testing Despite testing for various en-codings hashes and combinations it is possible thatwe have missed some leaks of email addresses We can-not hope to exhaustively test for all combinations ofencodings and hashes Instead we propose differentialtesting by registering multiple email addresses on thesame site we can look for parameters in URLs that aredifferent for different email addresses which are sugges-tive of transformed email addresses The difficulty withthis approach is that AB testing mentioned above isa confound

In summary we hope that our work leads to greaterawareness of the privacy risks of email tracking spursfurther research on the topic and paves the way fordeployment of robust defenses

9 AcknowledgementsWe would like to thank the anonymous reviewers AylinCaliskan Paul-Olivier Dehaye Joel Reardon and PaulVan Oorschot for their helpful comments Wersquore alsograteful to Guumlnes Acar Paul Ellenbogen Marc JuarezHarry Kalodner Marcela Melara and Laura Roberts fortheir assistance in compiling data for our email survey

This work was supported by NSF Grant CNS1526353 by a research grant from Mozilla and by Ama-zon AWS Cloud Credits for Research

References[1] Adblock Plus - Surf the web without annoying ads https

adblockplusorg Online accessed 2017-09-05[2] BeautifulSoup httpswwwcrummycomsoftware

BeautifulSoup Online accessed 2017-09-05[3] BlockListParser httpsgithubcomshivamagarwal-iitb

BlockListParser Online accessed 2017-09-05[4] EasyList and EasyPrivacy httpseasylistto Online

accessed 2017-09-05[5] uBlock Origin - An efficient blocker for Chromium and Fire-

fox Fast and lean httpsgithubcomgorhilluBlockOnline accessed 2017-09-05

[6] CSS Support Guide for Email Clients Campaign Sourcehttpswwwcampaignmonitorcomcss (Archive httpswwwwebcitationorg6rLLXBX0E) 2014

I never signed up for this Privacy implications of email tracking 16

[7] Gunes Acar Christian Eubank Steven Englehardt MarcJuarez Arvind Narayanan and Claudia Diaz The web neverforgets Persistent tracking mechanisms in the wild In Pro-ceedings of ACM CCS pages 674ndash689 ACM 2014

[8] Gunes Acar Marc Juarez Nick Nikiforakis Claudia DiazSeda Guumlrses Frank Piessens and Bart Preneel Fpdetectivedusting the web for fingerprinters In Proceedings of the2013 ACM SIGSAC conference on Computer amp communica-tions security pages 1129ndash1140 ACM 2013

[9] Julia Angwin Why online tracking is getting creepier ProP-ublica Jun 2014

[10] Mika D Ayenson Dietrich James Wambach Ashkan SoltaniNathan Good and Chris Jay Hoofnagle Flash cookies andprivacy II Now with html5 and etag respawning 2011

[11] Bananatag Email Tracking for Gmail Outlook and otherclients httpsbananatagcomemail-tracking Onlineaccessed 2017-09-04

[12] Justin Brookman Phoebe Rouge Aaron Alva Alva andChristina Yeung Cross-device tracking Measurement anddisclosures In Proceedings of the Privacy Enhancing Tech-nologies Symposium 2017

[13] Ceren Budak Sharad Goel Justin Rao and Georgios ZervasUnderstanding emerging threats to online advertising InProceedings of the ACM Conference on Economics andComputation 2016

[14] ContactMonkey Email Tracking for Outlook and Gmailhttpswwwcontactmonkeycomemail-tracking Onlineaccessed 2017-09-04

[15] Zakir Durumeric David Adrian Ariana Mirian James Kas-ten Elie Bursztein Nicolas Lidzborski Kurt Thomas VijayEranti Michael Bailey and J Alex Halderman Neither snownor rain nor mitm An empirical analysis of email deliv-ery security In Proceedings of the 2015 ACM Conferenceon Internet Measurement Conference pages 27ndash39 ACM2015

[16] Peter Eckersley How unique is your web browser In In-ternational Symposium on Privacy Enhancing TechnologiesSymposium pages 1ndash18 Springer 2010

[17] Steven Englehardt and Arvind Narayanan Online trackingA 1-million-site measurement and analysis In ACM Confer-ence on Computer and Communications Security 2016

[18] Steven Englehardt Dillon Reisman Christian Eubank Pe-ter Zimmerman Jonathan Mayer Arvind Narayanan andEdward W Felten Cookies that give you away The surveil-lance implications of web tracking In Proceedings of the24th Conference on World Wide Web 2015

[19] David Fifield and Serge Egelman Fingerprinting web usersthrough font metrics In International Conference on Finan-cial Cryptography and Data Security 2015

[20] Gmail Help Choose whether to show images httpssupportgooglecommailanswer145919 Online accessed2017-09-06

[21] Ralph Holz Johanna Amann Olivier Mehani Mohamed AliKacircafar and Matthias Wachs TLS in the wild An internet-wide analysis of tls-based protocols for electronic commu-nication In 23nd Annual Network and Distributed SystemSecurity Symposium NDSS 2016 San Diego CaliforniaUSA February 21-24 2016 2016

[22] HubSpot Start Email Tracking Today httpswwwhubspotcomproductssalesemail-tracking Online ac-

cessed 2017-09-04[23] Balachander Krishnamurthy Konstantin Naryshkin and

Craig Wills Privacy leakage vs protection measures thegrowing disconnect In Proceedings of the Web 2011

[24] Balachander Krishnamurthy and Craig E Wills On the leak-age of personally identifiable information via online socialnetworks In Proceedings of the 2nd ACM workshop onOnline social networks pages 7ndash12 ACM 2009

[25] Pierre Laperdrix Walter Rudametkin and Benoit BaudryBeauty and the beast Diverting modern web browsers tobuild unique browser fingerprints In 37th IEEE Symposiumon Security and Privacy 2016

[26] Adam Lerner Anna Kornfeld Simpson Tadayoshi Kohnoand Franziska Roesner Internet jones and the raiders of thelost trackers An archaeological study of web tracking from1996 to 2016 In 25th USENIX Security Symposium 2016

[27] Timothy Libert Exposing the invisible web An analysis ofthird-party http requests on 1 million websites InternationalJournal of Communication 918 2015

[28] Jonathan R Mayer and John C Mitchell Third-party webtracking Policy and technology In 2012 IEEE Symposiumon Security and Privacy IEEE 2012

[29] Georg Merzdovnik Markus Huber Damjan Buhov NickNikiforakis Sebastian Neuner Martin Schmiedecker andEdgar Weippl Block me if you can A large-scale study oftracker-blocking tools In Proceedings of the 2nd IEEE Euro-pean Symposium on Security and Privacy (IEEE EuroSampP)2017

[30] Keaton Mowery and Hovav Shacham Pixel perfect Finger-printing canvas in HTML5 W2SP 2012

[31] Mozilla Support Remote Content in Messages httpssupportmozillaorgen-USkbremote-content-in-messagesOnline accessed 2017-09-04

[32] Nick Nikiforakis Alexandros Kapravelos Wouter JoosenChristopher Kruegel Frank Piessens and Giovanni VignaCookieless monster Exploring the ecosystem of web-baseddevice fingerprinting In Security and privacy (SP) 2013IEEE symposium on pages 541ndash555 IEEE 2013

[33] Lukasz Olejnik Gunes Acar Claude Castelluccia and Clau-dia Diaz The leaking battery A privacy analysis of theHTML5 Battery Status API Technical report 2015

[34] Jingjing Ren Ashwin Rao Martina Lindorfer ArnaudLegout and David Choffnes Recon Revealing and control-ling pii leaks in mobile network traffic In Proceedings of the14th Annual International Conference on Mobile SystemsApplications and Services pages 361ndash374 ACM 2016

[35] Franziska Roesner Tadayoshi Kohno and David WetherallDetecting and defending against third-party tracking onthe web In Proceedings of the 9th USENIX conferenceon Networked Systems Design and Implementation pages12ndash12 USENIX Association 2012

[36] scikit-learn Jaccard Similarity Score httpscikit-learnorgstablemodulesgeneratedsklearnmetricsjaccard_similarity_scorehtml Online accessed 2017-09-05

[37] Ashkan Soltani Shannon Canty Quentin Mayo LaurenThomas and Chris Jay Hoofnagle Flash cookies and pri-vacy In AAAI spring symposium intelligent informationprivacy management volume 2010 pages 158ndash163 2010

[38] Oleksii Starov Phillipa Gill and Nick Nikiforakis Are yousure you want to contact us quantifying the leakage of pii

I never signed up for this Privacy implications of email tracking 17

via website contact forms Proceedings on Privacy Enhanc-ing Technologies 2016(1)20ndash33 2016

[39] Oleksii Starov and Nick Nikiforakis Extended trackingpowers Measuring the privacy diffusion enabled by browserextensions In Proceedings of the 26th International Confer-ence on World Wide Web pages 1481ndash1490 2017

[40] Narseo Vallina-Rodriguez Christian Kreibich Mark Allmanand Vern Paxson Lumen Fine-grained visibility and controlof mobile traffic in user-space 2017

[41] W3C 410 Forms - HTML5 httpswwww3orgTRhtml5formshtml Online accessed 2017-09-07

[42] Yahoo Help Block images in your incoming Yahoo Mailemails httpshelpyahoocomkbSLN5043html Onlineaccessed 2017-09-06

[43] Zhonghao Yu Sam Macbeth Konark Modi and Josep MPujol Tracking the trackers In Proceedings of the 25thInternational Conference on World Wide Web pages 121ndash132 International World Wide Web Conferences SteeringCommittee 2016

10 Appendix

101 Form discovery and fillingmethodology

Choosing pages on which to search for forms Thecrawler searches through all links (ltagt tags) on the land-ing page to find pages that are most likely to contain amailing list form It does this by matching the link textand URL against a ranked list of terms which are shownin Table 1 As an initial step we filter out invisiblelinks and links to external sites We check that the linktext does not contain words in our blacklist which aimsto avoid unsubscribe pages and phone-based registra-tion If we have found any links that match the crawlerclicks on the one with the highest rank then runs theform-finding procedure on the new page and any newlyopened pop-up windows If no forms are found it goesback and repeats this process for the remaining linksThe reason for clicking on generic article links is thatwe have come across several news sites with newsletterforms only within article pages We also make sure toselect the English language or USEnglish locale whenavailable since our keywords are in English

Top-down form detection For each page thecrawler visits it first searches through the HTML DOMfor any potential email registration forms When sitesuse the standard ltformgt element it can simply iteratethrough each formrsquos input fields (ltinputgt tags) and seeif any text fields ask for an email address (by matchingon input type and keywords) If so it marks the form as

a candidate and then chooses the best candidate usingthe following criteria (in order)1 Always return the topmost form Any form stacked

on top of other elements is probably a modal or dia-log and we find that the most common use of thesecomponents is to promote a sitersquos mailing lists Werely on the z-index CSS property which specifies thestacking order of an element in relation to others (asa relative arbitrary integer) Note that most DOMelements take the default z-index value of auto in-heriting the actual value from its parent thus thecrawler recursively checks a formrsquos parent elementsuntil it finds a non-auto value or reaches the rootof the DOM tree To break ties it also searches forthe literal strings ldquomodalrdquo or ldquodialogrdquo within theformrsquos HTML since we find that such componentsare usually descriptively named

2 Rank login forms lower This is the other class offorms that often asks for an email address so thecrawler explicitly checks for the strings ldquologinrdquo ldquologinrdquo and ldquosign inrdquo within a formrsquos HTML to avoidthese when other candidates are present

3 Prefer forms with more input fields This is mainlyhelpful for identifying the correct follow-up form ifwe submit our email address in the footer of a pagethe same footer might be present on the page we getredirected to In this scenario the form we want topick is the longer one

Additionally registration forms are sometimes foundinside of inline frames (ltiframegt tag) which are ef-fectively separate HTML pages embedded in the mainpage If necessary we iterate through each frame andapply the same procedure to locate registration formswithin them

Bottom-up form detection A growing numberof sites place logical forms inside of generic containerelements (eg ltdivgt or ltspangt tags) without using anyltformgt tags Therefore if top-down form detection failswe take a bottom-up approach the crawler first iteratesthrough all the ltinputgt elements on the page to checkif any email address fields exist at all then recursivelyexamines their parents to find the first container thatalso contains a submit button This container is usuallythe smallest logical form unit that includes all of therelevant input fields

Determining form field type Once a form isdiscovered we need to determine which fields are con-tained in the form and fill each field with valid dataWe skip any invisible elements since a real user wouldnot be expected to fill them Some fields can be iden-

I never signed up for this Privacy implications of email tracking 18

tified by their type attribute alonemdashfor example telfor phone numbers and email for email addressesmdashbutthese specific types were introduced in the relatively re-cent HTML5 standard [41] and most websites still usethe general text type for all text inputs In our sur-vey of the top sites we found that contextual hints arescattered across many tag attributes with the most fre-quent being name class id placeholder value forand title In addition tags that contain HTML bod-ies (such as ltbuttongt tags) often contain hints in theinnerHTML

Handling two-part form submissions Aftersubmitting a form we are sometimes prompted to fillout another longer form before the registration is ac-cepted This second form might appear on the samepage (ie using JavaScript) or on a separate page ei-ther through a redirect or as a pop-up window We takea simplistic approach the crawler waits a few secondsthen applies the same form-finding procedure first onany pop-up windows and then on the original windowThis approach may have the effect of submitting thesame form twice but we argue that this does not pro-duce any adverse resultsmdashduplicate form submissionsare a plausible user interaction that web services shouldbe expected to handle gracefully

102 Mail server implementation

The mail server receives emails using SubEtha SMTP alibrary offering a simple low-level API to handle incom-ing mail The server accepts any mail sent to (RCPT TO)an existing email address and rejects it otherwise Themail contents (DATA) are parsed in MIME format usingthe JavaMail API and the raw message contents arewritten to disk MIME messages consist of a set of head-ers and a content body with the required Content-Typeheader indicating the format of the content notably amultipart content body contains additional MIME mes-sage subparts enabling messages to be arranged in atree structure To save disk space we recursively scanmultipart MIME messages for subparts with contenttypes that are non-text (text) such as attached im-ages or other data and discard them before storing themessages since we do not examine any non-textual con-tent

103 Supported hash functions andencodings for leak detection

Supported hashes and checksums md2 md4 md5sha sha1 sha256 sha224 sha384 sha3-224 sha3-256 sha3-384 sha3-512 murmurhash2 (signed andunsigned) murmurhash3 32-bit murmurhash3 64-bitmurmurhash3 128-bit ripemd160 whirlpool blake2bblake2s crc32 adler32

Supported encodings base16 base32 base58base64 urlencoding deflate gzip zlib entity yenc

104 Top parties redirecting to new thirdparties on email reload

Redirecting Party OrganizationAvg addrsquolparties S E

pippiocom Acxiom 57 7 32liadmcom LiveIntent 37 68 1097rlcdncom Acxiom 17 11 551imiclkcom MediaMath 13 2 4mathtagcom MediaMath 11 11 382alcmpncom ALCdagger 08 6 132emltrkcom Litmus 07 41 638acxiom-onlinecom Acxiom 04 2 33dynemlcom PowerInbox 01 3 13adnxscom AppNexus 01 19 277

Table 14 Top parties by average number of new third-party re-sources in a redirect chain when an email is reloaded The num-ber of senders ( S) out of 902 total and the number of emails(E) out of 12618 total on which this occurs is given for eachredirecting party We exclude redirecting parties that only exhibitthis behavior in emails from a single sender In total there are 12parties which exhibit this type of redirect behavior Includes statistics for chains which redirect to httppliadmcomimp in the first redirect We observe a common pattern ofURLs of the form lifirstpartycom redirecting first to this end-point which then redirects to a number of other third partiesdagger American List Counsel

  • I never signed up for this Privacy implications of email tracking
    • 1 Introduction
      • 11 Methods
      • 12 The state of email tracking
      • 13 Evaluating and improving defenses
        • 2 Related work
        • 3 Collecting a dataset of emails
        • 4 Privacy leaks when viewing emails
          • 41 Measurement methodology
          • 42 Email provides much of same tracking opportunities as the web
          • 43 Leaks of email addresses to third parties are common
          • 44 Reopening emails brings in new third parties
          • 45 Case study LiveIntent
          • 46 Request blockers help but dont fix the problem
            • 5 Privacy leaks when clicking links in emails
              • 51 Measurement methodology
              • 52 Results
                • 6 Evaluation of defenses
                  • 61 Landscape of defenses
                  • 62 Survey of email clients
                    • 7 Proposed defense
                    • 8 Discussion and conclusion
                    • 9 Acknowledgements
                    • 10 Appendix
                      • 101 Form discovery and filling methodology
                      • 102 Mail server implementation
                      • 103 Supported hash functions and encodings for leak detection
                      • 104 Top parties redirecting to new third parties on email reload

I never signed up for this Privacy implications of email tracking 5

Description Keywords LocationEmail list registration newsletter weekly ad subscribe inbox email sale alert link textGeneric registration signup sign up sign me up register create join link textGeneric articlesposts article news 2017 link URLSelecting languageregion us =usamp en-us link URLBlacklist unsubscribe mobile phone link text

Table 1 The web crawler chooses links to click based on keywords that appear in the link text or URL The keywords were generatedby iterating on an initial set of terms optimizing for the success of mailing list sign-ups on the top sites We created an initial set ofsearch terms and manually observed the crawler interact with the top pages Each time the crawler missed a mailing list sign-up formor failed to go to a page containing a sign-up form we inspected the page and updated the set of keywords This process was repeateduntil the crawler was successful on the sampled sites

Submission classification of sampled sitesTotal successful submissions 38

rarrMailing lists subscription 32rarrUser account registration 6

Failed required a CAPTCHA 16Failed unsupported form fields 25Unable to classify via screenshots 21

Table 2 Submission success status of a sample of 252 of the3335 form submissions made during the sign-up crawl The suc-cess and failure classification was determined through a manualreview of screenshots taken before and after an attempted formsubmission

where we took screenshots of the pages before and afterthe initial and follow-up form submissions We manu-ally examined a random sample of sites on which a formsubmission was attempted We summarize the results inTable 2

When filling forms our crawler will interact withuser account registration forms mailing list sign-upforms and contact forms The successful submissionswere mostly mailing list sign-ups and a small number ofuser account registrations which are included as theycan be tied to a mailing list The failed submissions weremostly caused by forms other than mailing lists In factmore than 70 of the failures caused by a captcha orunsupported field were not mailing list form submis-sions Overall only 11 of the sampled mailing list in-teractions resulted in a captcha Since our primary fo-cus is mailing lists we leave the evaluation of complexand captcha-protected forms to future work

Email corpus The assembled corpus contains atotal of 12618 HTML emails from 902 sites We re-ceived an average of around 14 emails per site and amedian of 5 A few sites had very active mailing listswith 20 sites sending over 100 emails during the testperiod We observe that we received no spam whichwe confirmed both by manual inspection of a sample of

emails as well as by finding an exact one-to-one corre-spondence between the 902 senders in our dataset andthe unique email addresses that we generated This en-sures that the results represent the behavior of the siteswhere we registered rather than spammers

4 Privacy leaks when viewingemails

41 Measurement methodology

Simulating a webmail client To measure web track-ing in email bodies we render the emails using a simu-lated webmail client in an OpenWPM instance Manywebmail clients remove a subset of HTML tags fromthe email body to restrict the capabilities of renderedcontent In particular Javascript is exclusively removedwhile iframe tags and CSS [6] have mixed support Wesimulate a permissive webmail client one which disablesJavascript and removes the Referer header from all re-quests but applies no other restrictions to the renderedcontent

The email content is served on localhost but isaccessed through the domain localtestme (which re-solves to localhost) to avoid any special handling thebrowser may have for the local network We configureOpenWPM to run 15 measurement instances in parallelEach email is loaded twice in its own measurement in-stance once with a fresh profile and then again keepingthe same browser profile after sleeping for 10 secondsThis is intended to allow remote content on the page toload both with and without browser state present In-deed we observe some tracking images which redirect tonew domains upon every subsequent reload of the sameemail

I never signed up for this Privacy implications of email tracking 6

Classifying third-party content Many emailclients load embedded content directly from remoteservers (we further explore the properties of emailclients in Section 62) Thus remote content presentin multiple emails can track users in the same waythird-party content can track users across sites on theweb However unlike the web there isnrsquot always aclear distinction of which requests are ldquothird-partyrdquo andwhich are ldquofirst-partyrdquo For example all resources loadedby webmail clients are considered third-party by thebrowser We consider any request to a domain2 whichis different than both the domain on which we signedup for the mailing list and the domain of the senderrsquosemail address to be a third-party request

Detecting email leakage Email addresses leakto remote servers through resource requests Detectingthese leaks is not as simple as searching for email ad-dresses in requests since the addresses may be hashed orencoded sometimes iteratively To detect such leakagewe develop a methodology that given a set of encod-ings and hashes a plaintext email address and a URLtoken is able to determine if the token is a transforma-tion of the email address Starting with the plaintextemail address we pre-compute a candidate set of tokensby applying all supported encodings and hashes itera-tively stopping once we reach three nested encodingsor hashes We then take the URL token and apply allsupported decodings to the value checking if the resultis present in the candidate set If not we iterativelyapply decodings until we reach a level of three nesteddecodings

In a preliminary measurement we found no exam-ples of a value that was encoded before being hashedThis is unsurprising as hashed email addresses are usedto sync data between parties and adding a transforma-tion before the hash would prevent that use case Thuswhen analyzing the requests in this dataset we restrictourselves to at most three nested hashes for a set of24 supported hashes including md5 sha1 sha256 Forencodings we apply all possible combinations of 10 en-codings including base64 urlencoding and gzip Thefull list of supported hashes and encodings is given inAppendix 103

Classifying email leakage Email leaks may notbe intentional If an email address is included in thequery string or path of a document URL it may auto-matically end up in the Referer header of subsequent

2 A domain is identified by its public suffix plus the componentof the hostname immediately preceding its public suffix (PS+1)

requests from that document Requests which result ina redirect also often add the referrer of the previous re-quest to the query string of the new request In manyinstances this happens irrespective of the presence ofan email address in the original request The situationis made more complex on the web since third-partyJavascript can dynamically build URLs and trigger re-quests

The reduced HTML support and lack of Javascriptexecution in email clients makes it possible to deter-mine intentionality for most leaks When an email isrendered requests can result from three sources fromelements embedded in the original HTML from withinan embedded iframe (if supported by the client) or froma redirected request1 If a leak occurs in a Referer header it is uninten-

tional For webmail clients the Referer header (ifenabled) will be the client itself A mail sender canembed an iframe which loads a URL that includesthe userrsquos email address with the explicit intentionthat the userrsquos email leak to third parties via theReferer header However we chose not to includethis possibility because email senders have multi-ple direct options for sharing information with thirdparties that do not rely on the sparsely supportediframe tag

2 If a leak occurs in a request to a resource embeddeddirectly in the HTML of the email body (and is notthe result of a redirect) it is intentional We candetermine intentionality since any request result-ing from an HTML document must have been con-structed by the email sender Note that this does nothold for web documents since embedded Javascriptcan dynamically construct requests during the pagevisit

3 If a request results from a redirect the party re-sponsible for the leak is the party whose request(ie the triggering URL) responded with a redi-rect to the new location (ie the target URL) Weclassify a leak as intentional if the leaked value ishashed between the triggering URL and the targetURL or if there are more encodings or hashes ofthe leaked value included in the target URL thanin the triggering URL If the target URL includesa full copy of the triggering URL (in any encoding)the leak is unintentional All other cases are clas-sified as ambiguous such the case where a targetURL includes only the query string of the triggeringURL

I never signed up for this Privacy implications of email tracking 7

Measuring blocked tags Tracking protection toolswhich block resource requests offer users protectionagainst the tracking embedded in emails We evaluatethe effectiveness of these tools by checking the requestsin our dataset against two major blocklists EasyListand EasyPrivacy [4] These lists block advertisementand tracking related requests and are bundled withseveral popular blocking extensions including AdBlockPlus [1] and uBlock Origin [5] We use the BlockList-Parser library [3] to determine if a request would havebeen blocked3 by an extension utilizing these lists Weclassify a request as blocked if it matches any of thefollowing three conditions1 The request directly matches the filter list2 The request is the result of a redirect and any re-

quest earlier in the redirect was blocked3 The request is loaded in an iframe and the iframe

document request (or any resulting redirect) wasblocked

It is possible to do this classification in an offline fashionbecause of the lack of Javascript support in email clientsThis removes the need to run measurements with oneof the aforementioned extensions installed In environ-ments that support Javascript content can be loadeddynamically and as the result of interactions betweenseveral scripts In such an environment it is much moredifficult to determine which requests would have beenblocked by a single script appearing on the block list

42 Email provides much of same trackingopportunities as the web

Remote resources embedded in email content can trackusers across emails As we show in our survey of emailclients (Section 62) many email clients allow remoteresources to set persistent cookies and send those cook-ies with resource requests In total we find that 10724of the measured emails (85) embed resources from atleast one third party with an average of 5 third partiesper email The distribution of embedded third partiesis far from uniform we find a median of two per emailand a small number of emails embedding as many as 50third parties (Figure 2)

3 We set the parser options as we would expect them to beset for a request occurring in a webmail client For example allrequests are considered third-party requests

Domain of Emails of Top 1Mdoubleclicknet 222 475mathtagcom 142 79dotomicom 127 35adnxscom 122 132tapadcom 110 26liadmcom 110 04returnpathnet 110 lt01bidswitchnet 105 49fontsgoogleapiscom 102 394list-managecom 101 lt01

Table 3 Top third-party domains by percentage of the 12618emails in the corpus For comparison we show the percentageof the top 1 million websites on which these third parties arepresent

Fig 2 CDF of third parties per email aggregating data acrossthe initial viewing and re-opening of an email In addition 14of emails have between 25 and 53 third parties

Table 3 shows the top third-party domains presentin email Many of these parties also have a large presenceon the web [17] blurring the line between email and webtracking On webmail clients requests to these cross-context third parties will use the same cookies allowingthem to track both a userrsquos web browsing and emailhabits In total the emails visited during our crawlsembed resources from 879 third parties

43 Leaks of email addresses to thirdparties are common

In addition to being able to track email habits 99 thirdparties (11) also gain access to a userrsquos email ad-dress whether in plaintext or hashed In email clientswhich support cookies these third parties will receivethe email address alongside any cookies theyrsquove set onthe userrsquos device Trackers which are also present on theweb will thus be able to link this address with the userrsquosbrowsing history profile

Around 19 of the 902 senders leaked the userrsquosemail address to a third party in at least one emailand in total 29 of emails contain leaks to third par-

I never signed up for this Privacy implications of email tracking 8

ties We find that a majority of these leaks 62 of the100963 leaks to third parties are intentional Theseintentional leaks mostly occur through remote contentembedded directly by the sender Furthermore 1 ofleaks are classified as unintentional with the remainderconsidered ambiguous While we do not attempt to de-termine how these identifiers are being used plaintextand hashed emails can be used for persistent trackingcross-device tracking and syncing information betweenparties

Leak of Senders of RecipientsMD5 100 (111) 38 (385)SHA1 64 (71) 19 (192)SHA256 69 (76) 13 (131)Plaintext Domain 55 (61) 2 (20)Plaintext Address 77 (85) 54 (545)URL Encoded Address 6 (06) 8 (81)SHA1 of MD5 1 (01) 1 (10)SHA256 of MD5 1 (01) 1 (10)MD5 of MD5 1 (01) 1 (10)SHA384 1 (01) 1 (10)

Table 4 Email address leakage to third parties by encoding Per-centages are given out of a total of 902 senders and 99 third-party leak recipients All hashes are of the full email addressEmail ldquodomainrdquo is the part of the address after the ldquordquoThese appear to be a misuse of LiveIntentrsquos API (Section 45)

The leaked addresses are often hashed Althoughwe can detect email addresses hashed with 24 differentfunctions and up to three nested layers we only findMD5 SHA1 and SHA256 in frequent use Table 4 summa-rizes the number of senders and receivers of each encod-ing The relatively low diversity of hashes and encodingssuggests that these techniques are not being used to ob-fuscate the collection of email addresses In fact thequery parameters which contain hashed emails some-times identify the hash functions used in the parametername (eg a string like md5=ltmd5 hash of emailgt ap-pearing in the HTTP request) The design of APIs likeLiveIntentrsquos which first receives an email address andthen syncs with a number of other parties (Section 45)suggests that these hashed address may be used to shareor link data from multiple parties

Recipient Organization of SendersLiveIntent 68 (75)Acxiom 46 (51)Litmus Software 28 (31)Conversant Media 26 (29)Neustar 24 (27)apxlvcom 18 (20)5421114717 18 (20)Trancos 17 (19)WPP 17 (19)548261160 16 (18)

Table 5 Top organizations receiving email address leaks by num-ber of the 902 total senders A domain is used in place of anorganization when it isnrsquot clear which organization it belong to

Table 5 identifies the top organizations4 which re-ceive leaked email addresses This shows that email ad-dress collection from emails is largely consolidated to afew major players which are mostly distinct from thepopular web trackers In fact only one of the top 10organizations Neustar is found in the top 20 third-party organizations on the top 1 million websites asmeasured by Englehardt and Narayanan [17] Also sur-prising is the prevalence of leaks to IP addresses whichaccounts for eight of the top 20 domains receiving emailaddresses This may be due to the relatively ephemeralnature of newsletter emails which removes concerns ofIP address churn over time

44 Reopening emails brings in new thirdparties

Despite the lack of Javascript support email views aredynamic The email content itself is static but any re-mote resources embedded in it may return different re-sponses each time the email is viewed and even redirectto different third parties To examine the effects of thiswe load every email first with a ldquocleanrdquo browser profileand then again without clearing the profile Surprisinglythe average Jaccard similarity [36] between the sets ofthird parties loaded during the first and second viewsof the same email is only 60

The majority of emailsmdashtwo-thirdsmdashload fewerthird parties when the email is reopened compared tothe initial view However about 21 of emails load at

4 We map domains to organizations using the classification pro-vided by Libert [27] adding several new email-specific organi-zations When an organization could not be found we use thePS+1

I never signed up for this Privacy implications of email tracking 9

Row Request URL0 httpinboxwashingtonexaminercomimp[]ampe=ltEMAILgtampp=01 httppliadmcomimp[]ampm=ltMD5(address)gtampsh=ltSHA1(address)gtampsh2=ltSHA256(address)gt

ampp=0ampdom=ltEMAIL_DOMAINgt2 httpxbidswitchnetsyncssp=liveintentampbidder_id=5298amplicd=3357ampx=EGFM[]3 httpxbidswitchnetul_cbsyncssp=liveintentampbidder_id=5298amplicd=3357ampx=EGFM[]4 httppadsymptoticcomdpx_pid=12688amp_psign=d3e69[]ampbidswitch_ssp_id=liveintentamp_redirect=[]5 httppadsymptoticcomdpx_pid=12688amp_psign=d3e69[]ampbidswit[]amp_redirect=[]amp_expected_cookie=[]6 httpxbidswitchnetsyncdsp_id=126ampuser_id=84f3[]ampssp=liveintent7 httpiliadmcoms19751bidder_id=5298amplicd=3357ampbidder_uuid=ltUUID_1gt8 httpcmgdoubleclicknetpixelgoogle_nid=liveintent_dbmampgoogle_cmampgoogle_sc9 httpcmgdoubleclicknetpixelgoogle_nid=liveintent_dbmampgoogle_cm=ampgoogle_sc=ampgoogle_tc=10 httppliadmcommatch_gbidder_id=24314ampbidder_uuid=ltUUID_2gtampgoogle_cver=111 httpxbidswitchnetsyncssp=liveintentampbidder_id=5298amplicd=12 httppooludspiponwebnetsyncssp=bidswitchampbidswitch_ssp_id=liveintent

Table 6 Redirect chain from a LiveIntent Email Tracking Pixel URL query strings are truncated for clarity (using [])

least one resource when an email is reopened that wasnrsquotpresent the first time A small number of third partiesare disproportionately responsible for thismdashthey loaddifferent sets of additional third parties each time theemail is opened (Table 14 in the Appendix)

The number of leaks between email loads stays rela-tively constant with less than 50 emails leaking to newparties on the second load5 However as the compari-son of Table 14 with Table 5 shows many of the topleak recipients are also responsible for redirecting to thehighest number of new parties Thus reloading an emailincreases the number of potential recipients of a leak ifthe redirectors share data based on the email or emailhash they receive

45 Case study LiveIntent

LiveIntent receives email addresses from the largestnumber of senders 68 in total In this section weanalyze a sample of the request chains that re-sult in leaks to LiveIntent Table 6 shows an ex-ample redirect chain of a single pixel embedded inan email from the washingtonexaminercom mailinglist The initial request (row 0) is to a subdomain ofwashingtonexaminercom and includes the userrsquos plain-text email address in the e= query string parameter Thedomain redirects to liadmcom (row 1) a LiveIntent do-main and includes the MD5 SHA1 and SHA256 hashes ofthe email address in the parameters m= sh= and sh2=

5 We exclude leaks which occur to a different IP address on thesecond load This occurs in 349 emails but is less meaningfulgiven the dynamic nature of IP address

The URL also includes the domain portion of the userrsquosaddress

In rows 2 - 12 the request redirects through severalother domains and back to itself exchanging what ap-pear to be partner IDs and bidder IDs In rows 7 and10 LiveIntent receives a UUID from the domain in theprevious request which could allow it to exchange in-formation with those trackers outside of the browser

46 Request blockers help but donrsquot fixthe problem

Privacy conscious users often deploy blocking exten-sions such as uBlock Origin Privacy Badger orGhostery to block tracking requests Since webmailclients are browser-based these blocking extensions canalso filter requests that occur while displaying email con-tent6 We use our blocked tag detection methodology(Section 41) to determine which resources would havebeen blocked by the popular EasyList and EasyPrivacyblocklists We then examine the remaining requests todetermine how frequently email addresses continue toleak

Overall the blocklists cut the number of third par-ties receiving leaked email addresses from any sendernearly in half from 99 to 51 Likewise the number ofsenders which leak email addresses in at least one emailis greatly reduced from 19 to just 7 However asTable 7 shows a significant number of leaks of both

6 Thunderbird supports most of the popular Firefox extensionsand as such Thunderbird users can also deploy these defensesSee Table 12 for more details

I never signed up for this Privacy implications of email tracking 10

Encoding of Senders of RecipientsPlaintext Address 34 (37) 34 (667)MD5 21 (23) 12 (235)SHA1 14 (16) 6 (118)URL Encoded Address 4 (04) 4 (78)SHA256 4 (04) 2 (39)SHA384 1 (01) 1 (20)

Table 7 Encodings used in leaks to third parties after filteringrequests with EasyList and EasyPrivacy Totals are given out of902 email senders and 51 third-party leak recipients

Recipient Domain of Sendersmediawallahscriptcom 7jetlorecom 4scrippsnetworkscom 4alocdncom 3richrelevancecom 3ivitrackcom 2intentiqcom 2gatehousemediacom 2realtimeemail 2ziffimagescom 2

Table 8 The top third-party leak recipient domains after filteringrequests with EasyList and EasyPrivacy All recipients receiveleaks from less than 1 of the 902 senders studied

plaintext and email hashes still occur In Table 8 we seethat there are still several third-party domains whichreceive email address leaks despite blocking Several ofthese domains are known trackers which could be in-cluded in the blocklists In addition IP addresses andCDN domains are still recipients of leaked email ad-dresses Blocking on other URL features such as theURL path could help reduce leaks to these domains

5 Privacy leaks when clickinglinks in emails

In Section 4 we explore the privacy impact of a useropening and rendering an email In this section we ex-plore the privacy impact of a user clicking links withinan email Once a user clicks a link in an email the link istypically opened in a web browser Unlike email clientsweb browsers will typically support Javascript and ad-vanced features of HTML creating many potential av-enues for privacy leaks However the only way an emailaddress can propagate to a page visit is through the di-rect embedding of the address in a link contained in theoriginal email body

51 Measurement methodology

Sampling links from emails To evaluate the privacyleaks which occur when links in emails are clicked wegenerate a dataset from the HTML content of all emailsand visit them individually in an instrumented browserTo extract the links from mail content we parse allemail bodies with BeautifulSoup [2] and extract thesrc property of all ltagt tags We sample up to 200 uniquelinks per sender using the following sampling strategyFirst we bin links across all emails from a sender bythe PS+1 and path of the link Next we sample one linkfrom each bin without replacement until there are nomore links or we reach a limit of 200 This helps ensurethat we have as diverse a set of landing pages as possibleby stripping fragment and query string identifiers thatmay not influence the landing page

Simulating link clicks To simulate a user click-ing a link we visit each link in an OpenWPM instanceusing a fresh browser profile The browser fully loadsthe page and sleeps for 10 seconds before closing Un-like the email viewing simulation (Section 4) we enableboth Javascript and Referer headers This simulationreplicates what happens when a link is clicked in a stan-dalone email client only the URL of the clicked link ispassed to the browser for handling In a webmail clientthe initial request resulting from the click may also con-tain a cookie and a Referer header containing the emailclientrsquos URL We do not simulate these headers in ourcrawl

Detecting email address leakage To detectleakage of email addresses we use the procedure de-scribed in Section 41 Since the Referer header is en-abled for these measurements we consider a party tohave received a leak if it is contained either in the URLor the Referer header of the resource request to thatparty Email addresses can also be shared with the partythrough the Cookie header request POST bodies web-socket connections WebRTC connections and so onWe consider these out of scope for this analysis

52 Results

We found that about 11 of links contain requests thatleak the email address to a third party About 12 ofall emails contain at least one such link and among thissubset there are an average of 35 such links per emailThe percentage of the 902 senders that leak the emailaddress in at least one link in one email is higher 355Finally there were over 1400 distinct third parties that

I never signed up for this Privacy implications of email tracking 11

Recipient Organization of SendersGoogle 247 (274)Facebook 160 (177)Twitter 94 (104)Adobe 81 (90)Microsoft 73 (81)Pinterest 72 (80)LiveIntent 69 (76)Akamai 69 (76)Acxiom 68 (75)AppNexus 61 (68)

Table 9 The top leak recipient organizations based on a sampleof simulated link clicks All values are out of 902 total senders

Recipient Domain of Sendersgoogle-analyticscom 200 (222)doubleclicknet 196 (217)googlecom 159 (176)facebookcom 154 (171)facebooknet 145 (161)fontsgoogleapiscom 102 (113)googleadservicescom 96 (106)twittercom 94 (104)googletagmanagercom 87 (96)gstaticcom 78 (86)

Table 10 The top leak recipient domains based on a sample ofsimulated link clicks All values are out of 902 senders

received the email address in one or more of our sim-ulated link clicks We expect that all statistics in thisparagraph except the first are slight underestimatesdue to our limit of 200 links per sender

Table 9 shows the top organizations that receiveleaked email addresses and Table 10 shows the top do-mains Over a quarter of senders leak the email addressto Google in at least one link

The most striking difference between these resultsand the corresponding results for viewing emails is thatthese lists look very similar to the list of top third partytrackers [17] with the addition of a small number oforganizations specific to email tracking This motivatesthe privacy concern that identities could potentially beattached to third-party web tracking profiles

6 Evaluation of defenses

61 Landscape of defenses

Defenses against tracking can be employed by severalparties We ignore mail senders and trackers themselves

since email tracking is a thriving commercial space andour evidence suggests that senders by and large coop-erate with trackers to leak email addresses We insteadfocus on parties who have an incentive to protect therecipientrsquos privacy namely the recipientrsquos mail servermail user agent and the web browser

The lines between these roles can be blurry so weillustrate with two examples Consider a user readingYahoo mail via Firefox The email server is Yahoo theemail client is Firefox together with Yahoo mailrsquos client-side JavaScript and the web browser is again Fire-fox Or consider a user reading her university mail viaGmailrsquos IMAP feature on her iPhone For our purposesboth the university and Gmail count as email serverssince either of them is in a position to employ defensesThe email client is the Gmail iOS app and the webbrowser is Safari

Defense Email server Email client Web browserContent proxying XHTML filtering X XCookie blocking X XReferrer blocking X X XRequest blocking X X

Table 11 Applicability of each of the five possible defenses toeach of the three contexts in which they may be deployed An Xindicates that the defense is applicable

Table 11 summarizes the applicability of various de-fenses to the three roles We discuss each in turn

Content proxying Email tracking is possible be-cause of embedded content such as images and CSS (cas-cading style sheets) To prevent this some email serversnotably Gmail proxy embedded content Thus whenthe recipient views the email the mail user agent doesnot make any requests to third parties

This defense doesnrsquot prevent the recipient email ad-dress being leaked to third parties since it is leakedby being encoded in the URL In fact it hinders ef-forts by the mail client to prevent email address leakage(see request blocking below) However it prevents thirdparties from learning the userrsquos IP address client deviceproperties and when the email was read (depending onhow the proxy is configured) Most importantly it pre-vents the third-party cookie from being sent and thusprevents the third party from linking the userrsquos emailaddress to a tracking profile In this way it is a comple-ment to cookie blocking

This defense can be deployed by the email serverConceivably the email client might have its own server

I never signed up for this Privacy implications of email tracking 12

component through which embedded resources are prox-ied but no email clients currently work this way andfurther it would introduce its own privacy vulnerabili-ties so we ignore this possibility

HTML filtering HTML filtering refers to modify-ing the contents of HTML emails to mitigate tracking Itmay be applied by the email server or the client but it ismore suitable to the server since the client can generallyachieve the same effect in other ways eg by requestblocking or modifying the rendering engine It is rarelyapplied today and only in minimal ways In Section 7 weprototype a comprehensive HTML filtering technique

HTML filtering modifies the content of the emailbody and thus might interfere with some email au-thentication methods notably Domain Keys IdentifiedEmail (DKIM) However since filtering is carried outby the recipientrsquos mail server (Mail Transfer Agent) andnot by intermediate mail relays filtering can be done af-ter the signature has been verified and thus there is noimpact on email authentication

The following three techniques are applicable in oneof two scenarios when the email client requests embed-ded resources or when the web browser handles clickson links in emails

Cookie blocking Cookie blocking in the emailclient prevents third-party cookies from being sent whenembedded content is requested It is especially relevantin the webmail context where the cookie allows thirdparties to link an email address to a web browsing pro-file Even otherwise blocking cookies is helpful since itmakes it harder for third parties to compile a profile ofthe recipientrsquos email viewing (they can always do this forthe subset of emails where the email address is leaked)

Referrer blocking If the email client sends theReferer header when loading embedded resources itcan allow several types of leaks Depending on the imple-mentation the referrer may encode which client is beingused and which specific email is being read If the recip-ient forwarded an email to someone else and the email isbeing viewed in a different userrsquos mailbox it could leakthis information Worse if the client supports iframesin emails and the email address happens to be in theiframe URL all requests to resources embedded in thatiframe will accidentally leak the email address For allthese reasons referrer blocking is a privacy-enhancingmeasure There is little legitimate use for the referrerheader in the context of email While clients can cer-tainly block the header (as can web browsers) serverscan do this as well by rewriting HTML to add therel=ldquonoreferrerrdquo attribute to links and inserting a Re-ferrer Policy via the meta tag

Request blocking Request blocking is a powerfultechnique which is well known due to ad blockers andother browser privacy extensions It relies on manuallycompiled filter lists containing thousands of regular ex-pressions that define third-party content to be blockedThe most widely used ad-blocking list is EasyList andthe most widely used tracker-blocking list is EasyPri-vacy Filter list based blocking introduces false positivesand false negatives [43] but the popularity of ad block-ing suggests that many users find the usability trade-offto be acceptable While request-blocking extensions aresupported primarily by web browsers some email clientsalso have support for them notably Thunderbird

62 Survey of email clients

We built an email privacy tester to discover which de-fenses are deployed by which popular email servers andclients7 Browser support for tracking protection hasbeen extensively studied elsewhere [29] so we do notconsider it here

The email privacy tester allows the researcher to en-ter an email address and the name of an email clientand then sends an email to that address containing atracking image and a link The image and the link bothhave unique URLs The researcher views the email inthe specified email client and then clicks on the linkThe server records the following information the emailaddress the email client the IP address timestampand headers sent for both the image and the link re-quests The list of headers includes the cookie referrerand user agent

We created accounts with a total of 9 emailproviders and tested them with a total of 16 email clientsusing various devices available in our lab We analyzedthe data recorded by the email privacy tester and sum-marize the results in Table 12 We found that if de-fenses are deployed by email servers at all they are onlyenabled for specific email clients (typically the defaultwebmail client) Therefore we do not report on serversseparately but instead fold it into the analysis of clientsWe also found that HTML filtering in a general form isnot deployed but only in the limited form of image andreferrer blocking so we report on that instead We sum-marize our findings in Table 12

7 httpsemailtrackingopenwpmcom

I never signed up for this Privacy implications of email tracking 13

Mail Client Platform Proxies Content Blocks Images Blocks Referrers Blocks Cookies Ext SupportGmail Web Yes No L Yes I Yesdagger Yesdagger YesYahoo Mail Web No Yes L Yes I No No YesOutlook Web App Web No Yes No No YesOutlookcom Web No No No No YesYandex Mail Web Yes No L Yes I Yesdagger Yesdagger YesGMX Web No No No No YesZimbra Web No Yes No No Yes163com Web No No No No YesSina Web No No No No YesApple Mail iOS No No Yes Yes NoGmail iOS Yes No Yes Yes NoGmail Android Yes No Yes Yes NoApple Mail Desktop No No Yes Yes NoWindows Mail Desktop No No Yes No NoOutlook 2016 Desktop No Yes Yes No NoThunderbird Desktop No Yes Yes Optional (Default No) Yes

Table 12 A survey of the privacy impacting features of email clients We explore whether the client proxies image requests blocksimages by default blocks referrer headers from being sent (with image requests ldquoIrdquo and with link clicks ldquoLrdquo) blocks external re-sources from settings cookies and whether or not the client supports request blocking extensions mdash either through the browser (forweb clients) or directly (in the case of Thunderbird)Images are only blocked for messages considered spamdagger Blocking occurs as a result of proxied content

7 Proposed defenseWe argue that tracking protection should be at the cen-ter of a defensive strategy against email tracking It canbe employed either via HTML filtering on the server orvia request blocking on the client Tracking protection(and ad blocking) based on filter lists has proven to beeffective and popular in web browsers and its limita-tions manageable The other defenses we examined allhave serious drawbacks for example content proxyingcomes at a cost to the email server and makes emailleaks worse and cookie blocking is at best a partial so-lution

We propose to improve tracking protection in twoways

Server-side email content filtering First weprototype a server-side HTML filtering module We usethe existing standard EasyList and EasyPrivacy filterlists Our filtering script is written in Python using theBlockListParser library [3] It scans for any HTML con-tent (texthtml) in email bodies parses those contentsidentifies embedded resources (images or CSS) whoseURLs match one of the regular expressions in the filterlists strips them out and rewrites the HTML

To test the effectiveness of HTML filtering we ranour leak detection procedure on the filtered corpus ofemails We exclude one sender due to a measurementissue We found that 110 of senders will leak email ad-

dresses to a third party in at least one email and 115of emails contain embedded resources which leak emailto a third-party Overall 62 third parties received leakedemail addresses down from 99 As tracking-protectionlists improve (see below) we can expect these numbersto decrease further These numbers are very close tothe corresponding numbers for request blocking (Sec-tion 46) The two techniques arenrsquot identical the onedifference is that in static files filtering is limited to theURLs present in the body of the HTML and will missthose that result from a redirect However this differ-ence is small and we conclude that HTML filtering isessentially as effective as request blocking

Note that webmail users can already enjoy track-ing protection but server-side deployment will help allusers including those who use email clients that donrsquotsupport request-blocking extensions

Filling gaps in tracking-protection lists As asecond line of defense we use our dataset to identifya list of 27125 URLs representing 133 distinct partieswhich contain leaks of email addresses but which arenrsquotblocked by EasyList or EasyPrivacy These include firstparties in addition to third parties We are able to iden-tify first-party tracking URLs by observing groups ofURLs of similar structure across different first-party do-mains For example 51 email senders leak the userrsquosemail address to a URL of the form liltpublic suffix+ 1gtimp which appears to be part of LiveIntentrsquos API(Section 45) We summarize the most common struc-

I never signed up for this Privacy implications of email tracking 14

tures in the leaking URLs missed by tracking protectionlists in Table 13

URL Pattern of SendersliltPS+1gtimp 51 (57)partnerltPS+1gt 7 (07)stripeltPS+1gtstripeimage 4 (04)pltPS+1gtespopen 4 (04)apiltPS+1gtlayoutssectionltNgt 4 (04)ltPS+1gtcustomer-service 3 (03)miltPS+1gtprp 3 (03)dmtkltPS+1gt 3 (03)linksltPS+1gteopen 3 (03)eadsltPS+1gtimp 3 (03)

Table 13 The top URL patterns from URLs which leak email ad-dresses and are missed by tracking protection lists (Section 46)The patterns are generated by stripping request URLs to host-name and path replacing the public suffix plus one with ltPS+1gtreplacing integers with ltNgt and stripping the last portion of thepath if it ends with a file extension The patterns are ranked bythe number of senders which make at least one leaking requestmatching that pattern in any of the senderrsquos emails All values aregiven out of the total of 902 senders studied

We suspect that the reason so many trackers aremissed is that many of them are not active in the regu-lar web tracking space We have made the list of leakingURLs missed by tracking protection lists publicly avail-able8 It should be straightforward to add regular ex-pressions to filter lists based on these URLs we suggestthat filter list creators should regularly conduct scansof email corpora to identify new trackers

8 Discussion and conclusionPrivacy risks of email tracking Email security andprivacy has not received much research attention despiteits central importance in digital life We showed thatcommercial emails contain a high density of third-partytrackers This is of concern not only because trackerscan learn the recipientrsquos IP address when emails wereopened and so on but also because these third partiesare by and large the same ones that are involved in webtracking This means that trackers can connect emailaddresses to browsing histories and profiles which leadsto further privacy breaches such as cross-device tracking

8 httpsgistgithubcomenglehardt6438c5d775ffd535b317d5c6ce3cde61

and linking of online and offline activities Indeed emailis an underappreciated avenue for straightforward cross-device tracking since recipients tend to view emails onmultiple devices

The advice provided by many mail clients may mis-lead users into thinking the privacy risks associated withremote content are fairly limited The remote contenthelp pages of Gmail [20] Yahoo Mail [42] and Thun-derbird [31] all discuss the threat strictly in terms ofthe email sender learning information about the userrather than a number of third parties

Even network adversaries can benefit from the leaksin emails The NSA is known to piggyback on advertis-ing cookies for surveillance [18] and our work suggestsone way in which a surveillance agency might attachidentities to web activity records in line with the find-ings of Englehardt et al [18] Indeed nearly 91 ofURLs containing leaks of emails are sent in plaintext

Ineffectiveness of hashing The putative justi-fication for email address leaks in the online ad techindustry is that the address is hashed However hash-ing of PII including emails is not a meaningful pri-vacy protection This is folk knowledge in the securitycommunity but bears repeating Compared to hashingof passwords there are several reasons why hashing ofemail addresses is far more easily reversible via vari-ants of a dictionary attack First while (at least) someusers attempt to maximize the entropy of passwordsmost users aim to pick memorable emails and hence theset of potential emails is effectively enumerable Due toGPUs trillions of hashes can be attempted at low costSecond unlike password hashing salting is not applica-ble to email hashing since multiple third parties need tobe able to independently derive the same hash from theemail address

Perhaps most importantly if the adversaryrsquos goalis to retrieve records corresponding to a known emailaddress or set of email addresses then hashing ispointlessmdashthe adversary can simply hash the email ad-dresses and then look them up For example if the ad-versary is a surveillance agency as discussed above andseeks to retrieve network logs corresponding to a givenemail address this is trivially possible despite hashing

LimitationsWe mention several limitations of ourwork First despite the large number of heuristics thatwent into identifying and submitting forms it is a fun-damentally hard problem and our crawler fails in manycases including pages requiring complex mouse interac-tions pages containing very poorly structured HTMLand captcha-protected form submission pages More-over it is difficult to programmatically distinguish be-

I never signed up for this Privacy implications of email tracking 15

tween successful and failed form submissions Lookingat received network data is impractical since responsescould easily include text for both success and failuremessages On the other hand looking only at changesin the rendered text on the webpage is more feasiblebut would require handling many possible edge cases(eg page redirects alerts pop-up windows iframes)and might still be too unreliable to use as a metric forsuccess

Second our corpus of emails is not intended to berepresentative and we are unable to draw conclusionsabout the extent of tracking in the typical userrsquos mail-box

Third our simulation of a user viewing emails as-sumes a permissive user agent We expect that thisclosely approximates a webmail setup with defaultbrowser settings (on browsers except Safari whichblocks third-party cookies by default) but we have nottested this assumption

Future work Finally we mention several potentialareas of future work

Mailing list managers It would be helpful to bet-ter understand the relationship between email sendersand mailing list managers (such as Constant Contact)To what extent is email tracking driven by senders ver-sus mailing list managers When a sender sets up amarketing campaign with a mailing list manager is thetracking disclosed to the sender

PII leakage in registration forms Researchers havepreviously found leakage of PII to third parties in con-tact forms on websites [38] As far as we know there hasbeen no large-scale study of PII leakage in registrationforms where more sensitive information is often present(eg phone numbers street addresses and passwords)Recording and analyzing the third-party requests madeduring our crawls is an important area for further inves-tigation

Cookie syncing It would be interesting to find outif cookie syncing occurs when viewing emailsmdasha processin which different trackers exchange and link togethertheir own IDs for the same user Past work has shownthat this happens among the vast majority of top thirdparties on the web [17] so we suspect that it occursthrough email as well

AB testing We notice some clear instances of ABtesting in our data as might be expected in market-ing campaigns Specifically we registered multiple emailaddresses on some sites at roughly the same time andfound several emails sent at nearly the same time (mil-liseconds apart) with different subject lines and emailbodies advertising different products We have not at-

tempted to reverse-engineer or systematically analyzethese differences but it may be interesting to see if andhow the received content changes in response to readreceipts click-through metrics or other types of userinteractions

Differential testing Despite testing for various en-codings hashes and combinations it is possible thatwe have missed some leaks of email addresses We can-not hope to exhaustively test for all combinations ofencodings and hashes Instead we propose differentialtesting by registering multiple email addresses on thesame site we can look for parameters in URLs that aredifferent for different email addresses which are sugges-tive of transformed email addresses The difficulty withthis approach is that AB testing mentioned above isa confound

In summary we hope that our work leads to greaterawareness of the privacy risks of email tracking spursfurther research on the topic and paves the way fordeployment of robust defenses

9 AcknowledgementsWe would like to thank the anonymous reviewers AylinCaliskan Paul-Olivier Dehaye Joel Reardon and PaulVan Oorschot for their helpful comments Wersquore alsograteful to Guumlnes Acar Paul Ellenbogen Marc JuarezHarry Kalodner Marcela Melara and Laura Roberts fortheir assistance in compiling data for our email survey

This work was supported by NSF Grant CNS1526353 by a research grant from Mozilla and by Ama-zon AWS Cloud Credits for Research

References[1] Adblock Plus - Surf the web without annoying ads https

adblockplusorg Online accessed 2017-09-05[2] BeautifulSoup httpswwwcrummycomsoftware

BeautifulSoup Online accessed 2017-09-05[3] BlockListParser httpsgithubcomshivamagarwal-iitb

BlockListParser Online accessed 2017-09-05[4] EasyList and EasyPrivacy httpseasylistto Online

accessed 2017-09-05[5] uBlock Origin - An efficient blocker for Chromium and Fire-

fox Fast and lean httpsgithubcomgorhilluBlockOnline accessed 2017-09-05

[6] CSS Support Guide for Email Clients Campaign Sourcehttpswwwcampaignmonitorcomcss (Archive httpswwwwebcitationorg6rLLXBX0E) 2014

I never signed up for this Privacy implications of email tracking 16

[7] Gunes Acar Christian Eubank Steven Englehardt MarcJuarez Arvind Narayanan and Claudia Diaz The web neverforgets Persistent tracking mechanisms in the wild In Pro-ceedings of ACM CCS pages 674ndash689 ACM 2014

[8] Gunes Acar Marc Juarez Nick Nikiforakis Claudia DiazSeda Guumlrses Frank Piessens and Bart Preneel Fpdetectivedusting the web for fingerprinters In Proceedings of the2013 ACM SIGSAC conference on Computer amp communica-tions security pages 1129ndash1140 ACM 2013

[9] Julia Angwin Why online tracking is getting creepier ProP-ublica Jun 2014

[10] Mika D Ayenson Dietrich James Wambach Ashkan SoltaniNathan Good and Chris Jay Hoofnagle Flash cookies andprivacy II Now with html5 and etag respawning 2011

[11] Bananatag Email Tracking for Gmail Outlook and otherclients httpsbananatagcomemail-tracking Onlineaccessed 2017-09-04

[12] Justin Brookman Phoebe Rouge Aaron Alva Alva andChristina Yeung Cross-device tracking Measurement anddisclosures In Proceedings of the Privacy Enhancing Tech-nologies Symposium 2017

[13] Ceren Budak Sharad Goel Justin Rao and Georgios ZervasUnderstanding emerging threats to online advertising InProceedings of the ACM Conference on Economics andComputation 2016

[14] ContactMonkey Email Tracking for Outlook and Gmailhttpswwwcontactmonkeycomemail-tracking Onlineaccessed 2017-09-04

[15] Zakir Durumeric David Adrian Ariana Mirian James Kas-ten Elie Bursztein Nicolas Lidzborski Kurt Thomas VijayEranti Michael Bailey and J Alex Halderman Neither snownor rain nor mitm An empirical analysis of email deliv-ery security In Proceedings of the 2015 ACM Conferenceon Internet Measurement Conference pages 27ndash39 ACM2015

[16] Peter Eckersley How unique is your web browser In In-ternational Symposium on Privacy Enhancing TechnologiesSymposium pages 1ndash18 Springer 2010

[17] Steven Englehardt and Arvind Narayanan Online trackingA 1-million-site measurement and analysis In ACM Confer-ence on Computer and Communications Security 2016

[18] Steven Englehardt Dillon Reisman Christian Eubank Pe-ter Zimmerman Jonathan Mayer Arvind Narayanan andEdward W Felten Cookies that give you away The surveil-lance implications of web tracking In Proceedings of the24th Conference on World Wide Web 2015

[19] David Fifield and Serge Egelman Fingerprinting web usersthrough font metrics In International Conference on Finan-cial Cryptography and Data Security 2015

[20] Gmail Help Choose whether to show images httpssupportgooglecommailanswer145919 Online accessed2017-09-06

[21] Ralph Holz Johanna Amann Olivier Mehani Mohamed AliKacircafar and Matthias Wachs TLS in the wild An internet-wide analysis of tls-based protocols for electronic commu-nication In 23nd Annual Network and Distributed SystemSecurity Symposium NDSS 2016 San Diego CaliforniaUSA February 21-24 2016 2016

[22] HubSpot Start Email Tracking Today httpswwwhubspotcomproductssalesemail-tracking Online ac-

cessed 2017-09-04[23] Balachander Krishnamurthy Konstantin Naryshkin and

Craig Wills Privacy leakage vs protection measures thegrowing disconnect In Proceedings of the Web 2011

[24] Balachander Krishnamurthy and Craig E Wills On the leak-age of personally identifiable information via online socialnetworks In Proceedings of the 2nd ACM workshop onOnline social networks pages 7ndash12 ACM 2009

[25] Pierre Laperdrix Walter Rudametkin and Benoit BaudryBeauty and the beast Diverting modern web browsers tobuild unique browser fingerprints In 37th IEEE Symposiumon Security and Privacy 2016

[26] Adam Lerner Anna Kornfeld Simpson Tadayoshi Kohnoand Franziska Roesner Internet jones and the raiders of thelost trackers An archaeological study of web tracking from1996 to 2016 In 25th USENIX Security Symposium 2016

[27] Timothy Libert Exposing the invisible web An analysis ofthird-party http requests on 1 million websites InternationalJournal of Communication 918 2015

[28] Jonathan R Mayer and John C Mitchell Third-party webtracking Policy and technology In 2012 IEEE Symposiumon Security and Privacy IEEE 2012

[29] Georg Merzdovnik Markus Huber Damjan Buhov NickNikiforakis Sebastian Neuner Martin Schmiedecker andEdgar Weippl Block me if you can A large-scale study oftracker-blocking tools In Proceedings of the 2nd IEEE Euro-pean Symposium on Security and Privacy (IEEE EuroSampP)2017

[30] Keaton Mowery and Hovav Shacham Pixel perfect Finger-printing canvas in HTML5 W2SP 2012

[31] Mozilla Support Remote Content in Messages httpssupportmozillaorgen-USkbremote-content-in-messagesOnline accessed 2017-09-04

[32] Nick Nikiforakis Alexandros Kapravelos Wouter JoosenChristopher Kruegel Frank Piessens and Giovanni VignaCookieless monster Exploring the ecosystem of web-baseddevice fingerprinting In Security and privacy (SP) 2013IEEE symposium on pages 541ndash555 IEEE 2013

[33] Lukasz Olejnik Gunes Acar Claude Castelluccia and Clau-dia Diaz The leaking battery A privacy analysis of theHTML5 Battery Status API Technical report 2015

[34] Jingjing Ren Ashwin Rao Martina Lindorfer ArnaudLegout and David Choffnes Recon Revealing and control-ling pii leaks in mobile network traffic In Proceedings of the14th Annual International Conference on Mobile SystemsApplications and Services pages 361ndash374 ACM 2016

[35] Franziska Roesner Tadayoshi Kohno and David WetherallDetecting and defending against third-party tracking onthe web In Proceedings of the 9th USENIX conferenceon Networked Systems Design and Implementation pages12ndash12 USENIX Association 2012

[36] scikit-learn Jaccard Similarity Score httpscikit-learnorgstablemodulesgeneratedsklearnmetricsjaccard_similarity_scorehtml Online accessed 2017-09-05

[37] Ashkan Soltani Shannon Canty Quentin Mayo LaurenThomas and Chris Jay Hoofnagle Flash cookies and pri-vacy In AAAI spring symposium intelligent informationprivacy management volume 2010 pages 158ndash163 2010

[38] Oleksii Starov Phillipa Gill and Nick Nikiforakis Are yousure you want to contact us quantifying the leakage of pii

I never signed up for this Privacy implications of email tracking 17

via website contact forms Proceedings on Privacy Enhanc-ing Technologies 2016(1)20ndash33 2016

[39] Oleksii Starov and Nick Nikiforakis Extended trackingpowers Measuring the privacy diffusion enabled by browserextensions In Proceedings of the 26th International Confer-ence on World Wide Web pages 1481ndash1490 2017

[40] Narseo Vallina-Rodriguez Christian Kreibich Mark Allmanand Vern Paxson Lumen Fine-grained visibility and controlof mobile traffic in user-space 2017

[41] W3C 410 Forms - HTML5 httpswwww3orgTRhtml5formshtml Online accessed 2017-09-07

[42] Yahoo Help Block images in your incoming Yahoo Mailemails httpshelpyahoocomkbSLN5043html Onlineaccessed 2017-09-06

[43] Zhonghao Yu Sam Macbeth Konark Modi and Josep MPujol Tracking the trackers In Proceedings of the 25thInternational Conference on World Wide Web pages 121ndash132 International World Wide Web Conferences SteeringCommittee 2016

10 Appendix

101 Form discovery and fillingmethodology

Choosing pages on which to search for forms Thecrawler searches through all links (ltagt tags) on the land-ing page to find pages that are most likely to contain amailing list form It does this by matching the link textand URL against a ranked list of terms which are shownin Table 1 As an initial step we filter out invisiblelinks and links to external sites We check that the linktext does not contain words in our blacklist which aimsto avoid unsubscribe pages and phone-based registra-tion If we have found any links that match the crawlerclicks on the one with the highest rank then runs theform-finding procedure on the new page and any newlyopened pop-up windows If no forms are found it goesback and repeats this process for the remaining linksThe reason for clicking on generic article links is thatwe have come across several news sites with newsletterforms only within article pages We also make sure toselect the English language or USEnglish locale whenavailable since our keywords are in English

Top-down form detection For each page thecrawler visits it first searches through the HTML DOMfor any potential email registration forms When sitesuse the standard ltformgt element it can simply iteratethrough each formrsquos input fields (ltinputgt tags) and seeif any text fields ask for an email address (by matchingon input type and keywords) If so it marks the form as

a candidate and then chooses the best candidate usingthe following criteria (in order)1 Always return the topmost form Any form stacked

on top of other elements is probably a modal or dia-log and we find that the most common use of thesecomponents is to promote a sitersquos mailing lists Werely on the z-index CSS property which specifies thestacking order of an element in relation to others (asa relative arbitrary integer) Note that most DOMelements take the default z-index value of auto in-heriting the actual value from its parent thus thecrawler recursively checks a formrsquos parent elementsuntil it finds a non-auto value or reaches the rootof the DOM tree To break ties it also searches forthe literal strings ldquomodalrdquo or ldquodialogrdquo within theformrsquos HTML since we find that such componentsare usually descriptively named

2 Rank login forms lower This is the other class offorms that often asks for an email address so thecrawler explicitly checks for the strings ldquologinrdquo ldquologinrdquo and ldquosign inrdquo within a formrsquos HTML to avoidthese when other candidates are present

3 Prefer forms with more input fields This is mainlyhelpful for identifying the correct follow-up form ifwe submit our email address in the footer of a pagethe same footer might be present on the page we getredirected to In this scenario the form we want topick is the longer one

Additionally registration forms are sometimes foundinside of inline frames (ltiframegt tag) which are ef-fectively separate HTML pages embedded in the mainpage If necessary we iterate through each frame andapply the same procedure to locate registration formswithin them

Bottom-up form detection A growing numberof sites place logical forms inside of generic containerelements (eg ltdivgt or ltspangt tags) without using anyltformgt tags Therefore if top-down form detection failswe take a bottom-up approach the crawler first iteratesthrough all the ltinputgt elements on the page to checkif any email address fields exist at all then recursivelyexamines their parents to find the first container thatalso contains a submit button This container is usuallythe smallest logical form unit that includes all of therelevant input fields

Determining form field type Once a form isdiscovered we need to determine which fields are con-tained in the form and fill each field with valid dataWe skip any invisible elements since a real user wouldnot be expected to fill them Some fields can be iden-

I never signed up for this Privacy implications of email tracking 18

tified by their type attribute alonemdashfor example telfor phone numbers and email for email addressesmdashbutthese specific types were introduced in the relatively re-cent HTML5 standard [41] and most websites still usethe general text type for all text inputs In our sur-vey of the top sites we found that contextual hints arescattered across many tag attributes with the most fre-quent being name class id placeholder value forand title In addition tags that contain HTML bod-ies (such as ltbuttongt tags) often contain hints in theinnerHTML

Handling two-part form submissions Aftersubmitting a form we are sometimes prompted to fillout another longer form before the registration is ac-cepted This second form might appear on the samepage (ie using JavaScript) or on a separate page ei-ther through a redirect or as a pop-up window We takea simplistic approach the crawler waits a few secondsthen applies the same form-finding procedure first onany pop-up windows and then on the original windowThis approach may have the effect of submitting thesame form twice but we argue that this does not pro-duce any adverse resultsmdashduplicate form submissionsare a plausible user interaction that web services shouldbe expected to handle gracefully

102 Mail server implementation

The mail server receives emails using SubEtha SMTP alibrary offering a simple low-level API to handle incom-ing mail The server accepts any mail sent to (RCPT TO)an existing email address and rejects it otherwise Themail contents (DATA) are parsed in MIME format usingthe JavaMail API and the raw message contents arewritten to disk MIME messages consist of a set of head-ers and a content body with the required Content-Typeheader indicating the format of the content notably amultipart content body contains additional MIME mes-sage subparts enabling messages to be arranged in atree structure To save disk space we recursively scanmultipart MIME messages for subparts with contenttypes that are non-text (text) such as attached im-ages or other data and discard them before storing themessages since we do not examine any non-textual con-tent

103 Supported hash functions andencodings for leak detection

Supported hashes and checksums md2 md4 md5sha sha1 sha256 sha224 sha384 sha3-224 sha3-256 sha3-384 sha3-512 murmurhash2 (signed andunsigned) murmurhash3 32-bit murmurhash3 64-bitmurmurhash3 128-bit ripemd160 whirlpool blake2bblake2s crc32 adler32

Supported encodings base16 base32 base58base64 urlencoding deflate gzip zlib entity yenc

104 Top parties redirecting to new thirdparties on email reload

Redirecting Party OrganizationAvg addrsquolparties S E

pippiocom Acxiom 57 7 32liadmcom LiveIntent 37 68 1097rlcdncom Acxiom 17 11 551imiclkcom MediaMath 13 2 4mathtagcom MediaMath 11 11 382alcmpncom ALCdagger 08 6 132emltrkcom Litmus 07 41 638acxiom-onlinecom Acxiom 04 2 33dynemlcom PowerInbox 01 3 13adnxscom AppNexus 01 19 277

Table 14 Top parties by average number of new third-party re-sources in a redirect chain when an email is reloaded The num-ber of senders ( S) out of 902 total and the number of emails(E) out of 12618 total on which this occurs is given for eachredirecting party We exclude redirecting parties that only exhibitthis behavior in emails from a single sender In total there are 12parties which exhibit this type of redirect behavior Includes statistics for chains which redirect to httppliadmcomimp in the first redirect We observe a common pattern ofURLs of the form lifirstpartycom redirecting first to this end-point which then redirects to a number of other third partiesdagger American List Counsel

  • I never signed up for this Privacy implications of email tracking
    • 1 Introduction
      • 11 Methods
      • 12 The state of email tracking
      • 13 Evaluating and improving defenses
        • 2 Related work
        • 3 Collecting a dataset of emails
        • 4 Privacy leaks when viewing emails
          • 41 Measurement methodology
          • 42 Email provides much of same tracking opportunities as the web
          • 43 Leaks of email addresses to third parties are common
          • 44 Reopening emails brings in new third parties
          • 45 Case study LiveIntent
          • 46 Request blockers help but dont fix the problem
            • 5 Privacy leaks when clicking links in emails
              • 51 Measurement methodology
              • 52 Results
                • 6 Evaluation of defenses
                  • 61 Landscape of defenses
                  • 62 Survey of email clients
                    • 7 Proposed defense
                    • 8 Discussion and conclusion
                    • 9 Acknowledgements
                    • 10 Appendix
                      • 101 Form discovery and filling methodology
                      • 102 Mail server implementation
                      • 103 Supported hash functions and encodings for leak detection
                      • 104 Top parties redirecting to new third parties on email reload

I never signed up for this Privacy implications of email tracking 6

Classifying third-party content Many emailclients load embedded content directly from remoteservers (we further explore the properties of emailclients in Section 62) Thus remote content presentin multiple emails can track users in the same waythird-party content can track users across sites on theweb However unlike the web there isnrsquot always aclear distinction of which requests are ldquothird-partyrdquo andwhich are ldquofirst-partyrdquo For example all resources loadedby webmail clients are considered third-party by thebrowser We consider any request to a domain2 whichis different than both the domain on which we signedup for the mailing list and the domain of the senderrsquosemail address to be a third-party request

Detecting email leakage Email addresses leakto remote servers through resource requests Detectingthese leaks is not as simple as searching for email ad-dresses in requests since the addresses may be hashed orencoded sometimes iteratively To detect such leakagewe develop a methodology that given a set of encod-ings and hashes a plaintext email address and a URLtoken is able to determine if the token is a transforma-tion of the email address Starting with the plaintextemail address we pre-compute a candidate set of tokensby applying all supported encodings and hashes itera-tively stopping once we reach three nested encodingsor hashes We then take the URL token and apply allsupported decodings to the value checking if the resultis present in the candidate set If not we iterativelyapply decodings until we reach a level of three nesteddecodings

In a preliminary measurement we found no exam-ples of a value that was encoded before being hashedThis is unsurprising as hashed email addresses are usedto sync data between parties and adding a transforma-tion before the hash would prevent that use case Thuswhen analyzing the requests in this dataset we restrictourselves to at most three nested hashes for a set of24 supported hashes including md5 sha1 sha256 Forencodings we apply all possible combinations of 10 en-codings including base64 urlencoding and gzip Thefull list of supported hashes and encodings is given inAppendix 103

Classifying email leakage Email leaks may notbe intentional If an email address is included in thequery string or path of a document URL it may auto-matically end up in the Referer header of subsequent

2 A domain is identified by its public suffix plus the componentof the hostname immediately preceding its public suffix (PS+1)

requests from that document Requests which result ina redirect also often add the referrer of the previous re-quest to the query string of the new request In manyinstances this happens irrespective of the presence ofan email address in the original request The situationis made more complex on the web since third-partyJavascript can dynamically build URLs and trigger re-quests

The reduced HTML support and lack of Javascriptexecution in email clients makes it possible to deter-mine intentionality for most leaks When an email isrendered requests can result from three sources fromelements embedded in the original HTML from withinan embedded iframe (if supported by the client) or froma redirected request1 If a leak occurs in a Referer header it is uninten-

tional For webmail clients the Referer header (ifenabled) will be the client itself A mail sender canembed an iframe which loads a URL that includesthe userrsquos email address with the explicit intentionthat the userrsquos email leak to third parties via theReferer header However we chose not to includethis possibility because email senders have multi-ple direct options for sharing information with thirdparties that do not rely on the sparsely supportediframe tag

2 If a leak occurs in a request to a resource embeddeddirectly in the HTML of the email body (and is notthe result of a redirect) it is intentional We candetermine intentionality since any request result-ing from an HTML document must have been con-structed by the email sender Note that this does nothold for web documents since embedded Javascriptcan dynamically construct requests during the pagevisit

3 If a request results from a redirect the party re-sponsible for the leak is the party whose request(ie the triggering URL) responded with a redi-rect to the new location (ie the target URL) Weclassify a leak as intentional if the leaked value ishashed between the triggering URL and the targetURL or if there are more encodings or hashes ofthe leaked value included in the target URL thanin the triggering URL If the target URL includesa full copy of the triggering URL (in any encoding)the leak is unintentional All other cases are clas-sified as ambiguous such the case where a targetURL includes only the query string of the triggeringURL

I never signed up for this Privacy implications of email tracking 7

Measuring blocked tags Tracking protection toolswhich block resource requests offer users protectionagainst the tracking embedded in emails We evaluatethe effectiveness of these tools by checking the requestsin our dataset against two major blocklists EasyListand EasyPrivacy [4] These lists block advertisementand tracking related requests and are bundled withseveral popular blocking extensions including AdBlockPlus [1] and uBlock Origin [5] We use the BlockList-Parser library [3] to determine if a request would havebeen blocked3 by an extension utilizing these lists Weclassify a request as blocked if it matches any of thefollowing three conditions1 The request directly matches the filter list2 The request is the result of a redirect and any re-

quest earlier in the redirect was blocked3 The request is loaded in an iframe and the iframe

document request (or any resulting redirect) wasblocked

It is possible to do this classification in an offline fashionbecause of the lack of Javascript support in email clientsThis removes the need to run measurements with oneof the aforementioned extensions installed In environ-ments that support Javascript content can be loadeddynamically and as the result of interactions betweenseveral scripts In such an environment it is much moredifficult to determine which requests would have beenblocked by a single script appearing on the block list

42 Email provides much of same trackingopportunities as the web

Remote resources embedded in email content can trackusers across emails As we show in our survey of emailclients (Section 62) many email clients allow remoteresources to set persistent cookies and send those cook-ies with resource requests In total we find that 10724of the measured emails (85) embed resources from atleast one third party with an average of 5 third partiesper email The distribution of embedded third partiesis far from uniform we find a median of two per emailand a small number of emails embedding as many as 50third parties (Figure 2)

3 We set the parser options as we would expect them to beset for a request occurring in a webmail client For example allrequests are considered third-party requests

Domain of Emails of Top 1Mdoubleclicknet 222 475mathtagcom 142 79dotomicom 127 35adnxscom 122 132tapadcom 110 26liadmcom 110 04returnpathnet 110 lt01bidswitchnet 105 49fontsgoogleapiscom 102 394list-managecom 101 lt01

Table 3 Top third-party domains by percentage of the 12618emails in the corpus For comparison we show the percentageof the top 1 million websites on which these third parties arepresent

Fig 2 CDF of third parties per email aggregating data acrossthe initial viewing and re-opening of an email In addition 14of emails have between 25 and 53 third parties

Table 3 shows the top third-party domains presentin email Many of these parties also have a large presenceon the web [17] blurring the line between email and webtracking On webmail clients requests to these cross-context third parties will use the same cookies allowingthem to track both a userrsquos web browsing and emailhabits In total the emails visited during our crawlsembed resources from 879 third parties

43 Leaks of email addresses to thirdparties are common

In addition to being able to track email habits 99 thirdparties (11) also gain access to a userrsquos email ad-dress whether in plaintext or hashed In email clientswhich support cookies these third parties will receivethe email address alongside any cookies theyrsquove set onthe userrsquos device Trackers which are also present on theweb will thus be able to link this address with the userrsquosbrowsing history profile

Around 19 of the 902 senders leaked the userrsquosemail address to a third party in at least one emailand in total 29 of emails contain leaks to third par-

I never signed up for this Privacy implications of email tracking 8

ties We find that a majority of these leaks 62 of the100963 leaks to third parties are intentional Theseintentional leaks mostly occur through remote contentembedded directly by the sender Furthermore 1 ofleaks are classified as unintentional with the remainderconsidered ambiguous While we do not attempt to de-termine how these identifiers are being used plaintextand hashed emails can be used for persistent trackingcross-device tracking and syncing information betweenparties

Leak of Senders of RecipientsMD5 100 (111) 38 (385)SHA1 64 (71) 19 (192)SHA256 69 (76) 13 (131)Plaintext Domain 55 (61) 2 (20)Plaintext Address 77 (85) 54 (545)URL Encoded Address 6 (06) 8 (81)SHA1 of MD5 1 (01) 1 (10)SHA256 of MD5 1 (01) 1 (10)MD5 of MD5 1 (01) 1 (10)SHA384 1 (01) 1 (10)

Table 4 Email address leakage to third parties by encoding Per-centages are given out of a total of 902 senders and 99 third-party leak recipients All hashes are of the full email addressEmail ldquodomainrdquo is the part of the address after the ldquordquoThese appear to be a misuse of LiveIntentrsquos API (Section 45)

The leaked addresses are often hashed Althoughwe can detect email addresses hashed with 24 differentfunctions and up to three nested layers we only findMD5 SHA1 and SHA256 in frequent use Table 4 summa-rizes the number of senders and receivers of each encod-ing The relatively low diversity of hashes and encodingssuggests that these techniques are not being used to ob-fuscate the collection of email addresses In fact thequery parameters which contain hashed emails some-times identify the hash functions used in the parametername (eg a string like md5=ltmd5 hash of emailgt ap-pearing in the HTTP request) The design of APIs likeLiveIntentrsquos which first receives an email address andthen syncs with a number of other parties (Section 45)suggests that these hashed address may be used to shareor link data from multiple parties

Recipient Organization of SendersLiveIntent 68 (75)Acxiom 46 (51)Litmus Software 28 (31)Conversant Media 26 (29)Neustar 24 (27)apxlvcom 18 (20)5421114717 18 (20)Trancos 17 (19)WPP 17 (19)548261160 16 (18)

Table 5 Top organizations receiving email address leaks by num-ber of the 902 total senders A domain is used in place of anorganization when it isnrsquot clear which organization it belong to

Table 5 identifies the top organizations4 which re-ceive leaked email addresses This shows that email ad-dress collection from emails is largely consolidated to afew major players which are mostly distinct from thepopular web trackers In fact only one of the top 10organizations Neustar is found in the top 20 third-party organizations on the top 1 million websites asmeasured by Englehardt and Narayanan [17] Also sur-prising is the prevalence of leaks to IP addresses whichaccounts for eight of the top 20 domains receiving emailaddresses This may be due to the relatively ephemeralnature of newsletter emails which removes concerns ofIP address churn over time

44 Reopening emails brings in new thirdparties

Despite the lack of Javascript support email views aredynamic The email content itself is static but any re-mote resources embedded in it may return different re-sponses each time the email is viewed and even redirectto different third parties To examine the effects of thiswe load every email first with a ldquocleanrdquo browser profileand then again without clearing the profile Surprisinglythe average Jaccard similarity [36] between the sets ofthird parties loaded during the first and second viewsof the same email is only 60

The majority of emailsmdashtwo-thirdsmdashload fewerthird parties when the email is reopened compared tothe initial view However about 21 of emails load at

4 We map domains to organizations using the classification pro-vided by Libert [27] adding several new email-specific organi-zations When an organization could not be found we use thePS+1

I never signed up for this Privacy implications of email tracking 9

Row Request URL0 httpinboxwashingtonexaminercomimp[]ampe=ltEMAILgtampp=01 httppliadmcomimp[]ampm=ltMD5(address)gtampsh=ltSHA1(address)gtampsh2=ltSHA256(address)gt

ampp=0ampdom=ltEMAIL_DOMAINgt2 httpxbidswitchnetsyncssp=liveintentampbidder_id=5298amplicd=3357ampx=EGFM[]3 httpxbidswitchnetul_cbsyncssp=liveintentampbidder_id=5298amplicd=3357ampx=EGFM[]4 httppadsymptoticcomdpx_pid=12688amp_psign=d3e69[]ampbidswitch_ssp_id=liveintentamp_redirect=[]5 httppadsymptoticcomdpx_pid=12688amp_psign=d3e69[]ampbidswit[]amp_redirect=[]amp_expected_cookie=[]6 httpxbidswitchnetsyncdsp_id=126ampuser_id=84f3[]ampssp=liveintent7 httpiliadmcoms19751bidder_id=5298amplicd=3357ampbidder_uuid=ltUUID_1gt8 httpcmgdoubleclicknetpixelgoogle_nid=liveintent_dbmampgoogle_cmampgoogle_sc9 httpcmgdoubleclicknetpixelgoogle_nid=liveintent_dbmampgoogle_cm=ampgoogle_sc=ampgoogle_tc=10 httppliadmcommatch_gbidder_id=24314ampbidder_uuid=ltUUID_2gtampgoogle_cver=111 httpxbidswitchnetsyncssp=liveintentampbidder_id=5298amplicd=12 httppooludspiponwebnetsyncssp=bidswitchampbidswitch_ssp_id=liveintent

Table 6 Redirect chain from a LiveIntent Email Tracking Pixel URL query strings are truncated for clarity (using [])

least one resource when an email is reopened that wasnrsquotpresent the first time A small number of third partiesare disproportionately responsible for thismdashthey loaddifferent sets of additional third parties each time theemail is opened (Table 14 in the Appendix)

The number of leaks between email loads stays rela-tively constant with less than 50 emails leaking to newparties on the second load5 However as the compari-son of Table 14 with Table 5 shows many of the topleak recipients are also responsible for redirecting to thehighest number of new parties Thus reloading an emailincreases the number of potential recipients of a leak ifthe redirectors share data based on the email or emailhash they receive

45 Case study LiveIntent

LiveIntent receives email addresses from the largestnumber of senders 68 in total In this section weanalyze a sample of the request chains that re-sult in leaks to LiveIntent Table 6 shows an ex-ample redirect chain of a single pixel embedded inan email from the washingtonexaminercom mailinglist The initial request (row 0) is to a subdomain ofwashingtonexaminercom and includes the userrsquos plain-text email address in the e= query string parameter Thedomain redirects to liadmcom (row 1) a LiveIntent do-main and includes the MD5 SHA1 and SHA256 hashes ofthe email address in the parameters m= sh= and sh2=

5 We exclude leaks which occur to a different IP address on thesecond load This occurs in 349 emails but is less meaningfulgiven the dynamic nature of IP address

The URL also includes the domain portion of the userrsquosaddress

In rows 2 - 12 the request redirects through severalother domains and back to itself exchanging what ap-pear to be partner IDs and bidder IDs In rows 7 and10 LiveIntent receives a UUID from the domain in theprevious request which could allow it to exchange in-formation with those trackers outside of the browser

46 Request blockers help but donrsquot fixthe problem

Privacy conscious users often deploy blocking exten-sions such as uBlock Origin Privacy Badger orGhostery to block tracking requests Since webmailclients are browser-based these blocking extensions canalso filter requests that occur while displaying email con-tent6 We use our blocked tag detection methodology(Section 41) to determine which resources would havebeen blocked by the popular EasyList and EasyPrivacyblocklists We then examine the remaining requests todetermine how frequently email addresses continue toleak

Overall the blocklists cut the number of third par-ties receiving leaked email addresses from any sendernearly in half from 99 to 51 Likewise the number ofsenders which leak email addresses in at least one emailis greatly reduced from 19 to just 7 However asTable 7 shows a significant number of leaks of both

6 Thunderbird supports most of the popular Firefox extensionsand as such Thunderbird users can also deploy these defensesSee Table 12 for more details

I never signed up for this Privacy implications of email tracking 10

Encoding of Senders of RecipientsPlaintext Address 34 (37) 34 (667)MD5 21 (23) 12 (235)SHA1 14 (16) 6 (118)URL Encoded Address 4 (04) 4 (78)SHA256 4 (04) 2 (39)SHA384 1 (01) 1 (20)

Table 7 Encodings used in leaks to third parties after filteringrequests with EasyList and EasyPrivacy Totals are given out of902 email senders and 51 third-party leak recipients

Recipient Domain of Sendersmediawallahscriptcom 7jetlorecom 4scrippsnetworkscom 4alocdncom 3richrelevancecom 3ivitrackcom 2intentiqcom 2gatehousemediacom 2realtimeemail 2ziffimagescom 2

Table 8 The top third-party leak recipient domains after filteringrequests with EasyList and EasyPrivacy All recipients receiveleaks from less than 1 of the 902 senders studied

plaintext and email hashes still occur In Table 8 we seethat there are still several third-party domains whichreceive email address leaks despite blocking Several ofthese domains are known trackers which could be in-cluded in the blocklists In addition IP addresses andCDN domains are still recipients of leaked email ad-dresses Blocking on other URL features such as theURL path could help reduce leaks to these domains

5 Privacy leaks when clickinglinks in emails

In Section 4 we explore the privacy impact of a useropening and rendering an email In this section we ex-plore the privacy impact of a user clicking links withinan email Once a user clicks a link in an email the link istypically opened in a web browser Unlike email clientsweb browsers will typically support Javascript and ad-vanced features of HTML creating many potential av-enues for privacy leaks However the only way an emailaddress can propagate to a page visit is through the di-rect embedding of the address in a link contained in theoriginal email body

51 Measurement methodology

Sampling links from emails To evaluate the privacyleaks which occur when links in emails are clicked wegenerate a dataset from the HTML content of all emailsand visit them individually in an instrumented browserTo extract the links from mail content we parse allemail bodies with BeautifulSoup [2] and extract thesrc property of all ltagt tags We sample up to 200 uniquelinks per sender using the following sampling strategyFirst we bin links across all emails from a sender bythe PS+1 and path of the link Next we sample one linkfrom each bin without replacement until there are nomore links or we reach a limit of 200 This helps ensurethat we have as diverse a set of landing pages as possibleby stripping fragment and query string identifiers thatmay not influence the landing page

Simulating link clicks To simulate a user click-ing a link we visit each link in an OpenWPM instanceusing a fresh browser profile The browser fully loadsthe page and sleeps for 10 seconds before closing Un-like the email viewing simulation (Section 4) we enableboth Javascript and Referer headers This simulationreplicates what happens when a link is clicked in a stan-dalone email client only the URL of the clicked link ispassed to the browser for handling In a webmail clientthe initial request resulting from the click may also con-tain a cookie and a Referer header containing the emailclientrsquos URL We do not simulate these headers in ourcrawl

Detecting email address leakage To detectleakage of email addresses we use the procedure de-scribed in Section 41 Since the Referer header is en-abled for these measurements we consider a party tohave received a leak if it is contained either in the URLor the Referer header of the resource request to thatparty Email addresses can also be shared with the partythrough the Cookie header request POST bodies web-socket connections WebRTC connections and so onWe consider these out of scope for this analysis

52 Results

We found that about 11 of links contain requests thatleak the email address to a third party About 12 ofall emails contain at least one such link and among thissubset there are an average of 35 such links per emailThe percentage of the 902 senders that leak the emailaddress in at least one link in one email is higher 355Finally there were over 1400 distinct third parties that

I never signed up for this Privacy implications of email tracking 11

Recipient Organization of SendersGoogle 247 (274)Facebook 160 (177)Twitter 94 (104)Adobe 81 (90)Microsoft 73 (81)Pinterest 72 (80)LiveIntent 69 (76)Akamai 69 (76)Acxiom 68 (75)AppNexus 61 (68)

Table 9 The top leak recipient organizations based on a sampleof simulated link clicks All values are out of 902 total senders

Recipient Domain of Sendersgoogle-analyticscom 200 (222)doubleclicknet 196 (217)googlecom 159 (176)facebookcom 154 (171)facebooknet 145 (161)fontsgoogleapiscom 102 (113)googleadservicescom 96 (106)twittercom 94 (104)googletagmanagercom 87 (96)gstaticcom 78 (86)

Table 10 The top leak recipient domains based on a sample ofsimulated link clicks All values are out of 902 senders

received the email address in one or more of our sim-ulated link clicks We expect that all statistics in thisparagraph except the first are slight underestimatesdue to our limit of 200 links per sender

Table 9 shows the top organizations that receiveleaked email addresses and Table 10 shows the top do-mains Over a quarter of senders leak the email addressto Google in at least one link

The most striking difference between these resultsand the corresponding results for viewing emails is thatthese lists look very similar to the list of top third partytrackers [17] with the addition of a small number oforganizations specific to email tracking This motivatesthe privacy concern that identities could potentially beattached to third-party web tracking profiles

6 Evaluation of defenses

61 Landscape of defenses

Defenses against tracking can be employed by severalparties We ignore mail senders and trackers themselves

since email tracking is a thriving commercial space andour evidence suggests that senders by and large coop-erate with trackers to leak email addresses We insteadfocus on parties who have an incentive to protect therecipientrsquos privacy namely the recipientrsquos mail servermail user agent and the web browser

The lines between these roles can be blurry so weillustrate with two examples Consider a user readingYahoo mail via Firefox The email server is Yahoo theemail client is Firefox together with Yahoo mailrsquos client-side JavaScript and the web browser is again Fire-fox Or consider a user reading her university mail viaGmailrsquos IMAP feature on her iPhone For our purposesboth the university and Gmail count as email serverssince either of them is in a position to employ defensesThe email client is the Gmail iOS app and the webbrowser is Safari

Defense Email server Email client Web browserContent proxying XHTML filtering X XCookie blocking X XReferrer blocking X X XRequest blocking X X

Table 11 Applicability of each of the five possible defenses toeach of the three contexts in which they may be deployed An Xindicates that the defense is applicable

Table 11 summarizes the applicability of various de-fenses to the three roles We discuss each in turn

Content proxying Email tracking is possible be-cause of embedded content such as images and CSS (cas-cading style sheets) To prevent this some email serversnotably Gmail proxy embedded content Thus whenthe recipient views the email the mail user agent doesnot make any requests to third parties

This defense doesnrsquot prevent the recipient email ad-dress being leaked to third parties since it is leakedby being encoded in the URL In fact it hinders ef-forts by the mail client to prevent email address leakage(see request blocking below) However it prevents thirdparties from learning the userrsquos IP address client deviceproperties and when the email was read (depending onhow the proxy is configured) Most importantly it pre-vents the third-party cookie from being sent and thusprevents the third party from linking the userrsquos emailaddress to a tracking profile In this way it is a comple-ment to cookie blocking

This defense can be deployed by the email serverConceivably the email client might have its own server

I never signed up for this Privacy implications of email tracking 12

component through which embedded resources are prox-ied but no email clients currently work this way andfurther it would introduce its own privacy vulnerabili-ties so we ignore this possibility

HTML filtering HTML filtering refers to modify-ing the contents of HTML emails to mitigate tracking Itmay be applied by the email server or the client but it ismore suitable to the server since the client can generallyachieve the same effect in other ways eg by requestblocking or modifying the rendering engine It is rarelyapplied today and only in minimal ways In Section 7 weprototype a comprehensive HTML filtering technique

HTML filtering modifies the content of the emailbody and thus might interfere with some email au-thentication methods notably Domain Keys IdentifiedEmail (DKIM) However since filtering is carried outby the recipientrsquos mail server (Mail Transfer Agent) andnot by intermediate mail relays filtering can be done af-ter the signature has been verified and thus there is noimpact on email authentication

The following three techniques are applicable in oneof two scenarios when the email client requests embed-ded resources or when the web browser handles clickson links in emails

Cookie blocking Cookie blocking in the emailclient prevents third-party cookies from being sent whenembedded content is requested It is especially relevantin the webmail context where the cookie allows thirdparties to link an email address to a web browsing pro-file Even otherwise blocking cookies is helpful since itmakes it harder for third parties to compile a profile ofthe recipientrsquos email viewing (they can always do this forthe subset of emails where the email address is leaked)

Referrer blocking If the email client sends theReferer header when loading embedded resources itcan allow several types of leaks Depending on the imple-mentation the referrer may encode which client is beingused and which specific email is being read If the recip-ient forwarded an email to someone else and the email isbeing viewed in a different userrsquos mailbox it could leakthis information Worse if the client supports iframesin emails and the email address happens to be in theiframe URL all requests to resources embedded in thatiframe will accidentally leak the email address For allthese reasons referrer blocking is a privacy-enhancingmeasure There is little legitimate use for the referrerheader in the context of email While clients can cer-tainly block the header (as can web browsers) serverscan do this as well by rewriting HTML to add therel=ldquonoreferrerrdquo attribute to links and inserting a Re-ferrer Policy via the meta tag

Request blocking Request blocking is a powerfultechnique which is well known due to ad blockers andother browser privacy extensions It relies on manuallycompiled filter lists containing thousands of regular ex-pressions that define third-party content to be blockedThe most widely used ad-blocking list is EasyList andthe most widely used tracker-blocking list is EasyPri-vacy Filter list based blocking introduces false positivesand false negatives [43] but the popularity of ad block-ing suggests that many users find the usability trade-offto be acceptable While request-blocking extensions aresupported primarily by web browsers some email clientsalso have support for them notably Thunderbird

62 Survey of email clients

We built an email privacy tester to discover which de-fenses are deployed by which popular email servers andclients7 Browser support for tracking protection hasbeen extensively studied elsewhere [29] so we do notconsider it here

The email privacy tester allows the researcher to en-ter an email address and the name of an email clientand then sends an email to that address containing atracking image and a link The image and the link bothhave unique URLs The researcher views the email inthe specified email client and then clicks on the linkThe server records the following information the emailaddress the email client the IP address timestampand headers sent for both the image and the link re-quests The list of headers includes the cookie referrerand user agent

We created accounts with a total of 9 emailproviders and tested them with a total of 16 email clientsusing various devices available in our lab We analyzedthe data recorded by the email privacy tester and sum-marize the results in Table 12 We found that if de-fenses are deployed by email servers at all they are onlyenabled for specific email clients (typically the defaultwebmail client) Therefore we do not report on serversseparately but instead fold it into the analysis of clientsWe also found that HTML filtering in a general form isnot deployed but only in the limited form of image andreferrer blocking so we report on that instead We sum-marize our findings in Table 12

7 httpsemailtrackingopenwpmcom

I never signed up for this Privacy implications of email tracking 13

Mail Client Platform Proxies Content Blocks Images Blocks Referrers Blocks Cookies Ext SupportGmail Web Yes No L Yes I Yesdagger Yesdagger YesYahoo Mail Web No Yes L Yes I No No YesOutlook Web App Web No Yes No No YesOutlookcom Web No No No No YesYandex Mail Web Yes No L Yes I Yesdagger Yesdagger YesGMX Web No No No No YesZimbra Web No Yes No No Yes163com Web No No No No YesSina Web No No No No YesApple Mail iOS No No Yes Yes NoGmail iOS Yes No Yes Yes NoGmail Android Yes No Yes Yes NoApple Mail Desktop No No Yes Yes NoWindows Mail Desktop No No Yes No NoOutlook 2016 Desktop No Yes Yes No NoThunderbird Desktop No Yes Yes Optional (Default No) Yes

Table 12 A survey of the privacy impacting features of email clients We explore whether the client proxies image requests blocksimages by default blocks referrer headers from being sent (with image requests ldquoIrdquo and with link clicks ldquoLrdquo) blocks external re-sources from settings cookies and whether or not the client supports request blocking extensions mdash either through the browser (forweb clients) or directly (in the case of Thunderbird)Images are only blocked for messages considered spamdagger Blocking occurs as a result of proxied content

7 Proposed defenseWe argue that tracking protection should be at the cen-ter of a defensive strategy against email tracking It canbe employed either via HTML filtering on the server orvia request blocking on the client Tracking protection(and ad blocking) based on filter lists has proven to beeffective and popular in web browsers and its limita-tions manageable The other defenses we examined allhave serious drawbacks for example content proxyingcomes at a cost to the email server and makes emailleaks worse and cookie blocking is at best a partial so-lution

We propose to improve tracking protection in twoways

Server-side email content filtering First weprototype a server-side HTML filtering module We usethe existing standard EasyList and EasyPrivacy filterlists Our filtering script is written in Python using theBlockListParser library [3] It scans for any HTML con-tent (texthtml) in email bodies parses those contentsidentifies embedded resources (images or CSS) whoseURLs match one of the regular expressions in the filterlists strips them out and rewrites the HTML

To test the effectiveness of HTML filtering we ranour leak detection procedure on the filtered corpus ofemails We exclude one sender due to a measurementissue We found that 110 of senders will leak email ad-

dresses to a third party in at least one email and 115of emails contain embedded resources which leak emailto a third-party Overall 62 third parties received leakedemail addresses down from 99 As tracking-protectionlists improve (see below) we can expect these numbersto decrease further These numbers are very close tothe corresponding numbers for request blocking (Sec-tion 46) The two techniques arenrsquot identical the onedifference is that in static files filtering is limited to theURLs present in the body of the HTML and will missthose that result from a redirect However this differ-ence is small and we conclude that HTML filtering isessentially as effective as request blocking

Note that webmail users can already enjoy track-ing protection but server-side deployment will help allusers including those who use email clients that donrsquotsupport request-blocking extensions

Filling gaps in tracking-protection lists As asecond line of defense we use our dataset to identifya list of 27125 URLs representing 133 distinct partieswhich contain leaks of email addresses but which arenrsquotblocked by EasyList or EasyPrivacy These include firstparties in addition to third parties We are able to iden-tify first-party tracking URLs by observing groups ofURLs of similar structure across different first-party do-mains For example 51 email senders leak the userrsquosemail address to a URL of the form liltpublic suffix+ 1gtimp which appears to be part of LiveIntentrsquos API(Section 45) We summarize the most common struc-

I never signed up for this Privacy implications of email tracking 14

tures in the leaking URLs missed by tracking protectionlists in Table 13

URL Pattern of SendersliltPS+1gtimp 51 (57)partnerltPS+1gt 7 (07)stripeltPS+1gtstripeimage 4 (04)pltPS+1gtespopen 4 (04)apiltPS+1gtlayoutssectionltNgt 4 (04)ltPS+1gtcustomer-service 3 (03)miltPS+1gtprp 3 (03)dmtkltPS+1gt 3 (03)linksltPS+1gteopen 3 (03)eadsltPS+1gtimp 3 (03)

Table 13 The top URL patterns from URLs which leak email ad-dresses and are missed by tracking protection lists (Section 46)The patterns are generated by stripping request URLs to host-name and path replacing the public suffix plus one with ltPS+1gtreplacing integers with ltNgt and stripping the last portion of thepath if it ends with a file extension The patterns are ranked bythe number of senders which make at least one leaking requestmatching that pattern in any of the senderrsquos emails All values aregiven out of the total of 902 senders studied

We suspect that the reason so many trackers aremissed is that many of them are not active in the regu-lar web tracking space We have made the list of leakingURLs missed by tracking protection lists publicly avail-able8 It should be straightforward to add regular ex-pressions to filter lists based on these URLs we suggestthat filter list creators should regularly conduct scansof email corpora to identify new trackers

8 Discussion and conclusionPrivacy risks of email tracking Email security andprivacy has not received much research attention despiteits central importance in digital life We showed thatcommercial emails contain a high density of third-partytrackers This is of concern not only because trackerscan learn the recipientrsquos IP address when emails wereopened and so on but also because these third partiesare by and large the same ones that are involved in webtracking This means that trackers can connect emailaddresses to browsing histories and profiles which leadsto further privacy breaches such as cross-device tracking

8 httpsgistgithubcomenglehardt6438c5d775ffd535b317d5c6ce3cde61

and linking of online and offline activities Indeed emailis an underappreciated avenue for straightforward cross-device tracking since recipients tend to view emails onmultiple devices

The advice provided by many mail clients may mis-lead users into thinking the privacy risks associated withremote content are fairly limited The remote contenthelp pages of Gmail [20] Yahoo Mail [42] and Thun-derbird [31] all discuss the threat strictly in terms ofthe email sender learning information about the userrather than a number of third parties

Even network adversaries can benefit from the leaksin emails The NSA is known to piggyback on advertis-ing cookies for surveillance [18] and our work suggestsone way in which a surveillance agency might attachidentities to web activity records in line with the find-ings of Englehardt et al [18] Indeed nearly 91 ofURLs containing leaks of emails are sent in plaintext

Ineffectiveness of hashing The putative justi-fication for email address leaks in the online ad techindustry is that the address is hashed However hash-ing of PII including emails is not a meaningful pri-vacy protection This is folk knowledge in the securitycommunity but bears repeating Compared to hashingof passwords there are several reasons why hashing ofemail addresses is far more easily reversible via vari-ants of a dictionary attack First while (at least) someusers attempt to maximize the entropy of passwordsmost users aim to pick memorable emails and hence theset of potential emails is effectively enumerable Due toGPUs trillions of hashes can be attempted at low costSecond unlike password hashing salting is not applica-ble to email hashing since multiple third parties need tobe able to independently derive the same hash from theemail address

Perhaps most importantly if the adversaryrsquos goalis to retrieve records corresponding to a known emailaddress or set of email addresses then hashing ispointlessmdashthe adversary can simply hash the email ad-dresses and then look them up For example if the ad-versary is a surveillance agency as discussed above andseeks to retrieve network logs corresponding to a givenemail address this is trivially possible despite hashing

LimitationsWe mention several limitations of ourwork First despite the large number of heuristics thatwent into identifying and submitting forms it is a fun-damentally hard problem and our crawler fails in manycases including pages requiring complex mouse interac-tions pages containing very poorly structured HTMLand captcha-protected form submission pages More-over it is difficult to programmatically distinguish be-

I never signed up for this Privacy implications of email tracking 15

tween successful and failed form submissions Lookingat received network data is impractical since responsescould easily include text for both success and failuremessages On the other hand looking only at changesin the rendered text on the webpage is more feasiblebut would require handling many possible edge cases(eg page redirects alerts pop-up windows iframes)and might still be too unreliable to use as a metric forsuccess

Second our corpus of emails is not intended to berepresentative and we are unable to draw conclusionsabout the extent of tracking in the typical userrsquos mail-box

Third our simulation of a user viewing emails as-sumes a permissive user agent We expect that thisclosely approximates a webmail setup with defaultbrowser settings (on browsers except Safari whichblocks third-party cookies by default) but we have nottested this assumption

Future work Finally we mention several potentialareas of future work

Mailing list managers It would be helpful to bet-ter understand the relationship between email sendersand mailing list managers (such as Constant Contact)To what extent is email tracking driven by senders ver-sus mailing list managers When a sender sets up amarketing campaign with a mailing list manager is thetracking disclosed to the sender

PII leakage in registration forms Researchers havepreviously found leakage of PII to third parties in con-tact forms on websites [38] As far as we know there hasbeen no large-scale study of PII leakage in registrationforms where more sensitive information is often present(eg phone numbers street addresses and passwords)Recording and analyzing the third-party requests madeduring our crawls is an important area for further inves-tigation

Cookie syncing It would be interesting to find outif cookie syncing occurs when viewing emailsmdasha processin which different trackers exchange and link togethertheir own IDs for the same user Past work has shownthat this happens among the vast majority of top thirdparties on the web [17] so we suspect that it occursthrough email as well

AB testing We notice some clear instances of ABtesting in our data as might be expected in market-ing campaigns Specifically we registered multiple emailaddresses on some sites at roughly the same time andfound several emails sent at nearly the same time (mil-liseconds apart) with different subject lines and emailbodies advertising different products We have not at-

tempted to reverse-engineer or systematically analyzethese differences but it may be interesting to see if andhow the received content changes in response to readreceipts click-through metrics or other types of userinteractions

Differential testing Despite testing for various en-codings hashes and combinations it is possible thatwe have missed some leaks of email addresses We can-not hope to exhaustively test for all combinations ofencodings and hashes Instead we propose differentialtesting by registering multiple email addresses on thesame site we can look for parameters in URLs that aredifferent for different email addresses which are sugges-tive of transformed email addresses The difficulty withthis approach is that AB testing mentioned above isa confound

In summary we hope that our work leads to greaterawareness of the privacy risks of email tracking spursfurther research on the topic and paves the way fordeployment of robust defenses

9 AcknowledgementsWe would like to thank the anonymous reviewers AylinCaliskan Paul-Olivier Dehaye Joel Reardon and PaulVan Oorschot for their helpful comments Wersquore alsograteful to Guumlnes Acar Paul Ellenbogen Marc JuarezHarry Kalodner Marcela Melara and Laura Roberts fortheir assistance in compiling data for our email survey

This work was supported by NSF Grant CNS1526353 by a research grant from Mozilla and by Ama-zon AWS Cloud Credits for Research

References[1] Adblock Plus - Surf the web without annoying ads https

adblockplusorg Online accessed 2017-09-05[2] BeautifulSoup httpswwwcrummycomsoftware

BeautifulSoup Online accessed 2017-09-05[3] BlockListParser httpsgithubcomshivamagarwal-iitb

BlockListParser Online accessed 2017-09-05[4] EasyList and EasyPrivacy httpseasylistto Online

accessed 2017-09-05[5] uBlock Origin - An efficient blocker for Chromium and Fire-

fox Fast and lean httpsgithubcomgorhilluBlockOnline accessed 2017-09-05

[6] CSS Support Guide for Email Clients Campaign Sourcehttpswwwcampaignmonitorcomcss (Archive httpswwwwebcitationorg6rLLXBX0E) 2014

I never signed up for this Privacy implications of email tracking 16

[7] Gunes Acar Christian Eubank Steven Englehardt MarcJuarez Arvind Narayanan and Claudia Diaz The web neverforgets Persistent tracking mechanisms in the wild In Pro-ceedings of ACM CCS pages 674ndash689 ACM 2014

[8] Gunes Acar Marc Juarez Nick Nikiforakis Claudia DiazSeda Guumlrses Frank Piessens and Bart Preneel Fpdetectivedusting the web for fingerprinters In Proceedings of the2013 ACM SIGSAC conference on Computer amp communica-tions security pages 1129ndash1140 ACM 2013

[9] Julia Angwin Why online tracking is getting creepier ProP-ublica Jun 2014

[10] Mika D Ayenson Dietrich James Wambach Ashkan SoltaniNathan Good and Chris Jay Hoofnagle Flash cookies andprivacy II Now with html5 and etag respawning 2011

[11] Bananatag Email Tracking for Gmail Outlook and otherclients httpsbananatagcomemail-tracking Onlineaccessed 2017-09-04

[12] Justin Brookman Phoebe Rouge Aaron Alva Alva andChristina Yeung Cross-device tracking Measurement anddisclosures In Proceedings of the Privacy Enhancing Tech-nologies Symposium 2017

[13] Ceren Budak Sharad Goel Justin Rao and Georgios ZervasUnderstanding emerging threats to online advertising InProceedings of the ACM Conference on Economics andComputation 2016

[14] ContactMonkey Email Tracking for Outlook and Gmailhttpswwwcontactmonkeycomemail-tracking Onlineaccessed 2017-09-04

[15] Zakir Durumeric David Adrian Ariana Mirian James Kas-ten Elie Bursztein Nicolas Lidzborski Kurt Thomas VijayEranti Michael Bailey and J Alex Halderman Neither snownor rain nor mitm An empirical analysis of email deliv-ery security In Proceedings of the 2015 ACM Conferenceon Internet Measurement Conference pages 27ndash39 ACM2015

[16] Peter Eckersley How unique is your web browser In In-ternational Symposium on Privacy Enhancing TechnologiesSymposium pages 1ndash18 Springer 2010

[17] Steven Englehardt and Arvind Narayanan Online trackingA 1-million-site measurement and analysis In ACM Confer-ence on Computer and Communications Security 2016

[18] Steven Englehardt Dillon Reisman Christian Eubank Pe-ter Zimmerman Jonathan Mayer Arvind Narayanan andEdward W Felten Cookies that give you away The surveil-lance implications of web tracking In Proceedings of the24th Conference on World Wide Web 2015

[19] David Fifield and Serge Egelman Fingerprinting web usersthrough font metrics In International Conference on Finan-cial Cryptography and Data Security 2015

[20] Gmail Help Choose whether to show images httpssupportgooglecommailanswer145919 Online accessed2017-09-06

[21] Ralph Holz Johanna Amann Olivier Mehani Mohamed AliKacircafar and Matthias Wachs TLS in the wild An internet-wide analysis of tls-based protocols for electronic commu-nication In 23nd Annual Network and Distributed SystemSecurity Symposium NDSS 2016 San Diego CaliforniaUSA February 21-24 2016 2016

[22] HubSpot Start Email Tracking Today httpswwwhubspotcomproductssalesemail-tracking Online ac-

cessed 2017-09-04[23] Balachander Krishnamurthy Konstantin Naryshkin and

Craig Wills Privacy leakage vs protection measures thegrowing disconnect In Proceedings of the Web 2011

[24] Balachander Krishnamurthy and Craig E Wills On the leak-age of personally identifiable information via online socialnetworks In Proceedings of the 2nd ACM workshop onOnline social networks pages 7ndash12 ACM 2009

[25] Pierre Laperdrix Walter Rudametkin and Benoit BaudryBeauty and the beast Diverting modern web browsers tobuild unique browser fingerprints In 37th IEEE Symposiumon Security and Privacy 2016

[26] Adam Lerner Anna Kornfeld Simpson Tadayoshi Kohnoand Franziska Roesner Internet jones and the raiders of thelost trackers An archaeological study of web tracking from1996 to 2016 In 25th USENIX Security Symposium 2016

[27] Timothy Libert Exposing the invisible web An analysis ofthird-party http requests on 1 million websites InternationalJournal of Communication 918 2015

[28] Jonathan R Mayer and John C Mitchell Third-party webtracking Policy and technology In 2012 IEEE Symposiumon Security and Privacy IEEE 2012

[29] Georg Merzdovnik Markus Huber Damjan Buhov NickNikiforakis Sebastian Neuner Martin Schmiedecker andEdgar Weippl Block me if you can A large-scale study oftracker-blocking tools In Proceedings of the 2nd IEEE Euro-pean Symposium on Security and Privacy (IEEE EuroSampP)2017

[30] Keaton Mowery and Hovav Shacham Pixel perfect Finger-printing canvas in HTML5 W2SP 2012

[31] Mozilla Support Remote Content in Messages httpssupportmozillaorgen-USkbremote-content-in-messagesOnline accessed 2017-09-04

[32] Nick Nikiforakis Alexandros Kapravelos Wouter JoosenChristopher Kruegel Frank Piessens and Giovanni VignaCookieless monster Exploring the ecosystem of web-baseddevice fingerprinting In Security and privacy (SP) 2013IEEE symposium on pages 541ndash555 IEEE 2013

[33] Lukasz Olejnik Gunes Acar Claude Castelluccia and Clau-dia Diaz The leaking battery A privacy analysis of theHTML5 Battery Status API Technical report 2015

[34] Jingjing Ren Ashwin Rao Martina Lindorfer ArnaudLegout and David Choffnes Recon Revealing and control-ling pii leaks in mobile network traffic In Proceedings of the14th Annual International Conference on Mobile SystemsApplications and Services pages 361ndash374 ACM 2016

[35] Franziska Roesner Tadayoshi Kohno and David WetherallDetecting and defending against third-party tracking onthe web In Proceedings of the 9th USENIX conferenceon Networked Systems Design and Implementation pages12ndash12 USENIX Association 2012

[36] scikit-learn Jaccard Similarity Score httpscikit-learnorgstablemodulesgeneratedsklearnmetricsjaccard_similarity_scorehtml Online accessed 2017-09-05

[37] Ashkan Soltani Shannon Canty Quentin Mayo LaurenThomas and Chris Jay Hoofnagle Flash cookies and pri-vacy In AAAI spring symposium intelligent informationprivacy management volume 2010 pages 158ndash163 2010

[38] Oleksii Starov Phillipa Gill and Nick Nikiforakis Are yousure you want to contact us quantifying the leakage of pii

I never signed up for this Privacy implications of email tracking 17

via website contact forms Proceedings on Privacy Enhanc-ing Technologies 2016(1)20ndash33 2016

[39] Oleksii Starov and Nick Nikiforakis Extended trackingpowers Measuring the privacy diffusion enabled by browserextensions In Proceedings of the 26th International Confer-ence on World Wide Web pages 1481ndash1490 2017

[40] Narseo Vallina-Rodriguez Christian Kreibich Mark Allmanand Vern Paxson Lumen Fine-grained visibility and controlof mobile traffic in user-space 2017

[41] W3C 410 Forms - HTML5 httpswwww3orgTRhtml5formshtml Online accessed 2017-09-07

[42] Yahoo Help Block images in your incoming Yahoo Mailemails httpshelpyahoocomkbSLN5043html Onlineaccessed 2017-09-06

[43] Zhonghao Yu Sam Macbeth Konark Modi and Josep MPujol Tracking the trackers In Proceedings of the 25thInternational Conference on World Wide Web pages 121ndash132 International World Wide Web Conferences SteeringCommittee 2016

10 Appendix

101 Form discovery and fillingmethodology

Choosing pages on which to search for forms Thecrawler searches through all links (ltagt tags) on the land-ing page to find pages that are most likely to contain amailing list form It does this by matching the link textand URL against a ranked list of terms which are shownin Table 1 As an initial step we filter out invisiblelinks and links to external sites We check that the linktext does not contain words in our blacklist which aimsto avoid unsubscribe pages and phone-based registra-tion If we have found any links that match the crawlerclicks on the one with the highest rank then runs theform-finding procedure on the new page and any newlyopened pop-up windows If no forms are found it goesback and repeats this process for the remaining linksThe reason for clicking on generic article links is thatwe have come across several news sites with newsletterforms only within article pages We also make sure toselect the English language or USEnglish locale whenavailable since our keywords are in English

Top-down form detection For each page thecrawler visits it first searches through the HTML DOMfor any potential email registration forms When sitesuse the standard ltformgt element it can simply iteratethrough each formrsquos input fields (ltinputgt tags) and seeif any text fields ask for an email address (by matchingon input type and keywords) If so it marks the form as

a candidate and then chooses the best candidate usingthe following criteria (in order)1 Always return the topmost form Any form stacked

on top of other elements is probably a modal or dia-log and we find that the most common use of thesecomponents is to promote a sitersquos mailing lists Werely on the z-index CSS property which specifies thestacking order of an element in relation to others (asa relative arbitrary integer) Note that most DOMelements take the default z-index value of auto in-heriting the actual value from its parent thus thecrawler recursively checks a formrsquos parent elementsuntil it finds a non-auto value or reaches the rootof the DOM tree To break ties it also searches forthe literal strings ldquomodalrdquo or ldquodialogrdquo within theformrsquos HTML since we find that such componentsare usually descriptively named

2 Rank login forms lower This is the other class offorms that often asks for an email address so thecrawler explicitly checks for the strings ldquologinrdquo ldquologinrdquo and ldquosign inrdquo within a formrsquos HTML to avoidthese when other candidates are present

3 Prefer forms with more input fields This is mainlyhelpful for identifying the correct follow-up form ifwe submit our email address in the footer of a pagethe same footer might be present on the page we getredirected to In this scenario the form we want topick is the longer one

Additionally registration forms are sometimes foundinside of inline frames (ltiframegt tag) which are ef-fectively separate HTML pages embedded in the mainpage If necessary we iterate through each frame andapply the same procedure to locate registration formswithin them

Bottom-up form detection A growing numberof sites place logical forms inside of generic containerelements (eg ltdivgt or ltspangt tags) without using anyltformgt tags Therefore if top-down form detection failswe take a bottom-up approach the crawler first iteratesthrough all the ltinputgt elements on the page to checkif any email address fields exist at all then recursivelyexamines their parents to find the first container thatalso contains a submit button This container is usuallythe smallest logical form unit that includes all of therelevant input fields

Determining form field type Once a form isdiscovered we need to determine which fields are con-tained in the form and fill each field with valid dataWe skip any invisible elements since a real user wouldnot be expected to fill them Some fields can be iden-

I never signed up for this Privacy implications of email tracking 18

tified by their type attribute alonemdashfor example telfor phone numbers and email for email addressesmdashbutthese specific types were introduced in the relatively re-cent HTML5 standard [41] and most websites still usethe general text type for all text inputs In our sur-vey of the top sites we found that contextual hints arescattered across many tag attributes with the most fre-quent being name class id placeholder value forand title In addition tags that contain HTML bod-ies (such as ltbuttongt tags) often contain hints in theinnerHTML

Handling two-part form submissions Aftersubmitting a form we are sometimes prompted to fillout another longer form before the registration is ac-cepted This second form might appear on the samepage (ie using JavaScript) or on a separate page ei-ther through a redirect or as a pop-up window We takea simplistic approach the crawler waits a few secondsthen applies the same form-finding procedure first onany pop-up windows and then on the original windowThis approach may have the effect of submitting thesame form twice but we argue that this does not pro-duce any adverse resultsmdashduplicate form submissionsare a plausible user interaction that web services shouldbe expected to handle gracefully

102 Mail server implementation

The mail server receives emails using SubEtha SMTP alibrary offering a simple low-level API to handle incom-ing mail The server accepts any mail sent to (RCPT TO)an existing email address and rejects it otherwise Themail contents (DATA) are parsed in MIME format usingthe JavaMail API and the raw message contents arewritten to disk MIME messages consist of a set of head-ers and a content body with the required Content-Typeheader indicating the format of the content notably amultipart content body contains additional MIME mes-sage subparts enabling messages to be arranged in atree structure To save disk space we recursively scanmultipart MIME messages for subparts with contenttypes that are non-text (text) such as attached im-ages or other data and discard them before storing themessages since we do not examine any non-textual con-tent

103 Supported hash functions andencodings for leak detection

Supported hashes and checksums md2 md4 md5sha sha1 sha256 sha224 sha384 sha3-224 sha3-256 sha3-384 sha3-512 murmurhash2 (signed andunsigned) murmurhash3 32-bit murmurhash3 64-bitmurmurhash3 128-bit ripemd160 whirlpool blake2bblake2s crc32 adler32

Supported encodings base16 base32 base58base64 urlencoding deflate gzip zlib entity yenc

104 Top parties redirecting to new thirdparties on email reload

Redirecting Party OrganizationAvg addrsquolparties S E

pippiocom Acxiom 57 7 32liadmcom LiveIntent 37 68 1097rlcdncom Acxiom 17 11 551imiclkcom MediaMath 13 2 4mathtagcom MediaMath 11 11 382alcmpncom ALCdagger 08 6 132emltrkcom Litmus 07 41 638acxiom-onlinecom Acxiom 04 2 33dynemlcom PowerInbox 01 3 13adnxscom AppNexus 01 19 277

Table 14 Top parties by average number of new third-party re-sources in a redirect chain when an email is reloaded The num-ber of senders ( S) out of 902 total and the number of emails(E) out of 12618 total on which this occurs is given for eachredirecting party We exclude redirecting parties that only exhibitthis behavior in emails from a single sender In total there are 12parties which exhibit this type of redirect behavior Includes statistics for chains which redirect to httppliadmcomimp in the first redirect We observe a common pattern ofURLs of the form lifirstpartycom redirecting first to this end-point which then redirects to a number of other third partiesdagger American List Counsel

  • I never signed up for this Privacy implications of email tracking
    • 1 Introduction
      • 11 Methods
      • 12 The state of email tracking
      • 13 Evaluating and improving defenses
        • 2 Related work
        • 3 Collecting a dataset of emails
        • 4 Privacy leaks when viewing emails
          • 41 Measurement methodology
          • 42 Email provides much of same tracking opportunities as the web
          • 43 Leaks of email addresses to third parties are common
          • 44 Reopening emails brings in new third parties
          • 45 Case study LiveIntent
          • 46 Request blockers help but dont fix the problem
            • 5 Privacy leaks when clicking links in emails
              • 51 Measurement methodology
              • 52 Results
                • 6 Evaluation of defenses
                  • 61 Landscape of defenses
                  • 62 Survey of email clients
                    • 7 Proposed defense
                    • 8 Discussion and conclusion
                    • 9 Acknowledgements
                    • 10 Appendix
                      • 101 Form discovery and filling methodology
                      • 102 Mail server implementation
                      • 103 Supported hash functions and encodings for leak detection
                      • 104 Top parties redirecting to new third parties on email reload

I never signed up for this Privacy implications of email tracking 7

Measuring blocked tags Tracking protection toolswhich block resource requests offer users protectionagainst the tracking embedded in emails We evaluatethe effectiveness of these tools by checking the requestsin our dataset against two major blocklists EasyListand EasyPrivacy [4] These lists block advertisementand tracking related requests and are bundled withseveral popular blocking extensions including AdBlockPlus [1] and uBlock Origin [5] We use the BlockList-Parser library [3] to determine if a request would havebeen blocked3 by an extension utilizing these lists Weclassify a request as blocked if it matches any of thefollowing three conditions1 The request directly matches the filter list2 The request is the result of a redirect and any re-

quest earlier in the redirect was blocked3 The request is loaded in an iframe and the iframe

document request (or any resulting redirect) wasblocked

It is possible to do this classification in an offline fashionbecause of the lack of Javascript support in email clientsThis removes the need to run measurements with oneof the aforementioned extensions installed In environ-ments that support Javascript content can be loadeddynamically and as the result of interactions betweenseveral scripts In such an environment it is much moredifficult to determine which requests would have beenblocked by a single script appearing on the block list

42 Email provides much of same trackingopportunities as the web

Remote resources embedded in email content can trackusers across emails As we show in our survey of emailclients (Section 62) many email clients allow remoteresources to set persistent cookies and send those cook-ies with resource requests In total we find that 10724of the measured emails (85) embed resources from atleast one third party with an average of 5 third partiesper email The distribution of embedded third partiesis far from uniform we find a median of two per emailand a small number of emails embedding as many as 50third parties (Figure 2)

3 We set the parser options as we would expect them to beset for a request occurring in a webmail client For example allrequests are considered third-party requests

Domain of Emails of Top 1Mdoubleclicknet 222 475mathtagcom 142 79dotomicom 127 35adnxscom 122 132tapadcom 110 26liadmcom 110 04returnpathnet 110 lt01bidswitchnet 105 49fontsgoogleapiscom 102 394list-managecom 101 lt01

Table 3 Top third-party domains by percentage of the 12618emails in the corpus For comparison we show the percentageof the top 1 million websites on which these third parties arepresent

Fig 2 CDF of third parties per email aggregating data acrossthe initial viewing and re-opening of an email In addition 14of emails have between 25 and 53 third parties

Table 3 shows the top third-party domains presentin email Many of these parties also have a large presenceon the web [17] blurring the line between email and webtracking On webmail clients requests to these cross-context third parties will use the same cookies allowingthem to track both a userrsquos web browsing and emailhabits In total the emails visited during our crawlsembed resources from 879 third parties

43 Leaks of email addresses to thirdparties are common

In addition to being able to track email habits 99 thirdparties (11) also gain access to a userrsquos email ad-dress whether in plaintext or hashed In email clientswhich support cookies these third parties will receivethe email address alongside any cookies theyrsquove set onthe userrsquos device Trackers which are also present on theweb will thus be able to link this address with the userrsquosbrowsing history profile

Around 19 of the 902 senders leaked the userrsquosemail address to a third party in at least one emailand in total 29 of emails contain leaks to third par-

I never signed up for this Privacy implications of email tracking 8

ties We find that a majority of these leaks 62 of the100963 leaks to third parties are intentional Theseintentional leaks mostly occur through remote contentembedded directly by the sender Furthermore 1 ofleaks are classified as unintentional with the remainderconsidered ambiguous While we do not attempt to de-termine how these identifiers are being used plaintextand hashed emails can be used for persistent trackingcross-device tracking and syncing information betweenparties

Leak of Senders of RecipientsMD5 100 (111) 38 (385)SHA1 64 (71) 19 (192)SHA256 69 (76) 13 (131)Plaintext Domain 55 (61) 2 (20)Plaintext Address 77 (85) 54 (545)URL Encoded Address 6 (06) 8 (81)SHA1 of MD5 1 (01) 1 (10)SHA256 of MD5 1 (01) 1 (10)MD5 of MD5 1 (01) 1 (10)SHA384 1 (01) 1 (10)

Table 4 Email address leakage to third parties by encoding Per-centages are given out of a total of 902 senders and 99 third-party leak recipients All hashes are of the full email addressEmail ldquodomainrdquo is the part of the address after the ldquordquoThese appear to be a misuse of LiveIntentrsquos API (Section 45)

The leaked addresses are often hashed Althoughwe can detect email addresses hashed with 24 differentfunctions and up to three nested layers we only findMD5 SHA1 and SHA256 in frequent use Table 4 summa-rizes the number of senders and receivers of each encod-ing The relatively low diversity of hashes and encodingssuggests that these techniques are not being used to ob-fuscate the collection of email addresses In fact thequery parameters which contain hashed emails some-times identify the hash functions used in the parametername (eg a string like md5=ltmd5 hash of emailgt ap-pearing in the HTTP request) The design of APIs likeLiveIntentrsquos which first receives an email address andthen syncs with a number of other parties (Section 45)suggests that these hashed address may be used to shareor link data from multiple parties

Recipient Organization of SendersLiveIntent 68 (75)Acxiom 46 (51)Litmus Software 28 (31)Conversant Media 26 (29)Neustar 24 (27)apxlvcom 18 (20)5421114717 18 (20)Trancos 17 (19)WPP 17 (19)548261160 16 (18)

Table 5 Top organizations receiving email address leaks by num-ber of the 902 total senders A domain is used in place of anorganization when it isnrsquot clear which organization it belong to

Table 5 identifies the top organizations4 which re-ceive leaked email addresses This shows that email ad-dress collection from emails is largely consolidated to afew major players which are mostly distinct from thepopular web trackers In fact only one of the top 10organizations Neustar is found in the top 20 third-party organizations on the top 1 million websites asmeasured by Englehardt and Narayanan [17] Also sur-prising is the prevalence of leaks to IP addresses whichaccounts for eight of the top 20 domains receiving emailaddresses This may be due to the relatively ephemeralnature of newsletter emails which removes concerns ofIP address churn over time

44 Reopening emails brings in new thirdparties

Despite the lack of Javascript support email views aredynamic The email content itself is static but any re-mote resources embedded in it may return different re-sponses each time the email is viewed and even redirectto different third parties To examine the effects of thiswe load every email first with a ldquocleanrdquo browser profileand then again without clearing the profile Surprisinglythe average Jaccard similarity [36] between the sets ofthird parties loaded during the first and second viewsof the same email is only 60

The majority of emailsmdashtwo-thirdsmdashload fewerthird parties when the email is reopened compared tothe initial view However about 21 of emails load at

4 We map domains to organizations using the classification pro-vided by Libert [27] adding several new email-specific organi-zations When an organization could not be found we use thePS+1

I never signed up for this Privacy implications of email tracking 9

Row Request URL0 httpinboxwashingtonexaminercomimp[]ampe=ltEMAILgtampp=01 httppliadmcomimp[]ampm=ltMD5(address)gtampsh=ltSHA1(address)gtampsh2=ltSHA256(address)gt

ampp=0ampdom=ltEMAIL_DOMAINgt2 httpxbidswitchnetsyncssp=liveintentampbidder_id=5298amplicd=3357ampx=EGFM[]3 httpxbidswitchnetul_cbsyncssp=liveintentampbidder_id=5298amplicd=3357ampx=EGFM[]4 httppadsymptoticcomdpx_pid=12688amp_psign=d3e69[]ampbidswitch_ssp_id=liveintentamp_redirect=[]5 httppadsymptoticcomdpx_pid=12688amp_psign=d3e69[]ampbidswit[]amp_redirect=[]amp_expected_cookie=[]6 httpxbidswitchnetsyncdsp_id=126ampuser_id=84f3[]ampssp=liveintent7 httpiliadmcoms19751bidder_id=5298amplicd=3357ampbidder_uuid=ltUUID_1gt8 httpcmgdoubleclicknetpixelgoogle_nid=liveintent_dbmampgoogle_cmampgoogle_sc9 httpcmgdoubleclicknetpixelgoogle_nid=liveintent_dbmampgoogle_cm=ampgoogle_sc=ampgoogle_tc=10 httppliadmcommatch_gbidder_id=24314ampbidder_uuid=ltUUID_2gtampgoogle_cver=111 httpxbidswitchnetsyncssp=liveintentampbidder_id=5298amplicd=12 httppooludspiponwebnetsyncssp=bidswitchampbidswitch_ssp_id=liveintent

Table 6 Redirect chain from a LiveIntent Email Tracking Pixel URL query strings are truncated for clarity (using [])

least one resource when an email is reopened that wasnrsquotpresent the first time A small number of third partiesare disproportionately responsible for thismdashthey loaddifferent sets of additional third parties each time theemail is opened (Table 14 in the Appendix)

The number of leaks between email loads stays rela-tively constant with less than 50 emails leaking to newparties on the second load5 However as the compari-son of Table 14 with Table 5 shows many of the topleak recipients are also responsible for redirecting to thehighest number of new parties Thus reloading an emailincreases the number of potential recipients of a leak ifthe redirectors share data based on the email or emailhash they receive

45 Case study LiveIntent

LiveIntent receives email addresses from the largestnumber of senders 68 in total In this section weanalyze a sample of the request chains that re-sult in leaks to LiveIntent Table 6 shows an ex-ample redirect chain of a single pixel embedded inan email from the washingtonexaminercom mailinglist The initial request (row 0) is to a subdomain ofwashingtonexaminercom and includes the userrsquos plain-text email address in the e= query string parameter Thedomain redirects to liadmcom (row 1) a LiveIntent do-main and includes the MD5 SHA1 and SHA256 hashes ofthe email address in the parameters m= sh= and sh2=

5 We exclude leaks which occur to a different IP address on thesecond load This occurs in 349 emails but is less meaningfulgiven the dynamic nature of IP address

The URL also includes the domain portion of the userrsquosaddress

In rows 2 - 12 the request redirects through severalother domains and back to itself exchanging what ap-pear to be partner IDs and bidder IDs In rows 7 and10 LiveIntent receives a UUID from the domain in theprevious request which could allow it to exchange in-formation with those trackers outside of the browser

46 Request blockers help but donrsquot fixthe problem

Privacy conscious users often deploy blocking exten-sions such as uBlock Origin Privacy Badger orGhostery to block tracking requests Since webmailclients are browser-based these blocking extensions canalso filter requests that occur while displaying email con-tent6 We use our blocked tag detection methodology(Section 41) to determine which resources would havebeen blocked by the popular EasyList and EasyPrivacyblocklists We then examine the remaining requests todetermine how frequently email addresses continue toleak

Overall the blocklists cut the number of third par-ties receiving leaked email addresses from any sendernearly in half from 99 to 51 Likewise the number ofsenders which leak email addresses in at least one emailis greatly reduced from 19 to just 7 However asTable 7 shows a significant number of leaks of both

6 Thunderbird supports most of the popular Firefox extensionsand as such Thunderbird users can also deploy these defensesSee Table 12 for more details

I never signed up for this Privacy implications of email tracking 10

Encoding of Senders of RecipientsPlaintext Address 34 (37) 34 (667)MD5 21 (23) 12 (235)SHA1 14 (16) 6 (118)URL Encoded Address 4 (04) 4 (78)SHA256 4 (04) 2 (39)SHA384 1 (01) 1 (20)

Table 7 Encodings used in leaks to third parties after filteringrequests with EasyList and EasyPrivacy Totals are given out of902 email senders and 51 third-party leak recipients

Recipient Domain of Sendersmediawallahscriptcom 7jetlorecom 4scrippsnetworkscom 4alocdncom 3richrelevancecom 3ivitrackcom 2intentiqcom 2gatehousemediacom 2realtimeemail 2ziffimagescom 2

Table 8 The top third-party leak recipient domains after filteringrequests with EasyList and EasyPrivacy All recipients receiveleaks from less than 1 of the 902 senders studied

plaintext and email hashes still occur In Table 8 we seethat there are still several third-party domains whichreceive email address leaks despite blocking Several ofthese domains are known trackers which could be in-cluded in the blocklists In addition IP addresses andCDN domains are still recipients of leaked email ad-dresses Blocking on other URL features such as theURL path could help reduce leaks to these domains

5 Privacy leaks when clickinglinks in emails

In Section 4 we explore the privacy impact of a useropening and rendering an email In this section we ex-plore the privacy impact of a user clicking links withinan email Once a user clicks a link in an email the link istypically opened in a web browser Unlike email clientsweb browsers will typically support Javascript and ad-vanced features of HTML creating many potential av-enues for privacy leaks However the only way an emailaddress can propagate to a page visit is through the di-rect embedding of the address in a link contained in theoriginal email body

51 Measurement methodology

Sampling links from emails To evaluate the privacyleaks which occur when links in emails are clicked wegenerate a dataset from the HTML content of all emailsand visit them individually in an instrumented browserTo extract the links from mail content we parse allemail bodies with BeautifulSoup [2] and extract thesrc property of all ltagt tags We sample up to 200 uniquelinks per sender using the following sampling strategyFirst we bin links across all emails from a sender bythe PS+1 and path of the link Next we sample one linkfrom each bin without replacement until there are nomore links or we reach a limit of 200 This helps ensurethat we have as diverse a set of landing pages as possibleby stripping fragment and query string identifiers thatmay not influence the landing page

Simulating link clicks To simulate a user click-ing a link we visit each link in an OpenWPM instanceusing a fresh browser profile The browser fully loadsthe page and sleeps for 10 seconds before closing Un-like the email viewing simulation (Section 4) we enableboth Javascript and Referer headers This simulationreplicates what happens when a link is clicked in a stan-dalone email client only the URL of the clicked link ispassed to the browser for handling In a webmail clientthe initial request resulting from the click may also con-tain a cookie and a Referer header containing the emailclientrsquos URL We do not simulate these headers in ourcrawl

Detecting email address leakage To detectleakage of email addresses we use the procedure de-scribed in Section 41 Since the Referer header is en-abled for these measurements we consider a party tohave received a leak if it is contained either in the URLor the Referer header of the resource request to thatparty Email addresses can also be shared with the partythrough the Cookie header request POST bodies web-socket connections WebRTC connections and so onWe consider these out of scope for this analysis

52 Results

We found that about 11 of links contain requests thatleak the email address to a third party About 12 ofall emails contain at least one such link and among thissubset there are an average of 35 such links per emailThe percentage of the 902 senders that leak the emailaddress in at least one link in one email is higher 355Finally there were over 1400 distinct third parties that

I never signed up for this Privacy implications of email tracking 11

Recipient Organization of SendersGoogle 247 (274)Facebook 160 (177)Twitter 94 (104)Adobe 81 (90)Microsoft 73 (81)Pinterest 72 (80)LiveIntent 69 (76)Akamai 69 (76)Acxiom 68 (75)AppNexus 61 (68)

Table 9 The top leak recipient organizations based on a sampleof simulated link clicks All values are out of 902 total senders

Recipient Domain of Sendersgoogle-analyticscom 200 (222)doubleclicknet 196 (217)googlecom 159 (176)facebookcom 154 (171)facebooknet 145 (161)fontsgoogleapiscom 102 (113)googleadservicescom 96 (106)twittercom 94 (104)googletagmanagercom 87 (96)gstaticcom 78 (86)

Table 10 The top leak recipient domains based on a sample ofsimulated link clicks All values are out of 902 senders

received the email address in one or more of our sim-ulated link clicks We expect that all statistics in thisparagraph except the first are slight underestimatesdue to our limit of 200 links per sender

Table 9 shows the top organizations that receiveleaked email addresses and Table 10 shows the top do-mains Over a quarter of senders leak the email addressto Google in at least one link

The most striking difference between these resultsand the corresponding results for viewing emails is thatthese lists look very similar to the list of top third partytrackers [17] with the addition of a small number oforganizations specific to email tracking This motivatesthe privacy concern that identities could potentially beattached to third-party web tracking profiles

6 Evaluation of defenses

61 Landscape of defenses

Defenses against tracking can be employed by severalparties We ignore mail senders and trackers themselves

since email tracking is a thriving commercial space andour evidence suggests that senders by and large coop-erate with trackers to leak email addresses We insteadfocus on parties who have an incentive to protect therecipientrsquos privacy namely the recipientrsquos mail servermail user agent and the web browser

The lines between these roles can be blurry so weillustrate with two examples Consider a user readingYahoo mail via Firefox The email server is Yahoo theemail client is Firefox together with Yahoo mailrsquos client-side JavaScript and the web browser is again Fire-fox Or consider a user reading her university mail viaGmailrsquos IMAP feature on her iPhone For our purposesboth the university and Gmail count as email serverssince either of them is in a position to employ defensesThe email client is the Gmail iOS app and the webbrowser is Safari

Defense Email server Email client Web browserContent proxying XHTML filtering X XCookie blocking X XReferrer blocking X X XRequest blocking X X

Table 11 Applicability of each of the five possible defenses toeach of the three contexts in which they may be deployed An Xindicates that the defense is applicable

Table 11 summarizes the applicability of various de-fenses to the three roles We discuss each in turn

Content proxying Email tracking is possible be-cause of embedded content such as images and CSS (cas-cading style sheets) To prevent this some email serversnotably Gmail proxy embedded content Thus whenthe recipient views the email the mail user agent doesnot make any requests to third parties

This defense doesnrsquot prevent the recipient email ad-dress being leaked to third parties since it is leakedby being encoded in the URL In fact it hinders ef-forts by the mail client to prevent email address leakage(see request blocking below) However it prevents thirdparties from learning the userrsquos IP address client deviceproperties and when the email was read (depending onhow the proxy is configured) Most importantly it pre-vents the third-party cookie from being sent and thusprevents the third party from linking the userrsquos emailaddress to a tracking profile In this way it is a comple-ment to cookie blocking

This defense can be deployed by the email serverConceivably the email client might have its own server

I never signed up for this Privacy implications of email tracking 12

component through which embedded resources are prox-ied but no email clients currently work this way andfurther it would introduce its own privacy vulnerabili-ties so we ignore this possibility

HTML filtering HTML filtering refers to modify-ing the contents of HTML emails to mitigate tracking Itmay be applied by the email server or the client but it ismore suitable to the server since the client can generallyachieve the same effect in other ways eg by requestblocking or modifying the rendering engine It is rarelyapplied today and only in minimal ways In Section 7 weprototype a comprehensive HTML filtering technique

HTML filtering modifies the content of the emailbody and thus might interfere with some email au-thentication methods notably Domain Keys IdentifiedEmail (DKIM) However since filtering is carried outby the recipientrsquos mail server (Mail Transfer Agent) andnot by intermediate mail relays filtering can be done af-ter the signature has been verified and thus there is noimpact on email authentication

The following three techniques are applicable in oneof two scenarios when the email client requests embed-ded resources or when the web browser handles clickson links in emails

Cookie blocking Cookie blocking in the emailclient prevents third-party cookies from being sent whenembedded content is requested It is especially relevantin the webmail context where the cookie allows thirdparties to link an email address to a web browsing pro-file Even otherwise blocking cookies is helpful since itmakes it harder for third parties to compile a profile ofthe recipientrsquos email viewing (they can always do this forthe subset of emails where the email address is leaked)

Referrer blocking If the email client sends theReferer header when loading embedded resources itcan allow several types of leaks Depending on the imple-mentation the referrer may encode which client is beingused and which specific email is being read If the recip-ient forwarded an email to someone else and the email isbeing viewed in a different userrsquos mailbox it could leakthis information Worse if the client supports iframesin emails and the email address happens to be in theiframe URL all requests to resources embedded in thatiframe will accidentally leak the email address For allthese reasons referrer blocking is a privacy-enhancingmeasure There is little legitimate use for the referrerheader in the context of email While clients can cer-tainly block the header (as can web browsers) serverscan do this as well by rewriting HTML to add therel=ldquonoreferrerrdquo attribute to links and inserting a Re-ferrer Policy via the meta tag

Request blocking Request blocking is a powerfultechnique which is well known due to ad blockers andother browser privacy extensions It relies on manuallycompiled filter lists containing thousands of regular ex-pressions that define third-party content to be blockedThe most widely used ad-blocking list is EasyList andthe most widely used tracker-blocking list is EasyPri-vacy Filter list based blocking introduces false positivesand false negatives [43] but the popularity of ad block-ing suggests that many users find the usability trade-offto be acceptable While request-blocking extensions aresupported primarily by web browsers some email clientsalso have support for them notably Thunderbird

62 Survey of email clients

We built an email privacy tester to discover which de-fenses are deployed by which popular email servers andclients7 Browser support for tracking protection hasbeen extensively studied elsewhere [29] so we do notconsider it here

The email privacy tester allows the researcher to en-ter an email address and the name of an email clientand then sends an email to that address containing atracking image and a link The image and the link bothhave unique URLs The researcher views the email inthe specified email client and then clicks on the linkThe server records the following information the emailaddress the email client the IP address timestampand headers sent for both the image and the link re-quests The list of headers includes the cookie referrerand user agent

We created accounts with a total of 9 emailproviders and tested them with a total of 16 email clientsusing various devices available in our lab We analyzedthe data recorded by the email privacy tester and sum-marize the results in Table 12 We found that if de-fenses are deployed by email servers at all they are onlyenabled for specific email clients (typically the defaultwebmail client) Therefore we do not report on serversseparately but instead fold it into the analysis of clientsWe also found that HTML filtering in a general form isnot deployed but only in the limited form of image andreferrer blocking so we report on that instead We sum-marize our findings in Table 12

7 httpsemailtrackingopenwpmcom

I never signed up for this Privacy implications of email tracking 13

Mail Client Platform Proxies Content Blocks Images Blocks Referrers Blocks Cookies Ext SupportGmail Web Yes No L Yes I Yesdagger Yesdagger YesYahoo Mail Web No Yes L Yes I No No YesOutlook Web App Web No Yes No No YesOutlookcom Web No No No No YesYandex Mail Web Yes No L Yes I Yesdagger Yesdagger YesGMX Web No No No No YesZimbra Web No Yes No No Yes163com Web No No No No YesSina Web No No No No YesApple Mail iOS No No Yes Yes NoGmail iOS Yes No Yes Yes NoGmail Android Yes No Yes Yes NoApple Mail Desktop No No Yes Yes NoWindows Mail Desktop No No Yes No NoOutlook 2016 Desktop No Yes Yes No NoThunderbird Desktop No Yes Yes Optional (Default No) Yes

Table 12 A survey of the privacy impacting features of email clients We explore whether the client proxies image requests blocksimages by default blocks referrer headers from being sent (with image requests ldquoIrdquo and with link clicks ldquoLrdquo) blocks external re-sources from settings cookies and whether or not the client supports request blocking extensions mdash either through the browser (forweb clients) or directly (in the case of Thunderbird)Images are only blocked for messages considered spamdagger Blocking occurs as a result of proxied content

7 Proposed defenseWe argue that tracking protection should be at the cen-ter of a defensive strategy against email tracking It canbe employed either via HTML filtering on the server orvia request blocking on the client Tracking protection(and ad blocking) based on filter lists has proven to beeffective and popular in web browsers and its limita-tions manageable The other defenses we examined allhave serious drawbacks for example content proxyingcomes at a cost to the email server and makes emailleaks worse and cookie blocking is at best a partial so-lution

We propose to improve tracking protection in twoways

Server-side email content filtering First weprototype a server-side HTML filtering module We usethe existing standard EasyList and EasyPrivacy filterlists Our filtering script is written in Python using theBlockListParser library [3] It scans for any HTML con-tent (texthtml) in email bodies parses those contentsidentifies embedded resources (images or CSS) whoseURLs match one of the regular expressions in the filterlists strips them out and rewrites the HTML

To test the effectiveness of HTML filtering we ranour leak detection procedure on the filtered corpus ofemails We exclude one sender due to a measurementissue We found that 110 of senders will leak email ad-

dresses to a third party in at least one email and 115of emails contain embedded resources which leak emailto a third-party Overall 62 third parties received leakedemail addresses down from 99 As tracking-protectionlists improve (see below) we can expect these numbersto decrease further These numbers are very close tothe corresponding numbers for request blocking (Sec-tion 46) The two techniques arenrsquot identical the onedifference is that in static files filtering is limited to theURLs present in the body of the HTML and will missthose that result from a redirect However this differ-ence is small and we conclude that HTML filtering isessentially as effective as request blocking

Note that webmail users can already enjoy track-ing protection but server-side deployment will help allusers including those who use email clients that donrsquotsupport request-blocking extensions

Filling gaps in tracking-protection lists As asecond line of defense we use our dataset to identifya list of 27125 URLs representing 133 distinct partieswhich contain leaks of email addresses but which arenrsquotblocked by EasyList or EasyPrivacy These include firstparties in addition to third parties We are able to iden-tify first-party tracking URLs by observing groups ofURLs of similar structure across different first-party do-mains For example 51 email senders leak the userrsquosemail address to a URL of the form liltpublic suffix+ 1gtimp which appears to be part of LiveIntentrsquos API(Section 45) We summarize the most common struc-

I never signed up for this Privacy implications of email tracking 14

tures in the leaking URLs missed by tracking protectionlists in Table 13

URL Pattern of SendersliltPS+1gtimp 51 (57)partnerltPS+1gt 7 (07)stripeltPS+1gtstripeimage 4 (04)pltPS+1gtespopen 4 (04)apiltPS+1gtlayoutssectionltNgt 4 (04)ltPS+1gtcustomer-service 3 (03)miltPS+1gtprp 3 (03)dmtkltPS+1gt 3 (03)linksltPS+1gteopen 3 (03)eadsltPS+1gtimp 3 (03)

Table 13 The top URL patterns from URLs which leak email ad-dresses and are missed by tracking protection lists (Section 46)The patterns are generated by stripping request URLs to host-name and path replacing the public suffix plus one with ltPS+1gtreplacing integers with ltNgt and stripping the last portion of thepath if it ends with a file extension The patterns are ranked bythe number of senders which make at least one leaking requestmatching that pattern in any of the senderrsquos emails All values aregiven out of the total of 902 senders studied

We suspect that the reason so many trackers aremissed is that many of them are not active in the regu-lar web tracking space We have made the list of leakingURLs missed by tracking protection lists publicly avail-able8 It should be straightforward to add regular ex-pressions to filter lists based on these URLs we suggestthat filter list creators should regularly conduct scansof email corpora to identify new trackers

8 Discussion and conclusionPrivacy risks of email tracking Email security andprivacy has not received much research attention despiteits central importance in digital life We showed thatcommercial emails contain a high density of third-partytrackers This is of concern not only because trackerscan learn the recipientrsquos IP address when emails wereopened and so on but also because these third partiesare by and large the same ones that are involved in webtracking This means that trackers can connect emailaddresses to browsing histories and profiles which leadsto further privacy breaches such as cross-device tracking

8 httpsgistgithubcomenglehardt6438c5d775ffd535b317d5c6ce3cde61

and linking of online and offline activities Indeed emailis an underappreciated avenue for straightforward cross-device tracking since recipients tend to view emails onmultiple devices

The advice provided by many mail clients may mis-lead users into thinking the privacy risks associated withremote content are fairly limited The remote contenthelp pages of Gmail [20] Yahoo Mail [42] and Thun-derbird [31] all discuss the threat strictly in terms ofthe email sender learning information about the userrather than a number of third parties

Even network adversaries can benefit from the leaksin emails The NSA is known to piggyback on advertis-ing cookies for surveillance [18] and our work suggestsone way in which a surveillance agency might attachidentities to web activity records in line with the find-ings of Englehardt et al [18] Indeed nearly 91 ofURLs containing leaks of emails are sent in plaintext

Ineffectiveness of hashing The putative justi-fication for email address leaks in the online ad techindustry is that the address is hashed However hash-ing of PII including emails is not a meaningful pri-vacy protection This is folk knowledge in the securitycommunity but bears repeating Compared to hashingof passwords there are several reasons why hashing ofemail addresses is far more easily reversible via vari-ants of a dictionary attack First while (at least) someusers attempt to maximize the entropy of passwordsmost users aim to pick memorable emails and hence theset of potential emails is effectively enumerable Due toGPUs trillions of hashes can be attempted at low costSecond unlike password hashing salting is not applica-ble to email hashing since multiple third parties need tobe able to independently derive the same hash from theemail address

Perhaps most importantly if the adversaryrsquos goalis to retrieve records corresponding to a known emailaddress or set of email addresses then hashing ispointlessmdashthe adversary can simply hash the email ad-dresses and then look them up For example if the ad-versary is a surveillance agency as discussed above andseeks to retrieve network logs corresponding to a givenemail address this is trivially possible despite hashing

LimitationsWe mention several limitations of ourwork First despite the large number of heuristics thatwent into identifying and submitting forms it is a fun-damentally hard problem and our crawler fails in manycases including pages requiring complex mouse interac-tions pages containing very poorly structured HTMLand captcha-protected form submission pages More-over it is difficult to programmatically distinguish be-

I never signed up for this Privacy implications of email tracking 15

tween successful and failed form submissions Lookingat received network data is impractical since responsescould easily include text for both success and failuremessages On the other hand looking only at changesin the rendered text on the webpage is more feasiblebut would require handling many possible edge cases(eg page redirects alerts pop-up windows iframes)and might still be too unreliable to use as a metric forsuccess

Second our corpus of emails is not intended to berepresentative and we are unable to draw conclusionsabout the extent of tracking in the typical userrsquos mail-box

Third our simulation of a user viewing emails as-sumes a permissive user agent We expect that thisclosely approximates a webmail setup with defaultbrowser settings (on browsers except Safari whichblocks third-party cookies by default) but we have nottested this assumption

Future work Finally we mention several potentialareas of future work

Mailing list managers It would be helpful to bet-ter understand the relationship between email sendersand mailing list managers (such as Constant Contact)To what extent is email tracking driven by senders ver-sus mailing list managers When a sender sets up amarketing campaign with a mailing list manager is thetracking disclosed to the sender

PII leakage in registration forms Researchers havepreviously found leakage of PII to third parties in con-tact forms on websites [38] As far as we know there hasbeen no large-scale study of PII leakage in registrationforms where more sensitive information is often present(eg phone numbers street addresses and passwords)Recording and analyzing the third-party requests madeduring our crawls is an important area for further inves-tigation

Cookie syncing It would be interesting to find outif cookie syncing occurs when viewing emailsmdasha processin which different trackers exchange and link togethertheir own IDs for the same user Past work has shownthat this happens among the vast majority of top thirdparties on the web [17] so we suspect that it occursthrough email as well

AB testing We notice some clear instances of ABtesting in our data as might be expected in market-ing campaigns Specifically we registered multiple emailaddresses on some sites at roughly the same time andfound several emails sent at nearly the same time (mil-liseconds apart) with different subject lines and emailbodies advertising different products We have not at-

tempted to reverse-engineer or systematically analyzethese differences but it may be interesting to see if andhow the received content changes in response to readreceipts click-through metrics or other types of userinteractions

Differential testing Despite testing for various en-codings hashes and combinations it is possible thatwe have missed some leaks of email addresses We can-not hope to exhaustively test for all combinations ofencodings and hashes Instead we propose differentialtesting by registering multiple email addresses on thesame site we can look for parameters in URLs that aredifferent for different email addresses which are sugges-tive of transformed email addresses The difficulty withthis approach is that AB testing mentioned above isa confound

In summary we hope that our work leads to greaterawareness of the privacy risks of email tracking spursfurther research on the topic and paves the way fordeployment of robust defenses

9 AcknowledgementsWe would like to thank the anonymous reviewers AylinCaliskan Paul-Olivier Dehaye Joel Reardon and PaulVan Oorschot for their helpful comments Wersquore alsograteful to Guumlnes Acar Paul Ellenbogen Marc JuarezHarry Kalodner Marcela Melara and Laura Roberts fortheir assistance in compiling data for our email survey

This work was supported by NSF Grant CNS1526353 by a research grant from Mozilla and by Ama-zon AWS Cloud Credits for Research

References[1] Adblock Plus - Surf the web without annoying ads https

adblockplusorg Online accessed 2017-09-05[2] BeautifulSoup httpswwwcrummycomsoftware

BeautifulSoup Online accessed 2017-09-05[3] BlockListParser httpsgithubcomshivamagarwal-iitb

BlockListParser Online accessed 2017-09-05[4] EasyList and EasyPrivacy httpseasylistto Online

accessed 2017-09-05[5] uBlock Origin - An efficient blocker for Chromium and Fire-

fox Fast and lean httpsgithubcomgorhilluBlockOnline accessed 2017-09-05

[6] CSS Support Guide for Email Clients Campaign Sourcehttpswwwcampaignmonitorcomcss (Archive httpswwwwebcitationorg6rLLXBX0E) 2014

I never signed up for this Privacy implications of email tracking 16

[7] Gunes Acar Christian Eubank Steven Englehardt MarcJuarez Arvind Narayanan and Claudia Diaz The web neverforgets Persistent tracking mechanisms in the wild In Pro-ceedings of ACM CCS pages 674ndash689 ACM 2014

[8] Gunes Acar Marc Juarez Nick Nikiforakis Claudia DiazSeda Guumlrses Frank Piessens and Bart Preneel Fpdetectivedusting the web for fingerprinters In Proceedings of the2013 ACM SIGSAC conference on Computer amp communica-tions security pages 1129ndash1140 ACM 2013

[9] Julia Angwin Why online tracking is getting creepier ProP-ublica Jun 2014

[10] Mika D Ayenson Dietrich James Wambach Ashkan SoltaniNathan Good and Chris Jay Hoofnagle Flash cookies andprivacy II Now with html5 and etag respawning 2011

[11] Bananatag Email Tracking for Gmail Outlook and otherclients httpsbananatagcomemail-tracking Onlineaccessed 2017-09-04

[12] Justin Brookman Phoebe Rouge Aaron Alva Alva andChristina Yeung Cross-device tracking Measurement anddisclosures In Proceedings of the Privacy Enhancing Tech-nologies Symposium 2017

[13] Ceren Budak Sharad Goel Justin Rao and Georgios ZervasUnderstanding emerging threats to online advertising InProceedings of the ACM Conference on Economics andComputation 2016

[14] ContactMonkey Email Tracking for Outlook and Gmailhttpswwwcontactmonkeycomemail-tracking Onlineaccessed 2017-09-04

[15] Zakir Durumeric David Adrian Ariana Mirian James Kas-ten Elie Bursztein Nicolas Lidzborski Kurt Thomas VijayEranti Michael Bailey and J Alex Halderman Neither snownor rain nor mitm An empirical analysis of email deliv-ery security In Proceedings of the 2015 ACM Conferenceon Internet Measurement Conference pages 27ndash39 ACM2015

[16] Peter Eckersley How unique is your web browser In In-ternational Symposium on Privacy Enhancing TechnologiesSymposium pages 1ndash18 Springer 2010

[17] Steven Englehardt and Arvind Narayanan Online trackingA 1-million-site measurement and analysis In ACM Confer-ence on Computer and Communications Security 2016

[18] Steven Englehardt Dillon Reisman Christian Eubank Pe-ter Zimmerman Jonathan Mayer Arvind Narayanan andEdward W Felten Cookies that give you away The surveil-lance implications of web tracking In Proceedings of the24th Conference on World Wide Web 2015

[19] David Fifield and Serge Egelman Fingerprinting web usersthrough font metrics In International Conference on Finan-cial Cryptography and Data Security 2015

[20] Gmail Help Choose whether to show images httpssupportgooglecommailanswer145919 Online accessed2017-09-06

[21] Ralph Holz Johanna Amann Olivier Mehani Mohamed AliKacircafar and Matthias Wachs TLS in the wild An internet-wide analysis of tls-based protocols for electronic commu-nication In 23nd Annual Network and Distributed SystemSecurity Symposium NDSS 2016 San Diego CaliforniaUSA February 21-24 2016 2016

[22] HubSpot Start Email Tracking Today httpswwwhubspotcomproductssalesemail-tracking Online ac-

cessed 2017-09-04[23] Balachander Krishnamurthy Konstantin Naryshkin and

Craig Wills Privacy leakage vs protection measures thegrowing disconnect In Proceedings of the Web 2011

[24] Balachander Krishnamurthy and Craig E Wills On the leak-age of personally identifiable information via online socialnetworks In Proceedings of the 2nd ACM workshop onOnline social networks pages 7ndash12 ACM 2009

[25] Pierre Laperdrix Walter Rudametkin and Benoit BaudryBeauty and the beast Diverting modern web browsers tobuild unique browser fingerprints In 37th IEEE Symposiumon Security and Privacy 2016

[26] Adam Lerner Anna Kornfeld Simpson Tadayoshi Kohnoand Franziska Roesner Internet jones and the raiders of thelost trackers An archaeological study of web tracking from1996 to 2016 In 25th USENIX Security Symposium 2016

[27] Timothy Libert Exposing the invisible web An analysis ofthird-party http requests on 1 million websites InternationalJournal of Communication 918 2015

[28] Jonathan R Mayer and John C Mitchell Third-party webtracking Policy and technology In 2012 IEEE Symposiumon Security and Privacy IEEE 2012

[29] Georg Merzdovnik Markus Huber Damjan Buhov NickNikiforakis Sebastian Neuner Martin Schmiedecker andEdgar Weippl Block me if you can A large-scale study oftracker-blocking tools In Proceedings of the 2nd IEEE Euro-pean Symposium on Security and Privacy (IEEE EuroSampP)2017

[30] Keaton Mowery and Hovav Shacham Pixel perfect Finger-printing canvas in HTML5 W2SP 2012

[31] Mozilla Support Remote Content in Messages httpssupportmozillaorgen-USkbremote-content-in-messagesOnline accessed 2017-09-04

[32] Nick Nikiforakis Alexandros Kapravelos Wouter JoosenChristopher Kruegel Frank Piessens and Giovanni VignaCookieless monster Exploring the ecosystem of web-baseddevice fingerprinting In Security and privacy (SP) 2013IEEE symposium on pages 541ndash555 IEEE 2013

[33] Lukasz Olejnik Gunes Acar Claude Castelluccia and Clau-dia Diaz The leaking battery A privacy analysis of theHTML5 Battery Status API Technical report 2015

[34] Jingjing Ren Ashwin Rao Martina Lindorfer ArnaudLegout and David Choffnes Recon Revealing and control-ling pii leaks in mobile network traffic In Proceedings of the14th Annual International Conference on Mobile SystemsApplications and Services pages 361ndash374 ACM 2016

[35] Franziska Roesner Tadayoshi Kohno and David WetherallDetecting and defending against third-party tracking onthe web In Proceedings of the 9th USENIX conferenceon Networked Systems Design and Implementation pages12ndash12 USENIX Association 2012

[36] scikit-learn Jaccard Similarity Score httpscikit-learnorgstablemodulesgeneratedsklearnmetricsjaccard_similarity_scorehtml Online accessed 2017-09-05

[37] Ashkan Soltani Shannon Canty Quentin Mayo LaurenThomas and Chris Jay Hoofnagle Flash cookies and pri-vacy In AAAI spring symposium intelligent informationprivacy management volume 2010 pages 158ndash163 2010

[38] Oleksii Starov Phillipa Gill and Nick Nikiforakis Are yousure you want to contact us quantifying the leakage of pii

I never signed up for this Privacy implications of email tracking 17

via website contact forms Proceedings on Privacy Enhanc-ing Technologies 2016(1)20ndash33 2016

[39] Oleksii Starov and Nick Nikiforakis Extended trackingpowers Measuring the privacy diffusion enabled by browserextensions In Proceedings of the 26th International Confer-ence on World Wide Web pages 1481ndash1490 2017

[40] Narseo Vallina-Rodriguez Christian Kreibich Mark Allmanand Vern Paxson Lumen Fine-grained visibility and controlof mobile traffic in user-space 2017

[41] W3C 410 Forms - HTML5 httpswwww3orgTRhtml5formshtml Online accessed 2017-09-07

[42] Yahoo Help Block images in your incoming Yahoo Mailemails httpshelpyahoocomkbSLN5043html Onlineaccessed 2017-09-06

[43] Zhonghao Yu Sam Macbeth Konark Modi and Josep MPujol Tracking the trackers In Proceedings of the 25thInternational Conference on World Wide Web pages 121ndash132 International World Wide Web Conferences SteeringCommittee 2016

10 Appendix

101 Form discovery and fillingmethodology

Choosing pages on which to search for forms Thecrawler searches through all links (ltagt tags) on the land-ing page to find pages that are most likely to contain amailing list form It does this by matching the link textand URL against a ranked list of terms which are shownin Table 1 As an initial step we filter out invisiblelinks and links to external sites We check that the linktext does not contain words in our blacklist which aimsto avoid unsubscribe pages and phone-based registra-tion If we have found any links that match the crawlerclicks on the one with the highest rank then runs theform-finding procedure on the new page and any newlyopened pop-up windows If no forms are found it goesback and repeats this process for the remaining linksThe reason for clicking on generic article links is thatwe have come across several news sites with newsletterforms only within article pages We also make sure toselect the English language or USEnglish locale whenavailable since our keywords are in English

Top-down form detection For each page thecrawler visits it first searches through the HTML DOMfor any potential email registration forms When sitesuse the standard ltformgt element it can simply iteratethrough each formrsquos input fields (ltinputgt tags) and seeif any text fields ask for an email address (by matchingon input type and keywords) If so it marks the form as

a candidate and then chooses the best candidate usingthe following criteria (in order)1 Always return the topmost form Any form stacked

on top of other elements is probably a modal or dia-log and we find that the most common use of thesecomponents is to promote a sitersquos mailing lists Werely on the z-index CSS property which specifies thestacking order of an element in relation to others (asa relative arbitrary integer) Note that most DOMelements take the default z-index value of auto in-heriting the actual value from its parent thus thecrawler recursively checks a formrsquos parent elementsuntil it finds a non-auto value or reaches the rootof the DOM tree To break ties it also searches forthe literal strings ldquomodalrdquo or ldquodialogrdquo within theformrsquos HTML since we find that such componentsare usually descriptively named

2 Rank login forms lower This is the other class offorms that often asks for an email address so thecrawler explicitly checks for the strings ldquologinrdquo ldquologinrdquo and ldquosign inrdquo within a formrsquos HTML to avoidthese when other candidates are present

3 Prefer forms with more input fields This is mainlyhelpful for identifying the correct follow-up form ifwe submit our email address in the footer of a pagethe same footer might be present on the page we getredirected to In this scenario the form we want topick is the longer one

Additionally registration forms are sometimes foundinside of inline frames (ltiframegt tag) which are ef-fectively separate HTML pages embedded in the mainpage If necessary we iterate through each frame andapply the same procedure to locate registration formswithin them

Bottom-up form detection A growing numberof sites place logical forms inside of generic containerelements (eg ltdivgt or ltspangt tags) without using anyltformgt tags Therefore if top-down form detection failswe take a bottom-up approach the crawler first iteratesthrough all the ltinputgt elements on the page to checkif any email address fields exist at all then recursivelyexamines their parents to find the first container thatalso contains a submit button This container is usuallythe smallest logical form unit that includes all of therelevant input fields

Determining form field type Once a form isdiscovered we need to determine which fields are con-tained in the form and fill each field with valid dataWe skip any invisible elements since a real user wouldnot be expected to fill them Some fields can be iden-

I never signed up for this Privacy implications of email tracking 18

tified by their type attribute alonemdashfor example telfor phone numbers and email for email addressesmdashbutthese specific types were introduced in the relatively re-cent HTML5 standard [41] and most websites still usethe general text type for all text inputs In our sur-vey of the top sites we found that contextual hints arescattered across many tag attributes with the most fre-quent being name class id placeholder value forand title In addition tags that contain HTML bod-ies (such as ltbuttongt tags) often contain hints in theinnerHTML

Handling two-part form submissions Aftersubmitting a form we are sometimes prompted to fillout another longer form before the registration is ac-cepted This second form might appear on the samepage (ie using JavaScript) or on a separate page ei-ther through a redirect or as a pop-up window We takea simplistic approach the crawler waits a few secondsthen applies the same form-finding procedure first onany pop-up windows and then on the original windowThis approach may have the effect of submitting thesame form twice but we argue that this does not pro-duce any adverse resultsmdashduplicate form submissionsare a plausible user interaction that web services shouldbe expected to handle gracefully

102 Mail server implementation

The mail server receives emails using SubEtha SMTP alibrary offering a simple low-level API to handle incom-ing mail The server accepts any mail sent to (RCPT TO)an existing email address and rejects it otherwise Themail contents (DATA) are parsed in MIME format usingthe JavaMail API and the raw message contents arewritten to disk MIME messages consist of a set of head-ers and a content body with the required Content-Typeheader indicating the format of the content notably amultipart content body contains additional MIME mes-sage subparts enabling messages to be arranged in atree structure To save disk space we recursively scanmultipart MIME messages for subparts with contenttypes that are non-text (text) such as attached im-ages or other data and discard them before storing themessages since we do not examine any non-textual con-tent

103 Supported hash functions andencodings for leak detection

Supported hashes and checksums md2 md4 md5sha sha1 sha256 sha224 sha384 sha3-224 sha3-256 sha3-384 sha3-512 murmurhash2 (signed andunsigned) murmurhash3 32-bit murmurhash3 64-bitmurmurhash3 128-bit ripemd160 whirlpool blake2bblake2s crc32 adler32

Supported encodings base16 base32 base58base64 urlencoding deflate gzip zlib entity yenc

104 Top parties redirecting to new thirdparties on email reload

Redirecting Party OrganizationAvg addrsquolparties S E

pippiocom Acxiom 57 7 32liadmcom LiveIntent 37 68 1097rlcdncom Acxiom 17 11 551imiclkcom MediaMath 13 2 4mathtagcom MediaMath 11 11 382alcmpncom ALCdagger 08 6 132emltrkcom Litmus 07 41 638acxiom-onlinecom Acxiom 04 2 33dynemlcom PowerInbox 01 3 13adnxscom AppNexus 01 19 277

Table 14 Top parties by average number of new third-party re-sources in a redirect chain when an email is reloaded The num-ber of senders ( S) out of 902 total and the number of emails(E) out of 12618 total on which this occurs is given for eachredirecting party We exclude redirecting parties that only exhibitthis behavior in emails from a single sender In total there are 12parties which exhibit this type of redirect behavior Includes statistics for chains which redirect to httppliadmcomimp in the first redirect We observe a common pattern ofURLs of the form lifirstpartycom redirecting first to this end-point which then redirects to a number of other third partiesdagger American List Counsel

  • I never signed up for this Privacy implications of email tracking
    • 1 Introduction
      • 11 Methods
      • 12 The state of email tracking
      • 13 Evaluating and improving defenses
        • 2 Related work
        • 3 Collecting a dataset of emails
        • 4 Privacy leaks when viewing emails
          • 41 Measurement methodology
          • 42 Email provides much of same tracking opportunities as the web
          • 43 Leaks of email addresses to third parties are common
          • 44 Reopening emails brings in new third parties
          • 45 Case study LiveIntent
          • 46 Request blockers help but dont fix the problem
            • 5 Privacy leaks when clicking links in emails
              • 51 Measurement methodology
              • 52 Results
                • 6 Evaluation of defenses
                  • 61 Landscape of defenses
                  • 62 Survey of email clients
                    • 7 Proposed defense
                    • 8 Discussion and conclusion
                    • 9 Acknowledgements
                    • 10 Appendix
                      • 101 Form discovery and filling methodology
                      • 102 Mail server implementation
                      • 103 Supported hash functions and encodings for leak detection
                      • 104 Top parties redirecting to new third parties on email reload

I never signed up for this Privacy implications of email tracking 8

ties We find that a majority of these leaks 62 of the100963 leaks to third parties are intentional Theseintentional leaks mostly occur through remote contentembedded directly by the sender Furthermore 1 ofleaks are classified as unintentional with the remainderconsidered ambiguous While we do not attempt to de-termine how these identifiers are being used plaintextand hashed emails can be used for persistent trackingcross-device tracking and syncing information betweenparties

Leak of Senders of RecipientsMD5 100 (111) 38 (385)SHA1 64 (71) 19 (192)SHA256 69 (76) 13 (131)Plaintext Domain 55 (61) 2 (20)Plaintext Address 77 (85) 54 (545)URL Encoded Address 6 (06) 8 (81)SHA1 of MD5 1 (01) 1 (10)SHA256 of MD5 1 (01) 1 (10)MD5 of MD5 1 (01) 1 (10)SHA384 1 (01) 1 (10)

Table 4 Email address leakage to third parties by encoding Per-centages are given out of a total of 902 senders and 99 third-party leak recipients All hashes are of the full email addressEmail ldquodomainrdquo is the part of the address after the ldquordquoThese appear to be a misuse of LiveIntentrsquos API (Section 45)

The leaked addresses are often hashed Althoughwe can detect email addresses hashed with 24 differentfunctions and up to three nested layers we only findMD5 SHA1 and SHA256 in frequent use Table 4 summa-rizes the number of senders and receivers of each encod-ing The relatively low diversity of hashes and encodingssuggests that these techniques are not being used to ob-fuscate the collection of email addresses In fact thequery parameters which contain hashed emails some-times identify the hash functions used in the parametername (eg a string like md5=ltmd5 hash of emailgt ap-pearing in the HTTP request) The design of APIs likeLiveIntentrsquos which first receives an email address andthen syncs with a number of other parties (Section 45)suggests that these hashed address may be used to shareor link data from multiple parties

Recipient Organization of SendersLiveIntent 68 (75)Acxiom 46 (51)Litmus Software 28 (31)Conversant Media 26 (29)Neustar 24 (27)apxlvcom 18 (20)5421114717 18 (20)Trancos 17 (19)WPP 17 (19)548261160 16 (18)

Table 5 Top organizations receiving email address leaks by num-ber of the 902 total senders A domain is used in place of anorganization when it isnrsquot clear which organization it belong to

Table 5 identifies the top organizations4 which re-ceive leaked email addresses This shows that email ad-dress collection from emails is largely consolidated to afew major players which are mostly distinct from thepopular web trackers In fact only one of the top 10organizations Neustar is found in the top 20 third-party organizations on the top 1 million websites asmeasured by Englehardt and Narayanan [17] Also sur-prising is the prevalence of leaks to IP addresses whichaccounts for eight of the top 20 domains receiving emailaddresses This may be due to the relatively ephemeralnature of newsletter emails which removes concerns ofIP address churn over time

44 Reopening emails brings in new thirdparties

Despite the lack of Javascript support email views aredynamic The email content itself is static but any re-mote resources embedded in it may return different re-sponses each time the email is viewed and even redirectto different third parties To examine the effects of thiswe load every email first with a ldquocleanrdquo browser profileand then again without clearing the profile Surprisinglythe average Jaccard similarity [36] between the sets ofthird parties loaded during the first and second viewsof the same email is only 60

The majority of emailsmdashtwo-thirdsmdashload fewerthird parties when the email is reopened compared tothe initial view However about 21 of emails load at

4 We map domains to organizations using the classification pro-vided by Libert [27] adding several new email-specific organi-zations When an organization could not be found we use thePS+1

I never signed up for this Privacy implications of email tracking 9

Row Request URL0 httpinboxwashingtonexaminercomimp[]ampe=ltEMAILgtampp=01 httppliadmcomimp[]ampm=ltMD5(address)gtampsh=ltSHA1(address)gtampsh2=ltSHA256(address)gt

ampp=0ampdom=ltEMAIL_DOMAINgt2 httpxbidswitchnetsyncssp=liveintentampbidder_id=5298amplicd=3357ampx=EGFM[]3 httpxbidswitchnetul_cbsyncssp=liveintentampbidder_id=5298amplicd=3357ampx=EGFM[]4 httppadsymptoticcomdpx_pid=12688amp_psign=d3e69[]ampbidswitch_ssp_id=liveintentamp_redirect=[]5 httppadsymptoticcomdpx_pid=12688amp_psign=d3e69[]ampbidswit[]amp_redirect=[]amp_expected_cookie=[]6 httpxbidswitchnetsyncdsp_id=126ampuser_id=84f3[]ampssp=liveintent7 httpiliadmcoms19751bidder_id=5298amplicd=3357ampbidder_uuid=ltUUID_1gt8 httpcmgdoubleclicknetpixelgoogle_nid=liveintent_dbmampgoogle_cmampgoogle_sc9 httpcmgdoubleclicknetpixelgoogle_nid=liveintent_dbmampgoogle_cm=ampgoogle_sc=ampgoogle_tc=10 httppliadmcommatch_gbidder_id=24314ampbidder_uuid=ltUUID_2gtampgoogle_cver=111 httpxbidswitchnetsyncssp=liveintentampbidder_id=5298amplicd=12 httppooludspiponwebnetsyncssp=bidswitchampbidswitch_ssp_id=liveintent

Table 6 Redirect chain from a LiveIntent Email Tracking Pixel URL query strings are truncated for clarity (using [])

least one resource when an email is reopened that wasnrsquotpresent the first time A small number of third partiesare disproportionately responsible for thismdashthey loaddifferent sets of additional third parties each time theemail is opened (Table 14 in the Appendix)

The number of leaks between email loads stays rela-tively constant with less than 50 emails leaking to newparties on the second load5 However as the compari-son of Table 14 with Table 5 shows many of the topleak recipients are also responsible for redirecting to thehighest number of new parties Thus reloading an emailincreases the number of potential recipients of a leak ifthe redirectors share data based on the email or emailhash they receive

45 Case study LiveIntent

LiveIntent receives email addresses from the largestnumber of senders 68 in total In this section weanalyze a sample of the request chains that re-sult in leaks to LiveIntent Table 6 shows an ex-ample redirect chain of a single pixel embedded inan email from the washingtonexaminercom mailinglist The initial request (row 0) is to a subdomain ofwashingtonexaminercom and includes the userrsquos plain-text email address in the e= query string parameter Thedomain redirects to liadmcom (row 1) a LiveIntent do-main and includes the MD5 SHA1 and SHA256 hashes ofthe email address in the parameters m= sh= and sh2=

5 We exclude leaks which occur to a different IP address on thesecond load This occurs in 349 emails but is less meaningfulgiven the dynamic nature of IP address

The URL also includes the domain portion of the userrsquosaddress

In rows 2 - 12 the request redirects through severalother domains and back to itself exchanging what ap-pear to be partner IDs and bidder IDs In rows 7 and10 LiveIntent receives a UUID from the domain in theprevious request which could allow it to exchange in-formation with those trackers outside of the browser

46 Request blockers help but donrsquot fixthe problem

Privacy conscious users often deploy blocking exten-sions such as uBlock Origin Privacy Badger orGhostery to block tracking requests Since webmailclients are browser-based these blocking extensions canalso filter requests that occur while displaying email con-tent6 We use our blocked tag detection methodology(Section 41) to determine which resources would havebeen blocked by the popular EasyList and EasyPrivacyblocklists We then examine the remaining requests todetermine how frequently email addresses continue toleak

Overall the blocklists cut the number of third par-ties receiving leaked email addresses from any sendernearly in half from 99 to 51 Likewise the number ofsenders which leak email addresses in at least one emailis greatly reduced from 19 to just 7 However asTable 7 shows a significant number of leaks of both

6 Thunderbird supports most of the popular Firefox extensionsand as such Thunderbird users can also deploy these defensesSee Table 12 for more details

I never signed up for this Privacy implications of email tracking 10

Encoding of Senders of RecipientsPlaintext Address 34 (37) 34 (667)MD5 21 (23) 12 (235)SHA1 14 (16) 6 (118)URL Encoded Address 4 (04) 4 (78)SHA256 4 (04) 2 (39)SHA384 1 (01) 1 (20)

Table 7 Encodings used in leaks to third parties after filteringrequests with EasyList and EasyPrivacy Totals are given out of902 email senders and 51 third-party leak recipients

Recipient Domain of Sendersmediawallahscriptcom 7jetlorecom 4scrippsnetworkscom 4alocdncom 3richrelevancecom 3ivitrackcom 2intentiqcom 2gatehousemediacom 2realtimeemail 2ziffimagescom 2

Table 8 The top third-party leak recipient domains after filteringrequests with EasyList and EasyPrivacy All recipients receiveleaks from less than 1 of the 902 senders studied

plaintext and email hashes still occur In Table 8 we seethat there are still several third-party domains whichreceive email address leaks despite blocking Several ofthese domains are known trackers which could be in-cluded in the blocklists In addition IP addresses andCDN domains are still recipients of leaked email ad-dresses Blocking on other URL features such as theURL path could help reduce leaks to these domains

5 Privacy leaks when clickinglinks in emails

In Section 4 we explore the privacy impact of a useropening and rendering an email In this section we ex-plore the privacy impact of a user clicking links withinan email Once a user clicks a link in an email the link istypically opened in a web browser Unlike email clientsweb browsers will typically support Javascript and ad-vanced features of HTML creating many potential av-enues for privacy leaks However the only way an emailaddress can propagate to a page visit is through the di-rect embedding of the address in a link contained in theoriginal email body

51 Measurement methodology

Sampling links from emails To evaluate the privacyleaks which occur when links in emails are clicked wegenerate a dataset from the HTML content of all emailsand visit them individually in an instrumented browserTo extract the links from mail content we parse allemail bodies with BeautifulSoup [2] and extract thesrc property of all ltagt tags We sample up to 200 uniquelinks per sender using the following sampling strategyFirst we bin links across all emails from a sender bythe PS+1 and path of the link Next we sample one linkfrom each bin without replacement until there are nomore links or we reach a limit of 200 This helps ensurethat we have as diverse a set of landing pages as possibleby stripping fragment and query string identifiers thatmay not influence the landing page

Simulating link clicks To simulate a user click-ing a link we visit each link in an OpenWPM instanceusing a fresh browser profile The browser fully loadsthe page and sleeps for 10 seconds before closing Un-like the email viewing simulation (Section 4) we enableboth Javascript and Referer headers This simulationreplicates what happens when a link is clicked in a stan-dalone email client only the URL of the clicked link ispassed to the browser for handling In a webmail clientthe initial request resulting from the click may also con-tain a cookie and a Referer header containing the emailclientrsquos URL We do not simulate these headers in ourcrawl

Detecting email address leakage To detectleakage of email addresses we use the procedure de-scribed in Section 41 Since the Referer header is en-abled for these measurements we consider a party tohave received a leak if it is contained either in the URLor the Referer header of the resource request to thatparty Email addresses can also be shared with the partythrough the Cookie header request POST bodies web-socket connections WebRTC connections and so onWe consider these out of scope for this analysis

52 Results

We found that about 11 of links contain requests thatleak the email address to a third party About 12 ofall emails contain at least one such link and among thissubset there are an average of 35 such links per emailThe percentage of the 902 senders that leak the emailaddress in at least one link in one email is higher 355Finally there were over 1400 distinct third parties that

I never signed up for this Privacy implications of email tracking 11

Recipient Organization of SendersGoogle 247 (274)Facebook 160 (177)Twitter 94 (104)Adobe 81 (90)Microsoft 73 (81)Pinterest 72 (80)LiveIntent 69 (76)Akamai 69 (76)Acxiom 68 (75)AppNexus 61 (68)

Table 9 The top leak recipient organizations based on a sampleof simulated link clicks All values are out of 902 total senders

Recipient Domain of Sendersgoogle-analyticscom 200 (222)doubleclicknet 196 (217)googlecom 159 (176)facebookcom 154 (171)facebooknet 145 (161)fontsgoogleapiscom 102 (113)googleadservicescom 96 (106)twittercom 94 (104)googletagmanagercom 87 (96)gstaticcom 78 (86)

Table 10 The top leak recipient domains based on a sample ofsimulated link clicks All values are out of 902 senders

received the email address in one or more of our sim-ulated link clicks We expect that all statistics in thisparagraph except the first are slight underestimatesdue to our limit of 200 links per sender

Table 9 shows the top organizations that receiveleaked email addresses and Table 10 shows the top do-mains Over a quarter of senders leak the email addressto Google in at least one link

The most striking difference between these resultsand the corresponding results for viewing emails is thatthese lists look very similar to the list of top third partytrackers [17] with the addition of a small number oforganizations specific to email tracking This motivatesthe privacy concern that identities could potentially beattached to third-party web tracking profiles

6 Evaluation of defenses

61 Landscape of defenses

Defenses against tracking can be employed by severalparties We ignore mail senders and trackers themselves

since email tracking is a thriving commercial space andour evidence suggests that senders by and large coop-erate with trackers to leak email addresses We insteadfocus on parties who have an incentive to protect therecipientrsquos privacy namely the recipientrsquos mail servermail user agent and the web browser

The lines between these roles can be blurry so weillustrate with two examples Consider a user readingYahoo mail via Firefox The email server is Yahoo theemail client is Firefox together with Yahoo mailrsquos client-side JavaScript and the web browser is again Fire-fox Or consider a user reading her university mail viaGmailrsquos IMAP feature on her iPhone For our purposesboth the university and Gmail count as email serverssince either of them is in a position to employ defensesThe email client is the Gmail iOS app and the webbrowser is Safari

Defense Email server Email client Web browserContent proxying XHTML filtering X XCookie blocking X XReferrer blocking X X XRequest blocking X X

Table 11 Applicability of each of the five possible defenses toeach of the three contexts in which they may be deployed An Xindicates that the defense is applicable

Table 11 summarizes the applicability of various de-fenses to the three roles We discuss each in turn

Content proxying Email tracking is possible be-cause of embedded content such as images and CSS (cas-cading style sheets) To prevent this some email serversnotably Gmail proxy embedded content Thus whenthe recipient views the email the mail user agent doesnot make any requests to third parties

This defense doesnrsquot prevent the recipient email ad-dress being leaked to third parties since it is leakedby being encoded in the URL In fact it hinders ef-forts by the mail client to prevent email address leakage(see request blocking below) However it prevents thirdparties from learning the userrsquos IP address client deviceproperties and when the email was read (depending onhow the proxy is configured) Most importantly it pre-vents the third-party cookie from being sent and thusprevents the third party from linking the userrsquos emailaddress to a tracking profile In this way it is a comple-ment to cookie blocking

This defense can be deployed by the email serverConceivably the email client might have its own server

I never signed up for this Privacy implications of email tracking 12

component through which embedded resources are prox-ied but no email clients currently work this way andfurther it would introduce its own privacy vulnerabili-ties so we ignore this possibility

HTML filtering HTML filtering refers to modify-ing the contents of HTML emails to mitigate tracking Itmay be applied by the email server or the client but it ismore suitable to the server since the client can generallyachieve the same effect in other ways eg by requestblocking or modifying the rendering engine It is rarelyapplied today and only in minimal ways In Section 7 weprototype a comprehensive HTML filtering technique

HTML filtering modifies the content of the emailbody and thus might interfere with some email au-thentication methods notably Domain Keys IdentifiedEmail (DKIM) However since filtering is carried outby the recipientrsquos mail server (Mail Transfer Agent) andnot by intermediate mail relays filtering can be done af-ter the signature has been verified and thus there is noimpact on email authentication

The following three techniques are applicable in oneof two scenarios when the email client requests embed-ded resources or when the web browser handles clickson links in emails

Cookie blocking Cookie blocking in the emailclient prevents third-party cookies from being sent whenembedded content is requested It is especially relevantin the webmail context where the cookie allows thirdparties to link an email address to a web browsing pro-file Even otherwise blocking cookies is helpful since itmakes it harder for third parties to compile a profile ofthe recipientrsquos email viewing (they can always do this forthe subset of emails where the email address is leaked)

Referrer blocking If the email client sends theReferer header when loading embedded resources itcan allow several types of leaks Depending on the imple-mentation the referrer may encode which client is beingused and which specific email is being read If the recip-ient forwarded an email to someone else and the email isbeing viewed in a different userrsquos mailbox it could leakthis information Worse if the client supports iframesin emails and the email address happens to be in theiframe URL all requests to resources embedded in thatiframe will accidentally leak the email address For allthese reasons referrer blocking is a privacy-enhancingmeasure There is little legitimate use for the referrerheader in the context of email While clients can cer-tainly block the header (as can web browsers) serverscan do this as well by rewriting HTML to add therel=ldquonoreferrerrdquo attribute to links and inserting a Re-ferrer Policy via the meta tag

Request blocking Request blocking is a powerfultechnique which is well known due to ad blockers andother browser privacy extensions It relies on manuallycompiled filter lists containing thousands of regular ex-pressions that define third-party content to be blockedThe most widely used ad-blocking list is EasyList andthe most widely used tracker-blocking list is EasyPri-vacy Filter list based blocking introduces false positivesand false negatives [43] but the popularity of ad block-ing suggests that many users find the usability trade-offto be acceptable While request-blocking extensions aresupported primarily by web browsers some email clientsalso have support for them notably Thunderbird

62 Survey of email clients

We built an email privacy tester to discover which de-fenses are deployed by which popular email servers andclients7 Browser support for tracking protection hasbeen extensively studied elsewhere [29] so we do notconsider it here

The email privacy tester allows the researcher to en-ter an email address and the name of an email clientand then sends an email to that address containing atracking image and a link The image and the link bothhave unique URLs The researcher views the email inthe specified email client and then clicks on the linkThe server records the following information the emailaddress the email client the IP address timestampand headers sent for both the image and the link re-quests The list of headers includes the cookie referrerand user agent

We created accounts with a total of 9 emailproviders and tested them with a total of 16 email clientsusing various devices available in our lab We analyzedthe data recorded by the email privacy tester and sum-marize the results in Table 12 We found that if de-fenses are deployed by email servers at all they are onlyenabled for specific email clients (typically the defaultwebmail client) Therefore we do not report on serversseparately but instead fold it into the analysis of clientsWe also found that HTML filtering in a general form isnot deployed but only in the limited form of image andreferrer blocking so we report on that instead We sum-marize our findings in Table 12

7 httpsemailtrackingopenwpmcom

I never signed up for this Privacy implications of email tracking 13

Mail Client Platform Proxies Content Blocks Images Blocks Referrers Blocks Cookies Ext SupportGmail Web Yes No L Yes I Yesdagger Yesdagger YesYahoo Mail Web No Yes L Yes I No No YesOutlook Web App Web No Yes No No YesOutlookcom Web No No No No YesYandex Mail Web Yes No L Yes I Yesdagger Yesdagger YesGMX Web No No No No YesZimbra Web No Yes No No Yes163com Web No No No No YesSina Web No No No No YesApple Mail iOS No No Yes Yes NoGmail iOS Yes No Yes Yes NoGmail Android Yes No Yes Yes NoApple Mail Desktop No No Yes Yes NoWindows Mail Desktop No No Yes No NoOutlook 2016 Desktop No Yes Yes No NoThunderbird Desktop No Yes Yes Optional (Default No) Yes

Table 12 A survey of the privacy impacting features of email clients We explore whether the client proxies image requests blocksimages by default blocks referrer headers from being sent (with image requests ldquoIrdquo and with link clicks ldquoLrdquo) blocks external re-sources from settings cookies and whether or not the client supports request blocking extensions mdash either through the browser (forweb clients) or directly (in the case of Thunderbird)Images are only blocked for messages considered spamdagger Blocking occurs as a result of proxied content

7 Proposed defenseWe argue that tracking protection should be at the cen-ter of a defensive strategy against email tracking It canbe employed either via HTML filtering on the server orvia request blocking on the client Tracking protection(and ad blocking) based on filter lists has proven to beeffective and popular in web browsers and its limita-tions manageable The other defenses we examined allhave serious drawbacks for example content proxyingcomes at a cost to the email server and makes emailleaks worse and cookie blocking is at best a partial so-lution

We propose to improve tracking protection in twoways

Server-side email content filtering First weprototype a server-side HTML filtering module We usethe existing standard EasyList and EasyPrivacy filterlists Our filtering script is written in Python using theBlockListParser library [3] It scans for any HTML con-tent (texthtml) in email bodies parses those contentsidentifies embedded resources (images or CSS) whoseURLs match one of the regular expressions in the filterlists strips them out and rewrites the HTML

To test the effectiveness of HTML filtering we ranour leak detection procedure on the filtered corpus ofemails We exclude one sender due to a measurementissue We found that 110 of senders will leak email ad-

dresses to a third party in at least one email and 115of emails contain embedded resources which leak emailto a third-party Overall 62 third parties received leakedemail addresses down from 99 As tracking-protectionlists improve (see below) we can expect these numbersto decrease further These numbers are very close tothe corresponding numbers for request blocking (Sec-tion 46) The two techniques arenrsquot identical the onedifference is that in static files filtering is limited to theURLs present in the body of the HTML and will missthose that result from a redirect However this differ-ence is small and we conclude that HTML filtering isessentially as effective as request blocking

Note that webmail users can already enjoy track-ing protection but server-side deployment will help allusers including those who use email clients that donrsquotsupport request-blocking extensions

Filling gaps in tracking-protection lists As asecond line of defense we use our dataset to identifya list of 27125 URLs representing 133 distinct partieswhich contain leaks of email addresses but which arenrsquotblocked by EasyList or EasyPrivacy These include firstparties in addition to third parties We are able to iden-tify first-party tracking URLs by observing groups ofURLs of similar structure across different first-party do-mains For example 51 email senders leak the userrsquosemail address to a URL of the form liltpublic suffix+ 1gtimp which appears to be part of LiveIntentrsquos API(Section 45) We summarize the most common struc-

I never signed up for this Privacy implications of email tracking 14

tures in the leaking URLs missed by tracking protectionlists in Table 13

URL Pattern of SendersliltPS+1gtimp 51 (57)partnerltPS+1gt 7 (07)stripeltPS+1gtstripeimage 4 (04)pltPS+1gtespopen 4 (04)apiltPS+1gtlayoutssectionltNgt 4 (04)ltPS+1gtcustomer-service 3 (03)miltPS+1gtprp 3 (03)dmtkltPS+1gt 3 (03)linksltPS+1gteopen 3 (03)eadsltPS+1gtimp 3 (03)

Table 13 The top URL patterns from URLs which leak email ad-dresses and are missed by tracking protection lists (Section 46)The patterns are generated by stripping request URLs to host-name and path replacing the public suffix plus one with ltPS+1gtreplacing integers with ltNgt and stripping the last portion of thepath if it ends with a file extension The patterns are ranked bythe number of senders which make at least one leaking requestmatching that pattern in any of the senderrsquos emails All values aregiven out of the total of 902 senders studied

We suspect that the reason so many trackers aremissed is that many of them are not active in the regu-lar web tracking space We have made the list of leakingURLs missed by tracking protection lists publicly avail-able8 It should be straightforward to add regular ex-pressions to filter lists based on these URLs we suggestthat filter list creators should regularly conduct scansof email corpora to identify new trackers

8 Discussion and conclusionPrivacy risks of email tracking Email security andprivacy has not received much research attention despiteits central importance in digital life We showed thatcommercial emails contain a high density of third-partytrackers This is of concern not only because trackerscan learn the recipientrsquos IP address when emails wereopened and so on but also because these third partiesare by and large the same ones that are involved in webtracking This means that trackers can connect emailaddresses to browsing histories and profiles which leadsto further privacy breaches such as cross-device tracking

8 httpsgistgithubcomenglehardt6438c5d775ffd535b317d5c6ce3cde61

and linking of online and offline activities Indeed emailis an underappreciated avenue for straightforward cross-device tracking since recipients tend to view emails onmultiple devices

The advice provided by many mail clients may mis-lead users into thinking the privacy risks associated withremote content are fairly limited The remote contenthelp pages of Gmail [20] Yahoo Mail [42] and Thun-derbird [31] all discuss the threat strictly in terms ofthe email sender learning information about the userrather than a number of third parties

Even network adversaries can benefit from the leaksin emails The NSA is known to piggyback on advertis-ing cookies for surveillance [18] and our work suggestsone way in which a surveillance agency might attachidentities to web activity records in line with the find-ings of Englehardt et al [18] Indeed nearly 91 ofURLs containing leaks of emails are sent in plaintext

Ineffectiveness of hashing The putative justi-fication for email address leaks in the online ad techindustry is that the address is hashed However hash-ing of PII including emails is not a meaningful pri-vacy protection This is folk knowledge in the securitycommunity but bears repeating Compared to hashingof passwords there are several reasons why hashing ofemail addresses is far more easily reversible via vari-ants of a dictionary attack First while (at least) someusers attempt to maximize the entropy of passwordsmost users aim to pick memorable emails and hence theset of potential emails is effectively enumerable Due toGPUs trillions of hashes can be attempted at low costSecond unlike password hashing salting is not applica-ble to email hashing since multiple third parties need tobe able to independently derive the same hash from theemail address

Perhaps most importantly if the adversaryrsquos goalis to retrieve records corresponding to a known emailaddress or set of email addresses then hashing ispointlessmdashthe adversary can simply hash the email ad-dresses and then look them up For example if the ad-versary is a surveillance agency as discussed above andseeks to retrieve network logs corresponding to a givenemail address this is trivially possible despite hashing

LimitationsWe mention several limitations of ourwork First despite the large number of heuristics thatwent into identifying and submitting forms it is a fun-damentally hard problem and our crawler fails in manycases including pages requiring complex mouse interac-tions pages containing very poorly structured HTMLand captcha-protected form submission pages More-over it is difficult to programmatically distinguish be-

I never signed up for this Privacy implications of email tracking 15

tween successful and failed form submissions Lookingat received network data is impractical since responsescould easily include text for both success and failuremessages On the other hand looking only at changesin the rendered text on the webpage is more feasiblebut would require handling many possible edge cases(eg page redirects alerts pop-up windows iframes)and might still be too unreliable to use as a metric forsuccess

Second our corpus of emails is not intended to berepresentative and we are unable to draw conclusionsabout the extent of tracking in the typical userrsquos mail-box

Third our simulation of a user viewing emails as-sumes a permissive user agent We expect that thisclosely approximates a webmail setup with defaultbrowser settings (on browsers except Safari whichblocks third-party cookies by default) but we have nottested this assumption

Future work Finally we mention several potentialareas of future work

Mailing list managers It would be helpful to bet-ter understand the relationship between email sendersand mailing list managers (such as Constant Contact)To what extent is email tracking driven by senders ver-sus mailing list managers When a sender sets up amarketing campaign with a mailing list manager is thetracking disclosed to the sender

PII leakage in registration forms Researchers havepreviously found leakage of PII to third parties in con-tact forms on websites [38] As far as we know there hasbeen no large-scale study of PII leakage in registrationforms where more sensitive information is often present(eg phone numbers street addresses and passwords)Recording and analyzing the third-party requests madeduring our crawls is an important area for further inves-tigation

Cookie syncing It would be interesting to find outif cookie syncing occurs when viewing emailsmdasha processin which different trackers exchange and link togethertheir own IDs for the same user Past work has shownthat this happens among the vast majority of top thirdparties on the web [17] so we suspect that it occursthrough email as well

AB testing We notice some clear instances of ABtesting in our data as might be expected in market-ing campaigns Specifically we registered multiple emailaddresses on some sites at roughly the same time andfound several emails sent at nearly the same time (mil-liseconds apart) with different subject lines and emailbodies advertising different products We have not at-

tempted to reverse-engineer or systematically analyzethese differences but it may be interesting to see if andhow the received content changes in response to readreceipts click-through metrics or other types of userinteractions

Differential testing Despite testing for various en-codings hashes and combinations it is possible thatwe have missed some leaks of email addresses We can-not hope to exhaustively test for all combinations ofencodings and hashes Instead we propose differentialtesting by registering multiple email addresses on thesame site we can look for parameters in URLs that aredifferent for different email addresses which are sugges-tive of transformed email addresses The difficulty withthis approach is that AB testing mentioned above isa confound

In summary we hope that our work leads to greaterawareness of the privacy risks of email tracking spursfurther research on the topic and paves the way fordeployment of robust defenses

9 AcknowledgementsWe would like to thank the anonymous reviewers AylinCaliskan Paul-Olivier Dehaye Joel Reardon and PaulVan Oorschot for their helpful comments Wersquore alsograteful to Guumlnes Acar Paul Ellenbogen Marc JuarezHarry Kalodner Marcela Melara and Laura Roberts fortheir assistance in compiling data for our email survey

This work was supported by NSF Grant CNS1526353 by a research grant from Mozilla and by Ama-zon AWS Cloud Credits for Research

References[1] Adblock Plus - Surf the web without annoying ads https

adblockplusorg Online accessed 2017-09-05[2] BeautifulSoup httpswwwcrummycomsoftware

BeautifulSoup Online accessed 2017-09-05[3] BlockListParser httpsgithubcomshivamagarwal-iitb

BlockListParser Online accessed 2017-09-05[4] EasyList and EasyPrivacy httpseasylistto Online

accessed 2017-09-05[5] uBlock Origin - An efficient blocker for Chromium and Fire-

fox Fast and lean httpsgithubcomgorhilluBlockOnline accessed 2017-09-05

[6] CSS Support Guide for Email Clients Campaign Sourcehttpswwwcampaignmonitorcomcss (Archive httpswwwwebcitationorg6rLLXBX0E) 2014

I never signed up for this Privacy implications of email tracking 16

[7] Gunes Acar Christian Eubank Steven Englehardt MarcJuarez Arvind Narayanan and Claudia Diaz The web neverforgets Persistent tracking mechanisms in the wild In Pro-ceedings of ACM CCS pages 674ndash689 ACM 2014

[8] Gunes Acar Marc Juarez Nick Nikiforakis Claudia DiazSeda Guumlrses Frank Piessens and Bart Preneel Fpdetectivedusting the web for fingerprinters In Proceedings of the2013 ACM SIGSAC conference on Computer amp communica-tions security pages 1129ndash1140 ACM 2013

[9] Julia Angwin Why online tracking is getting creepier ProP-ublica Jun 2014

[10] Mika D Ayenson Dietrich James Wambach Ashkan SoltaniNathan Good and Chris Jay Hoofnagle Flash cookies andprivacy II Now with html5 and etag respawning 2011

[11] Bananatag Email Tracking for Gmail Outlook and otherclients httpsbananatagcomemail-tracking Onlineaccessed 2017-09-04

[12] Justin Brookman Phoebe Rouge Aaron Alva Alva andChristina Yeung Cross-device tracking Measurement anddisclosures In Proceedings of the Privacy Enhancing Tech-nologies Symposium 2017

[13] Ceren Budak Sharad Goel Justin Rao and Georgios ZervasUnderstanding emerging threats to online advertising InProceedings of the ACM Conference on Economics andComputation 2016

[14] ContactMonkey Email Tracking for Outlook and Gmailhttpswwwcontactmonkeycomemail-tracking Onlineaccessed 2017-09-04

[15] Zakir Durumeric David Adrian Ariana Mirian James Kas-ten Elie Bursztein Nicolas Lidzborski Kurt Thomas VijayEranti Michael Bailey and J Alex Halderman Neither snownor rain nor mitm An empirical analysis of email deliv-ery security In Proceedings of the 2015 ACM Conferenceon Internet Measurement Conference pages 27ndash39 ACM2015

[16] Peter Eckersley How unique is your web browser In In-ternational Symposium on Privacy Enhancing TechnologiesSymposium pages 1ndash18 Springer 2010

[17] Steven Englehardt and Arvind Narayanan Online trackingA 1-million-site measurement and analysis In ACM Confer-ence on Computer and Communications Security 2016

[18] Steven Englehardt Dillon Reisman Christian Eubank Pe-ter Zimmerman Jonathan Mayer Arvind Narayanan andEdward W Felten Cookies that give you away The surveil-lance implications of web tracking In Proceedings of the24th Conference on World Wide Web 2015

[19] David Fifield and Serge Egelman Fingerprinting web usersthrough font metrics In International Conference on Finan-cial Cryptography and Data Security 2015

[20] Gmail Help Choose whether to show images httpssupportgooglecommailanswer145919 Online accessed2017-09-06

[21] Ralph Holz Johanna Amann Olivier Mehani Mohamed AliKacircafar and Matthias Wachs TLS in the wild An internet-wide analysis of tls-based protocols for electronic commu-nication In 23nd Annual Network and Distributed SystemSecurity Symposium NDSS 2016 San Diego CaliforniaUSA February 21-24 2016 2016

[22] HubSpot Start Email Tracking Today httpswwwhubspotcomproductssalesemail-tracking Online ac-

cessed 2017-09-04[23] Balachander Krishnamurthy Konstantin Naryshkin and

Craig Wills Privacy leakage vs protection measures thegrowing disconnect In Proceedings of the Web 2011

[24] Balachander Krishnamurthy and Craig E Wills On the leak-age of personally identifiable information via online socialnetworks In Proceedings of the 2nd ACM workshop onOnline social networks pages 7ndash12 ACM 2009

[25] Pierre Laperdrix Walter Rudametkin and Benoit BaudryBeauty and the beast Diverting modern web browsers tobuild unique browser fingerprints In 37th IEEE Symposiumon Security and Privacy 2016

[26] Adam Lerner Anna Kornfeld Simpson Tadayoshi Kohnoand Franziska Roesner Internet jones and the raiders of thelost trackers An archaeological study of web tracking from1996 to 2016 In 25th USENIX Security Symposium 2016

[27] Timothy Libert Exposing the invisible web An analysis ofthird-party http requests on 1 million websites InternationalJournal of Communication 918 2015

[28] Jonathan R Mayer and John C Mitchell Third-party webtracking Policy and technology In 2012 IEEE Symposiumon Security and Privacy IEEE 2012

[29] Georg Merzdovnik Markus Huber Damjan Buhov NickNikiforakis Sebastian Neuner Martin Schmiedecker andEdgar Weippl Block me if you can A large-scale study oftracker-blocking tools In Proceedings of the 2nd IEEE Euro-pean Symposium on Security and Privacy (IEEE EuroSampP)2017

[30] Keaton Mowery and Hovav Shacham Pixel perfect Finger-printing canvas in HTML5 W2SP 2012

[31] Mozilla Support Remote Content in Messages httpssupportmozillaorgen-USkbremote-content-in-messagesOnline accessed 2017-09-04

[32] Nick Nikiforakis Alexandros Kapravelos Wouter JoosenChristopher Kruegel Frank Piessens and Giovanni VignaCookieless monster Exploring the ecosystem of web-baseddevice fingerprinting In Security and privacy (SP) 2013IEEE symposium on pages 541ndash555 IEEE 2013

[33] Lukasz Olejnik Gunes Acar Claude Castelluccia and Clau-dia Diaz The leaking battery A privacy analysis of theHTML5 Battery Status API Technical report 2015

[34] Jingjing Ren Ashwin Rao Martina Lindorfer ArnaudLegout and David Choffnes Recon Revealing and control-ling pii leaks in mobile network traffic In Proceedings of the14th Annual International Conference on Mobile SystemsApplications and Services pages 361ndash374 ACM 2016

[35] Franziska Roesner Tadayoshi Kohno and David WetherallDetecting and defending against third-party tracking onthe web In Proceedings of the 9th USENIX conferenceon Networked Systems Design and Implementation pages12ndash12 USENIX Association 2012

[36] scikit-learn Jaccard Similarity Score httpscikit-learnorgstablemodulesgeneratedsklearnmetricsjaccard_similarity_scorehtml Online accessed 2017-09-05

[37] Ashkan Soltani Shannon Canty Quentin Mayo LaurenThomas and Chris Jay Hoofnagle Flash cookies and pri-vacy In AAAI spring symposium intelligent informationprivacy management volume 2010 pages 158ndash163 2010

[38] Oleksii Starov Phillipa Gill and Nick Nikiforakis Are yousure you want to contact us quantifying the leakage of pii

I never signed up for this Privacy implications of email tracking 17

via website contact forms Proceedings on Privacy Enhanc-ing Technologies 2016(1)20ndash33 2016

[39] Oleksii Starov and Nick Nikiforakis Extended trackingpowers Measuring the privacy diffusion enabled by browserextensions In Proceedings of the 26th International Confer-ence on World Wide Web pages 1481ndash1490 2017

[40] Narseo Vallina-Rodriguez Christian Kreibich Mark Allmanand Vern Paxson Lumen Fine-grained visibility and controlof mobile traffic in user-space 2017

[41] W3C 410 Forms - HTML5 httpswwww3orgTRhtml5formshtml Online accessed 2017-09-07

[42] Yahoo Help Block images in your incoming Yahoo Mailemails httpshelpyahoocomkbSLN5043html Onlineaccessed 2017-09-06

[43] Zhonghao Yu Sam Macbeth Konark Modi and Josep MPujol Tracking the trackers In Proceedings of the 25thInternational Conference on World Wide Web pages 121ndash132 International World Wide Web Conferences SteeringCommittee 2016

10 Appendix

101 Form discovery and fillingmethodology

Choosing pages on which to search for forms Thecrawler searches through all links (ltagt tags) on the land-ing page to find pages that are most likely to contain amailing list form It does this by matching the link textand URL against a ranked list of terms which are shownin Table 1 As an initial step we filter out invisiblelinks and links to external sites We check that the linktext does not contain words in our blacklist which aimsto avoid unsubscribe pages and phone-based registra-tion If we have found any links that match the crawlerclicks on the one with the highest rank then runs theform-finding procedure on the new page and any newlyopened pop-up windows If no forms are found it goesback and repeats this process for the remaining linksThe reason for clicking on generic article links is thatwe have come across several news sites with newsletterforms only within article pages We also make sure toselect the English language or USEnglish locale whenavailable since our keywords are in English

Top-down form detection For each page thecrawler visits it first searches through the HTML DOMfor any potential email registration forms When sitesuse the standard ltformgt element it can simply iteratethrough each formrsquos input fields (ltinputgt tags) and seeif any text fields ask for an email address (by matchingon input type and keywords) If so it marks the form as

a candidate and then chooses the best candidate usingthe following criteria (in order)1 Always return the topmost form Any form stacked

on top of other elements is probably a modal or dia-log and we find that the most common use of thesecomponents is to promote a sitersquos mailing lists Werely on the z-index CSS property which specifies thestacking order of an element in relation to others (asa relative arbitrary integer) Note that most DOMelements take the default z-index value of auto in-heriting the actual value from its parent thus thecrawler recursively checks a formrsquos parent elementsuntil it finds a non-auto value or reaches the rootof the DOM tree To break ties it also searches forthe literal strings ldquomodalrdquo or ldquodialogrdquo within theformrsquos HTML since we find that such componentsare usually descriptively named

2 Rank login forms lower This is the other class offorms that often asks for an email address so thecrawler explicitly checks for the strings ldquologinrdquo ldquologinrdquo and ldquosign inrdquo within a formrsquos HTML to avoidthese when other candidates are present

3 Prefer forms with more input fields This is mainlyhelpful for identifying the correct follow-up form ifwe submit our email address in the footer of a pagethe same footer might be present on the page we getredirected to In this scenario the form we want topick is the longer one

Additionally registration forms are sometimes foundinside of inline frames (ltiframegt tag) which are ef-fectively separate HTML pages embedded in the mainpage If necessary we iterate through each frame andapply the same procedure to locate registration formswithin them

Bottom-up form detection A growing numberof sites place logical forms inside of generic containerelements (eg ltdivgt or ltspangt tags) without using anyltformgt tags Therefore if top-down form detection failswe take a bottom-up approach the crawler first iteratesthrough all the ltinputgt elements on the page to checkif any email address fields exist at all then recursivelyexamines their parents to find the first container thatalso contains a submit button This container is usuallythe smallest logical form unit that includes all of therelevant input fields

Determining form field type Once a form isdiscovered we need to determine which fields are con-tained in the form and fill each field with valid dataWe skip any invisible elements since a real user wouldnot be expected to fill them Some fields can be iden-

I never signed up for this Privacy implications of email tracking 18

tified by their type attribute alonemdashfor example telfor phone numbers and email for email addressesmdashbutthese specific types were introduced in the relatively re-cent HTML5 standard [41] and most websites still usethe general text type for all text inputs In our sur-vey of the top sites we found that contextual hints arescattered across many tag attributes with the most fre-quent being name class id placeholder value forand title In addition tags that contain HTML bod-ies (such as ltbuttongt tags) often contain hints in theinnerHTML

Handling two-part form submissions Aftersubmitting a form we are sometimes prompted to fillout another longer form before the registration is ac-cepted This second form might appear on the samepage (ie using JavaScript) or on a separate page ei-ther through a redirect or as a pop-up window We takea simplistic approach the crawler waits a few secondsthen applies the same form-finding procedure first onany pop-up windows and then on the original windowThis approach may have the effect of submitting thesame form twice but we argue that this does not pro-duce any adverse resultsmdashduplicate form submissionsare a plausible user interaction that web services shouldbe expected to handle gracefully

102 Mail server implementation

The mail server receives emails using SubEtha SMTP alibrary offering a simple low-level API to handle incom-ing mail The server accepts any mail sent to (RCPT TO)an existing email address and rejects it otherwise Themail contents (DATA) are parsed in MIME format usingthe JavaMail API and the raw message contents arewritten to disk MIME messages consist of a set of head-ers and a content body with the required Content-Typeheader indicating the format of the content notably amultipart content body contains additional MIME mes-sage subparts enabling messages to be arranged in atree structure To save disk space we recursively scanmultipart MIME messages for subparts with contenttypes that are non-text (text) such as attached im-ages or other data and discard them before storing themessages since we do not examine any non-textual con-tent

103 Supported hash functions andencodings for leak detection

Supported hashes and checksums md2 md4 md5sha sha1 sha256 sha224 sha384 sha3-224 sha3-256 sha3-384 sha3-512 murmurhash2 (signed andunsigned) murmurhash3 32-bit murmurhash3 64-bitmurmurhash3 128-bit ripemd160 whirlpool blake2bblake2s crc32 adler32

Supported encodings base16 base32 base58base64 urlencoding deflate gzip zlib entity yenc

104 Top parties redirecting to new thirdparties on email reload

Redirecting Party OrganizationAvg addrsquolparties S E

pippiocom Acxiom 57 7 32liadmcom LiveIntent 37 68 1097rlcdncom Acxiom 17 11 551imiclkcom MediaMath 13 2 4mathtagcom MediaMath 11 11 382alcmpncom ALCdagger 08 6 132emltrkcom Litmus 07 41 638acxiom-onlinecom Acxiom 04 2 33dynemlcom PowerInbox 01 3 13adnxscom AppNexus 01 19 277

Table 14 Top parties by average number of new third-party re-sources in a redirect chain when an email is reloaded The num-ber of senders ( S) out of 902 total and the number of emails(E) out of 12618 total on which this occurs is given for eachredirecting party We exclude redirecting parties that only exhibitthis behavior in emails from a single sender In total there are 12parties which exhibit this type of redirect behavior Includes statistics for chains which redirect to httppliadmcomimp in the first redirect We observe a common pattern ofURLs of the form lifirstpartycom redirecting first to this end-point which then redirects to a number of other third partiesdagger American List Counsel

  • I never signed up for this Privacy implications of email tracking
    • 1 Introduction
      • 11 Methods
      • 12 The state of email tracking
      • 13 Evaluating and improving defenses
        • 2 Related work
        • 3 Collecting a dataset of emails
        • 4 Privacy leaks when viewing emails
          • 41 Measurement methodology
          • 42 Email provides much of same tracking opportunities as the web
          • 43 Leaks of email addresses to third parties are common
          • 44 Reopening emails brings in new third parties
          • 45 Case study LiveIntent
          • 46 Request blockers help but dont fix the problem
            • 5 Privacy leaks when clicking links in emails
              • 51 Measurement methodology
              • 52 Results
                • 6 Evaluation of defenses
                  • 61 Landscape of defenses
                  • 62 Survey of email clients
                    • 7 Proposed defense
                    • 8 Discussion and conclusion
                    • 9 Acknowledgements
                    • 10 Appendix
                      • 101 Form discovery and filling methodology
                      • 102 Mail server implementation
                      • 103 Supported hash functions and encodings for leak detection
                      • 104 Top parties redirecting to new third parties on email reload

I never signed up for this Privacy implications of email tracking 9

Row Request URL0 httpinboxwashingtonexaminercomimp[]ampe=ltEMAILgtampp=01 httppliadmcomimp[]ampm=ltMD5(address)gtampsh=ltSHA1(address)gtampsh2=ltSHA256(address)gt

ampp=0ampdom=ltEMAIL_DOMAINgt2 httpxbidswitchnetsyncssp=liveintentampbidder_id=5298amplicd=3357ampx=EGFM[]3 httpxbidswitchnetul_cbsyncssp=liveintentampbidder_id=5298amplicd=3357ampx=EGFM[]4 httppadsymptoticcomdpx_pid=12688amp_psign=d3e69[]ampbidswitch_ssp_id=liveintentamp_redirect=[]5 httppadsymptoticcomdpx_pid=12688amp_psign=d3e69[]ampbidswit[]amp_redirect=[]amp_expected_cookie=[]6 httpxbidswitchnetsyncdsp_id=126ampuser_id=84f3[]ampssp=liveintent7 httpiliadmcoms19751bidder_id=5298amplicd=3357ampbidder_uuid=ltUUID_1gt8 httpcmgdoubleclicknetpixelgoogle_nid=liveintent_dbmampgoogle_cmampgoogle_sc9 httpcmgdoubleclicknetpixelgoogle_nid=liveintent_dbmampgoogle_cm=ampgoogle_sc=ampgoogle_tc=10 httppliadmcommatch_gbidder_id=24314ampbidder_uuid=ltUUID_2gtampgoogle_cver=111 httpxbidswitchnetsyncssp=liveintentampbidder_id=5298amplicd=12 httppooludspiponwebnetsyncssp=bidswitchampbidswitch_ssp_id=liveintent

Table 6 Redirect chain from a LiveIntent Email Tracking Pixel URL query strings are truncated for clarity (using [])

least one resource when an email is reopened that wasnrsquotpresent the first time A small number of third partiesare disproportionately responsible for thismdashthey loaddifferent sets of additional third parties each time theemail is opened (Table 14 in the Appendix)

The number of leaks between email loads stays rela-tively constant with less than 50 emails leaking to newparties on the second load5 However as the compari-son of Table 14 with Table 5 shows many of the topleak recipients are also responsible for redirecting to thehighest number of new parties Thus reloading an emailincreases the number of potential recipients of a leak ifthe redirectors share data based on the email or emailhash they receive

45 Case study LiveIntent

LiveIntent receives email addresses from the largestnumber of senders 68 in total In this section weanalyze a sample of the request chains that re-sult in leaks to LiveIntent Table 6 shows an ex-ample redirect chain of a single pixel embedded inan email from the washingtonexaminercom mailinglist The initial request (row 0) is to a subdomain ofwashingtonexaminercom and includes the userrsquos plain-text email address in the e= query string parameter Thedomain redirects to liadmcom (row 1) a LiveIntent do-main and includes the MD5 SHA1 and SHA256 hashes ofthe email address in the parameters m= sh= and sh2=

5 We exclude leaks which occur to a different IP address on thesecond load This occurs in 349 emails but is less meaningfulgiven the dynamic nature of IP address

The URL also includes the domain portion of the userrsquosaddress

In rows 2 - 12 the request redirects through severalother domains and back to itself exchanging what ap-pear to be partner IDs and bidder IDs In rows 7 and10 LiveIntent receives a UUID from the domain in theprevious request which could allow it to exchange in-formation with those trackers outside of the browser

46 Request blockers help but donrsquot fixthe problem

Privacy conscious users often deploy blocking exten-sions such as uBlock Origin Privacy Badger orGhostery to block tracking requests Since webmailclients are browser-based these blocking extensions canalso filter requests that occur while displaying email con-tent6 We use our blocked tag detection methodology(Section 41) to determine which resources would havebeen blocked by the popular EasyList and EasyPrivacyblocklists We then examine the remaining requests todetermine how frequently email addresses continue toleak

Overall the blocklists cut the number of third par-ties receiving leaked email addresses from any sendernearly in half from 99 to 51 Likewise the number ofsenders which leak email addresses in at least one emailis greatly reduced from 19 to just 7 However asTable 7 shows a significant number of leaks of both

6 Thunderbird supports most of the popular Firefox extensionsand as such Thunderbird users can also deploy these defensesSee Table 12 for more details

I never signed up for this Privacy implications of email tracking 10

Encoding of Senders of RecipientsPlaintext Address 34 (37) 34 (667)MD5 21 (23) 12 (235)SHA1 14 (16) 6 (118)URL Encoded Address 4 (04) 4 (78)SHA256 4 (04) 2 (39)SHA384 1 (01) 1 (20)

Table 7 Encodings used in leaks to third parties after filteringrequests with EasyList and EasyPrivacy Totals are given out of902 email senders and 51 third-party leak recipients

Recipient Domain of Sendersmediawallahscriptcom 7jetlorecom 4scrippsnetworkscom 4alocdncom 3richrelevancecom 3ivitrackcom 2intentiqcom 2gatehousemediacom 2realtimeemail 2ziffimagescom 2

Table 8 The top third-party leak recipient domains after filteringrequests with EasyList and EasyPrivacy All recipients receiveleaks from less than 1 of the 902 senders studied

plaintext and email hashes still occur In Table 8 we seethat there are still several third-party domains whichreceive email address leaks despite blocking Several ofthese domains are known trackers which could be in-cluded in the blocklists In addition IP addresses andCDN domains are still recipients of leaked email ad-dresses Blocking on other URL features such as theURL path could help reduce leaks to these domains

5 Privacy leaks when clickinglinks in emails

In Section 4 we explore the privacy impact of a useropening and rendering an email In this section we ex-plore the privacy impact of a user clicking links withinan email Once a user clicks a link in an email the link istypically opened in a web browser Unlike email clientsweb browsers will typically support Javascript and ad-vanced features of HTML creating many potential av-enues for privacy leaks However the only way an emailaddress can propagate to a page visit is through the di-rect embedding of the address in a link contained in theoriginal email body

51 Measurement methodology

Sampling links from emails To evaluate the privacyleaks which occur when links in emails are clicked wegenerate a dataset from the HTML content of all emailsand visit them individually in an instrumented browserTo extract the links from mail content we parse allemail bodies with BeautifulSoup [2] and extract thesrc property of all ltagt tags We sample up to 200 uniquelinks per sender using the following sampling strategyFirst we bin links across all emails from a sender bythe PS+1 and path of the link Next we sample one linkfrom each bin without replacement until there are nomore links or we reach a limit of 200 This helps ensurethat we have as diverse a set of landing pages as possibleby stripping fragment and query string identifiers thatmay not influence the landing page

Simulating link clicks To simulate a user click-ing a link we visit each link in an OpenWPM instanceusing a fresh browser profile The browser fully loadsthe page and sleeps for 10 seconds before closing Un-like the email viewing simulation (Section 4) we enableboth Javascript and Referer headers This simulationreplicates what happens when a link is clicked in a stan-dalone email client only the URL of the clicked link ispassed to the browser for handling In a webmail clientthe initial request resulting from the click may also con-tain a cookie and a Referer header containing the emailclientrsquos URL We do not simulate these headers in ourcrawl

Detecting email address leakage To detectleakage of email addresses we use the procedure de-scribed in Section 41 Since the Referer header is en-abled for these measurements we consider a party tohave received a leak if it is contained either in the URLor the Referer header of the resource request to thatparty Email addresses can also be shared with the partythrough the Cookie header request POST bodies web-socket connections WebRTC connections and so onWe consider these out of scope for this analysis

52 Results

We found that about 11 of links contain requests thatleak the email address to a third party About 12 ofall emails contain at least one such link and among thissubset there are an average of 35 such links per emailThe percentage of the 902 senders that leak the emailaddress in at least one link in one email is higher 355Finally there were over 1400 distinct third parties that

I never signed up for this Privacy implications of email tracking 11

Recipient Organization of SendersGoogle 247 (274)Facebook 160 (177)Twitter 94 (104)Adobe 81 (90)Microsoft 73 (81)Pinterest 72 (80)LiveIntent 69 (76)Akamai 69 (76)Acxiom 68 (75)AppNexus 61 (68)

Table 9 The top leak recipient organizations based on a sampleof simulated link clicks All values are out of 902 total senders

Recipient Domain of Sendersgoogle-analyticscom 200 (222)doubleclicknet 196 (217)googlecom 159 (176)facebookcom 154 (171)facebooknet 145 (161)fontsgoogleapiscom 102 (113)googleadservicescom 96 (106)twittercom 94 (104)googletagmanagercom 87 (96)gstaticcom 78 (86)

Table 10 The top leak recipient domains based on a sample ofsimulated link clicks All values are out of 902 senders

received the email address in one or more of our sim-ulated link clicks We expect that all statistics in thisparagraph except the first are slight underestimatesdue to our limit of 200 links per sender

Table 9 shows the top organizations that receiveleaked email addresses and Table 10 shows the top do-mains Over a quarter of senders leak the email addressto Google in at least one link

The most striking difference between these resultsand the corresponding results for viewing emails is thatthese lists look very similar to the list of top third partytrackers [17] with the addition of a small number oforganizations specific to email tracking This motivatesthe privacy concern that identities could potentially beattached to third-party web tracking profiles

6 Evaluation of defenses

61 Landscape of defenses

Defenses against tracking can be employed by severalparties We ignore mail senders and trackers themselves

since email tracking is a thriving commercial space andour evidence suggests that senders by and large coop-erate with trackers to leak email addresses We insteadfocus on parties who have an incentive to protect therecipientrsquos privacy namely the recipientrsquos mail servermail user agent and the web browser

The lines between these roles can be blurry so weillustrate with two examples Consider a user readingYahoo mail via Firefox The email server is Yahoo theemail client is Firefox together with Yahoo mailrsquos client-side JavaScript and the web browser is again Fire-fox Or consider a user reading her university mail viaGmailrsquos IMAP feature on her iPhone For our purposesboth the university and Gmail count as email serverssince either of them is in a position to employ defensesThe email client is the Gmail iOS app and the webbrowser is Safari

Defense Email server Email client Web browserContent proxying XHTML filtering X XCookie blocking X XReferrer blocking X X XRequest blocking X X

Table 11 Applicability of each of the five possible defenses toeach of the three contexts in which they may be deployed An Xindicates that the defense is applicable

Table 11 summarizes the applicability of various de-fenses to the three roles We discuss each in turn

Content proxying Email tracking is possible be-cause of embedded content such as images and CSS (cas-cading style sheets) To prevent this some email serversnotably Gmail proxy embedded content Thus whenthe recipient views the email the mail user agent doesnot make any requests to third parties

This defense doesnrsquot prevent the recipient email ad-dress being leaked to third parties since it is leakedby being encoded in the URL In fact it hinders ef-forts by the mail client to prevent email address leakage(see request blocking below) However it prevents thirdparties from learning the userrsquos IP address client deviceproperties and when the email was read (depending onhow the proxy is configured) Most importantly it pre-vents the third-party cookie from being sent and thusprevents the third party from linking the userrsquos emailaddress to a tracking profile In this way it is a comple-ment to cookie blocking

This defense can be deployed by the email serverConceivably the email client might have its own server

I never signed up for this Privacy implications of email tracking 12

component through which embedded resources are prox-ied but no email clients currently work this way andfurther it would introduce its own privacy vulnerabili-ties so we ignore this possibility

HTML filtering HTML filtering refers to modify-ing the contents of HTML emails to mitigate tracking Itmay be applied by the email server or the client but it ismore suitable to the server since the client can generallyachieve the same effect in other ways eg by requestblocking or modifying the rendering engine It is rarelyapplied today and only in minimal ways In Section 7 weprototype a comprehensive HTML filtering technique

HTML filtering modifies the content of the emailbody and thus might interfere with some email au-thentication methods notably Domain Keys IdentifiedEmail (DKIM) However since filtering is carried outby the recipientrsquos mail server (Mail Transfer Agent) andnot by intermediate mail relays filtering can be done af-ter the signature has been verified and thus there is noimpact on email authentication

The following three techniques are applicable in oneof two scenarios when the email client requests embed-ded resources or when the web browser handles clickson links in emails

Cookie blocking Cookie blocking in the emailclient prevents third-party cookies from being sent whenembedded content is requested It is especially relevantin the webmail context where the cookie allows thirdparties to link an email address to a web browsing pro-file Even otherwise blocking cookies is helpful since itmakes it harder for third parties to compile a profile ofthe recipientrsquos email viewing (they can always do this forthe subset of emails where the email address is leaked)

Referrer blocking If the email client sends theReferer header when loading embedded resources itcan allow several types of leaks Depending on the imple-mentation the referrer may encode which client is beingused and which specific email is being read If the recip-ient forwarded an email to someone else and the email isbeing viewed in a different userrsquos mailbox it could leakthis information Worse if the client supports iframesin emails and the email address happens to be in theiframe URL all requests to resources embedded in thatiframe will accidentally leak the email address For allthese reasons referrer blocking is a privacy-enhancingmeasure There is little legitimate use for the referrerheader in the context of email While clients can cer-tainly block the header (as can web browsers) serverscan do this as well by rewriting HTML to add therel=ldquonoreferrerrdquo attribute to links and inserting a Re-ferrer Policy via the meta tag

Request blocking Request blocking is a powerfultechnique which is well known due to ad blockers andother browser privacy extensions It relies on manuallycompiled filter lists containing thousands of regular ex-pressions that define third-party content to be blockedThe most widely used ad-blocking list is EasyList andthe most widely used tracker-blocking list is EasyPri-vacy Filter list based blocking introduces false positivesand false negatives [43] but the popularity of ad block-ing suggests that many users find the usability trade-offto be acceptable While request-blocking extensions aresupported primarily by web browsers some email clientsalso have support for them notably Thunderbird

62 Survey of email clients

We built an email privacy tester to discover which de-fenses are deployed by which popular email servers andclients7 Browser support for tracking protection hasbeen extensively studied elsewhere [29] so we do notconsider it here

The email privacy tester allows the researcher to en-ter an email address and the name of an email clientand then sends an email to that address containing atracking image and a link The image and the link bothhave unique URLs The researcher views the email inthe specified email client and then clicks on the linkThe server records the following information the emailaddress the email client the IP address timestampand headers sent for both the image and the link re-quests The list of headers includes the cookie referrerand user agent

We created accounts with a total of 9 emailproviders and tested them with a total of 16 email clientsusing various devices available in our lab We analyzedthe data recorded by the email privacy tester and sum-marize the results in Table 12 We found that if de-fenses are deployed by email servers at all they are onlyenabled for specific email clients (typically the defaultwebmail client) Therefore we do not report on serversseparately but instead fold it into the analysis of clientsWe also found that HTML filtering in a general form isnot deployed but only in the limited form of image andreferrer blocking so we report on that instead We sum-marize our findings in Table 12

7 httpsemailtrackingopenwpmcom

I never signed up for this Privacy implications of email tracking 13

Mail Client Platform Proxies Content Blocks Images Blocks Referrers Blocks Cookies Ext SupportGmail Web Yes No L Yes I Yesdagger Yesdagger YesYahoo Mail Web No Yes L Yes I No No YesOutlook Web App Web No Yes No No YesOutlookcom Web No No No No YesYandex Mail Web Yes No L Yes I Yesdagger Yesdagger YesGMX Web No No No No YesZimbra Web No Yes No No Yes163com Web No No No No YesSina Web No No No No YesApple Mail iOS No No Yes Yes NoGmail iOS Yes No Yes Yes NoGmail Android Yes No Yes Yes NoApple Mail Desktop No No Yes Yes NoWindows Mail Desktop No No Yes No NoOutlook 2016 Desktop No Yes Yes No NoThunderbird Desktop No Yes Yes Optional (Default No) Yes

Table 12 A survey of the privacy impacting features of email clients We explore whether the client proxies image requests blocksimages by default blocks referrer headers from being sent (with image requests ldquoIrdquo and with link clicks ldquoLrdquo) blocks external re-sources from settings cookies and whether or not the client supports request blocking extensions mdash either through the browser (forweb clients) or directly (in the case of Thunderbird)Images are only blocked for messages considered spamdagger Blocking occurs as a result of proxied content

7 Proposed defenseWe argue that tracking protection should be at the cen-ter of a defensive strategy against email tracking It canbe employed either via HTML filtering on the server orvia request blocking on the client Tracking protection(and ad blocking) based on filter lists has proven to beeffective and popular in web browsers and its limita-tions manageable The other defenses we examined allhave serious drawbacks for example content proxyingcomes at a cost to the email server and makes emailleaks worse and cookie blocking is at best a partial so-lution

We propose to improve tracking protection in twoways

Server-side email content filtering First weprototype a server-side HTML filtering module We usethe existing standard EasyList and EasyPrivacy filterlists Our filtering script is written in Python using theBlockListParser library [3] It scans for any HTML con-tent (texthtml) in email bodies parses those contentsidentifies embedded resources (images or CSS) whoseURLs match one of the regular expressions in the filterlists strips them out and rewrites the HTML

To test the effectiveness of HTML filtering we ranour leak detection procedure on the filtered corpus ofemails We exclude one sender due to a measurementissue We found that 110 of senders will leak email ad-

dresses to a third party in at least one email and 115of emails contain embedded resources which leak emailto a third-party Overall 62 third parties received leakedemail addresses down from 99 As tracking-protectionlists improve (see below) we can expect these numbersto decrease further These numbers are very close tothe corresponding numbers for request blocking (Sec-tion 46) The two techniques arenrsquot identical the onedifference is that in static files filtering is limited to theURLs present in the body of the HTML and will missthose that result from a redirect However this differ-ence is small and we conclude that HTML filtering isessentially as effective as request blocking

Note that webmail users can already enjoy track-ing protection but server-side deployment will help allusers including those who use email clients that donrsquotsupport request-blocking extensions

Filling gaps in tracking-protection lists As asecond line of defense we use our dataset to identifya list of 27125 URLs representing 133 distinct partieswhich contain leaks of email addresses but which arenrsquotblocked by EasyList or EasyPrivacy These include firstparties in addition to third parties We are able to iden-tify first-party tracking URLs by observing groups ofURLs of similar structure across different first-party do-mains For example 51 email senders leak the userrsquosemail address to a URL of the form liltpublic suffix+ 1gtimp which appears to be part of LiveIntentrsquos API(Section 45) We summarize the most common struc-

I never signed up for this Privacy implications of email tracking 14

tures in the leaking URLs missed by tracking protectionlists in Table 13

URL Pattern of SendersliltPS+1gtimp 51 (57)partnerltPS+1gt 7 (07)stripeltPS+1gtstripeimage 4 (04)pltPS+1gtespopen 4 (04)apiltPS+1gtlayoutssectionltNgt 4 (04)ltPS+1gtcustomer-service 3 (03)miltPS+1gtprp 3 (03)dmtkltPS+1gt 3 (03)linksltPS+1gteopen 3 (03)eadsltPS+1gtimp 3 (03)

Table 13 The top URL patterns from URLs which leak email ad-dresses and are missed by tracking protection lists (Section 46)The patterns are generated by stripping request URLs to host-name and path replacing the public suffix plus one with ltPS+1gtreplacing integers with ltNgt and stripping the last portion of thepath if it ends with a file extension The patterns are ranked bythe number of senders which make at least one leaking requestmatching that pattern in any of the senderrsquos emails All values aregiven out of the total of 902 senders studied

We suspect that the reason so many trackers aremissed is that many of them are not active in the regu-lar web tracking space We have made the list of leakingURLs missed by tracking protection lists publicly avail-able8 It should be straightforward to add regular ex-pressions to filter lists based on these URLs we suggestthat filter list creators should regularly conduct scansof email corpora to identify new trackers

8 Discussion and conclusionPrivacy risks of email tracking Email security andprivacy has not received much research attention despiteits central importance in digital life We showed thatcommercial emails contain a high density of third-partytrackers This is of concern not only because trackerscan learn the recipientrsquos IP address when emails wereopened and so on but also because these third partiesare by and large the same ones that are involved in webtracking This means that trackers can connect emailaddresses to browsing histories and profiles which leadsto further privacy breaches such as cross-device tracking

8 httpsgistgithubcomenglehardt6438c5d775ffd535b317d5c6ce3cde61

and linking of online and offline activities Indeed emailis an underappreciated avenue for straightforward cross-device tracking since recipients tend to view emails onmultiple devices

The advice provided by many mail clients may mis-lead users into thinking the privacy risks associated withremote content are fairly limited The remote contenthelp pages of Gmail [20] Yahoo Mail [42] and Thun-derbird [31] all discuss the threat strictly in terms ofthe email sender learning information about the userrather than a number of third parties

Even network adversaries can benefit from the leaksin emails The NSA is known to piggyback on advertis-ing cookies for surveillance [18] and our work suggestsone way in which a surveillance agency might attachidentities to web activity records in line with the find-ings of Englehardt et al [18] Indeed nearly 91 ofURLs containing leaks of emails are sent in plaintext

Ineffectiveness of hashing The putative justi-fication for email address leaks in the online ad techindustry is that the address is hashed However hash-ing of PII including emails is not a meaningful pri-vacy protection This is folk knowledge in the securitycommunity but bears repeating Compared to hashingof passwords there are several reasons why hashing ofemail addresses is far more easily reversible via vari-ants of a dictionary attack First while (at least) someusers attempt to maximize the entropy of passwordsmost users aim to pick memorable emails and hence theset of potential emails is effectively enumerable Due toGPUs trillions of hashes can be attempted at low costSecond unlike password hashing salting is not applica-ble to email hashing since multiple third parties need tobe able to independently derive the same hash from theemail address

Perhaps most importantly if the adversaryrsquos goalis to retrieve records corresponding to a known emailaddress or set of email addresses then hashing ispointlessmdashthe adversary can simply hash the email ad-dresses and then look them up For example if the ad-versary is a surveillance agency as discussed above andseeks to retrieve network logs corresponding to a givenemail address this is trivially possible despite hashing

LimitationsWe mention several limitations of ourwork First despite the large number of heuristics thatwent into identifying and submitting forms it is a fun-damentally hard problem and our crawler fails in manycases including pages requiring complex mouse interac-tions pages containing very poorly structured HTMLand captcha-protected form submission pages More-over it is difficult to programmatically distinguish be-

I never signed up for this Privacy implications of email tracking 15

tween successful and failed form submissions Lookingat received network data is impractical since responsescould easily include text for both success and failuremessages On the other hand looking only at changesin the rendered text on the webpage is more feasiblebut would require handling many possible edge cases(eg page redirects alerts pop-up windows iframes)and might still be too unreliable to use as a metric forsuccess

Second our corpus of emails is not intended to berepresentative and we are unable to draw conclusionsabout the extent of tracking in the typical userrsquos mail-box

Third our simulation of a user viewing emails as-sumes a permissive user agent We expect that thisclosely approximates a webmail setup with defaultbrowser settings (on browsers except Safari whichblocks third-party cookies by default) but we have nottested this assumption

Future work Finally we mention several potentialareas of future work

Mailing list managers It would be helpful to bet-ter understand the relationship between email sendersand mailing list managers (such as Constant Contact)To what extent is email tracking driven by senders ver-sus mailing list managers When a sender sets up amarketing campaign with a mailing list manager is thetracking disclosed to the sender

PII leakage in registration forms Researchers havepreviously found leakage of PII to third parties in con-tact forms on websites [38] As far as we know there hasbeen no large-scale study of PII leakage in registrationforms where more sensitive information is often present(eg phone numbers street addresses and passwords)Recording and analyzing the third-party requests madeduring our crawls is an important area for further inves-tigation

Cookie syncing It would be interesting to find outif cookie syncing occurs when viewing emailsmdasha processin which different trackers exchange and link togethertheir own IDs for the same user Past work has shownthat this happens among the vast majority of top thirdparties on the web [17] so we suspect that it occursthrough email as well

AB testing We notice some clear instances of ABtesting in our data as might be expected in market-ing campaigns Specifically we registered multiple emailaddresses on some sites at roughly the same time andfound several emails sent at nearly the same time (mil-liseconds apart) with different subject lines and emailbodies advertising different products We have not at-

tempted to reverse-engineer or systematically analyzethese differences but it may be interesting to see if andhow the received content changes in response to readreceipts click-through metrics or other types of userinteractions

Differential testing Despite testing for various en-codings hashes and combinations it is possible thatwe have missed some leaks of email addresses We can-not hope to exhaustively test for all combinations ofencodings and hashes Instead we propose differentialtesting by registering multiple email addresses on thesame site we can look for parameters in URLs that aredifferent for different email addresses which are sugges-tive of transformed email addresses The difficulty withthis approach is that AB testing mentioned above isa confound

In summary we hope that our work leads to greaterawareness of the privacy risks of email tracking spursfurther research on the topic and paves the way fordeployment of robust defenses

9 AcknowledgementsWe would like to thank the anonymous reviewers AylinCaliskan Paul-Olivier Dehaye Joel Reardon and PaulVan Oorschot for their helpful comments Wersquore alsograteful to Guumlnes Acar Paul Ellenbogen Marc JuarezHarry Kalodner Marcela Melara and Laura Roberts fortheir assistance in compiling data for our email survey

This work was supported by NSF Grant CNS1526353 by a research grant from Mozilla and by Ama-zon AWS Cloud Credits for Research

References[1] Adblock Plus - Surf the web without annoying ads https

adblockplusorg Online accessed 2017-09-05[2] BeautifulSoup httpswwwcrummycomsoftware

BeautifulSoup Online accessed 2017-09-05[3] BlockListParser httpsgithubcomshivamagarwal-iitb

BlockListParser Online accessed 2017-09-05[4] EasyList and EasyPrivacy httpseasylistto Online

accessed 2017-09-05[5] uBlock Origin - An efficient blocker for Chromium and Fire-

fox Fast and lean httpsgithubcomgorhilluBlockOnline accessed 2017-09-05

[6] CSS Support Guide for Email Clients Campaign Sourcehttpswwwcampaignmonitorcomcss (Archive httpswwwwebcitationorg6rLLXBX0E) 2014

I never signed up for this Privacy implications of email tracking 16

[7] Gunes Acar Christian Eubank Steven Englehardt MarcJuarez Arvind Narayanan and Claudia Diaz The web neverforgets Persistent tracking mechanisms in the wild In Pro-ceedings of ACM CCS pages 674ndash689 ACM 2014

[8] Gunes Acar Marc Juarez Nick Nikiforakis Claudia DiazSeda Guumlrses Frank Piessens and Bart Preneel Fpdetectivedusting the web for fingerprinters In Proceedings of the2013 ACM SIGSAC conference on Computer amp communica-tions security pages 1129ndash1140 ACM 2013

[9] Julia Angwin Why online tracking is getting creepier ProP-ublica Jun 2014

[10] Mika D Ayenson Dietrich James Wambach Ashkan SoltaniNathan Good and Chris Jay Hoofnagle Flash cookies andprivacy II Now with html5 and etag respawning 2011

[11] Bananatag Email Tracking for Gmail Outlook and otherclients httpsbananatagcomemail-tracking Onlineaccessed 2017-09-04

[12] Justin Brookman Phoebe Rouge Aaron Alva Alva andChristina Yeung Cross-device tracking Measurement anddisclosures In Proceedings of the Privacy Enhancing Tech-nologies Symposium 2017

[13] Ceren Budak Sharad Goel Justin Rao and Georgios ZervasUnderstanding emerging threats to online advertising InProceedings of the ACM Conference on Economics andComputation 2016

[14] ContactMonkey Email Tracking for Outlook and Gmailhttpswwwcontactmonkeycomemail-tracking Onlineaccessed 2017-09-04

[15] Zakir Durumeric David Adrian Ariana Mirian James Kas-ten Elie Bursztein Nicolas Lidzborski Kurt Thomas VijayEranti Michael Bailey and J Alex Halderman Neither snownor rain nor mitm An empirical analysis of email deliv-ery security In Proceedings of the 2015 ACM Conferenceon Internet Measurement Conference pages 27ndash39 ACM2015

[16] Peter Eckersley How unique is your web browser In In-ternational Symposium on Privacy Enhancing TechnologiesSymposium pages 1ndash18 Springer 2010

[17] Steven Englehardt and Arvind Narayanan Online trackingA 1-million-site measurement and analysis In ACM Confer-ence on Computer and Communications Security 2016

[18] Steven Englehardt Dillon Reisman Christian Eubank Pe-ter Zimmerman Jonathan Mayer Arvind Narayanan andEdward W Felten Cookies that give you away The surveil-lance implications of web tracking In Proceedings of the24th Conference on World Wide Web 2015

[19] David Fifield and Serge Egelman Fingerprinting web usersthrough font metrics In International Conference on Finan-cial Cryptography and Data Security 2015

[20] Gmail Help Choose whether to show images httpssupportgooglecommailanswer145919 Online accessed2017-09-06

[21] Ralph Holz Johanna Amann Olivier Mehani Mohamed AliKacircafar and Matthias Wachs TLS in the wild An internet-wide analysis of tls-based protocols for electronic commu-nication In 23nd Annual Network and Distributed SystemSecurity Symposium NDSS 2016 San Diego CaliforniaUSA February 21-24 2016 2016

[22] HubSpot Start Email Tracking Today httpswwwhubspotcomproductssalesemail-tracking Online ac-

cessed 2017-09-04[23] Balachander Krishnamurthy Konstantin Naryshkin and

Craig Wills Privacy leakage vs protection measures thegrowing disconnect In Proceedings of the Web 2011

[24] Balachander Krishnamurthy and Craig E Wills On the leak-age of personally identifiable information via online socialnetworks In Proceedings of the 2nd ACM workshop onOnline social networks pages 7ndash12 ACM 2009

[25] Pierre Laperdrix Walter Rudametkin and Benoit BaudryBeauty and the beast Diverting modern web browsers tobuild unique browser fingerprints In 37th IEEE Symposiumon Security and Privacy 2016

[26] Adam Lerner Anna Kornfeld Simpson Tadayoshi Kohnoand Franziska Roesner Internet jones and the raiders of thelost trackers An archaeological study of web tracking from1996 to 2016 In 25th USENIX Security Symposium 2016

[27] Timothy Libert Exposing the invisible web An analysis ofthird-party http requests on 1 million websites InternationalJournal of Communication 918 2015

[28] Jonathan R Mayer and John C Mitchell Third-party webtracking Policy and technology In 2012 IEEE Symposiumon Security and Privacy IEEE 2012

[29] Georg Merzdovnik Markus Huber Damjan Buhov NickNikiforakis Sebastian Neuner Martin Schmiedecker andEdgar Weippl Block me if you can A large-scale study oftracker-blocking tools In Proceedings of the 2nd IEEE Euro-pean Symposium on Security and Privacy (IEEE EuroSampP)2017

[30] Keaton Mowery and Hovav Shacham Pixel perfect Finger-printing canvas in HTML5 W2SP 2012

[31] Mozilla Support Remote Content in Messages httpssupportmozillaorgen-USkbremote-content-in-messagesOnline accessed 2017-09-04

[32] Nick Nikiforakis Alexandros Kapravelos Wouter JoosenChristopher Kruegel Frank Piessens and Giovanni VignaCookieless monster Exploring the ecosystem of web-baseddevice fingerprinting In Security and privacy (SP) 2013IEEE symposium on pages 541ndash555 IEEE 2013

[33] Lukasz Olejnik Gunes Acar Claude Castelluccia and Clau-dia Diaz The leaking battery A privacy analysis of theHTML5 Battery Status API Technical report 2015

[34] Jingjing Ren Ashwin Rao Martina Lindorfer ArnaudLegout and David Choffnes Recon Revealing and control-ling pii leaks in mobile network traffic In Proceedings of the14th Annual International Conference on Mobile SystemsApplications and Services pages 361ndash374 ACM 2016

[35] Franziska Roesner Tadayoshi Kohno and David WetherallDetecting and defending against third-party tracking onthe web In Proceedings of the 9th USENIX conferenceon Networked Systems Design and Implementation pages12ndash12 USENIX Association 2012

[36] scikit-learn Jaccard Similarity Score httpscikit-learnorgstablemodulesgeneratedsklearnmetricsjaccard_similarity_scorehtml Online accessed 2017-09-05

[37] Ashkan Soltani Shannon Canty Quentin Mayo LaurenThomas and Chris Jay Hoofnagle Flash cookies and pri-vacy In AAAI spring symposium intelligent informationprivacy management volume 2010 pages 158ndash163 2010

[38] Oleksii Starov Phillipa Gill and Nick Nikiforakis Are yousure you want to contact us quantifying the leakage of pii

I never signed up for this Privacy implications of email tracking 17

via website contact forms Proceedings on Privacy Enhanc-ing Technologies 2016(1)20ndash33 2016

[39] Oleksii Starov and Nick Nikiforakis Extended trackingpowers Measuring the privacy diffusion enabled by browserextensions In Proceedings of the 26th International Confer-ence on World Wide Web pages 1481ndash1490 2017

[40] Narseo Vallina-Rodriguez Christian Kreibich Mark Allmanand Vern Paxson Lumen Fine-grained visibility and controlof mobile traffic in user-space 2017

[41] W3C 410 Forms - HTML5 httpswwww3orgTRhtml5formshtml Online accessed 2017-09-07

[42] Yahoo Help Block images in your incoming Yahoo Mailemails httpshelpyahoocomkbSLN5043html Onlineaccessed 2017-09-06

[43] Zhonghao Yu Sam Macbeth Konark Modi and Josep MPujol Tracking the trackers In Proceedings of the 25thInternational Conference on World Wide Web pages 121ndash132 International World Wide Web Conferences SteeringCommittee 2016

10 Appendix

101 Form discovery and fillingmethodology

Choosing pages on which to search for forms Thecrawler searches through all links (ltagt tags) on the land-ing page to find pages that are most likely to contain amailing list form It does this by matching the link textand URL against a ranked list of terms which are shownin Table 1 As an initial step we filter out invisiblelinks and links to external sites We check that the linktext does not contain words in our blacklist which aimsto avoid unsubscribe pages and phone-based registra-tion If we have found any links that match the crawlerclicks on the one with the highest rank then runs theform-finding procedure on the new page and any newlyopened pop-up windows If no forms are found it goesback and repeats this process for the remaining linksThe reason for clicking on generic article links is thatwe have come across several news sites with newsletterforms only within article pages We also make sure toselect the English language or USEnglish locale whenavailable since our keywords are in English

Top-down form detection For each page thecrawler visits it first searches through the HTML DOMfor any potential email registration forms When sitesuse the standard ltformgt element it can simply iteratethrough each formrsquos input fields (ltinputgt tags) and seeif any text fields ask for an email address (by matchingon input type and keywords) If so it marks the form as

a candidate and then chooses the best candidate usingthe following criteria (in order)1 Always return the topmost form Any form stacked

on top of other elements is probably a modal or dia-log and we find that the most common use of thesecomponents is to promote a sitersquos mailing lists Werely on the z-index CSS property which specifies thestacking order of an element in relation to others (asa relative arbitrary integer) Note that most DOMelements take the default z-index value of auto in-heriting the actual value from its parent thus thecrawler recursively checks a formrsquos parent elementsuntil it finds a non-auto value or reaches the rootof the DOM tree To break ties it also searches forthe literal strings ldquomodalrdquo or ldquodialogrdquo within theformrsquos HTML since we find that such componentsare usually descriptively named

2 Rank login forms lower This is the other class offorms that often asks for an email address so thecrawler explicitly checks for the strings ldquologinrdquo ldquologinrdquo and ldquosign inrdquo within a formrsquos HTML to avoidthese when other candidates are present

3 Prefer forms with more input fields This is mainlyhelpful for identifying the correct follow-up form ifwe submit our email address in the footer of a pagethe same footer might be present on the page we getredirected to In this scenario the form we want topick is the longer one

Additionally registration forms are sometimes foundinside of inline frames (ltiframegt tag) which are ef-fectively separate HTML pages embedded in the mainpage If necessary we iterate through each frame andapply the same procedure to locate registration formswithin them

Bottom-up form detection A growing numberof sites place logical forms inside of generic containerelements (eg ltdivgt or ltspangt tags) without using anyltformgt tags Therefore if top-down form detection failswe take a bottom-up approach the crawler first iteratesthrough all the ltinputgt elements on the page to checkif any email address fields exist at all then recursivelyexamines their parents to find the first container thatalso contains a submit button This container is usuallythe smallest logical form unit that includes all of therelevant input fields

Determining form field type Once a form isdiscovered we need to determine which fields are con-tained in the form and fill each field with valid dataWe skip any invisible elements since a real user wouldnot be expected to fill them Some fields can be iden-

I never signed up for this Privacy implications of email tracking 18

tified by their type attribute alonemdashfor example telfor phone numbers and email for email addressesmdashbutthese specific types were introduced in the relatively re-cent HTML5 standard [41] and most websites still usethe general text type for all text inputs In our sur-vey of the top sites we found that contextual hints arescattered across many tag attributes with the most fre-quent being name class id placeholder value forand title In addition tags that contain HTML bod-ies (such as ltbuttongt tags) often contain hints in theinnerHTML

Handling two-part form submissions Aftersubmitting a form we are sometimes prompted to fillout another longer form before the registration is ac-cepted This second form might appear on the samepage (ie using JavaScript) or on a separate page ei-ther through a redirect or as a pop-up window We takea simplistic approach the crawler waits a few secondsthen applies the same form-finding procedure first onany pop-up windows and then on the original windowThis approach may have the effect of submitting thesame form twice but we argue that this does not pro-duce any adverse resultsmdashduplicate form submissionsare a plausible user interaction that web services shouldbe expected to handle gracefully

102 Mail server implementation

The mail server receives emails using SubEtha SMTP alibrary offering a simple low-level API to handle incom-ing mail The server accepts any mail sent to (RCPT TO)an existing email address and rejects it otherwise Themail contents (DATA) are parsed in MIME format usingthe JavaMail API and the raw message contents arewritten to disk MIME messages consist of a set of head-ers and a content body with the required Content-Typeheader indicating the format of the content notably amultipart content body contains additional MIME mes-sage subparts enabling messages to be arranged in atree structure To save disk space we recursively scanmultipart MIME messages for subparts with contenttypes that are non-text (text) such as attached im-ages or other data and discard them before storing themessages since we do not examine any non-textual con-tent

103 Supported hash functions andencodings for leak detection

Supported hashes and checksums md2 md4 md5sha sha1 sha256 sha224 sha384 sha3-224 sha3-256 sha3-384 sha3-512 murmurhash2 (signed andunsigned) murmurhash3 32-bit murmurhash3 64-bitmurmurhash3 128-bit ripemd160 whirlpool blake2bblake2s crc32 adler32

Supported encodings base16 base32 base58base64 urlencoding deflate gzip zlib entity yenc

104 Top parties redirecting to new thirdparties on email reload

Redirecting Party OrganizationAvg addrsquolparties S E

pippiocom Acxiom 57 7 32liadmcom LiveIntent 37 68 1097rlcdncom Acxiom 17 11 551imiclkcom MediaMath 13 2 4mathtagcom MediaMath 11 11 382alcmpncom ALCdagger 08 6 132emltrkcom Litmus 07 41 638acxiom-onlinecom Acxiom 04 2 33dynemlcom PowerInbox 01 3 13adnxscom AppNexus 01 19 277

Table 14 Top parties by average number of new third-party re-sources in a redirect chain when an email is reloaded The num-ber of senders ( S) out of 902 total and the number of emails(E) out of 12618 total on which this occurs is given for eachredirecting party We exclude redirecting parties that only exhibitthis behavior in emails from a single sender In total there are 12parties which exhibit this type of redirect behavior Includes statistics for chains which redirect to httppliadmcomimp in the first redirect We observe a common pattern ofURLs of the form lifirstpartycom redirecting first to this end-point which then redirects to a number of other third partiesdagger American List Counsel

  • I never signed up for this Privacy implications of email tracking
    • 1 Introduction
      • 11 Methods
      • 12 The state of email tracking
      • 13 Evaluating and improving defenses
        • 2 Related work
        • 3 Collecting a dataset of emails
        • 4 Privacy leaks when viewing emails
          • 41 Measurement methodology
          • 42 Email provides much of same tracking opportunities as the web
          • 43 Leaks of email addresses to third parties are common
          • 44 Reopening emails brings in new third parties
          • 45 Case study LiveIntent
          • 46 Request blockers help but dont fix the problem
            • 5 Privacy leaks when clicking links in emails
              • 51 Measurement methodology
              • 52 Results
                • 6 Evaluation of defenses
                  • 61 Landscape of defenses
                  • 62 Survey of email clients
                    • 7 Proposed defense
                    • 8 Discussion and conclusion
                    • 9 Acknowledgements
                    • 10 Appendix
                      • 101 Form discovery and filling methodology
                      • 102 Mail server implementation
                      • 103 Supported hash functions and encodings for leak detection
                      • 104 Top parties redirecting to new third parties on email reload

I never signed up for this Privacy implications of email tracking 10

Encoding of Senders of RecipientsPlaintext Address 34 (37) 34 (667)MD5 21 (23) 12 (235)SHA1 14 (16) 6 (118)URL Encoded Address 4 (04) 4 (78)SHA256 4 (04) 2 (39)SHA384 1 (01) 1 (20)

Table 7 Encodings used in leaks to third parties after filteringrequests with EasyList and EasyPrivacy Totals are given out of902 email senders and 51 third-party leak recipients

Recipient Domain of Sendersmediawallahscriptcom 7jetlorecom 4scrippsnetworkscom 4alocdncom 3richrelevancecom 3ivitrackcom 2intentiqcom 2gatehousemediacom 2realtimeemail 2ziffimagescom 2

Table 8 The top third-party leak recipient domains after filteringrequests with EasyList and EasyPrivacy All recipients receiveleaks from less than 1 of the 902 senders studied

plaintext and email hashes still occur In Table 8 we seethat there are still several third-party domains whichreceive email address leaks despite blocking Several ofthese domains are known trackers which could be in-cluded in the blocklists In addition IP addresses andCDN domains are still recipients of leaked email ad-dresses Blocking on other URL features such as theURL path could help reduce leaks to these domains

5 Privacy leaks when clickinglinks in emails

In Section 4 we explore the privacy impact of a useropening and rendering an email In this section we ex-plore the privacy impact of a user clicking links withinan email Once a user clicks a link in an email the link istypically opened in a web browser Unlike email clientsweb browsers will typically support Javascript and ad-vanced features of HTML creating many potential av-enues for privacy leaks However the only way an emailaddress can propagate to a page visit is through the di-rect embedding of the address in a link contained in theoriginal email body

51 Measurement methodology

Sampling links from emails To evaluate the privacyleaks which occur when links in emails are clicked wegenerate a dataset from the HTML content of all emailsand visit them individually in an instrumented browserTo extract the links from mail content we parse allemail bodies with BeautifulSoup [2] and extract thesrc property of all ltagt tags We sample up to 200 uniquelinks per sender using the following sampling strategyFirst we bin links across all emails from a sender bythe PS+1 and path of the link Next we sample one linkfrom each bin without replacement until there are nomore links or we reach a limit of 200 This helps ensurethat we have as diverse a set of landing pages as possibleby stripping fragment and query string identifiers thatmay not influence the landing page

Simulating link clicks To simulate a user click-ing a link we visit each link in an OpenWPM instanceusing a fresh browser profile The browser fully loadsthe page and sleeps for 10 seconds before closing Un-like the email viewing simulation (Section 4) we enableboth Javascript and Referer headers This simulationreplicates what happens when a link is clicked in a stan-dalone email client only the URL of the clicked link ispassed to the browser for handling In a webmail clientthe initial request resulting from the click may also con-tain a cookie and a Referer header containing the emailclientrsquos URL We do not simulate these headers in ourcrawl

Detecting email address leakage To detectleakage of email addresses we use the procedure de-scribed in Section 41 Since the Referer header is en-abled for these measurements we consider a party tohave received a leak if it is contained either in the URLor the Referer header of the resource request to thatparty Email addresses can also be shared with the partythrough the Cookie header request POST bodies web-socket connections WebRTC connections and so onWe consider these out of scope for this analysis

52 Results

We found that about 11 of links contain requests thatleak the email address to a third party About 12 ofall emails contain at least one such link and among thissubset there are an average of 35 such links per emailThe percentage of the 902 senders that leak the emailaddress in at least one link in one email is higher 355Finally there were over 1400 distinct third parties that

I never signed up for this Privacy implications of email tracking 11

Recipient Organization of SendersGoogle 247 (274)Facebook 160 (177)Twitter 94 (104)Adobe 81 (90)Microsoft 73 (81)Pinterest 72 (80)LiveIntent 69 (76)Akamai 69 (76)Acxiom 68 (75)AppNexus 61 (68)

Table 9 The top leak recipient organizations based on a sampleof simulated link clicks All values are out of 902 total senders

Recipient Domain of Sendersgoogle-analyticscom 200 (222)doubleclicknet 196 (217)googlecom 159 (176)facebookcom 154 (171)facebooknet 145 (161)fontsgoogleapiscom 102 (113)googleadservicescom 96 (106)twittercom 94 (104)googletagmanagercom 87 (96)gstaticcom 78 (86)

Table 10 The top leak recipient domains based on a sample ofsimulated link clicks All values are out of 902 senders

received the email address in one or more of our sim-ulated link clicks We expect that all statistics in thisparagraph except the first are slight underestimatesdue to our limit of 200 links per sender

Table 9 shows the top organizations that receiveleaked email addresses and Table 10 shows the top do-mains Over a quarter of senders leak the email addressto Google in at least one link

The most striking difference between these resultsand the corresponding results for viewing emails is thatthese lists look very similar to the list of top third partytrackers [17] with the addition of a small number oforganizations specific to email tracking This motivatesthe privacy concern that identities could potentially beattached to third-party web tracking profiles

6 Evaluation of defenses

61 Landscape of defenses

Defenses against tracking can be employed by severalparties We ignore mail senders and trackers themselves

since email tracking is a thriving commercial space andour evidence suggests that senders by and large coop-erate with trackers to leak email addresses We insteadfocus on parties who have an incentive to protect therecipientrsquos privacy namely the recipientrsquos mail servermail user agent and the web browser

The lines between these roles can be blurry so weillustrate with two examples Consider a user readingYahoo mail via Firefox The email server is Yahoo theemail client is Firefox together with Yahoo mailrsquos client-side JavaScript and the web browser is again Fire-fox Or consider a user reading her university mail viaGmailrsquos IMAP feature on her iPhone For our purposesboth the university and Gmail count as email serverssince either of them is in a position to employ defensesThe email client is the Gmail iOS app and the webbrowser is Safari

Defense Email server Email client Web browserContent proxying XHTML filtering X XCookie blocking X XReferrer blocking X X XRequest blocking X X

Table 11 Applicability of each of the five possible defenses toeach of the three contexts in which they may be deployed An Xindicates that the defense is applicable

Table 11 summarizes the applicability of various de-fenses to the three roles We discuss each in turn

Content proxying Email tracking is possible be-cause of embedded content such as images and CSS (cas-cading style sheets) To prevent this some email serversnotably Gmail proxy embedded content Thus whenthe recipient views the email the mail user agent doesnot make any requests to third parties

This defense doesnrsquot prevent the recipient email ad-dress being leaked to third parties since it is leakedby being encoded in the URL In fact it hinders ef-forts by the mail client to prevent email address leakage(see request blocking below) However it prevents thirdparties from learning the userrsquos IP address client deviceproperties and when the email was read (depending onhow the proxy is configured) Most importantly it pre-vents the third-party cookie from being sent and thusprevents the third party from linking the userrsquos emailaddress to a tracking profile In this way it is a comple-ment to cookie blocking

This defense can be deployed by the email serverConceivably the email client might have its own server

I never signed up for this Privacy implications of email tracking 12

component through which embedded resources are prox-ied but no email clients currently work this way andfurther it would introduce its own privacy vulnerabili-ties so we ignore this possibility

HTML filtering HTML filtering refers to modify-ing the contents of HTML emails to mitigate tracking Itmay be applied by the email server or the client but it ismore suitable to the server since the client can generallyachieve the same effect in other ways eg by requestblocking or modifying the rendering engine It is rarelyapplied today and only in minimal ways In Section 7 weprototype a comprehensive HTML filtering technique

HTML filtering modifies the content of the emailbody and thus might interfere with some email au-thentication methods notably Domain Keys IdentifiedEmail (DKIM) However since filtering is carried outby the recipientrsquos mail server (Mail Transfer Agent) andnot by intermediate mail relays filtering can be done af-ter the signature has been verified and thus there is noimpact on email authentication

The following three techniques are applicable in oneof two scenarios when the email client requests embed-ded resources or when the web browser handles clickson links in emails

Cookie blocking Cookie blocking in the emailclient prevents third-party cookies from being sent whenembedded content is requested It is especially relevantin the webmail context where the cookie allows thirdparties to link an email address to a web browsing pro-file Even otherwise blocking cookies is helpful since itmakes it harder for third parties to compile a profile ofthe recipientrsquos email viewing (they can always do this forthe subset of emails where the email address is leaked)

Referrer blocking If the email client sends theReferer header when loading embedded resources itcan allow several types of leaks Depending on the imple-mentation the referrer may encode which client is beingused and which specific email is being read If the recip-ient forwarded an email to someone else and the email isbeing viewed in a different userrsquos mailbox it could leakthis information Worse if the client supports iframesin emails and the email address happens to be in theiframe URL all requests to resources embedded in thatiframe will accidentally leak the email address For allthese reasons referrer blocking is a privacy-enhancingmeasure There is little legitimate use for the referrerheader in the context of email While clients can cer-tainly block the header (as can web browsers) serverscan do this as well by rewriting HTML to add therel=ldquonoreferrerrdquo attribute to links and inserting a Re-ferrer Policy via the meta tag

Request blocking Request blocking is a powerfultechnique which is well known due to ad blockers andother browser privacy extensions It relies on manuallycompiled filter lists containing thousands of regular ex-pressions that define third-party content to be blockedThe most widely used ad-blocking list is EasyList andthe most widely used tracker-blocking list is EasyPri-vacy Filter list based blocking introduces false positivesand false negatives [43] but the popularity of ad block-ing suggests that many users find the usability trade-offto be acceptable While request-blocking extensions aresupported primarily by web browsers some email clientsalso have support for them notably Thunderbird

62 Survey of email clients

We built an email privacy tester to discover which de-fenses are deployed by which popular email servers andclients7 Browser support for tracking protection hasbeen extensively studied elsewhere [29] so we do notconsider it here

The email privacy tester allows the researcher to en-ter an email address and the name of an email clientand then sends an email to that address containing atracking image and a link The image and the link bothhave unique URLs The researcher views the email inthe specified email client and then clicks on the linkThe server records the following information the emailaddress the email client the IP address timestampand headers sent for both the image and the link re-quests The list of headers includes the cookie referrerand user agent

We created accounts with a total of 9 emailproviders and tested them with a total of 16 email clientsusing various devices available in our lab We analyzedthe data recorded by the email privacy tester and sum-marize the results in Table 12 We found that if de-fenses are deployed by email servers at all they are onlyenabled for specific email clients (typically the defaultwebmail client) Therefore we do not report on serversseparately but instead fold it into the analysis of clientsWe also found that HTML filtering in a general form isnot deployed but only in the limited form of image andreferrer blocking so we report on that instead We sum-marize our findings in Table 12

7 httpsemailtrackingopenwpmcom

I never signed up for this Privacy implications of email tracking 13

Mail Client Platform Proxies Content Blocks Images Blocks Referrers Blocks Cookies Ext SupportGmail Web Yes No L Yes I Yesdagger Yesdagger YesYahoo Mail Web No Yes L Yes I No No YesOutlook Web App Web No Yes No No YesOutlookcom Web No No No No YesYandex Mail Web Yes No L Yes I Yesdagger Yesdagger YesGMX Web No No No No YesZimbra Web No Yes No No Yes163com Web No No No No YesSina Web No No No No YesApple Mail iOS No No Yes Yes NoGmail iOS Yes No Yes Yes NoGmail Android Yes No Yes Yes NoApple Mail Desktop No No Yes Yes NoWindows Mail Desktop No No Yes No NoOutlook 2016 Desktop No Yes Yes No NoThunderbird Desktop No Yes Yes Optional (Default No) Yes

Table 12 A survey of the privacy impacting features of email clients We explore whether the client proxies image requests blocksimages by default blocks referrer headers from being sent (with image requests ldquoIrdquo and with link clicks ldquoLrdquo) blocks external re-sources from settings cookies and whether or not the client supports request blocking extensions mdash either through the browser (forweb clients) or directly (in the case of Thunderbird)Images are only blocked for messages considered spamdagger Blocking occurs as a result of proxied content

7 Proposed defenseWe argue that tracking protection should be at the cen-ter of a defensive strategy against email tracking It canbe employed either via HTML filtering on the server orvia request blocking on the client Tracking protection(and ad blocking) based on filter lists has proven to beeffective and popular in web browsers and its limita-tions manageable The other defenses we examined allhave serious drawbacks for example content proxyingcomes at a cost to the email server and makes emailleaks worse and cookie blocking is at best a partial so-lution

We propose to improve tracking protection in twoways

Server-side email content filtering First weprototype a server-side HTML filtering module We usethe existing standard EasyList and EasyPrivacy filterlists Our filtering script is written in Python using theBlockListParser library [3] It scans for any HTML con-tent (texthtml) in email bodies parses those contentsidentifies embedded resources (images or CSS) whoseURLs match one of the regular expressions in the filterlists strips them out and rewrites the HTML

To test the effectiveness of HTML filtering we ranour leak detection procedure on the filtered corpus ofemails We exclude one sender due to a measurementissue We found that 110 of senders will leak email ad-

dresses to a third party in at least one email and 115of emails contain embedded resources which leak emailto a third-party Overall 62 third parties received leakedemail addresses down from 99 As tracking-protectionlists improve (see below) we can expect these numbersto decrease further These numbers are very close tothe corresponding numbers for request blocking (Sec-tion 46) The two techniques arenrsquot identical the onedifference is that in static files filtering is limited to theURLs present in the body of the HTML and will missthose that result from a redirect However this differ-ence is small and we conclude that HTML filtering isessentially as effective as request blocking

Note that webmail users can already enjoy track-ing protection but server-side deployment will help allusers including those who use email clients that donrsquotsupport request-blocking extensions

Filling gaps in tracking-protection lists As asecond line of defense we use our dataset to identifya list of 27125 URLs representing 133 distinct partieswhich contain leaks of email addresses but which arenrsquotblocked by EasyList or EasyPrivacy These include firstparties in addition to third parties We are able to iden-tify first-party tracking URLs by observing groups ofURLs of similar structure across different first-party do-mains For example 51 email senders leak the userrsquosemail address to a URL of the form liltpublic suffix+ 1gtimp which appears to be part of LiveIntentrsquos API(Section 45) We summarize the most common struc-

I never signed up for this Privacy implications of email tracking 14

tures in the leaking URLs missed by tracking protectionlists in Table 13

URL Pattern of SendersliltPS+1gtimp 51 (57)partnerltPS+1gt 7 (07)stripeltPS+1gtstripeimage 4 (04)pltPS+1gtespopen 4 (04)apiltPS+1gtlayoutssectionltNgt 4 (04)ltPS+1gtcustomer-service 3 (03)miltPS+1gtprp 3 (03)dmtkltPS+1gt 3 (03)linksltPS+1gteopen 3 (03)eadsltPS+1gtimp 3 (03)

Table 13 The top URL patterns from URLs which leak email ad-dresses and are missed by tracking protection lists (Section 46)The patterns are generated by stripping request URLs to host-name and path replacing the public suffix plus one with ltPS+1gtreplacing integers with ltNgt and stripping the last portion of thepath if it ends with a file extension The patterns are ranked bythe number of senders which make at least one leaking requestmatching that pattern in any of the senderrsquos emails All values aregiven out of the total of 902 senders studied

We suspect that the reason so many trackers aremissed is that many of them are not active in the regu-lar web tracking space We have made the list of leakingURLs missed by tracking protection lists publicly avail-able8 It should be straightforward to add regular ex-pressions to filter lists based on these URLs we suggestthat filter list creators should regularly conduct scansof email corpora to identify new trackers

8 Discussion and conclusionPrivacy risks of email tracking Email security andprivacy has not received much research attention despiteits central importance in digital life We showed thatcommercial emails contain a high density of third-partytrackers This is of concern not only because trackerscan learn the recipientrsquos IP address when emails wereopened and so on but also because these third partiesare by and large the same ones that are involved in webtracking This means that trackers can connect emailaddresses to browsing histories and profiles which leadsto further privacy breaches such as cross-device tracking

8 httpsgistgithubcomenglehardt6438c5d775ffd535b317d5c6ce3cde61

and linking of online and offline activities Indeed emailis an underappreciated avenue for straightforward cross-device tracking since recipients tend to view emails onmultiple devices

The advice provided by many mail clients may mis-lead users into thinking the privacy risks associated withremote content are fairly limited The remote contenthelp pages of Gmail [20] Yahoo Mail [42] and Thun-derbird [31] all discuss the threat strictly in terms ofthe email sender learning information about the userrather than a number of third parties

Even network adversaries can benefit from the leaksin emails The NSA is known to piggyback on advertis-ing cookies for surveillance [18] and our work suggestsone way in which a surveillance agency might attachidentities to web activity records in line with the find-ings of Englehardt et al [18] Indeed nearly 91 ofURLs containing leaks of emails are sent in plaintext

Ineffectiveness of hashing The putative justi-fication for email address leaks in the online ad techindustry is that the address is hashed However hash-ing of PII including emails is not a meaningful pri-vacy protection This is folk knowledge in the securitycommunity but bears repeating Compared to hashingof passwords there are several reasons why hashing ofemail addresses is far more easily reversible via vari-ants of a dictionary attack First while (at least) someusers attempt to maximize the entropy of passwordsmost users aim to pick memorable emails and hence theset of potential emails is effectively enumerable Due toGPUs trillions of hashes can be attempted at low costSecond unlike password hashing salting is not applica-ble to email hashing since multiple third parties need tobe able to independently derive the same hash from theemail address

Perhaps most importantly if the adversaryrsquos goalis to retrieve records corresponding to a known emailaddress or set of email addresses then hashing ispointlessmdashthe adversary can simply hash the email ad-dresses and then look them up For example if the ad-versary is a surveillance agency as discussed above andseeks to retrieve network logs corresponding to a givenemail address this is trivially possible despite hashing

LimitationsWe mention several limitations of ourwork First despite the large number of heuristics thatwent into identifying and submitting forms it is a fun-damentally hard problem and our crawler fails in manycases including pages requiring complex mouse interac-tions pages containing very poorly structured HTMLand captcha-protected form submission pages More-over it is difficult to programmatically distinguish be-

I never signed up for this Privacy implications of email tracking 15

tween successful and failed form submissions Lookingat received network data is impractical since responsescould easily include text for both success and failuremessages On the other hand looking only at changesin the rendered text on the webpage is more feasiblebut would require handling many possible edge cases(eg page redirects alerts pop-up windows iframes)and might still be too unreliable to use as a metric forsuccess

Second our corpus of emails is not intended to berepresentative and we are unable to draw conclusionsabout the extent of tracking in the typical userrsquos mail-box

Third our simulation of a user viewing emails as-sumes a permissive user agent We expect that thisclosely approximates a webmail setup with defaultbrowser settings (on browsers except Safari whichblocks third-party cookies by default) but we have nottested this assumption

Future work Finally we mention several potentialareas of future work

Mailing list managers It would be helpful to bet-ter understand the relationship between email sendersand mailing list managers (such as Constant Contact)To what extent is email tracking driven by senders ver-sus mailing list managers When a sender sets up amarketing campaign with a mailing list manager is thetracking disclosed to the sender

PII leakage in registration forms Researchers havepreviously found leakage of PII to third parties in con-tact forms on websites [38] As far as we know there hasbeen no large-scale study of PII leakage in registrationforms where more sensitive information is often present(eg phone numbers street addresses and passwords)Recording and analyzing the third-party requests madeduring our crawls is an important area for further inves-tigation

Cookie syncing It would be interesting to find outif cookie syncing occurs when viewing emailsmdasha processin which different trackers exchange and link togethertheir own IDs for the same user Past work has shownthat this happens among the vast majority of top thirdparties on the web [17] so we suspect that it occursthrough email as well

AB testing We notice some clear instances of ABtesting in our data as might be expected in market-ing campaigns Specifically we registered multiple emailaddresses on some sites at roughly the same time andfound several emails sent at nearly the same time (mil-liseconds apart) with different subject lines and emailbodies advertising different products We have not at-

tempted to reverse-engineer or systematically analyzethese differences but it may be interesting to see if andhow the received content changes in response to readreceipts click-through metrics or other types of userinteractions

Differential testing Despite testing for various en-codings hashes and combinations it is possible thatwe have missed some leaks of email addresses We can-not hope to exhaustively test for all combinations ofencodings and hashes Instead we propose differentialtesting by registering multiple email addresses on thesame site we can look for parameters in URLs that aredifferent for different email addresses which are sugges-tive of transformed email addresses The difficulty withthis approach is that AB testing mentioned above isa confound

In summary we hope that our work leads to greaterawareness of the privacy risks of email tracking spursfurther research on the topic and paves the way fordeployment of robust defenses

9 AcknowledgementsWe would like to thank the anonymous reviewers AylinCaliskan Paul-Olivier Dehaye Joel Reardon and PaulVan Oorschot for their helpful comments Wersquore alsograteful to Guumlnes Acar Paul Ellenbogen Marc JuarezHarry Kalodner Marcela Melara and Laura Roberts fortheir assistance in compiling data for our email survey

This work was supported by NSF Grant CNS1526353 by a research grant from Mozilla and by Ama-zon AWS Cloud Credits for Research

References[1] Adblock Plus - Surf the web without annoying ads https

adblockplusorg Online accessed 2017-09-05[2] BeautifulSoup httpswwwcrummycomsoftware

BeautifulSoup Online accessed 2017-09-05[3] BlockListParser httpsgithubcomshivamagarwal-iitb

BlockListParser Online accessed 2017-09-05[4] EasyList and EasyPrivacy httpseasylistto Online

accessed 2017-09-05[5] uBlock Origin - An efficient blocker for Chromium and Fire-

fox Fast and lean httpsgithubcomgorhilluBlockOnline accessed 2017-09-05

[6] CSS Support Guide for Email Clients Campaign Sourcehttpswwwcampaignmonitorcomcss (Archive httpswwwwebcitationorg6rLLXBX0E) 2014

I never signed up for this Privacy implications of email tracking 16

[7] Gunes Acar Christian Eubank Steven Englehardt MarcJuarez Arvind Narayanan and Claudia Diaz The web neverforgets Persistent tracking mechanisms in the wild In Pro-ceedings of ACM CCS pages 674ndash689 ACM 2014

[8] Gunes Acar Marc Juarez Nick Nikiforakis Claudia DiazSeda Guumlrses Frank Piessens and Bart Preneel Fpdetectivedusting the web for fingerprinters In Proceedings of the2013 ACM SIGSAC conference on Computer amp communica-tions security pages 1129ndash1140 ACM 2013

[9] Julia Angwin Why online tracking is getting creepier ProP-ublica Jun 2014

[10] Mika D Ayenson Dietrich James Wambach Ashkan SoltaniNathan Good and Chris Jay Hoofnagle Flash cookies andprivacy II Now with html5 and etag respawning 2011

[11] Bananatag Email Tracking for Gmail Outlook and otherclients httpsbananatagcomemail-tracking Onlineaccessed 2017-09-04

[12] Justin Brookman Phoebe Rouge Aaron Alva Alva andChristina Yeung Cross-device tracking Measurement anddisclosures In Proceedings of the Privacy Enhancing Tech-nologies Symposium 2017

[13] Ceren Budak Sharad Goel Justin Rao and Georgios ZervasUnderstanding emerging threats to online advertising InProceedings of the ACM Conference on Economics andComputation 2016

[14] ContactMonkey Email Tracking for Outlook and Gmailhttpswwwcontactmonkeycomemail-tracking Onlineaccessed 2017-09-04

[15] Zakir Durumeric David Adrian Ariana Mirian James Kas-ten Elie Bursztein Nicolas Lidzborski Kurt Thomas VijayEranti Michael Bailey and J Alex Halderman Neither snownor rain nor mitm An empirical analysis of email deliv-ery security In Proceedings of the 2015 ACM Conferenceon Internet Measurement Conference pages 27ndash39 ACM2015

[16] Peter Eckersley How unique is your web browser In In-ternational Symposium on Privacy Enhancing TechnologiesSymposium pages 1ndash18 Springer 2010

[17] Steven Englehardt and Arvind Narayanan Online trackingA 1-million-site measurement and analysis In ACM Confer-ence on Computer and Communications Security 2016

[18] Steven Englehardt Dillon Reisman Christian Eubank Pe-ter Zimmerman Jonathan Mayer Arvind Narayanan andEdward W Felten Cookies that give you away The surveil-lance implications of web tracking In Proceedings of the24th Conference on World Wide Web 2015

[19] David Fifield and Serge Egelman Fingerprinting web usersthrough font metrics In International Conference on Finan-cial Cryptography and Data Security 2015

[20] Gmail Help Choose whether to show images httpssupportgooglecommailanswer145919 Online accessed2017-09-06

[21] Ralph Holz Johanna Amann Olivier Mehani Mohamed AliKacircafar and Matthias Wachs TLS in the wild An internet-wide analysis of tls-based protocols for electronic commu-nication In 23nd Annual Network and Distributed SystemSecurity Symposium NDSS 2016 San Diego CaliforniaUSA February 21-24 2016 2016

[22] HubSpot Start Email Tracking Today httpswwwhubspotcomproductssalesemail-tracking Online ac-

cessed 2017-09-04[23] Balachander Krishnamurthy Konstantin Naryshkin and

Craig Wills Privacy leakage vs protection measures thegrowing disconnect In Proceedings of the Web 2011

[24] Balachander Krishnamurthy and Craig E Wills On the leak-age of personally identifiable information via online socialnetworks In Proceedings of the 2nd ACM workshop onOnline social networks pages 7ndash12 ACM 2009

[25] Pierre Laperdrix Walter Rudametkin and Benoit BaudryBeauty and the beast Diverting modern web browsers tobuild unique browser fingerprints In 37th IEEE Symposiumon Security and Privacy 2016

[26] Adam Lerner Anna Kornfeld Simpson Tadayoshi Kohnoand Franziska Roesner Internet jones and the raiders of thelost trackers An archaeological study of web tracking from1996 to 2016 In 25th USENIX Security Symposium 2016

[27] Timothy Libert Exposing the invisible web An analysis ofthird-party http requests on 1 million websites InternationalJournal of Communication 918 2015

[28] Jonathan R Mayer and John C Mitchell Third-party webtracking Policy and technology In 2012 IEEE Symposiumon Security and Privacy IEEE 2012

[29] Georg Merzdovnik Markus Huber Damjan Buhov NickNikiforakis Sebastian Neuner Martin Schmiedecker andEdgar Weippl Block me if you can A large-scale study oftracker-blocking tools In Proceedings of the 2nd IEEE Euro-pean Symposium on Security and Privacy (IEEE EuroSampP)2017

[30] Keaton Mowery and Hovav Shacham Pixel perfect Finger-printing canvas in HTML5 W2SP 2012

[31] Mozilla Support Remote Content in Messages httpssupportmozillaorgen-USkbremote-content-in-messagesOnline accessed 2017-09-04

[32] Nick Nikiforakis Alexandros Kapravelos Wouter JoosenChristopher Kruegel Frank Piessens and Giovanni VignaCookieless monster Exploring the ecosystem of web-baseddevice fingerprinting In Security and privacy (SP) 2013IEEE symposium on pages 541ndash555 IEEE 2013

[33] Lukasz Olejnik Gunes Acar Claude Castelluccia and Clau-dia Diaz The leaking battery A privacy analysis of theHTML5 Battery Status API Technical report 2015

[34] Jingjing Ren Ashwin Rao Martina Lindorfer ArnaudLegout and David Choffnes Recon Revealing and control-ling pii leaks in mobile network traffic In Proceedings of the14th Annual International Conference on Mobile SystemsApplications and Services pages 361ndash374 ACM 2016

[35] Franziska Roesner Tadayoshi Kohno and David WetherallDetecting and defending against third-party tracking onthe web In Proceedings of the 9th USENIX conferenceon Networked Systems Design and Implementation pages12ndash12 USENIX Association 2012

[36] scikit-learn Jaccard Similarity Score httpscikit-learnorgstablemodulesgeneratedsklearnmetricsjaccard_similarity_scorehtml Online accessed 2017-09-05

[37] Ashkan Soltani Shannon Canty Quentin Mayo LaurenThomas and Chris Jay Hoofnagle Flash cookies and pri-vacy In AAAI spring symposium intelligent informationprivacy management volume 2010 pages 158ndash163 2010

[38] Oleksii Starov Phillipa Gill and Nick Nikiforakis Are yousure you want to contact us quantifying the leakage of pii

I never signed up for this Privacy implications of email tracking 17

via website contact forms Proceedings on Privacy Enhanc-ing Technologies 2016(1)20ndash33 2016

[39] Oleksii Starov and Nick Nikiforakis Extended trackingpowers Measuring the privacy diffusion enabled by browserextensions In Proceedings of the 26th International Confer-ence on World Wide Web pages 1481ndash1490 2017

[40] Narseo Vallina-Rodriguez Christian Kreibich Mark Allmanand Vern Paxson Lumen Fine-grained visibility and controlof mobile traffic in user-space 2017

[41] W3C 410 Forms - HTML5 httpswwww3orgTRhtml5formshtml Online accessed 2017-09-07

[42] Yahoo Help Block images in your incoming Yahoo Mailemails httpshelpyahoocomkbSLN5043html Onlineaccessed 2017-09-06

[43] Zhonghao Yu Sam Macbeth Konark Modi and Josep MPujol Tracking the trackers In Proceedings of the 25thInternational Conference on World Wide Web pages 121ndash132 International World Wide Web Conferences SteeringCommittee 2016

10 Appendix

101 Form discovery and fillingmethodology

Choosing pages on which to search for forms Thecrawler searches through all links (ltagt tags) on the land-ing page to find pages that are most likely to contain amailing list form It does this by matching the link textand URL against a ranked list of terms which are shownin Table 1 As an initial step we filter out invisiblelinks and links to external sites We check that the linktext does not contain words in our blacklist which aimsto avoid unsubscribe pages and phone-based registra-tion If we have found any links that match the crawlerclicks on the one with the highest rank then runs theform-finding procedure on the new page and any newlyopened pop-up windows If no forms are found it goesback and repeats this process for the remaining linksThe reason for clicking on generic article links is thatwe have come across several news sites with newsletterforms only within article pages We also make sure toselect the English language or USEnglish locale whenavailable since our keywords are in English

Top-down form detection For each page thecrawler visits it first searches through the HTML DOMfor any potential email registration forms When sitesuse the standard ltformgt element it can simply iteratethrough each formrsquos input fields (ltinputgt tags) and seeif any text fields ask for an email address (by matchingon input type and keywords) If so it marks the form as

a candidate and then chooses the best candidate usingthe following criteria (in order)1 Always return the topmost form Any form stacked

on top of other elements is probably a modal or dia-log and we find that the most common use of thesecomponents is to promote a sitersquos mailing lists Werely on the z-index CSS property which specifies thestacking order of an element in relation to others (asa relative arbitrary integer) Note that most DOMelements take the default z-index value of auto in-heriting the actual value from its parent thus thecrawler recursively checks a formrsquos parent elementsuntil it finds a non-auto value or reaches the rootof the DOM tree To break ties it also searches forthe literal strings ldquomodalrdquo or ldquodialogrdquo within theformrsquos HTML since we find that such componentsare usually descriptively named

2 Rank login forms lower This is the other class offorms that often asks for an email address so thecrawler explicitly checks for the strings ldquologinrdquo ldquologinrdquo and ldquosign inrdquo within a formrsquos HTML to avoidthese when other candidates are present

3 Prefer forms with more input fields This is mainlyhelpful for identifying the correct follow-up form ifwe submit our email address in the footer of a pagethe same footer might be present on the page we getredirected to In this scenario the form we want topick is the longer one

Additionally registration forms are sometimes foundinside of inline frames (ltiframegt tag) which are ef-fectively separate HTML pages embedded in the mainpage If necessary we iterate through each frame andapply the same procedure to locate registration formswithin them

Bottom-up form detection A growing numberof sites place logical forms inside of generic containerelements (eg ltdivgt or ltspangt tags) without using anyltformgt tags Therefore if top-down form detection failswe take a bottom-up approach the crawler first iteratesthrough all the ltinputgt elements on the page to checkif any email address fields exist at all then recursivelyexamines their parents to find the first container thatalso contains a submit button This container is usuallythe smallest logical form unit that includes all of therelevant input fields

Determining form field type Once a form isdiscovered we need to determine which fields are con-tained in the form and fill each field with valid dataWe skip any invisible elements since a real user wouldnot be expected to fill them Some fields can be iden-

I never signed up for this Privacy implications of email tracking 18

tified by their type attribute alonemdashfor example telfor phone numbers and email for email addressesmdashbutthese specific types were introduced in the relatively re-cent HTML5 standard [41] and most websites still usethe general text type for all text inputs In our sur-vey of the top sites we found that contextual hints arescattered across many tag attributes with the most fre-quent being name class id placeholder value forand title In addition tags that contain HTML bod-ies (such as ltbuttongt tags) often contain hints in theinnerHTML

Handling two-part form submissions Aftersubmitting a form we are sometimes prompted to fillout another longer form before the registration is ac-cepted This second form might appear on the samepage (ie using JavaScript) or on a separate page ei-ther through a redirect or as a pop-up window We takea simplistic approach the crawler waits a few secondsthen applies the same form-finding procedure first onany pop-up windows and then on the original windowThis approach may have the effect of submitting thesame form twice but we argue that this does not pro-duce any adverse resultsmdashduplicate form submissionsare a plausible user interaction that web services shouldbe expected to handle gracefully

102 Mail server implementation

The mail server receives emails using SubEtha SMTP alibrary offering a simple low-level API to handle incom-ing mail The server accepts any mail sent to (RCPT TO)an existing email address and rejects it otherwise Themail contents (DATA) are parsed in MIME format usingthe JavaMail API and the raw message contents arewritten to disk MIME messages consist of a set of head-ers and a content body with the required Content-Typeheader indicating the format of the content notably amultipart content body contains additional MIME mes-sage subparts enabling messages to be arranged in atree structure To save disk space we recursively scanmultipart MIME messages for subparts with contenttypes that are non-text (text) such as attached im-ages or other data and discard them before storing themessages since we do not examine any non-textual con-tent

103 Supported hash functions andencodings for leak detection

Supported hashes and checksums md2 md4 md5sha sha1 sha256 sha224 sha384 sha3-224 sha3-256 sha3-384 sha3-512 murmurhash2 (signed andunsigned) murmurhash3 32-bit murmurhash3 64-bitmurmurhash3 128-bit ripemd160 whirlpool blake2bblake2s crc32 adler32

Supported encodings base16 base32 base58base64 urlencoding deflate gzip zlib entity yenc

104 Top parties redirecting to new thirdparties on email reload

Redirecting Party OrganizationAvg addrsquolparties S E

pippiocom Acxiom 57 7 32liadmcom LiveIntent 37 68 1097rlcdncom Acxiom 17 11 551imiclkcom MediaMath 13 2 4mathtagcom MediaMath 11 11 382alcmpncom ALCdagger 08 6 132emltrkcom Litmus 07 41 638acxiom-onlinecom Acxiom 04 2 33dynemlcom PowerInbox 01 3 13adnxscom AppNexus 01 19 277

Table 14 Top parties by average number of new third-party re-sources in a redirect chain when an email is reloaded The num-ber of senders ( S) out of 902 total and the number of emails(E) out of 12618 total on which this occurs is given for eachredirecting party We exclude redirecting parties that only exhibitthis behavior in emails from a single sender In total there are 12parties which exhibit this type of redirect behavior Includes statistics for chains which redirect to httppliadmcomimp in the first redirect We observe a common pattern ofURLs of the form lifirstpartycom redirecting first to this end-point which then redirects to a number of other third partiesdagger American List Counsel

  • I never signed up for this Privacy implications of email tracking
    • 1 Introduction
      • 11 Methods
      • 12 The state of email tracking
      • 13 Evaluating and improving defenses
        • 2 Related work
        • 3 Collecting a dataset of emails
        • 4 Privacy leaks when viewing emails
          • 41 Measurement methodology
          • 42 Email provides much of same tracking opportunities as the web
          • 43 Leaks of email addresses to third parties are common
          • 44 Reopening emails brings in new third parties
          • 45 Case study LiveIntent
          • 46 Request blockers help but dont fix the problem
            • 5 Privacy leaks when clicking links in emails
              • 51 Measurement methodology
              • 52 Results
                • 6 Evaluation of defenses
                  • 61 Landscape of defenses
                  • 62 Survey of email clients
                    • 7 Proposed defense
                    • 8 Discussion and conclusion
                    • 9 Acknowledgements
                    • 10 Appendix
                      • 101 Form discovery and filling methodology
                      • 102 Mail server implementation
                      • 103 Supported hash functions and encodings for leak detection
                      • 104 Top parties redirecting to new third parties on email reload

I never signed up for this Privacy implications of email tracking 11

Recipient Organization of SendersGoogle 247 (274)Facebook 160 (177)Twitter 94 (104)Adobe 81 (90)Microsoft 73 (81)Pinterest 72 (80)LiveIntent 69 (76)Akamai 69 (76)Acxiom 68 (75)AppNexus 61 (68)

Table 9 The top leak recipient organizations based on a sampleof simulated link clicks All values are out of 902 total senders

Recipient Domain of Sendersgoogle-analyticscom 200 (222)doubleclicknet 196 (217)googlecom 159 (176)facebookcom 154 (171)facebooknet 145 (161)fontsgoogleapiscom 102 (113)googleadservicescom 96 (106)twittercom 94 (104)googletagmanagercom 87 (96)gstaticcom 78 (86)

Table 10 The top leak recipient domains based on a sample ofsimulated link clicks All values are out of 902 senders

received the email address in one or more of our sim-ulated link clicks We expect that all statistics in thisparagraph except the first are slight underestimatesdue to our limit of 200 links per sender

Table 9 shows the top organizations that receiveleaked email addresses and Table 10 shows the top do-mains Over a quarter of senders leak the email addressto Google in at least one link

The most striking difference between these resultsand the corresponding results for viewing emails is thatthese lists look very similar to the list of top third partytrackers [17] with the addition of a small number oforganizations specific to email tracking This motivatesthe privacy concern that identities could potentially beattached to third-party web tracking profiles

6 Evaluation of defenses

61 Landscape of defenses

Defenses against tracking can be employed by severalparties We ignore mail senders and trackers themselves

since email tracking is a thriving commercial space andour evidence suggests that senders by and large coop-erate with trackers to leak email addresses We insteadfocus on parties who have an incentive to protect therecipientrsquos privacy namely the recipientrsquos mail servermail user agent and the web browser

The lines between these roles can be blurry so weillustrate with two examples Consider a user readingYahoo mail via Firefox The email server is Yahoo theemail client is Firefox together with Yahoo mailrsquos client-side JavaScript and the web browser is again Fire-fox Or consider a user reading her university mail viaGmailrsquos IMAP feature on her iPhone For our purposesboth the university and Gmail count as email serverssince either of them is in a position to employ defensesThe email client is the Gmail iOS app and the webbrowser is Safari

Defense Email server Email client Web browserContent proxying XHTML filtering X XCookie blocking X XReferrer blocking X X XRequest blocking X X

Table 11 Applicability of each of the five possible defenses toeach of the three contexts in which they may be deployed An Xindicates that the defense is applicable

Table 11 summarizes the applicability of various de-fenses to the three roles We discuss each in turn

Content proxying Email tracking is possible be-cause of embedded content such as images and CSS (cas-cading style sheets) To prevent this some email serversnotably Gmail proxy embedded content Thus whenthe recipient views the email the mail user agent doesnot make any requests to third parties

This defense doesnrsquot prevent the recipient email ad-dress being leaked to third parties since it is leakedby being encoded in the URL In fact it hinders ef-forts by the mail client to prevent email address leakage(see request blocking below) However it prevents thirdparties from learning the userrsquos IP address client deviceproperties and when the email was read (depending onhow the proxy is configured) Most importantly it pre-vents the third-party cookie from being sent and thusprevents the third party from linking the userrsquos emailaddress to a tracking profile In this way it is a comple-ment to cookie blocking

This defense can be deployed by the email serverConceivably the email client might have its own server

I never signed up for this Privacy implications of email tracking 12

component through which embedded resources are prox-ied but no email clients currently work this way andfurther it would introduce its own privacy vulnerabili-ties so we ignore this possibility

HTML filtering HTML filtering refers to modify-ing the contents of HTML emails to mitigate tracking Itmay be applied by the email server or the client but it ismore suitable to the server since the client can generallyachieve the same effect in other ways eg by requestblocking or modifying the rendering engine It is rarelyapplied today and only in minimal ways In Section 7 weprototype a comprehensive HTML filtering technique

HTML filtering modifies the content of the emailbody and thus might interfere with some email au-thentication methods notably Domain Keys IdentifiedEmail (DKIM) However since filtering is carried outby the recipientrsquos mail server (Mail Transfer Agent) andnot by intermediate mail relays filtering can be done af-ter the signature has been verified and thus there is noimpact on email authentication

The following three techniques are applicable in oneof two scenarios when the email client requests embed-ded resources or when the web browser handles clickson links in emails

Cookie blocking Cookie blocking in the emailclient prevents third-party cookies from being sent whenembedded content is requested It is especially relevantin the webmail context where the cookie allows thirdparties to link an email address to a web browsing pro-file Even otherwise blocking cookies is helpful since itmakes it harder for third parties to compile a profile ofthe recipientrsquos email viewing (they can always do this forthe subset of emails where the email address is leaked)

Referrer blocking If the email client sends theReferer header when loading embedded resources itcan allow several types of leaks Depending on the imple-mentation the referrer may encode which client is beingused and which specific email is being read If the recip-ient forwarded an email to someone else and the email isbeing viewed in a different userrsquos mailbox it could leakthis information Worse if the client supports iframesin emails and the email address happens to be in theiframe URL all requests to resources embedded in thatiframe will accidentally leak the email address For allthese reasons referrer blocking is a privacy-enhancingmeasure There is little legitimate use for the referrerheader in the context of email While clients can cer-tainly block the header (as can web browsers) serverscan do this as well by rewriting HTML to add therel=ldquonoreferrerrdquo attribute to links and inserting a Re-ferrer Policy via the meta tag

Request blocking Request blocking is a powerfultechnique which is well known due to ad blockers andother browser privacy extensions It relies on manuallycompiled filter lists containing thousands of regular ex-pressions that define third-party content to be blockedThe most widely used ad-blocking list is EasyList andthe most widely used tracker-blocking list is EasyPri-vacy Filter list based blocking introduces false positivesand false negatives [43] but the popularity of ad block-ing suggests that many users find the usability trade-offto be acceptable While request-blocking extensions aresupported primarily by web browsers some email clientsalso have support for them notably Thunderbird

62 Survey of email clients

We built an email privacy tester to discover which de-fenses are deployed by which popular email servers andclients7 Browser support for tracking protection hasbeen extensively studied elsewhere [29] so we do notconsider it here

The email privacy tester allows the researcher to en-ter an email address and the name of an email clientand then sends an email to that address containing atracking image and a link The image and the link bothhave unique URLs The researcher views the email inthe specified email client and then clicks on the linkThe server records the following information the emailaddress the email client the IP address timestampand headers sent for both the image and the link re-quests The list of headers includes the cookie referrerand user agent

We created accounts with a total of 9 emailproviders and tested them with a total of 16 email clientsusing various devices available in our lab We analyzedthe data recorded by the email privacy tester and sum-marize the results in Table 12 We found that if de-fenses are deployed by email servers at all they are onlyenabled for specific email clients (typically the defaultwebmail client) Therefore we do not report on serversseparately but instead fold it into the analysis of clientsWe also found that HTML filtering in a general form isnot deployed but only in the limited form of image andreferrer blocking so we report on that instead We sum-marize our findings in Table 12

7 httpsemailtrackingopenwpmcom

I never signed up for this Privacy implications of email tracking 13

Mail Client Platform Proxies Content Blocks Images Blocks Referrers Blocks Cookies Ext SupportGmail Web Yes No L Yes I Yesdagger Yesdagger YesYahoo Mail Web No Yes L Yes I No No YesOutlook Web App Web No Yes No No YesOutlookcom Web No No No No YesYandex Mail Web Yes No L Yes I Yesdagger Yesdagger YesGMX Web No No No No YesZimbra Web No Yes No No Yes163com Web No No No No YesSina Web No No No No YesApple Mail iOS No No Yes Yes NoGmail iOS Yes No Yes Yes NoGmail Android Yes No Yes Yes NoApple Mail Desktop No No Yes Yes NoWindows Mail Desktop No No Yes No NoOutlook 2016 Desktop No Yes Yes No NoThunderbird Desktop No Yes Yes Optional (Default No) Yes

Table 12 A survey of the privacy impacting features of email clients We explore whether the client proxies image requests blocksimages by default blocks referrer headers from being sent (with image requests ldquoIrdquo and with link clicks ldquoLrdquo) blocks external re-sources from settings cookies and whether or not the client supports request blocking extensions mdash either through the browser (forweb clients) or directly (in the case of Thunderbird)Images are only blocked for messages considered spamdagger Blocking occurs as a result of proxied content

7 Proposed defenseWe argue that tracking protection should be at the cen-ter of a defensive strategy against email tracking It canbe employed either via HTML filtering on the server orvia request blocking on the client Tracking protection(and ad blocking) based on filter lists has proven to beeffective and popular in web browsers and its limita-tions manageable The other defenses we examined allhave serious drawbacks for example content proxyingcomes at a cost to the email server and makes emailleaks worse and cookie blocking is at best a partial so-lution

We propose to improve tracking protection in twoways

Server-side email content filtering First weprototype a server-side HTML filtering module We usethe existing standard EasyList and EasyPrivacy filterlists Our filtering script is written in Python using theBlockListParser library [3] It scans for any HTML con-tent (texthtml) in email bodies parses those contentsidentifies embedded resources (images or CSS) whoseURLs match one of the regular expressions in the filterlists strips them out and rewrites the HTML

To test the effectiveness of HTML filtering we ranour leak detection procedure on the filtered corpus ofemails We exclude one sender due to a measurementissue We found that 110 of senders will leak email ad-

dresses to a third party in at least one email and 115of emails contain embedded resources which leak emailto a third-party Overall 62 third parties received leakedemail addresses down from 99 As tracking-protectionlists improve (see below) we can expect these numbersto decrease further These numbers are very close tothe corresponding numbers for request blocking (Sec-tion 46) The two techniques arenrsquot identical the onedifference is that in static files filtering is limited to theURLs present in the body of the HTML and will missthose that result from a redirect However this differ-ence is small and we conclude that HTML filtering isessentially as effective as request blocking

Note that webmail users can already enjoy track-ing protection but server-side deployment will help allusers including those who use email clients that donrsquotsupport request-blocking extensions

Filling gaps in tracking-protection lists As asecond line of defense we use our dataset to identifya list of 27125 URLs representing 133 distinct partieswhich contain leaks of email addresses but which arenrsquotblocked by EasyList or EasyPrivacy These include firstparties in addition to third parties We are able to iden-tify first-party tracking URLs by observing groups ofURLs of similar structure across different first-party do-mains For example 51 email senders leak the userrsquosemail address to a URL of the form liltpublic suffix+ 1gtimp which appears to be part of LiveIntentrsquos API(Section 45) We summarize the most common struc-

I never signed up for this Privacy implications of email tracking 14

tures in the leaking URLs missed by tracking protectionlists in Table 13

URL Pattern of SendersliltPS+1gtimp 51 (57)partnerltPS+1gt 7 (07)stripeltPS+1gtstripeimage 4 (04)pltPS+1gtespopen 4 (04)apiltPS+1gtlayoutssectionltNgt 4 (04)ltPS+1gtcustomer-service 3 (03)miltPS+1gtprp 3 (03)dmtkltPS+1gt 3 (03)linksltPS+1gteopen 3 (03)eadsltPS+1gtimp 3 (03)

Table 13 The top URL patterns from URLs which leak email ad-dresses and are missed by tracking protection lists (Section 46)The patterns are generated by stripping request URLs to host-name and path replacing the public suffix plus one with ltPS+1gtreplacing integers with ltNgt and stripping the last portion of thepath if it ends with a file extension The patterns are ranked bythe number of senders which make at least one leaking requestmatching that pattern in any of the senderrsquos emails All values aregiven out of the total of 902 senders studied

We suspect that the reason so many trackers aremissed is that many of them are not active in the regu-lar web tracking space We have made the list of leakingURLs missed by tracking protection lists publicly avail-able8 It should be straightforward to add regular ex-pressions to filter lists based on these URLs we suggestthat filter list creators should regularly conduct scansof email corpora to identify new trackers

8 Discussion and conclusionPrivacy risks of email tracking Email security andprivacy has not received much research attention despiteits central importance in digital life We showed thatcommercial emails contain a high density of third-partytrackers This is of concern not only because trackerscan learn the recipientrsquos IP address when emails wereopened and so on but also because these third partiesare by and large the same ones that are involved in webtracking This means that trackers can connect emailaddresses to browsing histories and profiles which leadsto further privacy breaches such as cross-device tracking

8 httpsgistgithubcomenglehardt6438c5d775ffd535b317d5c6ce3cde61

and linking of online and offline activities Indeed emailis an underappreciated avenue for straightforward cross-device tracking since recipients tend to view emails onmultiple devices

The advice provided by many mail clients may mis-lead users into thinking the privacy risks associated withremote content are fairly limited The remote contenthelp pages of Gmail [20] Yahoo Mail [42] and Thun-derbird [31] all discuss the threat strictly in terms ofthe email sender learning information about the userrather than a number of third parties

Even network adversaries can benefit from the leaksin emails The NSA is known to piggyback on advertis-ing cookies for surveillance [18] and our work suggestsone way in which a surveillance agency might attachidentities to web activity records in line with the find-ings of Englehardt et al [18] Indeed nearly 91 ofURLs containing leaks of emails are sent in plaintext

Ineffectiveness of hashing The putative justi-fication for email address leaks in the online ad techindustry is that the address is hashed However hash-ing of PII including emails is not a meaningful pri-vacy protection This is folk knowledge in the securitycommunity but bears repeating Compared to hashingof passwords there are several reasons why hashing ofemail addresses is far more easily reversible via vari-ants of a dictionary attack First while (at least) someusers attempt to maximize the entropy of passwordsmost users aim to pick memorable emails and hence theset of potential emails is effectively enumerable Due toGPUs trillions of hashes can be attempted at low costSecond unlike password hashing salting is not applica-ble to email hashing since multiple third parties need tobe able to independently derive the same hash from theemail address

Perhaps most importantly if the adversaryrsquos goalis to retrieve records corresponding to a known emailaddress or set of email addresses then hashing ispointlessmdashthe adversary can simply hash the email ad-dresses and then look them up For example if the ad-versary is a surveillance agency as discussed above andseeks to retrieve network logs corresponding to a givenemail address this is trivially possible despite hashing

LimitationsWe mention several limitations of ourwork First despite the large number of heuristics thatwent into identifying and submitting forms it is a fun-damentally hard problem and our crawler fails in manycases including pages requiring complex mouse interac-tions pages containing very poorly structured HTMLand captcha-protected form submission pages More-over it is difficult to programmatically distinguish be-

I never signed up for this Privacy implications of email tracking 15

tween successful and failed form submissions Lookingat received network data is impractical since responsescould easily include text for both success and failuremessages On the other hand looking only at changesin the rendered text on the webpage is more feasiblebut would require handling many possible edge cases(eg page redirects alerts pop-up windows iframes)and might still be too unreliable to use as a metric forsuccess

Second our corpus of emails is not intended to berepresentative and we are unable to draw conclusionsabout the extent of tracking in the typical userrsquos mail-box

Third our simulation of a user viewing emails as-sumes a permissive user agent We expect that thisclosely approximates a webmail setup with defaultbrowser settings (on browsers except Safari whichblocks third-party cookies by default) but we have nottested this assumption

Future work Finally we mention several potentialareas of future work

Mailing list managers It would be helpful to bet-ter understand the relationship between email sendersand mailing list managers (such as Constant Contact)To what extent is email tracking driven by senders ver-sus mailing list managers When a sender sets up amarketing campaign with a mailing list manager is thetracking disclosed to the sender

PII leakage in registration forms Researchers havepreviously found leakage of PII to third parties in con-tact forms on websites [38] As far as we know there hasbeen no large-scale study of PII leakage in registrationforms where more sensitive information is often present(eg phone numbers street addresses and passwords)Recording and analyzing the third-party requests madeduring our crawls is an important area for further inves-tigation

Cookie syncing It would be interesting to find outif cookie syncing occurs when viewing emailsmdasha processin which different trackers exchange and link togethertheir own IDs for the same user Past work has shownthat this happens among the vast majority of top thirdparties on the web [17] so we suspect that it occursthrough email as well

AB testing We notice some clear instances of ABtesting in our data as might be expected in market-ing campaigns Specifically we registered multiple emailaddresses on some sites at roughly the same time andfound several emails sent at nearly the same time (mil-liseconds apart) with different subject lines and emailbodies advertising different products We have not at-

tempted to reverse-engineer or systematically analyzethese differences but it may be interesting to see if andhow the received content changes in response to readreceipts click-through metrics or other types of userinteractions

Differential testing Despite testing for various en-codings hashes and combinations it is possible thatwe have missed some leaks of email addresses We can-not hope to exhaustively test for all combinations ofencodings and hashes Instead we propose differentialtesting by registering multiple email addresses on thesame site we can look for parameters in URLs that aredifferent for different email addresses which are sugges-tive of transformed email addresses The difficulty withthis approach is that AB testing mentioned above isa confound

In summary we hope that our work leads to greaterawareness of the privacy risks of email tracking spursfurther research on the topic and paves the way fordeployment of robust defenses

9 AcknowledgementsWe would like to thank the anonymous reviewers AylinCaliskan Paul-Olivier Dehaye Joel Reardon and PaulVan Oorschot for their helpful comments Wersquore alsograteful to Guumlnes Acar Paul Ellenbogen Marc JuarezHarry Kalodner Marcela Melara and Laura Roberts fortheir assistance in compiling data for our email survey

This work was supported by NSF Grant CNS1526353 by a research grant from Mozilla and by Ama-zon AWS Cloud Credits for Research

References[1] Adblock Plus - Surf the web without annoying ads https

adblockplusorg Online accessed 2017-09-05[2] BeautifulSoup httpswwwcrummycomsoftware

BeautifulSoup Online accessed 2017-09-05[3] BlockListParser httpsgithubcomshivamagarwal-iitb

BlockListParser Online accessed 2017-09-05[4] EasyList and EasyPrivacy httpseasylistto Online

accessed 2017-09-05[5] uBlock Origin - An efficient blocker for Chromium and Fire-

fox Fast and lean httpsgithubcomgorhilluBlockOnline accessed 2017-09-05

[6] CSS Support Guide for Email Clients Campaign Sourcehttpswwwcampaignmonitorcomcss (Archive httpswwwwebcitationorg6rLLXBX0E) 2014

I never signed up for this Privacy implications of email tracking 16

[7] Gunes Acar Christian Eubank Steven Englehardt MarcJuarez Arvind Narayanan and Claudia Diaz The web neverforgets Persistent tracking mechanisms in the wild In Pro-ceedings of ACM CCS pages 674ndash689 ACM 2014

[8] Gunes Acar Marc Juarez Nick Nikiforakis Claudia DiazSeda Guumlrses Frank Piessens and Bart Preneel Fpdetectivedusting the web for fingerprinters In Proceedings of the2013 ACM SIGSAC conference on Computer amp communica-tions security pages 1129ndash1140 ACM 2013

[9] Julia Angwin Why online tracking is getting creepier ProP-ublica Jun 2014

[10] Mika D Ayenson Dietrich James Wambach Ashkan SoltaniNathan Good and Chris Jay Hoofnagle Flash cookies andprivacy II Now with html5 and etag respawning 2011

[11] Bananatag Email Tracking for Gmail Outlook and otherclients httpsbananatagcomemail-tracking Onlineaccessed 2017-09-04

[12] Justin Brookman Phoebe Rouge Aaron Alva Alva andChristina Yeung Cross-device tracking Measurement anddisclosures In Proceedings of the Privacy Enhancing Tech-nologies Symposium 2017

[13] Ceren Budak Sharad Goel Justin Rao and Georgios ZervasUnderstanding emerging threats to online advertising InProceedings of the ACM Conference on Economics andComputation 2016

[14] ContactMonkey Email Tracking for Outlook and Gmailhttpswwwcontactmonkeycomemail-tracking Onlineaccessed 2017-09-04

[15] Zakir Durumeric David Adrian Ariana Mirian James Kas-ten Elie Bursztein Nicolas Lidzborski Kurt Thomas VijayEranti Michael Bailey and J Alex Halderman Neither snownor rain nor mitm An empirical analysis of email deliv-ery security In Proceedings of the 2015 ACM Conferenceon Internet Measurement Conference pages 27ndash39 ACM2015

[16] Peter Eckersley How unique is your web browser In In-ternational Symposium on Privacy Enhancing TechnologiesSymposium pages 1ndash18 Springer 2010

[17] Steven Englehardt and Arvind Narayanan Online trackingA 1-million-site measurement and analysis In ACM Confer-ence on Computer and Communications Security 2016

[18] Steven Englehardt Dillon Reisman Christian Eubank Pe-ter Zimmerman Jonathan Mayer Arvind Narayanan andEdward W Felten Cookies that give you away The surveil-lance implications of web tracking In Proceedings of the24th Conference on World Wide Web 2015

[19] David Fifield and Serge Egelman Fingerprinting web usersthrough font metrics In International Conference on Finan-cial Cryptography and Data Security 2015

[20] Gmail Help Choose whether to show images httpssupportgooglecommailanswer145919 Online accessed2017-09-06

[21] Ralph Holz Johanna Amann Olivier Mehani Mohamed AliKacircafar and Matthias Wachs TLS in the wild An internet-wide analysis of tls-based protocols for electronic commu-nication In 23nd Annual Network and Distributed SystemSecurity Symposium NDSS 2016 San Diego CaliforniaUSA February 21-24 2016 2016

[22] HubSpot Start Email Tracking Today httpswwwhubspotcomproductssalesemail-tracking Online ac-

cessed 2017-09-04[23] Balachander Krishnamurthy Konstantin Naryshkin and

Craig Wills Privacy leakage vs protection measures thegrowing disconnect In Proceedings of the Web 2011

[24] Balachander Krishnamurthy and Craig E Wills On the leak-age of personally identifiable information via online socialnetworks In Proceedings of the 2nd ACM workshop onOnline social networks pages 7ndash12 ACM 2009

[25] Pierre Laperdrix Walter Rudametkin and Benoit BaudryBeauty and the beast Diverting modern web browsers tobuild unique browser fingerprints In 37th IEEE Symposiumon Security and Privacy 2016

[26] Adam Lerner Anna Kornfeld Simpson Tadayoshi Kohnoand Franziska Roesner Internet jones and the raiders of thelost trackers An archaeological study of web tracking from1996 to 2016 In 25th USENIX Security Symposium 2016

[27] Timothy Libert Exposing the invisible web An analysis ofthird-party http requests on 1 million websites InternationalJournal of Communication 918 2015

[28] Jonathan R Mayer and John C Mitchell Third-party webtracking Policy and technology In 2012 IEEE Symposiumon Security and Privacy IEEE 2012

[29] Georg Merzdovnik Markus Huber Damjan Buhov NickNikiforakis Sebastian Neuner Martin Schmiedecker andEdgar Weippl Block me if you can A large-scale study oftracker-blocking tools In Proceedings of the 2nd IEEE Euro-pean Symposium on Security and Privacy (IEEE EuroSampP)2017

[30] Keaton Mowery and Hovav Shacham Pixel perfect Finger-printing canvas in HTML5 W2SP 2012

[31] Mozilla Support Remote Content in Messages httpssupportmozillaorgen-USkbremote-content-in-messagesOnline accessed 2017-09-04

[32] Nick Nikiforakis Alexandros Kapravelos Wouter JoosenChristopher Kruegel Frank Piessens and Giovanni VignaCookieless monster Exploring the ecosystem of web-baseddevice fingerprinting In Security and privacy (SP) 2013IEEE symposium on pages 541ndash555 IEEE 2013

[33] Lukasz Olejnik Gunes Acar Claude Castelluccia and Clau-dia Diaz The leaking battery A privacy analysis of theHTML5 Battery Status API Technical report 2015

[34] Jingjing Ren Ashwin Rao Martina Lindorfer ArnaudLegout and David Choffnes Recon Revealing and control-ling pii leaks in mobile network traffic In Proceedings of the14th Annual International Conference on Mobile SystemsApplications and Services pages 361ndash374 ACM 2016

[35] Franziska Roesner Tadayoshi Kohno and David WetherallDetecting and defending against third-party tracking onthe web In Proceedings of the 9th USENIX conferenceon Networked Systems Design and Implementation pages12ndash12 USENIX Association 2012

[36] scikit-learn Jaccard Similarity Score httpscikit-learnorgstablemodulesgeneratedsklearnmetricsjaccard_similarity_scorehtml Online accessed 2017-09-05

[37] Ashkan Soltani Shannon Canty Quentin Mayo LaurenThomas and Chris Jay Hoofnagle Flash cookies and pri-vacy In AAAI spring symposium intelligent informationprivacy management volume 2010 pages 158ndash163 2010

[38] Oleksii Starov Phillipa Gill and Nick Nikiforakis Are yousure you want to contact us quantifying the leakage of pii

I never signed up for this Privacy implications of email tracking 17

via website contact forms Proceedings on Privacy Enhanc-ing Technologies 2016(1)20ndash33 2016

[39] Oleksii Starov and Nick Nikiforakis Extended trackingpowers Measuring the privacy diffusion enabled by browserextensions In Proceedings of the 26th International Confer-ence on World Wide Web pages 1481ndash1490 2017

[40] Narseo Vallina-Rodriguez Christian Kreibich Mark Allmanand Vern Paxson Lumen Fine-grained visibility and controlof mobile traffic in user-space 2017

[41] W3C 410 Forms - HTML5 httpswwww3orgTRhtml5formshtml Online accessed 2017-09-07

[42] Yahoo Help Block images in your incoming Yahoo Mailemails httpshelpyahoocomkbSLN5043html Onlineaccessed 2017-09-06

[43] Zhonghao Yu Sam Macbeth Konark Modi and Josep MPujol Tracking the trackers In Proceedings of the 25thInternational Conference on World Wide Web pages 121ndash132 International World Wide Web Conferences SteeringCommittee 2016

10 Appendix

101 Form discovery and fillingmethodology

Choosing pages on which to search for forms Thecrawler searches through all links (ltagt tags) on the land-ing page to find pages that are most likely to contain amailing list form It does this by matching the link textand URL against a ranked list of terms which are shownin Table 1 As an initial step we filter out invisiblelinks and links to external sites We check that the linktext does not contain words in our blacklist which aimsto avoid unsubscribe pages and phone-based registra-tion If we have found any links that match the crawlerclicks on the one with the highest rank then runs theform-finding procedure on the new page and any newlyopened pop-up windows If no forms are found it goesback and repeats this process for the remaining linksThe reason for clicking on generic article links is thatwe have come across several news sites with newsletterforms only within article pages We also make sure toselect the English language or USEnglish locale whenavailable since our keywords are in English

Top-down form detection For each page thecrawler visits it first searches through the HTML DOMfor any potential email registration forms When sitesuse the standard ltformgt element it can simply iteratethrough each formrsquos input fields (ltinputgt tags) and seeif any text fields ask for an email address (by matchingon input type and keywords) If so it marks the form as

a candidate and then chooses the best candidate usingthe following criteria (in order)1 Always return the topmost form Any form stacked

on top of other elements is probably a modal or dia-log and we find that the most common use of thesecomponents is to promote a sitersquos mailing lists Werely on the z-index CSS property which specifies thestacking order of an element in relation to others (asa relative arbitrary integer) Note that most DOMelements take the default z-index value of auto in-heriting the actual value from its parent thus thecrawler recursively checks a formrsquos parent elementsuntil it finds a non-auto value or reaches the rootof the DOM tree To break ties it also searches forthe literal strings ldquomodalrdquo or ldquodialogrdquo within theformrsquos HTML since we find that such componentsare usually descriptively named

2 Rank login forms lower This is the other class offorms that often asks for an email address so thecrawler explicitly checks for the strings ldquologinrdquo ldquologinrdquo and ldquosign inrdquo within a formrsquos HTML to avoidthese when other candidates are present

3 Prefer forms with more input fields This is mainlyhelpful for identifying the correct follow-up form ifwe submit our email address in the footer of a pagethe same footer might be present on the page we getredirected to In this scenario the form we want topick is the longer one

Additionally registration forms are sometimes foundinside of inline frames (ltiframegt tag) which are ef-fectively separate HTML pages embedded in the mainpage If necessary we iterate through each frame andapply the same procedure to locate registration formswithin them

Bottom-up form detection A growing numberof sites place logical forms inside of generic containerelements (eg ltdivgt or ltspangt tags) without using anyltformgt tags Therefore if top-down form detection failswe take a bottom-up approach the crawler first iteratesthrough all the ltinputgt elements on the page to checkif any email address fields exist at all then recursivelyexamines their parents to find the first container thatalso contains a submit button This container is usuallythe smallest logical form unit that includes all of therelevant input fields

Determining form field type Once a form isdiscovered we need to determine which fields are con-tained in the form and fill each field with valid dataWe skip any invisible elements since a real user wouldnot be expected to fill them Some fields can be iden-

I never signed up for this Privacy implications of email tracking 18

tified by their type attribute alonemdashfor example telfor phone numbers and email for email addressesmdashbutthese specific types were introduced in the relatively re-cent HTML5 standard [41] and most websites still usethe general text type for all text inputs In our sur-vey of the top sites we found that contextual hints arescattered across many tag attributes with the most fre-quent being name class id placeholder value forand title In addition tags that contain HTML bod-ies (such as ltbuttongt tags) often contain hints in theinnerHTML

Handling two-part form submissions Aftersubmitting a form we are sometimes prompted to fillout another longer form before the registration is ac-cepted This second form might appear on the samepage (ie using JavaScript) or on a separate page ei-ther through a redirect or as a pop-up window We takea simplistic approach the crawler waits a few secondsthen applies the same form-finding procedure first onany pop-up windows and then on the original windowThis approach may have the effect of submitting thesame form twice but we argue that this does not pro-duce any adverse resultsmdashduplicate form submissionsare a plausible user interaction that web services shouldbe expected to handle gracefully

102 Mail server implementation

The mail server receives emails using SubEtha SMTP alibrary offering a simple low-level API to handle incom-ing mail The server accepts any mail sent to (RCPT TO)an existing email address and rejects it otherwise Themail contents (DATA) are parsed in MIME format usingthe JavaMail API and the raw message contents arewritten to disk MIME messages consist of a set of head-ers and a content body with the required Content-Typeheader indicating the format of the content notably amultipart content body contains additional MIME mes-sage subparts enabling messages to be arranged in atree structure To save disk space we recursively scanmultipart MIME messages for subparts with contenttypes that are non-text (text) such as attached im-ages or other data and discard them before storing themessages since we do not examine any non-textual con-tent

103 Supported hash functions andencodings for leak detection

Supported hashes and checksums md2 md4 md5sha sha1 sha256 sha224 sha384 sha3-224 sha3-256 sha3-384 sha3-512 murmurhash2 (signed andunsigned) murmurhash3 32-bit murmurhash3 64-bitmurmurhash3 128-bit ripemd160 whirlpool blake2bblake2s crc32 adler32

Supported encodings base16 base32 base58base64 urlencoding deflate gzip zlib entity yenc

104 Top parties redirecting to new thirdparties on email reload

Redirecting Party OrganizationAvg addrsquolparties S E

pippiocom Acxiom 57 7 32liadmcom LiveIntent 37 68 1097rlcdncom Acxiom 17 11 551imiclkcom MediaMath 13 2 4mathtagcom MediaMath 11 11 382alcmpncom ALCdagger 08 6 132emltrkcom Litmus 07 41 638acxiom-onlinecom Acxiom 04 2 33dynemlcom PowerInbox 01 3 13adnxscom AppNexus 01 19 277

Table 14 Top parties by average number of new third-party re-sources in a redirect chain when an email is reloaded The num-ber of senders ( S) out of 902 total and the number of emails(E) out of 12618 total on which this occurs is given for eachredirecting party We exclude redirecting parties that only exhibitthis behavior in emails from a single sender In total there are 12parties which exhibit this type of redirect behavior Includes statistics for chains which redirect to httppliadmcomimp in the first redirect We observe a common pattern ofURLs of the form lifirstpartycom redirecting first to this end-point which then redirects to a number of other third partiesdagger American List Counsel

  • I never signed up for this Privacy implications of email tracking
    • 1 Introduction
      • 11 Methods
      • 12 The state of email tracking
      • 13 Evaluating and improving defenses
        • 2 Related work
        • 3 Collecting a dataset of emails
        • 4 Privacy leaks when viewing emails
          • 41 Measurement methodology
          • 42 Email provides much of same tracking opportunities as the web
          • 43 Leaks of email addresses to third parties are common
          • 44 Reopening emails brings in new third parties
          • 45 Case study LiveIntent
          • 46 Request blockers help but dont fix the problem
            • 5 Privacy leaks when clicking links in emails
              • 51 Measurement methodology
              • 52 Results
                • 6 Evaluation of defenses
                  • 61 Landscape of defenses
                  • 62 Survey of email clients
                    • 7 Proposed defense
                    • 8 Discussion and conclusion
                    • 9 Acknowledgements
                    • 10 Appendix
                      • 101 Form discovery and filling methodology
                      • 102 Mail server implementation
                      • 103 Supported hash functions and encodings for leak detection
                      • 104 Top parties redirecting to new third parties on email reload

I never signed up for this Privacy implications of email tracking 12

component through which embedded resources are prox-ied but no email clients currently work this way andfurther it would introduce its own privacy vulnerabili-ties so we ignore this possibility

HTML filtering HTML filtering refers to modify-ing the contents of HTML emails to mitigate tracking Itmay be applied by the email server or the client but it ismore suitable to the server since the client can generallyachieve the same effect in other ways eg by requestblocking or modifying the rendering engine It is rarelyapplied today and only in minimal ways In Section 7 weprototype a comprehensive HTML filtering technique

HTML filtering modifies the content of the emailbody and thus might interfere with some email au-thentication methods notably Domain Keys IdentifiedEmail (DKIM) However since filtering is carried outby the recipientrsquos mail server (Mail Transfer Agent) andnot by intermediate mail relays filtering can be done af-ter the signature has been verified and thus there is noimpact on email authentication

The following three techniques are applicable in oneof two scenarios when the email client requests embed-ded resources or when the web browser handles clickson links in emails

Cookie blocking Cookie blocking in the emailclient prevents third-party cookies from being sent whenembedded content is requested It is especially relevantin the webmail context where the cookie allows thirdparties to link an email address to a web browsing pro-file Even otherwise blocking cookies is helpful since itmakes it harder for third parties to compile a profile ofthe recipientrsquos email viewing (they can always do this forthe subset of emails where the email address is leaked)

Referrer blocking If the email client sends theReferer header when loading embedded resources itcan allow several types of leaks Depending on the imple-mentation the referrer may encode which client is beingused and which specific email is being read If the recip-ient forwarded an email to someone else and the email isbeing viewed in a different userrsquos mailbox it could leakthis information Worse if the client supports iframesin emails and the email address happens to be in theiframe URL all requests to resources embedded in thatiframe will accidentally leak the email address For allthese reasons referrer blocking is a privacy-enhancingmeasure There is little legitimate use for the referrerheader in the context of email While clients can cer-tainly block the header (as can web browsers) serverscan do this as well by rewriting HTML to add therel=ldquonoreferrerrdquo attribute to links and inserting a Re-ferrer Policy via the meta tag

Request blocking Request blocking is a powerfultechnique which is well known due to ad blockers andother browser privacy extensions It relies on manuallycompiled filter lists containing thousands of regular ex-pressions that define third-party content to be blockedThe most widely used ad-blocking list is EasyList andthe most widely used tracker-blocking list is EasyPri-vacy Filter list based blocking introduces false positivesand false negatives [43] but the popularity of ad block-ing suggests that many users find the usability trade-offto be acceptable While request-blocking extensions aresupported primarily by web browsers some email clientsalso have support for them notably Thunderbird

62 Survey of email clients

We built an email privacy tester to discover which de-fenses are deployed by which popular email servers andclients7 Browser support for tracking protection hasbeen extensively studied elsewhere [29] so we do notconsider it here

The email privacy tester allows the researcher to en-ter an email address and the name of an email clientand then sends an email to that address containing atracking image and a link The image and the link bothhave unique URLs The researcher views the email inthe specified email client and then clicks on the linkThe server records the following information the emailaddress the email client the IP address timestampand headers sent for both the image and the link re-quests The list of headers includes the cookie referrerand user agent

We created accounts with a total of 9 emailproviders and tested them with a total of 16 email clientsusing various devices available in our lab We analyzedthe data recorded by the email privacy tester and sum-marize the results in Table 12 We found that if de-fenses are deployed by email servers at all they are onlyenabled for specific email clients (typically the defaultwebmail client) Therefore we do not report on serversseparately but instead fold it into the analysis of clientsWe also found that HTML filtering in a general form isnot deployed but only in the limited form of image andreferrer blocking so we report on that instead We sum-marize our findings in Table 12

7 httpsemailtrackingopenwpmcom

I never signed up for this Privacy implications of email tracking 13

Mail Client Platform Proxies Content Blocks Images Blocks Referrers Blocks Cookies Ext SupportGmail Web Yes No L Yes I Yesdagger Yesdagger YesYahoo Mail Web No Yes L Yes I No No YesOutlook Web App Web No Yes No No YesOutlookcom Web No No No No YesYandex Mail Web Yes No L Yes I Yesdagger Yesdagger YesGMX Web No No No No YesZimbra Web No Yes No No Yes163com Web No No No No YesSina Web No No No No YesApple Mail iOS No No Yes Yes NoGmail iOS Yes No Yes Yes NoGmail Android Yes No Yes Yes NoApple Mail Desktop No No Yes Yes NoWindows Mail Desktop No No Yes No NoOutlook 2016 Desktop No Yes Yes No NoThunderbird Desktop No Yes Yes Optional (Default No) Yes

Table 12 A survey of the privacy impacting features of email clients We explore whether the client proxies image requests blocksimages by default blocks referrer headers from being sent (with image requests ldquoIrdquo and with link clicks ldquoLrdquo) blocks external re-sources from settings cookies and whether or not the client supports request blocking extensions mdash either through the browser (forweb clients) or directly (in the case of Thunderbird)Images are only blocked for messages considered spamdagger Blocking occurs as a result of proxied content

7 Proposed defenseWe argue that tracking protection should be at the cen-ter of a defensive strategy against email tracking It canbe employed either via HTML filtering on the server orvia request blocking on the client Tracking protection(and ad blocking) based on filter lists has proven to beeffective and popular in web browsers and its limita-tions manageable The other defenses we examined allhave serious drawbacks for example content proxyingcomes at a cost to the email server and makes emailleaks worse and cookie blocking is at best a partial so-lution

We propose to improve tracking protection in twoways

Server-side email content filtering First weprototype a server-side HTML filtering module We usethe existing standard EasyList and EasyPrivacy filterlists Our filtering script is written in Python using theBlockListParser library [3] It scans for any HTML con-tent (texthtml) in email bodies parses those contentsidentifies embedded resources (images or CSS) whoseURLs match one of the regular expressions in the filterlists strips them out and rewrites the HTML

To test the effectiveness of HTML filtering we ranour leak detection procedure on the filtered corpus ofemails We exclude one sender due to a measurementissue We found that 110 of senders will leak email ad-

dresses to a third party in at least one email and 115of emails contain embedded resources which leak emailto a third-party Overall 62 third parties received leakedemail addresses down from 99 As tracking-protectionlists improve (see below) we can expect these numbersto decrease further These numbers are very close tothe corresponding numbers for request blocking (Sec-tion 46) The two techniques arenrsquot identical the onedifference is that in static files filtering is limited to theURLs present in the body of the HTML and will missthose that result from a redirect However this differ-ence is small and we conclude that HTML filtering isessentially as effective as request blocking

Note that webmail users can already enjoy track-ing protection but server-side deployment will help allusers including those who use email clients that donrsquotsupport request-blocking extensions

Filling gaps in tracking-protection lists As asecond line of defense we use our dataset to identifya list of 27125 URLs representing 133 distinct partieswhich contain leaks of email addresses but which arenrsquotblocked by EasyList or EasyPrivacy These include firstparties in addition to third parties We are able to iden-tify first-party tracking URLs by observing groups ofURLs of similar structure across different first-party do-mains For example 51 email senders leak the userrsquosemail address to a URL of the form liltpublic suffix+ 1gtimp which appears to be part of LiveIntentrsquos API(Section 45) We summarize the most common struc-

I never signed up for this Privacy implications of email tracking 14

tures in the leaking URLs missed by tracking protectionlists in Table 13

URL Pattern of SendersliltPS+1gtimp 51 (57)partnerltPS+1gt 7 (07)stripeltPS+1gtstripeimage 4 (04)pltPS+1gtespopen 4 (04)apiltPS+1gtlayoutssectionltNgt 4 (04)ltPS+1gtcustomer-service 3 (03)miltPS+1gtprp 3 (03)dmtkltPS+1gt 3 (03)linksltPS+1gteopen 3 (03)eadsltPS+1gtimp 3 (03)

Table 13 The top URL patterns from URLs which leak email ad-dresses and are missed by tracking protection lists (Section 46)The patterns are generated by stripping request URLs to host-name and path replacing the public suffix plus one with ltPS+1gtreplacing integers with ltNgt and stripping the last portion of thepath if it ends with a file extension The patterns are ranked bythe number of senders which make at least one leaking requestmatching that pattern in any of the senderrsquos emails All values aregiven out of the total of 902 senders studied

We suspect that the reason so many trackers aremissed is that many of them are not active in the regu-lar web tracking space We have made the list of leakingURLs missed by tracking protection lists publicly avail-able8 It should be straightforward to add regular ex-pressions to filter lists based on these URLs we suggestthat filter list creators should regularly conduct scansof email corpora to identify new trackers

8 Discussion and conclusionPrivacy risks of email tracking Email security andprivacy has not received much research attention despiteits central importance in digital life We showed thatcommercial emails contain a high density of third-partytrackers This is of concern not only because trackerscan learn the recipientrsquos IP address when emails wereopened and so on but also because these third partiesare by and large the same ones that are involved in webtracking This means that trackers can connect emailaddresses to browsing histories and profiles which leadsto further privacy breaches such as cross-device tracking

8 httpsgistgithubcomenglehardt6438c5d775ffd535b317d5c6ce3cde61

and linking of online and offline activities Indeed emailis an underappreciated avenue for straightforward cross-device tracking since recipients tend to view emails onmultiple devices

The advice provided by many mail clients may mis-lead users into thinking the privacy risks associated withremote content are fairly limited The remote contenthelp pages of Gmail [20] Yahoo Mail [42] and Thun-derbird [31] all discuss the threat strictly in terms ofthe email sender learning information about the userrather than a number of third parties

Even network adversaries can benefit from the leaksin emails The NSA is known to piggyback on advertis-ing cookies for surveillance [18] and our work suggestsone way in which a surveillance agency might attachidentities to web activity records in line with the find-ings of Englehardt et al [18] Indeed nearly 91 ofURLs containing leaks of emails are sent in plaintext

Ineffectiveness of hashing The putative justi-fication for email address leaks in the online ad techindustry is that the address is hashed However hash-ing of PII including emails is not a meaningful pri-vacy protection This is folk knowledge in the securitycommunity but bears repeating Compared to hashingof passwords there are several reasons why hashing ofemail addresses is far more easily reversible via vari-ants of a dictionary attack First while (at least) someusers attempt to maximize the entropy of passwordsmost users aim to pick memorable emails and hence theset of potential emails is effectively enumerable Due toGPUs trillions of hashes can be attempted at low costSecond unlike password hashing salting is not applica-ble to email hashing since multiple third parties need tobe able to independently derive the same hash from theemail address

Perhaps most importantly if the adversaryrsquos goalis to retrieve records corresponding to a known emailaddress or set of email addresses then hashing ispointlessmdashthe adversary can simply hash the email ad-dresses and then look them up For example if the ad-versary is a surveillance agency as discussed above andseeks to retrieve network logs corresponding to a givenemail address this is trivially possible despite hashing

LimitationsWe mention several limitations of ourwork First despite the large number of heuristics thatwent into identifying and submitting forms it is a fun-damentally hard problem and our crawler fails in manycases including pages requiring complex mouse interac-tions pages containing very poorly structured HTMLand captcha-protected form submission pages More-over it is difficult to programmatically distinguish be-

I never signed up for this Privacy implications of email tracking 15

tween successful and failed form submissions Lookingat received network data is impractical since responsescould easily include text for both success and failuremessages On the other hand looking only at changesin the rendered text on the webpage is more feasiblebut would require handling many possible edge cases(eg page redirects alerts pop-up windows iframes)and might still be too unreliable to use as a metric forsuccess

Second our corpus of emails is not intended to berepresentative and we are unable to draw conclusionsabout the extent of tracking in the typical userrsquos mail-box

Third our simulation of a user viewing emails as-sumes a permissive user agent We expect that thisclosely approximates a webmail setup with defaultbrowser settings (on browsers except Safari whichblocks third-party cookies by default) but we have nottested this assumption

Future work Finally we mention several potentialareas of future work

Mailing list managers It would be helpful to bet-ter understand the relationship between email sendersand mailing list managers (such as Constant Contact)To what extent is email tracking driven by senders ver-sus mailing list managers When a sender sets up amarketing campaign with a mailing list manager is thetracking disclosed to the sender

PII leakage in registration forms Researchers havepreviously found leakage of PII to third parties in con-tact forms on websites [38] As far as we know there hasbeen no large-scale study of PII leakage in registrationforms where more sensitive information is often present(eg phone numbers street addresses and passwords)Recording and analyzing the third-party requests madeduring our crawls is an important area for further inves-tigation

Cookie syncing It would be interesting to find outif cookie syncing occurs when viewing emailsmdasha processin which different trackers exchange and link togethertheir own IDs for the same user Past work has shownthat this happens among the vast majority of top thirdparties on the web [17] so we suspect that it occursthrough email as well

AB testing We notice some clear instances of ABtesting in our data as might be expected in market-ing campaigns Specifically we registered multiple emailaddresses on some sites at roughly the same time andfound several emails sent at nearly the same time (mil-liseconds apart) with different subject lines and emailbodies advertising different products We have not at-

tempted to reverse-engineer or systematically analyzethese differences but it may be interesting to see if andhow the received content changes in response to readreceipts click-through metrics or other types of userinteractions

Differential testing Despite testing for various en-codings hashes and combinations it is possible thatwe have missed some leaks of email addresses We can-not hope to exhaustively test for all combinations ofencodings and hashes Instead we propose differentialtesting by registering multiple email addresses on thesame site we can look for parameters in URLs that aredifferent for different email addresses which are sugges-tive of transformed email addresses The difficulty withthis approach is that AB testing mentioned above isa confound

In summary we hope that our work leads to greaterawareness of the privacy risks of email tracking spursfurther research on the topic and paves the way fordeployment of robust defenses

9 AcknowledgementsWe would like to thank the anonymous reviewers AylinCaliskan Paul-Olivier Dehaye Joel Reardon and PaulVan Oorschot for their helpful comments Wersquore alsograteful to Guumlnes Acar Paul Ellenbogen Marc JuarezHarry Kalodner Marcela Melara and Laura Roberts fortheir assistance in compiling data for our email survey

This work was supported by NSF Grant CNS1526353 by a research grant from Mozilla and by Ama-zon AWS Cloud Credits for Research

References[1] Adblock Plus - Surf the web without annoying ads https

adblockplusorg Online accessed 2017-09-05[2] BeautifulSoup httpswwwcrummycomsoftware

BeautifulSoup Online accessed 2017-09-05[3] BlockListParser httpsgithubcomshivamagarwal-iitb

BlockListParser Online accessed 2017-09-05[4] EasyList and EasyPrivacy httpseasylistto Online

accessed 2017-09-05[5] uBlock Origin - An efficient blocker for Chromium and Fire-

fox Fast and lean httpsgithubcomgorhilluBlockOnline accessed 2017-09-05

[6] CSS Support Guide for Email Clients Campaign Sourcehttpswwwcampaignmonitorcomcss (Archive httpswwwwebcitationorg6rLLXBX0E) 2014

I never signed up for this Privacy implications of email tracking 16

[7] Gunes Acar Christian Eubank Steven Englehardt MarcJuarez Arvind Narayanan and Claudia Diaz The web neverforgets Persistent tracking mechanisms in the wild In Pro-ceedings of ACM CCS pages 674ndash689 ACM 2014

[8] Gunes Acar Marc Juarez Nick Nikiforakis Claudia DiazSeda Guumlrses Frank Piessens and Bart Preneel Fpdetectivedusting the web for fingerprinters In Proceedings of the2013 ACM SIGSAC conference on Computer amp communica-tions security pages 1129ndash1140 ACM 2013

[9] Julia Angwin Why online tracking is getting creepier ProP-ublica Jun 2014

[10] Mika D Ayenson Dietrich James Wambach Ashkan SoltaniNathan Good and Chris Jay Hoofnagle Flash cookies andprivacy II Now with html5 and etag respawning 2011

[11] Bananatag Email Tracking for Gmail Outlook and otherclients httpsbananatagcomemail-tracking Onlineaccessed 2017-09-04

[12] Justin Brookman Phoebe Rouge Aaron Alva Alva andChristina Yeung Cross-device tracking Measurement anddisclosures In Proceedings of the Privacy Enhancing Tech-nologies Symposium 2017

[13] Ceren Budak Sharad Goel Justin Rao and Georgios ZervasUnderstanding emerging threats to online advertising InProceedings of the ACM Conference on Economics andComputation 2016

[14] ContactMonkey Email Tracking for Outlook and Gmailhttpswwwcontactmonkeycomemail-tracking Onlineaccessed 2017-09-04

[15] Zakir Durumeric David Adrian Ariana Mirian James Kas-ten Elie Bursztein Nicolas Lidzborski Kurt Thomas VijayEranti Michael Bailey and J Alex Halderman Neither snownor rain nor mitm An empirical analysis of email deliv-ery security In Proceedings of the 2015 ACM Conferenceon Internet Measurement Conference pages 27ndash39 ACM2015

[16] Peter Eckersley How unique is your web browser In In-ternational Symposium on Privacy Enhancing TechnologiesSymposium pages 1ndash18 Springer 2010

[17] Steven Englehardt and Arvind Narayanan Online trackingA 1-million-site measurement and analysis In ACM Confer-ence on Computer and Communications Security 2016

[18] Steven Englehardt Dillon Reisman Christian Eubank Pe-ter Zimmerman Jonathan Mayer Arvind Narayanan andEdward W Felten Cookies that give you away The surveil-lance implications of web tracking In Proceedings of the24th Conference on World Wide Web 2015

[19] David Fifield and Serge Egelman Fingerprinting web usersthrough font metrics In International Conference on Finan-cial Cryptography and Data Security 2015

[20] Gmail Help Choose whether to show images httpssupportgooglecommailanswer145919 Online accessed2017-09-06

[21] Ralph Holz Johanna Amann Olivier Mehani Mohamed AliKacircafar and Matthias Wachs TLS in the wild An internet-wide analysis of tls-based protocols for electronic commu-nication In 23nd Annual Network and Distributed SystemSecurity Symposium NDSS 2016 San Diego CaliforniaUSA February 21-24 2016 2016

[22] HubSpot Start Email Tracking Today httpswwwhubspotcomproductssalesemail-tracking Online ac-

cessed 2017-09-04[23] Balachander Krishnamurthy Konstantin Naryshkin and

Craig Wills Privacy leakage vs protection measures thegrowing disconnect In Proceedings of the Web 2011

[24] Balachander Krishnamurthy and Craig E Wills On the leak-age of personally identifiable information via online socialnetworks In Proceedings of the 2nd ACM workshop onOnline social networks pages 7ndash12 ACM 2009

[25] Pierre Laperdrix Walter Rudametkin and Benoit BaudryBeauty and the beast Diverting modern web browsers tobuild unique browser fingerprints In 37th IEEE Symposiumon Security and Privacy 2016

[26] Adam Lerner Anna Kornfeld Simpson Tadayoshi Kohnoand Franziska Roesner Internet jones and the raiders of thelost trackers An archaeological study of web tracking from1996 to 2016 In 25th USENIX Security Symposium 2016

[27] Timothy Libert Exposing the invisible web An analysis ofthird-party http requests on 1 million websites InternationalJournal of Communication 918 2015

[28] Jonathan R Mayer and John C Mitchell Third-party webtracking Policy and technology In 2012 IEEE Symposiumon Security and Privacy IEEE 2012

[29] Georg Merzdovnik Markus Huber Damjan Buhov NickNikiforakis Sebastian Neuner Martin Schmiedecker andEdgar Weippl Block me if you can A large-scale study oftracker-blocking tools In Proceedings of the 2nd IEEE Euro-pean Symposium on Security and Privacy (IEEE EuroSampP)2017

[30] Keaton Mowery and Hovav Shacham Pixel perfect Finger-printing canvas in HTML5 W2SP 2012

[31] Mozilla Support Remote Content in Messages httpssupportmozillaorgen-USkbremote-content-in-messagesOnline accessed 2017-09-04

[32] Nick Nikiforakis Alexandros Kapravelos Wouter JoosenChristopher Kruegel Frank Piessens and Giovanni VignaCookieless monster Exploring the ecosystem of web-baseddevice fingerprinting In Security and privacy (SP) 2013IEEE symposium on pages 541ndash555 IEEE 2013

[33] Lukasz Olejnik Gunes Acar Claude Castelluccia and Clau-dia Diaz The leaking battery A privacy analysis of theHTML5 Battery Status API Technical report 2015

[34] Jingjing Ren Ashwin Rao Martina Lindorfer ArnaudLegout and David Choffnes Recon Revealing and control-ling pii leaks in mobile network traffic In Proceedings of the14th Annual International Conference on Mobile SystemsApplications and Services pages 361ndash374 ACM 2016

[35] Franziska Roesner Tadayoshi Kohno and David WetherallDetecting and defending against third-party tracking onthe web In Proceedings of the 9th USENIX conferenceon Networked Systems Design and Implementation pages12ndash12 USENIX Association 2012

[36] scikit-learn Jaccard Similarity Score httpscikit-learnorgstablemodulesgeneratedsklearnmetricsjaccard_similarity_scorehtml Online accessed 2017-09-05

[37] Ashkan Soltani Shannon Canty Quentin Mayo LaurenThomas and Chris Jay Hoofnagle Flash cookies and pri-vacy In AAAI spring symposium intelligent informationprivacy management volume 2010 pages 158ndash163 2010

[38] Oleksii Starov Phillipa Gill and Nick Nikiforakis Are yousure you want to contact us quantifying the leakage of pii

I never signed up for this Privacy implications of email tracking 17

via website contact forms Proceedings on Privacy Enhanc-ing Technologies 2016(1)20ndash33 2016

[39] Oleksii Starov and Nick Nikiforakis Extended trackingpowers Measuring the privacy diffusion enabled by browserextensions In Proceedings of the 26th International Confer-ence on World Wide Web pages 1481ndash1490 2017

[40] Narseo Vallina-Rodriguez Christian Kreibich Mark Allmanand Vern Paxson Lumen Fine-grained visibility and controlof mobile traffic in user-space 2017

[41] W3C 410 Forms - HTML5 httpswwww3orgTRhtml5formshtml Online accessed 2017-09-07

[42] Yahoo Help Block images in your incoming Yahoo Mailemails httpshelpyahoocomkbSLN5043html Onlineaccessed 2017-09-06

[43] Zhonghao Yu Sam Macbeth Konark Modi and Josep MPujol Tracking the trackers In Proceedings of the 25thInternational Conference on World Wide Web pages 121ndash132 International World Wide Web Conferences SteeringCommittee 2016

10 Appendix

101 Form discovery and fillingmethodology

Choosing pages on which to search for forms Thecrawler searches through all links (ltagt tags) on the land-ing page to find pages that are most likely to contain amailing list form It does this by matching the link textand URL against a ranked list of terms which are shownin Table 1 As an initial step we filter out invisiblelinks and links to external sites We check that the linktext does not contain words in our blacklist which aimsto avoid unsubscribe pages and phone-based registra-tion If we have found any links that match the crawlerclicks on the one with the highest rank then runs theform-finding procedure on the new page and any newlyopened pop-up windows If no forms are found it goesback and repeats this process for the remaining linksThe reason for clicking on generic article links is thatwe have come across several news sites with newsletterforms only within article pages We also make sure toselect the English language or USEnglish locale whenavailable since our keywords are in English

Top-down form detection For each page thecrawler visits it first searches through the HTML DOMfor any potential email registration forms When sitesuse the standard ltformgt element it can simply iteratethrough each formrsquos input fields (ltinputgt tags) and seeif any text fields ask for an email address (by matchingon input type and keywords) If so it marks the form as

a candidate and then chooses the best candidate usingthe following criteria (in order)1 Always return the topmost form Any form stacked

on top of other elements is probably a modal or dia-log and we find that the most common use of thesecomponents is to promote a sitersquos mailing lists Werely on the z-index CSS property which specifies thestacking order of an element in relation to others (asa relative arbitrary integer) Note that most DOMelements take the default z-index value of auto in-heriting the actual value from its parent thus thecrawler recursively checks a formrsquos parent elementsuntil it finds a non-auto value or reaches the rootof the DOM tree To break ties it also searches forthe literal strings ldquomodalrdquo or ldquodialogrdquo within theformrsquos HTML since we find that such componentsare usually descriptively named

2 Rank login forms lower This is the other class offorms that often asks for an email address so thecrawler explicitly checks for the strings ldquologinrdquo ldquologinrdquo and ldquosign inrdquo within a formrsquos HTML to avoidthese when other candidates are present

3 Prefer forms with more input fields This is mainlyhelpful for identifying the correct follow-up form ifwe submit our email address in the footer of a pagethe same footer might be present on the page we getredirected to In this scenario the form we want topick is the longer one

Additionally registration forms are sometimes foundinside of inline frames (ltiframegt tag) which are ef-fectively separate HTML pages embedded in the mainpage If necessary we iterate through each frame andapply the same procedure to locate registration formswithin them

Bottom-up form detection A growing numberof sites place logical forms inside of generic containerelements (eg ltdivgt or ltspangt tags) without using anyltformgt tags Therefore if top-down form detection failswe take a bottom-up approach the crawler first iteratesthrough all the ltinputgt elements on the page to checkif any email address fields exist at all then recursivelyexamines their parents to find the first container thatalso contains a submit button This container is usuallythe smallest logical form unit that includes all of therelevant input fields

Determining form field type Once a form isdiscovered we need to determine which fields are con-tained in the form and fill each field with valid dataWe skip any invisible elements since a real user wouldnot be expected to fill them Some fields can be iden-

I never signed up for this Privacy implications of email tracking 18

tified by their type attribute alonemdashfor example telfor phone numbers and email for email addressesmdashbutthese specific types were introduced in the relatively re-cent HTML5 standard [41] and most websites still usethe general text type for all text inputs In our sur-vey of the top sites we found that contextual hints arescattered across many tag attributes with the most fre-quent being name class id placeholder value forand title In addition tags that contain HTML bod-ies (such as ltbuttongt tags) often contain hints in theinnerHTML

Handling two-part form submissions Aftersubmitting a form we are sometimes prompted to fillout another longer form before the registration is ac-cepted This second form might appear on the samepage (ie using JavaScript) or on a separate page ei-ther through a redirect or as a pop-up window We takea simplistic approach the crawler waits a few secondsthen applies the same form-finding procedure first onany pop-up windows and then on the original windowThis approach may have the effect of submitting thesame form twice but we argue that this does not pro-duce any adverse resultsmdashduplicate form submissionsare a plausible user interaction that web services shouldbe expected to handle gracefully

102 Mail server implementation

The mail server receives emails using SubEtha SMTP alibrary offering a simple low-level API to handle incom-ing mail The server accepts any mail sent to (RCPT TO)an existing email address and rejects it otherwise Themail contents (DATA) are parsed in MIME format usingthe JavaMail API and the raw message contents arewritten to disk MIME messages consist of a set of head-ers and a content body with the required Content-Typeheader indicating the format of the content notably amultipart content body contains additional MIME mes-sage subparts enabling messages to be arranged in atree structure To save disk space we recursively scanmultipart MIME messages for subparts with contenttypes that are non-text (text) such as attached im-ages or other data and discard them before storing themessages since we do not examine any non-textual con-tent

103 Supported hash functions andencodings for leak detection

Supported hashes and checksums md2 md4 md5sha sha1 sha256 sha224 sha384 sha3-224 sha3-256 sha3-384 sha3-512 murmurhash2 (signed andunsigned) murmurhash3 32-bit murmurhash3 64-bitmurmurhash3 128-bit ripemd160 whirlpool blake2bblake2s crc32 adler32

Supported encodings base16 base32 base58base64 urlencoding deflate gzip zlib entity yenc

104 Top parties redirecting to new thirdparties on email reload

Redirecting Party OrganizationAvg addrsquolparties S E

pippiocom Acxiom 57 7 32liadmcom LiveIntent 37 68 1097rlcdncom Acxiom 17 11 551imiclkcom MediaMath 13 2 4mathtagcom MediaMath 11 11 382alcmpncom ALCdagger 08 6 132emltrkcom Litmus 07 41 638acxiom-onlinecom Acxiom 04 2 33dynemlcom PowerInbox 01 3 13adnxscom AppNexus 01 19 277

Table 14 Top parties by average number of new third-party re-sources in a redirect chain when an email is reloaded The num-ber of senders ( S) out of 902 total and the number of emails(E) out of 12618 total on which this occurs is given for eachredirecting party We exclude redirecting parties that only exhibitthis behavior in emails from a single sender In total there are 12parties which exhibit this type of redirect behavior Includes statistics for chains which redirect to httppliadmcomimp in the first redirect We observe a common pattern ofURLs of the form lifirstpartycom redirecting first to this end-point which then redirects to a number of other third partiesdagger American List Counsel

  • I never signed up for this Privacy implications of email tracking
    • 1 Introduction
      • 11 Methods
      • 12 The state of email tracking
      • 13 Evaluating and improving defenses
        • 2 Related work
        • 3 Collecting a dataset of emails
        • 4 Privacy leaks when viewing emails
          • 41 Measurement methodology
          • 42 Email provides much of same tracking opportunities as the web
          • 43 Leaks of email addresses to third parties are common
          • 44 Reopening emails brings in new third parties
          • 45 Case study LiveIntent
          • 46 Request blockers help but dont fix the problem
            • 5 Privacy leaks when clicking links in emails
              • 51 Measurement methodology
              • 52 Results
                • 6 Evaluation of defenses
                  • 61 Landscape of defenses
                  • 62 Survey of email clients
                    • 7 Proposed defense
                    • 8 Discussion and conclusion
                    • 9 Acknowledgements
                    • 10 Appendix
                      • 101 Form discovery and filling methodology
                      • 102 Mail server implementation
                      • 103 Supported hash functions and encodings for leak detection
                      • 104 Top parties redirecting to new third parties on email reload

I never signed up for this Privacy implications of email tracking 13

Mail Client Platform Proxies Content Blocks Images Blocks Referrers Blocks Cookies Ext SupportGmail Web Yes No L Yes I Yesdagger Yesdagger YesYahoo Mail Web No Yes L Yes I No No YesOutlook Web App Web No Yes No No YesOutlookcom Web No No No No YesYandex Mail Web Yes No L Yes I Yesdagger Yesdagger YesGMX Web No No No No YesZimbra Web No Yes No No Yes163com Web No No No No YesSina Web No No No No YesApple Mail iOS No No Yes Yes NoGmail iOS Yes No Yes Yes NoGmail Android Yes No Yes Yes NoApple Mail Desktop No No Yes Yes NoWindows Mail Desktop No No Yes No NoOutlook 2016 Desktop No Yes Yes No NoThunderbird Desktop No Yes Yes Optional (Default No) Yes

Table 12 A survey of the privacy impacting features of email clients We explore whether the client proxies image requests blocksimages by default blocks referrer headers from being sent (with image requests ldquoIrdquo and with link clicks ldquoLrdquo) blocks external re-sources from settings cookies and whether or not the client supports request blocking extensions mdash either through the browser (forweb clients) or directly (in the case of Thunderbird)Images are only blocked for messages considered spamdagger Blocking occurs as a result of proxied content

7 Proposed defenseWe argue that tracking protection should be at the cen-ter of a defensive strategy against email tracking It canbe employed either via HTML filtering on the server orvia request blocking on the client Tracking protection(and ad blocking) based on filter lists has proven to beeffective and popular in web browsers and its limita-tions manageable The other defenses we examined allhave serious drawbacks for example content proxyingcomes at a cost to the email server and makes emailleaks worse and cookie blocking is at best a partial so-lution

We propose to improve tracking protection in twoways

Server-side email content filtering First weprototype a server-side HTML filtering module We usethe existing standard EasyList and EasyPrivacy filterlists Our filtering script is written in Python using theBlockListParser library [3] It scans for any HTML con-tent (texthtml) in email bodies parses those contentsidentifies embedded resources (images or CSS) whoseURLs match one of the regular expressions in the filterlists strips them out and rewrites the HTML

To test the effectiveness of HTML filtering we ranour leak detection procedure on the filtered corpus ofemails We exclude one sender due to a measurementissue We found that 110 of senders will leak email ad-

dresses to a third party in at least one email and 115of emails contain embedded resources which leak emailto a third-party Overall 62 third parties received leakedemail addresses down from 99 As tracking-protectionlists improve (see below) we can expect these numbersto decrease further These numbers are very close tothe corresponding numbers for request blocking (Sec-tion 46) The two techniques arenrsquot identical the onedifference is that in static files filtering is limited to theURLs present in the body of the HTML and will missthose that result from a redirect However this differ-ence is small and we conclude that HTML filtering isessentially as effective as request blocking

Note that webmail users can already enjoy track-ing protection but server-side deployment will help allusers including those who use email clients that donrsquotsupport request-blocking extensions

Filling gaps in tracking-protection lists As asecond line of defense we use our dataset to identifya list of 27125 URLs representing 133 distinct partieswhich contain leaks of email addresses but which arenrsquotblocked by EasyList or EasyPrivacy These include firstparties in addition to third parties We are able to iden-tify first-party tracking URLs by observing groups ofURLs of similar structure across different first-party do-mains For example 51 email senders leak the userrsquosemail address to a URL of the form liltpublic suffix+ 1gtimp which appears to be part of LiveIntentrsquos API(Section 45) We summarize the most common struc-

I never signed up for this Privacy implications of email tracking 14

tures in the leaking URLs missed by tracking protectionlists in Table 13

URL Pattern of SendersliltPS+1gtimp 51 (57)partnerltPS+1gt 7 (07)stripeltPS+1gtstripeimage 4 (04)pltPS+1gtespopen 4 (04)apiltPS+1gtlayoutssectionltNgt 4 (04)ltPS+1gtcustomer-service 3 (03)miltPS+1gtprp 3 (03)dmtkltPS+1gt 3 (03)linksltPS+1gteopen 3 (03)eadsltPS+1gtimp 3 (03)

Table 13 The top URL patterns from URLs which leak email ad-dresses and are missed by tracking protection lists (Section 46)The patterns are generated by stripping request URLs to host-name and path replacing the public suffix plus one with ltPS+1gtreplacing integers with ltNgt and stripping the last portion of thepath if it ends with a file extension The patterns are ranked bythe number of senders which make at least one leaking requestmatching that pattern in any of the senderrsquos emails All values aregiven out of the total of 902 senders studied

We suspect that the reason so many trackers aremissed is that many of them are not active in the regu-lar web tracking space We have made the list of leakingURLs missed by tracking protection lists publicly avail-able8 It should be straightforward to add regular ex-pressions to filter lists based on these URLs we suggestthat filter list creators should regularly conduct scansof email corpora to identify new trackers

8 Discussion and conclusionPrivacy risks of email tracking Email security andprivacy has not received much research attention despiteits central importance in digital life We showed thatcommercial emails contain a high density of third-partytrackers This is of concern not only because trackerscan learn the recipientrsquos IP address when emails wereopened and so on but also because these third partiesare by and large the same ones that are involved in webtracking This means that trackers can connect emailaddresses to browsing histories and profiles which leadsto further privacy breaches such as cross-device tracking

8 httpsgistgithubcomenglehardt6438c5d775ffd535b317d5c6ce3cde61

and linking of online and offline activities Indeed emailis an underappreciated avenue for straightforward cross-device tracking since recipients tend to view emails onmultiple devices

The advice provided by many mail clients may mis-lead users into thinking the privacy risks associated withremote content are fairly limited The remote contenthelp pages of Gmail [20] Yahoo Mail [42] and Thun-derbird [31] all discuss the threat strictly in terms ofthe email sender learning information about the userrather than a number of third parties

Even network adversaries can benefit from the leaksin emails The NSA is known to piggyback on advertis-ing cookies for surveillance [18] and our work suggestsone way in which a surveillance agency might attachidentities to web activity records in line with the find-ings of Englehardt et al [18] Indeed nearly 91 ofURLs containing leaks of emails are sent in plaintext

Ineffectiveness of hashing The putative justi-fication for email address leaks in the online ad techindustry is that the address is hashed However hash-ing of PII including emails is not a meaningful pri-vacy protection This is folk knowledge in the securitycommunity but bears repeating Compared to hashingof passwords there are several reasons why hashing ofemail addresses is far more easily reversible via vari-ants of a dictionary attack First while (at least) someusers attempt to maximize the entropy of passwordsmost users aim to pick memorable emails and hence theset of potential emails is effectively enumerable Due toGPUs trillions of hashes can be attempted at low costSecond unlike password hashing salting is not applica-ble to email hashing since multiple third parties need tobe able to independently derive the same hash from theemail address

Perhaps most importantly if the adversaryrsquos goalis to retrieve records corresponding to a known emailaddress or set of email addresses then hashing ispointlessmdashthe adversary can simply hash the email ad-dresses and then look them up For example if the ad-versary is a surveillance agency as discussed above andseeks to retrieve network logs corresponding to a givenemail address this is trivially possible despite hashing

LimitationsWe mention several limitations of ourwork First despite the large number of heuristics thatwent into identifying and submitting forms it is a fun-damentally hard problem and our crawler fails in manycases including pages requiring complex mouse interac-tions pages containing very poorly structured HTMLand captcha-protected form submission pages More-over it is difficult to programmatically distinguish be-

I never signed up for this Privacy implications of email tracking 15

tween successful and failed form submissions Lookingat received network data is impractical since responsescould easily include text for both success and failuremessages On the other hand looking only at changesin the rendered text on the webpage is more feasiblebut would require handling many possible edge cases(eg page redirects alerts pop-up windows iframes)and might still be too unreliable to use as a metric forsuccess

Second our corpus of emails is not intended to berepresentative and we are unable to draw conclusionsabout the extent of tracking in the typical userrsquos mail-box

Third our simulation of a user viewing emails as-sumes a permissive user agent We expect that thisclosely approximates a webmail setup with defaultbrowser settings (on browsers except Safari whichblocks third-party cookies by default) but we have nottested this assumption

Future work Finally we mention several potentialareas of future work

Mailing list managers It would be helpful to bet-ter understand the relationship between email sendersand mailing list managers (such as Constant Contact)To what extent is email tracking driven by senders ver-sus mailing list managers When a sender sets up amarketing campaign with a mailing list manager is thetracking disclosed to the sender

PII leakage in registration forms Researchers havepreviously found leakage of PII to third parties in con-tact forms on websites [38] As far as we know there hasbeen no large-scale study of PII leakage in registrationforms where more sensitive information is often present(eg phone numbers street addresses and passwords)Recording and analyzing the third-party requests madeduring our crawls is an important area for further inves-tigation

Cookie syncing It would be interesting to find outif cookie syncing occurs when viewing emailsmdasha processin which different trackers exchange and link togethertheir own IDs for the same user Past work has shownthat this happens among the vast majority of top thirdparties on the web [17] so we suspect that it occursthrough email as well

AB testing We notice some clear instances of ABtesting in our data as might be expected in market-ing campaigns Specifically we registered multiple emailaddresses on some sites at roughly the same time andfound several emails sent at nearly the same time (mil-liseconds apart) with different subject lines and emailbodies advertising different products We have not at-

tempted to reverse-engineer or systematically analyzethese differences but it may be interesting to see if andhow the received content changes in response to readreceipts click-through metrics or other types of userinteractions

Differential testing Despite testing for various en-codings hashes and combinations it is possible thatwe have missed some leaks of email addresses We can-not hope to exhaustively test for all combinations ofencodings and hashes Instead we propose differentialtesting by registering multiple email addresses on thesame site we can look for parameters in URLs that aredifferent for different email addresses which are sugges-tive of transformed email addresses The difficulty withthis approach is that AB testing mentioned above isa confound

In summary we hope that our work leads to greaterawareness of the privacy risks of email tracking spursfurther research on the topic and paves the way fordeployment of robust defenses

9 AcknowledgementsWe would like to thank the anonymous reviewers AylinCaliskan Paul-Olivier Dehaye Joel Reardon and PaulVan Oorschot for their helpful comments Wersquore alsograteful to Guumlnes Acar Paul Ellenbogen Marc JuarezHarry Kalodner Marcela Melara and Laura Roberts fortheir assistance in compiling data for our email survey

This work was supported by NSF Grant CNS1526353 by a research grant from Mozilla and by Ama-zon AWS Cloud Credits for Research

References[1] Adblock Plus - Surf the web without annoying ads https

adblockplusorg Online accessed 2017-09-05[2] BeautifulSoup httpswwwcrummycomsoftware

BeautifulSoup Online accessed 2017-09-05[3] BlockListParser httpsgithubcomshivamagarwal-iitb

BlockListParser Online accessed 2017-09-05[4] EasyList and EasyPrivacy httpseasylistto Online

accessed 2017-09-05[5] uBlock Origin - An efficient blocker for Chromium and Fire-

fox Fast and lean httpsgithubcomgorhilluBlockOnline accessed 2017-09-05

[6] CSS Support Guide for Email Clients Campaign Sourcehttpswwwcampaignmonitorcomcss (Archive httpswwwwebcitationorg6rLLXBX0E) 2014

I never signed up for this Privacy implications of email tracking 16

[7] Gunes Acar Christian Eubank Steven Englehardt MarcJuarez Arvind Narayanan and Claudia Diaz The web neverforgets Persistent tracking mechanisms in the wild In Pro-ceedings of ACM CCS pages 674ndash689 ACM 2014

[8] Gunes Acar Marc Juarez Nick Nikiforakis Claudia DiazSeda Guumlrses Frank Piessens and Bart Preneel Fpdetectivedusting the web for fingerprinters In Proceedings of the2013 ACM SIGSAC conference on Computer amp communica-tions security pages 1129ndash1140 ACM 2013

[9] Julia Angwin Why online tracking is getting creepier ProP-ublica Jun 2014

[10] Mika D Ayenson Dietrich James Wambach Ashkan SoltaniNathan Good and Chris Jay Hoofnagle Flash cookies andprivacy II Now with html5 and etag respawning 2011

[11] Bananatag Email Tracking for Gmail Outlook and otherclients httpsbananatagcomemail-tracking Onlineaccessed 2017-09-04

[12] Justin Brookman Phoebe Rouge Aaron Alva Alva andChristina Yeung Cross-device tracking Measurement anddisclosures In Proceedings of the Privacy Enhancing Tech-nologies Symposium 2017

[13] Ceren Budak Sharad Goel Justin Rao and Georgios ZervasUnderstanding emerging threats to online advertising InProceedings of the ACM Conference on Economics andComputation 2016

[14] ContactMonkey Email Tracking for Outlook and Gmailhttpswwwcontactmonkeycomemail-tracking Onlineaccessed 2017-09-04

[15] Zakir Durumeric David Adrian Ariana Mirian James Kas-ten Elie Bursztein Nicolas Lidzborski Kurt Thomas VijayEranti Michael Bailey and J Alex Halderman Neither snownor rain nor mitm An empirical analysis of email deliv-ery security In Proceedings of the 2015 ACM Conferenceon Internet Measurement Conference pages 27ndash39 ACM2015

[16] Peter Eckersley How unique is your web browser In In-ternational Symposium on Privacy Enhancing TechnologiesSymposium pages 1ndash18 Springer 2010

[17] Steven Englehardt and Arvind Narayanan Online trackingA 1-million-site measurement and analysis In ACM Confer-ence on Computer and Communications Security 2016

[18] Steven Englehardt Dillon Reisman Christian Eubank Pe-ter Zimmerman Jonathan Mayer Arvind Narayanan andEdward W Felten Cookies that give you away The surveil-lance implications of web tracking In Proceedings of the24th Conference on World Wide Web 2015

[19] David Fifield and Serge Egelman Fingerprinting web usersthrough font metrics In International Conference on Finan-cial Cryptography and Data Security 2015

[20] Gmail Help Choose whether to show images httpssupportgooglecommailanswer145919 Online accessed2017-09-06

[21] Ralph Holz Johanna Amann Olivier Mehani Mohamed AliKacircafar and Matthias Wachs TLS in the wild An internet-wide analysis of tls-based protocols for electronic commu-nication In 23nd Annual Network and Distributed SystemSecurity Symposium NDSS 2016 San Diego CaliforniaUSA February 21-24 2016 2016

[22] HubSpot Start Email Tracking Today httpswwwhubspotcomproductssalesemail-tracking Online ac-

cessed 2017-09-04[23] Balachander Krishnamurthy Konstantin Naryshkin and

Craig Wills Privacy leakage vs protection measures thegrowing disconnect In Proceedings of the Web 2011

[24] Balachander Krishnamurthy and Craig E Wills On the leak-age of personally identifiable information via online socialnetworks In Proceedings of the 2nd ACM workshop onOnline social networks pages 7ndash12 ACM 2009

[25] Pierre Laperdrix Walter Rudametkin and Benoit BaudryBeauty and the beast Diverting modern web browsers tobuild unique browser fingerprints In 37th IEEE Symposiumon Security and Privacy 2016

[26] Adam Lerner Anna Kornfeld Simpson Tadayoshi Kohnoand Franziska Roesner Internet jones and the raiders of thelost trackers An archaeological study of web tracking from1996 to 2016 In 25th USENIX Security Symposium 2016

[27] Timothy Libert Exposing the invisible web An analysis ofthird-party http requests on 1 million websites InternationalJournal of Communication 918 2015

[28] Jonathan R Mayer and John C Mitchell Third-party webtracking Policy and technology In 2012 IEEE Symposiumon Security and Privacy IEEE 2012

[29] Georg Merzdovnik Markus Huber Damjan Buhov NickNikiforakis Sebastian Neuner Martin Schmiedecker andEdgar Weippl Block me if you can A large-scale study oftracker-blocking tools In Proceedings of the 2nd IEEE Euro-pean Symposium on Security and Privacy (IEEE EuroSampP)2017

[30] Keaton Mowery and Hovav Shacham Pixel perfect Finger-printing canvas in HTML5 W2SP 2012

[31] Mozilla Support Remote Content in Messages httpssupportmozillaorgen-USkbremote-content-in-messagesOnline accessed 2017-09-04

[32] Nick Nikiforakis Alexandros Kapravelos Wouter JoosenChristopher Kruegel Frank Piessens and Giovanni VignaCookieless monster Exploring the ecosystem of web-baseddevice fingerprinting In Security and privacy (SP) 2013IEEE symposium on pages 541ndash555 IEEE 2013

[33] Lukasz Olejnik Gunes Acar Claude Castelluccia and Clau-dia Diaz The leaking battery A privacy analysis of theHTML5 Battery Status API Technical report 2015

[34] Jingjing Ren Ashwin Rao Martina Lindorfer ArnaudLegout and David Choffnes Recon Revealing and control-ling pii leaks in mobile network traffic In Proceedings of the14th Annual International Conference on Mobile SystemsApplications and Services pages 361ndash374 ACM 2016

[35] Franziska Roesner Tadayoshi Kohno and David WetherallDetecting and defending against third-party tracking onthe web In Proceedings of the 9th USENIX conferenceon Networked Systems Design and Implementation pages12ndash12 USENIX Association 2012

[36] scikit-learn Jaccard Similarity Score httpscikit-learnorgstablemodulesgeneratedsklearnmetricsjaccard_similarity_scorehtml Online accessed 2017-09-05

[37] Ashkan Soltani Shannon Canty Quentin Mayo LaurenThomas and Chris Jay Hoofnagle Flash cookies and pri-vacy In AAAI spring symposium intelligent informationprivacy management volume 2010 pages 158ndash163 2010

[38] Oleksii Starov Phillipa Gill and Nick Nikiforakis Are yousure you want to contact us quantifying the leakage of pii

I never signed up for this Privacy implications of email tracking 17

via website contact forms Proceedings on Privacy Enhanc-ing Technologies 2016(1)20ndash33 2016

[39] Oleksii Starov and Nick Nikiforakis Extended trackingpowers Measuring the privacy diffusion enabled by browserextensions In Proceedings of the 26th International Confer-ence on World Wide Web pages 1481ndash1490 2017

[40] Narseo Vallina-Rodriguez Christian Kreibich Mark Allmanand Vern Paxson Lumen Fine-grained visibility and controlof mobile traffic in user-space 2017

[41] W3C 410 Forms - HTML5 httpswwww3orgTRhtml5formshtml Online accessed 2017-09-07

[42] Yahoo Help Block images in your incoming Yahoo Mailemails httpshelpyahoocomkbSLN5043html Onlineaccessed 2017-09-06

[43] Zhonghao Yu Sam Macbeth Konark Modi and Josep MPujol Tracking the trackers In Proceedings of the 25thInternational Conference on World Wide Web pages 121ndash132 International World Wide Web Conferences SteeringCommittee 2016

10 Appendix

101 Form discovery and fillingmethodology

Choosing pages on which to search for forms Thecrawler searches through all links (ltagt tags) on the land-ing page to find pages that are most likely to contain amailing list form It does this by matching the link textand URL against a ranked list of terms which are shownin Table 1 As an initial step we filter out invisiblelinks and links to external sites We check that the linktext does not contain words in our blacklist which aimsto avoid unsubscribe pages and phone-based registra-tion If we have found any links that match the crawlerclicks on the one with the highest rank then runs theform-finding procedure on the new page and any newlyopened pop-up windows If no forms are found it goesback and repeats this process for the remaining linksThe reason for clicking on generic article links is thatwe have come across several news sites with newsletterforms only within article pages We also make sure toselect the English language or USEnglish locale whenavailable since our keywords are in English

Top-down form detection For each page thecrawler visits it first searches through the HTML DOMfor any potential email registration forms When sitesuse the standard ltformgt element it can simply iteratethrough each formrsquos input fields (ltinputgt tags) and seeif any text fields ask for an email address (by matchingon input type and keywords) If so it marks the form as

a candidate and then chooses the best candidate usingthe following criteria (in order)1 Always return the topmost form Any form stacked

on top of other elements is probably a modal or dia-log and we find that the most common use of thesecomponents is to promote a sitersquos mailing lists Werely on the z-index CSS property which specifies thestacking order of an element in relation to others (asa relative arbitrary integer) Note that most DOMelements take the default z-index value of auto in-heriting the actual value from its parent thus thecrawler recursively checks a formrsquos parent elementsuntil it finds a non-auto value or reaches the rootof the DOM tree To break ties it also searches forthe literal strings ldquomodalrdquo or ldquodialogrdquo within theformrsquos HTML since we find that such componentsare usually descriptively named

2 Rank login forms lower This is the other class offorms that often asks for an email address so thecrawler explicitly checks for the strings ldquologinrdquo ldquologinrdquo and ldquosign inrdquo within a formrsquos HTML to avoidthese when other candidates are present

3 Prefer forms with more input fields This is mainlyhelpful for identifying the correct follow-up form ifwe submit our email address in the footer of a pagethe same footer might be present on the page we getredirected to In this scenario the form we want topick is the longer one

Additionally registration forms are sometimes foundinside of inline frames (ltiframegt tag) which are ef-fectively separate HTML pages embedded in the mainpage If necessary we iterate through each frame andapply the same procedure to locate registration formswithin them

Bottom-up form detection A growing numberof sites place logical forms inside of generic containerelements (eg ltdivgt or ltspangt tags) without using anyltformgt tags Therefore if top-down form detection failswe take a bottom-up approach the crawler first iteratesthrough all the ltinputgt elements on the page to checkif any email address fields exist at all then recursivelyexamines their parents to find the first container thatalso contains a submit button This container is usuallythe smallest logical form unit that includes all of therelevant input fields

Determining form field type Once a form isdiscovered we need to determine which fields are con-tained in the form and fill each field with valid dataWe skip any invisible elements since a real user wouldnot be expected to fill them Some fields can be iden-

I never signed up for this Privacy implications of email tracking 18

tified by their type attribute alonemdashfor example telfor phone numbers and email for email addressesmdashbutthese specific types were introduced in the relatively re-cent HTML5 standard [41] and most websites still usethe general text type for all text inputs In our sur-vey of the top sites we found that contextual hints arescattered across many tag attributes with the most fre-quent being name class id placeholder value forand title In addition tags that contain HTML bod-ies (such as ltbuttongt tags) often contain hints in theinnerHTML

Handling two-part form submissions Aftersubmitting a form we are sometimes prompted to fillout another longer form before the registration is ac-cepted This second form might appear on the samepage (ie using JavaScript) or on a separate page ei-ther through a redirect or as a pop-up window We takea simplistic approach the crawler waits a few secondsthen applies the same form-finding procedure first onany pop-up windows and then on the original windowThis approach may have the effect of submitting thesame form twice but we argue that this does not pro-duce any adverse resultsmdashduplicate form submissionsare a plausible user interaction that web services shouldbe expected to handle gracefully

102 Mail server implementation

The mail server receives emails using SubEtha SMTP alibrary offering a simple low-level API to handle incom-ing mail The server accepts any mail sent to (RCPT TO)an existing email address and rejects it otherwise Themail contents (DATA) are parsed in MIME format usingthe JavaMail API and the raw message contents arewritten to disk MIME messages consist of a set of head-ers and a content body with the required Content-Typeheader indicating the format of the content notably amultipart content body contains additional MIME mes-sage subparts enabling messages to be arranged in atree structure To save disk space we recursively scanmultipart MIME messages for subparts with contenttypes that are non-text (text) such as attached im-ages or other data and discard them before storing themessages since we do not examine any non-textual con-tent

103 Supported hash functions andencodings for leak detection

Supported hashes and checksums md2 md4 md5sha sha1 sha256 sha224 sha384 sha3-224 sha3-256 sha3-384 sha3-512 murmurhash2 (signed andunsigned) murmurhash3 32-bit murmurhash3 64-bitmurmurhash3 128-bit ripemd160 whirlpool blake2bblake2s crc32 adler32

Supported encodings base16 base32 base58base64 urlencoding deflate gzip zlib entity yenc

104 Top parties redirecting to new thirdparties on email reload

Redirecting Party OrganizationAvg addrsquolparties S E

pippiocom Acxiom 57 7 32liadmcom LiveIntent 37 68 1097rlcdncom Acxiom 17 11 551imiclkcom MediaMath 13 2 4mathtagcom MediaMath 11 11 382alcmpncom ALCdagger 08 6 132emltrkcom Litmus 07 41 638acxiom-onlinecom Acxiom 04 2 33dynemlcom PowerInbox 01 3 13adnxscom AppNexus 01 19 277

Table 14 Top parties by average number of new third-party re-sources in a redirect chain when an email is reloaded The num-ber of senders ( S) out of 902 total and the number of emails(E) out of 12618 total on which this occurs is given for eachredirecting party We exclude redirecting parties that only exhibitthis behavior in emails from a single sender In total there are 12parties which exhibit this type of redirect behavior Includes statistics for chains which redirect to httppliadmcomimp in the first redirect We observe a common pattern ofURLs of the form lifirstpartycom redirecting first to this end-point which then redirects to a number of other third partiesdagger American List Counsel

  • I never signed up for this Privacy implications of email tracking
    • 1 Introduction
      • 11 Methods
      • 12 The state of email tracking
      • 13 Evaluating and improving defenses
        • 2 Related work
        • 3 Collecting a dataset of emails
        • 4 Privacy leaks when viewing emails
          • 41 Measurement methodology
          • 42 Email provides much of same tracking opportunities as the web
          • 43 Leaks of email addresses to third parties are common
          • 44 Reopening emails brings in new third parties
          • 45 Case study LiveIntent
          • 46 Request blockers help but dont fix the problem
            • 5 Privacy leaks when clicking links in emails
              • 51 Measurement methodology
              • 52 Results
                • 6 Evaluation of defenses
                  • 61 Landscape of defenses
                  • 62 Survey of email clients
                    • 7 Proposed defense
                    • 8 Discussion and conclusion
                    • 9 Acknowledgements
                    • 10 Appendix
                      • 101 Form discovery and filling methodology
                      • 102 Mail server implementation
                      • 103 Supported hash functions and encodings for leak detection
                      • 104 Top parties redirecting to new third parties on email reload

I never signed up for this Privacy implications of email tracking 14

tures in the leaking URLs missed by tracking protectionlists in Table 13

URL Pattern of SendersliltPS+1gtimp 51 (57)partnerltPS+1gt 7 (07)stripeltPS+1gtstripeimage 4 (04)pltPS+1gtespopen 4 (04)apiltPS+1gtlayoutssectionltNgt 4 (04)ltPS+1gtcustomer-service 3 (03)miltPS+1gtprp 3 (03)dmtkltPS+1gt 3 (03)linksltPS+1gteopen 3 (03)eadsltPS+1gtimp 3 (03)

Table 13 The top URL patterns from URLs which leak email ad-dresses and are missed by tracking protection lists (Section 46)The patterns are generated by stripping request URLs to host-name and path replacing the public suffix plus one with ltPS+1gtreplacing integers with ltNgt and stripping the last portion of thepath if it ends with a file extension The patterns are ranked bythe number of senders which make at least one leaking requestmatching that pattern in any of the senderrsquos emails All values aregiven out of the total of 902 senders studied

We suspect that the reason so many trackers aremissed is that many of them are not active in the regu-lar web tracking space We have made the list of leakingURLs missed by tracking protection lists publicly avail-able8 It should be straightforward to add regular ex-pressions to filter lists based on these URLs we suggestthat filter list creators should regularly conduct scansof email corpora to identify new trackers

8 Discussion and conclusionPrivacy risks of email tracking Email security andprivacy has not received much research attention despiteits central importance in digital life We showed thatcommercial emails contain a high density of third-partytrackers This is of concern not only because trackerscan learn the recipientrsquos IP address when emails wereopened and so on but also because these third partiesare by and large the same ones that are involved in webtracking This means that trackers can connect emailaddresses to browsing histories and profiles which leadsto further privacy breaches such as cross-device tracking

8 httpsgistgithubcomenglehardt6438c5d775ffd535b317d5c6ce3cde61

and linking of online and offline activities Indeed emailis an underappreciated avenue for straightforward cross-device tracking since recipients tend to view emails onmultiple devices

The advice provided by many mail clients may mis-lead users into thinking the privacy risks associated withremote content are fairly limited The remote contenthelp pages of Gmail [20] Yahoo Mail [42] and Thun-derbird [31] all discuss the threat strictly in terms ofthe email sender learning information about the userrather than a number of third parties

Even network adversaries can benefit from the leaksin emails The NSA is known to piggyback on advertis-ing cookies for surveillance [18] and our work suggestsone way in which a surveillance agency might attachidentities to web activity records in line with the find-ings of Englehardt et al [18] Indeed nearly 91 ofURLs containing leaks of emails are sent in plaintext

Ineffectiveness of hashing The putative justi-fication for email address leaks in the online ad techindustry is that the address is hashed However hash-ing of PII including emails is not a meaningful pri-vacy protection This is folk knowledge in the securitycommunity but bears repeating Compared to hashingof passwords there are several reasons why hashing ofemail addresses is far more easily reversible via vari-ants of a dictionary attack First while (at least) someusers attempt to maximize the entropy of passwordsmost users aim to pick memorable emails and hence theset of potential emails is effectively enumerable Due toGPUs trillions of hashes can be attempted at low costSecond unlike password hashing salting is not applica-ble to email hashing since multiple third parties need tobe able to independently derive the same hash from theemail address

Perhaps most importantly if the adversaryrsquos goalis to retrieve records corresponding to a known emailaddress or set of email addresses then hashing ispointlessmdashthe adversary can simply hash the email ad-dresses and then look them up For example if the ad-versary is a surveillance agency as discussed above andseeks to retrieve network logs corresponding to a givenemail address this is trivially possible despite hashing

LimitationsWe mention several limitations of ourwork First despite the large number of heuristics thatwent into identifying and submitting forms it is a fun-damentally hard problem and our crawler fails in manycases including pages requiring complex mouse interac-tions pages containing very poorly structured HTMLand captcha-protected form submission pages More-over it is difficult to programmatically distinguish be-

I never signed up for this Privacy implications of email tracking 15

tween successful and failed form submissions Lookingat received network data is impractical since responsescould easily include text for both success and failuremessages On the other hand looking only at changesin the rendered text on the webpage is more feasiblebut would require handling many possible edge cases(eg page redirects alerts pop-up windows iframes)and might still be too unreliable to use as a metric forsuccess

Second our corpus of emails is not intended to berepresentative and we are unable to draw conclusionsabout the extent of tracking in the typical userrsquos mail-box

Third our simulation of a user viewing emails as-sumes a permissive user agent We expect that thisclosely approximates a webmail setup with defaultbrowser settings (on browsers except Safari whichblocks third-party cookies by default) but we have nottested this assumption

Future work Finally we mention several potentialareas of future work

Mailing list managers It would be helpful to bet-ter understand the relationship between email sendersand mailing list managers (such as Constant Contact)To what extent is email tracking driven by senders ver-sus mailing list managers When a sender sets up amarketing campaign with a mailing list manager is thetracking disclosed to the sender

PII leakage in registration forms Researchers havepreviously found leakage of PII to third parties in con-tact forms on websites [38] As far as we know there hasbeen no large-scale study of PII leakage in registrationforms where more sensitive information is often present(eg phone numbers street addresses and passwords)Recording and analyzing the third-party requests madeduring our crawls is an important area for further inves-tigation

Cookie syncing It would be interesting to find outif cookie syncing occurs when viewing emailsmdasha processin which different trackers exchange and link togethertheir own IDs for the same user Past work has shownthat this happens among the vast majority of top thirdparties on the web [17] so we suspect that it occursthrough email as well

AB testing We notice some clear instances of ABtesting in our data as might be expected in market-ing campaigns Specifically we registered multiple emailaddresses on some sites at roughly the same time andfound several emails sent at nearly the same time (mil-liseconds apart) with different subject lines and emailbodies advertising different products We have not at-

tempted to reverse-engineer or systematically analyzethese differences but it may be interesting to see if andhow the received content changes in response to readreceipts click-through metrics or other types of userinteractions

Differential testing Despite testing for various en-codings hashes and combinations it is possible thatwe have missed some leaks of email addresses We can-not hope to exhaustively test for all combinations ofencodings and hashes Instead we propose differentialtesting by registering multiple email addresses on thesame site we can look for parameters in URLs that aredifferent for different email addresses which are sugges-tive of transformed email addresses The difficulty withthis approach is that AB testing mentioned above isa confound

In summary we hope that our work leads to greaterawareness of the privacy risks of email tracking spursfurther research on the topic and paves the way fordeployment of robust defenses

9 AcknowledgementsWe would like to thank the anonymous reviewers AylinCaliskan Paul-Olivier Dehaye Joel Reardon and PaulVan Oorschot for their helpful comments Wersquore alsograteful to Guumlnes Acar Paul Ellenbogen Marc JuarezHarry Kalodner Marcela Melara and Laura Roberts fortheir assistance in compiling data for our email survey

This work was supported by NSF Grant CNS1526353 by a research grant from Mozilla and by Ama-zon AWS Cloud Credits for Research

References[1] Adblock Plus - Surf the web without annoying ads https

adblockplusorg Online accessed 2017-09-05[2] BeautifulSoup httpswwwcrummycomsoftware

BeautifulSoup Online accessed 2017-09-05[3] BlockListParser httpsgithubcomshivamagarwal-iitb

BlockListParser Online accessed 2017-09-05[4] EasyList and EasyPrivacy httpseasylistto Online

accessed 2017-09-05[5] uBlock Origin - An efficient blocker for Chromium and Fire-

fox Fast and lean httpsgithubcomgorhilluBlockOnline accessed 2017-09-05

[6] CSS Support Guide for Email Clients Campaign Sourcehttpswwwcampaignmonitorcomcss (Archive httpswwwwebcitationorg6rLLXBX0E) 2014

I never signed up for this Privacy implications of email tracking 16

[7] Gunes Acar Christian Eubank Steven Englehardt MarcJuarez Arvind Narayanan and Claudia Diaz The web neverforgets Persistent tracking mechanisms in the wild In Pro-ceedings of ACM CCS pages 674ndash689 ACM 2014

[8] Gunes Acar Marc Juarez Nick Nikiforakis Claudia DiazSeda Guumlrses Frank Piessens and Bart Preneel Fpdetectivedusting the web for fingerprinters In Proceedings of the2013 ACM SIGSAC conference on Computer amp communica-tions security pages 1129ndash1140 ACM 2013

[9] Julia Angwin Why online tracking is getting creepier ProP-ublica Jun 2014

[10] Mika D Ayenson Dietrich James Wambach Ashkan SoltaniNathan Good and Chris Jay Hoofnagle Flash cookies andprivacy II Now with html5 and etag respawning 2011

[11] Bananatag Email Tracking for Gmail Outlook and otherclients httpsbananatagcomemail-tracking Onlineaccessed 2017-09-04

[12] Justin Brookman Phoebe Rouge Aaron Alva Alva andChristina Yeung Cross-device tracking Measurement anddisclosures In Proceedings of the Privacy Enhancing Tech-nologies Symposium 2017

[13] Ceren Budak Sharad Goel Justin Rao and Georgios ZervasUnderstanding emerging threats to online advertising InProceedings of the ACM Conference on Economics andComputation 2016

[14] ContactMonkey Email Tracking for Outlook and Gmailhttpswwwcontactmonkeycomemail-tracking Onlineaccessed 2017-09-04

[15] Zakir Durumeric David Adrian Ariana Mirian James Kas-ten Elie Bursztein Nicolas Lidzborski Kurt Thomas VijayEranti Michael Bailey and J Alex Halderman Neither snownor rain nor mitm An empirical analysis of email deliv-ery security In Proceedings of the 2015 ACM Conferenceon Internet Measurement Conference pages 27ndash39 ACM2015

[16] Peter Eckersley How unique is your web browser In In-ternational Symposium on Privacy Enhancing TechnologiesSymposium pages 1ndash18 Springer 2010

[17] Steven Englehardt and Arvind Narayanan Online trackingA 1-million-site measurement and analysis In ACM Confer-ence on Computer and Communications Security 2016

[18] Steven Englehardt Dillon Reisman Christian Eubank Pe-ter Zimmerman Jonathan Mayer Arvind Narayanan andEdward W Felten Cookies that give you away The surveil-lance implications of web tracking In Proceedings of the24th Conference on World Wide Web 2015

[19] David Fifield and Serge Egelman Fingerprinting web usersthrough font metrics In International Conference on Finan-cial Cryptography and Data Security 2015

[20] Gmail Help Choose whether to show images httpssupportgooglecommailanswer145919 Online accessed2017-09-06

[21] Ralph Holz Johanna Amann Olivier Mehani Mohamed AliKacircafar and Matthias Wachs TLS in the wild An internet-wide analysis of tls-based protocols for electronic commu-nication In 23nd Annual Network and Distributed SystemSecurity Symposium NDSS 2016 San Diego CaliforniaUSA February 21-24 2016 2016

[22] HubSpot Start Email Tracking Today httpswwwhubspotcomproductssalesemail-tracking Online ac-

cessed 2017-09-04[23] Balachander Krishnamurthy Konstantin Naryshkin and

Craig Wills Privacy leakage vs protection measures thegrowing disconnect In Proceedings of the Web 2011

[24] Balachander Krishnamurthy and Craig E Wills On the leak-age of personally identifiable information via online socialnetworks In Proceedings of the 2nd ACM workshop onOnline social networks pages 7ndash12 ACM 2009

[25] Pierre Laperdrix Walter Rudametkin and Benoit BaudryBeauty and the beast Diverting modern web browsers tobuild unique browser fingerprints In 37th IEEE Symposiumon Security and Privacy 2016

[26] Adam Lerner Anna Kornfeld Simpson Tadayoshi Kohnoand Franziska Roesner Internet jones and the raiders of thelost trackers An archaeological study of web tracking from1996 to 2016 In 25th USENIX Security Symposium 2016

[27] Timothy Libert Exposing the invisible web An analysis ofthird-party http requests on 1 million websites InternationalJournal of Communication 918 2015

[28] Jonathan R Mayer and John C Mitchell Third-party webtracking Policy and technology In 2012 IEEE Symposiumon Security and Privacy IEEE 2012

[29] Georg Merzdovnik Markus Huber Damjan Buhov NickNikiforakis Sebastian Neuner Martin Schmiedecker andEdgar Weippl Block me if you can A large-scale study oftracker-blocking tools In Proceedings of the 2nd IEEE Euro-pean Symposium on Security and Privacy (IEEE EuroSampP)2017

[30] Keaton Mowery and Hovav Shacham Pixel perfect Finger-printing canvas in HTML5 W2SP 2012

[31] Mozilla Support Remote Content in Messages httpssupportmozillaorgen-USkbremote-content-in-messagesOnline accessed 2017-09-04

[32] Nick Nikiforakis Alexandros Kapravelos Wouter JoosenChristopher Kruegel Frank Piessens and Giovanni VignaCookieless monster Exploring the ecosystem of web-baseddevice fingerprinting In Security and privacy (SP) 2013IEEE symposium on pages 541ndash555 IEEE 2013

[33] Lukasz Olejnik Gunes Acar Claude Castelluccia and Clau-dia Diaz The leaking battery A privacy analysis of theHTML5 Battery Status API Technical report 2015

[34] Jingjing Ren Ashwin Rao Martina Lindorfer ArnaudLegout and David Choffnes Recon Revealing and control-ling pii leaks in mobile network traffic In Proceedings of the14th Annual International Conference on Mobile SystemsApplications and Services pages 361ndash374 ACM 2016

[35] Franziska Roesner Tadayoshi Kohno and David WetherallDetecting and defending against third-party tracking onthe web In Proceedings of the 9th USENIX conferenceon Networked Systems Design and Implementation pages12ndash12 USENIX Association 2012

[36] scikit-learn Jaccard Similarity Score httpscikit-learnorgstablemodulesgeneratedsklearnmetricsjaccard_similarity_scorehtml Online accessed 2017-09-05

[37] Ashkan Soltani Shannon Canty Quentin Mayo LaurenThomas and Chris Jay Hoofnagle Flash cookies and pri-vacy In AAAI spring symposium intelligent informationprivacy management volume 2010 pages 158ndash163 2010

[38] Oleksii Starov Phillipa Gill and Nick Nikiforakis Are yousure you want to contact us quantifying the leakage of pii

I never signed up for this Privacy implications of email tracking 17

via website contact forms Proceedings on Privacy Enhanc-ing Technologies 2016(1)20ndash33 2016

[39] Oleksii Starov and Nick Nikiforakis Extended trackingpowers Measuring the privacy diffusion enabled by browserextensions In Proceedings of the 26th International Confer-ence on World Wide Web pages 1481ndash1490 2017

[40] Narseo Vallina-Rodriguez Christian Kreibich Mark Allmanand Vern Paxson Lumen Fine-grained visibility and controlof mobile traffic in user-space 2017

[41] W3C 410 Forms - HTML5 httpswwww3orgTRhtml5formshtml Online accessed 2017-09-07

[42] Yahoo Help Block images in your incoming Yahoo Mailemails httpshelpyahoocomkbSLN5043html Onlineaccessed 2017-09-06

[43] Zhonghao Yu Sam Macbeth Konark Modi and Josep MPujol Tracking the trackers In Proceedings of the 25thInternational Conference on World Wide Web pages 121ndash132 International World Wide Web Conferences SteeringCommittee 2016

10 Appendix

101 Form discovery and fillingmethodology

Choosing pages on which to search for forms Thecrawler searches through all links (ltagt tags) on the land-ing page to find pages that are most likely to contain amailing list form It does this by matching the link textand URL against a ranked list of terms which are shownin Table 1 As an initial step we filter out invisiblelinks and links to external sites We check that the linktext does not contain words in our blacklist which aimsto avoid unsubscribe pages and phone-based registra-tion If we have found any links that match the crawlerclicks on the one with the highest rank then runs theform-finding procedure on the new page and any newlyopened pop-up windows If no forms are found it goesback and repeats this process for the remaining linksThe reason for clicking on generic article links is thatwe have come across several news sites with newsletterforms only within article pages We also make sure toselect the English language or USEnglish locale whenavailable since our keywords are in English

Top-down form detection For each page thecrawler visits it first searches through the HTML DOMfor any potential email registration forms When sitesuse the standard ltformgt element it can simply iteratethrough each formrsquos input fields (ltinputgt tags) and seeif any text fields ask for an email address (by matchingon input type and keywords) If so it marks the form as

a candidate and then chooses the best candidate usingthe following criteria (in order)1 Always return the topmost form Any form stacked

on top of other elements is probably a modal or dia-log and we find that the most common use of thesecomponents is to promote a sitersquos mailing lists Werely on the z-index CSS property which specifies thestacking order of an element in relation to others (asa relative arbitrary integer) Note that most DOMelements take the default z-index value of auto in-heriting the actual value from its parent thus thecrawler recursively checks a formrsquos parent elementsuntil it finds a non-auto value or reaches the rootof the DOM tree To break ties it also searches forthe literal strings ldquomodalrdquo or ldquodialogrdquo within theformrsquos HTML since we find that such componentsare usually descriptively named

2 Rank login forms lower This is the other class offorms that often asks for an email address so thecrawler explicitly checks for the strings ldquologinrdquo ldquologinrdquo and ldquosign inrdquo within a formrsquos HTML to avoidthese when other candidates are present

3 Prefer forms with more input fields This is mainlyhelpful for identifying the correct follow-up form ifwe submit our email address in the footer of a pagethe same footer might be present on the page we getredirected to In this scenario the form we want topick is the longer one

Additionally registration forms are sometimes foundinside of inline frames (ltiframegt tag) which are ef-fectively separate HTML pages embedded in the mainpage If necessary we iterate through each frame andapply the same procedure to locate registration formswithin them

Bottom-up form detection A growing numberof sites place logical forms inside of generic containerelements (eg ltdivgt or ltspangt tags) without using anyltformgt tags Therefore if top-down form detection failswe take a bottom-up approach the crawler first iteratesthrough all the ltinputgt elements on the page to checkif any email address fields exist at all then recursivelyexamines their parents to find the first container thatalso contains a submit button This container is usuallythe smallest logical form unit that includes all of therelevant input fields

Determining form field type Once a form isdiscovered we need to determine which fields are con-tained in the form and fill each field with valid dataWe skip any invisible elements since a real user wouldnot be expected to fill them Some fields can be iden-

I never signed up for this Privacy implications of email tracking 18

tified by their type attribute alonemdashfor example telfor phone numbers and email for email addressesmdashbutthese specific types were introduced in the relatively re-cent HTML5 standard [41] and most websites still usethe general text type for all text inputs In our sur-vey of the top sites we found that contextual hints arescattered across many tag attributes with the most fre-quent being name class id placeholder value forand title In addition tags that contain HTML bod-ies (such as ltbuttongt tags) often contain hints in theinnerHTML

Handling two-part form submissions Aftersubmitting a form we are sometimes prompted to fillout another longer form before the registration is ac-cepted This second form might appear on the samepage (ie using JavaScript) or on a separate page ei-ther through a redirect or as a pop-up window We takea simplistic approach the crawler waits a few secondsthen applies the same form-finding procedure first onany pop-up windows and then on the original windowThis approach may have the effect of submitting thesame form twice but we argue that this does not pro-duce any adverse resultsmdashduplicate form submissionsare a plausible user interaction that web services shouldbe expected to handle gracefully

102 Mail server implementation

The mail server receives emails using SubEtha SMTP alibrary offering a simple low-level API to handle incom-ing mail The server accepts any mail sent to (RCPT TO)an existing email address and rejects it otherwise Themail contents (DATA) are parsed in MIME format usingthe JavaMail API and the raw message contents arewritten to disk MIME messages consist of a set of head-ers and a content body with the required Content-Typeheader indicating the format of the content notably amultipart content body contains additional MIME mes-sage subparts enabling messages to be arranged in atree structure To save disk space we recursively scanmultipart MIME messages for subparts with contenttypes that are non-text (text) such as attached im-ages or other data and discard them before storing themessages since we do not examine any non-textual con-tent

103 Supported hash functions andencodings for leak detection

Supported hashes and checksums md2 md4 md5sha sha1 sha256 sha224 sha384 sha3-224 sha3-256 sha3-384 sha3-512 murmurhash2 (signed andunsigned) murmurhash3 32-bit murmurhash3 64-bitmurmurhash3 128-bit ripemd160 whirlpool blake2bblake2s crc32 adler32

Supported encodings base16 base32 base58base64 urlencoding deflate gzip zlib entity yenc

104 Top parties redirecting to new thirdparties on email reload

Redirecting Party OrganizationAvg addrsquolparties S E

pippiocom Acxiom 57 7 32liadmcom LiveIntent 37 68 1097rlcdncom Acxiom 17 11 551imiclkcom MediaMath 13 2 4mathtagcom MediaMath 11 11 382alcmpncom ALCdagger 08 6 132emltrkcom Litmus 07 41 638acxiom-onlinecom Acxiom 04 2 33dynemlcom PowerInbox 01 3 13adnxscom AppNexus 01 19 277

Table 14 Top parties by average number of new third-party re-sources in a redirect chain when an email is reloaded The num-ber of senders ( S) out of 902 total and the number of emails(E) out of 12618 total on which this occurs is given for eachredirecting party We exclude redirecting parties that only exhibitthis behavior in emails from a single sender In total there are 12parties which exhibit this type of redirect behavior Includes statistics for chains which redirect to httppliadmcomimp in the first redirect We observe a common pattern ofURLs of the form lifirstpartycom redirecting first to this end-point which then redirects to a number of other third partiesdagger American List Counsel

  • I never signed up for this Privacy implications of email tracking
    • 1 Introduction
      • 11 Methods
      • 12 The state of email tracking
      • 13 Evaluating and improving defenses
        • 2 Related work
        • 3 Collecting a dataset of emails
        • 4 Privacy leaks when viewing emails
          • 41 Measurement methodology
          • 42 Email provides much of same tracking opportunities as the web
          • 43 Leaks of email addresses to third parties are common
          • 44 Reopening emails brings in new third parties
          • 45 Case study LiveIntent
          • 46 Request blockers help but dont fix the problem
            • 5 Privacy leaks when clicking links in emails
              • 51 Measurement methodology
              • 52 Results
                • 6 Evaluation of defenses
                  • 61 Landscape of defenses
                  • 62 Survey of email clients
                    • 7 Proposed defense
                    • 8 Discussion and conclusion
                    • 9 Acknowledgements
                    • 10 Appendix
                      • 101 Form discovery and filling methodology
                      • 102 Mail server implementation
                      • 103 Supported hash functions and encodings for leak detection
                      • 104 Top parties redirecting to new third parties on email reload

I never signed up for this Privacy implications of email tracking 15

tween successful and failed form submissions Lookingat received network data is impractical since responsescould easily include text for both success and failuremessages On the other hand looking only at changesin the rendered text on the webpage is more feasiblebut would require handling many possible edge cases(eg page redirects alerts pop-up windows iframes)and might still be too unreliable to use as a metric forsuccess

Second our corpus of emails is not intended to berepresentative and we are unable to draw conclusionsabout the extent of tracking in the typical userrsquos mail-box

Third our simulation of a user viewing emails as-sumes a permissive user agent We expect that thisclosely approximates a webmail setup with defaultbrowser settings (on browsers except Safari whichblocks third-party cookies by default) but we have nottested this assumption

Future work Finally we mention several potentialareas of future work

Mailing list managers It would be helpful to bet-ter understand the relationship between email sendersand mailing list managers (such as Constant Contact)To what extent is email tracking driven by senders ver-sus mailing list managers When a sender sets up amarketing campaign with a mailing list manager is thetracking disclosed to the sender

PII leakage in registration forms Researchers havepreviously found leakage of PII to third parties in con-tact forms on websites [38] As far as we know there hasbeen no large-scale study of PII leakage in registrationforms where more sensitive information is often present(eg phone numbers street addresses and passwords)Recording and analyzing the third-party requests madeduring our crawls is an important area for further inves-tigation

Cookie syncing It would be interesting to find outif cookie syncing occurs when viewing emailsmdasha processin which different trackers exchange and link togethertheir own IDs for the same user Past work has shownthat this happens among the vast majority of top thirdparties on the web [17] so we suspect that it occursthrough email as well

AB testing We notice some clear instances of ABtesting in our data as might be expected in market-ing campaigns Specifically we registered multiple emailaddresses on some sites at roughly the same time andfound several emails sent at nearly the same time (mil-liseconds apart) with different subject lines and emailbodies advertising different products We have not at-

tempted to reverse-engineer or systematically analyzethese differences but it may be interesting to see if andhow the received content changes in response to readreceipts click-through metrics or other types of userinteractions

Differential testing Despite testing for various en-codings hashes and combinations it is possible thatwe have missed some leaks of email addresses We can-not hope to exhaustively test for all combinations ofencodings and hashes Instead we propose differentialtesting by registering multiple email addresses on thesame site we can look for parameters in URLs that aredifferent for different email addresses which are sugges-tive of transformed email addresses The difficulty withthis approach is that AB testing mentioned above isa confound

In summary we hope that our work leads to greaterawareness of the privacy risks of email tracking spursfurther research on the topic and paves the way fordeployment of robust defenses

9 AcknowledgementsWe would like to thank the anonymous reviewers AylinCaliskan Paul-Olivier Dehaye Joel Reardon and PaulVan Oorschot for their helpful comments Wersquore alsograteful to Guumlnes Acar Paul Ellenbogen Marc JuarezHarry Kalodner Marcela Melara and Laura Roberts fortheir assistance in compiling data for our email survey

This work was supported by NSF Grant CNS1526353 by a research grant from Mozilla and by Ama-zon AWS Cloud Credits for Research

References[1] Adblock Plus - Surf the web without annoying ads https

adblockplusorg Online accessed 2017-09-05[2] BeautifulSoup httpswwwcrummycomsoftware

BeautifulSoup Online accessed 2017-09-05[3] BlockListParser httpsgithubcomshivamagarwal-iitb

BlockListParser Online accessed 2017-09-05[4] EasyList and EasyPrivacy httpseasylistto Online

accessed 2017-09-05[5] uBlock Origin - An efficient blocker for Chromium and Fire-

fox Fast and lean httpsgithubcomgorhilluBlockOnline accessed 2017-09-05

[6] CSS Support Guide for Email Clients Campaign Sourcehttpswwwcampaignmonitorcomcss (Archive httpswwwwebcitationorg6rLLXBX0E) 2014

I never signed up for this Privacy implications of email tracking 16

[7] Gunes Acar Christian Eubank Steven Englehardt MarcJuarez Arvind Narayanan and Claudia Diaz The web neverforgets Persistent tracking mechanisms in the wild In Pro-ceedings of ACM CCS pages 674ndash689 ACM 2014

[8] Gunes Acar Marc Juarez Nick Nikiforakis Claudia DiazSeda Guumlrses Frank Piessens and Bart Preneel Fpdetectivedusting the web for fingerprinters In Proceedings of the2013 ACM SIGSAC conference on Computer amp communica-tions security pages 1129ndash1140 ACM 2013

[9] Julia Angwin Why online tracking is getting creepier ProP-ublica Jun 2014

[10] Mika D Ayenson Dietrich James Wambach Ashkan SoltaniNathan Good and Chris Jay Hoofnagle Flash cookies andprivacy II Now with html5 and etag respawning 2011

[11] Bananatag Email Tracking for Gmail Outlook and otherclients httpsbananatagcomemail-tracking Onlineaccessed 2017-09-04

[12] Justin Brookman Phoebe Rouge Aaron Alva Alva andChristina Yeung Cross-device tracking Measurement anddisclosures In Proceedings of the Privacy Enhancing Tech-nologies Symposium 2017

[13] Ceren Budak Sharad Goel Justin Rao and Georgios ZervasUnderstanding emerging threats to online advertising InProceedings of the ACM Conference on Economics andComputation 2016

[14] ContactMonkey Email Tracking for Outlook and Gmailhttpswwwcontactmonkeycomemail-tracking Onlineaccessed 2017-09-04

[15] Zakir Durumeric David Adrian Ariana Mirian James Kas-ten Elie Bursztein Nicolas Lidzborski Kurt Thomas VijayEranti Michael Bailey and J Alex Halderman Neither snownor rain nor mitm An empirical analysis of email deliv-ery security In Proceedings of the 2015 ACM Conferenceon Internet Measurement Conference pages 27ndash39 ACM2015

[16] Peter Eckersley How unique is your web browser In In-ternational Symposium on Privacy Enhancing TechnologiesSymposium pages 1ndash18 Springer 2010

[17] Steven Englehardt and Arvind Narayanan Online trackingA 1-million-site measurement and analysis In ACM Confer-ence on Computer and Communications Security 2016

[18] Steven Englehardt Dillon Reisman Christian Eubank Pe-ter Zimmerman Jonathan Mayer Arvind Narayanan andEdward W Felten Cookies that give you away The surveil-lance implications of web tracking In Proceedings of the24th Conference on World Wide Web 2015

[19] David Fifield and Serge Egelman Fingerprinting web usersthrough font metrics In International Conference on Finan-cial Cryptography and Data Security 2015

[20] Gmail Help Choose whether to show images httpssupportgooglecommailanswer145919 Online accessed2017-09-06

[21] Ralph Holz Johanna Amann Olivier Mehani Mohamed AliKacircafar and Matthias Wachs TLS in the wild An internet-wide analysis of tls-based protocols for electronic commu-nication In 23nd Annual Network and Distributed SystemSecurity Symposium NDSS 2016 San Diego CaliforniaUSA February 21-24 2016 2016

[22] HubSpot Start Email Tracking Today httpswwwhubspotcomproductssalesemail-tracking Online ac-

cessed 2017-09-04[23] Balachander Krishnamurthy Konstantin Naryshkin and

Craig Wills Privacy leakage vs protection measures thegrowing disconnect In Proceedings of the Web 2011

[24] Balachander Krishnamurthy and Craig E Wills On the leak-age of personally identifiable information via online socialnetworks In Proceedings of the 2nd ACM workshop onOnline social networks pages 7ndash12 ACM 2009

[25] Pierre Laperdrix Walter Rudametkin and Benoit BaudryBeauty and the beast Diverting modern web browsers tobuild unique browser fingerprints In 37th IEEE Symposiumon Security and Privacy 2016

[26] Adam Lerner Anna Kornfeld Simpson Tadayoshi Kohnoand Franziska Roesner Internet jones and the raiders of thelost trackers An archaeological study of web tracking from1996 to 2016 In 25th USENIX Security Symposium 2016

[27] Timothy Libert Exposing the invisible web An analysis ofthird-party http requests on 1 million websites InternationalJournal of Communication 918 2015

[28] Jonathan R Mayer and John C Mitchell Third-party webtracking Policy and technology In 2012 IEEE Symposiumon Security and Privacy IEEE 2012

[29] Georg Merzdovnik Markus Huber Damjan Buhov NickNikiforakis Sebastian Neuner Martin Schmiedecker andEdgar Weippl Block me if you can A large-scale study oftracker-blocking tools In Proceedings of the 2nd IEEE Euro-pean Symposium on Security and Privacy (IEEE EuroSampP)2017

[30] Keaton Mowery and Hovav Shacham Pixel perfect Finger-printing canvas in HTML5 W2SP 2012

[31] Mozilla Support Remote Content in Messages httpssupportmozillaorgen-USkbremote-content-in-messagesOnline accessed 2017-09-04

[32] Nick Nikiforakis Alexandros Kapravelos Wouter JoosenChristopher Kruegel Frank Piessens and Giovanni VignaCookieless monster Exploring the ecosystem of web-baseddevice fingerprinting In Security and privacy (SP) 2013IEEE symposium on pages 541ndash555 IEEE 2013

[33] Lukasz Olejnik Gunes Acar Claude Castelluccia and Clau-dia Diaz The leaking battery A privacy analysis of theHTML5 Battery Status API Technical report 2015

[34] Jingjing Ren Ashwin Rao Martina Lindorfer ArnaudLegout and David Choffnes Recon Revealing and control-ling pii leaks in mobile network traffic In Proceedings of the14th Annual International Conference on Mobile SystemsApplications and Services pages 361ndash374 ACM 2016

[35] Franziska Roesner Tadayoshi Kohno and David WetherallDetecting and defending against third-party tracking onthe web In Proceedings of the 9th USENIX conferenceon Networked Systems Design and Implementation pages12ndash12 USENIX Association 2012

[36] scikit-learn Jaccard Similarity Score httpscikit-learnorgstablemodulesgeneratedsklearnmetricsjaccard_similarity_scorehtml Online accessed 2017-09-05

[37] Ashkan Soltani Shannon Canty Quentin Mayo LaurenThomas and Chris Jay Hoofnagle Flash cookies and pri-vacy In AAAI spring symposium intelligent informationprivacy management volume 2010 pages 158ndash163 2010

[38] Oleksii Starov Phillipa Gill and Nick Nikiforakis Are yousure you want to contact us quantifying the leakage of pii

I never signed up for this Privacy implications of email tracking 17

via website contact forms Proceedings on Privacy Enhanc-ing Technologies 2016(1)20ndash33 2016

[39] Oleksii Starov and Nick Nikiforakis Extended trackingpowers Measuring the privacy diffusion enabled by browserextensions In Proceedings of the 26th International Confer-ence on World Wide Web pages 1481ndash1490 2017

[40] Narseo Vallina-Rodriguez Christian Kreibich Mark Allmanand Vern Paxson Lumen Fine-grained visibility and controlof mobile traffic in user-space 2017

[41] W3C 410 Forms - HTML5 httpswwww3orgTRhtml5formshtml Online accessed 2017-09-07

[42] Yahoo Help Block images in your incoming Yahoo Mailemails httpshelpyahoocomkbSLN5043html Onlineaccessed 2017-09-06

[43] Zhonghao Yu Sam Macbeth Konark Modi and Josep MPujol Tracking the trackers In Proceedings of the 25thInternational Conference on World Wide Web pages 121ndash132 International World Wide Web Conferences SteeringCommittee 2016

10 Appendix

101 Form discovery and fillingmethodology

Choosing pages on which to search for forms Thecrawler searches through all links (ltagt tags) on the land-ing page to find pages that are most likely to contain amailing list form It does this by matching the link textand URL against a ranked list of terms which are shownin Table 1 As an initial step we filter out invisiblelinks and links to external sites We check that the linktext does not contain words in our blacklist which aimsto avoid unsubscribe pages and phone-based registra-tion If we have found any links that match the crawlerclicks on the one with the highest rank then runs theform-finding procedure on the new page and any newlyopened pop-up windows If no forms are found it goesback and repeats this process for the remaining linksThe reason for clicking on generic article links is thatwe have come across several news sites with newsletterforms only within article pages We also make sure toselect the English language or USEnglish locale whenavailable since our keywords are in English

Top-down form detection For each page thecrawler visits it first searches through the HTML DOMfor any potential email registration forms When sitesuse the standard ltformgt element it can simply iteratethrough each formrsquos input fields (ltinputgt tags) and seeif any text fields ask for an email address (by matchingon input type and keywords) If so it marks the form as

a candidate and then chooses the best candidate usingthe following criteria (in order)1 Always return the topmost form Any form stacked

on top of other elements is probably a modal or dia-log and we find that the most common use of thesecomponents is to promote a sitersquos mailing lists Werely on the z-index CSS property which specifies thestacking order of an element in relation to others (asa relative arbitrary integer) Note that most DOMelements take the default z-index value of auto in-heriting the actual value from its parent thus thecrawler recursively checks a formrsquos parent elementsuntil it finds a non-auto value or reaches the rootof the DOM tree To break ties it also searches forthe literal strings ldquomodalrdquo or ldquodialogrdquo within theformrsquos HTML since we find that such componentsare usually descriptively named

2 Rank login forms lower This is the other class offorms that often asks for an email address so thecrawler explicitly checks for the strings ldquologinrdquo ldquologinrdquo and ldquosign inrdquo within a formrsquos HTML to avoidthese when other candidates are present

3 Prefer forms with more input fields This is mainlyhelpful for identifying the correct follow-up form ifwe submit our email address in the footer of a pagethe same footer might be present on the page we getredirected to In this scenario the form we want topick is the longer one

Additionally registration forms are sometimes foundinside of inline frames (ltiframegt tag) which are ef-fectively separate HTML pages embedded in the mainpage If necessary we iterate through each frame andapply the same procedure to locate registration formswithin them

Bottom-up form detection A growing numberof sites place logical forms inside of generic containerelements (eg ltdivgt or ltspangt tags) without using anyltformgt tags Therefore if top-down form detection failswe take a bottom-up approach the crawler first iteratesthrough all the ltinputgt elements on the page to checkif any email address fields exist at all then recursivelyexamines their parents to find the first container thatalso contains a submit button This container is usuallythe smallest logical form unit that includes all of therelevant input fields

Determining form field type Once a form isdiscovered we need to determine which fields are con-tained in the form and fill each field with valid dataWe skip any invisible elements since a real user wouldnot be expected to fill them Some fields can be iden-

I never signed up for this Privacy implications of email tracking 18

tified by their type attribute alonemdashfor example telfor phone numbers and email for email addressesmdashbutthese specific types were introduced in the relatively re-cent HTML5 standard [41] and most websites still usethe general text type for all text inputs In our sur-vey of the top sites we found that contextual hints arescattered across many tag attributes with the most fre-quent being name class id placeholder value forand title In addition tags that contain HTML bod-ies (such as ltbuttongt tags) often contain hints in theinnerHTML

Handling two-part form submissions Aftersubmitting a form we are sometimes prompted to fillout another longer form before the registration is ac-cepted This second form might appear on the samepage (ie using JavaScript) or on a separate page ei-ther through a redirect or as a pop-up window We takea simplistic approach the crawler waits a few secondsthen applies the same form-finding procedure first onany pop-up windows and then on the original windowThis approach may have the effect of submitting thesame form twice but we argue that this does not pro-duce any adverse resultsmdashduplicate form submissionsare a plausible user interaction that web services shouldbe expected to handle gracefully

102 Mail server implementation

The mail server receives emails using SubEtha SMTP alibrary offering a simple low-level API to handle incom-ing mail The server accepts any mail sent to (RCPT TO)an existing email address and rejects it otherwise Themail contents (DATA) are parsed in MIME format usingthe JavaMail API and the raw message contents arewritten to disk MIME messages consist of a set of head-ers and a content body with the required Content-Typeheader indicating the format of the content notably amultipart content body contains additional MIME mes-sage subparts enabling messages to be arranged in atree structure To save disk space we recursively scanmultipart MIME messages for subparts with contenttypes that are non-text (text) such as attached im-ages or other data and discard them before storing themessages since we do not examine any non-textual con-tent

103 Supported hash functions andencodings for leak detection

Supported hashes and checksums md2 md4 md5sha sha1 sha256 sha224 sha384 sha3-224 sha3-256 sha3-384 sha3-512 murmurhash2 (signed andunsigned) murmurhash3 32-bit murmurhash3 64-bitmurmurhash3 128-bit ripemd160 whirlpool blake2bblake2s crc32 adler32

Supported encodings base16 base32 base58base64 urlencoding deflate gzip zlib entity yenc

104 Top parties redirecting to new thirdparties on email reload

Redirecting Party OrganizationAvg addrsquolparties S E

pippiocom Acxiom 57 7 32liadmcom LiveIntent 37 68 1097rlcdncom Acxiom 17 11 551imiclkcom MediaMath 13 2 4mathtagcom MediaMath 11 11 382alcmpncom ALCdagger 08 6 132emltrkcom Litmus 07 41 638acxiom-onlinecom Acxiom 04 2 33dynemlcom PowerInbox 01 3 13adnxscom AppNexus 01 19 277

Table 14 Top parties by average number of new third-party re-sources in a redirect chain when an email is reloaded The num-ber of senders ( S) out of 902 total and the number of emails(E) out of 12618 total on which this occurs is given for eachredirecting party We exclude redirecting parties that only exhibitthis behavior in emails from a single sender In total there are 12parties which exhibit this type of redirect behavior Includes statistics for chains which redirect to httppliadmcomimp in the first redirect We observe a common pattern ofURLs of the form lifirstpartycom redirecting first to this end-point which then redirects to a number of other third partiesdagger American List Counsel

  • I never signed up for this Privacy implications of email tracking
    • 1 Introduction
      • 11 Methods
      • 12 The state of email tracking
      • 13 Evaluating and improving defenses
        • 2 Related work
        • 3 Collecting a dataset of emails
        • 4 Privacy leaks when viewing emails
          • 41 Measurement methodology
          • 42 Email provides much of same tracking opportunities as the web
          • 43 Leaks of email addresses to third parties are common
          • 44 Reopening emails brings in new third parties
          • 45 Case study LiveIntent
          • 46 Request blockers help but dont fix the problem
            • 5 Privacy leaks when clicking links in emails
              • 51 Measurement methodology
              • 52 Results
                • 6 Evaluation of defenses
                  • 61 Landscape of defenses
                  • 62 Survey of email clients
                    • 7 Proposed defense
                    • 8 Discussion and conclusion
                    • 9 Acknowledgements
                    • 10 Appendix
                      • 101 Form discovery and filling methodology
                      • 102 Mail server implementation
                      • 103 Supported hash functions and encodings for leak detection
                      • 104 Top parties redirecting to new third parties on email reload

I never signed up for this Privacy implications of email tracking 16

[7] Gunes Acar Christian Eubank Steven Englehardt MarcJuarez Arvind Narayanan and Claudia Diaz The web neverforgets Persistent tracking mechanisms in the wild In Pro-ceedings of ACM CCS pages 674ndash689 ACM 2014

[8] Gunes Acar Marc Juarez Nick Nikiforakis Claudia DiazSeda Guumlrses Frank Piessens and Bart Preneel Fpdetectivedusting the web for fingerprinters In Proceedings of the2013 ACM SIGSAC conference on Computer amp communica-tions security pages 1129ndash1140 ACM 2013

[9] Julia Angwin Why online tracking is getting creepier ProP-ublica Jun 2014

[10] Mika D Ayenson Dietrich James Wambach Ashkan SoltaniNathan Good and Chris Jay Hoofnagle Flash cookies andprivacy II Now with html5 and etag respawning 2011

[11] Bananatag Email Tracking for Gmail Outlook and otherclients httpsbananatagcomemail-tracking Onlineaccessed 2017-09-04

[12] Justin Brookman Phoebe Rouge Aaron Alva Alva andChristina Yeung Cross-device tracking Measurement anddisclosures In Proceedings of the Privacy Enhancing Tech-nologies Symposium 2017

[13] Ceren Budak Sharad Goel Justin Rao and Georgios ZervasUnderstanding emerging threats to online advertising InProceedings of the ACM Conference on Economics andComputation 2016

[14] ContactMonkey Email Tracking for Outlook and Gmailhttpswwwcontactmonkeycomemail-tracking Onlineaccessed 2017-09-04

[15] Zakir Durumeric David Adrian Ariana Mirian James Kas-ten Elie Bursztein Nicolas Lidzborski Kurt Thomas VijayEranti Michael Bailey and J Alex Halderman Neither snownor rain nor mitm An empirical analysis of email deliv-ery security In Proceedings of the 2015 ACM Conferenceon Internet Measurement Conference pages 27ndash39 ACM2015

[16] Peter Eckersley How unique is your web browser In In-ternational Symposium on Privacy Enhancing TechnologiesSymposium pages 1ndash18 Springer 2010

[17] Steven Englehardt and Arvind Narayanan Online trackingA 1-million-site measurement and analysis In ACM Confer-ence on Computer and Communications Security 2016

[18] Steven Englehardt Dillon Reisman Christian Eubank Pe-ter Zimmerman Jonathan Mayer Arvind Narayanan andEdward W Felten Cookies that give you away The surveil-lance implications of web tracking In Proceedings of the24th Conference on World Wide Web 2015

[19] David Fifield and Serge Egelman Fingerprinting web usersthrough font metrics In International Conference on Finan-cial Cryptography and Data Security 2015

[20] Gmail Help Choose whether to show images httpssupportgooglecommailanswer145919 Online accessed2017-09-06

[21] Ralph Holz Johanna Amann Olivier Mehani Mohamed AliKacircafar and Matthias Wachs TLS in the wild An internet-wide analysis of tls-based protocols for electronic commu-nication In 23nd Annual Network and Distributed SystemSecurity Symposium NDSS 2016 San Diego CaliforniaUSA February 21-24 2016 2016

[22] HubSpot Start Email Tracking Today httpswwwhubspotcomproductssalesemail-tracking Online ac-

cessed 2017-09-04[23] Balachander Krishnamurthy Konstantin Naryshkin and

Craig Wills Privacy leakage vs protection measures thegrowing disconnect In Proceedings of the Web 2011

[24] Balachander Krishnamurthy and Craig E Wills On the leak-age of personally identifiable information via online socialnetworks In Proceedings of the 2nd ACM workshop onOnline social networks pages 7ndash12 ACM 2009

[25] Pierre Laperdrix Walter Rudametkin and Benoit BaudryBeauty and the beast Diverting modern web browsers tobuild unique browser fingerprints In 37th IEEE Symposiumon Security and Privacy 2016

[26] Adam Lerner Anna Kornfeld Simpson Tadayoshi Kohnoand Franziska Roesner Internet jones and the raiders of thelost trackers An archaeological study of web tracking from1996 to 2016 In 25th USENIX Security Symposium 2016

[27] Timothy Libert Exposing the invisible web An analysis ofthird-party http requests on 1 million websites InternationalJournal of Communication 918 2015

[28] Jonathan R Mayer and John C Mitchell Third-party webtracking Policy and technology In 2012 IEEE Symposiumon Security and Privacy IEEE 2012

[29] Georg Merzdovnik Markus Huber Damjan Buhov NickNikiforakis Sebastian Neuner Martin Schmiedecker andEdgar Weippl Block me if you can A large-scale study oftracker-blocking tools In Proceedings of the 2nd IEEE Euro-pean Symposium on Security and Privacy (IEEE EuroSampP)2017

[30] Keaton Mowery and Hovav Shacham Pixel perfect Finger-printing canvas in HTML5 W2SP 2012

[31] Mozilla Support Remote Content in Messages httpssupportmozillaorgen-USkbremote-content-in-messagesOnline accessed 2017-09-04

[32] Nick Nikiforakis Alexandros Kapravelos Wouter JoosenChristopher Kruegel Frank Piessens and Giovanni VignaCookieless monster Exploring the ecosystem of web-baseddevice fingerprinting In Security and privacy (SP) 2013IEEE symposium on pages 541ndash555 IEEE 2013

[33] Lukasz Olejnik Gunes Acar Claude Castelluccia and Clau-dia Diaz The leaking battery A privacy analysis of theHTML5 Battery Status API Technical report 2015

[34] Jingjing Ren Ashwin Rao Martina Lindorfer ArnaudLegout and David Choffnes Recon Revealing and control-ling pii leaks in mobile network traffic In Proceedings of the14th Annual International Conference on Mobile SystemsApplications and Services pages 361ndash374 ACM 2016

[35] Franziska Roesner Tadayoshi Kohno and David WetherallDetecting and defending against third-party tracking onthe web In Proceedings of the 9th USENIX conferenceon Networked Systems Design and Implementation pages12ndash12 USENIX Association 2012

[36] scikit-learn Jaccard Similarity Score httpscikit-learnorgstablemodulesgeneratedsklearnmetricsjaccard_similarity_scorehtml Online accessed 2017-09-05

[37] Ashkan Soltani Shannon Canty Quentin Mayo LaurenThomas and Chris Jay Hoofnagle Flash cookies and pri-vacy In AAAI spring symposium intelligent informationprivacy management volume 2010 pages 158ndash163 2010

[38] Oleksii Starov Phillipa Gill and Nick Nikiforakis Are yousure you want to contact us quantifying the leakage of pii

I never signed up for this Privacy implications of email tracking 17

via website contact forms Proceedings on Privacy Enhanc-ing Technologies 2016(1)20ndash33 2016

[39] Oleksii Starov and Nick Nikiforakis Extended trackingpowers Measuring the privacy diffusion enabled by browserextensions In Proceedings of the 26th International Confer-ence on World Wide Web pages 1481ndash1490 2017

[40] Narseo Vallina-Rodriguez Christian Kreibich Mark Allmanand Vern Paxson Lumen Fine-grained visibility and controlof mobile traffic in user-space 2017

[41] W3C 410 Forms - HTML5 httpswwww3orgTRhtml5formshtml Online accessed 2017-09-07

[42] Yahoo Help Block images in your incoming Yahoo Mailemails httpshelpyahoocomkbSLN5043html Onlineaccessed 2017-09-06

[43] Zhonghao Yu Sam Macbeth Konark Modi and Josep MPujol Tracking the trackers In Proceedings of the 25thInternational Conference on World Wide Web pages 121ndash132 International World Wide Web Conferences SteeringCommittee 2016

10 Appendix

101 Form discovery and fillingmethodology

Choosing pages on which to search for forms Thecrawler searches through all links (ltagt tags) on the land-ing page to find pages that are most likely to contain amailing list form It does this by matching the link textand URL against a ranked list of terms which are shownin Table 1 As an initial step we filter out invisiblelinks and links to external sites We check that the linktext does not contain words in our blacklist which aimsto avoid unsubscribe pages and phone-based registra-tion If we have found any links that match the crawlerclicks on the one with the highest rank then runs theform-finding procedure on the new page and any newlyopened pop-up windows If no forms are found it goesback and repeats this process for the remaining linksThe reason for clicking on generic article links is thatwe have come across several news sites with newsletterforms only within article pages We also make sure toselect the English language or USEnglish locale whenavailable since our keywords are in English

Top-down form detection For each page thecrawler visits it first searches through the HTML DOMfor any potential email registration forms When sitesuse the standard ltformgt element it can simply iteratethrough each formrsquos input fields (ltinputgt tags) and seeif any text fields ask for an email address (by matchingon input type and keywords) If so it marks the form as

a candidate and then chooses the best candidate usingthe following criteria (in order)1 Always return the topmost form Any form stacked

on top of other elements is probably a modal or dia-log and we find that the most common use of thesecomponents is to promote a sitersquos mailing lists Werely on the z-index CSS property which specifies thestacking order of an element in relation to others (asa relative arbitrary integer) Note that most DOMelements take the default z-index value of auto in-heriting the actual value from its parent thus thecrawler recursively checks a formrsquos parent elementsuntil it finds a non-auto value or reaches the rootof the DOM tree To break ties it also searches forthe literal strings ldquomodalrdquo or ldquodialogrdquo within theformrsquos HTML since we find that such componentsare usually descriptively named

2 Rank login forms lower This is the other class offorms that often asks for an email address so thecrawler explicitly checks for the strings ldquologinrdquo ldquologinrdquo and ldquosign inrdquo within a formrsquos HTML to avoidthese when other candidates are present

3 Prefer forms with more input fields This is mainlyhelpful for identifying the correct follow-up form ifwe submit our email address in the footer of a pagethe same footer might be present on the page we getredirected to In this scenario the form we want topick is the longer one

Additionally registration forms are sometimes foundinside of inline frames (ltiframegt tag) which are ef-fectively separate HTML pages embedded in the mainpage If necessary we iterate through each frame andapply the same procedure to locate registration formswithin them

Bottom-up form detection A growing numberof sites place logical forms inside of generic containerelements (eg ltdivgt or ltspangt tags) without using anyltformgt tags Therefore if top-down form detection failswe take a bottom-up approach the crawler first iteratesthrough all the ltinputgt elements on the page to checkif any email address fields exist at all then recursivelyexamines their parents to find the first container thatalso contains a submit button This container is usuallythe smallest logical form unit that includes all of therelevant input fields

Determining form field type Once a form isdiscovered we need to determine which fields are con-tained in the form and fill each field with valid dataWe skip any invisible elements since a real user wouldnot be expected to fill them Some fields can be iden-

I never signed up for this Privacy implications of email tracking 18

tified by their type attribute alonemdashfor example telfor phone numbers and email for email addressesmdashbutthese specific types were introduced in the relatively re-cent HTML5 standard [41] and most websites still usethe general text type for all text inputs In our sur-vey of the top sites we found that contextual hints arescattered across many tag attributes with the most fre-quent being name class id placeholder value forand title In addition tags that contain HTML bod-ies (such as ltbuttongt tags) often contain hints in theinnerHTML

Handling two-part form submissions Aftersubmitting a form we are sometimes prompted to fillout another longer form before the registration is ac-cepted This second form might appear on the samepage (ie using JavaScript) or on a separate page ei-ther through a redirect or as a pop-up window We takea simplistic approach the crawler waits a few secondsthen applies the same form-finding procedure first onany pop-up windows and then on the original windowThis approach may have the effect of submitting thesame form twice but we argue that this does not pro-duce any adverse resultsmdashduplicate form submissionsare a plausible user interaction that web services shouldbe expected to handle gracefully

102 Mail server implementation

The mail server receives emails using SubEtha SMTP alibrary offering a simple low-level API to handle incom-ing mail The server accepts any mail sent to (RCPT TO)an existing email address and rejects it otherwise Themail contents (DATA) are parsed in MIME format usingthe JavaMail API and the raw message contents arewritten to disk MIME messages consist of a set of head-ers and a content body with the required Content-Typeheader indicating the format of the content notably amultipart content body contains additional MIME mes-sage subparts enabling messages to be arranged in atree structure To save disk space we recursively scanmultipart MIME messages for subparts with contenttypes that are non-text (text) such as attached im-ages or other data and discard them before storing themessages since we do not examine any non-textual con-tent

103 Supported hash functions andencodings for leak detection

Supported hashes and checksums md2 md4 md5sha sha1 sha256 sha224 sha384 sha3-224 sha3-256 sha3-384 sha3-512 murmurhash2 (signed andunsigned) murmurhash3 32-bit murmurhash3 64-bitmurmurhash3 128-bit ripemd160 whirlpool blake2bblake2s crc32 adler32

Supported encodings base16 base32 base58base64 urlencoding deflate gzip zlib entity yenc

104 Top parties redirecting to new thirdparties on email reload

Redirecting Party OrganizationAvg addrsquolparties S E

pippiocom Acxiom 57 7 32liadmcom LiveIntent 37 68 1097rlcdncom Acxiom 17 11 551imiclkcom MediaMath 13 2 4mathtagcom MediaMath 11 11 382alcmpncom ALCdagger 08 6 132emltrkcom Litmus 07 41 638acxiom-onlinecom Acxiom 04 2 33dynemlcom PowerInbox 01 3 13adnxscom AppNexus 01 19 277

Table 14 Top parties by average number of new third-party re-sources in a redirect chain when an email is reloaded The num-ber of senders ( S) out of 902 total and the number of emails(E) out of 12618 total on which this occurs is given for eachredirecting party We exclude redirecting parties that only exhibitthis behavior in emails from a single sender In total there are 12parties which exhibit this type of redirect behavior Includes statistics for chains which redirect to httppliadmcomimp in the first redirect We observe a common pattern ofURLs of the form lifirstpartycom redirecting first to this end-point which then redirects to a number of other third partiesdagger American List Counsel

  • I never signed up for this Privacy implications of email tracking
    • 1 Introduction
      • 11 Methods
      • 12 The state of email tracking
      • 13 Evaluating and improving defenses
        • 2 Related work
        • 3 Collecting a dataset of emails
        • 4 Privacy leaks when viewing emails
          • 41 Measurement methodology
          • 42 Email provides much of same tracking opportunities as the web
          • 43 Leaks of email addresses to third parties are common
          • 44 Reopening emails brings in new third parties
          • 45 Case study LiveIntent
          • 46 Request blockers help but dont fix the problem
            • 5 Privacy leaks when clicking links in emails
              • 51 Measurement methodology
              • 52 Results
                • 6 Evaluation of defenses
                  • 61 Landscape of defenses
                  • 62 Survey of email clients
                    • 7 Proposed defense
                    • 8 Discussion and conclusion
                    • 9 Acknowledgements
                    • 10 Appendix
                      • 101 Form discovery and filling methodology
                      • 102 Mail server implementation
                      • 103 Supported hash functions and encodings for leak detection
                      • 104 Top parties redirecting to new third parties on email reload

I never signed up for this Privacy implications of email tracking 17

via website contact forms Proceedings on Privacy Enhanc-ing Technologies 2016(1)20ndash33 2016

[39] Oleksii Starov and Nick Nikiforakis Extended trackingpowers Measuring the privacy diffusion enabled by browserextensions In Proceedings of the 26th International Confer-ence on World Wide Web pages 1481ndash1490 2017

[40] Narseo Vallina-Rodriguez Christian Kreibich Mark Allmanand Vern Paxson Lumen Fine-grained visibility and controlof mobile traffic in user-space 2017

[41] W3C 410 Forms - HTML5 httpswwww3orgTRhtml5formshtml Online accessed 2017-09-07

[42] Yahoo Help Block images in your incoming Yahoo Mailemails httpshelpyahoocomkbSLN5043html Onlineaccessed 2017-09-06

[43] Zhonghao Yu Sam Macbeth Konark Modi and Josep MPujol Tracking the trackers In Proceedings of the 25thInternational Conference on World Wide Web pages 121ndash132 International World Wide Web Conferences SteeringCommittee 2016

10 Appendix

101 Form discovery and fillingmethodology

Choosing pages on which to search for forms Thecrawler searches through all links (ltagt tags) on the land-ing page to find pages that are most likely to contain amailing list form It does this by matching the link textand URL against a ranked list of terms which are shownin Table 1 As an initial step we filter out invisiblelinks and links to external sites We check that the linktext does not contain words in our blacklist which aimsto avoid unsubscribe pages and phone-based registra-tion If we have found any links that match the crawlerclicks on the one with the highest rank then runs theform-finding procedure on the new page and any newlyopened pop-up windows If no forms are found it goesback and repeats this process for the remaining linksThe reason for clicking on generic article links is thatwe have come across several news sites with newsletterforms only within article pages We also make sure toselect the English language or USEnglish locale whenavailable since our keywords are in English

Top-down form detection For each page thecrawler visits it first searches through the HTML DOMfor any potential email registration forms When sitesuse the standard ltformgt element it can simply iteratethrough each formrsquos input fields (ltinputgt tags) and seeif any text fields ask for an email address (by matchingon input type and keywords) If so it marks the form as

a candidate and then chooses the best candidate usingthe following criteria (in order)1 Always return the topmost form Any form stacked

on top of other elements is probably a modal or dia-log and we find that the most common use of thesecomponents is to promote a sitersquos mailing lists Werely on the z-index CSS property which specifies thestacking order of an element in relation to others (asa relative arbitrary integer) Note that most DOMelements take the default z-index value of auto in-heriting the actual value from its parent thus thecrawler recursively checks a formrsquos parent elementsuntil it finds a non-auto value or reaches the rootof the DOM tree To break ties it also searches forthe literal strings ldquomodalrdquo or ldquodialogrdquo within theformrsquos HTML since we find that such componentsare usually descriptively named

2 Rank login forms lower This is the other class offorms that often asks for an email address so thecrawler explicitly checks for the strings ldquologinrdquo ldquologinrdquo and ldquosign inrdquo within a formrsquos HTML to avoidthese when other candidates are present

3 Prefer forms with more input fields This is mainlyhelpful for identifying the correct follow-up form ifwe submit our email address in the footer of a pagethe same footer might be present on the page we getredirected to In this scenario the form we want topick is the longer one

Additionally registration forms are sometimes foundinside of inline frames (ltiframegt tag) which are ef-fectively separate HTML pages embedded in the mainpage If necessary we iterate through each frame andapply the same procedure to locate registration formswithin them

Bottom-up form detection A growing numberof sites place logical forms inside of generic containerelements (eg ltdivgt or ltspangt tags) without using anyltformgt tags Therefore if top-down form detection failswe take a bottom-up approach the crawler first iteratesthrough all the ltinputgt elements on the page to checkif any email address fields exist at all then recursivelyexamines their parents to find the first container thatalso contains a submit button This container is usuallythe smallest logical form unit that includes all of therelevant input fields

Determining form field type Once a form isdiscovered we need to determine which fields are con-tained in the form and fill each field with valid dataWe skip any invisible elements since a real user wouldnot be expected to fill them Some fields can be iden-

I never signed up for this Privacy implications of email tracking 18

tified by their type attribute alonemdashfor example telfor phone numbers and email for email addressesmdashbutthese specific types were introduced in the relatively re-cent HTML5 standard [41] and most websites still usethe general text type for all text inputs In our sur-vey of the top sites we found that contextual hints arescattered across many tag attributes with the most fre-quent being name class id placeholder value forand title In addition tags that contain HTML bod-ies (such as ltbuttongt tags) often contain hints in theinnerHTML

Handling two-part form submissions Aftersubmitting a form we are sometimes prompted to fillout another longer form before the registration is ac-cepted This second form might appear on the samepage (ie using JavaScript) or on a separate page ei-ther through a redirect or as a pop-up window We takea simplistic approach the crawler waits a few secondsthen applies the same form-finding procedure first onany pop-up windows and then on the original windowThis approach may have the effect of submitting thesame form twice but we argue that this does not pro-duce any adverse resultsmdashduplicate form submissionsare a plausible user interaction that web services shouldbe expected to handle gracefully

102 Mail server implementation

The mail server receives emails using SubEtha SMTP alibrary offering a simple low-level API to handle incom-ing mail The server accepts any mail sent to (RCPT TO)an existing email address and rejects it otherwise Themail contents (DATA) are parsed in MIME format usingthe JavaMail API and the raw message contents arewritten to disk MIME messages consist of a set of head-ers and a content body with the required Content-Typeheader indicating the format of the content notably amultipart content body contains additional MIME mes-sage subparts enabling messages to be arranged in atree structure To save disk space we recursively scanmultipart MIME messages for subparts with contenttypes that are non-text (text) such as attached im-ages or other data and discard them before storing themessages since we do not examine any non-textual con-tent

103 Supported hash functions andencodings for leak detection

Supported hashes and checksums md2 md4 md5sha sha1 sha256 sha224 sha384 sha3-224 sha3-256 sha3-384 sha3-512 murmurhash2 (signed andunsigned) murmurhash3 32-bit murmurhash3 64-bitmurmurhash3 128-bit ripemd160 whirlpool blake2bblake2s crc32 adler32

Supported encodings base16 base32 base58base64 urlencoding deflate gzip zlib entity yenc

104 Top parties redirecting to new thirdparties on email reload

Redirecting Party OrganizationAvg addrsquolparties S E

pippiocom Acxiom 57 7 32liadmcom LiveIntent 37 68 1097rlcdncom Acxiom 17 11 551imiclkcom MediaMath 13 2 4mathtagcom MediaMath 11 11 382alcmpncom ALCdagger 08 6 132emltrkcom Litmus 07 41 638acxiom-onlinecom Acxiom 04 2 33dynemlcom PowerInbox 01 3 13adnxscom AppNexus 01 19 277

Table 14 Top parties by average number of new third-party re-sources in a redirect chain when an email is reloaded The num-ber of senders ( S) out of 902 total and the number of emails(E) out of 12618 total on which this occurs is given for eachredirecting party We exclude redirecting parties that only exhibitthis behavior in emails from a single sender In total there are 12parties which exhibit this type of redirect behavior Includes statistics for chains which redirect to httppliadmcomimp in the first redirect We observe a common pattern ofURLs of the form lifirstpartycom redirecting first to this end-point which then redirects to a number of other third partiesdagger American List Counsel

  • I never signed up for this Privacy implications of email tracking
    • 1 Introduction
      • 11 Methods
      • 12 The state of email tracking
      • 13 Evaluating and improving defenses
        • 2 Related work
        • 3 Collecting a dataset of emails
        • 4 Privacy leaks when viewing emails
          • 41 Measurement methodology
          • 42 Email provides much of same tracking opportunities as the web
          • 43 Leaks of email addresses to third parties are common
          • 44 Reopening emails brings in new third parties
          • 45 Case study LiveIntent
          • 46 Request blockers help but dont fix the problem
            • 5 Privacy leaks when clicking links in emails
              • 51 Measurement methodology
              • 52 Results
                • 6 Evaluation of defenses
                  • 61 Landscape of defenses
                  • 62 Survey of email clients
                    • 7 Proposed defense
                    • 8 Discussion and conclusion
                    • 9 Acknowledgements
                    • 10 Appendix
                      • 101 Form discovery and filling methodology
                      • 102 Mail server implementation
                      • 103 Supported hash functions and encodings for leak detection
                      • 104 Top parties redirecting to new third parties on email reload

I never signed up for this Privacy implications of email tracking 18

tified by their type attribute alonemdashfor example telfor phone numbers and email for email addressesmdashbutthese specific types were introduced in the relatively re-cent HTML5 standard [41] and most websites still usethe general text type for all text inputs In our sur-vey of the top sites we found that contextual hints arescattered across many tag attributes with the most fre-quent being name class id placeholder value forand title In addition tags that contain HTML bod-ies (such as ltbuttongt tags) often contain hints in theinnerHTML

Handling two-part form submissions Aftersubmitting a form we are sometimes prompted to fillout another longer form before the registration is ac-cepted This second form might appear on the samepage (ie using JavaScript) or on a separate page ei-ther through a redirect or as a pop-up window We takea simplistic approach the crawler waits a few secondsthen applies the same form-finding procedure first onany pop-up windows and then on the original windowThis approach may have the effect of submitting thesame form twice but we argue that this does not pro-duce any adverse resultsmdashduplicate form submissionsare a plausible user interaction that web services shouldbe expected to handle gracefully

102 Mail server implementation

The mail server receives emails using SubEtha SMTP alibrary offering a simple low-level API to handle incom-ing mail The server accepts any mail sent to (RCPT TO)an existing email address and rejects it otherwise Themail contents (DATA) are parsed in MIME format usingthe JavaMail API and the raw message contents arewritten to disk MIME messages consist of a set of head-ers and a content body with the required Content-Typeheader indicating the format of the content notably amultipart content body contains additional MIME mes-sage subparts enabling messages to be arranged in atree structure To save disk space we recursively scanmultipart MIME messages for subparts with contenttypes that are non-text (text) such as attached im-ages or other data and discard them before storing themessages since we do not examine any non-textual con-tent

103 Supported hash functions andencodings for leak detection

Supported hashes and checksums md2 md4 md5sha sha1 sha256 sha224 sha384 sha3-224 sha3-256 sha3-384 sha3-512 murmurhash2 (signed andunsigned) murmurhash3 32-bit murmurhash3 64-bitmurmurhash3 128-bit ripemd160 whirlpool blake2bblake2s crc32 adler32

Supported encodings base16 base32 base58base64 urlencoding deflate gzip zlib entity yenc

104 Top parties redirecting to new thirdparties on email reload

Redirecting Party OrganizationAvg addrsquolparties S E

pippiocom Acxiom 57 7 32liadmcom LiveIntent 37 68 1097rlcdncom Acxiom 17 11 551imiclkcom MediaMath 13 2 4mathtagcom MediaMath 11 11 382alcmpncom ALCdagger 08 6 132emltrkcom Litmus 07 41 638acxiom-onlinecom Acxiom 04 2 33dynemlcom PowerInbox 01 3 13adnxscom AppNexus 01 19 277

Table 14 Top parties by average number of new third-party re-sources in a redirect chain when an email is reloaded The num-ber of senders ( S) out of 902 total and the number of emails(E) out of 12618 total on which this occurs is given for eachredirecting party We exclude redirecting parties that only exhibitthis behavior in emails from a single sender In total there are 12parties which exhibit this type of redirect behavior Includes statistics for chains which redirect to httppliadmcomimp in the first redirect We observe a common pattern ofURLs of the form lifirstpartycom redirecting first to this end-point which then redirects to a number of other third partiesdagger American List Counsel

  • I never signed up for this Privacy implications of email tracking
    • 1 Introduction
      • 11 Methods
      • 12 The state of email tracking
      • 13 Evaluating and improving defenses
        • 2 Related work
        • 3 Collecting a dataset of emails
        • 4 Privacy leaks when viewing emails
          • 41 Measurement methodology
          • 42 Email provides much of same tracking opportunities as the web
          • 43 Leaks of email addresses to third parties are common
          • 44 Reopening emails brings in new third parties
          • 45 Case study LiveIntent
          • 46 Request blockers help but dont fix the problem
            • 5 Privacy leaks when clicking links in emails
              • 51 Measurement methodology
              • 52 Results
                • 6 Evaluation of defenses
                  • 61 Landscape of defenses
                  • 62 Survey of email clients
                    • 7 Proposed defense
                    • 8 Discussion and conclusion
                    • 9 Acknowledgements
                    • 10 Appendix
                      • 101 Form discovery and filling methodology
                      • 102 Mail server implementation
                      • 103 Supported hash functions and encodings for leak detection
                      • 104 Top parties redirecting to new third parties on email reload

Recommended