+ All Categories
Home > Documents > f g@berkeley - arxiv.org · StegaStamp: Invisible Hyperlinks in Physical Photographs Matthew Tancik...

f g@berkeley - arxiv.org · StegaStamp: Invisible Hyperlinks in Physical Photographs Matthew Tancik...

Date post: 01-Sep-2019
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
10
StegaStamp: Invisible Hyperlinks in Physical Photographs Matthew Tancik * Ben Mildenhall * Ren Ng University of California, Berkeley {tancik, bmild, ren}@berkeley.edu Abstract Imagine a world in which each photo, printed or digi- tally displayed, hides arbitrary digital data that can be ac- cessed through an internet-connected imaging system. An- other way to think about this is physical photographs that have unique QR codes invisibly embedded within them. This paper presents an architecture, algorithms, and a proto- type implementation addressing this vision. Our key tech- nical contribution is StegaStamp, the first steganographic algorithm to enable robust encoding and decoding of arbi- trary hyperlink bitstrings into photos in a manner that ap- proaches perceptual invisibility. StegaStamp comprises a deep neural network that learns an encoding/decoding al- gorithm robust to image perturbations that approximate the space of distortions resulting from real printing and pho- tography. Our system prototype demonstrates real-time de- coding of hyperlinks for photos from in-the-wild video sub- ject to real-world variation in print quality, lighting, shad- ows, perspective, occlusion and viewing distance. Our pro- totype system robustly retrieves 56 bit hyperlinks after error correction – sufficient to embed a unique code within every photo on the internet. 1. Introduction Our vision is a future in which each photo in the real world invisibly encodes a unique hyperlink to arbitrary in- formation. This information is accessed by pointing a cam- era at the photo and using the system described in this paper to decode and follow the hyperlink. In the future, augmented-reality (AR) systems may perform this task con- tinuously, visually overlaying retrieved information along- side each photo in the user’s view. Our approach is related to the ubiquitous QR code and similar technologies, which are now commonplace for a wide variety of data-transfer tasks, such as sharing web ad- dresses, purchasing goods, and tracking inventory. Our ap- proach can be thought of as a complementary solution that * Authors contributed equally to this work. avoids visible, ugly barcodes, and enables digital informa- tion to be invisibly and ambiently embedded into the ubiq- uitous imagery of the modern visual world. It is worth taking a moment to consider three potential use cases of our system. First, at the farmer’s market, a stand owner may add photos of each type of produce along- side the price, encoded with extra information for customers about the source farm, nutrition information, recipes, and seasonable availability. Second, in the lobby of a university department, a photo directory of faculty may be augmented by encoding a unique URL for each person’s photo that con- tains the professor’s webpage, office hours, location, and di- rections. Third, New York City’s Times Square is plastered with digital billboards. Each image frame displayed may be encoded with a URL containing further information about the products, company, and promotional deals. Figure 1 presents an overview of our system, which we call StegaStamp, in the context of a typical usage flow. The inputs are an image and a desired hyperlink. First, we assign the hyperlink a unique bit string (analogous to the process used by URL-shortening services such as tinyurl.com). Sec- ond, we use our StegaStamp encoder to embed the bit string into the target image. This produces an encoded image that is ideally perceptually identical to the input image. As de- scribed in detail in Section 4, our encoder is implemented as a deep neural network jointly trained with a second net- work that implements decoding. Third, the encoded image is physically printed (or shown on an electronic display) and presented in the real world. Fourth, a user takes a photo that contains the physical print. Fifth, the system uses an im- age detector to identify and crop out all images. Sixth, each image is processed with the StegaStamp decoder to retrieve the unique bitstring, which is used to follow the hyperlink and retrieve the information associated with the image. Steganography [7] is the name of the main technical problem faced in this work – hiding secret data in a non- secret message, in this case an image. The key technical challenge solved in this paper is to dramatically increase the robustness and performance of steganography so that it can retrieve an arbitrary hyperlink in real-world scenarios. Most previous work on image steganography [4, 15, 33, 34, 36] 1 arXiv:1904.05343v1 [cs.CV] 10 Apr 2019
Transcript

StegaStamp: Invisible Hyperlinks in Physical Photographs

Matthew Tancik∗ Ben Mildenhall∗ Ren NgUniversity of California, Berkeley

{tancik, bmild, ren}@berkeley.edu

Abstract

Imagine a world in which each photo, printed or digi-tally displayed, hides arbitrary digital data that can be ac-cessed through an internet-connected imaging system. An-other way to think about this is physical photographs thathave unique QR codes invisibly embedded within them. Thispaper presents an architecture, algorithms, and a proto-type implementation addressing this vision. Our key tech-nical contribution is StegaStamp, the first steganographicalgorithm to enable robust encoding and decoding of arbi-trary hyperlink bitstrings into photos in a manner that ap-proaches perceptual invisibility. StegaStamp comprises adeep neural network that learns an encoding/decoding al-gorithm robust to image perturbations that approximate thespace of distortions resulting from real printing and pho-tography. Our system prototype demonstrates real-time de-coding of hyperlinks for photos from in-the-wild video sub-ject to real-world variation in print quality, lighting, shad-ows, perspective, occlusion and viewing distance. Our pro-totype system robustly retrieves 56 bit hyperlinks after errorcorrection – sufficient to embed a unique code within everyphoto on the internet.

1. Introduction

Our vision is a future in which each photo in the realworld invisibly encodes a unique hyperlink to arbitrary in-formation. This information is accessed by pointing a cam-era at the photo and using the system described in thispaper to decode and follow the hyperlink. In the future,augmented-reality (AR) systems may perform this task con-tinuously, visually overlaying retrieved information along-side each photo in the user’s view.

Our approach is related to the ubiquitous QR code andsimilar technologies, which are now commonplace for awide variety of data-transfer tasks, such as sharing web ad-dresses, purchasing goods, and tracking inventory. Our ap-proach can be thought of as a complementary solution that

∗Authors contributed equally to this work.

avoids visible, ugly barcodes, and enables digital informa-tion to be invisibly and ambiently embedded into the ubiq-uitous imagery of the modern visual world.

It is worth taking a moment to consider three potentialuse cases of our system. First, at the farmer’s market, astand owner may add photos of each type of produce along-side the price, encoded with extra information for customersabout the source farm, nutrition information, recipes, andseasonable availability. Second, in the lobby of a universitydepartment, a photo directory of faculty may be augmentedby encoding a unique URL for each person’s photo that con-tains the professor’s webpage, office hours, location, and di-rections. Third, New York City’s Times Square is plasteredwith digital billboards. Each image frame displayed may beencoded with a URL containing further information aboutthe products, company, and promotional deals.

Figure 1 presents an overview of our system, which wecall StegaStamp, in the context of a typical usage flow. Theinputs are an image and a desired hyperlink. First, we assignthe hyperlink a unique bit string (analogous to the processused by URL-shortening services such as tinyurl.com). Sec-ond, we use our StegaStamp encoder to embed the bit stringinto the target image. This produces an encoded image thatis ideally perceptually identical to the input image. As de-scribed in detail in Section 4, our encoder is implementedas a deep neural network jointly trained with a second net-work that implements decoding. Third, the encoded imageis physically printed (or shown on an electronic display) andpresented in the real world. Fourth, a user takes a photo thatcontains the physical print. Fifth, the system uses an im-age detector to identify and crop out all images. Sixth, eachimage is processed with the StegaStamp decoder to retrievethe unique bitstring, which is used to follow the hyperlinkand retrieve the information associated with the image.

Steganography [7] is the name of the main technicalproblem faced in this work – hiding secret data in a non-secret message, in this case an image. The key technicalchallenge solved in this paper is to dramatically increase therobustness and performance of steganography so that it canretrieve an arbitrary hyperlink in real-world scenarios. Mostprevious work on image steganography [4, 15, 33, 34, 36]

1

arX

iv:1

904.

0534

3v1

[cs

.CV

] 1

0 A

pr 2

019

Message

1010...001

Encoder StegaStamp Corrupted Image Decoder

Decoded Message

1010...001

Detection

Input Image Captured Photo

Figure 1: Our deep learning system is trained to hide hyperlinks in images. First, an encoder network processes the inputimage and hyperlink bitstring into a StegaStamp (encoded image). The StegaStamp is then printed and captured by a camera.A detection network localizes and rectifies the StegaStamp before passing it to the decoder network. After the bits arerecovered and error corrected, the user can follow the hyperlink. To train the encoder and decoder networks, we simulate thecorruptions caused by printing, reimaging, and detecting the StegaStamp with a set of differentiable image augmentations.

was designed for perfect digital transmission, and fails com-pletely (see Figure 6, “None”) under “physical transmis-sion” – i.e., presenting images physically and using realcameras to capture an image for decoding. The closest workto our own is the HiDDeN system [38], which introducesvarious types of noise between encoding and decoding toincrease robustness, but even this does not improve per-formance beyond guessing rate after undergoing “physicaltransmission” (see Figure 6, “Pixelwise”).

We show how to achieve robust decoding even under“physical transmission,” delivering excellent performancesufficient to encode and retrieve arbitrary hyperlinks for anessentially limitless number of images. Achieving this re-quires two main technical contributions. First, we extendthe traditional steganography framework by adding a set ofdifferential image corruptions between the encoder and de-coder that successfully approximate the space of distortionsresulting from “physical transmission” (i.e., real printing/ display and photography). The result is robust retrievalof 95% of encoded bits in real-world conditions. Second,we show how to preserve high image quality – our proto-type can encode 100 bits while preserving excellent per-ceptual image quality. Together, these allow our prototypeto uniquely encode hidden hyperlinks for orders of magni-tude more images than exist on the internet today (upperbounded by 100 trillion).

2. Related Work

2.1. Steganography

Steganography is the act of hiding data within other dataand has a long history that can be traced back to ancientGreece. Our proposed task is a type of steganography wherewe hide a code within an image. Various methods have been

developed for digital image steganography. Data can be hid-den in the least significant bit of the image, subtle color vari-ations, and subtle luminosity variations. Often methods aredesigned to evade steganalysis, the detection of hidden mes-sages [12, 28]. We refer the interested reader to the surveyby Cheddad et al. [7] that reviews a wide set of techniques.

The most relevant work to our proposal are methods thatutilize deep learning to both encode and decode a messagehidden inside an image [4, 15, 33, 34, 36]. Our methoddiffers from existing techniques as we assume that the im-age will be corrupted by the display-imaging pipeline be-tween the encoding and decoding steps. With the excep-tion of HiDDeN [38], small image manipulations or cor-ruptions would render existing techniques useless, as theirtarget is encoding a large number of bits-per-pixel in a con-text of perfect digital transmission. HiDDeN introducesvarious types of noise between encoding and decoding toincrease robustness but focuses only on the set of corrup-tions that would occur through digital image manipulations(e.g., JPEG compression and cropping). For use as a phys-ical barcode, the decoder cannot assume perfect alignment,given the perspective shifts and pixel resampling guaranteedto occur when taking a casual photo.

Similar work exists addressing the digital watermarkingproblem, where copyright information and other metadatais imperceptibly hidden in a photograph or document, us-ing both traditional [9] and learning-based [18] techniques.Some methods [25] particularly target the StirMark bench-mark [27, 26]. StirMark tests watermarking techniquesagainst a set of digital attacks, including geometrical dis-tortions and image compression. However, digital water-marking methods are not typically designed to work afterprinting and recapturing an image, unlike our method.

2.2. Barcodes

Barcodes are one of the most popular solutions for trans-mitting a short string of data to a computing device, requir-ing only simple hardware (a laser reader or camera) and anarea for printing or displaying the code. Traditional bar-codes are a one dimensional pattern where bars of alternat-ing thickness encode different values. The ubiquity of highquality cellphone cameras has led to the frequent use of twodimensional QR codes to transmit data to and from phones.For example, users can share contact information with oneanother, pay for goods, track inventory, or retrieve a couponfrom an advertisement.

Past research has addressed the issue of robustly decod-ing existing and new barcode designs using cameras [22,24]. Some designs particularly take advantage of the in-creased capabilities of cameras beyond simple laser scan-ners, such as incorporating color into the barcode [6]. Otherwork has proposed a method that determines where a bar-code should be placed on an image and what color shouldbe used to improve machine readability [23].

Another special type of barcode is specially designed totransmit both a small identifier and a precise six degree-of-freedom orientation for camera localization or calibration,e.g., ArUco markers [13, 29]. Hu et al. [16] train a deepnetwork to localize and identify ArUco markers in challeng-ing real world conditions using data augmentation similarlyto our method. However, their focus is robust detection ofhighly visible preexisting markers, as opposed to robust de-coding of messages hidden in arbitrary natural images.

2.3. Robust Adversarial Image Attacks

Adversarial image attacks on object classification CNNsare designed to minimally perturb an image such that thenetwork produces an incorrect classification. Most relevantto our work are the demonstrations of adversarial examplesin the physical world [3, 8, 11, 20, 21, 31, 32], where sys-tems are made robust for imaging applications by modelingphysically realistic perturbations (i.e., affine image warp-ing, additive noise, and JPEG compression). Jan et al. [20]take a different approach, explicitly training a neural net-work to replicate the distortions added by an imaging sys-tem and showing that applying the attack to the distortedimage increases the success rate.

These results demonstrate that networks can still be af-fected by small perturbations after the image has gonethrough an imaging pipeline. Our proposed task sharessome similarities; however, classification targets 1 of n ≈210 labels, while we aim to uniquely decode 1 of 2m mes-sages, where m ≈ 100 is the number of encoded bits. Ad-ditionally, adversarial attacks do not assume the ability tomodify the decoder network, whereas we explicitly trainour decoder to cooperate with our encoder for maximuminformation transferal.

Original Image StegaStamp Residual

Figure 2: Examples of encoded images. The residual iscalculated by the encoder network and added back to theoriginal image to produce the encoded StegaStamp. Theseexamples have 100 bit encoded messages and are robust tothe image perturbations that occur through the printing andimaging pipelines.

3. Training for Real World Robustness

During training, we apply a set of differentiable imageperturbations outlined in Figure 3 between the encoder anddecoder to approximate the possible distortions caused byphysically displaying and imaging the StegaStamps. Previ-ous work on synthesizing robust adversarial examples useda similar method to attack classification networks in thewild (termed “Expectation over Transformation”), thoughthey used a more limited set of transformations [3]. HiD-DeN [38] used nonspatial perturbations to augment theirsteganography pipeline against digital perturbations only.Deep ChArUco [16] used both spatial and nonspatial per-turbations to train a robust detector for the fixed category ofChArUco fiducial marker boards. We combine ideas fromall of these works, training an encoder and decoder thatcooperate to robustly transmit hidden messages through aphysical display-imaging pipeline.

3.1. Perspective Warp

Assuming a pinhole camera model, any two images ofthe same planar surface can be related by a homography.We generate a random homography to simulate the effect

Input Perspective warp (Sec. 3.1)

Motion/defocus blur (Sec. 3.2)

Color manipulation (Sec. 3.3)

Noise (Sec. 3.4)

JPEG compression (Sec. 3.5)

Figure 3: Image perturbation pipeline. During training, we approximate the effects of a physical display-imaging pipeline inorder to make our model robust for use in the real world. We take the output of the encoding network and apply the randomtransformations shown here before passing the image through the decoding network (see Section 3 for details).

of a camera that is not precisely aligned with the encodedimage marker. To sample the space of homographies, weperturb the four corner locations of the marker uniformly atrandom within a fixed range (up to ±40 pixels from theiroriginal coordinates) and solve a least squares problem torecover the homography that maps the original corners totheir new locations. We bilinearly resample the original im-age to create the perspective warped image.

3.2. Motion and Defocus Blur

Blur can result from both camera motion and inaccurateautofocus. To simulate motion blur, we sample a randomangle and generate a straight line blur kernel with a widthbetween 3 and 7 pixels. To simulate misfocus, we use aGaussian blur kernel with its standard deviation randomlysampled between 1 and 3 pixels.

3.3. Noise

Noise introduced by camera systems is well studied andincludes photon noise, dark noise, and shot noise [14]. Inour system we assume standard non-photon-starved imag-ing conditions. We employ a Gaussian noise model (sam-pling the standard deviation σ ∼ U [0, 0.2]) to account forthe imaging noise. We have found that printing does notintroduce significant pixel-wise noise, since the printer willtypically act to reduce the noise as the mixing of the ink actsas a low pass filter.

3.4. Color Manipulation

Printers and displays have a limited gamut compared tothe full RGB color space. Cameras modify their output us-ing exposure settings, white balance, and a color correctionmatrix. We approximate these perturbations with a seriesof random affine color transformations (constant across thewhole image) as follows:

1. Hue shift: adding a random color offset to each of theRGB channels sampled uniformly from [−0.1, 0.1].

2. Desaturation: randomly linearly interpolating betweenthe full RGB image and its grayscale equivalent.

3. Brightness and contrast: affine histogram rescalingmx+ b with b ∼ U [−0.3, 0.3] and m ∼ U [0.5, 1.5].

After these transforms, we clip the color channels back tothe range [0, 1].

3.5. JPEG Compression

Camera images are usually stored in a lossy format, suchas JPEG. JPEG compresses images by computing the dis-crete cosine transform of each 8 × 8 block in the imageand quantizing the resulting coefficients by rounding to thenearest integer (at varying strengths for different frequen-cies). This rounding step is not differentiable. We use thetrick from [31] for approximating the quantization step nearzero with the piecewise function

q(x) =

{x3 : |x| < 0.5

x : |x| ≥ 0.5(1)

which has nonzero derivative almost everywhere. We sam-ple the JPEG quality uniformly within [50, 100].

4. Implementation Details4.1. Encoder

The encoder is trained to embed a message into an im-age while minimizing perceptual differences between theinput and encoded images. We use a U-Net [30] style archi-tecture that receives a four channel 400 × 400 pixel input(RGB channels from the input image plus one for the mes-sage) and outputs a three channel RGB residual image. Theinput message is represented as a 100 bit binary string, pro-cessed through a fully connected layer to form a 50×50×3tensor, then upsampled to produce a 400 × 400 × 3 tensor.We find that applying this preprocessing to the message aidsconvergence. To enforce minimal perceptual distortion on

the encoded StegaStamp, we use an L2 loss, the LPIPS per-ceptual loss [37], and a critic loss calculated between theencoded image and the original image. We present exam-ples of encoded images in Figure 2.

4.2. Decoder

The decoder is a network trained to recover the hiddenmessage from the encoded image. A spatial transformernetwork [19] is used to develop robustness against smallperspective changes that are introduced while capturing andrectifying the encoded image. The transformed image isfed through a series of convolutional and dense layers anda sigmoid to produce a final output with the same length asthe message. The decoder network is supervised using crossentropy loss.

4.3. Detector

For real world use, we must detect and rectify StegaS-tamps within a wide field of view image before decod-ing them, since the decoder network alone is not designedto handle full detection within a much larger image. Wefine-tune an off-the-shelf semantic segmentation networkBiSeNet [35] to segment areas of the image that are be-lieved to contain StegaStamps. The network is trained usinga dataset of randomly transformed StegaStamps embeddedinto high resolution images sampled from DIV2K [1]. Attest time, we fit a quadrilateral to the convex hull of eachof the network’s proposed regions, then compute a homog-raphy to warp each quadrilateral back to a 400 × 400 pixelsquare for parsing by the decoder.

4.4. Encoder/Decoder Training Procedure

Training Data During training, we use images from theMIRFLICKR dataset [17] combined with randomly sam-pled binary messages. We resample the images to 400×400resolution.

Critic As part of our total loss, we use a critic networkthat predicts whether a message is encoded in a image andis used as a perceptual loss for the encoder/decoder pipeline.The network is composed of a series of convolutional layersfollowed by max pooling. To train the critic, an input imageand an encoded image are classified and the Wassersteinloss [2] is used as a supervisory signal. Training of the criticis interleaved with the training of the encoder/decoder.

Loss Weighting The training loss is the weighted sum ofthree image loss terms (residual regularization LR, percep-tual loss LP , critic loss LC) and the cross entropy messageloss LM :

L = λRLR + λPLP + λCLC + λMLM (2)

99%100% 99%99%

100%

98%

99%

98% 100% 100%

100%

Figure 4: Examples of our system deployed in-the-wild. Weoutline the StegaStamps detected and decoded by our sys-tem and the show message recovery accuracies. Our methodworks in the real world, exhibiting robustness to changingcamera orientation, lighting, shadows, etc. You can findthese examples and more in our supplemental video.

We find three loss function adjustments to particularly aidin convergence when training the networks:

1. These image loss weights λR,P,C must initially be setto zero while the decoder trains to high accuracy, afterwhich λR,P,C are increased linearly.

2. The image perturbation strengths must also start atzero. The perspective warping is the most sensitiveperturbation and is increased at the slowest rate.

3. The model learns to add distracting patterns at the edgeof the image (perhaps to assist in localization). Wemitigate this effect by increasing the weight of the L2

loss at the edges with a cosine dropoff.

5. Real-World & Simulation-Based EvaluationWe test our system in both real-world conditions and

synthetic approximations of display-imaging pipelines. Weshow that our system works in-the-wild, recovering mes-sages in uncontrolled indoor and outdoor environments. Weevaluate our system in a controlled real world setting with

99%97% 97%

Figure 5: Despite not explicitly training the method to berobust to occlusion, we find that our decoder can handlepartial erasures gracefully, maintaining high accuracy.

18 combinations of 6 different displays/printers and 3 dif-ferent cameras. Across all settings combined (1890 cap-tured images), we achieve a mean bit-accuracy of 98.7%.We conduct real and synthetic ablation studies with fourdifferent trained models to verify that our system is robustto each of the image perturbations we apply during train-ing and that omitting these augmentations significantly de-creases decoding performance.

5.1. In-the-Wild Robustness

Our method is tested on handheld cellphone cameravideos captured in a variety of real-world environments.The StegaStamps are printed on a consumer printer. Ex-amples of the captured frames with detected quadrilater-als and decoding accuracy are shown in Figure 4. We alsodemonstrate a surprising level of robustness when portionsof the StegaStamp are covered by other objects (Figure 5).Please see our supplemental video for extensive examplesof real world StegaStamp decoding, including examples ofperfectly recovering 56 bit messages using BCH error cor-recting codes [5]. In practice, we generally find that if thebounding rectangle is accurately located, decoding accuracywill be high. However, it is not uncommon for the detectorto miss the StegaStamp on a subset of video frames. Inpractice this is not an issue, because the code only needs tobe recovered once. We expect future extensions that incor-porate temporal information and custom detection networkscan further improve the detection consistency.

5.2. Controlled Real World Experiments

In order to demonstrate that our model generalizes fromsynthetic perturbations to real physical display-imagingpipelines, we conduct a series of test where encoded imagesare printed or displayed and captured by a camera prior todecoding. We randomly select 100 unique images from theImageNet dataset [10] (which is disjoint from our trainingset) and embed random 100 bit messages within each im-age. We generate 5 additional StegaStamps with the samesource image but different messages for a total of 105 testimages. We conduct the experiments in a darkroom with

5th 25th 50th Mean

Web

cam Prin

ter Enterprise 88% 94% 98% 95.9%

Consumer 90% 98% 99% 98.1%Pro 97% 99% 100% 99.2%

Scre

en Monitor 94% 98% 99% 98.5%Laptop 97% 99% 100% 99.1%Cellphone 91% 98% 99% 97.7%

Cel

lpho

ne

Prin

ter Enterprise 88% 96% 98% 96.8%

Consumer 95% 99% 100% 99.0%Pro 97% 99% 100% 99.3%

Scre

en Monitor 98% 99% 100% 99.4%Laptop 98% 99% 100% 99.7%Cellphone 96% 99% 100% 99.2%

DSL

R Prin

ter Enterprise 86% 96% 99% 97.0%

Consumer 97% 99% 100% 99.3%Pro 98% 99% 100% 99.5%

Scre

en Monitor 99% 100% 100% 99.8%Laptop 99% 100% 100% 99.8%Cellphone 99% 100% 100% 99.8%

Table 1: Real world decoding accuracy (percentage of bitscorrectly recovered) tested using a combination of six dis-play methods (three printers and three screens) and threecameras. We show the 5th, 25th, and 50th percentiles andmean taken over 105 images chosen randomly from Ima-genet [10] with randomly sampled 100 bit messages.

0.4 0.6 0.8 1.0

All

Spatial

Pixelwise

None

Chance

Bit Recovery Accuracy

Pertu

rbat

ions

App

lied

Dur

ing

Trai

ning

Real-world Ablation Study

Figure 6: Real world ablation test of the four networks de-scribed in Section 5.3 using the cellphone camera and con-sumer printer pipeline from Table 1. The y-axis denoteswhich image perturbations were used to train each network.We show the distribution of random guessing (with its meanof 0.5 indicated by the dotted line) to demonstrate that theno-perturbations ablation performs no better than chance.Previous work on digital steganography focuses on eitherthe no-perturbations case of perfect transmission or on thepixelwise case where the image may be digitally modifiedbut not spatially resampled. Adding spatial perturbations iscritical for achieving high real-world performance.

0.0 0.5 1.0 1.5 2.0

0.5

0.6

0.7

0.8

0.9

1.0

All Warp Blur Noise Color JPEG

0.0 0.5 1.0 1.5 2.0

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.0

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.5 1.0 1.5 2.0

0.5

0.6

0.7

0.8

0.9

1.0

Perturbation Strength Perturbation Strength Perturbation Strength Perturbation Strength

Acc

urac

y

(a) All Perturbations (b) No Perturbations (c) Pixelwise Perturbations Only (d) Spatial Perturbations Only

Test Time Perturbations

Pertubations Applied During Training

Figure 7: Synthetic ablation tests showing the effect of training with various image perturbation combinations on bit recoveryrobustness. “Pixelwise” perturbations (c) are noise, color transforms, and JPEG compression, and “spatial” perturbations (d)are perspective warp and blur. To test robustness across a range of possible degradation, we parameterize the strength ofeach perturbation on a scale from 0 (weakest) to 1 (maximum value seen during training) to 2 (strongest). Models not trainedagainst spatial perturbations (b-c) are highly susceptible to warp and blur, and the model trained only on spatial perturbations(d) is sensitive to color transformations. The lines show the mean accuracies and the shaded regions shows the 25th-75th

percentile range over 100 random images and messages. See Section 5.3 for details.

fixed lighting. The printed images are fixed in a rig forconsistency and captured by a tripod-mounted camera. Theresulting photographs are cropped by hand, rectified, andpassed through the decoder.

The images are printed using a consumer printer (HPLaserJet Pro M281fdw), an enterprise printer (HP Laser-Jet Enterprise CP4025), and a commercial printer (Xerox700i Digital Color Press). The images are also digitally dis-played on a matte 1080p monitor (Dell ST2410), a glossyhigh DPI laptop screen (Macbook Pro 15 inch), and anOLED cellphone screen (iPhone X). To image the StegaS-tamps, we use an HD webcam (Logitech C920), a cellphonecamera (Google Pixel 3), and a DSLR camera (Canon 5DMark II). All 105 images were captured across all 18 com-binations of the 6 media and 3 cameras. The results of thesetests are reported in Table 1. We see that our method ishighly robust across a variety of different combinations ofdisplay or printer and camera; two-thirds of these scenar-ios yield a median accuracy of 100% and 5th percentile ofat least 95% perfect decoding. Our mean accuracy over all1890 captured images is 98.7%.

5.3. Ablation Tests

We test how training with different subsets of the imageperturbations described in Section 3 impacts decoding ac-curacy in the real world (Figure 6) and in a synthetic exper-iment (Figure 7). We evaluate both our base model (trainedwith all perturbations) and three additional models (trainedwith no perturbations, only pixelwise perturbations, andonly spatial perturbations). Most work on image steganog-raphy focuses on hiding as much information as possiblebut assumes that no corruption will occur prior to decoding

(similar to our “no perturbations” model). HiDDeN [38]incorporates augmentations into their training pipeline toincrease robustness to perturbations. However, they applyonly digital corruptions, similar to our pixelwise perturba-tions, not including any augmentations that spatially resam-ple the image. Work in other domains [3, 16] has exploredtraining deep networks through pixelwise and spatial per-turbations, but to the best of our knowledge, we are the firstto apply this technique to steganography.

For the real world test, we evaluate all four networkson the same 105 test StegaStamps from Section 5.2 us-ing the cellphone camera and consumer printer combina-tion. We see in Figure 6 that training with no perturbationsyields accuracy no better than guessing and that adding pix-elwise perturbations alone barely improves performance.This demonstrates that previous deep learning steganogra-phy methods (most similar to either the “None” or “Pixel-wise” ablations) will fail when the encoded image is cor-rupted by a display-imaging pipeline. Training with spatialperturbations alone yields performance significantly higherthan pixelwise perturbations alone; however, it cannot reli-ably recover enough data for practical use. Our presentedmethod, combining both pixelwise and spatial perturba-tions, achieves the most precise and accurate results by alarge margin.

We run a more exhaustive synthetic ablation study over1000 images to separately test the effects of each training-time perturbation on accuracy. The results shown in Fig-ure 7 follow a similar pattern to the real world ablation test.The model trained with no perturbations is surprisingly ro-bust to color warps and noise but immediately fails in thepresence of warp, blur, or any level of JPEG compression.

50 bits 100 bits 150 bits 200 bitsOriginal

Figure 8: Four models trained to encode messages of different lengths. The inset shows the residual relative to the originalimage. The perceptual quality decreases as more bits are encoded. We find that a message length of 100 bits provides goodimage quality and is sufficient to encode a virtually unlimited number of distinct hyperlinks using error correcting codes.

Message lengthMetric 50 100 150 200PSNR ↑ 29.88 28.50 26.47 21.79SSIM ↑ 0.930 0.905 0.876 0.793LPIPS ↓ 0.100 0.101 0.128 0.184

Table 2: Image quality for models trained with differentmessage lengths, averaged over 500 images. For PSNR andSSIM, higher is better. LPIPS [37] is a learned perceptualsimilarity metric, lower is better.

Training with only pixelwise perturbations yields high ro-bustness to those augmentations but still leaves the networkvulnerable to any amount of pixel resampling from warpingor blur. On the other hand, training with only spatial per-turbations also confers increased robustness against JPEGcompression (perhaps because it has a similar low-pass fil-tering effect to blurring). Again, training with both spatialand pixelwise augmentations yields the best result.

5.4. Practical Message Length

Our model can be trained to store different numbers ofbits. In all previous examples, we use a message lengthof 100. Figure 8 compares encoded images from fourseparately trained models with different message lengths.Larger message are more difficult to encode and decode;as a result, there is a trade off between recovery accuracyand perceptual similarity. The associated image metrics arereported in Table 2. When training the models, the imageand message losses are tuned such that the bit accuracy con-verges to at least 95%.

We settle on a message length of 100 bits as it providesa good compromise between image quality and informationtransfer. Given an estimate of at least 95% recovery accu-racy, we can encode at least 56 error corrected bits usingBCH codes [5]. As discussed in the introduction, this givesus the ability to uniquely map every recorded image in his-tory to a corresponding StegaStamp. Accounting for error

correcting, using only 50 total message bits would drasti-cally reduce the number of possible encoded hyperlinks tounder one billion. The image degradation caused by encod-ing 150 or 200 bits is much more perceptible.

5.5. Limitations

Though our system works with a high rate of success inthe real world, it is still many steps from enabling an en-vironment saturated with ubiquitous invisible hyperlinkedimages. Despite often being very subtle in high frequencytextures, the residual added by the encoder network is some-times perceptible in large low frequency regions of the im-age. Future work could improve upon our architecture andloss functions to train an encoder/decoder pair that usedmore subtle encodings.

Additionally, we find our off-the-shelf detection networkto be the bottleneck in our decoding performance duringreal world testing. A custom detection architecture opti-mized end to end with the encoder/decoder could increasedetection performance. The current framework also as-sumes that the StegaStamps will be single, square imagesfor the purpose of detection. We imagine that embeddingmultiple codes seamlessly into a single, larger image (suchas a poster or billboard) could provide even more flexibility.

6. Conclusion

We have presented StegaStamp, an end-to-end deeplearning framework for encoding 56 bit error corrected hy-perlinks into arbitrary natural images. Our networks aretrained through an image perturbation module that allowsthem to generalize to real world display-imaging pipelines.We demonstrate robust decoding performance on a varietyof printer, screen, and camera combinations in an exper-imental setting. We also show that our method is stableenough to be deployed in-the-wild as a replacement for ex-isting barcodes that is less intrusive and more aestheticallypleasing.

7. AcknowledgmentsWe would like to thank Coline Devin and Cecilia Zhang

for their assistance on our demos and Utkarsh Singhal andPratul Srinivasan for their helpful discussions and feedback.Support for this work was provided by the Fannie and JohnHertz Foundation.

References[1] E. Agustsson and R. Timofte. Ntire 2017 challenge on single

image super-resolution: Dataset and study. In CVPR Work-shops, 2017. 5

[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gener-ative adversarial networks. In ICML, 2017. 5

[3] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok. Syn-thesizing robust adversarial examples. arXiv preprintarXiv:1707.07397, 2017. 3, 7

[4] S. Baluja. Hiding images in plain sight: Deep steganography.In NeurIPS, 2017. 1, 2

[5] R. C. Bose and D. K. Ray-Chaudhuri. On a class of error cor-recting binary group codes. Information and Control, 1960.6, 8

[6] O. Bulan, H. Blasinski, and G. Sharma. Color qr codes: In-creased capacity via per-channel data encoding and interfer-ence cancellation. In Color and Imaging Conference, 2011.3

[7] A. Cheddad, J. Condell, K. Curran, and P. Mc Kevitt. Digitalimage steganography: Survey and analysis of current meth-ods. Signal Processing, 90(3), 2010. 1, 2

[8] S.-T. Chen, C. Cornelius, J. Martin, and D. H. P. Chau.Shapeshifter: Robust physical adversarial attack on fasterr-cnn object detector. In Joint European Conference onMachine Learning and Knowledge Discovery in Databases,2018. 3

[9] I. Cox, M. Miller, J. Bloom, J. Fridrich, and T. Kalker. Dig-ital watermarking and steganography. Morgan kaufmann,2007. 2

[10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.ImageNet: A Large-Scale Hierarchical Image Database. InCVPR, 2009. 6

[11] I. Evtimov, K. Eykholt, E. Fernandes, T. Kohno, B. Li,A. Prakash, A. Rahmati, and D. Song. Robust physical-world attacks on deep learning models. arXiv preprintarXiv:1707.08945, 2017. 3

[12] J. Fridrich, T. Pevny, and J. Kodovsky. Statistically unde-tectable jpeg steganography: Dead ends, challenges, and op-portunities. In Proceedings of the 9th Workshop on Multime-dia & Security, 2007. 2

[13] S. Garrido-Jurado, R. Muoz-Salinas, F. Madrid-Cuevas, andR. Medina-Carnicer. Generation of fiducial marker dictionar-ies using mixed integer linear programming. Pattern Recog-nition, 2015. 3

[14] S. W. Hasinoff. Photon, poisson noise. In Computer Vision:A Reference Guide. 2014. 4

[15] J. Hayes and G. Danezis. Generating steganographic imagesvia adversarial training. In NeurIPS, 2017. 1, 2

[16] D. Hu, D. DeTone, V. Chauhan, I. Spivak, and T. Mal-isiewicz. Deep charuco: Dark charuco marker pose estima-tion. arXiv preprint arXiv:1812.03247, 2018. 3, 7

[17] M. J. Huiskes and M. S. Lew. The mir flickr retrieval eval-uation. In MIR ’08: Proceedings of the 2008 ACM Inter-national Conference on Multimedia Information Retrieval.ACM, 2008. 5

[18] B. Isac and V. Santhi. A study on digital image and videowatermarking schemes using neural networks. InternationalJournal of Computer Applications, 12(9):1–6, 2011. 2

[19] M. Jaderberg, K. Simonyan, A. Zisserman, andK. Kavukcuoglu. Spatial transformer networks. InNeurIPS, 2015. 5

[20] S. T. K. Jan, J. Messou, Y.-C. Lin, J.-B. Huang, and G. Wang.Connecting the digital and physical world: Improving therobustness of adversarial attacks. In AAAI, 2019. 3

[21] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial exam-ples in the physical world. arXiv preprint arXiv:1607.02533,2016. 3

[22] Y. Liu, J. Yang, and M. Liu. Recognition of qr code with mo-bile phones. In Chinese Control and Decision Conference.IEEE, 2008. 3

[23] E. Myodo, S. Sakazawa, and Y. Takishima. Method, appa-ratus and computer program for embedding barcode in colorimage, 2013. US Patent 8,550,366. 3

[24] E. Ohbuchi, H. Hanaizumi, and L. A. Hock. Barcode readersusing the camera device in mobile phones. In InternationalConference on Cyberworlds. IEEE, 2004. 3

[25] S. Pereira and T. Pun. Robust template matching for affineresistant image watermarks. IEEE Transactions on ImageProcessing, 2000. 2

[26] F. A. Petitcolas. Watermarking schemes evaluation. IEEESignal Processing Magazine, 2000. 2

[27] F. A. Petitcolas, R. J. Anderson, and M. G. Kuhn. Attackson copyright marking systems. In International workshop oninformation hiding. Springer, 1998. 2

[28] T. Pevny, T. Filler, and P. Bas. Using high-dimensional im-age models to perform highly undetectable steganography.In International Workshop on Information Hiding, 2010. 2

[29] F. Romero Ramirez, R. Muoz-Salinas, and R. Medina-Carnicer. Speeded up detection of squared fiducial markers.Image and Vision Computing, 2018. 3

[30] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu-tional networks for biomedical image segmentation. In MIC-CAI. Springer International Publishing, 2015. 4

[31] R. Shin and D. Song. Jpeg-resistant adversarial images. InNeurIPS Workshop on Machine Learning and Computer Se-curity, 2017. 3, 4

[32] C. Sitawarin, A. N. Bhagoji, A. Mosenia, M. Chiang, andP. Mittal. Darts: Deceiving autonomous cars with toxicsigns. arXiv preprint arXiv:1802.06430, 2018. 3

[33] W. Tang, S. Tan, B. Li, and J. Huang. Automatic stegano-graphic distortion learning using a generative adversarial net-work. IEEE Signal Processing Letters, 2017. 1, 2

[34] P. Wu, Y. Yang, and X. Li. Stegnet: Mega image steganog-raphy capacity with deep convolutional network. Future In-ternet, 2018. 1, 2

[35] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang.Bisenet: Bilateral segmentation network for real-time se-mantic segmentation. In ECCV, 2018. 5

[36] K. A. Zhang, A. Cuesta-Infante, and K. Veeramachaneni.Steganogan: Pushing the limits of image steganography.arXiv preprint arXiv:1901.03892, 2019. 1, 2

[37] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang.The unreasonable effectiveness of deep features as a percep-tual metric. In CVPR, 2018. 5, 8

[38] J. Zhu, R. Kaplan, J. Johnson, and L. Fei-Fei. Hidden: Hid-ing data with deep networks. In ECCV, 2018. 2, 3, 7


Recommended