+ All Categories
Home > Documents > M2TR: Multi-modal Multi-scale Transformers for Deepfake ...

M2TR: Multi-modal Multi-scale Transformers for Deepfake ...

Date post: 27-Jan-2022
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
11
M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection Junke Wang, Zuxuan Wu, Jingjing Chen, Yu-Gang Jiang Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University {17300240009, zxwu, chenjingjing, ygj}@fudan.edu.cn ABSTRACT The widespread dissemination of forged images generated by Deep- fake techniques has posed a serious threat to the trustworthiness of digital information. This demands effective approaches that can detect perceptually convincing Deepfakes generated by advanced manipulation techniques. Most existing approaches combat Deep- fakes with deep neural networks by mapping the input image to a binary prediction without capturing the consistency among differ- ent pixels. In this paper, we aim to capture the subtle manipulation artifacts at different scales for Deepfake detection. We achieve this with transformer models, which have recently demonstrated supe- rior performance in modeling dependencies between pixels for a variety of recognition tasks in computer vision. In particular, we introduce a Multi-modal Multi-scale TRansformer (M2TR), which uses a multi-scale transformer that operates on patches of different sizes to detect the local inconsistency at different spatial levels. To improve the detection results and enhance the robustness of our method to image compression, M2TR also takes frequency in- formation, which is further combined with RGB features using a cross modality fusion module. Developing and evaluating Deepfake detection methods requires large-scale datasets. However, we ob- serve that samples in existing benchmarks contain severe artifacts and lack diversity. This motivates us to introduce a high-quality Deepfake dataset, SR-DF, which consists of 4,000 DeepFake videos generated by state-of-the-art face swapping and facial reenactment methods. On three Deepfake datasets, we conduct extensive experi- ments to verify the effectiveness of the proposed method, which outperforms state-of-the-art Deepfake detection methods. 1 INTRODUCTION Recent years have witnessed the rapid development of Deepfake techniques [26, 29, 43, 52], which enable attackers to manipulate the facial area of an image and generate a forged image. As synthesized images are becoming more photo-realistic, it is extremely difficult to distinguish whether an image/video has been manipulated even for human eyes. At the same time, these forged images might be distributed on the Internet for malicious purposes, which could bring societal implications. The above challenges have driven the development of Deepfake forensics using deep neural networks [1, 5, 25, 32, 34, 40, 67]. Most existing approaches take as inputs a face region cropped out of an entire image and produce a binary real/fake prediction with deep CNN models. These methods capture artifacts from the face regions in a single scale with stacked convolutional operations. While decent detection results are achieved by stacked convolutions, they excel at modeling local information but fails to consider the relationships of pixels globally due to constrained receptive field. (a) FF++ [48] (b) DFD [9] (c) DFDC [13] (d) Celeb-DF [35] Figure 1: Visual artifacts of forged images in existing datasets, including color mismatch (row 1 col 1, row 2 col 3, row 3 col 1, row 3 col 2, row 3 col 3) , shape distortion (row 1 col 3, row 2 col 1), visible boundaries (row 2 col 2), and facial blurring (row 1 col 2, row 4 col 1, row 4 col 2, row 4 col3). We posit that relationships among pixels are particularly useful for Deepfake detection, since pixels in certain artifacts are clearly different from the remaining pixels in the image. On the other hand, we observe that forgery patterns vary in sizes. For instance, Figure 1 gives examples from popular Deepfake datasets. We can see that some forgery traces such as color mismatch occur in small regions (like the mouth corners), while other forgery signals such as visible boundaries that almost span the entire image (see row 3 col 2 in Figure 1). Therefore, how to effectively explore regions of different scales in images is extremely critical for Deepfake detection. To address the above limitations, we explore transformers to model the relationships of pixels due to their strong capability of long-term dependency modeling for both natural language process- ing tasks [12, 46, 58] and computer vision tasks [2, 14, 68]. Unlike traditional transformers operating on a single-scale, we propose a multi-scale architecture to capture forged regions that potentially have different sizes. Furthermore, [15, 24, 44, 59, 65] suggest that the artifacts of forged images will be destroyed by perturbations such as JPEG compression, making them imperceptible in the RGB domain but can still be detected in the frequency domain. This motivates us to use frequency information as a complementary modality in order to reveal artifacts that are no longer perceptible in the RGB domain. arXiv:2104.09770v2 [cs.CV] 21 Apr 2021
Transcript
Page 1: M2TR: Multi-modal Multi-scale Transformers for Deepfake ...

M2TR: Multi-modal Multi-scale Transformersfor Deepfake Detection

Junke Wang, Zuxuan Wu, Jingjing Chen, Yu-Gang JiangShanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University

{17300240009, zxwu, chenjingjing, ygj}@fudan.edu.cn

ABSTRACTThe widespread dissemination of forged images generated by Deep-fake techniques has posed a serious threat to the trustworthinessof digital information. This demands effective approaches that candetect perceptually convincing Deepfakes generated by advancedmanipulation techniques. Most existing approaches combat Deep-fakes with deep neural networks by mapping the input image to abinary prediction without capturing the consistency among differ-ent pixels. In this paper, we aim to capture the subtle manipulationartifacts at different scales for Deepfake detection. We achieve thiswith transformer models, which have recently demonstrated supe-rior performance in modeling dependencies between pixels for avariety of recognition tasks in computer vision. In particular, weintroduce aMulti-modalMulti-scale TRansformer (M2TR), whichuses a multi-scale transformer that operates on patches of differentsizes to detect the local inconsistency at different spatial levels.To improve the detection results and enhance the robustness ofour method to image compression, M2TR also takes frequency in-formation, which is further combined with RGB features using across modality fusion module. Developing and evaluating Deepfakedetection methods requires large-scale datasets. However, we ob-serve that samples in existing benchmarks contain severe artifactsand lack diversity. This motivates us to introduce a high-qualityDeepfake dataset, SR-DF, which consists of 4,000 DeepFake videosgenerated by state-of-the-art face swapping and facial reenactmentmethods. On three Deepfake datasets, we conduct extensive experi-ments to verify the effectiveness of the proposed method, whichoutperforms state-of-the-art Deepfake detection methods.

1 INTRODUCTIONRecent years have witnessed the rapid development of Deepfaketechniques [26, 29, 43, 52], which enable attackers to manipulate thefacial area of an image and generate a forged image. As synthesizedimages are becoming more photo-realistic, it is extremely difficultto distinguish whether an image/video has been manipulated evenfor human eyes. At the same time, these forged images might bedistributed on the Internet for malicious purposes, which couldbring societal implications. The above challenges have driven thedevelopment of Deepfake forensics using deep neural networks [1,5, 25, 32, 34, 40, 67]. Most existing approaches take as inputs a faceregion cropped out of an entire image and produce a binary real/fakeprediction with deep CNN models. These methods capture artifactsfrom the face regions in a single scale with stacked convolutionaloperations. While decent detection results are achieved by stackedconvolutions, they excel at modeling local information but failsto consider the relationships of pixels globally due to constrainedreceptive field.

(a) FF++ [48] (b) DFD [9] (c) DFDC [13] (d) Celeb-DF [35]

Figure 1: Visual artifacts of forged images in existingdatasets, including color mismatch (row 1 col 1, row 2 col 3,row 3 col 1, row 3 col 2, row 3 col 3) , shape distortion (row 1col 3, row 2 col 1), visible boundaries (row 2 col 2), and facialblurring (row 1 col 2, row 4 col 1, row 4 col 2, row 4 col3).

We posit that relationships among pixels are particularly usefulfor Deepfake detection, since pixels in certain artifacts are clearlydifferent from the remaining pixels in the image. On the other hand,we observe that forgery patterns vary in sizes. For instance, Figure 1gives examples from popular Deepfake datasets. We can see thatsome forgery traces such as color mismatch occur in small regions(like the mouth corners), while other forgery signals such as visibleboundaries that almost span the entire image (see row 3 col 2 inFigure 1). Therefore, how to effectively explore regions of differentscales in images is extremely critical for Deepfake detection.

To address the above limitations, we explore transformers tomodel the relationships of pixels due to their strong capability oflong-term dependency modeling for both natural language process-ing tasks [12, 46, 58] and computer vision tasks [2, 14, 68]. Unliketraditional transformers operating on a single-scale, we propose amulti-scale architecture to capture forged regions that potentiallyhave different sizes. Furthermore, [15, 24, 44, 59, 65] suggest thatthe artifacts of forged images will be destroyed by perturbationssuch as JPEG compression, making them imperceptible in the RGBdomain but can still be detected in the frequency domain. Thismotivates us to use frequency information as a complementarymodality in order to reveal artifacts that are no longer perceptiblein the RGB domain.

arX

iv:2

104.

0977

0v2

[cs

.CV

] 2

1 A

pr 2

021

Page 2: M2TR: Multi-modal Multi-scale Transformers for Deepfake ...

……

Input X<latexit sha1_base64="z5YIpZoakBT5fN7NNWHZWBefvwg=">AAAB8HicbVDLSgMxFL1TX7W+qi7dBIvgqszUgi6LbnRXwT6kHUomzbShSWZIMkIZ+hVuXCji1s9x59+YaWehrQcCh3PuJfecIOZMG9f9dgpr6xubW8Xt0s7u3v5B+fCoraNEEdoiEY9UN8CaciZpyzDDaTdWFIuA004wucn8zhNVmkXywUxj6gs8kixkBBsrPd7JODF91C0NyhW36s6BVomXkwrkaA7KX/1hRBJBpSEca93z3Nj4KVaGEU5npX6iaYzJBI9oz1KJBdV+Oj94hs6sMkRhpOyTBs3V3xspFlpPRWAnBTZjvexl4n9eLzHhlZ+yLBWVZPFRmHBkIpSlR0OmKDF8agkmitlbERljhYmxHWUleMuRV0m7VvUuqrX7eqVxnddRhBM4hXPw4BIacAtNaAEBAc/wCm+Ocl6cd+djMVpw8p1j+APn8wfXsY/G</latexit>

DCT<latexit sha1_base64="tT3dG08Mca15/19qm5qW2BhFkEg=">AAAB9HicbVDLSsNAFL2pr1pfUZduBovgqiRV0GWxLlxW6AvaUCbTSTt0Mokzk0IJ/Q43LhRx68e482+ctFlo64GBwzn3cs8cP+ZMacf5tgobm1vbO8Xd0t7+weGRfXzSVlEiCW2RiEey62NFORO0pZnmtBtLikOf044/qWd+Z0qlYpFo6llMvRCPBAsYwdpIXj/EekwwT+/rzfnALjsVZwG0TtyclCFHY2B/9YcRSUIqNOFYqZ7rxNpLsdSMcDov9RNFY0wmeER7hgocUuWli9BzdGGUIQoiaZ7QaKH+3khxqNQs9M1kFlKtepn4n9dLdHDrpUzEiaaCLA8FCUc6QlkDaMgkJZrPDMFEMpMVkTGWmGjTU8mU4K5+eZ20qxX3qlJ9vC7X7vI6inAG53AJLtxADR6gAS0g8ATP8Apv1tR6sd6tj+Vowcp3TuEPrM8frcySCQ==</latexit>

DCT (X)<latexit sha1_base64="82sucYbaDWVy5+3S9HI+vHr5sL8=">AAAB+XicbVBNT8JAFHzFL8SvqkcvG4kJXkiLJnok4sEjJiAk0JDtsoUN222zuyUhDf/EiweN8eo/8ea/cQs9KDjJJpOZ9/Jmx485U9pxvq3CxubW9k5xt7S3f3B4ZB+fPKkokYS2ScQj2fWxopwJ2tZMc9qNJcWhz2nHnzQyvzOlUrFItPQspl6IR4IFjGBtpIFt90OsxwTz9L7Rmle6lwO77FSdBdA6cXNShhzNgf3VH0YkCanQhGOleq4Tay/FUjPC6bzUTxSNMZngEe0ZKnBIlZcuks/RhVGGKIikeUKjhfp7I8WhUrPQN5NZTrXqZeJ/Xi/Rwa2XMhEnmgqyPBQkHOkIZTWgIZOUaD4zBBPJTFZExlhiok1ZJVOCu/rldfJUq7pX1drjdbl+l9dRhDM4hwq4cAN1eIAmtIHAFJ7hFd6s1Hqx3q2P5WjByndO4Q+szx+kn5MB</latexit>

B<latexit sha1_base64="INwGCMG2vwHAuENeHRvaIaCwbxU=">AAAB6XicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9FjqxWMV+wFtKJvtpl262YTdiVBK/4EXD4p49R9589+4aXPQ1gcDj/dmmJkXJFIYdN1vZ219Y3Nru7BT3N3bPzgsHR23TJxqxpsslrHuBNRwKRRvokDJO4nmNAokbwfj28xvP3FtRKwecZJwP6JDJULBKFrpoV7sl8puxZ2DrBIvJ2XI0eiXvnqDmKURV8gkNabruQn6U6pRMMlnxV5qeELZmA5511JFI2786fzSGTm3yoCEsbalkMzV3xNTGhkziQLbGVEcmWUvE//zuimGN/5UqCRFrthiUZhKgjHJ3iYDoTlDObGEMi3srYSNqKYMbThZCN7yy6ukVa14l5Xq/VW5Vs/jKMApnMEFeHANNbiDBjSBQQjP8Apvzth5cd6dj0XrmpPPnMAfOJ8/yf2M3g==</latexit>

Conv<latexit sha1_base64="ISCfo3F47jG9DSs3vM0ynsNmzCo=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9FjsxWMF0xbaUDbbTbt0sxt2N4US+hu8eFDEqz/Im//GTZuDtj4YeLw3w8y8MOFMG9f9djY2t7Z3dkt75f2Dw6PjyslpW8tUEeoTyaXqhlhTzgT1DTOcdhNFcRxy2gknzdzvTKnSTIonM0toEOORYBEj2FjJb0oxLQ8qVbfmLoDWiVeQKhRoDSpf/aEkaUyFIRxr3fPcxAQZVoYRTuflfqppgskEj2jPUoFjqoNscewcXVpliCKpbAmDFurviQzHWs/i0HbG2Iz1qpeL/3m91ER3QcZEkhoqyHJRlHJkJMo/R0OmKDF8ZgkmitlbERljhYmx+eQheKsvr5N2veZd1+qPN9XGfRFHCc7hAq7Ag1towAO0wAcCDJ7hFd4c4bw4787HsnXDKWbO4A+czx8/m45Q</latexit>

Predicted M̂<latexit sha1_base64="3plKikCqZv8D6SCdlYUYWbeAKTo=">AAAB/HicbVBNS8NAEN3Ur1q/oj16WSyCp5JUQY9FL16ECvYDmlA2m0m7dLMJuxuhhPpXvHhQxKs/xJv/xm2bg7Y+GHi8N8PMvCDlTGnH+bZKa+sbm1vl7crO7t7+gX141FFJJim0acIT2QuIAs4EtDXTHHqpBBIHHLrB+Gbmdx9BKpaIBz1JwY/JULCIUaKNNLCrLQkhoxpC7GFvRHR+Nx3YNafuzIFXiVuQGirQGthfXpjQLAahKSdK9V0n1X5OpGaUw7TiZQpSQsdkCH1DBYlB+fn8+Ck+NUqIo0SaEhrP1d8TOYmVmsSB6YyJHqllbyb+5/UzHV35ORNppkHQxaIo41gneJYEDpkEqvnEEEIlM7diOiKSmDCkqpgQ3OWXV0mnUXfP6437i1rzuoijjI7RCTpDLrpETXSLWqiNKJqgZ/SK3qwn68V6tz4WrSWrmKmiP7A+fwA0IZR7</latexit>

Decoder<latexit sha1_base64="u018JXs4NpaOsWtlVUH1NcAR9pk=">AAAB73icbVBNS8NAEN34WetX1aOXxSJ4KkkV9FjUg8cK9gPaUDabSbt0sxt3N0IJ/RNePCji1b/jzX/jps1BWx8MPN6bYWZekHCmjet+Oyura+sbm6Wt8vbO7t5+5eCwrWWqKLSo5FJ1A6KBMwEtwwyHbqKAxAGHTjC+yf3OEyjNpHgwkwT8mAwFixglxkrdW6AyBFUeVKpuzZ0BLxOvIFVUoDmofPVDSdMYhKGcaN3z3MT4GVGGUQ7Tcj/VkBA6JkPoWSpIDNrPZvdO8alVQhxJZUsYPFN/T2Qk1noSB7YzJmakF71c/M/rpSa68jMmktSAoPNFUcqxkTh/HodMATV8YgmhitlbMR0RRaixEeUheIsvL5N2vead1+r3F9XGdRFHCR2jE3SGPHSJGugONVELUcTRM3pFb86j8+K8Ox/z1hWnmDlCf+B8/gBppY+O</latexit>

Classifier<latexit sha1_base64="5GSC1awSyQB9fYjDmkRKoZ2ZXeg=">AAAB83icbVDLSgMxFL1TX3V8VV26CRbBVZmpgi6L3bisYB/QDiWT3mlDMw+SjFCG/oYbF4q49Wfc+Tdm2llo64HAyTn33twcPxFcacf5tkobm1vbO+Vde2//4PCocnzSUXEqGbZZLGLZ86lCwSNsa64F9hKJNPQFdv1pM/e7TygVj6NHPUvQC+k44gFnVBtp0BRUKXNFadvDStWpOQuQdeIWpAoFWsPK12AUszTESLN8Tt91Eu1lVGrOBM7tQaowoWxKx9g3NKIhKi9b7DwnF0YZkSCW5kSaLNTfHRkNlZqFvqkMqZ6oVS8X//P6qQ5uvYxHSaoxYsuHglQQHZM8ADLiEpkWM0Mok9zsStiESsq0iSkPwV398jrp1GvuVa3+cF1t3BVxlOEMzuESXLiBBtxDC9rAIIFneIU3K7VerHfrY1lasoqeU/gD6/MHF7GRDw==</latexit>

CMF<latexit sha1_base64="h+bKGQhKxMUakilgJEV6ZYBQ38U=">AAAB9HicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9FgsiBehgv2ANpTNdtMu3Wzi7qRYQn+HFw+KePXHePPfmLQ9aOuDgcd7M8zM8yIpDNr2t7Wyura+sZnbym/v7O7tFw4OGyaMNeN1FspQtzxquBSK11Gg5K1Icxp4kje9YTXzmyOujQjVA44j7ga0r4QvGMVUcjvIn9Dzk+rdzSTfLRTtkj0FWSbOnBRhjlq38NXphSwOuEImqTFtx47QTahGwSSf5Dux4RFlQ9rn7ZQqGnDjJtOjJ+Q0VXrED3VaCslU/T2R0MCYceClnQHFgVn0MvE/rx2jf+UmQkUxcsVmi/xYEgxJlgDpCc0ZynFKKNMivZWwAdWUYZpTFoKz+PIyaZRLznmpfH9RrFzP48jBMZzAGThwCRW4hRrUgcEjPMMrvFkj68V6tz5mrSvWfOYI/sD6/AFBcZHB</latexit>

Self-Attention<latexit sha1_base64="FqH+9lKCU3P2Fwd0Wv05zC3xmVA=">AAACAHicbVC7SgNBFL3rM8bXqoWFzWIQbAy7UdAyamMZ0TwgWcLsZDYZMvtg5q4Ylm38FRsLRWz9DDv/xkmyhSYeGDiccy93zvFiwRXa9rexsLi0vLJaWCuub2xubZs7uw0VJZKyOo1EJFseUUzwkNWRo2CtWDISeII1veH12G8+MKl4FN7jKGZuQPoh9zklqKWuud9B9oien94x4Z9cIrJwbGRds2SX7QmseeLkpAQ5al3zq9OLaBLofSqIUm3HjtFNiUROBcuKnUSxmNAh6bO2piEJmHLTSYDMOtJKz/IjqV+I1kT9vZGSQKlR4OnJgOBAzXpj8T+vnaB/4aY8jBMdjE4P+YmwMLLGbVg9LhlFMdKEUMn1Xy06IJJQ1J0VdQnObOR50qiUndNy5fasVL3K6yjAARzCMThwDlW4gRrUgUIGz/AKb8aT8WK8Gx/T0QUj39mDPzA+fwBYRpbj</latexit>

Real<latexit sha1_base64="LHHTC5bujoh9WOIT45SZQ5W4WeA=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSRV0GPRi8cqpi20oWy2k3bpZhN2N0IJ/Q1ePCji1R/kzX/jps1BWx8MPN6bYWZekHCmtON8W6W19Y3NrfJ2ZWd3b/+genjUVnEqKXo05rHsBkQhZwI9zTTHbiKRRAHHTjC5zf3OE0rFYvGopwn6ERkJFjJKtJG8ByS8MqjWnLozh71K3ILUoEBrUP3qD2OaRig05USpnusk2s+I1IxynFX6qcKE0AkZYc9QQSJUfjY/dmafGWVoh7E0JbQ9V39PZCRSahoFpjMieqyWvVz8z+ulOrz2MyaSVKOgi0Vhym0d2/nn9pBJpJpPDSFUMnOrTcdEEqpNPnkI7vLLq6TdqLsX9cb9Za15U8RRhhM4hXNw4QqacAct8IACg2d4hTdLWC/Wu/WxaC1Zxcwx/IH1+QMkTY4+</latexit>

Fake<latexit sha1_base64="gpEJlN0mbqGSkvMcjiN5aBeMfcU=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSRV0GNREI8VTFtoQ9lsJ+3SzSbsboQS+hu8eFDEqz/Im//GTZuDtj4YeLw3w8y8IOFMacf5tkpr6xubW+Xtys7u3v5B9fCoreJUUvRozGPZDYhCzgR6mmmO3UQiiQKOnWBym/udJ5SKxeJRTxP0IzISLGSUaCN5d2SClUG15tSdOexV4hakBgVag+pXfxjTNEKhKSdK9Vwn0X5GpGaU46zSTxUmhE7ICHuGChKh8rP5sTP7zChDO4ylKaHtufp7IiORUtMoMJ0R0WO17OXif14v1eG1nzGRpBoFXSwKU27r2M4/t4dMItV8agihkplbbTomklBt8slDcJdfXiXtRt29qDceLmvNmyKOMpzAKZyDC1fQhHtogQcUGDzDK7xZwnqx3q2PRWvJKmaO4Q+szx8Qao4x</latexit>

Conv3×3<latexit sha1_base64="NAnfD8WQmRfiymH9buMc4jryAjM=">AAAB+3icbVDLSsNAFJ3UV62vWJduBovgqiStoMtiNy4r2Ae0IUymk3boZCbMTIol5FfcuFDErT/izr9x0mahrQcuHM65l3vvCWJGlXacb6u0tb2zu1ferxwcHh2f2KfVnhKJxKSLBRNyECBFGOWkq6lmZBBLgqKAkX4wa+d+f06kooI/6kVMvAhNOA0pRtpIvl1tCz730yYcaRoRBZtZxbdrTt1ZAm4StyA1UKDj21+jscBJRLjGDCk1dJ1YeymSmmJGssooUSRGeIYmZGgoR2aRly5vz+ClUcYwFNIU13Cp/p5IUaTUIgpMZ4T0VK17ufifN0x0eOullMeJJhyvFoUJg1rAPAg4ppJgzRaGICypuRXiKZIIaxNXHoK7/vIm6TXqbrPeeLiute6KOMrgHFyAK+CCG9AC96ADugCDJ/AMXsGblVkv1rv1sWotWcXMGfgD6/MHly+Tfg==</latexit>

Conv1×1<latexit sha1_base64="e7uPBDVvIKbY84R2lxLF/53yvEg=">AAAB+3icbVBNS8NAEN34WetXrEcvi0XwVJIq6LHYi8cK9gPaEDbbTbt0swm7k2IJ/StePCji1T/izX/jps1BWx8MPN6bYWZekAiuwXG+rY3Nre2d3dJeef/g8OjYPql0dJwqyto0FrHqBUQzwSVrAwfBeoliJAoE6waTZu53p0xpHstHmCXMi8hI8pBTAkby7UozllM/c/EAeMQ0dudl3646NWcBvE7cglRRgZZvfw2GMU0jJoEKonXfdRLwMqKAU8Hm5UGqWULohIxY31BJzCIvW9w+xxdGGeIwVqYk4IX6eyIjkdazKDCdEYGxXvVy8T+vn0J462VcJikwSZeLwlRgiHEeBB5yxSiImSGEKm5uxXRMFKFg4spDcFdfXiedes29qtUfrquNuyKOEjpD5+gSuegGNdA9aqE2ougJPaNX9GbNrRfr3fpYtm5Yxcwp+gPr8weRBZN6</latexit>

Conv1×1<latexit sha1_base64="e7uPBDVvIKbY84R2lxLF/53yvEg=">AAAB+3icbVBNS8NAEN34WetXrEcvi0XwVJIq6LHYi8cK9gPaEDbbTbt0swm7k2IJ/StePCji1T/izX/jps1BWx8MPN6bYWZekAiuwXG+rY3Nre2d3dJeef/g8OjYPql0dJwqyto0FrHqBUQzwSVrAwfBeoliJAoE6waTZu53p0xpHstHmCXMi8hI8pBTAkby7UozllM/c/EAeMQ0dudl3646NWcBvE7cglRRgZZvfw2GMU0jJoEKonXfdRLwMqKAU8Hm5UGqWULohIxY31BJzCIvW9w+xxdGGeIwVqYk4IX6eyIjkdazKDCdEYGxXvVy8T+vn0J462VcJikwSZeLwlRgiHEeBB5yxSiImSGEKm5uxXRMFKFg4spDcFdfXiedes29qtUfrquNuyKOEjpD5+gSuegGNdA9aqE2ougJPaNX9GbNrRfr3fpYtm5Yxcwp+gPr8weRBZN6</latexit>

Conv1×1<latexit sha1_base64="e7uPBDVvIKbY84R2lxLF/53yvEg=">AAAB+3icbVBNS8NAEN34WetXrEcvi0XwVJIq6LHYi8cK9gPaEDbbTbt0swm7k2IJ/StePCji1T/izX/jps1BWx8MPN6bYWZekAiuwXG+rY3Nre2d3dJeef/g8OjYPql0dJwqyto0FrHqBUQzwSVrAwfBeoliJAoE6waTZu53p0xpHstHmCXMi8hI8pBTAkby7UozllM/c/EAeMQ0dudl3646NWcBvE7cglRRgZZvfw2GMU0jJoEKonXfdRLwMqKAU8Hm5UGqWULohIxY31BJzCIvW9w+xxdGGeIwVqYk4IX6eyIjkdazKDCdEYGxXvVy8T+vn0J462VcJikwSZeLwlRgiHEeBB5yxSiImSGEKm5uxXRMFKFg4spDcFdfXiedes29qtUfrquNuyKOEjpD5+gSuegGNdA9aqE2ougJPaNX9GbNrRfr3fpYtm5Yxcwp+gPr8weRBZN6</latexit>

Head 1 patch−size : r1<latexit sha1_base64="QWub04vh7Js9Nv0HQ+oD4291Mag=">AAACCHicbVA9SwNBEN2LXzF+RS0tXAyCjeEuCopV0CZlBPMByRH2NnPJkr29Y3dPiMeVNv4VGwtFbP0Jdv4bN8kVmvhg4PHeDDPzvIgzpW3728otLa+sruXXCxubW9s7xd29pgpjSaFBQx7KtkcUcCagoZnm0I4kkMDj0PJGNxO/dQ9SsVDc6XEEbkAGgvmMEm2kXvGwBqSPu9gxFRFNh8lpqtgDXGHZS5y00CuW7LI9BV4kTkZKKEO9V/zq9kMaByA05USpjmNH2k2I1IxySAvdWEFE6IgMoGOoIAEoN5k+kuJjo/SxH0pTQuOp+nsiIYFS48AznQHRQzXvTcT/vE6s/Us3YSKKNQg6W+THHOsQT1LBfSaBaj42hFDJzK2YDokkVJvsJiE48y8vkmal7JyVK7fnpep1FkceHaAjdIIcdIGqqIbqqIEoekTP6BW9WU/Wi/Vufcxac1Y2s4/+wPr8Af0ImA4=</latexit>

Head N patch−size : rN<latexit sha1_base64="QywERPffnEzJKY6J3JHZJi8BdGA=">AAACCHicbVC7SgNBFJ2Nrxhfq5YWDgbBxrAbBcUqaJMqRDAPSEKYnb2bDJl9MDMrxGVLG3/FxkIRWz/Bzr9xNkmhiQcuHM65l3vvcSLOpLKsbyO3tLyyupZfL2xsbm3vmLt7TRnGgkKDhjwUbYdI4CyAhmKKQzsSQHyHQ8sZ3WR+6x6EZGFwp8YR9HwyCJjHKFFa6puHVSAu7uKarogoOkxOU8ke4AqLflJLC32zaJWsCfAisWekiGao982vrhvS2IdAUU6k7NhWpHoJEYpRDmmhG0uICB2RAXQ0DYgPspdMHknxsVZc7IVCV6DwRP09kRBfyrHv6E6fqKGc9zLxP68TK++yl7AgihUEdLrIizlWIc5SwS4TQBUfa0KoYPpWTIdEEKp0dlkI9vzLi6RZLtlnpfLtebFyPYsjjw7QETpBNrpAFVRFddRAFD2iZ/SK3own48V4Nz6mrTljNrOP/sD4/AFX1JhI</latexit>

Q<latexit sha1_base64="TIwmk8u21VOUoY7U7p41qYBa1Ac=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlZrNfrrhVdw6ySrycVCBHo1/+6g1ilkYoDRNU667nJsbPqDKcCZyWeqnGhLIxHWLXUkkj1H42P3RKzqwyIGGsbElD5urviYxGWk+iwHZG1Iz0sjcT//O6qQmv/YzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaiijJjsynZELzll1dJu1b1Lqq15mWlfpPHUYQTOIVz8OAK6nAHDWgBA4RneIU359F5cd6dj0VrwclnjuEPnM8frGuM2Q==</latexit>

K<latexit sha1_base64="jJm4AV6Nav8exwOhyyPTon3UXwM=">AAAB6XicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj0InipYj+gDWWznbRLN5uwuxFK6T/w4kERr/4jb/4bN20O2vpg4PHeDDPzgkRwbVz321lZXVvf2CxsFbd3dvf2SweHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWMbjK/9YRK81g+mnGCfkQHkoecUWOlh7tir1R2K+4MZJl4OSlDjnqv9NXtxyyNUBomqNYdz02MP6HKcCZwWuymGhPKRnSAHUsljVD7k9mlU3JqlT4JY2VLGjJTf09MaKT1OApsZ0TNUC96mfif10lNeOVPuExSg5LNF4WpICYm2dukzxUyI8aWUKa4vZWwIVWUGRtOFoK3+PIyaVYr3nmlen9Rrl3ncRTgGE7gDDy4hBrcQh0awCCEZ3iFN2fkvDjvzse8dcXJZ47gD5zPH9eqjOc=</latexit>

V<latexit sha1_base64="La6qZ+FTW6KA2wNyhIaNn6kuoIA=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF49V7Ae0oWy2m3bpZhN2J0Ip/QdePCji1X/kzX/jps1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7Gt5nffuLaiFg94iThfkSHSoSCUbTSQ6vUL1fcqjsHWSVeTiqQo9Evf/UGMUsjrpBJakzXcxP0p1SjYJLPSr3U8ISyMR3yrqWKRtz40/mlM3JmlQEJY21LIZmrvyemNDJmEgW2M6I4MsteJv7ndVMMr/2pUEmKXLHFojCVBGOSvU0GQnOGcmIJZVrYWwkbUU0Z2nCyELzll1dJq1b1Lqq1+8tK/SaPowgncArn4MEV1OEOGtAEBiE8wyu8OWPnxXl3PhatBSefOYY/cD5/AOhhjPI=</latexit>

ConvN×N<latexit sha1_base64="gf6E4tgWGPUq1gihoL453fLD/o8=">AAAB+3icbVDLSsNAFJ34rPUV69LNYBFclaQKuix246pUsA9oQ5hMJ+3QyUyYmRRLyK+4caGIW3/EnX/jpM1CWw9cOJxzL/feE8SMKu0439bG5tb2zm5pr7x/cHh0bJ9UukokEpMOFkzIfoAUYZSTjqaakX4sCYoCRnrBtJn7vRmRigr+qOcx8SI05jSkGGkj+XalKfjMT1twqGlEFGxlZd+uOjVnAbhO3IJUQYG2b38NRwInEeEaM6TUwHVi7aVIaooZycrDRJEY4Skak4GhHJlFXrq4PYMXRhnBUEhTXMOF+nsiRZFS8ygwnRHSE7Xq5eJ/3iDR4a2XUh4nmnC8XBQmDGoB8yDgiEqCNZsbgrCk5laIJ0girE1ceQju6svrpFuvuVe1+sN1tXFXxFECZ+AcXAIX3IAGuAdt0AEYPIFn8ArerMx6sd6tj2XrhlXMnII/sD5/AOpmk7Q=</latexit>

Decoder<latexit sha1_base64="u018JXs4NpaOsWtlVUH1NcAR9pk=">AAAB73icbVBNS8NAEN34WetX1aOXxSJ4KkkV9FjUg8cK9gPaUDabSbt0sxt3N0IJ/RNePCji1b/jzX/jps1BWx8MPN6bYWZekHCmjet+Oyura+sbm6Wt8vbO7t5+5eCwrWWqKLSo5FJ1A6KBMwEtwwyHbqKAxAGHTjC+yf3OEyjNpHgwkwT8mAwFixglxkrdW6AyBFUeVKpuzZ0BLxOvIFVUoDmofPVDSdMYhKGcaN3z3MT4GVGGUQ7Tcj/VkBA6JkPoWSpIDNrPZvdO8alVQhxJZUsYPFN/T2Qk1noSB7YzJmakF71c/M/rpSa68jMmktSAoPNFUcqxkTh/HodMATV8YgmhitlbMR0RRaixEeUheIsvL5N2vead1+r3F9XGdRFHCR2jE3SGPHSJGugONVELUcTRM3pFb86j8+K8Ox/z1hWnmDlCf+B8/gBppY+O</latexit>

Convolution layer and 2 × Upsampling<latexit sha1_base64="KAXrm6EPUO3p5binbG+AUMI+Lps=">AAACHnicbVA9TxtBEN2DJDjOl4EyzSp2pFTWnQmC0sINJUgxWLIta249Z1bs7Z525yysk38JDX+FhgIURaKCf8OecZHYedJIT+/NaGZenCnpKAyfg43NN2/fbVXeVz98/PT5S21758yZ3ArsCqOM7cXgUEmNXZKksJdZhDRWeB5fdkr/fIrWSaN/0SzDYQoTLRMpgLw0qu0PCK8oToqO0VOj8lLlCmZoOegxb/HGgGSKrsG7mYPUn6Qn8+qoVg+b4QJ8nURLUmdLnIxqj4OxEXmKmoQC5/pRmNGwAEtSKJxXB7nDDMQlTLDvqQa/clgs3pvz714Z88RYX5r4Qv17ooDUuVka+84U6MKteqX4P6+fU3I4LKTOckItXhclueJkeJkVH0uLgtTMExBW+lu5uAALgnyiZQjR6svr5KzVjPaardOf9fbRMo4K+8q+sR8sYgeszY7ZCesywa7ZLbtnD8FNcBf8Dv68tm4Ey5ld9g+Cpxf0PqJU</latexit>

Convolution layer with kernel size N<latexit sha1_base64="OHMiNwYcC1h+lxld/2U4Z/+B/Sc=">AAACF3icbVBNS8NAEN34WetX1aOXxSJ4CokKehS9eJIKthXaUjbbiV262Q27k2oM/Rde/CtePCjiVW/+G5Pag18PBh7vzTAzL4ilsOh5H87U9Mzs3Hxpoby4tLyyWllbb1idGA51rqU2lwGzIIWCOgqUcBkbYFEgoRkMTgq/OQRjhVYXmMbQidiVEqHgDHOpW3HbCDcYhNmJVkMtk0KlkqVg6LXAPh2AUSCpFbdAz0blbqXqud4Y9C/xJ6RKJqh1K+/tnuZJBAq5ZNa2fC/GTsYMCi5hVG4nFmLGB+wKWjlVLALbycZ/jeh2rvRoqE1eCulY/T6RscjaNAryzohh3/72CvE/r5VgeNjJhIoTBMW/FoWJpKhpERLtCQMcZZoTxo3Ib6W8zwzjmEdZhOD/fvkvaey6/p67e75fPTqexFEim2SL7BCfHJAjckpqpE44uSMP5Ik8O/fOo/PivH61TjmTmQ3yA87bJ21GoAk=</latexit>

DCT −1<latexit sha1_base64="39zzkBB2rtxXJ1vIiIYdA04nw2I=">AAAB+3icbVDLSsNAFL2pr1pfsS7dDBbBjSWpgi6LdeGyQl/QxjKZTtqhkwczE7GE/IobF4q49Ufc+TdO2iy09cDA4Zx7uWeOG3EmlWV9G4W19Y3NreJ2aWd3b//APCx3ZBgLQtsk5KHouVhSzgLaVkxx2osExb7LadedNjK/+0iFZGHQUrOIOj4eB8xjBCstDc3ywMdqQjBPbhut9CE5t9OhWbGq1hxoldg5qUCO5tD8GoxCEvs0UIRjKfu2FSknwUIxwmlaGsSSRphM8Zj2NQ2wT6WTzLOn6FQrI+SFQr9Aobn6eyPBvpQz39WTWVK57GXif14/Vt61k7AgihUNyOKQF3OkQpQVgUZMUKL4TBNMBNNZEZlggYnSdZV0Cfbyl1dJp1a1L6q1+8tK/SavowjHcAJnYMMV1OEOmtAGAk/wDK/wZqTGi/FufCxGC0a+cwR/YHz+AJCLlCA=</latexit>

Backbone<latexit sha1_base64="45/9qbkn+eHP2xsEYOqLy5LXunA=">AAAB8HicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LLUjcsK9iHtUDJppg3NY0gyQin9CjcuFHHr57jzb8y0s9DWA4HDOffe3HuihDNjff/bW1vf2NzaLuwUd/f2Dw5LR8cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+zfz2E9WGKflgJwkNBR5KFjOCrZMe65iMIyVpsV8q+xV/DrRKgpyUIUejX/rqDRRJBZWWcGxMN/ATG06xtoxwOiv2UkMTNx0PaddRiQU14XS+8AydO2WAYqXdkxbN1d8dUyyMmYjIVQpsR2bZy8T/vG5q45twymSSWirJ4qM45cgqlF2PBkxTYvnEEUw0c7siMsIaE+syykIIlk9eJa1qJbisVO+vyrV6HkcBTuEMLiCAa6jBHTSgCQQEPMMrvHnae/HevY9F6ZqX95zAH3ifPyGLj/c=</latexit>

Layer 1<latexit sha1_base64="H1oyX2TTfZIFURWh2ESd51ukry0=">AAAB8HicbVA9SwNBEJ3zM8avqKXNYhCswl0UtAzaWFhEMB+SHGFvM5cs2ds7dveEI+RX2FgoYuvPsfPfuEmu0MQHA4/3ZpiZFySCa+O6387K6tr6xmZhq7i9s7u3Xzo4bOo4VQwbLBaxagdUo+ASG4Ybge1EIY0Cga1gdDP1W0+oNI/lg8kS9CM6kDzkjBorPd7RDBXpEq9XKrsVdwayTLyclCFHvVf66vZjlkYoDRNU647nJsYfU2U4EzgpdlONCWUjOsCOpZJGqP3x7OAJObVKn4SxsiUNmam/J8Y00jqLAtsZUTPUi95U/M/rpCa88sdcJqlByeaLwlQQE5Pp96TPFTIjMksoU9zeStiQKsqMzahoQ/AWX14mzWrFO69U7y/Ktes8jgIcwwmcgQeXUINbqEMDGETwDK/w5ijnxXl3PuatK04+cwR/4Hz+AKBIj6I=</latexit>

Layer 2<latexit sha1_base64="9tlzaUROHxgEfe5kji4ZwkSAZEg=">AAAB8HicbVA9SwNBEJ3zM8avqKXNYhCswl0UtAzaWFhEMB+SHGFvM5cs2ds7dveEcORX2FgoYuvPsfPfuEmu0MQHA4/3ZpiZFySCa+O6387K6tr6xmZhq7i9s7u3Xzo4bOo4VQwbLBaxagdUo+ASG4Ybge1EIY0Cga1gdDP1W0+oNI/lgxkn6Ed0IHnIGTVWeryjY1SkS6q9UtmtuDOQZeLlpAw56r3SV7cfszRCaZigWnc8NzF+RpXhTOCk2E01JpSN6AA7lkoaofaz2cETcmqVPgljZUsaMlN/T2Q00nocBbYzomaoF72p+J/XSU145WdcJqlByeaLwlQQE5Pp96TPFTIjxpZQpri9lbAhVZQZm1HRhuAtvrxMmtWKd16p3l+Ua9d5HAU4hhM4Aw8uoQa3UIcGMIjgGV7hzVHOi/PufMxbV5x85gj+wPn8AaHMj6M=</latexit>

Backbone<latexit sha1_base64="45/9qbkn+eHP2xsEYOqLy5LXunA=">AAAB8HicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LLUjcsK9iHtUDJppg3NY0gyQin9CjcuFHHr57jzb8y0s9DWA4HDOffe3HuihDNjff/bW1vf2NzaLuwUd/f2Dw5LR8cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+zfz2E9WGKflgJwkNBR5KFjOCrZMe65iMIyVpsV8q+xV/DrRKgpyUIUejX/rqDRRJBZWWcGxMN/ATG06xtoxwOiv2UkMTNx0PaddRiQU14XS+8AydO2WAYqXdkxbN1d8dUyyMmYjIVQpsR2bZy8T/vG5q45twymSSWirJ4qM45cgqlF2PBkxTYvnEEUw0c7siMsIaE+syykIIlk9eJa1qJbisVO+vyrV6HkcBTuEMLiCAa6jBHTSgCQQEPMMrvHnae/HevY9F6ZqX95zAH3ifPyGLj/c=</latexit>

Classifier<latexit sha1_base64="5GSC1awSyQB9fYjDmkRKoZ2ZXeg=">AAAB83icbVDLSgMxFL1TX3V8VV26CRbBVZmpgi6L3bisYB/QDiWT3mlDMw+SjFCG/oYbF4q49Wfc+Tdm2llo64HAyTn33twcPxFcacf5tkobm1vbO+Vde2//4PCocnzSUXEqGbZZLGLZ86lCwSNsa64F9hKJNPQFdv1pM/e7TygVj6NHPUvQC+k44gFnVBtp0BRUKXNFadvDStWpOQuQdeIWpAoFWsPK12AUszTESLN8Tt91Eu1lVGrOBM7tQaowoWxKx9g3NKIhKi9b7DwnF0YZkSCW5kSaLNTfHRkNlZqFvqkMqZ6oVS8X//P6qQ5uvYxHSaoxYsuHglQQHZM8ADLiEpkWM0Mok9zsStiESsq0iSkPwV398jrp1GvuVa3+cF1t3BVxlOEMzuESXLiBBtxDC9rAIIFneIU3K7VerHfrY1lasoqeU/gD6/MHF7GRDw==</latexit>

Fully connected layer<latexit sha1_base64="xVMha1mxekROgKorCruFilU5CRM=">AAACB3icbVDLSsNAFJ3UV62vqEtBBovgqiRV0GVREJcV7APaUCaTm3boZBJmJmIJ3bnxV9y4UMStv+DOv3HaZqGtBwYO59zLnXP8hDOlHefbKiwtr6yuFddLG5tb2zv27l5Txamk0KAxj2XbJwo4E9DQTHNoJxJI5HNo+cOrid+6B6lYLO70KAEvIn3BQkaJNlLPPuxqeNB+mF2nnI8wjYUAqiHAnIxAjnt22ak4U+BF4uakjHLUe/ZXN4hpGoHQlBOlOq6TaC8jUjPKYVzqpgoSQoekDx1DBYlAedk0xxgfGyXAYSzNExpP1d8bGYmUGkW+mYyIHqh5byL+53VSHV54GRNJqkHQ2aEw5VjHeFIKDpg0qU3+gBEqmfkrpgMiiWlCqpIpwZ2PvEia1Yp7WqnenpVrl3kdRXSAjtAJctE5qqEbVEcNRNEjekav6M16sl6sd+tjNlqw8p199AfW5w+3hZnY</latexit>

{tibase}3i=1<latexit sha1_base64="Ci+qaBlnR80Ly3NZvP48j2BozZ8=">AAACA3icbZBNS8MwGMdTX+d8q3rTS3AInka7CXoRhl48TnAvsNaSZtkWlqYlSYURCl78Kl48KOLVL+HNb2O69aCbfwj8+D/Pk+T5hwmjUjnOt7W0vLK6tl7aKG9ube/s2nv7bRmnApMWjlksuiGShFFOWooqRrqJICgKGemE4+u83nkgQtKY36lJQvwIDTkdUIyUsQL70NMq0PkN2b2mmZcFml66hutZYFecqjMVXAS3gAoo1AzsL68f4zQiXGGGpOy5TqJ8jYSimJGs7KWSJAiP0ZD0DHIUEenr6Q4ZPDFOHw5iYQ5XcOr+ntAoknIShaYzQmok52u5+V+tl6rBha8pT1JFOJ49NEgZVDHMA4F9KghWbGIAYUHNXyEeIYGwMrGVTQju/MqL0K5V3Xq1dntWaVwVcZTAETgGp8AF56ABbkATtAAGj+AZvII368l6sd6tj1nrklXMHIA/sj5/AMw7mD8=</latexit>

fs<latexit sha1_base64="0YWq47V9cl6VJ6Kk+Fdgafvq2mY=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGCaQttKJvtpl262YTdiVBCf4MXD4p49Qd589+4bXPQ1gcDj/dmmJkXplIYdN1vZ219Y3Nru7RT3t3bPzisHB23TJJpxn2WyER3Qmq4FIr7KFDyTqo5jUPJ2+H4bua3n7g2IlGPOEl5ENOhEpFgFK3kR/3cTPuVqltz5yCrxCtIFQo0+5Wv3iBhWcwVMkmN6XpuikFONQom+bTcywxPKRvTIe9aqmjMTZDPj52Sc6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQT5EKlGXLFFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7YheMsvr5JWveZd1uoPV9XGbRFHCU7hDC7Ag2towD00wQcGAp7hFd4c5bw4787HonXNKWZO4A+czx8aXY7g</latexit>

fcmf<latexit sha1_base64="XZ7u4VIZw0xBicomFMAAQBxEr1k=">AAAB7nicbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DHoxWME84AkhNlJbzJkZnaZmRXCko/w4kERr36PN//GSbIHTSxoKKq66e4KE8GN9f1vb219Y3Nru7BT3N3bPzgsHR03TZxqhg0Wi1i3Q2pQcIUNy63AdqKRylBgKxzfzfzWE2rDY/VoJwn2JB0qHnFGrZNaUT9jMpr2S2W/4s9BVkmQkzLkqPdLX91BzFKJyjJBjekEfmJ7GdWWM4HTYjc1mFA2pkPsOKqoRNPL5udOyblTBiSKtStlyVz9PZFRacxEhq5TUjsyy95M/M/rpDa66WVcJalFxRaLolQQG5PZ72TANTIrJo5Qprm7lbAR1ZRZl1DRhRAsv7xKmtVKcFmpPlyVa7d5HAU4hTO4gACuoQb3UIcGMBjDM7zCm5d4L96797FoXfPymRP4A+/zB5BIj7c=</latexit>

ffq<latexit sha1_base64="H/dCizGWZevQI2uqrMhNaCSUuEY=">AAAB7XicbVBNSwMxEJ34WetX1aOXYBE8ld0q6LHoxWMF+wHtUrJpto3NJmuSFcrS/+DFgyJe/T/e/Dem7R609cHA470ZZuaFieDGet43WlldW9/YLGwVt3d29/ZLB4dNo1JNWYMqoXQ7JIYJLlnDcitYO9GMxKFgrXB0M/VbT0wbruS9HScsiMlA8ohTYp3UjHpZ9DjplcpexZsBLxM/J2XIUe+Vvrp9RdOYSUsFMabje4kNMqItp4JNit3UsITQERmwjqOSxMwE2ezaCT51Sh9HSruSFs/U3xMZiY0Zx6HrjIkdmkVvKv7ndVIbXQUZl0lqmaTzRVEqsFV4+jruc82oFWNHCNXc3YrpkGhCrQuo6ELwF19eJs1qxT+vVO8uyrXrPI4CHMMJnIEPl1CDW6hDAyg8wDO8whtS6AW9o4956wrKZ47gD9DnD9jjj04=</latexit>

Frequency Features<latexit sha1_base64="GiT9XBdxr3JTnltnvxO035CEbaU=">AAAB/XicbVDJSgNBEO1xjXEbl5uXxiB4CjNR0GNQCB4jmAWSIfR0apImPYu9COMQ/BUvHhTx6n9482/sJHPQxAcFj/eqqKrnJ5xJ5Tjf1tLyyuraemGjuLm1vbNr7+03ZawFhQaNeSzaPpHAWQQNxRSHdiKAhD6Hlj+6nvitBxCSxdGdShPwQjKIWMAoUUbq2Yc1AfcaIpriLq4BUVqA7Nklp+xMgReJm5MSylHv2V/dfkx1CJGinEjZcZ1EeRkRilEO42JXS0gIHZEBdAyNSAjSy6bXj/GJUfo4iIWpSOGp+nsiI6GUaeibzpCooZz3JuJ/Xker4NLLWJRoZR6cLQo0xyrGkyhwnwmgiqeGECqYuRXTIRGEKhNY0YTgzr+8SJqVsntWrtyel6pXeRwFdISO0Sly0QWqohtURw1E0SN6Rq/ozXqyXqx362PWumTlMwfoD6zPHyG4lQQ=</latexit>

Multi-scale Transformer<latexit sha1_base64="G1fGP/K2/pkAOGe4vWBtIf3DFYg=">AAACDHicbVC7SgNBFJ31GeMramkzGAQbw24UtAza2AgR8oIkhNnJ3WTI7Owyc1cMSz7Axl+xsVDE1g+w82+cPApNPDBwOOdc7tzjx1IYdN1vZ2l5ZXVtPbOR3dza3tnN7e3XTJRoDlUeyUg3fGZACgVVFCihEWtgoS+h7g+ux379HrQRkargMIZ2yHpKBIIztFInl28hPKAfpLeJRHFqOJNAW7SimTJBpEPQo6xNuQV3ArpIvBnJkxnKndxXqxvxJASFXDJjmp4bYztlGgWXMMq2EgMx4wPWg6alioVg2unkmBE9tkqX2t32KaQT9fdEykJjhqFvkyHDvpn3xuJ/XjPB4LKdChUnCIpPFwWJpBjRcTO0KzRwlENLGNfC/pXyPtOMo+1vXII3f/IiqRUL3lmheHeeL13N6siQQ3JETohHLkiJ3JAyqRJOHskzeSVvzpPz4rw7H9PokjObOSB/4Hz+AL0sm2c=</latexit>

Figure 2: Overview of the proposed M2TR. The input is a suspicious face image (H x W x C), and the output includes both aforgery detection result and a predicted mask (H x W x 1), which locates the forgery regions.

To this end, we introduceM2TR, aMulti-modalMulti-scale Trans-former, for Deepfake detection. M2TR is a multimodal framework,consisting of a Multi-scale Transformer (MT) module and a CrossModality Fusion (CMF) module. In particular, M2TR first extractsfeatures of an input image with a few convolutional layers. We thengenerate patches of different sizes from the feature map, whichare used as inputs to different heads of the transformer. Similari-ties of spatial patches across different scales are calculated to cap-ture the inconsistency among different regions at multiple scales.This benefits the discovery of forgery artifacts, since certain subtleforgery clues, e.g., blurring and color inconsistency, are often timeshidden in small local patches. The outputs from the multi-scaletransformer are further augmented with frequency information toderive fused feature representations using a cross modality fusionmodule. Finally, the integrated features are used as inputs to sev-eral convolutional layers to generate prediction results. In additionto binary classification, we also predict the manipulated regionsof the face image in a multi-task manner. The rationale behind isthat binary classification tends to result in easily overfitted models.Therefore, we use face masks as additional supervisory signals tomitigate overfitting.

The availability of large-scale training data is an essential factorin the development of Deepfake detection methods. Existing Deep-fake datasets include the UADFV dataset [63], the DeepFake-TIMITdataset (DF-TIMIT) [28], the FaceForensics++ dataset (FF++) [48],the Google DeepFake detection dataset (DFD) [9], the FaceBookDeepFake detection challenge (DFDC) dataset [13], the WildDeep-fake dataset [69], and the Celeb-DF dataset [35]. However, thequality of visual samples in current Deepfake datasets is limited,containing clear artifacts (see Figure 1) like color mismatch, shapedistortion, visible boundaries, and facial blurring. Therefore, there isstill a huge gap between the images in existing datasets and forgedimages in the wild which are circulated on the Internet. Although

the visual quality of Celeb-DF [35] is relatively high compared toothers, they use only one face swapping method to generate forgedimages, lacking sample diversity. In addition, there are no unbiasedand comprehensive evaluation metrics to measure the quality ofDeepfake datasets, which is not conducive to the development ofsubsequent Deepfake research.

In this paper, we present a large-scale and high-quality Deepfakedataset, Swapping and Reenactment DeepFake (SR-DF) dataset,which is generated using the state-of-the-art face swapping andfacial reenactment methods for the development and evaluation ofDeepfake detection methods. We visualize in Figure 4 the sampledforged faces in the proposed SR-DF dataset. Besides, we proposea set of evaluation criteria to measure the quality of Deepfakedatasets from different perspectives. We hope the release of SR-DFdataset and the evaluation systems will benefit the future researchof Deepfake detection.

Our work makes the following key contributions:

• We propose a Multi-modal Multi-scale Transformer (M2TR)for Deepfake forensics, which uses a multi-scale transformerto detect local inconsistency at different scales and lever-ages frequency features to improve the robustness of detec-tion. Extensive experiments demonstrate that our methodachieves state-of-the-art detection performance on differentdatasets.

• We introduce a large-scale and challenging Deepfake datasetSR-DF, which is generated with state-of-the-art face swap-ping and facial reenactment methods.

• We construct the most comprehensive evaluation systemand demonstrate that SR-DF dataset is well-suited for thetraining Deepfake detection methods due to its visual qualityand diversity.

2

Page 3: M2TR: Multi-modal Multi-scale Transformers for Deepfake ...

2 RELATEDWORKDeepfake Generation Existing Deepfake generation methods canbe divided into two categories: face swapping and facial reenact-ment. Face swapping methods [8, 10, 18, 38] replace the face of asource person in an image with the face of a target person. Mostof these methods consist of three steps: segmenting the facial re-gions from the target image, manipulating the expression to thesource person, and blending with the background. Facial reenact-ment methods [54, 55, 61], on the other hand, transfer the faceexpressions of the source person to the target person while preserv-ing the identity information of the source person. Specially, [23]use a landmark decoder to model expression motions and a contentencoder to extract identity information, thus handling the task byfeature disentanglement.Deepfake Detection To mitigate the security threat brought byDeepfakes, a variety of methods have been proposed for Deepfakedetection. [67] uses a two-stream architecture to capture facialmanipulation clues and patch inconsistency separately, while [40]simultaneously identifies forged faces and locate the manipulatedregions with multi-task learning. Recently, [32] proposes to detectthe blending boundaries based on an observation that the step ofblending a forged face into the background is commonly used bymost existing face manipulation methods. [60] extracts featuresfrom the face image using a CNN model, which are then fed to atraditional single-scale transformer for forgery detection. However,most of them only focus on the features in RGB domain, thus fail-ing to detect forged images which are manipulated subtly in thecolor-space. Instead, [5] uses RGB and frequency information collab-oratively to extract comprehensive forgery features. In this paper,we use a multi-scale transformer to capture local inconsistency atdifferent scales for forgery detection, and additionally introducefrequency modality to improve the robustness of our method tovarious image compression algorithms.Visual Transformers Transformers [58] have demonstrated im-pressive performance for natural language processing tasks due totheir strong ability in modeling long-range context information. Re-cently, researchers have demonstrated remarkable interests in usingthe transformer for a variety of computer vision tasks. In particular,Vision Transformer (ViT) reshapes the image into a sequence offlattened patches and input them to the transformer encoder forimage classification [14]. DETR uses a common CNN to extractsemantic features from the input image, which are then input to atransformer-based encoder-decoder architecture for object detec-tion [2]. In this paper, we adopt a multi-scale transformer whichintegrates multi-scale information for Deepfake detection.

3 APPROACHOur goal is to detect the subtle forgery artifacts that are hiddenin the inconsistency of local patches and improve the robustnessto image compression with frequency features. In this section, weintroduce the Multi-modal Multi-scale Transformer (M2TR) forDeepfake detection, which consists of a multi-scale transformerin Sec 3.1, and a cross modality fusion module in Sec 3.2. Figure 2gives an overview of the framework.

Head 1, patch size = 10<latexit sha1_base64="Rx/GK8Pns8pz75WZoHw8lM90L68=">AAACA3icbVDLSsNAFJ3UV62vqDvdDBbBhZSkCroRim66rGAf0IQymdy0QycPZiZCDQU3/oobF4q49Sfc+TdO2yy09cCFwzn3cu89XsKZVJb1bRSWlldW14rrpY3Nre0dc3evJeNUUGjSmMei4xEJnEXQVExx6CQCSOhxaHvDm4nfvgchWRzdqVECbkj6EQsYJUpLPfOgDsTHDrZPcUIUHThYsgfAV9i2embZqlhT4EVi56SMcjR65pfjxzQNIVKUEym7tpUoNyNCMcphXHJSCQmhQ9KHrqYRCUG62fSHMT7Wio+DWOiKFJ6qvycyEko5Cj3dGRI1kPPeRPzP66YquHQzFiWpgojOFgUpxyrGk0CwzwRQxUeaECqYvhXTARGEKh1bSYdgz7+8SFrVin1Wqd6el2vXeRxFdIiO0Amy0QWqoTpqoCai6BE9o1f0ZjwZL8a78TFrLRj5zD76A+PzB1HklWM=</latexit>

Head 2, patch size = 20<latexit sha1_base64="6N/T6EgPXN+p0d90r9Fpco9dawI=">AAACA3icbVDLSsNAFJ3UV62vqDvdDBbBhZQkCroRim66rGAf0IQymdy0QycPZiZCLQU3/oobF4q49Sfc+TdO2yy09cCFwzn3cu89fsqZVJb1bRSWlldW14rrpY3Nre0dc3evKZNMUGjQhCei7RMJnMXQUExxaKcCSORzaPmDm4nfugchWRLfqWEKXkR6MQsZJUpLXfOgBiTALnZOcUoU7btYsgfAV9ixumbZqlhT4EVi56SMctS75pcbJDSLIFaUEyk7tpUqb0SEYpTDuORmElJCB6QHHU1jEoH0RtMfxvhYKwEOE6ErVniq/p4YkUjKYeTrzoiovpz3JuJ/XidT4aU3YnGaKYjpbFGYcawSPAkEB0wAVXyoCaGC6Vsx7RNBqNKxlXQI9vzLi6TpVOyzinN7Xq5e53EU0SE6QifIRheoimqojhqIokf0jF7Rm/FkvBjvxsestWDkM/voD4zPH1T/lWU=</latexit>

Head 3, patch size = 40<latexit sha1_base64="osEUj3HqOwHsLy3lteF8GJfIt3Y=">AAACA3icbVDJSgNBEO2JW4zbqDe9NAbBg4SZJKAXIeglxwhmgcwQenpqkiY9C909QhwCXvwVLx4U8epPePNv7CwHTXxQ8Hiviqp6XsKZVJb1beRWVtfWN/Kbha3tnd09c/+gJeNUUGjSmMei4xEJnEXQVExx6CQCSOhxaHvDm4nfvgchWRzdqVECbkj6EQsYJUpLPfOoDsTHDq6c44QoOnCwZA+Ar3DV6plFq2RNgZeJPSdFNEejZ345fkzTECJFOZGya1uJcjMiFKMcxgUnlZAQOiR96GoakRCkm01/GONTrfg4iIWuSOGp+nsiI6GUo9DTnSFRA7noTcT/vG6qgks3Y1GSKojobFGQcqxiPAkE+0wAVXykCaGC6VsxHRBBqNKxFXQI9uLLy6RVLtmVUvm2Wqxdz+PIo2N0gs6QjS5QDdVRAzURRY/oGb2iN+PJeDHejY9Za86YzxyiPzA+fwBZn5Vo</latexit>

Head 4, patch size = 80<latexit sha1_base64="7YREYNIauXJXqdUZ8KRKIwrEWa0=">AAACA3icbVDJSgNBEO2JW4zbqDe9NAbBg4SZGDAXIeglxwhmgcwQenpqkiY9C909QhwCXvwVLx4U8epPePNv7CwHTXxQ8Hiviqp6XsKZVJb1beRWVtfWN/Kbha3tnd09c/+gJeNUUGjSmMei4xEJnEXQVExx6CQCSOhxaHvDm4nfvgchWRzdqVECbkj6EQsYJUpLPfOoDsTHDq6c44QoOnCwZA+Ar3DV6plFq2RNgZeJPSdFNEejZ345fkzTECJFOZGya1uJcjMiFKMcxgUnlZAQOiR96GoakRCkm01/GONTrfg4iIWuSOGp+nsiI6GUo9DTnSFRA7noTcT/vG6qgqqbsShJFUR0tihIOVYxngSCfSaAKj7ShFDB9K2YDoggVOnYCjoEe/HlZdIql+yLUvm2Uqxdz+PIo2N0gs6QjS5RDdVRAzURRY/oGb2iN+PJeDHejY9Za86YzxyiPzA+fwBhSZVt</latexit>

Conv3×3<latexit sha1_base64="NAnfD8WQmRfiymH9buMc4jryAjM=">AAAB+3icbVDLSsNAFJ3UV62vWJduBovgqiStoMtiNy4r2Ae0IUymk3boZCbMTIol5FfcuFDErT/izr9x0mahrQcuHM65l3vvCWJGlXacb6u0tb2zu1ferxwcHh2f2KfVnhKJxKSLBRNyECBFGOWkq6lmZBBLgqKAkX4wa+d+f06kooI/6kVMvAhNOA0pRtpIvl1tCz730yYcaRoRBZtZxbdrTt1ZAm4StyA1UKDj21+jscBJRLjGDCk1dJ1YeymSmmJGssooUSRGeIYmZGgoR2aRly5vz+ClUcYwFNIU13Cp/p5IUaTUIgpMZ4T0VK17ufifN0x0eOullMeJJhyvFoUJg1rAPAg4ppJgzRaGICypuRXiKZIIaxNXHoK7/vIm6TXqbrPeeLiute6KOMrgHFyAK+CCG9AC96ADugCDJ/AMXsGblVkv1rv1sWotWcXMGfgD6/MHly+Tfg==</latexit>

0 1 2 15

Self Attention<latexit sha1_base64="abbInci19EccbPf8/W/WU4um3y0=">AAAB+nicbVDLSgMxFL2pr1pfU126CRbBVZmpgi6rblxWtA9oh5JJM21o5kGSUcrYT3HjQhG3fok7/8ZMOwttPRA4nHMPufd4seBK2/Y3Kqysrq1vFDdLW9s7u3tWeb+lokRS1qSRiGTHI4oJHrKm5lqwTiwZCTzB2t74OvPbD0wqHoX3ehIzNyDDkPucEm2kvlW+Y8LHPXypNQszCfetil21Z8DLxMlJBXI0+tZXbxDRJDB5KohSXceOtZsSqTkVbFrqJYrFhI7JkHUNDUnAlJvOVp/iY6MMsB9J80KNZ+rvREoCpSaBZyYDokdq0cvE/7xuov0LN+VhnJjD6PwjPxFYRzjrAQ+4ZFSLiSGESm52xXREJKHatFUyJTiLJy+TVq3qnFZrt2eV+lVeRxEO4QhOwIFzqMMNNKAJFB7hGV7hDT2hF/SOPuajBZRnDuAP0OcPOkOTVQ==</latexit>

Reassemble<latexit sha1_base64="ofFgq7aZx85VAwSmoZTlmbwcuFs=">AAAB8XicbVBNSwMxEJ31s9avqkcvwSJ4KrtV0GPRi8cq9gPbpWTT2TY0m12SrFBK/4UXD4p49d9489+YbfegrQ8CjzdvJjMvSATXxnW/nZXVtfWNzcJWcXtnd2+/dHDY1HGqGDZYLGLVDqhGwSU2DDcC24lCGgUCW8HoJqu3nlBpHssHM07Qj+hA8pAzaqz0eI9Ua8zcvVLZrbgzkGXi5aQMOeq90le3H7M0QmmYsFM6npsYf0KV4UzgtNhNNSaUjegAO5ZKGqH2J7ONp+TUKn0Sxso+achM/d0xoZHW4yiwzoiaoV6sZeJ/tU5qwit/wmWSGpRs/lGYCmJikp1P+lwhM2JsCWWK210JG1JFmbEhFW0I3uLJy6RZrXjnlerdRbl2ncdRgGM4gTPw4BJqcAt1aAADCc/wCm+Odl6cd+djbl1x8p4j+APn8weklZDl</latexit>

……

……

Head 2, patch size = 20<latexit sha1_base64="6N/T6EgPXN+p0d90r9Fpco9dawI=">AAACA3icbVDLSsNAFJ3UV62vqDvdDBbBhZQkCroRim66rGAf0IQymdy0QycPZiZCLQU3/oobF4q49Sfc+TdO2yy09cCFwzn3cu89fsqZVJb1bRSWlldW14rrpY3Nre0dc3evKZNMUGjQhCei7RMJnMXQUExxaKcCSORzaPmDm4nfugchWRLfqWEKXkR6MQsZJUpLXfOgBiTALnZOcUoU7btYsgfAV9ixumbZqlhT4EVi56SMctS75pcbJDSLIFaUEyk7tpUqb0SEYpTDuORmElJCB6QHHU1jEoH0RtMfxvhYKwEOE6ErVniq/p4YkUjKYeTrzoiovpz3JuJ/XidT4aU3YnGaKYjpbFGYcawSPAkEB0wAVXyoCaGC6Vsx7RNBqNKxlXQI9vzLi6TpVOyzinN7Xq5e53EU0SE6QifIRheoimqojhqIokf0jF7Rm/FkvBjvxsestWDkM/voD4zPH1T/lWU=</latexit>

fs<latexit sha1_base64="0YWq47V9cl6VJ6Kk+Fdgafvq2mY=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGCaQttKJvtpl262YTdiVBCf4MXD4p49Qd589+4bXPQ1gcDj/dmmJkXplIYdN1vZ219Y3Nru7RT3t3bPzisHB23TJJpxn2WyER3Qmq4FIr7KFDyTqo5jUPJ2+H4bua3n7g2IlGPOEl5ENOhEpFgFK3kR/3cTPuVqltz5yCrxCtIFQo0+5Wv3iBhWcwVMkmN6XpuikFONQom+bTcywxPKRvTIe9aqmjMTZDPj52Sc6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQT5EKlGXLFFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7YheMsvr5JWveZd1uoPV9XGbRFHCU7hDC7Ag2towD00wQcGAp7hFd4c5bw4787HonXNKWZO4A+czx8aXY7g</latexit>

Linear Projection of Flattened Patches<latexit sha1_base64="OLqUURQh207x2T2TjL30ghqselY=">AAACGHicbVDLSgMxFM34tr6qLt0Ei+Cqzqigy6IgLlxUsFVoS7mT3rHRzGRI7ohl6Ge48VfcuFDErTv/xvSx8HUgcDjn3iTnhKmSlnz/05uYnJqemZ2bLywsLi2vFFfX6lZnRmBNaKXNVQgWlUywRpIUXqUGIQ4VXoa3xwP/8g6NlTq5oF6KrRiuExlJAeSkdnGnSXhPYZSfuQvA8KrRNygGHtcRP1FAhAl2eBVIdNH228WSX/aH4H9JMCYlNka1XfxodrTIYkxIKLC2EfgptXIwJIXCfqGZWUxB3MI1NhxNIEbbyofB+nzLKR0eaeNOQnyoft/IIba2F4duMgbq2t/eQPzPa2QUHbZymaSZiydGD0WZ4qT5oCXekca1oHqOgDDS/ZWLLhgQ5LosuBKC35H/kvpuOdgr757vlypH4zrm2AbbZNssYAeswk5ZldWYYA/sib2wV+/Re/bevPfR6IQ33llnP+B9fAFW3KCS</latexit>

Figure 3: Illustration of the Multi-scale Transformer.

3.1 Multi-scale TransformerWe wish to locate regions that contain manipulation artifacts andthus are inconsistent with their other regions in the image. This re-quires modeling long-range relationships in images, i.e., calculatingthe similarity of regions not only in a local neighborhood but alsolie far apart. Inspired by the great success of transformer modelsin capturing long-term context information, we use transformersfor Deepfake detection. Unlike recent approaches that directly splitan input image into multiple patches of the same size as inputs totransformers [14], we introduce a multi-scale transformer, whichgenerates patches of different scales. The intuition behind is to coverregions with different sizes so as to identify artifacts generated bymanipulation methods.

More formally, denote the input image as X ∈ R𝐻×𝑊 ×3, whereH andW are the height and width of the image, respectively. 𝑓 rep-resents the backbone network, and 𝒇𝑡 is the feature map extractedfrom the 𝑡-th layer. The feature map 𝒇𝑠 ∈ R(𝐻/4)×(𝑊 /4)×𝐶 is firstextracted from the shallow layers of 𝑓 . Then to capture the forgerypatterns at multiple scales, we split the feature map into spatialpatches of different sizes and calculate patch-wise self-attentionin different heads. Specifically, we extract spatial patches {𝑝ℎ}𝑁

ℎ=1of shape 𝑟ℎ × 𝑟ℎ ×𝐶 from 𝑓𝑠 where 𝑁 = (𝐻/4𝑟ℎ) × (𝑊 /4𝑟ℎ), andreshape them into 1-dimension vectors for the ℎ-th head. After that,we use fully-connected layers to embed the flattened vectors intoquery embeddings {𝑞ℎ}𝑁

ℎ=1. Similar operations are implemented toobtain key and value emebddings, respectively. Then we calculatethe patch-wise similarities by matrix multiplication and 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥

function:

𝛼ℎ𝑖,𝑗 = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥

(𝒒ℎ𝑖· (𝒌ℎ

𝑗)𝑇√︁

𝑟ℎ × 𝑟ℎ ×𝐶

), 1 ≤ 𝑖, 𝑗 ≤ 𝑁, (1)

where 𝒒ℎ𝑖denotes the 𝑖-th query embedding and 𝒌ℎ

𝑗denotes the

𝑗-th key embedding. Then we obtain the output for the query patchby weighted sum the values from relevant patches:

𝒐ℎ𝑖 =

𝑁∑︁𝑗=1

𝛼ℎ𝑖,𝑗𝒗ℎ𝑗 . (2)

After receiving the output for all patches, we stitch them togetherand reshape to the original spatial resolution. Finally, the featuresfrom different heads are concatenated and further passed through a2D residual block to obtain the output 𝒇𝑚𝑡 ∈ R(𝐻/4)×(𝑊 /4)×𝐶 . Thedetailed architecture of the multi-scale transformer is illustrated inFigure 3.

3

Page 4: M2TR: Multi-modal Multi-scale Transformers for Deepfake ...

3.2 Cross Modality FusionIt has been shown that artifacts in manipulated images and videosare no longer perceptible with compression approaches like JPEGcompression [15, 24, 59, 65]. Following [5, 44], we also computefeatures from the frequency domain to complement RGB features.The resulting frequency features are combined with RGB featureswith a cross modality fusion module.

In particular, we first apply the Discrete Cosine Transform (DCT)to transform the input image X from the RGB domain to the fre-quency domain and obtain DCT (X) ∈ R𝐻×𝑊 ×1. Benefited fromthe properties of DCT, low-frequency responses are placed in thetop-left corner of DCT (X), while high-frequency responses arelocated in the bottom-right corner. Following [44], we separatethe frequency domain into low, middle, and high frequency bandswith three hand-crafted binary base filters {t𝑖

𝑏𝑎𝑠𝑒}3𝑖=1 and obtain

the decomposed frequency components:

𝒅𝑖 = DCT (X) ⊙ 𝒕𝑖𝑏𝑎𝑠𝑒

, 𝑖 = {1, 2, 3}, (3)

where ⊙ denotes the element-wise dot-product. Empirically, wemanually design the base filters with the following pattern: thelow frequency band 𝒕1

𝑏𝑎𝑠𝑒is the first 1/16 of the entire spectrum,

the middle frequency band 𝒕2𝑏𝑎𝑠𝑒

is between 1/16 and 1/8 of thespectrum, and the high frequency band 𝒕3

𝑏𝑎𝑠𝑒is the last 7/8 of the

spectrum.To preserve the shift invariance and local consistency of natural

images and explore the representative capability of CNNs, we theninvert 𝒅𝑖 back into the RGB domain via IDCT: 𝒃𝑖 = DCT−1 (𝒅𝑖 ), 𝑖 ={1, 2, 3}. Finally, we re-assemble {𝒃𝑖 }3𝑖=1 along the channel axis toobtain the frequency-aware spatial map B ∈ R𝐻×𝑊 ×3 , and input itto several stacked convolution layers to extract frequency features𝒇𝑓 𝑞 , the size of which is the same as 𝒇𝑠 .

Given RGB features𝒇𝑠 and frequency features𝒇𝑓 𝑞 , we use a CrossModality Fusion (CMF) to combine them into a unified representa-tion. Inspired by the architecture of self-attention in transformers,we design a fusion block using the query-key-value mechanism.Specifically,we first embed 𝒇𝑠 and 𝒇𝑓 𝑞 into 𝑸 , 𝑲 , and 𝑽 using 1× 1convolutions 𝐶𝑜𝑛𝑣𝑞 , 𝐶𝑜𝑛𝑣𝑘 , and 𝐶𝑜𝑛𝑣𝑣 , respectively:

𝑸 = 𝐶𝑜𝑛𝑣𝑞 (𝒇𝑠 ),𝑲 = 𝐶𝑜𝑛𝑣𝑘 (𝒇𝑓 𝑞), 𝑽 = 𝐶𝑜𝑛𝑣𝑣 (𝒇𝑓 𝑞), (4)

where 𝑸 , 𝑲 , and 𝑽 retain the original spatial sizes. Then we flattenthem along the spatial dimension to obtain the 2D embeddings ˜𝑸 ,˜𝑲 , and˜𝑽 ∈ R(𝐻𝑊 /16)×𝐶 , and calculate the fused features as:

𝒇𝑓 𝑢𝑠𝑒 = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥

(˜𝑸˜𝑲𝑇√︁

𝐻/4 ×𝑊 /4 ×𝐶

)˜𝑽 . (5)

Finally, we employ a residual connection by adding 𝒇𝑠 , 𝒇𝑚𝑡 , and𝒇𝑓 𝑢𝑠𝑒 , and use a convolutional layer to obtain the output 𝒇𝑐𝑚𝑓 :

𝒇𝑐𝑚𝑓 = 𝐶𝑜𝑛𝑣3×3 (𝒇𝑠 + 𝒇𝑚𝑡 + 𝒇𝑓 𝑢𝑠𝑒 ). (6)

3.3 Loss functionsCross-entropy loss. The featuremaps𝒇𝑐𝑚𝑓 are then passed throughseveral layers, followed by a single-scale Transformer (patch sizeequal to 2 × 2) to obtain global semantic features, which are finallyused to predict whether the input image is real or fake using a

cross-entropy loss L𝑐𝑙𝑠 :

L𝑐𝑙𝑠 = 𝑦𝑙𝑜𝑔𝑦 + (1 − 𝑦)𝑙𝑜𝑔(1 − 𝑦), (7)

where𝑦 is set to 1 if the face image has been manipulated, otherwiseit is set to 0; 𝑦 denotes the predicted label by our network.Segmentation loss It is worth noting using a binary classifiertends to result in overfitted models. We additionally predict thethe face region as an auxiliary task to enrich the supervision fortraining the networks. Specifically, we input the fused feature map𝒇𝑙𝑡 to a decoder to produce a binary mask �̂� 𝑖𝑛 R𝐻×𝑊 :

L𝑠𝑒𝑔 =∑︁𝑖, 𝑗

𝑀𝑖, 𝑗 𝑙𝑜𝑔�̂�𝑖, 𝑗 + (1 −𝑀𝑖, 𝑗 )𝑙𝑜𝑔(1 − �̂�𝑖, 𝑗 ), (8)

where𝑀𝑖, 𝑗 is the ground-truth mask, with 1 indicating the manipu-lated pixels and 0 otherwise.Contrastive loss Deepfake images generated by different facialmanipulation methods differ in forgery patterns, while the distribu-tion of real images is relatively stable. To improve the generalizationability of our detection model, we first calculate the feature centersof 𝑁𝑝 real samples 𝑪𝑝𝑜𝑠 = 1

𝑁𝑝

∑𝑁𝑝

𝑖=1 𝑓𝑝𝑜𝑠

𝑖and additionally use a

contrastive loss to make features from pristine samples to be closertowards the feature center than manipulated samples. Formally, thecontrastive loss is defined as:

𝐿𝑐𝑜𝑛 =1𝑁𝑝

𝑁𝑝∑︁𝑖=1

𝑑 (𝒇𝑝𝑜𝑠𝑖

, 𝑪𝑝𝑜𝑠 ) −1𝑁𝑛

𝑁𝑛∑︁𝑖=1

𝑑 (𝒇𝑛𝑒𝑔𝑖

, 𝑪𝑝𝑜𝑠 ), (9)

where 𝑁𝑛 denotes the number of negative samples, and 𝑑 computesdistance with cosine similarity. Finally, combining Eqn. 7, Eqn. 8and Eqn. 9, the training objective can be written as:

L = L𝑐𝑙𝑠 + _1L𝑠𝑒𝑔 + _2L𝑐𝑜𝑛, (10)

where _1 and _2 are the balancing hyper-parameters. By default,we set _1 = 1 and _2 = 0.001.

4 SR-DF DATASETTo stimulate research for Deepfake forensics, we introduce a large-scale and challenging dataset SR-Df. SR-DF is built upon the pristinevideos in the FF++ dataset, which contain a diverse set of samples indifferent genders, ages, and ethnic groups.We first crop face regionsin each video frame using [27], and then generate forged videosusing state-of-the-art Deepfake generation techniques. Finally, weuse the image harmonization method in [6] for post-processing.Below we introduce these steps in detail.

4.1 Dataset ConstructionSynthesis Approaches To guarantee the diversity of synthesizedimages, we use four facial manipulation methods, including twoface swapping methods: FSGAN [42] and FaceShifter [31], andtwo facial reenactment methods: First-order-motion [50] andIcFace [56]. Note that the manipulation methods we leverage areall identity-agnostic—they can be applied to arbitrary face imageswithout training in pairs, which is different from the FF++ [48]dataset. The detailed forgery images generation process will bedescribed in Appendix.Post-processing In order to resolve the color mismatch betweenthe face regions and the background and to eliminate the stitched

4

Page 5: M2TR: Multi-modal Multi-scale Transformers for Deepfake ...

(a)

(b)

(c)

(d)

Original<latexit sha1_base64="Zd/IFrQtgOSr2esp7c4V70+GtLM=">AAAB8HicbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DHoxZsRzEOSJcxOZpMh81hmZoWw5Cu8eFDEq5/jzb9xkuxBEwsaiqpuuruihDNjff/bW1ldW9/YLGwVt3d29/ZLB4dNo1JNaIMornQ7woZyJmnDMstpO9EUi4jTVjS6mfqtJ6oNU/LBjhMaCjyQLGYEWyc93mk2YBJz1CuV/Yo/A1omQU7KkKPeK311+4qkgkpLODamE/iJDTOsLSOcTord1NAEkxEe0I6jEgtqwmx28ASdOqWPYqVdSYtm6u+JDAtjxiJynQLboVn0puJ/Xie18VWYMZmklkoyXxSnHFmFpt+jPtOUWD52BBPN3K2IDLHGxLqMii6EYPHlZdKsVoLzSvX+oly7zuMowDGcwBkEcAk1uIU6NICAgGd4hTdPey/eu/cxb13x8pkj+APv8wd0WJAt</latexit>

Reenacted1<latexit sha1_base64="5ylNNoHUST9cQL8dtkZ59rgAXRc=">AAAB9HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF49V7Ae0oWw203bpZhN3N4US+ju8eFDEqz/Gm//GpM1BWx8MPN6bYWaeFwmujW1/W4W19Y3NreJ2aWd3b/+gfHjU0mGsGDZZKELV8ahGwSU2DTcCO5FCGngC2974NvPbE1Sah/LRTCN0AzqUfMAZNankPiBKygz6/cSZ9csVu2rPQVaJk5MK5Gj0y189P2RxgNIwQbXuOnZk3IQqw5nAWakXa4woG9MhdlMqaYDaTeZHz8hZqvhkEKq0pCFz9fdEQgOtp4GXdgbUjPSyl4n/ed3YDK7dhMsoNijZYtEgFsSEJEuA+FwhM2KaEsoUT28lbERVFoPSpTQEZ/nlVdKqVZ2Lau3+slK/yeMowgmcwjk4cAV1uIMGNIHBEzzDK7xZE+vFerc+Fq0FK585hj+wPn8AvG+SEw==</latexit>

Reenacted2<latexit sha1_base64="vSs4zMcbMXyauZLPDOT0gLb/oVQ=">AAAB9HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF49V7Ae0oWw2k3bpZhN3N4US+ju8eFDEqz/Gm//GtM1BWx8MPN6bYWaeFwuujW1/W4W19Y3NreJ2aWd3b/+gfHjU0lGiGDZZJCLV8ahGwSU2DTcCO7FCGnoC297odua3x6g0j+SjmcTohnQgecAZNZnkPiBKygz6/bQ27ZcrdtWeg6wSJycVyNHol796fsSSEKVhgmrddezYuClVhjOB01Iv0RhTNqID7GZU0hC1m86PnpKzTPFJEKmspCFz9fdESkOtJ6GXdYbUDPWyNxP/87qJCa7dlMs4MSjZYlGQCGIiMkuA+FwhM2KSEcoUz24lbEjVLAalS1kIzvLLq6RVqzoX1dr9ZaV+k8dRhBM4hXNw4ArqcAcNaAKDJ3iGV3izxtaL9W59LFoLVj5zDH9gff4AvfSSFA==</latexit>

Reenacted3<latexit sha1_base64="u+8tx6S2DrIps1tSXe2+GwRx+Y8=">AAAB9HicbVBNT8JAEJ3iF+IX6tFLIzHxRFow0SPRi0c0giTQkO12gA3bbd3dkpCG3+HFg8Z49cd489+4hR4UfMkkL+/NZGaeH3OmtON8W4W19Y3NreJ2aWd3b/+gfHjUVlEiKbZoxCPZ8YlCzgS2NNMcO7FEEvocH/3xTeY/TlAqFokHPY3RC8lQsAGjRBvJu0cUhGoM+ml91i9XnKozh71K3JxUIEezX/7qBRFNQhSacqJU13Vi7aVEakY5zkq9RGFM6JgMsWuoICEqL50fPbPPjBLYg0iaEtqeq78nUhIqNQ190xkSPVLLXib+53UTPbjyUibiRKOgi0WDhNs6srME7IBJpJpPDSFUMnOrTUdEZjFIVTIhuMsvr5J2rerWq7W7i0rjOo+jCCdwCufgwiU04Baa0AIKT/AMr/BmTawX6936WLQWrHzmGP7A+vwBv3mSFQ==</latexit>

Reenacted4<latexit sha1_base64="d+IaF0X0idHeCDCZGbKG6JRKOKM=">AAAB9HicbVBNS8NAEJ34WetX1aOXxSJ4Kkkt6LHoxWMV+wFtKJvNtF262cTdTaGE/g4vHhTx6o/x5r8xaXPQ1gcDj/dmmJnnRYJrY9vf1tr6xubWdmGnuLu3f3BYOjpu6TBWDJssFKHqeFSj4BKbhhuBnUghDTyBbW98m/ntCSrNQ/lophG6AR1KPuCMmlRyHxAlZQb9flKb9Utlu2LPQVaJk5My5Gj0S189P2RxgNIwQbXuOnZk3IQqw5nAWbEXa4woG9MhdlMqaYDaTeZHz8h5qvhkEKq0pCFz9fdEQgOtp4GXdgbUjPSyl4n/ed3YDK7dhMsoNijZYtEgFsSEJEuA+FwhM2KaEsoUT28lbERVFoPSxTQEZ/nlVdKqVpzLSvW+Vq7f5HEU4BTO4AIcuII63EEDmsDgCZ7hFd6sifVivVsfi9Y1K585gT+wPn8AwP6SFg==</latexit>

Reenacted5<latexit sha1_base64="Llvlp84hXRBGPM9trR8AHMsOeao=">AAAB9HicbVBNS8NAEJ34WetX1aOXYBE8laQqeix68VjFfkAbymYzbZduNnF3Uyihv8OLB0W8+mO8+W/ctDlo64OBx3szzMzzY86Udpxva2V1bX1js7BV3N7Z3dsvHRw2VZRIig0a8Ui2faKQM4ENzTTHdiyRhD7Hlj+6zfzWGKVikXjUkxi9kAwE6zNKtJG8B0RBqMagl15Oe6WyU3FmsJeJm5My5Kj3Sl/dIKJJiEJTTpTquE6svZRIzSjHabGbKIwJHZEBdgwVJETlpbOjp/apUQK7H0lTQtsz9fdESkKlJqFvOkOih2rRy8T/vE6i+9deykScaBR0vqifcFtHdpaAHTCJVPOJIYRKZm616ZDILAapiiYEd/HlZdKsVtzzSvX+oly7yeMowDGcwBm4cAU1uIM6NIDCEzzDK7xZY+vFerc+5q0rVj5zBH9gff4AwoOSFw==</latexit>

Source1<latexit sha1_base64="q9wxWDnv7Vv70i/uS4ApvcLNY90=">AAAB8XicbVBNSwMxEJ31s9avqkcvwSJ4KrtV0GPRi8eK9gPbpWTTaRuazS5JVihL/4UXD4p49d9489+YbvegrQ8Cj/dmJjMviAXXxnW/nZXVtfWNzcJWcXtnd2+/dHDY1FGiGDZYJCLVDqhGwSU2DDcC27FCGgYCW8H4Zua3nlBpHskHM4nRD+lQ8gFn1Fjp8T6b00u9aa9UdituBrJMvJyUIUe9V/rq9iOWhCgNE1TrjufGxk+pMpwJnBa7icaYsjEdYsdSSUPUfpptPCWnVumTQaTsk4Zk6u+OlIZaT8LAVobUjPSiNxP/8zqJGVz5KZdxYlCy+UeDRBATkdn5pM8VMiMmllCmuN2VsBFVlBkbUtGG4C2evEya1Yp3XqneXZRr13kcBTiGEzgDDy6hBrdQhwYwkPAMr/DmaOfFeXc+5qUrTt5zBH/gfP4AlcmQ2w==</latexit>

Target1<latexit sha1_base64="xvLMMGCx7F8JWngQC+bAqWtxqjM=">AAAB8XicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rFCv7ANZbOdtEs3m7C7EUrov/DiQRGv/htv/hu3bQ7a+mDg8d4MM/OCRHBtXPfbWVvf2NzaLuwUd/f2Dw5LR8ctHaeKYZPFIladgGoUXGLTcCOwkyikUSCwHYzvZn77CZXmsWyYSYJ+RIeSh5xRY6XHBlVDNP3Mm/ZLZbfizkFWiZeTMuSo90tfvUHM0gilYYJq3fXcxPgZVYYzgdNiL9WYUDamQ+xaKmmE2s/mF0/JuVUGJIyVLWnIXP09kdFI60kU2M6ImpFe9mbif143NeGNn3GZpAYlWywKU0FMTGbvkwFXyIyYWEKZ4vZWwkZUUWZsSEUbgrf88ippVSveZaX6cFWu3eZxFOAUzuACPLiGGtxDHZrAQMIzvMKbo50X5935WLSuOfnMCfyB8/kDhimQ0Q==</latexit>

Swapped1<latexit sha1_base64="OgUGdIG8Rm/CIVTpJ5EMp2oyqxo=">AAAB8nicbVBNS8NAEN34WetX1aOXxSJ4KkkV9Fj04rGi/YA0lM1m0y7dbJbdiVJCf4YXD4p49dd489+4bXPQ1gcDj/dmmJkXKsENuO63s7K6tr6xWdoqb+/s7u1XDg7bJs00ZS2ailR3Q2KY4JK1gINgXaUZSULBOuHoZup3Hpk2PJUPMFYsSMhA8phTAlby75+IUizq596kX6m6NXcGvEy8glRRgWa/8tWLUpolTAIVxBjfcxUEOdHAqWCTci8zTBE6IgPmWypJwkyQz06e4FOrRDhOtS0JeKb+nshJYsw4CW1nQmBoFr2p+J/nZxBfBTmXKgMm6XxRnAkMKZ7+jyOuGQUxtoRQze2tmA6JJhRsSmUbgrf48jJp12veea1+d1FtXBdxlNAxOkFnyEOXqIFuURO1EEUpekav6M0B58V5dz7mrStOMXOE/sD5/AFUw5FI</latexit>

Swapped2<latexit sha1_base64="KKtzwF/Wc5enPYxJ96MejGfZkJg=">AAAB8nicbVBNS8NAEN34WetX1aOXxSJ4KkkV9Fj04rGi/YA0lM1m0i7dbJbdjVJCf4YXD4p49dd489+4bXPQ1gcDj/dmmJkXSs60cd1vZ2V1bX1js7RV3t7Z3duvHBy2dZopCi2a8lR1Q6KBMwEtwwyHrlRAkpBDJxzdTP3OIyjNUvFgxhKChAwEixklxkr+/ROREqJ+Xp/0K1W35s6Al4lXkCoq0OxXvnpRSrMEhKGcaO17rjRBTpRhlMOk3Ms0SEJHZAC+pYIkoIN8dvIEn1olwnGqbAmDZ+rviZwkWo+T0HYmxAz1ojcV//P8zMRXQc6EzAwIOl8UZxybFE//xxFTQA0fW0KoYvZWTIdEEWpsSmUbgrf48jJp12veea1+d1FtXBdxlNAxOkFnyEOXqIFuURO1EEUpekav6M0xzovz7nzMW1ecYuYI/YHz+QNWSJFJ</latexit>

Source2<latexit sha1_base64="wyDywDdzg6btvpagWyrr257v7PY=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFNy4r2gdMh5JJ0zY0kwxJRihDP8ONC0Xc+jXu/Bsz01lo64HA4Zx7b+49YcyZNq777ZTW1jc2t8rblZ3dvf2D6uFRR8tEEdomkkvVC7GmnAnaNsxw2osVxVHIaTec3mZ+94kqzaR4NLOYBhEeCzZiBBsr+Q/5nEHamFcG1Zpbd3OgVeIVpAYFWoPqV38oSRJRYQjHWvueG5sgxcowwum80k80jTGZ4jH1LRU4ojpI85Xn6MwqQzSSyj5hUK7+7khxpPUsCm1lhM1EL3uZ+J/nJ2Z0HaRMxImhgiw+GiUcGYmy+9GQKUoMn1mCiWJ2V0QmWGFibEpZCN7yyauk06h7F/XG/WWteVPEUYYTOIVz8OAKmnAHLWgDAQnP8ApvjnFenHfnY1FacoqeY/gD5/MHz2aQ8A==</latexit>

Target2<latexit sha1_base64="k6Wm8gIyw03djm59xnlo0lZipKk=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0mqoMeiF48V+gVpKJvtpl262Q27E6GE/gwvHhTx6q/x5r9x0+agrQ8GHu/NMDMvTAQ34LrfTmljc2t7p7xb2ds/ODyqHp90jUo1ZR2qhNL9kBgmuGQd4CBYP9GMxKFgvXB6n/u9J6YNV7INs4QFMRlLHnFKwEp+m+gxg2HWmFeG1ZpbdxfA68QrSA0VaA2rX4ORomnMJFBBjPE9N4EgIxo4FWxeGaSGJYROyZj5lkoSMxNki5Pn+MIqIxwpbUsCXqi/JzISGzOLQ9sZE5iYVS8X//P8FKLbIOMySYFJulwUpQKDwvn/eMQ1oyBmlhCqub0V0wnRhIJNKQ/BW315nXQbde+q3ni8rjXvijjK6Aydo0vkoRvURA+ohTqIIoWe0St6c8B5cd6dj2VrySlmTtEfOJ8/v7yQ5g==</latexit>

Figure 4: Example frames from the SR-DF dataset. The first two rows are generated by manipulating facial expressions: (a)First-order-motion and (b) IcFace, while the last two rows are generated by manipulating facial identity: (c) FaceShifter and(d) FSGAN.

(a) naive (b) color-transfer (c) DoveNet

Figure 5: Synthesized images of blending the altered faceinto the background image. We compare three blendingmethods: naive stitching (left), stitching with color transfer(middle), and stitching with DoveNet (right).

.Table 1: A comparison of SR-DF dataset with existingdatasets for Deepfake detection. LQ: low-quality, HQ: high-quality.

Dataset Real ForgedVideo Frame Video Frame

UADFV [63] 49 17.3k 49 17.3kDF-TIMIT-LQ [28] 320 34.0k 320 34.0kDF-TIMIT-HQ [28] 320 34.0k 320 34.0kFF++ [48] 1,000 509.9k 4000 1,830.1kDFD [9] 363 315.4k 3,068 2,242.7kDFDC [13] 1,131 488.4k 4,113 1,783.3kWildDeepfake [69] 3,805 440.5k 3,509 739.6kCeleb-DF [35] 590 225.4k 5,639 2,116.8kSR-DF (ours) 1,000 509.9k 4,000 2,078.4k

boundaries, we use DoveNet [6] for post-processing, which is astate-of-the-art image harmonization method to make the fore-ground compatible with the background. Note that the masks thatwe use to distinguish foreground and background are generatedusing a face parsing model [16]. We compare the blending resultswith naive stitching and a color transfer algorithm [47] adopted by[35], and show an example of synthesized image in Figure 5.Existing Deepfake Datasets The UADFV dataset [63] contains49 real videos collected from YouTube and 49 Deepfake videos thatare generated using FakeAPP [20]. The DeepFake-TIMIT dataset

[28] includes 320 real videos and 640 Deepfake videos (320 high-quality and 320 low-quality) generated with faceswap-GAN [19].The FaceForensics++ (FF++) dataset [48] has 1,000 real videos fromYouTube and 4,000 corresponding Deefake videos that are generatedwith 4 face manipulation methods: Deepfakes [10], FaceSwap [18],Face2Face [55], and NeuralTextures [54]. The Google/Jigsaw Deep-Fake detection (DFD) [9] dataset includes 3,068 Deepfake videosthat are generated based on 363 original videos. The FacebookDeepFake detection challenge (DFDC) dataset [13] is part of theDeepFake detection challenge, which has 1,131 original videos and4,113 Deepfake videos. TheWildDeepfake [69] dataset consists of3,805 real videos and 3,509 fake videos that are collected from theInternet. The Celeb-DF dataset [35] contains 590 real videos and5,639 Deepfake videos created using the same synthesis algorithm.We summarize the basic information of these existing datasets andour SR-DF dataset in Table 1.

4.2 Visual Quality AssessmentAsmentioned above, how tomeasure the quality of forged images inthese datasets is under-explored. Therefore, we introduce a varietyof quantitative metrics to benchmark the quality of current datasetsfrom four perspectives: identity retention, authenticity, temporalsmoothness, and diversity. To the best of our knowledge, this is themost comprehensive evaluation system to measure the quality ofDeepfake datasets.Mask-SSIM First, we follow [35] to adopt the Mask-SSIM score asa measurement of synthesized Deepfake images. Mask-SSIM refersto the SSIM score between the face regions of the forged imageand the corresponding original image. We use the [16] to generatefacial masks and compute the Mask-SSIM on our face swappingsubsets. Table 2 demonstrates the average Mask-SSIM scores of allcompared datasets, and SR-DF dataset achieves the highest scores.Perceptual Loss Perceptual loss is usually used in face inpaintingapproaches [39, 64] to measure the similarity between the restored

5

Page 6: M2TR: Multi-modal Multi-scale Transformers for Deepfake ...

Table 2: Average Mask-SSIM scores and perceptual loss ofdifferent Deepfake datasets. The value of Mask-SSIM is inthe range of [0,1], with the higher value correspondingto better image quality. We follow [35] to calculate Mask-SSIM on videos that we have exact corresponding correspon-dences for DFD andDFDCdataset. For perceptual loss, lowervalue indicates the better image quality.

Dataset FF++ DFD DFDC Celeb-DF Ours

Mask-SSIM ↑ 0.82 0.86 0.85 0.91 0.92Perceptual Loss ↓ 0.67 0.69 0.63 0.59 0.60

Figure 6: A feature perspective comparison of Celeb-DF,FF++ dataset (RAW) and SR-DF dataset.We use an ImageNet-pretrained ResNet-18 network to extract features and t-SNE[57] for dimension reduction. Note that we only select oneframe in each video for visualization.

Table 3: Average 𝐸𝑤𝑎𝑟𝑝 values of different datasets, withlower value corresponding to smoother temporal results.We also calculate the 𝐸𝑤𝑎𝑟𝑝 of pristine videos in our dataset.

Dataset DFD FF++ Celeb-DF Ours Real

𝐸𝑤𝑎𝑟𝑝 69.53 73.16 49.10 56.95 14.28

faces and corresponding complete faces. Inspired by this, we use the𝑟𝑒𝑙𝑢1_1, 𝑟𝑒𝑙𝑢2_1, 𝑟𝑒𝑙𝑢3_1, 𝑟𝑒𝑙𝑢4_1 and 𝑟𝑒𝑙𝑢5_1 of the pretrainedVGG-19 network on ImageNet [11] to calculate the perceptual lossbetween the feature maps of forged faces and that of correspondingreal faces. Note that we use [27] to crop the facial regions. We com-pare the perceptual loss of different datasets in Table 2. Althoughthe perceptual loss of SR-DF dataset is slightly higher than that ofCeleb-DF, it is lower than other datasets by a large margin.Ewarp The warping error 𝐸𝑤𝑎𝑟𝑝 is used by [4, 22, 30] to measurethe temporal inconsistency for video style transfer. We use it to com-pute the 𝐸𝑤𝑎𝑟𝑝 of consecutive forged frames in different datasets toquantitatively measure the short-term consistency. Following [30],we use the method in [49] to calculate occlusion map and PWC-Net[51] to obtain optical flow. 𝐸𝑤𝑎𝑟𝑝 of different Deepfake datasets areshown in Table 3 for comparison.Feature Space DistributionAs can be seen from the above, Celeb-DF dataset [35] has a decent performance in visual quality. However,they only used one face swapping method to generate all the forgedimages, which results in limited diversity of data distribution. Weillustrate this by visualizing the feature space of Celeb-DF [35],FF++ dataset [48], and SR-DF in Figure 6. We can see the datadistribution of the Celeb-DF dataset is more concentrated, whilethe real and forged images of FF++ dataset can be easily separated

in the feature space. On the other hand, the data in SR-DF datasetare more scattered in the 2D space.

5 EXPERIMENTS5.1 Experimental SettingsDatasetsWe conduct experiments on FaceForensics++ (FF++) [48],Celeb-DF [35], and the newly proposed SR-DF dataset. FF++ consistsof 1,000 original videos with real faces, in which 720 videos are usedfor training, 140 videos are reserved for validation and 140 videos fortesting. Each video is manipulated by four Deepfake methods, i.e.,Deepfakes [10], FaceSwap [18], Face2Face [55], and NeuralTextures[54]. Different degrees of compression are implemented on both realand forged images to produce high-quality (HQ) version and low-quality (LQ) version of FF++, respectively. Celeb-DF is comprisedof 890 real videos and 5,639 Deepfake videos, in which 6,011 videosare used for training and 518 videos are for testing. For SR-DF,we build on the 1,000 original videos in FF++, and generate 4,000forged videos using four state-of-the-art subject-agnostic Deepfakegeneration techniques (see details above). We use the same training,validation and test set partitioning of FF++.

When training on FF++ dataset and SR-DF dataset, following [44,66], we augment the real images four times by repeated sampling tobalance the number of real and fake samples. For FF++, we sample270 frames from each video, following the setting in [44, 48].Evaluation Metrics We apply the Accuracy score (Acc) and AreaUnder the RoC Curve (AUC) as our evaluation metrics, which arecommonly used in various classification tasks including Deepfakedetection [32, 41, 44, 48, 66].Implementation Details For all real and forgery images, we usedlib [27] to crop the face regions as inputs with a size of 320 × 320.The patch sizes in Sec.4.2.1 are set to (80 × 80), (40 × 40), (20 × 20),and (10 × 10). For our backbone network, we use Efficient-b4 [53]pretrained on ImageNet [11]. We use Adam for optimization with alearning rate of 0.0001. The learning rate is decayed 10 times every40 steps. We set the batch size to 24, and train the complete networkfor 90 epoches.

5.2 Evaluation on FaceForensics++FF++ [48] is a widely used dataset in various Deepfake detection ap-proaches [1, 7, 21, 32, 44, 66]. Therefore, we compareM2TRwith cur-rent state-of-the-art Deepfake detectionmethods.We test the frame-level detection performance on RAW, HQ, and LQ, respectively, andreport the AUC scores (%) in Table 4. We compare with top-notchmethods, including: i.e., (i) Steg. Features [21], which assemblesdiverse noise component models to build a joint steganography de-tectors, (ii) LD-CNN [7], which uses CNN models as residual-basedlocal descriptors for forgery detection, (iii) MesoNet [1], whichmines the mesoscopic properties of forged images with the shal-low layers of convolutional networks, (iv) Face X-ray [32], whichdetects the discrepancies across blending boundaries, (v) F3-Net[44], which adopts a two-branch architecture where one makes useof frequency clues to recognize forgery patterns and the other ex-tracts the discrepancy of frequency statistics between real and fakeimages, and (vi) MaDD [66], which proposes a multi-attentional

6

Page 7: M2TR: Multi-modal Multi-scale Transformers for Deepfake ...

Table 4: Quantitative frame-level detection results on Face-Forensics++ dataset under all quality settings. The best re-sults are marked as bold.

Methods LQ HQ RAWACC AUC ACC AUC ACC AUC

Steg.Features [21] 55.98 - 70.97 - 97.63 -LD-CNN [7] 58.69 - 78.45 - 98.57 -MesoNet [1] 70.47 - 83.10 - 95.23 -Face X-ray [32] - 61.6 - 87.4 - -F3-Net [44] 90.43 93.30 97.52 98.10 99.95 99.80MaDD [66] 88.69 90.40 97.60 99.29 - -Ours 92.35 94.22 98.23 99.48 99.21 99.91

Deepfake detection framework to capture artifacts with multipleattention maps.

Table 4 summarizes the results and comparisons. We can see thatour method achieves state-of-the-art performance on all versions(i.e., LQ, HQ, and RAW) of FF++. This suggests the effectivenessof our approach in detecting Deepfakes of different visual quali-ties. Comparing across different versions of the FF++ dataset, wesee that while most approaches achieve high performance on thehigh-quality version of FF++, we observe a significant performancedegradation on FF++ (LQ) where the forged images are compressed.This could be remedied by leveraging frequency information. Whileboth F3-Net and M2TR uses frequency features, M2TR achievesan accuracy of 92.35% in the LQ setting, outperforming the F3-Netapproach by 1.92%.

5.3 Evaluation on Celeb-DF and SR-DFIn this section, we conduct experiments to evaluate the detectionaccuracy of our M2TR on Celeb-DF [35] dataset and SR-DF datasetat frame-level, respectively. Note that we do not report the quantita-tive results of certain state-of-the-art Deepfake detection methodsincluding [32, 44, 66] because the code and models are not pub-licly available. The results are reported in Table 5. We observe thatour M2TR achieves 99.9% and 90.5% on CeleDF and SR-DF, respec-tively, which demonstrate that our method outperforms all theother Deepfake detection methods over different datasets. This sug-gests that our approach is indeed effective for Deepfake detectionacross different datasets.

In addition, the quality of different Deepfake datasets can be eval-uated by comparing the detection accuracy of the same detectionmethod on different datasets. Given that Celeb-DF [35] containshigh-quality samples (as discussed in 4.2, Celeb-DF achieves thebest results on Mask-SSIM, Perceptual loss and Ewarp metrics in theavailable Deepfake dataset.), we calculate the average frame-levelAUC scores of all compared detection methods on Celeb-DF datasetand SR-DF, and report them in the last row of Table 5. The overallperformance on SR-DF is 9.2% lower than that of Celeb-DF, whichdemonstrates that SR-DF is more challenging.

5.4 Generalization AbilityThe generalization ability is at the core of Deepfake detection. Weevaluate the generalization of our M2TR by separately training onFF++ (HQ) and SR-DF dataset, and test on other datasets. We follow[66] to sample 30 frames for each video and calculate the frame-levelAUC scores. The comparison results are shown in Table 6. Note that

Table 5: Frame-level AUC scores (%) of various Deepfake de-tection methods on Celeb-DF and SR-DF dataset.

Methods Celeb-DF SR-DF

Xception [48] 97.6 88.2Multi-task [40] 90.5 85.7Capsule [41] 93.2 81.5DSW-FPA [33] 94.8 86.6DCViT [60] 97.2 87.9Ours 99.9 90.5Avg 95.5 86.7

for the Deepfake detection models that are not publicly available,we only use the results reported in their paper. The results in Table 6demonstrate that our method achieves better generalization thanmost existing methods.

5.5 From Frames to VideosExisting methods on Deepfake detection mainly perform evaluationbased on frames extracted from videos albeit videos are provided.However, in real-world scenarios, most Deepfake data circulating onthe Internet are fake videos, therefore, we also conduct experimentsto evaluate our M2TR on video-level Deepfake detection. The mostsignificant difference between videos and images is the additionaltemporal information between frame sequences. We demonstratethat M2TR can be easily extended for video modeling by adding atemporal transformer to combine frame-level features generatedby M2TR. We refer to such an extension as spatial-temporal M2TR(ST-M2TR).

In particular, we sampled 16 frames at intervals from one video,and directly use the model trained at the frame-level to extractfeatures of different frames. These features are then input to atransformer block (it has 4 stacked encoders, each with 8 attentionheads, and an MLP head that has two fc layers) to obtain video-levelpredictions.We report the AUC scores (%) and compare with (1) P3D[45], which simplifies 3D convolutions with 2D filters on spatialdimension and 1D temporal connections; (2)R3D [62], which en-codes the video sequences using a 3D fully convolutional networksand then generates candidate temporal fragments for classification;(3)I3D [3], which expands 2D CNNs with an additional temporaldimension to introduce a two-stream inflated 3D convolutionalnetwork; (4) M2TR𝑚𝑒𝑎𝑛 , which averages the features of differentframes by M2TR for video-level prediction. Note that (1) and (3)are designed for video action recognition, while (2) is for temporalactivity detection, and we modify them for video-level Deepfakedetection. The results are summarized in Table 7. We can see thatour method achieves the best performance on both FF++ and SR-DF.

5.6 Ablation studyEffectiveness of Different Components The Multi-scale Trans-former (MT) of our method is designed to capture local inconsis-tency between patches of different sizes, while the Cross ModalityFusion (CMF) module is utilized to introduce the frequency modal-ity features and fuse it with RGB modality features effectively. Toevaluate the effectiveness of MT and CMF, we remove them sepa-rately from M2TR and demonstrate the performance on FF++. Thequantitative results are listed in Table 8, which validates that the

7

Page 8: M2TR: Multi-modal Multi-scale Transformers for Deepfake ...

Table 6: AUC scores (%) for cross-dataset evaluation on FF++, Celeb-DF, and SR-DF datasets. Note that some methods have notmade their code public, so we directly use the data reported in their paper. “−” denotes the results are unavailable.

Training Set Testing Set Xception [48] Multi-task [40] Capsule [41] DSW-FPA [33] Two-Branch [36] F3-Net [44] MaDD [66] DCViT [60] Ours

FF++ FF++ 99.7 76.3 96.6 93.0 98.7 98.1 99.3 98.3 99.5Celeb-DF 48.2 54.3 57.5 64.6 73.4 65.2 67.4 60.8 65.7SR-DF 37.9 38.7 41.3 44.0 - - - 57.8 62.6

SR-DF SR-DF 88.2 85.7 81.5 86.6 - - - 87.9 90.5FF++ 63.2 58.9 60.6 69.1 - - - 62.6 77.9

Celeb-DF 59.4 51.7 52.1 62.9 - - - 63.7 80.7

Table 7: Quantitative video-level detection results on differ-ent versions of FF++ dataset and SR-DF dataset. M2TR 𝑚𝑒𝑎𝑛

denotes averaging the extracted features obtained by M2TRfor all frames as the video-level representation, whileM2TR𝑣𝑡 𝑓 denotes using VTF Block for temporal fusion. The bestresults are marked as bold.

Method FF++ (RAW) FF++ (HQ) FF++ (LQ) SR-DF

P3D [45] 80.9 75.23 67.05 65.97R3D [62] 96.15 95.00 87.72 73.24I3D [3] 98.23 96.70 93.18 80.11M2TR𝑚𝑒𝑎𝑛 98.19 98.77 93.28 82.09ST-M2TR 99.96 99.30 94.16 84.62

Table 8: Ablation results on FF++ (HQ) and FF++ (LQ) withand without Multi-scale Transformer and CMF.

Method LQ HQACC (%) AUC (%) ACC (%) AUC (%)

Baseline (Efficient-b4) 87.34 88.76 93.79 96.01w/o MT 90.44 91.89 95.36 97.17w/o CMF 90.97 92.58 96.25 98.54Ours 92.35 94.22 98.23 99.48

use of of MT and CMF can effectively improve the detection perfor-mance of our model. In particular, the proposed CMFmodule bringsa remarkable improvement to our method under the low-quality(LQ) setting, i.e., about 1.7% performance gain on AUC score, whichis mainly benefited from the complementary information from thefrequency modality.Effectiveness of the Multi-scale Design To verify the effective-ness of using multi-scale patches in different heads in our multi-scale transformer, we replace MT with several single-scale trans-formers with different patch sizes, and conduct experiments onFF++ (HQ). The results in Table 9 demonstrate that our full modelachieves the best performance with MT, i.e., 2.2%, 1.2%, and 0.5%higher than 40× 40, 20× 20 and 10× 10 single-scale transformer onAUC score. This confirms the the use of a multi-scale transformeris indeed effective.Effectiveness of the Contrastive Loss To illustrate the contribu-tion of the constrastive loss in improving the the generalization abil-ity of our method, we conduct experiments to train M2TR withoutits supervision and evaluate the cross-dataset detection accuracy.The comparison results are reported in Table 10. We can see that 1)When training on FF++ without the constrastive loss, the accuracydecreases by 2.7% and 5.6% in Celeb-DF and SR-DF, respectively. 2)

Table 9: Ablation results on FF++ (HQ) using multi-scaleTransformer (MT) or single-scale transformer.

Patch size ACC (%) AUC (%)

40 × 40 95.62 97.3320 × 20 96.81 98.2910 × 10 97.55 98.94Ours 98.23 99.48

Table 10: AUC (%) for cross-dataset evaluation on FF++ (HQ),Celeb-DF, and SR-DF with (denoted as M2TR) and without(denoted as M2TR 𝑛𝑐𝑙 ) the supervision of constrative loss.

Training Set Testing Set M2TR 𝑛𝑐𝑙 M2TR

FF++ Celeb-DF 63.9 65.7SR-DF 59.1 62.6

SR-DF FF++ 74.2 77.9Celeb-DF 78.5 80.7

When training on SR-DF dataset without the constrastive loss, theaccuracy decreases by 4.7% and 2.7%, respectively.

6 CONCLUSIONIn this paper, we presented a Multi-modal Multi-scale Transformer(M2TR) for Deepfake detection, which uses a multi-scale trans-former to capture subtle local inconsistency at multiple scales. Ad-ditionally, we also introduced a cross modality fusion module toimprove the robustness against image compression. Besides, weintroduced a challenging dataset SR-DF that are generated withseveral state-of-the-art face swapping and facial reenactment meth-ods. We also built the most comprehensive evaluation system toquantitatively verify that the SR-DF dataset is better than existingdatasets in terms of visual quality and data diversity. Extensiveexperiments on different datasets demonstrate the effectivenessand generalization ability of the proposed method.

8

Page 9: M2TR: Multi-modal Multi-scale Transformers for Deepfake ...

REFERENCES[1] Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. 2018.

Mesonet: a compact facial video forgery detection network. In WIFS.[2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexan-

der Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection withtransformers. In ECCV.

[3] Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? anew model and the kinetics dataset. In CVPR.

[4] Dongdong Chen, Jing Liao, Lu Yuan, Nenghai Yu, and Gang Hua. 2017. Coherentonline video style transfer. In ICCV.

[5] Shen Chen, Taiping Yao, Yang Chen, Shouhong Ding, Jilin Li, and Rongrong Ji.2021. Local Relation Learning for Face Forgery Detection. In AAAI.

[6] Wenyan Cong, Jianfu Zhang, Li Niu, Liu Liu, Zhixin Ling, Weiyuan Li, and LiqingZhang. 2020. DoveNet: Deep Image Harmonization via Domain Verification. InCVPR.

[7] Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva. 2017. Recasting residual-based local descriptors as convolutional neural networks: an application to imageforgery detection. In Workshop on IH&MMSec.

[8] Kevin Dale, Kalyan Sunkavalli, Micah K Johnson, Daniel Vlasic,WojciechMatusik,and Hanspeter Pfister. 2011. Video face replacement. In SIGGRAPH Asia.

[9] DeepFake Detection Dataset. 2019. https://ai.googleblog.com/2019/09/contributing-data-to-deepfake-detection.html.

[10] Deepfakes. 2018. github. https://github.com/deepfakes/faceswap.[11] Jia Deng,Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet:

A large-scale hierarchical image database. In CVPR.[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:

Pre-training of deep bidirectional transformers for language understanding. arXivpreprint arXiv:1810.04805 (2018).

[13] Brian Dolhansky, Russ Howes, Ben Pflaum, Nicole Baram, and Cristian CantonFerrer. 2020. The deepfake detection challenge (dfdc) dataset. arXiv preprintarXiv:2006.07397 (2020).

[14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi-aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, GeorgHeigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformersfor image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[15] Ricard Durall, Margret Keuper, Franz-Josef Pfreundt, and Janis Keuper. 2019.Unmasking deepfakes with simple features. arXiv preprint arXiv:1911.00686(2019).

[16] Face-parsing. 2019. github. https://github.com/zllrunning/face-parsing.PyTorch.[17] FaceShifter. 2020. github. https://github.com/mindslab-ai/faceshifter.[18] Faceswap. 2018. github. https://github.com/MarekKowalski/FaceSwap/.[19] Faceswap-GAN. 2019. github. https://github.com/shaoanlu/faceswap-GAN.[20] Fakeapp. 2018. https://www.fakeapp.com/.[21] Jessica Fridrich and Jan Kodovsky. 2012. Rich models for steganalysis of digital

images. TIFS (2012).[22] Haozhi Huang, Hao Wang, Wenhan Luo, Lin Ma, Wenhao Jiang, Xiaolong Zhu,

Zhifeng Li, andWei Liu. 2017. Real-time neural style transfer for videos. In CVPR.[23] Po-Hsiang Huang, Fu-En Yang, and Yu-Chiang Frank Wang. 2020. Learning

Identity-Invariant Motion Representations for Cross-ID Face Reenactment. InCVPR.

[24] Ying Huang, Wenwei Zhang, and Jinzhuo Wang. 2020. Deep frequent spatialtemporal learning for face anti-spoofing. arXiv preprint arXiv:2002.03723 (2020).

[25] Hyeonseong Jeon, Youngoh Bang, and Simon S Woo. 2020. FDFtNet: Facing offfake images using fake detection fine-tuning network. In ICT Systems Securityand Privacy Protection.

[26] Ira Kemelmacher-Shlizerman. 2016. Transfiguring portraits. ACM TOG (2016).[27] Davis E King. 2009. Dlib-ml: A machine learning toolkit. JMLR (2009).[28] Pavel Korshunov and Sébastien Marcel. 2018. Deepfakes: a new threat to face

recognition? assessment and detection. arXiv preprint arXiv:1812.08685 (2018).[29] Mohammad Rami Koujan, Michail Christos Doukas, Anastasios Roussos, and

Stefanos Zafeiriou. 2020. Head2head: Video-based neural head synthesis. arXivpreprint arXiv:2005.10954 (2020).

[30] Chenyang Lei, Yazhou Xing, and Qifeng Chen. 2020. Blind video temporalconsistency via deep video prior. arXiv preprint arXiv:2010.11838 (2020).

[31] Lingzhi Li, Jianmin Bao, Hao Yang, Dong Chen, and Fang Wen. 2019. Faceshifter:Towards high fidelity and occlusion aware face swapping. arXiv preprintarXiv:1912.13457 (2019).

[32] Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, andBaining Guo. 2020. Face x-ray for more general face forgery detection. In CVPR.

[33] Yuezun Li and Siwei Lyu. 2018. Exposing deepfake videos by detecting facewarping artifacts. arXiv preprint arXiv:1811.00656 (2018).

[34] Yuezun Li and Siwei Lyu. 2019. Exposing DeepFake Videos By Detecting FaceWarping Artifacts. In CVPRW.

[35] Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. 2020. Celeb-df: Alarge-scale challenging dataset for deepfake forensics. In CVPR.

[36] Iacopo Masi, Aditya Killekar, Royston Marian Mascarenhas, Shenoy Pratik Guru-datt, and Wael AbdAlmageed. 2020. Two-branch recurrent network for isolating

deepfakes in videos. In ECCV.[37] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2017. Voxceleb: a

large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017).[38] Ryota Natsume, Tatsuya Yatagawa, and Shigeo Morishima. 2018. Rsgan: face

swapping and editing using face and hair representation in latent spaces. arXivpreprint arXiv:1804.03447 (2018).

[39] Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Z Qureshi, and Mehran Ebrahimi.2019. Edgeconnect: Generative image inpainting with adversarial edge learning.arXiv preprint arXiv:1901.00212 (2019).

[40] Huy H Nguyen, Fuming Fang, Junichi Yamagishi, and Isao Echizen. 2019. Multi-task learning for detecting and segmenting manipulated facial images and videos.In BTAS.

[41] Huy H Nguyen, Junichi Yamagishi, and Isao Echizen. 2019. Use of a capsulenetwork to detect fake images and videos. arXiv preprint arXiv:1910.12467 (2019).

[42] Yuval Nirkin, Yosi Keller, and Tal Hassner. 2019. Fsgan: Subject agnostic faceswapping and reenactment. In ICCV.

[43] Albert Pumarola, Antonio Agudo, Aleix M Martinez, Alberto Sanfeliu, andFrancesc Moreno-Noguer. 2018. Ganimation: Anatomically-aware facial ani-mation from a single image. In ECCV.

[44] Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. 2020. Thinkingin frequency: Face forgery detection by mining frequency-aware clues. In ECCV.

[45] Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representa-tion with pseudo-3d residual networks. In ICCV.

[46] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the lim-its of transfer learning with a unified text-to-text transformer. arXiv preprintarXiv:1910.10683 (2019).

[47] E. Reinhard, M. Adhikhmin, B. Gooch, and P. Shirley. 2001. Color transfer betweenimages. IEEE CG&A (2001).

[48] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies,and Matthias Nießner. 2019. Faceforensics++: Learning to detect manipulatedfacial images. In ICCV.

[49] Manuel Ruder, Alexey Dosovitskiy, and Thomas Brox. 2016. Artistic style transferfor videos. In GCPR.

[50] Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, andNicu Sebe. 2020. First order motion model for image animation. arXiv preprintarXiv:2003.00196 (2020).

[51] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. 2018. Pwc-net: Cnnsfor optical flow using pyramid, warping, and cost volume. In CVPR.

[52] Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017.Synthesizing obama: learning lip sync from audio. ACM TOG (2017).

[53] Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling forconvolutional neural networks. In ICML.

[54] Justus Thies, Michael Zollhöfer, and Matthias Nießner. 2019. Deferred neuralrendering: Image synthesis using neural textures. ACM TOG (2019).

[55] Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, andMatthias Niessner. 2016. Face2Face: Real-Time Face Capture and Reenactment ofRGB Videos. In CVPR.

[56] Soumya Tripathy, Juho Kannala, and Esa Rahtu. 2020. Icface: Interpretable andcontrollable face reenactment using gans. In WACV.

[57] Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.JMLR (2008).

[58] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is allyou need. arXiv preprint arXiv:1706.03762 (2017).

[59] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei AEfros. 2020. CNN-generated images are surprisingly easy to spot... for now. InCVPR.

[60] Deressa Wodajo and Solomon Atnafu. 2021. Deepfake Video Detection UsingConvolutional Vision Transformer. arXiv preprint arXiv:2102.11126 (2021).

[61] Wayne Wu, Yunxuan Zhang, Cheng Li, Chen Qian, and Chen Change Loy. 2018.Reenactgan: Learning to reenact faces via boundary transfer. In ECCV.

[62] Huijuan Xu, Abir Das, and Kate Saenko. 2017. R-c3d: Region convolutional 3dnetwork for temporal activity detection. In ICCV.

[63] Xin Yang, Yuezun Li, and Siwei Lyu. 2019. Exposing deep fakes using inconsistenthead poses. In ICASSP.

[64] Yang Yang, Xiaojie Guo, Jiayi Ma, LinMa, andHaibin Ling. 2019. Lafin: Generativelandmark guided face inpainting. arXiv preprint arXiv:1911.11394 (2019).

[65] Ning Yu, Larry S Davis, and Mario Fritz. 2019. Attributing fake images to gans:Learning and analyzing gan fingerprints. In ICCV.

[66] Hanqing Zhao, Wenbo Zhou, Dongdong Chen, Tianyi Wei, Weiming Zhang, andNenghai Yu. 2021. Multi-attentional Deepfake Detection. In CVPR.

[67] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S Davis. 2017. Two-streamneural networks for tampered face detection. In CVPRW.

[68] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020.Deformable DETR: Deformable Transformers for End-to-End Object Detection.arXiv preprint arXiv:2010.04159 (2020).

9

Page 10: M2TR: Multi-modal Multi-scale Transformers for Deepfake ...

[69] Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang. 2020.WildDeepfake: A Challenging Real-World Dataset for Deepfake Detection. InACM MM.

10

Page 11: M2TR: Multi-modal Multi-scale Transformers for Deepfake ...

A SYNTHESIS METHODSA.1 FSGANFSGAN [42] follows the following pipeline to swap the faces ofthe source image 𝐼𝑠 to that of the target image 𝐼𝑡 . First, the swapgenerator 𝐺𝑟 estimates the swapped face 𝐼𝑟 and its segmentationmask 𝑆𝑟 based on 𝐼𝑡 and a heatmap encoding the facial landmarkof 𝐼𝑠 , while 𝐺𝑠 estimates the segmentation mask 𝑆𝑠 of the sourceimage 𝐼𝑠 . Then the inpainting generator 𝐺𝑐 inpaints the missingparts of 𝐼𝑟 based on 𝑆𝑠 to estimate the complete swapped face 𝐼𝑐 .Finally, using the segmentation mask 𝑆𝑠 , the blending generator𝐺𝑏 blends 𝐼𝑐 and 𝐼𝑠 to generate the final output 𝐼𝑏 which preservesthe posture of 𝐼𝑠 but owns the identity of 𝐼𝑡 . For our dataset, wedirectly use the pretrained model provided by [42] and inferenceon our pristine videos.

A.2 FaceShifterThere are two networks in FaceShifter [31] for full pipeline: AEI-Netfor face swapping, and HEAR-Net for occlusion handling. As theauthor of [31] have not public their code, we use the code from[17] who only implements AEI-Net and we train the model on ourdata. Specifically, AEI-Net is composed of three components: 1) anIdentity Encoder which adopts a pretrained state-of-the-art facerecognition model to provide representative identity embeddings.2) a Multi-level Attributes Encoder which encodes the featuresof facial attributes . 3) an AAD-Generator which integrates theinformation of identity and attributes in multiple feature levels andgenerates the swapped faces. We use the parameters declared in[31] to train the model.

A.3 First-order-motionFirst-order-motion [50] decouples appearance and motion informa-tion for subject-agnostic facial reenactment. Their framework com-prises of two main modules: the motion estimation module whichuses a set of learned keypoints along with their local affine transfor-mations to predict a dense motion field, and an image generationmodule which combines the appearance extracted from the sourceimage and the motion derived from the driving video to model theocclusions arising during target motions. To process our dataset, weuse the pretrained model on VoxCeleb dataset [37], which containsspeech videos from speakers spanning a wide range of differentethnic groups, accents, professions and ages, and reenact the facesin our real videos.

A.4 IcFaceIcFace [56] is a generic face animator that is able to transfer theexpressions from a driving image to a source image. Specifically, thegenerator 𝐺𝑁 takes the source image and neutral facial attributesas input and produces the source identity with central pose andneutral expression. Then the generator 𝐺𝐴 takes the neutral imageand attributes extracted from the driving image as an input andproduces an image with the source identity and driving image’sattributes. We train the complete model on our real videos in aself-supervised manner, using the parameters that they use to trainon VoxCeleb dataset [37].

B VISUALIZATION OF MASK DETECTIONRESULTS

Our method also predicts the manipulated regions of the face imagefor Deepfake localization. We demonstrate some examples of themask detection results in Figure 7, from which we can see that themanipulated areas could be accurately located.

Input<latexit sha1_base64="M49wOJeF95nk82p8cbD9x1yrS3w=">AAAB7HicbVBNSwMxFHypX7V+VT16CRbBU9mtgh6LXvRWwW0L7VKyabYNzWaXJCuUpb/BiwdFvPqDvPlvzLZ70NaBwDDzHnkzQSK4No7zjUpr6xubW+Xtys7u3v5B9fCoreNUUebRWMSqGxDNBJfMM9wI1k0UI1EgWCeY3OZ+54kpzWP5aKYJ8yMykjzklBgrefcySc2gWnPqzhx4lbgFqUGB1qD61R/GNI2YNFQQrXuukxg/I8pwKtis0k81SwidkBHrWSpJxLSfzY+d4TOrDHEYK/ukwXP190ZGIq2nUWAnI2LGetnLxf+8XmrCaz/jeSIm6eKjMBXYxDhPjodcMWrE1BJCFbe3YjomilBj+6nYEtzlyKuk3ai7F/XGw2WteVPUUYYTOIVzcOEKmnAHLfCAAodneIU3JNELekcfi9ESKnaO4Q/Q5w/pc47A</latexit>

GT Mask<latexit sha1_base64="KzsUlVOCf0BXmBPkixgACQ7ltxo=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRbBU9mtBT0WPehFqNAvaZeSTbNtaJJdkqxQlv4KLx4U8erP8ea/MW33oK0PBh7vzTAzL4g508Z1v53c2vrG5lZ+u7Czu7d/UDw8aukoUYQ2ScQj1QmwppxJ2jTMcNqJFcUi4LQdjG9mfvuJKs0i2TCTmPoCDyULGcHGSo+3DdRD91iP+8WSW3bnQKvEy0gJMtT7xa/eICKJoNIQjrXuem5s/BQrwwin00Iv0TTGZIyHtGupxIJqP50fPEVnVhmgMFK2pEFz9fdEioXWExHYToHNSC97M/E/r5uY8MpPmYwTQyVZLAoTjkyEZt+jAVOUGD6xBBPF7K2IjLDCxNiMCjYEb/nlVdKqlL2LcuWhWqpdZ3Hk4QRO4Rw8uIQa3EEdmkBAwDO8wpujnBfn3flYtOacbOYY/sD5/AGSgo+b</latexit>

Pred Mask<latexit sha1_base64="ZA9bWW3P4wu/O+EnJ7wFnZXVIIw=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiFy9CBfsBaSibzaZdutkNuxuhhP4MLx4U8eqv8ea/cdvmoK0PBh7vzTAzL0w508Z1v53S2vrG5lZ5u7Kzu7d/UD086miZKULbRHKpeiHWlDNB24YZTnupojgJOe2G49uZ332iSjMpHs0kpUGCh4LFjGBjJb+laIT66B7r8aBac+vuHGiVeAWpQYHWoPrVjyTJEioM4Vhr33NTE+RYGUY4nVb6maYpJmM8pL6lAidUB/n85Ck6s0qEYqlsCYPm6u+JHCdaT5LQdibYjPSyNxP/8/zMxNdBzkSaGSrIYlGccWQkmv2PIqYoMXxiCSaK2VsRGWGFibEpVWwI3vLLq6TTqHsX9cbDZa15U8RRhhM4hXPw4AqacActaAMBCc/wCm+OcV6cd+dj0Vpyiplj+APn8wdQzJCf</latexit>

Figure 7: Visual examples of the input image, the ground-truth mask, and the predicted mask.

11


Recommended