Learning from Video Platinum Pass Full Conference Pass Full Conference One-Day Pass Date: Monday, November 18th Time: 2:15pm - 4:00pm Venue: Plaza Meeting Room P2 Session Chair(s): Wolfgang Heidrich, King Abdullah University of Science and Technology (KAUST), KAUST Colorblind-Shareable Videos by Synthesizing Temporal-Coherent Polynomial Coefficients Abstract: To share the same visual content between color vision deficiencies (CVD) and normal-vision people, attempts have been made to allocate the two visual experiences of a binocular display (wearing and not wearing glasses) to CVD and normal-vision audiences. However, existing approaches only work for still images. Although state-of-the-art temporal filtering techniques can be applied to smooth the per-frame generated content, they may fail to maintain the multiple binocular constraints needed in our applications, and even worse, sometimes introduce color inconsistency (same color regions map to different colors). In this paper, we propose to train a neural network to predict the temporal coherent polynomial coefficients in the domain of global color decomposition. This indirect formulation solves the color inconsistency problem. Our key challenge is to design a neural network to predict the temporal coherent coefficients, while maintaining all required binocular constraints. Our method is evaluated on various videos and all metrics confirm that it outperforms all existing solutions. Authors/Presenter(s): Xinghong Hu, The Chinese University of Hong Kong, Hong KongXueting Liu, Caritas Institute of Higher Education, Hong KongZhuming Zhang, The Chinese University of Hong Kong, Hong KongMenghan Xia, The Chinese University of Hong Kong, Hong KongChengze Li, The Chinese University of Hong Kong; Guangdong Provincial Key Laboratory of Computer Vision and Virtual Reality Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China, Hong KongTien-Tsin Wong, The Chinese University of Hong Kong; Guangdong Provincial Key Laboratory of Computer Vision and Virtual Reality Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China, Hong Kong Animating Landscape: Self-Supervised Learning of Decoupled Motion and Appearance for Single-Image Video Synthesis Abstract: Automatic generation of a high-quality video from a single image remains a challenging task despite the recent advances in deep generative models. This paper proposes a method that can create a high-resolution, long-term animation using convolutional neural networks (CNNs) from a single landscape image where we mainly focus on skies and waters. Our key observation is that the motion (e.g., moving clouds) and appearance (e.g., time-varying colors in the sky) in natural scenes have different time scales. We thus learn them separately and predict them with decoupled control while handling future uncertainty in both predictions by introducing latent codes. Unlike previous methods that infer output frames directly, our CNNs predict spatially-smooth intermediate data, i.e., for motion, flow fields for warping, and for appearance, color transfer maps, via self-supervised learning, i.e., without explicitly-provided ground truth. These intermediate data are applied not to each previous output frame, but to the input image only once for each output frame. This design is crucial to alleviate error accumulation in long-term predictions, which is the essential problem in previous recurrent approaches. The output frames can be looped like cinemagraph, and also be controlled directly by specifying latent codes or indirectly via visual annotations. We demonstrate the effectiveness of our method through comparisons with the state-of-the-arts on video prediction as well as appearance manipulation. Authors/Presenter(s): Yuki Endo, University of Tsukuba, Toyohashi University of Technology, JapanYoshihiro Kanamori, University of Tsukuba, JapanShigeru Kuriyama, Toyohashi University of Technology, Japan DeepRemaster: Temporal Source-Reference Attention Networks for Comprehensive Video Enhancement Abstract: The remastering of vintage film comprises of a diversity of sub-tasks including super-resolution, noise removal, and contrast enhancement which aim to restore the deteriorated film medium to its original state. Additionally, due to the technical limitations of the time, most vintage film is either recorded in black and white, or has low quality colors, for which colorization becomes necessary. In this work, we propose a single framework to tackle the entire remastering task semi-interactively. Our work is based on temporal convolutional neural networks with attention mechanisms trained on videos with data-driven deterioration simulation. Our proposed source-reference attention allows the model to handle an arbitrary number of reference color images to colorize long videos without the need for segmentation while maintaining temporal consistency. Quantitative analysis shows that our framework outperforms existing approaches, and that, in contrast to existing approaches, the performance of our framework increases with longer videos and more reference color images. Authors/Presenter(s): Satoshi Iizuka, University of Tsukuba, JapanEdgar Simo-Serra, Waseda University, Japan Write-A-Video: Computational Video Montage from Themed Text Abstract: We present Write-A-Video, a tool for the creation of video montage using mostly text-editing. Given an input themed text and a related video repository either from online websites or personal albums, the tool allows novice users to generate a video montage much more easily than current video editing tools. The resulting video illustrates the given narrative, provides diverse visual content, and follows cinematographic guidelines. The process involves three simple steps: (1) the user provides input, mostly in the form of editing the text, (2) the tool automatically searches for semantically matching candidate shots from the video repository, and (3) an optimization method assembles the video montage. Visual-semantic matching between segmented text and shots is performed by cascaded keyword matching and visual-semantic embedding, that have better accuracy than alternative solutions. The video assembly is formulated as a hybrid optimization problem over a graph of shots, considering temporal constraints, cinematography metrics such as camera movement and tone, and user-specified cinematography idioms. Using our system, users without video editing experience are able to generate appealing videos. Authors/Presenter(s): Miao Wang, State Key Laboratory of Virtual Reality Technology and Systems, Beihang University; Tsinghua University, Beijing, ChinaGuo-Wei Yang, BNRist, Tsinghua University, Beijing, ChinaShi-Min Hu, BNRist, Tsinghua University, Beijing, ChinaShing-Tung Yau, Harvard University, United States of AmericaAriel Shamir, The Interdisciplinary Center, Herzliya, Israel Neural Style-Preserving Visual Dubbing Abstract: Dubbing is a technique for translating video content from one language to another. However, state-of-the-art visual dubbing techniques directly copy facial expressions from source to target actors without considering identity-specific idiosyncrasies such as a unique type of smile. We present a style-preserving visual dubbing approach from single video inputs, which maintains the signature style of target actors when modifying facial expressions, including mouth motions, to match foreign languages. At the heart of our approach is the concept of motion style, in particular for facial expressions, i.e., the person-specific expression change that is yet another essential factor beyond visual accuracy in face editing applications. Our method is based on a recurrent generative adversarial network that captures the spatiotemporal co-activation of facial expressions, and enables generating and modifying the facial expressions of the target actor while preserving their style. We train our model with unsynchronized source and target videos in an unsupervised manner using cycle-consistency and mouth expression losses, and synthesize photorealistic video frames using a layered neural face renderer. Our approach generates temporally coherent results, and handles dynamic backgrounds. Our results show that our dubbing approach maintains the idiosyncratic style of the target actor better than previous approaches, even for widely differing source and target actors. Authors/Presenter(s): Hyeongwoo Kim, Max Planck Institute for Informatics, GermanyMohamed Elgharib, Max Planck Institute for Informatics, GermanyMichael Zollhöfer, Stanford University, United States of AmericaHans-Peter Seidel, Max Planck Institute for Informatics, GermanyThabo Beeler, Disney Research, SwitzerlandChristian Richardt, University of Bath, United KingdomChristian Theobalt, Max Planck Institute for Informatics, Germany Back