Overview Session
Sessoin 1: Tuesday (19) 10:20-12:00
Talk 1
Speaker: Kazushi Ikeda
Affiliation: NAIST
Talk Title: Theoretical properties of deep learning methods
Talk Abstract: Deep learning has widely been used as the most powerful technique in artificial intelligence, which includes several heuristics such as pre-training, drop-out, batch normalization, and skip-connections. Although deep learning had long been said to be a black box, recent theoretical work has clarified its properties, that is, why deep learning works well. This talk introduces some theoretical results to deepen the understanding of deep learning.
Talk 2
Speaker: Jianquan Liu
Affiliation: NEC, Japan
Talk Title: Multimedia Engages with Database for Video Surveillance
Talk Abstract: In this talk, Dr. Liu will give an overview of the research topics that are currently conducted at the Biometrics Research Laboratories of NEC Corporation, such as processing and analysis on multimedia big data. The talk will mainly focus on how NEC was/is engaging techniques of database and multimedia for large-scale video surveillance in the past, present, and future, based on the related researches conducted in NEC. Representing the past status, he will introduce a commercial level demo system for surveillance video search, named Wally, which was exhibited at MM'14. Wally is a scalable distributed automated video surveillance system with rich search functionalities, and integrated with image processing products developed by NEC, such as NeoFace(R), FieldAnalyst, and StreamPro. Here, NeoFace(R) is one of the best face recognition technologies in the world, having highest recognition accuracy. Subsequently, he will switch to the present status of video search. The current focus becomes that, the video search process can be triggered without giving any query objects, although the search can be performed automatically based on the analysis of a certain kind of pre-defined patterns. This part will be introduced with a series of work published at SIGGRAPH'16, MM'16, MM'17, ICMR'18, MIPR'19, CBMI'19, BigMM'19, and MM'19. Finally, Dr. Liu will pick up some challenging issues and share some future directions of video search for surveillance, such as the search process by integrating multiple features extracted from surveillance videos, and the key applications of surveillance video search, etc. Such directions would be helpful to guide the research of multimedia big data in the near future.
Biography: Dr. Jianquan Liu received the M.E. and Ph.D. degrees from the University of Tsukuba, Japan, in 2009 and 2012, respectively. He was a development engineer in Tencent Inc. from 2005 to 2006, and was a visiting researcher at the Chinese University of Hong Kong in 2010. He is currently a senior researcher at the Biometrics Research Laboratories of NEC Corporation, working on the topics of multimedia data processing. He is also an adjunct assistant professor at Graduate School of Science and Engineering, Hosei University, Japan, teaching courses related to data mining and database. His research interests include high-dimensional similarity search, multimedia databases, web data mining and information retrieval, cloud storage and computing, and social network analysis. He has published over 50 papers at major international/domestic conferences and journals, received over 20 international/domestic awards, and filed over 40 PCT patents. He also successfully transformed these technological contributions into commercial products in the industry. Currently, he is/was serving as the PC Co-chair of IEEE ICME 2020, AIVR 2019, BigMM 2019, ISM 2018, ICSC 2018, ISM 2017, ICSC 2017, IRC 2017, and BigMM 2016; the Workshop Co-chair of IEEE AKIE 2018 and ICSC 2016; the Demo Co-chair of IEEE MIPR 2019 and MIPR 2018. He is a member of ACM, IEEE, and the Database Society of Japan (DBSJ), a member of expert committee for IEICE Mathematical Systems Science and its Applications, and IEICE Data Engineering, and an associate editor of IEEE MultiMedia Magazine and the Journal of Information Processing (JIP).
Talk 3
Speaker: Chung-Hsien Wu
Affiliation: National Cheng Kung University
Talk Title: Multimodal Affective Computing for Mental Health Care
Talk Abstract: Affective Computing is a rapidly emerging field which has a goal to recognize the emotional state of a user for rational decision making, social interaction, perception, and memory. There are numerous approaches to achieving the goal by using single/multiple modalities, including textual, audio-visual and physiological information. In the past years, analysis and recognition approaches of artificial affective expressions from a uni-modal input have been widely investigated. However, the performance of emotion recognition based on single modality still has its limitation. To further improve emotion recognition performance, a promising research area is to explore the data fusion strategy for effectively integrating multimodal inputs. For affective computing applications, one of the most significant research domains is directed towards the interrelation between emotions and human health, both mental and physical. In the past years, extensive research on affective computing has helped the medical community with technologies for better understanding of emotions, identifying their impact on mental health, and offering new techniques for diagnosis, therapy, and treatment of emotionally-influenced diseases. With the growing and varied uses of human-computer interactions, people exhibit emotions that in certain contexts might influence their mental health. The availability and constant development of technologies that can facilitate the application of affective computing in the medical realm. Recently, the technology of affective computing is now poised on the threshold of usability for the process of monitoring, recognition, as well as expression of emotions for various medical purposes.
Sessoin 2: Tuesday (19) 15:00-16:40
Talk 4
Speaker: Hsin-Min Wang
Affiliation: Academia Sinica
Talk Title: Variational Autoencoder (VAE)-based Voice Conversion
Talk Abstract: Voice conversion (VC) aims to convert the speech from a source to that of a target without changing the linguistic content. Although there are many types and applications of VCs, here we focus on the most typical one, i.e., speaker voice conversion. By formulating the task into a regression problem in machine learning, a conversion function that maps the acoustic features of a source speaker to those of a target speaker is to be learned. Numerous VC approaches have been proposed, such as Gaussian mixture model (GMM)-based methods, deep neural network (DNN)-based methods, and exemplar-based methods. Most of them require parallel training data, i.e., the source and target speakers utter the same transcripts for training. Since such data is hard to collect, non-parallel training has long remained one of the ultimate goals in VC. In this overview talk, I will first briefly introduce several representative VC models and then introduce our variational autoencoder (VAE)-based VC methods in a relatively in-depth manner. If there is still time, I will also introduce a recently proposed deep learning-based quality assessment model for the VC task.
Talk 5
Speaker: Chang-Su Kim
Affiliation: School of Electrical Engineering, Korea University, Seoul, Korea
Talk Title: Video Object Segmentation
Talk Abstract: Image and video segmentation is the process to separate objects from the background in still images or video sequences. It is applicable as a preliminary to various vision applications, such as action recognition, content-based image and video retrieval, targeted content replacement, and image and video summarization. It is hence important to develop efficient image and video segmentation techniques. However, image and video segmentation is challenging due to a variety of difficulties, including boundary ambiguity, cluttered background, occlusion, and non-rigid object deformation. In particular, in video object segmentation, a primary (or salient) object is extracted from a video sequence in either supervised or unsupervised manner. In this talk, we overview recent techniques for video object segmentation, classify them according to their approaches, and discuss future research issues.
Biography: Chang-Su Kim received the Ph.D. degree in electrical engineering from Seoul National University with a Distinguished Dissertation Award in 2000. From 2000 to 2001, he was a Visiting Scholar with the Signal and Image Processing Institute, University of Southern California, Los Angeles. From 2001 to 2003, he coordinated the 3D Data Compression Group in National Research Laboratory for 3D Visual Information Processing in SNU. From 2003 and 2005, he was an Assistant Professor in the Department of Information Engineering, Chinese University of Hong Kong. In Sept. 2005, he joined the School of Electrical Engineering, Korea University, where he is a Professor. His research topics include image processing and computer vision. He has published more than 270 technical papers in international journals and conferences. In 2009, he received the IEEK/IEEE Joint Award for Young IT Engineer of the Year. In 2014, he received the Best Paper Award from Journal of Visual Communication and Image Representation (JVCI). He is a member of the Multimedia Systems & Application Technical Committee (MSATC) of the IEEE Circuits and Systems Society. He was an APSIPA Distinguished Lecturer for term 2017-2018. He served as an Editorial Board Member of JVCI and an Associate Editor of IEEE Transactions on Image Processing. He is a Senior Area Editor of JVCI and an Associate Editor of IEEE Transactions on Multimedia.
Talk 6
Speaker: Sanghoon Lee
Affiliation: School of Electrical and Electronic Engineering, Yonsei University, Seoul, Korea
Talk Title: Deep Learning for Perceptual Image/Video Quality Assessment
Talk Abstract: Perceptual image processing is taking an increasingly important role in the field of multimedia processing. Since human observers are the end-users of image and video applications, great benefit can be derived from discovering methods for assessing image quality that highly correlates with human visual perception. In this talk, we review several perceptual image/video quality assessment (I/VQA) approaches. In part 1, we introduce basic human visual system (HVS) formulations such as foveation, contrast sensitivity, visual saliency, sharpness perception, and these related IQA works. Moreover, in part 2, we show recent trends on I/VQA works using robust deep-learning network such as convolutional neural networks (CNNs) based visual sensitivity learning. With the detailed discussion on current efforts towards perceptual I/VQA methods, we believe that the future multimedia services will be significantly advanced.
Biography: Sanghoon Lee received his Ph.D. in E.E. from the University of Texas at Austin in 2000. From 1999 to 2002, he worked for Lucent Technologies. In March 2003, he joined the faculty of the Department of Electrical and Electronics Engineering, Yonsei University, Seoul, Korea, where he is a Full Professor. He was an Associate Editor of the IEEE Trans. Image Processing (2010-2014), an Editor of the Journal of Communications and Networks (JCN) (2009-2015), and a Guest Editor for IEEE Trans. Image Processing (2013) and Journal of Electronic Imaging (2015). He has been an Associate Editor of IEEE Signal Processing Letters (2014- ) and a Chair of the IEEE P3333.1 Quality Assessment Working Group (2011-). He currently serves as a Chair of the APSIPA IVM Technical Committee (2018-), and a Member in the Technical Committees of the IEEE MMSP (2016-) and IVMSP (2014-). He received a 2015 Yonsei Academic Award and 2010, 2015, 2016, 2017 Yonsei Outstanding Accomplishment Awards, a 2012 Special Service Award from the IEEE Broadcast Technology Society, a 2013 Special Service Award from the IEEE Signal Processing Society, a Humantech Thesis Award of Samsung Electronics, 2013, an IEEE Seoul Section Student Paper Contest Award, 2012, a Qualcomm Innovation Award, Qualcomm, 2012, and a Best Student Paper Award of QoMEX (International Conference on Quality of Multimedia Experience) 2018. His research interests include deep learning, image & video quality of experience, computer vision and computer graphics.
Session 3: Wednsday (20) 15:00-16:40
Talk 7
Speaker: Hsueh-Ming Hang
Affiliation: National Chiao Tung University (NCTU)
Talk Title: Deep-Learning based Image Compression
Talk Abstract: The DCT-based transform coding technique was adopted by the international standards (ISO JPEG, ITU H.261/264/265, ISO MPEG-2/4/H, and many others) for 30 years. Although researchers are still trying to improve its efficiency by fine-tuning its components and parameters, the basic structure has not changed in the past 25 years. The deep learning technology recently developed may provide a new direction for constructing a high-compression image/video coding system. The recent results indicate that this new type of schemes may have a good potential in further improving compression efficiency. In this talk, we summarize the basic designs and the progress of this topic so far. Also, we propose a lossy image compression system using autoencoder to participate in the Challenge on Learned Image Compression (CLIC), CVPR 2019. The selected technologies include an autoencoder that incorporates (1) a principal component analysis (PCA) layer for energy compaction, (2) a uniform, scalar quantizer for lossy compression, (3) a context-adaptive bitplane coder for entropy coding, and (4) a soft-bit-based rate estimator. Our aim is to produce reconstructed images with good subjective quality under the 0.15 bits-per-pixel constraint.
Biography: Hsueh-Ming Hang received the Ph.D. in Electrical Engineering from Rensselaer Polytechnic Institute, Troy, NY, in 1984. From 1984 to 1991, he was with AT&T Bell Laboratories, Holmdel, NJ, and then he joined the Electronics Engineering Department of National Chiao Tung University (NCTU), Hsinchu, Taiwan, in December 1991. From 2006 to 2009, he was appointed as Dean of the EECS College at National Taipei University of Technology (NTUT). From 2014 to 2017, he served as the Dean of the ECE College at NCTU. He has been actively involved in the international MPEG standards since 1984 and his current research interests include multimedia compression, spherical image/video processing, and deep-learning based image/video processing. He was an associate editor (AE) of the IEEE Transactions on Image Processing (1992-1994, 2008-2012). He was an IEEE Circuits and Systems Society Distinguished Lecturer (2014-2015) and was a Board Member of the Asia-Pacific Signal and Information Processing Association (APSIPA) (2013-2018) and a General Co-chair of IEEE International Conference on Image Processing (ICIP) 2019. He is a recipient of the IEEE Third Millennium Medal and is a Fellow of IEEE and IET and a member of Sigma Xi.
Talk 8
Speaker: Homer Chen
Affiliation: Department of Electrical Engineering, National Taiwan University
Talk Title: Autofocus
Talk Abstract: It is important for any imaging device to accurately and quickly find the in-focus lens position so that sharp images can be captured without human intervention. In this overview talk, I will talk about the design criteria and considerations for both contrast detection autofocus (CDAF) and phase detection autofocus (PDAF) and highlight some key milestone techniques. In particular, I will close the talk by presenting how deep learning can be applied to push the performance of autofocus to an unprecedented level.
Biography: Homer H. Chen (M’86-SM’01-F’03) received the Ph.D. degree in Electrical and Computer Engineering from University of Illinois at Urbana-Champaign. Dr. Chen's professional career has spanned industry and academia. Since August 2003, he has been with the College of Electrical Engineering and Computer Science, National Taiwan University, where he is Distinguished Professor. Prior to that, he held various R&D management and engineering positions with U.S. companies over a period of 17 years, including AT&T Bell Labs, Rockwell Science Center, iVast, and Digital Island (acquired by Cable & Wireless). He was a U.S. delegate for ISO and ITU standards committees and contributed to the development of many new interactive multimedia technologies that are now part of the MPEG-4 and JPEG-2000 standards. His current research is related to multimedia signal processing, computational photography and display, and music data mining Dr. Chen is an IEEE Fellow. He was an Associate Editor of IEEE Transactions on Circuits and Systems for Video Technology from 2004 to 2010, IEEE Transactions on Image Processing from 1992 to 1994, and Pattern Recognition from 1989 to 1999. He served as a Guest Editor for IEEE Transactions on Circuits and Systems for Video Technology in 1999, IEEE Transactions on Multimedia in 2011, IEEE Journal of Selected Topics in Signal Processing in 2014, and Springer Multimedia Tools and Applications in 2015. He was a Distinguished Lecturer of the IEEE Circuits and Systems Society from 2012 to 2013. He served on the IEEE Signal Processing Society Fourier Award Committee and the Fellow Reference Committee from 2015 to 2017. Currently, he serves on the Awards Board of IEEE Signal Processing Society and on the Senior Editorial Board of IEEE Journal of Selected Topics in Signal Processing.
Talk 9
Speaker: Jiwu Huang
Affiliation: Shenzhen University
Talk Title: Multimedia Forensics
Talk Abstract: Nowadays, multimedia materials are widely adopted in a variety of applications. With the pervasive editing software, such as Adobe Photoshop for Images, CoolEdit, for audio, Adobe Premiere for video, the multimedia contents can be doctored very easily. Recent progress in AI provides more powerful tools for generating fake multimedia. It leads a series of security problems in politics, social affairs, judicial investigations and business activities. Therefore, it is of great significance to investigate the problems about the authenticity and integrity of digital multimedia. Multimedia forensics thus becomes an active research topics in information security, due to its important application in digital evidence authentication. Multimedia forensics aims to authenticate the media based on only the contents. In this talk, we first talk about the security issues raised by forged media, from some well known examples. Then, the methodology of multimedia forensics is introduced. The progress of three important research topics of multimedia forensics, i.e., media source identification, tampering detection, processing history analysis, is reviewed in the speech. Finally, we will discuss several existing challenges of multimedia forensics.
Session 4: Thursday (21) 10:20-12:00
Talk 10
Speaker: Seishi Takamura
Affiliation: NTT Japan
Talk Title: Recent advances on image/video compression and communication
Talk Abstract: We first review the history of video coding technology developments and its standardization activities, and then review the latest and near-future research trends particularly on technologies for immersive experiences.
Talk 11
Speaker: Ying Loong Lee
Affiliation: Guangxi University for Nationalities
Talk Title: A Survey on Applications of Deep Reinforcement Learning in Resource Management for 5G Heterogeneous Networks
Talk Abstract: Heterogeneous networks (HetNets) have been regarded as the key technology for fifth generation (5G) communications to support the explosive growth of mobile traffics. By deploying small-cells within the macrocells, the HetNets can boost the network capacity and support more users especially in the hotspot and indoor areas. Nonetheless, resource management for such networks becomes more complex compared to conventional cellular networks due to the interference arise between small-cells and macrocells, which thus making quality of service provisioning more challenging. Recent advances in deep reinforcement learning (DRL) have inspired its applications in resource management for 5G HetNets. In this paper, a survey on the applications of DRL in resource management for 5G HetNets is conducted. In particular, we review the DRLbased resource management schemes for 5G HetNets in various domains including energy harvesting, network slicing, cognitive HetNets, coordinated multipoint transmission, and big data. An insightful comparative summary and analysis on the surveyed studies is provided to shed some light on the shortcomings and research gaps in the current advances in DRL-based resource management for 5G HetNets. Last but not least, several open issues and future directions are presented.
Talk 12
Speaker: Mau-Luen Tham
Affiliation: Universiti Tunku Abdul Rahman (UTAR), Malaysia
Talk Title: Deep Reinforcement Learning for Resource Allocation in 5G Communications
Talk Abstract: The rapid growth of data traffic has pushed the mobile telecommunication industry towards the adoption of fifth generation (5G) communications. Cloud radio access network (CRAN), one of the 5G key enabler, facilitates fine-grained management of network resources by separating the remote radio head (RRH) from the baseband unit (BBU) via a highspeed front-haul link. Classical resource allocation (RA) schemes rely on numerical techniques to optimize various performance metrics. Most of these works can be defined as instantaneous since the optimization decisions are derived from the current network state without considering past network states. While utility theory can incorporate long-term optimization effect into these optimization actions, the growing heterogeneity and complexity of network environments has rendered the RA issue intractable. One prospective candidate is reinforcement learning (RL), a dynamic programming framework which solves the RA problems optimally over varying network states. Still, such method cannot handle the highly dimensional state-action spaces in the context of CRAN problems. Driven by the success of machine learning, researchers begin to explore the potential of deep reinforcement learning (DRL) to address the RA problems. In this work, an overview of the major existing DRL approaches in CRAN is presented. We conclude this article by identifying current technical hurdles and potential future research directions.