We are pleased to offer the following tutorial sessions at APSIPA 2012. The tutorials are run in parallel and are broken down into a Morning and Afternoon session.

Morning Session

Afternoon Session

  • Video Surveillance

    Mark Liao


    Since the 911 attacks on the United States, counter-terrorism strategies have been given a high priority in many countries. Surveillance camcorders are now almost ubiquitous in modern cities. As a result, the amount of recorded data is enormous, and it is extremely difficult and time-consuming to search the digital video content manually.  In this tutorial, I will split the contents into two parts.  In the first part, I will provide a comprehensive introduction about video surveillance.  I will talk about the past and present issues of video surveillance.  In the second part of the talk, some automatic ways to deal with surveillance videos will be detailed.  The topics covered in this regard are (1) fast coarse-to-fine video retrieval using shot-level spatio-temporal statistics [1]; (2) motion flow-based video retrieval [2]; (3) video-based human movement analysis and its application to surveillance systems [3]; and (4) spatiotemporal motion analysis for the detection and classification of moving targets [4].  In addition to the above-mentioned topics, designing good ways to enhance the quality of surveillance videos is also important.  In this regard, I will talk about (1) content-aware tone reproduction on images [5]; (2) spatiotemporal slice-based video stabilization; and (3) blurred license plate image recognition [6].  Recently, my group try very hard to develop people counting systems for surveillance purposes.  In [7], we developed a special-purpose people counting system for counting the number of people standing in front of a TV wall.  In [8], a general purpose people counting system is proposed.  Since the number of people present in a short video clip can be a very important cue for speeding up a video search process, I will spend some time to discuss this issue in the tutorial.



    Hong-Yuan Mark Liao received a BS degree in physics from National Tsing-Hua University, Hsin-Chu, Taiwan, in 1981, and an MS and Ph.D degree in electrical engineering from Northwestern University in 1985 and 1990, respectively. In July 1991, he joined the Institute of Information Science, Academia Sinica, Taiwan.  He is a Research Fellow now.  In 2008, he became the division chair of the computer science and information engineering division II, National Science Council of Taiwan. He is jointly appointed as a professor of the Computer Science and Information Engineering Department of National Chiao-Tung University and the Department of Electrical Engineering of National Cheng Kung University. From Jan. 2009 to Jan. 2012, he was jointly appointed as the Multimedia Information Chair Professor of National Chung Hsing University. From August 2010, he has been appointed as an Adjunct Chair Professor of Chung Yuan Christian University. His current research interests include multimedia signal processing, video-based Surveillance Systems, video forensics, and multimedia protection.


    Dr. Liao was a recipient of the Young Investigators' award from Academia Sinica in 1998. He received the distinguished research award from the National Science Council of Taiwan two times during 2003-2006 and 2010-2013, respectively, and the National Invention Award in 2004. In 2008, he received a Distinguished Scholar Research Project Award from National Science Council of Taiwan. In 2010, he received the Academia Sinica Investigator Award. In June 2004, he served as the conference co-chair of the 5th International Conference on Multimedia and Exposition (ICME) and technical co-chair of the 8th ICME held at Beijing. In Jan. 2011, Dr. Liao served as General co-chair of the 17th International Conference on Multimedia Modeling. From 2006-2008, Dr. Liao was the president of the Image Processing and Pattern Recognition Society of Taiwan.  From 2008 to 2011, he served as program director of the National Science Council of Taiwan.


    Dr. Liao is on the editorial boards of the IEEE Signal Processing Magazine, the IEEE Transactions on Image Processing.  From 2009 to 2012, he served as an Associate Editor of the IEEE Transactions on Information Forensics and Security. He was an associate editor of the IEEE Transactions on Multimedia during 1998-2001.

  • Image Statistical Models for Information Forensics

    Yun Q. Shi


    Image statistical model is critically important for both image compression and image forensics. However, in the former, the model is to represent image for some applications such as data compression, while, in the latter, the model is used for classification of an original image from its somehow manipulated version.  In this tutorial, the latter is addressed.


    The histogram of an image under examination, the moments of histogram, and the moments of characteristic function have been used as image model for forensic investigation. Rather quickly it has been found that the histogram is not power enough due to its nature of the first order statistics. The transition probability matrix of Markov chain and, somehow equivalent, the co-occurrence matrix have been widely used for forensic tasks. Besides, wavelet transform, run-length and many other statistical tools have been used for various tasks of forensics, demonstrating sometime significantly improved performance. The relationship between texture classification, an area that has been actively studied for more than 50 years, and digital forensics, a relatively new and increasingly active research area, has been discussed in this tutorial. Some other technologies, including wavelet, difference, local binary pattern, Markov random field, Laws’ masks, that have been studied and researched for decades in texture classification are introduced and presented in this tutorial. Some initial researches and positive results in this regard are discussed in this tutorial as well.


    In summary this tutorial aims to address the image statistical models for many tasks in image classification including image forensics.



    Dr. Yun Qing Shi has joined the Department of ECE at the New Jersey Institute of Technology (NJIT), USA since 1987, and is currently a professor there. He obtained his B.S. degree and M.S. degree from the Shanghai Jiao Tong University, Shanghai, China; his M.S. and Ph.D. degrees from the University of Pittsburgh, PA. His research interests include multimedia forensics and security, multimedia signal processing. He is an author/coauthor of a book, 4 book chapters, and 250 papers in his research areas. He holds 20 awarded US patents. He has delivered 100 invited talks around the world. He obtained Innovators Award 2010 by New Jersey Inventors Hall of Fame (NJIHOF) for Innovations in Digital Forensics and Security. His US patent 7,457,341 entitled "System and Method for Robust Reversible Data Hiding and Data Recovery in the Spatial Domain" won 2010 Thomas Alva Edison Patent Award by Research and Development Council of New Jersey. He was an Associate Editor of IEEE Transactions on Signal Processing, IEEE Transactions on Circuits and Systems Part II, the founding editor-in-chief of LNCS Transactions on Data Hiding and Multimedia Security (Springer). He is a fellow of IEEE for his contribution to Multidimensional Signal Processing.



  • 3DTV: Technical Challenges for Realistic Experiences

    Yo-Sung Ho


    In recent years, various multimedia services have become available and the demand for three-dimensional television (3DTV) is growing rapidly. Since 3DTV is considered as the next generation broadcasting service that can deliver realistic and immersive experiences, a number of advanced 3D video technologies have been studied. In this tutorial lecture, we are going to discuss the current activities for 3DTV research and development. After reviewing the main components of the 3DTV system, we are going to cover several challenging technical issues: representation of 3D scenes, acquisition of 3D video contents, illumination compensation and color correction, camera calibration and image rectification, depth map modeling and enhancement, 3-D warping and depth map refinement, coding of multi-view video and depth map, hole filling for occluded objects, and virtual view synthesis.



    Dr. Yo-Sung Ho received the B.S. and M.S. degrees in electronic engineering from Seoul National University, Seoul, Korea, in 1981 and 1983, respectively, and the Ph.D. degree in electrical and computer engineering from the University of California, Santa Barbara, in 1990. He joined ETRI (Electronics and Telecommunications Research Institute), Daejon, Korea, in 1983. From 1990 to 1993, he was with North America Philips Laboratories, Briarcliff Manor, New York, where he was involved in development of the Advanced Digital High-Definition Television (AD-HDTV) system. In 1993, he rejoined the technical staff of ETRI and was involved in development of the Korean DBS Digital Television and High-Definition Television systems. Since 1995, he has been with Gwangju Institute of Science and Technology (GIST), where he is currently Professor of Information and Communications Department. Since August 2003, he has been Director of Realistic Broadcasting Research Center at GIST in Korea. He gave several tutorial lectures at various international conferences, including the IEEE International Conference on Image Processing (ICIP) in 2009 and 2010, the IEEE International Conference on Multimedia & Expo (ICME) in 2010 and 2011, and the Pacific-Rim Conference on Multimedia (PCM) in 2006 and 2008. He has served as an Associate Editor of IEEE Transactions on Circuits and Systems Video Technology (T-CSVT). His research interests include Digital Image and Video Coding, Image Analysis and Image Restoration, Three-dimensional Image Modeling and Representation, Advanced Source Coding Techniques, Three-dimensional Television (3DTV) and Realistic Broadcasting Technologies.



  • Affective Computing on Speech and Language

    Chung-Hsien Wu, Jianhua Tao, Chia-Ping Chen


    Intact perception and experience of emotion is vital for communication in the social environment. Emotion perception in humans and non-human primates is associated with affective factors especially in communication. Human-machine interface technology has been investigated for several decades. Scientists have found that emotional skills can be an important component of intelligence, especially for human-human communication. Although human-computer interaction is different from human-human communication, some theories have shown that human-computer interaction is essentially following the basics of human-human interaction. Speech and language technologies have expanded the interaction modalities between humans and computer-supported communicational artifacts, such as robots, PDAs, and mobile phones. In this tutorial, we will present theoretical and practical work offering new and broad views of the latest research in affective information processing on speech and language. We organize several parts spanning a variety of theoretical background and applications ranging from salient affective features, affective-cognitive model, to affective information processing on speech and language.


    The tutorial is divided into three parts: speech-based emotion recognition, expressive speech synthesis, and text-based emotion or sentiment recognition. We begin with the basic theory, in which we introduce the concepts of affective states, expressive patterns, source-channel paradigm, and evaluation criteria. In each topic, we will review the state of the art by introducing current methods and presenting several applications. Eventually, technologies developed in different areas will be combined for future applications, so we will envision a few scenarios in which affective computing can make a difference.



    Prof. Chung-Hsien Wu received the Ph.D. degree in electrical engineering from National Cheng Kung University, Tainan, Taiwan, R.O.C., in 1991. Since August 1991, he has been with the Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan. He became professor and distinguished professor in August 1997 and August 2004, respectively. From 1999 to 2002, he served as the Chairman of the Department. Currently, he is the deputy dean of the College of Electrical Engineering and Computer Science, National Cheng Kung University. He also worked at Computer Science and Artificial Intelligence Laboratory of Massachusetts Institute of Technology (MIT), Cambridge, MA, in summer 2003 as a visiting scientist. He received the Outstanding Research Award of National Science Council in 2010 and the Distinguished Electrical Engineering Professor Award of the Chinese Institute of Electrical Engineering in 2011, Taiwan. He is currently associate editor of IEEE Trans. Audio, Speech and Language Processing, IEEE Trans. Affective Computing, ACM Trans. Asian Language Information Processing, and the Subject Editor on Information Engineering of Journal of the Chinese Institute of Engineers (JCIE). His research interests include affective speech recognition, expressive speech synthesis, and spoken language processing. Dr. Wu is a senior member of IEEE and a member of International speech communication association (ISCA). He was the President of the Association for Computational Linguistics and Chinese Language Processing (ACLCLP) in 2009~2011. He was the Chair of IEEE Tainan Signal Processing Chapter and has been the Vice Chair of IEEE Tainan Section since 2009.



    Prof. Jianhua Tao received the M.S. degree from Nanjing University, Nanjing, China, in 1996 and the Ph.D. degree from Tsinghua University, Beijing, China, in 2001. He is currently a Professor with the National Laboratory of Pattern Recognition, Chinese Academy of Sciences. His research interests include speech synthesis and recognition, emotional information processing. He developed quite several earliest versions of Speech systems in China, and has published more than 100 papers in journals and proceedings, e.g., IEEE Trans. on ASLP, ICASSP, Interspeech, ICME, ISCSLP, Speech Prosody, ICPR, etc. Prof. Tao received several awards from important conferences including Eurospeech2001. From 2006, he was ever elected as Vice-Chairperson of the ISCA Special Interest Group of Chinese Spoken Language Processing (SIG-CSLP) (2006-2010). Currently, He is also the executive committee member of the HUMAINE association, the steering committee member of IEEE Trans. on Affective Computing, the editorial member of IJSE, JMI and IJCLCLP.



    Prof. Chia-Ping Chen received the B.S. degree from the National Taiwan University in 1991, and the M.S. degree from the National Tsing-Hua University, both in physics major. He received the Ph.D. degree in electrical engineering from the University of Washington at Seattle in 2004. Since February 2005, he has been with the Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan, ROC. His main research interests are spoken language processing, affective speech synthesis, pattern recognition, and machine learning. Dr. Chen is a member of IEEE and a member of International speech communication association (ISCA). He is currently on the board of the Association for Computational Linguistics and Chinese Language Processing.

  • Depth-based Coding and Processing for 3D Video

    Anthony Vetro


    Current 3D video services and equipment are primarily based on stereoscopic video. With advances in display technology and processing capabilities, it is anticipated that additional viewpoints will need to be generated based on the 3D data format. In the context of 3D broadcasting systems, the generation of these additional views would happen at the receiver based on a compressed and reconstructed 3D data format. To achieve low-bandwidth communication and high-quality 3D rendering, depth-based formats are considered an attractive solution.


    This tutorial will cover fundamentals of 3D imaging, including principles of stereo vision, depth as a representation format, as well as various 3D display technologies. This tutorial will also describe compression and processing of depth information and analyze the impact of depth quality on the rendering. Various depth-based 3D formats that are being considered for standardization will also be presented. Additionally, applications that utilize depth will be discussed, and research challenges will be outlined.



    Anthony Vetro is a Group Manager at Mitsubishi Electric Research Labs, in Cambridge, Massachusetts. He joined Mitsubishi in 1996 and is currently responsible for research and standardization on video coding, as well as work on display processing, information security, sensing technologies, and speech/audio processing. He has published more than 150 papers and has been an active member of the MPEG and ITU-T video coding standardization committees for a number of years. He has served as editor and ad-hoc chair for several projects, including the Multiview Video Coding standard, and currently serves as Head of the US Delegation to MPEG. He received the B.S., M.S. and Ph.D. degrees in Electrical Engineering from Polytechnic University, in Brooklyn, NY. Dr. Vetro is also active in various IEEE conferences, technical committees and editorial boards. He has also received several awards for his work on transcoding and is a Fellow of the IEEE.



  • Real Time Image Processing on GPU

    In Kyu Park


    Recently, GPU (graphics processing unit) has evolved into an extremely powerful computation resource. The purpose of GPGPU (general purpose computation on GPU) is to achieve significant acceleration for computationally intensive tasks beyond the domain of graphics applications. GPU is well-suited to address massive data parallel processing with high floating point (FP) arithmetic intensity, which can be often found in many image processing algorithms. They fit very well with the GPU data parallel programming model and consequently achieve significant acceleration.


    In this tutorial, we will show how GPU parallel computing is used effectively for real time image processing. We will cover both desktop and mobile platforms, on which algorithm design strategies and optimization techniques are introduced by employing useful examples of image processing. We don’t assume the audience has strong knowledge on GPGPU.



    In Kyu Park received the B.S., M.S., and Ph.D. degrees from Seoul National University (SNU) in 1995, 1997, and 2001, respectively, all in electrical engineering and computer science. From September 2001 to March 2004, he was a Member of Technical Staff at Samsung Advanced Institute of Technology (SAIT). Since March 2004, he has been with the School of Information and Communication Engineering, Inha University, where he is an assistant professor. From Jan. 2007 to Feb. 2008, he was an exchange scholar at Mitsubishi Electric Research Laboratories (MERL). Dr. Park’s research interests include the joint area of computer graphics and vision, including 3D shape reconstruction from multiple views, image-based rendering, computational photography, and GPGPU for image processing and computer vision. He is a member of IEEE and ACM.

  • Human Activity Understanding with a Depth Camera

    Zicheng Liu and Zhengyou Zhang


    Human activity understanding is a critical task for many multimedia applications including human computer interaction, interactive games and entertainment, surveillance and home monitoring, senior assisted living, etc.  In the past decade, there has been a lot of research in human activity recognition with conventional 2D video cameras. Recently, the availability of commodity depth cameras has brought a new level of excitement to this field. Rapid progress has been made that addresses new technical issues in activity understanding with 3D depth cameras. In this tutorial, we introduce some basics in using depth cameras for human activity understanding, and provide a comprehensive overview of various visual representations and classification paradigms. The topics cover skeleton-based features, depth-map-based features, actionlet ensemble, action graph, recognition of activities involving human-object interactions, hand gesture recognition, and real-time activity recognition. We will also introduce various publicly available datasets, and discuss state-of-the-art performances on these datasets.



    Zicheng Liu is a senior researcher at Microsoft Research, Redmond. His current research interests include human activity recognition, face modeling and animation, and multimedia collaboration. He received a Ph.D. in Computer Science from Princeton University. He has published over 80 papers in peer-reviewed international journals and conferences, and holds over 50 granted patents. He co-authored a book entitled “Face Geometry and Appearance Modeling: Concepts and Applications”, Cambridge University Press, 2011. He has served in the technical committees for many international conferences. He served as a technical co-chair of the 2006 IEEE International Workshop of Multimedia Signal Processing. He is a technical co-chair of both 2010 and 2014 IEEE International Conference on Multimedia and Expo, a co-organizer of 2011 and 2012 CVPR Workshops on Human Activity Understanding from 3D Data, and a general co-chair of 2012 IEEE Visual Communication and Image Processing. He is an associate editor of both Machine Vision and Applications journal and Journal of Visual Communications and Image Representation. He is a senior member of IEEE.



    Zhengyou Zhang received the B.S. degree in electronic engineering from Zhejiang University, Hangzhou, China, in 1985, the M.S. degree in computer science from the University of Nancy, Nancy, France, in 1987, and the Ph.D. degree in computer science and the Doctorate of Science (Habilitation à diriger des recherches) from the University of Paris XI, Paris, France, in 1990 and 1994, respectively.


    He is a Principal Researcher with Microsoft Research, Redmond, WA, USA, and the Research Manager of the “Multimedia, Interaction, and Communication” group. Before joining Microsoft Research in March 1998, he was with INRIA (French National Institute for Research in Computer Science and Control), France, for 11 years and was a Senior Research Scientist from 1991. In 1996-1997, he spent a one-year sabbatical as an Invited Researcher with the Advanced Telecommunications Research Institute International (ATR), Kyoto, Japan. He served as an Adjunct Chair Professor with Zhejiang University, Hangzhou, China. He is also an Affiliate Professor with the University of Washington, Seattle, WA, USA. He has published over 200 papers in refereed international journals and conferences, and has coauthored the following books: 3-D Dynamic Scene Analysis: A Stereo Based Approach (Springer-Verlag, 1992); Epipolar Geometry in Stereo, Motion and Object Recognition (Kluwer, 1996); Computer Vision (Chinese Academy of Sciences, 1998, 2003, in Chinese); Face Detection and Adaptation (Morgan and Claypool, 2010), and Face Geometry and Appearance Modeling (Cambridge University Press, 2011). He has given a number of keynotes in international conferences and invited talks in universities.


    Dr. Zhang is a Fellow of the Institute of Electrical and Electronic Engineers (IEEE), the Founding Editor-in-Chief of the IEEE Transactions on Autonomous Mental Development, an Associate Editor of the International Journal of Computer Vision, an Associate Editor of Machine Vision and Applications, and an Area Editor of the Journal of Computer Science and Technology. He served as Associate Editor of the IEEE Transactions on Pattern Analysis and Machine Intelligence from 2000 to 2004, an Associate Editor of the IEEE Transactions on Multimedia from 2004 to 2009,  an Associate Editor of the International Journal of Pattern Recognition and Artificial Intelligence from 1997 to 2009,  among others. He has been on the program committees for numerous international conferences in the areas of computer vision, audio and speech signal processing, multimedia, human-computer interaction, and autonomous mental development. He was a member of the Pre- and Interim Steering Committee, in 2009, in charge of revamping the International Conference of Multimedia and Expo (ICME), the flagship multimedia conference sponsored by four IEEE societies. He served as Area Chair,  Program Chair, or  General Chair of a number of international conferences, including recently a Program Co-Chair of the International Conference on Multimedia and Expo (ICME), July 2010, a Program Co-Chair of the ACM International Conference on Multimedia (ACM MM), October 2010, a Program Co-Chair of the ACM International Conference on Multimodal Interfaces (ICMI), November 2010, and a General Co-Chair of the IEEE International Workshop on Multimedia Signal Processing (MMSP), October 2011. He is serving a Chair of a new track “Technical Briefs” of the ACM SIGGRAPH Asia Conference, Nov. 28 – Dec. 1st, 2012.

  • Statistical Modeling for Audio-Visual Speech Analysis, Recognition and Synthesis

    Frank K. Soong and Lijuan Wang


    In this tutorial we review the statistical approach, particularly on Hidden Markov Model (HMM), in current audio-visual speech processing. Two aspects will be specifically emphasized: data-driven and stochastic modeling. In this approach audio/visual data is first collected for training a statistical parametric model which can represent concisely the observed evolutionary audio/visual signals in a parametric form. The model consists of two, audio and visual, components, each is a dynamic HMM. The simultaneous co-occurrence of audio and visual channels of speech units provides both acoustic and visual cues for human listeners/viewers to exploit in speech perception, recognition or synthesis, either separately or jointly.


    We review the statistical modeling techniques and the physical meaning of the model parameters and show how to use them for recognizing an observed a/v speech trajectory or synthesizing intelligible, natural speech and talking head (avatar). Some less conventional audio and visual features such as Ultrasound, Non-Audio Murmur (NAM) microphone, Magnetic Resonance Imaging (MRI), ElectroPalatoGraphy (EPG), ElectroMagnetic Articulography (EMA), will be covered in addition to the more conventional a/v data collected by the standard microphone and optical camera. Speech recognition and synthesis research work based upon those different audio/visual features will be reviewed.


    A case study of using a universal HMM Trajectory Tiling (HTT) algorithm developed by us will be presented along with its applications to synthesizing high quality speech and photo-realistic talking head in depth. Applications to foreign language learning and cross-lingual voice conversion will also be presented. The talking head and TTS system have been deployed to http://dict.bing.com.cn as a large, dynamic dictionary for helping English as Second Language (ESL) learners. The web site is currently used by more than one million users daily.



    Frank K. Soong is a Principal Researcher/Manager, Speech Group, Microsoft Research Asia (MSRA), Beijing, China where he works on fundamental research on speech and its practical applications. His professional research career spans 30+ years, first with Bell Labs, US, then with ATR, Japan, before joining MSRA in 2004. At Bell Labs, he worked on stochastic modeling of speech signals, optimal decoder algorithm, speech analysis and coding, speech and speaker recognition. He was responsible for developing the recognition algorithm which was developed into voice-activated mobile phone products rated by the Mobile Office Magazine (Apr. 1993) as the “outstandingly the best”. He is a co-recipient of the Bell Labs President Gold Award for developing the Bell Labs Automatic Speech Recognition (BLASR) software package.

    He has served as a member of the Speech and Language Technical Committee, IEEE Signal Processing Society and other society functions, including Associate Editor of the IEEE Speech and Audio Transactions and chairing IEEE Workshop. He published extensively with more than 200 papers and co-edited a widely used reference book, Automatic Speech and Speech Recognition- Advanced Topics, Kluwer, 1996. He is a visiting professor of the Chinese University of Hong Kong (CUHK) and a few other top-rated universities in China. He is also the co-Director of the MSRA-CUHK Joint Research Lab. He got his BS, MS and PhD from National Taiwan Univ., Univ. of Rhode Island, and Stanford Univ, all in Electrical Eng. He is an IEEE Fellow.



    Lijuan Wang is a Researcher in the Speech Group, Microsoft Research Asia (MSRA), Beijing, China. She received her PhD degree from Tsinghua University, China in 2006. Since then she joined the Speech Group, MSRA. Her research interests include: Avatar (talking head) synthesis, speech synthesis, audio-visual signal processing. She is a member of IEEE. More info about her can be found at http://research.microsoft.com/en-us/people/lijuanw/