13th Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

DECEMBER 14 – 17, 2021, TOKYO, JAPAN
Venue: KFC Hall & Rooms
Kokusai Fashion Centre Bldg., Yokoami 1-6-1, Sumida City, Tokyo

Signal & Information Processing — Science for Signals, Data, and Intelligence

Important Dates

[April 1, 2021] Submission of Proposals for Special Sessions [May 1, 2021] Submission of Proposals for Forum, Panel & Tutorial Sessions [July 15, 2021] Submission of Regular Papers [July 15, 2021] Submission of Special Session Papers [July 16 to September 10, 2021] Submission of Research Abstract  [August 31, 2021] Notification of Papers Acceptance [October 1, 2021] Submission of Camera‑Ready Papers [October 1, 2021] Author (Early-Bird) Registration Deadline [December 14 – 17, 2021] Tutorials, Summit and Conference Dates

Overview Session

+OS-1: Acoustic Signal Processing

Prof. Hiroshi Saruwatari (The University of Tokyo, Japan)
■ Speaker #1: Prof. Shoichi Koyama (The University of Tokyo, Japan)
Sound Field Analysis and Synthesis: Theoretical Advances and Applications to Spatial Audio Reproduction

Sound field analysis and synthesis are fundamental techniques in spatial acoustic signal processing, which are aimed at estimating/synthesizing an acoustic field by a discrete set of microphones/loudspeakers. These techniques are essential in visualization/auralization of a sound field, VR/AR audio, creating personal sound zones, canceling noise in a regional space, and so forth. Conventional techniques are largely based on boundary integral representations of the Helmholtz equation, such as Kirchhoff-Helmholtz and Rayleigh integrals. In recent years, machine learning techniques incorporating characteristics of acoustic fields, which are referred to as wavefield-based machine learning (WBML), have evolved in this research field. WBML has the potential to further enhance the performance and applicability of sound field analysis and synthesis. In this overview talk, we will introduce these recent advancements. Specifically, kernel methods for sound field estimation and their application to spatial audio reproduction will be highlighted.

Shoichi Koyama
Shoichi Koyama received the B.E., M.S., and Ph.D. degrees from the University of Tokyo, Tokyo, Japan, in 2007, 2009, and 2014, respectively. In 2009, he joined Nippon Telegraph and Telephone (NTT) Corporation, Tokyo, Japan, as a Researcher in acoustic signal processing. In 2014, he moved to the University of Tokyo and has been a Lecturer since 2018. From 2016 to 2018, he was also a Visiting Researcher with Paris Diderot University (Paris 7), Institut Langevin, Paris, France. His research interests include audio signal processing, acoustic inverse problems, and spatial audio. He was the recipient of Itakura Prize Innovation Young Researcher Award by the Acoustical Society of Japan in 2015, and the Research Award by Funai Foundation for Information Technology in 2018.
■ Speaker #2: Prof. Yusuke Hioka (University of Auckland, New Zealand)
Audio Signal Processing for Unmanned Aerial Vehicles Audition

Along with the rapid advancement of its technologies and expanding abilities, new applications of unmanned aerial vehicles (UAV, a.k.a. drones) have been actively explored over the last decade. One of such emerging applications of UAV is its use for recording sound, i.e. equipping UAVs with “auditory” function, which has a huge potential to deliver both commercial and societal benefits through various industries and sectors, such as filming/broadcasting, monitoring/surveillance, and search/rescue. However, a challenge with the UAV audition is the extensive amount of noise generated by the UAV’s propellers, known as ego noise, significantly deteriorating the quality of sound recorded on UAV. Research in signal processing for better UAV audition has been actively conducted to achieve better auditory function by addressing this challenge. This talk will overview recent studies on audio signal processing for UAV audition, with a particular focus on techniques to emphasise sound from the target source while minimising the propeller noise. The talk will also introduce a case study where such technology is applied to commercial products.

Yusuke Hioka
Yusuke Hioka is a Senior Lecturer at the Acoustics Research Centre of the Department of Mechanical Engineering, the University of Auckland, Auckland, New Zealand. He received his B.E., M.E., and Ph.D. degrees in engineering in 2000, 2002, and 2005 from Keio University, Yokohama, Japan. From 2005 to 2012, he was with the NTT Cyber Space Laboratories, Nippon Telegraph and Telephone Corporation (NTT) in Tokyo. From 2010 to 2011, he was also a visiting researcher at Victoria University of Wellington, New Zealand. In 2013 he permanently moved to New Zealand and was appointed as a Lecturer at the University of Canterbury, Christchurch. Subsequently, in 2014, he moved to the current position at the University of Auckland, where he is also the Co-director of the Acoustic Research Centre and leads the Communication Acoustics Laboratory at the Centre. His research interests include audio and acoustic signal processing, room acoustics, human auditory perception and psychoacoustics. He is a Senior Member of the IEEE and a Member of the Acoustical Society of Japan and the Acoustical Society of New Zealand. Since 2016 he has been serving as the Chair of the IEEE New Zealand Signal Processing & Information Theory Chapter.
■ Speaker #3: Prof. Daichi Kitamura (National Institute of Technology, Kagawa College, Japan)
Blind Audio Source Separation Based on Time-Frequency Structure Models

Blind source separation (BSS) for audio signals is a technique to extract specific audio sources from an observed mixture signal. In particular, multichannel determined BSS has been studied for many years because of its capability: the separation can be achieved by a linear operation (multiplication of a demixing matrix) and the quality of estimated audio sources is much better than that of other non-linear ASS algorithms. Determined BSS algorithms have its roots in independent component analysis (ICA), which assumes the independence among sources and estimates the demixing matrix. Then, ICA was extended to independent low-rank matrix analysis (ILRMA) by introducing a low-rank time-frequency structure model for each source. With the advent of ILRMA, the combination of “demixing matrix estimation for linear BSS” and “time-frequency structure models for each source” has become a reliable approach for audio BSS problems. In this talk, we focus on a new flexible BSS algorithm called time-frequency-masking-based BSS (TFMBSS). In this method, thanks to a model-independent optimization algorithm, arbitrary time-frequency structure models can easily be utilized to estimate the demixing matrix in a plug-and-play manner. In addition to the theoretical basis of this algorithm, some TFMBSS applications combining group sparsity, harmonicity, or smoothness in the time-frequency domain will be reviewed.

Daichi Kitamura
Daichi Kitamura received the Ph.D. degree from SOKENDAI, Hayama, Japan. He joined The University of Tokyo in 2017 as a Research Associate, and he moved to National Institute of Technology, Kagawa Collage as an Assistant Professor in 2018. His research interests include audio source separation, statistical signal processing, and machine learning. He was the recipient of the Awaya Prize Young Researcher Award from The Acoustical Society of Japan (ASJ) in 2015, Ikushi Prize from Japan Society for the Promotion of Science in 2017, Best Paper Award from IEEE Signal Processing Society Japan in 2017, Itakura Prize Innovative Young Researcher Award from ASJ in 2018, and Young Author Best Paper Award from IEEE Signal Processing Society.

+OS-2: Information Processing for Speech and Environmental Sounds

Prof. Hiroshi Saruwatari (The University of Tokyo, Japan)
■ Speaker #1: Prof. Jingdong Chen (Northwestern Polytechnical University, China)
Microphone Array Design and Processing for Acoustic Signal Acquisition and Enhancement

Voice communication and human-machine speech interaction systems are facing more and more challenging application environments where besides the speech signals of interest there is coexistence of noise, reverberation, echo, and interference. How to acquire high-fidelity speech signals in such complicated acoustic environments is a very challenging problem, which involves the use of microphone arrays and many multichannel acoustic signal processing techniques. In this talk, I will present a brief overview of the basic problems and principles of sensing and processing of speech signals. I will then focus on discussing important challenges faced by teleconferencing, audio-bridging, and human-machine interface systems. I will elaborate, using examples, on how to design microphone arrays and beamforming algorithms to achieve noise reduction and interference suppression, thereby extracting speech signals of interest in noisy and reverberant acoustic environments.

Jingdong Chen
Jingdong Chen received his PhD degree from the National Laboratory of Pattern Recognition, Chinese Academy of Sciences in 1998. He is currently a professor at the Northwestern Polytechnical University (NWPU) in Xi'an, China. Before joining NWPU in January 2011, he served as the Chief Scientist of WeVoice Inc. in New Jersey for one year. Prior to this position, he was with Bell Labs in New Jersey for nine years. Before joining Bell Labs, he held positions at the Griffith University in Brisbane, Australia and the Advanced Telecommunications Research Institute International (ATR) in Kyoto, Japan. Dr. Chen has long been working on the problems of speech enhancement, noise reduction, echo cancellation, and microphone array processing. He has authored and co-authored 14 monograph books and published over 200 papers in peer reviewed journals and conferences. He has been serving in various capacities in the global research community: as the Chair of IEEE Xi’an Section, as an Associate Editor to the IEEE Transactions on Audio, Speech and Language Processing and as a Member of the Editorial Board of several journals. He was the general chair of the IWAENC 2016, IEEE ICSPCC 2021, IEEE ChinaSIP 2014, the technical program co-chair of the IEEE WASPAA 2009, IEEE TENCON 2013, ChinaSIP 2014, and helped organize many other conferences. Dr. Chen received the IEEE Signal Processing Society Best Paper Award in 2009, the best paper award from the IEEE WASPAA in 2011, the Bell Labs Role Model Teamwork Award twice, respectively, in 2009 and 2007, the NASA Tech Brief Award twice, respectively, in 2010 and 2009, the Japan Trust International Research Grant from the Japan Key Technology Center in 1998, the “Distinguished Young Scientists Fund” from the National Nature Science Foundation of China (NSFC) in 2014, and the Young Author Best Paper Award from the National Conference on Man-Machine Speech Communications in 1998. He is also the co-author of a journal paper for which his PhD student, Chao Pan, received the IEEE Region 10 (Asia-Pacific) 2016 Distinguished Student Paper Award (First Prize). He was elevated to IEEE Fellow in the 2021 “for contributions to microphone array processing and speech enhancement in noisy and reverberant environments”
■ Speaker #2: Prof. Berrak Sisman (Singapore University of Technology and Design, Singapore)
Emotion in Speech Synthesis

In this talk, Dr. Sisman will introduce the fundamentals of emotion in speech synthesis, and the recent advancements in the field through live demonstrations. She will present her technical contributions that cover both emotional voice conversion and deep learning solutions to high quality speech synthesis. She will also provide my perspectives on the technology challenges and future directions moving forward.

Dr. Berrak Sisman
Dr. Berrak Sisman is a tenure-track Assistant Professor at the Singapore University of Technology and Design (SUTD). She is the Principal Investigator of SUTD Speech & Intelligent Systems Lab. Prior to joining SUTD, she was a Postdoctoral Research Fellow with the National University of Singapore. She received the Ph.D. degree in Electrical and Computer Engineering from the National University of Singapore in 2020, fully-funded by Singapore International Graduate Award. During her Ph.D., she was a Visiting Scholar with The Centre for Speech Technology Research (CSTR), the University of Edinburgh in 2019. She was also attached to RIKEN Advanced Intelligence Project, Japan in 2018. Dr. Berrak Sisman has published in leading journals and conferences, including IEEE/ACM Transactions on Audio, Speech and Language Processing, Neural Networks, IEEE Signal Processing Letters, ASRU, INTERSPEECH, and ICASSP. Dr. Berrak Sisman plays leadership roles in conference organizations and also active in technical committees. She has served as the Area Chair at INTERSPEECH 2021 and INTERSPEECH 2022, and as the Publication Chair at ICASSP 2022. She has been elected as a member of the IEEE Speech and Language Processing Technical Committee (SLTC) in the area of Speech Synthesis for the term from Jan. 2022 to Dec. 2024.
■ Speaker #3: Prof. Keisuke Imoto (Doshisha University, Japan)
Fundamentals and Recent Advances in Environmental Sound Analysis

We are surrounded by various kinds of sounds such as speech, music, and environmental sounds. To understand these sounds by computers and to realize human-like listening systems, environmental sound analysis is essential technology, and they have been extensively developed. In this talk, we review the fundamentals of environmental sound analysis, including its problem definitions (e.g., acoustic scene analysis, sound event detection, and anomalous sound detection), available public datasets, applications, and challenges. We also introduce recent efforts in the environmental sound analysis, including various deep learning-based methods, performance evaluation methods, and our recent works, and discuss the future research directions in this research area.

Keisuke Imoto
Keisuke Imoto received his B.E. and M.E. degrees from Kyoto University in 2008 and 2010, respectively. He received his Ph.D. degree from SOKENDAI (The Graduate University for Advanced Studies) in 2017. He joined the Nippon Telegraph and Telephone Corporation (NTT) in 2010 and the Ritsumeikan University as an Assistant Professor in 2017. He moved to Doshisha University as an Associate Professor in 2020. He has been engaged in research on sound event detection, acoustic scene analysis, anomalous sound detection, and microphone array signal processing. He is a member of the IEEE Signal Processing Society and the Acoustical Society of Japan (ASJ). He received the Awaya Award from ASJ in 2013, the TAF Telecom System Technology Award in 2018, and the Sato Prize ASJ Paper Award from ASJ in 2020.

+OS-3: Multimedia Security

Prof. Jing-Ming Guo (National Taiwan University of Science and Technology, Taiwan)
■ Speaker #1: Prof. KokSheik Wong (Monash University Malaysia, Malaysia)
Complete Quality Preservation Reversible Data Hiding

Traditionally, data hiding is realized at the expense of slight quality degradation in the host content. This also applies to reversible methods because the modified content (now containing the inserted data) is inevitably distorted. Although the introduced distortion may not be noticeable to the naked eyes, a direct comparison between the original and modified contents will instantly reveal the differences. Recently, our research group put forward a few proposals to hide data while completely preserving the quality of the host content. Essentially, the quality of the content will be exactly the same before and after data insertion. In general, the quality preserving property is achieved by using two representations to render / encode the same entity, where one representation can be associated with `0’, while another associated with `1’. In addition, the proposed techniques are reversible and the hiding capacity is scalable. In this overview, two proposed techniques, specifically one for animated GIF and another for PDF file, will be presented, followed by some discussions on potential future research directions.

KokSheik Wong
KokSheik is an Associate Professor in the School of Information Technology Malaysia, Monash University Malaysia. He has published more than 50 journal articles and 90 conference papers. He was the recipient of the COMSTECH-TWAS Joint Research Grants funded by the Organisation of Islamic Cooperation. He was also an international partner to the H2020 grant IDENTITY (project 690907). He currently serves as the editor in chief for the APSIPA newsletter, an associate editor for the Journal of Information Security and Applications (JISA), as well as Signal. Dr. Wong served as a General Co-Chair of APSIPA ASC 2017. In 2019, he received the best paper award at the 18th International Workshop on Digital-forensics and Watermarking (IWDW 2019) for his work in data hiding in PDF. Recently in 2021, he has also been awarded the “Outstanding Reviewer Award” by IEEE Transactions on Multimedia.
■ Speaker #2: Prof. Wei Lu (Sun Yat-sen University, China)
Secure Robust JPEG Steganography for Social Networks

A secure robust JPEG steganographic scheme based on an autoencoder with an adaptive BCH encoding is proposed, which can resist JPEG compression channel interference in secret messages. In the proposed scheme, the autoencoder is first pretrained to fit the transformation relationship between the JPEG image before and after compression by the compression channel, which makes the autoencoder model fit the inverse procedure of the JPEG compression channel so that the model can generate an intermediate image that resists the JPEG compression channel. Then, the BCH encoding is adaptively utilized according to the content of cover image to decrease the error rate of secret message extraction. Finally, the DCT coefficient adjustment based on practical JPEG channel characteristics further improves the robustness and statistical security. Experimental results demonstrate that the proposed robust JPEG steganographic algorithm can provide a more robust performance and statistical security than prior state-of-the-art JPEG steganographic schemes.

Wei Lu
Wei Lu received the Ph.D. degree in computer science from Shanghai Jiao Tong University, China, in 2007. He is currently a Professor and a Director of the Institute for Cyberspace Security in the School of Computer Science and Engineering at Sun Yat-sen University, Guangzhou, China. He is a member of the Artificial Intelligence and Security Committee of the Chinese Society for Artificial Intelligence, a member of the Chinese Computer Society, and a member of IEEE. His research interests include multimedia forensics and security, data hiding and watermarking, privacy protection. He published over 100 papers and received the second prize of Shanghai Natural Science Award. He served as an Associate Editor for Signal Processing and other international journals. He participated in the drafting of the National judicial authentication technical specification "Technical Specification of Digital Image Metadata Forensics" (SF/T 0078-2020) and "Application Specification of Face Recognition Technology in Portrait Identification".
■ Speaker #3: Prof. Yuan-Gen Wang (Guangzhou University, China)
Advances in Watermarking Security

Watermarking is the art of hiding data in multimedia content in a robust manner, while keeping the imperceptibility of the hidden data. In the early studies of digital watermarking, the security received little attention, yet was later shown to be as important as the robustness and the imperceptibility. In this talk, I will first introduce the concept and requirements of watermarking security. Then, I will review some of the state-of-the-art watermarking security theories and algorithms, and discuss their merits and limitations. By analyzing the limitations of existing methods, I will outline the major challenges in watermarking security, which provide some future research directions on this topic.

Dr. Yuan-Gen Wang
Dr. Yuan-Gen Wang received the B.S. degree in physics from Jiangxi Normal University, Nanchang, China, in 1999, and the M.E. and Ph.D. degrees in communication and information system from Sun Yat-sen University, Guangzhou, China, in 2006 and 2013, respectively. Since 2011, he has been an Associate Professor with the Zhongkai University of Agriculture and Engineering, Guangzhou, China. From 2015 to 2016, he was a Research Scholar with the New Jersey Institute of Technology, Newark, NJ, USA. In 2017, he joined Guangzhou University, Guangzhou, China, where he is currently a Full Professor and the Deputy Dean with the School of Computer Science. In 2019, he visited Italy for one month as an Exchange Professor with the Department of Mathematics, University of Padua, Padua, Italy. Dr. Wang is a senior member of the IEEE. His research interests include digital watermarking, multimedia security, and image processing.

+OS-4: Deep Learning for Image Processing

Prof. Koichi Shinoda (Tokyo Institute of Technology, Japan)
■ Speaker #1: Prof. Nakamasa Inoue (Tokyo Institute of Technology, Japan)
Pre-training Neural Networks without Natural Images

Image representation learning is one of the most fundamental problems in computer vision with a broad range of applications, including object recognition and action recognition. Many recent studies on self-supervised learning have shown that the need for labeled images can be significantly reduced if a large number of unlabeled images is available for pre-training. However, a major effort is still required to collect data, and to resolve the dataset bias issue during the data collection process. In this overview talk, I will present a series of our recent work on pre-training without natural images, which aims to learn image representations from automatically generated images. Specifically, this talk covers pre-training algorithms, in which neural networks learn to classify procedural patterns such as fractals, color tiles, and Perlin noise. This talk will also include a review of recent research on self-supervised learning and discussions on future research directions.

Nakamasa Inoue
Nakamasa Inoue is an Assistant Professor in the department of computer science at Tokyo Institute of Technology, Japan. He received his B.E., M.E., and Ph.D. degrees in computer science from Tokyo Institute of Technology in 2009, 2011, and 2014, respectively. His main research interests lie in multimedia processing including video retrieval, image recognition, and speech recognition. He received the ACCV (Asian Conference on Computer Vision) Best Paper Honorable Mention Award in 2020 for work on pre-training without natural images.
■ Speaker #2: Prof. Rio Yokota (Tokyo Institute of Technology, Japan)
Approximate Second Order Optimization for Distributed Deep Learning

Second order optimization methods require the computation of Hessian, Gauss-Newton, or Fisher information matrices, which results in an intractable computational cost when done naively. Such matrices are not only used for optimization, but also for generalization metrics, continual learning, structured pruning, and gradient-based hyperparameter optimization. We show that these matrices can be computed using backpropagation and that various approximation methods exist that can accelerate their computation significantly. Furthermore, when training on distributed systems, the overhead of computing these matrices can be reduced significantly.

Rio Yokota
Rio Yokota is an Associate Professor at the Global Scientific Information and Computing Center at the Tokyo Institute of Technology. His research interest lies at the intersection of HPC and ML. On the HPC side, he has worked on hierarchical low-rank approximation methods such as FMM and H-matrices. He has worked on GPU computing since 2007 and won the Gordon Bell prize using the first GPU supercomputer in 2009. On the ML side, he works on distributed deep learning and second-order optimization. His work on training ImageNet in 2 minutes with second-order methods has been extended to various applications using second-order information.
■ Speaker #3: Prof. Hiroki Nakahara (Tokyo Institute of Technology, Japan)
Various Hardware Accelerator Implementations for Deep Learning

With the development of deep learning, the market for an edge AI including embedded systems is expected to expand in the market point of view. Edge AIs need to process a large number of operations on limited computational and power resources. Data structures and architectures have been researched and developed. Previous studies have proposed networks as various data structures. We introduce various networks that are effective for hardware implementation. Next, we describe a weight reduction method for hardware implementation. It is a quantization in low bits. High speed can be achieved while reducing the amount of dedicated hardware by quantization techniques. However, the recognition accuracy deteriorates, so we consider the optimization method. Next, we show a weight sparseness that approximates the weight to 0. It also exits a trade-off with accuracy, deterioration and acceleration, so we introduce the optimization method. Finally, we describe three types of hardware implementations. The implementation results will be demonstrated targeting an FPGA as a prototype of deep learning applications.

Hiroki Nakahara
Hiroki Nakahara received the B.E., M.E., and Ph.D. degrees in computer science from Kyushu Institute of Technology, Fukuoka, Japan, in 2003, 2005, and 2007, respectively. He has held research/faculty positions at Kyushu Institute of Technology, Iizuka, Japan, Kagoshima University, Kagoshima, Japan, and Ehime University, Ehime, Japan. Now, he is an associate professor at Tokyo Institute of Technology, Japan and CEO/CRO/Co-Founder at Tokyo Artisan Intelligence Co., Ltd. He was the Workshop Chairman for the International Workshop on Post-Binary ULSI Systems (ULSIWS) in 2014, 2015, 2016 and 2017, respectively. He served the Program Chairman for the International Symposium on 8th Highly-Efficient Accelerators and Reconfigurable Technologies (HEART) in 2017. He received the 8th IEEE/ACM MEMOCODE Design Contest 1st Place Award in 2010, the SASIMI Outstanding Paper Award in 2010, IPSJ Yamashita SIG Research Award in 2011, the 11st FIT Funai Best Paper Award in 2012, the 7th IEEE MCSoC-13 Best Paper Award in 2013, and the ISMVL2013 Kenneth C. Smith Early Career Award in 2014, respectively. His research interests include logic synthesis, reconfigurable architecture, digital signal processing, embedded systems, and machine learning. He is a member of the IEEE, the ACM, and the IEICE. He is a member of the IEEE, the ACM, and the IEICE.