APSIPA
Distinguished Lecturers
( 1 January 2012 - 31 December 2013 ) |
|
Abeer
Alwan, UCLA, USA |
Biography:
Abeer Alwan received her Ph.D. in EECS from MIT in 1992. Since
then, she has been with the Electrical Engineering Department
at UCLA as an Assistant Professor (1992-1996), Associate Professor
(1996-2000), Professor (2000-present), Vice Chair of the BME
program (1999-2001), Vice Chair of EE Graduate Affairs (2003-2006),
and Area Director of Signals and Systems (2006-2010). She established
and directs the Speech Processing and Auditory Perception Laboratory
at UCLA (http://www.ee.ucla.edu/~spapl). Her research interests
include modeling human speech production and perception mechanisms
and applying these models to improve speech-processing applications
such as noise-robust automatic speech recognition. She is the
recipient of the NSF Research Initiation Award (1993), the NIH
FIRST Career Development Award (1994), the UCLA-TRW Excellence
in Teaching Award (1994), the NSF Career Development Award (1995),
and the Okawa Foundation Award in Telecommunications (1997).
Dr. Alwan is an elected member of Eta Kappa Nu, Sigma Xi, Tau
Beta Pi, and the New York Academy of Sciences. She served, as
an elected member, on the Acoustical Society of America Technical
Committee on Speech Communication (1993-1999, and 2005-2008),
on the IEEE Signal Processing Technical Committees on Audio
and Electroacoustics (1996-2000) and on Speech Processing (1996-2001,
2005-2008, 2011-2013). She is a member of the Editorial Board
of Speech Communication and was an editor-in-chief of that journal
(2000-2003), was an Associate Editor (AE) of the IEEE Transactions
on Speech, Audio, and Language Processing (2006-2009), and is
an AE for the Journal of the Acoustical Society of America (JASA).
Dr. Alwan is a Fellow of the IEEE, the Acoustical Society of
America, and the Internatinoal Speech Communication Association
(ISCA). She was a 2006-2007 Fellow of the Radcliffe Institute
for Advanced Study at Harvard University, and a Distinguished
Lecturer for ISCA.
Lectures:
Lecture 1: Dealing with Noisy and Limited Data: A Hybrid
Approach
This talk builds on Dr. Alwan's Keynote Speech at Interspeech
2008. It surveys the field and presents and compares state-of-the-art
techniques. Areas of interest include noise-robust ASR and rapid
speaker adaptation for both native and non-native speakers of
English. It will show how linguistically-motivated, auditorally-inspired,
and speech production-based models can improve performance and
lead to greater insights.
Lecture 2: Models of Speech Production and Perception and
Applications in Speech and Audio Coding, TTS, and Hearing Aids
This technical talk discusses the potential value of signal
processing algorithms that are based on models of how humans
produce and perceive speech with a focus on models of speech
perception in noise. It then surveys applications which have
benefited tremendously from such models. Applications include
speech and audio coding (e.g., CELP-based techniques which have
benefited from simplified models of speech production, and MPEG
which benefited from modeling aspects of auditory perception),
text-to-speech synthesis, and hearing aids as well as cochlear
implants.
Lecture 3: Production, Analysis, and Perception of Voice Quality
Voice quality is due in part to patterns of vibration of a speaker's
vocal folds inside the larynx. In some languages, different
voice qualities can distinguish word meanings. I can talk about
our studies of Voice Quality which includes production and perception
studies as well as acoustic measurements of voice contrasts
in various languages. |
|
|
Mrityunjoy Chakraborty,
Indian Institute of Technology, India |
Biography:
Mrityunjoy Chakraborty obtained Bachelor of Engg. from Jadavpur
university, Calcutta, Master of Technology from IIT, Kanpur
and Ph.D. from IIT, Delhi. He joined IIT, Kharagpur as a faculty
member in 1994, where he currently holds the position of a professor
in Electronics and Electrical Communication Engg. The teaching
and research interests of Prof. Chakraborty are in Digital and
Adaptive Signal Processing, VLSI Signal Processing, Linear Algebra
and DSP applications in Wireless Communications. In these areas,
Prof. Chakraborty has supervised several graduate theses, carried
out independent research and has several well cited publications.
Prof. Chakraborty has been an Associate Editor of the IEEE Transactions
on Circuits and Systems, part I (2004-2007, 2010-2011, 2012)
and part II (2008-2009), apart from being an elected member
of the DSP TC of the IEEE Circuits and Systems Society, a guest
editor of the EURASIP JASP (special issue) and a TPC member
of ICC (2007-2011) and Globecom (2008-2011). Prof. Chakraborty
is co-founder of the Asia Pacific Signal and Information Processing
Association (APSIPA), a member of the APSIPA steering committee
and also, the chair of the APSIPA TC on Signal and Information
Processing Theory and Methods (SIPTM). He has also been the
general chair and also the TPC chair of the National Conference
on Communications ˇV 2012.
Prof. Chakraborty is a fellow of the Indian National Academy
of Engineering (INAE) and also a fellow of the IETE.
Lectures:
Lecture 1: A SPT Treatment to the Realization of the Sign-LMS
Based Adaptive Filters
The "sum of power of two (SPT)" is an effective format
to represent filter coefficients in a digital filter which reduces
the complexity of multiplications in the filtering process to
just a few shift and add operations. The canonic SPT is a special
sparse SPT representation that guarantees presence of at least
one zero between every two non-zero SPT digits. In the case
of adaptive filters, as the coefficients are updated with time
continuously, conversion to such canonic SPT forms is, however,
required at each time index, which is quite impractical and
requires additional circuitry. Also, as the position of the
non-zero SPT terms in the canonic SPT expression of each coefficient
word changes with time, it is not possible to carry out multiplications
involving the coefficients via a few "shift and add"
operations. This seminar addresses these issues, in the context
of a SPT based realization of adaptive filters belonging to
the sign-LMS family. Firstly, it proposes a bit serial adder
that takes as input two numbers, one (filter weights) in canonic
SPT and the other (data) in 2's complement form, producing an
output also in canonic SPT, which allows weight updating purely
in the canonic SPT domain. It is also shown how the canonic
SPT property of the input can be used to reduce the complexity
of the proposed adder. For multiplication, the canonic SPT word
for each coefficient is partitioned into non-overlapping digit
pairs and the data word is multiplied by each pair separately.
The fact that each pair can have at the most one non-zero digit
is exploited further to reduce the complexity of the multiplication.
Lecture 2: Adaptive Identification of Sparse Systems - a
Convex Combination Approach
In the context of system identification, it is shown that sometimes
the level of sparseness in the system impulse response can vary
greatly depending on the time-varying nature of the system.
When the response is strongly sparse, convergence of the conventional
approach such as least mean square (LMS) is poor. The recently
proposed, compressive sensing based sparsity-aware ZA-LMS algorithm
performs satisfactorily in strongly sparse environments, but
is shown to perform worse than the conventional LMS when sparseness
of the impulse response reduces. In this lecture, we present
an algorithm which works well both in sparse and non-sparse
circumstances and adapts dynamically to the level of sparseness,
using a convex combination based approach. The proposed algorithm
is supported by simulation results that show its robustness
against variable sparsity.
Lecture 3: A Low Complexity Realization of the Sign-LMS Algorithm
using a Constrained, Minimally Redundant, Radix-4 Arithmetic
The sign-LMS algorithm is a popular adaptive filter that requires
only addition/subtraction but no multiplication in the weight
update loop. To reduce the complexity of multiplication that
arises in the filtering part of the sign-LMS algorithm, a special
radix-4 format is presented in this paper to represent each
filter coefficient. The chosen format guarantees sufficient
sparsity which in turn reduces the multiplicative complexity
as no partial product needs to be computed when the multiplicand
is a binary zero. Care, is, however, taken to ensure that the
weight update process generates the updated weight also in the
same chosen radix-4 format, which is ensured by developing an
algorithm for adding a 2's complement number with a number given
in the adopted radix-4 format.
Lecture 4: New Algorithms for Multiplication and Addition
of CSD and 2's Complement Numbers
The CSD is a powerful sparse representation of digital data
that helps in reducing the complexity of multiplications in
a digital filter, by evaluating only those partial products
that correspond to the non-zero terms in the CSD word. This
talk will present new algorithms and architectures for adding
and multiplying CSD data with 2's complement words, where the
canonic property of the CSD data is used to reduce the complexity
of the implementation effectively.
Lecture 5: Compressed Sensing and Sparse System Identification
Recent emergence of the topic of "Compressed Sensing"
has generated a renewed dynamism in the area of sparse adaptive
filters and sparse system identification. This talk will provide
a review of the recent developments and trends in this area.
Lecture 6: Adaptive Estimation of Delay and Amplitude of
Sinusoidal Signals
In this seminar, we present a new adaptive filter for estimating
and tracking the delay and the relative amplitude of a sinusoid
vis-a-vis a reference sinusoid of the same frequency. By careful
choice of the sampling period, a two-tap FIR filter model is
constructed for the delayed signal. The delay and the amplitude
are estimated by identifying the FIR filter for which a delay
variable and an amplitude variable are updated in a LMS like
manner, deploying, however, separate step sizes. Convergence
analysis proving convergence (in mean) of the delay and the
amplitude updates to their respective true values will be discussed
and MATLAB based simulation studies confirming satisfactory
estimation performance of the proposed algorithm will be presented.
Lecture 7: APSIPA and its mission and vision
This talk will introduce APSIPA and its present activities as
well as it short and long term missions to the audience. |
|
|
Jen-Tzung Chien,
National Chiao Tung University, Taiwan |
Biography:
Jen-Tzung Chien received his Ph.D. degree in electrical engineering
from National Tsing Hua University, Hsinchu, Taiwan, in 1997.
During 1997-2012, he was with the Department of Computer Science
and Information Engineering, National Cheng Kung University,
Tainan, Taiwan. Since 2012, he has been with the Department
of Electrical and Computer Engineering, National Chiao Tung
University, Hsinchu, where he is currently a Professor. He held
the Visiting Researcher positions at the Panasonic Technologies
Inc., Santa Barbara, CA, the Tokyo Institute of Technology,
Tokyo, Japan, the Georgia Institute of Technology, Atlanta,
GA, the Microsoft Research Asia, Beijing, China, and the IBM
T. J. Watson Research Center, Yorktown Heights, NY. His research
interests include machine learning, speech recognition, blind
source separation, face recognition, and information retrieval.
Dr.
Chien is a senior member of the IEEE Signal Processing Society.
He served as the associate editor of the IEEE Signal Processing
Letters, in 2008-2011, and the tutorial speaker of the ICASSP,
in 2012. He is appointed as the APSIPA Distinguished Lecturer
for 2012-2013. He was a co-recipient of the Best Paper Award
of the IEEE Automatic Speech Recognition and Understanding Workshop
in 2011. He received the Young Investigator Award (Ta-You Wu
Memorial Award) from the National Science Council (NSC), Taiwan,
in 2003, the Research Award for Junior Research Investigators
from Academia Sinica, Taiwan, in 2004, and the NSC Distinguished
Research Awards, in 2006 and 2010.
Lectures:
Lecture 1: Machine Learning for Speech and Language Processing
In this lecture, I will present a series of machine learning
approaches to various applications relevant to speech and language
processing including acoustic modelling, language modelling,
speech recognition, blind source separation, document summarization,
information retrieval, and natural language understanding. In
general, speech and language processing involves extensive knowledge
of statistical models which are learnt from observation data.
However, in real world, observation data are inevitably acquired
from heterogeneous environments in presence of mislabeled, misaligned,
mismatched and ill-posed conditions. The estimated models suffer
from large complexity, ambiguity and uncertainty. Model regularization
becomes a crucial issue when constructing the speech and text
models for different information systems. In statistical machine
learning, the uncertainty and sparse coding algorithms provide
attractive and effective solution to model regularization. This
lecture will address several recent works on Bayesian and sparse
learning. In particular, I will present Bayesian sensing hidden
Markov models and Dirichlet class language models for speech
recognition, online Gaussian process for blind source separation,
unsupervised structural learning for text representation, and
Bayesian nonparametrics for document summarization. In these
works, robust models are established against improper model
assumption, over-determined model complexity, ambient noise
interference, and nonstationary environment variations. Finally,
I will point out some potential topics on machine learning for
speech and language processing.
Lecture 2: Independent Component Analysis and Unsupervised
Learning
Independent component analysis (ICA) is not only popular for
blind source separation (BSS) but also for unsupervised learning
of salient features underlying the mixed observations. In speech
signals, these features may represent the specific speaker,
gender, accent, noise or environment, and can act as the basis
functions to span the vector space of the human voices in different
conditions. In this lecture, I will present recent works on
ICA and BSS and their applications in audio signal separation
and speech recognition. These works include independent voices
for speaker adaptation, information-theoretic learning based
on convex ICA, and nonstationary source separation via online
Gaussian process. Several machine learning algorithms are developed
to deal with the issues of model selection, model optimization,
model variations, nonstationary process, online learning, nonparametric
modelling, etc. Further researches on unsupervised learning
and structural learning based on topic modelling will be addressed. |
|
|
Li Deng, Microsoft
Research, USA |
Biography:
Dr. Li Deng received the Ph.D. from Univ. Wisconsin-Madison.
He was an Assistant (1989-1992), Associate (1992-1996), and
Full Professor (1996-1999) at the University of Waterloo, Ontario,
Canada. He then joined Microsoft Research, Redmond, where he
is currently a Principal Researcher and where he received Microsoft
Research Technology Transfer, Goldstar, and Achievement Awards.
Prior to MSR, he also worked or taught at Massachusetts Institute
of Technology, ATR Interpreting Telecom. Research Lab. (Kyoto,
Japan), and HKUST. He has published over 300 refereed papers
in leading journals/conferences and 3 books covering broad areas
of human language technology, machine learning, and audio, speech,
and signal processing. He is a Fellow of the Acoustical Society
of America, a Fellow of the IEEE, and a Fellow of the International
Speech Communication Association. He is an inventor or co-inventor
of over 50 granted US, Japanese, or international patents. He
served on the Board of Governors of the IEEE Sig. Proc. Soc.
(2008-2010). More recently, he served as Editor-in-Chief for
IEEE Signal Processing Magazine (2009-2011), which, according
to the Thompson Reuters Journal Citation Report released 2010
and 2011, ranked first in both years among all 127 IEEE publications
and all 247 publications within the Electrical and Electronics
Engineering Category worldwide in terms of its impact factor,
and for which he received the 2011 IEEE SPS Meritorious Service
Award. He currently serves as Editor-in-Chief for IEEE Transactions
on Audio, Speech and Language Processing. His recent tutorials
on deep learning at APSIPA (Oct 2011) and at ICASSP (March 2012)
received the highest attendance rate at both conferences.
Lectures:
Lecture 1: Being Deep and Being Dynamic - New-Generation
Models and Methodology for Advancing Speech Technology
Semantic information embedded in the speech signal --- not only
the phonetic/linguistic content but also a full range of paralinguistic
information including speaker characteristics --- manifests
itself in a dynamic process rooted in the deep linguistic hierarchy
as an intrinsic part of the human cognitive system. Modeling
both the dynamic process and the deep structure for advancing
speech technology has been an active pursuit for over more than
20 years, but it is not until recently (since only a few years
ago) that noticeable breakthrough has been achieved by the new
methodology commonly referred to as ˇ§deep learningˇ¨. Deep Belief
Net (DBN) is recently being used to replace the Gaussian Mixture
Model (GMM) component in HMM-based speech recognition, and has
produced dramatic error rate reduction in both phone recognition
and large vocabulary speech recognition while keeping the HMM
component intact. On the other hand, the (constrained) Dynamic
Bayesian Net (referred to as DBN* here) has been developed for
many years to improve the dynamic models of speech while overcoming
the IID assumption as a key weakness of the HMM, with a set
of techniques and representations commonly known as hidden dynamic/trajectory
models or articulatory-like models. A history of these two largely
separate lines of ˇ§DBN/DBN*ˇ¨ research will be critically reviewed
and analyzed in the context of modeling deep and dynamic linguistic
hierarchy for advancing speech (as well as speaker) recognition
technology. Future directions will be discussed for this exciting
area of research that holds promise to build a foundation for
the next-generation speech technology with human-like cognitive
ability.
Lecture 2: Feature-Domain, Model-Domain, and Hybrid Approaches
to Noise-Robust Speech Recognition
Noise robustness has long been an active area of research that
captures significant interest from speech recognition researchers
and developers. In this lecture, we use the Bayesian framework
as a common thread to connect, analyze, and categorize a number
of popular approaches to noise robust speech recognition pursued
in the recent past. The topics covered in this lecture include:
1) Bayesian decision rules with unreliable features and unreliable
model parameters; 2) Principled ways of computing feature uncertainty
using structured speech distortion models; 3) Use of phase factor
in an advanced speech distortion model for feature compensation;
4) A novel perspective on model compensation as a special implementation
of the general Bayesian predictive classification rule capitalizing
on model parameter uncertainty; 5) Taxonomy of noise compensation
techniques using two distinct axes: feature vs. model domain
and structured vs. unstructured transformation; and 6) Noise
adaptive training as a hybrid feature-model compensation framework
and its various forms of extension.
Lecture 3: Machine Learning Paradigms for speech recognition
Automatic Speech Recognition (ASR) has historically been a driving
force behind many machine learning techniques, including the
ubiquitously used hidden Markov model, discriminative learning,
Bayesian learning, and adaptive learning. Moreover, machine
learning can and occasionally does use ASR as a large-scale,
realistic application to rigorously test the effectiveness of
a given technique, and to inspire new problems arising from
the inherently temporal nature of speech. On the other hand,
even though ASR is available commercially for some applications,
it is in general a largely unsolved problem - for many applications,
the performance of ASR is not yet on par with human performance.
New insight from modern machine learning methodology shows great
promise to advance the state-of-the-art in ASR technology performance.
This lecture provides audience with an overview of modern machine
learning techniques as utilized in current ASR research and
systems. The intent of the lecture is to foster further cross-pollination
between the machine learning and speech recognition communities
than what has occurred in the past. The lecture is organized
according to the major machine learning paradigms that are either
popular already in or have potential for making significant
contributions to ASR technology. The paradigms presented and
elaborated in this lecture include generative and discriminative
learning; supervised, unsupervised, semisupervised, and active
learning; and adaptive and multitask learning. These learning
paradigms are motivated and discussed in the context of ASR
applications. I will finally present and analyse recent developments
of deep learning, sparse representations, and combinatorial
optimization focusing on their direct relevance to advancing
ASR technology. |
|
|
Hsueh-Ming
Hang, National Chiao-Tung University,
Taiwan |
Biography:
Hsueh-Ming Hang received the B.S. and M.S. degrees in control
engineering and electronics engineering from National Chiao
Tung University, Hsinchu, Taiwan, in 1978 and 1980, respectively,
and Ph.D. in electrical engineering from Rensselaer Polytechnic
Institute, Troy, NY, in 1984.
From 1984 to 1991, he was with AT&T Bell Laboratories, Holmdel,
NJ, and then he joined the Electronics Engineering Department
of National Chiao Tung University (NCTU), Hsinchu, Taiwan, in
December 1991. From 2006 to 2009, he took a leave from NCTU
and was appointed as Dean of the EECS College at National Taipei
University of Technology (NTUT). He is currently a Distinguished
Professor of the EE Dept at NCTU and an associate dean of the
ECE College, NCTU. He has been actively involved in the international
MPEG standards since 1984 and his current research interests
include multimedia compression, image/signal processing algorithms
and architectures, and multimedia communication systems.
Dr. Hang holds 13 patents (Taiwan, US and Japan) and has published
over 180 technical papers related to image compression, signal
processing, and video codec architecture. He was an associate
editor (AE) of the IEEE Transactions on Image Processing (TIP,
1992-1994), the IEEE Transactions on Circuits and Systems for
Video Technology (1997-1999), and currently an AE of the IEEE
TIP again. He is co-editor and contributor of the Handbook of
Visual Communications published by Academic Press. He is a recipient
of the IEEE Third Millennium Medal and is a Fellow of IEEE and
IET and a member of Sigma Xi.
Lectures:
Lecture 1: Technology and Trends in Multi-camera Virtual-view
Systems
3D video products are aggressively growing recently. One step
further, the virtual-viewpoint (or free-viewpoint) video becomes
the research focus. It is also an on-going standardization item
of the international MPEG Committee. Its aim is to define an
efficient data representation for multi-view (virtual-view)
synthesis at the receiver, which can be a multi-view autostereoscopic
display. The December 2011 3DVC contest results indicate that
such a system is plausible. In practice, a densely arranged
camera array is used to acquire input images and a virtual view
is synthesized by using the depth-image based rendering (DIBR)
technique. Two essential tools are needed for a virtual-view
synthesis system: depth estimation and view synthesis. We will
summarize the recent progress and future trend on this subject.
Some of our work is included in this report.
Lecture 2: What's Next on Video Coding Technologies and Standards?
After the profound success of defining H.264/AVC video coding
standard in 2002, in the past a few years, the ITU-T Video Coding
Experts Group (VCEG) and ISO/IEC Motion Picture Expert Group
(MPEG) have been actively searching for new or improved technologies
that can achieve an even higher compression efficiency. After
several years' struggle, in January 2010, VCEG and MPEG formed
a joint team and issued a call-for-proposal for the "High
Efficiency Video Coding (HEVC)". This standard work item
attracts a lot of attention and it progresses very well in the
past 14 months. We thus expect a "new" video standard
will be specified in 2012. In addition, 3D video products are
aggressively growing recently. One step further, the free-viewpoint
video becomes the next MPEG standardization item. Its aim is
to define an efficient data representation for multi-view (free-view)
synthesis at the receiver, which can be a multi-view auto-stereoscopic
display. |
|
|
Kyoung Mu Lee,
Seoul National University, Korea
|
Biography:
Kyoung Mu Lee received the B.S. and M.S. degrees in Control
and Instrumentation Engineering from Seoul National University
(SNU), Seoul, Korea in 1984 and 1986, respectively, and Ph.
D. degree in Electrical Engineering from the USC (University
of Southern California), Los Angeles, California in 1993. He
has been awarded the Korean Government Overseas Scholarship
during his Ph. D. courses. From 1993 to 1994 he was a research
associate in the SIPI (Signal and Image Processing Institute)
at USC. He was with the Samsung Electronics Co. Ltd. in Korea
as a senior researcher from 1994 to 1995. On August 1995, he
joined the department of Electronics and Electrical Eng. of
the Hong-Ik University, and worked as an assistant and associate
professor. From September 2003, he is with the Department of
Electrical Engineering and Computer Science at Seoul National
University as a professor, and leads the Computer Vision Laboratory.
His primary research is focused on statistical methods in computer
vision that can be applied to various applications including
object recognition, segmentation, tracking and 3D reconstruction.
Prof. Lee has received several awards, in particular, the Most
Influential Paper over the Decade Award by the IAPR Machine
Vision Application in 2009, the ACCV Honorable Mention Award
in 2007, the Okawa Foundation Research Grant Award in 2006,
and the Outstanding Research Award by the College of Engineering
of SNU in 2010. He served as an Editorial Board member of the
EURASIP Journal of Applied Signal Processing, and is an associate
editor of the Machine Vision Application Journal, the IPSJ Transactions
on Computer Vision and Applications, the Journal of Information
Hiding and Multimedia Signal Processing, and IEEE Signal Processing
Letters. He has (co)authored more than 120 publications in refereed
journals and conferences including PAMI, IJCV, CVPR, ICCV and
ECCV.
Lectures:
Lecture 1: Statistical Sampling Approaches for Visual Tracking
Object tracking is one of the important and fundamental problems
in Computer Vision. It is a challenging problem to track a target
in the real world tracking environment where different types
of variations such as illumination, shape, occlusion, or motion
changes occur at the same time. Recently, several attempts have
been made to solve the problem, but still the results are far
from satisfactory due to their inherent limited modeling. In
this talk, to cope with this challenging problem we present
a novel approach based on a statistical sampling framework.
The underlying philosophy of our approach is that multiple trackers
can be constructed and integrated efficiently in a probabilistic
way. With a sampling method, the trackers themselves are sampled,
as well as the states of the targets. The trackers are adapted
or newly constructed depending on the current situation, so
that each specific tracker takes charge of a certain change
of the object. They are efficiently sampled using the Markov
Chain Monte Carlo method according to the appearance models,
motion models, state representation types, and observation types.
All trackers are then integrated into one compound tracker through
an interactive Markov Chain Monte Carlo (IMCMC) method, in which
the basic trackers communicate with one another interactively
while run in parallel. Experimental results show that the proposed
method tracks the object accurately and reliably in challenging
realistic videos, and outperforms the state of-the-art tracking
methods.
Lecture 2: Graph matching via Random Walks
Establishing feature correspondence between images lies at the
heart of computer vision problems, and a myriad of feature matching
algorithms have been proposed for a wide range of applications
such as object recognition, image retrieval, and image registration.
However, robust matching under non-rigid deformation and clutter
is still a challenging open problem. Furthermore, most of conventional
methods require some supervised settings or restrictive assumptions
such as a reference image without severe clutter, a clean model
of the target object, and one-to-one object matching between
two images. In this talk, a graph-theoretic approach to robust
feature matching will be introduced. Based on a random walk
view on graph matching, image matching under non-rigid deformation
and severe clutters is addressed and also extended to high-order
image matching with high-level visual cues. Combining the method
with a novel graph-based mode-seeking in a progressive framework,
the proposed algorithms effectively solves the interconnected
problems of robust feature matching, object discovery, and outlier
elimination. |
|
|
Weisi Lin, Nanyang
Technological University, Singapore
|
Biography:
Weisi Lin graduated from Zhongshan University, China with B.Sc
and M.Sc, and from King's College, London University, UK with
Ph.D. He researched in Zhongshan University, Shantou University
(China), Bath University (UK), National University of Singapore,
Institute of Microelectronics (Singapore) and Institute for
Infocomm Research (Singapore). He served as the Lab Head and
Acting Department Manager in Institute for Infocomm Research.
He is now an associate professor in School of Computer Engineering,
Nanyang Technological University, Singapore. He is a Chartered
Engineer, a senior member of IEEE, a fellow of the IET and an
Honorary Fellow of Singapore Institute of Engineering Technologists.
His areas of expertise include perception-inspired signal modeling,
perceptual multimedia quality evaluation, video compression,
and image processing and analysis; in these areas he has published
190+ refereed journal and conference papers, and been the Principal
Investigator of more than 10 major projects (with both academic
and industrial funding of over S$4m). He currently serves as
the Associate Editor for IEEE Trans on Multimedia, IEEE Signal
Processing Letters and Journal of Visual Communication and Image
Representation.
He co-chairs the IEEE MMTC interest group on Quality of Experience
(QoE), and has organized special sessions in IEEE ICME06, IEEE
IMAP07, PCM09, SPIE VCIP10, IEEE ISCAS 10, APSIPA11, MobiMedia
11 and IEEE ICME 12. He gave invited/panelist/keynote/tutorial
speeches in VPQM06, SPIE VCIP10, IEEE ICCCN07, PCM07, PCM09,
IEEE ISCAS08, IEEE ICME09, APSIPA10, IEEE ICIP10, and IEEE MMTC
QoEIG (2011). He is the Lead Guest Editor for the recent special
issue on New Subjective and Objective Methodologies for Audio
and Visual Signal Processing in IEEE Journal of Selected Topics
in Signal Processing. He also maintains long-term partnership
with a number of companies that are keen on perception-driven
technology for audiovisual signal processing.
Lectures:
Lecture 1: Recent Development in Perceptual Visual Quality
Evaluation
Quality (distortion) evaluation of images and video is useful
in many applications, and also crucial in shaping almost all
visual processing algorithms/systems, as well as their implementation,
optimization and testing. Since the human visual system (HVS)
is the final receiver and appreciator for most processed images
and videos (be they naturally captured or computer generated),
it would be beneficial to use a perceptual quality criterion
in the system design and optimization, instead of a traditional
one (e.g., MSE, SNR, PSNR, QoS). As a result of the evolution,
the HVS has developed unique characteristics. Significant research
effort has been made toward modelling the HVS' picture quality
evaluation mechanism, and to apply the models to various situations.
In this lecture, we will first introduce the major problems
associated with perceptual visual quality metrics (PVQMs) (to
be in line with the HVS perception), and the major research
and development work so far in the related fields. Then, the
two major modules in most current systems (i.e., feature detection
and feature pooling) are to be highlighted and explored, based
on the presenter's substantial project exposure. The lecture
aims at providing an up-to-date overview and classification
in perceptual quality gauging for images and videos. It will
also give comparison and comments for the current research activities,
with the presenter's understanding and experience in the said
areas.
Lecture 2: Human-vision Friendly Processing for Images and
Graphics
To make the machine perceive as the human vision does can result
in resource savings (for instance, bandwidth, memory space,
computing power) and performance enhancement (such as the resultant
visual quality, and new functionalities), for both naturally
captured images and computer generated graphics. Significant
research effort has been made toward modelling the human vision
mechanism during the past decade, and to apply the resultant
models to various situations (image and video compression, watermarking,
channel coding, signal restoration and enhancement, computer
graphics and animation, visual content retrieval, etc.). The
human vision system's characteristics can be turned into the
advantages for system designs and optimization. In this talk,
we will first introduce the major problems, difficulties and
research efforts so far in the related fields. The basic engineering
models (like signal decomposition, visual attention, eye movement,
visibility threshold determination, and common artefact detection)
are then to be discussed. Afterward, different perceptually-driven
techniques and applications will be presented for visual signal
compression, enhancement, communication, and rendering, with
proper case studies. The last part of the lecture is devoted
to a summary, points of further discussion and possible future
research directions. |
|
|
Helen Meng, The Chinese
University of Hong Kong, Hong Kong SAR, China
|
Biography:
Helen Meng received the S.B., S.M., and Ph.D. degrees, all in
electrical engineering, from the Massachusetts Institute of
Technology, Cambridge. She joined The Chinese University of
Hong Kong in 1998, where she is currently Professor in the Department
of Systems Engineering and Engineering Management. In 1999,
she established the Human-Computer Communications Laboratory
and serves as Director. In 2005, she established the Microsoft-CUHK
Joint Laboratory for Human-Centric Computing and Interface Technologies
and serves as Co-Director. This laboratory was conferred the
national status of the Ministry of Education of China (MoE)
Key Laboratory in 2008. Helen also served as an Associate Dean
(Research) of the Faculty of Engineering from 2006 to 2010.
She received the MoE Higher Education Outstanding Scientific
Research Output Awards in Technological Advancements, for the
area of "Multimodal User Interactions with Multilingual
Speech and Language Technologies" in 2009. In previous
years, she has also received the Exemplary Teaching Award, Service
Award for establishment of the worldwide engineering undergraduate
exchange program and Young Researcher's Award from CUHK Faculty
of Engineering. In 2010, her co-authored paper received the
Best Oral Paper Award from the Asia-Pacific Signal and Information
Processing Association Annual Summit and Conference (APSIPA).
Her research interest is in the area of human-computer interaction
via multimodal and multilingual spoken language systems, computer-aided
language learning systems, as well as translingual speech retrieval
technologies. She served as Editor-in-Chief of the IEEE Transactions
on Audio, Speech and Language Processing between 2009 and 2011.
She is also an elected Board Member of the International Speech
Communication Association since 2007. Helen has been participating
in the IEEE Speech Technical Committee for two terms and the
program committees of Interspeech for multiple years. She will
serve as the General Chair of ISCA SIG-CSLP's flagship conference
- International Symposium on Chinese Spoken Language Processing
(ISCSLP) in 2012 and the Technical Program Committee Chair of
Interspeech 2014.
Lectures:
Lecture 1: Development of Automatic Speech Recognition and
Synthesis Technologies to Support Chinese Learners of English
- the CUHK Experience
This talk presents an ongoing research initiative in the development
of speech technologies that strives to raise the efficacy of
computer-aided pronunciation training, especially for Chinese
learners of English. Our approach is grounded on the theory
of language transfer and involves a systematic phonological
comparison between the primary language (L1 being Chinese) and
secondary language (L2 being English) to predict possible segmental
and suprasegmental realizations that constitute mispronunciations
in L2 English. The predictions are validated based on a specially
designed corpus that consists of several hundred hours of L2
English speech. The speech data supports the development of
automatic speech recognition technologies that can detect and
diagnose mispronunciations. The diagnosis aims to support the
design of pedagogical and remedial instructions, which involves
text-to-speech synthesis technologies for corrective feedback
generation in audiovisual forms. This talk offers an overview
of the technologies, related experimental results and ongoing
work as well as future plans.
Lecture 2: Multimodal Processing in Speech-based Interactions
Speech constitutes the primary form of human communication and
research in automatic speech processing has largely focused
on the audio modality. However, human communication is inherently
multimodal, which involves not only speech, but also expressions,
gaze, gestures, posture, movement and position, etc. Much can
be gained in terms of naturalness, performance and robustness
in human-computer interaction, by mimicking the human capacity
in jointly processing information available in multiple modalities.
Multimodality in speech is a vastly interdisciplinary research
area. This talk presents an overview of related activities,
including audiovisual speech recognition and synthesis that
incorporates information about the speaker's facial and lip
motions, bimodal interfaces that support speech and pen gestural
inputs for mobile computing, multi-biometric user authentication
that incorporates voiceprints, fingerprints and face recognition,
as well as co-processing of audio and visual information from
multiple speakers in social signal processing. We will also
present methods for multimodal fusion, i.e., the critical process
of information integration across modalities. This talk concludes
with a set of challenges for multimodal speech processing and
suggests possible directions for future work.
Lecture 3: Modeling the Expressivity of Textual Semantics
for Text-to-Audiovisual Speech Synthesis in Avatar Animation
This talk describes expressive text-to-speech synthesis techniques
for a Chinese spoken dialog system, where the expressivity is
driven by the message content. We adapt the three-dimensional
pleasure-displeasure, arousal-nonarousal and dominance-submissiveness
(PAD) model for describing expressivity in input text semantics.
The context of our study is based on response messages generated
by a spoken dialog system in the tourist information domain.
We use the (pleasure) and (arousal) dimensions to describe expressivity
at the prosodic word level based on lexical semantics. The (dominance)
dimension is used to describe expressivity at the utterance
level based on dialog acts. We analyze contrastive (neutral
versus expressive) speech recordings to develop a nonlinear
perturbation model that incorporates the PAD values of a response
message to transform neutral speech into expressive speech.
Two levels of perturbations are implemented-local perturbation
at the prosodic word level, as well as global perturbation at
the utterance level. Perceptual experiments indicate that the
proposed approach can significantly enhance expressivity in
response generation for a spoken dialog system. We also demonstrate
that the approach can be generalized for visual speech prosody
that include head motions and facial expressions. |
|
|
Xiaokang Yang, Shanghai
Jiao Tong University, China |
Biography:
Xiaokang YANG received the B. S. degree from Xiamen University,
Xiamen, China, in 1994, the M. S. degree from Chinese Academy
of Sciences, Shanghai, China, in 1997, and the Ph.D. degree
from Shanghai Jiao Tong University, Shanghai China, in 2000.
He is currently a professor and the deputy director of the Institute
of Image Communication and Information Processing, Department
of Electronic Engineering, Shanghai Jiao Tong University, Shanghai,
China. From August 2007 to July 2008, he visited the Institute
for Computer Science, University of Freiburg, Germany, as an
Alexander von Humboldt Research Fellow. From September 2000
to March 2002, he worked as a Research Fellow in Centre for
Signal Processing, Nanyang Technological University, Singapore.
From April 2002 to October 2004, he was a Research Scientist
in the Institute for Infocomm Research (I2R), Singapore. He
has published over 150 refereed papers, and has filed 35 patents.
His current research interests include visual processing and
communication, media analysis and retrieval, and pattern recognition.
He received National Science Fund for Distinguished Young Scholars
in 2010, Professorship Award of Shanghai Special Appointment
(Eastern Scholar) in 2008, the Microsoft Young Professorship
Award in 2006, the Best Young Investigator Paper Award at IS&T/SPIE
International Conference on Video Communication and Image Processing
(VCIP2003) and awards from A-STAR and Tan Kah Kee foundations.
He is currently a member of Editorial Board of IEEE Signal Processing
Letters, Digital Signal Processing (Elsveier Press), a member
of APSIPA, a senior member of IEEE, a member of Design and Implementation
of Signal Processing Systems (DISPS) Technical Committee of
the IEEE Signal Processing Society and a member of Visual Signal
Processing and Communications (VSPC) Technical Committee of
the IEEE Circuits and Systems Society. He was the special session
chair of Perceptual Visual Processing of IEEE ICME2006. He was
the technical program co-chair of IEEE SiPS2007 and the technical
program co-chair of 3DTV workshop in junction with 2010 IEEE
International Symposium on Broadband Multimedia Systems and
Broadcasting.
Lectures:
Lecture 1: Visual quality assessment incorporating the knowledge
from physiology, psychophysics and neuroscience
Perceptual quality assessment is an important research topic
of visual signal processing both in its own right and for its
utility in designing various optimal processing and coding algorithms.
With the quick advances of vision related research in a broad
area of physiology and psychology in the last decades, it is
beneficial for us to incorporating biologically and neurologically
inspired theories and models into the study of visual signal
processing. In this talk, we will first review some classic
biological and psychological models for image and video quality
assessment. After that, some newly developed neurological theories
for human perceptions, especially the free energy principle,
will be introduced. And the free energy principle will be adapted
for the task of image quality assessment. The performances of
those biological, psychovisual and neurological models based
image quality metrics will be briefly analyzed and commented.
This talk emphasizes the importance, effectiveness and necessity
of incorporating knowledge from physiology, psychophysics and
neuroscience for the problem of visual signal processing.
Lecture 2: Smart video surveillance system in the context
of Internet-of-Things
Video surveillance networks are increasingly deployed in public
and private facilities, with tremendous potential value for
public safety. It is not feasible to monitor thousands (even
millions) of video sources manually. Huge volume surveillance
imagery data is often simply directed to mass-storage devices,
to be used only forensically. On-line automatic video analysis
represents a new trend to smart video surveillance networks.
In this talk, we will first overview the challenging issues
on large scale smart video surveillance networks, and present
new paradigm of smart video surveillance in the context of IoT.
We then review the enable techniques including multimodal video
processors for deeper sensing, high performance video coding
and transmission schemes for ubiquitous connection, video analysis
and retrieval techniques for intelligent services. |
|
|
Thomas Fang Zheng,
Tsinghua University, China |
Biography:
Dr. Thomas Fang Zheng is a full research professor and Vice
Dean of the Research Institute of Information Technology (RIIT),
Tsinghua University (THU), and Director of the Center for Speech
and Language Technologies (CSLT), RIIT, THU.
Since 1988, he has been working on speech and language processing.
He has been in charge of, or undertaking as a key participant,
the R&D of more than 30 national key projects and international
cooperation projects, and received awards for more than 10 times
from the State Ministry (Commission) of Education, the State
Ministry (Commission) of Science and Technology, the Beijing
City, and others. So far, he has published over 200 journal
and conference papers, 11 (3 for first author) of which were
titled the Excellent Papers, and 11 books (refer to http://cslt.riit.tsinghua.edu.cn/~fzheng
for details). He has been serving in many conferences, journals,
and organizations.
He is an IEEE Senior member, a CCF (China Computer Federation)
Senior Member, an Oriental COCOSDA (Committee for the international
Coordination and Standardization of speech Databases and input/output
Assessment methods) key member, an ISCA member, an APSIPA (Asia-Pacific
Signal and Information Processing Association) member, a council
member of Chinese Information Processing Society of China, a
council member of the Acoustical Society of China, a member
of the Phonetic Association of China, and so on.
He serves as Council Chair of Chinese Corpus Consortium (CCC),
a Steering Committee member and a BoG (Board of Governors) member
of APSIPA, Chair of the Steering Committee of the National Conference
on Man-Machine Speech Communication (NCMMSC) of China, head
of the Voiceprint Recognition (VPR) special topic group of the
Chinese Speech Interactive Technology Standard Group, Vice Director
of Subcommittee 2 on Human Biometrics Application of Technical
Committee 100 on Security Protection Alarm Systems of Standardization
Administration of China (SAC/TC100/SC2), a member of the Artificial
Intelligence and Pattern Recognition Committee of CCF.
He is an associate editor of IEEE Transactions on Audio, Speech,
and Language Processing, a member of editorial board of Speech
Communication, a member of editorial board of APSIPA Transactions
on Signal and Information Processing, an associate editor of
International Journal of Asian Language Processing, and a member
of editorial committee of the Journal of Chinese Information
Processing.
He ever served as co-chair of Program Committee of International
Symposium on Chinese Spoken Language Processing (ISCSLP) 2000,
member of Technical Committee of ISCSLP 2000, member of Organization
Committee of Oriental COCOSDA 2000, member of Program Committee
of NCMMSC 2001, member of Scientific Committee of ISCA Tutorial
and Research Workshop on Pronunciation Modeling and Lexicon
Adaptation for Spoken Language Technology 2002, member of Organization
Committee and international advisor of Joint International Conference
of SNLP-O-COCOSDA 2002, General Chair of Oriental COCOSDA 2003,
member of Scientific Committee of International Symposium on
Tonal Aspects of Languages (TAL) 2004, member of Scientific
Committee and Session Chair of ISCSLP 2004, chair of Special
Session on Speaker Recognition in ISCSLP 2006, Program Committee
Chair of NCMMSC 2007, Program Committee Chair of NCMMSC 2009,
Tutorial Co-Chair of APSIPA ASC 2009, Program Committee Chair
of NCMMSC 2011, general co-chair of APSIPA ASC 2011.
He has been also working on the construction of "Study-Research-Product"
channel, devoted himself in transferring speech and language
technologies into industries, including language learning, embedded
speech recognition, speaker recognition for public security
and telephone banking, location-centered intelligent information
retrieval service, and so on. Now he holds over 10 patents in
various aspects of speech and language technologies.
He has been supervising tens of doctoral and master students,
several of who were awarded, and therefore he was entitled Excellent
Graduate Supervisor. Recently, he received 1997 Beijing City
Patriotic and Contributing Model Certificate, 1999 National
College Young Teacher (Teaching) Award issued by the Fok Ying
Tung Education Foundation of the Ministry of Education (MOE),
2000 1st Prize of Beijing City College Teaching Achievement
Award, 2001 2nd Prize Beijing City Scientific and Technical
Progress Award, 2007 3rd Prize of Science and Technology Award
of the Ministry of Public Security, and 2009 China "Industry-University-Research
Institute" Collaboration Innovation Award.
Lectures:
Lecture 1: Speaker Recognition Systems: Paradigms and Challenges
Speaker recognition applications are becoming more and more
popular. However, in practical applications many factors may
affect the performance of systems.
In this talk, a general introduction to speaker recognition
will be presented, including definition, applications, category,
and key issues in terms of research and application. Robust
speaker recognition technologies that are useful to speaker
recognition applications will be briefed, including cross channel,
multiple speaker, background noise, emotions, short utterance,
and time-varying (or aging). Recent research on time-varying
robust speaker recognition will be detailed.
Performance degradation with time varying is a generally acknowledged
phenomenon in speaker recognition and it is widely assumed that
speaker models should be updated from time to time to maintain
representativeness. However, it is costly, user-unfriendly,
and sometimes, perhaps unrealistic, which hinders the technology
from practical applications. From a pattern recognition point
of view, the time-varying issue in speaker recognition requires
such features that are speaker-specific, and as stable as possible
across time-varying sessions. Therefore, after searching and
analyzing the most stable parts of feature space, a Discrimination-emphasized
Mel-frequency-warping method is proposed. In implementation,
each frequency band is assigned with a discrimination score,
which takes into account both speaker and session information,
and Mel- frequency-warping is done in feature extraction to
emphasize bands with higher scores. Experimental results show
that in the time-varying voiceprint database, this method can
not only improve speaker recognition performance with an EER
reduction of 19.1%, but also alleviate performance degradation
brought by time varying with a reduction of 8.9%.
Lecture 2: A domain-specific language understanding framework
with its application in intelligent search engine
Compared with Google search services, vertical search is getting
more and more popular nowadays in China, especially with rapid
growth of the use of short messages.
Different from general web search engines which basically use
keyword based techniques to provide information retrieval services,
vertical search will narrow its application to a specific domain,
such as travel, hotel, and shopping, so that semantic parsing
technologies can be used to retrieve more specific and deep
knowledge.
In this talk, a user-friendly easy-to-use SDK, named SDS Studio,
which is a domainspecific language understanding framework,
will be introduced, with detailed information on the design
of a robust parser based on topic forest, a powerful dialogue
manager, a keyword extractor, and a semi-automatic grammar writer
and checker. By using this SDK, a developer is easy to setup
a vertical search application very efficiently and precisely.
A Location-centered Services (LCS) application example will
also be presented. This application was implemented with the
above-mentioned SDS Studio, and under collaboration with China
Mobile Co., the largest telecommunication company in China.
The services in this application include food, restaurant, hotel/house-renting,
transportation, sightseeing, entertainment, shopping (digital-products),
and so on, and all such information is related to a location
of interests of a user in the recent query. A digital map is
used in the background for the location related services.
|
|
|
|