Tuesday, December 4, 2012 (10:20 - 12:00)
|
OS.1-IVM.1 High
Efficiency Video Coding (HEVC)
Session Chairs: Siwei Ma,
Oscar Au, Jiaying Liu Location: Doheny
An
Efficient NEON-based Quarter-pel Interpolation Method for
HEVC
Hao Lv Peking University,
Ronggang Wang Peking University, Jie Wan Peking
University, Huizhu Jia Peking University, Xiaodong
Xie Peking University, Wen Gao Peking University
SIMD (Single Instruction Multiple
Data) instructions have been widely used for digital signal
processing and multimedia applications, especially video codec.
This paper proposes the quarter-pel interpolation acceleration
method of the HEVC (High Efficiency Video Coding), which is
implemented with ARM SIMD instructions. Data level parallelism
is utilized to use the SIMD capability of NEON effectively.
Experiment results show that the implementation of the proposed
method is approximately five times faster than that of the
HEVC reference software for the HEVC quarter-pel interpolation
operation.
Efficient
SIMD Optimization of HEVC Encoder over X86 Processors
Keji Chen Peking University,
Yizhou Duan Peking University, Leju Yan Peking University,
Jun Sun Peking University, Zongming Guo Peking University
High Efficient Video Coding (HEVC)
is the next generation video coding standard in progress.
Based on the traditional hybrid coding framework, HEVC implements
enhanced tools to improve compression efficiency at the cost
of far more computational payload than the capacity of real-time
video applications. In this paper, we focus on the fast implementation
of the HEVC encoder over modern Intel x86 processors. First,
we identify the most time-consuming modules of HM 6.2 encoder,
represented by motion compensation, Hadamard transform, sum
of difference (SAD/SSD) calculation and integer transform.
Then the single-instruction-multiple-data (SIMD) methods are
proposed to optimize the computational performance of these
modules. Experimental results show that the optimized encoder
achieves 56% - 85% time saving compared with the HM 6.2 encoder
over Intel i5-750 processor.
Lossy
and Lossless Intra Coding Performance Evaluation: HEVC, H.264/AVC,
JPEG 2000 and JPEG LS
Qi Cai Shanghai Jiao
Tong University, Li Song Shanghai Jiao Tong University,
Guichun Li Santa Clara University, Nam LingSanta
Clara University
High Efficiency Video Coding (HEVC),
the latest international standard of video coding under development,
has shown a major breakthrough with regards to compression
efficiency. But most of the currently published studies were
intended to evaluate the overall R-D performance of HEVC in
comparison to prior H.264/AVC video coding standard. In this
paper, we present sufficient rate-distortion performance comparisons
of image coding between the HEVC and previous image and intra-only
video coding standards, including JPEG 2000, JPEG LS and H.264/AVC
intra high profile. In addition, some recently reported performances
of HEVC are also reviewed and compared. The coding simulations
are conducted on a set of recommended video sequences during
the development of the HEVC standard. Experimental results
show that HEVC can offer consistent performance gains over
a wide range of bitrates on natural video sequences as expected.
Besides, we also present the comparison results of all these
standards in the scenario of lossless image coding.
An
Adaptive Frame Complexity Based Rate Quantization Model for
Intra-Frame Rate Control of High Efficiency Video Coding (HEVC)
Lin Sun HKUST, Oscar
C. Au HKUST, Wei Dai HKUST, YuanFang Guo HKUST,
Ruobing Zou HKUST
An efficient and accurate R-Q model
is greatly important for intra-frame rate control of the latest
High Efficiency video Coding (HEVC) standard. However, previous
methods pay more attention to the gradient based rate quantization
(R-Q) model for the intra bit rate control. In this paper,
we analyze the drawbacks of the gradient based frame complexity
measure when applied different Quantization Parameters (QPs).
Then we propose a novel edge based frame complexity measure
using the Gaussian Gradient operator with properly selected
parameters. In order to tackle the problems that the gradient
based rate quantization model fails when using the different
QPs, based on these two complexity measures we propose an
adaptive frame complexity based R-Q model for intra bit rate
control. Simulations have been conducted based on HM6.2 which
is the latest reference software of HEVC. Note that we may
be the first to do this work in HEVC, so we do not have the
classical methods which have been implemented in HEVC to compare
with. So we implement the traditional gradient based rate
quantization model and the Cauchy distribution based rate
quantization model in HEVC. Then we compare the bit rate mismatch
ratio between their methods and our proposed method. The simulation
results show that by using our proposed scheme, better bit
rate estimation for intra frames can be achieved. Up to 33.1%
mismatch ratio reduction compared with the Cauchy distribution
based model and up to 13% mismatch ratio reduction compared
with the gradient based model.
Early
Termination of Coding Unit Splitting for HEVC
Qin Yu Peking University,
Xinfeng Zhang Peking University, Siwei Ma Peking
University, Shiqi Wang Peking University
The emerging high-efficiency video
coding (HEVC) standard employs a new coding structure characterized
by coding unit (CU), prediction unit (PU) and transform unit
(TU). It improves the coding efficiency significantly, but
also introduces great computation complexity on the decision
of optimal CU, PU and TU sizes. To reduce the encoding complexity,
we propose a CU splitting early termination scheme for inter
frame coding. In the proposed scheme, the characteristics
of prediction residuals are utilized to early terminate the
CU splitting. Specifically, the Mean Square Error (MSE) between
the prediction block and the origin block for each CU level
is obtained and then compared with an adaptive threshold.
The recursive CU splitting process is early terminated according
to the threshold. Experimental results demonstrate that, the
proposed algorithm achieves up to 34.83% total encoding time
reduction with less than 0.25% BD-rate increase on average.
Hardware
Oriented Re-design and Matrix Approximation Analysis for Transform
in High Efficiency Video Coding (HEVC)
Lin Sun HKUST, Oscar
C. Au HKUST, Jiali Li HKUST, Ruobing Zou HKUST,
Wei Dai HKUST
In this paper, we propose an adaptive
truncate k-bit re-configurable approximation (aTra) method
which can achieve similar video coding efficiency as the original
transform but substitute all low efficient multiplication
by conformed shifting and addition operations, lifting the
utilization of the hardware implementation and making simple
hardware implementation and high efficiency pipeline design
possible.
Also, we may be the first to propose
three mathematical constraints for the hardware matrix approximation
to make the final performance controllable. The proposed method
can achieve regular data flow and massive operation reduction
balancing the rate distortion (RD) performance and data throughput.
Particularly for the current secondary transform, rotational
transform (ROT), we obtain the hardware friendly ROT through
our proposed method based on one of the constraints. The simulation
results implemented in the HEVC reference software HM1.0 present
the validity of our method. Our method achieves similar performance
when compared with the original one but it is much better
than the simple hardware implementation.
OS.2-IVM.2 Camera-based
Human Centric Computing: Technology and Applications
Session Chairs: Haowei
Liu, Weiyao Lin, YingLi Tian, Yi Wu Location: Beachwood
RGBD
Camera-based Activity Analysis
Chenyang Zhang The City
College of New York, Yingli Tian The City College of
New York
In this paper, we propose a new activity
analysis framework to facilitate the independence of elderly
adults living in the community, reduce risks, and enhance
the quality of life at home by using RGB-D cameras. Our contributions
include two aspects: 1) recognizing 5 activities related to
falling including standing, fall from standing, fall from
sitting, sit on chair, and sit on floor. The main analysis
is based on the depth information due to the advantages of
handling illumination changes and identity protection. If
the monitored person is out of the range of 3D camera, RGB-based
video analysis module is employed to continue the activity
monitoring. 2) Identifying the monitored person if there are
multiple people in camera view by combining depth and RGB
information. We have collected a dataset under different lighting
conditions and ranges. Experimental results demonstrate the
effectiveness of the proposal framework.
Recognizing
Object Manipulation Activities Using Depth and Visual Cues
Haowei Liu Intel,
Matthai Philipose Microsoft Research, Ming-Ting Sun
University of Washington
We present the design of an approach
to recognize human activities that involve manipulating objects.
Our proposed approach identifies objects being manipulated
and models high-level tasks being performed accordingly. Realistic
settings for such tasks pose several problems for computer
vision, including sporadic occlusion by subjects, non-frontal
poses, and objects with few local features. We show how size
and segmentation information derived from depth data can address
these challenges using simple and fast techniques. In particular,
we show how to robustly and without supervision find the manipulating
hand, properly detect/recognize objects and properly use the
temporal information to fill in the gaps between sporadically
detected objects, all through careful inclusion of depth cues.
We evaluate our approach on a challenging dataset of 12 kitchen
tasks that involve 24 objects performed by 2 subjects. The
entire system yields 82%/84% precision (74%/83%recall) for
task/object recognition. Our techniques outperform the state-of-the-art
significantly in activity/ object recognition rates.
Virtual
Mirror By Fusing Multiple RGB-D Cameras
Ju Shen University of
Kentucky, Sen-ching Samson Cheung University of Kentucky,
Jian Zhao Microsoft
Mirror is possibly the most common
optical device in our everyday life. Rendering a virtual mirror
using a joint camera-display system has a wide range of applications
from cosmetics to medicine. Existing works focus primarily
on simple modification of the mirror images of body parts
and provide no or limited range of viewpoint dependent rendering.
In this paper, we propose a framework for rendering mirror
images from a virtual mirror based on 3D point clouds and
color texture captured from a network of structured-light
RGB-D cameras. We validate our models by comparing the results
with a real mirror. Commodity structured-light cameras often
have missing and erroneous depth data which directly affect
the quality of the rendering. We address this problem via
a novel probabilistic model that accurately separates foreground
objects from background scene before correcting the erroneous
depth data. We experimentally demonstrate that our depth correction
algorithm outperforms other state-of-the-art techniques.
Periodic
Motion Detection With ROI-Based Similarity Measure And Extrema-Based
Reference-Frame Selection
Xintong Han Shanghai
Jiao Tong University, Gaojian Li Fudan University,
Weiyao Lin Shanghai Jiao Tong University, Xiaoqiong
Su Shanghai Jiao Tong University, Hongxiang Li University
of Louisville, Hua Yang Shanghai Jiao Tong University,
Hui WeiFudan University
This paper presents a new algorithm
for detecting and analyzing the periodic motions in video
sequences. Different from the previous methods which detect
periodic motions from the entire frame, we propose a convex-hull-based
process to automatically determine the regions of interest
(ROI) of the motions and utilize an ROI-based similarity measure
to detect the motion periods. Furthermore, we also propose
an extrema-based method to select the optimal reference frame
for further improving the periodic detection performance.
Our proposed algorithm can not only effectively detect motion
periods with both constant and variable period lengths, but
also have obvious advantage when handling periodic motion
with slight movements. Experimental results demonstrate the
effectiveness of our proposed method.
Abnormal
Crowd Behavior Detection Based on Local Pressure Model
Hua Yang Shanghai Jiao
Tong University, Yihua Cao Shanghai Jiao Tong University,
Shuang Wu Shanghai Jiao Tong University, Weiyao Lin
Shanghai Jiao Tong University, Shibao Zheng Shanghai
Jiao Tong University, Zhenghua YuShanghai Jiao Tong
University
Abnormal crowd behavior detection
is an important issue in crowd surveillance. In this paper,
a novel local pressure model is proposed to detect the abnormality
in large-scale crowd scene based on local crowd characteristics.
These characteristics include the local density and velocity
which are very significant parameters for measuring the dynamic
of crowd. A gird of particles is placed over the image to
reduce the computation of the local crowd parameters. Local
pressure is generated by applying these local characteristics
in pressure model. Histogram is utilized to extract the statistical
property of the magnitude and direction of the pressure. The
crowd feature vector of the whole frame is obtained through
the analysis of Histogram of Oriented Pressure (HOP). SVM
and median filter are then adopted to detect the anomaly.
The performance of the proposed method is evaluated on publicly
available datasets from UMN. The experimental results show
that the proposed method can achieve a higher accuracy than
that of the previous methods on detecting abnormal crowd behavior.
Combining
RGB and Depth Features for Human Activity Recognition
Yang Zhao UESTC,
Zicheng Liu Microsoft, Lu Yang UESTC, Hong Cheng
UESTC
We study the problem of human activity
recognition from RGB-D sensors when the skeletons are not
available. The skeleton tracking in Kinect SDK works well
when the human subject is facing the camera and there are
no occlusions. In surveillance or senior home monitoring scenarios,
however, the camera is usually mounted higher than human subjects
and there may be serious occlusions. Consequently, the skeleton
tracking may not work well. In RGB image based activity recognition,
a popular approach that can handle cluttered background and
partial occlusions is the interest point based approach. When
both RGB and depth channels are available, one can still use
the interest point based approach. But there are questions
on whether we should detect interest points from RGB channel
or from depth channel, and what descriptor to use for the
depth channel. The goal of this paper is to compare the performances
of different ways of extracting interest points. In addition,
we have developed a depth map based descriptor which outperforms
the HOGHOF descriptor on the depth video. We show that the
best performance is achieved when we extract interest points
from RGB channel, and combine the RGB based descriptor and
depth map based descriptor.
OS.3-SLA.1 Speech
Processing and Its Applications
Session Chair: Jen-Tzung
Chien Location: Runyon
Open
Answer Scoring for S-CAT Automated Speaking Test System Using
Support Vector Regression
Yutaka Ono Chiba University,
Misuzu Otake Chiba University, Takahiro Shinozaki Chiba
University, Ryuichi Nisimura WakayamaUniversity,
Takeshi Yamada University of Tsukuba,
Kenkichi Ishizuka University of Tsukuba,
Yasuo Horiuchi Chiba University,
Shingo Kuroiwa Chiba University, Shingo Imai University
of Tsukuba
We are developing S-CAT computer test
system that will be the first automated adaptive speaking
test for Japanese. The speaking ability of examinees is scored
using speech processing techniques without human raters. By
using computers for the scoring, it is possible to largely
reduce the scoring cost and provide a convenient means for
language learners to evaluate their learning status. While
the S-CAT test has several categories of question items, open
answer question is technically the most challenging one since
examinees freely talk about a given topic or argue something
for a given material. For this problem, we proposed to use
support vector regression (SVR) with various features. Some
of the features rely on speech recognition hypothesis and
others do not. SVR is more robust than multiple regression
and the best result was obtained when 390 dimensional features
that combine everything were used. The correlation coefficients
between human rated and SVR estimated scores were 0.878, 0.847,
0.853, and 0.872 for fluency, accuracy, content, and richness
measures, respectively.
Singing
Voice Conversion Method Based on Many-to-Many Eigenvoice Conversion
and Training Data Generation Using a Singing-to-Singing Synthesis
System
Hironori Doi Nara Institute
of Science and Technology, Tomoki Toda Nara Institute
of Science and Technology, Tomoyasu Nakano National
Institute of Advanced Industrial Science and Technology,
Masataka Goto National Institute of Advanced Industrial
Science and Technology, Satoshi Nakamura Nara Institute
of Science and Technology
The voice quality (identity) of singing
voices is usually fixed in each singer. To overcome this limitation
and enable singers to freely change their voice quality using
signal-processing technologies, we propose a singing voice
conversion method based on many-to-many eigenvoice conversion
(EVC) that can convert the voice quality of an arbitrary source
singer into that of another arbitrary target singer. Previous
EVC- based methods required parallel data consisting of song
pairs of a single reference singer and many prestored target
singers for training a voice conversion model, but it was
difficult to record such data. Our proposed method therefore
uses a singing-to-singing synthesis system called VocaListener
to generate parallel data by imitating singing voices of many
prestored target singers with the system’s singing voices.
Experimental results show that our method succeeded in enabling
people to sing a song with the voice quality of a different
target singer even if only an extremely small amount of the
target singing voice is available.
Response
Generation based on Statistical Machine Translation for Speech-Oriented
Guidance System
Kazuma Nishimura Nara
Institute of Science and Technology, Hiromichi Kawanami
Nara Institute of Science and Technology, Hiroshi Saruwatari
Nara Institute of Science and Technology, Kiyohiro
Shikano Nara Institute of Science and Technology
An example-based response generation
is a robust and practical approach for a real-environment
information guidance system. However, this framework cannot
reflect differences in nuance, because the set of answer sentences
are fixed beforehand. To overcome this issue, we have proposed
response generation using a statistical machine translation
technique. In this paper, we make use of N-best speech recognition
candidates instead of manual transcription used in our previous
study. As a result, the generation rate of appropriate response
sentences was improved by using multiple recognition hypothesis.
Multi-Stream
Acoustic Model Adaptation for Noisy Speech Recognition
Tamura Satoshi Gifu
University, Hayamizu Satoru Gifu University
In this paper, a multi-stream-based
model adaptation method is proposed for speech recognition
in noisy or real environments. The proposed scheme comes from
our experience about audio-visual model adaptation. At first,
an acoustic feature vector is divided into several vectors
(e.g. static, first-order and second-order dynamic vectors),
namely streams. While adaptation, a stream performing relatively
high recognition performance is updated for the stream only.
Alternatively, a stream having less recognition power is adapted
using all the streams that are superior to the stream. In
order to evaluate the proposed technique, recognition experiments
were conducted using every streams, and then adaptation experiments
were also investigated for various types of combination of
streams.
Design
of a Pitch Quantization and Pitch Correction System for Real-Time
Music Effects Signal Processing
Corey Cheng MIT
This paper describes the design of
a practical, real-time pitch quantization system intended
for digital musical effects signal processing. Like most modern
pitch quantizers, this system can be used to pitch correct
and even reharmonize out-of-tune singing to alternative musical
scales simultaneously (e.g. major, minor, diminished, etc.)
Pitch Quantization can also be intentionally exaggerated to
produce distinctive effects processing which results in an
emotionally inflected and/or “robotic” sound. This system
uses intentionally simple signal processing algorithms which
make real-time processing possible on constrained devices.
In particular, we employ tools such as an octave resolver
and range limiter, grain boundary expansion and contraction,
and transient detection to enhance the performance of our
system.
OS.4-SPS.1 Complexity-efficient
Design and Implementation for Signal Processing Systems
Session Chairs: Shang-Ho
(Lawrence) Tsai, Chih-Hung Kuo Location: Laurel
A
Highly Parallel Design for Irregular LDPC Decoding on GPGPUs
Tsou-Han Chiu National
Chiao Tung University, Hsien-Kai Kuo National Chiao
Tung University, Bo-Cheng Charles Lai National Chiao
Tung University
Low-Density Parity-Check (LDPC) code
is a powerful error correcting code. It has been widely adopted
by many communication systems. Finding a fast and efficient
design of LDPC has been an active research area. This paper
proposes a high performance design for irregular LDPC decoding
on a general purpose graphic processing unit (GPGPU). A GPGPU
is a many-core architecture which enables massively parallel
computing. In this paper, a high degree of computation parallelism
has been exposed by decoding multiple LDPC code-words concurrently.
An innovative data structure is proposed to more efficiently
leverage memory coalescing for the irregular data accesses
of LDPC decoding. Data spatial locality is maximized by keeping
more reusable data within the on-chip cache of a GPGPU. The
data communication overhead between a host and a GPGPU is
minimized through a single word copy for the convergence check.
The experiment results show that the proposed design can achieve
up to 55.68X runtime improvement, when compared with a sequential
LDPC program on a CPU.
Vehicular
Signal Transmission Using Power Line Communications
Yen-Chang Chen National
Chiao Tung University, Shang-Ho Tsai National Chiao
Tung University, Kai-jiun YangNational Chiao Tung University,
Ping-Fan Ho Industrial Technology Research Institute,
Kuo-Feng Tseng National Chiao Tung University,
Ho-Shun Chen National Chiao Tung University
We propose a power line communication
(PLC) system for transmitting control signals in vehicle through
internal power lines, such that the internal wires can be
reduced and the vehicles can be lighter. The signals from
different devices are multiplexed and modulated to the power
lines by the transmitter. In the receiver, first the noise
within power lines are filtered. Afterwards the multiplexed
signals are selectively extracted by specific codes which
are minimally correlated. Finally the control signals are
restored from the error-control coded bits which make the
information more robust. The maximum data rate of the chip
is 50 kbps, and the die area is 3.74 mm^2 using a TSMC 0.18
um standard cell library with power consumption 22.49 mW.
High-Performance
Turbo-MIMO System Design with Iterative Soft-Detection and
Decoding
Der-Wei Yang National
Cheng Kung University, Jing-Shiun Lin National Cheng
Kung University, Shih-Hao Fang National Cheng Kung
University, Chia-Fen Lin National
Cheng Kung University, Ming-Der Shieh
National Cheng Kung University
In turbo-multiple-input multiple-output
(Turbo-MIMO) systems, the soft-output MIMO detector can provide
the priori information to the turbo decoder. Unfortunately,
if Rayleigh fading channels are applied, the induced unreliable
priori information would cause the system performance degradation.
In this paper, we proposed an iterative method to acquire
the high reliability priori information from MIMO soft-detector
in Turbo-MIMO systems. Similar to the conventional updating
rules in the turbo decoding algorithm, we utilize the extrinsic
information from the turbo decoder to update the log-likelihood
ratios (LLRs) based on log-MAP algorithm in the list sphere
decoding (LSD) algorithm. To reduce the overall computational
complexity, different iteration profiles are also discussed.
Simulation results show that the proposed Turbo-MIMO system
can significantly improve the system performance compared
to that of the conventional Turbo-MIMO system.
Hardware
Architecture Design of Hybrid Distributed Video coding with
Frame Level Coding Mode Selection
Chieh-Chuan Chiu National
Taiwan University, Hsin-Fang Wu National Taiwan University,
Shao-Yi Chien National Taiwan University,
Chia-han Lee Academia Sinica,
V. Srinivasa Somayazulu Intel Corporation,
Yen-Kuang Chen Intel Corporation
Distributed video coding (DVC), a
new video coding paradigm based on Slepian-Wolf and Wyner-Ziv
theories, is a promising solution for implementing low-power
and low-cost distributed wireless video sensors since most
of the computation load is moved from the encoder to the decoder.
In this paper, the hardware architecture design of an efficient
distributed video coding system, hybrid DVC with frame-level
coding mode selection, is proposed. With the fully block-pipelined
architecture, coding mode pre-decision, and specially-designed
LDPC engine, the proposed hardware is an efficient solution
for distributed video sensors with high rate-distortion performance.
Hardware-Efficient
EVD Processor Architecture in FastICA for Epileptic Seizure
Detection
Yi-Hsin Shih National
Chiao Tung University, Tsan-Jieh Chen National Chiao
Tung University, Chia-Hsiang Yang National Chiao Tung
University, Herming Chiueh National
Chiao Tung University
Independent component analysis (ICA)
is a key signal processing technique to improve the detection
accuracy of epileptic seizures. It separates artifacts and
epileptic signals, which facilitates the succeeding signal
processing for seizure detection. FastICA is an efficient
algorithm to compute ICA through proper pre-processing. In
the preprocessing stage of the FastICA, eigenvalue decomposition
(EVD) is applied to reduce the convergence time of iterative
calculation of weights for demultiplexing received multi-channel
signals. To calculate EVD efficiently, the Jacobi method is
preferable since an array structure is proposed to decompose
matrix efficiently by leveraging givens rotations. Multiple
diagonal and off-diagonal processing elements run in parallel
to calculate EVD. The micro-rotations can be realized efficiently
by coordinate rotation digital computer (CORDIC), which calculates
trigonometric functions using only addition, shift, and table
lookup without dedicated multipliers. In this work, an approximate
Jacobi is adopted instead to reduce the number of iterations
significantly. Optimized rotation angles can be calculated
efficiently using shift-add operations for multiplications
with coefficients of power of 2 in the diagonal processing
elements. Normalization operation in the original mathematical
formulation can be omitted due to signal re-scaling in both
diagonal and off-diagonal processing elements. The number
of processing cycles is reduced by 6 times for each sweep
due to the reduced number of pipelining stages in the critical
path. The approximate Jacobi method provides a 6x speedup
(185-252 cycles instead of 1440 cycles) for a 6-channel EVD.
An overall 77.2% area reduction is achieved due to arithmetic
simplification and hardware reduction. The hardware architecture
is verified by testing the human electroencephalogram (EEG)
signals from the Freibur
Tracking
Performance Analysis of the Set-Membership NLMS Adaptive Filtering
Algorithm
Reza Arablouei University
of South Australia, Kutluyil Dogancay University of
South Australia
In this paper, we analyze the tracking
performance of the set-membership normalized least mean squares
(SM-NLMS) adaptive filtering algorithm using the energy conservation
argument. The analysis leads to a nonlinear equation whose
solution gives the steady-state mean squared error (MSE) of
the SM-NLMS algorithm in a nonstationary environment. We prove
that there is always a unique positive solution for this equation.
The results predicted by the analysis show good agreement
with the simulation experiments.
OS.5-IVM.3 Recent
Topics in Computer Vision and Image Processing
Session Chair: Salina Abdul
Samad Location: Trousdale Estates
Facial
Image Prediction Using Exemplar-based Algorithm and Non-negative
Matrix Factorization
Hsuan-Ting Chang National
Yunlin University of Science and Technology, Hsiao-wei
PengNational Yunlin University of Science and Technology
Human aging face prediction is a popular
research topic because of its various useful applications
such as security system, missing persons search system, etc.
In this study, we propose Exemplar-based Algorithm whose property
considers the environment of human growth. Moreover, both
the non-negative matrix factorization and linear interpolation
methods are used to perform the prediction for six facial
ROIs. In the proposed method, we employ the family images,
in which each family member has more than one images at different
ages. And we predict the image ROIs to replace the original
ones to obtain the prediction result. However, it is difficult
to collect the facial image ROI of families at various age,
we also refer the databases from the internet. In experimental
results, the correlation coefficient between the real and
predicted images can reach 0.82. However, the factor such
as expression and light in the reference images could result
in lower correlation coefficient.
Video
Prediction Block Structure and the Emerging High Efficiency
Video Coding Standard
Shan Liu MediaTek USA
Inc., Shawmin Lei MediaTek USA Inc.
In the ISO/IEC 14496-10 |ITU-T H.264
advanced video coding (AVC) standard, the prediction block
sizes can be 16x16, 8x8 and 4x4 for Intra prediction; 16x16,
16x8, 8x16, 8x8, 8x4, 4x8 and 4x4 for Inter prediction. In
the first HEVC test model (HM1.0), each 2Nx2N Intra CU may
consist of either one 2Nx2N prediction unit (PU) or four NxN
prediction units; while each 2Nx2N Inter CU may consist of
one 2Nx2N PU, two 2NxN or Nx2N PUs or four NxN PUs. Since
then, some investigations have been made to the prediction
block structure based on HM1.0, including the removal of NxN
prediction partition mode for all coding units (CU) except
the smallest CU, and the removal of 4x4 Inter prediction.
Experimental results show that with both these simplifications,
the encoder complexity can be greatly reduced at the minimum
cost of coding efficiency. Therefore both of these two suggestions
were adopted by the HEVC standard.
Classification
of Beverages Using Electronic Nose and Machine Vision Systems
Mazlina Mamat Universiti
Kebangsaan, Salina Abdul Samad Universiti Kebangsaan
In this work, the classification of
beverages was conducted using three approaches: by using the
electronic nose alone, by using the machine vision alone and
by using the combination of electronic nose and machine vision.
A total of two hundred and twenty eight beverages from fifteen
different brands were used in this classification problem.
A supervised Support Vector Machine was used to classify beverages
according to their brand. Results show that by using the electronic
nose alone and the machine vision alone were able to classify
73.7% and 92.9% of the beverages correctly. When combining
the electronic nose and the machine vision, the classification
accuracy was increased to 96.6%. Based on the results, it
can be concluded that the combination of the electronic nose
and the machine vision is able to extract more information
from the sample, hence improving the classification accuracy.
A
Subjective Comparison of Depth Image Based Rendering and Frame
Compatible Stereo for Low Bit Rate 3D Video Coding
Peshala Pahalawatta DDD
Inc., Kevin Stec DDD Inc.
Frame compatible stereo video delivery
has become a de-facto standard because it enables the delivery
of stereoscopic information over legacy devices that can currently
only decode a 2D signal. At the cost of reducing spatial resolution
of the images, frame compatible delivery also reduces the
bandwidth requirements for signaling stereoscopic 3D video.
The new generations of playback devices are less constrained
than legacy devices in that they are increasingly becoming
capable of decoding multiple video streams in parallel. Bandwidth,
however, remains an issue especially in mobile wireless and
real-time streaming environments. This paper explores the
use of texture and depth data to render 3D views, and compares
the bandwidth requirements of the depth based rendering method
to frame compatible stereo. Some interesting subjective observations
that affect the comparison are discussed along with the results
of a formal subjective evaluation. The relative merits and
drawbacks of each method are detailed both in terms of compression
efficiency and overall quality of experience.
Joint
Perceptually-Based Intra Prediction and Quantization for HEVC
Guoxin Jin Northwestern
University, Robert Cohen MERL, Anthony Vetro MERL,
Huifang Sun MERL
This paper proposes a new coding scheme
which jointly applies perceptual quality metrics to prediction,
quantization and rate-distortion optimization (RDO) within
the High Efficiency Video Coding (HEVC) framework. A new prediction
approach which uses template matching is introduced. The template
matching uses a structural similarity metric (SSIM) and a
Just-Noticeable Distortion (JND) model. The matched candidates
are linearly filtered to generate a prediction. We also modify
the JND model and use Supra-threshold Distortion (StD) as
the distortion measurement in RDO. Experimental results showing
improvements for coding textured areas are presented as well.
OS.6-BioSPS.1 Brain-body
Physiological Networks Connectivity and Synchrony Analysis
Session Chairs: Tomasz
M. Rutkowski, Zbigniew R. Struzik Location: Franklin Hills
Linear
and Nonlinear Features for Automatic Artifacts Removal from
MEG Data Based on ICA
Montri Phothisonothai University
of Tokyo, Hiroyuki Tsubomi University of Tokyo,
Aki Kondo University of Tokyo, Yoshio Minabe Kanazawa
University, Mitsuru Kikuchi University
of Tokyo, Kastumi Watanabe University
of Tokyo
This paper presents the automatic
method to remove physiological artifacts from magnetoencephalogram
(MEG) data based on independent component analysis (ICA).
The proposed features including kurtosis (K), probability
density (PD), central moment of frequency (CMoF), spectral
entropy (SpecEn), and fractal dimension (FD) were used to
identify the artifactual components such as cardiac, ocular,
muscular, and sudden high-amplitude changes. For an ocular
artifact, the frontal head region (FHR) thresholding was proposed.
In this paper, ICA method was on the basis of FastICA algorithm
to decompose the underlying sources in MEG data. Then, the
corresponding ICs responsible for artifacts were identified
by means of appropriate parameters. Comparison between MEG
and artifactual components showed the statistical significance
at p<0.001 for all features. The output artifact-free MEG
waveforms showed the applicability of the proposed method
in removing artifactual components.
Sonification
of Muscular Activity in Human Movements using the Temporal
Patterns in EMG
Masaki Matsubara University
of Tsukuba, Hiroko Terasawa University of Tsukuba,
Hideki Kadone University of Tsukuba, Kenji Suzuki University
of Tsukuba/JAT, Shoji Makino University of Tsukuba
Biofeedback is currently considered
as an effective method for medical rehabilitation. It aims
to increase the awareness and recognition of the body's
motion by feeding back the physiological information to the
patients in real time. Our goal is to create an auditory biofeedback
that aids understanding of the dynamic motion involving multiple
muscular parts, with the ultimate aim of clinical rehabilitation
use. In this paper, we report the development of a real-time
sonification system using EMG, and we propose three sonification
methods that represent the data in pitch, timbre, and the
combination of polyphonic timbre and loudness. Our user evaluation
test involves the task of timing and order identification
and a questionnaire about the subjective comprehensibility
and the preferences, leading to a discussion of the task performance
and usability. The results show that the subjects can understand
the order of the muscular activities at 63.7% accuracy on
average. And the sonification method with polyphonic timbre
and loudness provides an 85.2% accuracy score on average,
showing its effectiveness. Regarding the preference of the
sound design, we found that there is not a direct relationship
between the task performance accuracy and the preference of
sound in the proposed implementations.
Higher-order
PLS for Classification of ERPs with Application to BCIs
Qibin Zhao Brain Science
Institute, RIKEN, Liqing Zhang Shanghai Jiao Tong University,
Cao Jianting Saitama Institute of Technology,
Andrzej Cichocki Brain Science Institute, RIKEN
The EEG signals recorded during Brain
Computer Interfaces (BCIs) are naturally represented by multi-way
arrays in spatial, temporal, and frequency domains. In order
to effectively extract the underlying components from brain
activities which correspond to the specific mental state,
we propose the higher-order PLS approach to find the latent
variables related to the target labels and then make classification
based on latent variables. To this end, the low-dimensional
latent space can be optimized by using the higher-order SVD
on a cross-product tensor, and the latent variables are considered
as shared components between observed data and target output.
The EEG signals recorded under the P300-type affective BCI
paradigm were used to demonstrate the effectiveness of our
new approach.
Spatial
Auditory BCI Paradigm Utilizing N200 and P300 Responses
Zhenyu Cai University
of Tsukuba, Tomek Rutkowski University of Tsukuba
The paper presents our recent results
obtained with a new auditory spatial localization based BCI
paradigm in which the ERP shape differences at early latencies
are employed to enhance the traditional P300 responses in
an oddball experimental setting. The concept relies on the
recent results in auditory neuroscience showing a possibility
to differentiate early anterior contralateral responses to
attended spatial sources. Contemporary stimuli-driven BCI
paradigms benefit mostly from the P300 ERP latencies in so
called aha-response settings. We show the further enhancement
of the classification results in spatial auditory paradigms
by incorporating the N200 latencies, which differentiate the
brain responses to lateral, in relation to the subject head,
sound locations in the auditory space. The results reveal
that those early spatial auditory ERPs boost online classification
results of the BCI application. The online BCI experiments
with the multi- command BCI prototype support our research
hypothesis with the higher classification results and the
improved information-transfer- rates.
EEG
Steady-State Synchrony Patterns Sonification
Teruaki Kaniwa University
of Tsukuba, Masaki Matsubara University of Tsukuba,
Tomek Rutkowski University of Tsukuba, Hiroko Terasawa
University of Tsukuba
This paper describes an application
of multichannel EEG sonification approach. We present results
obtained with a multichannel sonification method tested with
steady-state EEG responses. We elucidate brain synchrony patterns
in auditory do- main with utilization of EEG coherence measure.
The transitions in the synchrony patterns are represented
as timbre (i.e. spectro- temporal) deviation and as spatial
movement of the sound cluster. Our final sonification evaluation
experiment with six subjects confirms the validity of the
proposed brain synchrony elucidation approach.
OS.7-WCN.1 Cooperative
and Coordinated Wireless Communications
Session Chairs: Sau-Hsuan
Wu, Wan-Jen Huang, Feng-Tsun Chien Location: Whitley Heights
Ergodic
Mutual Information of Amplify-and-Forward MIMO Relay Channels
with LOS Components
Chung-Kai Hsu National
Sun Yat-sen University, Chao-Kai Wen National Sun Yat-sen
University, Jung-Chieh Chen National Kaohsiung
Normal University, Wan-Jen Huang National Sun Yat-sen University,
Pangan Ting Industrial Technology Research
In this paper, we address the ergodic
mutual information of amplify_and_forward multiple_input multiple_output
two_hop relay channels. In these channels, the source terminal,
relay terminal, and destination terminal are equipped with
a number of correlated antennas, and there presents a line_of_sight
component on each link. The models have widel applications
in the field of machine_type communication devices, such as
meters and sensors. Given channel matrices with Gaussian entries,
the mean of mutual information is derived under the large_system
regimen, in which the number of antennas at the transmitter
and the receiver go to infinity with a fixed ratio. Simulation
results demonstrate that even for a moderate number of antennas
at each end, the proposed analytical results provide undistinguishable
results from those obtained by Monte_Carlo simulations. In
addition, the well approximation property holds even if the
entries of the channel matrices are non_Gaussian.
Relay
Selection in Multiuser Two-Way Cooperative Relaying Systems
Yi-Ru Liao National Chiao
Tung University, Feng-Tsun Chien National Chiao Tung
University, Min-Kuan Chang National Chung Hsing University
In this paper, we study a relay selection
(RS) problem in multi-user two-way cooperative relaying systems.
We consider a more practical scenario in which multiple users,
multiple relays and a single destination are involved in the
two-way network. In this paper, the code division multiple
access (CDMA) system with non-orthogonal spreading sequences
is employed to handle the multiuser interference. Relay selection
based on maximizing the SINR of the worse link is proposed
in this research. Besides, aiming at mitigating the interference,
we consider the design of linear filter at each relay such
that the minimum SINR of the worst link in the two-way transmission
is maximized. The result shows that the linear filter is similar
to minimum mean-square error (MMSE) detector. Furthermore,
we simulate the proposed scheme with several different parameters
such as the numbers of users and relays, and the length of
spreading sequences. Also, we compare the proposed RS method
with random RS approach, and the result shows that our proposed
method has better performance in terms of the bit error rate
(BER).
Game
Theoretic Channel Allocation for the Delay-Sensitive Cognitive
Radio Networks
Wenson Chang National
Cheng Kung University, Yun-Li Yang Qisda Corporation
In this paper, we propose several
channel allocation schemes via the Game theoretical approaches
for the distributed CR networks. Distinguished from the literature,
more important factors are taken into account when designing
the potential games for the interweave and underlay CR networks,
i.e. the queueing delay, complete inter-system interference
and protection of primary users (PUs). Particularly, in the
underlay CR networks, a PU can be protected by adaptively
adjusting the cost for secondary users (SUs) to share a subchannel
with him. Consequently, SUs and PUs can achieve higher end-to-end
throughput and maintain the desired signal-to-noise-and-interference
ratio, respectively. Moreover, for proving the convergence
of the proposed schemes, the associated potential functions
are also defined. Via the simulation results, the proposed
schemes are proved to be capable of effectively reducing the
queuing delay at the cost of the slightly decreased throughput.
Robust
Linear Beamformer Designs for CoMP AF Relaying in Downlink
Multi-Cell Networks
Chun-I Kuo MStar Semiconductor,
Inc., Sau-Hsuan Wu National Chiao Tung University,
Chun-Kai Tseng National Chiao Tung University
Robust beamforming methods are studied
to support relay-assisted coordinated multi-point (CoMP) retransmissions
in downlink multi-cell networks. Linear beamformers (BFers)
for relay stations of different cells are jointly designed
to maintain, in a CoMP amplify-and- forward (AF) relaying
manner, the target signal to interference-plus-noise ratios
(SINR) at the cellular boundaries of this type of networks.
Considering the feasibility in realizations, BFer designs
are only allowed to use the channel state information (CSI)
feedbacks of the wireless links inside a network. This kind
of designs turns out to be a challenging optimization problem
when attempting to maintain the SINR under the estimation
and quantization errors in CSI. A conservative criterion and
solution method is proposed for this robust design problem.
Despite the conservativeness, the proposed method appears
to provide an effective BFer design for CoMP AF relaying,
either from the perspective of power consumption or from the
viewpoints of BFers’ complexity and \emph{feasibility} in
syntheses. Simulations also show that when applying the proposed
CoMP AF relaying method in Automatic Retransmission reQuest
(ARQ), data throughput can be efficiently increased for users
close to the joint cellular boundaries inside a multi-cell
network.
Zero-Forcing
Design of Precoders and Decoders in Multiuser CDMA Cooperative
Networks
Li-Chung Lo National
Sun Yat-sen University, Wan-Jen Huang National Sun
Yat-sen University, Chun-Ting Liu National Sun Yat-sen
University
Consider a multiuser cooperative CDMA
networks where multiple sources transmit signals toward their
respective destinations with assistance of multiple relays.
We propose joint designs of precoders at relays and decoders
at the destinations to eliminate MAI and improve system performance.
Specifically, two sub-optimal designs of precoders are developed
to maximize SNR averaged over all users and to maximize SNR
of the worst user respectively. It shows through computer
simulations that the precoder maximizing average SNR favors
the best user, while the precoder maximizing the minimal SNR
balances radio usage of relays such that all users can achieve
near- optimal diversity order.
Distributed
Beamforming with Compressed Feedback in Time-Varying Cooperative
Networks
Miao-Fen Jian National
Sun Yat-sen University, Wan-Jen Huang National Sun
Yat-sen University, Chao-Kai Wen National Sun Yat-sen
University
In this paper, we investigate distributed
beamforming with limited feedback for time varying cooperative
networks with multiple amplify-and-forward (AF) relays. With
perfect channel state information, transmit beamforming has
been shown to achieve significant diversity and coding gain
in both MIMO or cooperative systems. However, it requires
large amount of overhead for receiver to feed back channel
information or beamforming coefficients, which makes it impractical.
To perform transmit beamforming with limited feedback, the
destination can choose the best codeword as beamforming vector
from a predetermined codebook. In this work, we adopt the
generalized Lloyd algorithm (GLA) to optimize codebook in
terms of maximal average SNR. Furthermore, the feedback message
can be compressed by exploiting temporal correlation of channel
states. Specifically, We model channel states as a first-order
finite-state Markov chain, and propose two compression methods
according to the property of the transition probabilities
among different channel states. Simulations show that distributed
beamforming with compressed feedback performs closely to the
case with infinite feedback.
OS.8-SLA.2 Speech
Processing (I)
Session Chair: Jianwu Dang
Location: Mt. Olympus
Mandarin
Vowel Synthesis Based on 2D and 3D Vocal Tract Model by Finite-Difference
Time-Domain Method
Yuguang Wang Tianjin
University, Hongcui Wang Tianjin University, Jianguo
Wei Tianjin University, Jianwu Dang JAIST/Tianjin
University
Finite-difference time-domain (FDTD)
method is an effective numerical method to do acoustic simulation.
This paper focused on the details of Mandarin vowel synthesis
based on 2D and 3D vocal tract model by FDTD method. To do
so, a 3D vocal tract shape and vocal tract area function were
extracted from the MRI volumetric images during Chinese vowel
production. 3D and 2D model with staggered FDTD mesh were
constructed based on the vocal tract and the area function,
respectively. Finally, vowels were synthesized by simulating
wave sound propagation in the vocal tract using FDTD method
with the two-mass vocal folds model. The formant frequencies
of synthesized vowels were compared to those of real speech
sounds. It is found that the mean absolute errors of formant
frequencies were 7.77% and 6.07% for 2D and 3D model, respectively.
Results suggested that both 2D and 3D model are capable of
producing speech formants in about the same accuracy. However,
3D method exhibits more realistic phenomenon in high frequency
region because it was based on complete 3D vocal tract model.
It is also observed that the bandwidths of real speech can
be achieved by setting the normal sound absorption coefficient
within a proper range.
Speaker
Adaptation Intensively Weighted on Mis-Recognized Speech Segments
Takahiro Oku NHK,
Yuya Fujita NHK, Akio Kobayashi NHK, Toru Imai
NHK
A “re-speak method” is an effective
speech recognition method for simultaneous closed-captioning
of live broadcasting programs picked up in noisy environments
featuring spontaneous or emotional commentary. An acoustic
model of the re-speaker needs to be constantly adapted according
to the re-speaker's daily health condition or level of
fatigue. In this paper, we propose efficient speaker adaptation
for the re-speak method. Conventional speaker adaptation is
performed uniformly over entire speech segments. In comparison,
our proposed speaker adaptation determines intensive adaptation
segments corresponding to recognition error parts by comparing
speech recognition results and manually error-corrected results.
These results are provided in real time by the simultaneous
closed-captioning process. Then, the frame-level statistics
for speaker adaptation are multiplied by larger weights in
proportion to the degree of the recognition errors more over
the intensive adaptation segments than they are over the other
segments. In an experiment on an information variety program
in Japanese broadcasting, our speaker adaptation method reduced
the word error rate relatively by 3.4% compared with the conventional
uniform adaptation method.
Introduction
of False Detection Control Parameters in Spoken Term Detection
Yuto Furuya University
of Yamanashi, Satoshi Natori University of Yamanashi,
Hiromitsu Nishizaki University of Yamanashi, Yoshihiro
Sekiguchi University of Yamanashi
This paper describes spoken term detection
(STD) with false detection control. Our STD method uses phoneme
transition network (PTN) derived by multiple automatic speech
recognizers (ASRs) as an index. An PTN is almost the same
to a sub-word based confusion network (CN), which is derived
from an output of an ASR. The PTN-based index we proposed
is made of the outputs of multiple ASRs, which is known to
be robust to certain recognition errors and the out-of-vocabulary
problem. Our PTN was very effective at detecting query terms.
However, the PTN generates a lot of false detections especially
for short query terms. Therefore, we applied two false detection
control parameters to the Dynamic Time Warping-based term
detection engine. In addition, we changed the search parameters
depending on length of a query term. Finally, the STD performance
was better (0.785 of F-measure) than without any parameters
(0.717).
Pipeline
Decomposition of Speech Decoders and Their Implementation
Based on Delayed Evaluation
Takahiro Shinozaki Chiba
University, Sadaoki Furui Tokyo Institute of Technology,
Yasuo Horiuchi Chiba University, Shingo Kuroiwa Chiba
University
For large vocabulary continuous speech
recognition, speech decoders treat time sequence with context
information using large probabilistic models. The software
of such speech decoders tend to be large and complex since
it has to handle both relationships of its component functions
and timing of computation at the same time. In the traditional
signal processing area such as measurement and system control,
block diagram based implementations are common where systems
are designed by connecting blocks of components. The connections
describe flow of signals and this framework greatly helps
to understand and design complex systems. In this research,
we show that speech decoders can be effectively decomposed
to diagrams or pipelines. Once they are decomposed to pipelines,
they can be easily implemented in a highly abstracted manner
using a pure functional programming language with delayed
evaluation. Based on this perspective, we have re-designed
our pure-functional decoder Husky proposing a new design paradigm
for speech recognition systems. In the evaluation experiments,
it is shown that it efficiently works for a large vocabulary
continuous speech recognition task.
Consonant
Enhancement for Articulation Disorders Based on Non-negative
Matrix Factorization
Ryo Aihara Kobe University,
Ryoichi Takashima Kobe University, Tetsuya Takiguchi
Kobe University, Yasuo Ariki Kobe University
We present consonant enhancement on
a voice for a person with articulation disorders resulting
from athetoid cerebral palsy. The movement of such speakers
is limited by their athetoid symptoms, and their consonants
are often unstable or unclear, which makes it difficult for
them to communicate. Speech recognition for articulation disorders
has been studied; however, its recognition rate is still lower
than that of physically unimpaired persons. In this paper,
an exemplar-based spectral conversion using non-negative Matrix
Factorization (NMF) is applied to consonant enhancement of
a voice with articulation disorders. The source speaker's
spectrum is easily converted into a well-ordered speaker's
spectrum. Its effectiveness is examined for voice quality
and clarity of consonants for a person with articulation disorders.
PS.1-SLA.3 Speech
Recognition (I)
Session Chair: Masato Akagi
Location: Solano
Fast
Spoken Term Detection Using Pre-retrieval Results of Syllable
Bigrams
Hiroyuki Saito Iwate
Prefectural University, Yoshiaki Itoh Iwate Prefectural
University, Kazunori Kojima Iwate Prefectural University,
Masaaki Ishigame Iwate Prefectural University, Kazuyo
Tanaka Tsukuba University, Shi-wook LeeNational
Institute of Advanced Industrial Science and Technology
We propose a method of the Spoken
Term Detection (STD) based on a priori retrieval results in
which plural syllables are used as query terms. In the proposed
method, all N-syllable combinations such as syllable bigrams
are searched for in spoken documents. In the first step of
the method, the retrieval results are prepared a priori, where
pre-retrieval results include candidates with scores matching
those of each N-syllable sequence. Given a query, the syllable
sequence of the query is divided into plural syllable sequences
whose lengths are the same as those of the pre-retrieval results.
In the second step, the candidate sections are filtered by
using the scores of query’s syllable combinations. This reduction
in the number of candidate sections for detailed matching
leads to a large reduction of the retrieval time. In the third
step, these candidates sections are re-scored by performing
detailed matching. Experimental results show that the proposed
method reduces the retrieval time by 93% with a performance
degradation of less than 2 points.
Simplifying
Emotion Classification Through Emotion Distillation
Emily Provost University
of Michigan, Shrikanth Narayanan University of Southern
California
Many state-of-the-art emotion classification
systems are computationally complex. In this paper we present
an emotion distillation framework that decreases the need
for computational complex algorithms while maintaining rich,
and interpretable, emotional descriptors. These representations
are important for emotionally-aware interfaces, which we will
increasingly see in technologies such as mobile devices with
personalized interaction paradigms and in behavioral informatics.
In both cases these technologies require the rapid distillation
of vast amounts of data to identify emotionally salient portions.
We demonstrate that emotion distillation can produce rich
emotional descriptors that serve as an input to simple classification
techniques. This system obtains results that match state-of-the-art
classification results on the USC IEMOCAP data.
Speech
Emotion Recognition System Based on a Dimensional Approach
Using a Three-Layered Model
Reda Elbarougy Japan
Advanced Institute of Science and Technology, Masato Akagi
Japan Advanced Institute of Science and Technology
This paper proposes a three-layer
model for estimating the expressed emotions in a speech signal
based on a dimensional approach. Most of the previous studies
using the dimensional approach mainly focused on the direct
relationship between acoustic features and emotion dimensions
(valence, activation, and dominance). However, the acoustic
features that correlate to valence dimension are less numerous,
less strong, and the valence dimension has being particularly
difficult to be predicted. The ultimate goal of this study
is to improve the dimensional approach in order to precisely
predict the valence dimension. The proposed model consists
of three layers: acoustic features, semantic primitives, and
emotion dimensions. We aimed to construct a three-layer model
in imitation of the process of how human perceive and recognize
emotions. In this study, we first investigated the correlations
between the elements of the two-layered model and elements
of the three-layered model. In addition, we compared the two
models by applying a fuzzy inference system (FIS) to estimate
emotion dimensions. In our model FIS was used to estimate
semantic primitives from acoustic features, then to estimate
emotion dimensions from the estimated semantic primitives.
The experimental results show that the proposed three-layered
model outperforms the traditional two-layered model.
A
Spoken Dialogue System Using Virtual Conversational Agent
with Augmented Reality
Shinji Miyake Tohoku
University, Akinori Ito Tohoku University
We have developed a spoken dialogue
system using virtual conversational agent with augmented reality.
The proposed dialogue system has architecture based on question
and answer database that contains many question and answer
pairs. Additionally, we have developed two agents displayed
using augmented reality, which behave as avatars of objects
to be operated. We evaluated user’s impression as well as
response accuracy of our proposed system. As a result, the
existence of an agent increased user’s feeling of vividness
of conversation and easiness to talk to the system. In addition,
the system with an agent showed better response accuracy than
the system without agents.
Data-driven
Rescaled Teager Energy Cepstral Coefficients for Noise-robust
Speech Recognition
Chia-Ping Chen National
Sun Yat-sen University, Miau-Luan Hsu National Sun
Yat-sen University
We investigate data-driven rescaled
Teager energy cepstral coefficients (DRTECC) features for
noise-robust speech recognition. In the first stage, we apply
a bank of auditory gammatone filters (GTF) and extract Teager-Kaiser
energy (TE) estimates, which substitute the commonly used
mel-spectrum. The output features of the first stage are called
the Teager energy cepstral coefficients (TECC). In the second
stage, we apply a piecewise rescaling operation of the cepstral
coefficients of the zeroth order to bridge the difference
between clean and noisy utterances. The segmentation point
is determined by voice activity detection (VAD), and the proportional
constants are data-driven. The resultant features are called
DRTECC. The proposed features are evaluated on the Aurora
2.0 database. The relative improvements over the baseline
MFCC features are significant.
PS.2-SLA.4 Audio
& Music Processing (I)
Session Chair: Ryo Takahashi
Location: Solano
Diffusion
Noise Suppression by Crystal-Shape Subtraction Array
Akira Tanaka Hokkaido
University, Ryo Takahashi Hokkaido University
Noise suppression of diffusion noise
by microphone arrays is discussed in this paper. In our previous
work, we proposed a method for jointly estimating signal and
noise correlation matrices from observations with diffusion
noise by using so-called crystal shape microphone arrays;
and discussed the performance of the Wiener filter based on
those correlation matrices. In this paper, we propose a novel
method for noise suppression of diffusion noise based on the
newly adopted spectral subtraction scheme with the estimated
correlation matrices by our previous work. We also verify
the efficacy of the proposed method by some computer simulations
and show that the proposed method outperforms our previous
method by the Wiener filter.
Reproduction
of Varied Sound Image Localization for Real Source in Stereo
Audio System
Satoshi Okuro Kansai
University, Yoshinobu Kajikawa Kansai University
In this paper, we propose a sound
reproduction system which can realize varied sound image localization
in stereo audio systems. The proposed system can suppress
unnatural variations of sound image localization with listener’s
movement and maintain the absolute position of sound image
so that a real source exists in the corresponding position.
Generally, human being perceives the direction of sound image
on horizontal plane according to Interaural Level Difference
(ILD) and Interaural Time Difference (ITD) between signals
arriving at both ears. Accordingly, unnatural variation of
sound image localization accompanying listener’s movement
is due to the differences of ILD and ITD between the stereo
audio system and the real source. The proposed system therefore
compensates ILD and ITD using digital filters. Some subjective
assessment tests with ten subjects demonstrate that fixed
sound image can be realized in the proposed system when listener
moves away by giving appropriate signal level ratios.
A
Japanese Lyrics Writing Support System for Amateur Songwriters
Chihiro Abe Tohoku University,
Akinori Ito Tohoku University
In this paper, we propose a lyrics
writing support system focused on the number of syllables,
rhyme and word accent. The system generates candidate sentences
that satisfy user-specified conditions based on Ngram, and
presents them. Users can use the system like a dictionary,
and write lyrics be choosing presented sentences. In our subjective
evaluations, we have investigated how the system is utilized
for writing lyrics actually.The log of using the system and
the questionnaires showed that users want the system to present
words suitable for their images, and they used the presented
words as keywords of a lyrics rather than as they are.
Comparative
Study on Various Noise Reduction Methods with Decision-Directed
a Priori SNR Estimator via Higher-Order Statistics
Suzumi Kanehara Nara
Institute of Science and Technology, Hiroshi Saruwatari
Nara Institute of Science and Technology, Ryoichi Miyazaki
Nara Institute of Science and Technology, Kiyohiro
Shikano Nara Institute of Science and Technology, Kazunobu
Kondo Yamaha Corporate Research & Development Center
In this paper, we propose a new theoretical
analysis of amount of musical noise generated in several noise
reduction methods with a decision-directed a priori SNR estimator
using higher-order statistics. In our previous study, a musical
noise assessment based on kurtosis has been successfully applied
to spectral subtraction and Wiener filter. However, this approach
cannot be applied to some high-quality noise reduction methods,
namely, the minimum mean-square error short-time spectral
amplitude estimator, the minimum mean square error log-spectral
amplitude estimator and the maximum a posteriori estimator,
because such methods include the decision-directed a priori
SNR estimator, which corresponds to a nonlinear recursive
(infinite) process for noise power spectral sequences. Therefore,
in this paper, we introduce a computationally efficient higher-order-moment
calculation method based on generalized Gauss-Laguerre quadrature.
We also mathematically clarify the justification of using
a typical decision-directed parameter, namely, magic number
0.98,_in the three types of the decision-directed-based estimators
from a viewpoint of amounts of musical noise and speech distortion.
In addition, we perform comparison between these noise reduction
methods based on the mathematical analysis and human perception
test.
Toward
Polyphonic Musical Instrument Identification using Example-based
Sparse Representation
Okamura Mari Gifu University,
Masanori Takehara Gifu University, Tamura Satoshi Gifu
University, Hayamizu Satoru Gifu University
Musical instrument identification
is one of the major topics in music signal processing. In
this paper, we propose a musical instrument identification
method based on sparse representation for polyphonic sounds.
Such the identification has been still categorized into challenging
tasks, since it needs high-performance signal processing techniques.
The proposed scheme can be applied without any signal processing
such as source separation. Sample feature vectors for various
musical instruments are used for the base matrix of sparse
representation. We conducted two experiments to evaluate the
proposed method. First, the musical instrument identification
is tested for monophonic sounds using five musical instruments.
The average accuracy of 91.9% was obtained and it shows the
effictiveness of the proposed method. Second, musical instrument
composition of polyphonic sounds is examined, which contain
two instruments. It is found that the estimated weight vector
by sparse representation indicates the mixture ratio of two
instruments.
Optimization
Scheme of Joint Noise Suppression and Dereverberation Based
on Higher-Order Statistics
Fine Aprilyanti Nara
Institute of Science and Technology, Hiroshi Saruwatari
Nara Institute of Science and Technology, Kiyohiro
Shikano Nara Institute of Science and Technology, Tomoya
Takatani Toyota Motor Company
In this paper, we apply the higher-order
statistics parameter to automatically improve the performance
of blind speech enhancement. Recently, a method to suppress
both diffuse background noise and late reverberation part
of speech has been proposed combining blind signal extraction
and Wiener filtering. However, this method requires a good
strategy for choosing the set of its parameters in order to
achieve the optimum result and to control the amount of musical
noise, which is a common problem in non-linear signal processing.
We present an optimization scheme to control the value of
Wiener filter coefficients used in this method, which depends
on the amount of musical noise generated, measured by higher-order
statistics. The noise reduction rate and cepstral distortion
are also evaluated to confirm the effectiveness of this scheme.
Tuesday, December 4, 2012 (14:30 - 16:10)
|
OS.9-IVM.4 Image
and Video Coding (I)
Session Chair: Hsueh-Ming
Hang Location: Doheny
Depth
Coding using Coded Boundary Patterns
Kai-Hsiang Yang National
Chiao Tung University, Hsueh-Ming Hang National Chiao
Tung University
The depth information plays an essential
role in the virtual-view (or free-viewpoint) 3D video systems.
In this paper, we propose a new algorithm to code a depth
map for the purpose of virtual view synthesis. The idea is
to use H.264/AVC to represent the rough shape (including depth
values) of a depth map and then additional information is
transmitted to improve the depth values around the object
boundaries. The complete encoding and decoding simulation
system was built on the H.264/AVC JM 18.0 platform. In our
experiments, three tools can be turned on individually and
thus four coding modes are defined and tested. Our data show
that these proposed tools offer advantages in either coding
efficiency or image quality improvement and some tools work
best on simple images while the others work best on complex
images. With proper parameter setting, the overall quality
of virtual view rendering is noticeably improved.
Color
Image Coding based on the Colorization
Takashi Ueno Keio University,
Yoshida Taichi Keio University, Masaaki Ikehara Keio
University
Colorization is a method which adds
color components to grayscale images using color assigned
information provided by the user. Recently, a novel approach
to image compression called colorization based coding has
been proposed. It automatically extracts color assignations
from original color images at an encoder and restores color
components by colorization method at a decoder. In this paper,
we propose the method which improves the conventional color
image coding methods by regarding colorization as interpolation.
At the encoder, the proposed method subsamples chrominance
components considering colorization and subsampled chrominance
components are compressed by conventional methods. At the
decoder, subsampled chrominance components are interpolated
by colorization. Simulations reveal that the proposed method
improves quality of reconstructed images, objectively.
Improved
JPEG 2000 System Using LS Prediction and Grouping Context
Coding Scheme
Jian-Jiun Ding National
Taiwan University, Hsin-Hui Chen National Taiwan University,
Guan-Chen PanNational Taiwan University, Po-Hung Wu
National Taiwan University
In this paper, two coding strategies
are proposed to improve the coding efficiency of the JPEG2000
system. First, instead of using the embedded block coding
with optimized truncation (EBCOT) scheme in all subbands,
we apply the algorithm based on least square prediction in
the LL part of the discrete wavelet transform domain. Moreover,
in the LH, HL, and HH parts, instead of using the MQ coder
and the fixed probability table, we group the 19 contexts
into 7 classes. Since the characteristics of the contexts
in EBCOT are different from one another, it is proper to use
different probability tables for different classes. Simulation
results demonstrate that the proposed methods significantly
improve the coding efficiency of the JPEG2000 system.
A
New In-Loop Filter for Depth Map Coding in HEVC
Hyunsuk Ko University
of Southern California, C.-C. Jay Kuo University of
Southern California
A depth-map is used to synthesize
virtual texture views in the multi-view plus depth (MVD) format.
In conventional video coding, a coded depth-map often suffers
from compression artifacts along object boundaries, which
have a negative effect on the quality of rendered images in
the view synthesis process. To address this problem, we propose
a depth-map boundary filtering technique to eliminate coding
artifacts while preserving sharp edges. This can be mathematically
formulated as a L0-norm minimization problem. This filtering
process is cascaded with the de-blocking filter in the emerging
HEVC video coding standard to result in a new in-loop filter.
Experimental results are given to show that the subjective
and objective quality of the synthesized views is enhanced
by the introduction of the new in-loop fitler.
Modification
of Intra Angular Prediction in HEVC
Shohei Matsuo NTT Corporation,
Seishi Takamura NTT Corporation, Atsushi Shimizu NTT
Corporation
Intra prediction of the emerging High
Efficiency Video Coding (HEVC) standard has new features that
the existing video coding standard H.264/AVC does not have.
One example is that new angular prediction modes are added.
Finer prediction directions enable to reduce the prediction
error energy by making the predicted signals more flexibly.
A method to make reference samples for the intra angular prediction
plays an important role in terms of the coding efficiency.
In the angular prediction of HEVC, a simple 2-tap linear filter
is used to make reference samples. In this paper, the reference
samples are generated by the conventional 2-tap linear filter
or a DCT-based interpolation filter. The proposal improves
the intra prediction performance especially for small prediction
units such as 4x4 and 8x8. The average coding gains against
the anchor of HEVC test model (HM6.0) were about 0.34% and
0.31%, when the tap length of DCT-IF is set to four and six,
respectively. The maximum coding gains were about 2.2%, 3.3%,
and 3.9% for each component (Y, Cb, and, Cr). In the case
of four tap interpolation, the average run-times of encoding
and decoding were about 102.84% and 100.96%, respectively.
OS.10-IVM.5 3DTV
and Free-viewpoint TV (I)
Session Chairs: Masayuki
Tanimoto, Yo-Sung Ho Location: Beachwood
3D
Video Coding with Depth Modeling Modes and View Synthesis
Optimization
Karsten Mueller Fraunhofer
HHI, Philipp Merkle Fraunhofer HHI, Gerhard Tech
Fraunhofer HHI, Thomas Wiegand Fraunhofer HHI
This paper presents efficient coding
tools for depth data in depth-enhanced video formats. The
method is based on the high-efficiency video codec (HEVC).
The developed tools include new depth modeling modes (DMMs),
in particular using non-rectangular wedgelet and contour block
partitions. As the depth data is used for synthesis of new
video views, a specific 3D video encoder optimization is used.
This view synthesis optimization (VSO) considers the exact
local distortion in a synthesized intermediate video portion
or image block for the depth map coding. In a fully optimized
3D-HEVC coder, VSO achieves average bit rate savings of 17%,
while DMMs gain 6%in BD rate, even though the depth rate only
contributes 10% to the overall MVD bit rate.
Depth
Map Up-sampling Based on Edge Layers
Danillo Graziosi MERL,
Dong Tian MERL, Anthony Vetro MERL
Depth map images are characterized
by large ho- mogeneous areas and strong edges. It has been
observed that efficient compression of the depth map is achieved
by applying a down-sampling operation prior to encoding. However,
since high resolution depth maps are typically required for
view synthesis, an up-sampling method that is able to recover
the loss of information is needed within this framework. In
this paper, an up-sampling algorithm that recovers the high
frequency content of depth maps using a novel edge layer concept
is proposed. This algorithm includes a method for extracting
edge layers from the corresponding texture images, which are
then used as part of a non- linear interpolation filter for
depth map up- sampling. In the present work, the up-sampling
is applied as a post-processing operation to generate multiview
output for display. Views synthesized with our up-sampled
depth maps show the efficiency of our proposed technique relative
to conventional interpolation filters.
Hybrid
Plane Fitting for Depth Estimation
Lingfeng Xu HKUST,
Oscar C. Au HKUST, Wenxiu Sun HKUST, Yujun Li
HKUST, Jiali Li HKUST
In this paper, a novel plane Þtting
algorithm with low complexity and high accuracy is proposed
to reÞne the depth maps generated by stereo matching. We Þrst
compute the conÞdence coefÞcient for each pixel in the depth
map by cross checking and stable pixel calculation. According
to the outlier pixel percentage for each segment, we choose
one method, either proposed weighted least square error based
or RANSAC based plane Þtting algorithm, to estimate the plane
parameters. Experimental results show that our method outperforms
other existing plane Þtting algorithms.
Ray
Capture Systems for FTV
Masayuki Tanimoto Nagoya
Industrial Science Research Institute
FTV (Free-viewpoint Television) is
an innovative visual media that allows users to view a 3D
scene by freely changing their viewpoints. Thus, it enables
realistic viewing and free navigation of 3D scenes. FTV is
the ultimate 3DTV with infinite number of views and ranked
as the top of visual media. FTV is not a conventional pixel-based
system but a ray-based system. Ray capture, processing and
display technologies have been developed for FTV. Here, three
types of ray capture systems are presented. They are multi-camera
ray capture with view interpolation, all-around dense ray
capture without view interpolation and computational ray capture
by reduced number of pixel data.
OS.11-SLA.5 Multimodal
Information Processing - Algorithms and Applications
Session Chairs: Lei Xie,
Helen Meng Location: Runyon
Dimensional
Emotion Driven Facial Expression Synthesis Based on the Multi-Stream
DBN Model
Hao Wu Northwestern
Polytechnical University, Dongmei Jiang Northwestern
Polytechnical University, Yong Zhao Northwestern Polytechnical
University, Hichem Sahli Vrije Universiteit Brussel
This paper proposes a dynamic Bayesian
network (DBN) based MPEG-4 compliant 3D facial animation synthesis
method driven by the (Evaluation, Activation) values in the
continuous emotion space. For each emotion, a state synchronous
DBN model (SS_DBN) is firstly trained using the Cohn-Kanade
(CK) database with two streams of inputs: (i) the annotated
(Evaluation, Activation) values, and (ii) the extracted Facial
Action Parameters (FAPs) of the face image sequences. Then
given an input (Evaluation, Activation) sequence, the optimal
FAP sequence is estimated via the maximum likelihood estimation
(MLE) criterion, and then used to construct the MPEG-4 compliant
3D facial animation. Compared with the state-of-the-art approaches
where the mapping between the emotional space and the FAPs
has been made empirically, in our approach the mapping is
learned and optimized using DBN to fit the input (Evaluation,
Activation) sequence. Emotion recognition results on the constructed
facial animations, as well as subjective evaluations, show
that the proposed method obtains natural facial animations
representing well the dynamic process of the emotions from
neutral to exaggerate.
Modeling
the Correlation between Modality Semantics and Facial Expressions
Jia Jia Tsinghua University,
Xiaohui Wang Tsinghua University, Zhiyong Wu Tsinghua
University, Lianhong Cai Tsinghua University, Helen
Meng Chinese University of Hong Kong
Facial expression plays an important
role in face-to-face human-computer communication. Although
considerable efforts have been made to enable computers to
speak like human beings, how to express the rich semantic
information through facial expression still remains a challenging
problem. In this paper, we use the concept of “modality” to
describe the semantic information which is related to the
mood, attitude and intention. We propose a novel parametric
mapping model to quantitatively characterize the non-verbal
modality semantics for facial expression animation. In particular,
seven-dimensional semantic parameters (SP) are first defined
to describe the modality information. Then, a set of motion
patterns represented with Key FAP (KFAP) is used to explore
the correlations of MPEG-4 facial animation parameters (FAP).
The SP-KFAP mapping model is trained with the linear regression
algorithm (AMMSE) and an artificial neural network (ANN) respectively.
Empirical analysis on a public facial image dataset verifies
the strong correlation between the SP and KFAP. We further
apply the mapping model to two different applications: facial
expression synthesis and modality semantics detection from
facial images. Both objective and subjective experimental
results on the public datasets show the effectiveness of the
proposed model. The results also indicate that the ANN method
can significantly improve the prediction accuracies in both
applications.
Detection
of Ball Hits in a Tennis Game Using Audio and Visual Information
Qiang Huang University
of East Anglia, Stephen Cox University of East Anglia,
Xiangzeng ZhouNorthwestern Polytechnical
University, Lei Xie Northwestern
Polytechnical University
In this paper we describe a framework
to improve the detection of ball hit events in tennis games
by combining audio and visual information. Detection of the
presence and timing of these events is crucial for the understanding
of the game. However, neither modality on its own gives satisfactory
results: audio information is often corrupted by noise and
also suffers from acoustic mismatch between the training and
test data, and visual information is corrupted by complex
backgrounds, camera calibration, and the presence of multiple
moving objects. Our approach is to first attempt to track
the ball visually and hence estimate a sequence of candidate
positions for the ball, and to then locate putative ball hits
by analysing the ball’s position in this trajectory. To handle
the severe interferences caused by false ball candidates,
we smooth the trajectory by using locally weighted linear
regression and removing the frames where there are no candidates.
We use Gaussian mixture models to generate estimates of the
times of hits using the audio information, and then integrate
these two sources of information in a probabilistic framework.
Testing our approach on three complete tennis games shows
significant improvements in detection over a range of conditions
when compared with using a single modality.
Face
Sketch-to-Photo Synthesis from Simple Line Drawing
Yang Liang Zhejiang
University, Mingli Song Zhejiang University, Lei
Xie Northwestern Polytechnical University, Jiajun Bu
Zhejiang University, Chun Chen Zhejiang University
Face sketch-to-photo synthesis has
attracted increasing attention in recent years for its useful
applications on both digital entertainment and law enforcement.
Although great progress has been made, previous methods only
work on face sketches with rich textures which are not easily
to obtain. In this paper, we propose a robust algorithm for
synthesizing a face photo from a simple line drawing that
contains only a few lines without any texture. In order to
obtain a robust result, firstly, the input sketch is divided
into several patches and edge descriptors are extracted from
these local input patches. Afterwards, an MRF framework is
built based on the divided local patches. Then a series of
candidate photo patches are synthesized for each local sketch
patch based on a coupled dictionary learned from a set of
training data. Finally, the MRF is optimized to get the final
estimated photo patches for each input sketch patch and a
realistic face photo is synthesized. Experimental results
on CUHK database have validated the effectiveness of the proposed
method.
High
Quality Lips Animation with Speech and Captured Facial Action
Unit as A/V Input
Lijuan Wang Microsoft,
Frank Soong Microsoft
Rendering realistic lips movements
in avatar with camera captured human's facial features
is desirable in many applications, e.g. telepresence, video
gaming, social networking, etc. We have proposed to use Gaussian
Mixture Model (GMM) to generate lips trajectory and successfully
tested in speech-to-lips conversion experiments, where only
audio signal (speech) is used as input. In this paper real-time
user's facial features called the Action Units (AUs) well
tracked by Microsoft Kinect SDK with a consumer-grade RGB
camera, are combined with speech to form joint A/V input for
lips animation. We test the lips animation performance and
show that the new combined A/V input can improve the conversion
error rate by 22% in a speaker dependent test, compared with
a baseline system.
OS.12-SLA.6 Recent
Advances in Audio and Acoustic Signal Processing (I)
Session Chairs: Shoji Makino,
Hiroshi Saruwatari Location: Laurel
Beamformer
Design Using Measured Microphone Directivity Patterns: Robustness
to Modelling Error
Mark Thomas Microsoft
Research, Jens Ahrens Microsoft Research, Ivan
Tashev Microsoft Research
The design process for time-invariant
acoustic beamformers often assumes that the microphones have
an omnidirectional directivity pattern, a flat frequency response
in the range of interest, and a 2D environment in which wavefronts
propagate as a function of azimuth angle only. In this paper
we investigate those cases in which one or more of these assumptions
do not hold, considering a Minimum Variance Distortionless
Response (MVDR)-based solution that is optimized using measured
directivity patterns as a function of azimuth, elevation and
frequency. Robustness to modelling error is controlled by
a regularization parameter that produces a suboptimal but
more robust solution. A comparative study is made with the
4-element cardioid microphone array employed in Microsoft
Kinect for Windows, whose beamformer weights are calculated
with directivity patterns using (a) 2D cardioid models, (b)
3D cardioid models and (c) 3D measurements. Speech recognition
and PESQ results are used as evaluation criteria with a noisy
speech corpus, revealing empirically optimal regularization
parameters for each case and up to a 70% relative improvement
in word error rate comparing (a) and (c).
Theoretical
Analysis of Musical Noise in Nonlinear Noise Reduction Based
on Higher-Order Statistics
Yu Takahashi Yamaha
Corporation, Ryoichi Miyazaki Nara Institute of Science
and Technology, Hiroshi Saruwatari Nara Institute of
Science and Technology, Kazunobu Kondo Yamaha Corporate
Research & Development Center
In this paper, we review a musical-noise-generation
analysis of nonlinear noise reduction techniques with using
higher-order statistics (HOS). Recently, an objective metric
based on HOS to analyze nonlinear artifacts, i.e., musical
noise, caused by nonlinear noise reduction techniques has
been proposed. Such metric enables us to perform objective
comparison of any nonlinear methods from the perspective of
the amount of musical noise generated. Furthermore, such metric
enables us to control the musical noise generated by nonlinear
noise reduction techniques. In the paper, first, the mathematical
principle of the analysis for the amount of musical noise
based on HOS is described, and analyses and comparison examples
of typical nonlinear noise reduction techniques are demonstrated.
Next, it is clarified that to find a fixed point in HOS leads
to no-musical noise property in noise reduction. Finally,
several expansions on the theory are discussed.
Auxiliary-function-based
Independent Vector Analysis with Power of Vector-norm Type
Weighting Functions
Nobutaka Ono National
Institute of Informatics
This paper presents auxiliary-function-based
independent vector analysis (AuxIVA) based on Generalized
super Gaussian source model or Gaussian source model with
time-varying variance. AuxIVA is a convergence-guaranteed
iterative algorithm for independent vector analysis (IVA)
with a spherical and super Gaussian source model, and the
source model can be characterized by a weighting function.
In this paper, as typical source models in AuxIVA, the generalized
super Gaussian distribution or the Gaussian distribution with
time-varying variance are considered. Both of them yield a
power of vector-norm type weighting functions with an exponent
parameter ß such that 0 ≤ ß ≤ 2. A scaling and a clipping
technique for numerical stability are also discussed. The
separation performance of AuxIVA with several ßs is compared.
New
Analytical Calculation and Estimation of TDOA for Underdetermined
BSS in Noisy Environments
Takuro Maruyama University
of Tsukuba, Shoko Araki NTT Corporation Science Laboratories,
Tomohiro Nakatani NTT Corporation, Shigeki Miyabe University
of Tsukuba, Takeshi Yamada University of Tsukuba,
Shoji Makino University of Tsukuba, Atsushi Nakamura
NTT Corporation Science Laboratories
We have proposed a new algorithm for
sparseness-based underdetermined blind source separation (BSS)
that can cope with diffused noise environments. This algorithm
includes a technique for estimating time-difference-of-arrival
(TDOA) parameter separately in individual frequency bins for
each source. In this paper, we propose some methods that integrate
the frequency-bin-wise TDOA parameter to estimate a TDOA of
each source. The accuracy of the TDOA estimation by the proposed
approach is shown by experiments in comparison with a conventional
approach. The separation performance and calculation time
of the proposed approach is also examined.
A
New Permutation Control Method for Frequency Domain BSS
Steven Grant Missouri
S&T, Christopher Osterwise Missouri S&T
This paper introduces a new frequency
domain blind source separation algorithm: Inter-frequency
Correlation with Microphone Diversity (ICMD). Here, we consider
using different sets of microphones where in each set the
number of microphones and sources are equal. In the frequency
domain, cascaded ICA initialization (CII) is used, where the
separation matrix of one bin is used to initialize the ICA
iterations of the next. CII greatly re-duces the number of
permutation changes in successive bins. However, for a given
microphone set, it is not uncommon that ICA will fail to separate
some bins, thus defeating CII. This problem is addressed as
follows. 1) In addition to CII the inter- frequency correlation
matrix of the separated signals is used to align permutations
in successive frequency bins. 2) The condition number of this
matrix is monitored to determine if separation has failed
for the current bin and microphone set. 3) If so, an alter-nate
set with better separation is selected and again, inter-frequency
correlation is used to align the permutations of the new set
of microphones with the old. Results show a marked im-provement
in separation when there are three or more sources.
OS.13-BioSPS.2
Signal Processing Aspects of Brain Computer/Machine Interfaces
Session Chairs: Toshihisa
Tanaka, Yodchanan Wongsawat Location: Trousdale Estates
Toward
Multi-Command Auditory Brain Computer Interfacing Using Speech
Stimuli
Shuho Yoshimoto University
of Electro-Communications, Yoshikazu Washizawa University
of Electro-Communications, Toshihisa Tanaka Tokyo University
of Agriculture and Technology, Hiroshi Higashi Tokyo
University of Agriculture and Technology, Jun Tamura Nara
Institute of Science and Technology
Brain-computer interfaces (BCIs) based
on eventrelated potentials (ERP) are promising tools to communicate
with patients suffering from some severe disabled diseases.
ERP is evoked by various stimuli such as auditory, olfactory,
and visual stimuli. Some auditory based BCIs using certain
synthetic tone have been proposed, however, it is still challenging
to increase the number of commands in auditory-based BCIs,
since it is usually difficult for users to remember and distinguish
multiple tones that corresponds to commands. We propose a
new auditory BCI framework using speech stimuli. It is easier
for users to distinguish different speech stimuli than different
simple tones. We show experimental results of four-command
BCI. The proposed speech-based BCI achieved a classification
accuracy of more than 70 percents.
EEG
Energy Analysis Based On MEMD With ICA Pre-Processing
Yunchao Yin Saitama
Institute of Technology, Cao Jianting Saitama Institute
of Technology, Toshihisa TanakaTokyo University of
Agriculture and Technology
Analysis of EEG energy is a useful
technique in the brain signal processing. This paper presents
a data analysis method based on multivariate empirical mode
decomposition (MEMD) with ICA pre-processing to calculate
and evaluate the energy of EEG recorded from the quasi brain
deaths. The main advantage of introducing ICA pre-processing
is that we can reduce the noise and other unexpected components.
The simulation results illustrate the effectiveness and performance
of the proposed method in brain death determination.
Auditory
Steady-State Response Stimuli Based BCI application - The
Optimization of the Stimuli Types and Lengths
Yoshihiro Matsumoto University
of Tsukuba, Tomek Rutkowski University of Tsukuba
We propose a method for an improvement
of auditory BCI (aBCI) paradigm based on a combination of
ASSR stimuli optimization by choosing the subjects' best
responses to AM-, flutter-, AM/FM and click-envelope modulated
sounds. As the ASSR response features we propose pairwise
phaseÐlockingÐ values calculated from the EEG and next classified
using binary classifier to detect attended and ignored stimuli.
We also report on a possibility to use the stimuli as short
as half a second, which is a step forward in ASSR based aBCI.
The presented results are helpful for optimization of the
aBCI stimuli for each subject.
On
the Classification of EEG/HEG-based Attention Levels Via Time-Frequency
Selective Multilayer Perceptron For BCI-based Neurofeedback
System
Supassorn Rodrak Mahidol
University, Yodchanan Wongsawat Mahidol University
Attention Deficit/Hyperactivity Disorder
(ADHD) is a neurobehavioral disorder which leads to the difficulty
on focusing, paying attention and controlling normal behavior.
Globally, the prevalence of ADHD is estimated to be 6.5%.
Medicine has been widely used for the treatment of ADHD symptoms,
but the patient may have a chance to suffer from the side
effects of drug, such as vomit, rash, urticarial, cardiac
arrthymia and insomnia. In this paper, we propose the alternative
medicine system based on the brain-computer interface (BCI)
technology called neurofeedback. The proposed neurofeedback
system simultaneously employs two important signals, i.e.
electroencephalogram (EEG) and hemoencephalogram (HEG), which
can quickly reveal the brain functional network. The treatment
criteria are that, for EEG signals, the patient needs to maintain
the beta activities (13-30 Hz) while reducing the alpha activities
(7-13 Hz). Simultaneously, HEG signals need to be maintained
continuously increasing to some setting thresholds of the
brain blood oxygenation levels. Time-frequency selective multilayer
perceptron (MLP) is employed to capture the mentioned phenomena
in real-time. The experimental results show that the proposed
system yields the sensitivity of 98.16% and the specificity
of 95.57%. Furthermore, from the resulting weights of the
proposed MLP, we can also conclude that HEG signals yield
the most impact to our neurofeedback treatment followed by
the alpha, beta, and theta activities, respectively.
Minimal-Assisted
SSVEP-based Brain-Computer Interface Device
Yunyong Punsawad Mahidol
University, Yodchanan Wongsawat Mahidol University
Steady-state visual evoked potential
(SSVEP)-based brain computer interface (BCI) device is one
of the most accurate assistive technologies for the persons
with severe disabilities. However, for the existing systems,
the persons with disabilities still need the assistance for
the long period of time as well as the continuous time usages.
In order to minimize this problem, we propose the SSVEP-based
BCI system that the persons with disabilities can enable /disable
the BCI device by alpha band EEG and control the electrical
devices by SSVEP. A single- channel EEG (O1 or O2) is employed.
Power spectral density via periodogram at the four stimulated
frequencies (6, 7, 8, and 13 Hz) and their harmonics are used
as the features of interest. Simple threshold-based decision
rule is applied to the selected features. With the minimal
need for assistance, the classification accuracy of the proposed
system ranged from 75 to 100%.
The
Spatial Real and Virtual Sound Stimuli Optimization for the
Auditory BCI
Nozomu Nishikawa University
of Tsukuba, Tomek Rutkowski University of Tsukuba
The paper presents results from a
project aiming to create horizontally distributed surround
sound sources and virtual sound images as auditory BCI (aBCI)
stimuli. The purpose is to create evoked brain wave response
patterns depending on attended or ignored sound directions.
We propose to use a modified version of the vector based amplitude
panning (VBAP) approach to achieve the goal. The so created
spatial sound stimulus system for the novel oddball aBCI paradigm
allows us to create a multiÐcommand experimental environment
with very encouraging results reported in this paper. We also
present results showing that a modulation of the sound image
depth changes also the subject responses. Finally, we also
compare the proposed virtual sound approach with the traditional
one based on real sound sources generated from the real loudspeaker
directions. The so obtained results confirm the hypothesis
of the possibility to modulate independently the brain responses
to spatial types and depths of sound sources which allows
for the development of the novel multi- command aBCI.
OS.14-IVM.6 Visual
Signal Compression
Session Chair: Gwo Giun
Lee Location: Franklin Hills
Image
Set Modeling by Exploiting Temporal-Spatial Correlation and
Photo Album Compression
Ruobing Zou HKUST,
Oscar C. Au HKUST, Guyue Zhou HKUST, Sijin Li
HKUST, Lin Sun HKUST
With the advance of digital photographing
technology, large amount of personal photos are created and
stored online or in personal computers. To save storage space
and transmission bandwidth, we proposed a new photo album
compression scheme by reducing both intra- and inter-image
redundancy. Specifically, we first cluster a collection of
images into groups each of which contains a set of similar
images. Under a proposed graph framework, an optimal compression
structure is derived from each cluster by finding the minimum
spanning tree (MST) at a minimum prediction cost. The MST
is trimmed for compression. Then by High Efficiency Video
Coding (HEVC), the album is compressed as a whole and every
cluster as a group of pictures (GOP), according to the predictive
order of the optimal structures. Our experimental results
show that there was around 60% improvement over only using
JPEG compression.
Largest
Coding Unit Based Framework for Non-local Means Filter
Masaaki Matsumura NTT
Corporation, Seishi Takamura NTT Corporation, Atsushi
Shimizu NTT Corporation
One of the important factors for in-loop
filter of video codec is low-delay capability for encoding
and decoding. In this paper, we employ non-local means filter
between sample adaptive offset and adaptive loop filter to
the reference software of High Efficiency Video Coding HM7.0,
and propose largest coding unit (LCU) based framework for
non-local means filter that can reconstruct a decoded picture
in LCU order at encoder and decoder. As the result, compared
to HM7.0 anchor, in the case of picture-based RD-optimization,
the average improvements of BD-rate for luma and chroma are
0.36 to 1.52% and 0.04 to 1.37%, respectively. Similarly,
LCU-based one improves 0.20 to 1.27% and 0.67 to 1.91%, respectively.
We confirm the maximum gain in the sequence of “Kimono” on
low-delay P; the gains are 3.50% (Y), 2.89% (U) and 1.84%
(V), respectively. Subjective quality improvements are also
observed.
A
Five-stage Pipeline Design of Binary Arithmetic Encoder in
H.264/AVC
Rui Song Xidian University,
Hongfei Cui Xidian University, Yunsong Li Xidian
University, Song Xiao Xidian University
Context-based Adaptive Binary Arithmetic
Coding (CABAC) is a well known bottleneck in H.264/AVC encoder.
Despite its high performance, the tight feedback loops make
it difficult to parallelize. Most researchers are concerned
about multi-bin processing regardless of the pipeline design.
But without pipeline, the overall performance is greatly limited.
In this paper, the critical path for hardware implementation
of binary arithmetic encoder (BAE) was analyzed in detail.
We break the computing steps to the best extent, and re-arrange
it to the appropriate pipeline to get a balanced latency at
each stage. Further, new binary arithmetic encoder architecture
with five stage pipeline and 1 bin per cycle was proposed,
the latency of critical path were cut off exceedingly, and
the frequency and throughput rate was improved. An FPGA implementation
of the proposed pipelined architecture in our H.264 encoder
is capable of 190Mbps encoding rate. And a maximum 483MHz
could be achieved on SMIC 0.13${\mu}m$ technology, which meets
the requirements of QFHD encoding at 30fps. The proposed architecture
could be utilized in other designs to get a better performance.
Architecture
of High-throughput Context Adaptive Variable Length Coding
Decoder in AVC/H.264
Gwo Giun (Chris) Lee National
Cheng Kung University, Shu-Ming Xu National Cheng Kung
University, Chun-Fu Chen National Cheng Kung University,
Ching-Jui Hsiao National Cheng Kung University
In this paper, a High-throughput Context
Adaptive Variable Length Coding decoder which is capable for
supporting AVC/H.264 HP@level 4.2 has been presented. To increase
throughput, multi-symbol decoders for LEVEL and “RunBefore” and
architecture of fast zero insertion are presented to reduce
processing cycles to reach high-throughput rate. Finally,
the experimental results show that the throughput of presented
Context Adaptive Variable Length Coding decoder achieves the
level limitation of level 4.2 in AVC/H.264 and the synthesis
result shows that the gate count is about 17.2K gates at a
clock constrain of 108MHz.
An
Inter-Frame/Inter-View Cache Architecture Design for Multi-View
Video Decoders
Jui-Sheng Lee National
Chiao Tung University, Sheng-Han Wang National Chung-Cheng
University, Chih-Tai Chou National Chung-Cheng University,
Cheng-An Chien National Chung-Cheng University, Hsiu-Cheng
Chang National Chung-Cheng University, Jiun-In Guo
National Chiao Tung University
In this paper we propose a low-bandwidth
two-level inter-frame/inter-view cache architecture for a
view scalable multi-view video decoder, which adopts two decoder
cores to decode multi-view videos in parallel. The first level
L1 cache is developed for the single video decoder core, which
is able to reduce 60% bandwidth in doing inter-frame prediction
in average. Moreover, we develop the second level L2 cache
architecture to reuse the same reference data for doing inter-view
prediction among different decoder cores, which can further
reduce 35% bandwidth. By adopting the proposed two-level cache
architecture for doing inter-frame/inter-view prediction,
we can reduce 80% bandwidth through a view scalable multi-view
video decoder implementation, which achieves real-time HD1080
dual-view video decoding.
OS.15-SLA.7 Speech
Recognition (II)
Session Chair: Seiichi
Nakagawa Location: Whitley Heights
Acoustic
Model Training Using Committee-Based Active and Semi-Supervised
Learning for Speech Recognition
Tsutaoka Takuya Tokyo
Institute of Technology, Koichi Shinoda Tokyo Institute
of Technology
We propose an acoustic model training
method which combines committee-based active learning and
semi-supervised learning for large vocabulary continuous speech
recognition. In this method, each untranscribed training utterance
is examined by a committee consists of multiple speech recognizers
and the degree of disagreement in the committee on its transcription
is used for selecting utterances. Those utterances the committee
members disagree with each other are transcribed for active
learning, while those they agree are used for semi-supervised
learning. Our method was evaluated using the Corpus of Spontaneous
Japanese. It was shown that it achieved higher recognition
accuracy with lower transcription costs than random sampling,
active learning alone, and semi-supervised learning alone.
We also propose an alternative data selection method in semi-supervised
learning.
Distance
Attenuation Control of Spherical Loudspeaker Array
Shigeki Miyabe University
of Tsukuba, Takaya Hayashi University of Tsukuba,
Takeshi Yamada University of Tsukuba, Shoji Makino
University of Tsukuba
This paper describes control of distance
attenuation using spherical loudspeaker array. Fisher et al.
proposed radial filtering with spherical microphone to control
the sensitivity to distance from a sound source by modeling
the propagation of waves in spherical harmonic domain. Since
transfer functions are not changed by swapping their inputs
and outputs, we can use the same theory of radial filtering
for microphone arrays to the filter design of distance attenuation
control with loudspeaker arrays. Experimental results confirmed
that the proposed method is effective in low frequencies.
Recognition
of Utterances with Grammatical Mistakes based on Optimization
of Language Model towards Interactive CALL Systems
Takuya Anzai Tohoku
University, Akinori Ito Tohoku University
To realize a voice-interactive CALL
system, it is necessary to recognize the learner’s utterance
correctly including the grammatical mistakes. In this paper,
we proposed methods for improving recognition accuracy of
speech with grammatical mistakes. The proposed method is based
on the method that uses n-gram model trained from sentences
that are generated using grammatical error rules. We introduced
two improvements to the previous method: one is the utterance
discrimination to avoid introducing errors into correct utterances,
and the other one is optimization of language model where
probability of grammatical mistakes in the generated training
text is optimized using the score of utterance discrimination.
As a result, we obtained 0.92 point improvement, which is
12% error reduction.
Fast
NMF Based Approach and Improved VQ Based Approach for Speech
Recognition From Mixed Sound
Shoichi Nakano Toyohashi
University of Technology, Kazumasa Yamamoto Toyohashi
University of Technology, Seiichi Nakagawa Toyohashi
University of Technology
We have considered a speech recognition
method for mixed sound, consisting of speech and music, that
removes only the music based on vector quantization (VQ) and
non-negative matrix factorization (NMF). This paper describe
fast calculation technique of music removal based on NMF and
improvement using a VQ method. For isolated word recognition
using the clean speech model, an improvement of 46% word error
reduction rate was obtained compared with the case of not
removing music. Furthermore, a high recognition rate, close
to clean speech recognition was obtained at 10 dB. For the
case of the multi-conditions, our proposed method reduced
the error rate of 50% compared with the multi-conditions model.
Expansion
of Training Texts to Generate a Topic-Dependent Language Model
for Meeting Speech Recognition
Kazushige Egashira Nagasaki
University, Kazuya Kojima Nagasaki University,
Masaru Yamashita Nagasaki University, Katsuya Yamauchi
Nagasaki University, Shoichi Matsunaga Nagasaki
University
This paper proposes expansion methods
for training texts (baseline) to generate a topic-dependent
language model for more accurate recognition of meeting speech.
To prepare a universal language model that can cope with the
variety of topics discussed in meetings is very difficult.
Our strategy is to generate topic-dependent training texts
based on two methods. The first is text collection from web
pages using queries that consist of topic-dependent confident
terms; these terms were selected from preparatory recognition
results based on the TF-IDF (TF; Term Frequency, IDF; Inversed
Document Frequency) values of each term. The second technique
is text generation using participants' names. Our topic-dependent
language model was generated using these new texts and the
baseline corpus. The language model generated by the proposed
strategy reduced the perplexity by 16.4% and out-of-vocabulary
rate by 37.5%, respectively, compared with the language model
that used only the baseline corpus. This improvement was confirmed
through meeting speech recognition as well.
OS.16-WCN.2 Wireless
Communications and Networking (I)
Session Chair: Ioannis
Katsavounidis Location: Mt. Olympus
Packet
Loss Rate Estimation with Active and Passive Measurements
Atsushi Miyamoto Nara
Institute of Science and Technology, Kazuho Watanabe Nara
Institute of Science and Technology, Kazushi Ikeda Nara
Institute of Science and Technology
Network tomography is a problem of
estimating network properties such as the packet loss rates
of links using available packets. There are two kinds of methods
to measure packets: active and passive. An active measurement
specifies link information (paths) of packets a priori while
a passive measurement gets only the origins and destinations
of packets. The conventional methods for estimating the packet
loss rate of each link, one of the network tomography problems,
utilize only active measurements because passive measurements
have no link information. We propose a method to utilize passive
measurements also. The method regards the link information
in the passive measurements as latent variables and estimates
the variables and the loss rates of links simultaneously in
the framework of Bayesian inference. We show through numerical
experiments that our method outperforms the conventional algorithm
with only active measurements in the estimation accuracy.
Investigating
Wireless Sensor Network Lifetime under Static Routing with
Unequal Energy Distribution
Apostolis Xenakis University
of Thessaly, Ioannis Katsavounidis University of Thessaly,
George Stamoulis University of Thessaly
In a Wireless Sensor Network (WSN)
the sensed data must be gathered and transmitted to a base
station where it it further processed by end users. Since
that kind of network consists of low-power nodes with limited
battery power, power efficient methods must be applied for
nodes communication and data gathering in order to achieve
long network lifetimes. In such networks where in a round
of communication each of the sensor nodes has data to send
to a base station, it is very important to minimize the total
energy consumed by the system in a round so that the total
network lifetime is maximized. The lifetime of such sensor
network is the time until base station can receive data from
all sensors in the network. In this paper, besides the conventional
protocol of direct transmission or the use of dynamic routing
protocols proposed in literature that aggregates data, we
propose an algorithm based on static routing among sensor
nodes with unequal energy distribution in order to extend
network lifetime and find a near optimal node energy charge
scheme that leads to both node and network lifetime prolong.
Our simulation results show that our algorithm achieve longer
network lifetimes mainly because each node is free from maintaining
complex route information, less infrastructure communication
is needed and the charge of nodes is not uniform.
Efficient
Algorithm with Lognormal Distributions for Overloaded MIMO
Wireless Systems
Kazi Obaidullah Hokkaido
University, Yoshikazu Miyanaga Hokkaido University
Due to outstanding search strength
and well organized steps, genetic algorithm (GA) has gained
high interest in the field of overloaded multiple-input/multiple-output
(MIMO)wireless communications system. For overloaded MIMO
system employing spatial multiplexing transmission we evaluate
the performance and complexity of genetic algorithm (GA)-based
detection, against the maximum likelihood (ML) approach. We
consider transmit-correlated fading channels with realistic
Laplacian power azimuth spectrum. The values of the azimuth
spread (AS) and Rician K-factor are set by the means of the
lognormal distributions obtained from WINNER II channel models.
First, we confirm that for constant complexity, GA performance
is same for different combinations of GA parameters. Then,
we compare the GA performance with ML in several WINNER II
scenarios and channel matrix means. Finally, we compare the
complexity of GA with ML. We find that GA perform similarly
with ML throughout the SNR points for different scenarios
and different deterministic rank. We also find that for achieving
performance, GA complexity is much less than ML and thus,
is an advantage in field programmable gate array (FPGA) design.
Location
Based Relay Selection Optimization in Mobile Cooperative Environment
Esam Obiedat CommScope
Inc., Chirag Warty Stanford University, Lei Cao
University of Mississippi
This paper proposes an optimal relay
selection criteria based on the location of the relay (0 ²
_ ² 1) relative to source and destination in the cooperative
coded system. The proposed optimization algorithm employs
distributed turbo product coding technique with hard and soft
decoding. It is shown that the link quality depends on the
location of the relay which in turn affects overall system
Bit Error Rate (BER) performance. The simulation model creates
several scenarios for location of intermediate relays when
the inter-user channel is experiencing distortion in presence
of different Signal to Noise Ratio (SNR)s. It is observed
that the link performance degrades as the relay proximity
changes with respect to the source and the destination. The
relay selection optimization algorithm provides the participating
nodes necessary information to select neighboring nodes depending
on link quality, thus lowering the BER and increasing the
overall network capacity.
A
Real-time Streaming Media Transmission Protocol for Multi-hop
Wireless Networks
Jianchao Du Xidian University,
Song Xiao Xidian University, Lei Quan Xidian University
Time delay as well as error accumulation
makes transmission of high-quality streaming media over multi-hop
wireless networks more challenging. Since conventional TCP/IP-based
protocol drops and retransmits the whole packet once error(s)
occur above Physical Layer, which leads to long time delay
and low efficiency in transmission, a new protocol for real-time
streaming media transmission is proposed in this paper. A
packet control layer (PCL) is added between Data Link and
Network Layer to enable error-polluted data be transmitted
continuously while an error control layer (ECL) is inserted
between Transport and Application Layer to further correct
errors in data stream. Moreover, a robust header conversion
method is applied to PCL to shorten time delay by reducing
packet retransmission probability and decrease the redundancy
of packet header. And an error-CRC-erasure coding scheme embedded
with CRC error correcting algorithm is adopted in ECL. Simulations
on DSP show that the proposed protocol can greatly reduce
time delay and obtain better error correction performance
compared with traditional protocol in BSC channel.
Optimal
Bit Allocation of Limited Rate Feedback for Cooperative Jamming
Xinjie Yang University
of California, Irvine, A. Lee Swindlehurst University
of California, Irvine
In this paper, we investigate bit
allocation schemes with limited rate feedback for cooperative
jamming. In addition to the transmitter and receiver, we assume
a passive eavesdropper and cooperative jammer are present.
In order to achieve a secure communications link against the
eavesdropper, the transmitter and jammer require channel state
information (CSI) to be fed back to them from the receiver.
Assuming feedback channels with a maximum sum feedback rate
constraint, the receiver must allocate the total number of
bits available to quantize the CSI between the transmitter
and jammer. This requires the receiver to balance the need
for a strong channel from the transmitter against the need
for the jammer to accurately null the receiver and reduce
the resulting interference. We propose an optimal bit allocation
strategy for this problem using mean-squared error as the
performance metric, and we use simulation examples to illustrate
its advantage over a non-optimized feedback allocation.
PS.3-IVM.7 Selected
Topics in Computer Vision and Multimedia
Session Chair: Mark Liao
Location: Solano
An
Algorithm for Radar Power Line Detection with Tracking
Qirong Ma University
of Washington, Darren Goshi Honeywell Corporation,
Long Bui Honeywell Corporation, Ming-Ting Sun University
of Washington
In this paper we deal with the problem
of power line detection from millimeter-wave radar video.
We propose an algorithm that is based on Hough Transform,
Support Vector Machine, and particle filter tracking. We explore
the defining characteristics of the power lines in the radar
video, and present an approach to utilize these characteristics
together with the temporal correlation property of the power
line objects. The particle filter framework naturally captures
the temporal correlation of the power line objects, and the
power-line- specific feature is embedded into the conditional
likelihood measurement process of the particle filter. Experimental
result validates the effectiveness of the power line detection
approach.
Multiple
Exposure Integration with Image Denoising
Ryo Matsuoka The University
of Kitakyushu, Masahiro Okuda The University of Kitakyushu,
Takao Jinno The University of Kitakyushu
We propose a denoising technique for
multiple exposure image integration. In our method, noise
removal is achieved by the wavelet-shrinkage for multiple
exposures, and a novel weighting scheme for the integration.
A weighted image is converted to the low and the high frequency
elements by the shift invariant wavelet transform, and the
wavelet coefficient in the high frequencies are decreased
by thresholding based on the wavelet-based hard shrinkage.
The weight is designed to reduce sensor noise and quantization
noise in the process of the multiple exposure integration.
Our method works well especially for noise in shadow areas.
We show the validity of the proposed algorithm by simulating
the method with some actual noisy images.
Palmprint
Verification using Gradient Maps and Support Vector Machines
Chun-Wei Lu Academia
Sinica, Ivy Fan Academia Sinica, Chin-Chuan Han
National United University, Jyh-Chian Chang Chinese
Cultural University, Kuo-Chin Fan National Central
University, Hong-Yuan Liao Academia Sinica
With the urgent demand in information
security, biometric feature-based verification systems have
been extensively explored in many application domains. However,
the efficacy of existing biometric-based systems is unsatisfactory
and there are still a lot of difficult problems to be solved.
Among many existing biometric features, palmprint has been
regarded as a unique and useful biometric feature due to its
stable principal lines. In this paper, we proposed a new method
to perform palmprint recognition. We extract the gradient
map of a palmprint and then verify it by a trained support
vector machine (SVM). The procedure can be divided into three
steps, including image preprocessing, feature extraction,
and verification. We used the multi-spectral palmprint database
prepared by Hong Kong PolyU [14] which included 6000 palm
images collected from 250 individuals to test our method.
The experimental results demonstrate our proposed method is
reliable and efficient to verify whether the person is genuine
or not.
OpenQoS:
An OpenFlow Controller Design for Multimedia Delivery with
End-to-End Quality of Service over Software-Defined Networks
Hilmi Egilmez Koc University,
S. Tahsin Dane Koc University, K. Tolga Bagci Koc
University, A. Murat Tekalp Koc University
OpenFlow is a Software Defined Networking
(SDN) paradigm that decouples control and data forwarding
layers of routing. In this paper, we propose OpenQoS, which
is a novel OpenFlow controller design for multimedia delivery
with end-to-end Quality of Service (QoS) support. Our approach
is based on QoS routing where the routes of multimedia traffic
are optimized dynamically to fulfill the required QoS. We
measure performance of OpenQoS over a real test network and
compare it with the performance of the current state-of-the-art,
HTTP- based multi-bitrate adaptive streaming. Our experimental
results show that OpenQoS can guarantee seamless video delivery
with little or no video artifacts experienced by the end-users.
Moreover, unlike current QoS architectures, in OpenQoS the
guaranteed service is handled without having adverse effects
on other types of traffic in the network.
GIF-LR:GA-based
Informative Feature for Lipreading
Naoya Ukai Gifu University,
Takumi Seko Gifu University, Tamura Satoshi Gifu
University, Hayamizu Satoru Gifu University
In this paper, we propose a general
and discriminative feature GIF (GA-based Informative Feature),
and apply the feature to lipreading (visual speech recognition).
The feature extraction method consists of two transforms,
that convert an input vector to GIF for recognition. The transforms
can be computed using training data and Genetic Algorithm
(GA). For lipreading, we extract a fundamental feature as
an input vector from an image; the vector consists of intensity
values at all the pixels in an input lip image, which are
enumerated from left-top to right-bottom. Recognition experiments
of continuous digit utterances were conducted using an audio-visual
corpus including more than 268,000 lip images. The recognition
results show that the GIF-based method is better than the
baseline method using eigenlip features.
PS.4-BioSPS.3 Biomedical
Signal Processing and Systems
Session Chair: Bonnie Law
Location: Solano
Comparative
Study of Interactive Seed Generation for Growcut-Based Fast
3D MRI Segmentation
Toshihiko Yamasaki The
University of Tokyo, Tsuhan Chen Cornell University,
Masakazu Yagi Oaska University, Toshinori HiraiKumamoto
University, Ryuji MurakamiKumamoto
University
This paper proposes a speed-enhanced
growcut method and presents comparative study of seed setting
methods for fast 3D medical image (MRI) segmentation. The
processing time tends to be larger in 3D image segmentation
because of the large number of neighboring voxels as well
as the number of voxels themselves. In this paper, two seed
setting methods are proposed for our fast growcut-based segmentation
algorithm: sphere-based bounding box method and label transfer
based method using SIFT flow. Experimental results demonstrate
that the tumor segmentation for each patient can be done very
quickly as compared to the previous works. The segmentation
accuracy can also be made very high with only a few user interactions.
Compressed
Sensing with Super-resolution in Magnetic Resonance using
Quadratic Phase Modulation
Satoshi Ito Utsunomiya
University, Yoshifumi Yamada Utsunomiya University
In recent years, compressed sensing
(CS) has attracted considerable attention in areas of rapid
MR imaging. Our group and Y. Wiaux have shown independently
that the use of quadratic phase modulation prior to data acquisition
can greatly improve the accelerating factor of CS. The use
of quadratic phase modulation has distinctive features that
the extrapolation of signal by post processing calculation
is feasible. In this paper, we propose a novel image reconstruction
method in which extrapolation of signal is executed in the
CS reconstruction algorithm, resulting in the improvement
of spatial resolution. Simulation and experimental studies
have revealed that the spatial resolution is fairly improved
compared to the images obtained in standard CS based on Fourier
transform imaging
Exploiting
Biclustering for Missing Value Estimation in DNA Microarray
Data
Kin-On Cheng The Hong
Kong Polytechnic University, Bonnie Law The Hong Kong
Polytechnic University, Wan-Chi SiuThe Hong Kong Polytechnic
University
The missing values in gene expression
data harden subsequent analysis such as biclustering which
aims to find a set of coexpressed genes across a number of
experimental conditions. Missing values are thus required
to be estimated before biclusters detection. Existing estimation
algorithms rely on finding coherence among expression values
throughout the entire genes and/or across all the conditions.
In view that both missing values estimation and biclusters
detection aim at exploiting coherence inside the expression
data, we propose to integrate them into a single framework.
The benefits are twofold, the missing value estimation can
improve bicluster analysis and the coherence in detected biclusters
can be exploited for better missing value estimation. Experimental
results show that the integrated framework outperforms existing
missing values estimation algorithms. It reduces error in
missing value estimation and facilitates the detection of
biologically meaningful biclusters.
A
Breast Tumor Classification Method based on Ultrasound BI-RADS
Data Mining
Jin Man Park Samsung
Electronics, Hyoungmin Park Samsung Electronics,
Jong-Ha Lee Samsung Electronics, Yeong Kyeong Seong
Samsung Electronics, Kyoung-Gu Woo Samsung Electronics,
Kyuseok Shim Seoul National University
In this paper, we propose a data analysis
method to select important characteristics of ultrasonic breast
images which suggest the malignancy of breast tumor. Based
on the analysis, we also present a method of creating a classifier
that can quickly and precisely predict the malignancy of breast
tumors from an ultrasonic breast image. The selection of important
characteristics enables us to focus on the image processing
algorithms that can effectively represent the selected characteristics.
By applying the data analysis method to more than 5,000 clinical
cases, we select a subset of image processing algorithms which
have better representative power for important characteristics.
Our classifier based on the subset of image processing algorithms
shows a comparable accuracy to a naive classifier which uses
a full set of the image processing algorithms. Thus, our classifier
can reduce the prediction time on demand by minimizing the
number of image processing algorithms. The experiments show
that the malignancy of tumor could be successfully predicted
by our approach.
An
AdaBoost-Based Weighting Method for Localizing Human Brain
Magnetic Activity
Tetsuya Takiguchi Kobe
University, Ryoichi Takashima Kobe University,
Yasuo Ariki Kobe University, Toshiaki ImadUniversity
of Washington, Lotus Lin University
of Washington, Patricia KuhlUniversity
of Washington, Masaki KawakatsuTokyo
Denki University
This paper shows that pattern classification
based on machine learning is a powerful tool for analyzing
human brain activity data obtained by magnetoencephalography
(MEG). In our previous work, a weighting method using multiple
kernel learning was proposed, but this method had a high computational
cost. In this paper, we propose a novel and fast weighting
method using an AdaBoost algorithm to find the sensor area
contributing to the accurate discrimination of vowels. Our
AdaBoost simultaneously estimates both the classification
boundary and the weight to each MEG sensor, with MEG amplitude
obtained from each pair of sensors being an element of the
feature vector. The estimated weight indicates how the corresponding
sensor is useful for classifying the MEG response patterns.
Our results for vowel recognition show the large-weight MEG
sensors mainly in a language area of the brain and the high
classification accuracy (91.0%) in the latency range between
50 and 150 ms.
Tuesday, December 4, 2012 (16:30 - 18:10)
|
OS.17-SLA.8 Speech
Processing (II)
Session Chair: Waleed Abdulla
Location: Doheny
Optimizing
the Parameters of Decoding Graphs Using New Log-based MCE
Abdelaziz Abdelhamid The
University of Auckland, Waleed Abdulla The University
of Auckland
This paper proposes a new class loss
function as an alternative to the standard sigmoid class loss
function for optimizing the parameters of decoding graphs
using discriminative training based on minimum classification
error (MCE) criterion. The standard sigmoid based approach
tends to ignore a significant number of training samples that
have a large difference between the scores of the reference
and their corresponding competing hypotheses and this affects
the parameters optimization. The proposed function overcomes
this limitation through considering almost all the training
samples and thus improved the parameter optimization when
tested on large decoding graphs. The decoding graph used in
this research is an integrated network of weighted finite
state transducers. The primary task examined is 64K words,
continuous speech recognition task. The experimental results
show that the proposed method outperformed the baseline system
based on both the maximum likelihood estimation (MLE) and
sigmoid-based MCE and achieved a reduction in the word error
rate (WER) of 28.9% when tested on the TIMIT speech database.
Voice
Activity Detection Based on Augmented Statistical Noise Suppression
Yasunari Obuchi Hitachi
Ltd., Ryu Takeda Hitachi Ltd., Naoyuki Kanda Hitachi
Ltd.
A new voice activity detection (VAD)
algorithm using augmented statistical noise suppression is
introduced. Statistical noise suppression is an effective
tool for speech processing under noisy conditions. It achieves
the best VAD performance when the noise suppression is augmented
in various ways. The speech distortion, which is usually a
severe side effect of strong noise suppression, does not affect
the VAD performance, and the correctly estimated signal power
provides accurate detection of speech. The performance of
the proposed algorithm is evaluated using CENSREC-1-C public
database, and it is confirmed that the proposed algorithm
outperforms other algorithms such as the switching Kalman
filter-based VAD.
Language
Modeling for Spoken Dialogue System based on Sentence Transformation
and Filtering using Predicate-Argument Structures
Koichiro Yoshino Kyoto
University, Shinsuke Mori Kyoto University, Tatsuya
Kawahara Kyoto University
We present a novel scheme of language
modeling for a spoken dialogue system by effectively exploiting
the back-end documents the system uses for information navigation.
The proposed method first converts sentences in the document,
which are written and plain style, into spoken question-style
queries, which are expected in spoken dialogue. In this process,
we conduct dependency analysis to extract verbs and relevant
phrases to generate natural sentences by applying transformation
rules. Then, we select sentences which have useful information
relevant to the target domain and thus are more likely to
be queried. For this purpose, we define predicate-argument
(P-A) templates based on a statistical measure in the target
document. An experimental evaluation shows that the proposed
method outperforms the conventional method in ASR performance,
and the sentence selection based on the P-A templates is effective.
Hybrid
Vector Space Model for Flexible Voice Search
Cheongjae Lee Kyoto
University, Tatsuya Kawahara Kyoto University
This paper addresses incorporation
of semantic analysis into information retrieval (IR) based
on the vector space model (VSM) for flexible matching of spontaneous
queries in a voice search system. Information of semantic
slots or concepts that correspond to database fields is expected
to help enhancing IR, but the semantic analyzer often fails
or needs a large amount of training data. We propose a hybrid
model which combines dedicated VSMs for concept slots with
a general VSM as a back-off. The model has been evaluated
in a book search task and shown to be effective and robust
against ASR and SLU errors.
An
Interference-Free Representation of Group Delay for Periodic
Signals
Hideki Kawahara Wakayama
University, Masanori Morise Ritsumeikan University,
Ryuichi Nisimura Wakayama University, Toshio Irino
Wakayama University
This article introduces a new group
delay representation for periodic signals. The proposed method
yields a group delay representation that is free from interferences
due to repetitive excitation. Power spectrum-weighted averaged
group delay using shifted copies of the weighted group delay
separated by a half fundamental frequency is proven to have
the desired property.
OS.18-SLA.9 Audio
& Music Processing (II)
Session Chair: Chang-chun
Bao Location: Beachwood
A
Blind Bandwidth Extension Method of Audio Signals based on
Volterra Series
Xing-tao Zhang Beijing
University of Technology, Chang-chun Bao Beijing University
of Technology, Xin Liu Beijing University of Technology,
Li-yan Zhang Beijing University of Technology
In this paper, a blind bandwidth extension
method of audio signals is proposed in which the fine structure
of high-frequency information is recovered based on Volterra
series. Combining with Gaussian mixture model and codebook
mapping to adjust the spectrum envelope and energy gain of
the extended high-frequency components separately, the bandwidth
of audio signals is extended to super-wideband from wideband.
Furthermore, the proposed method is applied into a real audio
codec. The performance of the proposed method is evaluated
through objective and subjective tests on the audio signals
selected from MPEG items, and it is found that the proposed
method outperforms the chaotic prediction method and nearest-neighbor
matching method. When the proposed algorithm is applied into
ITU-T G.722.1 wideband audio codec, the performance is comparable
with that of G.722.1C super-wideband audio codec at 24 kbps.
Personalized
Music Emotion Recognition via Model Adaptation
Ju-Chiang Wang Academia
Sinica, Yi-Hsuan Yang Academia Sinica, Hsin-Min
Wang Academia Sinica, Shyh-Kang Jeng National Taiwan
University
In the music information retrieval
(MIR) research, developing a computational model that comprehends
the affective content of music signal and utilizes such a
model to organize music collections have been an essential
topic. Emotion perception in music is in nature subjective.
Consequently, building a general emotion recognition system
that performs equally well for every user could be insufficient.
In contrast, it would be more desirable for one’s personal
computer/device being able to understand his/her perception
of music emotion. In our previous work, we have developed
the acoustic emotion Gaussians (AEG) model, which can learn
the broad emotion perception of music from general users.
Such a general music emotion model, called the background
AEG model in this paper, can recognize the perceived emotion
of unseen music from a general point of view. In this paper,
we go one step further to realize the personalized music emotion
modeling by adapting the background AEG model with a limited
number of emotion annotations provided by a target user in
an online and dynamic fashion. A novel maximum a posteriori
(MAP)-based algorithm is proposed to achieve this in a probabilistic
framework. We carry out quantitative evaluations on a well-known
emotion annotated corpus, MER60, to validate the effectiveness
of the proposed method for personalized music emotion recognition.
HRTF
Magnitude Modeling Using a Non-Regularized Least-Squares Fit
of Spherical Harmonics Coefficients on Incomplete Data
Jens Ahrens Microsoft
Research, Mark Thomas Microsoft Research, Ivan
Tashev Microsoft Research
Head-related transfer functions (HRTFs)
represent the acoustic transfer function from a sound source
at a given location to the ear drums of a human. They are
typically measured from discrete source positions at a constant
distance. Spherical harmonics decompositions have been shown
to provide a flexible representation of HRTFs. Practical constraints
often prevent the retrieval of measurement data from certain
directions, a circumstance that complicates the decomposition
of the measured data into spherical harmonics. A least-squares
fit of coefficients is a potential approach to determining
the coefficients of incomplete data. However, a straightforward
non-regularized fit tends to give unrealistic estimates for
the region were no measurement data is available. Recently,
a regularized least-squares fit was proposed, which yields
well-behaved results for the unknown region at the expense
of reducing the accuracy of the data representation in the
known region. In this paper, we propose using a lower-order
non-regularized least-squares fit to achieve a well-behaved
estimation of the unknown data. This data then allows for
a high-order non-regularized least-squares fit over the entire
sphere. We compare the properties of all three approaches
applied to modeling the magnitudes of the HRTFs measured from
a manikin. The proposed approach reduces the normalized mean-square
error by approximately 7 dB in the known region and 11 dB
in the unknown region compared to the regularized fit.
Subjective
Similarity of Music: Data Collection for Individuality Analysis
Shota Kawabuchi Nagoya
University, Chiyomi Miyajima Nagoya University,
Norihide Kitaoka Nagoya University, Kazuya Takeda Nagoya
University
We describe a method of estimating
subjective music similarity from acoustic music similarity.
Recently, there have been many studies on the topic of music
information retrieval, but there continues to be difficulty
improving retrieval precision. For this reason, in this study
we analyze the individuality of subjective music similarity.
We collected subjective music similarity evaluation data for
individuality analysis using songs in the RWC music database,
a widely used database in the field of music information processing.
A total of 27 subjects listened to pairs of music tracks,
and evaluated each pair as similar or dissimilar. They also
selected the components of the music (melody, tempo/ rhythm,
vocals, instruments) that were similar. Each subject evaluated
the same 200 pairs of songs, thus the individuality of the
evaluation can be easily analyzed. Using the collected data,
we trained individualized distance functions between songs,
in order to estimate subjective similarity and analyze individuality.
Comparison
of Superimposition and Sparse Models in Blind Source Separation
By Multichannel Wiener Filter
Ryutaro Sakanashi University
of Tsukuba, Shigeki Miyabe University of Tsukuba,
Takeshi Yamada University of Tsukuba, Shoji Makino
University of Tsukuba
Multichannel Wiener filter proposed
by Duong et al. can conduct underdetermined blind source separation
(BSS) with low distortion. This method assumes that the observed
signal is the superimposition of the multichannel source images
generated from multivariate normal distributions. The covariance
matrix in each time-frequency slot is estimated by an EM algorithm
which treats the source images as the hidden variables. Using
the estimated parameters, the source images are separated
as the maximum a posteriori estimate. It is worth nothing
that this method does not assume the sparseness of sources,
which is usually assumed in underdetermined BSS. In this paper
we investigate the effectiveness of the three attributes of
Duong’s method, i.e., the source image model with multivariate
normal distribution, the observation model without sparseness
assumption, and the source separation by multichannel Wiener
filter. We newly formulate three BSS methods with the similar
source image model and the different observation model assuming
sparseness, and we compare them with Duong’s method and the
conventional binary masking. Experimental results confirmed
the effectiveness of all the three attributes of Duong’s method.
OS.19-IVM.8 Visual
3D Scene Reconstruction and its Applications
Session Chair: Kyoung Mu
Lee Location: Runyon
Violin
Pedagogy for Finger and Bow Placement using Augmented Reality
Francois de Sorbier Keio
University, Hiroyuki Shiino Keio University, Hideo
Saito Keio University
Beginners need a long time before
being able to play correctly violin. Learning the bowing technique
appears to be a difficult task and retains most of the attention
of beginners. Besides this point, the finger placement is
also an important part of the learning but often under estimated.
One difficulty is that the fingerboard of the violin does
not have frets. In this on-going work, we present a marker-less
augmented reality system that advises the novice players about
their fingering and bowing. We display in real-time the virtual
frets by tracking the violin with a depth camera. We also
capture and recognize the note currently played to direct
the placement of the bow on the strings.
Robust
View Synthesis Under Varying Illumination Conditions Using
Segment-Based Disparity Estimation
Il-lyong Jung Korea
University, Chang-Su Kim Korea University
An intermediate view synthesis scheme
under varying illumination conditions is proposed in this
work. First, we estimate the disparity map based on cumulative
color histograms. Since the cumulative histogram of an image
represents the brightness ranks of pixels, the disparity estimation
is robust against varying illumination conditions. More specifically,
we divide each image into segments, and compute the cumulative
histogram of the representative values for these segments.
Then, we estimate the disparity map based on the similarity
of the cumulative histograms between stereo images. Second,
we transform the colors of stereo images adaptively using
the disparity map. Finally, we synthesize intermediate views
using the transformed stereo images and the disparity map.
Simulation results demonstrate that the proposed algorithm
provides better disparity maps and intermediate views under
varying illumination conditions than the conventional techniques.
Memory-Efficient
Belief Propagation in Stereo Matching on GPU
Young-kyu Choi Inha
University, Williem Inha University, In Kyu Park
Inha University
Belief propagation (BP) is a commonly
used global energy minimization algorithm for solving stereo
matching problem in 3D reconstruction. However, it requires
large memory bandwidth and data size. In this paper, we propose
a novel memory-efficient algorithm of BP in stereo matching
on the Graphics Processing Units (GPU). The data size and
transfer bandwidth are significantly reduced by storing only
a part of the whole message. In order to maintain the accuracy
of the matching result, the local messages are reconstructed
using shared memory available in GPU. Experimental result
shows that there is almost an order of reduction in the global
memory consumption, and 21 to 46% saving in memory bandwidth
when compared to the conventional algorithm. The implementation
result on a recent GPU shows that we can obtain 22.8 times
speedup in execution time compared to the execution on CPU.
Real-Time
Panorama Image Synthesis By Fast Camera Pose Estimation
Beom Su Kim Seoul National
University, Sang Hwa Lee Seoul National University,
Nam Ik Cho Seoul National University
This paper proposes a fast panorama
synthesis algorithm that runs on a mobile devices real-time.
Like most existing methods, the proposed method consists of
following steps: feature tracking, rotation matrix estimation,
and image warping on a targeting plane, where the feature
tracking is usually a bottleneck for real-time implementation.
Hence, we propose to track the features on a virtual sphere
surface instead of projected surface or image domain as in
the conventional methods. By performing the feature tracking
on the sphere, the camera pose can be found by linear and
non-iterative least squares method, which was usually obtained
by nonlinear and iterative methods. The fast estimation of
camera pose can make outlier rejection more robust since the
camera pose can be inferred from the hypotheses by one iteration,
which can't be done in real-time by iterative estimation.
We also propose a two-step blending algorithm, i.e., celling-filling
followed by linear blending along the cell boundary. The panorama
canvas is partitioned into many cells where each cell contains
pixels from the same shot. Hence there is no stitching seam
within the cell and only the boundaries need to be blended,
which reduces the stitching artifacts significantly.
Combining
Multi-view Stereo and Super Resolution in a Unified Framework
Haesol Park Seoul National
University, Kyoung Mu Lee Seoul National University,
Sang Uk Lee Seoul National University
In multi-view stereo setting, pixel
correspondence problem and super resolution problem are inter-related
in a sense that the result of each problem could help to solve
the other. In this paper, we propose a novel method to solve
two problems together by optimizing a unified energy functional.
Main difference from the previous works is that the consistency
between high resolution images is considered along with consideration
to the consistency of high-resolution and low-resolution image
pair with the same viewpoint. Experimental results show that
our method outperforms the naive combination of single image
super resolution and multi-view stereo method.
Confidence-based
Refinement of Corrupted Depth Maps
Satoshi Ikehata University
of Tokyo, Kiyoharu Aizawa University of Tokyo
This paper present a practical depth-map
refinement system designed for highly corrupted multiple depth
maps. We define a pixel-wise confidence measurement of depth
value and apply the three-steps depth-map refinement scheme
(\ie confidence-based depth-map fusion, confidence-weighted
bundle optimization and super-pixel-based planar propagation)
to maximize the whole reliability of depth maps. Our experimental
result shows that our refinement algorithm can dramatically
improve highly corrupted depth maps acquired by previous approaches.
OS.20-SIPTM.1
Control, Optimization and Information Processing for Smart
Grid (I)
Session Chairs: Rongshan
Yu, Binbin Chen Location: Laurel
Distributed
State Estimation in Smart Grid with Communication Constraints
Hang Ma University of
Maryland, Yu-Han Yang University of Maryland, Yan
Chen University of Maryland, K. J. Ray Liu University
of Maryland
Distributed state estimation in smart
grid highly relies on the availability of measurements. Transmitting
a lot of measurements within a small time interval is costly
and sometimes even impossible. This paper explores the problem
of distributed state estimation in smart grid with constraint
on the number of measurements that is able to be transmitted
in one step. It is shown that there exists a lower bound which
depends on the structure of the grid such that if the number
of permissible measurements is beyond the bound, then the
estimator achieves the same performance as its peer without
the constraint. Further, if the number of permissible measurements
is below the lower bound, a tradeoff between the performance
of the estimator and the measurements transmitted is needed
to meet the constraint. A method to attain the tradeoff is
offered in this paper. The proposed conclusions and methods
are illustrated in the simulation on the IEEE 14-bus system.
Cyclostationary
Noise Mitigation in Narrowband Powerline Communications
Jing Lin University
of Texas at Austin, Brian Evans University of Texas
at Austin
Future Smart Grid systems will intelligently
monitor and control energy flows in order to improve the efficiency
and reliability of power delivery. The monitoring and control
require low-delay, highly reliable, two-way communications
between customers, local utilities and regional utilities.
Narrowband powerline communication (NB-PLC) systems operating
in the 3--500 kHz band have been standardized to enable these
two-way communication links. In NB-PLC systems, additive non-Gaussian
noise/interference is primary limitation to the communication
performance. From field trials, the dominant source of this
non-Gaussian noise/interference is cyclostationary. In this
paper, we address the problem of cyclostationary noise mitigation
in NB-PLC systems and other orthogonal frequency division
multiplexing (OFDM) systems. The contributions of this paper
include developing a parametric noise estimation algorithm
based on switching linear autoregressive (AR) process, and
a simple adaptive noise whitening approach that can be immediately
integrated into the conventional OFDM transceiver structure
to improve its performance. In our simulations, the proposed
noise whitening method achieves up to 3dB SNR gain over conventional
OFDM systems at SNRs higher than -3dB.
Load
Disaggregation Using Harmonic Analysis and Regularized Optimization
Jerry Chiang ADSC,
Tianzhu Zhang ADSC, Binbin Chen ADSC, Yih-chun
Hu University of Illinois at Urbana-Champaign
In this paper, we present a load disaggregation
technique that uses regularized optimization together with
harmonic frequency signatures of appliances. The benefits
of our technique are two fold: 1) The regularized optimization
is faster than integer programming; and 2) The harmonic frequency
signatures allow us to disaggregate the loads using as few
as 10 cycles (equaling as little as 200~milliseconds) of samples,
instead of having to wait for state changes from appliances
or weekly usage pattern to emerge. We test our proposed technique
in proof-of-concept experiments and show that our technique
returns accurate disaggregation results.
Dynamic
Incentive Strategy for Voluntary Demand Response based on
TDP Scheme
Haiyan Shu Institute
for Infocomm Research, Rongshan Yu Institute for Infocomm
Research, Susanto Rahardja Institute for Infocomm Research
The enhanced real-time metering and
communication capabilities from smart meters and their associated
advanced metering infrastructure make it possible for utility
company to extend demand response (DR) to small customers
through time-dependent pricing (TDP). Considering the economic
reason and infrastructure cost, the utility company has to
design an incentive scheme to attract the traditional flat
pricing (FP) users to be engaged in the TDP scheme. In this
process, the utility company may share its revenue from the
TDP scheme to those TDP users. It is found, with properly
analyzing the energy procurement cost and user elasticity,
a dynamic incentive strategy can be considered in dual-tariffs
system when flat pricing (FP) and TDP pricing are co-existed.
This dynamic incentive strategy gives appropriate stimulus
to the users who are involved into the TDP program, and guarantee
the utility company’s profit at the same time.
Toward
Standards for Model-Based Control of Dynamic Interactions
in Large Electric Power Grids
Qixing Liu Carnegie
Mellon University, Marija Ilic Carnegie Mellon University
This paper is motivated by the recent
needs to manage possible instabilities between electrically-connected
system components and/or sub-systems (layers) in future electric
energy systems. It is shown that standard state-space models
of general multi-layered energy systems have fundamentally
the same structure which can be expressed in terms of: 1)
state variables representing stand-alone layer (sub- system)
dynamics; and, 2) an interaction variable between the layer
and the rest of the system. Once this is recognized, three
possible structure-based control designs are derived and analyzed
for their performance using a small power system model. The
three control designs considered are: 1) a decentralized component-level
output controller; 2) decentralized sub-system (control area)
layer output controller; and, 3) a full-state centralized
system-level controller. Pros and cons of these three control
architectures and their implications on three qualitatively
different IT architectures and standards for dynamics in future
electric energy systems are discussed.
OS.21 -BioSPS.4
Biomedical Image Acquisition, Reconstruction, and Quantitation
Session Chairs: Richard
Leahy, Justin Haldar, Krishna Nayak Location: Trousdale Estates
Magnetic
Resonance Techniques for Fat Quantification in Obesity
Houchun Hu Children’s
Hospital Los Angeles
As the prevalence of obesity and its
comorbidities continue to rise in the United States and worldwide,
robust imaging techniques and accurate post-processing strategies
are critically needed to accurately quantify the distribution
of fat in the human body. Magnetic resonance imaging and spectroscopy
provide a wide array of sensitive methods to assess and characterize
fat in storage locations such as white adipose tissue depots
and “high-health-risk” ectopic sites such as organs and muscles.
Quantitative fat measurements provide useful information to
investigators in preventive medicine who monitor the efficacy
of dietary, exercise, and surgical interventions to combat
weight gain and obesity in longitudinal studies. They are
also useful to clinicians who study the implications of steatosis
and the pathophysiology of fat. The primary aim of this paper
is to provide a technical review of state-of-the-art proton
magnetic resonance methods in human body fat quantification.
The paper will emphasize the fundamental principles with which
several magnetic resonance techniques differentiate lean (water-dominant)
and fatty (fat-dominant) tissues and illustrate with examples
how each method can be appropriately used for fat quantification.
The paper will also briefly summarize post-processing procedures
that are currently in practice for extracting quantitative
fat endpoints, such as adipose tissue depot volume and percent
fat content in organs. Lastly, given its increased attention
in recent literature, the paper will discuss progress in the
imaging of human brown adipose tissue.
Quantitatively
Accurate Image Reconstruction for Clinical Whole-Body PET
Imaging
Evren Asma GE Global
Research, Sangtae Ahn GE Global Research, Hua Qian
GE Global Research, Girishankar Gopalakrishnan GE
Healthcare-Bangalore, Kris Thielemans King's College
London, Steven Ross GE Healthcare, Ravindra Manjeshwar
GE Global Research, Alexander Ganin GE Healthcare
We present a PET image reconstruction
approach that aims for accurate quantitation through model-based
physical corrections and rigorous noise control with clinically
acceptable image properties. We focus particularly on image
generation chain components that are critical to quantitation
such as physical system modeling, scatter correction, patient
motion correction and regularized image reconstruction. Through
realistic clinical datasets with inserted lesions, we demonstrate
the quantitation improvements due to detector point spread
function modeling, model-based single scatter estimation and
the associated object-dependent multiple scatter estimation
and non-rigid patient motion estimation and motion correction.
We also describe a penalized-likelihood (PL) whole-body clinical
PET image reconstruction approach using the relative difference
penalty that achieves superior quantitation over the clinically-widespread
ordered subsets expectation maximization (OSEM) algorithm
while maintaining visual image properties similar to OSEM
and therefore clinical acceptability. We discuss the axial
and in-plane smoothing modulation profiles that are necessary
to avoid large variations in noise and resolution levels.
The overall approach of accurate models for data acquisition,
corrections for patient related effects and rigorous noise
control greatly improve quantitation and when combined with
repeatable imaging protocols, limit quantitation variability
only to factors related to patient physiology and scanner
performance differences.
Surface
Fluid Registration and Multivariate Tensor-Based Morphometry
in Newborns - The Effects of Prematurity on the Putamen
Jie Shi Arizona State
University, Yalin Wang Arizona State University,
Rafael Ceschin Children’s Hospital of Pittsburgh of UPMC,
Xing An Arizona State University, Marvin Nelson Children’s
Hospital Los Angeles and University of Southern California,
Ashok Panigrahy Children’s Hospital of Pittsburgh of UPMC,
Natasha Lepore Children’s Hospital Los Angeles and University
of Southern California
Many disorders that affect the brain
can cause shape changes in subcortical structures, and these
may provide biomarkers for disease detection and progression.
Automatic tools are needed to accurately identify and characterize
these alterations. In recent work, we developed a surface
multivariate tensor-based morphometry analysis (mTBM) to detect
morphological group differences in subcortical structures,
and we applied this method to study HIV/AIDS, William's
syndrome, Alzheimer's disease and prematurity. Here we
will focus more specifically on mTBM in neonates, which, in
its current form, starts with manually segmented subcortical
structures from MRI images of a two subject groups, places
a conformal grid on each of their surfaces, registers them
to a template through a constrained harmonic map and provides
statistical comparisons between the two groups, at each vertex
of the template grid. We improve this pipeline in two ways:
first by replacing the constrained harmonic map with a new
fluid registration algorithm that we recently developed. Secondly,
by optimizing the pipeline to study the putamen in newborns.
Our analysis is applied to the comparison of the putamen in
premature and term born neonates. Recent whole-brain volumetric
studies have detected differences in this structure in babies
born preterm. Here we add to the literature on this topic
by zooming in on this structure, and by generating the first
surface-based maps of these changes. To do so, we use a dataset
of manually segmented putamens from T1-weighted brain MR images
from 17 preterm and 18 term-born neonates. Statistical comparisons
between the two groups are performed via four methods: univariate
and multivariate TBM, the commonly used medial axis distance,
and a combination of the last two statistics. We detect widespread
statistically significant differences in morphology between
the two groups.
High
Spatio-Temporal Resolution Dynamic Contrast-Enhnaced MRI using
Compressed Sensing
Kyunghyun Sung University
of California, Los Angeles, Manoj Saranathan Stanford
University, Bruce Daniel Stanford University, Brian
Hargreaves Stanford University
Iterative thresholding methods have
been extensively studied as faster alternatives to convex
optimization methods for solving large-sized problems in compressed
sensing MRI. A novel iterative thresholding method, called
LCAMP (Location Constrained Approximate Message Passing),
is presented for reducing computational complexity and improving
reconstruction accuracy when a non-zero location (or sparse
support) constraint can be obtained from view shared images
in dynamic contrast-enhanced MRI (DCE-MRI). LCAMP modifies
the existing approximate message passing algorithm by replacing
the thresholding stage with a location constraint, which avoids
adjusting regularization parameters or thresholding levels.
This work is applied to breast DCE-MRI to demonstrate the
excellent reconstruction accuracy and low computation time
with highly undersampled data.
Quantitative
Analysis of Myocardial Perfusion Images
Piotr Slomka Cedars-Sinai
Medical Center, Reza Arsanjani Cedars-Sinai Medical
Center, Yuan Xu Cedars-Sinai Medical Center, Daniel
Berman Cedars-Sinai Medical Center, Guido Germano Cedars-Sinai
Medical Center
Myocardial perfusion imaging is a
widely used test for the detection of coronary artery disease.
Automated measurements of perfusion can be obtained from three-dimensional
stress and rest images. The software segments the left ventricle
of the heart and compares image intensities to normal subject
database. In our research, we aim at reduction and ultimately
elimination of human supervision in this process to improve
overall reproducibility and accuracy for disease detection.
We have developed several methods to this end such as automatic
detection of potentially incorrect contours and direct measurement
of stress-rest changes. Current state-of-the-art analysis
methods demonstrate better reproducibility and similar accuracy
when compared with experienced physicians. We aim to further
improve the diagnostic accuracy by data mining techniques,
combining several extracted image features with clinical information
about the patients. Preliminary results show further improvements
in accuracy, beyond that achieved by expert observers.
Correcting
Susceptibility-Induced Distortion in Diffusion-Weighted MRI
using Constrained Nonrigid Registration
Chitresh Bhushan University
of Southern California, Anand Joshi University of Southern
California, Justin Haldar University of Southern California,
Richard Leahy University of Southern California
Echo Planar Imaging (EPI) is the standard
pulse sequence used in fast diffusion-weighted magnetic resonance
imaging (MRI), but is sensitive to susceptibility-induced
inhomogeneities in the main B0 magnetic field. In diffusion
MRI of the human head, this leads to geometric distortion
of the brain in reconstructed diffusion images and a resulting
lack of correspondence with the high-resolution MRI scans
that are used to define the subject anatomy. In this study,
we propose and test an approach to estimate and correct this
distortion using a non- linear registration framework based
on mutual-information. We use an anatomical image as the registration-template
and constrain the registration using spatial regularization
and physics-based information about the characteristics of
the distortion, without requiring any additional data collection.
Results are shown for simulated and experimental data. The
proposed method aligns diffusion images to the anatomical
image with an error of 1-3 mm in most brain regions.
OS.22-WCN.3 System
Design, Architecture, Physical Layer Security in MIMO systems
Session Chair: Y.-W. Peter
Hong Location: Franklin Hills
Data
Detection of Amplify-and-Forward User Cooperation in MIMO
Broadcasting Systems without Channel State Information Feedback
Shih-Jung Lu Academia
Sinica, Ronald Chang Academia Sinica, Wei-Ho Chung
Academia Sinica
In this paper, we consider the broadcasting
of data streams from a multiantenna source to several single-antenna
or multiantenna users in a data broadcasting network. To improve
the received data quality for all single-antenna users simultaneously
while not compromising the data rates of multiantenna users,
we propose a new cooperation scheme among single-antenna users.
Users cooperate in the amplify- and-forward (AF) mode to jointly
detect the spatially multiplexed data streams from the source
with estimated channel state information. Simulation results
verify the effectiveness of the proposed channel estimation
scheme based on very few pilots in conjunction with the maximum
likelihood (ML) detection scheme.
On
Secure Beamforming for Wiretap Channels with Partial Channel
State Information at the Transmitter
Pin-Hsun Lin National
Taiwan University, Shih-Chun Lin NTUST, Szu-Hsiang
Lai National Taiwan University, Hsuan-Jung Su National
Taiwan University
In this paper, we consider the secure
transmission in ergodic fast fading multiple-input single-output
single-antennaeavesdropper (MISOSE) wiretap channels with
only the statistics of eavesdropper's channel state information
at the transmitter (CSIT). Two kinds of the legitimate CSIT
are assumed, that is, full and statistical legitimate CSIT.
With full legitimate CSIT, we generalize and optimize the
previously proposed artificial noise (AN) aided secure beamforming
to improve its secrecy rate performance. The AN covariance
matrix in our scheme is more flexible than previous scheme
and the region of non-zero secrecy rate is enlarged significantly
according to our simulations. For the case with statistical
legitimate CSIT, we further prove that the secure beamforming
is secrecy capacity achieving for the Rayleigh faded channels.
In this case, the AN is not necessary. Extensions to cases
where legitimate receiver and eavesdropper have multiple antennas
will also be discussed.
Secret
Key Generation over Correlated Wireless Fading Channels using
Vector Quantization
Hou-Tung Li National
Tsing Hua University, Yao-Win Peter Hong National Tsing
Hua University
Vector quantization schemes are proposed
in this work to extract secret keys from correlated wireless
fading channels. By assuming that the channel between two
terminals are reciprocal, its estimates can be used as the
common randomness for generating secret keys at the two terminals.
Most schemes in the literature assume that channels are independent
over time and utilize scalar quantization on each element
of the estimated channel vector to generate secret key bits.
These schemes are simple to implement but yield high key disagreement
probability (KDP) at low SNR and low key entropy when channels
are highly correlated. In this work, two vector quantization
schemes, namely, the minimum key disagreement probability
(MKDP) and the minimum quadratic distortion (MQD) secret key
generation schemes, are proposed to effectively extract secret
keys from correlated channel estimates. The vector quantizers
are derived using KDP and QD as the respective distortion
measures. To further reduce KDP, each channel vector is first
pre-multiplied by an appropriately chosen unitary matrix to
rotate the vector away from quantization cell boundaries.
The MKDP scheme achieves the lowest KDP but requires high
complexity whereas the MQD scheme yields lower complexity
but at the cost of slightly increased KDP. Computer simulations
are provided to demonstrate the effectiveness of the proposed
vector quantization schemes.
Two-stage
Compensation for Non-ideal Effects in MIMO-OFDM Systems
Chih-Chi Wu National
Cheng Kung University, Ping Ma National Cheng Kung
University, Wun-De HungNational Cheng Kung University,
Chih-Hung Kuo National Cheng Kung University
In this paper, we propose a two-stage
estimation and compensation of the carrier frequency offset
(CFO) and the IQ imbalance in MIMO-OFDM systems. In the first
stage, the receiver IQ imbalance is compensated along with
CFO estimation. In the second stage, the transmitter IQ imbalance
is compensated by taking advantage of structure of Alamouti
space time block codes. Robust least square estimations are
applied for all compensation processes. Compared with the
conventional system that only compensates receiver IQ imbalance,
simulation results show that BER performance of the proposed
system is significantly improved.
On
Blind Sequential Detection of Misbehaving Relay
Yang-Ming Yi National
Sun Yat-sen University, Li-Chung Lo National Sun Yat-sen
University, Wan-Jen HuangNational Sun Yat-sen University
Consider a three-node cooperative
system where the relay may misbehave for selfish or adversarial
reasons. We propose a blind sequential detection to determine
relay’s misbehavior with the least number of observations
under requirement of detection performance. The likelihood
function conditioning on the detected data symbols is derived
here for three types of misbehaviors. The destination accumulates
log-likelihood ratio (LLR) of current received symbols, and
completes detection until the probabilities of false alarm
and miss are both guaranteed below required thresholds. Simulation
results show that the proposed scheme demands only small number
of received symbols at SNR greater than 10dB.
OS.23-IVM.9 Perception-based
Multimedia Quality Assessment and Processing
Session Chairs: Xiaokang
Yang, Weisi Lin, Zhou Wang, Jingliang Peng Location: Whitley
Heights
An
Improved Full-Reference Image Quality Metric Based on Structure
Compensation
Ke Gu Shanghai Jiao
Tong University, Guangtao Zhai Shanghai Jiao Tong University,
Xiaokang YangShanghai Jiao Tong University, Wenjun
Zhang Shanghai Jiao Tong University
During the last two decades, image
quality assessment has been a major research area, which considerably
helps to promote the development of image processing. Following
the tremendous success of Structural SIMilarity (SSIM) index
in terms of the correlation between the quality predictions
and the subjective scores, many improved algorithms have been
further exploited, such as Multi-Scale SSIM (MS-SSIM) and
Information content Weighted SSIM (IW-SSIM). However, a growing
number of researchers have been devoted to the study of the
effects of uneven responses to different image distortion
categories on prediction accuracy of the quality metrics.
Inspired by this, we propose an improved full-reference image
quality assessment paradigm based on structure compensation.
Experimental results on Laboratory for Image and Video Engineering
(LIVE) database and Tampere Image Database 2008 (TID2008)
are provided to confirm our introduced approach has superior
prediction performance as compared to mainstream image quality
metrics. Besides, it is worth emphasizing that our algorithm
not introduces other operators but only applies the SSIM function
to compensate itself, and furthermore, it also has an effective
capability of image distortion classification.
Video
Quality Metric for Consistent Visual Quality Control in Video
Coding
Long Xu University of
Science and Technology Beijing, King Ngi Ngan Chinese
University of Hong Kong, Song Nan Li Chinese University
of Hong Kong, Lin Ma Chinese University of Hong Kong
The visual quality consistency is
one of the most important issues in video quality assessment
(VQA). When people view a sequential video, they may have
an unpleasant perceptual experience if the visual quality
of video frames is inconsistent even though the average visual
quality of the video is not too bad. Thus, the consistent
visual quality control is mostly expected in real-time video
communication. Additionally, in conventional video communication,
the channel bandwidth and buffer resources are limited. The
unfair distribution of encoding resources among video frames
would result in not only inconsistent visual quality but also
other types of spatial distortions. In this paper, a new objective
visual quality metric (VQM) is firstly proposed for measuring
the video quality in video coding. It makes full use of the
information of video coding without extra computational complexity.
Secondly, a visual quality control algorithm is proposed to
ensure the consistent visual quality of video coding under
the given channel and buffer resources. Finally, the experimental
results indicate that the proposed VQM is consistent well
with the human visual system (HVS). In addition, the consistent
visual quality, better rate-distortion efficiency, accurate
bit control and compliant buffer can be achieved by the proposed
visual quality control algorithm.
A
New No-reference Image Quality Assessment Model Based on DCT
Coefficients Distribution and PSNR
Zhengyou Wang Shijiazhuang
Tiedao University, Wan Wang Shijiazhuang Tiedao University,
Zhenxing Li Jiangxi University of Finance & Economics,
Jin Wang Shijiazhuang Tiedao University, Weisi Lin
Nanyang Technological University
Based on the traditional quality metric
PSNR, we propose a new no-reference image quality assessment
model nPSNR (No-reference PSNR) for JPEG compressed images.
The metric performs in DCT domain and the DCT coefficient
distribution is used. This metric estimates the MSQE (mean-squared
quantization error) of a decoded image with the distributions
of AC coefficients and DC coefficients of the encoded image.
Based on the MSQE, the overall nPSNR value of the image is
calculated. We test the proposed metric on the selected images
from the TID2008 database. Then, the computational scores
(nPSNR) are compared with the ground truth values (MOS) based
on three performance criteria. Experimental results demonstrate
that the proposed metric is more consistent with the subjective
perception than the state-of-the-art full-reference image
quality assessment metrics.
A
Fusion Approach to Video Quality Assessment Based on Temporal
Decomposition
Tsung-Jung Liu University
of Southern California, Weisi Lin Nanyang Technological
University, C.-C. Jay Kuo University of Southern California
In this work, we decompose an input
video clip into multiple smaller intervals, measure the quality
of each interval separately, and apply a fusion approach to
integrating these scores into a final one. To give more details,
an input video clip is first decomposed into smaller units
along the temporal domain, called the temporal decomposition
units (TDUs). Next, for each TDU that consists of a small
number of frames, we adopt a proper video quality metric (specifically,
the MOVIE index in this work) to compute the quality scores
of all frames and, based on the sociological findings, choose
the worst scores of TDUs for data fusion. Finally, a regression
approach is used to fuse selected worst scores from all TDUs
to get the ultimate quality score of the input video as a
whole. We conduct extensive experiments on the LIVE video
database, and show that the proposed approach indeed improves
MOVIE and is also competitive with other state-ofÐthe-art
video quality metrics.
Performance
Comparison of Decision Fusion Strategies in BMMF-based Image
Quality Assessment
Lina Jin Tampere University
of Technology, SeongHo Cho University of Southern California,
Tsung-Jung Liu University of Southern California, Karen
Egiazarian Tampere University of Technology, C.-C.
Jay Kuo University of Southern California
The block-based multi-metric fusion
(BMMF) is one of the state-of-the-art perceptual image quality
assessment (IQA) schemes. With this scheme, image quality
is analyzed in a block-by-block fashion according to the block
content type (i.e. smooth, edge and texture blocks) and the
distortion type. Then, a suitable IQA metric is adopted to
evaluate the quality of each block. Various fusion strategies
to combine the QA scores of all blocks are discussed in this
work. Specifically, factors such as quality scores distribution
and the spatial distribution of each block are examined using
statistics methods. Finally, we compare the performance of
various fusion strategies based on the popular TID database.
Quality-of-Experience
Perception for Video Streaming Services: Preliminary Subjective
and Objective Results
Khalil ur Rehman Laghari
EMT-INRS, Omneya Issa Communications Research Centre
Canada, Filippo Speranza Communications Research Centre
Canada, Tiago Falk Institut National de la Recherche
Scientifique
Quality-of-Experience (QoE) is a human
centric notion that produces the blue print of human perception,
feelings, needs and intentions while Quality-of-Service (QoS)
is a technology centric metric used to assess the performance
of a multimedia application and/or network. To ensure superior
video QoE, it is important to understand the relationship
between QoE and QoS. To achieve this goal, we conducted a
pilot subjective user study simulating a video streaming service
over a broadband network with varying distortion scenarios,
namely packet losses (0, 0.5, 1, 3,7, and 15%), packet reorder
(0, 1, 5, 10, 20, and 30%), and coding bit rates (100, 400,
600, and 800 Kbps). Users were asked to rate their experience
using a subjective quantitative metric (termed Perceived Video
Quality, PVQ) and qualitative indicators of “experience.” Simulation
results suggest a) an exponential relationship between PVQ
and packet loss and between PVQ and packet reorder, and b)
a logarithmic relationship between PVQ and video bit rate.
Similar trends were observed with the qualitative indicators.
Exploratory analysis with two objective video quality metrics
suggests that trends similar to those obtained with the subjective
ratings were obtained, particularly with a full-reference
metric.
OS.24-IVM.10 Visual
Media Data Representation, Retrieval and Recognition
Session Chairs: Jingliang
Peng, Xin-Shun Xu Location: Mt. Olympus
A
User-driven Model for Content-based Image Retrieval
Yi Zhang Tianjin University,
Zhipeng Mo Tianjin University, Wenbo Li Tianjin
University, Tianhao Zhao Tianjin University
The intention of image retrieval systems
is to provide retrieved results as close to users’ expectations
as possible. However, users' requirements vary from each
other in various application scenarios for the same concept
and keywords. In this paper, we introduce a personalized image
retrieval model driven by users' operational history.
In our simulated system, three types of data, which are browsing
time, downloads and grades, are collected to generate a sort
criterion for retrieved image sets. According to the criterion,
the image collection is classified into a positive group,
a negative group and a testing group. Then a SVM classifier
is trained with image features extracted from three groups
and used to refine retrieved results. We test the proposed
method on several image sets. The experimental results show
that our model is effective to represent users' demands
and help improving retrieval accuracy.
A
Poselet Based Key Frame Searching Approach in Sports Training
Videos
Wu Lifang Beijing University
of Technology, Zhang Jingwen Beijing University of
Technology, Yan Fenghui Beijing University of Technology
In some sport training application,
it is necessary to search the key frames of training video
for carefully analysis. In this paper, we take the key frame
searching issue as a pose estimation problem. First, a set
of various pose detectors are collected trough the twice SVM
training process, each of which can be interpreted as a learned
pose-specific HOG weight classifier. Then we run each linear
SVM classifier over the image in a multi_scale scanning mode.
In order to resolve the problem of extreme similarity between
the adjacent frames, the detection hits at every scale in
each frame is counted as the principle of optimal key frame
selection. The frame with the most detection hits are chosen
as the key frame for the pose detector. The experimental results
using weight-lifting training videos show the efficiency of
proposed approach
A
Novel Multi-instance Learning Algorithm with Application to
Image Classification
Xiaocong Xi Shandong
University, Xinshun Xu Shandong University, Xiaolin
Wang Shandong University
Image classification is an important
research topic due to its potential impact on both image processing
and understanding. However, due to the inherent ambiguity
of image-keyword mapping, this task becomes a challenge. From
the perspective of machine learning, image classification
task fits the multi-instance learning (MIL) framework very
well owing to the fact that a specific keyword is often relevant
to an object in an image rather than the entire image. In
this paper, we propose a novel MIL algorithm to address image
classification task. First, a new instance prototype extraction
method is proposed to construct projection space for each
keyword. Then, each training sample is mapped to this potential
projection space as a point, which converts the MIL problem
into standard supervised learning problem. Finally, an SVM
is trained for each keyword. The experimental results on a
benchmark data set Corel5k demonstrate that the new instance
prototype extraction method can result in more reliable instance
prototypes and faster running time, and the proposed MIL approach
outperforms some state-of-the-art MIL algorithms.
Hierarchical
Bag-of-Words Model for Joint Multi-View Object Representation
and Classification
Xiang Fu University
of Southern California, Sanjay Purushotham University
of Southern California, Daru Xu University of Southern
California, Jian Li University of Southern California,
C.-C. Jay Kuo University of Southern California
Multi-view object classification is
a challenging problem in image retrieval. One common approach
is to apply the visual bag-of-words (BoW) model to all view
representations of each object class and compare them with
the representation of the query image one by one so as to
determine the closest view of the object class. This approach
offers good matching performance, yet it demands a large amount
of computation and storage space. To address these issues,
we propose a novel hierarchical BoW model that provides a
concise representation of each object class with multi-views.
When the higher level BoW representation does not match with
that of the query instance, further comparison can be saved.
We can also incorporate similar views to reduce the storage
space. We conduct experiments on a dataset of 3D object classes,
and show that the proposed approach achieves higher efficiency
in terms of lower computational complexity and storage space
while preserving good matching performance.
An
Analysis of Eating Activities for Automatic Food Type Recognition
Hyun-Jun Kim Samsung
Electronics, Mira Kim Samsung Electronics, Sun-Jae
Lee Samsung Electronics, Young Sang Choi Samsung
Electronics
Nowadays, chronic diseases such as
type 2 diabetes or cardiovascular diseases are considered
to be one of the most serious threats to healthy life. These
kinds of diseases are primarily caused by an unhealthy lifestyle
including lack of exercise, irregular meal patterns and abuse
of addictive substances such as alcohol, caffeine and nicotine.
Therefore, observing our daily lives is crucial in developing
interventions to reduce the risk of lifestyle diseases. In
order to manage and predict progression of diseases of a patient,
objective measurement of lifestyle is essential. However,
self-reporting questionnaires and interviews have limitation
due to human errors and difficulty of conducting. In this
paper, we analyzed users' eating activities and comprising
sub-actions for developing eating activity recognition system
based on a tri-axial accelerometer embedded wrist band. By
analyzing actions in eating activities, we can improve the
accuracy of the recognition of eating activities and also
provide clues that indentifying the type of foods.
A
New Hybrid PCNN for Multi-Objects Image Segmentation
Zhenbo Li China Agricultural
University
Many image based applications such
as multi-object tracking were nagged by the problem of robust
multi-objects image segmentation. In this paper, we propose
a new hybrid Pulse Coupled Neural Network (PCNN) method for
multi-object segmentation. Firstly, we use saliency detection
methods, Graph-based visual saliency (GBVS) and Spectrum Residual
(SR) to find more accurate object region (R1) and more number
of object regions (R2) separately. Then an improved PCNN is
used to work out the multi-objects with R1 and R2. The statistical
result of R1 is selected as an adaptive generator threshold
of PCNN and a selection standard of segmentation result. R2
determines the correct object number in the image. Experiments
of images selected from BSD and VOC and two full image datasets
(MSRC v2 and Weizmann) prove that our method can get more
right object quantity and more accurate object region than
GBVS-PCNN[1] and adaptive PCNN[2].
Wednesday, December 5, 2012 (10:50 - 12:30)
|
OS.25-IVM.11 Recent
Topics in Image, Video, and Multimedia Processing (I)
Session Chair: Yi-Chong
Zeng Location: Doheny
3D
Shape Retrieval from a 2D Image as Query
Masaki Aono Toyohashi
University of Technology, Hiroki Iwabuchi Toyohashi
University of Technology
3D shape retrieval has gained popularity
in recent years. Yet we have difficulty preparing a 3D shape
by ourselves for query input. It is therefore much desired
to have an easy way of doing 3D shape search in terms of query
input. In this paper, we propose a new method for defining
a feature vector for 3D shape retrieval from a single 3D photo
image. Our feature vector is defined as a combination of Zernike
moments and HOG (Histogram of Oriented Gradients), where these
features can be extracted from both a 2D image and a 3D shape
model. Comparative experiments demonstrate that our approach
shows very promising and effective as an initial clue to searching
for more relevant 3D shape model we have in mind.
A
Large-Scale Shape Benchmark for 3D Object Retrieval: Toyohashi
Shape Benchmark
Atsushi Tatsuma Toyohashi
University of Technology, Hitoshi Koyanagi Toyohashi
University of Technology, Masaki Aono Toyohashi University
of Technology
In this paper, we describe the Toyohashi
Shape Benchmark (TSB), a publicly available new database of
polygonal models collected form the World Wide Web, consisting
of 10,000 models, as the largest 3D shape models to our knowledge
used for benchmark testing. TSB includes 352 categories with
labels. It can be used for both 3D shape retrieval and 3D
shape classification. Formerly, the most well-known 3D shape
benchmark has been the PSB, or the Princeton Shape Benchmark,
consisting of 1,814 models, including the half as training
data and the remaining half as testing. TSB is approximately
6 times larger than PSB. Unlike textual data such as TREC
and NTCIR data collections, 3D shape repositories have been
suffering from the shortage of data, and from the difficulty
in testing the scalability of any algorithms that work on
top of given benchmark data set. In addition to the TSB, we
propose a new shape descriptor which we call DB-VLAT (Depth-Buffered
Vector of Locally Aggregated Tensors). During the comparison
with the TSB, we will demonstrate that our new shape descriptor
exhibits the best search performance among those known programs
to which we have had access on the Internet, including Spherical
Harmonic Descriptor and Light-Field Descriptor. We believe
that the TSB can be a step toward the next generation 3D shape
benchmark having massive 3D data collection, and hope it to
be served for many purposes in both academia and industries.
Quality
Assessment of Finger-vein Image
Huafeng Qin Chongqing
University, Sheng Li Nanyang Technological University,
Alex Chichung Kot Nanyang Technological University,
Lan Qin Chongqing University
In this paper, we propose a novel
quality assessment of finger-vein images for quality control
purpose. First of all, we divide a finger vein image into
a set of non-overlapping blocks. In order to detect the local
vein patterns, each block is projected into the Radon space
using an average Radon transform. A local quality score is
estimated for each block according to the curvature in the
corresponding Radon space, based on which a global quality
score of the finger-vein is computed and assessed. Experimental
results show that our approach can effectively identify the
low quality finger-vein images, which is also helpful in improving
the performance of a finger-vein recognition system.
Preserving
Features in Multilevel Halftones
Lai-Yan Wong Hong Kong
Polytechnic University, Yuk-Hee Chan Hong Kong Polytechnic
University
Conventional threshold decomposition
(TD) based multilevel halftoning algorithms decompose an input
image into layers, halftone them sequentially with a binary
halftoning algorithm, and combine their binary halftones to
produce the final multilevel halftone. When these algorithms
are exploited to produce multilevel halftones, bright spatial
features are generally difficult to preserve as darker pixels
in the final multilevel output are positioned first. We propose
a solution to solve this problem in this paper. Simulation
result shows that the proposed method can provide an output
of better quality as compared with conventional TD-based algorithms.
Automatic
Recognition of Frame Quality Degradation For Inspection of
Surveillance Camera
Yi-Chong Zeng Institute
for Information Industry, Miao-Fen Chueh Institute
for Information Industry, Chi-Hung Tsai Institute for
Information Industry
When surveillance camera is broken
down, it will degrade frame quality directly. Sometimes, quality
degradation happens occasionally, it is difficult for people
being aware it immediately. With the aim to automatically
inspect surveillance camera, we propose an automatic method
to recognize frame quality degradation. Seven features are
extracted based on four kinds of measures, i.e. mean of structure
similarity, variation of intensity difference, minimum of
block correlation, and average color. Those measures have
different reactions to different de-gradations. Subsequently,
linear discriminant analysis (LDA) applied to the extracted
features is able to train classifiers. Six classes of degradations
are recognized in this work, including signal missing, color
missing, local alternation, global alteration, periodic intensity
change, and normal status. After implement-ing degradation
recognition, we determine whether surveillance camera works
normally or not. The experiment results demon-strate that
the proposed method is capable of recognizing de-gradation
as well as inspecting surveillance camera.
OS.26-SLA.10 Immersive
Audio and Cloud-assisted Audio Processing
Session Chairs: Woon-Seng
Gan, Yu RongShan, Ee-Leng Tan Location: Beachwood
Theories
and Signal Processing Techniques for the Implementation of
Sound Ball in Space Using Loudspeaker Array
Jung-Woo Choi KAIST,
Yang-Hann Kim KAIST
It has been known that we can make
a certain region of more acoustic energy than others. This
can be achieved by utilizing loudspeaker arrays and designing
a multichannel filter that effectively controls interferences
of sound waves in space and time. The concept of a bright
and dark zone[Choi and Kim, Generation of an acoustically
bright zone with an illuminated region using multiple sources,J.
Acoust. Soc. Am., Vol. 111(4), 1695-1700, Apr. 2002] showed
that this idea can be realized in practice. Then we attempted
to make a “sound ball” that utilizes the concept of bright
and dark zone for generating a small spatial region of concentrated
sound energy inside. The 32 speaker system which surrounds
the zone of interest tried to implement the ball that can
be positioned and also allowed to be moved. However, it was
also found that the solution based on the bright and dark
zones control does not, in strict sense, guarantee an effective
radiation of sound from the ball. Understanding this inherent
limitation motivated us to design a novel mean to have a sound
ball that can radiate effectively. This has to solve the well-known
Kirchhoff-Helmholtz equation for the case of which the source
or sources are surrounded by an array of speakers. The sound
ball is implemented by using a 50-channel spherical loudspeaker
array, in which the loudspeakers are positioned on the Lebedev
quadrature grid.
Cloud-based
Audio Fingerprinting Service
Wenyu Jiang Institute
for Infocomm Research, Yongwei Zhu Institute for Infocomm
Research, Xiaoming Bao Institute for Infocomm Research,
Rongshan Yu Institute for Infocomm Research
Audio Fingerprinting allows the identification
of a query audio clip by matching the query audio fingerprints
against a reference database. Traditionally, the matching
process, which is CPU and memory intensive, is implemented
either on a single computer (which is confined by CPU and
memory limits for large databases), or on a computer cluster
in a proprietary manner (which has limited flexibility in
scaling the database). We have implemented audio fingerprinting
prototype software that can run in a Cloud environment, specifically
Hadoop/ MapReduce. Because the MapReduce framework is designed
for stream data processing instead of database query, we discuss
how we address this challenge as well as other challenges
such as appropriate data input format and partitioning. A
performance evaluation of the software on a real dataset of
~8500 songs and real Hadoop clusters is presented to illustrate
its efficacy, where a batch query of 1000 60sec clips can
be completed in ~50sec in addition to ~30sec of database loading
time with a 12-node cluster configuration.
Repeating
Segment Detection in Songs using Audio Fingerprint Matching
Regunathan Radhakrishnan
Dolby Laboratories Inc., Wenyu Jiang Institute for
Infocomm Research
We propose an efficient repeating
segment detection approach that doesn't require computation
of the distance matrix for the whole song. The proposed framework
first extracts audio fingerprints for the whole song. Then,for
each time step in the song we perform a query to match a sequence
of M fingerprint codewords against the fingerprints of the
rest of the song. In order to find a match for the first fingerprint
query, a search tree data structure is built with the fingerprints
of the rest of the song. For subsequent fingerprint queries
for the rest of the song, the matching process dynamically
updates the search tree data structure to exclude the M fingerprint
codewords corresponding to each time step. For each matching
segment, we record the time offset from the query segment.
Following the matching process for the whole song, we compute
the histogram of the number of matching segments for each
offset. The peaks in this histogram correspond to offsets
at which matches were found more often than others and can
be used to pick out a set of repeating segments
Streaming
of Scalable Multimedia over Content Delivery Cloud
Xiaoming Bao Institute
for Infocomm Research, Rongshan Yu Institute for Infocomm
Research
Content Delivery Cloud (CDC) extends
Content Delivery Network (CDN) to provide elastic, scalable
and low cost services to the customers. For multimedia streaming
over CDC, caching the media content onto the edge server from
storage cloud is commonly used to minimize the latency of
content delivery. It is very important for CDN to balance
between the resources being used (storage space, bandwidth,
etc) and the performance achieved. Commercial CDNs (such as
Akamai, Limelight, Amazon CloudFront) have their proprietary
caching algorithms to deal with this issue. In this paper,
we propose a method to further improve the efficiency of the
caching system for scalable multimedia contents. Specifically,
we notice that a scalable multimedia content can be flexibly
truncated to lower bit rates on-the-fly based on the available
network bandwidth between the edge server to the end users.
Therefore, it may not be necessary to cache such a content
at its highest quality/rate. Based on this observation, we
show that edge server can decide an optimized truncation ratio
for the cached scalable multimedia contents to balance between
the quality of the media and the resource usage. The proposed
optimized truncation algorithm is analyzed and its efficacy
in improving the efficiency of the caching system is justified
with simulation result.
Spatial
Sound Reproduction Using Conventional and Parametric Loudspeakers
Ee-Leng Joseph Tan Nanyang
Technological University, Woon-seng Gan NTU, Chiu-Hao
Chen Nanyang Technological University
The auditory image of a movie or game
scene can be decomposed into point-like sources and diffused
sources for effective and accurate audio synthesis. By embedding
appropriate visual and audio cues into objects in a 2D or
3D visual scene, an immersive and engaging experience can
be created. While there are many breakthroughs in the display
technology recently, such as the ultra high-definition (UHD)
and 3D displays, conventional sound systems (stereo, 5.1,
etc) are still being used. Such an audio-visual setup may
degrade the overall experience. This degradation is directly
linked to the dispersive nature of the conventional loudspeaker,
and the rendered auditory image may be perceived to lack sharpness
in the spatial imaging due to the reverberant nature of a
room. This drawback tends to lead to comparably poor synthesis
of point-like sources as compared to diffused sources in the
rendered auditory image. On the other hand, the rendered auditory
image from a directional loudspeaker, such as the parametric
loudspeaker, may seem to lack spaciousness and sound envelopment
due to very little influence of the acoustics of a room. Therefore,
directional loudspeaker is suitable for rendering point-like
sources, but not diffused sources. In this paper, we propose
a unique sound system which comprises of conventional loudspeakers
and parametric loudspeakers. This setup exploits the high
directivity of the parametric loudspeakers to render sharp
auditory images while producing the diffused sources of the
auditory image using the conventional loudspeaker.
Interactive
3D Audio Rendering in Flexible Playback Configurations
Jean-Marc Jot DTS, Inc.
Interactive object-based 3D audio
spatialization technology has become commonplace in personal
computers and game consoles. While its primary current application
is 3-D game sound track rendering, it is ultimately necessary
in the implementation of any personal or shared immersive
virtual world (including multi-user communication and telepresence).
The successful development and deployment of such applications
in new mobile or online platforms involves maximizing the
plausibility of the synthetic 3D audio scene while minimizing
the computational and memory footprint of the audio rendering
engine. It also requires a flexible, standardized scene description
model to facilitate the development of applications targeting
multiple platforms. This paper presents a general computationally
efficient 3D positional audio and environmental reverberation
rendering engine applicable to a wide range of loudspeaker
or headphone playback configurations.
OS.27-SPS.2 Advances
in Circuits and Systems for Multimedia Processing and Analysis
(I)
Session Chairs: Takeshi
Ikenaga, Tse-Wei Chen Location: Runyon
A
Low Power ASIP for Precision Configurable FFT Processing
Yifan Bo Fudan University,
Jun Han Fudan University, Yao Zou Fudan University,
Xiaoyang Zeng Fudan University
Fast Fourier transformation (FFT)
is a key operation in digital communication systems. Different
communication standards require various FFT length and precision.
In this paper, we present a low power Application-Specific
Instruction-set Processor (ASIP) for variable length (16-point
- 4096-point) and bit precision (8-bit - 16-bit) to meet different
requirements. We use scalable multipliers to construct the
butterfly unit, which support both 8-bit and 16-bit operation.
The order of butterfly operation is adjusted to reduce twiddle-factor
ROM accesses, so as to reduce overall power consumption efficiently.
Clock Gating is implemented to shut down processor's pipeline
during the FFT process in terms of special low power demands.
Special Instructions are tailored to make full use of the
flexible hardware.
SIFT-Based
Low Complexity Keypoint Extraction and Its Real-Time Hardware
Implementation for Full-HD Video
Takahiro Suzuki Waseda
University, Takeshi Ikenaga Waseda University
Scale-Invariant Feature Transform
(SIFT) has lately attracted attention in computer vision as
a robust keypoint detection algorithm which is invariant for
scale, rotation and illumination change. However, its computational
complexity is too high to apply practical real-time applications.
This paper proposes a low complexity keypoint extraction algorithm
based on SIFT descriptor and utilization of the database,
and its real-time hardware implementation for Full-HD resolution
video. The proposed algorithm computes SIFT descriptor on
the keypoint obtained by corner detection and selects a scale
from the database. It is possible to parallelize the keypoint
detection and descriptor computation modules in the hardware.
These modules do not depend on each other in the proposed
algorithm in contrast with SIFT that computes a scale. The
processing time of descriptor computation in this hardware
is independent of the number of keypoints because its descriptor
generation is pipelining structure of pixel. Evaluation results
show that the proposed algorithm on software is 12 times faster
than SIFT. Moreover, the proposed hardware on FPGA is 427
times faster than SIFT and 61 times faster than the proposed
algorithm on software. The proposed hardware performs keypoint
extraction and matching at 60 fps for Full-HD video.
Halo
Artifacts Reduction Method for Variational based Realtime
Retinex Image Enhancement
Hiroshi Tsutsui Kyoto
University, Satoshi Yoshikawa Osaka University,
Hiroyuki Okuhata Synthesis Corporation, Takao Onoye
Osaka University
In this paper, we propose a novel
halo reduction method for variational based Retinex image
enhancement. In variational based Retinex image enhancement,
a cost function is designed based on the illumination characteristics.
The enhanced image is obtained by extracting the illumination
component, which gives minimum cost, from the given input
image. Although this approach gives good enhancement quality
with less computational cost, a problem that dark regions
near edges remain dark after image enhancement, known as halo
artifact, still exists. In order to suppress such artifacts
effectively, the proposed method adaptively adjusts the parameter
of the cost function, which influences the trade-off relation
between reducing halo artifacts and preserving image contrast.
The proposed method is applicable to an existing realtime
Retinex image enhancement hardware implementation.
Design
and Analysis of a Many-Core Processor Architecture for Multimedia
Applications
Jyu-Yuan Lai National
Tsing Hua University, Po-Yu Chen National Tsing Hua
University, Ting-Shuo Hsu National Tsing Hua University,
Chih-Tsun Huang National Tsing Hua University, Jing-Jia
Liou National Tsing Hua University
We present a design of many-core processor
architecture with superior cost-effectiveness to fulfill the
rapid increasing demand of high-speed embedded multimedia
applications. The prototype platform consists of sixteen processor
cores and a 4-by-4 mesh-based duplex network interconnection
with external memory. The hardware and software interface
in a bare-metal environment, i.e., without an Operating System
(OS), has been emphasized in our architecture. An on-chip
communication library is developed for practical parallel
applications. In addition, we propose two memory-based file
handling approaches to manipulate files with the lack of file-system
support by OS. Our file handling approach can effectively
reduce the minimum requirement of local memory without page
swapping for each core from 4 MB to 64 KB in a case study
of JPEG encoding. Furthermore, the analysis of instruction
and data caches is addressed for the trade-off between area
and speed. The experimental result indicates that our many-core
platform with its application development infrastructure is
efficient in delivering cost-effective multimedia applications
in a bare-metal environment.
An
Edge-based Adaptive Image Interpolation and Its VLSI Architecture
Hongbin Sun Xi’an Jiaotong
University, Fengwei Zhang Xi’an Jiaotong University,
Nanning Zheng Xi’an Jiaotong University
The design of high quality yet real-time
image interpolation has become increasingly important for
digital TV SoC, as the size of flat-panel display has been
steadily increased to 4K*2K definition in the very near future.
This paper aims to develop a real-time image interpolation
algorithm that can achieve the high-ratio image scaling with
sharp and natural edges. Comparing with conventional image
interpolation approaches that often suffer from either image
blurring/jagged problem or high computational cost, this paper
propose a high-efficient edge directional image interpolation
approach that can support multiple interpolation directions
and hence can well preserve the detail and edges. Experimental
result shows that the proposed image interpolation algorithm
is able to achieve the high quality at the high image scaling
ratio while only incurring very low computational cost. And
the VLSI architecture of proposed image interpolation is also
presented.
OS.28-SLA.11 Recent
Advances in Audio and Acoustic Signal Processing (II)
Session Chairs: Yoshinobu
Kajikawa, Woon-Seng Gan Location: Laurel
Psychoacoustic
Active Noise Control System
Tongwei Wang NTU,
Woon-seng Gan NTU, Yong-Kim Chong NTU
In practical active noise control
(ANC) applications, it is difficult to completely cancel out
the undesired noise. In order to improve the user's comfort
level in a noisy environment, a psychoacoustic ANC system
that incorporates masking techniques is proposed in this paper.
A two-stage approach of performing ANC, followed by masking
the residual noise with carefully selected masking signal,
is used to enhance the user's listening experience. In
order to mask the residual noise effectively, an automatic
gain controller (AGC) is used to give different gains to the
masking signal according to the residual noise level. A new
mechanism to control the gain of the AGC is proposed so that
the system can produce a conclusive listening experience for
the user. The proposed hybrid ANC-masking system also takes
into account of any uncorrelated noise that is only captured
by the error microphone. Computer simulations are conducted
to show the superior performance of the proposed psychoacoustic
ANC system. Two real-signal cases are also considered in this
paper to test out the effectiveness in combining ANC with
masking.
Network-Based
Multi-Channel Signal Processing Using the Precision Time Protocol
Yoshifumi Chisaki Kumamoto
University, Dan Murakami Kumamoto University, Tsuyoshi
Usagawa Kumamoto University
A conventional microphone array system
uses a conductor to wire from a microphone to an input via
an amplifier. While, a wireless transmission for an array
system makes the configuration flexible, and it is expected
to provide novel applications widely, such as measuring of
impulse response in wide area. Not only in acoustic research
field but also in research area of remote sensing of body
motion and so on, data acquisition with a precise time synchronization
is essential. When a signal received at distributed microphone’s
position is sent over a computer network, the data is packetized
and can include some information at receiving point. When
a time is included as information at data acquisition position,
the time depends on each data acquisition client device. Since
it is possible to use network time protocol or precision time
protocol, multi-channel signal processing can be achieved
with ease. This paper proposes multiple signals transmission
system over a computer network with a time code embedding
to synchronize those signals. The signal from a client is
reconstructed at a server with the time code. Since the time
differences between clients affects to performance of the
multichannel signal processing, smaller error in time at a
client is preferred. This paper discusses how the error in
time between channels affects to performance of the distributed
microphone array system.
Content/Context-Adaptive
Feature Selection for Environmental Sound Recognition
EnShuo Tsau University
of Southern California, Sachin Chachada University
of Southern California, C.-C. Jay Kuo University of
Southern California
Environmental sound recognition (ESR)
is a challenging problem that has gained a lot of attention
in the recent years. A large number of audio features has
been adopted for solving the ESR problem. In this work, we
focus on the problem of automatic feature selection. Specifically,
we propose two methods, called the content-adaptive and the
context-adaptive feature selection schemes to achieve this
goal. Finally, the superior performance of the proposed feature
selection methods is demonstrated when they are applied to
a medium- sized environmental database with a simple Bayesian
network classifier.
Blind
Depth Estimation Based on Primary-To-Ambient Energy Ratio
for 3-D Acoustic Depth Rendering
Se-Woon Jeon Yonsei University,
Dae Hee Youn Yonsei University, Young-cheol Park Yonsei
University
Since the advent of 3-D video, the
acoustic depth rendering for the proximity effect has been
an issue of great interest. In this study, we propose an algorithm
for estimating acoustic depth cues from stereo audio signal,
without a priori knowledge about the source-to- listener geometry
and room environments. We employ the principal component analysis
(PCA) to estimate the acoustic depth based on the primary-to-ambient
energy ratio (PAR) which is related with the front-back movement
of the sound source. And for the acoustic depth rendering,
the distance variation of the sound source is parameterized
through tracking the estimated depth cue. The proposed estimation
algorithm was evaluated using stereo audio clips extracted
from a real 3-D movie, and the results confirmed effectiveness
of the proposed acoustic depth estimation algorithm.
Theoretical
Framework for Stochastic Modeling of FxLMS-based Active Noise
Control Dynamics
Iman Tabatabaei Ardekani
University of Auckland, Waleed Abdulla University
of Auckland
There have been several contributions
on theoretical modeling of FxLMS-based active noise control
systems; however, when it is intended to derive elegant closed-form
expressions for formulating dynamical behaviors of these systems,
a number of simplifying assumptions regarding the acoustic
noise, the actual secondary path and its model have to be
used. This paper develops a dynamic model for FxLMS- based
ANC systems, considering a general stochastic acoustic noise
and a general secondary path. Also, an arbitrary secondary
path model, which is not necessarily a perfect model, is considered.
The main distinction of this model is that previously-derived
dynamic models can be resulted in from it as special cases.
OS.29-SIPTM.2
Recent Topics on Signal Processing in Noisy Environments
Session Chairs: Kiyoshi
Nishikawa, Arata Kawamura Location: Trousdale Estates
Low-complexity
Approximate LMMSE Channel Estimation for OFDM Systems
Shuichi Ohno Hiroshima
University, Emmanuel Manasseh Hiroshima University
A low-complexity linear minimum mean
square error (LMMSE) based channel estimator is proposed for
orthogonal frequency-division multiplexing (OFDM) systems
over frequency-selective channels. Using the law of a large
number, we approximate the LMMSE estimator to reduce the numerical
complexity of the channel estimation. Our estimator exhibits
comparable performance with the LMMSE estimator at low SNR
but suffers the performance floor due to the approximation,
which are verified by numerical simulations.
Mixture
Structure of Kernel Adaptive Filters for Improving the Convergence
Characteristics
Kiyoshi Nishikawa Tokyo
Metropolitan University, Hiroya Nakazato Tokyo Metropolitan
University
In this paper, we propose a mixture
structure of the linear and kernel adaptive fiilters for improving
the convergence characteristics of the kernel normalized least
mean square (KNLMS) adaptive algorithm. The proposed method
is based on the concept of the affine constrained mixture
structure for the linear normalized LMS adaptive filters which
uses the more than two adaptive filters concurrently. We derive
the proposed structure, and its implementation method. We
confirm the effectiveness of the proposed method through the
computer simulations.
Pink
Noise Whitening Method for Pitch Synchronous LPC Analysis
Liu Liqing Saitama University,
Shimamura Tetsuya Saitama University
We present a new noise whitening method
for pitch synchronous LPC analysis under pink noise circumstances.
First,we utilize a rectangular window to extract two frames
whose shifting interval is a full pitch period. Then we perform
a subtraction operation between the two frames to obtain a
new noise signal which is considered to be not corrupted by
the voiced speech signal. The obtained new noise signal can
be used to design a new prediction whitening filter. The new
whitening filter not only whitens the pink noise signal, but
also can keep the vocal tract and formant natures of voiced
speech signal. Utilizing the whitened signal, we can improve
the pitch synchronous addition and subtraction (PSAS) method
under pink noise circumstances. We discuss the properties
of the whitened signal and PSAS method. Experimental results
indicate the effectiveness of the proposed method.
Self-Interference
Canceller for Full-Duplex Radio Relay Station Using Virtual
Coupling Wave Paths
Kazunori Hayashi Kyoto
University, Yasuo Fujishima Kyoto University, Megumi
Kaneko Kyoto University, Hideaki Sakai Kyoto University,
Riichi Kudo NTT Corporation, Tomoki
MurakamiNTT Corporation
The paper considers a coupling wave
canceller for full--duplex radio relay station using adaptive
antenna array. Taking advantage of the fact that coupling
waves to be cancelled at the relay station consist of its
own past transmitted signals, we propose a beamforming method
using not only received signals at actual antenna elements
but also virtual received signals, which are generated in
the relay station with artificial channel impulse responses,
that is to say, virtual coupling wave paths. With the approach,
the proposed method can eliminate coupling waves without increasing
the number of actual antenna elements even when the number
of coupling wave paths is large due to high--speed communications.
Computer simulation results show that the proposed method
achieves coupling wave cancellation with smaller number of
antenna elements
An
Adaptive MAP Speech Spectral Amplitude Estimator Combined
With a Zero Phase Noise Suppressor
Sayuri Kohmura Osaka
University, Arata Kawamura Osaka University, Youji
Iiguni Osaka University
We previously proposed an efficient
MAP (Maximum {\it a posteriori}) speech spectral amplitude
estimator for stationary noise suppression. Although this
method can strongly reduce stationary noise signals, it cannot
reduce impulsive noise signals, such that a thunder, clap,
and other impact noise signals. On the other hand, we also
previously proposed a zero phase noise suppression method
to achieve impulsive noise reduction, where its effectiveness
was confirmed through some simulations. In this paper, we
combine these two effective noise reduction methods and achieve
a noise suppressor which can remove both of stationary and
impulsive noise signals. We evaluate its noise reduction capability
for some types of noise. The simulation results show the effectiveness
of the proposed noise suppression method.
OS.30-SIPTM.3
Control, Optimization and Information Processing for Smart
Grid (II)
Session Chairs: Anthony
Kuh, Urbashi Mitra, Anna Scaglione Location: Franklin Hills
An
Optimal Dynamic Pricing and Schedule Approach in V2G
Yi Han University of
Maryland, Yan Chen University of Maryland, Feng
Han University of Maryland, K. J. Ray Liu University
of Maryland
Smart Grid (SG) can greatly improve
the efficiency and reliability of traditional grid. As a promising
feature of future SG, the Vehicle-to-Grid (V2G) technique
exhibits great potential to balance the supply and demand
of electrical power as well as integrate renewable energy.
Recently, some V2G-based schemes have been proposed to leverage
the energy-storage capability of electric vehicles (EVs) to
effectively reduce energy loss caused by supply-demand mismatches.
However, most of the existing schemes rely on the assumption
that the charge station is profit-neutral, lacking of adequate
incentive to the charge stations for wide deployment. In this
paper, we investigate a scenario where the charge station
is modelled as an entity driven by its own profit. We formulate
the interactions between the charge station and multiple EVs
as a game, in which two kinds of EVs, cooperative EVs and
selfish EVs, are considered. Regarding the intelligence of
selfish EVs, a dynamic pricing over multiple time slots is
developed from the charge station’s perspective to maximize
its own profit. Both theoretical analysis and simulation results
show that through our scheme of dynamic pricing for selfish
EVs and charging scheduling for cooperative EVs, the charge
station can maximize its profit while EVs maximize their utilities.
Power
Grid Vulnerability Measures to Cascading Overload Failures
Zhifang Wang Virginia
Commonwealth University, Anna Scaglione University
of California, Davis, Robert Thomas Cornell Univeristy
Cascading failure in power grids has
long been recognized as a sever security threat to national
economy and society, which happens infrequent but can cause
severe consequences. The causes of cascading phenomena can
be extremely complicated due to the many different and interactive
mechanisms such as transmission overloads, protection equipment
failures, transient instability, voltage collapse, etc. In
the literature a number of vulnerability measures to cascading
failures have been proposed to identify the most critical
components in the grid and evaluate the damages caused by
the removal of such recognized components from the grid. In
this paper we propose a novel power grid vulnerability measure
called the minimum safety time after 1 line trip, defined
based on the stochastic cascading failure model [1]. We compare
its performance with several other vulnerability measures
through a set of statistical analysis.
Quickest
Detection of Unknown Power Quality Events for Smart Grids
Xingze He University
of Southern California, Man-On Pun Huawei Technologies,
C.-C. Jay Kuo University of Southern California
In this work, we study a change-point
approach to provide the quickest detection of power quality
(PQ) event occurrence for smart grids. Despite that both the
occurrence time and the PQ event type are unknown beforehand,
knowledge of the statistics of post-PQ event signals is required
to implement the change-point approach. To circumvent this
obstacle, we propose to model the unknown PQ events using
different statistical distributions, namely the Gaussian,
Gamma and inverse Gamma distributions. It is shown by computer
simulation that all distributions under consideration can
provide accurate PQ event detection. In particular, the inverse
Gamma distribution demonstrates the most promising performance
in our simulation.
Some
Problems in Demand Side Management
Lingwen Gan California
Institute of Technology, Libin Jiang California Institute
of Technology, Steven LowCalifornia
Institute of Technology, Ufuk Topcu California
Institute of Technology, Changhong Zhao California
Institute of Technology
We present a sample of problems in
demand side management in future power systems and illustrate
how they can be solved in a distributed manner using local
information. First, we consider a set of users served by a
single load-serving entity (LSE). The LSE procures capacity
a day ahead. When random renewable energy is realized at delivery
time, it manages user load through real-time demand response
and purchases balancing power on the spot market to meet the
aggregate demand. Hence optimal supply procurement by the
LSE and the consumption decisions by the users must be coordinated
over two timescales, a day ahead and in real time, in the
presence of supply uncertainty. Moreover, they must be computed
jointly by the LSE and the users since the necessary information
is distributed among them. We present distributed algorithms
to maximize expected social welfare. Instead of social welfare,
the second problem is to coordinate electric vehicle charging
to fill the valleys in aggregate electric demand profile,
or track a given desired profile. We present synchronous and
asynchronous algorithms and prove their convergence. Finally,
we show how loads can use locally measured frequency deviations
to adapt in real time their demand in response to a shortfall
in supply. We design decentralized demand response mechanism
that, together with the swing equation of the generators,
jointly maximize disutility of demand rationing, in a decentralized
manner.
Scale
Invariance and Long-Range Dependence in Smart Energy Grids
Marco Levorato University
of Southern California, Urbashi Mitra University of
Southern California
The shift from the traditional energy
grid to the SmartGrid makes the features of scale invariance
and long-range dependence, traditionally examined in reference
to communication networks, extremely relevant to the modeling,
analysis and design of modern energy grids. The present paper
reviews mathematical concepts and tools central for the understanding
and analysis of these phenomena and contextualizes them to
the energy scenario. The framework proposed herein enables,
in addition to a more accurate modeling and design of smart
energy grids, the definition of novel algorithms for the detection
of events, e.g., anomalies, in SmartGrids.
OS.31-IVM.12 Image/Video
Retrieval and Multimedia Applications
Session Chair: Ming-Sui
Lee Location: Whitley Heights
Social
Album: Linking and Merging Online Albums based on Social Relationship
Kai-Yin Cheng National
Taiwan University, Tzu-Hao Kuo National Taiwan University,
Yu-Ting Wong National Taiwan University, Ming-Sui Lee
National Taiwan University, Bing-Yu Chen National
Taiwan University
This work designs a novel prototype
system, Social Album, by utilizing social relationship data
to link and merge online albums of individuals together. Field
study results indicate that co-event albums related to more
than one participating individual are the majority of online
albums. Two different views are designed based on feedbacks
from the interviews: the indexing view provides a metro-map
such as an overview of the linked albums, while the browsing
view allows individuals to peruse photos without looking at
mis-aligned and duplicate photos from merged albums. Hence,
through our system, Social Album, to share and gather co-event
photos becomes much easier than before, and to browse the
photos in the co-event albums also becomes more efficient
while still keeping the comprehensiveness of the whole event.
Finally, a user study demonstrates the usefulness of the proposed
system.
An
Efficient VLSI Architecture of Parallel Bit Plane Encoder
Based on CCSDS IDC
Yi Lu Xidian University,
Jie Lei Xidian University, Yunsong Li Xidian University
The Bit-Plane Encoder (BPE) is the
key part of CCSDS-IDC that encodes the coefficients of 2-D
Discrete Wavelet Transform (DWT). In common sense, it is considered
as the bottleneck of throughput performance and hardware resource
consumption. An efficient VLSI architecture of BPE implemented
with parallel and pipeline technology is proposed in this
paper. In this architecture, the whole bit planes of each
DWT coefficient could be encoded simultaneously and pipeline
is utilized in three functional parts of the bit plane coding.
The proposed architecture has been implemented in a Xilinx
FPGA, its throughput could be improved three times while its
resource consumption is only about a quarter comparing with
the published architectures.
Video
Instance Search for Embedded Marketing
Ting-Chu Lin Academia
Sinica, Jau-Hong Kao Industrial Technology Research
Institute, Chin-Te Liu Industrial Technology Research
Institute, Chia-Yin Tsai National Taiwan University,
Yu-Chiang Frank Wang Academia Sinica
With the rise of online sharing platforms
such as YouTube, advertisers become more interested in providing
relevant advertisements (ads) when the embedded products are
presented in videos during broadcast, so that the number of
hits and potential customers will be increased. Given the
product image of interest, we present a framework which allows
the advertisers or video deliverers to automatically detect
the embedded products throughout the video, so that relevant
ads or latest product information can be delivered to the
viewers accordingly. We advance the boundary preserving dense
local regions (BPLR) as the local descriptors for the query
and each video frame, and utilize different types of features
to describe the local region. To make our framework robust
yet efficient, we reduce the search space by applying the
technique of inverted index, and we propose a probabilistic
framework to identify the video frames in which the product
of interest is presented. Experiments on TRECVID, commercial,
and movie datasets confirm the effectiveness of our proposed
framework.
Learning
Sparse Dictionaries for Saliency Detection
Karen Guo National Tsing
Hua University, Hwann-Tzong Chen National Tsing Hua
University
We present a new method of predicting
the visually salient locations in an image. The basic idea
is to use the sparse coding coefficients as features and find
a way to reconstruct the sparse features into a saliency map.
In the training phase, we use the images and the corresponding
fixation values to train a feature-based dictionary for sparse
coding as well as a fixation-based dictionary for converting
the sparse coefficients into a saliency map. In the test phase,
given a new image, we can get its sparse coding from the feature-based
dictionary and then estimate the saliency map using the fixation-based
dictionary. We evaluate our results on two datasets with the
shuffled AUC score and show that our method is effective in
deriving the saliency map from sparse coding information.
Voting-Based
Depth Map Refinement and Propagation for 2D to 3D Conversion
Yu-Hsiang Chiu National
Taiwan University, Ming-Sui Lee National Taiwan University,
Wei-Kai Liao The White Rabbit Entertainment
In this paper, a voting-based filter
which is capable of enhancing the quality of depth map sequence
and interpolates missing frames is proposed. The main concept
is that if the information in the filter window is not consistent,
then only the multitude decides the output. The minority will
be considered as outliers. Compare to other depth map refinement
method using joint bilateral filter, the outlier detection
of the proposed scheme ensures that only appropriate information
is involved so that the halo effects can be avoided. Moreover,
in order to refine a sequence of depth maps, a space-time
filtering extension is proposed. This extension refines each
depth map according to the information of several adjacent
frames rather than only one frame. As a result, the proposed
method is capable of interpolating missing depth maps of a
sequence and the errors on the depth maps are successfully
reduced. The experimental results demonstrate that the voting-based
filter not only provides a depth map sequence with good quality
and temporal consistency but also provides several post- processing
for depth maps in 2D to 3D conversion due to its flexibility.
OS.32-SLA.12 Recent
Topics in Speech Processing (I)
Session Chair: Hiroshi
Saruwatari Location: Mt. Olympus
Hierarchical
Prosodic Boundary Prediction For Uyghur TTS
Hamdulla Askar Xinjiang
University, Guljamal Mamateli Xinjiang University,
Askar Rozi Xinjiang University, Imam Sayyare Xinjiang
University
Correct prosodic boundary prediction
is crucial for the quality of synthesized speech. This paper
presents the prosodic hierarchy of Uyghur-language which belongs
to agglutinative language. A two-layer bottom-up hierarchical
approach based on conditional random fields (CRF) is used
for predicting prosodic word (PW) and prosodic phrase (PP)
boundaries. In order to disambiguate the confusion between
different prosodic boundaries at punctuation sites, CRF based
prosodic boundary determination model is used and integrated
with bottom-up hierarchical approach. Word suffix feature
is considered useful for prosodic boundary prediction and
added into the feature sets. The experimental results show
that the proposed method successfully resolves the confusion
between different prosodic boundaries. Consequently, further
enhance the accuracy of prosodic boundary prediction.
A
Research of Dependencies Between Frequency Components and
Speaker Characteristics
Hyon Songgun Tianjin
University, Wang Hongcui Tianjin University, Jianguo
Wei Tianjin University, Jianwu Dang JAIST/Tianjin
University
This paper proposes a new speaker
feature extraction method, which is based on the non-uniformly
distributed speaker information in frequency bands. In order
to discard the linguistic information effectively, in this
study, we first examine the differences of the distribution
of individual information in the frequency region when a speaker
utters different phonemes. Then we adopt an improved F-ratio,
a phoneme mean F-ratio, to measure the dependences between
frequency components and individual characteristics. According
to the result of the analysis, we adopt an adaptive frequency
filter to extract more discriminative feature. The new feature
was combined with GMM speaker models and applied to the speaker
recognition database which includes 50 persons. The experiment
shows that the error rate using the proposed feature is reduced
by 28.5% compared with the F-ratio feature, and reduced by
68.02% compared with the MFCC feature.
Development
of Note-Taking Support System with Speech Interface
Kohei Ota University
of Yamanashi, Hiromitsu Nishizaki University of Yamanashi,
Yoshihiro Sekiguchi University of Yamanashi
This paper describes a note-taking
support system with a speech interface. To solve problems
with existing note-taking methods, we implemented a speech
interface consisting of a combination of a touch panel and
graphical user interface in a note-taking support system.
As a system user listens to a speech, the content of the speech
is recognized and displayed on the system’s screen. Users
can take notes by simply touching or tracing the words automatically
displayed on the screen. In addition, the system can support
keyboard and handwritten input to cope with speech recognition
errors. The developed system was experimentally compared with
another note-taking method, a text editor on a personal computer.
Most of the subjects could take a note more quickly using
the system than using the text editor. The effectiveness of
the system was demonstrated in the experiment.
An
Online Evaluation System for English Pronunciation Intelligibility
for Japanese English Learners
Hiroshi Kibishi Toyohashi
University of Technology, Seiichi Nakagawa Toyohashi
University of Technology
We have previously proposed a statistical
method for estimating pronunciation proficiency and intelligibility
of presentations delivered in English by Japanese speakers.
In an offline test, we also evaluated possibly-confused pairs
of phonemes that are often mispronounced by Japanese native
speakers. In this study, we developed an online evaluation
system for English spoken by Japanese speakers using offline
techniques and carried out an evaluation to obtain the effect
thereof based on experimental results. The results showed
that both the objective and subjective evaluations improved
when using this system.
Real-Time
Semi-Blind Speech Extraction with Speaker Direction Tracking
on Kinect
Yuji Onuma Nara Institute
of Science and Technology, Noriyoshi Kamado Nara Institute
of Science and Technology, Hiroshi Saruwatari Nara
Institute of Science and Technology, Kiyohiro Shikano
Nara Institute of Science and Technology
In this paper, speech recognition
accuracy improvement is addressed for ICA-based multichannel
noise reduction in spoken-dialogue robot. First, to achieve
high recognition accuracy for the early utterance of the target
speaker, we introduce a new rapid ICA initialization method
combining robot image information and a prestored initial
separation filter bank. From this image information, an ICA
initial filter fitted to the user’s direction can be used
to save the user’s first utterance. Next, a new permutation
solving method using a probability statistics model is proposed
for realistic sound mixtures consisting of point-source speech
and diffuse noise.We implement these methods using user tracking
on Microsoft Kinect and evaluate it by speech recognition
experiment in the real environment. The experimental results
show that the proposed approaches can markedly improve the
word recognition accuracy.
Wednesday, December 5, 2012 (14:00 - 15:40)
|
OS.33-SLA.13 Fundamental
Technologies in Modern Speech Processing (I)
Session Chairs: Sadaoki
Furui, Li Deng Location: Doheny
Survey
on Approaches to Speech Recognition in Reverberant Environments
Takuya Yoshioka NTT
Corporation, Armin Sehr University of Erlangen-Nuremberg,
Marc Delcroix NTT Corporation, Keisuke Kinoshita NTT
Corporation, Roland Maas University of Erlangen-Nuremberg,
Tomohiro Nakatani NTT Corporation, Walter Kellermann
University of Erlangen-Nuremberg
This paper overviews the state of
the art in reverberant speech processing from the speech recognition
viewpoint. First, it points out that the key to successful
reverberant speech recognition is to account for long-term
dependencies between reverberant observations obtained from
consecutive time frames. Then, a diversity of approaches that
exploit the long-term dependencies in various ways is described,
ranging from signal and feature dereverberation to acoustic
model compensation tailored to reverberation. A framework
for classifying those approaches is presented to highlight
similarities and differences between them.
Recent
Developments in Large Vocabulary Continuous Speech Recognition
George Saon IBM T. J.
Watson Research Center, Jen-Tzung Chien National Chiao
Tung University
This paper overviews a series of recent
approaches to front-end processing, acoustic modeling, language
modeling, and back-end search and system combination which
have made contributions for large vocabulary continuous speech
recognition (LVCSR) systems. These approaches include the
feature transformations, speaker-adaptive features, and discriminative
features in front-end processing, the feature-space and model-space
discriminative training, deep neural networks, and speaker
adaptation in acoustic modeling, the backoff smoothing, large-span
modeling, and model regularization in language modeling, and
the system combination, cross-adaptation, and boosting in
search and system combination. Some future directions for
LVCSR research are also addressed.
Microphone
Array Processing for Distant Speech Recognition: Towards Real-World
Deployment
Kenichi Kumatani Disney
Research, Takayuki Arakawa NEC, Kazumasa Yamamoto
Toyohashi University of Technology, John McDonough
Carnegie Mellon University/Voci Technologies, Inc.,
Bhiksha Raj Carnegie Mellon Univerity, Rita Singh Carnegie
Mellon Univerity, Ivan Tashev Microsoft Research
Distant speech recognition (DSR) holds
out the promise of providing a natural human computer interface
in that it enables verbal interactions with computers without
the necessity of donning intrusive body- or head-mounted devices.
Recognizing distant speech robustly, however, remains a challenge.
This paper provides a overview of DSR systems based on microphone
arrays. In particular, we present recent work on acoustic
beamforming for DSR, along with experimental results verifying
the effectiveness of the various algorithms described here;
beginning from a word error rate (WER) of 14.3% with a single
microphone of a 64-channel linear array, our state-of-the
art DSR system achieved a WER of 5.3%, which was comparable
to that of 4.2% obtained with a lapel microphone. Furthermore,
we report the results of speech recognition experiments on
data captured with a popular device, the Kinect. Even for
speakers at a distance of four meters from the Kinect, our
DSR system achieved acceptable recognition performance on
a large vocabulary task, a WER of 24.1%, beginning from a
WER of 42.5% with a single array channel.
Exploiting
Speech Production Information for Automatic Speech and Speaker
Modeling and Recognition -- Possibilities and New Opportunities
Vikram Ramanarayanan University
of Southern California, Prasanta Ghosh IBM Research
India, Adam Lammert University of Southern California,
Shrikanth Narayanan University of Southern California
We consider the potential for incorporating
direct, or inferred, speech production knowledge in speech
technology development. We first review the technologies that
can be used to capture speech articulation information. We
discuss how meaningful (speech and speaker) representations
can be derived from articulatory data thus captured and further
how they can be estimated from the acoustics in the absence
of these direct measurements. We present some applications
that have used speech production information to further the
state-of-the-art in automatic speech and speaker recognition.
We also offer an outlook on how such knowledge and applications
can in turn inform scientific understanding of the human speech
communication process.
OS.34-IVM.13 Recent
Topics in Image, Video, and Multimedia Processing (II)
Session Chair: Koichi Shinoda
Location: Beachwood
Color-Tone
Similarity on Digital Images
Hisakazu Kikuchi Niigata
University, Heikki Huttunen Tampere University of Technology,
Junghyeun Hwang Niigata University, Masahiro Yukawa
Niigata University, Shogo Muramatsu Niigata University,
Jaeho Shin Dongguk University
A color-tone similarity index (CSIM)
between two color images is presented and another index, picture
similarity index (PSIM), is also given for a comprehensive
similarity com-parison between color images. CSIM is defined
by a statistical analysis of cumulative histograms in a hue-oriented
color space. It characterizes the color distributions, while
the existing structural similarity index reflects the spatial
structure involved with grayscale images. The behaviors of
CSIM are checked by the comparisons of color code chips. Experimental
results are given. The proposed indexes combined with SSIM
are hopeful to provide a tool for color image quality analysis
(IQA).
Efficient
Model Training for HMM-based Person Identification by Gait
Muhammad Aqmar Tokyo
Institute of Technology, Koichi Shinoda Tokyo Institute
of Technology, Sadaoki Furui Tokyo Institute of Technology
In gait-based person identification,
statistical methods such as hidden Markov models (HMMs) have
been proved to be effective. Their performance often degrades,
however, when the amount of training data for each walker
is insufficient. In this paper, we propose walker adaptation
and walker adaptive training, where the data from the other
walkers are effectively utilized in the model training. In
walker adaptation, maximum likelihood linear regression (MLLR)
is used to transform the parameters of the walker-independent
model to those of the target walker model. In walker adaptive
training, we effectively exclude the inter-walker variability
from the walker-independent model. In our evaluation, our
methods improved the identification performance even when
the amount of data was extremely small.
Real-Time
Both Hands Tracking Using CAMshift with Motion Mask and Probability
Reduction by Motion Prediction
Ryosuke Araki Waseda
University, Takeshi Ikenaga Waseda University,
Seiichi Gohshi Kogakuin University
Hand gesture interfaces are more intuitive
and convenient than traditional interfaces. They are the most
important parts in the relationship between users and devices.
Hand tracking for hand gesture interfaces is an active area
of research in image processing. However, previous works have
limits such as requiring the use of multiple camera or sensor,
working only with single color background, etc. This paper
proposes a real-time both hands tracking algorithm based on
“CAMshift (Continuous Adaptive Mean Shift Algorithm)” using
only a single camera in multi-color backgrounds. In order
to track hands robustly, the proposed algorithm uses “motion
mask” to combine color and movement probability distributions
and “probability reduction” for multi-hand tracking in nonlimiting
environments. Experimental results demonstrate that this algorithm
can precisely track both hands of an operator in multicolor
backgrounds and process the VGA size input sequences from
a web camera in real time (about 25 fps).
High
Contrast Tone-mapping and its Application for Two-layer High
Dynamic Range Coding
Takao Jinno Toyohashi
University of Technology, Hiroya Watanabe University
of Kitakyushu, Masahiro OkudaUniversity of Kitakyushu
Many applications for High Dynamic
Range (HDR) images require tone-mapping operations that preserve
details in whole luminance range. This paper proposes a high
contrast tone-mapping operator using a multi-scale contrast
enhancement, and uses it for a high efficiency two-layer HDR
coding. To visualize minute details, the high contrast tone-mapping
operator often results in hard enhancement. In many conventional
two-layer coding methods, it degrades compression efficiency.
In contrast our method can achieve both of the high contrast
and high compression efficiency. Moreover this paper can perform
two types of tone-mapping which generates the images with
strong enhancement and natural look. This paper shows the
validity of our methods through some experimental results.
Gradient-based
Global Features and Its Application to Image Retargeting
Izumi Ito Tokyo Institute
of Technology
We propose gradient-based global features
and its application to image retargeting. The proposed features
are used for an importance map for image retargeting, which
represents rough location of salient objects in an image.
We focus on areas rather than points and lines to be assigned
as an important part. The information about areas in multiple
layers provides global features. Experimental results compared
to the state-of- the-art salient features for image retargeting
demonstrate the effectiveness of the proposed features.
Finding
Canonical Views by Measuring Features on the Viewing Plane
Wencheng Wang Institutue
of Software, CAS, Liming Yang Institutue of Software,
CAS, Dongxu Wang Institutue of Software,CAS
Canonical views are referred to the
classical three-quarter views of a 3D object, always preferred
by human beings, because they are stable and able to produce
more meaningful and understandable images for the viewer.
Unlike existing methods to measure features in the 3D space
for view selection, this paper proposes to measure features
on the viewing plane, taking into account the influence of
feature deformation due to perspective projection on view
evaluation. Meanwhile, we try to have more features perceptible
instead of having preferred features displayed more in a good
view, which is aimed by existing methods. As a result, we
can effectively obtain canonical views with only geometry
computation, without troublesome semantic computation, which
are always needed in existing techniques for obtaining good
views.
OS.35-SIPTM.4
Recent Advances in Sparse and Nonlinear Adaptive Signal Processing
Session Chairs: Mrityunjoy
Chakraborty, Tokunbo Ogunfunmi Location: Runyon
An
Alternative Kernel Adaptive Filtering Algorithm for Quaternion-Valued
Data
Tokunbo Ogunfunmi Santa
Clara University, Thomas Paul Santa Clara University
Nonlinear adaptive filters are getting
more common and are useful especially where performance of
linear adaptive filters may be unacceptable. Such areas include
communications, image processing and biological systems. Quaternion
valued data has also been drawing recent interest in various
areas of statistical signal processing, including adaptive
filtering, image pattern recognition, and modeling and tracking
of motion. The benefit of quaternion valued processing includes
performing data transformations in a 3 or 4-dimensional space
in a more convenient fashion than using vector algebra. In
this paper we describe an alternative kernel adaptive filter
for quaternion valued data we refer to as the involution Quaternion
Kernel Least Mean Square (iQuat-KLMS) algorithm. The approach
is based on the Quaternion KLMS (Quat-KLMS) algorithm obtained
previously, as well as the recently developed involution gradient
(i-gradient). A modified HR Calculus for Hilbert Space is
used for finding cost function gradients defined on a quaternion
RKHS. Simulation tests with a synthetic quaternion channel
are used to verify the benefit of iQuat-KLMS in convergence
compared to Quat-KLMS.
Subspace
Based Blind Sparse Channel Estimation
Kazunori Hayashi Kyoto
University, Hiroki Matsushima Kyoto University,
Hideaki Sakai Kyoto University, Elisabeth Carvalho
Aalborg University, Petar Popovski
Aalborg University
The paper proposes a subspace based
blind sparse channel estimation method using L1-L2 optimization
by replacing the L2-norm minimization in the conventional
subspace based method by the L1-norm minimization problem.
Numerical results confirm that the proposed method can significantly
improve the estimation accuracy for the sparse channel, while
achieving the same performance as the conventional subspace
method when the channel is dense. Moreover, the proposed method
enables us to estimate the channel response with unknown channel
order if the channel is sparse enough.
An
Efficient Iterative Method for Basis Pursuit Adaptive Filters
for Sparse Systems
Steven Grant Missouri
S&T, Pratik Shah Missouri S&T, Jacob Benesty
University of Quebec
The “proportionate” family of adaptive
filters has been in use over the past decade. Their fast convergence
for sparse systems makes them particularly useful in the network
echo canceller application. Recently, an iterative form of
the proportionate affine projection algorithm (PAPA), derived
from the basic principles of basis pursuit, has been shown
to have remarkably fast convergence for such sparse systems.
The number of samples for convergence is proportional to the
sparseness of the system which means that often full convergence
occurs in fewer samples than the length of the system's
impulse response. Here, we introduce a lower complexity implementation
with the same performance that is an iterative version of
proportionate normalized least mean squares (PNLMS).
Sparse
Recovery from Convolved Output in Underwater Acoustic Relay
Networks
Sunav Choudhary University
of Southern California, Urbashi Mitra University of
Southern California
This paper explores criteria for unique
recovery from blind deconvolution under sparsity priors. Additionally
regularizing functions stemming from this problem framework
are developed. For key cases, it is possible to ensure unique
recoverability given the regularized problem statement. The
uniqueness results are informed by a matrix completion-based
viewpoint of blind deconvolution. Furthermore, this perspective
enables characterization of why blind deconvolution with two
sparse inputs is an inherently hard problem. Two blind deconvolution
algorithms are proposed which do not rely on alternating between
the estimation of one input signal, while holding the other
constant. Evaluation of the algorithms is done via simulation
and shown to significantly outperform a previously proposed
method. Furthermore, numerical illustration of recovery failure
considering sparsity of input signals that do not satisfy
the recovery constraints is also provided.
A
Zero Attracting Proportionate Normalized Least Mean Square
Algorithm
Rajib Lochan Das Indian
Institute of Technology, Mrityunjoy Chakraborty Indian
Institute of Technology
The proportionate normalized least
mean square (PNLMS) algorithm, a popular tool for sparse system
identification, achieves fast initial convergence by assigning
independent step sizes to the different taps, each being proportional
to the magnitude of the respective tap weight. However, once
the active (i.e., non-zero) taps converge, the speed of convergence
slows down as the effective step sizes for the inactive (i.e.,
zero or near zero) taps become progressively less. In this
paper, we try to improve upon both the convergence speed and
the steady state excess mean square error (EMSE) of the PNLMS
algorithm, by introducing a l1 norm (of the coefficients)
penalty in the cost function which introduces a so-called
zero-attractor term in the PNLMS weight update recursion.
The zero attractor induces further shrinkage of the coefficients,
especially of those which correspond to the inactive taps
and thus arrests the slowing down of the convergence of the
PNLMS algorithm, apart from bringing down the steady state
EMSE. We have also modified the cost function further generating
a reweighted zero attractor which helps in confining the “Zero
Attraction” to the inactive taps only.
OS.36-SLA.14 Recent
Advances in Signal Processing/Filter Applications
Session Chair: Xiangui
Kang Location: Laurel
A
Comb Filter with Adaptive Notch Gain for Periodic Noise Reduction
Yosuke Sugiura Osaka
University, Arata Kawamura Osaka University, Youji
Iiguni Osaka University
A comb filter is used to eliminate
a periodic noise signal from an observed signal. For extracting
the desired signal, one of the most important factors of the
comb filter is the notch gain which controls an elimination
quantity of the observed signal at noise frequencies. Conventional
comb filters employ a pre-designed notch gain under the assumption
that the appropriate notch gain is known. Unfortunately, in
many practical situations, the appropriate notch gain is unknown
and often changes. In this paper, we propose a new comb filter
with the adaptive notch gain to automatically achieve the
appropriate notch gain. In the proposed method, we utilize
an adaptive line enhancer (ALE) instead of the conventional
notch gain multiplier. When the ALE completely estimates the
periodic noise signal, the ALE’s frequency response directly
gives the appropriate notch gain. Simulation results show
the effectiveness of the proposed adaptive comb filter.
A
Directional and Shift-Invariant Transform Based on M-channel
Rational-Valued Cosine-Sine Modulated Filter Banks
Seisuke Kyochi University
of Kitakyushu, Taizo Suzuki Nihon University, Yuichi
Tanaka Tokyo University of Agriculture and Technology
This paper proposes a directional
and shift-invariant transform based on M-channel rational-valued
cosine-sine modulated filter banks (R-CSMFBs) for the practical
implementation on hardware devices. M-channel CSMFBs can be
easily designed by the modulation of a prototype filter and
achieve a good stopband attenuation. In addition, in our previous
work, the directionality and the shift-invariance of CSMFBs
have been theoretically clarified. Thus, they can be an alternative
choice of the dual-tree complex wavelet transform (DTCWT)
which is one of the most popular directional and shift-invariant
transforms. In this paper, it is shown that the proposed lifting-based
structure of the R-CSMFB can also achieve rich directional
selectivity and the shift-invariance even if the lifting coefficients
are rounded to rational values. Finally, the R-CSMFB can provide
better stopband attenuation and image denoising performance
than that of the conventional M-channel rational-valued DTCWT
in the simulation.
Robust
Median Filtering Forensics Based on the Autoregressive Model
of Median Filtered Residual
Xiangui Kang Sun Yat-Sen
University, Anjie Peng Sun Yat-Sen University,
K. J. Ray Liu University of Maryland, Matthew Stamm
University of Maryland
One important aspect of multimedia
forensics is exposing an image's processing history. Median
filtering is a popular noise removal and image enhancement
tool. It is also an effective tool in anti-forensics recently.
An image is usually saved in a compressed format such as the
JPEG format. The forensic detection of median filtering from
a JPEG compressed image remains challenging, because typical
filter characteristics are suppressed by JPEG quantization
and blocking artifacts. In this paper, we introduce a robust
median filtering detection scheme based on the autoregressive
model of median filter residual. Median filtering is first
applied on a test image and the difference between the initial
image and the filtered output image is called the median filter
residual (MFR). The MFR is used as the forensic fingerprint.
Thus, the interference from the image edge and texture, which
is regarded as a limitation of the existing forensic methods,
can be reduced. To capture the statistical properties of the
MFR, we fit it to an autoregressive (AR) model. We then use
the AR coefficients as features for median filter detection.
Experimental results show that the proposed median filtering
detection method is very robust to JPEG post- compression
with a quality factor as low as 30. It distinguishes well
between median filtering and other manipulations, such as
Gaussian filtering, average filtering, and rescaling and performs
well on low-resolution images of size 32 _ 32. The proposed
method achieves not only much better performance than the
existing state-of-the-art methods, but also has very small
dimension of feature, i.e., 10-D.
Nonlinear
Signal Processing for Compensating Nonlinear Distortion of
Louspeakers
Kenta Iwai Kansai University,
Yoshinobu Kajikawa Kansai University
In this paper, we propose a 3rd-order
nonlinear IIR filter for compensating nonlinear distortions
of loudspeaker systems. The 2nd-order nonlinear IIR filter
based on the Mirror filter is used for reducing nonlinear
distortions of loudspeaker systems. However, the 2nd-order
nonlinear IIR filter cannot reduce nonlinear distortions at
high frequencies because it does not include the nonlinearity
of the self- inductance of loudspeaker systems. On the other
hand, the proposed filter includes the effect of such self-inductance
and thus can reduce nonlinear distortions at high frequencies.
Experimental results demonstrate that the proposed filter
can realize a reduction by 3.2 dB more than the conventional
filter on intermodulation distortions at high frequencies.
Classifying
NMF Components Based on Vector Similarity for Speech and Music
Separation
Nengheng Zheng Shenzhen
University, Xia Li Shenzhen University, Yi Cai
Shenzhen University, Tan Lee CUHK
This paper presents a nonnegative
matrix factorization (NMF) components classification algorithm
for single-channel speech and music separation. Music only
and music-speech mixture segments are firstly classified from
the audio stream via audio segmentation technique. Then NMF
is applied for signal decomposition. The basis matrix of the
NMF output of music only segments provides the prior knowledge
of music component in the mixture signal. NMF components,
i.e. basis and gain vectors of the mixture signal are classified
into speech and music based on the vector similarity between
each basis vector and the priori music basis matrix. A set
of SNR-dependent thresholding coefficients are empirically
determined for the classification. The separated speech and
music signals are reconstructed from the respectively classified
NMF components. Experimental results show the effectiveness
of the proposed method for speech and music separation, and
its superior performance over the traditional NMF-based separation
methods.
OS.37-SPS.3 Sparse
and Feature Representation for Image/Video Restoration
Session Chairs: Jiaying
Liu, Weisheng Dong, Zongming Guo Location: Trousdale Estates
Review
of Image Interpolation and Super-resolution
Wan-Chi Siu Hong Kong
Polytechnic University, Kwok Wai Hung Hong Kong Polytechnic
University
Image/video interpolations and super-resolution
are topics of great interest. Their applications include HDTV,
image coding, image resizing, image manipulation, face recognition
and surveillance. The objective is to increase the resolution
of an image/video through upsampling, deblurring, denoising,
etc. This paper reviews the development of various approaches
on image interpolation and super-resolution theory for image/video
enlargement in multimedia applications. Some basic formulations
will be derived such that readers can make use of them to
design their own, practical and efficient interpolation algorithms.
New results, such as hole filling using non-local means for
3D video synthesis and fast interpolation using a simplified
image model will be introduced. New directions and trends
will also be discussed at the end of the paper.
Super
Resolution with Edge-Constrained Motion Estimation
Yue Zhuo Peking University,
Jiaying Liu Peking University, Mading Li Peking
University, Zongming Guo Peking University
Motion estimation is a critical step
for most reconstruction-based super resolution methods. However,
accurate motion estimation is difficult and the unavoidable
error degrades performance of super resolution rapidly. In
this paper, we present a robust way to perform super resolution
by improving motion estimation. Beginning with feature points
matching, we compute local motion parameter of feature point
correspondences by weighted Lucas-Kanade algorithm. Then accurate
motion field is estimated by support region search, which
refers to edge information and considers discontinuity of
motion boundary and consistency of motion field. Experimental
results validate the efficacy of each step in the proposed
algorithm and show that we produce super resolved images with
higher quality.
Super
Resolution For Subpixel Based Downsampled Images
Ketan Tang HKUST,
Oscar C. Au HKUST, Lu Fang USTC, Yuanfang Guo
HKUST, Pengfei Wan HKUST, Lingfeng Xu HKUST
Subpixel-based downsampling is a new
downsampling technique which utilizes the fact that each pixel
in LCD is composed of three individually addressable subpixels.
Subpixel-based downsampling can provide higher apparent resolution
than pixel-based downsampling. In this paper we study the
inverse problem of subpixel-based downsampling. We find that
conventional pixel-based super resolution algorithms are not
suitable for subpixel-based downsampled images due to the
special downsampling pattern. In this paper we propose a super
resolution algorithm specially for subpixel-based downsampled
images, which uses piecewise autoregressive model to model
spatial correlation of neighboring pixels, and incorporates
the special data degradation term corresponding to the subpixel
downsampling pattern. We formulate the super resolution problem
as a constrained least square problem and solve it using Gauss-Seidel
iteration. Experimental results demonstrate the effectiveness
of the proposed algorithm.
Image
Deblurring with Low-rank Approximation Structured Sparse Representation
Weisheng Dong Xidian
University, Guangming Shi Xidian University, Xin
Li West Virginia University
In recent years sparse representation
model (SRM) based image deblurring approaches have shown promising
image deblurring results. However, since most of the current
SRMs don't utilize the spatial correlations between the
nonzero sparse coefficients, the SRM-based image deblurring
methods often fail to faithfully recover sharp image edges.
In this paper, a structured SRM is employed to exploit the
local and nonlocal spatial correlation between the sparse
codes. The connection between the structured SRM and the low-rank
approximation model has also been presented, leading to an
efficient low-rank based optimization algorithm. An effective
image deblurring algorithm using the patch-based structured
SRM is then proposed. Experimental results demonstrate the
improvements of the proposed deblurring method over current
state-of-the-art image deblurring methods.
Face
Super-Resolution Based on Singular Value Decomposition
Muwei Jian Hong Kong
Polytechnic University, Kin Man Lam Hong Kong Polytechnic
University
In this paper, a novel face image
super-resolution approach based on singular value decomposition
(SVD) is proposed. We prove that the singular values of an
image at one resolution have approximately linear relationships
with their counterparts at other resolutions. This makes the
estimation of the singular values of the corresponding HR
face images more reliable. From the signal-processing point
of view, this can effectively preserve and reconstruct the
dominant information in the HR face image. Interpolating the
two other matrices obtained from the SVD of a LR face image
does not change either the primary facial structure or the
pattern of the face image. Furthermore, the mapping scheme
for interpolating the matrices can be viewed as a “coarse-to-fine” estimation
of HR face images, which uses the mapping matrices learned
from the corresponding reference image pairs. Experimental
results show that the proposed super-resolution scheme is
effective and efficient.
Robust
Single Image Super-resolution Based on Gradient Enhancement
Licheng Yu Shanghai
Jiao Tong University, Hongteng Xu Shanghai Jiao Tong
University, Yi Xu Shanghai Jiao Tong University,
Xiaokang Yang Shanghai Jiao Tong University
In this paper, we propose an image
super-resolution approach based on gradient enhancement. Local
constraints are established to achieve enhanced gradient map,
while the global sparsity constraints are imposed on the gradient
field to reduce noise effects in super resolution results.
We can then formulate the image reconstruction problem as
optimizing an energy function composed of the proposed sharpness
and sparsity regularization terms. The solution to this super-resolution
image reconstruction is finally achieved using the well- known
variable-splitting and penalty techniques. In comparison with
the existing methods, the experimental results highlight our
proposed method in computation efficiency and robustness to
noisy scenes.
OS.38-SIPTM.5
Signal and Information Processing of Energy Signals
Session Chairs: Anthony
Kuh, Urbashi Mitra, Anna Scaglione Location: Franklin Hills
Networked
Loads in the Distribution Grid
Zhifang Wang Virginia
Commonwealth University, Xiao Li University of California,
Davis, Vishak MuthukumarUniversity of
California, Davis, Anna Scaglione University
of California, Davis, Sean Peisert Lawrence
Berkeley National Laboratory/University of California,
Davis, Charles McParland Lawrence Berkeley
National Laboratory
Central utility services are increasingly
networked systems that use an interconnection of sensors and
programmable logic controllers, and feed data to servers and
human-machine interfaces. These systems are connected to the
Internet so that they can be accessed remotely, and the network
in these plants is structured according to the SCADA model.
Although the physical systems themselves are generally designed
with high degrees of safety in mind, and designers of computer
systems are well advised to incorporate computer security
principles, a combined framework for supervisory control of
the physical and cyber architectures in these systems is still
lacking. Often absent are provisions to defend against external
and internal attacks, and even operator errors that might
bypass currently standalone security measures to cause undesirable
consequences. In this paper we examine a prototypical instance
of SCADA network in the distribution network that handles
central cooling and heating for a set of buildings. The electrical
loads are networked through programmable logic controllers
(PLCs), electrical meters, and networks that deliver data
to and from servers that are part of a SCADA system, which
has grown in size and complexity over many years.
Instantaneous
Frequency Estimation and Localization for ENF Signals
Adi Hajj-Ahmad University
of Maryland, Ravi Garg University of Maryland,
Min Wu University of Maryland
Forensic analysis based on Electric
Network Frequency (ENF) fluctuations is an emerging technology
to authenticate multimedia recordings. This class of techniques
requires extracting frequency fluctuations from multimedia
recordings and comparing them with the ground truth frequencies,
obtained from the power mains, at the corresponding time.
Most current guidelines for frequency estimation from the
ENF signal use non-parametric approaches. Such approaches
have limited temporal-frequency resolution due to the tradeoffs
of the time-frequency resolutions as well as computational
power. To facilitate robust high-resolution matching, it is
important to estimate instantaneous frequency using as few
samples as possible. The use of subspace-based methods for
high resolution frequency estimation is fairly new for ENF
analysis. In this paper, a systematic study of several high
resolution low-complexity frequency estimation algorithms
is conducted, focusing on estimating the frequencies in short
time-frames. After establishing the performance of several
frequency estimation algorithms, a study towards using the
ENF signal for estimating the location-of-recording is carried
out. Experiments conducted on ENF data collected in several
cities indicate the presence of location-specific signatures
that can be exploited for future forensic applications.
Modeling
Distributed PV Energy Using Stochastic Queuing Models
Anthony Kuh University
of Hawaii, Chuanyi Ji Georgia Institute of Technology,
Yun Wei Georgia Institute of Technology
In the past few years there has been
a tremendous growth in distributed PV generation on commercial
and residential buildings. Increasing distributed PV generation
has raised concerns about the stability of the distribution
grid due to the intermittency of solar PV energy. Before smart
grid optimization and control algorithms can be formulated
we must obtain a better understanding of the behavior of the
distributed PV energy contributions to the electrical grid.
This paper develops stochastic models to model each distributed
energy source using both spatial and temporal processing.
A goal is to develop simple stochastic models that accurately
model the distributed energy produced from the PV sources
with possible storage so that key events (e.g. ramp downs
due to cloud cover can be predicted). The production of energy
from PV panels is modeled as a queue with inputs being the
nonstationary solar irradiation, the energy produced modeled
by a deterministic function, and a queue modeled by storage
which can be sold to the grid or used by local loads. A second
queue models solar irradiation with inputs being weather conditions
(sunny, partly cloudy, cloudy).
Real-Time
Adaptive Distributed State Estimation in Smart Grids
Soummya Kar Carnegie
Mellon University, Jose Moura Carnegie Mellon University
The paper presents a fully distributed
framework for sequential recursive state estimation in inter-connected
electrical power systems. Specifically, the setup considered
involves a grid partitioned into multiple control areas that
communicate over a \emph{sparse} communication network. In
the absence of a global sensor data fusion center (the conventional
centralized SCADA) and with sensing model uncertainties, an
adaptive distributed state estimation approach, the $\mathcal{DAE}$,
is proposed in which the system control areas engage in a
collaborative joint (model) learning and (state) estimation
procedure through sequential information exchange over the
pre- assigned communication network. The proposed distributed
estimation methodology is recursive, in that, each system
control area refines its state estimate at a given sampling
instant by suitably combining its past estimate with the newly
collected local measurement(s) and the information obtained
from its communication neighbors. Under rather weak assumptions
of global observability and connectivity of the control area
communication network, the proposed distributed adaptive scheme
is shown to yield consistent system state estimates (i.e.,
estimates that converge to the true system state in the large
sample limit), the convergence rate being optimal in the Fisher
information sense. As discussed, the proposed approach based
on local communication and computation is suitable for real-time
implementation as opposed to conventional centralized SCADA
based estimation architectures with periodic data gathering
and processing. This has the potential of being more responsive
and adaptive to sensed data generated by advanced non-conventional
sensing resources like the PMUs with significantly higher
system sampling rates.
OS.39-SPS.4/BioSPS.5
Advances in Signal Processing Systems and Biomedical Signal
Processing
Session Chair: Alex Kot
Location: Whitley Heights
EAG:
Edge Adaptive Grid Data Hiding for Binary Image Authentication
Hong Cao Institute for
Infocomm Research, Alex Chichung Kot Nanyang Technological
University
This paper proposes a novel data hiding
method for authenticating binary images through establishing
dense edge adaptive grids (EAG) for invariantly selecting
good data carrying pixel locations (DCPL). Our method employs
dynamic system structure with carefully designed local content
adaptive processes (CAP) to iteratively trace new contour
segments and to search for new DCPLs. By maintaining and updating
a location status map, we re-design the fundamental content
adaptive switch and a protection mechanism is proposed to
preserve the local CAPs' contexts as well as their corresponding
outcomes. Different from existing contour-based methods, our
method addresses a key interference issue and has unprecedentedly
demonstrated to invariantly select a same sequence of DCPLs
for an arbitrary binary host image and its marked versions
for our contour-tracing based hiding method. Comparison also
shows that our method achieves better trade-off between large
capacity and good perceptional quality as compared with several
prior works for representative binary text and cartoon images.
Performance
Improvement of Closed-Form Joint Diagonalizer of Non-Negative
Hermitian Matrices
Akira Tanaka Hokkaido
University, Miho Murota Hokkaido University
Joint diagonalization of a series
of non-negative Hermitian matrices is one of important techniques
in the fields of signal processing, such as blind source separation
based on second order statistics. In our previous works, we
introduced a closed-form solution of a joint diagonalizer
of non-negative Hermitian matrices and also proposed a method
for improving the performance of the solution for the cases
where given series of Hermitian matrices are not jointly diagonalizable
strictly. However, the performance of the method may degrade
when the number of given Hermitian matrices are comparatively
small. In this paper, we propose an improved version of the
closed-form joint diagonalizer of given set of Hermitian matrices
by increasing the number of Hermitian matrices virtually.
Some numerical examples are also shown to verify the efficacy
of the proposed method.
Weighted-CS
for Reconstruction of Highly Under-sampled Dynamic MRI Sequences
Dornoosh Zonoobi NUS,
Ashraf A. Kassim NUS
This paper investigates the potential
of the new Weighted-Compressive Sensing approach which overcomes
the major limitations of other compressive sensing and current
state-of-the-art methods for low-rate reconstruction of sequences
of MRI images. The underlying idea of this approach is to
use the image of the previous time instance to extract an
estimated probability model for the image of interest, and
then use this model to guide the reconstruction process. This
is motivated by the observation that MRI images are hugely
sparse in Wavelet domain and the sparsity changes slowly over
time.
An
Anisotropic Diffusion Filter for Reducing Speckle Noise of
Ultrasound Images Based on Separability
Shen Liu Tianjin University,
Jianguo Wei Tianjin University, Bo Feng Tianjin
University, Wenhuan Lu Tianjin University, Bruce
Denby ESPCI ParisTech/Université Pierre et Marie Curie,
Qiang Fang Chinese Academy of Social Sciences, Jianwu
Dang JAIST/Tianjin University
Anisotropic diffusion is being widely
used in reducing speckle noise of ultrasound images. However,
the traditional anisotropic diffusion algorithms are poor
at preserving edges and usually make the image edges blurred
when denoising, which negatively affects the following image
analysis. In this paper, we modify the standard speckle reducing
anisotropic diffusion to increase its ability of detecting
edges and suppress the smooth at edge by using the separability
of images. We extract contours from the original images, denoised
images by SRAD and the images denoised by our proposed method,
respectively. We analyze and compare the accuracy of these
three kinds of contours. The result shows the proposed method
performs better in edge-preserve and gets better images of
high quality than SRAD, which can contribute to get more accurate
contours.
Accelerating
Householder Bidiagonalization with ARM NEON Technology
Wenjun Yang Tsinghua
University, Zhenyu Liu Tsinghua University
Householder bidiagonalization is the
first step of Singular Value Decomposition (SVD), an important
algorithm in numerical linear algebra that is widely used
in video processing. NEON is a general-purpose Single Instruction
Multiple Data (SIMD) engine introduced in ARMv7 architecture,
which is targeted to accelerate multimedia and signal processing
on mobile platforms. In this paper, we propose a NEON- based
implementation and optimization of Householder bidiagonalization,
aiming at testifying the potential of NEON to handle with
low- dimensional macroblocks if applied to future computing-intensive
video codecs. Intrinsics and inline assembly, two most commonly
used ways to utilize NEON, are compared in performance. Solutions
to the problem of leftover elements in vectorization is also
discussed. Our study finally shows that with hand-coded inline
assembly and all kinds of optimization, our NEON implementation
of Householder bidiagonlization will gain a speedup of 2.3
over the plain C version.
Thursday, December 6, 2012 (9:30 - 10:30)
|
OS.40-WCN.4 Wireless
Communications and Networking (II)
Session Chair: Wan-Jen
Huang Location: Doheny
Filter-And-Forward
Relay Design for OFDM Systems for Quality-of-Service Enhancement
Donggun Kim KAIST,
Junyeong Seo KAIST, Youngchul Sung KAIST
In this paper, the filter-and-forward
(FF) relay design for OFDM communication systems is considered
to enhance the system performance over the conventional amplify-and-forward(AF)
relaying scheme. The considered design criterion in this paper
is to maximize the worst subcarrier channel signal-to-noise
ratio (SNR) subject to the total relay transmit power constraint
in order to improve the overall transmission quality-of-service.
It is shown, by exploiting the eigen-property of circulant
matrices and the structure of Toeplitz matrices, that the
considered problem reduces to a semi-definite programming
(SDP) problem. Numerical results are provided to validate
our design method and the numerical results show that the
proposed FF relay outperforms the AF scheme significantly.
Preamble
Design for Joint estimation of Channel and I/Q imbalance in
MIMO-OFDM Systems
Emmanuel Manasseh Hiroshima
University, Shuichi Ohno Hiroshima University,
Masayoshi Nakamoto Hiroshima University
In this paper, preamble design for
estimation of frequency selective channels and In-phase/Quadrature-phase
(I/Q) imbalance in multiple input multiple output orthogonal
frequency-division-multiplexing (MIMO-OFDM) systems is proposed.
First we utilize convex optimization to optimize power of
all active subcarriers, then we employ cross entropy (CE)
optimization techniques to select optimal preamble sequence
that minimizes the channel estimate mean squared error (MSE)
while suppressing the effect of I/Q mismatch. To mitigate
inter- antenna interferences, disjoint preamble sequences
are utilized for each transmit antenna. An algorithm to guarantee
that the preamble sequences are disjoint for each transmitter
is proposed. Numerical simulations are provided to verify
the advantages and effectiveness of the proposed preamble
sequences over the conventional sequences.
One-to-N
Wireless Power Transmission System Based on Multiple Access
One-Way In-Band Communication
Dong-Zo Kim Samsung
Electronics, Ki young Kim Samsung Advanced Institute
of Technology, Nam Yoon Kim Samsung Advanced Institute
of Technology, Yun-Kwon Park Samsung Advanced Institute
of Technology, Wang-Sang Lee KAIST, Jong-Won Yu
KAIST, Sangwook Kwon Samsung Advanced Institute
of Technology
For an efficient wireless charging
to multiple devices, transmitter (TX) system should be able
to control the transmitting power-level by monitoring number
of receiver (RX) devices via their identification information
and charging status and capacity of each receivers. This work
proposes a one-way in-band communication scheme, corresponding
system structure, and charging algorithm for efficient one-to-N
wireless charging system, which has been verified experimentally.
OS.41-SLA.15 Speech
Recognition (III)
Session Chair: Atsuhiko
Kai Location: Beachwood
Soft-clustering
Technique for Training Data in Age- and Gender-independent
Speech Recognition
Daisuke Enami Toyohashi
University of Technology, Faqiang Zhu Toyohashi University
of Technology, Kazumasa Yamamoto Toyohashi University
of Technology, Seiichi Nakagawa Toyohashi University
of Technology
In this paper, we propose approaches
for the Gaussian mixture model (GMM) based soft clustering
of training data and the GMM- or/and hidden Markov model (HMM)-based
cluster selection in age and gender-independent speech recognition.
Typically, increasing the number of speaker classes leads
to more specific models in speaker-class-dependent speech
recognition, and thus better recognition performance. However,
the amount of data for each class model is reduced by the
increase in the number of classes, which leads to unreliable
model parameters. To solve the problem of the reduction of
training data, we propose a GMM-based soft clustering method
that allows overlap, and a selecting method for selecting
a speaker model using a GMM or/and HMM. In an experiment,
we obtained a 5.0% absolute gain for word error rate (WER),
and a 24.9% gain for the relative WER over an age- and gender-dependent
baseline.
Ensemble
of SVM Trees for Multimodal Emotion Recognition
Viktor Rozgic Raytheon
BBN Technologies, Rohit Prasad Raytheon BBN Technologies,
Shirin Saleem Raytheon BBN Technologies, Sankaranarayanan
Ananthakrishnan Raytheon BBN Technologies, Rohit Kumar
Raytheon BBN Technologies
In this paper we address the sentence-level
multi-modal emotion recognition problem. We formulate the
emotion recognition task as a multi-category classification
problem and propose an innovative solution based on the automatically
generated ensemble of trees with binary support vector machines
(SVM) classifiers in the tree nodes. We demonstrate the efficacy
of our approach by performing four-way (anger, happiness,
sadness, neutral) and five-way (including excitement) emotion
recognition on the University of Southern California’s Interactive
Emotional Motion Capture (USC-IEMOCAP) corpus using combinations
of acoustic features, lexical features extracted from automatic
speech recognition (ASR) output and visual features extracted
from facial markers traced by a motion capture system. The
experiments show that the proposed ensemble of trees of binary
SVM classifiers outperforms classical multi-way SVM classification
with one-vs-one voting scheme and achieves state-of-the-art
results for all feature combinations.
Dereverberantion
Based on Generalized Spectral Subtraction for Distant-talking
Speaker Recognition
Zhaofeng Zhang Shizuoka
University, Longbiao Wang Nagaoka University of Technology,
Atsuhiko Kai Shizuoka University
A dereverberation method based on
generalized spectral subtraction (GSS) using multi-channel
least mean square (MCLMS) was proposed previously. The results
on speech recognition experiments showed that this method
achieved a significant improvement compare to the conventional
methods. In this paper, we employ this method to distant-talking
speaker recognition. However, the GSS-based dereverberation
method using clean speech models degrades the speaker recognition
performance while it is very effective for speech recognition.
One of the reason may be that the GSS-based dereverberation
method causes some distortions such as speaker characteristics
distortion between clean speech and dereverberant speech.
In this paper, we address this problem by training speaker
models using dereverberant speech which is obtained by suppressing
reverberation from arbitrary artificial reverberant speech.
The speaker recognition experiment was performed on a large
scale far-field speech with different reverberant environments
to the training environments. The proposed method achieved
a relative error reduction rate of 88.2% compared to conventional
CMN with beamforming using clean speech models and 27.8% compared
to reverberant speech models, respectively.
OS.42-IVM.14 Image
Enhancement & Restoration
Session Chair: Jian-Jiun
Ding Location: Runyon
Edge-Membership
Based Blurred Image Reconstruction Algorithm
Wei-De Chang National
Taiwan University, Jian-Jiun Ding National Taiwan University,
Yu Chen National Taiwan University, Chir-Weei Chang
Industrial Technology Research Institute, Chuan-Chung
Chang Industrial Technology Research Institute
Enhancing the sharpness of edges and
avoiding the ringing effect are two important issues in blurred
image reconstruction. However, there is a tradeoff between
the two goals. A reconstruction filter with long impulse response
can reduce the ringing artifact, however, the sharpness of
the edge is decreased. By contrast, a short impulse response
reconstruction filter can perfectly retrieve the edge but
is not robust to noise. In this paper, an edge-membership
based blurred image reconstruction algorithm is proposed.
In order to achieve the two goals simultaneously, we design
two filters. One focuses on edge restoration and the other
one focuses on noise removing. After performing linear combination
of the outputs of the two reconstruction filters, the edges
are preserved and the ringing artifacts are removed at the
same time. Simulation results show that our approach can reconstruct
the blurred image with sharp edge and less ringing effect.
Flickers
and Black-streak Artifacts Removal on Display Monitors taken
by Video Cameras
Kotaro Nakazawa Shinshu
University, Keiichiro Shirai Shinshu University,
Masayuki Okamoto Shinshu University, Toshio Koga Yamagata
University
In this paper, we focus on removing
black streak artifacts from video images of LED and VFD display
monitors taken by video cameras. These streak artifacts affect
the image recognition of characters displayed on the monitor,
and also affect the quality of video compression. To interpolate
correct pixel intensities, we apply an image composition method
using light blending of images selected at an appropriate
frame interval. The frame interval is decided based on the
spectrum of the luminance oscillations at each pixel. Our
experimental results show some improvements in the appearance
of images and in video encoding.
Image
Restoration with Union of Directional Orthonormal DWTs
Shogo Muramatsu Niigata
University, Natsuki Aizawa Niigata University,
Masahiro Yukawa Niigata University
This work proposes to apply directional
lapped orthogonal transforms to image restoration. A DirLOT
is an orthonormal transform of which basis is allowed to be
anisotropic with the symmetric, real-valued and compact-support
property. In this work, DirLOTs are used to generate symmetric
orthonormal discrete wavelet transforms and then a redundant
dictionary as a union of unitary transforms. The multiple
directional property is suitable for representing natural
images which contain diagonal edges and textures. The performances
of deblurring, super-resolution and inpainting are evaluated
for several images with the iterative-shrinkage/thresholding
algorithm. It is verified that the proposed dictionary yields
comparable or superior restoration performance to the non-subsampled
Haar transform.
OS.43-SIPTM.6
Signal and Information Processing Theory and Methods
Session Chair: Jorge Silva
Location: Laurel
Low-Complexity
Implementation of the Constrained Recursive Least-Squares
Algorithm
Reza Arablouei University
of South Australia, Kutluyil Dogancay University of
South Australia
A low-complexity implementation of
the constrained recursive least squares (CRLS) adaptive filtering
algorithm is developed based on the method of weighting and
the dichotomous coordinate descent (DCD) iterations. The method
of weighting is employed to incorporate the linear constraints
into the least squares problem of interest. The DCD iterations
are then used to solve the normal equations of the resultant
unconstrained least squares problem. The new algorithm has
a significantly smaller computational complexity than the
CRLS algorithm while delivering convergence performance on
par with it. Simulations demonstrate the effectiveness of
the proposed algorithm.
A
Comparison of Two Algorithmic Recipes to Parametrize Rectangular
Orthogonal Matrices
Simone Fiori DII - UNIVPM,
Tetsuya Kaneko SIPLab - TUAT, Toshihisa Tanaka Tokyo
University of Agriculture and Technology
The present contribution focuses on
the parametrization of rectangular (‘tall-skinny’) orthogonal
matrices, which play a fundamental role in signal processing
and machine learning. Such matrices form a smooth curved space
termed ‘compact Stiefel manifold’. The present contribution
aims at illustrating a numerical comparison of two algorithmic
recipes to parameterize Stiefel matrices in signal processing.
Compressibility
of Infinite Sequences and its Interplay with Compressed Sensing
Recovery
Jorge Silva University
of Chile, Eduardo Pavez University of Chile
This work elaborates connections between
notions of compressibility of infinite sequences, recently
addressed by Amini et al., and the performance of the compressed
sensing (CS) type of recovery algorithms from linear measurements
in the under-sample scenario. In particular, in the asymptotic
regime when the signal dimension goes to infinity, we established
a new set of compressibility definitions over infinite sequences
that guarantees arbitrary good performance in an ℓ1-noise
to signal ratio (ℓ1-NSR) sense with
an arbitrary close to zero number of measurements per signal
dimension.
OS.44-SLA.16 Behavioral
Informatics: Enabling Technologies and Applications (I)
Session Chair: Shrikanth
Narayanan Location: Trousdale Estates
A
Behaviorist Manifesto for the 21st Century
Brian Baucom University
of Utah, Esti Iturralde University of Southern California
Observational assessment of behavior
is a core measurement tool in modern psychological research
and practice. Despite the importance of observational assessment
and the tremendous amount of research devoted to refining
and enhancing the methodological foundations of these tools,
current best practices still bear a strong resemblance to
those from three decades prior. The emergent field of behavioral
signal processing, though relatively little known within the
field of psychology, has the potential to revolutionize observational
practice and resolve long-standing limitations of current
methods. In this paper, we illustrate the need for and potential
value of behavioral signal processing methods for observational
practice by examining three current issues in the observational
assessment of clinically distressed, married couples that
are representative of challenges faced in numerous areas of
clinical psychology.
Using
Measures of Vocal Entrainment to Inform Outcome-Related Behaviors
in Marital Conflicts
Chi-Chun Lee University
of Southern California, Athanasios Katsamanis University
of Southern California, Brian Baucom University of
Southern California, Panayiotis Georgiou University
of Southern California, Shrikanth Narayanan University
of Southern California
Behavioral entrainment is an important,
naturally-occurring dynamic phenomenon in human interactions.
In this paper, we carry out two quantitative analyses of the
vocal entrainment phenomenon in the context of studying conflictual
marital interactions. We investigate the role of vocal entrainment
in reflecting different dimensions of couple-specific behaviors,
such as withdrawal, that are commonly-used in assessing the
effectiveness on the outcome of couple therapy. The results
indicate a statistically-significant relation between these
behaviors and vocal entrainment, as quantified using our proposed
unsupervised signal-derived computational framework. We further
demonstrate the potential of the signal-based vocal entrainment
framework in characterizing influential factors in distressed
couples relationship satisfaction outcomes.
Automatic
Detection of Psychological Distress Indicators in Online Forum
Posts
Shirin Saleem Raytheon
BBN Technologies, Maciej Pacula Raytheon BBN Technologies,
Rachel Chasin Massachusetts Institute of Technology,
Rohit Kumar Raytheon BBN Technologies, Rohit Prasad
Raytheon BBN Technologies, Michael Crystal Raytheon
BBN Technologies, Brian Marx National
Center for PTSD at VA Boston Healthcare System, Denise
Sloan National Center for PTSD at VA Boston
Healthcare System, Jennifer Vasterling National
Center for PTSD at VA Boston Healthcare System, Theodore
Speroff Vanderbilt University
The stigma associated with mental
health issues makes face-to-face discussions with family members,
friends, or medical professionals difficult for many people.
In contrast, the Internet, due to its ubiquity and global
outreach, is increasingly becoming a popular medium for distressed
individuals to anonymously relate experiences. In this paper,
we present a system for automatically detecting psychological
distress indicators in informal text interactions on Internet
discussion forums. We compare a suite of innovative features
and classifiers on data downloaded from an online forum discussing
psychological health issues. Psychologists annotated individual
messages with a comprehensive set of distress labels derived
from the Diagnostic and Statistical Manual of Mental Disorders
(DSM) IV. The noisy nature of the forum posts and the large
set of distress labels for multi-label text classification
(many of which cannot be detected by a mere surface form analysis
of the text), make the task extremely challenging. A late
fusion technique combines outputs from different classifiers
resulting in promising accuracy on this challenging multi-label
classification problem.
OS.45-SLA.17 Recent
Topics in Speech Processing (II)
Session Chair: Longbiao
Wang Location: Franklin Hills
A
Study of Emotional Information Present in Articulatory Movements
Estimated Using Acoustic-To-Articulatory Inversion
Jangwon Kim University
of Southern California, Prasanta Ghosh IBM Research
India, Sungbok Lee University of Southern California,
Shrikanth Narayanan University of Southern California
This study examines emotion-specific
information (ESI) in the articulatory movements estimated
using acoustic-to-articulatory inversion on emotional speech.
We study two main aspects: (1) the degree of similarity between
the pair of estimated and original articulatory trajectories
for the same and different emotions and (2) the amount of
ESI present in the estimated trajectory. They are evaluated
using mean squared error between the articulatory pair and
by automated emotion classification. This study uses parallel
acoustic and articulatory data in 5 elicited emotions spoken
by 3 native American English speakers. We also test emotion
classification performance using articulatory trajectories
estimated from different acoustic feature sets and they turn
out subject-dependent. Experimental results suggest that the
ESI in the estimated trajectory, although smaller than that
in the direct articulatory measurements, is found to be complementary
to that in the prosodic features and hence, suggesting the
usefulness of estimated articulatory data for emotions research.
Robust
Feature Extraction to Utterance Fluctuations Due to Articulation
Disorders Based on Sparse Expression
Toshiya Yoshioka Kobe
University, Ryoichi Takashima Kobe University,
Tetsuya Takiguchi Kobe University, Yasuo Ariki Kobe
University
We investigated the speech recognition
of a person with articulation disorders resulting from athetoid
cerebral palsy. Recently, the accuracy of speaker-independent
speech recognition has been remarkably improved by the use
of stochastic modeling of speech. However, the use of those
acoustic models causes degradation of speech recognition for
a person with different speech styles (e.g., articulation
disorders). In this paper, we discuss our efforts to build
an acoustic model for a person with articulation disorders.
The articulation of the first utterance tends to become more
unstable than other utterances due to strain on speech-related
muscles, and that causes degradation of speech recognition.
Therefore, we propose a robust feature extraction method based
on exemplar-based sparse representation using NMF (Non-negative
Matrix Factorization). In our method, the unstable first utterance
is expressed as a linear and non-negative combination of a
small number of bases created using the more stable utterances
of a person with articulation disorders. Then, we use the
coefficient of combination as an acoustic feature. Its effectiveness
has been confirmed by word-recognition experiments.
Distant-talking
Speaker Identification Using a Reverberation Model With Various
Artificial Room Impulse Responses
Longbiao Wang Nagaoka
University of Technology, Zhaofeng Zhang Shizuoka University,
Atsuhiko Kai Shizuoka University, Yoshiki Kishi NS
Solutions Kansai Corp.
In this paper, we propose a distant-talking
speaker recognition method using a reverberation model with
various artificial room impulse responses. These artificial
room impulse responses with different speaker and microphone
positions, room sizes, and reflection coefficients of walls
and convoluted with clean speech are used to train an artificial
reverberation speaker model. This artificial reverberation
model is also combined with a reverberation speaker model
trained with room impulse responses measured in real environments.
Speaker identification performance using a combination of
the two reverberation speaker models achieved a relative error
reduction rate of 50.0% and 78.4% compared with that using
a reverberation model trained with real-world room impulse
responses and a clean speech model, respectively.
OS.46-IVM.15 Intelligent
Object Detection and Identification for Visual Surveillance
and Security
Session Chair: Chia-Hung
Yeh Location: Whitley Heights
Background
Subtraction by Modeling Pixel and Neighborhood Information
Shu-Jhen Fan Jiang National
Sun Yat-sen University, Kahlil Muchtar Asia University,
Chih-Yang Lin Asia University, Li-Wei Kang Academia
Sinica, Chia-Hung Yeh National Sun Yat-sen University
In applications of the computer vision
field, a vision system is usually composed of several low
level and high level components, stacked on top of each other.
A better design of the lower level components usually results
in better accuracy of higher level functions, such as object
tracking, face recognition, and surveillance. In this paper,
we focus on the low level component design, background construction,
which is one of the most basic elements for a surveillance
system. The proposed method eases the problems that usually
occur in background construction, including aperture problem,
vacillating background, and shadow removal. In conventional
background construction methods, only the history information
(vertical direction) of pixels is usually considered. In contrast,
the proposed scheme not only uses the vertical direction but
also the neighborhood information (horizontal direction).
Experimental results show that the proposed scheme can detect
objects more delicate, alleviate the aperture problem, and
identify shadow and discard it from detected objects.
A
Block Based (t, n) Visual Cryptography Scheme for Unbounded
n and t=2, 3
Sian-Jheng Lin Academia
Sinica, Wei-Ho Chung Academia Sinica
The (t, n) visual cryptography (VC)
is a secret sharing scheme of decomposing a secret image into
n transparencies, and the stacking of any t out of n transparencies
reveals the secret content. The perfect security condition
of VC scheme requires the strict requirement where any t_1
or fewer transparencies cannot extract any information about
the secret. For n approaching infinity, previous studies consider
the scenario where the probabilistic model is a pixel-to-pixel
scheme that encodes each secret pixel to a corresponding pixel
in each transparency. In this paper, we extend the pixel-to-pixel
scheme to pixel-to-block scheme for the cases t=2 and 3. Given
a secret image, the proposed VC scheme generates a transparency
through coding each secret image pixel to a m-pixels shadow
block on the corresponding position of the transparency. Experiments
show that the stacking results reveal better visual quality
than the probabilistic model scheme.
A
Real-Time Rear Obstacle Detection System Based On a Fish-Eye
Camera
Che-Tsung Lin ITRI,
Yu-Chen Lin ITRI, Wei-Cheng Liu ITRI, Chi-Wei
Lin ITRI
This paper proposes a rear vision
camera-based vehicle detection system which could detect if
any rear vehicle exists in ego lane and if any vehicles in
adjacent lanes are overtaking. The source image is firstly
applied with distortion calibration which helps the following
Hough transform to detect the existence of lane lines. The
rear vehicle in ego lane is detected by a combination of feature-based
approach and appearance-based approach. When a vehicle in
adjacent lane is overtaking, the vanishing of its symmetry
makes itself very difficult to be detected. Therefore, we
propose a new detection algorithm applying corner detection
and motion vector whose calculation are based on Local Binary
Pattern (LBP) to find if any vehicles in adjacent lanes are
overtaking. Our proposed algorithm achieves high detecting
rate and low computing power and is successfully implemented
in ADI-BF561 600MHz dual core DSP.
OS.47-IVM.16 Image
Analysis and Recognition
Session Chair: Toshihiko
Yamasaki Location: Mt. Olympus
Spatial
Statistics for Spatial Pyramid Matching Based Image Recognition
Toshihiko Yamasaki University
of Tokyo, Tsuhan Chen Cornell University
This paper presents an image feature
extraction algorithm that enhances the object classification
accuracy in the spatial pyramid matching (SPM) framework.
The proposed method considers the spatial statistics of the
feature vectors by calculating the moment vectors. While the
original SPM algorithm captures the spatial distribution of
the image feature descriptors, the proposed algorithm describes
how such spatial distribution is variant. The experiments
are conducted using two state-of-the-art SPM-based methods
for two commonly used datasets. The results demonstrates the
validity of our proposed algorithm. The cases where the proposed
algorithm works well are also investigated. In addition, it
is demonstrated that the proposed feature and adding more
layers improve the classification accuracy in different situations.
Using
the Visual Words based on Affine-SIFT Descriptors for Face
Recognition
Yu-Shan Wu Chunghwa
Telecommunication Laboratories, Heng-Sung Liu Chunghwa
Telecommunication Laboratories, Gwo-Hwa Ju Chunghwa
Telecommunication Laboratories, Ting Wei Lee Chunghwa
Telecommunication Laboratories, Yen-Lin Chiu Chunghwa
Telecommunication Laboratories
Video-based face recognition has drawn
a lot of attention in recent years. On the other hand, Bag-of-visual
Words (BoWs) representation has been successfully applied
in image retrieval and object recognition recently. In this
paper, a video-based face recognition approach which uses
visual words is proposed. In classic visual words, Scale Invariant
Feature Transform (SIFT) descriptors of an image are firstly
extracted on interest points detected by difference of Gaussian
(DoG), then k-means-based visual vocabulary generation is
applied to replace these descriptors with the indexes of the
closet visual words. However, in facial images, SIFT descriptors
are not good enough due to facial pose distortion, facial
expression and lighting condition variation. In this paper,
we use Affine-SIFT (ASIFT) descriptors as facial image representation.
Experimental results on UCSD/Honda Video Database and VidTIMIT
Video Database suggest that visual words based on Affine-SIFT
descriptors can achieve lower error rates in face recognition
task.
An
Open Framework for Video Content Analysis
Chia-Wei Liao Institute
for Information Industry, Kai-Hsuan Chan Institute
for Information Industry, Bin-Yi Cheng Institute for
Information Industry, Chi-Hung Tsai Institute for Information
Industry, Wen-Tsung Chang Institute for Information
Industry, Yu-Ling Chuang Institute for Information
Industry
In the past few years, the amount
of the internet video has grown rapidly, and it has become
a major market. Efficient video indexing and retrieval, therefore,
is now an important research and system-design issue. Reliable
extraction of metadata from video as indexes is one major
step toward efficient video management. There are numerous
video types, and theoretically, everybody can define his/her
own video types. The nature of video can be so different that
we may end up, for each video type, having a dedicated video
analysis module, which is in itself nontrivial to implement.
We believe an open video analysis framework should help when
one needs to process various types of videos. In the paper,
we propose an open video analysis framework where the video
analysis modules are developed and deployed as plug- ins.
In addition to plug-in management, it provides a runtime environment
with standard libraries and proprietary rule-based automaton
modules to facilitate the plug-in development. A prototype
has been implemented and proved with some experimental plug-ins.
PS.5-SLA.18 Speech
Coding and Processing and Recognition
Session Chair: Hideki Kawahara
Location: Solano
Speaking
Rate Dependent Multiple Acoustic Models Using Continuous Frame
Rate Normalization
Ban Sung Min Pusan National
University, Kim Hyung Soon Pusan National University
This paper proposes a method using
speaking rate dependent multiple acoustic models for speech
recognition. In this method, multiple acoustic models with
various speaking rates are generated. Among them, the optimal
acoustic model relevant to the speaking rate of test data
is selected and used in recognition. To simulate the various
speaking rates for the multiple acoustic models, we use the
variable frame shift size considering the speaking rate of
each utterance instead of applying a flat frame shift size
to all training utterances. The Continuous Frame Rate Normalization
(CFRN) is applied to each of training utterances to control
the frame shift size. Experimental results show that the proposed
method outperforms both the baseline and the conventional
CFRN on test utterances.
Emotion
Classification of Infant Cries with Consideration for Local
and Global Features
Kazuki Honda Nagasaki
University, Kazuki Kitahara Nagasaki University,
Shoichi Matsunaga Nagasaki University, Masaru Yamashita
Nagasaki University
In this paper, we propose an approach
to the classification of emotion clusters in infant cries
with consideration for frame-wise/local acoustic features
and global prosodic features. Our proposed approach has two
main characteristics as follows. The emotion cluster detection
procedure is based on the most likely segment sequence, which
delivers the emotion cluster as a classification result. This
is obtained based on a maximum likelihood approach using the
frame-wise likelihood and the global prosodic likelihood.
We exploit the duration ratios of resonant cry segments and
silent segments as prosodic features, while the duration ratios
are calculated using the derived segment sequence. The second
characteristic is the use of pitch information, in addition
to conventional power and spectral information, during the
modeling of frame-wise acoustic features with hidden Markov
models. The classification performance (74.7%) of our proposed
approach with added pitch information was better than (71.5%)
the classification method using only power and spectral features.
The proposed method based on a maximum likelihood approach
using both frame-wise and global features also achieved better
performance (75.5%).
On
the Use of Phase Information-based Joint Factor Analysis for
Speaker Verification under Channel Mismatch Condition
Ikuya Hirano Shizuoka
University, Longbiao Wang Nagaoka University of Technology,
Atsuhiko Kai Shizuoka University, Seiichi Nakagawa
Toyohashi University of Technology
Recent studies have shown that phase
information contains speaker characteristics. A new extraction
method to extract pitch synchronous phase information has
been proposed and shown that it was very effective under channel
matched condition. However, phase changes between different
channels. Therefore, the speaker recognition performance is
drastically degraded under channel mismatch condition. On
the other hand, Joint Factor Analysis (JFA) is an approach
that is robust for channel variability. In this paper, we
propose phase information-based JFA for speaker verification
under channel mismatch condition. Speaker verification experiments
were performed using the NIST 2003 SRE database. Phase information-based
JFA achieved a relative equal error rate reduction of 20.9%
for male and 17.4% for female compared to the traditional
system based on Gaussian Mixture Model and Universal Background
Model (GMM-UBM) that influenced by channel variability. Furthermore,
by combining phase information-based method with the MFCC-based
method, we obtained the better result than that of the only
MFCC-based method.
Modulation
Transfer Function Design for a Flexible Cross Synthesis Vocoder
Based on F0 Adaptive Spectral Envelope Recovery
Taiki Nishi Wakayama
University, Ryuichi Nisimura Wakayama University,
Toshio Irino Wakayama University, Hideki Kawahara Wakayama
University
A new design procedure for flexible
cross synthesis VOCODER is proposed based on TANDEM-STRAIGHT
framework, a F0 adaptive spectral
envelope estimator, and modulation transfer function design.
The proposed design procedure enables control of speech intelligibility
and timber identity of musical instruments or animal voices.
Removal of the averaged and smoothed logarithmic spectrum
of speech from the filter reduced the timbre modification
effect of filtered sounds and manipulation of cut-off frequencies
of modulation transfer function for designing the filter enabled
control of trade-offs between intelligibility and timbre preservation.
Detecting
Child Speaker Based on Auditory Feature Vectors for VTL Estimation
Ryuichi Nisimura Wakayama
University, Shoko Miyamori Wakayama University,
Erika Okamoto Wakayama University, Hideki Kawahara
Wakayama University, Toshio Irino Wakayama University
We introduce novel auditory features
in the Hidden Markov Model (HMM) system for detecting child
speakers. The features derived by the gammachirp auditory
filterbank (GCFB) have been demonstrated to be suitable for
vocal tract length (VTL) estimation, both theoretically and
experimentally. We performed numerical experiments to distinguish
between child and adult speakers using HMMs trained on 2,360
speech samples collected through a web-based query interface,
and we compared the performance of the common mel-frequency
cepstral coefficients (MFCC) and the GCFB-based feature vectors.
We also introduced the modulation features as the substitution
of delta parameters. It has been clearly demonstrated that
the error rate distinguishing a child from an adult is reduced
by GCFB. To enhance our method for use as a web application,
we applied our original voice-enabled web framework to the
front-end interface of the proposed system.
Acoustic
Model Training Using Feature Vectors Generated by Manipulating
Speech Parameters of Real Speakers
Tetsuto Kawai Nagoya
University, Norihide Kitaoka Nagoya University,
Kazuya Takeda Nagoya University
In this paper, we propose a robust
speaker-independent acoustic model training method using generative
training to generate many pseudo-speakers from a small number
of real speakers. We focus on the difference between each
speaker’s vocal tract length, and manipulate it in order to
create many different pseudo-speakers with a range of vocal
tract lengths. This method employs frequency warping based
on the inverted use Vocal Tract Length Normalization (VTLN).
Another method for creating pseudo-speakers is to vary the
speaking rate of the speakers. This can be achieved by a method
called PICOLA (Pointer Interval Controlled OverLap and Add).
In experiments, we train acoustic models using these generated
pseudo-speakers in addition to the original speakers. Evaluation
results show that generating pseudo-speakers by manipulating
speaking rates did not result in a sufficient increase in
performance, however, vocal tract length warping was effective.
A
Concatenative Speech Synthesis for Monosyllabic Languages
With Limited Data
Trung-Nghia Phung Japan
Advanced Institute of Science and Technology, Luong Chi
Mai Vietnamese Academy ofScience and
Technology, Masato Akagi Japan Advanced Institute
of Science and Technology
Quality of unit-based concatenative
speech synthesis is low while that of corpus-based concatenative
speech synthesis with unit selection is great natural. However,
unit selection requires a huge data for concatenation that
reduces the range of its applications. In this paper, by using
temporal decomposition for modeling contextual effects intra-syllable
and inter-syllables, we propose a context-fitting unit modification
method and a context-matching unit selection method. The two
proposed context-specific methods are used in our proposed
syllablebased concatenative speech synthesis applied for monosyllabic
languages. The experimental results with a Vietnamese speech
synthesis using a small corpus support that the proposed methods
are efficient. As a consequence, the naturalness and intelligibility
of the proposed speech synthesis is high even when we have
only limited data for concatenation.
Feature
Reconstruction using Sparse Imputation for Noise Robust Audio-Visual
Speech Recognition
Peng Shen Gifu University,
Tamura Satoshi Gifu University, Hayamizu Satoru Gifu
University
In this paper, we propose to use noise
reduction technology on both speech signal and visual signal
by using exemplar-based sparse representation features for
audio-visual speech recognition. First, we introduce sparse
representation classification technology and describe how
to utilize the sparse imputation to reduce noise not only
for audio signal but also for visual signal. We utilize a
normalization method to improve the accuracy of the sparse
representation classification, and propose a method to reduce
the error rate of visual signal when using the normalization
method. We show the effectiveness of our proposed noise reduction
method and that the audio features achieved up to 88.63% accuracy
at -5dB, a 6.24% absolute improvement is achieved over the
additive noise reduction method, and the visual features achieved
27.24% absolute improvement at gamma noise.
Statistical
Voice Conversion using GA-based Informative Feature
Kohei Sawada Gifu University,
Yoji Tagami Gifu University, Tamura Satoshi Gifu
University, Masanori Takehara Gifu University,
Hayamizu Satoru Gifu University
In order to make voice conversion
(VC) robust to noise, we propose VC using GA-based informative
feature (GIF), by adding an extraction process of GIF to a
conventional VC. GIF is proposed as a feature that can be
applied not only in pattern recognition but also in relative
tasks. In speech recognition, furthermore, GIF could improve
recognition accuracy in noise environment. We evaluated the
performances of VC using spectral segmental features (conventional
method) and GIF, respectively. Objective experimental result
indicates that in noise environments, the proposed method
was better than the conventional method. Subjective experiment
was also conducted to compare the performances. These results
show that application of GIF to VC was effective.
GIF-SP:
GA-based Informative Feature for Noisy Speech Recognition
Tamura Satoshi Gifu
University, Yoji Tagami Gifu University, Hayamizu
Satoru Gifu University
This paper proposes a novel discriminative
feature extraction method. The method consists of two stages;
in the first stage, a classifier is built for each class,
which categorizes an input vector into a certain class or
not. From all the parameters of the classifiers, a first transformation
can be formed. In the second stage, another transformation
that generates a feature vector is subsequently obtained to
reduce the dimension and enhance recognition ability. These
transformations are computed applying genetic algorithm. In
order to evaluate the performance of the proposed feature,
speech recognition experiments were conducted. Results in
clean training condition shows that GIF greatly improves recognition
accuracy compared to conventional MFCC in noisy environments.
Multi-condition results also clarifies that out proposed scheme
is robust against differences of conditions.
A
Packet Loss Recovery of G.729 Speech Under Severe Packet Loss
Condition
Takeshi Nagano Tohoku
University, Akinori Ito Tohoku University
In a VoIP application, packet losses
degrade speech quality. Especially, IP network under a large-scale
disaster should cause severe packet losses. We investigate
influence of parameter loss to speech quality using G.729.
And we investigated an effect of packet loss concealment method
using redundant G.729 parameters. As compared with ``repetition’’ method,
the proposed method could improve speech quality. We also
propose a bitrate reduction method by sending bit flip position
instead of codebook index.
Microphone
Array Processing for Distant Speech Recognition: Spherical
Arrays
John McDonough Carnegie
Mellon University/Voci Technologies, Inc., Kenichi Kumatani
Disney Research, Pittsburgh, Bhiksha Raj Carnegie
Mellon Univerity
Distant speech recognition (DSR) holds
out the promise of the most natural human computer interface
because it enables man-machine interactions through speech,
without the necessity of donning intrusive body- or head-mounted
microphones. With the advent of the Microsoft Kinect, the
application of non-uniform linear arrays to the DSR problem
has become commonplace. Performance analysis of such arrays
is well-represented in the literature. Recently, spherical
arrays have become the subject of intense research interest
in the acoustic array processing community. Such arrays have
heretofore been analyzed solely with theoretical metrics under
idealized conditions. In this work, we analyze such arrays
under realistic conditions. Moreover, we compare a linear
array with 64-channel arrays and a total length of 128cm to
a spherical array with 32 channels and a radius of 4.2cm;
we found that these provided word error rates of 9.3% and
9.9%, respectively, on a DSR task. For a speaker positioned
at an oblique angle with respect to the linear array, we recorded
error rates of 12.8% and 10.7%, respectively, for the linear
and spherical arrays. The compact size and outstanding performance
of the spherical array recommends itself well to space-limited
and mobile applications such as home-gaming consoles and humanoid
robots.
Thursday, December 6, 2012 (10:50 - 12:30)
|
OS.48-SLA.19 Fundamental
Technologies in Modern Speech Processing (II)
Session Chair: Chia-Ping
Chen Location: Doheny
Sub-word
Modeling for Automatic Speech Recognition
Karen Livescu Toyota
Technological Institute at Chicago, Eric Foster Lussier
Ohio State University, Florian Metze Carnegie
Mellon University
Discriminative
Training for ASR: Modeling, Criteria, Optimization, Implementation,
and Performance
Georg Heigold Google,
Hermann Ney RWTH Aachen University, Ralph Schlueter
RWTH Aachen University, Simon Wiesler RWTH Aachen
University
Deep
Neural Networks for Acoustic Modeling in Speech Recognition
Geoffrey Hinton University
of Toronto, Li Deng Microsoft Research, Dong Yu
Microsoft Research, George Dahl University of Toronto,
Abdel-rahman Mohamed University of Toronto, Navdeep
Jaitly University of Toronto, Vincent Vanhoucke Google,
Patrick Nguyen Google, Tara N. Sainath IBM T. J.
Watson Research Center, Brian Kingsbury IBM T. J. Watson
Research Center
Exemplar-Based
Processing for Speech Recognition
Tara N. Sainath IBM
T. J. Watson Research Center, Bhuvana Ramabhadran IBM
T. J. Watson Research Center, David Nahamoo IBM T.
J. Watson Research Center, Dimitri Kanevsky IBM T.
J. Watson Research Center, Dirk Van Compernolle KU
Leuven, Kris Demuynck KU Leuven, Jort F. Gemmeke
KU Leuven, Jerome R. Bellegarda Apple, Shiva
Sundaram Deutsche Telekom Laboratories
OS.49-IVM.17 Recent
Advances in High Fidelity Digital Imaging
Session Chairs: Masahiro
Okuda, Yuichi Tanaka Location: Beachwood
Directional
Image Decomposition Using Retargeting Pyramid
Yuichi Tanaka Tokyo
University of Agriculture and Technology, Keiichiro Shirai
Shinshu University
Retargeting pyramid (RP) is an alternative
method for multiscale image decomposition to the well-known
Laplacian pyramid (LP). RP can be obtained by replacing the
low-pass filtering process in LP with content-aware image
resizing (a.k.a. retargeting), which is a developing technique
for computer vision researches. Furthermore, we use RP for
contourlet-based directional image decomposition. In experimental
results, the proposed decomposition outperforms the LP-based
contourlet transform for image denoising.
A
Hierarchical Convex Optimization Approach for High Fidelity
Solution Selection in Image Recovery
Shunsuke Ono Tokyo Institute
of Technology, Isao Yamada Tokyo Institute of Technology
The aim of this paper is to propose
a hierarchical convex optimization for selecting a high fidelity
image from possible solutions of a convex optimization problem
associated with existing image recovery methods. Image recovery
problems have been cast in certain convex optimization problems
which have infinitely many solutions in general. However,
existing convex optimization algorithms are designed to reach
one solution randomly, and hence can not select a solution
corresponding to a high fidelity image from the possible solutions.
In this paper, we propose to select a high fidelity image
by solving a newly-formulated hierarchical convex optimization
problem. This problem is a constrained minimization of a convex
criteria over the solution set of all images which are optimal
in the sense of any existing image recovery method. The hierarchical
convex optimization problem is efficiently solved by a proposed
iterative scheme based on the hybrid steepest descent method
with the help of a nonexpansive mapping related to the Douglas-Rachford
splitting type algorithms. Numerical results indicate that
our method appropriately selects a recovered image of high
fidelity in the case of inpainting and compressed sensing
recovery.
Head
Pose Estimation using Motion Subspace Matching on GPU
Nawarat Auttanugune Chulalongkorn
University, Thanarat Chalidabhongse Chulalongkorn University,
Supavadee Aramvith Chulalongkorn University
The head pose estimation is a process
of recovering 3D head position in term of yaw, pitch and roll
from 2D images. However, the reduction of information from
3D to 2D leads to an ill-posed problem. In this paper, we
propose a novel algorithm of head pose estimation that includes
facial features tracking for Thai sign language recognition.
In order to estimate head pose correctly, feature points tracking
requires high precision. Nevertheless it is difficult for
low cost cameras where input image quality may be generally
poor. To overcome this problem, we introduce an automatic
camera signal calibration such that the features can be tracked
correctly despite the quality of the input image sequences.
Finally, as our approach bases on the state space searching,
the local minima problem is common. Hence, we divide the search
space into sub spaces and perform parallel computation on
GPU.
High
Dynamic Range Image Compression using Base Map Coding
Takuya Fujiki University
of Kitakyushu, Nicola Adami Brescia University,
Takao Jinno Toyohashi University Of Technology,
Masahiro Okuda University of Kitakyushu
As the High dynamic range (HDR) images
generally have more than 10 bit/1024 colors per channel, its
enormous data size often needs to be reduced when transmitting
or storing the images. Thus development for functional compression
is one of important research topics. Recently a lot of techniques
for the HDR image compression are being suggested, and several
two-layer coding algorithms which separately encode a low
dynamic range (LDR) image and a residual image have been studied.
However, those methods are inefficient in coding performance.
In this article, we suggest a new two-layer coding algorithm
for the HDR images, which realizes two-layered dynamic range.
Our method encodes a base map, which is a blurred version
of the HDR, and LDR image produced by the base map. Our algorithm
significantly improves a compression performance.
Efficient
Lossless Bit Depth Scalable Coding for HDR Images
Masahiro Iwahashi Nagaoka
University of Technology, Hitoshi Kiya Tokyo Metropolitan
University
This report proposes two layered bit
depth scalable coding methods for high dynamic range (HDR)
images expressed in floating point data format. From the base
layer bit stream, low dynamic range (LDR) images are decoded.
They are tone mapped appropriately for human eye sensitivity,
and shortened to a standard bit depth, e.g. 8 bit. From the
enhance layer bit stream, HDR images are decoded. However
the bit depth of this layer has been huge in the existing
method. To reduce it, we divide the tone mapping into a reversible
logarithmic mapping and its compensation. It was confirmed
that the proposed methods significantly reduce the bit depth
of the enhance layer, even though the compensation slightly
increases coding noise.
OS.50-WCN.5 Video/3D
Data Transmission over Mobile Network
Session Chairs: Hwangjun
Song, Jongwon Kim Location: Runyon
Real-Time
Acquisition and Representation of 3D Environmental Data
Se-Ho Lee Korea University,
Seong-Gyun Jeong Korea University, Tae-Young Chung
Korea University, Chang-Su Kim Korea University
We develop a mobile system to acquire
and represent 3D environmental data for modeling indoor spaces.
The system is composed of a laser range finder (LRF) and an
omni-directional camera. Multiple 3D point clouds from different
viewpoints are acquired as the geometric information by scanning
a scene with the LRF, while an omni-directional texture image
is acquired with the omni-directional camera. We merge those
multiple 3D point clouds into a single point cloud. We then
combine the point cloud and the texture image into a complete
3D mesh model in three steps. First, we downsample the point
cloud based on a voxel grid and estimate the normal vector
of each point. Second, using the normal vectors, we reconstruct
a 3D mesh based on the Poisson surface reconstruction. Third,
to assign texture information to the mesh surface, we estimate
the matching region in the omnidirectional image that corresponds
to each face of the mesh. Simulation results demonstrate that
the proposed system can reconstruct indoor spaces effectively.
Reliable
Scalable Video Multi-cast with Source Diversity and Inter-source
Network Decoding in Lossy Networks
Saran Tarnoi National
Institute of Informatics, Yusheng Ji National Institute
of Informatics, Wuttipong Kumwilaisak KingMongkut´s
University of Technology Thonburi
This paper presents a reliable scalable
video multi-cast with source diversity and inter-source network
decoding in lossy networks. The source diversity technique
gives path diversity providing a better quality of layered
video transmission under hostile environments. For each source,
an optimization formulation is set up to find the best transmission
route of each transmitting video layer. The objectives of
the formulation are to maximize a number of transmitting video
layers and transmission reliability. The source providing
the best overall throughput is selected to be the primary
source, while the rest will be secondary sources. When the
Quality-of-Service (QoS) guarantee of some transmitting video
layers cannot be fulfilled by the primary source, the secondary
source with the best QoS parameters is selected to transmit
the layers to destinations. The number of secondary sources
used for transmissions is increased until the QoS guarantees
of all transmitting video layers are satisfied or all network
resources are utilized. Network coding is deployed to multi-cast
video layers from the same source for efficient resource usage.
Network coded data from different sources can be used together
to decode the transmitting video data. In other words, at
each destination, it needs only a sufficient number of video
packets from different sources to recover all transmitting
video data. Simulations with different network topologies
show the improvement in both objective and subjective qualities
of layered video multi-cast under lossy environments.
Advanced
Logical Superposition Modulation based Video Streaming Multicast
System over Wireless Networks
Kyuhwi Choi POSTECH,
Hwangjun Song POSTECH
In this paper, we present an advanced
logical superposition modulation based video streaming multicast
system over a wireless networks. The traditional logical superposition
modulation scheme generates symbols using a pre-fixed modulation
scheme to overcome wireless channel diversity. In contrast,
our proposed logical superposition modulation determines logical
superposition modulation scheme according to time-varying
wireless channel conditions in order to maximize overall network
throughput, similar to state-of-art adaptive modulation schemes.
In addition, two-layer SVC video streams are effectively mapped
to the determined superposition modulation scheme to support
better quality video streaming. Experimental results show
the performance of our proposed system.
Constant
Frame Quality Control for H.264/AVC
Po-Chyi Su National Central
University, Ching-Yu Wu National Central University,
Long-Wang Huang National Central University, Chia-Yang
Chiou National Central University
A frame quality control mechanism
for H.264/AVC is proposed in this research. The research objective
is to ensure that a suitable Quantization Parameter (QP) can
be assigned to each frame so that the target frame quality
can be achieved. One application is consistently maintaining
the frame quality during the encoding process to facilitate
such applications of video archiving or surveillance. A single-
parameter Distortion to Quantization (D-Q) model is derived
by training a large number of frame blocks and this model
parameter can be determined by the frame content. Given the
target quality for a video frame, we can then select an appropriate
QP according to the proposed D-Q model. The model refinement
and QP adjustment of subsequent frames can be applied according
to the coding results of previous data. Structural Similarity
(SSIM) is chosen as the quality measurement to demonstrate
the feasibility of the proposed framework.
Peer-assisted
Video-on-Demand in Multi-channel Switching WiFi-based Mobile
Networks
Jongwon Kim GIST,
Hayoung Yoon GIST
Network convergence paradigm will
substantially increase the pervasive use of WiFi-enabled smart
mobile devices. Although various on-demand streaming services
are already available over mobile WiFi-enabled devices, it
remains a challenging problem due to WiFi’s limited communication
range, mobility and user population density issues. In this
paper, we enhance our previous work on MOVi (Mobile Opportunistic
Video-on-demand) by exploiting the use of multi-channel switching
capability at mobile terminals. We reinvestigate the previous
version of MOVi in this environment and propose an improved
scheduling algorithm which incorporates collaborative congestion
sensing and efficient channel allocation mechanisms. The scalability
of the extended MOVi system is verified by extensive simulations.
In terms of the number of supported user, an average of 25%
improvement can be achieved. In addition, the proposed extension
of MOVi provides more tolerance against increased volume of
non-MOVi traffic.
OS.51-WCN.6 Green
Wireless Communicaions
Session Chair: Sumei Sun
Location: Laurel
An
Improved LLR Approximation Algorithm for Low-Complexity MIMO
Detection Towards Green Communications
Ruijuan Ma Xi’an Jiaotong
University, Pinyi Ren Xi’an Jiaotong University,
Chao Zhang Xi’an Jiaotong University, Qinghe Du
Xi’an Jiaotong University
As the number of transmit/receive
antennas gets large in wireless communication systems, the
drastically-increasing complexity in MIMO detection imposes
significant challenges in implementing green communications
while achieving high spectral efficiency. The winner- path-extension
(WPE) K-best algorithm is an efficient detection algorithm
in uncoded MIMO systems, known for its stable throughput and
excellent symbol-error-rate (SER) and bit-error-rate (BER)
performances under relatively low complexity. However, when
applying the WPE K-best algorithm into coded MIMO systems,
where soft-output information such as log-likelihood ratio
(LLR) is required, missing counter- hypotheses issue in LLR
calculation often degrades the error performance. To solve
this problem, in this paper we propose an improved LLR approximation
algorithm, such that WPE K-best algorithm can be well suited
to coded MIMO systems. Specifically, when a counter- hypothesis
misses, we set a metric threshold for the missing counter-hypothesis
by calculating the metric of the bit flipping vector, and
then randomly choose a value below the threshold as the approximation.
We conduct simulation evaluations for our proposed algorithm
in an 8×8 MIMO multiplexing system employing 16QAM modulation
and Turbo coding. Simulation results show that compared with
other existing LLR approximation schemes, our proposed approach
can effectively improve the block-error-rate (BLER) performance
as well as reducing the complexity in the tree search of WPE
K-best algorithm. Moreover, we use a look-up table method
to determine the Schnorr- Euchner (SE) enumeration order,
which can further decrease the computational complexity of
WPE K-best algorithms.
Green
Wireless Communications: A Power Amplifier Perspective
Jingon Joung Institute
for Infocomm Research, Chin Keong Ho Institute for
Infocomm Research, Sumei Sun Institute for Infocomm
Research
In this paper, we survey two essential
and practical characteristics of radio-frequency power amplifier
(PA), namely, linearity and efficiency. Nonlinear amplification
yields significant distortion of the transmit signals and
strong interference for cochannel users. Imperfect efficiency
of the PA causes an overhead of the systems resulting in energy
efficiency (EE) degradation. Therefore, the linearity and
efficiency of the PA should be precisely characterized in
the system design. We first survey the linearity and efficiency
models of PA, and then introduce commonly used technologies
for improving the EE according to three approaches: i) transmitter
architecture, ii) signal processing, and iii) network protocols.
We then introduce our recent work on a multiple PA switching
(PAS) architecture, in which one or more PAs are switched
on at any time to maximize the EE while satisfying the required
spectral efficiency (SE). We consider the case where either
full or partial channel state information is available at
the transmitter (CSIT). Since the transmitter selects the
most efficient PA that satisfies a target rate with the least
power consumption, EE is improved, and a Pareto-optimal SE-EE
tradeoff region can be enlarged as verified in the numerical
results with real-life device parameters. For example, we
observe around 323% and 50% EE improvements for a single antenna
system with a full CSIT and for a transmit antenna selection
and maximum ration combining system with a partial CSIT, respectively;
as a result, we can surmise that the PAS is one promising
technology for green, i.e., energy efficient, wireless communication
systems.
Area
Spectral Efficiency of Cooperative Network with DF and AF
Relaying
Lei Zhang University
of Victoria, Hong-Chuan Yang University of Victoria,
Mazen Hasna Qatar University
Most performance metrics for cooperative
networks focus on the qualification of either spectrum efficiency
or link reliability, without considering the spatial effect
of radio transmission. Area spectral efficiency (ASE) was
first introduced to qualify the spatial spectral utilization
efficiency of cellular systems. In this paper, we generalize
the definition of ASE and investigate this performance metric
in a three-node cooperative network with decode-and-forward
(DF) and amplify-and-forward (AF) relaying. We derive the
mathematical expression of ASE with the consideration of path-loss
and fading effects. We show through selected numerical examples
that ASE provides a new perspective on the spectrum utilization
efficiency and transmission power design.
Validation
of a Green Wireless Communication System with ICA based Semi-Blind
Equalization
Teng Ma University of
Liverpool, Xu Zhu University of Liverpool, Yufei
Jiang University of Liverpool, Yi Huang University
of Liverpool
In this paper, we validate a green
wireless communication system with the independent component
analysis (ICA) based and precoding aided semi-blind equalization,
on a testbed which consists of a pair of Keithley signal generator
and signal analyzer connected to antennas. The implemented
system requires only a small amount of side information to
be transmitted to the receiver and therefore achieves energy
and spectrum-efficient green communications. The system performance
is measured in different wireless channels and compared with
simulation results. The impact of precoding weight constant
on the BER performance is showed and optimal constant value
is found. The impact of frame length on the performance of
ICA based equalization is also evaluated.
Energy
Efficient Cooperation for Two-Hop Relay Networks
Ernest Kurniawan Stanford
University, Stefano Rini Technische Universitat Munchen,
Andrea Goldsmith Stanford University
We analyze the impact of cooperation
on the energy efficiency of two-hop relay transmissions. While
cooperation has been demonstrated to improve spectral efficiency,
the benefits in terms of energy consumption are not well characterized.
We show that cooperation is not always beneficial in this
context, since the energy required to facilitate the cooperation
can sometimes outweigh its benefit. In this work, we first
characterize the optimal energy efficient strategy for a single
source-destination transmission aided by multiple relay nodes
as a function of network parameters and the transmission rate.
We then extend this result to a relay-assisted cellular broadcast
network and determine an optimal solution which provides guidelines
on the cooperative strategy that improves energy efficiency
in such networks.
OS.52-SLA.20 Behavioral
Informatics: Enabling Technologies and Applications (II)
Session Chairs: Shri Narayanan,
Panayiotis Georgiou Location: Trousdale Estates
Analyzing
the Language of Therapist Empathy in Motivational Interview
based Psychotherapy
Bo Xiao University of
Southern California, Dogan Can University of Southern
California, Panayiotis Georgiou University of Southern
California, David Atkins University of Washington ,
Shrikanth Narayanan University of Southern California
Empathy is an important aspect of
social communication, especially in medical and psychotherapy
applications. Measures of empathy can offer insights into
the quality of therapy. We use an N-gram language model based
maximum likelihood strategy to classify empathic versus non-empathic
utterances and report the precision and recall of classification
for various parameters. High recall is obtained with unigram
while bigram features achieved the highest F1-score. Based
on the utterance level models, a group of lexical features
are extracted at the therapy session level. The effectiveness
of these features in modeling session level annotator perceptions
of empathy is evaluated through correlation with expert-coded
session level empathy scores. Our combined feature set achieved
a correlation of 0.56 between predicted and expert-coded empathy
scores. Results also suggest that the longer term empathy
perception process may be more related to isolated empathic
salient events.
Using
Interval Type-2 Fuzzy Logic to Analyze Turkish Emotion Words
Ozan Cakmak Mustafa Kemal
University, Serdar Yildirim Mustafa Kemal University,
Abe Kazemzadeh University of Southern California, Shrikanth
Narayanan University of Southern California
This paper describes a methodology
that shows the feasibility of a fuzzy logic (FL) representation
of Turkish emotionrelated words. We analyzed 197 Turkish emotion
words set through a web-based survey that prompted users with
emotional words and asked them to enter an interval valence,
activation, and dominance emotion attributes using a double
slider. Our previous experimental results indicated that there
was a strong correlation between the emotions attributed to
Turkish word roots and the Turkish sentences. In this paper,
we extend our previous work and analyze Turkish emotion words
by using an interval type-2 fuzzy logic.
Dialogue
Support for Memory Impaired People
Luca Bellodi ASML,
Radu Jasinschi Philips, Gerard De Haan Philips,
Murtaza Bulut Philips Research
People affected by the loss of short
term memory and cognitive impairment have serious difficulties
in communication. This may lead to social isolation and lack
of community access, a fundamental key barrier to independence
for people suffering from Alzheimer’s Disease, the most common
form of memory and cognitive impairment. We propose Automated
Memory Support for Social Interaction (AMSSI), a system that
helps memory impaired people with their social interaction.
The system provides active support that may help reducing
stress level of patients. AMSSI recognizes visitors, determines
the purpose of the visit, monitors the dialogue, determines
whether the patient needs support, and provides feedback.
AMSSI is tailored to patient needs, it has fast computation,
full automation, and can be handled by the patient without
supervision. The proposed assistive system can be beneficial
for improving the quality of life of patients with mild to
moderate cognitive impairments. This paper describes the implementation
of the first working prototype of the AMSSI system. Validation
user tests are still to be conducted.
Composite-DBN
for Recognition of Environmental Contexts
Selina Chu Oregon State
University, Shrikanth Narayanan University of Southern
California, C.-C. Jay Kuo University of Southern California
People’s behaviors are usually dictated
by their surroundings. The local environment often affects
the character and disposition of the people within it. The
goal of our work is to automatically recognize the type of
environments a person might be in. We introduce a hierarchical
structure to recognize environmental contexts using the surrounding
audio. We can use this structure to discover high- level representations
for different acoustic environments in a data-driven fashion.
Being able to perform such function allow us to better understand
how we could utilize such information to assist in predicting
a person’s emotion or behavior. To accurately make an informative
decision about behaviors or emotions, it is important to have
the ability to differentiate between different types of environments.
The nature of environmental sound contains large variances
even within a single environment type and is constantly changing.
These changes and events are dynamic and inconsistent. The
goal is to come up with models that is robust enough to generalize
to different situations. Learning a hierarchy of sound types
might improve and clarify problems caused by the confusion
between multiple acoustic environments with similar characteristics.
We propose a framework for a composite of deep belief networks
(composite-DBNs) as a way to represent various levels of representations
and to recognize twelve different types of common everyday
environments. Experimental results demonstrate promising performance
in improving the state of art recognition for acoustic environments.
OS.53-IVM.18 3DTV
and Free-viewpoint TV (II)
Session Chairs: Masayuki
Tanimoto, Yo-Sung Ho Location: Franklin Hills
Global
Optimization for Spatio-Temporally Consistent View Synthesis
Hsiao-An Hsu National
Tsing Hua University, Chen-Kuo Chiang National Tsing
Hua University, Shang-Hong Lai National Tsing Hua University
We propose a novel algorithm to generate
a virtualview video from a video-plus-depth sequence. The
proposed method enforces the spatial and temporal consistency
in the disocclusion regions by formulating the problem as
an energy minimization problem in a Markov random field (MRF)
framework. In the system level, we first recover the depth
images and the motion vector maps after the image warping
with the preprocessed depth map. Then we formulate the energy
function for the MRF with additional shift variables for each
node. To reduce the high computational complexity of applying
BP to this problem, we present a multi-level BPs by using
BP with smaller numbers of label candidates for each level.
Finally, the Poisson image reconstruction is applied to improve
the color consistency along the boundary of the disocclusion
region in the synthesized image. Experimental results demonstrate
the performance of the proposed method on several publicly
available datasets.
Depth
Intra Skip Prediction for 3D Video Coding
Kwan-Jung Oh Samsung
Electronics, Jaejoon Lee Samsung Electronics, Du-Sik
Park Samsung Electronics
Depth image compression plays a key
role in the 3D video system. It can be used as supplementary
data for rendering. In this paper, we present a depth intra
skip prediction for 3D video coding. The depth intra skip
prediction is designed based on the intra 16x16 mode. It exploits
the estimated prediction direction which is derived from the
adjacent neighboring pixels and does not encode any residual
data. The proposed depth intra skip prediction is allowed
for I slice as well as P and B slices. The usage of the depth
intra skip prediction is signaled by a newly defined macroblock
level flag. From the experiments, we confirm that the proposed
depth intra skip prediction reduces the depth bit rate by
up to 18.69% while preserving same synthesis quality for the
virtual views.
Depth
Boundary Filtering for View Synthesis in 3D Video
Yunseok Song Gwangju
Institute of Science and Technology, Cheon Lee Gwangju
Institute of Science and Technology, Woo-Seok Jang Gwangju
Institute of Science and Technology, Yo-Sung Ho Gwangju
Institute of Science and Technology
This paper presents a boundary sharpening
method for depth maps to improve synthesis view quality. In
general, coded depth maps exhibit noise and artifacts around
object boundaries, leading to ineffective view synthesis.
In our approach, gradient information is used to extract depth
boundary regions. Afterwards, filtering based on distance,
similarity, and direction is performed on such regions to
replace depth values. The proposed algorithm was implemented
on 3DV-ATM v0.3 as post-processing to coded depth maps. Experimental
results showed 5.23% compared to the anchor results of 3DV-ATM
v0.3. Subjective quality was improved as well.
Single-Image
Depth Inference based on Blur Cues
Jingwei Wang University
of Southern California, Hao Xu University of Southern
California, C.-C. Jay Kuo University of Southern California
With the rapid advancement of 3D visual
technology, the technique of depth inference from a single
image has received new attention. In this work, we present
several single-image depth inference algorithms based on the
blur degree in different regions in one image. We identify
two major sources of image blur: camera defocus and atmospheric
reflectance. The latter is also known as the haziness. We
build models for these two scenarios with the depth information
as a model parameter. Thus, we are able to infer the depth
information from the observed image. Experimental results
are conducted on a large variety of images to demonstrate
that the robustness of the proposed depth inference method.
OS.54-SLA.21 Recent
Advances in Speaker Characterization and Recognition
Session Chairs: Changchun
Bao, Jia Min Karen Kua Location: Whitley Heights
An
Investigation into Better Frequency Warping for Time-Varying
Speaker Recognition
Linlin Wang Tsinghua
University, Xiaojun Wu Tsinghua University, Thomas
Fang ZhengTsinghua National Laboratory for Information
Science and Technology, Chenhao Zhang Tsinghua University
Performance degradation has been observed
in presence of time intervals in practical speaker recognition
systems. Researchers usually resort to enrollment data augmentation,
speaker model adaptation, and variable verification threshold
to alleviate the time-varying impact. However, in this paper,
efforts have been made in the feature domain and an investigation
into better frequency warping for the target task has been
done. Two methods to determine the discrimination sensitivity
of frequency bands are explored: an energy-based F-ratio measure
and a performance-driven one. Frequency warping is performed
according to the discrimination sensitivity curves of the
whole frequency range. Experimental results show that the
proposed methods outperform both MFCC and LFCC, and to some
extent, alleviate the time-varying impact on speaker recognition.
A
K-Phoneme-Class based Multi-Model Method for Short Utterance
Speaker Recognition
Chenhao Zhang Tsinghua
University, Thomas Fang Zheng Tsinghua National Laboratory
for Information Science and Technology, Linlin Wang Tsinghua
University, Xiaojun Wu Tsinghua University, Cong
Yin Taiyuan University of Technology
For GMM-UBM based text-independent
speaker recognition, the performance decreases significantly
when the test speech is too short. Considering that the use
of text information is helpful, a K-phoneme-class scoring
based multiple phoneme class speaker model method (shortened
as K-phoneme-class based multi-model method, abbreviated as
KPCMMM) is proposed including a phoneme class speech recognition
stage and a phoneme class dependent multi-model speaker recognition
stage, where K means the number of most likely phoneme classes
to be used in the second stage. Two different phoneme class
definitions, expert-knowledge based and data-driven, are compared,
and the performance as a function of K is also studied. Experimental
results show that the data-driven phoneme class definition
outperforms the expert-knowledge based one, and that an appropriate
K value can lead to much better performance. Compared with
the baseline GMM-UBM system, the proposed KPCMMM can achieve
a relative equal error rate (EER) reduction of 38.60% for
text-independent speaker recognition with a length of less
than two seconds of test speech.
A
Study on Spoofing Attack in State-Of-The-Art Speaker Verification:
The Telephone Speech Case
Zhizheng Wu Nanyang
Technological University, Tomi Kinnunen University
of Eastern Finland, Eng Siong Chng Nanyang Technological
University, Haizhou Li Nanyang Technological University/Institute
for Infocomm Research, Eliathamby Ambikairajah University
of New South Wales
Voice conversion technique, which
modifies one speaker’s (source) voice to sound like another
speaker (target), presents a threat to automatic speaker verification.
In this paper, we first present new results of evaluating
the vulnerability of current state-of-the-art speaker verification
systems: Gaussian mixture model with joint factor analysis
(GMM-JFA) and probabilistic linear discriminant analysis (PLDA)
systems, against spoofing attacks. The spoofing attacks are
simulated by two voice conversion techniques: Gaussian mixture
model based conversion and unit selection based conversion.
To reduce false acceptance rate caused by spoofing attack,
we propose a general anti- spoofing attack framework for the
speaker verification systems, where a converted speech detector
is adopted as a post-processing module for the speaker verification
system’s acceptance decision. The detector decides whether
the accepted claim is human speech or converted speech. A
subset of the core task in the NIST SRE 2006 corpus is used
to evaluate the vulnerability of speaker verification system
and the performance of converted speech detector. The results
indicate that both conversion techniques can increase the
false acceptance rate of GMM-JFA and PLDA system, while the
converted speech detector can reduce the false acceptance
rate from 31.54% and 41.25% to 1.64% and 1.71% for GMM-JFA
and PLDA system on unit-selection based converted speech,
respectively.
PNCC-ivector-SRC
based Speaker Verification
Eliathamby Ambikairajah
University of New South Wales, Jia Min Karen Kua University
of New South Wales, Vidhyasaharan Sethu University
of New South Wales, Haizhou Li Nanyang Technological
University/Institute for Infocomm Research
Most conventional features used in
speaker recognition are based on Mel Frequency Cepstral Coefficients
(MFCC) or Perceptual Linear Prediction (PLP) coefficients.
Recently, the Power Normalised Cepstral Coefficients (PNCC)
which are computed based on auditory processing, have been
proposed as an alternative feature to MFCC for robust speech
recognition. The objective of this paper is to investigate
the speaker verification performance of PNCC features with
a Sparse Representation Classifier (SRC), using a mixture
of l_1 and l_2 norms. The paper also explores the score level
fusion of both MFCC and PNCC i-vector based speaker verification
systems. Evaluations on the NIST 2010 SRE extended database
show that the fusion of MFCC-SRC and PNCC-SRC gave the best
performance with a DCF of 0.4977. Further, cosine distance
scoring (CDS) based systems were also investigated and the
fusion of MFCC-CDS and PNCC-CDS presented an improvement in
terms of EER, from a 3.99% EER baseline to 3.55%.
Speaker
Verification using Lasso based Sparse Total Variability Supervector
with PLDA modeling
Ming Li University of
Southern California, Charley Lu 3M Cogent, Anne
Wang 3M Cogent, Shrikanth Narayanan University of
Southern California
In this paper, we propose a Lasso
based framework to generate the sparse total variability supervectors
(s-vectors). Rather than the factor analysis framework, which
uses a low dimensional Eigenvoice subspace to represent the
mean supervector, the proposed Lasso approach utilizes the
l1 norm regularized least square estimation to project the
mean supervector on a pre-defined dictionary. The number of
samples in this dictionary is appreciably larger than the
typical Eigenvoice rank but the l1 norm of the Lasso solution
vector is constrained. Only a small number of samples in the
dictionary are selected for representing the mean supervector,
and most of the dictionary coefficients in the Lasso solution
are 0. We denote these sparse dictionary coefficient vectors
in the Lasso solutions as the svectors and model them using
probabilistic linear discriminant analysis (PLDA) for speaker
verification. The proposed approach generates comparable results
to the conventional cosine distance scoring based i-vector
system and improvement is achieved by fusing the proposed
method with either the i-vector system or the joint factor
analysis (JFA) system. Experiments results are reported on
the female part of the NIST SRE 2010 task with common conditions
using equal error rate (EER), norm old minDCF and norm new
minDCF values. The norm new minDCF cost was reduced by 7.5%
and 9.6% relative when fusing the proposed approach with the
baseline JFA and i-vector systems, respectively. Similarly,
8.3% and 10.7% relative norm old minDCF cost reduction was
observed in the fusion.
PS.6-IVM.19 Selected
Topics in Image Processing and Computer Vision
Session Chair: Jin-Jang
Leou Location: Solano
Image
Inpainting by Block-Based Linear Regression with Optimal Block
Selection
Akira Tanaka Hokkaido
University, Takahiro Ogawa Hokkaido University,
Miki Haseyama Hokkaido University
Estimation of missing entries in a
multivariate data is one of classical problems in the field
of statistical science. One of the most popular approaches
for this problem is linear regression based on the EM algorithm.
When we consider to apply this approach to block-based image
inpainting problems, we have additional information, that
is, a target lost pixel could be included in multiple blocks,
which implies that we have multiple candidates of estimates
for the pixel. In such cases, we have to choose a good estimate
among the multiple candidates. In this paper, we propose a
novel image inpainting method incorporating optimal block
selection in terms of the expected squared errors among multiple
candidates of the estimate for the target pixel. Results of
numerical examples are also shown to verify the efficacy of
the proposed method.
Generalized
Histogram Shifting-Based Reversible Data Hiding with an Adaptive
Binary-to-q-ary Converter
Masaaki Fujiyoshi Tokyo
Metropolitan University, Hitoshi Kiya Tokyo Metropolitan
University
This paper increases the flexibility
of generalized histogram shifting-based reversible data hiding
(HS-RDH). An RDH method modifies an original image to hide
data to the image, and the method not only extracts the hidden
data but also restores the original image from the distorted
image which conveys the hidden data. A generalized HS-RDH
method increases (or decreases) particular pixel values in
an original image by (q - 1), based on its tonal distribution,
to hide q-ary data symbols, whereas an ordinary HS-RDH method
shifts a part of the histogram by one to embed binary symbols.
This paper introduces an adaptive binary-to-q-ary watermark
converter and a tonal distribution analysis to increase conveyable
hidden data size, whereas a conventional generalized HS-RDH
method with an arithmetic decoder-based converter cannot always
convert the extracted q-ary strings to original binary strings
correctly and the other method embeds q-symbols instead of
q-ary symbols where q is a power of two equal to or less than
q. In addition, histogram packing technique is introduced
in this paper to further increase q. Experimental results
show the effectiveness of the proposed method.
An
Enhanced Seam Carving Approach for Video Retargeting
Tzu-Hua Chao National
Chung Cheng University, Jin-Jang Leou National Chung
Cheng University, Han-Hui Hsiao National Chung Cheng
University
Video retargeting (resizing) is an
important task for displaying videos on various display devices.
In this study, an enhanced seam carving approach for video
retargeting is proposed, in which a seam may be a non-8-connected
one. Both the search window size and the temporal weight can
be adaptively adjusted according to video contents (motion
information). Additionally, to preserve temporal coherence,
the appearance-based method is employed. The spatial and temporal
costs of a pixel are linearly combined to compute the cumulative
cost with an adaptive temporal weight. Finally, dynamic programming
is used to determine the optimal non-8-connected seam (with
the minimum cumulative cost) for carving out. Based on the
experimental results obtained in this study, the performance
of the proposed approach is better than those of two comparison
approaches.
Adaptive
Reversible Data Hiding in Frequency Domain via Integer-to-Integer
Transform
Yoshida Taichi Keio
University, Yusuke Okamura Keio University, Taizo
Suzuki Nihon University, Masaaki Ikehara Keio University
This paper proposes an adaptive reversible
data hiding algorithm for images, which embeds significant
information in frequency domain based on integer-to-integer
transform. Its embedding method is realized to modify state-of-the-art
one according to transformed coefficients. It overcomes a
problem of the conventional method that particular visual
degradations generated by embedding often appear. Our proposed
algorithm outperforms the conventional one about the visual
quality of embedded images, objectively and perceptually,
while keeping embedding capacity.
3D
Editing System for Captured Real Scenes
Inwoo Ha Samsung Advanced
Institute of Technology, Yong Beom Lee Samsung Advanced
Institute of Technology, D. K. James Kim Samsung Electronics
This paper presents a complete 3D
editing system for real scenes, captured from conventional
color-depth camera. From captured source and target images,
desired source objects and target locations are selected.
The source objects are copied and pasted to the desired location
of the target image in color layer not considering their mutual
illumination between the source objects and target image.
To seamlessly composite the source objects to the target image
based on their mutual illuminations, 3D surface meshes of
source and target real scenes are reconstructed interactively,
and differential rendering framework based on the instant
radiosity is applied. The final result is a seamlessly mixed
image considered with correct occlusions and mutual illumination.
A
Novel Criterion for Quality Improvement of JPEG Images Based
on Image Database and Re-application of JPEG
Katsuya Kohno Hokkaido
University, Akira Tanaka Hokkaido University, Hideyuki
Imai Hokkaido University
Image compression is one of the most
important technologies in the fields of image processing.
JPEG has been commonly used for image compression. Since JPEG
is a lossy compression method, decoded images exhibit visually
unwanted noises. A need for techniques for improving the quality
of JPEG images remains because there still exist many images
compressed by JPEG today. Many methods for improving the quality
of JPEG images have been proposed. Among them, a method based
on re-application of JPEG, which means compression and decoding,
is recognized as one of efficient methods. In our previous
study, we improved this method by incorporating an image database
and novel distance measures between two images. In this paper,
we propose a new distance measure between two images to improve
the performance of our previous method. We also show some
results of numerical experiments to verify the efficacy of
the proposed criterion.
Recovery
Method based Particle Filter for Object Tracking in Complex
Environment
Yuhi Shiina Waseda University,
Takeshi Ikenaga Waseda University
Object tracking is a key process for
various image recognition applications, and many algorithms
have been proposed in this field. Especially, particle filter
has possibility for tracking objects steadily thanks to prediction
using many particles. However, other objects that are of similar
color or shape with a tracking object hijack a tracking region
if there were such objects nearby the tracking object. It
is a critical problem. This paper proposes a recovery method
based particle filter by focusing a feature regions attached
to an object. This proposal tracks both a feature region and
an object including the region at once. This proposal utilizes
a recovery method that pulls a tracking region back to an
appropriate position using the prior frame’s distance and
angle between the two tracking regions when the tracking region
is hijacked by other objects. Some video sequences including
complex environment have been tested for evaluating this proposal.
The experimental results show that this proposal can track
a specified person in the sequences, while conventional method
cannot track the person. This result represents that recovery
method of proposal effectively works when other objects hijack
the tracking region.
Detection
of Salient Object Using Pixel Blurriness
Yi-Chong Zeng Institute
for Information Industry, Chi-Hung Tsai Institute for
Information Industry
In this paper we propose a method
to detect salient object in still image and non-slow motion
background video. The key technique is measuring pixel blurriness.
Generally speaking, when the salient object was taken in focus,
pixels within the salient object should be sharper than those
within the background. In the first step, image intensity
is extracted and then four different-size average filters
are applied to intensity. Subsequently, variation of intensity
differences (VID) is computed among the original intensity
and four blurred versions. The VID is employed to represent
degree of pixel blurriness. Finally, a thresholding method
is applied to pixel blurriness in order to distinguish salient
object from background, and salient object is composed of
low-blurriness pixels. The experiment results demonstrate
that the proposed method is efficient in detection of salient
object in still image and non-slow motion background video.
Moreover, our method has better detection performance than
the two compared methods.
|