Microsoft
SPEAKER-AWARE TRAINING OF LSTM-RNNS FOR ACOUSTIC MODELLING
Long Short-Term Memory (LSTM) is a particular type of recurrent neural network (RNN) that can model long term temporal dynamics. Recently it has been shown that LSTM-RNNs can achieve higher recognition accuracy than deep feed-forword neural networks (DNNs) in acoustic modelling. However, speaker adaption for LSTM-RNN based acoustic models has not been well investigated. In this paper, we study the LSTM-RNN speaker-aware training that incorporates the speaker information during model training to normalise the speaker variability. We first present several speaker-aware training architectures, and then empirically evaluate three types of speaker representation: I-vectors, bottleneck speaker vectors and speaking rate. Furthermore, to factorize the variability in the acoustic signals caused by speakers and phonemes respectively, we investigate the speaker-aware and phone-aware joint training under the framework of multi-task learning. In AMI meeting speech transcription task, speaker-aware training of LSTM-RNNs reduces word error rates by 6.5% relative to a very strong LSTM-RNN baseline, which uses FMLLR features.
Categories: Microsoft
PREDICTION-ADAPTATION-CORRECTION RECURRENT NEURAL NETWORKS FOR LOW-RESOURCE LANGUAGE SPEECH RECOGNITION
In this paper, we investigate the use of prediction-adaptationcorrection recurrent neural networks (PAC-RNNs) for lowresource speech recognition. A PAC-RNN is comprised of a pair of neural networks in which a correction network uses auxiliary information given by a prediction network to help estimate the state probability. The information from the correction network is also used by the prediction network in a recurrent loop. Our model outperforms other state-of-theart neural networks (DNNs, LSTMs) on IARPA-Babel tasks. Moreover, transfer learning from a language that is similar to the target language can help improve performance further.
Categories: Microsoft
AN INVESTIGATION INTO USING PARALLEL DATA FOR FAR-FIELD SPEECH RECOGNITION
Far-field speech recognition is an important yet challenging task due to low signal to noise ratio. In this paper, three novel deep neural network architectures are explored to improve the far-field speech recognition accuracy by exploiting the parallel far-field and closetalk recordings. All three novel architectures use multi-task learning for the model optimization but focus on three different ideas:
Categories: Microsoft
INTEGRATED ADAPTATION WITH MULTI-FACTOR JOINT-LEARNING FOR FAR-FIELD SPEECH RECOGNITION
Although great progress has been made in automatic speech recognition (ASR), significant performance degradation still exists in distant talking scenarios due to significantly lower signal power. In this paper, a novel adaptation framework, named integrated adaptation with multi-factor joint-learning , is proposed to improve the recognition accuracy for distant speech recognition. We explore and extract speaker, phone and environment factor representations using deep neural networks (DNNs), which are integrated into the main ASR DNN to improve classification accuracy. In addition, the hidden activations in the main ASR DNN are used to improve the factor extraction, which in turn helps the ASR DNN. All the model parameters, including those in the ASR DNN and factor extractor DNNs, are jointly optimized under the multi-task learning framework. Further more, unlike prior techniques, our novel approach requires no explicit separate stages for factor extraction and adaptation. Experiments on the AMI single distant microphone (SDM) task show that the proposed architecture can significantly reduce word error rate (WER) and additional improvement can be achieved by combining it with the i-vector adaptation. Our best configuration obtained more than 15% and 10% relative reduction on WER over the baselines using the SDM and close-talk data generated alignments, respectively.
Categories: Microsoft
DEEP BEAMFORMING NETWORKS FOR MULTI-CHANNEL SPEECH RECOGNITION
Despite the significant progress in speech recognition enabled by deep neural networks, poor performance persists in some scenarios. In this work, we focus on far-field speech recognition which remains challenging due to high levels of noise and reverberation in the captured speech signals. We propose to represent the stages of acoustic processing including beamforming, feature extraction, and acoustic modeling, as three components of a single unified computational network. The parameters of a frequency-domain beamformer are first estimated by a network based on features derived from the microphone channels. These filter coefficients are then applied to the array signals to form an enhanced signal. Conventional features are then extracted from this signal and passed to a second network that performs acoustic modeling for classification. The parameters of both the beamforming and acoustic modeling networks are trained jointly using back-propagation with a common crossentropy objective function. In experiments on the AMI meeting corpus, we observed improvements by pre-training each sub-network with a network-specific objective function before joint training of both networks. The proposed method obtained a 3.2% absolute word error rate reduction compared to a conventional pipeline of independent processing stages.
Categories: Microsoft
Automatic Discovery of Attribute Synonyms Using Query Logs and Table Corpora
Attribute synonyms are important ingredients for keyword-based search systems. For instance, web search engines, recognize queries that seek the value of an entity on a specific attribute (referred to as e+a queries) and provide direct answers for them using a combination of knowledge bases, web tables and documents. However, users often refer to an attribute in their e+a query differently from how it is referred in the web table or text passage. In such cases, search engines may fail to return relevant answers. To address that problem, we propose to automatically discover all the alternate ways of referring to the attributes of a given class of entities (referred to as attribute synonyms) in order to improve search quality. The state-of-the-art approach that relies on attribute name co-occurrence in web tables suffers from low precision. Our main insight is to combine positive evidence of attribute synonymity from query click logs, with negative evidence from web table attribute name co-occurrences. We formalize the problem as an optimization problem on a graph, with the attribute names being the vertices and the positive and negative evidences from query logs and web table schemas as weighted edges. We develop a linear programming based algorithm to solve the problem that has bi-criteria approximation guarantees. Our experiments on real-life datasets show that our approach has significantly higher precision and recall compared with the state-of-the-art.
Categories: Microsoft
Optimizing Distributed Actor Systems for Dynamic Interactive Services
Distributed actor systems are widely used for developing interactive scalable cloud services, such as social networks and on-line games. By modeling an application as a dynamic set of lightweight communicating “actors”, developers can easily build complex distributed applications, while the underlying runtime system deals with low-level complexities of a distributed environment. We present ActOp — a data-driven, application-independent runtime mechanism for optimizing end-to-end service latency of actor-based distributed applications. ActOp targets the two dominant factors affecting latency: the overhead of remote inter-actor communications across servers, and the intra-server queuing delay. ActOp automatically identifies frequently communicating actors and migrates them to the same server transparently to the running application. The migration decisions are driven by a novel scalable distributed graph partitioning algorithm which does not rely on a single server to store the whole communication graph, thereby enabling efficient actor placement even for applications with rapidly changing graphs (e.g., chat services). Further, each server autonomously reduces the queuing delay by learning an internal queuing model and configuring threads according to instantaneous request rate and application demands. We prototype ActOp by integrating it with Orleans – a popular open-source actor system [4, 13]. Experiments with realistic workloads show latency improvements of up to 75% for the 99th percentile, up to 63% for the mean, with up to 2x increase in peak system throughput.
Categories: Microsoft
Efficient Queue Management for Cluster Scheduling
Job scheduling in Big Data clusters is crucial both for cluster operators’ return on investment and for overall user experience. In this context, we observe several anomalies in how modern cluster schedulers manage queues, and argue that maintaining queues of tasks at worker nodes has significant benefits. On one hand, centralized approaches do not use worker-side queues. Given the inherent feedback delays that these systems incur, they achieve suboptimal cluster utilization, particularly for workloads dominated by short tasks. On the other hand, distributed schedulers typically do employ worker-side queuing, and achieve higher cluster utilization. However, they fail to place tasks at the best possible machine, since they lack cluster-wide information, leading to worse job completion time, especially for heterogeneous workloads. To the best of our knowledge, this is the first work to provide principled solutions to the above problems by introducing queue management techniques, such as appropriate queue sizing, prioritization of task execution via queue reordering, starvation freedom, and careful placement of tasks to queues. We instantiate our techniques by extending both a centralized (YARN) and a distributed (Mercury) scheduler, and evaluate their performance on a wide variety of synthetic and production workloads derived from Microsoft clusters. Our centralized implementation, Yaq-c, achieves 1.7x improvement on median job completion time compared to YARN, and our distributed one, Yaq-d, achieves 9.3x improvement over an implementation of Sparrow’s batch sampling on Mercury.
Categories: Microsoft
Did You Say U2 or Youtube? Inferring Implicit Transcripts from Voice Search Logs
Categories: Microsoft
Author2Vec: Learning Author Representations by Combining Content and Link Information
In this paper, we consider the problem of learning representations for authors from bibliographic co-authorship networks. Existing methods for deep learning on graphs, such as DeepWalk, suffer from link sparsity problem as they focus on modeling the link information only. We hypothesize that capturing both the content and link information in a unified way will help mitigate the sparsity problem. To this end, we present a novel model `Author2Vec' 1, which learns lowdimensional author representations such that authors who write similar content and share similar network structure are closer in vector space. Such embeddings are useful in a variety of applications such as link prediction, node classification, recommendation and visualization. The author embeddings we learn are empirically shown to outperform DeepWalk by 2.35% and 0.83% for link prediction and clustering task respectively.
Categories: Microsoft
TPC: Target-Driven Parallelism Combining Prediction and Correction to Reduce Tail Latency in Interactive Services
Categories: Microsoft
Predicting Post-Operative Visual Acuity for LASIK Surgeries
LASIK (Laser-Assisted in SItu Keratomileusis) surgeries have been quite popular for treatment of myopia (nearsightedness), hyperopia (farsightedness) and astigmatism over the past two decades. In the past decade, over 10 million LASIK procedures had been performed in the United States alone with an average cost of approximately $2000 USD per surgery. While 99% of such surgeries are successful, the commonest side effect is a residual refractive error and poor uncorrected visual acuity (UCVA). In this work, we aim at predicting the UCVA post LASIK surgery. We model the task as a regression problem and use the patient demography and pre-operative examination details as features. To the best of our knowledge, this is the first work to systematically explore this critical problem using machine learning methods. Further, LASIK surgery settings are often determined by practitioners using manually designed rules. We explore the possibility of determining such settings automatically to optimize for the best post-operative UCVA by including such settings as features in our regression model. Our experiments on a dataset of 791 surgeries provides an RMSE (root mean square error) of 0.102, 0.094 and 0.074 for the predicted post-operative UCVA after one day, one week and one month of the surgery respectively.
Categories: Microsoft
High-Density Image Storage Using Approximate Memory Cells
This paper proposes tailoring image encoding for an approximate storage substrate, developing an approximation-aware encoding algorithm. We develop a methodology to determine relative importance of encoded bits and store them in an approximate storage substrate that we tune to match their error tolerance. We present a case study with the progressive transform codec (PTC), a precursor to JPEG XR, and a phase-change memory (PCM) storage substrate that is optimized to minimize errors via biasing and tuned via selective error correction to different error rate levels. This enables effective use of approximate storage for images, resulting in over 2.7x increase in density of pixels per silicon volume that is additive to PTC storage savings.
Categories: Microsoft
A DNA-Based Archival Storage System
Demand for data storage is growing exponentially, but the capacity of existing storage media is not keeping up. Using DNA to archive data is an attractive possibility because it is extremely dense, with a raw limit of 1 exabyte per cubic millimeter, and long-lasting, with observed half-life of over 500 years. This paper presents an architecture for a DNA-backed archival storage system. It is structured as a key-value store, and leverages common biochemical techniques to provide random access. We also propose a new encoding scheme that offers controllable redundancy, trading off reliability for density.
Categories: Microsoft
Accuracy of depth-sensing recordings in classifying Expanded Disability Status Scale subscores of motor dysfuntion in patients with multiple sclerosis
Categories: Microsoft
Where Can I Buy a Boulder? Searching for Offline Retail Locations
People commonly need to purchase things in person, from large garden supplies to home decor. Although modern search systems are very effective at finding online products, little research attention has been paid to helping users find places that sell a specific product offline. For instance, users searching for an apron are not typically directed to a nearby kitchen store by a standard search engine. In this paper, we investigate "where can I buy"-style queries related to in-person purchases of products and services. Answering these queries is challenging since little is known about the range of products sold in many stores, especially those which are smaller in size. To better understand this class of queries, we first present an in-depth analysis of typical offline purchase needs as observed by a major search engine, producing an ontology of such needs. We then propose ranking features for this new problem, and learn a ranking function that returns stores most likely to sell a queried item or service, even if there is very little information available online about some of the stores. Our final contribution is a new evaluation framework that combines distance with store relevance in measuring the effectiveness of such a search system. We evaluate our method using this approach and show that it outperforms a modern web search engine.
Categories: Microsoft
Exploring limits to prediction in complex social systems: Predicting cascade size on Twitter
How predictable is success in complex social systems? In spite of a recent profusion of prediction studies that exploit online social and information network data, this question remains unanswered, in part because it has not been adequately specified. In this paper we attempt to clarify the question by presenting a simple stylized model of success that attributes prediction error to one of two generic sources: insufficiency of available data and/or models on the one hand; and inherent unpredictability of complex social systems on the other. We then use this model to motivate an illustrative empirical study of information cascade size prediction on Twitter. Despite an unprecedented volume of information about users, content, and past performance, our best performing models can explain less than half of the variance in cascade sizes. In turn, this result suggests that even with unlimited data predictive performance would be bounded well below deterministic accuracy. Finally, we explore this potential bound theoretically using simulations of a diffusion process on a random scale free network similar to Twitter. We show that although higher predictive power is possible in theory, such performance requires a homogeneous system and perfect ex-ante knowledge of it: even a small degree of uncertainty in estimating product quality or slight variation in quality across products leads to substantially more restrictive bounds on predictability. We conclude that realistic bounds on predictive accuracy are not dissimilar from those we have obtained empirically, and that such bounds for other complex social systems for which data is more difficult to obtain are likely even lower.
Categories: Microsoft
Query-Less: Predicting Task Repetition for NextGen Proactive Search and Recommendation Engines
Categories: Microsoft
Toward Full Elasticity in Distributed Static Analysis
In this paper we present the design and implementation of an elastic static analysis framework that is designed to scale with the size of the input. Our approach is based on the actor programming model and is deployed in the cloud. This provides a degree of elasticity for CPU, memory and storage resources. To demonstrate the potential of our technique, we show how a typical call graph analysis can be implemented. We experimentally validate the analysis using a combination of both synthetic and real benchmarks. The results show that our analysis scales well in terms of memory pressure independently of the input size, as we add more machines. Despite using stock hardware and incurring a non-trivial communication overhead, our processing time for projects of close to 1M LOC is about 15 minutes. As the number of machines increases, we show that the analysis time does not suffer. Lastly, we demonstrate that querying the results can be performed with a median latency of well under 20 ms.
Categories: Microsoft