Microsoft
RETracer: Triaging Crashes by Reverse Execution from Partial Memory Dumps
Many software providers operate crash reporting services to automatically collect crashes from millions of customers and file bug reports. Precisely triaging crashes is necessary and important for software providers because the millions of crashes that may be reported every day are critical in identifying high impact bugs. However, the triaging accuracy of existing systems is limited, as they rely only on the syntactic information of the stack trace at the moment of a crash without analyzing program semantics. In this paper, we present RETracer, the first system to triage software crashes based on program semantics reconstructed from memory dumps. RETracer was designed to meet the requirements of large-scale crash reporting services. RETracer performs binarylevel backward taint analysis without a recorded execution trace to understand how functions on the stack contribute to the crash. The main challenge is that the machine state at an earlier time cannot be recovered completely from a memory dump, since most instructions are information destroying. We have implemented RETracer for x86 and x86-64 native code, and compared it with the existing crash triaging tool used by Microsoft. We found that RETracer eliminates two thirds of triage errors based on a manual analysis of 140 bugs fixed in Microsoft Windows and Office. RETracer has been deployed as the main crash triaging system on Microsoft’s crash reporting service.
Categories: Microsoft
Robust and Efficient Multiple Alignment of Unsynchronized Meeting Recordings
This paper proposes a way to generate a single high-quality audio recording of a meeting using no equipment other than participants’ personal devices. Each participant in the meeting uses their mobile device as a local recording node, and they begin recording whenever they arrive in an unsynchronized fashion. The main problem in generating a single summary recording is to temporally align the various audio recordings in a robust and efficient manner. We propose a way to do this using an adaptive audio fingerprint based on spectrotemporal eigenfilters, where the fingerprint design is learned on-the-fly in a totally unsupervised way to perform well on the data at hand. The adaptive fingerprints require only a few seconds of data to learn a robust design, and they require no tuning. Our method uses an iterative, greedy two-stage alignment algorithm which finds a rough alignment using indexing techniques, and then performs a more fine-grained alignment based on Hamming distance. Our proposed system achieves >99% alignment accuracy on challenging alignment scenarios extracted from the ICSI meeting corpus, and it outperforms five other well-known and state-ofthe-art fingerprint designs. We conduct extensive analyses of the factors that affect the robustness of the adaptive fingerprints, and we provide a simple heuristic that can be used to adjust the fingerprint’s robustness according to the amount of computation we are willing to perform.
Categories: Microsoft
LatticeCrypto
LatticeCrypto is a high-performance and portable software library that implements lattice-based cryptographic algorithms. The first release of the library provides an implementation of lattice-based key exchange with security based on the Ring Learning With Errors (R-LWE) problem using new algorithms for the underlying Number Theoretic Transform (NTT). The chosen parameters provide at least 128 bits of security against attackers running classical and quantum computers.
Categories: Microsoft
SIDH Library
SIDH is a fast and portable software library that implements a new suite of algorithms for Supersingular Isogeny Diffie-Hellman (SIDH) key exchange. The chosen parameters aim to provide 128 bits of security against attackers running a large-scale quantum computer, and 192 bits of security against classical algorithms. SIDH has the option of a hybrid key exchange that combines supersingular isogeny Diffie-Hellman with a high-security classical elliptic curve Diffie-Hellman key exchange at a small overhead.
Categories: Microsoft
Program for TPC-H Data Generation with Skew
The schema and queries of the TPC-H (formerly TPC-D) benchmark are widely used by people in the database community. One of the requirements of the benchmark is that data for columns in the database are generated from a uniform distribution. However, this requirement makes it hard for users to conclude about the robustness/effectiveness of their system since real world data distributions are often non-uniform. We have therefore created a new data generation program for TPC-H that is capable of generating a database where the columns have non-uniform (skewed) data distributions. In particular, the program can generate data from a Zipfian distribution, where the Zipf value (z), which controls the degree of skew in the data, is a parameter that can be specified to the program. In addition, the program allows the generation of a database with “mixed” data distribution where the skew of a column in the database is randomly chosen from the Zipfian values {0,1,2,3,4}. Note that the total number of rows in the tables and the total database size are not affected by our changes.
Categories: Microsoft
RIoT - A Foundation for Trust in the Internet of Things
RIoT (Robust Internet-of-Things) is an architecture for providing foundational trust services to computing devices. The trust services include device identity, sealing, attestation, and data integrity. The term “Robust” is used because the minimal trusted computing base is tiny, and because RIoT capabilities can remotely re-establish trust in devices that have been compromised by malware. The term IoT is used because these services can be provided at low cost on even the tiniest of devices.
Categories: Microsoft
Diverse Algebra Word Problem Dataset with Derivation Annotations
This dataset provides training and testing examples for solving algebra word problems automatically. In addition to have 1000 completely new problems, we augmented the data by annotating the full derivations (template + alignments) for each word problem. We also performed cross-dataset cleaning across all datasets, so that the template annotation across different sets are unified. The instances are coming from the following resources: (1) 1000 new training/testing data with diverse templates and narratives crawled from algebra.com. (Our contribution) (2) http://groups.csail.mit.edu/rbg/code/wordprobs/ (3) http://research.microsoft.com/en-us/projects/dolphin/ (their dataset contains non-linear problems. We took a subset of the problems that are linear there) . Please cite the corresponding report (and the original papers of the other datasets) if you found the dataset useful. We are aware that some of the problems might contain annotation error.
Categories: Microsoft
Ripples of mediatization: Social media and the exposure of the pool interview
During the 2011 UK public sector protests, controversy ignited over the “Miliband Loop”, an unedited video from a pool interview showing Labour leader Ed Miliband to have provided largely the same answer in response to six questions. The interviewer subsequently complained in a TwitLonger that the incident epitomized the clash of public relations and journalism. In this paper we unpack the practical production of the pool interview as a delamination of the interview-as-lived from the interview-as-media-production-mechanism. We then explore professional and public understanding (or lack thereof) of exposure of this delamination issue and its relation to politics. While the controversy did not directly affect Miliband׳s position as leader, it is clear that the Internet is a dangerous place for the old rules of mediatization.
Categories: Microsoft
Table Cell Search for Question Answering
Tables are pervasive on the Web. Informative web tables range across a large variety of topics, which can naturally serve as a significant resource to satisfy user information needs. Driven by such observations, in this paper, we investigate an important yet largely under-addressed problem: Given millions of tables, how to precisely retrieve table cells to answer a user question. This work proposes a novel table cell search framework to attack this problem. We first formulate the concept of a relational chain which connects two cells in a table and represents the semantic relation between them. With the help of search engine snippets, our framework generates a set of relational chains pointing to potentially correct answer cells. We further employ deep neural networks to conduct more fine-grained inference on which relational chains best match the input question and finally extract the corresponding answer cells. Based on millions of tables crawled from the Web, we evaluate our framework in the open-domain question answering (QA) setting, using both the well-known WebQuestions dataset and user queries mined from Bing search engine logs. On WebQuestions, our framework is comparable to state-of-the-art QA systems based on knowledge bases (KBs), while on Bing queries, it outperforms other systems with a 56.7% relative gain. Moreover, when combined with results from our framework, KB-based QA performance can obtain a relative improvement of 28.1% to 66.7%, demonstrating that web tables supply rich knowledge that might not exist or is difficult to be identified in existing KBs.
Categories: Microsoft
Improving Document Ranking with Dual Word Embeddings
This paper investigates the popular neural word embedding method Word2vec as a source of evidence in document ranking. In contrast to NLP applications of word2vec, which tend to use only the input embeddings, we retain both the input and the output embeddings, allowing us to calculate a different word similarity that may be more suitable for document ranking. We map the query words into the input space and the document words into the output space, and compute a relevance score by aggregating the cosine similarities across all the query-document word pairs. We postulate that the proposed Dual Embedding Space Model (DESM) provides evidence that a document is about a query term, in addition to and complementing the traditional term frequency based approach.
Categories: Microsoft
A Software Methodology for Compiling Quantum Programs
Quantum computers promise to transform our notions of computation by offering a completely new paradigm. To achieve scalable quantum computation, optimizing compilers and a corresponding software design flow will be essential. We present a software architecture for compiling quantum programs from a high-level language program to hardware-specific instructions. We describe the necessary layers of abstraction and their differences and similarities to classical layers of a computer-aided design flow. For each layer of the stack, we discuss the underlying methods for compilation and optimization. Our software methodology facilitates more rapid innovation among quantum algorithm designers, quantum hardware engineers, and experimentalists. It enables scalable compilation of complex quantum algorithms and can be targeted to any specific quantum hardware implementation.
Categories: Microsoft
Emerging and Recurring Data-Driven Storytelling Techniques: Analysis of a Curated Collection of Recent Stories
Storytelling with data is becoming an important component of many fields such as graphic design, the advocacy of causes, and journalism. New techniques for integrating data visualization into narrative stories have now become commonplace. Authors are enabling new reader experiences, such as linking textual narrative and data visualizations through dynamic queries embedded in the text. Novel means of communicating position and navigating within the narrative also have merged, such as utilizing scrolling to advance narration and initiate animations. We advance the study of narrative visualization through an analysis of a curated collection of recent data-driven stories shared on the web. Drawing from the results of this analysis, we present a set of techniques being employed in these examples, organized under four high-level categories that help authors to tell stories in creative ways: communicating narrative and explaining data, linking separated story elements, enhancing structure and navigation, and providing controlled exploration. We describe the benefits of each storytelling technique along with a number of example applications of the ideas through recent data-driven stories. Additionally, we discuss the trends we observed as well as how the field has evolved and grown. Finally, we conclude with a discussion of areas for future research.
Categories: Microsoft
Learning to Verify the Heap
We present a data-driven verification framework to automatically prove memory safety and functional correctness of heap programs. For this, we introduce a novel statistical machine learning technique that maps observed program states to (possibly disjunctive) separation logic formulas describing the invariant shape of data structures at relevant program locations. We then attempt to verify these predictions using a theorem prover, where counterexamples to a predicted invariant are used as additional input to the shape predictor in a refinement loop. After obtaining valid shape invariants, we use a second learning algorithm to strengthen them with data invariants, again employing a refinement loop using the underlying theorem prover. We have implemented our techniques in Cricket, an extension of the GRASShopper verification tool. Cricket is able to automatically prove memory safety and correctness of implementations of a variety of classical list-manipulating algorithms such as insertionsort.
Categories: Microsoft
Analyzing Runtime and Size Complexity of Integer Programs
We present a modular approach to automatic complexity analysis of integer programs. Based on a novel alternation between finding symbolic time bounds for program parts and using these to infer bounds on the absolute values of program variables, we can restrict each analysis step to a small part of the program while maintaining a high level of precision. The bounds computed by our method are polynomial or exponential expressions that depend on the absolute values of input parameters. We show how to extend our approach to arbitrary cost measures, allowing to use our technique to find upper bounds for other expended resources, such as network requests or memory consumption. Our contributions are implemented in the open source tool KoAT, and extensive experiments show the performance and power of our implementation in comparison with other tools.
Categories: Microsoft
The Dialog State Tracking Challenge Series: A Review
In a spoken dialog system, dialog state tracking refers to the task of correctly inferring the state of the conversation – such as the user’s goal – given all of the dialog history up to that turn. Dialog state tracking is crucial to the success of a dialog system, yet until recently there were no common resources, hampering progress. The Dialog State Tracking Challenge series of 3 tasks introduced the first shared testbed and evaluation metrics for dialog state tracking, and has underpinned three key advances in dialog state tracking: the move from generative to discriminative models; the adoption of discriminative sequential techniques; and the incorporation of the speech recognition results directly into the dialog state tracker. This paper reviews this research area, covering both the challenge tasks themselves and summarizing the work they have enabled.
Categories: Microsoft
Multimodal Learning for Image Captioning and Visual Question Answering (talk at UC Berkeley, BVLC)
Categories: Microsoft
Inferring Annotations For Device Drivers From Verification Histories
Finding invariants is an important step in automated program analysis. Discovery of precise invariants, however, can be very difficult in practice. The problem can be simplified if one has access to a candidate set of predicates (or annotations) and the search for invariants is limited over the space defined by these annotations. We present an approach that infers program annotations automatically by leveraging the history of verifying related programs. Our algorithm extracts high-quality annotations from previous verification attempts, and then applies them for verifying new programs. We present a case study where we applied our techniques to Microsoft's Static Driver Verifier (SDV). SDV currently uses manually-tuned heuristics for obtaining a set of annotations. Our techniques can not only replace the need for this manual effort, they even out-perform these heuristics and improve the performance of SDV overall.
Categories: Microsoft
Visual Storytelling
We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling. The first release of this dataset, SIND1 v.1, includes 81,743 unique photos in 20,211 sequences, aligned to both descriptive (caption) and story language. We establish several strong baselines for the storytelling task, and motivate an automatic metric to benchmark progress. Modelling concrete description as well as figurative and social language, as provided in this dataset and the storytelling task, has the potential to move artificial intelligence from basic understandings of typical visual scenes towards more and more human-like understanding of grounded event structure and subjective expression.
Categories: Microsoft
HIGHWAY LONG SHORT-TERM MEMORY RNNS FOR DISTANT SPEECH RECOGNITION
In this paper, we extend the deep long short-term memory (DLSTM) recurrent neural networks by introducing gated direct connections between memory cells in adjacent layers. These direct links, called highway connections, enable unimpeded information flow across different layers and thus alleviate the gradient vanishing problem when building deeper LSTMs. We further introduce the latency-controlled bidirectional LSTMs (BLSTMs) which can exploit the whole history while keeping the latency under control. Efficient algorithms are proposed to train these novel networks using both frame and sequence discriminative criteria. Experiments on the AMI distant speech recognition (DSR) task indicate that we can train deeper LSTMs and achieve better improvement from sequence training with highway LSTMs (HLSTMs). Our novel model obtains 43 :9=47 :7% WER on AMI (SDM) dev and eval sets, outperforming all previous works. It beats the strong DNN and DLSTM baselines
Categories: Microsoft