Improving the t-SNE data visualisation algorithm via stochastic sampling

Abstract

Improving the t-SNE data visualisation algorithm via stochastic samplingMatthew ChalmersSchool of Computing Science, University of"" Glasgow, UKThe stochastic network embedding (SNE) method was published in 2008, and since then it has become a commonly used metho""d for visualising high dimensional data. SNE~s core aim is to take a high dimensional data set, and create a lower dimensional (usua"lly 2D) layout that visually conveysuseful structures and patterns within the data set. This layout is often called an embedding in" the lower-dimensional space. Wattenberg, Vi~gas and Johnson, of Google Cloud, recently published on the distill.pub web site an int""eractive overview of t-SNE, the predominant SNE technique, that is by far the most accessible and insightful overview of the method"" we have come across. For the reader less familiar with the details of t-SNE, it serves as an ideal introduction, as it points out (""and visually demonstrates) the method~s characteristics, strengths and weaknesses.The full project proposal explores a number of th""ese weaknesses in detail, and presents responses to be carried out in this project. These responses are in two areas: improving the"" display of global structure within visualised data sets, and doing embeddings incrementally so as to allow for quicker and more ste"erable analysis. Core to this is the combination of t-SNE with methods used previouslyby Chalmers and his research group in spring" models, which are a complementary family of methods for visualising high-dimensional data. There, Chalmers and other researchers ha""ve used sampling methods, and dynamically phased and adaptive computation methods, to concentratecomputation more selectively and t"o increase execution speeds significantly. This project will apply these methods to adjust and speed up tSNE. In the course of adapt"ing t-SNE in these ways, several overarching algorithmic aims orrequirements will be used to assess progress. Firstly, structure sh""ould always be preserved when mapping data to a space of same dimension. Beyond translational, rotational, and scaling differences," relative pairwise distances should always be maintained in an N to M embedding when N=M. The examples of [9] show that tSNE often d"oes not achieve this. Secondly, statisticalproperties should be preserved, up to analytical projections. Randomness in high dimensi""ons behaves differently to what one might expect but, for some distributions (e.g. Gaussian) we know analytically what the distribut""ion should look like in a lower dimensional space, so we should be able to check that statistical structure has been retained by the"" mapping. Thirdly, structure should be preserved over runs. Having zero variability of layout quality over repeated runs of the algo""rithms is a hard requirement to satisfy, given that we are using stochastic sampling, but it~s natural to desire that the degree of"" variability should be low. When someone views a layout, they might then have more idea of how much confidence to put into what they"" are seeing. In this work, we propose to use the same data sets used by the tSNE community, e.g. the MNISTdataset of handwritten le""tters (60k data-points, 784 dimensions) and the CIFAR-10 object recognition dataset (50k points, 1024 dimensions). In addition, we w""ould aim for a number of other larger data sets, to see how far our methods can go, including the full NIST dataset of handwritten l""etters (810,000 data points) from which MNIST was taken.

Document Details

Document Type
DoD Grant Award
Publication Date
Mar 26, 2018
Source ID
N629091812076

Entities

People

  • Matthew Chalmers

Organizations

  • Office of Naval Research
  • United States Navy
  • University of Glasgow

Tags

Fields of Study

  • Computer science

Readers

  • Computer Vision.
  • Distributed Systems and Data Platform Development
  • Theoretical Analysis.

Technology Areas

  • Space