Constraint preserving score for automatic hyperparameter tuning of dimensionality reduction methods for visualization


In data analysis, visualization through dimensionality reduction (DR) is one of the most effective ways to understand a dataset. However, the quality of a visualization is hard to evaluate quantitatively and the hyperparameters of visualization algorithms are sometimes difficult to tune for end-users. This article proposes a score for visualization assessment that can be used to ease the choice of hyperparameter values for widely used DR methods like $t$ -distributed stochastic neighbor embedding, LargeVis, and uniform manifold approximation and projection. We present the constraint preserving score , a computationally efficient score to measure visualization quality. The idea is to measure how well a visualization preserves the information encoded in pairwise constraints like group information or similarity/dissimilarity relationships between instances. Based on this quantitative measure, we use Bayesian optimization to effectively explore the solution space of all visualizations and find the most suitable one. The proposed score is flexible as it can measure quality in different ways depending on the provided constraints. Experiments show its interest for end-users, its complementarity with existing visualization quality measures, and its flexibility to easily express different quality aspects. Impact Statement—When working with high-dimensional data, visualization techniques are useful tools to help us to understand patterns in data. Widely used visualization methods such as $t$-distributed stochastic neighbor embedding, LargeVis, and uniform manifold approximation and projection require tuning several hyperparameters, which is a tedious task for end-users. The visualizations are usually assessed qualitatively and subjectively by users since we lack quantitative measures that fit their needs. Our work tackles this problem by proposing a novel score based on user’s constraints to measure visualization quality. This score can, thus, be used to automatically tune the hyperparameters of visualization methods. For real-world datasets, there are typically multiple aspects hidden in the data under the form of local or global structures, or relationships between data groups. One visualization gives us one vantage point to look at the data and, thus, reveals one specific aspect of the data. Assessing the visualization quality is still an open question and each state-of-the-art visualization quality metric is designed to capture only one specific aspect like local neighborhood structure. However, our proposed constraints preserving score can capture other different aspects of the visualization like the global structure or semantic relationships between groups according to the information encoded in the input constraints. Our score measures how well the information encoded in input constraints is preserved in a visualization, and suggests the best visualization corresponding to the users’ needs. This score can have a large impact since it is very easy to use and works with any visualization method. Domain experts can express their knowledge in a simple form of similar or dissimilar groups of points. If needed, end-users can use a small amount of labeled data to express their constraints.