Sirius: Exploratory Analysis Python Package

Sirius is an open source Python package and web application to support a novel method for analyzing high-dimensional data using mutual information feature networks. This project is a collaboration between the University of Vermont Complex Systems Center and the MassMutual Data Visualization team. Check out and clone the project via the GitHub page here.

There are many different approaches for graphically representing feature relationships. From left to right: Small multiples (A), which are advantageous for small numbers of features of homogenous or heterogenous data types; (B) Matrices, i.e. correlation or conditional probability matrices, and (C) dendrograms, both advantageous for comparing small or medium numbers of features of a homogenous data type using summary statistics or similarity/distance measures; and (D) Networks which support a high number of features, usually of a homogenous data type, but supporting heterogenenous data types through the Sirius mutual information implementation.

Mutual information scores allow us to find feature pairs which are highly dependent. Feature pairs with high mutual information scores are connected in the resultant network graph.

 

The tool is designed to process data of continuous and discrete data types; mutual information can be calculated among homogenous or heterogeneous data type pairings. Unlike correlation, mutual information allows us to find dependence among features which is non-linear.

 

A backbone method is applied to the feature network which selects the most statistically significant edges and displays them in the resultant web-based application.

By thresholding the network in a context-aware manner, clusters of dependent features remain in a simplified network visualization view. Compare, below, the statically thresheld association network (left) with the dynamically thresheld mutual information network generated by Sirius (right).

These networks show great potential for EDA of high-dimensional data sumarries, while still allowing direct access to pairwise record-level visualizations.

      Stay tuned for the paper.

Leave a Reply

Your email address will not be published. Required fields are marked *