Dealing with high-dimensionality in large data sets

Wilmott Magazine, May 2014(Early version)

Summary:

Many financial data sets are characterized by large number of dimensions. High-dimensional datasets increases the complexity of analysis and requires sophisticated techniques to process these datasets. Whether it is stock data for individual companies or economic data used for macro-economic modeling, high-dimensional data sets present unique challenges. When building predictive models, quants typically have to deploy statistical methods to reduce data complexity and the number of dimensions to make it easier and tractable for processing. Traditional techniques involve choosing important dimensions (Variable Selection methods where a subset of dimensions is chosen) or reducing dimensions (where variables are transformed to a smaller set of new variables) to make analysis feasible and practical. However, these traditional techniques are seeing limits when dealing with today’s data sets. Technological innovations in data collection and processing in the last decade has made access to large volumes of data possible. In addition, the data collected has high granularity, frequency and complexity increasing the need to adopt sophisticated data handling techniques. Collectively quants are seeing the 4 ‘V’s of Big data, Volume, Velocity, Variety and Veracity manifest in financial datasets requiring rethinking on approaches to process these datasets(See our prior article from March 2014 for more on this topic). In order to appreciate the nature of the problem high-dimensional data sets, we need to understand both traditional and modern techniques.

In this two-part article, we will cover both traditional and modern techniques to address high-dimensional data sets. In part1 of this article, we will lay the foundation by discussing some of the common traditional techniques to handle high-dimensional data. We begin by discussing the problems in high-dimensional data sets including the famous “Curse of Dimensionality” problem. We then discuss two methods to deal with high-dimensional datasets. The goal of the first method is to reduce the number of variables by variable selection and that of second is to reduce the number of variables by deriving new variables. We will illustrate these methods through sample techniques (regression, decision trees and principal component analysis) and give pointers on implementing these techniques in MATLAB. We will also include sample applications of these techniques in finance and economics as a part of this discussion. In part 2 of this article, we discuss some of the challenges dealing with high-dimensionality in the context of Big Data problems and methodologies and innovations to process these large data sets. We will review some of the proposed methodologies and provide guidance on choosing approaches to handle large high-dimensional data sets.

Click here to download the Article