Algorithmic Techniques Denoising Data
What do we aim for with data reduction?
Data analysis problems where the datasets have a high number of features that need to be considered are very common. Many times, data scientists come across data that suffers from the curse of dimensionality, in other words the datasets are defined by too many variables, thus introducing noise. In some cases, the number of explanatory variables (p) exceed the number of data samples (n), these cases are classified by data scientists as (p>n) problems. This presents a problem when it comes to training a machine learning algorithm. The algorithm may find it difficult to find which features are useful when it comes to classifying new data.
Data reduction during EDA.
During the feature selection process, the goal is to select a subset of the features that are relevant and can be used to train a machine learning model. There are many approaches that can be taken in order to assess the usefulness of each feature. During the exploratory data analysis process the features with the highest range can be discovered, these features are thought to be useful because they are the most likely features to hold most of the signal of the dataset.
Apart from exploratory data analysis techniques, special techniques can be used to find the optimal features:
Categories of data reduction techniques
Feature selection techniques
Feature selection techniques are used to identify a smaller set of features from the original set. These set of features will ideally contain the most signal of the original set, but less noise.
There are 3 types of Feature Selection algorithms:
Wrapper methods are used in conjunction with a learning algorithm. Wrapper methods essentially treat the learning algorithm as a function to be called in order to evaluate the predictive importance of a subset of features.
These methods embed the feature selection within the learning process. They use the same criteria as the wrapper techniques to rank the usefulness of features.
A simple metric (criterion) is used to show how good the predictive capability of each feature is. All features will be scored based on the criterion and the ones that have the best results will be used to build the model. It’s up to the user to define a minimum score for when the predictor is deemed to be of little to no use.
Feature Generation/ Transformation
These techniques aim to create a new more compact dataset, while retaining as much signal as possible. These techniques can be further sub-categorized as:
2. Feature extraction techniques:
Like feature selection these techniques aim to reduce the number of features in a dataset, however the key difference is that extraction techniques do not maintain any original features and instead transform the data in a way that can be projected on a new feature space. These techniques also approach the data with the goal of maintaining the most relevant information. The most common technique used is Principal Component Analysis which is an unsupervised dimensionality reduction algorithm.
2. Feature generation techniques:
These kinds of techniques actually aim to build more features that can be used to discover hidden information in the data. These artificially created features are created by combining the original ones. While this approach adds dimensionality to a dataset, the goal is to build descriptive features that can increase the accuracy of the model, moreover at the end of the process other feature selection techniques can be used to select the features that contribute most to the accuracy of a model. These kinds of techniques have grate use in NLP settings.
Next time we’ll see how data reduction is achieved in traditional machine learning NLP settings.
. H. Blancken, A. Vries, et al. (2007) ‘Multimedia Retrieval’1sted. Springer
 I. Guyon, A. Elisseeff (2003) Journal of Machine Learning Research ‘An introduction to Variable and feature selection’vol3 (1), pp 1157-1182.
 A. L. Blum, P. Langley (1997) Journal of Artificial Intelligence ‘Selection of relevant features and examples in machine learning’vol97, pp 245-271.