• Legal Utopia

Algorithmic Techniques Denoising Data

What do we aim for with data reduction?

Data analysis problems where the datasets have a high number of features that need to be considered are very common. Many times, data scientists come across data that suffers from the curse of dimensionality, in other words the datasets are defined by too many variables, thus introducing noise. In some cases, the number of explanatory variables (p) exceed the number of data samples (n), these cases are classified by data scientists as (p>n) problems. This presents a problem when it comes to training a machine learning algorithm. The algorithm may find it difficult to find which features are useful when it comes to classifying new data.

Data reduction during EDA.

During the feature selection process, the goal is to select a subset of the features that are relevant and can be used to train a machine learning model. There are many approaches that can be taken in order to assess the usefulness of each feature. During the exploratory data analysis process the features with the highest range can be discovered, these features are thought to be useful because they are the most likely features to hold most of the signal of the dataset.

Apart from exploratory data analysis techniques, special techniques can be used to find the optimal features:

Categories of data reduction techniques

Feature selection techniques

Feature selection techniques are used to identify a smaller set of features from the original set. These set of features will ideally contain the most signal of the original set, but less noise.

There are 3 types of Feature Selection algorithms:

· Wrappers

Wrapper methods are used in conjunction with a learning algorithm. Wrapper methods essentially treat the learning algorithm as a function to be called in order to evaluate the predictive importance of a subset of features.

· Embedded

These methods embed the feature selection within the learning process. They use the same criteria as the wrapper techniques to rank the usefulness of features.

· Filters

A simple metric (criterion) is used to show how good the predictive capability of each feature is. All features will be scored based on the criterion and the ones that have the best results will be used to build the model. It’s up to the user to define a minimum score for when the predictor is deemed to be of little to no use.

Feature Generation/ Transformation

These techniques aim to create a new more compact dataset, while retaining as much signal as possible. These techniques can be further sub-categorized as:

2. Feature extraction techniques:

Like feature selection these techniques aim to reduce the number of features in a dataset, however the key difference is that extraction techniques do not maintain any original features and instead transform the data in a way that can be projected on a new feature space. These techniques also approach the data with the goal of maintaining the most relevant information. The most common technique used is Principal Component Analysis which is an unsupervised dimensionality reduction algorithm.

2. Feature generation techniques:

These kinds of techniques actually aim to build more features that can be used to discover hidden information in the data. These artificially created features are created by combining the original ones. While this approach adds dimensionality to a dataset, the goal is to build descriptive features that can increase the accuracy of the model, moreover at the end of the process other feature selection techniques can be used to select the features that contribute most to the accuracy of a model. These kinds of techniques have grate use in NLP settings.

Next time we’ll see how data reduction is achieved in traditional machine learning NLP settings.

Some references:

[23]. H. Blancken, A. Vries, et al. (2007) ‘Multimedia Retrieval’1sted. Springer

[24] I. Guyon, A. Elisseeff (2003) Journal of Machine Learning Research ‘An introduction to Variable and feature selection’vol3 (1), pp 1157-1182.

[25] A. L. Blum, P. Langley (1997) Journal of Artificial Intelligence ‘Selection of relevant features and examples in machine learning’vol97, pp 245-271.

Legal Utopia Limited HQ
Level 30, The Leadenhall Building
122 Leadenhall Street 
City of London
  • LinkedIn
  • Facebook
  • 59486444-3699ab80-8e71-11e9-9f9a-836e431
  • Twitter
  • social-62-512
  • Instagram
  • RSS
Join our community

*Possibilities refers to the number of potential outcomes of the application taking into consideration all available data input variables.*Cases refers to the number of historic legal problems analysed to inform the creation of the service. *Legal Services refers to the number of legal services providers referenced from our database. *Legal Fields refers to a category or body of law or a legal subject matter. *Resources refers to materials related to a legal subject matter referenced from our database. *Portals refers to the number of online service portals referenced from our database.

App design and functionality may vary to images displayed. 

Legal Utopia, Legal Utopia - The A.I Way and LegalCrowd are the trademarks and trading names of Legal Utopia Limited, a company registered in England and Wales under company number 10909418 operating from and registered address Level 30, The Leadenhall Building, 122 Leadenhall Street, London, EC3V 4AB.

(C) Legal Utopia Limited 2019-2020. All Rights Reserved.