Deep Learning Cheat Sheet



  1. Deep Learning Cheat Sheet 5th
  2. Deep Learning Interview Cheat Sheet
Sheet

The question that I get the most from new and experienced machine learning engineers is “how can I get higher accuracy?”

Cheatsheets enumerating everything about convolutional neural systems,recurrent neural networks, as well as the DL tips and traps to have at the top of the priority list when preparing a deep learning model. All components of the above joined in an ultimate arrangement of ideas, to have with you consistently. Transfer learning Training a deep learning model requires a lot of data and more importantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets that took days/weeks to train, and leverage it towards our use case. Depending on how much data we have at hand, here are the different ways to leverage this.

Makes a lot of sense since the most valuable part of machine learning for business is often its predictive capabilities. Improving the accuracy of prediction is an easy way to squeeze more value from existing systems.

The guide will be broken up into four different sections with some strategies in each.

  • Data Optimization
  • Algorithm tuning
  • Hyper-Parameter Optimization
  • Ensembles, Ensembles, Ensembles

Not all of these ideas will boost model performance, and you will see limited returns the more of them you apply to the same problem.

Still, stuck after trying a few of these? That indicates you should rethink the core solution to your business problem. This article is just a deep learning performance cheat sheet, so I’m linking you to more detailed sources of information in each section.

Data Optimization

Balance your data set

One of the easiest ways to increase performance for underperforming deep learning models is to balance your dataset if your problem is classification. Often real-world data sets are skewed, and if you want the best accuracy you want your deep learning system to learn how to pick between two classes based on the characteristics not by copying its distribution

Common methods include:

  • Subsample Majority Class: You can balance the class distributions by subsampling the majority class.
  • Oversample Minority Class: Sampling with replacement can be used to increase your minority class proportion.

More Data

Many of us are familiar with this graph. It shows the relationship between the amount of data and performance for both deep learning and classical machine learning approaches. If you are not, then the lesson is clear and straightforward. If you want better performance for your model, you need more data. Depending on your budget you might opt for creating more labeled data or collecting more unlabeled data and training your feature extraction sub-model more.

Open Source Labeling Software

Generate More Data

Or Fake it, till you make it. An often ignored method of improving accuracy is creating new data from what you already have. Take for example photos; often engineers will create more images by rotating and randomly shifting existing images. Such transformations also increase the reduced overfitting of the training set.

Algorithm Tuning

Copy The Researchers

Are you working on a problem that has lots of research behind it? You are in luck because 100’s of engineers might have already put a bunch of thought into how to get the best accuracy for this problem. Read some research papers on the topic and take note of the different methods they used to get results! They might even have a git-hub of their code for you to sink your teeth into.

Google Scholar is an excellent place to start your search. They offer many tools to help you find related research as well.

For storage and organization of research papers, I use Mendeley

Algorithm spot check

Don’t let your ego get the best of you. It’s impossible to know which machine learning algorithm will work best for your problem. Whenever I attack a new problem, with not much in the way of research behind it, I look at a few methods available and try all of them. Deep learning (CNN’s, RNN’s, etc.) and classical machine learning approaches (Random Forests, Gradient Boosting, etc.)

Rank the results of all your experiments and double down on the algorithms that perform the best.

Hyper-Parameter Optimization

Learning rates

The Adam optimization algorithm is tried and true. Often giving amazing results on all deep learning problems. Even with Its fantastic performance it still can get you stuck in a local minimum for your problem. An even better algorithm that has the benefits of Adam and helps eliminate the chance of getting stuff in a local minimum is Stochastic Gradient Descent with Warm Restarts.

Batch size and number of epochs

A standard procedure is using large batch sizes with a large number of epochs for modern deep learning implementations, but common strategies yield common results. Experiment with the size of your batches and the number of training epochs.

Early Stopping

This is an excellent method for reducing the generalization error of your deep learning system. Continual training might improve accuracy on your data set, but at a certain point, it starts to reduce the model’s accuracy on data not yet seen by the model. To improve real-world performance try early stopping.

Network Architecture

If you want to try something a little more interesting, you can give Efficient Neural Architecture Search (ENAS) a try. This algorithm will create a custom network design that will maximize accuracy on your dataset and is way more efficient than the standard Neural architecture search that cloud ML uses.

Regularization

A robust method to stop overfitting is to use regularization. There are a couple of different ways to use regularization that you can train on your deep learning project. If you haven’t tried these methods yet I would start to include them in every project you do.

  • Dropout: They randomly turn off a percentage of neurons during training. Dropout helps prevent groups of neurons from all overfitting to themselves.
  • Weight penalty L1 and L2: Weights that explode in size can be a real problem in deep learning and reduce accuracy. One of the ways to combat this is to add decay to all weights. These try to keep all of the weights in the networks as small as possible unless there are large gradients to counteract it. On top of often increasing performance, it has the benefit of making the model easier to interpret.

Ensembles, Ensembles, Ensembles

Having trouble picking the best model to use? Often you can combine the outputs from the different models and get better accuracy. There are two steps for every one of these algorithms.

  • Producing a distribution of simple ML models on subsets of the original data
  • Combining the distribution into one “Aggregated” model

Combined Models/Views (Bagging)

In this method, you train a few different models, which are different in some way, on the same data and you average out the outputs to create the final output. Bagging has the effect of reducing variance in the model. You can intuitively think of it as having multiple people with different backgrounds thinking about the same problem but with different starting positions. Just as on a team this can be a potent tool for getting the right answer.

Stacking

It’s similar to bagging the difference here is that you don’t have an empirical formula for your combined output. You create a meta-level learner that based on the input data chooses how to weigh the answers from your different models to produce the final output.

Still having issues?

Reframe Your Problem

Take a break from looking at your screen and get a coffee. This solution is all about rethinking your problem from the beginning. I find it helps to sit down and start brainstorming different ways that you could solve the problem. Maybe start by asking your self some simple questions:

  • Can my classification problem become a regression problem or the reverse?
  • Can you break down your problem any smaller?
  • Are there any observations that you have collected about your data that could change the scope of the problem?
  • Can your binary output become a softmax output or vice versa?
  • Are you looking at this problem in the most efficient way?
Deep learning cheat sheet 2020

Rethinking your problem can be the hardest of the methods to increase performance but it often the one that yields the best results. It helps to chat with someone that has experience in deep learning and can give you a fresh take on your problem.

3rd

If you would like to chat with someone, I am making myself self-available for the next month to have a 30-minute conversation with you about your project. I’m charging 5 dollars for this 30-minute call as a barrier to keep those out who are not serious from wasting our time. Sign up for a time slot

Also read: Top 6 Cheat Sheets Novice Machine Learning Engineers Need

Thanks for reading If you enjoyed the post, share the article with anyone you think needs it. Let’s also connect on Twitter, LinkedIn, or follow me on Medium.

Part 0: Intro

Why

Deep Learning is a powerful toolset, but it also involves a steep learning curve and a radical paradigm shift.

For those new to Deep Learning, there are many levers to learn and different approaches to try out. Even more frustratingly, designing deep learning architectures can be equal parts art and science, without some of the rigorous backing found in longer studied, linear models.

In this article, we’ll work through some of the basic principles of deep learning, by discussing the fundamental building blocks in this exciting field. Take a look at some of the primary ingredients of getting started below, and don’t forget to bookmark this page as your Deep Learning cheat sheet!

FAQ

What is a layer?

A layer is an atomic unit, within a deep learning architecture. Networks are generally composed by adding successive layers.

What properties do all layers have?

Almost all layers will have :

  • Weights (free parameters), which create a linear combination of the outputs from the previous layer.
  • An activation, which allows for non-linearities
  • A bias node, an equivalent to one incoming variable that is always set to 1

What changes between layer types?

There are many different layers for many different use cases. Different layers may allow for combining adjacent inputs (convolutional layers), or dealing with multiple timesteps in a single observation (RNN layers).

Difference between DL book and Keras Layers

Frustratingly, there is some inconsistency in how layers are referred to and utilized. For example, the Deep Learning Book commonly refers to archictures (whole networks), rather than specific layers. For example, their discussion of a convolutional neural network focuses on the convolutional layer as a sub-component of the network.

1D vs 2D

Deep Learning Cheat Sheet 5th

Some layers have 1D and 2D varieties. A good rule of thumb is:

  • 1D: Temporal (time series, text)
  • 2d: Spatial (image)

Cheat sheet

Deep learning cheat sheet 5th

Part 1: Standard layers

Input

  • Simple pass through
  • Needs to align w/ shape of upcoming layers

Deep Learning Interview Cheat Sheet

Embedding

  • Categorical / text to vector
  • Vector can be used with other (linear) algorithms
  • Can use transfer learning / pre-trained embeddings(see example)

Dense layers

  • Vanilla, default layer
  • Many different activations
  • Probably want to use ReLu activation

Dropout

  • Helpful for regularization
  • Generally should not be used after input layer
  • Can select fraction of weights (p) to be dropped
  • Weights are scaled at train / test time, so average weight is the same for both
  • Weights are not dropped at test time

Part 2: Specialized layers

Convolutional layers

  • Take a subset of input
  • Create a linear combination of the elements in that subset
  • Replace subset (multiple values) with the linear combination (single value)
  • Weights for linear combination are learned

Time series & text layers

  • Helpful when input has a specific order
    • Time series (e.g. stock closing prices for 1 week)
    • Text (e.g. words on a page, given in a certain order)
  • Text data is generally preceeded by an embedding layer
  • Generally should be paired w/ RMSprop optimizer

Simple RNN

  • Each time step is concatenated with the last time step's output
  • This concatenated input is fed into a dense layer equivalent
  • The output of the dense layer equivalent is this time step's output
  • Generally, only the output from the last time step is used
  • Specially handling for the first time step

LSTM

  • Improvement on Simple RNN, with internal 'memory state'
  • Avoid issue of exploding / vanishing gradients

Utility layers

  • There for utility use!