Working Notes: a commonplace notebook for recording & exploring ideas.
  Home. Site Map. Subscribe. More at expLog.
Dl Intro
- 
2016 
- 08
- 03
- Convolutional neural nets
- Details
- Layers
- Pixels at each layer
- Relu or some non linearity
- Into output features
 
 
 
 
- 07
- 
26 
- Softmax for a probability vector zi = e^-bxi / sum (e^-bx)
- Beta is temperature to scale x
- Used a cross entropy cost function
 
- Tips
- ReLU
- Cross entropy loss
- SGD on minibatches
- Shuffle training samples to make sure models don't memorize order
- Normalize to zero mean, unit variance; whiten the data
- network should only unpick the higher order dependencies
- otherwise network just tries to remove these instead
 
- Learning rate; decides how far you take a step
- Efficient backdrop, Yan LeCun
 
- Setting learning rate
- if it's too large, will oscillate
- different layers will have different curvatures
- any single learning rate that can advance in all dimensions is hard
- that's why having a variable learning rate helps
 
- Automatic learning rate
- under research
- fairly expensive
 
- Momentum
- Add speed to movement across a surface
- Nesterov momentum
- update weights with momentum vector
- apply momentum term first
 
 
- Batch Normalization
- normalize data within network
- pluggable layer in torch
 
- Local Minima
- in practice once you have lots of layers then the minima look similar
 
- DropOut
- add noise into the network
- randomly set half of the dimensions to zero during training
- forces redundancy: remaining ones should still be able to do this
 
- Debugging
- backprop is broken: numerical gradient check
- parameters collaspe/accuracy is low
- check loss function
- hit a degenerate solution
 
- Underperforming
- Slow
- it's possible to have small bugs and still get meaningfulish results
- write sanity checks
- inspect hidden units
- if there are strong correlations, then something is probably wrong
 
 
- Big model + regularize
- controls what the model does without
- prevent overfitting
- weight sharing
- data augmentation
- dropout
- weight decay
- sparsity in hidden units
- multi-task leaning
 
 
- ConvNet
- some signals have structure that can be exploited
- Images have a lot of local dependencies
- Pixels captured tend to be simple
- capture structure with windows around position
- eg. photos
- other things that have some emperical structure
- spectograms of speech
 
- arbitrary weight matrix has a very defined structure
- the filter is run during training
- good for GPUs
 
 
- 
24 
- Neural networks, continued
- can't make out anything from the lecture =/
 
- 
23 
- Neural networks
- Additional Resources
- Book: Neural Networks for pattern recognition, Christopher Bishop :ATTACH:
:Attachments: Neural_Networks_for_Pattern_Recognition_-_Christopher_Bishop.pdf
:ID:       4EE53424-BA92-4023-96D8-A55900114F4A
- Carpathy Neural Networks: cs231n.github.io
- Yann LeCun: deeplearning
 
- Layers of artificial neurons - history
- each layer computes a function of the layer beneath it
- mapped as /feed forward/ to the output
- began in the forties
- single layer perceptron was realized to be limited
- but multi layer models can work
- backprop then became important
- very usable for digital recognition
- GPUs
 
- Neuron
- n x 1 vector
- multiply by weights
- y = point wise non linear function (xiwi + b)
- typically called /units/
 
- Single layer net
- think of this as a matrix and a bias vector
- output will be non linear function
 
- Non linearity function
- This allows creating non linearity in the network:
otherwise linear combinations of linear functions just leads to more
linearity and doesn't allow modelling non linear behaviour
- Sigmoid 1 / (1 + e^-x) -- not used in practice
- network locks up in this because the gradient is too low
- doesn't work well: mnist doesn't count
 
- tanh: boundied +1, -1
- preferable to sigmoid
- very little theory to decide which non linearity to use
- more of general practice
 
- Rectified Linear: Relu
- max(x, 0)
- efficient to implement
- defaulted to in practice
- if in negative function, then it's pretty much stuck
 
- Leaky rectified linear
- Architecture
- No good answer to picking an architecuture (# layers, # units)
- 2-3 layers for fully connected models that can be trained
- ... verify using Validation set
- Too many units leads to overfitting
 
 
- Representational power
- 1 layer == linear
- 2+ layers: /any/ function
- Very wide 2 layers or a narrower deep model?
- Beyond 3, 4 layers don't help much
 
- Training
- Choose x, y and a cost function
- Forward pass examples
- calculate error using cost function
- back prop to pass error back through model, adjusting parameters to minimize energy
- Chain rule of derivatives back through model
 
- Once gradients are obtained, use Stochastic Gradient Descent
 
- Gradient Descent
- Remember to scale gradient vector depending on weights
- Want to take more small steps
- Compute gradient on a small set of data and update
 
 
 
- 
14                                                             :ATTACH:
:Attachments: lecture1%20(1).pdf
:ID:       4AA674E1-1979-46C2-81AD-C8EBBEF18220 
- 
12                                                             :ATTACH:
:Attachments: lecture1.pdf lecun-06.pdf notes_1018.pdf pml-intro-22may12.pdf
:ID:       32C6805C-5647-40AA-AE4D-0A1B0068E790 
 
 
- 
Rob Fergus, Alexander Miller, Christian Puhrsch - Binary classification
 
- 
predict one of two discrete outcomes - Simple regression
 
- 
predict a numerical value - Notes
 
Dataset is just a small part of the visible function; we try to find
the function.
Hyperplane is a subspace one dimension less than geometry (eg. a line
in a plane).
  - Perceptrons:
1 if wx + b >= 0; 0 if wx + b < 0
w & b define the perceptron
- Criterion: count number of mistakes that don't match, averaged.
- Cannot distinguish finely
Optimization Algo
- 
choose random w & b 
- 
will find a solution if it exists 
- 
if you're wrong, add input vector to weight vector; otherwise
continue 
- 
epoch: pass over data 
- 
beyond a point overfitting happens and validation error goes up 
- 
underfitting: opposite, model can't capture the underlying function - Supervised learning
 
- 
discovering true function from samples (which may be corrupted) 
- 
might be insane outside domain 
- 
restrict to a function space: eg perceptron - Perceptron limitations:
 
- 
0-1 can't distinguish between no zero error 
- 
Only terminates when data is separable - Linear regression
 
fx = w x + b
- 
use mean square error to determine loss 
- 
can come up with the best w & b for the mean squared error 
- 
map -inf, inf to 0, 1 using softmax - Softmax
 
- 
can choose which class we'd like to fit 
- 
converts scores to a probability
- TODO why not squares? - Logistic regression
 
f(x) = 1 / (1 + e^(-wx + b))
evaluate fx as probability
take product of chances as the product
differentiable, doesn't have a closed form solution
use log likelihood
  - Gradient descent
Go through each example, and calculate a gradient based on loss
function and difference between prediction
can also be used for closed form solution.
/Stochastic gradient descent/ -- per example, instead of dataset
  - Regularization
Apart from reducing training error, minimize regularization term by
including magnitude of weight vector in the loss function.
Controlled by a lambda.
  - Hyperparameters
Not directly optimized by the learning process
generally sweep over a different combination of hyperparameters
  - Stop validation after validation error increasses
  - Cross validation
keep trying different partitions: expensive for large daataset
  - Lua
http://tylerneylon.com/a/learn-lua
  - Additional Resources
https://www.facebook.com/groups/987689104683098/permalink/989801941138481/
— Kunal