Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
Dl Intro
-
2016
- 08
- 03
- Convolutional neural nets
- Details
- Layers
- Pixels at each layer
- Relu or some non linearity
- Into output features
- 07
-
26
- Softmax for a probability vector zi = e^-bxi / sum (e^-bx)
- Beta is temperature to scale x
- Used a cross entropy cost function
- Tips
- ReLU
- Cross entropy loss
- SGD on minibatches
- Shuffle training samples to make sure models don't memorize order
- Normalize to zero mean, unit variance; whiten the data
- network should only unpick the higher order dependencies
- otherwise network just tries to remove these instead
- Learning rate; decides how far you take a step
- Efficient backdrop, Yan LeCun
- Setting learning rate
- if it's too large, will oscillate
- different layers will have different curvatures
- any single learning rate that can advance in all dimensions is hard
- that's why having a variable learning rate helps
- Automatic learning rate
- under research
- fairly expensive
- Momentum
- Add speed to movement across a surface
- Nesterov momentum
- update weights with momentum vector
- apply momentum term first
- Batch Normalization
- normalize data within network
- pluggable layer in torch
- Local Minima
- in practice once you have lots of layers then the minima look similar
- DropOut
- add noise into the network
- randomly set half of the dimensions to zero during training
- forces redundancy: remaining ones should still be able to do this
- Debugging
- backprop is broken: numerical gradient check
- parameters collaspe/accuracy is low
- check loss function
- hit a degenerate solution
- Underperforming
- Slow
- it's possible to have small bugs and still get meaningfulish results
- write sanity checks
- inspect hidden units
- if there are strong correlations, then something is probably wrong
- Big model + regularize
- controls what the model does without
- prevent overfitting
- weight sharing
- data augmentation
- dropout
- weight decay
- sparsity in hidden units
- multi-task leaning
- ConvNet
- some signals have structure that can be exploited
- Images have a lot of local dependencies
- Pixels captured tend to be simple
- capture structure with windows around position
- eg. photos
- other things that have some emperical structure
- spectograms of speech
- arbitrary weight matrix has a very defined structure
- the filter is run during training
- good for GPUs
-
24
- Neural networks, continued
- can't make out anything from the lecture =/
-
23
- Neural networks
- Additional Resources
- Book: Neural Networks for pattern recognition, Christopher Bishop :ATTACH:
:Attachments: Neural_Networks_for_Pattern_Recognition_-_Christopher_Bishop.pdf
:ID: 4EE53424-BA92-4023-96D8-A55900114F4A
- Carpathy Neural Networks: cs231n.github.io
- Yann LeCun: deeplearning
- Layers of artificial neurons - history
- each layer computes a function of the layer beneath it
- mapped as /feed forward/ to the output
- began in the forties
- single layer perceptron was realized to be limited
- but multi layer models can work
- backprop then became important
- very usable for digital recognition
- GPUs
- Neuron
- n x 1 vector
- multiply by weights
- y = point wise non linear function (xiwi + b)
- typically called /units/
- Single layer net
- think of this as a matrix and a bias vector
- output will be non linear function
- Non linearity function
- This allows creating non linearity in the network:
otherwise linear combinations of linear functions just leads to more
linearity and doesn't allow modelling non linear behaviour
- Sigmoid 1 / (1 + e^-x) -- not used in practice
- network locks up in this because the gradient is too low
- doesn't work well: mnist doesn't count
- tanh: boundied +1, -1
- preferable to sigmoid
- very little theory to decide which non linearity to use
- more of general practice
- Rectified Linear: Relu
- max(x, 0)
- efficient to implement
- defaulted to in practice
- if in negative function, then it's pretty much stuck
- Leaky rectified linear
- Architecture
- No good answer to picking an architecuture (# layers, # units)
- 2-3 layers for fully connected models that can be trained
- ... verify using Validation set
- Too many units leads to overfitting
- Representational power
- 1 layer == linear
- 2+ layers: /any/ function
- Very wide 2 layers or a narrower deep model?
- Beyond 3, 4 layers don't help much
- Training
- Choose x, y and a cost function
- Forward pass examples
- calculate error using cost function
- back prop to pass error back through model, adjusting parameters to minimize energy
- Chain rule of derivatives back through model
- Once gradients are obtained, use Stochastic Gradient Descent
- Gradient Descent
- Remember to scale gradient vector depending on weights
- Want to take more small steps
- Compute gradient on a small set of data and update
-
14 :ATTACH:
:Attachments: lecture1%20(1).pdf
:ID: 4AA674E1-1979-46C2-81AD-C8EBBEF18220
-
12 :ATTACH:
:Attachments: lecture1.pdf lecun-06.pdf notes_1018.pdf pml-intro-22may12.pdf
:ID: 32C6805C-5647-40AA-AE4D-0A1B0068E790
-
Rob Fergus, Alexander Miller, Christian Puhrsch
- Binary classification
-
predict one of two discrete outcomes
- Simple regression
-
predict a numerical value
- Notes
Dataset is just a small part of the visible function; we try to find
the function.
Hyperplane is a subspace one dimension less than geometry (eg. a line
in a plane).
- Perceptrons:
1 if wx + b >= 0; 0 if wx + b < 0
w & b define the perceptron
- Criterion: count number of mistakes that don't match, averaged.
- Cannot distinguish finely
Optimization Algo
-
choose random w & b
-
will find a solution if it exists
-
if you're wrong, add input vector to weight vector; otherwise
continue
-
epoch: pass over data
-
beyond a point overfitting happens and validation error goes up
-
underfitting: opposite, model can't capture the underlying function
- Supervised learning
-
discovering true function from samples (which may be corrupted)
-
might be insane outside domain
-
restrict to a function space: eg perceptron
- Perceptron limitations:
-
0-1 can't distinguish between no zero error
-
Only terminates when data is separable
- Linear regression
fx = w x + b
-
use mean square error to determine loss
-
can come up with the best w & b for the mean squared error
-
map -inf, inf to 0, 1 using softmax
- Softmax
-
can choose which class we'd like to fit
-
converts scores to a probability
- TODO why not squares?
- Logistic regression
f(x) = 1 / (1 + e^(-wx + b))
evaluate fx as probability
take product of chances as the product
differentiable, doesn't have a closed form solution
use log likelihood
- Gradient descent
Go through each example, and calculate a gradient based on loss
function and difference between prediction
can also be used for closed form solution.
/Stochastic gradient descent/ -- per example, instead of dataset
- Regularization
Apart from reducing training error, minimize regularization term by
including magnitude of weight vector in the loss function.
Controlled by a lambda.
- Hyperparameters
Not directly optimized by the learning process
generally sweep over a different combination of hyperparameters
- Stop validation after validation error increasses
- Cross validation
keep trying different partitions: expensive for large daataset
- Lua
http://tylerneylon.com/a/learn-lua
- Additional Resources
https://www.facebook.com/groups/987689104683098/permalink/989801941138481/
— Kunal