PyTorch: Completely Uninformed Image Modelling
1 Pytorch
PyTorch is a framework for GPU computation in Python. It's a re-implementation of Torch, which was a project in Lua, which apparently everyone hates 1. This document reviews PyTorch, and (perhaps more importantly) shows you how to apply some pretty simple CNN's and RNN's to new data.
I think this is pretty important, as many tutorials use the same data, and I was pretty suspicious about the insane accuracies I saw on various data-sets.
1.1 How do I install it?
You can go here for instructions. I like running my stuff locally (and have a GPU, so those are the instructions I used. )
I use conda, so that's what my instructions are based on.
conda create -n pytorch source activate pytorch conda install pytorch torchvision cudatoolkit=9.0 -c pytorch conda install -c conda-forge autopep8 yapf flake8 conda install scipy numpy sklearn pandas seaborn requests jupyter conda install -c conda-forge scikit-image
This is for CUDA 9.0 (which must be installed), Python 3 and does require Conda.
If you use Python, you should already be using conda, as its a pretty OK package manager. It's also nice that it can handle compiled dependencies (i.e. C++ and whatnot). It is more general than Python (you can get R and various linux tools through it, which is handy if you don't have root or are working on an out of date system).
Note that the above is taking a while…. Like long enough that's annoying me as I write this, but I guess that reproducibility is a harsh mistress :/. Note that i add a bunch of other packages. These are for the virtualenv, so that I get all of my IDE necessities 2.
1.2 Basic Usage
import torch as torch import torchvision as tv
Torch is the main package, which has loads of subpackages (autograd, nn and more). Torchvision is a collection of uilities for image learning. It has transforms, and a dataset and model library, all of which are really useful.
x = torch.Tensor(5, 3) y = torch.rand(5, 3) x + y
1.3 Overall API
Torch: Tensor (i.e. ndarray ) library, torch.autograd has features for automatic differentiation, torch.nn is a neural network library, torch.optim has standard optimisation methods (SGD, RMSProp, etc), torch.multiprocessing has magical memory sharing and torch.utils contains loading and training utility functions. For more, check out their about page. We'll review a bunch of this stuff throughout this document.
1.4 Sizes and Shapes
y = torch.randn(2, 3) print(y.size()) # like np.shape for ndarray
So the size function is essentially equivalent to np.shape, returning either a scalar (if vector), or a list of dimensions.
=torch.Size([5, 3])
1.5 Addition Ops
x + y z = torch.zeros([5, 3]) torch.add(x, y) print(th.add(x, y)) torch.add(z, y, out=z) x.add_(y)
Addition can be done in multiple ways: We can assign the value to an out tensor as in the last example; alternatively, functions prepended with an _ mutate their first argument, which is standard throughout the entire library.
I like the underscore convention, as long as it's consistent. Possibly a recipe for disaster if people don't know what they're doing.
1.6 Indexing
print(x[:,1]) #x[R,C]
Pytorch uses Row, Column indices, which tend to be standardised across multiple languages now. Note that 3d Tensors (i.e. for images) can have a minibatch dimension giving the number of observations in each minibatch. This is normally the first dimension.
- Numpy uses Height * Width * Colour,
- Torch uses Colour * Height * Width,
- This means that we require conversion to transpose the matrices (examples below)
1.7 Numpy Interaction
a = torch.ones(5) print(a) b = a.numpy() print(b)
Conversion from and to numpy is pretty seamless (as is conversion from gpu to cpu).
1.8 CUDA (yay!)
if torch.cuda.is_available(): x = x.cuda() y = y.cuda() z = x + y z_cpu = z.cpu()
If you can get CUDA to work, the above is really easy. However, getting CUDA to work can be an exercise in frustration. You'll need all the drivers (available from the apt repos, assuming you're using Ubuntu). Then you need to install CUDA globally, set LD_LIBRARY_PATH so that the conda libraries can find your CUDA, and then pray.
It's gotten better in the last year or two, but it's still painful. It is unlikely to completely brick your machine these days 3
1.9 Autograd
Once upon a time, Pytorch had a class called Variable, which represented a variable with a history (for the differentiation).
Creating one of those used to look like this, below.
from torch.autograd import Variable x = torch.ones((2, 2))) y = x + 2 print(y) z = y * y * 3 out = z.mean() print(z, out) out.backward()
This doesn't work the same way anymore (since 0.4).
Now, the API for tensors and variables has been unified, which means that the above is done like so.
from torch.autograd import Variable x = torch.ones((2, 2)), requires_grad=True) y = x + 2 print(y) z = y * y * 3 out = z.mean() print(z, out)
This is probably the coolest thing about PyTorch. It implements full reverse mode auto-differentiation. This is done efficiently with a combination of memoizing and recursive applications of the chain rule. These variables are inherently stateful, and thus idempotency is not preserved. So eventually, repeated calls to backward leave you with a constant.
2 Gradients
x = torch.randn(3, requires_grad=True) y = x * 2 while y.data.norm() < 1000: y = y * 2 print(y) gradients = torch.FloatTensor([0.1, 1.0, 0.0001]) y.backward(gradients) print(x, x.grad)
The gradients are a slot in the Tensor object.
len(dir(x))
Wow, they implement essentially all of the standard dunder methods, and add on a whole host more for free. Clearly, FAIR's programmers and researchers are not standing still while they're not plotting to take over the world ;).
3 Loading Data
data_dir = 'new_photos' dsets = { x: datasets.ImageFolder( os.path.join(data_dir, x), transform[x]) for x in ['train', 'val']} dset_loaders = {x: torch.utils.data.DataLoader(dsets[x], batch_size=6, shuffle=True, num_workers=4) for x in ['train', 'val']} dset_sizes = {x: len(dsets[x]) for x in ['train', 'val']} dset_classes = dsets['train'].classes
So, we just define our folder, our dataloader, and hack around the lack of implementing a len method by running a dictionary comprehension.
3.1 Dataloaders
When I first looked at the code above, I was horrified. It seemed far too complicated for what it did. I replaced it with this
from scipy import misc test = misc.imread("new_photos/train/high/6813074_....jpg")
However, the DataLoader solves way more problems:
- It implements lazy-loading which is good because each image is reasonably large
- it shuffles the data
- It varies the batch size (which can make a big difference)
3.2 Better dataloading
- Torch provides a
DataSet
class- Implement
__len__
and__getitem__
- Implement
3.3 DataLoading
from skimage import io, transform from torch.utils.data import Dataset, DataLoader class RentalsDataSetOriginal(Dataset): def __init__(self, csv_file, image_path, transform): self.data_file = pd.read_csv(csv_file) self.image_dir = image_path if transform: self.transform = transform def __len__(self): return len(self.data_file) def __getitem__(self, idx): row = self.data_file.iloc[idx,:] dclass, listing, im, split = row image = io.imread(os.path.join(self.image_dir, split, dclass, im)).astype('float') img_torch = torch.from_numpy(image) h, w, c = img_torch.size() img_rs = img_torch.view([c, h, w]) return (img_rs, dclass)
3.4 Getitem
from skimage import io, transform from torch.utils.data import Dataset, DataLoader class RentalsDataSetOriginal(Dataset): def __init__(self, csv_file, image_path, transform): self.data_file = pd.read_csv(csv_file) self.image_dir = image_path if transform: self.transform = transform def __len__(self): return len(self.data_file) def __getitem__(self, idx): row = self.data_file.iloc[idx,:] dclass, listing, im, split = row image = io.imread(os.path.join(self.image_dir, split, dclass, im)).astype('float') img_torch = torch.from_numpy(image) h, w, c = img_torch.size() img_rs = img_torch.view([c, h, w]) return (img_rs, dclass)
3.5 Actual Neural Networks
You must implement an init method, to define the structure of the Net. You must implement a forward method. This should consist of all of non-linearites applied to each of the input layers, and then PyTorch handles all the backward differentiations for you (which is nice).
3.6 Minimal Neural Network
class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(3, 48, 5) self.pool = nn.MaxPool2d(2, 2) self.conv2 = nn.Conv2d(48, 64, 5) self.conv3 = nn.Dropout2d() #honestly, I just made up these numbers self.fc1 = nn.Linear(64*29*29, 300) self.fc2 = nn.Linear(300, 120) self.fc3 = nn.Linear(120,3)
So the __init__
method creates the structure of the net, but you
need to provide input and output sizes. If (when) you mess this up
(what do you mean you don't do those kinds of sums in your head?),
comment out all of the layers after the error, and use x.size()
to
decide what to do All nets must inherit from nn.Module (or a more
specific version).
3.7 Forward Differentiation
def forward(self, x): x = self.pool(F.relu(self.conv1(x))) x = self.pool(F.relu(self.conv2(x))) x = x.view(-1, 64 * 29 * 29) #-1 ignores the minibatch x = F.dropout(x, training=self.training) x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x
The forward operator contains the non-linearities, and the pooling operators. Note the training argument to dropout. The tricksy part with -1 indicates to ignore this dimension, as it's the minibatch (as explained above). Note also, the ReLU, otherwise known as the simplest, most complicated thing in deep learning. It performs the princely computation of this dire incantation
max(x, 0)
Amazing, isn't it? To be fair, ReLU's are way better than sigmoid functions (which may be familiar to you from such movies as logistic regression ), in that they don't saturate at 1. The logistic function never goes above 1, whereas our max if positive can (theoretically, at least) go to Infinity 4.
3.8 Training the Model
import torch.optim as optim criterion = nn.CrossEntropyLoss() optimiser = optim.SGD(net.parameters(), lr=0.01, momentum=0.9) tr = dset_loaders['train'] for epoch in range(10): for i, data in enumerate(tr, 0): inputs, labels = data inputs, labels = Variable(inputs.cuda()), Variable(labels.cuda()) optimiser.zero_grad() outputs = net(inputs) loss = criterion(outputs, labels) _, preds = torch.max(outputs.data, 1) loss.backward() optimiser.step()
4 Saving Model State
dtime = str(datetime.datetime.now()) outfilename = 'train' + "_" + str(epoch) + "_" + dtime + ".tar" torch.save(net.state_dict(), outfilename)
- Useful to resume training
- Current model state can be restored into a net of exactly the same shape
- Not as important for my smaller models
- These files are huuuuuggggeeee
- So you may wish to only save whichever performs best
5 Testing the Model
for epoch in range(5): val_loss = 0.0 val_corrects = 0 for i, data in enumerate(val, 0): inputs, labels = data inputs, labels = Variable(inputs.cuda()), Variable(labels.cuda()) outputs = net(inputs) loss = criterion(outputs, labels) _, preds = torch.max(outputs.data, 1) val_loss += loss.data[0] val_corrects += torch.sum(preds == labels.data) phase = 'val' val_epoch_loss = val_loss / dset_sizes['val'] val_epoch_acc = val_corrects / dset_sizes['val'] print('{} Loss: {:.4f} Acc: {:.4f}'.format( phase, val_epoch_loss, val_epoch_acc))
6 Playing with the Net
params = list(net.parameters()) print(len(params)) print(params[0].size())
input = Variable(torch.randn(3, 3, 48, 48)) out = net(input) print(out)
7 How did it do?
train Loss: 0.1360 Acc: 0.6742 train Loss: 0.1355 Acc: 0.6750 ... train Loss: 0.1202 Acc: 0.6966 val Loss: 0.1432 Acc: 0.6816 ... val Loss: 0.1440 Acc: 0.6810
- Training Accuracy 69% (10 epochs)
- Test Accuracy 68%
- This is OK, but given the data and the lack of any meaningful domain knowledge, I'm reasonably impressed.
- I guess what we actually need to know is what the incremental value of the image data is, relative to the rest of the data.
8 Text Data
- Fortunately, the rentals dataset also has some text data
import pandas as pd text = pd.read_csv("rentals_sample_text_only.csv") first = text.iloc[0,:] print(list(first))
>>> >>> ["This location is one of the most sought after areas
in Manhattan Building is located on an amazing quiet tree lined block located just steps from transportation, restaurants, boutique shops, grocery stores*** For more info on this unit and/or others like it please contact Bryan 449-593-7152 / [email protected] <br /><br /> Bond New York is a real estate broker that supports equal housing opportunity.<p><a website_redacted "]
9 Characters vs Words?
- Most NLP that I traditionally saw used words (and bigrams, trigrams etc) as the unit of observation
- Many deep learning approaches instead rely on characters
- Characters are much less sparse than words
- We have way more characters
- We don't understand a word as a function of its characters, so should a machine?
10 Characters
- They are much less sparse
- The representation is pretty cool also
- We represent each character as a 1*N tensor for each item in the character universe
- Each word is represented as a matrix of these characters
11 Preprocessing
import unicodedata import string all_letters = string.ascii_letters + " .,;'" n_letters = len(all_letters) def unicode_to_ascii(s): return ''.join( c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn' and c in all_letters )
- Cultural Imperialism rocks!
- More seriously, we reduce the dimension from 90+ to 32
- This means we can handle more words and longer descriptions
12 Apply to the text data
first = text['description'] first2 = [] char_ascii = {} for word in first: for char in word: char = unicode_to_ascii(char.lower()) if char not in char_ascii: char_ascii[char] = 1 else: pass
- We need the character counts to create a mapping from characters to a 1-hot matrix
- This is necessitated by the disappointing lack of R's
model.matrix
- This code was also used to assess the impact of removing non-ascii chars
13 Character to Index
import torch all_letters = char_ascii.keys() letter_idx ={} for letter in all_letters: if letter not in letter_idx: letter_idx[letter] = len(letter_idx) def letter_to_index(letter): return letter_idx[letter]
- Create a dict with the key being the number of previous letters
- Use this to represent the letter as a number
14 Letter/Words to Tensor
def letter_to_tensor(letter): tensor = torch.zeros(1, len(char_ascii)) tensor[0][letter_to_index(letter)] = 1 return tensor def line_to_tensor(line): tensor = torch.zeros(len(line), 1, len(char_ascii)) for li, letter in enumerate(line): letter = unicode_to_ascii(letter.lower()) tensor[li][0][letter_to_index(letter)] = 1 return tensor
- Code implementation for the character and word to tensor functions
- Note that these are going to be really sparse vectors (1 non-sparse entry per row)
- torch has sparse matrix support (but it's marked as experimental)
15 Bespoke Rentals Code
all_categories = ['low', 'medium', 'high'] def category_from_output(output): top_n, top_i = output.data.topk(1) category_i = top_i[0][0] return all_categories[category_i], category_i
- We need to be able to map back from a matrix of probabilities to a class prediction
16 Different Get Data Implementation
import pandas as pd textdf = pd.read_csv('rentals_text_data.csv').dropna(axis=0) cat_to_ix = {} for cat in all_categories: if cat not in cat_to_ix: cat_to_ix[cat] = len(cat_to_ix) else: pass def random_row(df): rowrange = df.shape[0] - 1 return df.iloc[random.randint(0, rowrange)]
17 Shuffling Training Examples
import random as random from torch.autograd import Variable def random_training_example(df): row = random_row(df) target = row['interest_level'] text = row['description'] catlen = len(all_categories) target_tensor = Variable(torch.zeros(catlen)) idx_cat = cat_to_ix[target] target_tensor[idx_cat] = 1 words_tensor = Variable(line_to_tensor(text)) return target, text, target_tensor, words_tensor target, text, t_tensor, w_tensor = random_training_example(textdf)
- We return the class, the actual text
- And also the matrix representation of these two parts
18 our RNN
import torch.nn as nn from torch.autograd import Variable class RNN(nn.Module): def __init__(self, input_size, hidden_size, output_size): super(RNN, self).__init__() self.hidden_size = hidden_size self.i2h = nn.Linear(input_size + hidden_size, hidden_size) self.i2o = nn.Linear(input_size + hidden_size, output_size) def forward(self, input, hidden): combined = torch.cat((input, hidden), 1) hidden = self.i2h(combined) output = self.i2o(combined) return output, hidden def init_hidden(self): return Variable(torch.zeros(1, self.hidden_size)) n_hidden = 128 n_letters = len(char_ascii) rnn = RNN(len(char_ascii), n_hidden, 3)
- Pretty simple
- Absolutely no tuning applied
19 Train on one example
optimiser = optim.SGD(rnn.parameters(), lr=0.01, momentum=0.9) criterion = nn.CrossEntropyLoss() learning_rate = 0.005 def train(target_tensor, words_tensor): hidden = rnn.init_hidden() rnn.zero_grad() for i in range(words_tensor.size()[0]): output, hidden = rnn(words_tensor[i], hidden) loss = criterion(output.squeeze(), target_tensor.type(torch.LongTensor)) loss.backward() #magic optimiser.step() for p in rnn.parameters(): #need to figure out why this is necessary p.data.add_(-learning_rate, p.grad.data) return output, loss.data[0]
20 Training in a Loop
n_iters = 10000 for iter in range(1, n_iters + 1): category, line, category_tensor, line_tensor, numrow = random_training_example(textdf) output, loss = train(category_tensor, line_tensor) current_loss += loss
21 Inspecting the Running Model
# Print iter number, loss, name and guess if iter % print_every == 0: guess, guess_i = category_from_output(output) correct = 'Y' if guess == category else 'N (%s)' % category print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, time_since(start), loss, line, guess, correct)) # Add current loss avg to list of losses if iter % plot_every == 0: all_losses.append(current_loss / plot_every) current_loss = 0
22 Problems
- This loops through the data in a non-deterministic order
- We should probably ensure that we go through the data N*epoch times
- Additionally, we need some test data
- Fortunately, we have all of the text data available
- Unfortunately it's late Monday night now, and I won't sleep if I don't stop working :(
23 Future Work
- Impelement Deconvolutional Nets/other visualisation tools to understand how the models work
- Solve the actual Kaggle problem by using an RNN over my CNN
- Add the text data, image data and structured data to an ensemble and examine overall performance
- Learn more Python
24 Conclusions
- PyTorch is a powerful framework for matrix computation on the GPU
- It is deeply integrated with Python
- It's not just a set of bindings to C/C++ code
- It is really easy to install (by the standards of DL frameworks)
- You can inspect each stage of the Net really easily (as it's just Python objects)
- No weirdass errors caused by compilation!
25 Further Reading
- My repo with code (currently non-functional because I need to upload)
- PyTorch tutorials and examples
- the Docs (these are unbelievably large)
- The Book (seriously, even if you never use deep learning there's a lot of really good material there)
- Completely unrelated, but this is an amazing book on Python
- You should definitely read it
26 Papers (horribly incomplete)
- AlexNet - it's amazing how many new things this paper did
- Deconvolutional Nets
- Gneralised Adversarial Networks
- Rethinking Generalisation and Deep Learning
- Deep Reinforcement Learning