```
#import libraries
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset, random_split
```

# Fully Connected Network in PyTorch

Implementing a fully connected neural network in PyTorch

This page contains my notes on how to implement a fully-connected neural network (FCN) in PyTorch. It’s meant to be a bare-bones, basic implementation of an FCN.

## What is an FCN?

A fully connected network (FCN) is an architecture with a bunch of densely-connected layers stacked on top of one another. In this architecture, each node in layer *i* is connected to each node of layer *i+1*, etc, with the output of each layer being passed through an activation function before being sent to the next layer. Here’s what this looks like:

In PyTorch parlance, these are linear layers, and they can be specified via `nn.Linear(in_size, out_size)`

. Other frameworks might call them other things (e.g. Flux.jl calls them Dense layers).

### Strengths and Weaknesses

FCNs are flexible models that can do well at all sorts of “typical” machine-learning tasks (i.e. classification and regression), and can work with various types of inputs (images, text, etc.). They’re basically the linear regression of the deep learning world (in that they’re foundational models, not in the sense that they’re linear…because they aren’t).

Since FCNs connect all nodes in layer *i* to all nodes in layer *i+1*, they can be computationally expensive when used to model large datasets (those with lots of predictors) or when the models themselves have lots of parameters. Another issue is that they don’t have any mechanisms to capture dependencies in the data, so in cases where the inputs have dependencies (image data -> spatial dependenices, time-series data -> temporal dependencies, clustered data -> group dependencies), FCNs may not be the best choice.

## Example Model in PyTorch

The code below has a basic implementation of a fully-connected network in PyTorch. I’m using fake data and an arbitrary model architecture, so it’s not supposed to be a great model. It’s more intended to demonstrate a general workflow.

### Import Libraries

The main library I need here is `torch`

. Then I’m also loading the `nn`

module to create the FCN as well as some utility functions to work with the data.

### Generate Fake Data

In this step, I’m generating some fake data:

- a design matrix,
`X`

, - a set of ground-truth betas
- the result of \(X * \beta\),
`y_true`

`y_true`

with some noise added to it,`y_noisy`

```
# generate some true data
= 10000
n = 10
m
= torch.randn(m)
beta = torch.randn(n, m)
X = torch.matmul(X, beta)
y_true = y_true + torch.randn(n)
y_noisy = y_noisy.unsqueeze(1) y_noisy
```

In the last step, I have to call `unsqueeze()`

on `y_noisy`

to ensure it’s formatted as a tensor with the correct number of dimensions.

### Process Data

Here, I’ll create a Dataset using `X`

and `y_noisy`

, then I’ll do some train/test split stuff and create a Dataloader for the train and test sets.

I should probably create a set of notes on datasets and dataloaders in PyTorch, but for now we’ll just say that dataloaders are utility tools that help feed data into PyTorch models in batches. These are usually useful when we’re working with big data that we can’t (or don’t want to) process all at once.

```
= 64
batch_size
= TensorDataset(X, y_noisy)
ds
#splitting into train and test
= int(.8 * len(X))
trn_size = len(X) - trn_size
tst_size
= random_split(ds, [trn_size, tst_size])
trn, tst
= DataLoader(trn, batch_size=batch_size)
trn_dl = DataLoader(tst, batch_size=batch_size) tst_dl
```

### Defining the Model

Now that the data’s set up, it’s time to define the model. FCN’s are pretty straightforward and are composed of alternating `nn.Linear()`

and activation function (e.g. `nn.ReLU()`

) calls. These can be wrapped in `nn.Sequential()`

, which makes it easier to refer to the whole stack of layers as a single module.

I’m also using CUDA if it’s available.

The model definition has 2 parts:

- defining an
`__init__()`

method; - defining a
`forward()`

method.

`__init__()`

defines the model structure/components of the model, and it tells us in the very first line (`class FCN(nn.Module)`

) that our class we’re creating, `FCN`

, is a subclass of (or inherits from) the `nn.Module`

class. We then call the nn.Module init function (`super().__init__()`

) and define what our model looks like. In this case, it’s a fully-connected sequential model. There are 10 inputs into the first layer since there are 10 columns in our `X`

matrix. The choice to have 100 output features is arbitrary here, as is the size of the second linear layer (`nn.Linear(100, 100)`

). The output size of the final layer is 1 since this is a regression problem, and we want our output to be a single number (in a classification problem, this would be size *k* where *k* is the number of classes in the y variable).

The `forward()`

method defines the order we should call the model components in. In the current case, this is very straightforward, since we’ve already wrapped all of the individual layers in `nn.Sequential()`

and assigned that sequential model to an object called `linear_stack`

. So in `forward()`

, all we need to do is call the `linear_stack()`

.

```
= (
device "cuda"
if torch.cuda.is_available()
else "cpu"
)
#define a fully connected model
class FCN(nn.Module):
def __init__(self):
super().__init__()
self.linear_stack = nn.Sequential(
10, 100),
nn.Linear(
nn.ReLU(),100, 100),
nn.Linear(
nn.ReLU(),100, 1)
nn.Linear(
)
def forward(self, x):
= self.linear_stack(x)
ret return ret
=FCN().to(device)
modelprint(model)
```

```
FCN(
(linear_stack): Sequential(
(0): Linear(in_features=10, out_features=100, bias=True)
(1): ReLU()
(2): Linear(in_features=100, out_features=100, bias=True)
(3): ReLU()
(4): Linear(in_features=100, out_features=1, bias=True)
)
)
```

We could also define the model like this:

```
class FCN(nn.Module):
def __init__(self):
super().__init__()
self.l1 = nn.Linear(10, 100)
self.l2 = nn.Linear(100, 100)
self.l3 = nn.Linear(100, 1)
def forward(self, x):
= self.l1(x)
x = F.relu(x)
x = self.l2(x)
x = F.relu(x)
x = self.l3(x)
ret return ret
```

But that doesn’t feel quite as good to me.

### Define a Loss Function and Optimizer

These are fairly straightforward. The loss function is going to be mean squared error (MSE) since it’s a regression problem. There are lots of optimizers we could use, but I don’t think it actually matters all that much here, so I’ll just use stochastic gradient descent (SGD).

```
= nn.MSELoss()
loss_fn = torch.optim.SGD(model.parameters(), lr=1e-3) opt
```

### Define a Train Method

Now we have a model architecture specified, we have a dataloader, we have a loss function, and we have an optimizer. These are the pieces we need to train a model. So we can write a `train()`

function that takes these components as arguments. Here’s what this function could look like:

```
def train(dataloader, model, loss_fn, optimizer):
#just getting the size for printing stuff
= len(dataloader.dataset)
size #note that model.train() puts the model in 'training mode', which allows for gradient calculation
#model.eval() is its contrasting mode
model.train()for batch, (X, y) in enumerate(dataloader):
= X.to(device), y.to(device)
X, y
#compute error
= model(X)
pred = loss_fn(pred, y)
loss
#backprop
loss.backward()
optimizer.step()
optimizer.zero_grad()
if batch % 10 == 0:
= loss.item(), (batch + 1) * len(X)
loss, current print(f"loss: {loss:>5f} [{current:>5d}/{size:>5d}]")
```

Let’s walk through this:

- We get
`size`

just to help with printing progress `model.train()`

puts the model in “train” mode, which lets us calculate the gradient- We then iterate over all of the batches in our dataloader…
- We move
`X`

and`y`

to the GPU (if it’s available) - Make predictions from the model
- Calculate the loss (using the specified loss function)
- Calculate the gradient of the loss function for all model parameters (via
`loss.backward()`

) - Update the model parameters by applying the optimizer’s rules (
`optimizer.step()`

) - Zero out the gradients so they can be calculated again (
`optimizer.zero_grad()`

) - Then we do some printing at the end of the loop to show our progress.

### Define the Test Method

Unlike the `train()`

function we just defined, `test()`

doesn’t do parameter optimization – it simply shows how well the model performs on a holdout (test) set of data. This is why we don’t need to include the optimizer as a function argument – we aren’t doing any optimizing.

Here’s the code for our `test()`

function:

```
def test(dataloader, model, loss_fn):
= len(dataloader.dataset)
size #set model into eval mode
eval()
model.= 0
test_loss
#set up so we're not calculating gradients
with torch.no_grad():
for X, y in dataloader:
= X.to(device), y.to(device)
X, y = model(X)
pred += loss_fn(pred, y).item() * X.size(0)
test_loss = test_loss / size
avg_loss print(f"Avg Loss: {avg_loss:>7f}\n")
```

And we can walk through it:

- In this function,
`size`

is actually useful in calculating our loss. We have to do some kinda wonky slight-of-hand when estimating the model loss here. We are using mean-squared error (MSE) as our loss function, and for each batch in the dataloader, it will give us the mean-squared error (a single number per batch). We then multiply this average loss by the size of the batch to get the “total” loss per batch, and we sum up all of the total batch losses to get the total overall loss (`test_loss`

in the function). Since this is the total loss, we then have to divide by the number of observations to get us back to the mean squared error. `model.eval()`

puts the model into evaluation mode, signaling that we’re not going to be calculating gradients or anything like that.- We initialize
`test_loss`

to 0 - Then for the remainder, we make predictions (just like we did in the
`train()`

function) and calculate loss as described in the first bullet point above.

### Train the Model

Now we can finally train the model. We’ll train over 5 “epochs”, i.e. 5 passes through the full dataset. This is arbitrary here – in real-world contexts this is a number we probably want to tune for or at least choose carefully.

We do the traning with a simple `for`

loop, and during each iteration through the loop we:

- train the model;
- show the performance on the test set

Since we included some print statements in our `train()`

and `test()`

functions, we can monitor the progress of our model’s training.

```
= 5
epochs for i in range(epochs):
print(f"Epoch {i+1}\n------------------")
train(trn_dl, model, loss_fn, opt)
test(tst_dl, model, loss_fn)print("Done!")
```

```
Epoch 1
------------------
loss: 11.559361 [ 64/ 8000]
loss: 10.150082 [ 704/ 8000]
loss: 12.431366 [ 1344/ 8000]
loss: 8.781148 [ 1984/ 8000]
loss: 11.217168 [ 2624/ 8000]
loss: 8.060587 [ 3264/ 8000]
loss: 9.669640 [ 3904/ 8000]
loss: 9.624810 [ 4544/ 8000]
loss: 9.808482 [ 5184/ 8000]
loss: 10.177684 [ 5824/ 8000]
loss: 8.251064 [ 6464/ 8000]
loss: 7.369075 [ 7104/ 8000]
loss: 8.711451 [ 7744/ 8000]
Avg Loss: 9.769837
Epoch 2
------------------
loss: 10.468843 [ 64/ 8000]
loss: 9.218903 [ 704/ 8000]
loss: 11.100779 [ 1344/ 8000]
loss: 7.746999 [ 1984/ 8000]
loss: 9.873242 [ 2624/ 8000]
loss: 7.149283 [ 3264/ 8000]
loss: 8.435396 [ 3904/ 8000]
loss: 8.367870 [ 4544/ 8000]
loss: 8.155600 [ 5184/ 8000]
loss: 8.412899 [ 5824/ 8000]
loss: 6.608739 [ 6464/ 8000]
loss: 5.931441 [ 7104/ 8000]
loss: 6.928466 [ 7744/ 8000]
Avg Loss: 7.608491
Epoch 3
------------------
loss: 7.947855 [ 64/ 8000]
loss: 7.037781 [ 704/ 8000]
loss: 8.046132 [ 1344/ 8000]
loss: 5.375796 [ 1984/ 8000]
loss: 6.869387 [ 2624/ 8000]
loss: 4.945308 [ 3264/ 8000]
loss: 5.555494 [ 3904/ 8000]
loss: 5.423590 [ 4544/ 8000]
loss: 4.637858 [ 5184/ 8000]
loss: 4.751605 [ 5824/ 8000]
loss: 3.392872 [ 6464/ 8000]
loss: 2.943655 [ 7104/ 8000]
loss: 3.505284 [ 7744/ 8000]
Avg Loss: 3.615225
Epoch 4
------------------
loss: 3.360711 [ 64/ 8000]
loss: 3.129750 [ 704/ 8000]
loss: 3.220852 [ 1344/ 8000]
loss: 2.028703 [ 1984/ 8000]
loss: 2.811476 [ 2624/ 8000]
loss: 2.152989 [ 3264/ 8000]
loss: 2.169516 [ 3904/ 8000]
loss: 2.104460 [ 4544/ 8000]
loss: 1.674014 [ 5184/ 8000]
loss: 1.971039 [ 5824/ 8000]
loss: 1.329351 [ 6464/ 8000]
loss: 1.101499 [ 7104/ 8000]
loss: 1.521148 [ 7744/ 8000]
Avg Loss: 1.489543
Epoch 5
------------------
loss: 1.216901 [ 64/ 8000]
loss: 1.191105 [ 704/ 8000]
loss: 1.300725 [ 1344/ 8000]
loss: 0.935616 [ 1984/ 8000]
loss: 1.545732 [ 2624/ 8000]
loss: 1.388198 [ 3264/ 8000]
loss: 1.273956 [ 3904/ 8000]
loss: 1.170510 [ 4544/ 8000]
loss: 1.217884 [ 5184/ 8000]
loss: 1.544257 [ 5824/ 8000]
loss: 1.180285 [ 6464/ 8000]
loss: 0.944844 [ 7104/ 8000]
loss: 1.244212 [ 7744/ 8000]
Avg Loss: 1.219065
Done!
```

And that’s a complete step-by-step for a fully-connected neural net!