Introduction to Statiscal Learning with R

A Lesson in Statistical Learning

This post follows along with Springer’s Statistical Learning with R as well as some changes to the book’s R Code, to make it more reproduceable an introduce good R practices

We will mainly focus on statistical learning in this lesson, but what is statistical learning? Essentially, it is what I have described in other posts where I referred to Machine Learning. Statistical Learning can be used interchangeably with Machine Learning.

Let us pretend we have a dataset with predictors for weather. Dewpoint, Temperature, Wind Speed, could all be input variables (aka independent variables, predictors, features). These dependdent variables are often denoted as $X_n$. For example, $X_1 = Tempearture$ and $X_2 = Dewpoint$, etc… These are used to predict whether it will be sunny or rainy also known as the output variables (aka dependent variables or response). Normally the output variables are denoted with Y.

Y will always be a quantitative response for $p$ different predictors. We will use $X$ in our formula to represent all of the predictors, so $X = (X_1, X_2, … , X_p)$. Our formula becomes…

$$ \begin{equation} Y = f(X) + \epsilon \end{equation} $$

where $f$ is a fixed but unknow function of $X_1$, $X_2$, … , $X_p$ and $\epsilon$ is a random error term, which is independent of $X$ and has a mean zero.

Taking a look at some advertising sample data

In the following sections I will use some [$R$]( to show an example of the statistical learning concepts we are discussing, as well as some basic $R$ programming skills.

### Load the libraries
### If you have never installed an R package before, than this will cause an error
### To install the following packages you can either do them one at a time with:
### install.packages("DBI")
### Or you can do several at a time using the `c()` function (concatenate)
### install.packages(c("DBI", "RPostgres",...))
### In the block I am connecting to my personal Postgres Database
### If you subscribe to my blog you can get access to this data, too, and follow along more easily

### Using `readline()` I can enter my personal information everytime without needting to create a new argument
### `getPass()` lets me hide sensitive data
dbname = readline("Database: ")
host = readline("Host: ")
port = 5432
user = readline("Username: ")
pwd = getPass("Password: ")
### The function to connect to the Postgres instance
con <- dbConnect(Postgres(),
                 dbname = dbname, 
                 host = host,
                 port = port,
                 user = user,
                 password = pwd)
### Using SQL to pull the information I need
adv <- dbGetQuery(con, "SELECT * FROM machine_learning.advertising")

### View the first couple of rows with the `head()` function
| index | tv  | radio | newspaper | sales |
| --- | --- | --- | --- | --- |
| 1   | 230.1 | 37.8 | 69.2 | 22.1 |
| 2   | 44.5 | 39.3 | 45.1 | 10.4 |
| 3   | 17.2 | 45.9 | 69.3 | 9.3 |
| 4   | 151.5 | 41.3 | 58.5 | 18.5 |
| 5   | 180.8 | 10.8 | 58.4 | 12.9 |
| 6   | 8.7 | 48.9 | 75.0 | 7.2 |
### R has a really fantastic plotting package (dare I say better than Python's Matplotlib...) called ggplot
### ggplot will be the basis of many of our visualizations

### Since I will be repeating a lot of typing with similar functionality, I decided to create a function
### that will repeat the same scatterplot making as many times as I use the function

### I want to define the x input (the independent variable), the y input (dependent variable) and
### the dataset that I will use
scatter_plot_sales <- function(x, y, data) {
    # ggplot requires that the variables x, y be passed as arguments
    # but the arguments don't techinally exist in our local function 
    # yet since we have not defined our dataset
    # This will cause R to throw an error, because there is no 'sales' variable etc...
    # With `enquo()` we take the parameter of a function as expression
    # That means, we do not evaluate it, we just take the expression, leave it on hold, and use it later on.
    x <- enquo(x)
    y <- enquo(y)
    # The "!!" we see in the following line takes those x, y variables from earlier 
    # and returns those as expressions for R to interpret
    plot <- ggplot(data=data, aes(x=!!x, y=!!y)) +
        # Add the points as shape=1(donuts) and the color red with a width of 1.5
        geom_point(shape=1, color="red", stroke=1.5) +
        # Add the linear model line and remove standard error (se)
        geom_smooth(method=lm, se=FALSE, size=2)

tv <- scatter_plot_sales(tv, sales, adv)
radio <- scatter_plot_sales(radio, sales, adv)
newspaper <- scatter_plot_sales(newspaper, sales, adv)

### Make sure the graph is wide and not all scrunched together
options(repr.plot.width=15, repr.plot.height=6)

### Grid arrange allows us to plot several graphs in the same row
grid.arrange(tv, radio, newspaper, nrow=1)

Estimating f

There are two main reasons why we are trying to estimate $f$:

  • Prediction - trying to estimate future values based on input variables
  • Inference - we are not looking to predict $Y$, but instead understand the relationship between $X$ and $Y$


In most cases, $X$ (input variables) is widely available, but $Y$ (dependent variables) are not easily obtainable. Since in this instance oue error term $\epsilon$ averages to 0. I was initially confused on why $\epsilon$ averaged to 0, since I have not really done stats modeling in a while. The reason our error is averaged to 0, is because we are just predicting, so we do not have an actual to compare against in this instance, so no matter the output, our error is averaged to 0. In our previous model, our $\epsilon$ was the distance our point was away from our linear model line. Think of the red dots distance away from the blue line in the charts above. We can now predict $Y$ using the following function.

$$ \begin{equation} \hat{Y} = \hat{f}({X}) \end{equation} $$

$\hat{f}$ is our prediction for the function of $f$, and $\hat{Y}$ represents the resulting prediction for Y. $\hat{f}$ is normally treated as a black box when we are trying to predict because we are focused on getting an accurate prediction for $Y$.

Reducible and Irreducible Error

Let us assume that $X_1, … , X_p$ are the characteristics of certain weather measurements, and $Y$ is the chance of precipitation. If we are planning a day at the beach we are very interested in using $X$ to predict $Y$, but how do we measure the accuracy of $\hat{Y}$? as a prediction for $Y$? This depends on two quantities, reducible error and irreducible error.

Reducible Error

Mostly, $\hat{f}$ will not be a perfect estimate for $f$, and this imperfection will in turn create some error. We consider this error reducible because we can potentially improve the accuracy $\hat{f}$ by using the most appropriate technique to estimate $f$.

Irreducible Error

Let’s consider that even if it were possible to form a perfect estimate for $f$ so that $\hat{Y} = f(X)$, our prediction would still have some error in it (think back to the charts above where $Y$ and $f(X)$ are known, but there is still some error when trying to get $f(X)$ to equal $Y$). This is because $Y$ is also a function of $\epsilon$, which by definition cannot be predicted using $X$. This means that the variability associated with $\epsilon$ affects the accuracy of our predicitons. This is known as the irreducible error, because no matter how well we estimate $f$ we will not be able to reduce the error introduced by $\epsilon$.

The reason that the irreducible error always exists is because $\epsilon$ may contain unmeasured variables that are useful in predicting the outcome variable. If they are not contained in $f$ then we need to measure them in $\epsilon$.

Consider the following formula, $E(Y-\hat{Y})^2$ which represents the average (or expected) values difference between the predicted and actual values. Given that we know $Y = f(X) + \epsilon$ and $\hat{Y} = \hat{f}(X)$, we can subtract these two equations from each other to show the reducible and irreducible error.

$$ \begin{equation} E(Y - \hat{Y}) = E[f(X) + \epsilon - \hat{f}(X)]^2 \end{equation} $$

$$ \begin{equation} \hspace{3.3cm} = \underbrace{[f(X) - \hat{f}(X)]^2} _ {\text{reducible error}} + \underbrace{Var(\epsilon)} _ {\text{irreducible error}} \end{equation} $$


Where in prediction we were solely focused on the prediction of $Y$, in inference we are curious on how $Y$ is affected as $X_1, … , X_2$ changes. Our goal is to estimate $f$, but we are not necessarily interested in making predictions for $Y$. We are more focused on the relationship between $X$ and $Y$.

We are essentially interested in answering 3 questions:

  • Which predictors are associated with the response?

    • Often, only a few predictors are significantly associated with $Y$, and identifying the few important predictors is, obviously, very important.
  • What is the relationship between the response and each predictor?

    • Some predictors have positive relationships and some have negative relationships and some are interwoven with other predictors
  • Can you develop a linear equation from $Y$ and the predictors, or is a more complex model required?

    • Sometimes models need to be more complex than a simple linear equation

A good example is the charts we created earlier, we may be more interested in which advertising medium increases sales (i.e. focusing on our predictors) than we are with predicting our sales.

How do we Estimate $f$

Let $x_{ij}$ represent the value of the $j$th predictor, or input, for observation $i$, where $i = 1,2,…,n$ and $j = 1,2,…,p$. Correspondingly, let $y_i$ represent the training data response variable for the $i$th observation. Then our training data consist of:

$$ \begin{equation} {(x_1, y_1), (x_2, y_2), … , (x_n, y_n)} \hspace{1cm}where\hspace{1cm} x_i = (x_{i1}, x_{i2}, … , x_{ip})^T \end{equation} $$

The ultimate goal is to apply statistical learning to training data to estimate the unknown function of $f$.

Parametric Methods

The parametric method relies on estimating a set of parameters, that best fit $f$, i.e. it will look at how your inputs affect $f$. If your inputs (parameters) to predict our $\hat{Y}$(predicted fuel burn) are engine size and miles traveled, it will assign coefficient values to your $\beta_p$. For example:

$$ \begin{equation} \beta_0 = 10 (i.e. \hspace{.1cm}your \hspace{.1cm}intercept) \end{equation} $$

$$ \begin{equation} \beta_1 = .5 (multiply \hspace{.1cm}this \hspace{.1cm}by \hspace{.1cm}engine \hspace{.1cm}size) \end{equation} $$

$$ \begin{equation} \beta_2 = .75 (multiply \hspace{.1cm}this \hspace{.1cm}by \hspace{.1cm}miles \hspace{.1cm}traveled) \end{equation} $$

$$ \begin{equation} \hat{Y} = 10 + .5X_i + .75X_i \end{equation} $$

An issue we will run into with the parametric approach is that it will never match the true unknown form of $f$. We can try to minimize by selecting more flexible models that will take several more estimates of parameters, but we will run into an issue of overfitting the data.

When we are using a parametric model approach, we take a two-step model approach.

  1. We need to take a guess or make an assumption about the functional form or shape of $f$. For example, if the model is very simple (think of plotting engine size versus miles per gallon) we can assume that the model is most likely linear, i.e.

$$ \begin{equation} f(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + … + \beta_pX_p \end{equation} $$

Once we have estimated that $f$ is linear, we only need to estimate the $p + 1$ coefficients, eliminating the need to estimate an arbitrary p-dimensional function.

  1. Once the model is selected, we need a procedure that will be used to train the model. Since in step 1 we determined that we have a linear model, we need to find $\beta$ (i.e. the coefficients) or better known as the parameters $\beta_0, \beta_1, … , \beta_p$.

Non-Parametric Methods

Non-Parametric methods are indifferent to the functional form of $f$ as they are more focused on fitting the data points of $f$, but try not to be too rough or wiggly. This makes no assumptions about our parameters, we have the ability to fit a wider range of shapes (relative to the graphs) for $f$. Non-parametric methods do need a significant amount of data, due to the fact it tries to fit the data points independent of parameters.

Prediction Accuracy vs Model Interpretability Trade-Off

In the figure above, there are some model type examples that show the trade-off between interpretability and flexibility. So the question is, when would I want interpretability and when would I want flexibility? Let us think back to our two reasons that we are trying to estimate $f$ in the first place, prediction and inference. If we are trying to predict, we may not care how $f$ is estimated, we don’t care about the inputs, therefore we can rely on a more flexible model. If we are are worried about inference, i.e. understanding the inputs and how they affect $\hat{Y}$, we want to create a more interpretable model.

Supervised vs. Unsupervised Learning

Since supervised vs unsupervised learning is covered quite extensively in other posts, this will be an abbreviated refresher. The quickest way to remember is that supervised learning has labels (outputs are known). Consider our linear regression model where the miles per gallon are known. Since we have the labels of mpg to test against, our model is supervised, as we can drop in and look at the true outputs. Unsupervised learning the labels are not known. Consider if we did not have mpg available, we could just see how engine size and miles traveled compare to each other. We would then be able to make the assumption that the bigger the engine the less the vehicle can travel.

Regression vs Classification

Again, since regression vs classification is covered quite extensively in other posts, this will be an abbreviated refresher. Essentially, you will use regression when your output variables are numerical, double, float, decimal, or any other quantitative measure, like in our mpg model we have been discussing. Otherwise, if your model is has classes or objects (i.e. Honda, Toyota, Nissan), then you will need to build a classification model. If you go to the post I linked above, there are examples on how you can turn categorical variables into quantitative with Ranking and One-Hot Encoding, but this is outside the scope of this post.

Measuring Model Accuracy

It is important in statistical learning to determine which model has the best fit, given the goal you are trying to achieve. Keep in mind, some models may be more accurate than others, but you can not draw inferences from them since they are highly complex. No one model is inheritantly better than any other given model. It is up to the Data Scientist to determine what is best for their circumstance.

Quality of Fit

When determing how well our model performs, we want to see how closely the data fits to the model. When building regression models, the normal quality measurement is using Mean Squared Error (MSE), given by $\hat{f}(x_i)$ is the prediction that gives $\hat{f}$ for the $i$th observation. Remember that $\hat{f}(x_i)$ will be our predicted value of $Y$ at point $i$ and $y_i$ is our actual value of $Y$. We square all of the differences and then divide by $n$ (i.e. the total number of observations).

$$ \begin{equation} MSE=\frac{1}{n}\sum_{i=1}^{n}\big(y_i-\hat{f}(x_i)\big)^2 \end{equation} $$

Ideally, we would want MSE to be as small as possible. The issue is, if we have too small of a value, there is a chance that we have overfit our data. This is a good reason on why we split out our train (normally 80% of the data and test (remaining 20% of the data) sets.

Bias-Variance Trade-Off

In order to achieve the lowest possible test error we need to be able to build a model with the lowest possible variance and lowest posssible bias. If we consider the error a sum of 3 separate parts; the variance of $\hat{f}_0$, the squared bias of $\hat{f}_0$, and the variance of the error terms $\epsilon$, the formula can be viewed as such:

$$ \begin{equation} E\big(y_0-\hat{f}(x_0)\big)=Var\big(\hat{f}(x_0)\big)+[Bias\big(\hat{f}(x_0)\big)]^2+Var(\epsilon) \end{equation} $$


Variance refers to the amount of which $\hat{f}$ would change if we estimated it using a different training set. If we use different data, then we would expect our $\hat{f}$ to change, since the data is different, even if ever so slightly. If we substituted one data point in a million data point set, it has the potential to change $\hat{f}$. Our goal in building a model is to make sure that it is robust enough to where $\hat{f}$ does not change very much between training data sets. If a method has high variance, then small changes in the training data will drastically change $\hat{f}$. The more flexible the model, the higher the variance, generally.


Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. For example, a linear regression model is one of the simplest models that can be used in machine learning, and it is unlikely that any real world problem can easily be solved by a straight line. Yet, we tried to fit our car model to a regression model. In this case, we introduced a high bias.

In general, as we simplify our model, the variance becomes lower, but the bias increases and vice-versa as we increase the complexity of our model. Granted, they trade-off may not be proportionately inverse, see the figure above.


In the previous section we mentioned classification vs regression, and since we have discussed regression at length, let’s talk a little about classification. In classification, we look to assign a class to a prediction, not a value. Given that $\hat{y}_i$ is the predicted class label for the $i$th observation using $\hat{f}$, and $I(y_i \neq \hat{y}_i)$ is an indicator variable that equals $1$ if $y_i \neq \hat{y}_i$ and $0$ if $y_i = \hat{y}_i$. Thusly, we can represent as follows:

$$ \begin{equation} \frac{1}{n}\sum_{i=1}^{n}I(y_i \neq \hat{y}_i) \end{equation} $$

If we get the prediction correct, the error is $0$ otherwise it is equal to $1$, which is why $\big[y_i \neq \hat{y}_i=1\big]$ and $\big[y_i = \hat{y}_i=0\big]$.

Bayes Classifier

We can minimize the error rate when we assign each observation to the most likely class given its predictor values. Thus, this is a conditional probability we can assign the predictor vector $x_0$ to the class $j$ for which is largest:

$$ \begin{equation} Pr(Y=j|X=x_0) \end{equation} $$

This very simple classifier is called Bayes Classifier. In a two class problem there can only be two classes, class 1 or class 2. Where the prediction for class 1 is:

$$ \begin{equation} Pr(Y=1|X=x_0) > 0.5 \end{equation} $$

The Bayes Classifier produces the lowest possible error rate called the Bayes Error Rate. The overall error rate is given by:

$$ \begin{equation} 1-E\big(\stackrel{max}{j}Pr(Y=j|X)\big) \end{equation} $$

K-Nearest Neighbors

In theory, we would always like to predict qualitative responses using the Bayes classifier, but we do not know the conditional distribution of $Y$ given $X$. So computing the Bayes classifier is impossible, therefore, the Bayes classifier is an unattainable method. Many approaches attempt to estimate the conditional distribution of $Y$ given $X$, and then classify a given observation to the class with highest estimated probability. This is what K-nearest neighbors (KNN) classifier attempts to do. Given a positive integer $K$ and a test observation $x_0$, the KNN classifier first identifies the $K$ points in the training data that are closest to $x_0$, represented by $N_0$:

$$ \begin{equation} Pr(Y=j|X=x_0)=\frac{1}{K}\sum_{i \in N_0}I(y_i=j) \end{equation} $$

For example, if we choose a $K=3$, then the KNN classifier will look for the 3 closest neighbors of each point and assign the majority to the point. So, if 2/3 neighbors are blue and 1/3 neighbors are red, the model will assign blue to the point.

Summary and Examples

For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

(a) The sample size n is extremely large, and the number of predictors p is small. - We would expect an flexible model to perform better because we could fit the large amount of data closer than an inflexible method. (b) The number of predictors p is extremely large, and the number of observations n is small. - We would expect an inflexible method to perform better, since the sample size is so small, we could risk overfitting with a flexible method. (c) The relationship between the predictors and response is highly non-linear. - We would expect a flexible method to perform better since the data is not easily fit to a linear model (d) *The variance of the error terms, i.e. $\sigma^2 = Var(\epsilon)$, is extremely high. - We would expect an inflexible method to perform better, because an inflexible method would try to fit the noise of the variance

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide $n$ and $p$.

(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary. - In this scenario, we would build a regression model with the predictors (p) [record profit, number of employees, industry (can be one-hot encoded)] and a sample size (n) of 500. We are interested in what factors affect our label (output variable) of CEO Salary, which means we are interested in inference. (b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables. - In this scenario, we would build a classification model since our output will be binary (success or failure) with our predictors (p) being [price, marketing budget, competition pricem, and 10 other variables] and our sample size (n) being 20 (other products). We are interested in prediction, because all we care about are chances of success for the product. (c) We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market. - In this scenario, we will be focused on building a regression model since we want a quantitative output where our predictors (p) will be [% change in US/British/German markets] with a sample size of 52 (weeks in a year) and we are looking for a predictive output.

### Import `college` dataset
college <- dbGetQuery(con, "SELECT * FROM")
### Convert the `college` column to rownames then remove the `college` column
rownames(college) <- college[,1]
college <- college[, -1]
### Convert the `private` column to a factor
college$private <- as.factor(college$private)
### Since we pulled `college` from a Postgres DB, the class of each integer
### is integer64 and we need to convert it to int for R to understand 
### count the number of columns
for (i in colnames(college)){
    ### if the column has a class of integer64 then...
    if (class(college[,i]) == "integer64") {
        ### convert it to integer
        college[,i] <- as.integer(college[,i])
    ### if it doesn't have a class of integer64, then leave it alone
    } else {}

### Take a look at the data
|   idx  | private | apps | accept | enroll | top10perc | top25perc | f_undergrad | p_undergrad | outstate | room_board | books | personal | phd | terminal | sf_ratio | perc_alumni | expend | grad_rate |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Abilene Christian University | Yes | 1660 | 1232 | 721 | 23  | 52  | 2885 | 537 | 7440 | 3300 | 450 | 2200 | 70  | 78  | 18.1 | 12  | 7041 | 60  |
| Adelphi University | Yes | 2186 | 1924 | 512 | 16  | 29  | 2683 | 1227 | 12280 | 6450 | 750 | 1500 | 29  | 30  | 12.2 | 16  | 10527 | 56  |
| Adrian College | Yes | 1428 | 1097 | 336 | 22  | 50  | 1036 | 99  | 11250 | 3750 | 400 | 1165 | 53  | 66  | 12.9 | 30  | 8735 | 54  |
| Agnes Scott College | Yes | 417 | 349 | 137 | 60  | 89  | 510 | 63  | 12960 | 5450 | 450 | 875 | 92  | 97  | 7.7 | 37  | 19016 | 59  |
| Alaska Pacific University | Yes | 193 | 146 | 55  | 16  | 44  | 249 | 869 | 7560 | 4120 | 800 | 1500 | 76  | 72  | 11.9 | 2   | 10922 | 15  |
| Albertson College | Yes | 587 | 479 | 158 | 38  | 62  | 678 | 41  | 13500 | 3335 | 500 | 675 | 67  | 73  | 9.4 | 11  | 9727 | 55  |
### Check the summary statistics for `college`
 private        apps           accept          enroll       top10perc    
 No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
 Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
           Median : 1558   Median : 1110   Median : 434   Median :23.00  
           Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
           3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
           Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
   top25perc      f_undergrad     p_undergrad         outstate    
 Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
 1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
 Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
 Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
 3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
 Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
   room_board       books           personal         phd        
 Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
 1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
 Median :4200   Median : 500.0   Median :1200   Median : 75.00  
 Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
 3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
 Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
    terminal        sf_ratio      perc_alumni        expend     
 Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
 1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
 Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
 Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
 3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
 Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
 Min.   : 10.00  
 1st Qu.: 53.00  
 Median : 65.00  
 Mean   : 65.46  
 3rd Qu.: 78.00  
 Max.   :118.00  
### Create a pairs plot for the first 10 columns of `college`
pairs(college[,1:10], col="green", col.axis="white")

### Create a boxplot of Out of State vs Private
ggplot(data = college, mapping = aes(x = private, y = outstate)) +
    geom_boxplot() + 

### Create a new column in the `college` dataset that determines the Elite
### status of the university
### Create a list with "No" equating to the number of rows in `college`
Elite <- rep("No", nrow(college))
### If the college has 50 students or more that were top 10% then turn to "Yes"
Elite[college$top10perc > 50] <- "Yes"
### Convert it to a factor
Elite <- as.factor(Elite)
### Add it to the `college` dataframe
college$Elite <- Elite
ggplot(data = college, mapping = aes(x = Elite, y = outstate)) +
    geom_boxplot() + 

par(mfrow = c(2,2))
hist(college$top10perc, col="green", col.axis="white", col.main="white")
hist(college$top25perc, col="green", col.axis="white", col.main="white")
hist(college$expend, col="green", col.axis="white", col.main="white")
hist(college$accept/college$apps, col="green", col.axis="white", col.main="white")