---
title: "Regression_lab"
author: "Kathleen"
date: "3/26/2019"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


Load the necessary packages

```{r}
library(tidyverse)
library(e1071)
library(psych)
library(corrplot)
```


Let start with the cereal data and get a sense of the data 

```{r cereal}
cereal_df <- read_csv("cereal.csv")
```


1. Make distribution plots for each feature 


```{r}
ggplot(cereal_df) + geom_bar(aes(x = Shelf))

```

```{r}

```


```{r visualize}
ggplot(data = cereal_df, 
       mapping = aes(x = Calories)) + geom_histogram(bins=10)
breaks <-c(-3,-2,-1,0,1,2,3)
ggplot(cereal_df) + 
  geom_histogram(breaks=breaks,aes(x=Calories,y=..density..), position="identity") + 
  geom_density(aes(x=Calories,y=..density..))
summary(cereal_df)

```

```{r}

```
Make sure you review your data for skew and kurtosis 

Skew: Measure of symmetry of the distribution
Rules of thumb for skewness
Normal distribution skewnwss = 0
If the skewness is between -0.5 and 0.5, the data are fairly symmetrical
If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately skewed
If the skewness is less than -1 or greater than 1, the data are highly skewed

Kurtosis: measure of distribution in the combined tails. kurtosis decreases as the tails become lighter.  It increases as the tails become heavier. 
Normal distribution kurtosis = 0
< 0 slightly less weight in the tails than Normal distribution
> 0 slightly more weight in the tails than Normal distribution

```{r}

```


```{r}

```


```{r}

```

2. Check out the correlations between the variables 


```{r}

```


```{r}
pairs.panels(cereal_df)
```

4. Let's say we are trying to predict calories given the other variables. Which variables would you consider as potential features?


5. Build a linear model use backwards elimination to remove insignigicant terms 
```{r}

```


6. Determine the caloric range for a cereal that has Manufacturer = Nabisco, Sodium = 80, Fiber = 20, Sugar = 7 Carbs = 8 , Shelf = 1 


```{r}

```


7. Lets create a variable that represents low fiber - looking at the distribution of fiber let's define values 0, 1 and 2 as low fiber  (target value = 1) all other values are high fiber (target value = 0 )


```{r}

```

8. Create a logistic model to predict low_fiber . Is the model predictive? 

```{r}

```


9. Any other plan of action? 

```{r}

```