--- title: "Regression_lab" author: "Kathleen" date: "3/26/2019" output: pdf_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` Load the necessary packages ```{r} library(tidyverse) library(e1071) library(psych) library(corrplot) ``` Let start with the cereal data and get a sense of the data ```{r cereal} cereal_df <- read_csv("cereal.csv") ``` 1. Make distribution plots for each feature ```{r} ggplot(cereal_df) + geom_bar(aes(x = Shelf)) ``` ```{r} ``` ```{r visualize} ggplot(data = cereal_df, mapping = aes(x = Calories)) + geom_histogram(bins=10) breaks <-c(-3,-2,-1,0,1,2,3) ggplot(cereal_df) + geom_histogram(breaks=breaks,aes(x=Calories,y=..density..), position="identity") + geom_density(aes(x=Calories,y=..density..)) summary(cereal_df) ``` ```{r} ``` Make sure you review your data for skew and kurtosis Skew: Measure of symmetry of the distribution Rules of thumb for skewness Normal distribution skewnwss = 0 If the skewness is between -0.5 and 0.5, the data are fairly symmetrical If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately skewed If the skewness is less than -1 or greater than 1, the data are highly skewed Kurtosis: measure of distribution in the combined tails. kurtosis decreases as the tails become lighter. It increases as the tails become heavier. Normal distribution kurtosis = 0 < 0 slightly less weight in the tails than Normal distribution > 0 slightly more weight in the tails than Normal distribution ```{r} ``` ```{r} ``` ```{r} ``` 2. Check out the correlations between the variables ```{r} ``` ```{r} pairs.panels(cereal_df) ``` 4. Let's say we are trying to predict calories given the other variables. Which variables would you consider as potential features? 5. Build a linear model use backwards elimination to remove insignigicant terms ```{r} ``` 6. Determine the caloric range for a cereal that has Manufacturer = Nabisco, Sodium = 80, Fiber = 20, Sugar = 7 Carbs = 8 , Shelf = 1 ```{r} ``` 7. Lets create a variable that represents low fiber - looking at the distribution of fiber let's define values 0, 1 and 2 as low fiber (target value = 1) all other values are high fiber (target value = 0 ) ```{r} ``` 8. Create a logistic model to predict low_fiber . Is the model predictive? ```{r} ``` 9. Any other plan of action? ```{r} ```