Introduction to R

The programming languages R and Python dominate in the world of Data Science. This class is targeted to the non-programmer who has a statistics background. You will learn how to install and configure R, read data into R, use R packages, write and debug R code, and profile R code. Statistical topics serve as the working examples as you get to know R. This class can be taught in a two-day format for classes without a programming background, or one day for classes with a programming background.

Required Software: R and R Studio Open Source Edition (free version). We recommend Microsoft Open R enhanced distribution, which is multithreaded and uses the Intel MKL for Windows/Linux and Mac Accelerate Framework on the Mac OS.

Introduction to R
Control Structures (If, For, While)
R Functions
Scoping Rules
Dates and Times
Loop Functions
Debugging
Profiling
Programming Exercise

Data Munging - Getting and Preparing Data

While most people associate model building with Data Science, much of the time is spent getting the data and preparing it for analysis. This class covers the basics of getting data from the internet, various file formats, and databases. It will also cover the basics of how to clean up the data in preparation for statistical analysis. This class can be taught in a two-day format for classes without a programming background, or one day for classes with a programming background.

Reading Data From Files (CSV, Tab, Excel, XML, JSON)
Obtaining Data from Databases (MySQL, SQL Server, AWS, Azure)
Organizing Data using dplyr
Date Manipulation
Exercise

Exploratory Data Analysis - Graphing and Summarizing

Exploratory data analysis falls between data munging and model building. In this class we cover the different plotting systems in R along with summarization techniques. This class can be taught in a two-day format for classes without a programming background, or one day for classes with a programming background.

Plotting Systems in R
Base Plotting System
Graphic Devices
Lattice Plotting System
ggplot2
Summarizing Data
Hierarchical Clustering
K-Means Clustering

Basic Statistical Analysis in R

This class focuses on inferential statistics in R. After the data munging of the data, it is ready for basic statistical analysis such as hypothesis testing. If the class has a background in both statistics and programming, this class can be taught in one day. Allow an additional half day for those without a programming background and another half day for those without a statistics background.

Distributions in R
Confidence Intervals
Hypothesis Testing
Power and Sample Size
Introduction to Bootstrapping

Regression Modelling in R

Regression models are typically the first step in what statistics calls “models” and data science calls “classifiers”. Despite the media attention to more complex methods such as Deep Learning, regression models are more parsimonious (think Occam’s razor for models) and often provide excellent predictive capability. In our R sequence, this course shifts from a programming to a statistics/analytics focus.  Allow an additional day for those without a statistics background.

Univariate Least Squares Regression
Coding in R
Residual Analysis
Prediction
Multivariate Regression
Multivariate Residuals and Diagnostics
Logistic Regression
Introduction to Poisson Regression

Machine Learning in R

Machine learning is a statistical technique to give computer software the ability to improve performance on a task (or “learn”) with data. In this course, we will cover the basics of machine learning including training and test datasets, over fitting, underfitting, and error rates. The models (classifiers) used include regression, classification trees, and random forest.

Prediction, Cross Validation, and ROC Curves
Using R’s Caret Package
Predicting with Trees
Introduction to Random Forest

Creating Data Products

Data products span the gap between the person who created the analysis and the person who needs to consume the information. The ability to present your findings in a way that is easily understood by the receiver is key to being a good data scientist. This course uses Shiny, GoogleVis, Plotly, R Markdown, and Leaflet as new tools in your toolbox to present your results.

A good example of a data product in Shiny is this Hypergeometric Sample Size Calculator

Introduction to Data Products
Shiny