The first approach to analyzing data is to visually analyze it. The objectives at doing this are normally finding relations between variables and univariate descriptions of the variables. We can divide these strategies as −
Univariate is a statistical term. In practice, it means we want to analyze a variable independently from the rest of the data. The plots that allow to do this efficiently are −
Box-Plots are normally used to compare distributions. It is a great way to visually inspect if there are differences between distributions. We can see if there are differences between the price of diamonds for different cut.
# We will be using the ggplot2 library for plotting library(ggplot2) data("diamonds") # We will be using the diamonds dataset to analyze distributions of numeric variables head(diamonds) # carat cut color clarity depth table price x y z # 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 # 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 # 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 # 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 # 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 # 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 ### Box-Plots p = ggplot(diamonds, aes(x = cut, y = price, fill = cut)) + geom_box-plot() + theme_bw() print(p)
We can see in the plot there are differences in the distribution of diamonds price in different types of cut.
source('01_box_plots.R') # We can plot histograms for each level of the cut factor variable using facet_grid p = ggplot(diamonds, aes(x = price, fill = cut)) + geom_histogram() + facet_grid(cut ~ .) + theme_bw() p # the previous plot doesn’t allow to visuallize correctly the data because of the differences in scale # we can turn this off using the scales argument of facet_grid p = ggplot(diamonds, aes(x = price, fill = cut)) + geom_histogram() + facet_grid(cut ~ ., scales = 'free') + theme_bw() p png('02_histogram_diamonds_cut.png') print(p) dev.off()
The output of the above code will be as follows −
Multivariate graphical methods in exploratory data analysis have the objective of finding relationships among different variables. There are two ways to accomplish this that are commonly used: plotting a correlation matrix of numeric variables or simply plotting the raw data as a matrix of scatter plots.
In order to demonstrate this, we will use the diamonds dataset. To follow the code, open the script bda/part2/charts/03_multivariate_analysis.R.
library(ggplot2) data(diamonds) # Correlation matrix plots keep_vars = c('carat', 'depth', 'price', 'table') df = diamonds[, keep_vars] # compute the correlation matrix M_cor = cor(df) # carat depth price table # carat 1.00000000 0.02822431 0.9215913 0.1816175 # depth 0.02822431 1.00000000 -0.0106474 -0.2957785 # price 0.92159130 -0.01064740 1.0000000 0.1271339 # table 0.18161755 -0.29577852 0.1271339 1.0000000 # plots heat-map(M_cor)
The code will produce the following output −
This is a summary, it tells us that there is a strong correlation between price and caret, and not much among the other variables.
A correlation matrix can be useful when we have a large number of variables in which case plotting the raw data would not be practical. As mentioned, it is possible to show the raw data also −
library(GGally) ggpairs(df)
We can see in the plot that the results displayed in the heat-map are confirmed, there is a 0.922 correlation between the price and carat variables.
It is possible to visualize this relationship in the price-carat scatterplot located in the (3, 1) index of the scatterplot matrix.