A Beginner Template for Descriptive Statistics in R

Régis KLA, May 2017

Introduction

This article presents a very simple R program template to quickly produce useful descriptive statistics on your current dataset. It is part of a series of articles that provide ready-to-customize programs templates for the junior Data Scientist that wants to quickly obtain a runnable program. Templates for this series are available in many programming languages:

R,

Python, and

C/C++

However, the current article focuses on a R template described below.

Why is it important to understand your data?¶

Descriptive statistics will provide you with one important thing: understanding your data.

Understanding the data is the only way to make clever choices for the next steps of the analysis process. Indeed knowing minimal information about the data will allow you - for instance - to correctly prepare the preprocessing phase:

What are the columns to (re)encode (e.g. from string factors to numerical values),

What data can I throw away because they bring no added value,

What is the distribution that rules the data, the skewness, and so forth.

Descriptive statistics brings a lot of answers to these questions.

Take away?¶

After reading this article, the reader will be provided with:

a well documented template to obtain a runnable R program that calculates essential metrics described in section [ What Are Descriptive Statistics?],

a solid understanding of all the calculated metrics (e.g. what, why, how),

What Are Descriptive Statistics?¶

Before diving in the heart of the subject, we introduce a pragmatic definition of descriptive statistics under the Data Science perspective. Indeed, a purist Statistician may find our definition largely incomplete; but we are Data Scientists and we are interested in a pragmatical definition of concepts.

While presenting some practical concepts of descriptive statistics we also highlight some chunks of the template that implements these concepts.

The article [Qui09] has been used as a reference for this section.

Preliminaries¶

It may seem dummy to insist on that step; but believe me when I say that it is very easy to forget it: display a sample or a subset of your raw data:

                                    
                                            ncol(dataset)
nrow(dataset)
head(dataset)

When the dataset points to the Ionosphere [Rep89] data, then the result is something like:

                                    
                                        
                                            
                                            35
  351

    V1 V2      V3       V4       V5       V6       V7       V8      V9      V10
  1  1  0 0.99539 -0.05889  0.85243  0.02306  0.83398 -0.37708 1.00000  0.03760
  2  1  0 1.00000 -0.18829  0.93035 -0.36156 -0.10868 -0.93597 1.00000 -0.04549
  3  1  0 1.00000 -0.03365  1.00000  0.00485  1.00000 -0.12062 0.88965  0.01198
  ...
        V27      V28      V29      V30      V31      V32      V33      V34 Class
  1  0.41078 -0.46168  0.21266 -0.34090  0.42267 -0.54487  0.18641 -0.45300  good
  2 -0.20468 -0.18401 -0.19040 -0.11593 -0.16626 -0.06288 -0.13738 -0.02447   bad
  3  0.58984 -0.22145  0.43100 -0.17365  0.60436 -0.24180  0.56045 -0.38238  good
  ...

                                        

                                

On the console, everything looks fine with the data; but... not so good. Since we are doing Data Science and not pure Statistics, we are constrained by the computer’s laws. It means that, one should - first of all - discover the data types of each columns, features, or variables. Because, “What You See (on the console) Is Not What You Get (in real life)!”:

                                    
                                            sapply(dataset, levels)

The command produces:

                                    
                                            $V1
[1] "0" "1"

$V2
[1] "0"
...
$Class
[1] "bad"  "good"

The previous results illustrates a frequent trap easy to avoid. Indeed, by simply displaying the data, one can reasonably think that the column V1 is numeric... but no it is not! It is called a factor column in R speaking: a set of string labels; exactly like the Class column. In addition, this command reveals the real nature of V2: a constant column with a single value for all lines (e.g. observations). Since V2 is a constant column, under the machine learning perspective it brings no added value at all, and should be removed to save space.

Thus one can apply now some early processing actions:

transform column V1 from strings to numerics, and

drop column V2 which is useless

                                    
                                            dataset$V1 <- as.numeric(as.character(dataset$V1))
dataset <- dataset[,-2] # col index of V2 is 2

Descriptive Statistics¶

The Descriptive Statistics [Yau] of a dataset can be defined as the first set of figures used to represent that dataset. These figures SHOULD at least contain the following metrics.

The values ranges¶

For each column, it is good to know the range of its values.

The percentiles¶

A percentile (or a centile) is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value (or score) below which 20% of the observations may be found [Per] [Rum].

                                        
                                                quantile_vector <- c(.1, .25, .5, .75) # 10%, 25%, 50%, 75%
sapply(dataset, percentile, quantile_vector=quantile_vector)

The mean¶

The mean (e.g. arithmetical): trivial. It can be called expected value in some contexts.

The standard deviation¶

Aka the ecart type, it is a measure of the column values dispersion regardind the mean. A low sd means that the data are grouped around the mean; and a high one means the data are spread out over from the mean. It is expressed in the same units as the data

The standard error of the mean¶

It depicts the relationship between the dispersion of individual observations around the population mean (the standard deviation), and the dispersion of sample means around the population mean (the standard error). One can see it as a combination of both sd and standard error.

The relationship with the standard deviation is defined such that, for a given sample size, the standard error equals the standard deviation divided by the square root of the sample size. As the sample size increases, the dispersion of the sample means clusters more closely around the population mean and the standard error decreases. It can also be viewed as the standard deviation of the error in the sample mean with respect to the true mean, since the sample mean is an unbiased estimator [err].

The R Program Template¶

The template file is available [here] for download. The reader can download it and run it directly by following install instructions provided by the associated Readme.md file. It is origially inspired by the following article [Bro16]; but I deeply modify and customize the original template. You are invited to do the same in order to comply your requirements.

In the rest of the document we describe each part of this template and provide the reader with ways to improve it and to go further in descriptive statistics upon the data.

Libraries¶

The first actions consist to load the required libraries. It is difficult to guess a-priori how many of them will be required. Thus, if you don’t know which ones to import, then leave it empty for now and come back later to import the right ones.

                                    
                                            # a) Load the required libraries

    # Example:
    # library(mlbench)

# b) set seed

    # Example:
    # seed <- 1234

In the previous listing, the mlbench [FL12] package is imported as example. Indeed, this package is a good data provider among many other things. The reader is invidted to read the internal documentation about this package. Finally, the seed global variable is set to guarantee the reproductibility of the program when the same conditions hold.

Functions¶

Functions can be defined here. They are optional ; but they can increase the code reuse level.

Data loading¶

Now it is time to load the dataset and place it into a data frame object.

                                    
                                            # Example: the Ionosphere dataset from mlbench package
# data(Ionosphere)
# dataset <- Ionosphere

# Example: from CSV file
# dataset <- read.table(...)

Here, the developer can update the parameters lib.loc or package to provide more information about the location of the data. Please once again confer to the internal help system for more information about the data() function.

Data summary¶

This part is where you effectively implement the descriptive statistics steps already described and illustrated with examples.

Conclusion¶

With this article you have been provided with the minimal knowledge to quickly implement descriptive statistics with R and interpret the results. Armed with this insight regarding your data will help you take the right decisions for the next steps of your journey.

Finally an R program template is provided in order for you to modify and adapt it to your specific data.

Bibliography

[Bro16]

Jason Brownlee. Better understand your data in r using descriptive statistics (8 recipes you can use today). 2016. http://machinelearningmastery.com/descriptive-statistics-examples-with-r/

[err]

Wikipedia Standard error. Standard error. https://en.wikipedia.org/wiki/Standard_error

[FL12]

Evgenia Dimitriadou Friedrich Leisch. Mlbench: machine learning benchmark problems. 07 2012. https://cran.r-project.org/web/packages/mlbench/index.html

[Per]

Wikipedia Percentile. Standard error. https://en.wikipedia.org/wiki/Percentile

[Qui09]

John M Quick. R tutorial series: summary and descriptive statistics. 11 2009. http://rtutorialseries.blogspot.fr/2009/11/r-tutorial-series-summary-and.html

[Rep89]

UCI Machine Learning Repository. Ionosphere data set. 01 1989. https://archive.ics.uci.edu/ml/datasets/Ionosphere.

[Rum]

Deborah J. Rumsey. How to calculate percentiles in statistics. http://www.dummies.com/education/math/statistics/how-to-calculate-percentiles-in-statistics/

[Yau]

Chi Yau. R tutorial - an introduction to statistics - numerical measures. http://www.r-tutor.com/elementary-statistics/numerical-measures.

Régis KLA, Ph.D.

I'm currently a Data Scientist based in Paris, Europe. In my day job I'm working on applying Data Science and Big Data techniques for a Financial Institution. Read more...