Prepare, analyze, and visualize data with R

Arithmetic Artist

R base or Alternative Packages?

Typical applications in R can often be realized with the base package of the core distribution along with newer packages. In most cases, packages from the core distribution offer maximum stability, whereas some newer packages are much more convenient and elegant to use and have more sensible default values. However, some of these packages are still under development, so you should always keep in mind that a package update can change basic functionality. A quick look at the version number helps you avoid unpleasant surprises.

As already mentioned, you can create graphics with the graphics package from the core distribution or the newer ggplot2 (current version 3.1), which comes from the tidyverse collection of packages. Tidyverse is the home of a group of practical packages for data preparation, analysis, and visualization that follow a uniform interface.

Typical data preparation steps (changing data columns, calculating new columns, data aggregation, etc.) can be implemented both with the base package and with the tidyr (version 0.8) and dplyr (version 0.8) packages. The example in Listing 4 creates a small data frame with the columns a, b, and colour. R then computes the c and sum columns with the R base and dplyr packages.

Listing 4

Two Packages, One Task

> # base
> dat <- data.frame(a = c(10, 11, 12),
+                   b = c(4, 5, 6),
+                   colour = c("blue", "green", "yellow"),
+                   stringsAsFactors = FALSE)
> dat$c <- 2 * dat$a
> dat$sum <- dat$a + dat$b
>
> # dplyr
> dat <- data_frame(a = c(10, 11, 12),
+                   b = c(4, 5, 6),
+                   colour = c("blue", "green", "yellow")) %>%
+        mutate(c = 2 * a,
+               sum = a + b)

The trademark feature of the dplyr package immediately stands out: The pipe operator (%>%) attaches expressions to each other like the links of a chain and forwards the expression to the expression on its right. Whereas complicated expressions often have to be nested in R base and then read from the inside out, these constructs can be broken down with dplyr and converted into logically consecutive steps.

The example in Listing 5 illustrates how the pipe operator improves the readability of the code. A logical vector, which also contains a missing value, is converted into a percentage. Even if the second example is made a bit longer by the pipe operator (and takes a few microseconds longer to execute), it is easier to see which arguments belong to which function.

Listing 5

Pipe Operator

> x <- c(TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, NA)
>
> # base
> paste0(round(mean(x, na.rm = TRUE), 2), "%")
[1] "0.57%"
>
> # dplyr pipe operator
> x %>% mean(na.rm = TRUE) %>% round(2) %>% paste0("%")
[1] "0.57%"

Development Environments

The R language has several development environments, of which RStudio (Figure 3) is by far the most common. In addition to the console, history, and editor with syntax highlighting, collapsing, and automatic indenting, the RStudio interface includes Git integration and helpful tools for the development of R packages.

Figure 3: The RStudio development environment. The script is at top left and the console at bottom left. The top right window displays all objects currently present in the environment.

Alternatives to RStudio include Atom with the Hydrogen extension or Emacs with the ESS (Emacs Speaks Statistics) extension. Both open up more possibilities for individual configuration and expansion through plugins. They are suitable for different languages, so you have no need to change your editor. On the other hand, RStudio is often more convenient, especially for beginners, because it requires no configuration.

Strengths and Weaknesses

One of R's greatest strengths is certainly its overwhelming number of packages. You will find an R package suited to almost any application that has to do with data. However, the wide range of packages is also one of R's weaknesses, because it is often difficult to make a selection from among the large number available. Moreover, the stability and quality of rarely used packages sometimes does not meet the high level of the core distribution, not least because the authors often have only limited programming experience.

Further advantages and disadvantages are the result of R being an interpreted language. Because the code is not compiled first, code executes line by line in a terminal window; this method is the only way to enable interactive work and is almost indispensable. During explorative data analysis or the development of a statistical forecast model, you can look at your data, graphics, and results after every step. The intermediate step of error-free compilation would complicate data analysis enormously. The disadvantage, however, is that R is fairly slow in certain applications.

The problem of speed can be solved by parallelization, for which the parallel package is recommended. On Linux, the implementation is very uncomplicated, whereas the code for Windows is somewhat more complicated. Also, the memory requirement for parallelization on Windows is greater than for Linux, so the limiting factor is often not the number of cores, but memory (especially with large datasets).

Another way to accelerate code is to outsource performance-critical sections to C or C++; for example, the rcpp package is available for integrating C++ code. In practice, some users also follow the strategy of using R to develop a model interactively and finally implement the result for operational use in another language, such as Java.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus