Prepare, analyze, and visualize data with R
Arithmetic Artist
Object Orientation
In R you can go quite a long way without ever knowing what a class is. Nevertheless, object-oriented programming is possible. R has three common systems of object orientation: S3 classes, S4 classes, and Reference Classes. The first two classes are the most common but implement a rather unusual concept. In contrast to common class systems, a generic function decides which method is called according to the class of the object to which it is applied (called method dispatch). Therefore, a function can be applied to different classes and return class-specific output.
For example, the summary()
function applied to a dataset gives an overview of its dimension, the variables it contains, their classes, and so on. If you apply it to a numeric vector, R outputs minimum, maximum, and quantiles. S3 and S4 classes are fairly similar, but the S4 system is more formalized than the S3 system, and the method dispatch can take into account the classes of multiple input objects.
Reference Classes are more in line with the concept familiar from other languages that methods belong to classes rather than functions. The lack of significance of object-oriented programming in R can be explained by, among other things, the additional overhead it generally generates, rather than the benefits from the preparation and evaluation of data.
Standard Library and Extensions
R provides all functions through packages, some of which are already included in the R source code and are thus automatically installed and available. All other packages need to be downloaded and installed before their functionality is available. In addition to basic methods, the standard packages include functions for statistical data analysis, the creation of graphics, and sample datasets.
The packages base
, datasets
, and stats
, among others, belong to the standard library. The base
package provides – as the name suggests – really basic functions like mean()
, length()
, or print()
. The datasets
package contains small datasets that programmers can mess around with to test code and learn methods.
Also helpful is the stats
package, which supports full-blown statistical analyses. Stats provides functions such as lm()
(linear model), anova()
(analysis of variance), and t.test()
(significance difference of means test).
The main source for expanding packages is the Comprehensive R Archive Network (CRAN), a network of mirrored servers that provides developers with a platform on which to publish their packages. Packages available on CRAN must meet certain requirements primarily concerned with the architecture of the package, rather than the quality of the content. In addition to this official approach, packages can also be provided on GitHub or similar services.
To install the dplyr data manipulation package, for example, enter:
> install.packages("dplyr")
For the functions of the package to be usable, you first have to load the library:
> library("dplyr")
This step usually happens at the beginning of a session.
Graphics
The overhead for creating visually appealing graphics is greater in R than, for example, in Excel, but everything can be configured down to the last detail. The easiest way to generate graphics is to use the graphics
package from the core distribution. The plot()
function it includes automatically generates the appropriate graph type according to the input.
The strongest competitor to graphics is the slightly newer ggplot2 package. The syntax initially takes some getting used to; for example, the various elements like data, axes, and legends are connected by a plus sign. However, the learning curve is worth your while, because even with the default settings, ggplot2 graphics look much more professional than those from the graphics package.
A wide range of themes are easy to apply to the graphics, as well, making it possible to change subsequently the external appearance of a plot stored as an object. If the themes offered directly by ggplot2 are not to your taste, you can look around in the ggthemes package or even adjust the font, color, and size of the individual elements to suit your needs. Because it is not particularly difficult to define a theme yourself, you can create graphics that correspond exactly to the corporate design of an organization.
In the next example, the same plot is created first with graphics (Figure 1) and then with ggplot2 (Figure 2). To begin, you need a sample dataset with exactly one column (Listing 3) containing the responses by some imaginary people to a question about their favorite development environment in R (more about this later).
Listing 3
Sample Data Record
> dat <- data.frame(environment = c("RStudio", "Atom", "none", "RStudio", "Emacs", "RStudio", "RStudio", + "RStudio", "Emacs", "RStudio", "Atom", "RStudio", "RStudio")) > plot(dat$environment) > ggplot(dat, aes(x = environment)) + geom_bar()
Buy this article as PDF
(incl. VAT)