Prepare, analyze, and visualize data with R
Arithmetic Artist
The R statistical programming language offers machine learning methods, dashboards, descriptive analyses, t -tests, cluster analyses, various regression methods, interactive graphs, and more. If you have ever wondered whether it would be worthwhile to immerse yourself in R, the following easy-to-grasp taster seeks to clarify who is likely to benefit.
Yesterday and Today
R [1] is based on the S programming language developed at Bell Laboratories in 1975-1976, which then split in the late 1980s and early 1990s into a commercial version named S-PLUS and the GNU project R. The name, which admittedly takes some getting used to, can be traced back to the first names of the developers, Ross Ihaka and Robert Gentleman, and alludes to the previous project name, as well.
R is often referred to as a programming environment, which is intended to emphasize the open package concept and to indicate that R differentiates itself from common monolithic statistical software. The basic functions are provided by eight packages included in the R source code. Additionally, many thousands of additional packages offer extensions.
In its early years, R was more of a niche player and was mainly used by statisticians and biometricians at universities. In the meantime, however, it has gained a firm place in the corporate world with the increasing entry of data science into many companies.
Syntax
An overview of the most important properties of its syntax facilitates any introduction to R. The R syntax is characterized by expressions and is case sensitive: An object named modelFit
cannot be called as modelfit
, for example. The assignment operator <-
creates an object and points to the object that is assigned the content of an expression.
To assign the numbers from 1 to 5 to the vector numbers
, use the following expression:
> numbers <- c(1, 2, 3, 4, 5)
The c()
function – the c
stands for "concatenate" – combines the individual elements listed in parentheses. An equals sign can be used as an alternative for assignments, in line with the standards of other programming languages. However, this practice is controversial in the R community. The assignment operator and equals sign also are not fully equivalent, because the latter can only be used at the top level.
Another member in the assignment operator group, <<-
, is an extension of the assignment operator. Known as the superassignment operator, it can be used to assign values within functions in the global environment or to overwrite variables already defined there.
A special feature compared with common programming languages is indexing. To retrieve the first element of the numbers
vector you would write numbers[1]
, which would return both the element at the first position of numbers
and explicitly the index of that element ([1]
).
As you will see from the code examples, no semicolon or the like is used to complete an expression. R uses line breaks for this purpose.
The interpreter anticipates you: If the end of a line obviously does not complete the expression (e.g., because a bracket is missing or the expression ends with a comma), the interpreter assumes that the expression continues in the next line and prompts you for the completion with a plus sign:
> rep(numbers, + each = 3) [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
Comments in code are indicated by a hash,
# This is a comment > names <- c("Anna", "Rudolf", "Edith", + "Jason", "Maria")
and can occupy a line of its own or be added to the end of a line.
Data Structures
The most important data structures in R are vectors, matrices, lists, and data frames. Vectors are unidimensional data structures and the smallest possible building block, because R has no scalars. For this reason, they are also known as atomic vectors. For example, an object containing only one string or one number is treated by R as a vector with a length of 1 . R has five vector types:
logical
integer
double
(numeric
)character
complex
Vectors in R are characteristic, in that their elements all have to be of the same type. If you try to combine elements of different types (e.g., strings and numbers), you will not see an error message. Instead, R automatically converts all elements to the same class in a process known as coercion. The following example tries to create a vector from an integer, a logical constant, and a character. R automatically converts all elements to characters:
> misc <- c(43, TRUE, "Hello") > misc [1] "43" "TRUE" "Hello" > class(misc) [1] "character"
Data frames are table-like data structures in R and are used in almost every data analysis. Each column contains a vector; although all vectors have the same length, they can be of any type (Listing 1).
Listing 1
Data Frame
> data.frame(numbers, names) numbers names 1 1 Anna 2 2 Rudolf 3 3 Edith 4 4 Jason 5 5 Maria
If you add an additional element to the names
vector and again try to create a data frame from numbers
and names
, an error occurs because numbers
has a length of 5
and names
has a length of 6
(Listing 2).
Listing 2
Faulty Frame
> names[6] <- "Henry" > data.frame(numbers, names) Error in data.frame(numbers, names) : Arguments imply different number of lines: 5, 6
Strictly speaking, lists are also vectors, but they are recursive vectors. Any conceivable object can be components of a list, even lists themselves. The data structure list
in R can be easily compared with a dictionary (dict
) in Python or a structure (struct
) in C. Lists in Python, on the other hand, are more similar to vectors in R, except R vectors can contain different data types.
Buy this article as PDF
(incl. VAT)
Buy ADMIN Magazine
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Most Popular
Support Our Work
ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.