Pandas: Data analysis with Python
Data Panda
In an age when laptops are more powerful and offer more features than high-performance servers of only a few years ago, whole groups of developers are discovering new opportunities in their data. However, companies without a large development department still lack the manpower to develop their own software and tailor it to suit their data. The pandas [1] Python library provides pre-built methods for many applications.
Panda Analysis
The Pandas acronym comes from a combination of panel data , an econometric term, and Python data analysis . It targets five typical steps in the processing and analysis of data, regardless of the data origin: load, prepare, manipulate, model, and analyze.
The tools supplied by Pandas save time when loading data. The library can read records in CSV (comma-separated values), Excel, HDF, SQL, JSON, HTML, and Stata formats; Pandas places much emphasis on flexibility, for example, in handling disparate cell separators. Moreover, it reads directly from the cache or loads Python objects serialized in files by the Python pickle module.
The preparation of the loaded data then follows. Records are deleted, if erroneous entries are found, or set to default values, as well as normalized, grouped, sorted, transformed, and otherwise adapted for further processing. This preparatory work usually involves labor-intensive activities that are very much worth standardizing before you start interpreting the content.
The interesting Big Data business starts now, with computing statistical models that, for example, allow predictions of future input using algorithms from the field of machine learning.
NumPy Arrays
For a long time, the main disadvantage of interpreted languages like Python was the lack of speed when dealing with large volumes of data and complex mathematical operations. The Python NumPy
(Numerical Python) [2] library in particular takes the wind out of the sails of this allegation. It loads its data efficiently into memory and integrates C code, which compiles at run time.
The most important data structure in NumPy is the N
-dimensional array, ndarray
. In a one-dimensional case, ndarrays
are vectors. Unlike Python lists, the size of NumPy arrays is immutable; its elements are of a fixed type predetermined during initialization – by default, floating-point numbers.
The internal structure of the array allows the computation of vector and matrix operations at considerably higher speeds than in a native Python implementation.
The easiest approach is to generate NumPy arrays from existing Python lists:
np.array([1, 2, 3])
The np
stands for the module name of NumPy, which by convention – but not necessarily – is imported using:
import numpy as np
Multidimensional matrices are created in a similar way, that is, with nested lists:
np.array([[1, 2, 3], [4, 5, 6]])
If the content is still unknown when you create an array, np.zeros()
generates a zero-filled structure of a predetermined size. The argument used here is an integer tuple in which each entry represents an array dimension. For one-dimensional arrays, a simple integer suffices:
array2d = np.zeros((5,5)) array1d = np.zeros(5)
If you prefer 1
as the initial element, you can create an array in the same way using np.ones()
.
The use of np.empty()
is slightly faster because it does not initialize the resulting data structure with content. The result, therefore, contains arbitrary values that exist at the storage locations used. However, they are not suitable for use as true random numbers.
The syntax of np.empty()
is the same as np.zeros()
and np.ones()
. All three functions also have a counterpart with the suffix _like
(e.g., np.zeros_like()
). These methods copy the shape of an existing array, which is passed in as an argument and creates the basis of a new data structure of the same dimensions and the desired initial values.
The methods mentioned also accept an optional dtype
argument. As a value, it expects a NumPy data type (e.g., np.int32
, np.string_
, or np.bool
), which it assigns to the resulting array instead of the standard floating-point number. In the case of np.empty()
, this again results in arbitrary content.
Finally, the NumPy arange()
method works the same way as the Python range()
command. If you specify an integer argument, it creates an array of that length, initializing the values with a stepped sequence:
In: np.arange(3) Out: array([0, 1, 2])
The arange()
method optionally takes additional arguments, like its Python counterpart range()
. The second argument defines a final value, whereas the first is used as the seed for the sequence. A third argument optionally changes the step size. For example, use
In: np.arange(3, 10, 2) Out: array([3, 5, 7, 9])
to generate a sequence from 3 to 10 with a step size of 2.
Basic Arithmetic Operations
NumPy allows many operations applied against all elements of an array without having to go through Python-style loops. Known mathematical operators are used (e.g., +
for simple addition). The basic rule is that, if two uniform arrays exist, the operator manipulates elements at the same position in both arrays; however, if you add a scalar (i.e., a number) to an array, NumPy adds that number to each array element:
In: np.array([1,2,3]) + np.array([3,2,1]) Out: array([4, 4, 4]) In: np.array([1,2,3]) + 1 Out: array([2, 3, 4])
Multiplication, division, subtraction, and power calculations with **
work in the same way. Additionally, NumPy has some universal functions for further calculations, such as sqrt()
and square()
, which compute the square root or the power of 2 for each content element of an array.