Pandas: Data analysis with Python

Data Panda

List and Dictionary Methods

Access to the elements of a NumPy array involves indexes and slices as in Python. The first element is returned by array[0], whereas array[2] returns the first two. For multidimensional arrays, a comma-separated argument list accesses the individual dimensions, such as array[0,2]. Again, slices allow the extraction of areas.

In addition to the list functions, NumPy also supports set operations. The unique() method only outputs the different elements of an array and, in practice, creates a set. Intersections and union sets of one-dimensional arrays with intersect1d() and union1d() are also at hand.

Serial Pandas

Pandas introduces other data structures based, directly or indirectly (Pandas v.013 [3]) on NumPy arrays text that combine the efficiency of NumPy with simple craftsmanship. First up is the Series object, a one-dimensional NumPy array; however, it does have some additional methods and attributes. Creating a Series object is much like creating a NumPy array:

s = pd.Series([1, 2, 3])

One of the enhancements compared with NumPy arrays involves the indices that contain the Series objects. If they are not defined explicitly, they exist as a list of consecutive numbers. The indices can also be strings or any other data type:

Series([1, 2, 3], index=['a', 'b', 'c'])

Now you can retrieve the elements much like a Python dictionary (e.g., with s['a']). Pandas takes this into account and allows the initialization of a Series object directly from a Python dictionary:

Series({'a': 1, 'b': 2, 'c': 3})

In this use case, too, you can pass in a list separately as an index argument so that only those elements that exist in the index make their way from the dictionary to the resulting Series object. Conversely, Pandas initializes values for indexes that are missing in the dictionary, as non-existing (NaN). In the following case, the entry for 'D' is missing from the results, whereas 'C' is initialized without a value.

In: Series({'a': 1, 'b': 2, 'd': 4}, index=['a', 'b', 'c'])
Out:
a     1
b     2
c   NaN
dtype: float64

Multiple indices are also allowed; you need to pass a list of tuples, whose elements in turn form the elements of the indices, to the index argument instead of a list of single elements. Such structures are mainly used in practice to group records based on a first index, whereas the second index then uniquely identifies the elements within such a group.

Indexes are separate Pandas data objects that are generally immutable. However, they can be replaced with the reindex() method. It accepts a list as an argument, as well as the index argument, when initializing a Series.

Again, Pandas pads nonexistent values with NaN and removes values that no longer exist in the new index. Instead of NaN, you use the fill_value argument to specify different default values. To fill empty rows with 0, use:

s.reindex(['d', 'e', 'f'], fill_value=0)

The s indicates a previously generated Series object.

Framed

Pandas uses the DataFrame class to implement two-dimensional structures. The DataFrame object is again initialized in the same ways as a Series by defining the rows via a dictionary in which each key contains a value comprising a list of elements:

DataFrame({'a': [1, 2], 'b': [3, 4]})

An optional index list determines the indices, as for a Series.

The DataFrame constructor also takes an optional columns argument, which works like an index but defines the column names instead of the rows:

In: DataFrame({'a': [1, 2], 'b': [3, 4]}, columns=['a', 'c'], index=['top', 'bottom'])
Out:
        a    c
top     1  NaN
bottom  2  NaN

Columns that are not in the columns list are dropped. However, Pandas again initializes undefined columns with NaN. Access to a column in a DataFrame is again via dataframe['a'], as for a dictionary. Additionally, the columns can be accessed as attributes of a DataFrame object: dataframe.a. If you instead want to address a row, the DataFrame attribute, ix lets you do so: dataframe.ix['top'].

Like Series, the DataFrame object also supports the reindex() method. By default, it references the row labels, but the columns argument replaces the column names in the same way. For both Series and DataFrame objects, the drop() method removes one or more lines. In the first case, you state the desired index as an argument. A list is used to delete multiple rows: s.drop(['b', 'c']).

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus