« Previous 1 2 3 4 Next »
Pandas: Data analysis with Python
Data Panda
List and Dictionary Methods
Access to the elements of a NumPy array involves indexes and slices as in Python. The first element is returned by array[0]
, whereas array[2]
returns the first two. For multidimensional arrays, a comma-separated argument list accesses the individual dimensions, such as array[0,2]
. Again, slices allow the extraction of areas.
In addition to the list functions, NumPy also supports set operations. The unique()
method only outputs the different elements of an array and, in practice, creates a set. Intersections and union sets of one-dimensional arrays with intersect1d()
and union1d()
are also at hand.
Serial Pandas
Pandas introduces other data structures based, directly or indirectly (Pandas v.013 [3]) on NumPy arrays text that combine the efficiency of NumPy with simple craftsmanship. First up is the Series
object, a one-dimensional NumPy array; however, it does have some additional methods and attributes. Creating a Series object is much like creating a NumPy array:
s = pd.Series([1, 2, 3])
One of the enhancements compared with NumPy arrays involves the indices that contain the Series objects. If they are not defined explicitly, they exist as a list of consecutive numbers. The indices can also be strings or any other data type:
Series([1, 2, 3], index=['a', 'b', 'c'])
Now you can retrieve the elements much like a Python dictionary (e.g., with s['a']
). Pandas takes this into account and allows the initialization of a Series object directly from a Python dictionary:
Series({'a': 1, 'b': 2, 'c': 3})
In this use case, too, you can pass in a list separately as an index
argument so that only those elements that exist in the index make their way from the dictionary to the resulting Series object. Conversely, Pandas initializes values for indexes that are missing in the dictionary, as non-existing (NaN
). In the following case, the entry for 'D'
is missing from the results, whereas 'C'
is initialized without a value.
In: Series({'a': 1, 'b': 2, 'd': 4}, index=['a', 'b', 'c']) Out: a 1 b 2 c NaN dtype: float64
Multiple indices are also allowed; you need to pass a list of tuples, whose elements in turn form the elements of the indices, to the index
argument instead of a list of single elements. Such structures are mainly used in practice to group records based on a first index, whereas the second index then uniquely identifies the elements within such a group.
Indexes are separate Pandas data objects that are generally immutable. However, they can be replaced with the reindex()
method. It accepts a list as an argument, as well as the index
argument, when initializing a Series.
Again, Pandas pads nonexistent values with NaN
and removes values that no longer exist in the new index. Instead of NaN
, you use the fill_value
argument to specify different default values. To fill empty rows with 0
, use:
s.reindex(['d', 'e', 'f'], fill_value=0)
The s
indicates a previously generated Series object.
Framed
Pandas uses the DataFrame
class to implement two-dimensional structures. The DataFrame object is again initialized in the same ways as a Series by defining the rows via a dictionary in which each key contains a value comprising a list of elements:
DataFrame({'a': [1, 2], 'b': [3, 4]})
An optional index
list determines the indices, as for a Series.
The DataFrame constructor also takes an optional columns
argument, which works like an index
but defines the column names instead of the rows:
In: DataFrame({'a': [1, 2], 'b': [3, 4]}, columns=['a', 'c'], index=['top', 'bottom']) Out: a c top 1 NaN bottom 2 NaN
Columns that are not in the columns
list are dropped. However, Pandas again initializes undefined columns with NaN
. Access to a column in a DataFrame is again via dataframe['a']
, as for a dictionary. Additionally, the columns can be accessed as attributes of a DataFrame object: dataframe.a
. If you instead want to address a row, the DataFrame attribute, ix
lets you do so: dataframe.ix['top']
.
Like Series, the DataFrame object also supports the reindex()
method. By default, it references the row labels, but the columns
argument replaces the column names in the same way. For both Series and DataFrame objects, the drop()
method removes one or more lines. In the first case, you state the desired index as an argument. A list is used to delete multiple rows: s.drop(['b', 'c'])
.
« Previous 1 2 3 4 Next »