« Previous 1 2 3 4 Next »
Pandas: Data analysis with Python
Data Panda
Files
In the Big Data real world, the data to be analyzed do not usually originate directly with the application that analyzes them. Pandas thus comes with some auxiliary functions that read popular file formats and transfer their contents directly into Pandas data structures: read_csv()
, read_table()
, and read_fwf()
. Figure 1 shows an example of a session with the advanced Python shell, IPython [4], and a call to read_csv()
; Figure 2 shows a curtailed record.
These methods expect data sources in tabular form (i.e., one record per line and comma- or tab-separated cells). Arbitrary field separators can be defined with the sep
argument in the form of simple strings or regular expressions.
For read_fwf()
only, fixed field widths remove the need for field separators; instead, you pass in a list of field widths
, stated as the number of characters, or colspecs
, stated as the absolute start and end values of each column as a tuple. As a data source, the read methods always expect the first argument to be file names or URLs – or a path
.
By default, Pandas reading methods interpret the first line of a file as a header that contains the column names. If you set the argument header=None
when calling the method, the first line becomes the first record. In this case, it makes sense to pass in the column names as a list using the names
argument.
To save memory and time when processing very large files, the iterator=True
argument instructs all the read functions to do the reading chunkwise. Instead of returning the complete file contents, they then return a TextParser
object. The size of the read chunks is specified by the chunksize
argument. If this argument is set, Pandas automatically sets the iterator
to True
. Using a TextParser, you can read and process the data line by line in a for
loop. The get_chunk()
method directly returns the next chuck of the file.
The Series and DataFrame structures make it just as easy to write their content to files. Both have a to_csv()
method that expects the output file as an argument; if you instead specify sys.stdout
, it passes the data directly to the standard output. The default field separator is the comma, but you can declare an alternative with the sep
argument.
Various Formats
Pandas can even process Excel files using the ExcelFile
class. Its constructor expects the file path; the resulting ExcelFile
uses the parse()
method to return DataFrame objects of the individual sheets:
excelfile = pandas.ExcelFile('file.xls') dataframe = excelfile.parse('Sheet1')
If so desired, Pandas uses the pickle module to store binary format objects on disk. Series and DataFrames, and all other Pandas structures, support the save()
helper method for this. It simply expects the output file as an argument. Conversely, the Pandas load()
method reads the file and returns the corresponding object.
The data library also adds support for HDF5 (Hierarchical Data Format), which is used, among other programs, by Matlab mathematics software. The advantage is that it can be read efficiently in chunks, even when using compression, so it is particularly suitable for very large data sets.
Pandas uses the HDFStore
class to read HDF5 files; the class constructor expects a file name. The resulting object can be read much like a dictionary:
hdf = HDFStore('file.h5') s1 = hdf['s1']
These calls read the file.h5
HDF file, whose data structure contains a Series object named s1
. This is also stored in the s1
variable.
Data!
After reading the data, Pandas applies numerous auxiliary functions to shape them. First, merge()
merges two DataFrame objects,
pandas.merge(dataframe1, dataframe2)
here, by combining the columns of two DataFrames on the basis of identical indices by default. If instead you want to use another column to identify the records to be merged, you can use the on
argument to specify the relevant name. This of course only works if both data frames contain a like-named column.
Instead of merging records in two objects, concat()
concatenates Series or DataFrames. In the simplest case, the concatenation of two Series objects creates a new Series object that lists all the entries in the two source objects in succession. Alternatively, the line
concat([series1, series2, series3], axis=1)
generates a DataFrame from multiple Series objects. In this example, the function concatenates the sources on the basis of the columns (axis=1
), instead of line by line (axis=0
).
SQL database users are familiar with the concat()
functionality from joins. By default, the inner
method is used; this generates an intersection of the keys used. Alternatively, you may use the outer
(union), left
, or right
method. With left
and right
, the result of a merge contains only the keys of the left or right source object, respectively.
« Previous 1 2 3 4 Next »