By Jay Parmar

Pandas is a Python library to deal with sequential and tabular data. It includes many tools to manage, analyze and manipulate data in a convenient and efficient manner. We can think of its data structures as akin to database tables or spreadsheets.

Pandas is built on top of the Numpy library and has two primary data structures viz. Series (1-dimensional) and DataFrame (2-dimensional). It can handle both homogeneous and heterogeneous data, and some of its many capabilities are:

ETL tools (Extraction, Transformation and Load tools)
Dealing with missing data (NaN)
Dealing with data files (csv, xls, db, hdf5, etc.)
Time-series manipulation tools

In the Python ecosystem, Pandas is the best choice to retrieve, manipulate, analyze and transform financial data.

In this Python Pandas tutorial, we will learn the basics of the Python Pandas module as well as understand some of the codes. This blog article is an excerpt from the Python Basics Handbook created for the simple purpose of making the reader understand the beauty and simplicity of the Python language.

The contents of this article include:

Installing Python Pandas
What problem does Python Pandas solve?
Python Pandas Series
Python Pandas DataFrame
Importing data in Python Pandas
Indexing and Subsetting
Manipulating a Python Pandas DataFrame
Statistical Exploratory data analysis
Filtering Python Pandas DataFrame
Iterating Python Pandas DataFrame
Merge, Append and Concat Python Pandas DataFrame
Time-Series in Pandas

Installing Python Pandas

In this section of the Python Pandas tutorial, we will start by going through the official documentation which has a detailed explanation here on installing Pandas. We summarize it below.

With pip

The simplest way to install Pandas is from PyPI.

In a terminal window, run the following command.

pip install pandas

In your code, you can use the escape character '!' to install pandas directly from your Python console.

!pip install pandas

Pip is a useful tool to manage Python's packages and it is worth investing some time in knowing it better.

pip help

With Conda environments

For advanced users, who like to work with Python environments for each project, you can create a new environment and install pandas as shown below.

conda create -n EPAT python
source activate EPAT
conda install pandas

Testing Pandas installation

To check the installation, Pandas comes with a test suite to test almost all of the codebase and verify that everything is working.

import pandas as pd
pd.test()

What problem does Python Pandas solve?

In this section of the Python Pandas tutorial, we will understand the use of Python Pandas.

Python Pandas works with homogeneous data series (1-Dimension) and heterogeneous tabular data series (2-Dimensions). It includes a multitude of tools to work with these data types, such as:

Indexes and labels.
Searching of elements.
Insertion, deletion and modification of elements.
Apply set techniques, such as grouping, joining, selecting, etc.
Data processing and cleaning.
Work with time series.
Make statistical calculations
Draw graphics
Connectors for multiple data file formats, such as, csv, xlsx, hdf5, etc.

Pandas Documentation: 10 minutes with Pandas

Python Pandas Series

The first data structure we will go through in the Python Pandas tutorial is the Series.

Python Pandas Series are homogeneous one-dimensional objects, that is, all data are of the same type and are implicitly labelled with an index.

For example, we can have a Series of integers, real numbers, characters, strings, dictionaries, etc. We can conveniently manipulate these series performing operations like adding, deleting, ordering, joining, filtering, vectorized operations, statistical analysis, plotting, etc.

Let's see some examples of how to create and manipulate a Python Pandas Series:

We will start by creating an empty Python Pandas Series:

import pandas as pd
s = pd.Series()
print(s)

Out[]: 

Series([], dtype: float64)

Let's create a Python Pandas Series of integers:

import pandas as pd
s = pd.Series([1, 2, 3, 4, 5, 6, 7])
print(s)

Out[]:
        0    1
        1    2
        2    3
        3    4
        4    5
        5    6
        6    7
        dtype: int64

Let's create a Python Pandas Series of characters:

import pandas as pd
s = pd.Series(['a', 'b', 'c', 'd', 'e'])
print(s)

Out[]: 
        0    1
        1    2
        2    3
        3    4
        4    5
        5    6
        6    7
        dtype: int64

Let's create a random Python Pandas Series of float numbers:

import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(5))
print(s)

Out[]:  
        0    0.383567
        1    0.869761
        2    1.100957
        3   -0.259689
        4    0.704537
        dtype: float64

In all these examples we saw in the Python Pandas tutorial, we have allowed the index label to appear by default (without explicitly programming it). It starts at 0, and we can check the index as:

In[]: 

s.index

Out[]: 

RangeIndex(start=0, stop=5, step=1)

But we can also specify the index we need, for example:

In[]: 

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

Out[]:
      a    1.392051
      b    0.515690
      c   -0.432243
      d   -0.803225
      e    0.832119
      dtype: float64

Let's create a Python Pandas Series from a dictionary:

import pandas as pd
dictionary = {'a' : 1, 'b' : 2, 'c' : 3, 'd': 4, 'e': 5}
s = pd.Series(dictionary)
print(s)

Out[]:

        a    1
        b    2
        c    3
        d    4
        e    5
        dtype: int64

In this case, the Python Pandas Series is created with the dictionary keys as index unless we specify any other index.

Simple operations with Python Pandas Series

When we have a Python Pandas Series, we can perform several simple operations on it.

For example, in this section of the Python Pandas tutorial, let's create two Series. One from a dictionary and the other from an array of integers:

In[]: 
      import pandas as pd
      dictionary = {'a' : 1, 'b' : 2, 'c' : 3, 'd': 4, 'e': 5}
      s1 = pd.Series(dictionary)

      array = [1, 2, 3, 4, 5]
      s2 = pd.Series(array)


Out[]:  
        a    1
        b    2
        c    3
        d    4
        e    5
        dtype: int64

        0    1
        1    2
        2    3
        3    4
        4    5
        dtype: int64

We can perform operations similar to Numpy arrays:

Selecting one item from the Python Pandas Series by means of its index:

In[]: 
s1[0] # Select the first element

Out[]: 
1

In[]: 
s1['a'] 

Out[]: 
1

In[]: 
s2[0] 

Out[]: 
1

Selecting several items from the Python Pandas Series by means of its index:

In[]: 
s1[[1,4]]

Out[]:
        b    2
        e    5
        dtype: int64

In[]:
s1[['b', 'e']]

Out[]:  
        b    2
        e    5
        dtype: int64

In[]: 
s2[[1,4]]

Out[]:  
        b    2
        e    5
        dtype: int64

Get the Python Pandas series starting from an element:

In[]: 
s1[2:]

Out[]:  
        c    3
        d    4
        e    5
        dtype: int64

In[]: 
s2[2:]

Out[]:  
        2    3
        3    4
        4    5
        dtype: int64

Get the Python Pandas series up to one element:

In[]: 
s1[:2]

Out[]:  
        c    3
        d    4
        e    5
        dtype: int64

In[]: 
s2[:2]

Out[]:  
        2    3
        3    4
        4    5
        dtype: int64

We can perform operations like a dictionary:

Assign a value:

In[]: 
      s1[1] = 99
      s1['a'] = 99

Out[]:  
        a     1
        b    99
        c     3
        d     4
        e     5
        dtype: int64

In[]: 
s2[1] = 99
print(s2)

Out[]:  
        0     1
        1    99
        2     3
        3     4
        4     5
        dtype: int64

Get a value by index (like dictionary key):

In[]: 
s.get('b')

Out[]: 
2

Here are some powerful vectorized operations that let us perform quickly calculations, for example:

Add, subtract, multiply, divide, power, and almost any NumPy function that accepts NumPy arrays.

s1 + 2
s1 - 2
s1 * 2
s1 / 2
s1 ** 2
np.exp(s1)

We can perform the same operations over two Python Pandas Series although these must be aligned, that is, to have the same index, in other cases, perform a Union operation.

In[]: 
s1 + s1 # The indices are aligned

Out[]:  
        a     2
        b     4
        c     6
        d     8
        e    10
        dtype: int64

In[]: 
s1 + s2 # The indices are unaligned

Out[]:  
        a   NaN
        b   NaN
        c   NaN
        d   NaN
        e   NaN
        0   NaN
        1   NaN
        2   NaN
        3   NaN
        4   NaN
        dtype: float64

Python Pandas DataFrame

The second data structure in Python Pandas that we are going to see is the DataFrame.

Python Pandas DataFrame is a heterogeneous two-dimensional object, that is, the data are of the same type within each column but it could be a different data type for each column and are implicitly or explicitly labelled with an index.

We can think of a Python Pandas DataFrame as a database table, in which we store heterogeneous data. For example, a Python Pandas DataFrame with one column for the first name, another for the last name and a third column for the phone number, or a Python Pandas dataframe with columns to store the opening price, close price, high, low, volume, and so on.

The index can be implicit, starting with zero or we can specify it ourselves, even working with dates and times as indexes as well.

Let's see some examples of how to create and manipulate a Python Pandas DataFrame.

Creating an empty Python Pandas DataFrame:

In[]: 
      import pandas as pd
      s = pd.DataFrame()
      print(s)

Out[]:  
        Empty DataFrame
        Columns: []
        Index: []

Creating an empty structure Python Pandas DataFrame:

In[]: 
      import pandas as pd
      s = pd.DataFrame(columns=['A','B','C','D','E'])
      print(s)

Out[]:  
        Empty DataFrame
        Columns: [A, B, C, D, E]
        Index: []

In[]: 
      import pandas as pd
      s = pd.DataFrame(columns=['A','B','C','D','E'], index=range(1, 6))
      print(s)

Out[]:
             A    B    C    D    E
        1  NaN  NaN  NaN  NaN  NaN
        2  NaN  NaN  NaN  NaN  NaN
        3  NaN  NaN  NaN  NaN  NaN
        4  NaN  NaN  NaN  NaN  NaN
        5  NaN  NaN  NaN  NaN  NaN

Creating a Python Pandas DataFrame passing a NumPy array:

In[]: 
      array =  {'A' : [1, 2, 3, 4],
                'B' : [4, 3, 2, 1]}

      pd.DataFrame(array)

Out[]: 
        A   B
      0 1   4
      1 2   3
      2 3   2
      3 4   1

Creating a Python Pandas DataFrame passing a NumPy array, with datetime index:

In[]: 
      import pandas as pd
      array =  {'A' : [1, 2, 3, 4],'B' : [4, 3, 2, 1]}
      index = pd.DatetimeIndex(['2018-12-01', '2018-12-02','2018-12-03', '2018-12-04'])
      pd.DataFrame(array, index=index)

Out[]:
                 A B
    2018-12-01  1   4
    2018-12-02  2   3
    2018-12-03  3   2
    2018-12-04  4   1

Creating a Python Pandas DataFrame passing a Dictionary:

In[]: 
      import pandas as pd
      dictionary = {'a' : 1, 'b' : 2, 'c' : 3, 'd': 4, 'e': 5}
      pd.DataFrame([dictionary])

Out[]:
        a   b   c   d   e
      0 1   2   3   4   5

Viewing a Python Pandas DataFrame: We can use some methods to explore the Pandas DataFrame:

First, we go to create a Python Pandas DataFrame to work with it.

In[]: 
      import pandas as pd
      pd.DataFrame({'A' : np.random.randn(10), 'B' : np.random.randn(10), 'C' : np.random.randn(10)})

Out[]:
            A         B           C
    0   0.164358    1.689183    1.745963
    1   -1.830385   0.035618    0.047832
    2   1.304339    2.236809    0.920484
    3   0.365616    1.877610    -0.287531
    4   -0.741372   -1.443922   -1.566839
    5   -0.119836   -1.249112   -0.134560
    6   -0.848425   -0.569149   -1.222911
    7   -1.172688   0.515443    1.492492
    8   0.765836    0.307303    0.788815
    9   0.761520    -0.409206   1.298350

Get the first three rows:

In[]: 
      import pandas as pd
      df=pd.DataFrame({'A' : np.random.randn(10), 'B' : np.random.randn(10), 'C' : np.random.randn(10)})

      df.head(3)

Out[]:
            A         B           C
    0   0.164358    1.689183    1.745963
    1   -1.830385   0.035618    0.047832
    2   1.304339    2.236809    0.920484

Get the last three rows:

In[]: 
      import pandas as pd
      df=pd.DataFrame({'A' : np.random.randn(10), 'B' : np.random.randn(10), 'C' : np.random.randn(10)})

      df.tail(3)

Out[]:
           A         B           C
    7   -1.172688   0.515443    1.492492
    8   0.765836    0.307303    0.788815
    9   0.761520    -0.409206   1.298350

Get the Python Pandas DataFrame's index:

In[]: 
      import pandas as pd
      df=pd.DataFrame({'A' : np.random.randn(10), 'B' : np.random.randn(10), 'C' : np.random.randn(10)})

      df.index

Out[]: 
RangeIndex(start=0, stop=10, step=1)

Get the Python Pandas DataFrame's columns:

In[]: 
      import pandas as pd
      df=pd.DataFrame({'A' : np.random.randn(10), 'B' : np.random.randn(10), 'C' : np.random.randn(10)})

      df.columns

Out[]: 
Index(['A', 'B', 'C'], dtype='object')

Get the Python Pandas DataFrame's values:

In[]: 
      import pandas as pd
      df=pd.DataFrame({'A' : np.random.randn(10), 'B' : np.random.randn(10), 'C' : np.random.randn(10)})

      df.values

Out[]: 
array([[ 0.6612966 , -0.60985049,  1.11955054],
       [-0.74105636,  1.42532491, -0.74883362],
       [ 0.10406892,  0.5511436 ,  2.63730671],
       [-0.73027121, -0.11088373, -0.19143175],
       [ 0.11676573,  0.27582786, -0.38271609],
       [ 0.51073858, -0.3313141 ,  0.20516165],
       [ 0.23917755,  0.55362   , -0.62717194],
       [ 0.25565784, -1.4960713 ,  0.58886377],
       [ 1.20284041,  0.21173483,  2.0331718 ],
       [ 0.62247283,  2.18407105,  0.02431867]])

Importing data in Python Pandas

Python Pandas DataFrame is able to read several data formats, some of the most used are: CSV, JSON, Excel, HDF5, SQL, etc.

CSV to DataFrame

One of the most useful functions is read_csv that allows us to read csv files with almost any format and load it into our Python Pandas DataFrame to work with it. Let's see how to work with csv files:

import pandas as pd
df=pd.read_csv('Filename.csv')
type(df)

Out[]: 
pandas.core.frame.DataFrame

This simple operation loads the csv file into the Python Pandas DataFrame after which we can explore it as we have seen before.

Customizing pandas import

Sometimes the format of the csv file comes with a particular separator or we need specific columns or rows. In this section of the Python Pandas tutorial, we will now see some ways to deal with this.

In this example, we want to load a csv file with blank space as separator:

import pandas as pd
df=pd.read_csv('Filename.csv', sep=' ')

In this example, we want to load columns from 0 and 5 and the first 100 rows:

import pandas as pd
df=pd.read_csv('Filename.csv', usecols=[0,1,2,3,4,5], nrows=100)

It's possible to customize the headers, convert the columns or rows names and carry out a good number of other operations.

Check the IO tools from Pandas documentation.

Importing sheets from Excel files

In the same way that we have worked with csv files, we can work with Excel file with the read_excel function, let's see some examples:

In this example, we want to load the sheet 1 from an Excel file:

import pandas as pd
df=pd.read_excel('Filename.xls', sheet_name='Sheet1')

This simple operation loads Sheet 1 from the Excel file into the Pandas DataFrame.

Indexing and Subsetting

Once we have the Python Pandas DataFrame prepared, independent of the source of our data (csv, Excel, hdf5, etc.) we can work with it, as if it were a database table, selecting the elements that interest us. We will work with some examples of how to index and extract subsets of data.

Let's begin with loading a csv file having details of a market instrument.

In[]: 
      import pandas as pd
      df=pd.read_csv('MSFT.csv' , usecols=[0,1,2,3,4,5])
      df.head()
      df.shape

Out[]:
          Date        Open  High     Low    Close   Volume
    0   2008-12-29  19.1500 19.21   18.64   18.96   58512800.0
    1   2008-12-30  19.0100 19.49   19.00   19.34   43224100.0
    2   2008-12-31  19.3100 19.68   19.27   19.44   46419000.0
    3   2009-01-02  19.5328 20.40   19.37   20.33   50084000.0
    4   2009-01-05  20.2000 20.67   20.06   20.52   61475200.0

    (1000, 6)

Here, we have read a csv file, of which we only need the columns of date, opening, closing, high, low and volume (the first 6 columns) and we check the form of the DataFrame that has 1000 rows and 6 columns.

Selecting a single column

In the previous code, we have read directly the first 6 columns from the csv file. This is a filter that we applied because we were only interested in those columns.

We can apply selection filters to the Python Pandas DataFrame itself, to select one column to work with. For example, we could need the Close column:

In[]: 
close=df['Close']
close.head()

Out[]:
        Close
    0   18.96   
    1   19.34   
    2   19.44   
    3   20.33   
    4   20.52

Selecting multiple columns

We can select multiple columns too:

In[]: 
closevol=df[['Close', 'Volume']]
closevol.head()

Out[]:
        Close  Volume
    0   18.96   58512800.0
    1   19.34   43224100.0
    2   19.44   46419000.0
    3   20.33   50084000.0
    4   20.52   61475200.0

Selecting rows via [].

We can select a set of rows by index:

In[]: 
      import pandas as pd
      df=pd.read_csv('TSLA.csv' )

      df[100:110]

Out[]:

	Date	Open	High	Low	Close	Volume	SplitRatio	AdjOpen	AdjHigh	AdjLow	AdjClose	AdjVolume	Name
100	2017-10-30	319.18	323.7800	317.25	320.08	4236029.0	1.0	319.18	323.7800	317.25	320.08	4236029.0	TSLA
101	2017-10-27	319.75	324.5900	316.66	320.87	6942493.0	1.0	319.75	324.5900	316.66	320.87	6942493.0	TSLA
102	2017-10-26	327.78	330.2300	323.20	326.17	4980316.0	1.0	327.78	330.2300	323.20	326.17	4980316.0	TSLA
103	2017-10-25	336.70	337.5000	323.56	325.84	8547764.0	1.0	336.70	337.5000	323.56	325.84	8547764.0	TSLA
104	2017-10-24	338.80	342.8000	336.16	337.34	4463807.0	1.0	338.80	342.8000	336.16	337.34	4463807.0	TSLA
105	2017-10-23	349.88	349.9500	336.25	337.02	5715817.0	1.0	349.88	349.9500	336.25	337.02	5715817.0	TSLA
106	2017-10-20	352.69	354.5500	344.34	345.10	4888221.0	1.0	352.69	354.5500	344.34	345.10	4888221.0	TSLA
107	2017-10-19	355.56	357.1465	348.20	351.81	5032884.0	1.0	355.56	357.1465	348.20	351.81	5032884.0	TSLA
108	2017-10-18	355.97	363.0000	354.13	359.65	4898808.0	1.0	355.97	363.0000	354.13	359.65	4898808.0	TSLA
109	2017-10-17	350.91	356.2200	350.07	355.75	3280670.0	1.0	350.91	356.2200	350.07	355.75	3280670.0	TSLA

Or we can select a set of rows and columns:

In[]: 
df[100:110][['Close', 'Volume']]

Out[]: 
        Close   Volume
100 320.08  4236029.0
101 320.87  6942493.0
102 326.17  4980316.0
103 325.84  8547764.0
104 337.34  4463807.0
105 337.02  5715817.0
106 345.10  4888221.0
107 351.81  5032884.0
108 359.65  4898808.0
109 355.75  3280670.0

Selecting via .loc[] (By label)

With df.loc we can do the same selections using labels:

To select a set of rows, we can code the following using the index number as label:

In[]: 
df.loc[100:110]

Out[]:

	Date	Open	High	Low	Close	Volume	SplitRatio	AdjOpen	AdjHigh	AdjLow	AdjClose	AdjVolume	Name
100	2017-10-30	319.18	323.7800	317.25	320.08	4236029.0	1.0	319.18	323.7800	317.25	320.08	4236029.0	TSLA
101	2017-10-27	319.75	324.5900	316.66	320.87	6942493.0	1.0	319.75	324.5900	316.66	320.87	6942493.0	TSLA
102	2017-10-26	327.78	330.2300	323.20	326.17	4980316.0	1.0	327.78	330.2300	323.20	326.17	4980316.0	TSLA
103	2017-10-25	336.70	337.5000	323.56	325.84	8547764.0	1.0	336.70	337.5000	323.56	325.84	8547764.0	TSLA
104	2017-10-24	338.80	342.8000	336.16	337.34	4463807.0	1.0	338.80	342.8000	336.16	337.34	4463807.0	TSLA
105	2017-10-23	349.88	349.9500	336.25	337.02	5715817.0	1.0	349.88	349.9500	336.25	337.02	5715817.0	TSLA
106	2017-10-20	352.69	354.5500	344.34	345.10	4888221.0	1.0	352.69	354.5500	344.34	345.10	4888221.0	TSLA
107	2017-10-19	355.56	357.1465	348.20	351.81	5032884.0	1.0	355.56	357.1465	348.20	351.81	5032884.0	TSLA
108	2017-10-18	355.97	363.0000	354.13	359.65	4898808.0	1.0	355.97	363.0000	354.13	359.65	4898808.0	TSLA
109	2017-10-17	350.91	356.2200	350.07	355.75	3280670.0	1.0	350.91	356.2200	350.07	355.75	3280670.0	TSLA

Or we can select a set of rows and columns like before:

In[]: 
df.loc[100:110, ['Close', 'Volume']]

Out[]:
    Close   Volume
100 320.08  4236029.0
101 320.87  6942493.0
102 326.17  4980316.0
103 325.84  8547764.0
104 337.34  4463807.0
105 337.02  5715817.0
106 345.10  4888221.0
107 351.81  5032884.0
108 359.65  4898808.0
109 355.75  3280670.0
110 350.60  5353262.0

Selecting via .iloc[] (By position)

With df.iloc we can do the same selections using integer position:

In[]: 
df.iloc[100:110]

Out[]:

	Date	Open	High	Low	Close	Volume	SplitRatio	AdjOpen	AdjHigh	AdjLow	AdjClose	AdjVolume	Name
100	2017-10-30	319.18	323.7800	317.25	320.08	4236029.0	1.0	319.18	323.7800	317.25	320.08	4236029.0	TSLA
101	2017-10-27	319.75	324.5900	316.66	320.87	6942493.0	1.0	319.75	324.5900	316.66	320.87	6942493.0	TSLA
102	2017-10-26	327.78	330.2300	323.20	326.17	4980316.0	1.0	327.78	330.2300	323.20	326.17	4980316.0	TSLA
103	2017-10-25	336.70	337.5000	323.56	325.84	8547764.0	1.0	336.70	337.5000	323.56	325.84	8547764.0	TSLA
104	2017-10-24	338.80	342.8000	336.16	337.34	4463807.0	1.0	338.80	342.8000	336.16	337.34	4463807.0	TSLA
105	2017-10-23	349.88	349.9500	336.25	337.02	5715817.0	1.0	349.88	349.9500	336.25	337.02	5715817.0	TSLA
106	2017-10-20	352.69	354.5500	344.34	345.10	4888221.0	1.0	352.69	354.5500	344.34	345.10	4888221.0	TSLA
107	2017-10-19	355.56	357.1465	348.20	351.81	5032884.0	1.0	355.56	357.1465	348.20	351.81	5032884.0	TSLA
108	2017-10-18	355.97	363.0000	354.13	359.65	4898808.0	1.0	355.97	363.0000	354.13	359.65	4898808.0	TSLA
109	2017-10-17	350.91	356.2200	350.07	355.75	3280670.0	1.0	350.91	356.2200	350.07	355.75	3280670.0	TSLA

In the last example, we used the index as an integer position rather than by label.

We can select a set of rows and columns like before:

In[]: 
df.iloc[100:110, [3, 4]]

Out[]:
      Low   Close
100 317.25  320.08
101 316.66  320.87
102 323.20  326.17
103 323.56  325.84
104 336.16  337.34
105 336.25  337.02
106 344.34  345.10
107 348.20  351.81
108 354.13  359.65
109 350.07  355.75

Boolean indexing

So far in the Python Pandas tutorial, we have sliced subsets of data by label or by position. Now let's see how to select data that meet some criteria. We do this with Boolean indexing. We can use the same criteria similar to what we have seen with Numpy arrays. We show you just two illustrative examples here. This is by no means enough to get comfortable with it and so would encourage you to check the documentation and further readings at the end of this chapter to learn more.

We can filter data that is greater (less) than a number.

In[]: 
df[df.Close > 110]

Out[]:

	Date	Open	High	Low	Close	Volume	ExDividend	SplitRatio	AdjOpen	AdjHigh	AdjLow	AdjClose	AdjVolume	Name
0	2018-03-27	304.00	304.2700	277.1800	279.1800	13696168.0	0.0	1.0	304.00	304.2700	277.1800	279.1800	13696168.0	TSLA
1	2018-03-26	307.34	307.5900	291.3600	304.1800	8324639.0	0.0	1.0	307.34	307.5900	291.3600	304.1800	8324639.0	TSLA
2	2018-03-23	311.25	311.6100	300.4500	301.5400	6600538.0	0.0	1.0	311.25	311.6100	300.4500	301.5400	6600538.0	TSLA
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1080	2013-12-09	137.00	141.7000	134.2100	141.6000	9061500.0	0.0	1.0	137.00	141.7000	134.2100	141.6000	9061500.0	TSLA
1081	2013-12-06	141.51	142.4900	136.3000	137.3600	7909600.0	0.0	1.0	141.51	142.4900	136.3000	137.3600	7909600.0	TSLA
1082	2013-12-05	140.15	143.3500	139.5000	140.4800	9288400.0	0.0	1.0	140.15	143.3500	139.5000	140.4800	9288400.0	TSLA

1083 rows × 14 columns

In[]: 
df[(df['Close'] > 110) | (df['Close'] < 120)]

Out[]:
`

	Date	Open	High	Low	Close	Volume	ExDividend	SplitRatio	AdjOpen	AdjHigh	AdjLow	AdjClose	AdjVolume	Name
0	2018-03-27	304.00	304.2700	277.1800	279.1800	13696168.0	0.0	1.0	304.00	304.2700	277.1800	279.1800	13696168.0	TSLA
1	2018-03-26	307.34	307.5900	291.3600	304.1800	8324639.0	0.0	1.0	307.34	307.5900	291.3600	304.1800	8324639.0	TSLA
2	2018-03-23	311.25	311.6100	300.4500	301.5400	6600538.0	0.0	1.0	311.25	311.6100	300.4500	301.5400	6600538.0	TSLA
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1080	2013-12-09	137.00	141.7000	134.2100	141.6000	9061500.0	0.0	1.0	137.00	141.7000	134.2100	141.6000	9061500.0	TSLA
1081	2013-12-06	141.51	142.4900	136.3000	137.3600	7909600.0	0.0	1.0	141.51	142.4900	136.3000	137.3600	7909600.0	TSLA
1082	2013-12-05	140.15	143.3500	139.5000	140.4800	9288400.0	0.0	1.0	140.15	143.3500	139.5000	140.4800	9288400.0	TSLA

1083 rows × 14 columns

Manipulating a Python Pandas DataFrame

When we are working with data, the most common structure is the DataFrame. We have seen how to create them, make selections and find data in the Python Pandas tutorial. We are now going to see how to manipulate the Python Pandas DataFrame to transform it into another Python Pandas DataFrame that has the form that our problem requires.

We'll see how to sort it, re-index it, eliminate unwanted (or spurious) data, add or remove columns and update values.

.T (Transposing)

The Python Pandas DataFrame transpose function T allows us to transpose the rows as columns, and logically the columns as rows:

In[]: import pandas as pd
      df=pd.read_csv('TSLA.csv' )
      df2=df[100:110][['Close', 'Volume']]

      df2.T

Out[]:

	100	101	102	103	104	105	106	107	108	109
Close	320.08	320.87	326.17	325.84	337.34	337.02	345.1	351.81	359.65	355.75
Volume	4236029.00	6942493.00	4980316.00	8547764.00	4463807.00	5715817.00	4888221.0	5032884.00	4898808.00	3280670.00

.sort_index()

When we are working with Python Pandas Dataframe it is usual to add or remove rows, order by columns, etc. That's why it's important to have a function that allows us to easily and comfortably sort the Python Pandas DataFrame by its index. We do this with the sort_index function of Pandas DataFrame.

In[]:
 df.sort_index()

Out[]:

	Date	Open	High	Low	Close	Volume	ExDividend	SplitRatio	AdjOpen	AdjHigh	AdjLow	AdjClose	AdjVolume	Name
0	2018-03-27	304.00	304.2700	277.1800	279.1800	13696168.0	0.0	1.0	304.00	304.2700	277.1800	279.1800	13696168.0	TSLA
1	2018-03-26	307.34	307.5900	291.3600	304.1800	8324639.0	0.0	1.0	307.34	307.5900	291.3600	304.1800	8324639.0	TSLA
2	2018-03-23	311.25	311.6100	300.4500	301.5400	6600538.0	0.0	1.0	311.25	311.6100	300.4500	301.5400	6600538.0	TSLA
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

.sort_values()

Sometimes, we may be interested in sorting the Python Pandas DataFrame by some column or even with several columns as criteria. For example, sort the column by first names and the second criterion by last names. We do this with the sort_values function of Python Pandas DataFrame.

In[]: 
df.sort_values(by='Close')

Out[]:

	Date	Open	High	Low	Close	Volume	ExDividend	SplitRatio	AdjOpen	AdjHigh	AdjLow	AdjClose	AdjVolume	Name
1081	2013-12-06	141.510	142.4900	136.3000	137.3600	7909600.0	0.0	1.0	141.510	142.4900	136.3000	137.3600	7909600.0	TSLA
1057	2014-01-13	145.780	147.0000	137.8200	139.3400	6316100.0	0.0	1.0	145.780	147.0000	137.8200	139.3400	6316100.0	TSLA
1078	2013-12-11	141.880	143.0500	139.4900	139.6500	7137800.0	0.0	1.0	141.880	143.0500	139.4900	139.6500	7137800.0	TSLA
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

In[]: 
df.sort_values(by=['Open', 'Close'])

Out[]:

	Date	Open	High	Low	Close	Volume	ExDividend	SplitRatio	AdjOpen	AdjHigh	AdjLow	AdjClose	AdjVolume	Name
1080	2013-12-09	137.000	141.7000	134.2100	141.6000	9061500.0	0.0	1.0	137.000	141.7000	134.2100	141.6000	9061500.0	TSLA
1077	2013-12-12	139.700	148.2400	138.5300	147.4700	10767800.0	0.0	1.0	139.700	148.2400	138.5300	147.4700	10767800.0	TSLA
1079	2013-12-10	140.050	145.8700	139.8600	142.1900	10748200.0	0.0	1.0	140.050	145.8700	139.8600	142.1900	10748200.0	TSLA
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

.reindex()

The Python Pandas' reindex function let us realign the index of the Series or DataFrame, it's useful when we need to reorganize the index to meet some criteria.

For example, we can play with the Series or DataFrame that we create before to alter the original index. For example, when the index is a label, we can reorganize as we need:

In[]: import pandas as pd
      import numpy as np
      df = pd.DataFrame(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

      df

Out[]:

	0
a	-0.134133
b	-0.586051
c	1.179358
d	0.433142
e	-0.365686

Now, we can reorganize the index as follows:

In[]: 
df.reindex(['b', 'a', 'd', 'c', 'e'])

Out[]:

	0
b	-0.586051
a	-0.134133
d	0.433142
c	1.179358
e	-0.365686

When the index is numeric we can use the same function to order by hand the index:

In[]: 
      import pandas as pd
      import numpy as np
      df = pd.DataFrame(np.random.randn(5))
      df.reindex([4,3,2,1,0])

Out[]:

	0
4	1.058589
3	1.194400
2	-0.645806
1	0.836606
0	1.288102

Later in this section, we'll see how to work and reorganize date and time indices.

Adding a new column. For e.g. df[‘new_column’] = 1

Another interesting feature of Python Pandas DataFrames is the possibility of adding new columns to an existing DataFrame.

For example, we can add a new column to the random DataFrame that we have created before:

In[]: 
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5))

Out[]:

	0
0	0.238304
1	2.068558
2	1.015650
3	0.506208
4	0.214760

To add a new column, we only need to include the new column name in the DataFrame and assign an initialization value, or assign to the new column a Pandas Series or another column from other DataFrame.

In[]: 
df['new']=1
df

Out[]:

	0	new
0	0.238304	1
1	2.068558	1
2	1.015650	1
3	0.506208	1
4	0.214760	1

Delete existing column

Likewise, we can remove one or more columns from the Python Pandas DataFrame. Let's create a DataFrame with 5 rows and 4 columns with random values to delete one column.

In[]: 

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 4))
df

Out[]:

	0	1	2	3
0	-1.171562	-0.086348	-1.971855	1.168017
1	-0.408317	-0.061397	-0.542212	-1.412755
2	-0.365539	-0.587147	1.494690	1.756105
3	0.642882	0.924202	0.517975	-0.914366
4	0.777869	-0.431151	-0.401093	0.145646

Now, we can delete the column that we specify by index or by the label if any:

In[]: 
del df[0]

Out[]:

	1	2	3
0	-0.086348	-1.971855	1.168017
1	-0.061397	-0.542212	-1.412755
2	-0.587147	1.494690	1.756105
3	0.924202	0.517975	-0.914366
4	-0.431151	-0.401093	0.145646

In[]: 
df['new']=1
df

Out[]:

	0	new
0	0.238304	1
1	2.068558	1
2	1.015650	1
3	0.506208	1

In[]: 
del df['new']

Out[]:

		0
	0	0.238304
	1	2.068558
	2	1.015650
	3	0.506208

.at[] (By label)

With at we can locate a specific value by row and column labels as follows:

In[]: 
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)

df.at['a', 'A']

Out[]:

	A	B	C	D
a	0.996496	-0.165002	0.727912	0.564858
b	-0.388169	1.171039	-0.231934	-1.124595
c	-1.385129	0.299195	0.573570	-1.736860
d	1.222447	-0.312667	0.957139	-0.054156
e	1.188335	0.679921	1.508681	-0.677776

In[]: 
df.at['a', 'A']

Out[]: 

0.9964957014209125

It is possible to assign a new value with the same function too:

In[]: 
df.at['a', 'A'] = 0

Out[]:

	A	B	C	D
a	0.000000	-0.165002	0.727912	0.564858
b	-0.388169	1.171039	-0.231934	-1.124595
c	-1.385129	0.299195	0.573570	-1.736860
d	1.222447	-0.312667	0.957139	-0.054156
e	1.188335	0.679921	1.508681	-0.677776

.iat[] (By position)

With iat we can locate a specific value by row and column index as follows:

In[]: 
      import pandas as pd
      import numpy as np
      df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
      print(df)

      df.iat[0, 0]

Out[]: 
0.996496

It is possible to assign a new value with the same function too:

In[]: 
df.iat[0, 0] = 0

Out[]:

	A	B	C	D
a	0.000000	-0.165002	0.727912	0.564858
b	-0.388169	1.171039	-0.231934	-1.124595
c	-1.385129	0.299195	0.573570	-1.736860
d	1.222447	-0.312667	0.957139	-0.054156
e	1.188335	0.679921	1.508681	-0.677776

Conditional updating of values. For e.g. df[df > 0] = 1

Another useful function is to update values that meet some criteria, for example, update values whose values are greater than 0:

In[]: 
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5, 4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)

df[df > 0] = 1
df

Out[]:

	A	B	C	D
a	1.000000	-0.082466	1.000000	-0.728372
b	-0.784404	-0.663096	-0.595112	1.000000
c	-1.460702	-1.072931	-0.761314	1.000000
d	1.000000	1.000000	1.000000	-0.302310
e	-0.488556	1.000000	-0.798716	-0.590920

We can also update the values of a specific column that meet some criteria, or even work with several columns as criteria and update a specific column.

In[]: 
df['A'][df['A'] < 0] = 1
print(df)

Out[]:

	A	B	C	D
a	1.0	-0.082466	1.000000	-0.728372
b	1.0	-0.663096	-0.595112	1.000000
c	1.0	-1.072931	-0.761314	1.000000
d	1.0	1.000000	1.000000	-0.302310
e	1.0	1.000000	-0.798716	-0.590920

In[]: 
df['A'][(df['B'] < 0 )& (df['C'] < 0)] = 9
print(df)

Out[]:

	A	B	C	D
a	1.0	-0.082466	1.000000	-0.728372
b	9.0	-0.663096	-0.595112	1.000000
c	9.0	-1.072931	-0.761314	1.000000
d	1.0	1.000000	1.000000	-0.302310
e	1.0	1.000000	-0.798716	-0.590920

.dropna()

Occasionally, we may have a Python Pandas DataFrame that, for whatever reason, includes NA values. This type of values is usually problematic when we are making calculations or operations and must be treated properly before proceeding with them.

The easiest way to eliminate NA values is to remove the row that contains it.

In[]: 
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)

Out[]:

	A	B	C	D
a	1.272361	1.799535	-0.593619	1.152889
b	-0.318368	-0.190419	0.129420	1.551332
c	0.166951	1.669034	-1.653618	0.656313
d	0.219999	0.951074	0.442325	-0.170177
e	0.312319	-0.765930	-1.641234	-1.388924

In[]: 
df['A'][(df['B'] < 0 )& (df['C'] < 0)] = np.nan
print(df)

Out[]:

	A	B	C	D
a	1.272361	1.799535	-0.593619	1.152889
b	-0.318368	-0.190419	0.129420	1.551332
c	0.166951	1.669034	-1.653618	0.656313
d	0.219999	0.951074	0.442325	-0.170177
e	NaN	-0.765930	-1.641234	-1.388924

In[]: 
df=df.dropna()
print(df)

Out[]:

	A	B	C	D
a	1.272361	1.799535	-0.593619	1.152889
b	-0.318368	-0.190419	0.129420	1.551332
c	0.166951	1.669034	-1.653618	0.656313
d	0.219999	0.951074	0.442325	-0.170177

Here we are deleting the whole row that has, in any of its columns, a NaN value, but we can also specify that it deletes the column that any of its values is NaN:

df=df.dropna(axis=1)
print(df)

We can specify if a single NaN value is enough to delete the row or column, or if the whole row or column must have NaN to delete it.

```python
df=df.dropna(how='all')
print(df)

.fillna()

With the previous function, we have seen how to eliminate a complete row or column that contains one or all the values to NaN, this operation can be a little drastic if we have valid values in the row or column.

For this, it is interesting to use the fillna function that substitutes the NaN values with some fixed value.

In[]: 
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)

Out[]:

	A	B	C	D
a	1.272361	1.799535	-0.593619	1.152889
b	-0.318368	-0.190419	0.129420	1.551332
c	0.166951	1.669034	-1.653618	0.656313
d	0.219999	0.951074	0.442325	-0.170177
e	0.312319	-0.765930	-1.641234	-1.388924

In[]: 
df['A'][(df['B'] < 0 )& (df['C'] < 0)] = np.nan
print(df)

Out[]:

	A	B	C	D
a	1.272361	1.799535	-0.593619	1.152889
b	-0.318368	-0.190419	0.129420	1.551332
c	0.166951	1.669034	-1.653618	0.656313
d	0.219999	0.951074	0.442325	-0.170177
e	NaN	-0.765930	-1.641234	-1.388924

In[]: 
df=df.fillna(999)
print(df)

Out[]:

	A	B	C	D
a	1.272361	1.799535	-0.593619	1.152889
b	-0.318368	-0.190419	0.129420	1.551332
c	0.166951	1.669034	-1.653618	0.656313
d	0.219999	0.951074	0.442325	-0.170177
e	999	-0.765930	-1.641234	-1.388924

.apply()

apply is a very useful way to use functions or methods in a DataFrame without having to loop through it.

We can apply the Apply function to a Series or DataFrame to apply a function to all rows or columns of the DataFrame. Let's see some examples.

Suppose we are working with the randomly generated Python Pandas DataFrame and need to apply a function. In this example, for simplicity's sake, we're going to create a custom function to square a number.

In[]: 
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)

Out[]:
`

	A	B	C	D
a	-0.633249	-2.699088	0.574052	0.652742
b	0.060295	-0.150527	0.149123	-0.701216
c	-0.052515	0.469481	0.899180	-0.608409
d	-1.352912	0.103302	0.457878	-1.897170
e	0.088279	0.418317	-1.102989	0.582455

def square_number(number):
  return number**2

# Test the function
In[]: 
square_number(2)

Out[]: 
4

Now, let's use the custom function through Apply:

In[]: 
df.apply(square_number, axis=1)

Out[]:

	A	B	C	D
a	0.401005	7.285074	0.329536	0.426073
b	0.003636	0.022658	0.022238	0.491704
c	0.002758	0.220412	0.808524	0.370161
d	1.830372	0.010671	0.209652	3.599253
e	0.007793	0.174989	1.216586	0.339254

This method apply the function square_number to all rows of the DataFrame.

.shift()

The shift function allows us to move a row to the right or left and/or to move a column up or down. Let's look at some examples.

First, we are going to move the values of a column downwards:

In[]: 
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)

Out[]:

	A	B	C	D
a	-0.633249	-2.699088	0.574052	0.652742
b	0.060295	-0.150527	0.149123	-0.701216
c	-0.052515	0.469481	0.899180	-0.608409
d	-1.352912	0.103302	0.457878	-1.897170
e	0.088279	0.418317	-1.102989	0.582455

In[]: 
df['D'].shift(1)

Out[]:

	A	B	C	D
a	-0.633249	-2.699088	0.574052	NaN
b	0.060295	-0.150527	0.149123	0.652742
c	-0.052515	0.469481	0.899180	-0.701216
d	-1.352912	0.103302	0.457878	-0.608409
e	0.088279	0.418317	-1.102989	-1.897170

We are going to move the values of a column upwards

In[]: 
df['shift'] = df['D'].shift(-1)

Out[]:

	A	B	C	D
a	-0.633249	-2.699088	0.574052	-0.701216
b	0.060295	-0.150527	0.149123	-0.608409
c	-0.052515	0.469481	0.899180	-1.897170
d	-1.352912	0.103302	0.457878	0.582455
e	0.088279	0.418317	-1.102989	NaN

This is very useful for comparing the current value with the previous value.

Statistical Exploratory data analysis

Python Pandas DataFrame allows us to make some descriptive statistics calculations, which are very useful to make a first analysis of the data we are handling. Let's see some useful functions.

info()

It is a good practice to know the structure and format of our DataFrame, the Info function offers us just that:

In[]: 
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)

Out[]:

	A	B	C	D
a	-0.633249	-2.699088	0.574052	0.652742
b	0.060295	-0.150527	0.149123	-0.701216
c	-0.052515	0.469481	0.899180	-0.608409
d	-1.352912	0.103302	0.457878	-1.897170
e	0.088279	0.418317	-1.102989	0.582455

In[]: 
df.info()

Out[]:
       <class 'pandas.core.frame.DataFrame'>
      Index: 5 entries, a to e
      Data columns (total 5 columns):
      A        5 non-null float64
      B        5 non-null float64
      C        5 non-null float64
      D        5 non-null float64
      shift    4 non-null float64
      dtypes: float64(5)
      memory usage: 240.0+ bytes

describe()

We can obtain a statistical overview of the DataFrame with the `describe function, which gives us the mean, median, standard deviation, maximum, minimum, quartiles, etc. of each DataFrame column.

In[]: 
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)

Out[]:
`

	A	B	C	D
a	-0.633249	-2.699088	0.574052	0.652742
b	0.060295	-0.150527	0.149123	-0.701216
c	-0.052515	0.469481	0.899180	-0.608409
d	-1.352912	0.103302	0.457878	-1.897170
e	0.088279	0.418317	-1.102989	0.582455

In[]: 
df.describe()

Out[]:

	A	B	C	D
count	5.000000	5.000000	5.000000	5.000000
mean	-0.378020	-0.371703	0.195449	-0.394319
std	0.618681	1.325046	0.773876	1.054633
min	-1.352912	-2.699088	-1.102989	-1.897170
25%	-0.633249	-0.150527	0.149123	-0.701216
50%	-0.052515	0.103302	0.457878	-0.608409
75%	0.060295	0.418317	0.574052	0.582455
max	0.088279	0.469481	0.899180	0.652742

.value_counts()

The function value_counts counts the repeated values of the specified column:

In[]: 
df['A'].value_counts()

Out[]: 
       0.088279    1
      -0.052515    1
       0.060295    1
      -0.633249    1
      -1.352912    1
      Name: A, dtype: int64

.mean()

We can obtain the mean of a specific column or row by means of the mean function.

In[]:
df['A'].mean() # specifying a column

Out[]: 
-0.3780203497252693
`

In[]: 
df.mean() # by column
df.mean(axis=0) # by column

Out[]:
        A       -0.378020
        B       -0.371703
        C        0.195449
        D       -0.394319
        shift   -0.638513
        dtype: float64
`

In[]:
df.mean(axis=1) # by row

Out[]:
        a   -0.526386
        b    0.002084
        c    0.001304
        d   -0.659462
        e   -0.382222
        dtype: float64

.std()

We can obtain the standard deviation of a specific column or row by means of the std function.

In[]: 
df['A'].std() # specifying a column

Out[]: 
0.6186812554819784
`

In[]: 
df.std() # by column
df.std(axis=0) # by column

Out[]:
        A        0.618681
        B        1.325046
        C        0.773876
        D        1.054633
        shift    1.041857
        dtype: float64
`

In[]: 
df.std(axis=1) # by row

Out[]:  
        a    1.563475
        b    0.491499
        c    0.688032
        d    0.980517
        e    1.073244
        dtype: float64

Filtering Python Pandas DataFrame

We have already seen how to filter data in a Python Pandas DataFrame, including logical statements to filter rows or columns with some logical criteria. Let's remember some examples:

For example, we will filter rows whose column 'A' is greater than zero:

In[]: 
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)

Out[]:
`

	A	B	C	D
a	-0.633249	-2.699088	0.574052	0.652742
b	0.060295	-0.150527	0.149123	-0.701216
c	-0.052515	0.469481	0.899180	-0.608409
d	-1.352912	0.103302	0.457878	-1.897170
e	0.088279	0.418317	-1.102989	0.582455

In[]:
df_filtered = df[df['A'] > 0]
print(df_filtered)

Out[]:

	A	B	C	D
b	0.060295	-0.150527	0.149123	-0.701216
e	0.088279	0.418317	-1.102989	0.582455

We can also combine logical statements, we will filter all rows whose column 'A' and 'B' have their values greater than zero.

In[]: 
df_filtered = df[(df['A'] > 0) & (df['B'] > 0)] 
print(df_filtered)

Out[]:

	A	B	C	D
e	0.088279	0.418317	-1.102989	0.582455

Iterating Python Pandas DataFrame

We can go through the Python Pandas DataFrame row by row to do operations in each iteration, let's see some examples.

In[]: 
for item in df.iterrows():
          print(item)

Out[]:
('a', A       -0.633249
B       -2.699088
C        0.574052
D        0.652742
shift         NaN
Name: a, dtype: float64)
('b', A        0.060295
B       -0.150527
C        0.149123
D       -0.701216
shift    0.652742
Name: b, dtype: float64)
('c', A       -0.052515
B        0.469481
C        0.899180
D       -0.608409
shift   -0.701216
Name: c, dtype: float64)
('d', A       -1.352912
B        0.103302
C        0.457878
D       -1.897170
shift   -0.608409
Name: d, dtype: float64)
('e', A        0.088279
B        0.418317
C       -1.102989
D        0.582455
shift   -1.897170
Name: e, dtype: float64)

Merge, Append and Concat operation on a Python Pandas DataFrame

Another interesting feature of Python Pandas DataFrames is that we can merge, concatenate them and add new values, let's see how to do each of these operations in this section of Python Pandas tutorial.

merge function allows us to merge two Python Pandas DataFrame by rows:

In[]: 
      import numpy as np
      df1=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
      print(df1)

Out[]:
`

	A	B	C	D
a	1.179924	-1.512124	0.767557	0.019265
b	0.019969	-1.351649	0.665298	-0.989025
c	0.351921	-0.792914	0.455174	0.170751
d	-0.150499	0.151942	-0.628074	-0.347300
e	-1.307590	0.185759	0.175967	-0.170334

df2=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df2)
`

	A	B	C	D
a	2.030462	-0.337738	-0.894440	-0.757323
b	0.475807	1.350088	-0.514070	-0.843963
c	0.948164	-0.155052	-0.618893	1.319999
d	1.433736	-0.455008	1.445698	-1.051454
e	0.565345	1.802485	-0.167189	-0.227519

In[]: 
      df3 = pd.merge(df1, df2)
      print(df3)

Out[]:  
        Empty DataFrame
        Columns: [A, B, C, D]
        Index: []

append function allows us to append rows from one Python Pandas DataFrame to another Python Pandas DataFrame by rows:

In[] 
      import numpy as np
      df1=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
      print(df1)

      df2=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
      print(df2)

      df3 = df1.append(df2)
      print(df3)

Out[]

	A	B	C	D
a	1.179924	-1.512124	0.767557	0.019265
b	0.019969	-1.351649	0.665298	-0.989025
c	0.351921	-0.792914	0.455174	0.170751
d	-0.150499	0.151942	-0.628074	-0.347300
e	-1.307590	0.185759	0.175967	-0.170334
a	2.030462	-0.337738	-0.894440	-0.757323
b	0.475807	1.350088	-0.514070	-0.843963
c	0.948164	-0.155052	-0.618893	1.319999
d	1.433736	-0.455008	1.445698	-1.051454
e	0.565345	1.802485	-0.167189	-0.227519

concat function allows us to merge two Python Pandas DataFrame by rows or columns:

In[]: 
      import numpy as np
      df1=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
      print(df1)

      df2=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
      print(df2)

      df3 = pd.concat([df1, df2]) # concat by row
      print(df3)

Out[]:
`

	A	B	C	D
a	1.179924	-1.512124	0.767557	0.019265
b	0.019969	-1.351649	0.665298	-0.989025
c	0.351921	-0.792914	0.455174	0.170751
d	-0.150499	0.151942	-0.628074	-0.347300
e	-1.307590	0.185759	0.175967	-0.170334
a	2.030462	-0.337738	-0.894440	-0.757323
b	0.475807	1.350088	-0.514070	-0.843963
c	0.948164	-0.155052	-0.618893	1.319999
d	1.433736	-0.455008	1.445698	-1.051454
e	0.565345	1.802485	-0.167189	-0.227519

In[]: 
df3 = pd.concat([df1, df2], axis=0) # concat by row
print(df3)

Out[]:
`

	A	B	C	D
a	1.179924	-1.512124	0.767557	0.019265
b	0.019969	-1.351649	0.665298	-0.989025
c	0.351921	-0.792914	0.455174	0.170751
d	-0.150499	0.151942	-0.628074	-0.347300
e	-1.307590	0.185759	0.175967	-0.170334
a	2.030462	-0.337738	-0.894440	-0.757323
b	0.475807	1.350088	-0.514070	-0.843963
c	0.948164	-0.155052	-0.618893	1.319999
d	1.433736	-0.455008	1.445698	-1.051454
e	0.565345	1.802485	-0.167189	-0.227519

In[]: 
df3 = pd.concat([df1, df2], axis=1) # concat by column
print(df3)

Out[]:

	A	B	C	D	A	B	C	D
a	1.179924	-1.512124	0.767557	0.019265	2.030462	-0.337738	-0.894440	-0.757323
b	0.019969	-1.351649	0.665298	-0.989025	0.475807	1.350088	-0.514070	-0.843963
c	0.351921	-0.792914	0.455174	0.170751	0.948164	-0.155052	-0.618893	1.319999
d	-0.150499	0.151942	-0.628074	-0.347300	1.433736	-0.455008	1.445698	-1.051454
e	-1.307590	0.185759	0.175967	-0.170334	0.565345	1.802485	-0.167189	-0.227519

TimeSeries in Python Pandas

Python Pandas TimeSeries includes a set of tools to work with Series or DataFrames indexed in time.

Usually, the series of financial data are of this type and therefore, knowing these tools will make our work much more comfortable.

In this section of the Python Pandas tutorial, we are going to start creating time series from scratch and then we will see how to manipulate them and convert them to different frequencies.

Indexing Python Pandas TimeSeries

With date_range Panda's method, we can create a time range with a certain frequency. For example, create a range starting on December 1st, 2018, with 30 occurrences with an hourly frequency.

In[]: 
rng = pd.date_range('12/1/2018', periods=30, freq='H')
print(rng)

Out[]: 
DatetimeIndex(['2018-12-01 00:00:00', '2018-12-01 01:00:00',
               '2018-12-01 02:00:00', '2018-12-01 03:00:00',
               '2018-12-01 04:00:00', '2018-12-01 05:00:00',
               '2018-12-01 06:00:00', '2018-12-01 07:00:00',
               '2018-12-01 08:00:00', '2018-12-01 09:00:00',
               '2018-12-01 10:00:00', '2018-12-01 11:00:00',
               '2018-12-01 12:00:00', '2018-12-01 13:00:00',
               '2018-12-01 14:00:00', '2018-12-01 15:00:00',
               '2018-12-01 16:00:00', '2018-12-01 17:00:00',
               '2018-12-01 18:00:00', '2018-12-01 19:00:00',
               '2018-12-01 20:00:00', '2018-12-01 21:00:00',
               '2018-12-01 22:00:00', '2018-12-01 23:00:00',
               '2018-12-02 00:00:00', '2018-12-02 01:00:00',
               '2018-12-02 02:00:00', '2018-12-02 03:00:00',
               '2018-12-02 04:00:00', '2018-12-02 05:00:00'],
              dtype='datetime64[ns]', freq='H')

We can do the same to get a daily frequency (or any other, see the documentation). We can use the freq parameter to adjust this.

In[]: 
rng = pd.date_range('12/1/2018', periods=30, freq='D')
print(rng)

Out[]: 
DatetimeIndex(['2018-12-01', '2018-12-02', '2018-12-03', '2018-12-04',
               '2018-12-05', '2018-12-06', '2018-12-07', '2018-12-08',
               '2018-12-09', '2018-12-10', '2018-12-11', '2018-12-12',
               '2018-12-13', '2018-12-14', '2018-12-15', '2018-12-16',
               '2018-12-17', '2018-12-18', '2018-12-19', '2018-12-20',
               '2018-12-21', '2018-12-22', '2018-12-23', '2018-12-24',
               '2018-12-25', '2018-12-26', '2018-12-27', '2018-12-28',
               '2018-12-29', '2018-12-30'],
              dtype='datetime64[ns]', freq='D')

Now, we have a DateTimeIndex in the rng object and we can use it to create a Series or DataFrame:

In[]: 
ts = pd.DataFrame(np.random.randn(len(rng), 4), index=rng, columns=['A', 'B', 'C', 'D'])
print(ts)

Out[]:
`

	A	B	C	D
2018-12-01	0.048603	0.968522	0.408213	0.921774
2018-12-02	-2.301373	-2.310408	-0.559381	-0.652291
2018-12-03	-2.337844	0.329954	0.289221	0.259132
2018-12-04	1.357521	0.969808	1.341875	0.767797
2018-12-05	-1.212355	-0.077457	-0.529564	0.375572
2018-12-06	-0.673065	0.527754	0.006344	-0.533316
2018-12-07	0.226145	0.235027	0.945678	-1.766167
2018-12-08	1.735185	-0.604229	0.274809	0.841128

In[]: 
ts = pd.Series(np.random.randn(len(rng)), index=rng)
print(ts)

Out[]:  
        2018-12-01    0.349234
        2018-12-02   -1.807753
        2018-12-03    0.112777
        2018-12-04    0.421516
        2018-12-05   -0.992449
        2018-12-06    1.254999
        2018-12-07   -0.311152
        2018-12-08    0.331584
        2018-12-09    0.196904
        2018-12-10   -1.619186
        2018-12-11    0.478510
        2018-12-12   -1.036074

Sometimes, we read the data from internet sources or from csv files and we need to convert the date column into the index to work properly with the Series or DataFrame.

In[]: 
import pandas as pd
df=pd.read_csv('TSLA.csv')
df.tail()

Out[]:

	Date	Open	High	Low	Close	Volume	SplitRatio	AdjOpen	AdjHigh	AdjLow	AdjClose	AdjVolume	Name
1078	2013-12-11	141.88	143.05	139.49	139.65	7137800.0	1.0	141.88	143.05	139.49	139.65	7137800.0	TSLA
1079	2013-12-10	140.05	145.87	139.86	142.19	10748200.0	1.0	140.05	145.87	139.86	142.19	10748200.0	TSLA
1080	2013-12-09	137.00	141.70	134.21	141.60	9061500.0	1.0	137.00	141.70	134.21	141.60	9061500.0	TSLA
1081	2013-12-06	141.51	142.49	136.30	137.36	7909600.0	1.0	141.51	142.49	136.30	137.36	7909600.0	TSLA
1082	2013-12-05	140.15	143.35	139.50	140.48	9288400.0	1.0	140.15	143.35	139.50	140.48	9288400.0	TSLA

Here, we can see the index as numeric and a Date column, let's convert this column into the index to indexing our DataFrame, read from a csv file, in time. For this, we are going to use the Python Pandas set_index method

In[]: 
df = df.set_index('Date')
df.tail()

Out[]:

Date	Open	High	Low	Close	Volume	SplitRatio	AdjOpen	AdjHigh	AdjLow	AdjClose	AdjVolume	Name
2013-12-11	141.88	143.05	139.49	139.65	7137800.0	1.0	141.88	143.05	139.49	139.65	7137800.0	TSLA
2013-12-10	140.05	145.87	139.86	142.19	10748200.0	1.0	140.05	145.87	139.86	142.19	10748200.0	TSLA
2013-12-09	137.00	141.70	134.21	141.60	9061500.0	1.0	137.00	141.70	134.21	141.60	9061500.0	TSLA
2013-12-06	141.51	142.49	136.30	137.36	7909600.0	1.0	141.51	142.49	136.30	137.36	7909600.0	TSLA
2013-12-05	140.15	143.35	139.50	140.48	9288400.0	1.0	140.15	143.35	139.50	140.48	9288400.0	TSLA

Now, we have Pandas TimeSeries ready to work.

Resampling Pandas TimeSeries

A very useful feature of Python Pandas TimeSeries is the resample capacity, this allows us to pass the current frequency to another higher frequency (we can't pass to lower frequencies, because we don't know the data).

As it can be supposed, when we pass from one frequency to another data could be lost, for this, we must use some function that treats the values of each frequency interval, for example, if we pass from an hourly frequency to daily, we must specify what we want to do with the group of data that fall inside each frequency, we can do a mean, a sum, we can get the maximum or the minimum, etc.

In[]: 
rng = pd.date_range('12/1/2018', periods=30, freq='H')
ts = pd.DataFrame(np.random.randn(len(rng), 4), index=rng, columns=['A', 'B', 'C', 'D'])
print(ts)

Out[]:
`

	A	B	C	D
2018-12-01 00:00:00	0.048603	0.968522	0.408213	0.921774
2018-12-01 01:00:00	-2.301373	-2.310408	-0.559381	-0.652291
2018-12-01 02:00:00	-2.337844	0.329954	0.289221	0.259132
2018-12-01 03:00:00	1.357521	0.969808	1.341875	0.767797
2018-12-01 04:00:00	-1.212355	-0.077457	-0.529564	0.375572
2018-12-01 05:00:00	-0.673065	0.527754	0.006344	-0.533316

In[]: 
ts = ts.resample("1D").mean()
print(ts)

Out[]:

	A	B	C	D
2018-12-01	0.449050	0.127412	-0.154179	-0.358324
2018-12-02	-0.539007	-0.855894	0.000010	0.454623

Manipulating TimeSeries

We can manipulate the Python Pandas TimeSeries in the same way that we have done until now, since they offer us the same capacity that the Pandas Series and the Pandas DataFrames.

Additionally, we can work comfortably with all jobs related to handling dates. For example, to obtain all the data from a date, to obtain the data in a range of dates, etc.

In[]: 
rng = pd.date_range('12/1/2018', periods=30, freq='D')
ts = pd.DataFrame(np.random.randn(len(rng), 4), index=rng, columns=['A', 'B', 'C', 'D'])
print(ts)

Out[]:

	A	B	C	D
2018-12-01	0.048603	0.968522	0.408213	0.921774
2018-12-02	-2.301373	-2.310408	-0.559381	-0.652291
2018-12-03	-2.337844	0.329954	0.289221	0.259132
2018-12-04	1.357521	0.969808	1.341875	0.767797
2018-12-05	-1.212355	-0.077457	-0.529564	0.375572
2018-12-06	-0.673065	0.527754	0.006344	-0.533316

Getting all values from a specific date:

In[]: 
ts['2018-12-15':]

Out[]:

	A	B	C	D
2018-12-02	0.324689	-0.413723	0.019163	0.385233
2018-12-03	-2.198937	0.536600	-0.540934	-0.603858
2018-12-04	-1.195148	2.191311	-0.981604	-0.942440
2018-12-05	0.621298	-1.435266	-0.761886	-1.787730
2018-12-06	0.635679	0.683265	0.351140	-1.451903

Getting all values inside a date range:

In[]: 
ts['2018-12-15':'2018-12-20']

Out[]:

	A	B	C	D
2018-12-15	0.605576	0.584369	-1.520749	-0.242630
2018-12-16	-0.105561	-0.092124	0.385085	0.918222
2018-12-17	0.337416	-1.367549	0.738320	2.413522
2018-12-18	-0.011610	-0.339228	-0.218382	-0.070349
2018-12-19	0.027808	-0.422975	-0.622777	0.730926
2018-12-20	0.188822	-1.016637	0.470874	0.674052

Summary

Let us summarise a few points we learnt in the Python Pandas tutorial

Python Pandas DataFrame and Python Pandas Series are some of the most important data structures. It is a must to acquire fluency in its handling because we will find them in practically all the problems that we handle.
A Python Pandas DataFrame is a data structure formed by rows and columns and has an index.
We must think of them as if they were data tables (for the Array with a single column) with which we can select, filter, sort, add and delete elements, either by rows or columns.
Help in ETL processes (Extraction, Transformation and Loading)
We can select, insert, delete and update elements with simple functions.
We can perform computations by rows or columns.
Has the ability to run vectorized computations.
We can work with several DataFrames at the same time.
Indexing and subsetting are the most important features of Python Pandas.
Facilitates the statistical exploration of the data.
It offers us a variety of options for handling NaN data.
Another additional advantage is the ability to read & write multiple data formats (CSV, Excel, HDF5, etc.).
Retrieve data from external sources (Yahoo, Google, Quandl Python API, etc.)
Finally, it has the ability to work with date and time indexes and offers
us a set of functions to work with dates.

Conclusion

To reiterate, this article is an excerpt from the Python Basics Handbook, which was created for both; beginners who are starting out in Python as well as accomplished traders who can use it as a handy reference while coding their strategy.

Do let us know if you loved the article and any other feedback in the comments below.

Disclaimer: All investments and trading in the stock market involve risk. Any decisions to place trades in the financial markets, including trading in stock or options or other financial instruments is a personal decision that should only be made after thorough research, including a personal risk and financial assessment and the engagement of professional assistance to the extent you believe necessary. The trading strategies or related information mentioned in this article is for informational purposes only.

EPAT Walkthrough & Live Q&A