Python Pandas Tutorial: Installation, Series and DataFrame

40 min read

By Jay Parmar

Pandas is a Python library to deal with sequential and tabular data. It includes many tools to manage, analyze and manipulate data in a convenient and efficient manner. We can think of its data structures as akin to database tables or spreadsheets.

Pandas is built on top of the Numpy library and has two primary data structures viz. Series (1-dimensional) and DataFrame (2-dimensional). It can handle both homogeneous and heterogeneous data, and some of its many capabilities are:

  • ETL tools (Extraction, Transformation and Load tools)
  • Dealing with missing data (NaN)
  • Dealing with data files (csv, xls, db, hdf5, etc.)
  • Time-series manipulation tools

In the Python ecosystem, Pandas is the best choice to retrieve, manipulate, analyze and transform financial data.

In this Python Pandas tutorial, we will learn the basics of the Python Pandas module as well as understand some of the codes. This blog article is an excerpt from the Python Basics Handbook created for the simple purpose of making the reader understand the beauty and simplicity of the Python language.

The contents of this article include:

Installing Python Pandas

In this section of the Python Pandas tutorial, we will start by going through the official documentation which has a detailed explanation here on installing Pandas. We summarize it below.

With pip

The simplest way to install Pandas is from PyPI.

In a terminal window, run the following command.

pip install pandas

In your code, you can use the escape character '!' to install pandas directly from your Python console.

!pip install pandas

Pip is a useful tool to manage Python's packages and it is worth investing some time in knowing it better.

pip help

With Conda environments

For advanced users, who like to work with Python environments for each project, you can create a new environment and install pandas as shown below.

conda create -n EPAT python
source activate EPAT
conda install pandas

Testing Pandas installation

To check the installation, Pandas comes with a test suite to test almost all of the codebase and verify that everything is working.

import pandas as pd
pd.test()

What problem does Python Pandas solve?

In this section of the Python Pandas tutorial, we will understand the use of Python Pandas.

Python Pandas works with homogeneous data series (1-Dimension) and heterogeneous tabular data series (2-Dimensions). It includes a multitude of tools to work with these data types, such as:

  • Indexes and labels.
  • Searching of elements.
  • Insertion, deletion and modification of elements.
  • Apply set techniques, such as grouping, joining, selecting, etc.
  • Data processing and cleaning.
  • Work with time series.
  • Make statistical calculations
  • Draw graphics
  • Connectors for multiple data file formats, such as, csv, xlsx, hdf5, etc.

Pandas Documentation: 10 minutes with Pandas

Python Pandas Series

The first data structure we will go through in the Python Pandas tutorial is the Series.

Python Pandas Series are homogeneous one-dimensional objects, that is, all data are of the same type and are implicitly labelled with an index.

For example, we can have a Series of integers, real numbers, characters, strings, dictionaries, etc. We can conveniently manipulate these series performing operations like adding, deleting, ordering, joining, filtering, vectorized operations, statistical analysis, plotting, etc.

Let's see some examples of how to create and manipulate a Python Pandas Series:

  • We will start by creating an empty Python Pandas Series:
import pandas as pd
s = pd.Series()
print(s)

Out[]: 

Series([], dtype: float64)
  • Let's create a Python Pandas Series of integers:
import pandas as pd
s = pd.Series([1, 2, 3, 4, 5, 6, 7])
print(s)

Out[]:
        0    1
        1    2
        2    3
        3    4
        4    5
        5    6
        6    7
        dtype: int64
  • Let's create a Python Pandas Series of characters:
import pandas as pd
s = pd.Series(['a', 'b', 'c', 'd', 'e'])
print(s)

Out[]: 
        0    1
        1    2
        2    3
        3    4
        4    5
        5    6
        6    7
        dtype: int64
  • Let's create a random Python Pandas Series of float numbers:
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(5))
print(s)

Out[]:  
        0    0.383567
        1    0.869761
        2    1.100957
        3   -0.259689
        4    0.704537
        dtype: float64

In all these examples we saw in the Python Pandas tutorial, we have allowed the index label to appear by default (without explicitly programming it). It starts at 0, and we can check the index as:

In[]: 

s.index

Out[]: 

RangeIndex(start=0, stop=5, step=1)

But we can also specify the index we need, for example:

In[]: 

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

Out[]:
      a    1.392051
      b    0.515690
      c   -0.432243
      d   -0.803225
      e    0.832119
      dtype: float64
  • Let's create a Python Pandas Series from a dictionary:
import pandas as pd
dictionary = {'a' : 1, 'b' : 2, 'c' : 3, 'd': 4, 'e': 5}
s = pd.Series(dictionary)
print(s)

Out[]:

        a    1
        b    2
        c    3
        d    4
        e    5
        dtype: int64

In this case, the Python Pandas Series is created with the dictionary keys as index unless we specify any other index.

Simple operations with Python Pandas Series

When we have a Python Pandas Series, we can perform several simple operations on it.

For example, in this section of the Python Pandas tutorial, let's create two Series. One from a dictionary and the other from an array of integers:

In[]: 
      import pandas as pd
      dictionary = {'a' : 1, 'b' : 2, 'c' : 3, 'd': 4, 'e': 5}
      s1 = pd.Series(dictionary)

      array = [1, 2, 3, 4, 5]
      s2 = pd.Series(array)


Out[]:  
        a    1
        b    2
        c    3
        d    4
        e    5
        dtype: int64

        0    1
        1    2
        2    3
        3    4
        4    5
        dtype: int64

We can perform operations similar to Numpy arrays:

  • Selecting one item from the Python Pandas Series by means of its index:
In[]: 
s1[0] # Select the first element

Out[]: 
1

In[]: 
s1['a'] 

Out[]: 
1

In[]: 
s2[0] 

Out[]: 
1
  • Selecting several items from the Python Pandas Series by means of its index:
In[]: 
s1[[1,4]]

Out[]:
        b    2
        e    5
        dtype: int64

In[]:
s1[['b', 'e']]

Out[]:  
        b    2
        e    5
        dtype: int64

In[]: 
s2[[1,4]]

Out[]:  
        b    2
        e    5
        dtype: int64
  • Get the Python Pandas series starting from an element:
In[]: 
s1[2:]

Out[]:  
        c    3
        d    4
        e    5
        dtype: int64

In[]: 
s2[2:]

Out[]:  
        2    3
        3    4
        4    5
        dtype: int64
  • Get the Python Pandas series up to one element:
In[]: 
s1[:2]

Out[]:  
        c    3
        d    4
        e    5
        dtype: int64

In[]: 
s2[:2]

Out[]:  
        2    3
        3    4
        4    5
        dtype: int64

We can perform operations like a dictionary:

  • Assign a value:
In[]: 
      s1[1] = 99
      s1['a'] = 99

Out[]:  
        a     1
        b    99
        c     3
        d     4
        e     5
        dtype: int64

In[]: 
s2[1] = 99
print(s2)

Out[]:  
        0     1
        1    99
        2     3
        3     4
        4     5
        dtype: int64
  • Get a value by index (like dictionary key):
In[]: 
s.get('b')

Out[]: 
2

Here are some powerful vectorized operations that let us perform quickly calculations, for example:

  • Add, subtract, multiply, divide, power, and almost any NumPy function that accepts NumPy arrays.
s1 + 2
s1 - 2
s1 * 2
s1 / 2
s1 ** 2
np.exp(s1)
  • We can perform the same operations over two Python Pandas Series although these must be aligned, that is, to have the same index, in other cases, perform a Union operation.
In[]: 
s1 + s1 # The indices are aligned

Out[]:  
        a     2
        b     4
        c     6
        d     8
        e    10
        dtype: int64

In[]: 
s1 + s2 # The indices are unaligned

Out[]:  
        a   NaN
        b   NaN
        c   NaN
        d   NaN
        e   NaN
        0   NaN
        1   NaN
        2   NaN
        3   NaN
        4   NaN
        dtype: float64

Python Pandas DataFrame

The second data structure in Python Pandas that we are going to see is the DataFrame.

Python Pandas DataFrame is a heterogeneous two-dimensional object, that is, the data are of the same type within each column but it could be a different data type for each column and are implicitly or explicitly labelled with an index.

We can think of a Python Pandas DataFrame as a database table, in which we store heterogeneous data. For example, a Python Pandas DataFrame with one column for the first name, another for the last name and a third column for the phone number, or a Python Pandas dataframe with columns to store the opening price, close price, high, low, volume, and so on.

The index can be implicit, starting with zero or we can specify it ourselves, even working with dates and times as indexes as well.

Let's see some examples of how to create and manipulate a Python Pandas DataFrame.

  • Creating an empty Python Pandas DataFrame:
In[]: 
      import pandas as pd
      s = pd.DataFrame()
      print(s)

Out[]:  
        Empty DataFrame
        Columns: []
        Index: []
  • Creating an empty structure Python Pandas DataFrame:
In[]: 
      import pandas as pd
      s = pd.DataFrame(columns=['A','B','C','D','E'])
      print(s)

Out[]:  
        Empty DataFrame
        Columns: [A, B, C, D, E]
        Index: []

In[]: 
      import pandas as pd
      s = pd.DataFrame(columns=['A','B','C','D','E'], index=range(1, 6))
      print(s)

Out[]:
             A    B    C    D    E
        1  NaN  NaN  NaN  NaN  NaN
        2  NaN  NaN  NaN  NaN  NaN
        3  NaN  NaN  NaN  NaN  NaN
        4  NaN  NaN  NaN  NaN  NaN
        5  NaN  NaN  NaN  NaN  NaN
  • Creating a Python Pandas DataFrame passing a NumPy array:
In[]: 
      array =  {'A' : [1, 2, 3, 4],
                'B' : [4, 3, 2, 1]}

      pd.DataFrame(array)

Out[]: 
        A   B
      0 1   4
      1 2   3
      2 3   2
      3 4   1
  • Creating a Python Pandas DataFrame passing a NumPy array, with datetime index:
In[]: 
      import pandas as pd
      array =  {'A' : [1, 2, 3, 4],'B' : [4, 3, 2, 1]}
      index = pd.DatetimeIndex(['2018-12-01', '2018-12-02','2018-12-03', '2018-12-04'])
      pd.DataFrame(array, index=index)

Out[]:
                 A B
    2018-12-01  1   4
    2018-12-02  2   3
    2018-12-03  3   2
    2018-12-04  4   1
  • Creating a Python Pandas DataFrame passing a Dictionary:
In[]: 
      import pandas as pd
      dictionary = {'a' : 1, 'b' : 2, 'c' : 3, 'd': 4, 'e': 5}
      pd.DataFrame([dictionary])

Out[]:
        a   b   c   d   e
      0 1   2   3   4   5
  • Viewing a Python Pandas DataFrame: We can use some methods to explore the Pandas DataFrame:

First, we go to create a Python Pandas DataFrame to work with it.

In[]: 
      import pandas as pd
      pd.DataFrame({'A' : np.random.randn(10), 'B' : np.random.randn(10), 'C' : np.random.randn(10)})

Out[]:
            A         B           C
    0   0.164358    1.689183    1.745963
    1   -1.830385   0.035618    0.047832
    2   1.304339    2.236809    0.920484
    3   0.365616    1.877610    -0.287531
    4   -0.741372   -1.443922   -1.566839
    5   -0.119836   -1.249112   -0.134560
    6   -0.848425   -0.569149   -1.222911
    7   -1.172688   0.515443    1.492492
    8   0.765836    0.307303    0.788815
    9   0.761520    -0.409206   1.298350
  • Get the first three rows:
In[]: 
      import pandas as pd
      df=pd.DataFrame({'A' : np.random.randn(10), 'B' : np.random.randn(10), 'C' : np.random.randn(10)})

      df.head(3)

Out[]:
            A         B           C
    0   0.164358    1.689183    1.745963
    1   -1.830385   0.035618    0.047832
    2   1.304339    2.236809    0.920484
  • Get the last three rows:
In[]: 
      import pandas as pd
      df=pd.DataFrame({'A' : np.random.randn(10), 'B' : np.random.randn(10), 'C' : np.random.randn(10)})

      df.tail(3)

Out[]:
           A         B           C
    7   -1.172688   0.515443    1.492492
    8   0.765836    0.307303    0.788815
    9   0.761520    -0.409206   1.298350
  • Get the Python Pandas DataFrame's index:
In[]: 
      import pandas as pd
      df=pd.DataFrame({'A' : np.random.randn(10), 'B' : np.random.randn(10), 'C' : np.random.randn(10)})

      df.index

Out[]: 
RangeIndex(start=0, stop=10, step=1)
  • Get the Python Pandas DataFrame's columns:
In[]: 
      import pandas as pd
      df=pd.DataFrame({'A' : np.random.randn(10), 'B' : np.random.randn(10), 'C' : np.random.randn(10)})

      df.columns

Out[]: 
Index(['A', 'B', 'C'], dtype='object')
  • Get the Python Pandas DataFrame's values:
In[]: 
      import pandas as pd
      df=pd.DataFrame({'A' : np.random.randn(10), 'B' : np.random.randn(10), 'C' : np.random.randn(10)})

      df.values

Out[]: 
array([[ 0.6612966 , -0.60985049,  1.11955054],
       [-0.74105636,  1.42532491, -0.74883362],
       [ 0.10406892,  0.5511436 ,  2.63730671],
       [-0.73027121, -0.11088373, -0.19143175],
       [ 0.11676573,  0.27582786, -0.38271609],
       [ 0.51073858, -0.3313141 ,  0.20516165],
       [ 0.23917755,  0.55362   , -0.62717194],
       [ 0.25565784, -1.4960713 ,  0.58886377],
       [ 1.20284041,  0.21173483,  2.0331718 ],
       [ 0.62247283,  2.18407105,  0.02431867]])

Importing data in Python Pandas

Python Pandas DataFrame is able to read several data formats, some of the most used are: CSV, JSON, Excel, HDF5, SQL, etc.

CSV to DataFrame

One of the most useful functions is read_csv that allows us to read csv files with almost any format and load it into our Python Pandas DataFrame to work with it. Let's see how to work with csv files:

import pandas as pd
df=pd.read_csv('Filename.csv')
type(df)

Out[]: 
pandas.core.frame.DataFrame

This simple operation loads the csv file into the Python Pandas DataFrame after which we can explore it as we have seen before.

Customizing pandas import

Sometimes the format of the csv file comes with a particular separator or we need specific columns or rows. In this section of the Python Pandas tutorial, we will now see some ways to deal with this.

In this example, we want to load a csv file with blank space as separator:

import pandas as pd
df=pd.read_csv('Filename.csv', sep=' ')

In this example, we want to load columns from 0 and 5 and the first 100 rows:

import pandas as pd
df=pd.read_csv('Filename.csv', usecols=[0,1,2,3,4,5], nrows=100)

It's possible to customize the headers, convert the columns or rows names and carry out a good number of other operations.

Check the IO tools from Pandas documentation.

Importing sheets from Excel files

In the same way that we have worked with csv files, we can work with Excel file with the read_excel function, let's see some examples:

In this example, we want to load the sheet 1 from an Excel file:

import pandas as pd
df=pd.read_excel('Filename.xls', sheet_name='Sheet1')

This simple operation loads Sheet 1 from the Excel file into the Pandas DataFrame.

Indexing and Subsetting

Once we have the Python Pandas DataFrame prepared, independent of the source of our data (csv, Excel, hdf5, etc.) we can work with it, as if it were a database table, selecting the elements that interest us. We will work with some examples of how to index and extract subsets of data.

Let's begin with loading a csv file having details of a market instrument.

In[]: 
      import pandas as pd
      df=pd.read_csv('MSFT.csv' , usecols=[0,1,2,3,4,5])
      df.head()
      df.shape

Out[]:
          Date        Open  High     Low    Close   Volume
    0   2008-12-29  19.1500 19.21   18.64   18.96   58512800.0
    1   2008-12-30  19.0100 19.49   19.00   19.34   43224100.0
    2   2008-12-31  19.3100 19.68   19.27   19.44   46419000.0
    3   2009-01-02  19.5328 20.40   19.37   20.33   50084000.0
    4   2009-01-05  20.2000 20.67   20.06   20.52   61475200.0

    (1000, 6)

Here, we have read a csv file, of which we only need the columns of date, opening, closing, high, low and volume (the first 6 columns) and we check the form of the DataFrame that has 1000 rows and 6 columns.

Selecting a single column

In the previous code, we have read directly the first 6 columns from the csv file. This is a filter that we applied because we were only interested in those columns.

We can apply selection filters to the Python Pandas DataFrame itself, to select one column to work with. For example, we could need the Close column:

In[]: 
close=df['Close']
close.head()

Out[]:
        Close
    0   18.96   
    1   19.34   
    2   19.44   
    3   20.33   
    4   20.52

Selecting multiple columns

We can select multiple columns too:

In[]: 
closevol=df[['Close', 'Volume']]
closevol.head()

Out[]:
        Close  Volume
    0   18.96   58512800.0
    1   19.34   43224100.0
    2   19.44   46419000.0
    3   20.33   50084000.0
    4   20.52   61475200.0

Selecting rows via [].

We can select a set of rows by index:

In[]: 
      import pandas as pd
      df=pd.read_csv('TSLA.csv' )

      df[100:110]

Out[]:

DateOpenHighLowCloseVolumeExDividendSplitRatioAdjOpenAdjHighAdjLowAdjCloseAdjVolumeName
1002017-10-30319.18323.7800317.25320.084236029.00.01.0319.18323.7800317.25320.084236029.0TSLA
1012017-10-27319.75324.5900316.66320.876942493.00.01.0319.75324.5900316.66320.876942493.0TSLA
1022017-10-26327.78330.2300323.20326.174980316.00.01.0327.78330.2300323.20326.174980316.0TSLA
1032017-10-25336.70337.5000323.56325.848547764.00.01.0336.70337.5000323.56325.848547764.0TSLA
1042017-10-24338.80342.8000336.16337.344463807.00.01.0338.80342.8000336.16337.344463807.0TSLA
1052017-10-23349.88349.9500336.25337.025715817.00.01.0349.88349.9500336.25337.025715817.0TSLA
1062017-10-20352.69354.5500344.34345.104888221.00.01.0352.69354.5500344.34345.104888221.0TSLA
1072017-10-19355.56357.1465348.20351.815032884.00.01.0355.56357.1465348.20351.815032884.0TSLA
1082017-10-18355.97363.0000354.13359.654898808.00.01.0355.97363.0000354.13359.654898808.0TSLA
1092017-10-17350.91356.2200350.07355.753280670.00.01.0350.91356.2200350.07355.753280670.0TSLA

Or we can select a set of rows and columns:

In[]: 
df[100:110][['Close', 'Volume']]

Out[]: 
        Close   Volume
100 320.08  4236029.0
101 320.87  6942493.0
102 326.17  4980316.0
103 325.84  8547764.0
104 337.34  4463807.0
105 337.02  5715817.0
106 345.10  4888221.0
107 351.81  5032884.0
108 359.65  4898808.0
109 355.75  3280670.0

Selecting via .loc[] (By label)

With df.loc we can do the same selections using labels:

To select a set of rows, we can code the following using the index number as label:

In[]: 
df.loc[100:110]

Out[]:

DateOpenHighLowCloseVolumeExDividendSplitRatioAdjOpenAdjHighAdjLowAdjCloseAdjVolumeName
1002017-10-30319.18323.7800317.25320.084236029.00.01.0319.18323.7800317.25320.084236029.0TSLA
1012017-10-27319.75324.5900316.66320.876942493.00.01.0319.75324.5900316.66320.876942493.0TSLA
1022017-10-26327.78330.2300323.20326.174980316.00.01.0327.78330.2300323.20326.174980316.0TSLA
1032017-10-25336.70337.5000323.56325.848547764.00.01.0336.70337.5000323.56325.848547764.0TSLA
1042017-10-24338.80342.8000336.16337.344463807.00.01.0338.80342.8000336.16337.344463807.0TSLA
1052017-10-23349.88349.9500336.25337.025715817.00.01.0349.88349.9500336.25337.025715817.0TSLA
1062017-10-20352.69354.5500344.34345.104888221.00.01.0352.69354.5500344.34345.104888221.0TSLA
1072017-10-19355.56357.1465348.20351.815032884.00.01.0355.56357.1465348.20351.815032884.0TSLA
1082017-10-18355.97363.0000354.13359.654898808.00.01.0355.97363.0000354.13359.654898808.0TSLA
1092017-10-17350.91356.2200350.07355.753280670.00.01.0350.91356.2200350.07355.753280670.0TSLA

Or we can select a set of rows and columns like before:

In[]: 
df.loc[100:110, ['Close', 'Volume']]

Out[]:
    Close   Volume
100 320.08  4236029.0
101 320.87  6942493.0
102 326.17  4980316.0
103 325.84  8547764.0
104 337.34  4463807.0
105 337.02  5715817.0
106 345.10  4888221.0
107 351.81  5032884.0
108 359.65  4898808.0
109 355.75  3280670.0
110 350.60  5353262.0

Selecting via .iloc[] (By position)

With df.iloc we can do the same selections using integer position:

In[]: 
df.iloc[100:110]

Out[]:

DateOpenHighLowCloseVolumeExDividendSplitRatioAdjOpenAdjHighAdjLowAdjCloseAdjVolumeName
1002017-10-30319.18323.7800317.25320.084236029.00.01.0319.18323.7800317.25320.084236029.0TSLA
1012017-10-27319.75324.5900316.66320.876942493.00.01.0319.75324.5900316.66320.876942493.0TSLA
1022017-10-26327.78330.2300323.20326.174980316.00.01.0327.78330.2300323.20326.174980316.0TSLA
1032017-10-25336.70337.5000323.56325.848547764.00.01.0336.70337.5000323.56325.848547764.0TSLA
1042017-10-24338.80342.8000336.16337.344463807.00.01.0338.80342.8000336.16337.344463807.0TSLA
1052017-10-23349.88349.9500336.25337.025715817.00.01.0349.88349.9500336.25337.025715817.0TSLA
1062017-10-20352.69354.5500344.34345.104888221.00.01.0352.69354.5500344.34345.104888221.0TSLA
1072017-10-19355.56357.1465348.20351.815032884.00.01.0355.56357.1465348.20351.815032884.0TSLA
1082017-10-18355.97363.0000354.13359.654898808.00.01.0355.97363.0000354.13359.654898808.0TSLA
1092017-10-17350.91356.2200350.07355.753280670.00.01.0350.91356.2200350.07355.753280670.0TSLA

In the last example, we used the index as an integer position rather than by label.

We can select a set of rows and columns like before:

In[]: 
df.iloc[100:110, [3, 4]]

Out[]:
      Low   Close
100 317.25  320.08
101 316.66  320.87
102 323.20  326.17
103 323.56  325.84
104 336.16  337.34
105 336.25  337.02
106 344.34  345.10
107 348.20  351.81
108 354.13  359.65
109 350.07  355.75

Boolean indexing

So far in the Python Pandas tutorial, we have sliced subsets of data by label or by position. Now let's see how to select data that meet some criteria. We do this with Boolean indexing. We can use the same criteria similar to what we have seen with Numpy arrays. We show you just two illustrative examples here. This is by no means enough to get comfortable with it and so would encourage you to check the documentation and further readings at the end of this chapter to learn more.

  • We can filter data that is greater (less) than a number.
In[]: 
df[df.Close > 110]

Out[]:

DateOpenHighLowCloseVolumeExDividendSplitRatioAdjOpenAdjHighAdjLowAdjCloseAdjVolumeName
02018-03-27304.00304.2700277.1800279.180013696168.00.01.0304.00304.2700277.1800279.180013696168.0TSLA
12018-03-26307.34307.5900291.3600304.18008324639.00.01.0307.34307.5900291.3600304.18008324639.0TSLA
22018-03-23311.25311.6100300.4500301.54006600538.00.01.0311.25311.6100300.4500301.54006600538.0TSLA

.............................................
10802013-12-09137.00141.7000134.2100141.60009061500.00.01.0137.00141.7000134.2100141.60009061500.0TSLA
10812013-12-06141.51142.4900136.3000137.36007909600.00.01.0141.51142.4900136.3000137.36007909600.0TSLA
10822013-12-05140.15143.3500139.5000140.48009288400.00.01.0140.15143.3500139.5000140.48009288400.0TSLA

1083 rows × 14 columns

In[]: 
df[(df['Close'] > 110) | (df['Close'] < 120)]

Out[]:
`

DateOpenHighLowCloseVolumeExDividendSplitRatioAdjOpenAdjHighAdjLowAdjCloseAdjVolumeName
02018-03-27304.00304.2700277.1800279.180013696168.00.01.0304.00304.2700277.1800279.180013696168.0TSLA
12018-03-26307.34307.5900291.3600304.18008324639.00.01.0307.34307.5900291.3600304.18008324639.0TSLA
22018-03-23311.25311.6100300.4500301.54006600538.00.01.0311.25311.6100300.4500301.54006600538.0TSLA

.............................................
10802013-12-09137.00141.7000134.2100141.60009061500.00.01.0137.00141.7000134.2100141.60009061500.0TSLA
10812013-12-06141.51142.4900136.3000137.36007909600.00.01.0141.51142.4900136.3000137.36007909600.0TSLA
10822013-12-05140.15143.3500139.5000140.48009288400.00.01.0140.15143.3500139.5000140.48009288400.0TSLA

1083 rows × 14 columns

Manipulating a Python Pandas DataFrame

When we are working with data, the most common structure is the DataFrame. We have seen how to create them, make selections and find data in the Python Pandas tutorial. We are now going to see how to manipulate the Python Pandas DataFrame to transform it into another Python Pandas DataFrame that has the form that our problem requires.

We'll see how to sort it, re-index it, eliminate unwanted (or spurious) data, add or remove columns and update values.

.T (Transposing)

The Python Pandas DataFrame transpose function T allows us to transpose the rows as columns, and logically the columns as rows:

In[]: import pandas as pd
      df=pd.read_csv('TSLA.csv' )
      df2=df[100:110][['Close', 'Volume']]

      df2.T

Out[]:

100101102103104105106107108109
Close320.08320.87326.17325.84337.34337.02345.1351.81359.65355.75
Volume4236029.006942493.004980316.008547764.004463807.005715817.004888221.05032884.004898808.003280670.00

.sort_index()

When we are working with Python Pandas Dataframe it is usual to add or remove rows, order by columns, etc. That's why it's important to have a function that allows us to easily and comfortably sort the Python Pandas DataFrame by its index. We do this with the sort_index function of Pandas DataFrame.

In[]:
 df.sort_index()

Out[]:

DateOpenHighLowCloseVolumeExDividendSplitRatioAdjOpenAdjHighAdjLowAdjCloseAdjVolumeName
02018-03-27304.00304.2700277.1800279.180013696168.00.01.0304.00304.2700277.1800279.180013696168.0TSLA
12018-03-26307.34307.5900291.3600304.18008324639.00.01.0307.34307.5900291.3600304.18008324639.0TSLA
22018-03-23311.25311.6100300.4500301.54006600538.00.01.0311.25311.6100300.4500301.54006600538.0TSLA

.............................................

.sort_values()

Sometimes, we may be interested in sorting the Python Pandas DataFrame by some column or even with several columns as criteria. For example, sort the column by first names and the second criterion by last names. We do this with the sort_values function of Python Pandas DataFrame.

In[]: 
df.sort_values(by='Close')

Out[]:

DateOpenHighLowCloseVolumeExDividendSplitRatioAdjOpenAdjHighAdjLowAdjCloseAdjVolumeName
10812013-12-06141.510142.4900136.3000137.36007909600.00.01.0141.510142.4900136.3000137.36007909600.0TSLA
10572014-01-13145.780147.0000137.8200139.34006316100.00.01.0145.780147.0000137.8200139.34006316100.0TSLA
10782013-12-11141.880143.0500139.4900139.65007137800.00.01.0141.880143.0500139.4900139.65007137800.0TSLA

.............................................
In[]: 
df.sort_values(by=['Open', 'Close'])

Out[]:

DateOpenHighLowCloseVolumeExDividendSplitRatioAdjOpenAdjHighAdjLowAdjCloseAdjVolumeName
10802013-12-09137.000141.7000134.2100141.60009061500.00.01.0137.000141.7000134.2100141.60009061500.0TSLA
10772013-12-12139.700148.2400138.5300147.470010767800.00.01.0139.700148.2400138.5300147.470010767800.0TSLA
10792013-12-10140.050145.8700139.8600142.190010748200.00.01.0140.050145.8700139.8600142.190010748200.0TSLA

.............................................

.reindex()

The Python Pandas' reindex function let us realign the index of the Series or DataFrame, it's useful when we need to reorganize the index to meet some criteria.

For example, we can play with the Series or DataFrame that we create before to alter the original index. For example, when the index is a label, we can reorganize as we need:

In[]: import pandas as pd
      import numpy as np
      df = pd.DataFrame(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

      df

Out[]:

0
a-0.134133
b-0.586051
c1.179358
d0.433142
e-0.365686

Now, we can reorganize the index as follows:

In[]: 
df.reindex(['b', 'a', 'd', 'c', 'e'])

Out[]:

0
b-0.586051
a-0.134133
d0.433142
c1.179358
e-0.365686

When the index is numeric we can use the same function to order by hand the index:

In[]: 
      import pandas as pd
      import numpy as np
      df = pd.DataFrame(np.random.randn(5))
      df.reindex([4,3,2,1,0])

Out[]:

0
41.058589
31.194400
2-0.645806
10.836606
01.288102

Later in this section, we'll see how to work and reorganize date and time indices.

Adding a new column. For e.g. df[‘new_column’] = 1

Another interesting feature of Python Pandas DataFrames is the possibility of adding new columns to an existing DataFrame.

For example, we can add a new column to the random DataFrame that we have created before:

In[]: 
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5))

Out[]:

0
00.238304
12.068558
21.015650
30.506208
40.214760

To add a new column, we only need to include the new column name in the DataFrame and assign an initialization value, or assign to the new column a Pandas Series or another column from other DataFrame.

In[]: 
df['new']=1
df

Out[]:


0new

00.2383041

12.0685581

21.0156501

30.5062081

40.2147601

Delete existing column

Likewise, we can remove one or more columns from the Python Pandas DataFrame. Let's create a DataFrame with 5 rows and 4 columns with random values to delete one column.

In[]: 

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 4))
df

Out[]:


0123

0-1.171562-0.086348-1.9718551.168017

1-0.408317-0.061397-0.542212-1.412755

2-0.365539-0.5871471.4946901.756105

30.6428820.9242020.517975-0.914366

40.777869-0.431151-0.4010930.145646

Now, we can delete the column that we specify by index or by the label if any:

In[]: 
del df[0]

Out[]:


123

0-0.086348-1.9718551.168017

1-0.061397-0.542212-1.412755

2-0.5871471.4946901.756105

30.9242020.517975-0.914366

4-0.431151-0.4010930.145646
In[]: 
df['new']=1
df

Out[]:


0new

00.2383041

12.0685581

21.0156501

30.5062081
In[]: 
del df['new']

Out[]:


0

00.238304

12.068558

21.015650

30.506208

.at[] (By label)

With at we can locate a specific value by row and column labels as follows:

In[]: 
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)

df.at['a', 'A']

Out[]:

ABCD
a0.996496-0.1650020.7279120.564858
b-0.3881691.171039-0.231934-1.124595
c-1.3851290.2991950.573570-1.736860
d1.222447-0.3126670.957139-0.054156
e1.1883350.6799211.508681-0.677776
In[]: 
df.at['a', 'A']

Out[]: 

0.9964957014209125

It is possible to assign a new value with the same function too:

In[]: 
df.at['a', 'A'] = 0

Out[]:

ABCD
a0.000000-0.1650020.7279120.564858
b-0.3881691.171039-0.231934-1.124595
c-1.3851290.2991950.573570-1.736860
d1.222447-0.3126670.957139-0.054156
e1.1883350.6799211.508681-0.677776

.iat[] (By position)

With iat we can locate a specific value by row and column index as follows:

In[]: 
      import pandas as pd
      import numpy as np
      df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
      print(df)

      df.iat[0, 0]

Out[]: 
0.996496

It is possible to assign a new value with the same function too:

In[]: 
df.iat[0, 0] = 0

Out[]:

ABCD
a0.000000-0.1650020.7279120.564858
b-0.3881691.171039-0.231934-1.124595
c-1.3851290.2991950.573570-1.736860
d1.222447-0.3126670.957139-0.054156
e1.1883350.6799211.508681-0.677776

Conditional updating of values. For e.g. df[df > 0] = 1

Another useful function is to update values that meet some criteria, for example, update values whose values are greater than 0:

In[]: 
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5, 4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)

df[df > 0] = 1
df

Out[]:

ABCD
a1.000000-0.0824661.000000-0.728372
b-0.784404-0.663096-0.5951121.000000
c-1.460702-1.072931-0.7613141.000000
d1.0000001.0000001.000000-0.302310
e-0.4885561.000000-0.798716-0.590920

We can also update the values of a specific column that meet some criteria, or even work with several columns as criteria and update a specific column.

In[]: 
df['A'][df['A'] < 0] = 1
print(df)

Out[]:

ABCD
a1.0-0.0824661.000000-0.728372
b1.0-0.663096-0.5951121.000000
c1.0-1.072931-0.7613141.000000
d1.01.0000001.000000-0.302310
e1.01.000000-0.798716-0.590920
In[]: 
df['A'][(df['B'] < 0 )& (df['C'] < 0)] = 9
print(df)

Out[]:

ABCD
a1.0-0.0824661.000000-0.728372
b9.0-0.663096-0.5951121.000000
c9.0-1.072931-0.7613141.000000
d1.01.0000001.000000-0.302310
e1.01.000000-0.798716-0.590920

.dropna()

Occasionally, we may have a Python Pandas DataFrame that, for whatever reason, includes NA values. This type of values is usually problematic when we are making calculations or operations and must be treated properly before proceeding with them.

The easiest way to eliminate NA values is to remove the row that contains it.

In[]: 
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)

Out[]:

ABCD
a1.2723611.799535-0.5936191.152889
b-0.318368-0.1904190.1294201.551332
c0.1669511.669034-1.6536180.656313
d0.2199990.9510740.442325-0.170177
e0.312319-0.765930-1.641234-1.388924
In[]: 
df['A'][(df['B'] < 0 )& (df['C'] < 0)] = np.nan
print(df)

Out[]:

ABCD
a1.2723611.799535-0.5936191.152889
b-0.318368-0.1904190.1294201.551332
c0.1669511.669034-1.6536180.656313
d0.2199990.9510740.442325-0.170177
eNaN-0.765930-1.641234-1.388924
In[]: 
df=df.dropna()
print(df)

Out[]:

ABCD
a1.2723611.799535-0.5936191.152889
b-0.318368-0.1904190.1294201.551332
c0.1669511.669034-1.6536180.656313
d0.2199990.9510740.442325-0.170177

Here we are deleting the whole row that has, in any of its columns, a NaN value, but we can also specify that it deletes the column that any of its values is NaN:

df=df.dropna(axis=1)
print(df)

We can specify if a single NaN value is enough to delete the row or column, or if the whole row or column must have NaN to delete it.

```python
df=df.dropna(how='all')
print(df)

.fillna()

With the previous function, we have seen how to eliminate a complete row or column that contains one or all the values to NaN, this operation can be a little drastic if we have valid values in the row or column.

For this, it is interesting to use the fillna function that substitutes the NaN values with some fixed value.

In[]: 
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)

Out[]:

ABCD
a1.2723611.799535-0.5936191.152889
b-0.318368-0.1904190.1294201.551332
c0.1669511.669034-1.6536180.656313
d0.2199990.9510740.442325-0.170177
e0.312319-0.765930-1.641234-1.388924
In[]: 
df['A'][(df['B'] < 0 )& (df['C'] < 0)] = np.nan
print(df)

Out[]:

ABCD
a1.2723611.799535-0.5936191.152889
b-0.318368-0.1904190.1294201.551332
c0.1669511.669034-1.6536180.656313
d0.2199990.9510740.442325-0.170177
eNaN-0.765930-1.641234-1.388924
In[]: 
df=df.fillna(999)
print(df)

Out[]:

ABCD
a1.2723611.799535-0.5936191.152889
b-0.318368-0.1904190.1294201.551332
c0.1669511.669034-1.6536180.656313
d0.2199990.9510740.442325-0.170177
e999-0.765930-1.641234-1.388924

.apply()

apply is a very useful way to use functions or methods in a DataFrame without having to loop through it.

We can apply the Apply function to a Series or DataFrame to apply a function to all rows or columns of the DataFrame. Let's see some examples.

Suppose we are working with the randomly generated Python Pandas DataFrame and need to apply a function. In this example, for simplicity's sake, we're going to create a custom function to square a number.

In[]: 
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)

Out[]:
`

ABCD
a-0.633249-2.6990880.5740520.652742
b0.060295-0.1505270.149123-0.701216
c-0.0525150.4694810.899180-0.608409
d-1.3529120.1033020.457878-1.897170
e0.0882790.418317-1.1029890.582455
def square_number(number):
  return number**2

# Test the function
In[]: 
square_number(2)

Out[]: 
4

Now, let's use the custom function through Apply:

In[]: 
df.apply(square_number, axis=1)

Out[]:

ABCD
a0.4010057.2850740.3295360.426073
b0.0036360.0226580.0222380.491704
c0.0027580.2204120.8085240.370161
d1.8303720.0106710.2096523.599253
e0.0077930.1749891.2165860.339254

This method apply the function square_number to all rows of the DataFrame.

.shift()

The shift function allows us to move a row to the right or left and/or to move a column up or down. Let's look at some examples.

First, we are going to move the values of a column downwards:

In[]: 
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)

Out[]:

ABCD
a-0.633249-2.6990880.5740520.652742
b0.060295-0.1505270.149123-0.701216
c-0.0525150.4694810.899180-0.608409
d-1.3529120.1033020.457878-1.897170
e0.0882790.418317-1.1029890.582455
In[]: 
df['D'].shift(1)

Out[]:

ABCD
a-0.633249-2.6990880.574052NaN
b0.060295-0.1505270.1491230.652742
c-0.0525150.4694810.899180-0.701216
d-1.3529120.1033020.457878-0.608409
e0.0882790.418317-1.102989-1.897170

We are going to move the values of a column upwards

In[]: 
df['shift'] = df['D'].shift(-1)

Out[]:

ABCD
a-0.633249-2.6990880.574052-0.701216
b0.060295-0.1505270.149123-0.608409
c-0.0525150.4694810.899180-1.897170
d-1.3529120.1033020.4578780.582455
e0.0882790.418317-1.102989NaN

This is very useful for comparing the current value with the previous value.

Statistical Exploratory data analysis

Python Pandas DataFrame allows us to make some descriptive statistics calculations, which are very useful to make a first analysis of the data we are handling. Let's see some useful functions.

info()

It is a good practice to know the structure and format of our DataFrame, the Info function offers us just that:

In[]: 
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)

Out[]:

ABCD
a-0.633249-2.6990880.5740520.652742
b0.060295-0.1505270.149123-0.701216
c-0.0525150.4694810.899180-0.608409
d-1.3529120.1033020.457878-1.897170
e0.0882790.418317-1.1029890.582455
In[]: 
df.info()

Out[]:
       <class 'pandas.core.frame.DataFrame'>
      Index: 5 entries, a to e
      Data columns (total 5 columns):
      A        5 non-null float64
      B        5 non-null float64
      C        5 non-null float64
      D        5 non-null float64
      shift    4 non-null float64
      dtypes: float64(5)
      memory usage: 240.0+ bytes

describe()

We can obtain a statistical overview of the DataFrame with the `describe function, which gives us the mean, median, standard deviation, maximum, minimum, quartiles, etc. of each DataFrame column.

In[]: 
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)

Out[]:
`

ABCD
a-0.633249-2.6990880.5740520.652742
b0.060295-0.1505270.149123-0.701216
c-0.0525150.4694810.899180-0.608409
d-1.3529120.1033020.457878-1.897170
e0.0882790.418317-1.1029890.582455
In[]: 
df.describe()

Out[]:

ABCD
count5.0000005.0000005.0000005.000000
mean-0.378020-0.3717030.195449-0.394319
std0.6186811.3250460.7738761.054633
min-1.352912-2.699088-1.102989-1.897170
25%-0.633249-0.1505270.149123-0.701216
50%-0.0525150.1033020.457878-0.608409
75%0.0602950.4183170.5740520.582455
max0.0882790.4694810.8991800.652742

.value_counts()

The function value_counts counts the repeated values of the specified column:

In[]: 
df['A'].value_counts()

Out[]: 
       0.088279    1
      -0.052515    1
       0.060295    1
      -0.633249    1
      -1.352912    1
      Name: A, dtype: int64

.mean()

We can obtain the mean of a specific column or row by means of the mean function.

In[]:
df['A'].mean() # specifying a column

Out[]: 
-0.3780203497252693
`
In[]: 
df.mean() # by column
df.mean(axis=0) # by column

Out[]:
        A       -0.378020
        B       -0.371703
        C        0.195449
        D       -0.394319
        shift   -0.638513
        dtype: float64
`
In[]:
df.mean(axis=1) # by row

Out[]:
        a   -0.526386
        b    0.002084
        c    0.001304
        d   -0.659462
        e   -0.382222
        dtype: float64

.std()

We can obtain the standard deviation of a specific column or row by means of the std function.

In[]: 
df['A'].std() # specifying a column

Out[]: 
0.6186812554819784
`
In[]: 
df.std() # by column
df.std(axis=0) # by column

Out[]:
        A        0.618681
        B        1.325046
        C        0.773876
        D        1.054633
        shift    1.041857
        dtype: float64
`
In[]: 
df.std(axis=1) # by row

Out[]:  
        a    1.563475
        b    0.491499
        c    0.688032
        d    0.980517
        e    1.073244
        dtype: float64

Filtering Python Pandas DataFrame

We have already seen how to filter data in a Python Pandas DataFrame, including logical statements to filter rows or columns with some logical criteria. Let's remember some examples:

For example, we will filter rows whose column 'A' is greater than zero:

In[]: 
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)

Out[]:
`

ABCD
a-0.633249-2.6990880.5740520.652742
b0.060295-0.1505270.149123-0.701216
c-0.0525150.4694810.899180-0.608409
d-1.3529120.1033020.457878-1.897170
e0.0882790.418317-1.1029890.582455
In[]:
df_filtered = df[df['A'] > 0]
print(df_filtered)

Out[]:

ABCD
b0.060295-0.1505270.149123-0.701216
e0.0882790.418317-1.1029890.582455

We can also combine logical statements, we will filter all rows whose column 'A' and 'B' have their values greater than zero.

In[]: 
df_filtered = df[(df['A'] > 0) & (df['B'] > 0)] 
print(df_filtered)

Out[]:

ABCD
e0.0882790.418317-1.1029890.582455

Iterating Python Pandas DataFrame

We can go through the Python Pandas DataFrame row by row to do operations in each iteration, let's see some examples.

In[]: 
for item in df.iterrows():
          print(item)

Out[]:
('a', A       -0.633249
B       -2.699088
C        0.574052
D        0.652742
shift         NaN
Name: a, dtype: float64)
('b', A        0.060295
B       -0.150527
C        0.149123
D       -0.701216
shift    0.652742
Name: b, dtype: float64)
('c', A       -0.052515
B        0.469481
C        0.899180
D       -0.608409
shift   -0.701216
Name: c, dtype: float64)
('d', A       -1.352912
B        0.103302
C        0.457878
D       -1.897170
shift   -0.608409
Name: d, dtype: float64)
('e', A        0.088279
B        0.418317
C       -1.102989
D        0.582455
shift   -1.897170
Name: e, dtype: float64)

Merge, Append and Concat operation on a Python Pandas DataFrame

Another interesting feature of Python Pandas DataFrames is that we can merge, concatenate them and add new values, let's see how to do each of these operations in this section of Python Pandas tutorial.

  • merge function allows us to merge two Python Pandas DataFrame by rows:
In[]: 
      import numpy as np
      df1=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
      print(df1)

Out[]:
`

ABCD
a1.179924-1.5121240.7675570.019265
b0.019969-1.3516490.665298-0.989025
c0.351921-0.7929140.4551740.170751
d-0.1504990.151942-0.628074-0.347300
e-1.3075900.1857590.175967-0.170334
df2=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df2)
`

ABCD
a2.030462-0.337738-0.894440-0.757323
b0.4758071.350088-0.514070-0.843963
c0.948164-0.155052-0.6188931.319999
d1.433736-0.4550081.445698-1.051454
e0.5653451.802485-0.167189-0.227519
In[]: 
      df3 = pd.merge(df1, df2)
      print(df3)

Out[]:  
        Empty DataFrame
        Columns: [A, B, C, D]
        Index: []
  • append function allows us to append rows from one Python Pandas DataFrame to another Python Pandas DataFrame by rows:
In[] 
      import numpy as np
      df1=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
      print(df1)

      df2=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
      print(df2)

      df3 = df1.append(df2)
      print(df3)

Out[]

ABCD
a1.179924-1.5121240.7675570.019265
b0.019969-1.3516490.665298-0.989025
c0.351921-0.7929140.4551740.170751
d-0.1504990.151942-0.628074-0.347300
e-1.3075900.1857590.175967-0.170334
a2.030462-0.337738-0.894440-0.757323
b0.4758071.350088-0.514070-0.843963
c0.948164-0.155052-0.6188931.319999
d1.433736-0.4550081.445698-1.051454
e0.5653451.802485-0.167189-0.227519
  • concat function allows us to merge two Python Pandas DataFrame by rows or columns:
In[]: 
      import numpy as np
      df1=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
      print(df1)

      df2=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
      print(df2)

      df3 = pd.concat([df1, df2]) # concat by row
      print(df3)

Out[]:
`

ABCD
a1.179924-1.5121240.7675570.019265
b0.019969-1.3516490.665298-0.989025
c0.351921-0.7929140.4551740.170751
d-0.1504990.151942-0.628074-0.347300
e-1.3075900.1857590.175967-0.170334
a2.030462-0.337738-0.894440-0.757323
b0.4758071.350088-0.514070-0.843963
c0.948164-0.155052-0.6188931.319999
d1.433736-0.4550081.445698-1.051454
e0.5653451.802485-0.167189-0.227519
In[]: 
df3 = pd.concat([df1, df2], axis=0) # concat by row
print(df3)

Out[]:
`

ABCD
a1.179924-1.5121240.7675570.019265
b0.019969-1.3516490.665298-0.989025
c0.351921-0.7929140.4551740.170751
d-0.1504990.151942-0.628074-0.347300
e-1.3075900.1857590.175967-0.170334
a2.030462-0.337738-0.894440-0.757323
b0.4758071.350088-0.514070-0.843963
c0.948164-0.155052-0.6188931.319999
d1.433736-0.4550081.445698-1.051454
e0.5653451.802485-0.167189-0.227519
In[]: 
df3 = pd.concat([df1, df2], axis=1) # concat by column
print(df3)

Out[]:

ABCDABCD
a1.179924-1.5121240.7675570.0192652.030462-0.337738-0.894440-0.757323
b0.019969-1.3516490.665298-0.9890250.4758071.350088-0.514070-0.843963
c0.351921-0.7929140.4551740.1707510.948164-0.155052-0.6188931.319999
d-0.1504990.151942-0.628074-0.3473001.433736-0.4550081.445698-1.051454
e-1.3075900.1857590.175967-0.1703340.5653451.802485-0.167189-0.227519

TimeSeries in Python Pandas

Python Pandas TimeSeries includes a set of tools to work with Series or DataFrames indexed in time.

Usually, the series of financial data are of this type and therefore, knowing these tools will make our work much more comfortable.

In this section of the Python Pandas tutorial, we are going to start creating time series from scratch and then we will see how to manipulate them and convert them to different frequencies.

Indexing Python Pandas TimeSeries

With date_range Panda's method, we can create a time range with a certain frequency. For example, create a range starting on December 1st, 2018, with 30 occurrences with an hourly frequency.

In[]: 
rng = pd.date_range('12/1/2018', periods=30, freq='H')
print(rng)

Out[]: 
DatetimeIndex(['2018-12-01 00:00:00', '2018-12-01 01:00:00',
               '2018-12-01 02:00:00', '2018-12-01 03:00:00',
               '2018-12-01 04:00:00', '2018-12-01 05:00:00',
               '2018-12-01 06:00:00', '2018-12-01 07:00:00',
               '2018-12-01 08:00:00', '2018-12-01 09:00:00',
               '2018-12-01 10:00:00', '2018-12-01 11:00:00',
               '2018-12-01 12:00:00', '2018-12-01 13:00:00',
               '2018-12-01 14:00:00', '2018-12-01 15:00:00',
               '2018-12-01 16:00:00', '2018-12-01 17:00:00',
               '2018-12-01 18:00:00', '2018-12-01 19:00:00',
               '2018-12-01 20:00:00', '2018-12-01 21:00:00',
               '2018-12-01 22:00:00', '2018-12-01 23:00:00',
               '2018-12-02 00:00:00', '2018-12-02 01:00:00',
               '2018-12-02 02:00:00', '2018-12-02 03:00:00',
               '2018-12-02 04:00:00', '2018-12-02 05:00:00'],
              dtype='datetime64[ns]', freq='H')

We can do the same to get a daily frequency (or any other, see the documentation). We can use the freq parameter to adjust this.

In[]: 
rng = pd.date_range('12/1/2018', periods=30, freq='D')
print(rng)

Out[]: 
DatetimeIndex(['2018-12-01', '2018-12-02', '2018-12-03', '2018-12-04',
               '2018-12-05', '2018-12-06', '2018-12-07', '2018-12-08',
               '2018-12-09', '2018-12-10', '2018-12-11', '2018-12-12',
               '2018-12-13', '2018-12-14', '2018-12-15', '2018-12-16',
               '2018-12-17', '2018-12-18', '2018-12-19', '2018-12-20',
               '2018-12-21', '2018-12-22', '2018-12-23', '2018-12-24',
               '2018-12-25', '2018-12-26', '2018-12-27', '2018-12-28',
               '2018-12-29', '2018-12-30'],
              dtype='datetime64[ns]', freq='D')

Now, we have a DateTimeIndex in the rng object and we can use it to create a Series or DataFrame:

In[]: 
ts = pd.DataFrame(np.random.randn(len(rng), 4), index=rng, columns=['A', 'B', 'C', 'D'])
print(ts)

Out[]:
`

ABCD
2018-12-010.0486030.9685220.4082130.921774
2018-12-02-2.301373-2.310408-0.559381-0.652291
2018-12-03-2.3378440.3299540.2892210.259132
2018-12-041.3575210.9698081.3418750.767797
2018-12-05-1.212355-0.077457-0.5295640.375572
2018-12-06-0.6730650.5277540.006344-0.533316
2018-12-070.2261450.2350270.945678-1.766167
2018-12-081.735185-0.6042290.2748090.841128
In[]: 
ts = pd.Series(np.random.randn(len(rng)), index=rng)
print(ts)

Out[]:  
        2018-12-01    0.349234
        2018-12-02   -1.807753
        2018-12-03    0.112777
        2018-12-04    0.421516
        2018-12-05   -0.992449
        2018-12-06    1.254999
        2018-12-07   -0.311152
        2018-12-08    0.331584
        2018-12-09    0.196904
        2018-12-10   -1.619186
        2018-12-11    0.478510
        2018-12-12   -1.036074

Sometimes, we read the data from internet sources or from csv files and we need to convert the date column into the index to work properly with the Series or DataFrame.

In[]: 
import pandas as pd
df=pd.read_csv('TSLA.csv')
df.tail()

Out[]:

DateOpenHighLowCloseVolumeExDividendSplitRatioAdjOpenAdjHighAdjLowAdjCloseAdjVolumeName
10782013-12-11141.88143.05139.49139.657137800.00.01.0141.88143.05139.49139.657137800.0TSLA
10792013-12-10140.05145.87139.86142.1910748200.00.01.0140.05145.87139.86142.1910748200.0TSLA
10802013-12-09137.00141.70134.21141.609061500.00.01.0137.00141.70134.21141.609061500.0TSLA
10812013-12-06141.51142.49136.30137.367909600.00.01.0141.51142.49136.30137.367909600.0TSLA
10822013-12-05140.15143.35139.50140.489288400.00.01.0140.15143.35139.50140.489288400.0TSLA

Here, we can see the index as numeric and a Date column, let's convert this column into the index to indexing our DataFrame, read from a csv file, in time. For this, we are going to use the Python Pandas set_index method

In[]: 
df = df.set_index('Date')
df.tail()

Out[]:
DateOpenHighLowCloseVolumeExDividendSplitRatioAdjOpenAdjHighAdjLowAdjCloseAdjVolumeName
2013-12-11141.88143.05139.49139.657137800.00.01.0141.88143.05139.49139.657137800.0TSLA
2013-12-10140.05145.87139.86142.1910748200.00.01.0140.05145.87139.86142.1910748200.0TSLA
2013-12-09137.00141.70134.21141.609061500.00.01.0137.00141.70134.21141.609061500.0TSLA
2013-12-06141.51142.49136.30137.367909600.00.01.0141.51142.49136.30137.367909600.0TSLA
2013-12-05140.15143.35139.50140.489288400.00.01.0140.15143.35139.50140.489288400.0TSLA

Now, we have Pandas TimeSeries ready to work.

Resampling Pandas TimeSeries

A very useful feature of Python Pandas TimeSeries is the resample capacity, this allows us to pass the current frequency to another higher frequency (we can't pass to lower frequencies, because we don't know the data).

As it can be supposed, when we pass from one frequency to another data could be lost, for this, we must use some function that treats the values of each frequency interval, for example, if we pass from an hourly frequency to daily, we must specify what we want to do with the group of data that fall inside each frequency, we can do a mean, a sum, we can get the maximum or the minimum, etc.

In[]: 
rng = pd.date_range('12/1/2018', periods=30, freq='H')
ts = pd.DataFrame(np.random.randn(len(rng), 4), index=rng, columns=['A', 'B', 'C', 'D'])
print(ts)

Out[]:
`

ABCD
2018-12-01 00:00:000.0486030.9685220.4082130.921774
2018-12-01 01:00:00-2.301373-2.310408-0.559381-0.652291
2018-12-01 02:00:00-2.3378440.3299540.2892210.259132
2018-12-01 03:00:001.3575210.9698081.3418750.767797
2018-12-01 04:00:00-1.212355-0.077457-0.5295640.375572
2018-12-01 05:00:00-0.6730650.5277540.006344-0.533316
In[]: 
ts = ts.resample("1D").mean()
print(ts)

Out[]:

ABCD
2018-12-010.4490500.127412-0.154179-0.358324
2018-12-02-0.539007-0.8558940.0000100.454623

Manipulating TimeSeries

We can manipulate the Python Pandas TimeSeries in the same way that we have done until now, since they offer us the same capacity that the Pandas Series and the Pandas DataFrames.

Additionally, we can work comfortably with all jobs related to handling dates. For example, to obtain all the data from a date, to obtain the data in a range of dates, etc.

In[]: 
rng = pd.date_range('12/1/2018', periods=30, freq='D')
ts = pd.DataFrame(np.random.randn(len(rng), 4), index=rng, columns=['A', 'B', 'C', 'D'])
print(ts)

Out[]:

ABCD
2018-12-010.0486030.9685220.4082130.921774
2018-12-02-2.301373-2.310408-0.559381-0.652291
2018-12-03-2.3378440.3299540.2892210.259132
2018-12-041.3575210.9698081.3418750.767797
2018-12-05-1.212355-0.077457-0.5295640.375572
2018-12-06-0.6730650.5277540.006344-0.533316

Getting all values from a specific date:

In[]: 
ts['2018-12-15':]

Out[]:

ABCD
2018-12-020.324689-0.4137230.0191630.385233
2018-12-03-2.1989370.536600-0.540934-0.603858
2018-12-04-1.1951482.191311-0.981604-0.942440
2018-12-050.621298-1.435266-0.761886-1.787730
2018-12-060.6356790.6832650.351140-1.451903

Getting all values inside a date range:

In[]: 
ts['2018-12-15':'2018-12-20']

Out[]:

ABCD
2018-12-150.6055760.584369-1.520749-0.242630
2018-12-16-0.105561-0.0921240.3850850.918222
2018-12-170.337416-1.3675490.7383202.413522
2018-12-18-0.011610-0.339228-0.218382-0.070349
2018-12-190.027808-0.422975-0.6227770.730926
2018-12-200.188822-1.0166370.4708740.674052

Summary

Let us summarise a few points we learnt in the Python Pandas tutorial

  • Python Pandas DataFrame and Python Pandas Series are some of the most important data structures. It is a must to acquire fluency in its handling because we will find them in practically all the problems that we handle.
  • A Python Pandas DataFrame is a data structure formed by rows and columns and has an index.
  • We must think of them as if they were data tables (for the Array with a single column) with which we can select, filter, sort, add and delete elements, either by rows or columns.
  • Help in ETL processes (Extraction, Transformation and Loading)
  • We can select, insert, delete and update elements with simple functions.
  • We can perform computations by rows or columns.
  • Has the ability to run vectorized computations.
  • We can work with several DataFrames at the same time.
  • Indexing and subsetting are the most important features of Python Pandas.
  • Facilitates the statistical exploration of the data.
  • It offers us a variety of options for handling NaN data.
  • Another additional advantage is the ability to read & write multiple data formats (CSV, Excel, HDF5, etc.).
  • Retrieve data from external sources (Yahoo, Google, Quandl Python API, etc.)
  • Finally, it has the ability to work with date and time indexes and offers
    us a set of functions to work with dates.

Conclusion

To reiterate, this article is an excerpt from the Python Basics Handbook, which was created for both; beginners who are starting out in Python as well as accomplished traders who can use it as a handy reference while coding their strategy.

Do let us know if you loved the article and any other feedback in the comments below.

Disclaimer: All investments and trading in the stock market involve risk. Any decisions to place trades in the financial markets, including trading in stock or options or other financial instruments is a personal decision that should only be made after thorough research, including a personal risk and financial assessment and the engagement of professional assistance to the extent you believe necessary. The trading strategies or related information mentioned in this article is for informational purposes only.

Live Webinar: EPAT Curriculum