By Jay Parmar
Pandas is a Python library to deal with sequential and tabular data. It includes many tools to manage, analyze and manipulate data in a convenient and efficient manner. We can think of its data structures as akin to database tables or spreadsheets.
Pandas is built on top of the Numpy library and has two primary data structures viz. Series (1-dimensional) and DataFrame (2-dimensional). It can handle both homogeneous and heterogeneous data, and some of its many capabilities are:
- ETL tools (Extraction, Transformation and Load tools)
- Dealing with missing data (NaN)
- Dealing with data files (csv, xls, db, hdf5, etc.)
- Time-series manipulation tools
In the Python ecosystem, Pandas is the best choice to retrieve, manipulate, analyze and transform financial data.
In this Python Pandas tutorial, we will learn the basics of the Python Pandas module as well as understand some of the codes. This blog article is an excerpt from the Python Basics Handbook created for the simple purpose of making the reader understand the beauty and simplicity of the Python language.
The contents of this article include:
- Installing Python Pandas
- What problem does Python Pandas solve?
- Python Pandas Series
- Python Pandas DataFrame
- Importing data in Python Pandas
- Indexing and Subsetting
- Manipulating a Python Pandas DataFrame
- Statistical Exploratory data analysis
- Filtering Python Pandas DataFrame
- Iterating Python Pandas DataFrame
- Merge, Append and Concat Python Pandas DataFrame
- Time-Series in Pandas
Installing Python Pandas
In this section of the Python Pandas tutorial, we will start by going through the official documentation which has a detailed explanation here on installing Pandas. We summarize it below.
With pip
The simplest way to install Pandas is from PyPI.
In a terminal window, run the following command.
pip install pandas
In your code, you can use the escape character '!' to install pandas directly from your Python console.
!pip install pandas
Pip is a useful tool to manage Python's packages and it is worth investing some time in knowing it better.
pip help
With Conda environments
For advanced users, who like to work with Python environments for each project, you can create a new environment and install pandas as shown below.
conda create -n EPAT python
source activate EPAT
conda install pandas
Testing Pandas installation
To check the installation, Pandas comes with a test suite to test almost all of the codebase and verify that everything is working.
import pandas as pd
pd.test()
What problem does Python Pandas solve?
In this section of the Python Pandas tutorial, we will understand the use of Python Pandas.
Python Pandas works with homogeneous data series (1-Dimension) and heterogeneous tabular data series (2-Dimensions). It includes a multitude of tools to work with these data types, such as:
- Indexes and labels.
- Searching of elements.
- Insertion, deletion and modification of elements.
- Apply set techniques, such as grouping, joining, selecting, etc.
- Data processing and cleaning.
- Work with time series.
- Make statistical calculations
- Draw graphics
- Connectors for multiple data file formats, such as, csv, xlsx, hdf5, etc.
Pandas Documentation: 10 minutes with Pandas
Python Pandas Series
The first data structure we will go through in the Python Pandas tutorial is the Series.
Python Pandas Series are homogeneous one-dimensional objects, that is, all data are of the same type and are implicitly labelled with an index.
For example, we can have a Series of integers, real numbers, characters, strings, dictionaries, etc. We can conveniently manipulate these series performing operations like adding, deleting, ordering, joining, filtering, vectorized operations, statistical analysis, plotting, etc.
Let's see some examples of how to create and manipulate a Python Pandas Series:
- We will start by creating an empty Python Pandas Series:
import pandas as pd
s = pd.Series()
print(s)
Out[]:
Series([], dtype: float64)
- Let's create a Python Pandas Series of integers:
import pandas as pd
s = pd.Series([1, 2, 3, 4, 5, 6, 7])
print(s)
Out[]:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
dtype: int64
- Let's create a Python Pandas Series of characters:
import pandas as pd
s = pd.Series(['a', 'b', 'c', 'd', 'e'])
print(s)
Out[]:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
dtype: int64
- Let's create a random Python Pandas Series of float numbers:
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(5))
print(s)
Out[]:
0 0.383567
1 0.869761
2 1.100957
3 -0.259689
4 0.704537
dtype: float64
In all these examples we saw in the Python Pandas tutorial, we have allowed the index label to appear by default (without explicitly programming it). It starts at 0, and we can check the index as:
In[]:
s.index
Out[]:
RangeIndex(start=0, stop=5, step=1)
But we can also specify the index we need, for example:
In[]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
Out[]:
a 1.392051
b 0.515690
c -0.432243
d -0.803225
e 0.832119
dtype: float64
- Let's create a Python Pandas Series from a dictionary:
import pandas as pd
dictionary = {'a' : 1, 'b' : 2, 'c' : 3, 'd': 4, 'e': 5}
s = pd.Series(dictionary)
print(s)
Out[]:
a 1
b 2
c 3
d 4
e 5
dtype: int64
In this case, the Python Pandas Series is created with the dictionary keys as index unless we specify any other index.
Simple operations with Python Pandas Series
When we have a Python Pandas Series, we can perform several simple operations on it.
For example, in this section of the Python Pandas tutorial, let's create two Series. One from a dictionary and the other from an array of integers:
In[]:
import pandas as pd
dictionary = {'a' : 1, 'b' : 2, 'c' : 3, 'd': 4, 'e': 5}
s1 = pd.Series(dictionary)
array = [1, 2, 3, 4, 5]
s2 = pd.Series(array)
Out[]:
a 1
b 2
c 3
d 4
e 5
dtype: int64
0 1
1 2
2 3
3 4
4 5
dtype: int64
We can perform operations similar to Numpy arrays:
- Selecting one item from the Python Pandas Series by means of its index:
In[]:
s1[0] # Select the first element
Out[]:
1
In[]:
s1['a']
Out[]:
1
In[]:
s2[0]
Out[]:
1
- Selecting several items from the Python Pandas Series by means of its index:
In[]:
s1[[1,4]]
Out[]:
b 2
e 5
dtype: int64
In[]:
s1[['b', 'e']]
Out[]:
b 2
e 5
dtype: int64
In[]:
s2[[1,4]]
Out[]:
b 2
e 5
dtype: int64
- Get the Python Pandas series starting from an element:
In[]:
s1[2:]
Out[]:
c 3
d 4
e 5
dtype: int64
In[]:
s2[2:]
Out[]:
2 3
3 4
4 5
dtype: int64
- Get the Python Pandas series up to one element:
In[]:
s1[:2]
Out[]:
c 3
d 4
e 5
dtype: int64
In[]:
s2[:2]
Out[]:
2 3
3 4
4 5
dtype: int64
We can perform operations like a dictionary:
- Assign a value:
In[]:
s1[1] = 99
s1['a'] = 99
Out[]:
a 1
b 99
c 3
d 4
e 5
dtype: int64
In[]:
s2[1] = 99
print(s2)
Out[]:
0 1
1 99
2 3
3 4
4 5
dtype: int64
- Get a value by index (like dictionary key):
In[]:
s.get('b')
Out[]:
2
Here are some powerful vectorized operations that let us perform quickly calculations, for example:
- Add, subtract, multiply, divide, power, and almost any NumPy function that accepts NumPy arrays.
s1 + 2
s1 - 2
s1 * 2
s1 / 2
s1 ** 2
np.exp(s1)
- We can perform the same operations over two Python Pandas Series although these must be aligned, that is, to have the same index, in other cases, perform a Union operation.
In[]:
s1 + s1 # The indices are aligned
Out[]:
a 2
b 4
c 6
d 8
e 10
dtype: int64
In[]:
s1 + s2 # The indices are unaligned
Out[]:
a NaN
b NaN
c NaN
d NaN
e NaN
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64
Python Pandas DataFrame
The second data structure in Python Pandas that we are going to see is the DataFrame.
Python Pandas DataFrame is a heterogeneous two-dimensional object, that is, the data are of the same type within each column but it could be a different data type for each column and are implicitly or explicitly labelled with an index.
We can think of a Python Pandas DataFrame as a database table, in which we store heterogeneous data. For example, a Python Pandas DataFrame with one column for the first name, another for the last name and a third column for the phone number, or a Python Pandas dataframe with columns to store the opening price, close price, high, low, volume, and so on.
The index can be implicit, starting with zero or we can specify it ourselves, even working with dates and times as indexes as well.
Let's see some examples of how to create and manipulate a Python Pandas DataFrame.
- Creating an empty Python Pandas DataFrame:
In[]:
import pandas as pd
s = pd.DataFrame()
print(s)
Out[]:
Empty DataFrame
Columns: []
Index: []
- Creating an empty structure Python Pandas DataFrame:
In[]:
import pandas as pd
s = pd.DataFrame(columns=['A','B','C','D','E'])
print(s)
Out[]:
Empty DataFrame
Columns: [A, B, C, D, E]
Index: []
In[]:
import pandas as pd
s = pd.DataFrame(columns=['A','B','C','D','E'], index=range(1, 6))
print(s)
Out[]:
A B C D E
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
- Creating a Python Pandas DataFrame passing a NumPy array:
In[]:
array = {'A' : [1, 2, 3, 4],
'B' : [4, 3, 2, 1]}
pd.DataFrame(array)
Out[]:
A B
0 1 4
1 2 3
2 3 2
3 4 1
- Creating a Python Pandas DataFrame passing a NumPy array, with datetime index:
In[]:
import pandas as pd
array = {'A' : [1, 2, 3, 4],'B' : [4, 3, 2, 1]}
index = pd.DatetimeIndex(['2018-12-01', '2018-12-02','2018-12-03', '2018-12-04'])
pd.DataFrame(array, index=index)
Out[]:
A B
2018-12-01 1 4
2018-12-02 2 3
2018-12-03 3 2
2018-12-04 4 1
- Creating a Python Pandas DataFrame passing a Dictionary:
In[]:
import pandas as pd
dictionary = {'a' : 1, 'b' : 2, 'c' : 3, 'd': 4, 'e': 5}
pd.DataFrame([dictionary])
Out[]:
a b c d e
0 1 2 3 4 5
- Viewing a Python Pandas DataFrame: We can use some methods to explore the Pandas DataFrame:
First, we go to create a Python Pandas DataFrame to work with it.
In[]:
import pandas as pd
pd.DataFrame({'A' : np.random.randn(10), 'B' : np.random.randn(10), 'C' : np.random.randn(10)})
Out[]:
A B C
0 0.164358 1.689183 1.745963
1 -1.830385 0.035618 0.047832
2 1.304339 2.236809 0.920484
3 0.365616 1.877610 -0.287531
4 -0.741372 -1.443922 -1.566839
5 -0.119836 -1.249112 -0.134560
6 -0.848425 -0.569149 -1.222911
7 -1.172688 0.515443 1.492492
8 0.765836 0.307303 0.788815
9 0.761520 -0.409206 1.298350
- Get the first three rows:
In[]:
import pandas as pd
df=pd.DataFrame({'A' : np.random.randn(10), 'B' : np.random.randn(10), 'C' : np.random.randn(10)})
df.head(3)
Out[]:
A B C
0 0.164358 1.689183 1.745963
1 -1.830385 0.035618 0.047832
2 1.304339 2.236809 0.920484
- Get the last three rows:
In[]:
import pandas as pd
df=pd.DataFrame({'A' : np.random.randn(10), 'B' : np.random.randn(10), 'C' : np.random.randn(10)})
df.tail(3)
Out[]:
A B C
7 -1.172688 0.515443 1.492492
8 0.765836 0.307303 0.788815
9 0.761520 -0.409206 1.298350
- Get the Python Pandas DataFrame's index:
In[]:
import pandas as pd
df=pd.DataFrame({'A' : np.random.randn(10), 'B' : np.random.randn(10), 'C' : np.random.randn(10)})
df.index
Out[]:
RangeIndex(start=0, stop=10, step=1)
- Get the Python Pandas DataFrame's columns:
In[]:
import pandas as pd
df=pd.DataFrame({'A' : np.random.randn(10), 'B' : np.random.randn(10), 'C' : np.random.randn(10)})
df.columns
Out[]:
Index(['A', 'B', 'C'], dtype='object')
- Get the Python Pandas DataFrame's values:
In[]:
import pandas as pd
df=pd.DataFrame({'A' : np.random.randn(10), 'B' : np.random.randn(10), 'C' : np.random.randn(10)})
df.values
Out[]:
array([[ 0.6612966 , -0.60985049, 1.11955054],
[-0.74105636, 1.42532491, -0.74883362],
[ 0.10406892, 0.5511436 , 2.63730671],
[-0.73027121, -0.11088373, -0.19143175],
[ 0.11676573, 0.27582786, -0.38271609],
[ 0.51073858, -0.3313141 , 0.20516165],
[ 0.23917755, 0.55362 , -0.62717194],
[ 0.25565784, -1.4960713 , 0.58886377],
[ 1.20284041, 0.21173483, 2.0331718 ],
[ 0.62247283, 2.18407105, 0.02431867]])
Importing data in Python Pandas
Python Pandas DataFrame is able to read several data formats, some of the most used are: CSV, JSON, Excel, HDF5, SQL, etc.
CSV to DataFrame
One of the most useful functions is read_csv
that allows us to read csv files with almost any format and load it into our Python Pandas DataFrame to work with it. Let's see how to work with csv files:
import pandas as pd
df=pd.read_csv('Filename.csv')
type(df)
Out[]:
pandas.core.frame.DataFrame
This simple operation loads the csv file into the Python Pandas DataFrame after which we can explore it as we have seen before.
Customizing pandas import
Sometimes the format of the csv file comes with a particular separator or we need specific columns or rows. In this section of the Python Pandas tutorial, we will now see some ways to deal with this.
In this example, we want to load a csv file with blank space as separator:
import pandas as pd
df=pd.read_csv('Filename.csv', sep=' ')
In this example, we want to load columns from 0 and 5 and the first 100 rows:
import pandas as pd
df=pd.read_csv('Filename.csv', usecols=[0,1,2,3,4,5], nrows=100)
It's possible to customize the headers, convert the columns or rows names and carry out a good number of other operations.
Check the IO tools from Pandas documentation.
Importing sheets from Excel files
In the same way that we have worked with csv files, we can work with Excel file with the read_excel
function, let's see some examples:
In this example, we want to load the sheet 1 from an Excel file:
import pandas as pd
df=pd.read_excel('Filename.xls', sheet_name='Sheet1')
This simple operation loads Sheet 1 from the Excel file into the Pandas DataFrame.
Indexing and Subsetting
Once we have the Python Pandas DataFrame prepared, independent of the source of our data (csv, Excel, hdf5, etc.) we can work with it, as if it were a database table, selecting the elements that interest us. We will work with some examples of how to index and extract subsets of data.
Let's begin with loading a csv file having details of a market instrument.
In[]:
import pandas as pd
df=pd.read_csv('MSFT.csv' , usecols=[0,1,2,3,4,5])
df.head()
df.shape
Out[]:
Date Open High Low Close Volume
0 2008-12-29 19.1500 19.21 18.64 18.96 58512800.0
1 2008-12-30 19.0100 19.49 19.00 19.34 43224100.0
2 2008-12-31 19.3100 19.68 19.27 19.44 46419000.0
3 2009-01-02 19.5328 20.40 19.37 20.33 50084000.0
4 2009-01-05 20.2000 20.67 20.06 20.52 61475200.0
(1000, 6)
Here, we have read a csv file, of which we only need the columns of date, opening, closing, high, low and volume (the first 6 columns) and we check the form of the DataFrame that has 1000 rows and 6 columns.
Selecting a single column
In the previous code, we have read directly the first 6 columns from the csv file. This is a filter that we applied because we were only interested in those columns.
We can apply selection filters to the Python Pandas DataFrame itself, to select one column to work with. For example, we could need the Close
column:
In[]:
close=df['Close']
close.head()
Out[]:
Close
0 18.96
1 19.34
2 19.44
3 20.33
4 20.52
Selecting multiple columns
We can select multiple columns too:
In[]:
closevol=df[['Close', 'Volume']]
closevol.head()
Out[]:
Close Volume
0 18.96 58512800.0
1 19.34 43224100.0
2 19.44 46419000.0
3 20.33 50084000.0
4 20.52 61475200.0
Selecting rows via [].
We can select a set of rows by index:
In[]:
import pandas as pd
df=pd.read_csv('TSLA.csv' )
df[100:110]
Out[]:
Date | Open | High | Low | Close | Volume | ExDividend | SplitRatio | AdjOpen | AdjHigh | AdjLow | AdjClose | AdjVolume | Name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100 | 2017-10-30 | 319.18 | 323.7800 | 317.25 | 320.08 | 4236029.0 | 0.0 | 1.0 | 319.18 | 323.7800 | 317.25 | 320.08 | 4236029.0 | TSLA |
101 | 2017-10-27 | 319.75 | 324.5900 | 316.66 | 320.87 | 6942493.0 | 0.0 | 1.0 | 319.75 | 324.5900 | 316.66 | 320.87 | 6942493.0 | TSLA |
102 | 2017-10-26 | 327.78 | 330.2300 | 323.20 | 326.17 | 4980316.0 | 0.0 | 1.0 | 327.78 | 330.2300 | 323.20 | 326.17 | 4980316.0 | TSLA |
103 | 2017-10-25 | 336.70 | 337.5000 | 323.56 | 325.84 | 8547764.0 | 0.0 | 1.0 | 336.70 | 337.5000 | 323.56 | 325.84 | 8547764.0 | TSLA |
104 | 2017-10-24 | 338.80 | 342.8000 | 336.16 | 337.34 | 4463807.0 | 0.0 | 1.0 | 338.80 | 342.8000 | 336.16 | 337.34 | 4463807.0 | TSLA |
105 | 2017-10-23 | 349.88 | 349.9500 | 336.25 | 337.02 | 5715817.0 | 0.0 | 1.0 | 349.88 | 349.9500 | 336.25 | 337.02 | 5715817.0 | TSLA |
106 | 2017-10-20 | 352.69 | 354.5500 | 344.34 | 345.10 | 4888221.0 | 0.0 | 1.0 | 352.69 | 354.5500 | 344.34 | 345.10 | 4888221.0 | TSLA |
107 | 2017-10-19 | 355.56 | 357.1465 | 348.20 | 351.81 | 5032884.0 | 0.0 | 1.0 | 355.56 | 357.1465 | 348.20 | 351.81 | 5032884.0 | TSLA |
108 | 2017-10-18 | 355.97 | 363.0000 | 354.13 | 359.65 | 4898808.0 | 0.0 | 1.0 | 355.97 | 363.0000 | 354.13 | 359.65 | 4898808.0 | TSLA |
109 | 2017-10-17 | 350.91 | 356.2200 | 350.07 | 355.75 | 3280670.0 | 0.0 | 1.0 | 350.91 | 356.2200 | 350.07 | 355.75 | 3280670.0 | TSLA |
Or we can select a set of rows and columns:
In[]:
df[100:110][['Close', 'Volume']]
Out[]:
Close Volume
100 320.08 4236029.0
101 320.87 6942493.0
102 326.17 4980316.0
103 325.84 8547764.0
104 337.34 4463807.0
105 337.02 5715817.0
106 345.10 4888221.0
107 351.81 5032884.0
108 359.65 4898808.0
109 355.75 3280670.0
Selecting via .loc[] (By label)
With df.loc
we can do the same selections using labels:
To select a set of rows, we can code the following using the index number as label:
In[]:
df.loc[100:110]
Out[]:
Date | Open | High | Low | Close | Volume | ExDividend | SplitRatio | AdjOpen | AdjHigh | AdjLow | AdjClose | AdjVolume | Name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100 | 2017-10-30 | 319.18 | 323.7800 | 317.25 | 320.08 | 4236029.0 | 0.0 | 1.0 | 319.18 | 323.7800 | 317.25 | 320.08 | 4236029.0 | TSLA |
101 | 2017-10-27 | 319.75 | 324.5900 | 316.66 | 320.87 | 6942493.0 | 0.0 | 1.0 | 319.75 | 324.5900 | 316.66 | 320.87 | 6942493.0 | TSLA |
102 | 2017-10-26 | 327.78 | 330.2300 | 323.20 | 326.17 | 4980316.0 | 0.0 | 1.0 | 327.78 | 330.2300 | 323.20 | 326.17 | 4980316.0 | TSLA |
103 | 2017-10-25 | 336.70 | 337.5000 | 323.56 | 325.84 | 8547764.0 | 0.0 | 1.0 | 336.70 | 337.5000 | 323.56 | 325.84 | 8547764.0 | TSLA |
104 | 2017-10-24 | 338.80 | 342.8000 | 336.16 | 337.34 | 4463807.0 | 0.0 | 1.0 | 338.80 | 342.8000 | 336.16 | 337.34 | 4463807.0 | TSLA |
105 | 2017-10-23 | 349.88 | 349.9500 | 336.25 | 337.02 | 5715817.0 | 0.0 | 1.0 | 349.88 | 349.9500 | 336.25 | 337.02 | 5715817.0 | TSLA |
106 | 2017-10-20 | 352.69 | 354.5500 | 344.34 | 345.10 | 4888221.0 | 0.0 | 1.0 | 352.69 | 354.5500 | 344.34 | 345.10 | 4888221.0 | TSLA |
107 | 2017-10-19 | 355.56 | 357.1465 | 348.20 | 351.81 | 5032884.0 | 0.0 | 1.0 | 355.56 | 357.1465 | 348.20 | 351.81 | 5032884.0 | TSLA |
108 | 2017-10-18 | 355.97 | 363.0000 | 354.13 | 359.65 | 4898808.0 | 0.0 | 1.0 | 355.97 | 363.0000 | 354.13 | 359.65 | 4898808.0 | TSLA |
109 | 2017-10-17 | 350.91 | 356.2200 | 350.07 | 355.75 | 3280670.0 | 0.0 | 1.0 | 350.91 | 356.2200 | 350.07 | 355.75 | 3280670.0 | TSLA |
Or we can select a set of rows and columns like before:
In[]:
df.loc[100:110, ['Close', 'Volume']]
Out[]:
Close Volume
100 320.08 4236029.0
101 320.87 6942493.0
102 326.17 4980316.0
103 325.84 8547764.0
104 337.34 4463807.0
105 337.02 5715817.0
106 345.10 4888221.0
107 351.81 5032884.0
108 359.65 4898808.0
109 355.75 3280670.0
110 350.60 5353262.0
Selecting via .iloc[] (By position)
With df.iloc
we can do the same selections using integer position:
In[]:
df.iloc[100:110]
Out[]:
Date | Open | High | Low | Close | Volume | ExDividend | SplitRatio | AdjOpen | AdjHigh | AdjLow | AdjClose | AdjVolume | Name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100 | 2017-10-30 | 319.18 | 323.7800 | 317.25 | 320.08 | 4236029.0 | 0.0 | 1.0 | 319.18 | 323.7800 | 317.25 | 320.08 | 4236029.0 | TSLA |
101 | 2017-10-27 | 319.75 | 324.5900 | 316.66 | 320.87 | 6942493.0 | 0.0 | 1.0 | 319.75 | 324.5900 | 316.66 | 320.87 | 6942493.0 | TSLA |
102 | 2017-10-26 | 327.78 | 330.2300 | 323.20 | 326.17 | 4980316.0 | 0.0 | 1.0 | 327.78 | 330.2300 | 323.20 | 326.17 | 4980316.0 | TSLA |
103 | 2017-10-25 | 336.70 | 337.5000 | 323.56 | 325.84 | 8547764.0 | 0.0 | 1.0 | 336.70 | 337.5000 | 323.56 | 325.84 | 8547764.0 | TSLA |
104 | 2017-10-24 | 338.80 | 342.8000 | 336.16 | 337.34 | 4463807.0 | 0.0 | 1.0 | 338.80 | 342.8000 | 336.16 | 337.34 | 4463807.0 | TSLA |
105 | 2017-10-23 | 349.88 | 349.9500 | 336.25 | 337.02 | 5715817.0 | 0.0 | 1.0 | 349.88 | 349.9500 | 336.25 | 337.02 | 5715817.0 | TSLA |
106 | 2017-10-20 | 352.69 | 354.5500 | 344.34 | 345.10 | 4888221.0 | 0.0 | 1.0 | 352.69 | 354.5500 | 344.34 | 345.10 | 4888221.0 | TSLA |
107 | 2017-10-19 | 355.56 | 357.1465 | 348.20 | 351.81 | 5032884.0 | 0.0 | 1.0 | 355.56 | 357.1465 | 348.20 | 351.81 | 5032884.0 | TSLA |
108 | 2017-10-18 | 355.97 | 363.0000 | 354.13 | 359.65 | 4898808.0 | 0.0 | 1.0 | 355.97 | 363.0000 | 354.13 | 359.65 | 4898808.0 | TSLA |
109 | 2017-10-17 | 350.91 | 356.2200 | 350.07 | 355.75 | 3280670.0 | 0.0 | 1.0 | 350.91 | 356.2200 | 350.07 | 355.75 | 3280670.0 | TSLA |
In the last example, we used the index as an integer position rather than by label.
We can select a set of rows and columns like before:
In[]:
df.iloc[100:110, [3, 4]]
Out[]:
Low Close
100 317.25 320.08
101 316.66 320.87
102 323.20 326.17
103 323.56 325.84
104 336.16 337.34
105 336.25 337.02
106 344.34 345.10
107 348.20 351.81
108 354.13 359.65
109 350.07 355.75
Boolean indexing
So far in the Python Pandas tutorial, we have sliced subsets of data by label or by position. Now let's see how to select data that meet some criteria. We do this with Boolean indexing. We can use the same criteria similar to what we have seen with Numpy arrays. We show you just two illustrative examples here. This is by no means enough to get comfortable with it and so would encourage you to check the documentation and further readings at the end of this chapter to learn more.
- We can filter data that is greater (less) than a number.
In[]:
df[df.Close > 110]
Out[]:
Date | Open | High | Low | Close | Volume | ExDividend | SplitRatio | AdjOpen | AdjHigh | AdjLow | AdjClose | AdjVolume | Name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2018-03-27 | 304.00 | 304.2700 | 277.1800 | 279.1800 | 13696168.0 | 0.0 | 1.0 | 304.00 | 304.2700 | 277.1800 | 279.1800 | 13696168.0 | TSLA | |
1 | 2018-03-26 | 307.34 | 307.5900 | 291.3600 | 304.1800 | 8324639.0 | 0.0 | 1.0 | 307.34 | 307.5900 | 291.3600 | 304.1800 | 8324639.0 | TSLA | |
2 | 2018-03-23 | 311.25 | 311.6100 | 300.4500 | 301.5400 | 6600538.0 | 0.0 | 1.0 | 311.25 | 311.6100 | 300.4500 | 301.5400 | 6600538.0 | TSLA | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
1080 | 2013-12-09 | 137.00 | 141.7000 | 134.2100 | 141.6000 | 9061500.0 | 0.0 | 1.0 | 137.00 | 141.7000 | 134.2100 | 141.6000 | 9061500.0 | TSLA | |
1081 | 2013-12-06 | 141.51 | 142.4900 | 136.3000 | 137.3600 | 7909600.0 | 0.0 | 1.0 | 141.51 | 142.4900 | 136.3000 | 137.3600 | 7909600.0 | TSLA | |
1082 | 2013-12-05 | 140.15 | 143.3500 | 139.5000 | 140.4800 | 9288400.0 | 0.0 | 1.0 | 140.15 | 143.3500 | 139.5000 | 140.4800 | 9288400.0 | TSLA |
1083 rows × 14 columns
In[]:
df[(df['Close'] > 110) | (df['Close'] < 120)]
Out[]:
`
Date | Open | High | Low | Close | Volume | ExDividend | SplitRatio | AdjOpen | AdjHigh | AdjLow | AdjClose | AdjVolume | Name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2018-03-27 | 304.00 | 304.2700 | 277.1800 | 279.1800 | 13696168.0 | 0.0 | 1.0 | 304.00 | 304.2700 | 277.1800 | 279.1800 | 13696168.0 | TSLA | |
1 | 2018-03-26 | 307.34 | 307.5900 | 291.3600 | 304.1800 | 8324639.0 | 0.0 | 1.0 | 307.34 | 307.5900 | 291.3600 | 304.1800 | 8324639.0 | TSLA | |
2 | 2018-03-23 | 311.25 | 311.6100 | 300.4500 | 301.5400 | 6600538.0 | 0.0 | 1.0 | 311.25 | 311.6100 | 300.4500 | 301.5400 | 6600538.0 | TSLA | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
1080 | 2013-12-09 | 137.00 | 141.7000 | 134.2100 | 141.6000 | 9061500.0 | 0.0 | 1.0 | 137.00 | 141.7000 | 134.2100 | 141.6000 | 9061500.0 | TSLA | |
1081 | 2013-12-06 | 141.51 | 142.4900 | 136.3000 | 137.3600 | 7909600.0 | 0.0 | 1.0 | 141.51 | 142.4900 | 136.3000 | 137.3600 | 7909600.0 | TSLA | |
1082 | 2013-12-05 | 140.15 | 143.3500 | 139.5000 | 140.4800 | 9288400.0 | 0.0 | 1.0 | 140.15 | 143.3500 | 139.5000 | 140.4800 | 9288400.0 | TSLA |
1083 rows × 14 columns
Manipulating a Python Pandas DataFrame
When we are working with data, the most common structure is the DataFrame. We have seen how to create them, make selections and find data in the Python Pandas tutorial. We are now going to see how to manipulate the Python Pandas DataFrame to transform it into another Python Pandas DataFrame that has the form that our problem requires.
We'll see how to sort it, re-index it, eliminate unwanted (or spurious) data, add or remove columns and update values.
.T (Transposing)
The Python Pandas DataFrame transpose function T
allows us to transpose the rows as columns, and logically the columns as rows:
In[]: import pandas as pd
df=pd.read_csv('TSLA.csv' )
df2=df[100:110][['Close', 'Volume']]
df2.T
Out[]:
100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | |
---|---|---|---|---|---|---|---|---|---|---|
Close | 320.08 | 320.87 | 326.17 | 325.84 | 337.34 | 337.02 | 345.1 | 351.81 | 359.65 | 355.75 |
Volume | 4236029.00 | 6942493.00 | 4980316.00 | 8547764.00 | 4463807.00 | 5715817.00 | 4888221.0 | 5032884.00 | 4898808.00 | 3280670.00 |
.sort_index()
When we are working with Python Pandas Dataframe it is usual to add or remove rows, order by columns, etc. That's why it's important to have a function that allows us to easily and comfortably sort the Python Pandas DataFrame by its index. We do this with the sort_index
function of Pandas DataFrame.
In[]:
df.sort_index()
Out[]:
Date | Open | High | Low | Close | Volume | ExDividend | SplitRatio | AdjOpen | AdjHigh | AdjLow | AdjClose | AdjVolume | Name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2018-03-27 | 304.00 | 304.2700 | 277.1800 | 279.1800 | 13696168.0 | 0.0 | 1.0 | 304.00 | 304.2700 | 277.1800 | 279.1800 | 13696168.0 | TSLA | |
1 | 2018-03-26 | 307.34 | 307.5900 | 291.3600 | 304.1800 | 8324639.0 | 0.0 | 1.0 | 307.34 | 307.5900 | 291.3600 | 304.1800 | 8324639.0 | TSLA | |
2 | 2018-03-23 | 311.25 | 311.6100 | 300.4500 | 301.5400 | 6600538.0 | 0.0 | 1.0 | 311.25 | 311.6100 | 300.4500 | 301.5400 | 6600538.0 | TSLA | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
.sort_values()
Sometimes, we may be interested in sorting the Python Pandas DataFrame by some column or even with several columns as criteria. For example, sort the column by first names and the second criterion by last names. We do this with the sort_values
function of Python Pandas DataFrame.
In[]:
df.sort_values(by='Close')
Out[]:
Date | Open | High | Low | Close | Volume | ExDividend | SplitRatio | AdjOpen | AdjHigh | AdjLow | AdjClose | AdjVolume | Name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1081 | 2013-12-06 | 141.510 | 142.4900 | 136.3000 | 137.3600 | 7909600.0 | 0.0 | 1.0 | 141.510 | 142.4900 | 136.3000 | 137.3600 | 7909600.0 | TSLA | |
1057 | 2014-01-13 | 145.780 | 147.0000 | 137.8200 | 139.3400 | 6316100.0 | 0.0 | 1.0 | 145.780 | 147.0000 | 137.8200 | 139.3400 | 6316100.0 | TSLA | |
1078 | 2013-12-11 | 141.880 | 143.0500 | 139.4900 | 139.6500 | 7137800.0 | 0.0 | 1.0 | 141.880 | 143.0500 | 139.4900 | 139.6500 | 7137800.0 | TSLA | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
In[]:
df.sort_values(by=['Open', 'Close'])
Out[]:
Date | Open | High | Low | Close | Volume | ExDividend | SplitRatio | AdjOpen | AdjHigh | AdjLow | AdjClose | AdjVolume | Name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1080 | 2013-12-09 | 137.000 | 141.7000 | 134.2100 | 141.6000 | 9061500.0 | 0.0 | 1.0 | 137.000 | 141.7000 | 134.2100 | 141.6000 | 9061500.0 | TSLA | |
1077 | 2013-12-12 | 139.700 | 148.2400 | 138.5300 | 147.4700 | 10767800.0 | 0.0 | 1.0 | 139.700 | 148.2400 | 138.5300 | 147.4700 | 10767800.0 | TSLA | |
1079 | 2013-12-10 | 140.050 | 145.8700 | 139.8600 | 142.1900 | 10748200.0 | 0.0 | 1.0 | 140.050 | 145.8700 | 139.8600 | 142.1900 | 10748200.0 | TSLA | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
.reindex()
The Python Pandas' reindex function let us realign the index of the Series or DataFrame, it's useful when we need to reorganize the index to meet some criteria.
For example, we can play with the Series or DataFrame that we create before to alter the original index. For example, when the index is a label, we can reorganize as we need:
In[]: import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
df
Out[]:
0 | |
---|---|
a | -0.134133 |
b | -0.586051 |
c | 1.179358 |
d | 0.433142 |
e | -0.365686 |
Now, we can reorganize the index as follows:
In[]:
df.reindex(['b', 'a', 'd', 'c', 'e'])
Out[]:
0 | |
---|---|
b | -0.586051 |
a | -0.134133 |
d | 0.433142 |
c | 1.179358 |
e | -0.365686 |
When the index is numeric we can use the same function to order by hand the index:
In[]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5))
df.reindex([4,3,2,1,0])
Out[]:
0 | |
---|---|
4 | 1.058589 |
3 | 1.194400 |
2 | -0.645806 |
1 | 0.836606 |
0 | 1.288102 |
Later in this section, we'll see how to work and reorganize date and time indices.
Adding a new column. For e.g. df[‘new_column’] = 1
Another interesting feature of Python Pandas DataFrames is the possibility of adding new columns to an existing DataFrame.
For example, we can add a new column to the random DataFrame that we have created before:
In[]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5))
Out[]:
0 | |
---|---|
0 | 0.238304 |
1 | 2.068558 |
2 | 1.015650 |
3 | 0.506208 |
4 | 0.214760 |
To add a new column, we only need to include the new column name in the DataFrame and assign an initialization value, or assign to the new column a Pandas Series or another column from other DataFrame.
In[]:
df['new']=1
df
Out[]:
0 | new | ||
---|---|---|---|
0 | 0.238304 | 1 | |
1 | 2.068558 | 1 | |
2 | 1.015650 | 1 | |
3 | 0.506208 | 1 | |
4 | 0.214760 | 1 |
Delete existing column
Likewise, we can remove one or more columns from the Python Pandas DataFrame. Let's create a DataFrame with 5 rows and 4 columns with random values to delete one column.
In[]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 4))
df
Out[]:
0 | 1 | 2 | 3 | ||
---|---|---|---|---|---|
0 | -1.171562 | -0.086348 | -1.971855 | 1.168017 | |
1 | -0.408317 | -0.061397 | -0.542212 | -1.412755 | |
2 | -0.365539 | -0.587147 | 1.494690 | 1.756105 | |
3 | 0.642882 | 0.924202 | 0.517975 | -0.914366 | |
4 | 0.777869 | -0.431151 | -0.401093 | 0.145646 |
Now, we can delete the column that we specify by index or by the label if any:
In[]:
del df[0]
Out[]:
1 | 2 | 3 | ||
---|---|---|---|---|
0 | -0.086348 | -1.971855 | 1.168017 | |
1 | -0.061397 | -0.542212 | -1.412755 | |
2 | -0.587147 | 1.494690 | 1.756105 | |
3 | 0.924202 | 0.517975 | -0.914366 | |
4 | -0.431151 | -0.401093 | 0.145646 |
In[]:
df['new']=1
df
Out[]:
0 | new | ||
---|---|---|---|
0 | 0.238304 | 1 | |
1 | 2.068558 | 1 | |
2 | 1.015650 | 1 | |
3 | 0.506208 | 1 |
In[]:
del df['new']
Out[]:
0 | ||
---|---|---|
0 | 0.238304 | |
1 | 2.068558 | |
2 | 1.015650 | |
3 | 0.506208 |
.at[] (By label)
With at
we can locate a specific value by row and column labels as follows:
In[]:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)
df.at['a', 'A']
Out[]:
A | B | C | D | |
---|---|---|---|---|
a | 0.996496 | -0.165002 | 0.727912 | 0.564858 |
b | -0.388169 | 1.171039 | -0.231934 | -1.124595 |
c | -1.385129 | 0.299195 | 0.573570 | -1.736860 |
d | 1.222447 | -0.312667 | 0.957139 | -0.054156 |
e | 1.188335 | 0.679921 | 1.508681 | -0.677776 |
In[]:
df.at['a', 'A']
Out[]:
0.9964957014209125
It is possible to assign a new value with the same function too:
In[]:
df.at['a', 'A'] = 0
Out[]:
A | B | C | D | |
---|---|---|---|---|
a | 0.000000 | -0.165002 | 0.727912 | 0.564858 |
b | -0.388169 | 1.171039 | -0.231934 | -1.124595 |
c | -1.385129 | 0.299195 | 0.573570 | -1.736860 |
d | 1.222447 | -0.312667 | 0.957139 | -0.054156 |
e | 1.188335 | 0.679921 | 1.508681 | -0.677776 |
.iat[] (By position)
With iat
we can locate a specific value by row and column index as follows:
In[]:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)
df.iat[0, 0]
Out[]:
0.996496
It is possible to assign a new value with the same function too:
In[]:
df.iat[0, 0] = 0
Out[]:
A | B | C | D | |
---|---|---|---|---|
a | 0.000000 | -0.165002 | 0.727912 | 0.564858 |
b | -0.388169 | 1.171039 | -0.231934 | -1.124595 |
c | -1.385129 | 0.299195 | 0.573570 | -1.736860 |
d | 1.222447 | -0.312667 | 0.957139 | -0.054156 |
e | 1.188335 | 0.679921 | 1.508681 | -0.677776 |
Conditional updating of values. For e.g. df[df > 0] = 1
Another useful function is to update values that meet some criteria, for example, update values whose values are greater than 0:
In[]:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5, 4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)
df[df > 0] = 1
df
Out[]:
A | B | C | D | |
---|---|---|---|---|
a | 1.000000 | -0.082466 | 1.000000 | -0.728372 |
b | -0.784404 | -0.663096 | -0.595112 | 1.000000 |
c | -1.460702 | -1.072931 | -0.761314 | 1.000000 |
d | 1.000000 | 1.000000 | 1.000000 | -0.302310 |
e | -0.488556 | 1.000000 | -0.798716 | -0.590920 |
We can also update the values of a specific column that meet some criteria, or even work with several columns as criteria and update a specific column.
In[]:
df['A'][df['A'] < 0] = 1
print(df)
Out[]:
A | B | C | D | |
---|---|---|---|---|
a | 1.0 | -0.082466 | 1.000000 | -0.728372 |
b | 1.0 | -0.663096 | -0.595112 | 1.000000 |
c | 1.0 | -1.072931 | -0.761314 | 1.000000 |
d | 1.0 | 1.000000 | 1.000000 | -0.302310 |
e | 1.0 | 1.000000 | -0.798716 | -0.590920 |
In[]:
df['A'][(df['B'] < 0 )& (df['C'] < 0)] = 9
print(df)
Out[]:
A | B | C | D | |
---|---|---|---|---|
a | 1.0 | -0.082466 | 1.000000 | -0.728372 |
b | 9.0 | -0.663096 | -0.595112 | 1.000000 |
c | 9.0 | -1.072931 | -0.761314 | 1.000000 |
d | 1.0 | 1.000000 | 1.000000 | -0.302310 |
e | 1.0 | 1.000000 | -0.798716 | -0.590920 |
.dropna()
Occasionally, we may have a Python Pandas DataFrame that, for whatever reason, includes NA values. This type of values is usually problematic when we are making calculations or operations and must be treated properly before proceeding with them.
The easiest way to eliminate NA values is to remove the row that contains it.
In[]:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)
Out[]:
A | B | C | D | |
---|---|---|---|---|
a | 1.272361 | 1.799535 | -0.593619 | 1.152889 |
b | -0.318368 | -0.190419 | 0.129420 | 1.551332 |
c | 0.166951 | 1.669034 | -1.653618 | 0.656313 |
d | 0.219999 | 0.951074 | 0.442325 | -0.170177 |
e | 0.312319 | -0.765930 | -1.641234 | -1.388924 |
In[]:
df['A'][(df['B'] < 0 )& (df['C'] < 0)] = np.nan
print(df)
Out[]:
A | B | C | D | |
---|---|---|---|---|
a | 1.272361 | 1.799535 | -0.593619 | 1.152889 |
b | -0.318368 | -0.190419 | 0.129420 | 1.551332 |
c | 0.166951 | 1.669034 | -1.653618 | 0.656313 |
d | 0.219999 | 0.951074 | 0.442325 | -0.170177 |
e | NaN | -0.765930 | -1.641234 | -1.388924 |
In[]:
df=df.dropna()
print(df)
Out[]:
A | B | C | D | |
---|---|---|---|---|
a | 1.272361 | 1.799535 | -0.593619 | 1.152889 |
b | -0.318368 | -0.190419 | 0.129420 | 1.551332 |
c | 0.166951 | 1.669034 | -1.653618 | 0.656313 |
d | 0.219999 | 0.951074 | 0.442325 | -0.170177 |
Here we are deleting the whole row that has, in any of its columns, a NaN value, but we can also specify that it deletes the column that any of its values is NaN:
df=df.dropna(axis=1)
print(df)
We can specify if a single NaN value is enough to delete the row or column, or if the whole row or column must have NaN to delete it.
```python
df=df.dropna(how='all')
print(df)
.fillna()
With the previous function, we have seen how to eliminate a complete row or column that contains one or all the values to NaN, this operation can be a little drastic if we have valid values in the row or column.
For this, it is interesting to use the fillna
function that substitutes the NaN values with some fixed value.
In[]:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)
Out[]:
A | B | C | D | |
---|---|---|---|---|
a | 1.272361 | 1.799535 | -0.593619 | 1.152889 |
b | -0.318368 | -0.190419 | 0.129420 | 1.551332 |
c | 0.166951 | 1.669034 | -1.653618 | 0.656313 |
d | 0.219999 | 0.951074 | 0.442325 | -0.170177 |
e | 0.312319 | -0.765930 | -1.641234 | -1.388924 |
In[]:
df['A'][(df['B'] < 0 )& (df['C'] < 0)] = np.nan
print(df)
Out[]:
A | B | C | D | |
---|---|---|---|---|
a | 1.272361 | 1.799535 | -0.593619 | 1.152889 |
b | -0.318368 | -0.190419 | 0.129420 | 1.551332 |
c | 0.166951 | 1.669034 | -1.653618 | 0.656313 |
d | 0.219999 | 0.951074 | 0.442325 | -0.170177 |
e | NaN | -0.765930 | -1.641234 | -1.388924 |
In[]:
df=df.fillna(999)
print(df)
Out[]:
A | B | C | D | |
---|---|---|---|---|
a | 1.272361 | 1.799535 | -0.593619 | 1.152889 |
b | -0.318368 | -0.190419 | 0.129420 | 1.551332 |
c | 0.166951 | 1.669034 | -1.653618 | 0.656313 |
d | 0.219999 | 0.951074 | 0.442325 | -0.170177 |
e | 999 | -0.765930 | -1.641234 | -1.388924 |
.apply()
apply
is a very useful way to use functions or methods in a DataFrame without having to loop through it.
We can apply the Apply function to a Series or DataFrame to apply a function to all rows or columns of the DataFrame. Let's see some examples.
Suppose we are working with the randomly generated Python Pandas DataFrame and need to apply a function. In this example, for simplicity's sake, we're going to create a custom function to square a number.
In[]:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)
Out[]:
`
A | B | C | D | |
---|---|---|---|---|
a | -0.633249 | -2.699088 | 0.574052 | 0.652742 |
b | 0.060295 | -0.150527 | 0.149123 | -0.701216 |
c | -0.052515 | 0.469481 | 0.899180 | -0.608409 |
d | -1.352912 | 0.103302 | 0.457878 | -1.897170 |
e | 0.088279 | 0.418317 | -1.102989 | 0.582455 |
def square_number(number):
return number**2
# Test the function
In[]:
square_number(2)
Out[]:
4
Now, let's use the custom function through Apply:
In[]:
df.apply(square_number, axis=1)
Out[]:
A | B | C | D | |
---|---|---|---|---|
a | 0.401005 | 7.285074 | 0.329536 | 0.426073 |
b | 0.003636 | 0.022658 | 0.022238 | 0.491704 |
c | 0.002758 | 0.220412 | 0.808524 | 0.370161 |
d | 1.830372 | 0.010671 | 0.209652 | 3.599253 |
e | 0.007793 | 0.174989 | 1.216586 | 0.339254 |
This method apply
the function square_number
to all rows of the DataFrame.
.shift()
The shift
function allows us to move a row to the right or left and/or to move a column up or down. Let's look at some examples.
First, we are going to move the values of a column downwards:
In[]:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)
Out[]:
A | B | C | D | |
---|---|---|---|---|
a | -0.633249 | -2.699088 | 0.574052 | 0.652742 |
b | 0.060295 | -0.150527 | 0.149123 | -0.701216 |
c | -0.052515 | 0.469481 | 0.899180 | -0.608409 |
d | -1.352912 | 0.103302 | 0.457878 | -1.897170 |
e | 0.088279 | 0.418317 | -1.102989 | 0.582455 |
In[]:
df['D'].shift(1)
Out[]:
A | B | C | D | |
---|---|---|---|---|
a | -0.633249 | -2.699088 | 0.574052 | NaN |
b | 0.060295 | -0.150527 | 0.149123 | 0.652742 |
c | -0.052515 | 0.469481 | 0.899180 | -0.701216 |
d | -1.352912 | 0.103302 | 0.457878 | -0.608409 |
e | 0.088279 | 0.418317 | -1.102989 | -1.897170 |
We are going to move the values of a column upwards
In[]:
df['shift'] = df['D'].shift(-1)
Out[]:
A | B | C | D | |
---|---|---|---|---|
a | -0.633249 | -2.699088 | 0.574052 | -0.701216 |
b | 0.060295 | -0.150527 | 0.149123 | -0.608409 |
c | -0.052515 | 0.469481 | 0.899180 | -1.897170 |
d | -1.352912 | 0.103302 | 0.457878 | 0.582455 |
e | 0.088279 | 0.418317 | -1.102989 | NaN |
This is very useful for comparing the current value with the previous value.
Statistical Exploratory data analysis
Python Pandas DataFrame allows us to make some descriptive statistics calculations, which are very useful to make a first analysis of the data we are handling. Let's see some useful functions.
info()
It is a good practice to know the structure and format of our DataFrame, the Info
function offers us just that:
In[]:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)
Out[]:
A | B | C | D | |
---|---|---|---|---|
a | -0.633249 | -2.699088 | 0.574052 | 0.652742 |
b | 0.060295 | -0.150527 | 0.149123 | -0.701216 |
c | -0.052515 | 0.469481 | 0.899180 | -0.608409 |
d | -1.352912 | 0.103302 | 0.457878 | -1.897170 |
e | 0.088279 | 0.418317 | -1.102989 | 0.582455 |
In[]:
df.info()
Out[]:
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, a to e
Data columns (total 5 columns):
A 5 non-null float64
B 5 non-null float64
C 5 non-null float64
D 5 non-null float64
shift 4 non-null float64
dtypes: float64(5)
memory usage: 240.0+ bytes
describe()
We can obtain a statistical overview of the DataFrame with the `describe function, which gives us the mean, median, standard deviation, maximum, minimum, quartiles, etc. of each DataFrame column.
In[]:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)
Out[]:
`
A | B | C | D | |
---|---|---|---|---|
a | -0.633249 | -2.699088 | 0.574052 | 0.652742 |
b | 0.060295 | -0.150527 | 0.149123 | -0.701216 |
c | -0.052515 | 0.469481 | 0.899180 | -0.608409 |
d | -1.352912 | 0.103302 | 0.457878 | -1.897170 |
e | 0.088279 | 0.418317 | -1.102989 | 0.582455 |
In[]:
df.describe()
Out[]:
A | B | C | D | |
---|---|---|---|---|
count | 5.000000 | 5.000000 | 5.000000 | 5.000000 |
mean | -0.378020 | -0.371703 | 0.195449 | -0.394319 |
std | 0.618681 | 1.325046 | 0.773876 | 1.054633 |
min | -1.352912 | -2.699088 | -1.102989 | -1.897170 |
25% | -0.633249 | -0.150527 | 0.149123 | -0.701216 |
50% | -0.052515 | 0.103302 | 0.457878 | -0.608409 |
75% | 0.060295 | 0.418317 | 0.574052 | 0.582455 |
max | 0.088279 | 0.469481 | 0.899180 | 0.652742 |
.value_counts()
The function value_counts
counts the repeated values of the specified column:
In[]:
df['A'].value_counts()
Out[]:
0.088279 1
-0.052515 1
0.060295 1
-0.633249 1
-1.352912 1
Name: A, dtype: int64
.mean()
We can obtain the mean of a specific column or row by means of the mean
function.
In[]:
df['A'].mean() # specifying a column
Out[]:
-0.3780203497252693
`
In[]:
df.mean() # by column
df.mean(axis=0) # by column
Out[]:
A -0.378020
B -0.371703
C 0.195449
D -0.394319
shift -0.638513
dtype: float64
`
In[]:
df.mean(axis=1) # by row
Out[]:
a -0.526386
b 0.002084
c 0.001304
d -0.659462
e -0.382222
dtype: float64
.std()
We can obtain the standard deviation of a specific column or row by means of the std
function.
In[]:
df['A'].std() # specifying a column
Out[]:
0.6186812554819784
`
In[]:
df.std() # by column
df.std(axis=0) # by column
Out[]:
A 0.618681
B 1.325046
C 0.773876
D 1.054633
shift 1.041857
dtype: float64
`
In[]:
df.std(axis=1) # by row
Out[]:
a 1.563475
b 0.491499
c 0.688032
d 0.980517
e 1.073244
dtype: float64
Filtering Python Pandas DataFrame
We have already seen how to filter data in a Python Pandas DataFrame, including logical statements to filter rows or columns with some logical criteria. Let's remember some examples:
For example, we will filter rows whose column 'A' is greater than zero:
In[]:
import numpy as np
df=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df)
Out[]:
`
A | B | C | D | |
---|---|---|---|---|
a | -0.633249 | -2.699088 | 0.574052 | 0.652742 |
b | 0.060295 | -0.150527 | 0.149123 | -0.701216 |
c | -0.052515 | 0.469481 | 0.899180 | -0.608409 |
d | -1.352912 | 0.103302 | 0.457878 | -1.897170 |
e | 0.088279 | 0.418317 | -1.102989 | 0.582455 |
In[]:
df_filtered = df[df['A'] > 0]
print(df_filtered)
Out[]:
A | B | C | D | |
---|---|---|---|---|
b | 0.060295 | -0.150527 | 0.149123 | -0.701216 |
e | 0.088279 | 0.418317 | -1.102989 | 0.582455 |
We can also combine logical statements, we will filter all rows whose column 'A' and 'B' have their values greater than zero.
In[]:
df_filtered = df[(df['A'] > 0) & (df['B'] > 0)]
print(df_filtered)
Out[]:
A | B | C | D | |
---|---|---|---|---|
e | 0.088279 | 0.418317 | -1.102989 | 0.582455 |
Iterating Python Pandas DataFrame
We can go through the Python Pandas DataFrame row by row to do operations in each iteration, let's see some examples.
In[]:
for item in df.iterrows():
print(item)
Out[]:
('a', A -0.633249
B -2.699088
C 0.574052
D 0.652742
shift NaN
Name: a, dtype: float64)
('b', A 0.060295
B -0.150527
C 0.149123
D -0.701216
shift 0.652742
Name: b, dtype: float64)
('c', A -0.052515
B 0.469481
C 0.899180
D -0.608409
shift -0.701216
Name: c, dtype: float64)
('d', A -1.352912
B 0.103302
C 0.457878
D -1.897170
shift -0.608409
Name: d, dtype: float64)
('e', A 0.088279
B 0.418317
C -1.102989
D 0.582455
shift -1.897170
Name: e, dtype: float64)
Merge, Append and Concat operation on a Python Pandas DataFrame
Another interesting feature of Python Pandas DataFrames is that we can merge, concatenate them and add new values, let's see how to do each of these operations in this section of Python Pandas tutorial.
merge
function allows us to merge two Python Pandas DataFrame by rows:
In[]:
import numpy as np
df1=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df1)
Out[]:
`
A | B | C | D | |
---|---|---|---|---|
a | 1.179924 | -1.512124 | 0.767557 | 0.019265 |
b | 0.019969 | -1.351649 | 0.665298 | -0.989025 |
c | 0.351921 | -0.792914 | 0.455174 | 0.170751 |
d | -0.150499 | 0.151942 | -0.628074 | -0.347300 |
e | -1.307590 | 0.185759 | 0.175967 | -0.170334 |
df2=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df2)
`
A | B | C | D | |
---|---|---|---|---|
a | 2.030462 | -0.337738 | -0.894440 | -0.757323 |
b | 0.475807 | 1.350088 | -0.514070 | -0.843963 |
c | 0.948164 | -0.155052 | -0.618893 | 1.319999 |
d | 1.433736 | -0.455008 | 1.445698 | -1.051454 |
e | 0.565345 | 1.802485 | -0.167189 | -0.227519 |
In[]:
df3 = pd.merge(df1, df2)
print(df3)
Out[]:
Empty DataFrame
Columns: [A, B, C, D]
Index: []
append
function allows us to append rows from one Python Pandas DataFrame to another Python Pandas DataFrame by rows:
In[]
import numpy as np
df1=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df1)
df2=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df2)
df3 = df1.append(df2)
print(df3)
Out[]
A | B | C | D | |
---|---|---|---|---|
a | 1.179924 | -1.512124 | 0.767557 | 0.019265 |
b | 0.019969 | -1.351649 | 0.665298 | -0.989025 |
c | 0.351921 | -0.792914 | 0.455174 | 0.170751 |
d | -0.150499 | 0.151942 | -0.628074 | -0.347300 |
e | -1.307590 | 0.185759 | 0.175967 | -0.170334 |
a | 2.030462 | -0.337738 | -0.894440 | -0.757323 |
b | 0.475807 | 1.350088 | -0.514070 | -0.843963 |
c | 0.948164 | -0.155052 | -0.618893 | 1.319999 |
d | 1.433736 | -0.455008 | 1.445698 | -1.051454 |
e | 0.565345 | 1.802485 | -0.167189 | -0.227519 |
concat
function allows us to merge two Python Pandas DataFrame by rows or columns:
In[]:
import numpy as np
df1=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df1)
df2=pd.DataFrame(np.random.randn(5,4), index=['a','b','c','d','e'], columns=['A', 'B', 'C', 'D'])
print(df2)
df3 = pd.concat([df1, df2]) # concat by row
print(df3)
Out[]:
`
A | B | C | D | |
---|---|---|---|---|
a | 1.179924 | -1.512124 | 0.767557 | 0.019265 |
b | 0.019969 | -1.351649 | 0.665298 | -0.989025 |
c | 0.351921 | -0.792914 | 0.455174 | 0.170751 |
d | -0.150499 | 0.151942 | -0.628074 | -0.347300 |
e | -1.307590 | 0.185759 | 0.175967 | -0.170334 |
a | 2.030462 | -0.337738 | -0.894440 | -0.757323 |
b | 0.475807 | 1.350088 | -0.514070 | -0.843963 |
c | 0.948164 | -0.155052 | -0.618893 | 1.319999 |
d | 1.433736 | -0.455008 | 1.445698 | -1.051454 |
e | 0.565345 | 1.802485 | -0.167189 | -0.227519 |
In[]:
df3 = pd.concat([df1, df2], axis=0) # concat by row
print(df3)
Out[]:
`
A | B | C | D | |
---|---|---|---|---|
a | 1.179924 | -1.512124 | 0.767557 | 0.019265 |
b | 0.019969 | -1.351649 | 0.665298 | -0.989025 |
c | 0.351921 | -0.792914 | 0.455174 | 0.170751 |
d | -0.150499 | 0.151942 | -0.628074 | -0.347300 |
e | -1.307590 | 0.185759 | 0.175967 | -0.170334 |
a | 2.030462 | -0.337738 | -0.894440 | -0.757323 |
b | 0.475807 | 1.350088 | -0.514070 | -0.843963 |
c | 0.948164 | -0.155052 | -0.618893 | 1.319999 |
d | 1.433736 | -0.455008 | 1.445698 | -1.051454 |
e | 0.565345 | 1.802485 | -0.167189 | -0.227519 |
In[]:
df3 = pd.concat([df1, df2], axis=1) # concat by column
print(df3)
Out[]:
A | B | C | D | A | B | C | D | |
---|---|---|---|---|---|---|---|---|
a | 1.179924 | -1.512124 | 0.767557 | 0.019265 | 2.030462 | -0.337738 | -0.894440 | -0.757323 |
b | 0.019969 | -1.351649 | 0.665298 | -0.989025 | 0.475807 | 1.350088 | -0.514070 | -0.843963 |
c | 0.351921 | -0.792914 | 0.455174 | 0.170751 | 0.948164 | -0.155052 | -0.618893 | 1.319999 |
d | -0.150499 | 0.151942 | -0.628074 | -0.347300 | 1.433736 | -0.455008 | 1.445698 | -1.051454 |
e | -1.307590 | 0.185759 | 0.175967 | -0.170334 | 0.565345 | 1.802485 | -0.167189 | -0.227519 |
TimeSeries in Python Pandas
Python Pandas TimeSeries includes a set of tools to work with Series or DataFrames indexed in time.
Usually, the series of financial data are of this type and therefore, knowing these tools will make our work much more comfortable.
In this section of the Python Pandas tutorial, we are going to start creating time series from scratch and then we will see how to manipulate them and convert them to different frequencies.
Indexing Python Pandas TimeSeries
With date_range
Panda's method, we can create a time range with a certain frequency. For example, create a range starting on December 1st, 2018, with 30 occurrences with an hourly frequency.
In[]:
rng = pd.date_range('12/1/2018', periods=30, freq='H')
print(rng)
Out[]:
DatetimeIndex(['2018-12-01 00:00:00', '2018-12-01 01:00:00',
'2018-12-01 02:00:00', '2018-12-01 03:00:00',
'2018-12-01 04:00:00', '2018-12-01 05:00:00',
'2018-12-01 06:00:00', '2018-12-01 07:00:00',
'2018-12-01 08:00:00', '2018-12-01 09:00:00',
'2018-12-01 10:00:00', '2018-12-01 11:00:00',
'2018-12-01 12:00:00', '2018-12-01 13:00:00',
'2018-12-01 14:00:00', '2018-12-01 15:00:00',
'2018-12-01 16:00:00', '2018-12-01 17:00:00',
'2018-12-01 18:00:00', '2018-12-01 19:00:00',
'2018-12-01 20:00:00', '2018-12-01 21:00:00',
'2018-12-01 22:00:00', '2018-12-01 23:00:00',
'2018-12-02 00:00:00', '2018-12-02 01:00:00',
'2018-12-02 02:00:00', '2018-12-02 03:00:00',
'2018-12-02 04:00:00', '2018-12-02 05:00:00'],
dtype='datetime64[ns]', freq='H')
We can do the same to get a daily frequency (or any other, see the documentation). We can use the freq
parameter to adjust this.
In[]:
rng = pd.date_range('12/1/2018', periods=30, freq='D')
print(rng)
Out[]:
DatetimeIndex(['2018-12-01', '2018-12-02', '2018-12-03', '2018-12-04',
'2018-12-05', '2018-12-06', '2018-12-07', '2018-12-08',
'2018-12-09', '2018-12-10', '2018-12-11', '2018-12-12',
'2018-12-13', '2018-12-14', '2018-12-15', '2018-12-16',
'2018-12-17', '2018-12-18', '2018-12-19', '2018-12-20',
'2018-12-21', '2018-12-22', '2018-12-23', '2018-12-24',
'2018-12-25', '2018-12-26', '2018-12-27', '2018-12-28',
'2018-12-29', '2018-12-30'],
dtype='datetime64[ns]', freq='D')
Now, we have a DateTimeIndex
in the rng
object and we can use it to create a Series or DataFrame:
In[]:
ts = pd.DataFrame(np.random.randn(len(rng), 4), index=rng, columns=['A', 'B', 'C', 'D'])
print(ts)
Out[]:
`
A | B | C | D | |
---|---|---|---|---|
2018-12-01 | 0.048603 | 0.968522 | 0.408213 | 0.921774 |
2018-12-02 | -2.301373 | -2.310408 | -0.559381 | -0.652291 |
2018-12-03 | -2.337844 | 0.329954 | 0.289221 | 0.259132 |
2018-12-04 | 1.357521 | 0.969808 | 1.341875 | 0.767797 |
2018-12-05 | -1.212355 | -0.077457 | -0.529564 | 0.375572 |
2018-12-06 | -0.673065 | 0.527754 | 0.006344 | -0.533316 |
2018-12-07 | 0.226145 | 0.235027 | 0.945678 | -1.766167 |
2018-12-08 | 1.735185 | -0.604229 | 0.274809 | 0.841128 |
In[]:
ts = pd.Series(np.random.randn(len(rng)), index=rng)
print(ts)
Out[]:
2018-12-01 0.349234
2018-12-02 -1.807753
2018-12-03 0.112777
2018-12-04 0.421516
2018-12-05 -0.992449
2018-12-06 1.254999
2018-12-07 -0.311152
2018-12-08 0.331584
2018-12-09 0.196904
2018-12-10 -1.619186
2018-12-11 0.478510
2018-12-12 -1.036074
Sometimes, we read the data from internet sources or from csv files and we need to convert the date column into the index to work properly with the Series or DataFrame.
In[]:
import pandas as pd
df=pd.read_csv('TSLA.csv')
df.tail()
Out[]:
Date | Open | High | Low | Close | Volume | ExDividend | SplitRatio | AdjOpen | AdjHigh | AdjLow | AdjClose | AdjVolume | Name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1078 | 2013-12-11 | 141.88 | 143.05 | 139.49 | 139.65 | 7137800.0 | 0.0 | 1.0 | 141.88 | 143.05 | 139.49 | 139.65 | 7137800.0 | TSLA |
1079 | 2013-12-10 | 140.05 | 145.87 | 139.86 | 142.19 | 10748200.0 | 0.0 | 1.0 | 140.05 | 145.87 | 139.86 | 142.19 | 10748200.0 | TSLA |
1080 | 2013-12-09 | 137.00 | 141.70 | 134.21 | 141.60 | 9061500.0 | 0.0 | 1.0 | 137.00 | 141.70 | 134.21 | 141.60 | 9061500.0 | TSLA |
1081 | 2013-12-06 | 141.51 | 142.49 | 136.30 | 137.36 | 7909600.0 | 0.0 | 1.0 | 141.51 | 142.49 | 136.30 | 137.36 | 7909600.0 | TSLA |
1082 | 2013-12-05 | 140.15 | 143.35 | 139.50 | 140.48 | 9288400.0 | 0.0 | 1.0 | 140.15 | 143.35 | 139.50 | 140.48 | 9288400.0 | TSLA |
Here, we can see the index as numeric and a Date column, let's convert this column into the index to indexing our DataFrame, read from a csv file, in time. For this, we are going to use the Python Pandas set_index
method
In[]:
df = df.set_index('Date')
df.tail()
Out[]:
Date | Open | High | Low | Close | Volume | ExDividend | SplitRatio | AdjOpen | AdjHigh | AdjLow | AdjClose | AdjVolume | Name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2013-12-11 | 141.88 | 143.05 | 139.49 | 139.65 | 7137800.0 | 0.0 | 1.0 | 141.88 | 143.05 | 139.49 | 139.65 | 7137800.0 | TSLA |
2013-12-10 | 140.05 | 145.87 | 139.86 | 142.19 | 10748200.0 | 0.0 | 1.0 | 140.05 | 145.87 | 139.86 | 142.19 | 10748200.0 | TSLA |
2013-12-09 | 137.00 | 141.70 | 134.21 | 141.60 | 9061500.0 | 0.0 | 1.0 | 137.00 | 141.70 | 134.21 | 141.60 | 9061500.0 | TSLA |
2013-12-06 | 141.51 | 142.49 | 136.30 | 137.36 | 7909600.0 | 0.0 | 1.0 | 141.51 | 142.49 | 136.30 | 137.36 | 7909600.0 | TSLA |
2013-12-05 | 140.15 | 143.35 | 139.50 | 140.48 | 9288400.0 | 0.0 | 1.0 | 140.15 | 143.35 | 139.50 | 140.48 | 9288400.0 | TSLA |
Now, we have Pandas TimeSeries ready to work.
Resampling Pandas TimeSeries
A very useful feature of Python Pandas TimeSeries is the resample capacity, this allows us to pass the current frequency to another higher frequency (we can't pass to lower frequencies, because we don't know the data).
As it can be supposed, when we pass from one frequency to another data could be lost, for this, we must use some function that treats the values of each frequency interval, for example, if we pass from an hourly frequency to daily, we must specify what we want to do with the group of data that fall inside each frequency, we can do a mean, a sum, we can get the maximum or the minimum, etc.
In[]:
rng = pd.date_range('12/1/2018', periods=30, freq='H')
ts = pd.DataFrame(np.random.randn(len(rng), 4), index=rng, columns=['A', 'B', 'C', 'D'])
print(ts)
Out[]:
`
A | B | C | D | |
---|---|---|---|---|
2018-12-01 00:00:00 | 0.048603 | 0.968522 | 0.408213 | 0.921774 |
2018-12-01 01:00:00 | -2.301373 | -2.310408 | -0.559381 | -0.652291 |
2018-12-01 02:00:00 | -2.337844 | 0.329954 | 0.289221 | 0.259132 |
2018-12-01 03:00:00 | 1.357521 | 0.969808 | 1.341875 | 0.767797 |
2018-12-01 04:00:00 | -1.212355 | -0.077457 | -0.529564 | 0.375572 |
2018-12-01 05:00:00 | -0.673065 | 0.527754 | 0.006344 | -0.533316 |
In[]:
ts = ts.resample("1D").mean()
print(ts)
Out[]:
A | B | C | D | |
---|---|---|---|---|
2018-12-01 | 0.449050 | 0.127412 | -0.154179 | -0.358324 |
2018-12-02 | -0.539007 | -0.855894 | 0.000010 | 0.454623 |
Manipulating TimeSeries
We can manipulate the Python Pandas TimeSeries in the same way that we have done until now, since they offer us the same capacity that the Pandas Series and the Pandas DataFrames.
Additionally, we can work comfortably with all jobs related to handling dates. For example, to obtain all the data from a date, to obtain the data in a range of dates, etc.
In[]:
rng = pd.date_range('12/1/2018', periods=30, freq='D')
ts = pd.DataFrame(np.random.randn(len(rng), 4), index=rng, columns=['A', 'B', 'C', 'D'])
print(ts)
Out[]:
A | B | C | D | |
---|---|---|---|---|
2018-12-01 | 0.048603 | 0.968522 | 0.408213 | 0.921774 |
2018-12-02 | -2.301373 | -2.310408 | -0.559381 | -0.652291 |
2018-12-03 | -2.337844 | 0.329954 | 0.289221 | 0.259132 |
2018-12-04 | 1.357521 | 0.969808 | 1.341875 | 0.767797 |
2018-12-05 | -1.212355 | -0.077457 | -0.529564 | 0.375572 |
2018-12-06 | -0.673065 | 0.527754 | 0.006344 | -0.533316 |
Getting all values from a specific date:
In[]:
ts['2018-12-15':]
Out[]:
A | B | C | D | |
---|---|---|---|---|
2018-12-02 | 0.324689 | -0.413723 | 0.019163 | 0.385233 |
2018-12-03 | -2.198937 | 0.536600 | -0.540934 | -0.603858 |
2018-12-04 | -1.195148 | 2.191311 | -0.981604 | -0.942440 |
2018-12-05 | 0.621298 | -1.435266 | -0.761886 | -1.787730 |
2018-12-06 | 0.635679 | 0.683265 | 0.351140 | -1.451903 |
Getting all values inside a date range:
In[]:
ts['2018-12-15':'2018-12-20']
Out[]:
A | B | C | D | |
---|---|---|---|---|
2018-12-15 | 0.605576 | 0.584369 | -1.520749 | -0.242630 |
2018-12-16 | -0.105561 | -0.092124 | 0.385085 | 0.918222 |
2018-12-17 | 0.337416 | -1.367549 | 0.738320 | 2.413522 |
2018-12-18 | -0.011610 | -0.339228 | -0.218382 | -0.070349 |
2018-12-19 | 0.027808 | -0.422975 | -0.622777 | 0.730926 |
2018-12-20 | 0.188822 | -1.016637 | 0.470874 | 0.674052 |
Summary
Let us summarise a few points we learnt in the Python Pandas tutorial
- Python Pandas DataFrame and Python Pandas Series are some of the most important data structures. It is a must to acquire fluency in its handling because we will find them in practically all the problems that we handle.
- A Python Pandas DataFrame is a data structure formed by rows and columns and has an index.
- We must think of them as if they were data tables (for the Array with a single column) with which we can select, filter, sort, add and delete elements, either by rows or columns.
- Help in ETL processes (Extraction, Transformation and Loading)
- We can select, insert, delete and update elements with simple functions.
- We can perform computations by rows or columns.
- Has the ability to run vectorized computations.
- We can work with several DataFrames at the same time.
- Indexing and subsetting are the most important features of Python Pandas.
- Facilitates the statistical exploration of the data.
- It offers us a variety of options for handling NaN data.
- Another additional advantage is the ability to read & write multiple data formats (CSV, Excel, HDF5, etc.).
- Retrieve data from external sources (Yahoo, Google, Quandl Python API, etc.)
- Finally, it has the ability to work with date and time indexes and offers
us a set of functions to work with dates.
Conclusion
To reiterate, this article is an excerpt from the Python Basics Handbook, which was created for both; beginners who are starting out in Python as well as accomplished traders who can use it as a handy reference while coding their strategy.
Do let us know if you loved the article and any other feedback in the comments below.
Disclaimer: All investments and trading in the stock market involve risk. Any decisions to place trades in the financial markets, including trading in stock or options or other financial instruments is a personal decision that should only be made after thorough research, including a personal risk and financial assessment and the engagement of professional assistance to the extent you believe necessary. The trading strategies or related information mentioned in this article is for informational purposes only.