Pandas cheatsheet

Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data.

#Getting Started

#inroduction

Pandas is used to analyze data, and it is built on top of the Python programming language.
Pandas is faster than Excel and has more features than Excel in terms of data analysis.
Pandas is also an open source library, which means it is free to use.

#Installation

If you have Python and PIP already installed on a system, then installation of Pandas is very easy.
Install it using this command:

pip install pandas

#Import Pandas

Once Pandas is installed, import it in your applications by adding the import keyword:

import pandas as pd

#Pandas Data Structures

#Series

A Pandas Series is like a column in a table.
It is a one-dimensional array holding data of any type.
Create a simple Pandas Series from a list:

import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)

#DataFrame

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.
Create a simple Pandas DataFrame:

import pandas as pd
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
myvar = pd.DataFrame(data)
print(myvar)

#Panel

A Panel is a 3D container of data.
The term Panel data is derived from econometrics and is partially responsible for the name pandas − pan(el)-da(ta)-s.
Create an Empty Panel

import pandas as pd
myvar = pd.Panel()
print(myvar)

#Read files

#Read CSV Files

Pandas is used to read a CSV file and display the content:

import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())

#Read JSON Files

Pandas is used to read a JSON file and display the content:

import pandas as pd
df = pd.read_json('data.json')
print(df.to_string())

#Read Excel Files

Pandas is used to read an Excel file and display the content:

import pandas as pd
df = pd.read_excel('data.xlsx')
print(df.to_string())

#Read HTML

Pandas is used to read an HTML file and display the content:

import pandas as pd
df = pd.read_html('data.html')
print(df.to_string())

#Read SQL

Pandas is used to read a SQL table and display the content:

import pandas as pd
import mysql.connector
mydb = mysql.connector.connect(
  host="localhost",
  user="yourusername",
  password="yourpassword",
  database="mydatabase"
)
df = pd.read_sql("SELECT * FROM customers", con = mydb)
print(df.to_string())

#Pandas DataFrames

#Viewing the Data

One of the most used method for getting a quick overview of the DataFrame, is the head() method.
The head() method returns the headers and a specified number of rows, starting from the top.

import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(10))

#Pandas tail()

The tail() method returns the headers and a specified number of rows, starting from the bottom.

import pandas as pd
df = pd.read_csv('data.csv')
print(df.tail(10))

#Pandas set_index

The set_index() method is used to set a DataFrame index into a column of the DataFrame.

import pandas as pd
df = pd.read_csv('data.csv')
df.set_index('Date', inplace = True)
print(df)

#Pandas loc

The loc attribute access a group of rows and columns by label(s) or a boolean array.

import pandas as pd
df = pd.read_csv('data.csv')
df.set_index('Date', inplace = True)
print(df.loc["2019-01-31"])

#Pandas iloc

The iloc attribute access a group of rows and columns by integer position(s).

import pandas as pd
df = pd.read_csv('data.csv')
df.set_index('Date', inplace = True)
print(df.iloc[3])

#Pandas info

The info() method prints information about the DataFrame including the index dtype and column dtypes, non-null values and memory usage.

import pandas as pd
df = pd.read_csv('data.csv')
print(df.info())

#Pandas shape

The shape attribute returns a tuple representing the dimensionality of the DataFrame.

import pandas as pd
df = pd.read_csv('data.csv')
print(df.shape)

#Pandas columns

The columns attribute returns the column labels of the DataFrame.

import pandas as pd
df = pd.read_csv('data.csv')
print(df.columns)

#Pandas dtypes

The dtypes attribute returns the dtypes in this object.

import pandas as pd
df = pd.read_csv('data.csv')
print(df.dtypes)

#Pandas astype

The astype() method is used to cast a pandas object to a specified dtype.

import pandas as pd
df = pd.read_csv('data.csv')
df["Date"] = pd.to_datetime(df["Date"])
df["Date"] = df["Date"].astype("datetime64")
print(df.dtypes)

#Pandas to_datetime

The to_datetime() method is used to convert argument to datetime.

import pandas as pd
df = pd.read_csv('data.csv')
df["Date"] = pd.to_datetime(df["Date"])
print(df["Date"])

#Pandas unique

The unique() method is used to get a Series with unique values.

import pandas as pd
df = pd.read_csv('data.csv')
x = df["Date"].unique()
print(x)

#Pandas nunique

The nunique() method is used to get a Series with number of unique values.

import pandas as pd
df = pd.read_csv('data.csv')
x = df["Date"].nunique()
print(x)

#Pandas count

The count() method is used to count non-NA cells for each column or row.

import pandas as pd
df = pd.read_csv('data.csv')
x = df["Date"].count()
print(x)

#Pandas value_counts

The value_counts() method is used to get a Series containing counts of unique values.

import pandas as pd
df = pd.read_csv('data.csv')
x = df["Date"].value_counts()
print(x)

#Pandas sort_values

The sort_values() method is used to sort a Series in ascending or descending order by some criterion.

import pandas as pd
df = pd.read_csv('data.csv')
df.sort_values("Date", inplace = True)
print(df)

#Pandas isnull

The isnull() method is used to detect missing values for an array-like object.

import pandas as pd
df = pd.read_csv('data.csv')
x = df.isnull()
print(x)

#Pandas notnull

The notnull() method is used to detect non-missing values for an array-like object.

import pandas as pd
df = pd.read_csv('data.csv')
x = df.notnull()
print(x)

#Pandas dropna

The dropna() method is used to remove missing values.


import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace = True)
print(df)

#Pandas fillna

The fillna() method is used to fill NA/NaN values using the specified method.

import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(130, inplace = True)
print(df)

#Pandas drop_duplicates

The drop_duplicates() method is used to remove duplicates from the DataFrame.

import pandas as pd
df = pd.read_csv('data.csv')
df.drop_duplicates(inplace = True)
print(df)

#Pandas drop

The drop() method is used to drop specified labels from rows or columns.

import pandas as pd
df = pd.read_csv('data.csv')
df.drop("Date", axis = 1, inplace = True)
print(df)

#Pandas iterrows

The iterrows() method is used to iterate over DataFrame rows as (index, Series) pairs.

import pandas as pd
df = pd.read_csv('data.csv')
for index, row in df.iterrows():
    print(index, row)

#Pandas itertuples

The itertuples() method is used to iterate over DataFrame rows as namedtuples.

import pandas as pd
df = pd.read_csv('data.csv')
for row in df.itertuples():
    print(row)

#Pandas copy

The copy() method is used to make a copy of this object’s indices and data.

import pandas as pd
df = pd.read_csv('data.csv')
x = df.copy()
print(x)

#Pandas append

The append() method is used to append rows of other to the end of caller, returning a new object.

import pandas as pd
df = pd.read_csv('data.csv')
df2 = pd.read_csv('data2.csv')
x = df.append(df2)
print(x)

#Pandas merge

The merge() method is used to merge with another DataFrame, thus performing a join operation.

import pandas as pd
df = pd.read_csv('data.csv')
df2 = pd.read_csv('data2.csv')
x = pd.merge(df, df2, on = "Date")
print(x)

#Pandas join

The join() method is used to join columns of another DataFrame.

import pandas as pd
df = pd.read_csv('data.csv')
df2 = pd.read_csv('data2.csv')
x = df.join(df2)
print(x)

#Pandas concat

The concat() method is used to concatenate pandas objects along a particular axis with optional set logic along the other axes.

import pandas as pd
df = pd.read_csv('data.csv')
df2 = pd.read_csv('data2.csv')
x = pd.concat([df, df2])
print(x)

#Pandas pivot

The pivot() method is used to produce pivot table based on 3 columns of the DataFrame.

import pandas as pd
df = pd.read_csv('data.csv')
x = df.pivot(index = "Date", columns = "Calories", values = "Duration")
print(x)

#Pandas pivot_table

The pivot_table() method is used to create a spreadsheet-style pivot table as a DataFrame.

import pandas as pd
df = pd.read_csv('data.csv')
x = df.pivot_table(index = "Date", columns = "Calories", values = "Duration")
print(x)

#Pandas stack

The stack() method is used to stack the prescribed level(s) from columns to index.

import pandas as pd
df = pd.read_csv('data.csv')
x = df.stack()
print(x)

#Pandas unstack

The unstack() method is used to pivot a level of the (necessarily hierarchical) index labels.

import pandas as pd
df = pd.read_csv('data.csv')
x = df.unstack()
print(x)

#Replace Only For a Specified Columns

Replace NULL values in the "Calories" columns with the number 130:

import pandas as pd
df = pd.read_csv('data.csv')
df["Calories"].fillna(130, inplace = True)

#Replace Using Mean, Median, or Mode

A common way to replace empty cells, is to calculate the mean, median or mode value of the column.
Pandas uses the mean(), median() and mode() methods to calculate the respective values for a specified column:

import pandas as pd
df = pd.read_csv('data.csv')
x = df["Calories"].mean()
df["Calories"].fillna(x, inplace = True)

#Cleaning Data of Wrong Format

Convert to a correct format:

import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])

#Wrong Data

"Wrong data" does not have to be "empty cells" or "wrong format", it can just be wrong, like if someone registered "199" instead of "1.99".
When you have made the changes in the dataset, you can convert it to a CSV file, and use it in your program.
Replacing Values
- Replace all "John" values with "Sean":

import pandas as pd
df = pd.read_csv('data.csv')
df.loc[df["Duration"] <= 60, "Duration"] = 45

#Removing Duplicates

To discover duplicates, we can use the duplicated() method.
The duplicated() method returns a Boolean values for each row:

import pandas as pd
df = pd.read_csv('data.csv')
df.drop_duplicates(inplace = True)

#Data Correlations

A great aspect of the Pandas module is the corr() method.
The corr() method calculates the relationship between each column in your data set.
The corr() method ignores "not numeric" columns.
The closer the value is to 1, the higher the correlation:

import pandas as pd
df = pd.read_csv('data.csv')
df.corr()

#Transpose

Pandas uses the T attribute or the transpose() method to transpose a dataframe.

import pandas as pd
df = pd.read_csv('data.csv')
df.transpose()

#Iterate Rows

Iterate over rows in a DataFrame:

import pandas as pd
df = pd.read_csv('data.csv')
for index, row in df.iterrows():
  print(index, row)

#Sorting

Sort the result ascending or descending by the column "Duration":

import pandas as pd
df = pd.read_csv('data.csv')
df.sort_values('Duration', inplace = True)

#Pandas Plotting

#Introduction

Pandas uses the plot() method to create diagrams.
The kind parameter allows you to specify the type of diagram to use.
The following example uses the plot() method to create a diagram:

import pandas as pd
df = pd.read_csv('data.csv')
df.plot()

#Scatter Plot

Scatter Plot shows the relationship between two variables as dots in two dimensions, one axis for each attribute.
Create a scatter plot of the "Duration" and "Calories" column:

import pandas as pd
df = pd.read_csv('data.csv')
df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')

#Histogram

Histograms are used to show distributions of variables while bar charts are used to compare variables.
Create a histogram of the "Duration" column:

import pandas as pd
df = pd.read_csv('data.csv')
df["Duration"].plot(kind = 'hist')

#Bar Chart

Create a bar chart of the "Duration" column:

import pandas as pd
df = pd.read_csv('data.csv')
df["Duration"].plot(kind = 'bar')

#Pie Chart

Create a pie chart of the "Duration" column:

import pandas as pd
df = pd.read_csv('data.csv')
df.plot(kind = 'pie')

#Subplots

With the subplots() method you can draw multiple plots in one figure:

import pandas as pd
df = pd.read_csv('data.csv')
df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')
df.plot(kind = 'bar', x = 'Duration', y = 'Calories')