Pandas

Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data.

#Getting Started

#inroduction

  • Pandas is used to analyze data, and it is built on top of the Python programming language.
  • Pandas is faster than Excel and has more features than Excel in terms of data analysis.
  • Pandas is also an open source library, which means it is free to use.

#Installation

  • If you have Python and PIP already installed on a system, then installation of Pandas is very easy.
  • Install it using this command:
pip install pandas

#Import Pandas

  • Once Pandas is installed, import it in your applications by adding the import keyword:
import pandas as pd

#Pandas Data Structures

#Series

  • A Pandas Series is like a column in a table.
  • It is a one-dimensional array holding data of any type.
  • Create a simple Pandas Series from a list:
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)

#DataFrame

  • A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.
  • Create a simple Pandas DataFrame:
import pandas as pd
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
myvar = pd.DataFrame(data)
print(myvar)

#Panel

  • A Panel is a 3D container of data.
  • The term Panel data is derived from econometrics and is partially responsible for the name pandas − pan(el)-da(ta)-s.
  • Create an Empty Panel
import pandas as pd
myvar = pd.Panel()
print(myvar)

#Read files

#Read CSV Files

  • Pandas is used to read a CSV file and display the content:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())

#Read JSON Files

  • Pandas is used to read a JSON file and display the content:
import pandas as pd
df = pd.read_json('data.json')
print(df.to_string())

#Read Excel Files

  • Pandas is used to read an Excel file and display the content:
import pandas as pd
df = pd.read_excel('data.xlsx')
print(df.to_string())

#Read HTML

  • Pandas is used to read an HTML file and display the content:
import pandas as pd
df = pd.read_html('data.html')
print(df.to_string())

#Read SQL

  • Pandas is used to read a SQL table and display the content:
import pandas as pd
import mysql.connector
mydb = mysql.connector.connect(
  host="localhost",
  user="yourusername",
  password="yourpassword",
  database="mydatabase"
)
df = pd.read_sql("SELECT * FROM customers", con = mydb)
print(df.to_string())

#Pandas DataFrames

#Viewing the Data

  • One of the most used method for getting a quick overview of the DataFrame, is the head() method.
  • The head() method returns the headers and a specified number of rows, starting from the top.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(10))

#Pandas tail()

  • The tail() method returns the headers and a specified number of rows, starting from the bottom.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.tail(10))

#Pandas set_index

  • The set_index() method is used to set a DataFrame index into a column of the DataFrame.
import pandas as pd
df = pd.read_csv('data.csv')
df.set_index('Date', inplace = True)
print(df)

#Pandas loc

  • The loc attribute access a group of rows and columns by label(s) or a boolean array.
import pandas as pd
df = pd.read_csv('data.csv')
df.set_index('Date', inplace = True)
print(df.loc["2019-01-31"])

#Pandas iloc

  • The iloc attribute access a group of rows and columns by integer position(s).
import pandas as pd
df = pd.read_csv('data.csv')
df.set_index('Date', inplace = True)
print(df.iloc[3])

#Pandas info

  • The info() method prints information about the DataFrame including the index dtype and column dtypes, non-null values and memory usage.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.info())

#Pandas shape

  • The shape attribute returns a tuple representing the dimensionality of the DataFrame.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.shape)

#Pandas columns

  • The columns attribute returns the column labels of the DataFrame.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.columns)

#Pandas dtypes

  • The dtypes attribute returns the dtypes in this object.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.dtypes)

#Pandas astype

  • The astype() method is used to cast a pandas object to a specified dtype.
import pandas as pd
df = pd.read_csv('data.csv')
df["Date"] = pd.to_datetime(df["Date"])
df["Date"] = df["Date"].astype("datetime64")
print(df.dtypes)

#Pandas to_datetime

  • The to_datetime() method is used to convert argument to datetime.
import pandas as pd
df = pd.read_csv('data.csv')
df["Date"] = pd.to_datetime(df["Date"])
print(df["Date"])

#Pandas unique

  • The unique() method is used to get a Series with unique values.
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Date"].unique()
print(x)

#Pandas nunique

  • The nunique() method is used to get a Series with number of unique values.
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Date"].nunique()
print(x)

#Pandas count

  • The count() method is used to count non-NA cells for each column or row.
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Date"].count()
print(x)

#Pandas value_counts

  • The value_counts() method is used to get a Series containing counts of unique values.
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Date"].value_counts()
print(x)

#Pandas sort_values

  • The sort_values() method is used to sort a Series in ascending or descending order by some criterion.
import pandas as pd
df = pd.read_csv('data.csv')
df.sort_values("Date", inplace = True)
print(df)

#Pandas isnull

  • The isnull() method is used to detect missing values for an array-like object.
import pandas as pd
df = pd.read_csv('data.csv')
x = df.isnull()
print(x)

#Pandas notnull

  • The notnull() method is used to detect non-missing values for an array-like object.
import pandas as pd
df = pd.read_csv('data.csv')
x = df.notnull()
print(x)

#Pandas dropna

  • The dropna() method is used to remove missing values.

import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace = True)
print(df)

#Pandas fillna

  • The fillna() method is used to fill NA/NaN values using the specified method.
import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(130, inplace = True)
print(df)

#Pandas drop_duplicates

  • The drop_duplicates() method is used to remove duplicates from the DataFrame.
import pandas as pd
df = pd.read_csv('data.csv')
df.drop_duplicates(inplace = True)
print(df)

#Pandas drop

  • The drop() method is used to drop specified labels from rows or columns.
import pandas as pd
df = pd.read_csv('data.csv')
df.drop("Date", axis = 1, inplace = True)
print(df)

#Pandas iterrows

  • The iterrows() method is used to iterate over DataFrame rows as (index, Series) pairs.
import pandas as pd
df = pd.read_csv('data.csv')
for index, row in df.iterrows():
    print(index, row)

#Pandas itertuples

  • The itertuples() method is used to iterate over DataFrame rows as namedtuples.
import pandas as pd
df = pd.read_csv('data.csv')
for row in df.itertuples():
    print(row)

#Pandas copy

  • The copy() method is used to make a copy of this object’s indices and data.
import pandas as pd
df = pd.read_csv('data.csv')
x = df.copy()
print(x)

#Pandas append

  • The append() method is used to append rows of other to the end of caller, returning a new object.
import pandas as pd
df = pd.read_csv('data.csv')
df2 = pd.read_csv('data2.csv')
x = df.append(df2)
print(x)

#Pandas merge

  • The merge() method is used to merge with another DataFrame, thus performing a join operation.
import pandas as pd
df = pd.read_csv('data.csv')
df2 = pd.read_csv('data2.csv')
x = pd.merge(df, df2, on = "Date")
print(x)

#Pandas join

  • The join() method is used to join columns of another DataFrame.
import pandas as pd
df = pd.read_csv('data.csv')
df2 = pd.read_csv('data2.csv')
x = df.join(df2)
print(x)

#Pandas concat

  • The concat() method is used to concatenate pandas objects along a particular axis with optional set logic along the other axes.
import pandas as pd
df = pd.read_csv('data.csv')
df2 = pd.read_csv('data2.csv')
x = pd.concat([df, df2])
print(x)

#Pandas pivot

  • The pivot() method is used to produce pivot table based on 3 columns of the DataFrame.
import pandas as pd
df = pd.read_csv('data.csv')
x = df.pivot(index = "Date", columns = "Calories", values = "Duration")
print(x)

#Pandas pivot_table

  • The pivot_table() method is used to create a spreadsheet-style pivot table as a DataFrame.
import pandas as pd
df = pd.read_csv('data.csv')
x = df.pivot_table(index = "Date", columns = "Calories", values = "Duration")
print(x)

#Pandas stack

  • The stack() method is used to stack the prescribed level(s) from columns to index.
import pandas as pd
df = pd.read_csv('data.csv')
x = df.stack()
print(x)

#Pandas unstack

  • The unstack() method is used to pivot a level of the (necessarily hierarchical) index labels.
import pandas as pd
df = pd.read_csv('data.csv')
x = df.unstack()
print(x)

#Replace Only For a Specified Columns

  • Replace NULL values in the "Calories" columns with the number 130:
import pandas as pd
df = pd.read_csv('data.csv')
df["Calories"].fillna(130, inplace = True)

#Replace Using Mean, Median, or Mode

  • A common way to replace empty cells, is to calculate the mean, median or mode value of the column.
  • Pandas uses the mean(), median() and mode() methods to calculate the respective values for a specified column:
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Calories"].mean()
df["Calories"].fillna(x, inplace = True)

#Cleaning Data of Wrong Format

  • Convert to a correct format:
import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])

#Wrong Data

  • "Wrong data" does not have to be "empty cells" or "wrong format", it can just be wrong, like if someone registered "199" instead of "1.99".
  • When you have made the changes in the dataset, you can convert it to a CSV file, and use it in your program.
  • Replacing Values
    • Replace all "John" values with "Sean":
import pandas as pd
df = pd.read_csv('data.csv')
df.loc[df["Duration"] <= 60, "Duration"] = 45

#Removing Duplicates

  • To discover duplicates, we can use the duplicated() method.
  • The duplicated() method returns a Boolean values for each row:
import pandas as pd
df = pd.read_csv('data.csv')
df.drop_duplicates(inplace = True)

#Data Correlations

  • A great aspect of the Pandas module is the corr() method.
  • The corr() method calculates the relationship between each column in your data set.
  • The corr() method ignores "not numeric" columns.
  • The closer the value is to 1, the higher the correlation:
import pandas as pd
df = pd.read_csv('data.csv')
df.corr()

#Transpose

  • Pandas uses the T attribute or the transpose() method to transpose a dataframe.
import pandas as pd
df = pd.read_csv('data.csv')
df.transpose()

#Iterate Rows

  • Iterate over rows in a DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
for index, row in df.iterrows():
  print(index, row)

#Sorting

  • Sort the result ascending or descending by the column "Duration":
import pandas as pd
df = pd.read_csv('data.csv')
df.sort_values('Duration', inplace = True)

#Pandas Plotting

#Introduction

  • Pandas uses the plot() method to create diagrams.
  • The kind parameter allows you to specify the type of diagram to use.
  • The following example uses the plot() method to create a diagram:
import pandas as pd
df = pd.read_csv('data.csv')
df.plot()

#Scatter Plot

  • Scatter Plot shows the relationship between two variables as dots in two dimensions, one axis for each attribute.
  • Create a scatter plot of the "Duration" and "Calories" column:
import pandas as pd
df = pd.read_csv('data.csv')
df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')

#Histogram

  • Histograms are used to show distributions of variables while bar charts are used to compare variables.
  • Create a histogram of the "Duration" column:
import pandas as pd
df = pd.read_csv('data.csv')
df["Duration"].plot(kind = 'hist')

#Bar Chart

  • Create a bar chart of the "Duration" column:
import pandas as pd
df = pd.read_csv('data.csv')
df["Duration"].plot(kind = 'bar')

#Pie Chart

  • Create a pie chart of the "Duration" column:
import pandas as pd
df = pd.read_csv('data.csv')
df.plot(kind = 'pie')

#Subplots

  • With the subplots() method you can draw multiple plots in one figure:
import pandas as pd
df = pd.read_csv('data.csv')
df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')
df.plot(kind = 'bar', x = 'Duration', y = 'Calories')