A Complete Guide to Pandas in Python

Pandas in Python is a widely used library for data scientists. It emerges as a foundation library for data manipulation and analysis. If you are contemplating working with any type of data, Pandas is the go-to tool for data cleaning, transforming, and exploring complex datasets. In this article, we will see the concepts of data exploration and some Pandas functions that are commonly used.

Table of Contents

Introduction to Pandas in Python

Pandas in Python serves as a bridge between the raw data you have and the insights you seek. Think of it as a data powerhouse that enables you to clean, transform, and explore your datasets with unmatched ease. There are two fundamental data structures at the core of Pandas library: Series and DataFrame. A Series is like a column in an Excel spreadsheet, it holds data of a single type, such as numbers or strings. DataFrame, on the other hand, is just like the entire spreadsheet, a two-dimensional table containing multiple columns and rows. Here are some benefits Pandas offers:

Versatile Data Types

DataFrame can combine a wide range of data types within a single structure, including integers, floats, strings, and even more complex data structures like lists and dictionaries.

Data Cleaning and Transformation

By converting the dataset into Pandas’ DataFrame format, we could utilize functions provided by Pandas to perform data manipulation and apply mathematical operations.

Compatibility with Libraries

Pandas is compatible with a wide range of Python libraries and tools, making it an integral part of end-to-end data analysis workflows. For example, it seamlessly integrates with popular visualization libraries like Matplotlib and Seaborn.

Community and Documentation

Most importantly, it has a huge community, which means there is a wealth of resources, tutorials, and documentation available for learning and troubleshooting.

Top Pandas Data Exploration Functions

In this section, I will show some useful Pandas functions on an online store’s sales dataset.

IDProductCategoryPriceQuantity
1LaptopElectronics80010
2TabletElectronics30015
3ShirtClothing2550
4JeansClothing4030
5BookBooks15100
6HeadphonesElectronics5025
7DressClothing3540
8PhoneElectronics6008
# Python code to generate the online store sales 
import pandas as pd

data = {
    'ID': [1, 2, 3, 4, 5, 6, 7, 8],
    'Product': ['Laptop', 'Tablet', 'Shirt', 'Jeans', 'Book', 'Headphones', 'Dress', 'Phone'],
    'Category': ['Electronics', 'Electronics', 'Clothing', 'Clothing', 'Books', 'Electronics', 'Clothing', 'Electronics'],
    'Price': [800, 300, 25, 40, 15, 50, 35, 600],
    'Quantity': [10, 15, 50, 30, 100, 25, 40, 8]
}

df = pd.DataFrame(data)

1. head() and tail()

  • head(): Returns the first few rows of a DataFrame. By default, it shows the first five rows, but you can specify the number of rows you want to display.
  • tail(): Returns the last few rows of a DataFrame. Like head(), it shows the last five rows by default.
# Display the first few rows of the DataFrame
print("First 3 rows:")
print(df.head(3))
IDProductCategoryPriceQuantity
1LaptopElectronics80010
2TabletElectronics30015
3ShirtClothing2550
# Display the last few rows of the DataFrame
print("Last 3 rows:")
print(df.tail(3))
IDProductCategoryPriceQuantity
6HeadphonesElectronics5025
7DressClothing3540
8PhoneElectronics6008

2. info()

info() provides a concise summary of a DataFrame, including data types, non-null counts, and memory usage. This is useful to quickly understand the structure of your dataset.

# Get a summary of the DataFrame
print("DataFrame summary:")
print(df.info())
#ColumnsNon-Null CountDtype
0ID8 non-nullint64
1Product8 non-nullobject
2Category8 non-nullobject
3Price8 non-nullint64
4Quantity8 non-nullint64

3. describe()

describe() generates descriptive statistics for numerical columns in a DataFrame. It provides information like mean, standard deviation, minimum, maximum, and quartile values.

# Generate descriptive statistics for numerical columns
print("Descriptive statistics:")
print(df.describe())
IDPriceQuantity
count8.000008.000008.000000
mean4.50000233.1250034.750000
std2.44949307.3845630.127112
min1.0000015.000008.000000
25%2.7500032.5000013.750000
50%4.5000045.0000027.500000
75%6.25000375.0000042.500000
max8.00000800.00000100.00000

Note: Non-numeric data cannot be shown in the table.


4. value_counts()

value_counts() returns a Series with the count of unique values in a specified column. It’s particularly useful for understanding the distribution of categorical data.

# Count the occurrences of each category in the 'Category' column
print("Value counts for 'Category' column:")
print(df['Category'].value_counts())
CategoryCount
Electronics4
Clothing3
Books1

5. groupby()

groupby() splits the data into groups based on a specified criterion, typically a column. This function is often followed by an aggregation operation, like sum(), mean(), or count(), to compute summary statistics within each group.

# Group the data by 'Category' and calculate the total quantity in each category
print("\nGrouped by 'Category' with total quantity:")
print(df.groupby('Category')['Quantity'].sum())
CategoryQuantity
Books100
Clothing120
Electronics58

6. pivot_table()

pivot_table() creates a pivot table that summarizes data in a more structured format. You can define rows, columns, and values to create a multidimensional summary of the data.

# Create a pivot table to show average price for each category
pivot_table = df.pivot_table(index='Category', values='Price', aggfunc='mean')
print("Pivot table for average price:")
print(pivot_table)
CategoryPrice
Books15.00000
Clothing33.33333
Electronics437.50000

7. isnull()

isnull() returns a DataFrame of the same shape as the input, with True for missing values (NaN) and False otherwise.

# Check for missing values
print("\nMissing values check:")
print(df.isnull())
ID ProductProductCategoryPriceQuantity
0FalseFalseFalseFalseFalse
1FalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalse
4FalseFalseFalseFalseFalse
5FalseFalseFalseFalseFalse
6FalseFalseFalseFalseFalse
7FalseFalseFalseFalseFalse

You can also combine isnull() with sum() functions for a more comprehensive view.

# return the count of null value in every column 
print(df.isna().sum())
ID0
Product0
Category0
Price0
Quantity0

8. fillna()

fillna() replaces missing values in a DataFrame with specified values or uses certain filling methods, such as mean, median, and mode (mostly for categorical features).

# Fill missing values in 'Price' with the mean price
df['Price'].fillna(df['Price'].mean(), inplace=True)
print("DataFrame after filling missing values:")
print(df)

Since the table doesn’t contain any missing value, it will return the exact same table.


9. plot()

plot() creates basic plots like line, bar, scatter, and more directly from a DataFrame. It simplifies the process of generating visualizations to aid in data exploration.

import matplotlib.pyplot as plt
df.groupby('Category')['Quantity'].sum().plot(kind='bar')
plt.title('Total Quantity by Category')
plt.xlabel('Category')
plt.ylabel('Total Quantity')
plt.show()
Pandas in Python