DATASCIHUBS ‣ A Complete Guide to Pandas in Python

Pandas in Python is a widely used library for data scientists. It emerges as a foundation library for data manipulation and analysis. If you are contemplating working with any type of data, Pandas is the go-to tool for data cleaning, transforming, and exploring complex datasets. In this article, we will see the concepts of data exploration and some Pandas functions that are commonly used.

Introduction to Pandas in Python
Top Pandas Data Exploration Functions

Introduction to Pandas in Python

Pandas in Python serves as a bridge between the raw data you have and the insights you seek. Think of it as a data powerhouse that enables you to clean, transform, and explore your datasets with unmatched ease. There are two fundamental data structures at the core of Pandas library: Series and DataFrame. A Series is like a column in an Excel spreadsheet, it holds data of a single type, such as numbers or strings. DataFrame, on the other hand, is just like the entire spreadsheet, a two-dimensional table containing multiple columns and rows. Here are some benefits Pandas offers:

Versatile Data Types

DataFrame can combine a wide range of data types within a single structure, including integers, floats, strings, and even more complex data structures like lists and dictionaries.

Data Cleaning and Transformation

By converting the dataset into Pandas’ DataFrame format, we could utilize functions provided by Pandas to perform data manipulation and apply mathematical operations.

Compatibility with Libraries

Pandas is compatible with a wide range of Python libraries and tools, making it an integral part of end-to-end data analysis workflows. For example, it seamlessly integrates with popular visualization libraries like Matplotlib and Seaborn.

Community and Documentation

Most importantly, it has a huge community, which means there is a wealth of resources, tutorials, and documentation available for learning and troubleshooting.

Top Pandas Data Exploration Functions

In this section, I will show some useful Pandas functions on an online store’s sales dataset.

ID	Product	Category	Price	Quantity
1	Laptop	Electronics	800	10
2	Tablet	Electronics	300	15
3	Shirt	Clothing	25	50
4	Jeans	Clothing	40	30
5	Book	Books	15	100
6	Headphones	Electronics	50	25
7	Dress	Clothing	35	40
8	Phone	Electronics	600	8

# Python code to generate the online store sales 
import pandas as pd

data = {
    'ID': [1, 2, 3, 4, 5, 6, 7, 8],
    'Product': ['Laptop', 'Tablet', 'Shirt', 'Jeans', 'Book', 'Headphones', 'Dress', 'Phone'],
    'Category': ['Electronics', 'Electronics', 'Clothing', 'Clothing', 'Books', 'Electronics', 'Clothing', 'Electronics'],
    'Price': [800, 300, 25, 40, 15, 50, 35, 600],
    'Quantity': [10, 15, 50, 30, 100, 25, 40, 8]
}

df = pd.DataFrame(data)

1. head() and tail()

head(): Returns the first few rows of a DataFrame. By default, it shows the first five rows, but you can specify the number of rows you want to display.
tail(): Returns the last few rows of a DataFrame. Like head(), it shows the last five rows by default.

# Display the first few rows of the DataFrame
print("First 3 rows:")
print(df.head(3))

ID	Product	Category	Price	Quantity
1	Laptop	Electronics	800	10
2	Tablet	Electronics	300	15
3	Shirt	Clothing	25	50

# Display the last few rows of the DataFrame
print("Last 3 rows:")
print(df.tail(3))

ID	Product	Category	Price	Quantity
6	Headphones	Electronics	50	25
7	Dress	Clothing	35	40
8	Phone	Electronics	600	8

2. info()

info() provides a concise summary of a DataFrame, including data types, non-null counts, and memory usage. This is useful to quickly understand the structure of your dataset.

# Get a summary of the DataFrame
print("DataFrame summary:")
print(df.info())

#	Columns	Non-Null Count	Dtype
0	ID	8 non-null	int64
1	Product	8 non-null	object
2	Category	8 non-null	object
3	Price	8 non-null	int64
4	Quantity	8 non-null	int64

3. describe()

describe() generates descriptive statistics for numerical columns in a DataFrame. It provides information like mean, standard deviation, minimum, maximum, and quartile values.

# Generate descriptive statistics for numerical columns
print("Descriptive statistics:")
print(df.describe())

	ID	Price	Quantity
count	8.00000	8.00000	8.000000
mean	4.50000	233.12500	34.750000
std	2.44949	307.38456	30.127112
min	1.00000	15.00000	8.000000
25%	2.75000	32.50000	13.750000
50%	4.50000	45.00000	27.500000
75%	6.25000	375.00000	42.500000
max	8.00000	800.00000	100.00000

Note: Non-numeric data cannot be shown in the table.

4. value_counts()

value_counts() returns a Series with the count of unique values in a specified column. It’s particularly useful for understanding the distribution of categorical data.

# Count the occurrences of each category in the 'Category' column
print("Value counts for 'Category' column:")
print(df['Category'].value_counts())

Category	Count
Electronics	4
Clothing	3
Books	1

5. groupby()

groupby() splits the data into groups based on a specified criterion, typically a column. This function is often followed by an aggregation operation, like sum(), mean(), or count(), to compute summary statistics within each group.

# Group the data by 'Category' and calculate the total quantity in each category
print("\nGrouped by 'Category' with total quantity:")
print(df.groupby('Category')['Quantity'].sum())

Category	Quantity
Books	100
Clothing	120
Electronics	58

6. pivot_table()

pivot_table() creates a pivot table that summarizes data in a more structured format. You can define rows, columns, and values to create a multidimensional summary of the data.

# Create a pivot table to show average price for each category
pivot_table = df.pivot_table(index='Category', values='Price', aggfunc='mean')
print("Pivot table for average price:")
print(pivot_table)

Category	Price
Books	15.00000
Clothing	33.33333
Electronics	437.50000

7. isnull()

isnull() returns a DataFrame of the same shape as the input, with True for missing values (NaN) and False otherwise.

# Check for missing values
print("\nMissing values check:")
print(df.isnull())

	ID Product	Product	Category	Price	Quantity
0	False	False	False	False	False
1	False	False	False	False	False
2	False	False	False	False	False
3	False	False	False	False	False
4	False	False	False	False	False
5	False	False	False	False	False
6	False	False	False	False	False
7	False	False	False	False	False

You can also combine isnull() with sum() functions for a more comprehensive view.

# return the count of null value in every column 
print(df.isna().sum())

ID	0
Product	0
Category	0
Price	0
Quantity	0

8. fillna()

fillna() replaces missing values in a DataFrame with specified values or uses certain filling methods, such as mean, median, and mode (mostly for categorical features).

# Fill missing values in 'Price' with the mean price
df['Price'].fillna(df['Price'].mean(), inplace=True)
print("DataFrame after filling missing values:")
print(df)

Since the table doesn’t contain any missing value, it will return the exact same table.

9. plot()

plot() creates basic plots like line, bar, scatter, and more directly from a DataFrame. It simplifies the process of generating visualizations to aid in data exploration.

import matplotlib.pyplot as plt
df.groupby('Category')['Quantity'].sum().plot(kind='bar')
plt.title('Total Quantity by Category')
plt.xlabel('Category')
plt.ylabel('Total Quantity')
plt.show()