Pandas in Python is a widely used library for data scientists. It emerges as a foundation library for data manipulation and analysis. If you are contemplating working with any type of data, Pandas is the go-to tool for data cleaning, transforming, and exploring complex datasets. In this article, we will see the concepts of data exploration and some Pandas functions that are commonly used.
Table of Contents
Introduction to Pandas in Python
Pandas in Python serves as a bridge between the raw data you have and the insights you seek. Think of it as a data powerhouse that enables you to clean, transform, and explore your datasets with unmatched ease. There are two fundamental data structures at the core of Pandas library: Series and DataFrame. A Series is like a column in an Excel spreadsheet, it holds data of a single type, such as numbers or strings. DataFrame, on the other hand, is just like the entire spreadsheet, a two-dimensional table containing multiple columns and rows. Here are some benefits Pandas offers:
Versatile Data Types
DataFrame can combine a wide range of data types within a single structure, including integers, floats, strings, and even more complex data structures like lists and dictionaries.
Data Cleaning and Transformation
By converting the dataset into Pandas’ DataFrame format, we could utilize functions provided by Pandas to perform data manipulation and apply mathematical operations.
Compatibility with Libraries
Pandas is compatible with a wide range of Python libraries and tools, making it an integral part of end-to-end data analysis workflows. For example, it seamlessly integrates with popular visualization libraries like Matplotlib and Seaborn.
Community and Documentation
Most importantly, it has a huge community, which means there is a wealth of resources, tutorials, and documentation available for learning and troubleshooting.
Top Pandas Data Exploration Functions
In this section, I will show some useful Pandas functions on an online store’s sales dataset.
ID | Product | Category | Price | Quantity |
---|---|---|---|---|
1 | Laptop | Electronics | 800 | 10 |
2 | Tablet | Electronics | 300 | 15 |
3 | Shirt | Clothing | 25 | 50 |
4 | Jeans | Clothing | 40 | 30 |
5 | Book | Books | 15 | 100 |
6 | Headphones | Electronics | 50 | 25 |
7 | Dress | Clothing | 35 | 40 |
8 | Phone | Electronics | 600 | 8 |
# Python code to generate the online store sales
import pandas as pd
data = {
'ID': [1, 2, 3, 4, 5, 6, 7, 8],
'Product': ['Laptop', 'Tablet', 'Shirt', 'Jeans', 'Book', 'Headphones', 'Dress', 'Phone'],
'Category': ['Electronics', 'Electronics', 'Clothing', 'Clothing', 'Books', 'Electronics', 'Clothing', 'Electronics'],
'Price': [800, 300, 25, 40, 15, 50, 35, 600],
'Quantity': [10, 15, 50, 30, 100, 25, 40, 8]
}
df = pd.DataFrame(data)
1. head()
and tail()
head()
: Returns the first few rows of a DataFrame. By default, it shows the first five rows, but you can specify the number of rows you want to display.tail()
: Returns the last few rows of a DataFrame. Likehead()
, it shows the last five rows by default.
# Display the first few rows of the DataFrame
print("First 3 rows:")
print(df.head(3))
ID | Product | Category | Price | Quantity |
---|---|---|---|---|
1 | Laptop | Electronics | 800 | 10 |
2 | Tablet | Electronics | 300 | 15 |
3 | Shirt | Clothing | 25 | 50 |
# Display the last few rows of the DataFrame
print("Last 3 rows:")
print(df.tail(3))
ID | Product | Category | Price | Quantity |
---|---|---|---|---|
6 | Headphones | Electronics | 50 | 25 |
7 | Dress | Clothing | 35 | 40 |
8 | Phone | Electronics | 600 | 8 |
2. info()
info()
provides a concise summary of a DataFrame, including data types, non-null counts, and memory usage. This is useful to quickly understand the structure of your dataset.
# Get a summary of the DataFrame
print("DataFrame summary:")
print(df.info())
# | Columns | Non-Null Count | Dtype |
---|---|---|---|
0 | ID | 8 non-null | int64 |
1 | Product | 8 non-null | object |
2 | Category | 8 non-null | object |
3 | Price | 8 non-null | int64 |
4 | Quantity | 8 non-null | int64 |
3. describe()
describe()
generates descriptive statistics for numerical columns in a DataFrame. It provides information like mean, standard deviation, minimum, maximum, and quartile values.
# Generate descriptive statistics for numerical columns
print("Descriptive statistics:")
print(df.describe())
ID | Price | Quantity | |
---|---|---|---|
count | 8.00000 | 8.00000 | 8.000000 |
mean | 4.50000 | 233.12500 | 34.750000 |
std | 2.44949 | 307.38456 | 30.127112 |
min | 1.00000 | 15.00000 | 8.000000 |
25% | 2.75000 | 32.50000 | 13.750000 |
50% | 4.50000 | 45.00000 | 27.500000 |
75% | 6.25000 | 375.00000 | 42.500000 |
max | 8.00000 | 800.00000 | 100.00000 |
Note: Non-numeric data cannot be shown in the table.
4. value_counts()
value_counts()
returns a Series with the count of unique values in a specified column. It’s particularly useful for understanding the distribution of categorical data.
# Count the occurrences of each category in the 'Category' column
print("Value counts for 'Category' column:")
print(df['Category'].value_counts())
Category | Count |
---|---|
Electronics | 4 |
Clothing | 3 |
Books | 1 |
5. groupby()
groupby()
splits the data into groups based on a specified criterion, typically a column. This function is often followed by an aggregation operation, like sum()
, mean()
, or count()
, to compute summary statistics within each group.
# Group the data by 'Category' and calculate the total quantity in each category
print("\nGrouped by 'Category' with total quantity:")
print(df.groupby('Category')['Quantity'].sum())
Category | Quantity |
---|---|
Books | 100 |
Clothing | 120 |
Electronics | 58 |
6. pivot_table()
pivot_table()
creates a pivot table that summarizes data in a more structured format. You can define rows, columns, and values to create a multidimensional summary of the data.
# Create a pivot table to show average price for each category
pivot_table = df.pivot_table(index='Category', values='Price', aggfunc='mean')
print("Pivot table for average price:")
print(pivot_table)
Category | Price |
---|---|
Books | 15.00000 |
Clothing | 33.33333 |
Electronics | 437.50000 |
7. isnull()
isnull()
returns a DataFrame of the same shape as the input, with True
for missing values (NaN) and False
otherwise.
# Check for missing values
print("\nMissing values check:")
print(df.isnull())
ID Product | Product | Category | Price | Quantity | |
---|---|---|---|---|---|
0 | False | False | False | False | False |
1 | False | False | False | False | False |
2 | False | False | False | False | False |
3 | False | False | False | False | False |
4 | False | False | False | False | False |
5 | False | False | False | False | False |
6 | False | False | False | False | False |
7 | False | False | False | False | False |
You can also combine isnull()
with sum()
functions for a more comprehensive view.
# return the count of null value in every column
print(df.isna().sum())
ID | 0 |
Product | 0 |
Category | 0 |
Price | 0 |
Quantity | 0 |
8. fillna()
fillna()
replaces missing values in a DataFrame with specified values or uses certain filling methods, such as mean, median, and mode (mostly for categorical features).
# Fill missing values in 'Price' with the mean price
df['Price'].fillna(df['Price'].mean(), inplace=True)
print("DataFrame after filling missing values:")
print(df)
Since the table doesn’t contain any missing value, it will return the exact same table.
9. plot()
plot()
creates basic plots like line, bar, scatter, and more directly from a DataFrame. It simplifies the process of generating visualizations to aid in data exploration.
import matplotlib.pyplot as plt
df.groupby('Category')['Quantity'].sum().plot(kind='bar')
plt.title('Total Quantity by Category')
plt.xlabel('Category')
plt.ylabel('Total Quantity')
plt.show()