What is the Difference Between axis=0 and axis=1 When Working with Pandas Dataframes?
Sometimes, functions ask you to specify an axis
. The documentation can often feel vague and/or technical.
For instance, here’s a quote from the apply
function’s documentation:
axis : {0 or ‘index’, 1 or ‘columns’}, default 0 Axis along which the function is applied: 0 or ‘index’: apply function to each column. 1 or ‘columns’: apply function to each row.
Uuuum… right.
So what’s the difference? Here’s an example…
Load a CSV file to play with
Prerequisites (if you want to practice)
- Install the Pandas library for your Python environment
- Cells in this notebook expect the Car Sales.csv file to be in the same directory as your notebook
- Resources to help you practice
First Things First
import pandas as pd
# Read the CSV file
# This assumes "Car Sales.csv" is in the same directory as your notebook
car_sales_data = pd.read_csv("Car Sales.csv")
# Show the first 5 rows
first_five = car_sales_data.head(5)
display(first_five)
DealershipName | RedCars | SilverCars | BlackCars | BlueCars | MonthSold | YearSold | |
---|---|---|---|---|---|---|---|
0 | Clyde's Clunkers | 902.0 | 650.0 | 754.0 | 792.0 | 1.0 | 2018.0 |
1 | Clyde's Clunkers | 710.0 | 476.0 | 518.0 | 492.0 | 2.0 | 2018.0 |
2 | Clyde's Clunkers | 248.0 | 912.0 | 606.0 | 350.0 | 3.0 | 2018.0 |
3 | Clyde's Clunkers | 782.0 | 912.0 | 858.0 | 446.0 | 4.0 | 2018.0 |
4 | Clyde's Clunkers | 278.0 | 354.0 | 482.0 | 752.0 | 5.0 | 2018.0 |
The car sales data looks like it contains one row that summarizes the total sales of each color of car for a given dealership, for each month of the year.
To state the “grain” of the data frame another way, the data frame contains one row per dealership, month, year combo and reports the total number of cars sold by color.
Choose Your Scenario
Suppose that two people come to you and ask separate questions about average car sales.
-
Lucy asks, “Can you calculate the average number of cars sold for each color?”
-
Zack asks, “Can you calculate the average number of cars sold (regardless of color) for each dealership in each month & year?” (so basically the average of red, silver, black, and blue cars for each row)
Start with Lucy.
Think about what you’d do to answer Lucy’s question by hand, manually, if you didn’t have Pandas to do the work for you. Here’s what I’d do:
- Start with the RedCars column.
- Add up 902, 710, 248, 782, 278, and so on.
- Divide that sum by the total number of values. Boom. RedCars average.
- Rinse and reapeat steps 1-3 for SilverCars, BlackCars, and BlueCars.
This is an axis=0 scenario in Pandas.
first_five[['RedCars', 'SilverCars', 'BlackCars', 'BlueCars']].mean(axis=0)
RedCars 584.0
SilverCars 660.8
BlackCars 643.6
BlueCars 566.4
dtype: float64
What about Zack?
What would you do to answer his question by hand, without Pandas? How’s this…
- Start with the first row of data (Row 0), since his question matches the “grain” of the data frame… one row per dealership per month & year.
- Add up the RedCars, SilverCars, BlackCars, and BlueCars values for Row 0 and divide by 4. So (902 + 650 + 754 + 792)/4
- Rinse and repeat steps 1 & 2 for every row in the data frame. Boom. Average cars sold by dealer/month/year.
This is an axis=1 scenario.
first_five[['RedCars', 'SilverCars', 'BlackCars', 'BlueCars']].mean(axis=1)
0 774.5
1 549.0
2 529.0
3 749.5
4 466.5
dtype: float64
Summarizing the Findings
Specifying an axis
to a function in Pandas is helping answer one of the following questions:
- Should I (Pandas) start with a column and make this function do its job downward on all the “cells” for that column, and then continue doing the same thing for all the rest of the columns in the data frame? (
axis=0
)
or
- Should I (Pandas) start with the first row of data in the data frame and make this function do its job horizontally on all of the “cells” for that row, and then continue doing the same thing for all the rest of the rows in the data frame? (
axis=1
)