Client Report - Flights: Column Aggregation

Show the code

import polars as pl
import requests #new for reading in JSON
import io       #new for reading in JSON
import numpy as np
from lets_plot import *

LetsPlot.setup_html(isolated_frame=True)

Show the code

# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html

# Include and execute your code here
url = "https://raw.githubusercontent.com/byuidatascience/data4missing/master/data-raw/flights_missing/flights_missing.json"
response = requests.get(url)
df = pl.read_json(io.BytesIO(response.content))

QUESTION 1

Calculate the sample size (i.e. number of rows in the dataset) for each airport in each year. What do you notice?

Each airport has one row for each month in each year, so most airport/year groups should have 12 rows. If any groups have fewer than 12 rows, that means some months are missing for that airport and year.

Show the code

# Include and execute your code here
sample_size = (
    df
    .group_by(["airport_code", "year"])
    .agg(
        pl.len().alias("sample_size")
    )
    .sort(["airport_code", "year"])
)

sample_size

shape: (78, 3)

airport_code	year	sample_size
str	i64	u32
"ATL"	null	23
"ATL"	2005	10
"ATL"	2006	11
"ATL"	2007	10
"ATL"	2008	9
…	…	…
"SLC"	2011	12
"SLC"	2012	12
"SLC"	2013	12
"SLC"	2014	12
"SLC"	2015	12

QUESTION 2

Which airport has the worst delays? Describe the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.

I used the proportion of flights delayed as the main metric because airports have different numbers of total flights. Using a proportion makes the airports easier to compare fairly. Based on this metric, the airport at the top of the table has the worst delays.

Show the code

# Include and execute your code here

numeric_columns = [
    "num_of_flights_total",
    "num_of_delays_total",
    "minutes_delayed_total"
]

delay_clean = (
    df
    .filter(
        pl.all_horizontal([
            ~pl.col(col).cast(pl.Utf8).str.contains(r"\+")
            for col in numeric_columns
        ])
    )
    .with_columns([
        pl.col(col).cast(pl.Float64)
        for col in numeric_columns
    ])
    .filter(
        pl.all_horizontal([
            pl.col(col) != -999
            for col in numeric_columns
        ])
    )
)

airport_summary = (
    delay_clean
    .group_by(["airport_code", "airport_name"])
    .agg(
        pl.col("num_of_flights_total").sum().alias("total_flights"),
        pl.col("num_of_delays_total").sum().alias("total_delayed_flights"),
        pl.col("minutes_delayed_total").sum().alias("total_delay_minutes")
    )
    .with_columns(
        (pl.col("total_delayed_flights") / pl.col("total_flights")).alias("proportion_delayed"),
        (pl.col("total_delay_minutes") / pl.col("total_delayed_flights") / 60).alias("avg_delay_hours")
    )
    .select([
        "airport_code",
        "airport_name",
        "total_flights",
        "total_delayed_flights",
        "proportion_delayed",
        "avg_delay_hours"
    ])
    .sort("proportion_delayed", descending=True)
)

airport_summary

shape: (14, 6)

airport_code	airport_name	total_flights	total_delayed_flights	proportion_delayed	avg_delay_hours
str	str	f64	f64	f64	f64
"ORD"	""	197556.0	57703.0	0.292084	1.210507
"SFO"	"San Francisco, CA: San Francis…	1.565257e6	408631.0	0.261063	1.039595
"SFO"	""	65688.0	16973.0	0.258388	1.042692
"ORD"	"Chicago, IL: Chicago O'Hare In…	3.400032e6	773122.0	0.227387	1.124556
"ATL"	"Atlanta, GA: Hartsfield-Jackso…	4.235114e6	870910.0	0.20564	0.997328
…	…	…	…	…	…
"ATL"	""	194933.0	31533.0	0.161763	0.98783
"SAN"	""	47701.0	7385.0	0.154819	0.798501
"DEN"	""	190598.0	28555.0	0.149818	0.883043
"SLC"	"Salt Lake City, UT: Salt Lake …	1.293072e6	190733.0	0.147504	0.826714
"SLC"	""	110312.0	14427.0	0.130784	0.765313

QUESTION 3

Create a barplot showing the proportion of all flights that are delayed at each airport. Describe what you learn from this graph.

The bar chart shows which airports have the highest share of delayed flights. The tallest bar represents the airport with the worst delay rate. Airports with shorter bars have a lower proportion of delayed flights, so they are generally better choices for avoiding delays.

Show the code

# Include and execute your code here
(
    ggplot(airport_summary, aes(x="airport_code", y="proportion_delayed"))
    + geom_bar(stat="identity")
    + labs(
        title="Proportion of Flights Delayed by Airport",
        x="Airport",
        y="Proportion of Flights Delayed"
    )
    + scale_y_continuous(format=".0%")
)