Client Report - Flights: Column Aggregation

Unit 2 Task 2

Author

Jay Underwood

Show the code
import polars as pl
import requests #new for reading in JSON
import io       #new for reading in JSON
import numpy as np
from lets_plot import *

LetsPlot.setup_html(isolated_frame=True)
Show the code
# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html

# Include and execute your code here
url = "https://raw.githubusercontent.com/byuidatascience/data4missing/master/data-raw/flights_missing/flights_missing.json"
response = requests.get(url)
df = pl.read_json(io.BytesIO(response.content))

QUESTION 1

Calculate the sample size (i.e. number of rows in the dataset) for each airport in each year. What do you notice?

Each airport has one row for each month in each year, so most airport/year groups should have 12 rows. If any groups have fewer than 12 rows, that means some months are missing for that airport and year.

Show the code
# Include and execute your code here
sample_size = (
    df
    .group_by(["airport_code", "year"])
    .agg(
        pl.len().alias("sample_size")
    )
    .sort(["airport_code", "year"])
)

sample_size
shape: (78, 3)
airport_code year sample_size
str i64 u32
"ATL" null 23
"ATL" 2005 10
"ATL" 2006 11
"ATL" 2007 10
"ATL" 2008 9
"SLC" 2011 12
"SLC" 2012 12
"SLC" 2013 12
"SLC" 2014 12
"SLC" 2015 12

QUESTION 2

Which airport has the worst delays? Describe the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.

I used the proportion of flights delayed as the main metric because airports have different numbers of total flights. Using a proportion makes the airports easier to compare fairly. Based on this metric, the airport at the top of the table has the worst delays.

Show the code
# Include and execute your code here

numeric_columns = [
    "num_of_flights_total",
    "num_of_delays_total",
    "minutes_delayed_total"
]

delay_clean = (
    df
    .filter(
        pl.all_horizontal([
            ~pl.col(col).cast(pl.Utf8).str.contains(r"\+")
            for col in numeric_columns
        ])
    )
    .with_columns([
        pl.col(col).cast(pl.Float64)
        for col in numeric_columns
    ])
    .filter(
        pl.all_horizontal([
            pl.col(col) != -999
            for col in numeric_columns
        ])
    )
)

airport_summary = (
    delay_clean
    .group_by(["airport_code", "airport_name"])
    .agg(
        pl.col("num_of_flights_total").sum().alias("total_flights"),
        pl.col("num_of_delays_total").sum().alias("total_delayed_flights"),
        pl.col("minutes_delayed_total").sum().alias("total_delay_minutes")
    )
    .with_columns(
        (pl.col("total_delayed_flights") / pl.col("total_flights")).alias("proportion_delayed"),
        (pl.col("total_delay_minutes") / pl.col("total_delayed_flights") / 60).alias("avg_delay_hours")
    )
    .select([
        "airport_code",
        "airport_name",
        "total_flights",
        "total_delayed_flights",
        "proportion_delayed",
        "avg_delay_hours"
    ])
    .sort("proportion_delayed", descending=True)
)

airport_summary
shape: (14, 6)
airport_code airport_name total_flights total_delayed_flights proportion_delayed avg_delay_hours
str str f64 f64 f64 f64
"ORD" "" 197556.0 57703.0 0.292084 1.210507
"SFO" "San Francisco, CA: San Francis… 1.565257e6 408631.0 0.261063 1.039595
"SFO" "" 65688.0 16973.0 0.258388 1.042692
"ORD" "Chicago, IL: Chicago O'Hare In… 3.400032e6 773122.0 0.227387 1.124556
"ATL" "Atlanta, GA: Hartsfield-Jackso… 4.235114e6 870910.0 0.20564 0.997328
"ATL" "" 194933.0 31533.0 0.161763 0.98783
"SAN" "" 47701.0 7385.0 0.154819 0.798501
"DEN" "" 190598.0 28555.0 0.149818 0.883043
"SLC" "Salt Lake City, UT: Salt Lake … 1.293072e6 190733.0 0.147504 0.826714
"SLC" "" 110312.0 14427.0 0.130784 0.765313

QUESTION 3

Create a barplot showing the proportion of all flights that are delayed at each airport. Describe what you learn from this graph.

The bar chart shows which airports have the highest share of delayed flights. The tallest bar represents the airport with the worst delay rate. Airports with shorter bars have a lower proportion of delayed flights, so they are generally better choices for avoiding delays.

Show the code
# Include and execute your code here
(
    ggplot(airport_summary, aes(x="airport_code", y="proportion_delayed"))
    + geom_bar(stat="identity")
    + labs(
        title="Proportion of Flights Delayed by Airport",
        x="Airport",
        y="Proportion of Flights Delayed"
    )
    + scale_y_continuous(format=".0%")
)