Show the code
import polars as pl
import requests #new for reading in JSON
import io #new for reading in JSON
import numpy as np
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)import polars as pl
import requests #new for reading in JSON
import io #new for reading in JSON
import numpy as np
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html
# Include and execute your code here
url = "https://raw.githubusercontent.com/byuidatascience/data4missing/master/data-raw/flights_missing/flights_missing.json"
response = requests.get(url)
df = pl.read_json(io.BytesIO(response.content))Calculate the sample size (i.e. number of rows in the dataset) for each airport in each year. What do you notice?
Each airport has one row for each month in each year, so most airport/year groups should have 12 rows. If any groups have fewer than 12 rows, that means some months are missing for that airport and year.
# Include and execute your code here
sample_size = (
df
.group_by(["airport_code", "year"])
.agg(
pl.len().alias("sample_size")
)
.sort(["airport_code", "year"])
)
sample_size| airport_code | year | sample_size |
|---|---|---|
| str | i64 | u32 |
| "ATL" | null | 23 |
| "ATL" | 2005 | 10 |
| "ATL" | 2006 | 11 |
| "ATL" | 2007 | 10 |
| "ATL" | 2008 | 9 |
| … | … | … |
| "SLC" | 2011 | 12 |
| "SLC" | 2012 | 12 |
| "SLC" | 2013 | 12 |
| "SLC" | 2014 | 12 |
| "SLC" | 2015 | 12 |
Which airport has the worst delays? Describe the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.
I used the proportion of flights delayed as the main metric because airports have different numbers of total flights. Using a proportion makes the airports easier to compare fairly. Based on this metric, the airport at the top of the table has the worst delays.
# Include and execute your code here
numeric_columns = [
"num_of_flights_total",
"num_of_delays_total",
"minutes_delayed_total"
]
delay_clean = (
df
.filter(
pl.all_horizontal([
~pl.col(col).cast(pl.Utf8).str.contains(r"\+")
for col in numeric_columns
])
)
.with_columns([
pl.col(col).cast(pl.Float64)
for col in numeric_columns
])
.filter(
pl.all_horizontal([
pl.col(col) != -999
for col in numeric_columns
])
)
)
airport_summary = (
delay_clean
.group_by(["airport_code", "airport_name"])
.agg(
pl.col("num_of_flights_total").sum().alias("total_flights"),
pl.col("num_of_delays_total").sum().alias("total_delayed_flights"),
pl.col("minutes_delayed_total").sum().alias("total_delay_minutes")
)
.with_columns(
(pl.col("total_delayed_flights") / pl.col("total_flights")).alias("proportion_delayed"),
(pl.col("total_delay_minutes") / pl.col("total_delayed_flights") / 60).alias("avg_delay_hours")
)
.select([
"airport_code",
"airport_name",
"total_flights",
"total_delayed_flights",
"proportion_delayed",
"avg_delay_hours"
])
.sort("proportion_delayed", descending=True)
)
airport_summary| airport_code | airport_name | total_flights | total_delayed_flights | proportion_delayed | avg_delay_hours |
|---|---|---|---|---|---|
| str | str | f64 | f64 | f64 | f64 |
| "ORD" | "" | 197556.0 | 57703.0 | 0.292084 | 1.210507 |
| "SFO" | "San Francisco, CA: San Francis… | 1.565257e6 | 408631.0 | 0.261063 | 1.039595 |
| "SFO" | "" | 65688.0 | 16973.0 | 0.258388 | 1.042692 |
| "ORD" | "Chicago, IL: Chicago O'Hare In… | 3.400032e6 | 773122.0 | 0.227387 | 1.124556 |
| "ATL" | "Atlanta, GA: Hartsfield-Jackso… | 4.235114e6 | 870910.0 | 0.20564 | 0.997328 |
| … | … | … | … | … | … |
| "ATL" | "" | 194933.0 | 31533.0 | 0.161763 | 0.98783 |
| "SAN" | "" | 47701.0 | 7385.0 | 0.154819 | 0.798501 |
| "DEN" | "" | 190598.0 | 28555.0 | 0.149818 | 0.883043 |
| "SLC" | "Salt Lake City, UT: Salt Lake … | 1.293072e6 | 190733.0 | 0.147504 | 0.826714 |
| "SLC" | "" | 110312.0 | 14427.0 | 0.130784 | 0.765313 |
Create a barplot showing the proportion of all flights that are delayed at each airport. Describe what you learn from this graph.
The bar chart shows which airports have the highest share of delayed flights. The tallest bar represents the airport with the worst delay rate. Airports with shorter bars have a lower proportion of delayed flights, so they are generally better choices for avoiding delays.
# Include and execute your code here
(
ggplot(airport_summary, aes(x="airport_code", y="proportion_delayed"))
+ geom_bar(stat="identity")
+ labs(
title="Proportion of Flights Delayed by Airport",
x="Airport",
y="Proportion of Flights Delayed"
)
+ scale_y_continuous(format=".0%")
)