diff --git a/how-vehicle-and-engine-characteristics-influence-fuel-efficiency.ipynb b/how-vehicle-and-engine-characteristics-influence-fuel-efficiency.ipynb
new file mode 100644
index 0000000..f49ff5c
--- /dev/null
+++ b/how-vehicle-and-engine-characteristics-influence-fuel-efficiency.ipynb
@@ -0,0 +1,5595 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "c923f833-196d-4f39-ab7a-d8968f690b89",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "authors:\n",
+ " - name: Zachary Hanthorn\n",
+ " affiliation: University of Washington\n",
+ " email: zhanthor@uw.edu \n",
+ " github: https://github.com/zhanthor\n",
+ " linkedin: https://www.linkedin.com/in/zacharyhanthorn/\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ebf52349-3294-4698-b783-ba72e43ff06d",
+ "metadata": {},
+ "source": [
+ "# How Vehicle and Engine Characteristics Influence Fuel Efficiency Over Four Decades\n",
+ "\n",
+ "Research Questions:\n",
+ "## 1. How has the average fuel economy changed across different vehicle types? Do trends differ between foreign and domestic vehicle manufacturers?\n",
+ "\n",
+ " Average fuel economy has changed across different vehicle types in different ways. Wagons have had much higher increases in fuel economy, while trucks and vans have made little improvements in fuel efficiency. Trends differ slightly between foreign and domestic vehicle manufacturers. Countries like Korea and Japan, typically known for their commuter vehicles, have higher fuel economy than the luxury cars that are produced in Europe.\n",
+ "\n",
+ "## 2. What is the relationship between engine characteristics (displacement, cylinders, transmissions) and fuel efficiency? How has this relationship changed with the recent shift to hybrid vehicles?\n",
+ "\n",
+ " There is a noticeable relationship between cylinder counts and fuel economy. Less cylinders generally indicates better fuel economy, and 4 cylinder vehicles seem to have the most growth in MPG over the length of the dataset. There is also a noticeable relationship between engine displacement and fuel economy. Vehicles with a smaller engine displacement, and hybrid vehicles have a significantly higher MPG than other types of vehicles. For transmissions, there is a noticeable improvement in MPG from cars that have a continuously variable transmission (CVT). The next highest MPG group is the 5 speed manual. Other transmission categories do not seem to have any noticeable differences from each other.\n",
+ " \n",
+ "\n",
+ "## 3. Can we predict a vehicle’s MPG based on its engine displacement, number of cylinders, and other identifying characteristics?\n",
+ "\n",
+ " Yes, we can predict a vehicles MPG. I trained a XGBoost Regressor model using data from the dataset and it can predict past and present MPG within single digit error."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4c762e35-1fc8-4ff0-bd2a-517bf12a66d9",
+ "metadata": {},
+ "source": [
+ "## Motivation\n",
+ "\n",
+ "I have a very strong passion for cars and the technology behind them. I especially like older cars from the 90s and 2000s. I have always wondered how the advancements in technology as well as the changes in societal vehicle standards have affected fuel economy. I feel like I often see larger SUVs and pickup trucks on the road when I feel like it used to be more common to see smaller cars like sedans on the road when I was younger. So I wonder if the advancements in engine efficiency has outweighed the negatives that come from people driving heavier vehicles. I also think that climate change is a real problem, so I wonder if the advancements in fuel efficiency have made it so that people can continue to drive gasoline powered vehicles instead of electric vehicles. Knowing the answers to these questions would help me to form a better opinion about if people can continue to drive heavy cars because of modern engine efficiency or if people should try to shift to smaller cars to help save the environment and reduce their carbon footprint. It would also allow me to better understand if the shift to electric cars is actually better for the environment in the future, as some argue that electric cars are not better for the environment. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e3a9116b-b1c6-424a-a790-037dd3a8a55b",
+ "metadata": {},
+ "source": [
+ "## Data Setting\n",
+ "\n",
+ "The dataset is available here: https://catalog.data.gov/dataset/vehicle-fuel-economy.\n",
+ "This dataset contains detailed information about fuel efficiency and emissions for vehicles sold in the United States since the 1980s. This data was published by the California Air Resources Board. It contains many vehicle attributes such as make, model, engine displacement, transmission type, fuel type, miles per gallon (city and highway), and tailpipe emissions. The dataset analysis might be complicated by the changing testing standards since the 1980s. Older vehicle MPG ratings may not be directly compatible with modern ratings because of the changes in testing procedures over the last 40 years. Another difficulty in the analysis will be the many technology shifts that have occurred since the beginning of the data collection. New technologies like turbochargers (existed in the 80s but far less common than they are nowadays), hybrid, and electric vehicles. These innovations fundamentally alter how efficiency is measured (like MPG vs. MPGe) which can help to deepen the analysis but will require some sort of separation or normalization. A third difficulty is that the dataset only contains information about vehicles sold in the U.S. This means the trends in consumer demand and government incentives and regulation differ from other regions of the world. This may introduce a bias towards certain brands or U.S. specific vehicle types (for example, pickup trucks are far less common outside of America)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b256b1da-1275-4486-87a2-1d83f7fb8ae2",
+ "metadata": {},
+ "source": [
+ "## Method\n",
+ "\n",
+ "### Step 1: Load and clean data\n",
+ "Load the data into a dataframe using pandas and clean the data by standardizing column names. (Data manipulation function). I will test this function on a smaller sample CSV to verify that any missing data or data types are correctly interpreted. The outcome is a clean dataframe that is ready to use in later stages. \n",
+ "\t\n",
+ "### Step 2: Explore the data and summarize it\n",
+ "This will help us to gain a better insight on MPG distributions over time. I will write a function that calculates the average MPG by year, vehicle type, and vehicle make, which is a data manipulation function. I will also write a function that plots these trends using an interactive chart with Plotly or Altair. To test these functions I will verify that the statistics returned by the function match a manual computation that I make on a subset of data. The outcome of this will be an initial context for MPG trends which will directly support answering the first question. \n",
+ "\n",
+ "### Step 3: Analyze relationships between different engine characteristics and fuel efficiency\n",
+ "This will allow me to examine how the displacement, cylinders, and weight affect the fuel efficiency of a vehicle. I will write a data manipulation function that computes the correlation between these numerical variables and MPG of the vehicle. Next, I will create a scatter plot using Plotly or Altair that allows the user to choose different variables to plot against MPG. For testing I will confirm that the computed correlations align with expected values (like a larger displacement or a heavier vehicle will have a lower MPG than one a lighter one with a smaller displacement). The outcome of this will be a visual understanding of how these variables affect MPG.\n",
+ "\t\n",
+ "### Step 4: Build and compare predictive models\n",
+ "In this step I will use machine learning to predict a vehicle's MPG and find the most influential factors. I will have a function that splits the dataset into training and testing subsets which is a data manipulation function. Then I will have another function that trains the models by feeding them the data, and then another function that evaluates the models and tests them. For testing purposes I will use newer cars that aren’t in the dataset and a synthetic dataset that has predictable results. The outcome of this will be a model that can predict a vehicle's MPG, satisfying the challenge goal.\n",
+ "\t\n",
+ "### Step 5: Interpret final results \n",
+ "In this step I will write a function that compiles a summary of all results found throughout the project into a final table. For testing I will confirm that the summary accurately represents the rest of the data found throughout the project. The outcome will be a cohesive explanation of how vehicle design and manufacturing trends have combined with technological achievements to influence the fuel economy of our cars."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "33cecd7f-eb42-4ef6-bde8-d1283ceb792a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# imports\n",
+ "!pip install --quiet\n",
+ "!pip install plotly\n",
+ "!pip install xgboost\n",
+ "!pip install scikit-learn\n",
+ "\n",
+ "import pandas as pd\n",
+ "import doctest\n",
+ "import plotly.io as pio\n",
+ "import plotly.express as px\n",
+ "import numpy as np\n",
+ "from IPython.display import Image\n",
+ "\n",
+ "from sklearn.metrics import mean_squared_error\n",
+ "from sklearn.metrics import root_mean_squared_error\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "from xgboost import XGBRegressor\n",
+ "\n",
+ "pio.renderers.default = 'notebook'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "e75a58ad-6760-43c2-b28b-2a806506d51b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def load_data(csv):\n",
+ " \"\"\"\n",
+ " Load data from a given csv.\n",
+ " Takes a path (str) to a CSV file and returns a new DataFrame.\n",
+ " \"\"\"\n",
+ " data = pd.read_csv(csv, low_memory=False) # mixed types error, so low_memory flag\n",
+ " \n",
+ " return data\n",
+ "\n",
+ "vehicles = load_data(\"vehicles.csv\")\n",
+ "vehicles_test = load_data(\"vehicles_test.csv\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7c7bf82d-e690-45f0-a034-cc40b7d9c927",
+ "metadata": {},
+ "source": [
+ "### How large is your dataset? Write code to show the number of rows and columns. What do the rows represent? What about the columns?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "49d26ea7-adbf-4901-b342-c997996aab99",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Number of rows: 40704\n",
+ "Number of columns: 83\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(f\"Number of rows: {len(vehicles)}\")\n",
+ "print(f\"Number of columns: {len(vehicles.columns)}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "f546ac43-05de-4c90-947b-413898a6ea3b",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Index(['barrels08', 'barrelsA08', 'charge120', 'charge240', 'city08',\n",
+ " 'city08U', 'cityA08', 'cityA08U', 'cityCD', 'cityE', 'cityUF', 'co2',\n",
+ " 'co2A', 'co2TailpipeAGpm', 'co2TailpipeGpm', 'comb08', 'comb08U',\n",
+ " 'combA08', 'combA08U', 'combE', 'combinedCD', 'combinedUF', 'cylinders',\n",
+ " 'displ', 'drive', 'engId', 'eng_dscr', 'feScore', 'fuelCost08',\n",
+ " 'fuelCostA08', 'fuelType', 'fuelType1', 'ghgScore', 'ghgScoreA',\n",
+ " 'highway08', 'highway08U', 'highwayA08', 'highwayA08U', 'highwayCD',\n",
+ " 'highwayE', 'highwayUF', 'hlv', 'hpv', 'id', 'lv2', 'lv4', 'make',\n",
+ " 'model', 'mpgData', 'phevBlended', 'pv2', 'pv4', 'range', 'rangeCity',\n",
+ " 'rangeCityA', 'rangeHwy', 'rangeHwyA', 'trany', 'UCity', 'UCityA',\n",
+ " 'UHighway', 'UHighwayA', 'VClass', 'year', 'youSaveSpend', 'guzzler',\n",
+ " 'trans_dscr', 'tCharger', 'sCharger', 'atvType', 'fuelType2', 'rangeA',\n",
+ " 'evMotor', 'mfrCode', 'c240Dscr', 'charge240b', 'c240bDscr',\n",
+ " 'createdOn', 'modifiedOn', 'startStop', 'phevCity', 'phevHwy',\n",
+ " 'phevComb'],\n",
+ " dtype='object')"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "vehicles.columns"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d0a78e27-d00e-4493-be0c-e9319bf30ce1",
+ "metadata": {},
+ "source": [
+ "The rows represent the number of cars in the dataset.\n",
+ "The columns represent the variables for each car in the dataset."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "40f5f9bc-592f-4184-b5f8-f4de3194eae5",
+ "metadata": {},
+ "source": [
+ "### Does your dataset have any missing data? Write code to show all missing data (or an empty result if there is no missing data). If your dataset has missing data, describe the missing data (e.g., number of rows or columns; percent missingness; etc.) and your plan for working with it."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "fc9ee0ad-f842-4dc4-8062-4d81d11b705c",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Columns with more than 75% missing values: 46\n",
+ "Columns with more than 90% missing values: 39\n",
+ "Columns with more than 99% missing values: 22\n"
+ ]
+ }
+ ],
+ "source": [
+ "def check_missing_data(data):\n",
+ " \"\"\"\n",
+ " Prints information about missing data within a given dataset.\n",
+ "\n",
+ " NaN percentage for each column.\n",
+ "\n",
+ " >>> check_missing_data(vehicles_test)\n",
+ " Columns with more than 75% missing values: 48\n",
+ " Columns with more than 90% missing values: 39\n",
+ " Columns with more than 99% missing values: 39\n",
+ " \"\"\"\n",
+ " # replace 0 with NaN, checking to ensure strings are actually NaN\n",
+ " data = data.replace([\"NaN\", \"nan\", 0], np.nan)\n",
+ "\n",
+ " # calculating the percentage of NaNs in a column\n",
+ " nan_percentage= ((data.isna().sum() / len(data)))\n",
+ "\n",
+ " print(f\"Columns with more than 75% missing values: {len(nan_percentage[nan_percentage > 0.75])}\")\n",
+ " print(f\"Columns with more than 90% missing values: {len(nan_percentage[nan_percentage > 0.9])}\")\n",
+ " print(f\"Columns with more than 99% missing values: {len(nan_percentage[nan_percentage > 0.99])}\")\n",
+ "\n",
+ "check_missing_data(vehicles)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3c8ec0e2-3944-482b-a31a-d8d6ea5ff583",
+ "metadata": {},
+ "source": [
+ "Yes, the dataset has a substantial amount of missing data. Some columns have many rows with NaN values. This is due to a number of factors, but one main factor is the differences in testing over the four decades that this data was collected and how cars have changed over time. Columns relating to electric motors and modern features like start-stop are not present in older cars. Also columns like fuelType2 are not present in many vehicles as they only use one fuel type. Many of these columns are not applicable to answering my research questions, so I will remove them from the dataset before beginning my analysis. Other columns might be applicable (like turbocharger, supercharger, start-stop features, etc.), for these columns I will replace the NaNs with False or 0, or exclude them from the visualization or machine learning training if it skews the results significantly. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "6d751196-4bea-4ed3-b381-9911c005b8fb",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Columns with more than 75% missing values: 4\n",
+ "Columns with more than 90% missing values: 2\n",
+ "Columns with more than 99% missing values: 0\n",
+ "16\n"
+ ]
+ }
+ ],
+ "source": [
+ "def clean_data(data):\n",
+ " \"\"\"\n",
+ " Clean the data of variables that we will not use.\n",
+ " Takes a DataFrame, returns a modified version of that DataFrame.\n",
+ "\n",
+ " >>> vehicles_test = vehicles_test.copy()\n",
+ " >>> cleaned_data = clean_data(vehicles_test)\n",
+ " >>> len(cleaned_data.columns)\n",
+ " 16\n",
+ " >>> cleaned_data.columns.tolist() == ['city08', 'comb08', 'cylinders', 'displ', 'drive', 'fuelType1', 'fuelType2', 'highway08', 'make', 'model', 'trany', 'VClass', 'year', 'sCharger', 'tCharger', 'startStop']\n",
+ " True\n",
+ " \"\"\"\n",
+ "\n",
+ " data = data[[\"city08\", \"comb08\", \n",
+ " \"cylinders\", \"displ\", \"drive\", \n",
+ " \"fuelType1\", \"fuelType2\", \"highway08\", \n",
+ " \"make\", \"model\", \"trany\", \n",
+ " \"VClass\", \"year\", \"sCharger\", \n",
+ " \"tCharger\", \"startStop\"]]\n",
+ "\n",
+ " return data\n",
+ "\n",
+ "vehicles = clean_data(vehicles)\n",
+ "\n",
+ "check_missing_data(vehicles)\n",
+ "print(len(vehicles.columns))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bc3ce5a9-2301-4ede-9204-2d766a9a97b1",
+ "metadata": {},
+ "source": [
+ "### What are the variables of interest for your research questions? Explain what each variable of interest (e.g., column in a data frame) represents. Connect them back to your research questions. Why are you using these specific columns or attributes?\n",
+ "\n",
+ "I have identified 16 variables that I believe will be of interest for my specific research questions. \n",
+ "1. city08 - City miles per gallon (rounded to nearest full number)\n",
+ "2. comb08 - Combined miles per gallon (city and highway) (rounded to nearest digit)\n",
+ "3. cylinders - Number of cylinders in the engine\n",
+ "4. displ - Engine displacement (in liters)\n",
+ "5. drive - Drivetrain type (categorical)\n",
+ "6. fuelType1 - Fuel type (categorical)\n",
+ "7. fuelType2 - Second fuel type. Will be NaN if the vehicle only has one fuel type. This will be used to identify hybrid vehicles for RQ2.\n",
+ "8. highway08 - Highway miles per gallon. (rounded to nearest digit)\n",
+ "9. make - Vehicle make (manufacturer)\n",
+ "10. model - Vehicle model (name)\n",
+ "11. trany - Transmission type. (categorical)\n",
+ "12. VClass - Vehicle class (categorical)\n",
+ "13. year - Production year\n",
+ "14. sCharger - Vehicle equipped with supercharger (boolean)\n",
+ "15. tCharger - Vehicle equipped with turbocharger (boolean)\n",
+ "16. startStop - Vehicle equipped with start-stop feature\n",
+ "\n",
+ "For research question 1, I will use the variables relating to vehicle MPG, variables that describe the car (engine characteristics, class, make/model). The production year will be used to plot the trend over time. For research question 2, I will use the same variables, I will also use fuelType2 to identify if a vehicle is a hybrid to answer the second part of the question. I will use the sCharger, tCharger, and startStop columns to see if that has affected the relationship and how cars with those features differ from cars without them. For research question 3 I will use all of these variables except MPG variables, to try and predict the MPG based on these other factors. I filtered out many columns that had duplicate rounded or unrounded data, data not applicable to my research questions (like emission data, EV specific data), or data with very high NaN values that would make it unusable."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "a6986405-8954-417a-89f0-3a02c0393d48",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def simplify_vehicle_class(data):\n",
+ " \"\"\"\n",
+ " Simplifies the vehicle class column by grouping into broader\n",
+ " categories to make the data better for analysis.\n",
+ "\n",
+ " Ex. 'Sport Utility Vehicle - 2WD' & 'Standard Sport Utility Vehicle - 4WD' become 'SUV'\n",
+ "\n",
+ " Takes a DataFrame and returns a modified version of that DataFrame.\n",
+ "\n",
+ " >>> vehicles_test = vehicles_test.copy()\n",
+ " >>> simplified_data = simplify_vehicle_class(vehicles_test)\n",
+ " >>> simplified_data['VClass'].unique().tolist()\n",
+ " ['Coupe', 'Sedan', 'SUV', 'Van']\n",
+ " >>> simplified_data.loc[simplified_data['VClass'] == 'Coupe', 'model'].iloc[0]\n",
+ " '456M GT'\n",
+ " \"\"\"\n",
+ " vclass_buckets = {\n",
+ " \"SUV\": [\n",
+ " \"Sport Utility Vehicle - 2WD\",\n",
+ " \"Sport Utility Vehicle - 4WD\",\n",
+ " \"Small Sport Utility Vehicle 2WD\",\n",
+ " \"Small Sport Utility Vehicle 4WD\",\n",
+ " \"Standard Sport Utility Vehicle 2WD\",\n",
+ " \"Standard Sport Utility Vehicle 4WD\",\n",
+ " ],\n",
+ " \"Sedan\": [\n",
+ " \"Minicompact Cars\",\n",
+ " \"Subcompact Cars\",\n",
+ " \"Compact Cars\",\n",
+ " \"Midsize Cars\",\n",
+ " \"Large Cars\",\n",
+ " ],\n",
+ " \"Wagon\": [\n",
+ " \"Small Station Wagons\",\n",
+ " \"Midsize Station Wagons\",\n",
+ " \"Midsize-Large Station Wagons\",\n",
+ " ],\n",
+ " \"Truck\": [\n",
+ " \"Small Pickup Trucks 2WD\",\n",
+ " \"Small Pickup Trucks 4WD\",\n",
+ " \"Standard Pickup Trucks 2WD\",\n",
+ " \"Standard Pickup Trucks 4WD\",\n",
+ " \"Small Pickup Trucks\",\n",
+ " \"Standard Pickup Trucks\",\n",
+ " \"Standard Pickup Trucks/2wd\",\n",
+ " ],\n",
+ " \"Van\": [\n",
+ " \"Vans, Cargo Type\",\n",
+ " \"Vans, Passenger Type\",\n",
+ " \"Vans Passenger\",\n",
+ " \"Vans\",\n",
+ " \"Minivan - 2WD\",\n",
+ " \"Minivan - 4WD\",\n",
+ " \"Minivan\",\n",
+ " ],\n",
+ " \"Special Purpose\": [\n",
+ " \"Special Purpose Vehicle\",\n",
+ " \"Special Purpose Vehicle 2WD\",\n",
+ " \"Special Purpose Vehicle 4WD\",\n",
+ " \"Special Purpose Vehicles\",\n",
+ " \"Special Purpose Vehicles/2wd\",\n",
+ " \"Special Purpose Vehicles/4wd\",\n",
+ " ],\n",
+ " \"Coupe\": [\n",
+ " \"Two Seaters\"\n",
+ " ],\n",
+ " }\n",
+ "\n",
+ " vclass_map = {original: category\n",
+ " for category, originals in vclass_buckets.items()\n",
+ " for original in originals\n",
+ " }\n",
+ "\n",
+ " data[\"VClass\"] = data[\"VClass\"].map(vclass_map)\n",
+ "\n",
+ " return data\n",
+ "\n",
+ "vehicles = simplify_vehicle_class(vehicles)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "1dbc377a-4f2b-43ab-88fe-087719090c05",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def categorize_transmissions(data):\n",
+ " \"\"\"\n",
+ " Splits transmission data into categories for easier analysis.\n",
+ " Converts a singular 'trany' column into two new columns:\n",
+ " 'trans' indicating if the vehicle is automatic or manual,\n",
+ " 'gears' indicating the number of gears the vehicle has.\n",
+ "\n",
+ " Takes a DataFrame and returns a modified version of that DataFrame.\n",
+ "\n",
+ " >>> vehicles_test = vehicles_test.copy()\n",
+ " >>> categorized = categorize_transmissions(vehicles_test)\n",
+ " >>> categorized[['trans', 'gears']].head().to_dict(orient='list')\n",
+ " {'trans': ['Automatic', 'Automatic', 'Automatic', 'Automatic', 'Manual'], 'gears': [4, 1, 1, 9, 5]}\n",
+ " >>> 'trany' in categorized.columns\n",
+ " False\n",
+ " >>> categorized.loc[0, 'trans']\n",
+ " 'Automatic'\n",
+ " >>> categorized.loc[0, 'gears']\n",
+ " 4\n",
+ " >>> categorized.loc[categorized['trans'] == 'Manual', 'gears'].iloc[0]\n",
+ " 5\n",
+ " \"\"\"\n",
+ "\n",
+ " transmission_groups = {\n",
+ " \"Manual\": {\n",
+ " 3: [\"Manual 3-spd\"],\n",
+ " 4: [\"Manual 4-spd\", \"Manual 4-spd Doubled\"],\n",
+ " 5: [\"Manual 5-spd\"],\n",
+ " 6: [\"Manual 6-spd\"],\n",
+ " 7: [\"Manual 7-spd\"],\n",
+ " },\n",
+ "\n",
+ " \"Automatic\": {\n",
+ " 1: [\n",
+ " \"Automatic (variable gear ratios)\",\n",
+ " \"Automatic (CVT)\",\n",
+ " \"Automatic (A1)\"\n",
+ " ],\n",
+ " 3: [\"Automatic 3-spd\", \"Automatic (L3)\"],\n",
+ " 4: [\"Automatic 4-spd\", \"Automatic (S4)\", \"Automatic (L4)\"],\n",
+ " 5: [\"Automatic 5-spd\", \"Automatic (S5)\", \"Automatic (AM5)\"],\n",
+ " 6: [\n",
+ " \"Automatic 6-spd\", \"Automatic (S6)\", \"Automatic (AM6)\", \n",
+ " \"Automatic (AM-S6)\", \"Automatic (AV-S6)\"\n",
+ " ],\n",
+ " 7: [\n",
+ " \"Automatic 7-spd\", \"Automatic (S7)\", \"Automatic (AM7)\", \n",
+ " \"Automatic (AM-S7)\", \"Automatic (AV-S7)\"\n",
+ " ],\n",
+ " 8: [\n",
+ " \"Automatic 8-spd\", \"Automatic (S8)\", \"Automatic (AM8)\", \n",
+ " \"Automatic (AM-S8)\", \"Automatic (AV-S8)\"\n",
+ " ],\n",
+ " 9: [\"Automatic 9-spd\", \"Automatic (S9)\", \"Automatic (AM-S9)\"],\n",
+ " 10: [\"Automatic 10-spd\", \"Automatic (S10)\", \"Automatic (AV-S10)\"],\n",
+ " }\n",
+ " }\n",
+ "\n",
+ " # invert trany to (\"Automatic\" or \"Manual\", gears)\n",
+ " transmission_map = {trany: (category, gears)\n",
+ " for category, gear_map in transmission_groups.items()\n",
+ " for gears, trany_list in gear_map.items()\n",
+ " for trany in trany_list\n",
+ " }\n",
+ "\n",
+ " \n",
+ " data[[\"trans\", \"gears\"]] = data[\"trany\"].map(transmission_map).apply(pd.Series)\n",
+ "\n",
+ " return data.drop(\"trany\", axis=1)\n",
+ "\n",
+ "vehicles = categorize_transmissions(vehicles)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "824fe050-360f-4ab3-9959-6ad4d907fd74",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def categorize_makes(data):\n",
+ " \"\"\"\n",
+ " Categorizes vehicle makes by their origin.\n",
+ "\n",
+ " (Ex. Chevrolet is American, BMW is German, Honda is Japanese)\n",
+ "\n",
+ " Takes a DataFrame and Returns a modified version of that DataFrame.\n",
+ "\n",
+ " >>> vehicles_test = vehicles_test.copy()\n",
+ " >>> categorized = categorize_makes(vehicles_test)\n",
+ " >>> sorted(categorized['origin'].unique().tolist())\n",
+ " ['American', 'European', 'German', 'Japanese']\n",
+ " >>> categorized.loc[categorized['make'] == 'Chevrolet', 'origin'].iloc[0]\n",
+ " 'American'\n",
+ " >>> categorized.loc[categorized['make'] == 'Ferrari', 'origin'].iloc[0]\n",
+ " 'European'\n",
+ " >>> categorized.loc[categorized['make'] == 'Volkswagen', 'origin'].iloc[0]\n",
+ " 'German'\n",
+ " >>> categorized.loc[categorized['make'] == 'Honda', 'origin'].iloc[0]\n",
+ " 'Japanese'\n",
+ " \"\"\"\n",
+ " \n",
+ " regions = {\n",
+ " \"American\": [\n",
+ " \"Chevrolet\", \"Ford\", \"Cadillac\", \"Buick\", \"GMC\", \"Chrysler\",\n",
+ " \"Dodge\", \"Jeep\", \"Ram\", \"Lincoln\", \"Tesla\"\n",
+ " ],\n",
+ "\n",
+ " \"German\": [\n",
+ " \"Audi\", \"BMW\", \"BMW Alpina\", \"Mercedes-Benz\", \"Maybach\",\n",
+ " \"Porsche\", \"Volkswagen\", \"smart\"\n",
+ " ],\n",
+ "\n",
+ " \"European\": [\n",
+ " \"Alfa Romeo\", \"Aston Martin\", \"Bentley\", \"Ferrari\", \"Lamborghini\", \"Lotus\", \n",
+ " \"Maserati\", \"Rolls-Royce\", \"Jaguar\", \"Land Rover\", \"MINI\",\"McLaren Automotive\", \n",
+ " \"Pagani\", \"Koenigsegg\", \"Volvo\", \"Fiat\", \"Bugatti\"\n",
+ " ],\n",
+ "\n",
+ " \"Japanese\": [\n",
+ " \"Toyota\", \"Honda\", \"Acura\", \"Nissan\", \"Infiniti\", \"Mazda\",\n",
+ " \"Lexus\", \"Mitsubishi\", \"Subaru\", \"Suzuki\", \"Isuzu\", \"Daihatsu\"\n",
+ " ],\n",
+ "\n",
+ " \"Korean\": [\n",
+ " \"Hyundai\", \"Kia\", \"Genesis\", \"Daewoo\"\n",
+ " ],\n",
+ "\n",
+ " \"Defunct/Low-volume\": [\n",
+ " \"Pontiac\", \"Oldsmobile\", \"Mercury\", \"Plymouth\", \"Saturn\", \"Hummer\", \"American Motors Corporation\", \n",
+ " \"Avanti Motor Corporation\", \"Geo\", \"Eagle\", \"Panoz Auto-Development\", \"Vector\", \"Saleen Performance\",\n",
+ " \"Fisker\", \"Karma\", \"Mobility Ventures LLC\", \"VPG\", \"Bitter Gmbh and Co. Kg\", \"Shelby\", \"Saleen\", \"SRT\", \n",
+ " \"Roush Performance\", \"Ruf Automobile Gmbh\", \"Spyker\", \"Morgan\", \"Peugeot\", \"Renault\", \"Saab\", \"Dacia\", \n",
+ " \"Kenyon Corporation Of America\", \"Bill Dovell Motor Car Company\", \"Import Foreign Auto Sales Inc\", \n",
+ " \"Import Trade Services\", \"Lambda Control Systems\", \"London Coach Co Inc\", \"Vixen Motor Company\", \n",
+ " \"CX Automotive\", \"Red Shift Ltd.\", \"ASC Incorporated\", \"CCC Engineering\", \"Grumman Olson\", \n",
+ " \"S and S Coach Company E.p. Dutton\", \"Superior Coaches Div E.p. Dutton\", \"PAS, Inc\", \"PAS Inc - GMC\",\n",
+ " \"Quantum Technologies\", \"Tecstar, LP\", \"Azure Dynamics\", \"CODA Automotive\", \"Dabryan Coach Builders Inc\", \n",
+ " \"Federal Coach\", \"Environmental Rsch and Devp Corp\", \"Evans Automobiles\", \"Consulier Industries Inc\", \n",
+ " \"Laforza Automobile Inc\", \"Goldacre\", \"J.K. Motors\", \"Isis Imports Ltd\", \"Wallace Environmental\",\n",
+ " \"Autokraft Limited\", \"Panther Car Company Limited\", \"TVR Engineering Ltd\", \"Aurora Cars Ltd\", \"Sterling\", \n",
+ " \"Mcevoy Motors\", \"Yugo\", \"London Taxi\", \"Bertone\", \"Merkur\", \"E. P. Dutton, Inc.\", \"Excalibur Autos\",\n",
+ " \"JBA Motorcars, Inc.\", \"Grumman Allied Industries\", \"Qvale\", \"Texas Coach Company\", \"BYD\", \"Pininfarina\", \n",
+ " \"AM General\", \"Volga Associated Automobile\", \"General Motors\", \"Panos\", \"Scion\", \"Mahindra\"\n",
+ " ]\n",
+ " }\n",
+ "\n",
+ " # invert dictionary make to region\n",
+ " make_map = {make: region \n",
+ " for region, makes in regions.items() \n",
+ " for make in makes}\n",
+ "\n",
+ " data[\"origin\"] = data[\"make\"].map(make_map)\n",
+ "\n",
+ " return data\n",
+ " \n",
+ "vehicles = categorize_makes(vehicles)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "d74a80b3-f199-4ff4-9e3d-d785fbcb4135",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def convert_data(data):\n",
+ " \"\"\"\n",
+ " This function will convert categorical data into booleans.\n",
+ "\n",
+ " Example: 'sCharger' is currently 'S' or 'NaN', but True or False would make more sense.\n",
+ "\n",
+ " Takes a DataFrame and returns a modified version of that DataFrame.\n",
+ "\n",
+ " >>> vehicles_test = vehicles_test.copy()\n",
+ " >>> converted = convert_data(vehicles_test)\n",
+ " >>> converted[['sCharger', 'tCharger', 'startStop']].dtypes.tolist()\n",
+ " [dtype('bool'), dtype('bool'), dtype('bool')]\n",
+ "\n",
+ " >>> converted.loc[0, 'sCharger']\n",
+ " False\n",
+ " >>> converted.loc[0, 'tCharger']\n",
+ " True\n",
+ " >>> converted.loc[converted['startStop'] == True, 'startStop'].iloc[0]\n",
+ " True\n",
+ " \"\"\"\n",
+ "\n",
+ " data[\"sCharger\"] = data[\"sCharger\"].eq(\"S\") # true if s\n",
+ " data[\"tCharger\"] = data[\"tCharger\"].eq(\"T\") # true if t\n",
+ " data[\"startStop\"] = data[\"startStop\"].eq(\"Y\") # true if y\n",
+ "\n",
+ " return data\n",
+ "\n",
+ "vehicles = convert_data(vehicles)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "45166d2e-19fe-46cd-baa8-c6b63d4f81ad",
+ "metadata": {},
+ "source": [
+ "### Give a summary of each variable of interest.\n",
+ "\n",
+ " For quantitative variables, this will be a seven-number summary (mean, standard deviation, minimum, first quartile, median, third quartile, maximum) from DataFrame.describe \n",
+ "\n",
+ "For categorical variables, this will be a list of all the unique values that the variable can take, along with a count for each value.\n",
+ "For variables that don't quite suit automated Pandas summarization, write your own summary of the data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "4c2635a0-83e5-479d-afbb-b75ecc433959",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " city08 | \n",
+ " comb08 | \n",
+ " cylinders | \n",
+ " displ | \n",
+ " highway08 | \n",
+ " year | \n",
+ " gears | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | mean | \n",
+ " 18.276533 | \n",
+ " 20.525550 | \n",
+ " 5.720610 | \n",
+ " 3.299097 | \n",
+ " 24.416618 | \n",
+ " 2001.342890 | \n",
+ " 4.919740 | \n",
+ "
\n",
+ " \n",
+ " | std | \n",
+ " 7.555437 | \n",
+ " 7.369281 | \n",
+ " 1.755344 | \n",
+ " 1.359130 | \n",
+ " 7.489017 | \n",
+ " 11.046497 | \n",
+ " 1.435478 | \n",
+ "
\n",
+ " \n",
+ " | min | \n",
+ " 6.000000 | \n",
+ " 7.000000 | \n",
+ " 2.000000 | \n",
+ " 0.000000 | \n",
+ " 9.000000 | \n",
+ " 1984.000000 | \n",
+ " 1.000000 | \n",
+ "
\n",
+ " \n",
+ " | 25% | \n",
+ " 15.000000 | \n",
+ " 17.000000 | \n",
+ " 4.000000 | \n",
+ " 2.200000 | \n",
+ " 20.000000 | \n",
+ " 1991.000000 | \n",
+ " 4.000000 | \n",
+ "
\n",
+ " \n",
+ " | 50% | \n",
+ " 17.000000 | \n",
+ " 20.000000 | \n",
+ " 6.000000 | \n",
+ " 3.000000 | \n",
+ " 24.000000 | \n",
+ " 2002.000000 | \n",
+ " 5.000000 | \n",
+ "
\n",
+ " \n",
+ " | 75% | \n",
+ " 20.000000 | \n",
+ " 23.000000 | \n",
+ " 6.000000 | \n",
+ " 4.300000 | \n",
+ " 28.000000 | \n",
+ " 2011.000000 | \n",
+ " 6.000000 | \n",
+ "
\n",
+ " \n",
+ " | max | \n",
+ " 150.000000 | \n",
+ " 136.000000 | \n",
+ " 16.000000 | \n",
+ " 8.400000 | \n",
+ " 123.000000 | \n",
+ " 2019.000000 | \n",
+ " 10.000000 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " city08 comb08 cylinders displ highway08 year \\\n",
+ "mean 18.276533 20.525550 5.720610 3.299097 24.416618 2001.342890 \n",
+ "std 7.555437 7.369281 1.755344 1.359130 7.489017 11.046497 \n",
+ "min 6.000000 7.000000 2.000000 0.000000 9.000000 1984.000000 \n",
+ "25% 15.000000 17.000000 4.000000 2.200000 20.000000 1991.000000 \n",
+ "50% 17.000000 20.000000 6.000000 3.000000 24.000000 2002.000000 \n",
+ "75% 20.000000 23.000000 6.000000 4.300000 28.000000 2011.000000 \n",
+ "max 150.000000 136.000000 16.000000 8.400000 123.000000 2019.000000 \n",
+ "\n",
+ " gears \n",
+ "mean 4.919740 \n",
+ "std 1.435478 \n",
+ "min 1.000000 \n",
+ "25% 4.000000 \n",
+ "50% 5.000000 \n",
+ "75% 6.000000 \n",
+ "max 10.000000 "
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "vehicles.describe()[1:] # remove count"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "0f9bc32c-93f7-4d4f-b985-8049f8f29611",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "drive [nan, 2-Wheel Drive, Rear-Wheel Drive, 4-Wheel...\n",
+ "fuelType1 [Regular Gasoline, Diesel, Premium Gasoline, N...\n",
+ "fuelType2 [nan, Natural Gas, E85, Propane, Electricity]\n",
+ "make [Alfa Romeo, Bertone, Chevrolet, Nissan, Ford,...\n",
+ "model [Spider Veloce 2000, X1/9, Corvette, 300ZX, EX...\n",
+ "VClass [Coupe, Sedan, Wagon, Truck, Van, Special Purp...\n",
+ "sCharger [False, True]\n",
+ "tCharger [False, True]\n",
+ "startStop [False, True]\n",
+ "trans [Manual, Automatic, nan]\n",
+ "origin [European, Defunct/Low-volume, American, Japan...\n",
+ "dtype: object"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "def categorical_summary(data):\n",
+ " cat_data = data.select_dtypes(exclude=\"number\") # select non-numeric columns\n",
+ " summary = cat_data.apply(pd.Series.unique) # get unique values per column\n",
+ " \n",
+ " return summary\n",
+ "\n",
+ "categorical_summary(vehicles)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a7516488-3f56-434a-99fa-f05d4c24f3db",
+ "metadata": {},
+ "source": [
+ "## EDA Results\n",
+ "\n",
+ "1. The most helpful steps I took during my EDA were organizing the data and making them into sub-categories. This helped for training my machine learning model and making simpler visualizations that made sense. Cleaning the data by dropping columns that were not important to my research questions and removing NaN entries also helped to save time for later in the project.\n",
+ "2. I learned more about what the different columns of my data represented. For example, lots of different types of MPG columns were present in the dataset including rounded versions of both city and highway. I also learned that there were lots of outliers which makes sense because many different kinds of vehicles were produced. I also learned that because of changes in vehicle technology and features, lots of vehicles had many NaN values when a new column was added in modern testing (for example, start-stop features).\n",
+ "3. I gained a better understanding of the data setting because I learned that MPG treads generally increase over the 40 years of vehicles. Seeing that vehicles evolved over time showed me that the dataset is not uniform accross years because early vehicles are lacking many of the features that newer vehicles have. The EDA helped to shift my perspective from seeing this as just a table of data into a historical dataset that represents how vehicles have changed over the past 40 years."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e20a8a52-e215-4267-8054-1467217db3f9",
+ "metadata": {},
+ "source": [
+ "## Research Question 1. \n",
+ "### How has the average fuel economy changed across different vehicle types? Do trends differ between foreign and domestic vehicle manufacturers?\n",
+ "\n",
+ "To answer this question, I will present plots of the data, plotting MPG against different vehicle variables. This is also meant to meet the challenge goal of using interactive plots from another library (Plotly). To interact with the plots, hover over the graph and it will show the data for the X coordinate that you are hovering over. You can also touch items in the legend to remove them from the graph, and the axes will adjust accordingly.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "id": "07348ec4-622e-4fff-aad2-9115f76003a4",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ " \n",
+ " \n",
+ " "
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/html": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "def mpg_by_origin(data):\n",
+ " \"\"\"\n",
+ " Returns a line graph that compares the average mpg for different vehicle country of origin\n",
+ "\n",
+ " Takes a DataFrame to plot and returns the plot.\n",
+ " \"\"\"\n",
+ " \n",
+ " grouped = data.groupby([\"year\", \"origin\"], as_index=False)[\"comb08\"].mean()\n",
+ "\n",
+ " # drop defunct/low volumne\n",
+ " grouped = grouped[grouped[\"origin\"] != \"Defunct/Low-volume\"]\n",
+ "\n",
+ " \n",
+ " # create plot\n",
+ " fig = px.line(\n",
+ " grouped,\n",
+ " x=\"year\",\n",
+ " y=\"comb08\",\n",
+ " color=\"origin\",\n",
+ " title=\"Comparing Combined MPG for Different Vehicle Origins Over Time\"\n",
+ " )\n",
+ "\n",
+ " # labels and hover\n",
+ " fig.update_layout(\n",
+ " title_font=dict(size=20, family=\"Arial\", color=\"#333\"),\n",
+ " xaxis_title=\"Model Year\",\n",
+ " yaxis_title=\"Combined Avg. MPG (city & highway)\",\n",
+ " legend_title=\"Origin\",\n",
+ " template=\"plotly_white\", # background\n",
+ " hovermode=\"x unified\",\n",
+ " )\n",
+ "\n",
+ " # Make lines thicker and markers more visible\n",
+ " fig.update_traces(mode=\"lines+markers\", line=dict(width=2))\n",
+ "\n",
+ " return fig\n",
+ "\n",
+ "mpg_by_origin(vehicles)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "19bed509-8672-4dc8-8535-12644715f4e9",
+ "metadata": {},
+ "source": [
+ "**Figure 1 — Comparing Combined MPG for Different Vehicle Origins Over Time.** This plot shows the average combined MPG for different vehicle origins each year. Data was filtered by manually sorting each vehicle make into different origin categories. As we can see in the graph, vehicle MPG has gone up, with the most drastic changes beginning after 2005. Certain vehicle origins, like Japanese and Korean have been around the higher end of MPG throughout the dataset, other origins like European have been on the lower end (this could be due to the fact that many luxury vehicles and supercars in the dataset are European). At the last year in the dataset, we see that Korean vehicles have significantly higher MPG than other origins."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "40b9a112-575b-4323-9d6f-a3b85743b29b",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "def mpg_by_vclass(data):\n",
+ " \"\"\"\n",
+ " Returns a line graph that compares the average mpg for different vehicle country of origin\n",
+ "\n",
+ " Takes a DataFrame to plot and returns the plot.\n",
+ " \"\"\"\n",
+ " \n",
+ " grouped = data.groupby([\"year\", \"VClass\"], as_index=False)[\"comb08\"].mean()\n",
+ "\n",
+ " \n",
+ " # create plot\n",
+ " fig = px.line(\n",
+ " grouped,\n",
+ " x=\"year\",\n",
+ " y=\"comb08\",\n",
+ " color=\"VClass\",\n",
+ " title=\"Comparing Combined MPG for Different Vehicle Classes Over Time\"\n",
+ " )\n",
+ "\n",
+ " # labels and hover\n",
+ " fig.update_layout(\n",
+ " title_font=dict(size=20, family=\"Arial\", color=\"#333\"),\n",
+ " xaxis_title=\"Model Year\",\n",
+ " yaxis_title=\"Combined Avg. MPG (city & highway)\",\n",
+ " legend_title=\"Vehicle Class\",\n",
+ " template=\"plotly_white\", # background\n",
+ " )\n",
+ "\n",
+ " # Make lines thicker and markers more visible\n",
+ " fig.update_traces(mode=\"lines+markers\")\n",
+ "\n",
+ " return fig\n",
+ "\n",
+ "mpg_by_vclass(vehicles)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b3342ffe-7b7c-4b7e-9206-215328da6f72",
+ "metadata": {},
+ "source": [
+ "**Figure 2 — Comparing Combined MPG for Different Vehicle Classes Over Time** This plots the average combined MPG for different vehicle classes each year. It was produced by sorting the highly specific vehicle categories into more general categories. Comparing the MPG trends for vehicle classes, we can see that wagons are significantly higher than other type of vehicles, with sedans being the next most efficient. Wagons had much higher fuel economy increases than other vehicle classes, and some classes like trucks and vans have barely made any MPG increases in the last 40 years!\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "efce715e-2dde-442d-8e05-d484c0b3381e",
+ "metadata": {},
+ "source": [
+ "## Research Question 2.\n",
+ "### What is the relationship between engine characteristics (displacement, cylinders, transmissions) and fuel efficiency? How has this relationship changed with the recent shift to hybrid vehicles?\n",
+ "To answer this question, I will present interactive plots that show how changes in these characteristics affect fuel efficiency in vehicles. I will also compare hybrid vehicles to gasoline vehicles, starting from the first year that they were produced"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "id": "d0735f9e-ffe4-4984-bf42-c907494a3c2c",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "def mpg_by_cylinder(data):\n",
+ " \"\"\"\n",
+ " Returns a plot comparing the combined MPG over time for cars with different amounts of \n",
+ " cylinders in their engine. Takes a DataFrame containing vehicle data as input.\n",
+ " \"\"\"\n",
+ " data=data[(data[\"cylinders\"] >= 4) & (data[\"cylinders\"] <= 10)]\n",
+ "\n",
+ " grouped = (\n",
+ " data.groupby([\"year\", \"cylinders\"], as_index=False)[\"comb08\"]\n",
+ " .mean()\n",
+ " )\n",
+ "\n",
+ " # create plot\n",
+ " fig = px.line(\n",
+ " grouped,\n",
+ " x=\"year\",\n",
+ " y=\"comb08\",\n",
+ " color=\"cylinders\",\n",
+ " title=\"Comparing Combined MPG for Different Cylinder Counts Over Time\"\n",
+ " )\n",
+ "\n",
+ " # labels and hover\n",
+ " fig.update_layout(\n",
+ " title_font=dict(size=20, family=\"Arial\", color=\"#333\"),\n",
+ " xaxis_title=\"Model Year\",\n",
+ " yaxis_title=\"Combined Avg. MPG (city & highway)\",\n",
+ " legend_title=\"Cylinders\",\n",
+ " template=\"plotly_white\", # background\n",
+ " hovermode=\"x unified\",\n",
+ " )\n",
+ "\n",
+ " # Make lines thicker and markers more visible\n",
+ " fig.update_traces(mode=\"lines+markers\")\n",
+ "\n",
+ " return fig\n",
+ "\n",
+ "mpg_by_cylinder(vehicles)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "084ddd96-70a4-4f85-ae10-6d15fd6db449",
+ "metadata": {},
+ "source": [
+ "**Figure 3 — Comparing Combined MPG for Different Cylinder Counts Over Time** This plots the combined average MPG for different cylinder counts over time. It was produced by filtering uncommon cylinder counts (less than 4 or greater than 10). We can see that MPG has improved at different rates for different cylinder counts. It also shows that most of the increases in fuel efficiency have occured between 2005 and 2015, with less increases before and after those dates (especially in the 4 cylinder category). This graph also shows that MPG has not always been a constant increase over time, and that there have been decreases in average MPG in certain years, which might be surprising to some. 4 cylinder vehicles seem to have made the most improvements in MPG, but not by that much. It seems like the more cylinders a vehicle has, the less it's MPG has improved since the beginning of the dataset."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "id": "60e7a545-f57f-4b36-bb5a-a4f1a69b808e",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "def mpg_by_displ_range_hybrid(data):\n",
+ " bins = [0, 1.5, 2.5, 3.5, 5.0, 20]\n",
+ " labels = [\"≤1.5L\", \"1.6–2.5L\", \"2.6–3.5L\", \"3.6–5.0L\", \"≥5.1L\"]\n",
+ " data[\"displ_range\"] = pd.cut(x=data[\"displ\"], bins=bins, labels=labels)\n",
+ "\n",
+ " # hybrid indicator\n",
+ " data[\"is_hybrid\"] = data[\"fuelType2\"].str.contains(\"Electricity\")\n",
+ "\n",
+ " # remove years before first hybrid\n",
+ " data = data[data[\"year\"] >= 2000]\n",
+ "\n",
+ " # observed=False removes warning of deprication\n",
+ " grouped = (\n",
+ " data.groupby([\"year\", \"displ_range\", \"is_hybrid\"], as_index=False, observed=False)[\"comb08\"]\n",
+ " .mean()\n",
+ " )\n",
+ "\n",
+ " fig = px.line(\n",
+ " grouped,\n",
+ " x=\"year\",\n",
+ " y=\"comb08\",\n",
+ " color=\"displ_range\",\n",
+ " line_dash=\"is_hybrid\",\n",
+ " title=\"Comparing Combined MPG for Different Engine Displacements Over Time\"\n",
+ " )\n",
+ "\n",
+ "\n",
+ " fig.update_layout(\n",
+ " xaxis_title=\"Model Year\",\n",
+ " yaxis_title=\"Combined MPG\",\n",
+ " legend_title=\"Displacement Range / Is Hybrid?\",\n",
+ " template=\"plotly_white\",\n",
+ " )\n",
+ "\n",
+ " fig.update_traces(mode=\"lines+markers\")\n",
+ "\n",
+ " return fig\n",
+ "\n",
+ "mpg_by_displ_range_hybrid(vehicles)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "34853cc8-b05e-448c-a422-3d44ebab1494",
+ "metadata": {},
+ "source": [
+ "**Figure 4 — Comparing Combined MPG for Different Engine Displacements Over Time** This plots average combined MPG for different displacement categories over time. It also dashes the lines for categories that are hybrid (second fuel type is electricity). We can see in this graph that hybrid vehicles generally have a much higher MPG than non-hybrid vehicles do. This is as expected. The graph also shows that hybrid vehicles have not made major improvements in MPG since they first were released and that improvements in MPG were not as significant when split into displacement categories. This might mean that the general rise in MPG trends are due to the average displacement of vehicles being lower.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "id": "a0768cf7-1af5-48b3-b641-80ad5dc463ef",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "def mpg_by_gears(data):\n",
+ " \"\"\"\n",
+ " Returns a plot comparing the combined MPG over time for cars with different amounts of \n",
+ " cylinders in their engine. Takes a DataFrame containing vehicle data as input.\n",
+ " \"\"\"\n",
+ "\n",
+ " data[\"is_automatic\"] = data[\"trans\"].str.contains(\"Automatic\")\n",
+ "\n",
+ " \n",
+ " grouped = (\n",
+ " data.groupby([\"year\", \"gears\", \"is_automatic\"], as_index=False, observed=False)[\"comb08\"]\n",
+ " .mean()\n",
+ " )\n",
+ " \n",
+ " # create plot\n",
+ " fig = px.line(\n",
+ " grouped,\n",
+ " x=\"year\",\n",
+ " y=\"comb08\",\n",
+ " color=\"gears\",\n",
+ " line_dash=\"is_automatic\",\n",
+ " title=\"Comparing Combined MPG for Different Transmission Types Over Time\"\n",
+ " )\n",
+ "\n",
+ " # labels and legend\n",
+ " fig.update_layout(\n",
+ " xaxis_title=\"Model Year\",\n",
+ " yaxis_title=\"Combined MPG\",\n",
+ " legend_title=\"Transmission Gears / Is Automatic\",\n",
+ " template=\"plotly_white\" # background color\n",
+ " )\n",
+ "\n",
+ " # make more visible at each year marker\n",
+ " fig.update_traces(mode=\"lines+markers\")\n",
+ "\n",
+ " return fig\n",
+ "\n",
+ "mpg_by_gears(vehicles)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "20dc728d-d958-416b-938a-311c5de8bcf4",
+ "metadata": {},
+ "source": [
+ "**Figure 5 — Comparing Combined MPG for Different Transmission Types Over Time** This plots combined average MPG for different transmission gear numbers. The line is dashed if it is an automatic transmission. This plot shows that transmissions with one gear (CVT) have significantly higher fuel economy than other transmission types. The next highest is the 5 speed manual transmission. This might be due to the lower sample data, and that vehicles containing manuals in modern times might be cheaper economy cars whereas the average could be brought down by inefficient cars like trucks in other transmission categories.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d6730dff-7b2c-4de9-8591-0d261d7687c7",
+ "metadata": {},
+ "source": [
+ "## Research Question 3.\n",
+ "### Can we predict a vehicle’s MPG based on its engine displacement, number of cylinders, and other identifying characteristics?\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "78da0323-7ea8-428e-bc4f-e48c2d975394",
+ "metadata": {},
+ "source": [
+ "Here we begin to train and test a machine learning algorithm that can predict the MPG of a vehicle"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "id": "848ecc1a-f253-4e69-86d9-cb8b355c3ee5",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def create_model_dataset(data, critical_cols, drop_cols, target_col):\n",
+ " \"\"\"\n",
+ " Creates dummy and target DataFrames for training a machine learning model.\n",
+ " Takes a DataFrame to get data.\n",
+ " critical_cols indicating the critical columns that cannot containing missing data.\n",
+ " drop_cols indicating the columns that should be dropped (data to predict \n",
+ " or data that would make predicting too obvious)\n",
+ " target_col indicating the column that has data that will be predicted. \n",
+ " \"\"\"\n",
+ " # drop rows missing critical data\n",
+ " cleaned = data.dropna(subset=critical_cols)\n",
+ "\n",
+ " # set target before dropping\n",
+ " target = cleaned[target_col]\n",
+ "\n",
+ " # drop columns without features (and target)\n",
+ " features = cleaned.drop(drop_cols, axis=1)\n",
+ "\n",
+ " # turns categorical data into T/F\n",
+ " dummies = pd.get_dummies(features)\n",
+ "\n",
+ " return dummies, target\n",
+ "\n",
+ "critical_cols = [\"city08\", \"highway08\", \"comb08\", \"fuelType1\", \"cylinders\", \"trans\", \"gears\", \"VClass\"]\n",
+ "drop_cols = [\"city08\", \"highway08\", \"comb08\"]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6d214746-8fde-47be-a60f-c0ba39f6f7d7",
+ "metadata": {},
+ "source": [
+ "First, create the dataset that will be used for training. We choose these critical columns as they will be key factors in predicting MPG. We chose these columns to drop because we are trying to predict comb08, and comb08 is just the average of highway08 and city08, so we drop those as well."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "id": "03812018-a6c6-4f1e-9d50-53213d51a720",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def train_model(X, y):\n",
+ " \"\"\"\n",
+ " Trains a machine learning model (XGBRegressor()). \n",
+ " Takes X data used for predicting, y data which is the column to predict.\n",
+ " Returns the best model (found using GridSearchCV to exhaustively search all parameter options\n",
+ " and choose the model with best fit)\n",
+ " also returns testing X and y data (this was not used for training at all, \n",
+ " so it can be used to evaluate the model).\n",
+ " \"\"\"\n",
+ " # splits training and testing data using tuple unpacking\n",
+ " X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n",
+ "\n",
+ " model = XGBRegressor(\n",
+ " n_estimators=300,\n",
+ " max_depth=3,\n",
+ " learning_rate=0.05\n",
+ " )\n",
+ " model.fit(X_train, y_train)\n",
+ " \n",
+ "\n",
+ " return model, X_test, y_test\n",
+ "\n",
+ "X, y = create_model_dataset(vehicles, critical_cols, drop_cols, \"comb08\")\n",
+ "\n",
+ "predictor, X_test, y_test = train_model(X, y)\n",
+ "\n",
+ "training_columns = X.columns"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3865f4be-b6fc-4412-aca7-f860cc797908",
+ "metadata": {},
+ "source": [
+ "Here, train a XGBRegressor() model using the model dataset returned by our previously written function. We fit the model using the training data which is an 80/20 split of the original data. Once done, return the model object and the 20% split of testing data that will be used for testing. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c7572fec-4a5c-4533-a21a-5d4117888772",
+ "metadata": {},
+ "source": [
+ "#### Testing Process:\n",
+ "I initially trained a deeper model with 5 max_depth and 700 estimators. This model achieved a very low error on the 80/20 testing split from the same dataset, but produced a much higher error on modern data from 2020 onwards (overfitting). This was due to the model learning detailed patterns from older vehicles in the dataset that no longer hold true in modern vehicles. I switched to a model with a lower max depth and reduced the number of estimators so that the model would focus on broader patterns instead of memorizing specific ones. This reduces overfitting to pre-2020 data and will help the model adapt to newer vehicles with modern technology like hybrid systems and new transmission types. The resulting model performs worse with the old data but is much better when predicting new data from future years."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "id": "6942f113-905e-4451-815b-7fb3ace38f4b",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Error: 1.645411634919229 MPG\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(f\"Error: {root_mean_squared_error(y_test, predictor.predict(X_test))} MPG\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "id": "9faafbf2-6129-40ed-813b-d4aba4d4fd59",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def align_columns(test, train):\n",
+ " \"\"\"\n",
+ " Aligns columns of a new testing dataset, and a previously used training dataset\n",
+ " by adding empty data to the missing columns in the training dataset and remove\n",
+ " extra columns that werent present in the training dataset.\n",
+ "\n",
+ " Takes a new testing dataset, and the original training dataset.\n",
+ " \"\"\"\n",
+ " # add empty data to missing columns\n",
+ " for col in train:\n",
+ " if col not in test:\n",
+ " test[col] = 0\n",
+ "\n",
+ " # drop extra columns that werent present in training state\n",
+ " extras = [col for col in test.columns if col not in train]\n",
+ " if extras:\n",
+ " test = test.drop(columns=extras)\n",
+ "\n",
+ " return test[train]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9233ec44-9ab0-4b5a-a63d-ebabc04bbdde",
+ "metadata": {},
+ "source": [
+ "I ran into some problems with missing data in the newer dataset that I am using for testing. This function fixes that. It allows us to use a newer dataset that contains vehicles from 2020 onwards, \n",
+ "as the vehicles in this new dataset do not have certain makes or features that older cars have. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "id": "815066be-90fa-4d44-b5ac-2858ff24878a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def load_new_data(csv):\n",
+ " \"\"\"\n",
+ " Loads a new dataset using all previously used data manipulation functions.\n",
+ " Takes a path to a CSV as input. Returns a dataframe with functions applied to it.\n",
+ " \"\"\"\n",
+ " testing_data = load_data(csv)\n",
+ " testing_data = clean_data(testing_data)\n",
+ " testing_data = simplify_vehicle_class(testing_data)\n",
+ " testing_data = categorize_transmissions(testing_data)\n",
+ " testing_data = categorize_makes(testing_data)\n",
+ " testing_data = convert_data(testing_data)\n",
+ "\n",
+ " return testing_data\n",
+ "\n",
+ "new_vehicles = load_new_data(\"vehicles-through-2026.csv\")\n",
+ "new_X, new_y = create_model_dataset(new_vehicles, critical_cols, drop_cols, \"comb08\")\n",
+ "\n",
+ " # dropping year here, because modern cars do not have some variables that old cars did (one example is natural gas fuel)\n",
+ "new_X = new_X[new_X[\"year\"] > 2020]\n",
+ "new_y = new_y.loc[new_X.index]\n",
+ "\n",
+ "# aligning the columns to fix any missing or extra data\n",
+ "new_X = align_columns(new_X, training_columns)\n",
+ "new_y = new_y.loc[new_X.index]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "630d2aec-afa2-47e8-ada6-c9e48852c645",
+ "metadata": {},
+ "source": [
+ "This loads in our new dataset and sorts by only cars newer than 2020. It fixes the columns by calling the previous function."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "id": "2e186a8f-920b-4dcc-83f7-7556e5a07e46",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Actual Avg MPG: [15, 28, 25, 54, 21, 21, 31, 24, 20, 17]\n",
+ "Predicted Avg MPG: [19.653326 25.775482 24.99107 40.519886 20.297852 19.846128 25.437662\n",
+ " 25.410978 19.131628 19.101501]\n",
+ "Error: 4.982376422436265 MPG\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "def test_modern_model(model, X, y, samples=10):\n",
+ " \"\"\"\n",
+ " Tests a model on a given dataset. Samples a specified amount of rows from the dataset (default = 10).\n",
+ " Prints two lists, one containing the actual MPG, and one containing the MPG predicted by the model.\n",
+ " Also prints the error between the lists.\n",
+ " \"\"\"\n",
+ " random_rows = X.sample(n=samples)\n",
+ " mpg_actual = list(y.loc[random_rows.index])\n",
+ " mpg_pred = model.predict(random_rows)\n",
+ " \n",
+ " if samples <= 25:\n",
+ " print(f\"Actual Avg MPG: {mpg_actual}\")\n",
+ " print(f\"Predicted Avg MPG: {mpg_pred}\")\n",
+ " print(f\"Error: {root_mean_squared_error(mpg_actual, mpg_pred)} MPG\")\n",
+ "\n",
+ " \n",
+ " # create new dataframe with predicted and actual data\n",
+ " plot_df = pd.DataFrame({\n",
+ " \"sample\": list(range(1, samples + 1)),\n",
+ " \"Actual MPG\": mpg_actual,\n",
+ " \"Predicted MPG\": mpg_pred,\n",
+ " })\n",
+ "\n",
+ " fig = px.line(plot_df, x=\"sample\", y=[\"Actual MPG\", \"Predicted MPG\"], markers=True, title=\"Predicted MPG Compared to Actual MPG\")\n",
+ "\n",
+ " fig.update_layout(\n",
+ " xaxis_title=\"Sample Index\",\n",
+ " yaxis_title=\"MPG\",\n",
+ " legend_title=\"Data Type\",\n",
+ " template=\"plotly_white\",\n",
+ " hovermode=\"x unified\"\n",
+ " )\n",
+ "\n",
+ " return fig\n",
+ "\n",
+ "test_modern_model(predictor, new_X, new_y)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "60aac3de-8236-4e9f-9f89-9a7153717c3a",
+ "metadata": {},
+ "source": [
+ "**Figure 6 — Predicted MPG Compared to Actual MPG** This plots the actual MPG and the MPG predicted by the model for each index of the sample. We sample 10 random vehicles from the dataset.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "afbb315b-34b2-4a19-8a67-7b8f96c24754",
+ "metadata": {},
+ "source": [
+ "## Implications and Limitations\n",
+ "\n",
+ "1. People who are interested in decisions behind buying cars, environmental policy, or vehicle technology trends could benefit from my analysis. Consumers can use my findings to pick the most fuel efficient vehicle for their needs and see how different engine characteristics can impact MPG. Policymakers or new vehicle manufacturers can also benefit from seeing where the areas of improvement were most significant over the last four decades. One drawback of my analysis is that people with modified cars or models newer than 2019 are not included in the analysis. It could harm groups if the results are interpreted too broadly. For example, assuming that 5 speed manuals are the second most fuel efficient transmission type even if that might not be the case.\n",
+ "2. The data setting impacted my results because the data was taken over many decades. This means that some features are only present in modern vehicles (like start-stop). These differences affected my analysis because some comparisons were less direct. The shift toward hybrid vehicles and other modern efficiency features also changed MPG patterns which had an affect on the model prediction for newer vehicles. The dataset didn't just reflect vehicle characteristics, but also changes in consumer preferences and government regulation.\n",
+ "3. One limitation is that the dataset only contains data from vehicles that are officially tested by the EPA, so it ignores modified vehicles or less common/custom vehicles. Another limitation is that the dataset has many missing values or features that were added at different times over the past four decades. This can bias a comparison between certain years. A third limitation is that the model I made is trained on past vehicles before 2019, which might not generalize to future vehicles with newer technology. Others should use my conclusions as general trends and not strict rules. The relationships between engine characteristics and fuel economy can help to have a better understanding of broad patterns but should not be used as an exact prediction for vehicles or future technology."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "84e331f5-20f9-442c-ab65-364d4efb118c",
+ "metadata": {},
+ "source": [
+ "## Challenge Goals\n",
+ "\n",
+ "Interactive Visualizations: I plan to meet the challenge goal of creating interactive visualizations using altair or plotly to visualize the data. I think that visualizing the data will help to better understand my first three questions and can be useful to see the trends over time more so than a table of numbers would. My project would meet this goal because it does not overlap with my other challenge goal and would use a library that is not used in class. I plan to design an interactive scatter plot that can filter by vehicle type, model year, and other engine characteristics to see how MPG changes across decades. The interactivity of the visualizations will make it easier to identify complex patterns.\n",
+ "Advanced Machine Learning: I plan to meet the challenge goal of using advanced machine learning to answer my third question of predicting a car's MPG. I would like to feed the data from my data set into the machine learning algorithm and give the specifications of cars made after 2019 (when the data set ends) and see if it is able to predict the MPG. My project would meet this goal by using different algorithms from scikit-learn and including vehicle characteristics like displacement, cylinders, vehicle weight, transmission, fuel type. This will also reveal the factors that most strongly influence as well as testing the prediction performance.\n",
+ "\n",
+ "\n",
+ "The interactive visualizations challenge goal was scaled back by having multiple plots instead of a single plot that could be changed. Having a singular plot made it harder to explain what readers should take away from the plot. I also used line plots instead of scatter plots because most data was better displayed as a line because it is time dependent.\n",
+ "\n",
+ "The advanced machine learning challenge goal was not changed. I used an XGBRegressor which is a model not included in scikit-learn to predict vehicle MPG using variables from the original data. I also imported a second dataset containing modern vehicles and used my model to predict the MPG of vehicles from that dataset."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d9d76493-85a2-42e0-b718-8d9264027674",
+ "metadata": {},
+ "source": [
+ "## Plan Evaluation\n",
+ "\n",
+ "### Original Plan\n",
+ "Data preparation and cleaning:\n",
+ "I don’t expect this to take that long, but I will estimate 1 hour to rename some columns and ensure the data is prepared and cleaned.\n",
+ "\tInitial analysis of MPG trends:\n",
+ "I estimate that this will take 5 hours to complete as I will have to analyze different vehicle types, manufacturers, and engine characteristics to find the correlation between these values. \n",
+ "\tVisualization using Plotly or Altair:\n",
+ "I expect this step to take a bit longer, as I have no experience with either of these libraries and they have not been taught in class. I estimate 7 hours to complete this interactive visualization as well as adjusting the data as necessary to make the graphs make sense.\n",
+ "\tMachine Learning modeling: \n",
+ "This is the step that I expect to take the longest. I have no experience with machine learning and I hear it is very complicated. I would also like to dive deeper into this research goal as I believe it will be good to have machine learning experience and understanding how it works. I estimate 12 hours for this part so that I can take my time to understand everything.\n",
+ "\tDocumentation and final report writing:\n",
+ "I expect this step to take 3 hours as I wrap up everything from the previous steps, and proofread my work. \n",
+ "\n",
+ "### Reflection\n",
+ "My proposed plan was not very accurate. I spent a lot longer doing the data preparation and cleaning because the data was not categorized correctly and was much too specific for what I needed. The initial analysis also took longer because I had to sort the data more than I expected but it didn't take much longer than 5 hours. The visualizations estimate was fairly accurate. The plotly documentation was very helpful and made it easy to implement into my project. The machine learning modeling was also accurate. I had the most trouble with choosing a model to use that wasn't taught in class. Filtering the data because of NaN values was also quite challenging. It took me about 12 hours to complete. The final step took much longer than I predicted. I only planned for 3 hours but it took about 8 hours to complete it. Organizing all of my work in a format that was easy to read was very time consuming."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "121a3326-7b35-4ef9-b2e8-21202e02cbfc",
+ "metadata": {},
+ "source": [
+ "## Testing\n",
+ "\n",
+ "I used doctests within all of my data modification functions. I created a smaller test dataset with 10 rows of data that took vehicles from the modern dataset I found. I had doctests with the expected results for each function within that dataset. I manually checked the results for this smaller dataset and compared it to my expected result with the doctest and left them in the functions for the reader to view. I used interactive visualizations to display my data. I tested the machine learning model by testing different parameters in my function call to see which had lowest error. I also loaded a modern vehicle dataset and tested how the model performed on modern vehicles compared to the older vehicles. I also used visualizations for my machine learning predicitions so that the results of each specific prediction could be viewed. My results are trustworthy because they have been thoroughly tested and the testing can be seen throughout this notebook."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "id": "9027d3d6-c58b-4c9e-bca0-3853bc237f79",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "TestResults(failed=0, attempted=29)\n"
+ ]
+ }
+ ],
+ "source": [
+ "test_results = doctest.testmod()\n",
+ "print(test_results)\n",
+ "assert test_results.failed == 0, \"There are failed doctests\""
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.9"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}