From bdff72deaa500424dd85b7168dea224edf44bdd1 Mon Sep 17 00:00:00 2001 From: Derik Vo Date: Wed, 1 Nov 2023 15:31:56 -0700 Subject: [PATCH 1/4] Added a sql equivalent (cte and union) to the concatenation of movie dataframe --- lessons/lesson3/Lesson 3.ipynb | 1714 +++++++++++++++++++++++++++++++- 1 file changed, 1713 insertions(+), 1 deletion(-) diff --git a/lessons/lesson3/Lesson 3.ipynb b/lessons/lesson3/Lesson 3.ipynb index 4a5b860..07e3280 100644 --- a/lessons/lesson3/Lesson 3.ipynb +++ b/lessons/lesson3/Lesson 3.ipynb @@ -1 +1,1713 @@ -{"cells":[{"attachments":{},"cell_type":"markdown","metadata":{"id":"gPZMglnlttO8"},"source":["\n"," \n","# Basic Elementary Exploratory Data Analysis using Pandas\n","\n","_Author: Christopher Chan_"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"cdSXNUkhttO9"},"source":["### Objective\n","\n","Upon completion of this lesson you should be able to understand the following:\n","\n","1. Pandas library\n","2. Dataframes\n","3. Data selection\n","4. Data manipulation\n","5. Handling of missing data\n","\n","This is arguably the most important part of analysis. This is also referred to as the \"cleaning the data\". Data must be usable for it to a valid analysis. Otherwise it would be garbage in, garbage out."]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"1c9iQdrTttO-"},"source":["##### ==================================================================================================\n","## Data Selection and Inspection\n","\n","\n","### Pandas Library\n","\n","`pandas` is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,\n","built on top of the Python programming language.\n","\n","`pandas` data frame can be created by loading the data from the external, existing storage like a database, SQL, or CSV files. But the Pandas Data Frame can also be created from the lists, dictionary, etc. For simplicity, we will use `.csv` files. One of the ways to create a pandas data frame is shown below:\n","\n","### DataFrames\n","A data frame is a structured representation of data.\n","##### =================================================================================================="]},{"cell_type":"code","execution_count":null,"metadata":{"id":"0v8znxdlttO-"},"outputs":[],"source":["import pandas as pd"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Q-fUhePhttO-"},"outputs":[],"source":["data = {'Name':['John', 'Tiffany', 'Chris', 'Winnie', 'David'],\n"," 'Age': [24, 23, 22, 19, 10], \n"," 'Salary': [60000,120000,1000000,75000,80000]}\n","\n","people_df = pd.DataFrame(data)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"_fi2q8cuttO-"},"source":["##### ==================================================================================================\n","We can call on the dataframe we labeled `people_df` by applying the `.head()` function that would display the first five rows of the dataframe. Similarly, the `.tail()` function would return the last five rows of a dataframe."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Muo9Gs_xttO_"},"outputs":[],"source":["people_df.head()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"ZtcR6GJ2ttO_"},"source":["##### ==================================================================================================\n","We can also modify the number of rows we would like to display by inserting the integer into the `.head()` function.\n","\n","Example: Select the first 2 rows of the dataframe"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"k-lZOSuGttO_"},"outputs":[],"source":["people_df.head(2)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"C_UTdB6IWiG_"},"source":["Example: Select the last 2 rows of the dataframe"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"tfNVLk_tWU52"},"outputs":[],"source":["people_df.tail(2)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"Q8nzMIscttO_"},"source":["##### ==================================================================================================\n","Another way to create a dataframe would be to load an existing CSV file by using the `read_csv` function built into `pandas` onto the desired file path as shown below:\n","\n","`dataframe = pd.read_csv(\".../file_location/file_name.csv\")`"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"IbdNygPgttO_"},"outputs":[],"source":["# Saving the file location to our data into a variable\n","pixar_url = \"https://raw.githubusercontent.com/freestackinitiative/COOP-PythonLessons/main/lessons/lesson3/data/Pixar_Movies.csv\"\n","# Passing our file location to the read_csv function to locate and read our data into a DataFrame\n","movies_df = pd.read_csv(pixar_url)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"4r7sM285ttPA"},"source":["##### =================================================================================================="]},{"cell_type":"code","execution_count":null,"metadata":{"id":"u-4v6IISttPA"},"outputs":[],"source":["movies_df.head(10)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"5JISZYZHttPA"},"source":["#### The above python code is equivalent to SQL's\n","\n","```sql\n","SELECT * \n","FROM Movies\n","LIMIT 10\n","```\n","##### =================================================================================================="]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"uB-nxMi8ttPA"},"source":["`.shape` shows the number of rows and columns"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"tuyP3rLKttPA"},"outputs":[],"source":["movies_df.shape"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"sCnSo2HBttPA"},"source":["This shows us how many rows and columns are in the entire dataframe, 14 rows, 5 columns\n","\n","##### =================================================================================================="]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"zMhYN4Z1ttPA"},"source":["`.dtypes` shows the data types"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"UOImUY3attPA"},"outputs":[],"source":["movies_df.dtypes"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"244Ux8N_XWmo"},"source":["`.describe()` can be used to help summarize numerical data in our dataframe. It summarizes the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"-MPTc3c6YjMp"},"outputs":[],"source":["movies_df.describe()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"uNXiyTCWYwl8"},"source":["You may optionally include categorical data in the `describe` method like so:"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"ITxGRSqQY8oX"},"outputs":[],"source":["movies_df.describe(include='all')"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"s3f7AzxJv_L5"},"outputs":[],"source":["movies_df.info()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"QHrP16DdttPA"},"source":["##### ==================================================================================================\n","\n","### Row and Column Selection\n","\n","There are two common ways to select rows and columns in a dataframe using .loc and .iloc\n","\n","`.loc` selects rows and columns by label/name\n","\n","`.iloc` selects row and columns by index\n","\n","Example: using `.loc` to select every row in the dataframe by using `:` and filtering the column to just Title, Director and Year"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"_H7tY4X8ttPA"},"outputs":[],"source":["movies_df.loc[2:4, ['Title','Director','Year'] ]"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"0qwe2OwyttPA"},"source":["##### ==================================================================================================\n","\n","Similarly we obtain the same results using `'iloc` by filtering the columns to the 1, 2, and 3 column that correspond to as Title, Director and Year respectively as shown below:"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"3rPyj7J1ttPA"},"outputs":[],"source":["movies_df.iloc[ :, [1,2,3] ]"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"OigUAB8ottPB"},"source":["#### The two python codes above are equivalent to SQL's\n","\n","```sql\n","SELECT Title, Director, Year\n","FROM Movies\n","```\n","\n","##### =================================================================================================="]},{"cell_type":"code","execution_count":null,"metadata":{"id":"VrxiA9oittPB"},"outputs":[],"source":["movies_df.iloc[0:3,[1,2,3]]"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"W8Rpe-SBttPB"},"source":["#### The above python code is equivalent to SQL's\n","\n","```sql\n","SELECT Title, Director, Year\n","FROM Movies\n","LIMIT 3\n","```\n","##### =================================================================================================="]},{"cell_type":"code","execution_count":null,"metadata":{"id":"JYZXjJ7zttPB"},"outputs":[],"source":["movies_df.iloc[2:5, [1,2,3]]"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"xuF4sFtRttPB"},"source":["#### The above python code is equivalent to SQL's\n","\n","```sql\n","SELECT Title, Director, Year\n","FROM movies\n","LIMIT 3\n","OFFSET 2\n","```\n","##### =================================================================================================="]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"qoAct0MgZq2Y"},"source":["The `value_counts()` method returns the count of unique values in a given `Series`/column. For example, let's look at the number of entries each Director has in `movies_df`:"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"mAaANUmittPB"},"outputs":[],"source":["movies_df.loc[:,'Director'].value_counts()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"UUKE7FJkttPB"},"source":["#### The above python code is equivalent to SQL's\n","```sql\n","SELECT Director, COUNT(*)\n","FROM Movies\n","GROUP BY Director\n","```\n"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"dqwCWeGUdDOO"},"source":["##### ==================================================================================================\n","\n","We can use the `mean()` method to help us find the average of a column or group of columns."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"85Yx7Q8MdXDp"},"outputs":[],"source":["movies_df.loc[:, 'Length_minutes'].mean()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"hxFxYRlVgy8D"},"source":["#### The above python code is equivalent to SQL's\n","```sql\n","SELECT AVG(Length_minutes)\n","FROM Movies\n","```"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"CDzWn6ZYhdjl"},"source":["Using the `groupby()` method, we can perform operations that are similar to the `GROUP BY` clause in SQL.\n","\n","For example, let's get the average `Length_minutes` by `Director` to see the average number of minutes for each Director's movies:"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"1Pc8Bk75ePoi"},"outputs":[],"source":["movies_df.loc[:, ['Director', 'Length_minutes']].groupby('Director').mean()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"jbDuTSGwiCmq"},"source":["#### The above python code is equivalent to SQL's\n","```sql\n","SELECT Director, AVG(Length_minutes) AS Length_minutes\n","FROM Movies\n","GROUP BY Director\n","```"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"cKaf4n4ycypo"},"source":["##### ==================================================================================================\n","### Filtering Data\n","Using operator comparisons on columns returns information based on our desired conditions\n","\n","Example: Suppose we want to return movie information if it is only longer than 100 minutes long."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Zgl74_zjttPB"},"outputs":[],"source":["# Create the filter \n","movie_filter = movies_df.loc[:, \"Length_minutes\"] > 100\n","# Use the filter in the `.loc` selector\n","movies_df.loc[movie_filter, :]\n","\n","# An example showing everything in a single step \n","movies_df.loc[movies_df.loc[:, \"Length_minutes\"] > 100, :]\n"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"1RAY_qWtttPB"},"source":["#### The above python code is equivalent to SQL's\n","```sql\n","SELECT *\n","FROM Movies\n","WHERE Length_minutes > 100\n","```\n","##### ==================================================================================================\n","\n","#### Multiple Conditional Filtering\n","\n","Supposed we want to return movie information only if it is longer than 100 minutes and was created before the year 2005"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Dp1-vQ3mttPB"},"outputs":[],"source":["movie_len_filter = movies_df.loc[:, \"Length_minutes\"] > 100\n","movie_year_filter = movies_df.loc[:, \"Year\"] < 2005\n","\n","movies_df.loc[(movie_len_filter) & (movie_year_filter), :]"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"lQksNrTkttPB"},"source":["#### The above python code is equivalent to SQL's\n","```sql\n","SELECT *\n","FROM Movies\n","WHERE Length_minutes > 100\n","AND Year < 2005\n","```\n","##### =================================================================================================="]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"GVTfOPhottPB"},"source":["##### ==================================================================================================\n","### Sorting Data\n","The `sort_values()` method sorts the list ascending by default. To sort by descending order, you must apply `ascending = False`. \n","\n","The `.reset_index(drop=True)` will re-index the index after sorting."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"KFQjjjOxttPC"},"outputs":[],"source":["movies_df.loc[:,\"Title\"].sort_values().reset_index(drop=True)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"1fEi_PBfttPC"},"source":["#### The above python code is equivalent to SQL's\n","\n","```sql\n","SELECT Title\n","FROM Movies\n","ORDER BY Title\n","```\n","##### ==================================================================================================\n","\n","Sort the entire dataframe by a single column:"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"stgM1BXxttPC"},"outputs":[],"source":["movies_df.sort_values(\"Title\").reset_index(drop=True)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"V5j8FDwuttPC"},"source":["#### The above python code is equivalent to SQL's\n","```sql\n","SELECT *\n","FROM Movies\n","ORDER BY Title\n","```\n","##### ==================================================================================================\n","\n","We can also sort using multiple columns.\n","Example: We can sort by Director first, then within each Director, sort the Title of the films."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"q6UqfJacttPC"},"outputs":[],"source":["movies_df.sort_values([\"Director\",\"Title\"], ascending=[True, False]).reset_index(drop=True)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"5wlURoWy2eYC"},"source":["```sql\n","SELECT Director, Title\n","FROM Movies\n","ORDER BY\n"," Director ASC,\n"," Title DESC\n","```"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"AWFTUYNVttPC"},"source":["##### ==================================================================================================\n","### Merging DataFrames\n","\n","In python the `.concat` function combines dataframes together. This can be either one on top of another dataframe or side by side.\n","\n","But first let us introduce a new dataset:"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"3C3P14EvttPC"},"outputs":[],"source":["other_movies_url = \"https://raw.githubusercontent.com/freestackinitiative/COOP-PythonLessons/main/lessons/lesson3/data/Other_Movies.csv\"\n","other_movies_df = pd.read_csv(other_movies_url)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"fjEx1V8vttPC"},"outputs":[],"source":["other_movies_df.head()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"DiyckWV1ttPC"},"source":["##### ==================================================================================================\n","Now lets combine the two dataframes, that being `movies_df` and `other_movies_df` using the `.concat` function and call this new dataframe `all_movies_df`"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"pjvZ8wGFttPC"},"outputs":[],"source":["all_movies_df = pd.concat([movies_df,other_movies_df]).reset_index(drop=True)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"DwGVaXWxttPC"},"outputs":[],"source":["all_movies_df.head(-1) # Using -1 in the head function will show us all of the rows"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"uwtTP015ttPD"},"source":["##### ==================================================================================================\n","Now lets introduce another dataframe, that being the movie scores received"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"wntotCmBttPD"},"outputs":[],"source":["movie_scores_url = \"https://raw.githubusercontent.com/freestackinitiative/COOP-PythonLessons/main/lessons/lesson3/data/Movie_Scores.csv\"\n","scores_df = pd.read_csv(movie_scores_url)"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":206},"executionInfo":{"elapsed":143,"status":"ok","timestamp":1680135575621,"user":{"displayName":"Martin Arroyo","userId":"00023833307036255373"},"user_tz":240},"id":"9xeFCBz5ttPD","outputId":"d749036a-deb4-4141-c95d-0428415de7a3"},"outputs":[{"data":{"text/html":["\n","
\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
Score
08.3
17.2
27.9
38.1
48.2
\n","
\n"," \n"," \n"," \n","\n"," \n","
\n","
\n"," "],"text/plain":[" Score\n","0 8.3\n","1 7.2\n","2 7.9\n","3 8.1\n","4 8.2"]},"execution_count":35,"metadata":{},"output_type":"execute_result"}],"source":["scores_df.head()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"Yarl7-KPttPD"},"source":["##### ==================================================================================================\n","Now we can combine the two dataframes side by side"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"W2zOhxPcttPD"},"outputs":[],"source":["movies_and_scores_df = pd.concat([all_movies_df,scores_df], axis = \"columns\").reset_index(drop=True)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"VBMdQiRettPD"},"outputs":[],"source":["movies_and_scores_df.head(-1)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"xoQ3fB8SttPD"},"source":["##### ==================================================================================================\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"ZITCa9qYttPD"},"outputs":[],"source":["managers = pd.DataFrame(\n"," {\n"," 'Id': [1,2,3],\n"," 'Manager':['Chris','Maritza','Jamin']\n"," }\n",")"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"MX9spfihttPD"},"outputs":[],"source":["managers.head()"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"LtD1zQJuttPD"},"outputs":[],"source":["captains = pd.DataFrame(\n"," {\n"," 'Id': [2,2,3,1,1,3,2,3,1,1,3,3],\n"," 'Captain':['Derick','Shane','Becca','Anna','Christine','Melody','Tom','Eric','Naomi','Angelina','Nancy','Richard'],\n"," 'Title':['C','C','SC','C','SC','C','C','SC','C','EC','C','SC']\n"," }\n",")"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"0xOS-Bu4ttPD"},"outputs":[],"source":["captains.head(12)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"3c478mlSttPD"},"outputs":[],"source":["roster = captains.merge(managers,left_on = 'Id', right_on = 'Id')\n","roster.head(-1)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"rJ9K1BPGXxzE"},"outputs":[],"source":["test_roster = pd.concat([captains, managers], axis=\"columns\").reset_index(drop=True)\n","test_roster.head()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"2hro1V6XttPD"},"source":["#### The above python code is equivalent to SQL's\n","```sql\n","SELECT *\n","FROM Captains\n","INNER JOIN Managers\n","ON Captains.Id = Managers.Id\n","```\n","##### ==================================================================================================\n","## Column Renaming\n","\n","We can use the `.rename` function in python to relabel the columns of a dataframe. Suppose we want to rename `Id` to `Cohort` and `Title` to `Captain Rank`."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"-nELWGyPttPD"},"outputs":[],"source":["roster = roster.rename(columns = {\"Id\":\"Cohort\",\"Title\":\"Captain Rank\"})\n","roster.head(-1)"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":144,"status":"ok","timestamp":1680136209238,"user":{"displayName":"Martin Arroyo","userId":"00023833307036255373"},"user_tz":240},"id":"zrKc31ukYw3i","outputId":"40c853d2-aa1f-4f1d-ae91-3e803b90c20f"},"outputs":[{"data":{"text/plain":["Index(['Cohort', 'Captain', 'Captain Rank', 'Manager'], dtype='object')"]},"execution_count":45,"metadata":{},"output_type":"execute_result"}],"source":["roster.columns"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"N5H7HamottPE"},"source":["If we would like to replace all columns, we must use a list of equal length"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"eCTo6V3UttPE"},"outputs":[],"source":["roster.columns = ['Cohort Num','Capt','Capt Rank','Manager']\n","roster.head(-1)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"Wp5nb6skttPE"},"source":["##### ==================================================================================================\n","### Drop Columns"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"doploOj9ttPE"},"outputs":[],"source":["#df.drop([\"column1\",\"column2\"], axis = \"columns\")\n","\n","roster = roster.drop(\"Cohort Num\", axis = \"columns\")\n","roster.head(-1)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"u-SBCempttPE"},"source":["##### ==================================================================================================\n","### Missing Values / NaN Values\n","\n","There are various types of missing data. Most commonly it could just be data was never collected, the data was handled incorrectly or null valued entry.\n","\n","Missing data can be remedied by the following:\n","1. Removing the row with the missing/NaN values\n","2. Removing the column with the missing/NaN values\n","3. Filling in the missing data\n","\n","For simplicity, we will only focus on the first two methods. The third method can be resolved with value interpolation by use of information from other rows or columns of the dataset. This process requires knowledge outside of the scope of this lesson. There are entire studies dedicated to this topic alone."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"yV1RhRDNttPE"},"outputs":[],"source":["cars_url = \"https://raw.githubusercontent.com/freestackinitiative/COOP-PythonLessons/main/lessons/lesson3/data/Cars.csv\"\n","cars = pd.read_csv(cars_url)\n","cars.head(-1)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"zT2P3Mq9ttPE"},"source":["##### ==================================================================================================\n","Now lets sort the companies in alphabetical order"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"W4xHJumrttPE"},"outputs":[],"source":["cars = cars.sort_values(\"Company\").reset_index(drop=True)\n","cars.head(-1)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"tFLokzyvttPE"},"source":["##### ==================================================================================================\n","Now lets check how many entry points are missing. As we can see there are 4 entries in the Location column and 5 entries missing in the Year column."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":163,"status":"ok","timestamp":1680136659347,"user":{"displayName":"Martin Arroyo","userId":"00023833307036255373"},"user_tz":240},"id":"q33En74DttPE","outputId":"5b5675c2-4396-49ee-ef44-cf50d5c03b14"},"outputs":[{"data":{"text/plain":["Company 0\n","Location 4\n","Year 5\n","dtype: int64"]},"execution_count":54,"metadata":{},"output_type":"execute_result"}],"source":["cars.isna().sum()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"cGKKoYTpttPE"},"source":["##### ==================================================================================================\n","Lets inspect all the rows with any missing Loctation entries"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"dRmT5-TvttPE"},"outputs":[],"source":["missing_car_info_filter = cars.loc[:, \"Location\"].isna()\n","cars.loc[missing_car_info_filter, :]"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"tvT4mHb5ttPE"},"source":["##### ==================================================================================================\n","Lets inspect all the rows with any missing Year entries"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"64m7mIH0ttPF"},"outputs":[],"source":["cars.loc[cars.loc[:, \"Year\"].isna(), :]"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"Kl_wIVHCttPF"},"source":["##### ==================================================================================================\n","For simplicity we can fill all the missing Location entries with \"NA\""]},{"cell_type":"code","execution_count":null,"metadata":{"id":"k0NDuUMhttPF"},"outputs":[],"source":["cars.loc[:, \"Location\"] = cars.loc[:, \"Location\"].fillna(value=\"NA\")"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"KXC45KtFttPF"},"outputs":[],"source":["cars.head(-1)\n","cars.isna().sum()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"nB__rivattPF"},"source":["##### ==================================================================================================\n","Now lets drop any rows with missing entries"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":161,"status":"ok","timestamp":1680136931394,"user":{"displayName":"Martin Arroyo","userId":"00023833307036255373"},"user_tz":240},"id":"Ft1XTWOGttPF","outputId":"2a06ac7b-978b-4720-9cc9-52925c26c448"},"outputs":[{"data":{"text/plain":["Company 0\n","Location 0\n","Year 0\n","dtype: int64"]},"execution_count":61,"metadata":{},"output_type":"execute_result"}],"source":["cars = cars.dropna().reset_index(drop=True)\n","cars.head(-1)\n","cars.isna().sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"MoUYqyzSeK9n"},"outputs":[],"source":["cars.info()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"lbaxA3zrttPF"},"source":["##### ==================================================================================================\n","## Summary\n","\n","- `pandas` provides `Series` and `DataFrame` classes that with tabular style data.\n","- `.loc` selects rows and columns based on their index values.\n","- `.iloc` selects rows and columns based on their position values.\n","- Calling a DataFrame method with `axis=\"rows\"` or `axis=0` causes it to operate along the row axis.\n","- Calling a DataFrame method with `axis=\"columns\"` or `axis=1` causes it to operate along the columns axis.\n","- `sort_values` reorders rows based on condition\n","- `.rename()` can rename columns in DataFrames. You can also rewrite the `.columns` attribute to rename columns.\n","- `.isna()` detects missing values\n","- `.fillna()` replaces NULL values with a specified value\n","- `.dropna()` removes all rows that contain NULL values\n","- `.merge()` updates content from one DataFrame with content from another Dataframe"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"k8I532SRttPF"},"source":["##### ==================================================================================================\n","### Exercise 1:\n","Create a new DataFrame called `cohort` by inner joining the two DataFrames `roster` and `exam`"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"t3G0XkmittPF"},"outputs":[],"source":["#solution\n","roster = pd.DataFrame(\n","{\n"," \"Name\" : [\"James\",\"Greg\",\"Patrick\",\"Chris\",\"Cynthia\",\"Chandra\", \"John\",\"David\",\"Tiffany\",\"Peter\"],\n"," \"Id\": [\"1\",\"2\",\"3\",\"4\",\"5\",\"6\",\"7\",\"8\",\"9\",\"10\"],\n"," \n","})\n","\n","exam = pd.DataFrame({\n"," \"Exam 1\" : [89,78,81,90,93,76,66,87,42,55],\n"," \"Exam 2\" : [100,74,20,86,60,76,92,97,88,90],\n"," \"Exam 3\" : [85,60,90,90,88,76,55,None,64,79],\n"," \"Id\" : [\"4\",\"2\",\"1\",\"7\",\"5\",\"10\",\"6\",\"3\",\"9\",\"8\"]\n","})\n","\n","# YOUR CODE HERE"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"rMRopV2FttPF"},"source":["##### ==================================================================================================\n","### Exercise 2:\n","Fill all missing grades with 0."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"DA8C74TLttPG"},"outputs":[],"source":["# YOUR CODE HERE"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"6a_N8JEEttPG"},"source":["##### ==================================================================================================\n","### Exercise 3:\n","Update James Exam 2 score from 20 to 85 and update Tiffany Exam 1 score from 42 to 88"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Mzka5Y3_ttPG"},"outputs":[],"source":["# YOUR CODE HERE"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"DkuO3tIPttPG"},"source":["##### ==================================================================================================\n","### Exercise 4:\n","\n","Create a series called `Average` that takes the average of Exam 1, Exam 2 and Exam 3 scores"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"QWXVYTj0ttPG"},"outputs":[],"source":["# YOUR CODE HERE"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"96hRtey9ttPG"},"source":["##### ==================================================================================================\n","### Exercise 5:\n","Incorporate the newly created `Average` column into the DataFrame `cohort`"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"wEysGqYyttPG"},"outputs":[],"source":["# YOUR CODE HERE"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"QHk1lZiDttPG"},"source":["##### ==================================================================================================\n","### Exercise 6:\n","Sort the dataset by Average in **descending** order and reindex the DataFrame"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"9azLYMHPttPG"},"outputs":[],"source":["# YOUR CODE HERE"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"yyWST6gUttPG"},"source":["##### ==================================================================================================\n","### Exercise 7:\n","Drop columns Exam 1, 2, and 3"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"PgD_KqCkttPG"},"outputs":[],"source":["# YOUR CODE HERE"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"OHg6AiIYttPG"},"source":["##### ==================================================================================================\n","### Exercise 8:\n","Select only the top 3 **Name, Id and Average only*** based on highest Average grade"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"MmHW3ki9ttPG"},"outputs":[],"source":["# YOUR CODE HERE"]}],"metadata":{"colab":{"provenance":[{"file_id":"11txmjQA_zWvA1kWLMxvhr2VEhNoUk9uC","timestamp":1679880071146}]},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.5"}},"nbformat":4,"nbformat_minor":0} +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "gPZMglnlttO8" + }, + "source": [ + "\n", + " \n", + "# Basic Elementary Exploratory Data Analysis using Pandas\n", + "\n", + "_Author: Christopher Chan_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cdSXNUkhttO9" + }, + "source": [ + "### Objective\n", + "\n", + "Upon completion of this lesson you should be able to understand the following:\n", + "\n", + "1. Pandas library\n", + "2. Dataframes\n", + "3. Data selection\n", + "4. Data manipulation\n", + "5. Handling of missing data\n", + "\n", + "This is arguably the most important part of analysis. This is also referred to as the \"cleaning the data\". Data must be usable for it to a valid analysis. Otherwise it would be garbage in, garbage out." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1c9iQdrTttO-" + }, + "source": [ + "##### ==================================================================================================\n", + "## Data Selection and Inspection\n", + "\n", + "\n", + "### Pandas Library\n", + "\n", + "`pandas` is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,\n", + "built on top of the Python programming language.\n", + "\n", + "`pandas` data frame can be created by loading the data from the external, existing storage like a database, SQL, or CSV files. But the Pandas Data Frame can also be created from the lists, dictionary, etc. For simplicity, we will use `.csv` files. One of the ways to create a pandas data frame is shown below:\n", + "\n", + "### DataFrames\n", + "A data frame is a structured representation of data.\n", + "##### ==================================================================================================" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0v8znxdlttO-" + }, + "outputs": [], + "source": [ + "import pandas as pd" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Q-fUhePhttO-" + }, + "outputs": [], + "source": [ + "data = {'Name':['John', 'Tiffany', 'Chris', 'Winnie', 'David'],\n", + " 'Age': [24, 23, 22, 19, 10], \n", + " 'Salary': [60000,120000,1000000,75000,80000]}\n", + "\n", + "people_df = pd.DataFrame(data)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_fi2q8cuttO-" + }, + "source": [ + "##### ==================================================================================================\n", + "We can call on the dataframe we labeled `people_df` by applying the `.head()` function that would display the first five rows of the dataframe. Similarly, the `.tail()` function would return the last five rows of a dataframe." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Muo9Gs_xttO_" + }, + "outputs": [], + "source": [ + "people_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZtcR6GJ2ttO_" + }, + "source": [ + "##### ==================================================================================================\n", + "We can also modify the number of rows we would like to display by inserting the integer into the `.head()` function.\n", + "\n", + "Example: Select the first 2 rows of the dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "k-lZOSuGttO_" + }, + "outputs": [], + "source": [ + "people_df.head(2)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C_UTdB6IWiG_" + }, + "source": [ + "Example: Select the last 2 rows of the dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tfNVLk_tWU52" + }, + "outputs": [], + "source": [ + "people_df.tail(2)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q8nzMIscttO_" + }, + "source": [ + "##### ==================================================================================================\n", + "Another way to create a dataframe would be to load an existing CSV file by using the `read_csv` function built into `pandas` onto the desired file path as shown below:\n", + "\n", + "`dataframe = pd.read_csv(\".../file_location/file_name.csv\")`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "IbdNygPgttO_" + }, + "outputs": [], + "source": [ + "# Saving the file location to our data into a variable\n", + "pixar_url = \"https://raw.githubusercontent.com/freestackinitiative/COOP-PythonLessons/main/lessons/lesson3/data/Pixar_Movies.csv\"\n", + "# Passing our file location to the read_csv function to locate and read our data into a DataFrame\n", + "movies_df = pd.read_csv(pixar_url)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4r7sM285ttPA" + }, + "source": [ + "##### ==================================================================================================" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "u-4v6IISttPA" + }, + "outputs": [], + "source": [ + "movies_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5JISZYZHttPA" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "\n", + "```sql\n", + "SELECT * \n", + "FROM Movies\n", + "LIMIT 10\n", + "```\n", + "##### ==================================================================================================" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uB-nxMi8ttPA" + }, + "source": [ + "`.shape` shows the number of rows and columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tuyP3rLKttPA" + }, + "outputs": [], + "source": [ + "movies_df.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sCnSo2HBttPA" + }, + "source": [ + "This shows us how many rows and columns are in the entire dataframe, 14 rows, 5 columns\n", + "\n", + "##### ==================================================================================================" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zMhYN4Z1ttPA" + }, + "source": [ + "`.dtypes` shows the data types" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "UOImUY3attPA" + }, + "outputs": [], + "source": [ + "movies_df.dtypes" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "244Ux8N_XWmo" + }, + "source": [ + "`.describe()` can be used to help summarize numerical data in our dataframe. It summarizes the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-MPTc3c6YjMp" + }, + "outputs": [], + "source": [ + "movies_df.describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uNXiyTCWYwl8" + }, + "source": [ + "You may optionally include categorical data in the `describe` method like so:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ITxGRSqQY8oX" + }, + "outputs": [], + "source": [ + "movies_df.describe(include='all')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "s3f7AzxJv_L5" + }, + "outputs": [], + "source": [ + "movies_df.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QHrP16DdttPA" + }, + "source": [ + "##### ==================================================================================================\n", + "\n", + "### Row and Column Selection\n", + "\n", + "There are two common ways to select rows and columns in a dataframe using .loc and .iloc\n", + "\n", + "`.loc` selects rows and columns by label/name\n", + "\n", + "`.iloc` selects row and columns by index\n", + "\n", + "Example: using `.loc` to select every row in the dataframe by using `:` and filtering the column to just Title, Director and Year" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_H7tY4X8ttPA" + }, + "outputs": [], + "source": [ + "movies_df.loc[2:4, ['Title','Director','Year'] ]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0qwe2OwyttPA" + }, + "source": [ + "##### ==================================================================================================\n", + "\n", + "Similarly we obtain the same results using `'iloc` by filtering the columns to the 1, 2, and 3 column that correspond to as Title, Director and Year respectively as shown below:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3rPyj7J1ttPA" + }, + "outputs": [], + "source": [ + "movies_df.iloc[ :, [1,2,3] ]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OigUAB8ottPB" + }, + "source": [ + "#### The two python codes above are equivalent to SQL's\n", + "\n", + "```sql\n", + "SELECT Title, Director, Year\n", + "FROM Movies\n", + "```\n", + "\n", + "##### ==================================================================================================" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VrxiA9oittPB" + }, + "outputs": [], + "source": [ + "movies_df.iloc[0:3,[1,2,3]]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "W8Rpe-SBttPB" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "\n", + "```sql\n", + "SELECT Title, Director, Year\n", + "FROM Movies\n", + "LIMIT 3\n", + "```\n", + "##### ==================================================================================================" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JYZXjJ7zttPB" + }, + "outputs": [], + "source": [ + "movies_df.iloc[2:5, [1,2,3]]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xuF4sFtRttPB" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "\n", + "```sql\n", + "SELECT Title, Director, Year\n", + "FROM movies\n", + "LIMIT 3\n", + "OFFSET 2\n", + "```\n", + "##### ==================================================================================================" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qoAct0MgZq2Y" + }, + "source": [ + "The `value_counts()` method returns the count of unique values in a given `Series`/column. For example, let's look at the number of entries each Director has in `movies_df`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "mAaANUmittPB" + }, + "outputs": [], + "source": [ + "movies_df.loc[:,'Director'].value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UUKE7FJkttPB" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "```sql\n", + "SELECT Director, COUNT(*)\n", + "FROM Movies\n", + "GROUP BY Director\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dqwCWeGUdDOO" + }, + "source": [ + "##### ==================================================================================================\n", + "\n", + "We can use the `mean()` method to help us find the average of a column or group of columns." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "85Yx7Q8MdXDp" + }, + "outputs": [], + "source": [ + "movies_df.loc[:, 'Length_minutes'].mean()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hxFxYRlVgy8D" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "```sql\n", + "SELECT AVG(Length_minutes)\n", + "FROM Movies\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CDzWn6ZYhdjl" + }, + "source": [ + "Using the `groupby()` method, we can perform operations that are similar to the `GROUP BY` clause in SQL.\n", + "\n", + "For example, let's get the average `Length_minutes` by `Director` to see the average number of minutes for each Director's movies:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "1Pc8Bk75ePoi" + }, + "outputs": [], + "source": [ + "movies_df.loc[:, ['Director', 'Length_minutes']].groupby('Director').mean()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jbDuTSGwiCmq" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "```sql\n", + "SELECT Director, AVG(Length_minutes) AS Length_minutes\n", + "FROM Movies\n", + "GROUP BY Director\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cKaf4n4ycypo" + }, + "source": [ + "##### ==================================================================================================\n", + "### Filtering Data\n", + "Using operator comparisons on columns returns information based on our desired conditions\n", + "\n", + "Example: Suppose we want to return movie information if it is only longer than 100 minutes long." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Zgl74_zjttPB" + }, + "outputs": [], + "source": [ + "# Create the filter \n", + "movie_filter = movies_df.loc[:, \"Length_minutes\"] > 100\n", + "# Use the filter in the `.loc` selector\n", + "movies_df.loc[movie_filter, :]\n", + "\n", + "# An example showing everything in a single step \n", + "movies_df.loc[movies_df.loc[:, \"Length_minutes\"] > 100, :]\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1RAY_qWtttPB" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "```sql\n", + "SELECT *\n", + "FROM Movies\n", + "WHERE Length_minutes > 100\n", + "```\n", + "##### ==================================================================================================\n", + "\n", + "#### Multiple Conditional Filtering\n", + "\n", + "Supposed we want to return movie information only if it is longer than 100 minutes and was created before the year 2005" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Dp1-vQ3mttPB" + }, + "outputs": [], + "source": [ + "movie_len_filter = movies_df.loc[:, \"Length_minutes\"] > 100\n", + "movie_year_filter = movies_df.loc[:, \"Year\"] < 2005\n", + "\n", + "movies_df.loc[(movie_len_filter) & (movie_year_filter), :]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lQksNrTkttPB" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "```sql\n", + "SELECT *\n", + "FROM Movies\n", + "WHERE Length_minutes > 100\n", + "AND Year < 2005\n", + "```\n", + "##### ==================================================================================================" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GVTfOPhottPB" + }, + "source": [ + "##### ==================================================================================================\n", + "### Sorting Data\n", + "The `sort_values()` method sorts the list ascending by default. To sort by descending order, you must apply `ascending = False`. \n", + "\n", + "The `.reset_index(drop=True)` will re-index the index after sorting." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "KFQjjjOxttPC" + }, + "outputs": [], + "source": [ + "movies_df.loc[:,\"Title\"].sort_values().reset_index(drop=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1fEi_PBfttPC" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "\n", + "```sql\n", + "SELECT Title\n", + "FROM Movies\n", + "ORDER BY Title\n", + "```\n", + "##### ==================================================================================================\n", + "\n", + "Sort the entire dataframe by a single column:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "stgM1BXxttPC" + }, + "outputs": [], + "source": [ + "movies_df.sort_values(\"Title\").reset_index(drop=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V5j8FDwuttPC" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "```sql\n", + "SELECT *\n", + "FROM Movies\n", + "ORDER BY Title\n", + "```\n", + "##### ==================================================================================================\n", + "\n", + "We can also sort using multiple columns.\n", + "Example: We can sort by Director first, then within each Director, sort the Title of the films." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "q6UqfJacttPC" + }, + "outputs": [], + "source": [ + "movies_df.sort_values([\"Director\",\"Title\"], ascending=[True, False]).reset_index(drop=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5wlURoWy2eYC" + }, + "source": [ + "```sql\n", + "SELECT Director, Title\n", + "FROM Movies\n", + "ORDER BY\n", + " Director ASC,\n", + " Title DESC\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AWFTUYNVttPC" + }, + "source": [ + "##### ==================================================================================================\n", + "### Merging DataFrames\n", + "\n", + "In python the `.concat` function combines dataframes together. This can be either one on top of another dataframe or side by side.\n", + "\n", + "But first let us introduce a new dataset:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3C3P14EvttPC" + }, + "outputs": [], + "source": [ + "other_movies_url = \"https://raw.githubusercontent.com/freestackinitiative/COOP-PythonLessons/main/lessons/lesson3/data/Other_Movies.csv\"\n", + "other_movies_df = pd.read_csv(other_movies_url)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "fjEx1V8vttPC" + }, + "outputs": [], + "source": [ + "other_movies_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DiyckWV1ttPC" + }, + "source": [ + "##### ==================================================================================================\n", + "Now lets combine the two dataframes, that being `movies_df` and `other_movies_df` using the `.concat` function and call this new dataframe `all_movies_df`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "pjvZ8wGFttPC" + }, + "outputs": [], + "source": [ + "all_movies_df = pd.concat([movies_df,other_movies_df]).reset_index(drop=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "DwGVaXWxttPC" + }, + "outputs": [], + "source": [ + "all_movies_df.head(-1) # Using -1 in the head function will show us all of the rows" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "\n", + "```sql\n", + "WITH all_movies_df AS (\n", + " SELECT *\n", + " FROM movies_df\n", + " UNION ALL\n", + " SELECT *\n", + " FROM other_movies_df\n", + ")\n", + "SELECT *\n", + "FROM all_movies_df;\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uwtTP015ttPD" + }, + "source": [ + "##### ==================================================================================================\n", + "Now lets introduce another dataframe, that being the movie scores received" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "wntotCmBttPD" + }, + "outputs": [], + "source": [ + "movie_scores_url = \"https://raw.githubusercontent.com/freestackinitiative/COOP-PythonLessons/main/lessons/lesson3/data/Movie_Scores.csv\"\n", + "scores_df = pd.read_csv(movie_scores_url)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "executionInfo": { + "elapsed": 143, + "status": "ok", + "timestamp": 1680135575621, + "user": { + "displayName": "Martin Arroyo", + "userId": "00023833307036255373" + }, + "user_tz": 240 + }, + "id": "9xeFCBz5ttPD", + "outputId": "d749036a-deb4-4141-c95d-0428415de7a3" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Score
08.3
17.2
27.9
38.1
48.2
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ], + "text/plain": [ + " Score\n", + "0 8.3\n", + "1 7.2\n", + "2 7.9\n", + "3 8.1\n", + "4 8.2" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "scores_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Yarl7-KPttPD" + }, + "source": [ + "##### ==================================================================================================\n", + "Now we can combine the two dataframes side by side" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "W2zOhxPcttPD" + }, + "outputs": [], + "source": [ + "movies_and_scores_df = pd.concat([all_movies_df,scores_df], axis = \"columns\").reset_index(drop=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VBMdQiRettPD" + }, + "outputs": [], + "source": [ + "movies_and_scores_df.head(-1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xoQ3fB8SttPD" + }, + "source": [ + "##### ==================================================================================================\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZITCa9qYttPD" + }, + "outputs": [], + "source": [ + "managers = pd.DataFrame(\n", + " {\n", + " 'Id': [1,2,3],\n", + " 'Manager':['Chris','Maritza','Jamin']\n", + " }\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MX9spfihttPD" + }, + "outputs": [], + "source": [ + "managers.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "LtD1zQJuttPD" + }, + "outputs": [], + "source": [ + "captains = pd.DataFrame(\n", + " {\n", + " 'Id': [2,2,3,1,1,3,2,3,1,1,3,3],\n", + " 'Captain':['Derick','Shane','Becca','Anna','Christine','Melody','Tom','Eric','Naomi','Angelina','Nancy','Richard'],\n", + " 'Title':['C','C','SC','C','SC','C','C','SC','C','EC','C','SC']\n", + " }\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0xOS-Bu4ttPD" + }, + "outputs": [], + "source": [ + "captains.head(12)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3c478mlSttPD" + }, + "outputs": [], + "source": [ + "roster = captains.merge(managers,left_on = 'Id', right_on = 'Id')\n", + "roster.head(-1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rJ9K1BPGXxzE" + }, + "outputs": [], + "source": [ + "test_roster = pd.concat([captains, managers], axis=\"columns\").reset_index(drop=True)\n", + "test_roster.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2hro1V6XttPD" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "```sql\n", + "SELECT *\n", + "FROM Captains\n", + "INNER JOIN Managers\n", + "ON Captains.Id = Managers.Id\n", + "```\n", + "##### ==================================================================================================\n", + "## Column Renaming\n", + "\n", + "We can use the `.rename` function in python to relabel the columns of a dataframe. Suppose we want to rename `Id` to `Cohort` and `Title` to `Captain Rank`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-nELWGyPttPD" + }, + "outputs": [], + "source": [ + "roster = roster.rename(columns = {\"Id\":\"Cohort\",\"Title\":\"Captain Rank\"})\n", + "roster.head(-1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 144, + "status": "ok", + "timestamp": 1680136209238, + "user": { + "displayName": "Martin Arroyo", + "userId": "00023833307036255373" + }, + "user_tz": 240 + }, + "id": "zrKc31ukYw3i", + "outputId": "40c853d2-aa1f-4f1d-ae91-3e803b90c20f" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['Cohort', 'Captain', 'Captain Rank', 'Manager'], dtype='object')" + ] + }, + "execution_count": 45, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "roster.columns" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N5H7HamottPE" + }, + "source": [ + "If we would like to replace all columns, we must use a list of equal length" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "eCTo6V3UttPE" + }, + "outputs": [], + "source": [ + "roster.columns = ['Cohort Num','Capt','Capt Rank','Manager']\n", + "roster.head(-1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Wp5nb6skttPE" + }, + "source": [ + "##### ==================================================================================================\n", + "### Drop Columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "doploOj9ttPE" + }, + "outputs": [], + "source": [ + "#df.drop([\"column1\",\"column2\"], axis = \"columns\")\n", + "\n", + "roster = roster.drop(\"Cohort Num\", axis = \"columns\")\n", + "roster.head(-1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u-SBCempttPE" + }, + "source": [ + "##### ==================================================================================================\n", + "### Missing Values / NaN Values\n", + "\n", + "There are various types of missing data. Most commonly it could just be data was never collected, the data was handled incorrectly or null valued entry.\n", + "\n", + "Missing data can be remedied by the following:\n", + "1. Removing the row with the missing/NaN values\n", + "2. Removing the column with the missing/NaN values\n", + "3. Filling in the missing data\n", + "\n", + "For simplicity, we will only focus on the first two methods. The third method can be resolved with value interpolation by use of information from other rows or columns of the dataset. This process requires knowledge outside of the scope of this lesson. There are entire studies dedicated to this topic alone." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "yV1RhRDNttPE" + }, + "outputs": [], + "source": [ + "cars_url = \"https://raw.githubusercontent.com/freestackinitiative/COOP-PythonLessons/main/lessons/lesson3/data/Cars.csv\"\n", + "cars = pd.read_csv(cars_url)\n", + "cars.head(-1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zT2P3Mq9ttPE" + }, + "source": [ + "##### ==================================================================================================\n", + "Now lets sort the companies in alphabetical order" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "W4xHJumrttPE" + }, + "outputs": [], + "source": [ + "cars = cars.sort_values(\"Company\").reset_index(drop=True)\n", + "cars.head(-1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tFLokzyvttPE" + }, + "source": [ + "##### ==================================================================================================\n", + "Now lets check how many entry points are missing. As we can see there are 4 entries in the Location column and 5 entries missing in the Year column." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 163, + "status": "ok", + "timestamp": 1680136659347, + "user": { + "displayName": "Martin Arroyo", + "userId": "00023833307036255373" + }, + "user_tz": 240 + }, + "id": "q33En74DttPE", + "outputId": "5b5675c2-4396-49ee-ef44-cf50d5c03b14" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Company 0\n", + "Location 4\n", + "Year 5\n", + "dtype: int64" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cars.isna().sum()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cGKKoYTpttPE" + }, + "source": [ + "##### ==================================================================================================\n", + "Lets inspect all the rows with any missing Loctation entries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "dRmT5-TvttPE" + }, + "outputs": [], + "source": [ + "missing_car_info_filter = cars.loc[:, \"Location\"].isna()\n", + "cars.loc[missing_car_info_filter, :]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tvT4mHb5ttPE" + }, + "source": [ + "##### ==================================================================================================\n", + "Lets inspect all the rows with any missing Year entries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "64m7mIH0ttPF" + }, + "outputs": [], + "source": [ + "cars.loc[cars.loc[:, \"Year\"].isna(), :]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Kl_wIVHCttPF" + }, + "source": [ + "##### ==================================================================================================\n", + "For simplicity we can fill all the missing Location entries with \"NA\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "k0NDuUMhttPF" + }, + "outputs": [], + "source": [ + "cars.loc[:, \"Location\"] = cars.loc[:, \"Location\"].fillna(value=\"NA\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "KXC45KtFttPF" + }, + "outputs": [], + "source": [ + "cars.head(-1)\n", + "cars.isna().sum()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nB__rivattPF" + }, + "source": [ + "##### ==================================================================================================\n", + "Now lets drop any rows with missing entries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 161, + "status": "ok", + "timestamp": 1680136931394, + "user": { + "displayName": "Martin Arroyo", + "userId": "00023833307036255373" + }, + "user_tz": 240 + }, + "id": "Ft1XTWOGttPF", + "outputId": "2a06ac7b-978b-4720-9cc9-52925c26c448" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Company 0\n", + "Location 0\n", + "Year 0\n", + "dtype: int64" + ] + }, + "execution_count": 61, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cars = cars.dropna().reset_index(drop=True)\n", + "cars.head(-1)\n", + "cars.isna().sum()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MoUYqyzSeK9n" + }, + "outputs": [], + "source": [ + "cars.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lbaxA3zrttPF" + }, + "source": [ + "##### ==================================================================================================\n", + "## Summary\n", + "\n", + "- `pandas` provides `Series` and `DataFrame` classes that with tabular style data.\n", + "- `.loc` selects rows and columns based on their index values.\n", + "- `.iloc` selects rows and columns based on their position values.\n", + "- Calling a DataFrame method with `axis=\"rows\"` or `axis=0` causes it to operate along the row axis.\n", + "- Calling a DataFrame method with `axis=\"columns\"` or `axis=1` causes it to operate along the columns axis.\n", + "- `sort_values` reorders rows based on condition\n", + "- `.rename()` can rename columns in DataFrames. You can also rewrite the `.columns` attribute to rename columns.\n", + "- `.isna()` detects missing values\n", + "- `.fillna()` replaces NULL values with a specified value\n", + "- `.dropna()` removes all rows that contain NULL values\n", + "- `.merge()` updates content from one DataFrame with content from another Dataframe" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k8I532SRttPF" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 1:\n", + "Create a new DataFrame called `cohort` by inner joining the two DataFrames `roster` and `exam`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "t3G0XkmittPF" + }, + "outputs": [], + "source": [ + "#solution\n", + "roster = pd.DataFrame(\n", + "{\n", + " \"Name\" : [\"James\",\"Greg\",\"Patrick\",\"Chris\",\"Cynthia\",\"Chandra\", \"John\",\"David\",\"Tiffany\",\"Peter\"],\n", + " \"Id\": [\"1\",\"2\",\"3\",\"4\",\"5\",\"6\",\"7\",\"8\",\"9\",\"10\"],\n", + " \n", + "})\n", + "\n", + "exam = pd.DataFrame({\n", + " \"Exam 1\" : [89,78,81,90,93,76,66,87,42,55],\n", + " \"Exam 2\" : [100,74,20,86,60,76,92,97,88,90],\n", + " \"Exam 3\" : [85,60,90,90,88,76,55,None,64,79],\n", + " \"Id\" : [\"4\",\"2\",\"1\",\"7\",\"5\",\"10\",\"6\",\"3\",\"9\",\"8\"]\n", + "})\n", + "\n", + "# YOUR CODE HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rMRopV2FttPF" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 2:\n", + "Fill all missing grades with 0." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "DA8C74TLttPG" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6a_N8JEEttPG" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 3:\n", + "Update James Exam 2 score from 20 to 85 and update Tiffany Exam 1 score from 42 to 88" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Mzka5Y3_ttPG" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DkuO3tIPttPG" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 4:\n", + "\n", + "Create a series called `Average` that takes the average of Exam 1, Exam 2 and Exam 3 scores" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QWXVYTj0ttPG" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "96hRtey9ttPG" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 5:\n", + "Incorporate the newly created `Average` column into the DataFrame `cohort`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "wEysGqYyttPG" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QHk1lZiDttPG" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 6:\n", + "Sort the dataset by Average in **descending** order and reindex the DataFrame" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9azLYMHPttPG" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yyWST6gUttPG" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 7:\n", + "Drop columns Exam 1, 2, and 3" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PgD_KqCkttPG" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OHg6AiIYttPG" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 8:\n", + "Select only the top 3 **Name, Id and Average only*** based on highest Average grade" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MmHW3ki9ttPG" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE" + ] + } + ], + "metadata": { + "colab": { + "provenance": [ + { + "file_id": "11txmjQA_zWvA1kWLMxvhr2VEhNoUk9uC", + "timestamp": 1679880071146 + } + ] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.18" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From b47c7d28e4a1f5d5cd6e2c13e666231b7e9bfc84 Mon Sep 17 00:00:00 2001 From: Derik Vo Date: Wed, 1 Nov 2023 17:50:13 -0700 Subject: [PATCH 2/4] added code to allow older notebooks on colab to work --- lessons/lesson3/Lesson 3.ipynb | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/lessons/lesson3/Lesson 3.ipynb b/lessons/lesson3/Lesson 3.ipynb index 07e3280..bad33cd 100644 --- a/lessons/lesson3/Lesson 3.ipynb +++ b/lessons/lesson3/Lesson 3.ipynb @@ -1,5 +1,36 @@ { "cells": [ + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "''" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "\"\"\"\n", + "For Captians Using OLDER colab notebooks: \n", + "Run this code if youre loading .csv as \"pd.read_csv(\"/content/Pixar_Movies.csv\")\"\n", + "\n", + "This will down load a folder with the csv's and wil move the files into your current working directory, so you can use the prewritten code.\n", + "Newer notebooks will directly link to the csv, so you dont need to take any additional steps.\n", + "\"\"\"\n", + "# !gdown --folder https://drive.google.com/drive/folders/1DyiVBtUKIMg311TQTUoGt0kV2vcFcseM\n", + "# !cd Content; mv *.csv ..; rmdir /content/Content\n", + ";" + ] + }, { "cell_type": "markdown", "metadata": { From b9f4f454d9a25c9538779e46badc6ad32534dd18 Mon Sep 17 00:00:00 2001 From: DerikVo Date: Sat, 27 Apr 2024 15:48:00 -0700 Subject: [PATCH 3/4] Added a script to download CSV files needed to run the notebook's examples. --- Python_Lesson_3.ipynb | 12054 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 12054 insertions(+) create mode 100644 Python_Lesson_3.ipynb diff --git a/Python_Lesson_3.ipynb b/Python_Lesson_3.ipynb new file mode 100644 index 0000000..19d23ca --- /dev/null +++ b/Python_Lesson_3.ipynb @@ -0,0 +1,12054 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Must run this to prepare all the CSVs to be loaded\n", + "The current notebook in Google Classroom assumes COOPers already have the Data in the content folder, but that is not the case.\n", + "I added a script to download the files into the content folder. Unhide the script for more explanation.\n", + "\n", + "\n", + "If there are any questions Please contact \"[Derik Vo](https://www.linkedin.com/in/derik-vo/)\" on slack or LinkedIn\n", + "\n", + "Last updated:\n", + "20240427" + ], + "metadata": { + "id": "twfwKM5msch0" + } + }, + { + "cell_type": "code", + "source": [ + "\"\"\"\n", + "This script is designed to download all the appropriate CSV files to utilize this notebook.\n", + "The notebook assumes the program participants already have the files in the virtual working directory,\n", + "directory but that is not the case.\n", + "\n", + "This notebook simply downloads Google Sheets files as CSVs in the current working directory, content.\n", + "Which is the pathway used through out the notebook.\n", + "\"\"\"\n", + "\n", + "import importlib\n", + "# Check if gdown package is installed, so you can re-run this without downloading the package each time\n", + "if importlib.util.find_spec(\"gdown\") is None:\n", + " !pip install gdown\n", + "\n", + "# File ID of google sheet URLs e.g. the thing between .../d/... and /edit?...\n", + "file_id = ['1Jk5SlYcHsdklUgxMdV_4AVomxKcgOduiMw0TO-XKBJs',\n", + " '1krLyXgH0ZhMsrh5fHOTGb5qRzQQhbZM8QAV00R9uTlU',\n", + " '1cjJJ8_b4Du8AQaY0QB5a2DE-eaLa1kQ5WmeYS32nlQ8',\n", + " '15HXIdDfVSrkGt_ef7UyCrDRnDbfZICVYTtlTk9YmuSM']\n", + "\n", + "# Specify the file names, index must match file_id\n", + "output_file = ['Cars.csv',\n", + " 'Movie_Scores.csv',\n", + " 'Other_Movies.csv',\n", + " 'Pixar_Movies.csv'\n", + " ]\n", + "# Download using gdown, specify the export as csv\n", + "for i in range(0, len(file_id)):\n", + " download_url = f'https://docs.google.com/spreadsheets/d/{file_id[i]}/export?format=csv'\n", + " !gdown {download_url} --output {output_file[i]}" + ], + "metadata": { + "id": "luc8W-65kZr7", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "91c2998e-ae89-45cf-e0ec-0a5525b10916" + }, + "execution_count": 69, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "/usr/local/lib/python3.10/dist-packages/gdown/parse_url.py:48: UserWarning: You specified a Google Drive link that is not the correct link to download a file. You might want to try `--fuzzy` option or the following url: https://drive.google.com/uc?id=None\n", + " warnings.warn(\n", + "Downloading...\n", + "From: https://docs.google.com/spreadsheets/d/1Jk5SlYcHsdklUgxMdV_4AVomxKcgOduiMw0TO-XKBJs/export?format=csv\n", + "To: /content/Cars.csv\n", + "918B [00:00, 2.35MB/s]\n", + "/usr/local/lib/python3.10/dist-packages/gdown/parse_url.py:48: UserWarning: You specified a Google Drive link that is not the correct link to download a file. You might want to try `--fuzzy` option or the following url: https://drive.google.com/uc?id=None\n", + " warnings.warn(\n", + "Downloading...\n", + "From: https://docs.google.com/spreadsheets/d/1krLyXgH0ZhMsrh5fHOTGb5qRzQQhbZM8QAV00R9uTlU/export?format=csv\n", + "To: /content/Movie_Scores.csv\n", + "101B [00:00, 319kB/s]\n", + "/usr/local/lib/python3.10/dist-packages/gdown/parse_url.py:48: UserWarning: You specified a Google Drive link that is not the correct link to download a file. You might want to try `--fuzzy` option or the following url: https://drive.google.com/uc?id=None\n", + " warnings.warn(\n", + "Downloading...\n", + "From: https://docs.google.com/spreadsheets/d/1cjJJ8_b4Du8AQaY0QB5a2DE-eaLa1kQ5WmeYS32nlQ8/export?format=csv\n", + "To: /content/Other_Movies.csv\n", + "319B [00:00, 940kB/s]\n", + "/usr/local/lib/python3.10/dist-packages/gdown/parse_url.py:48: UserWarning: You specified a Google Drive link that is not the correct link to download a file. You might want to try `--fuzzy` option or the following url: https://drive.google.com/uc?id=None\n", + " warnings.warn(\n", + "Downloading...\n", + "From: https://docs.google.com/spreadsheets/d/15HXIdDfVSrkGt_ef7UyCrDRnDbfZICVYTtlTk9YmuSM/export?format=csv\n", + "To: /content/Pixar_Movies.csv\n", + "542B [00:00, 896kB/s]\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gPZMglnlttO8" + }, + "source": [ + "\n", + "\n", + "# Basic Elementary Exploratory Data Analysis using Pandas\n", + "\n", + "_Author: Christopher Chan_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cdSXNUkhttO9" + }, + "source": [ + "### Objective\n", + "\n", + "Upon completion of this lesson you should be able to understand the following:\n", + "\n", + "1. Pandas library\n", + "2. Dataframes\n", + "3. Data selection\n", + "4. Data manipulation\n", + "5. Handling of missing data\n", + "\n", + "This is arguably the most important part of analysis. This is also referred to as the \"cleaning the data\". Data must be usable for it to a valid analysis. Otherwise it would be garbage in, garbage out." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1c9iQdrTttO-" + }, + "source": [ + "##### ==================================================================================================\n", + "## Data Selection and Inspection\n", + "\n", + "\n", + "### Pandas Library\n", + "\n", + "`pandas` is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,\n", + "built on top of the Python programming language.\n", + "\n", + "`pandas` data frame can be created by loading the data from the external, existing storage like a database, SQL, or CSV files. But the Pandas Data Frame can also be created from the lists, dictionary, etc. For simplicity, we will use `.csv` files. One of the ways to create a pandas data frame is shown below:\n", + "\n", + "### DataFrames\n", + "A data frame is a structured representation of data.\n", + "##### ==================================================================================================" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "id": "0v8znxdlttO-" + }, + "outputs": [], + "source": [ + "import pandas as pd" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "id": "Q-fUhePhttO-" + }, + "outputs": [], + "source": [ + "data = {'Name':['John', 'Tiffany', 'Chris', 'Winnie', 'David'],\n", + " 'Age': [24, 23, 22, 19, 10],\n", + " 'Salary': [60000,120000,1000000,75000,80000]}\n", + "\n", + "people_df = pd.DataFrame(data)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_fi2q8cuttO-" + }, + "source": [ + "##### ==================================================================================================\n", + "We can call on the dataframe we labeled `people_df` by applying the `.head()` function that would display the first five rows of the dataframe. Similarly, the `.tail()` function would return the last five rows of a dataframe." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "id": "Muo9Gs_xttO_", + "outputId": "05ac9d53-000a-4d97-a299-b8a766c78531", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Name Age Salary\n", + "0 John 24 60000\n", + "1 Tiffany 23 120000\n", + "2 Chris 22 1000000\n", + "3 Winnie 19 75000\n", + "4 David 10 80000" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
NameAgeSalary
0John2460000
1Tiffany23120000
2Chris221000000
3Winnie1975000
4David1080000
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "people_df", + "summary": "{\n \"name\": \"people_df\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"Name\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Tiffany\",\n \"David\",\n \"Chris\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Age\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5,\n \"min\": 10,\n \"max\": 24,\n \"num_unique_values\": 5,\n \"samples\": [\n 23,\n 10,\n 22\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Salary\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 410359,\n \"min\": 60000,\n \"max\": 1000000,\n \"num_unique_values\": 5,\n \"samples\": [\n 120000,\n 80000,\n 1000000\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 11 + } + ], + "source": [ + "people_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZtcR6GJ2ttO_" + }, + "source": [ + "##### ==================================================================================================\n", + "We can also modify the number of rows we would like to display by inserting the integer into the `.head()` function.\n", + "\n", + "Example: Select the first 2 rows of the dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "id": "k-lZOSuGttO_", + "outputId": "16624df3-6df0-400c-9708-eebc63c390da", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 112 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Name Age Salary\n", + "0 John 24 60000\n", + "1 Tiffany 23 120000" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
NameAgeSalary
0John2460000
1Tiffany23120000
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "people_df", + "summary": "{\n \"name\": \"people_df\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"Name\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Tiffany\",\n \"David\",\n \"Chris\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Age\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5,\n \"min\": 10,\n \"max\": 24,\n \"num_unique_values\": 5,\n \"samples\": [\n 23,\n 10,\n 22\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Salary\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 410359,\n \"min\": 60000,\n \"max\": 1000000,\n \"num_unique_values\": 5,\n \"samples\": [\n 120000,\n 80000,\n 1000000\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 12 + } + ], + "source": [ + "people_df.head(2)" + ] + }, + { + "cell_type": "markdown", + "source": [ + "Example: Select the last 2 rows of the dataframe" + ], + "metadata": { + "id": "C_UTdB6IWiG_" + } + }, + { + "cell_type": "code", + "source": [ + "people_df.tail(2)" + ], + "metadata": { + "id": "tfNVLk_tWU52", + "outputId": "8a683d3c-3ff9-4e9b-9bf3-fda099413cb4", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 112 + } + }, + "execution_count": 13, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Name Age Salary\n", + "3 Winnie 19 75000\n", + "4 David 10 80000" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
NameAgeSalary
3Winnie1975000
4David1080000
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"people_df\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Name\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"David\",\n \"Winnie\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Age\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6,\n \"min\": 10,\n \"max\": 19,\n \"num_unique_values\": 2,\n \"samples\": [\n 10,\n 19\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Salary\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 3535,\n \"min\": 75000,\n \"max\": 80000,\n \"num_unique_values\": 2,\n \"samples\": [\n 80000,\n 75000\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 13 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q8nzMIscttO_" + }, + "source": [ + "##### ==================================================================================================\n", + "Another way to create a dataframe would be to load an existing CSV file by using the `read_csv` function built into `pandas` onto the desired file path as shown below:\n", + "\n", + "`dataframe = pd.read_csv(\".../file_location/file_name.csv\")`" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "id": "IbdNygPgttO_" + }, + "outputs": [], + "source": [ + "movies_df = pd.read_csv(\"/content/Pixar_Movies.csv\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4r7sM285ttPA" + }, + "source": [ + "##### ==================================================================================================" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "id": "u-4v6IISttPA", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 363 + }, + "outputId": "9125fe6b-77a3-4251-c969-c0947f5d7e01" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Id Title Director Year Length_minutes\n", + "0 1 Toy Story John Lasseter 1995 81\n", + "1 2 A Bug's Life John Lasseter 1998 95\n", + "2 3 Toy Story 2 John Lasseter 1999 93\n", + "3 4 Monsters, Inc. Pete Docter 2001 92\n", + "4 5 Finding Nemo Andrew Stanton 2003 107\n", + "5 6 The Incredibles Brad Bird 2004 116\n", + "6 7 Cars John Lasseter 2006 117\n", + "7 8 Ratatouille Brad Bird 2007 115\n", + "8 9 WALL-E Andrew Stanton 2008 104\n", + "9 10 Up Pete Docter 2009 101" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdTitleDirectorYearLength_minutes
01Toy StoryJohn Lasseter199581
12A Bug's LifeJohn Lasseter199895
23Toy Story 2John Lasseter199993
34Monsters, Inc.Pete Docter200192
45Finding NemoAndrew Stanton2003107
56The IncrediblesBrad Bird2004116
67CarsJohn Lasseter2006117
78RatatouilleBrad Bird2007115
89WALL-EAndrew Stanton2008104
910UpPete Docter2009101
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "movies_df", + "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 14,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4,\n \"min\": 1,\n \"max\": 14,\n \"num_unique_values\": 14,\n \"samples\": [\n 10,\n 12,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 14,\n \"samples\": [\n \"Up\",\n \"Cars 2\",\n \"Toy Story\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"John Lasseter\",\n \"Pete Docter\",\n \"Brenda Chapman\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5,\n \"min\": 1995,\n \"max\": 2013,\n \"num_unique_values\": 14,\n \"samples\": [\n 2009,\n 2011,\n 1995\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Length_minutes\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 11,\n \"min\": 81,\n \"max\": 120,\n \"num_unique_values\": 14,\n \"samples\": [\n 101,\n 120,\n 81\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 15 + } + ], + "source": [ + "movies_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5JISZYZHttPA" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "\n", + "```sql\n", + "SELECT *\n", + "FROM Movies\n", + "LIMIT 10\n", + "```\n", + "##### ==================================================================================================" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uB-nxMi8ttPA" + }, + "source": [ + "`.shape` shows the number of rows and columns" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "id": "tuyP3rLKttPA", + "outputId": "e814c6e6-08db-42a7-ed51-acebc0e11df0", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(14, 5)" + ] + }, + "metadata": {}, + "execution_count": 16 + } + ], + "source": [ + "movies_df.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sCnSo2HBttPA" + }, + "source": [ + "This shows us how many rows and columns are in the entire dataframe, 14 rows, 5 columns\n", + "\n", + "##### ==================================================================================================" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zMhYN4Z1ttPA" + }, + "source": [ + "`.dtypes` shows the data types" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "id": "UOImUY3attPA", + "outputId": "205c0d6d-3e13-40c1-b872-d7107d20b6cf", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Id int64\n", + "Title object\n", + "Director object\n", + "Year int64\n", + "Length_minutes int64\n", + "dtype: object" + ] + }, + "metadata": {}, + "execution_count": 17 + } + ], + "source": [ + "movies_df.dtypes" + ] + }, + { + "cell_type": "markdown", + "source": [ + "`.describe()` can be used to help summarize numerical data in our dataframe. It summarizes the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values." + ], + "metadata": { + "id": "244Ux8N_XWmo" + } + }, + { + "cell_type": "code", + "source": [ + "movies_df.describe()" + ], + "metadata": { + "id": "-MPTc3c6YjMp", + "outputId": "5e72034a-69a7-4a62-8966-b3c92d2684ce", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 300 + } + }, + "execution_count": 18, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Id Year Length_minutes\n", + "count 14.0000 14.000000 14.000000\n", + "mean 7.5000 2005.428571 104.000000\n", + "std 4.1833 5.598273 11.176899\n", + "min 1.0000 1995.000000 81.000000\n", + "25% 4.2500 2001.500000 96.500000\n", + "50% 7.5000 2006.500000 103.500000\n", + "75% 10.7500 2009.750000 113.750000\n", + "max 14.0000 2013.000000 120.000000" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdYearLength_minutes
count14.000014.00000014.000000
mean7.50002005.428571104.000000
std4.18335.59827311.176899
min1.00001995.00000081.000000
25%4.25002001.50000096.500000
50%7.50002006.500000103.500000
75%10.75002009.750000113.750000
max14.00002013.000000120.000000
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 8,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4.745054915279273,\n \"min\": 1.0,\n \"max\": 14.0,\n \"num_unique_values\": 6,\n \"samples\": [\n 14.0,\n 7.5,\n 10.75\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 923.7077335637653,\n \"min\": 5.598272889084574,\n \"max\": 2013.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 2005.4285714285713,\n 2006.5,\n 14.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Length_minutes\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 43.476190740521304,\n \"min\": 11.176899253508413,\n \"max\": 120.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 104.0,\n 103.5,\n 14.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 18 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "You may optionally include categorical data in the `describe` method like so:" + ], + "metadata": { + "id": "uNXiyTCWYwl8" + } + }, + { + "cell_type": "code", + "source": [ + "movies_df.describe(include='all')" + ], + "metadata": { + "id": "ITxGRSqQY8oX", + "outputId": "2aff9745-e64d-401c-fc80-5e53cfaf81c0", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 394 + } + }, + "execution_count": 19, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Id Title Director Year Length_minutes\n", + "count 14.0000 14 14 14.000000 14.000000\n", + "unique NaN 14 7 NaN NaN\n", + "top NaN Toy Story John Lasseter NaN NaN\n", + "freq NaN 1 5 NaN NaN\n", + "mean 7.5000 NaN NaN 2005.428571 104.000000\n", + "std 4.1833 NaN NaN 5.598273 11.176899\n", + "min 1.0000 NaN NaN 1995.000000 81.000000\n", + "25% 4.2500 NaN NaN 2001.500000 96.500000\n", + "50% 7.5000 NaN NaN 2006.500000 103.500000\n", + "75% 10.7500 NaN NaN 2009.750000 113.750000\n", + "max 14.0000 NaN NaN 2013.000000 120.000000" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdTitleDirectorYearLength_minutes
count14.0000141414.00000014.000000
uniqueNaN147NaNNaN
topNaNToy StoryJohn LasseterNaNNaN
freqNaN15NaNNaN
mean7.5000NaNNaN2005.428571104.000000
std4.1833NaNNaN5.59827311.176899
min1.0000NaNNaN1995.00000081.000000
25%4.2500NaNNaN2001.50000096.500000
50%7.5000NaNNaN2006.500000103.500000
75%10.7500NaNNaN2009.750000113.750000
max14.0000NaNNaN2013.000000120.000000
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 11,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4.745054915279273,\n \"min\": 1.0,\n \"max\": 14.0,\n \"num_unique_values\": 6,\n \"samples\": [\n 14.0,\n 7.5,\n 10.75\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"14\",\n \"Toy Story\",\n \"1\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 4,\n \"samples\": [\n 7,\n \"5\",\n \"14\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 923.7077335637653,\n \"min\": 5.598272889084574,\n \"max\": 2013.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 2005.4285714285713,\n 2006.5,\n 14.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Length_minutes\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 43.476190740521304,\n \"min\": 11.176899253508413,\n \"max\": 120.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 104.0,\n 103.5,\n 14.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 19 + } + ] + }, + { + "cell_type": "code", + "source": [ + "movies_df.info()" + ], + "metadata": { + "id": "s3f7AzxJv_L5", + "outputId": "9eded4ed-a6ae-48d0-ff9e-33be78411b2b", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": 20, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "RangeIndex: 14 entries, 0 to 13\n", + "Data columns (total 5 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 Id 14 non-null int64 \n", + " 1 Title 14 non-null object\n", + " 2 Director 14 non-null object\n", + " 3 Year 14 non-null int64 \n", + " 4 Length_minutes 14 non-null int64 \n", + "dtypes: int64(3), object(2)\n", + "memory usage: 688.0+ bytes\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QHrP16DdttPA" + }, + "source": [ + "##### ==================================================================================================\n", + "\n", + "### Row and Column Selection\n", + "\n", + "There are two common ways to select rows and columns in a dataframe using .loc and .iloc\n", + "\n", + "`.loc` selects rows and columns by label/name\n", + "\n", + "`.iloc` selects row and columns by index\n", + "\n", + "Example: using `.loc` to select every row in the dataframe by using `:` and filtering the column to just Title, Director and Year" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "id": "_H7tY4X8ttPA", + "outputId": "85648979-5889-4331-c0b5-4f55c4fe69ca", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 143 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Title Director Year\n", + "2 Toy Story 2 John Lasseter 1999\n", + "3 Monsters, Inc. Pete Docter 2001\n", + "4 Finding Nemo Andrew Stanton 2003" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TitleDirectorYear
2Toy Story 2John Lasseter1999
3Monsters, Inc.Pete Docter2001
4Finding NemoAndrew Stanton2003
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Toy Story 2\",\n \"Monsters, Inc.\",\n \"Finding Nemo\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"John Lasseter\",\n \"Pete Docter\",\n \"Andrew Stanton\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2,\n \"min\": 1999,\n \"max\": 2003,\n \"num_unique_values\": 3,\n \"samples\": [\n 1999,\n 2001,\n 2003\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 21 + } + ], + "source": [ + "movies_df.loc[2:4, ['Title','Director','Year'] ]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0qwe2OwyttPA" + }, + "source": [ + "##### ==================================================================================================\n", + "\n", + "Similarly we obtain the same results using `'iloc` by filtering the columns to the 1, 2, and 3 column that correspond to as Title, Director and Year respectively as shown below:" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "id": "3rPyj7J1ttPA", + "outputId": "68a08106-5da8-4d74-9c48-cf0924545c85", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 488 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Title Director Year\n", + "0 Toy Story John Lasseter 1995\n", + "1 A Bug's Life John Lasseter 1998\n", + "2 Toy Story 2 John Lasseter 1999\n", + "3 Monsters, Inc. Pete Docter 2001\n", + "4 Finding Nemo Andrew Stanton 2003\n", + "5 The Incredibles Brad Bird 2004\n", + "6 Cars John Lasseter 2006\n", + "7 Ratatouille Brad Bird 2007\n", + "8 WALL-E Andrew Stanton 2008\n", + "9 Up Pete Docter 2009\n", + "10 Toy Story 3 Lee Unkrich 2010\n", + "11 Cars 2 John Lasseter 2011\n", + "12 Brave Brenda Chapman 2012\n", + "13 Monsters University Dan Scanlon 2013" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TitleDirectorYear
0Toy StoryJohn Lasseter1995
1A Bug's LifeJohn Lasseter1998
2Toy Story 2John Lasseter1999
3Monsters, Inc.Pete Docter2001
4Finding NemoAndrew Stanton2003
5The IncrediblesBrad Bird2004
6CarsJohn Lasseter2006
7RatatouilleBrad Bird2007
8WALL-EAndrew Stanton2008
9UpPete Docter2009
10Toy Story 3Lee Unkrich2010
11Cars 2John Lasseter2011
12BraveBrenda Chapman2012
13Monsters UniversityDan Scanlon2013
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 14,\n \"fields\": [\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 14,\n \"samples\": [\n \"Up\",\n \"Cars 2\",\n \"Toy Story\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"John Lasseter\",\n \"Pete Docter\",\n \"Brenda Chapman\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5,\n \"min\": 1995,\n \"max\": 2013,\n \"num_unique_values\": 14,\n \"samples\": [\n 2009,\n 2011,\n 1995\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 22 + } + ], + "source": [ + "movies_df.iloc[ :, [1,2,3] ]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OigUAB8ottPB" + }, + "source": [ + "#### The two python codes above are equivalent to SQL's\n", + "\n", + "```sql\n", + "SELECT Title, Director, Year\n", + "FROM Movies\n", + "```\n", + "\n", + "##### ==================================================================================================" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "id": "VrxiA9oittPB", + "outputId": "3010b408-ed54-4c6c-ce92-97434951c9a5", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 143 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Title Director Year\n", + "0 Toy Story John Lasseter 1995\n", + "1 A Bug's Life John Lasseter 1998\n", + "2 Toy Story 2 John Lasseter 1999" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TitleDirectorYear
0Toy StoryJohn Lasseter1995
1A Bug's LifeJohn Lasseter1998
2Toy Story 2John Lasseter1999
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Toy Story\",\n \"A Bug's Life\",\n \"Toy Story 2\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"John Lasseter\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2,\n \"min\": 1995,\n \"max\": 1999,\n \"num_unique_values\": 3,\n \"samples\": [\n 1995\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 23 + } + ], + "source": [ + "movies_df.iloc[0:3,[1,2,3]]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "W8Rpe-SBttPB" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "\n", + "```sql\n", + "SELECT Title, Director, Year\n", + "FROM Movies\n", + "LIMIT 3\n", + "```\n", + "##### ==================================================================================================" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "id": "JYZXjJ7zttPB", + "outputId": "96fc2c14-aed4-47cf-dee7-f1084670fb09", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 143 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Title Director Year\n", + "2 Toy Story 2 John Lasseter 1999\n", + "3 Monsters, Inc. Pete Docter 2001\n", + "4 Finding Nemo Andrew Stanton 2003" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TitleDirectorYear
2Toy Story 2John Lasseter1999
3Monsters, Inc.Pete Docter2001
4Finding NemoAndrew Stanton2003
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Toy Story 2\",\n \"Monsters, Inc.\",\n \"Finding Nemo\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"John Lasseter\",\n \"Pete Docter\",\n \"Andrew Stanton\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2,\n \"min\": 1999,\n \"max\": 2003,\n \"num_unique_values\": 3,\n \"samples\": [\n 1999,\n 2001,\n 2003\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 24 + } + ], + "source": [ + "movies_df.iloc[2:5, [1,2,3]]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xuF4sFtRttPB" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "\n", + "```sql\n", + "SELECT Title, Director, Year\n", + "FROM movies\n", + "LIMIT 3\n", + "OFFSET 2\n", + "```\n", + "##### ==================================================================================================" + ] + }, + { + "cell_type": "markdown", + "source": [ + "The `value_counts()` method returns the count of unique values in a given `Series`/column. For example, let's look at the number of entries each Director has in `movies_df`:" + ], + "metadata": { + "id": "qoAct0MgZq2Y" + } + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "id": "mAaANUmittPB", + "outputId": "785fa941-046c-4201-d01d-37b6f1c6decc", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Director\n", + "John Lasseter 5\n", + "Pete Docter 2\n", + "Andrew Stanton 2\n", + "Brad Bird 2\n", + "Lee Unkrich 1\n", + "Brenda Chapman 1\n", + "Dan Scanlon 1\n", + "Name: count, dtype: int64" + ] + }, + "metadata": {}, + "execution_count": 25 + } + ], + "source": [ + "movies_df.loc[:,'Director'].value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UUKE7FJkttPB" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "```sql\n", + "SELECT Director, COUNT(*)\n", + "FROM Movies\n", + "GROUP BY Director\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "source": [ + "##### ==================================================================================================\n", + "\n", + "We can use the `mean()` method to help us find the average of a column or group of columns." + ], + "metadata": { + "id": "dqwCWeGUdDOO" + } + }, + { + "cell_type": "code", + "source": [ + "movies_df.loc[:, 'Length_minutes'].mean()" + ], + "metadata": { + "id": "85Yx7Q8MdXDp", + "outputId": "5bbd0b42-99a8-450c-f306-42d051eefce0", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": 26, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "104.0" + ] + }, + "metadata": {}, + "execution_count": 26 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "#### The above python code is equivalent to SQL's\n", + "```sql\n", + "SELECT AVG(Length_minutes)\n", + "FROM Movies\n", + "```" + ], + "metadata": { + "id": "hxFxYRlVgy8D" + } + }, + { + "cell_type": "markdown", + "source": [ + "Using the `groupby()` method, we can perform operations that are similar to the `GROUP BY` clause in SQL.\n", + "\n", + "For example, let's get the average `Length_minutes` by `Director` to see the average number of minutes for each Director's movies:" + ], + "metadata": { + "id": "CDzWn6ZYhdjl" + } + }, + { + "cell_type": "code", + "source": [ + "movies_df.loc[:, ['Director', 'Length_minutes']].groupby('Director').mean()" + ], + "metadata": { + "id": "1Pc8Bk75ePoi", + "outputId": "8c744dc8-2701-4111-82a7-3a6a0d28848d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 300 + } + }, + "execution_count": 27, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Length_minutes\n", + "Director \n", + "Andrew Stanton 105.5\n", + "Brad Bird 115.5\n", + "Brenda Chapman 102.0\n", + "Dan Scanlon 110.0\n", + "John Lasseter 101.2\n", + "Lee Unkrich 103.0\n", + "Pete Docter 96.5" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Length_minutes
Director
Andrew Stanton105.5
Brad Bird115.5
Brenda Chapman102.0
Dan Scanlon110.0
John Lasseter101.2
Lee Unkrich103.0
Pete Docter96.5
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 7,\n \"fields\": [\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"Andrew Stanton\",\n \"Brad Bird\",\n \"Lee Unkrich\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Length_minutes\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6.257642945877883,\n \"min\": 96.5,\n \"max\": 115.5,\n \"num_unique_values\": 7,\n \"samples\": [\n 105.5,\n 115.5,\n 103.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 27 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "#### The above python code is equivalent to SQL's\n", + "```sql\n", + "SELECT Director, AVG(Length_minutes) AS Length_minutes\n", + "FROM Movies\n", + "GROUP BY Director\n", + "```" + ], + "metadata": { + "id": "jbDuTSGwiCmq" + } + }, + { + "cell_type": "markdown", + "source": [ + "##### ==================================================================================================\n", + "### Filtering Data\n", + "Using operator comparisons on columns returns information based on our desired conditions\n", + "\n", + "Example: Suppose we want to return movie information if it is only longer than 100 minutes long." + ], + "metadata": { + "id": "cKaf4n4ycypo" + } + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "id": "Zgl74_zjttPB", + "outputId": "77bbc64a-a1a2-4e8a-973a-0dcfb927e8da", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 363 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Id Title Director Year Length_minutes\n", + "4 5 Finding Nemo Andrew Stanton 2003 107\n", + "5 6 The Incredibles Brad Bird 2004 116\n", + "6 7 Cars John Lasseter 2006 117\n", + "7 8 Ratatouille Brad Bird 2007 115\n", + "8 9 WALL-E Andrew Stanton 2008 104\n", + "9 10 Up Pete Docter 2009 101\n", + "10 11 Toy Story 3 Lee Unkrich 2010 103\n", + "11 12 Cars 2 John Lasseter 2011 120\n", + "12 13 Brave Brenda Chapman 2012 102\n", + "13 14 Monsters University Dan Scanlon 2013 110" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdTitleDirectorYearLength_minutes
45Finding NemoAndrew Stanton2003107
56The IncrediblesBrad Bird2004116
67CarsJohn Lasseter2006117
78RatatouilleBrad Bird2007115
89WALL-EAndrew Stanton2008104
910UpPete Docter2009101
1011Toy Story 3Lee Unkrich2010103
1112Cars 2John Lasseter2011120
1213BraveBrenda Chapman2012102
1314Monsters UniversityDan Scanlon2013110
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 10,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 3,\n \"min\": 5,\n \"max\": 14,\n \"num_unique_values\": 10,\n \"samples\": [\n 13,\n 6,\n 10\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 10,\n \"samples\": [\n \"Brave\",\n \"The Incredibles\",\n \"Up\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"Andrew Stanton\",\n \"Brad Bird\",\n \"Brenda Chapman\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 3,\n \"min\": 2003,\n \"max\": 2013,\n \"num_unique_values\": 10,\n \"samples\": [\n 2012,\n 2004,\n 2009\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Length_minutes\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 7,\n \"min\": 101,\n \"max\": 120,\n \"num_unique_values\": 10,\n \"samples\": [\n 102,\n 116,\n 101\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 28 + } + ], + "source": [ + "# Create the filter\n", + "movie_filter = movies_df.loc[:, \"Length_minutes\"] > 100\n", + "# Use the filter in the `.loc` selector\n", + "movies_df.loc[movie_filter, :]\n", + "\n", + "# An example showing everything in a single step\n", + "movies_df.loc[movies_df.loc[:, \"Length_minutes\"] > 100, :]\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1RAY_qWtttPB" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "```sql\n", + "SELECT *\n", + "FROM Movies\n", + "WHERE Length_minutes > 100\n", + "```\n", + "##### ==================================================================================================\n", + "\n", + "#### Multiple Conditional Filtering\n", + "\n", + "Supposed we want to return movie information only if it is longer than 100 minutes and was created before the year 2005" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": { + "id": "Dp1-vQ3mttPB", + "outputId": "5ee97816-cc89-4bb1-e7fc-a222915c26e8", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 112 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Id Title Director Year Length_minutes\n", + "4 5 Finding Nemo Andrew Stanton 2003 107\n", + "5 6 The Incredibles Brad Bird 2004 116" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdTitleDirectorYearLength_minutes
45Finding NemoAndrew Stanton2003107
56The IncrediblesBrad Bird2004116
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 5,\n \"max\": 6,\n \"num_unique_values\": 2,\n \"samples\": [\n 6,\n 5\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"The Incredibles\",\n \"Finding Nemo\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Brad Bird\",\n \"Andrew Stanton\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 2003,\n \"max\": 2004,\n \"num_unique_values\": 2,\n \"samples\": [\n 2004,\n 2003\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Length_minutes\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6,\n \"min\": 107,\n \"max\": 116,\n \"num_unique_values\": 2,\n \"samples\": [\n 116,\n 107\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 29 + } + ], + "source": [ + "movie_len_filter = movies_df.loc[:, \"Length_minutes\"] > 100\n", + "movie_year_filter = movies_df.loc[:, \"Year\"] < 2005\n", + "\n", + "movies_df.loc[(movie_len_filter) & (movie_year_filter), :]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lQksNrTkttPB" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "```sql\n", + "SELECT *\n", + "FROM Movies\n", + "WHERE Length_minutes > 100\n", + "AND Year < 2005\n", + "```\n", + "##### ==================================================================================================" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GVTfOPhottPB" + }, + "source": [ + "##### ==================================================================================================\n", + "### Sorting Data\n", + "The `sort_values()` method sorts the list ascending by default. To sort by descending order, you must apply `ascending = False`.\n", + "\n", + "The `.reset_index(drop=True)` will re-index the index after sorting." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": { + "id": "KFQjjjOxttPC", + "outputId": "17dbeed8-b178-496d-8f22-c6932dbde3dc", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 A Bug's Life\n", + "1 Brave\n", + "2 Cars\n", + "3 Cars 2\n", + "4 Finding Nemo\n", + "5 Monsters University\n", + "6 Monsters, Inc.\n", + "7 Ratatouille\n", + "8 The Incredibles\n", + "9 Toy Story\n", + "10 Toy Story 2\n", + "11 Toy Story 3\n", + "12 Up\n", + "13 WALL-E\n", + "Name: Title, dtype: object" + ] + }, + "metadata": {}, + "execution_count": 30 + } + ], + "source": [ + "movies_df.loc[:,\"Title\"].sort_values().reset_index(drop=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1fEi_PBfttPC" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "\n", + "```sql\n", + "SELECT Title\n", + "FROM Movies\n", + "ORDER BY Title\n", + "```\n", + "##### ==================================================================================================\n", + "\n", + "Sort the entire dataframe by a single column:" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": { + "id": "stgM1BXxttPC", + "outputId": "58c62ab4-0d0f-4485-8d96-7366c97a9be1", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 488 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Id Title Director Year Length_minutes\n", + "0 2 A Bug's Life John Lasseter 1998 95\n", + "1 13 Brave Brenda Chapman 2012 102\n", + "2 7 Cars John Lasseter 2006 117\n", + "3 12 Cars 2 John Lasseter 2011 120\n", + "4 5 Finding Nemo Andrew Stanton 2003 107\n", + "5 14 Monsters University Dan Scanlon 2013 110\n", + "6 4 Monsters, Inc. Pete Docter 2001 92\n", + "7 8 Ratatouille Brad Bird 2007 115\n", + "8 6 The Incredibles Brad Bird 2004 116\n", + "9 1 Toy Story John Lasseter 1995 81\n", + "10 3 Toy Story 2 John Lasseter 1999 93\n", + "11 11 Toy Story 3 Lee Unkrich 2010 103\n", + "12 10 Up Pete Docter 2009 101\n", + "13 9 WALL-E Andrew Stanton 2008 104" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdTitleDirectorYearLength_minutes
02A Bug's LifeJohn Lasseter199895
113BraveBrenda Chapman2012102
27CarsJohn Lasseter2006117
312Cars 2John Lasseter2011120
45Finding NemoAndrew Stanton2003107
514Monsters UniversityDan Scanlon2013110
64Monsters, Inc.Pete Docter200192
78RatatouilleBrad Bird2007115
86The IncrediblesBrad Bird2004116
91Toy StoryJohn Lasseter199581
103Toy Story 2John Lasseter199993
1111Toy Story 3Lee Unkrich2010103
1210UpPete Docter2009101
139WALL-EAndrew Stanton2008104
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 14,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4,\n \"min\": 1,\n \"max\": 14,\n \"num_unique_values\": 14,\n \"samples\": [\n 1,\n 11,\n 2\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 14,\n \"samples\": [\n \"Toy Story\",\n \"Toy Story 3\",\n \"A Bug's Life\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"John Lasseter\",\n \"Brenda Chapman\",\n \"Brad Bird\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5,\n \"min\": 1995,\n \"max\": 2013,\n \"num_unique_values\": 14,\n \"samples\": [\n 1995,\n 2010,\n 1998\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Length_minutes\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 11,\n \"min\": 81,\n \"max\": 120,\n \"num_unique_values\": 14,\n \"samples\": [\n 81,\n 103,\n 95\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 31 + } + ], + "source": [ + "movies_df.sort_values(\"Title\").reset_index(drop=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V5j8FDwuttPC" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "```sql\n", + "SELECT *\n", + "FROM Movies\n", + "ORDER BY Title\n", + "```\n", + "##### ==================================================================================================\n", + "\n", + "We can also sort using multiple columns.\n", + "Example: We can sort by Director first, then within each Director, sort the Title of the films." + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": { + "id": "q6UqfJacttPC", + "outputId": "ee97eac3-eb25-491c-d574-1860e8af3a32", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 488 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Id Title Director Year Length_minutes\n", + "0 9 WALL-E Andrew Stanton 2008 104\n", + "1 5 Finding Nemo Andrew Stanton 2003 107\n", + "2 6 The Incredibles Brad Bird 2004 116\n", + "3 8 Ratatouille Brad Bird 2007 115\n", + "4 13 Brave Brenda Chapman 2012 102\n", + "5 14 Monsters University Dan Scanlon 2013 110\n", + "6 3 Toy Story 2 John Lasseter 1999 93\n", + "7 1 Toy Story John Lasseter 1995 81\n", + "8 12 Cars 2 John Lasseter 2011 120\n", + "9 7 Cars John Lasseter 2006 117\n", + "10 2 A Bug's Life John Lasseter 1998 95\n", + "11 11 Toy Story 3 Lee Unkrich 2010 103\n", + "12 10 Up Pete Docter 2009 101\n", + "13 4 Monsters, Inc. Pete Docter 2001 92" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdTitleDirectorYearLength_minutes
09WALL-EAndrew Stanton2008104
15Finding NemoAndrew Stanton2003107
26The IncrediblesBrad Bird2004116
38RatatouilleBrad Bird2007115
413BraveBrenda Chapman2012102
514Monsters UniversityDan Scanlon2013110
63Toy Story 2John Lasseter199993
71Toy StoryJohn Lasseter199581
812Cars 2John Lasseter2011120
97CarsJohn Lasseter2006117
102A Bug's LifeJohn Lasseter199895
1111Toy Story 3Lee Unkrich2010103
1210UpPete Docter2009101
134Monsters, Inc.Pete Docter200192
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 14,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4,\n \"min\": 1,\n \"max\": 14,\n \"num_unique_values\": 14,\n \"samples\": [\n 7,\n 11,\n 9\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 14,\n \"samples\": [\n \"Cars\",\n \"Toy Story 3\",\n \"WALL-E\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"Andrew Stanton\",\n \"Brad Bird\",\n \"Lee Unkrich\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5,\n \"min\": 1995,\n \"max\": 2013,\n \"num_unique_values\": 14,\n \"samples\": [\n 2006,\n 2010,\n 2008\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Length_minutes\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 11,\n \"min\": 81,\n \"max\": 120,\n \"num_unique_values\": 14,\n \"samples\": [\n 117,\n 103,\n 104\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 32 + } + ], + "source": [ + "movies_df.sort_values([\"Director\",\"Title\"], ascending=[True, False]).reset_index(drop=True)" + ] + }, + { + "cell_type": "markdown", + "source": [ + "```sql\n", + "SELECT Director, Title\n", + "FROM Movies\n", + "ORDER BY\n", + " Director ASC,\n", + " Title DESC\n", + "```" + ], + "metadata": { + "id": "5wlURoWy2eYC" + } + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AWFTUYNVttPC" + }, + "source": [ + "##### ==================================================================================================\n", + "### Merging DataFrames\n", + "\n", + "In python the `.concat` function combines dataframes together. This can be either one on top of another dataframe or side by side.\n", + "\n", + "But first let us introduce a new dataset:" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": { + "id": "3C3P14EvttPC" + }, + "outputs": [], + "source": [ + "other_movies_df = pd.read_csv(\"Other_Movies.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": { + "id": "fjEx1V8vttPC", + "outputId": "e66e4d1c-e519-4505-f814-642c854a96c6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Id Title Director Year \\\n", + "0 15 The Fast and the Furious Rob Cohen 2001 \n", + "1 16 A Beautiful Mind Ron Howard 2001 \n", + "2 17 Good Will Hunting Gus Van Sant 1997 \n", + "3 18 Shang-Chi and the Legend of the Ten Rings Destin Daniel Cretton 2021 \n", + "4 19 The Martian Ridley Scott 2015 \n", + "\n", + " Length_minutes \n", + "0 106 \n", + "1 135 \n", + "2 126 \n", + "3 132 \n", + "4 141 " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdTitleDirectorYearLength_minutes
015The Fast and the FuriousRob Cohen2001106
116A Beautiful MindRon Howard2001135
217Good Will HuntingGus Van Sant1997126
318Shang-Chi and the Legend of the Ten RingsDestin Daniel Cretton2021132
419The MartianRidley Scott2015141
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "other_movies_df", + "summary": "{\n \"name\": \"other_movies_df\",\n \"rows\": 6,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 15,\n \"max\": 20,\n \"num_unique_values\": 6,\n \"samples\": [\n 15,\n 16,\n 20\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 6,\n \"samples\": [\n \"The Fast and the Furious\",\n \"A Beautiful Mind\",\n \"Fast Five\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 6,\n \"samples\": [\n \"Rob Cohen\",\n \"Ron Howard\",\n \"Justin Lin\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 9,\n \"min\": 1997,\n \"max\": 2021,\n \"num_unique_values\": 5,\n \"samples\": [\n 1997,\n 2011,\n 2021\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Length_minutes\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 12,\n \"min\": 106,\n \"max\": 141,\n \"num_unique_values\": 6,\n \"samples\": [\n 106,\n 135,\n 130\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 34 + } + ], + "source": [ + "other_movies_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DiyckWV1ttPC" + }, + "source": [ + "##### ==================================================================================================\n", + "Now lets combine the two dataframes, that being `movies_df` and `other_movies_df` using the `.concat` function and call this new dataframe `all_movies_df`" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": { + "id": "pjvZ8wGFttPC" + }, + "outputs": [], + "source": [ + "all_movies_df = pd.concat([movies_df,other_movies_df]).reset_index(drop=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": { + "id": "DwGVaXWxttPC", + "outputId": "60acd983-0c9b-4cb9-e46b-6102b2b8a5a2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 645 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Id Title Director \\\n", + "0 1 Toy Story John Lasseter \n", + "1 2 A Bug's Life John Lasseter \n", + "2 3 Toy Story 2 John Lasseter \n", + "3 4 Monsters, Inc. Pete Docter \n", + "4 5 Finding Nemo Andrew Stanton \n", + "5 6 The Incredibles Brad Bird \n", + "6 7 Cars John Lasseter \n", + "7 8 Ratatouille Brad Bird \n", + "8 9 WALL-E Andrew Stanton \n", + "9 10 Up Pete Docter \n", + "10 11 Toy Story 3 Lee Unkrich \n", + "11 12 Cars 2 John Lasseter \n", + "12 13 Brave Brenda Chapman \n", + "13 14 Monsters University Dan Scanlon \n", + "14 15 The Fast and the Furious Rob Cohen \n", + "15 16 A Beautiful Mind Ron Howard \n", + "16 17 Good Will Hunting Gus Van Sant \n", + "17 18 Shang-Chi and the Legend of the Ten Rings Destin Daniel Cretton \n", + "18 19 The Martian Ridley Scott \n", + "\n", + " Year Length_minutes \n", + "0 1995 81 \n", + "1 1998 95 \n", + "2 1999 93 \n", + "3 2001 92 \n", + "4 2003 107 \n", + "5 2004 116 \n", + "6 2006 117 \n", + "7 2007 115 \n", + "8 2008 104 \n", + "9 2009 101 \n", + "10 2010 103 \n", + "11 2011 120 \n", + "12 2012 102 \n", + "13 2013 110 \n", + "14 2001 106 \n", + "15 2001 135 \n", + "16 1997 126 \n", + "17 2021 132 \n", + "18 2015 141 " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdTitleDirectorYearLength_minutes
01Toy StoryJohn Lasseter199581
12A Bug's LifeJohn Lasseter199895
23Toy Story 2John Lasseter199993
34Monsters, Inc.Pete Docter200192
45Finding NemoAndrew Stanton2003107
56The IncrediblesBrad Bird2004116
67CarsJohn Lasseter2006117
78RatatouilleBrad Bird2007115
89WALL-EAndrew Stanton2008104
910UpPete Docter2009101
1011Toy Story 3Lee Unkrich2010103
1112Cars 2John Lasseter2011120
1213BraveBrenda Chapman2012102
1314Monsters UniversityDan Scanlon2013110
1415The Fast and the FuriousRob Cohen2001106
1516A Beautiful MindRon Howard2001135
1617Good Will HuntingGus Van Sant1997126
1718Shang-Chi and the Legend of the Ten RingsDestin Daniel Cretton2021132
1819The MartianRidley Scott2015141
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "all_movies_df", + "summary": "{\n \"name\": \"all_movies_df\",\n \"rows\": 20,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5,\n \"min\": 1,\n \"max\": 20,\n \"num_unique_values\": 20,\n \"samples\": [\n 1,\n 18,\n 16\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 20,\n \"samples\": [\n \"Toy Story\",\n \"Shang-Chi and the Legend of the Ten Rings\",\n \"A Beautiful Mind\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 13,\n \"samples\": [\n \"Ridley Scott\",\n \"Gus Van Sant\",\n \"John Lasseter\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6,\n \"min\": 1995,\n \"max\": 2021,\n \"num_unique_values\": 17,\n \"samples\": [\n 1995,\n 1998,\n 2004\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Length_minutes\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 15,\n \"min\": 81,\n \"max\": 141,\n \"num_unique_values\": 20,\n \"samples\": [\n 81,\n 132,\n 135\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 36 + } + ], + "source": [ + "all_movies_df.head(-1) # Using -1 in the head function will show us all of the rows" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uwtTP015ttPD" + }, + "source": [ + "##### ==================================================================================================\n", + "Now lets introduce another dataframe, that being the movie scores received" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": { + "id": "wntotCmBttPD" + }, + "outputs": [], + "source": [ + "scores_df = pd.read_csv(\"Movie_Scores.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": { + "id": "9xeFCBz5ttPD", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "outputId": "7ad7b5c2-8c43-4a86-bb98-4613fcf88b90" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Score\n", + "0 8.3\n", + "1 7.2\n", + "2 7.9\n", + "3 8.1\n", + "4 8.2" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Score
08.3
17.2
27.9
38.1
48.2
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "scores_df", + "summary": "{\n \"name\": \"scores_df\",\n \"rows\": 20,\n \"fields\": [\n {\n \"column\": \"Score\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.6190825129938148,\n \"min\": 6.2,\n \"max\": 8.4,\n \"num_unique_values\": 12,\n \"samples\": [\n 6.8,\n 7.3,\n 8.3\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 38 + } + ], + "source": [ + "scores_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Yarl7-KPttPD" + }, + "source": [ + "##### ==================================================================================================\n", + "Now we can combine the two dataframes side by side" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": { + "id": "W2zOhxPcttPD" + }, + "outputs": [], + "source": [ + "movies_and_scores_df = pd.concat([all_movies_df,scores_df], axis = \"columns\").reset_index(drop=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": { + "id": "VBMdQiRettPD", + "outputId": "8f1d854c-e1c0-4a8b-fe3e-a7cb1317b69b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 645 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Id Title Director \\\n", + "0 1 Toy Story John Lasseter \n", + "1 2 A Bug's Life John Lasseter \n", + "2 3 Toy Story 2 John Lasseter \n", + "3 4 Monsters, Inc. Pete Docter \n", + "4 5 Finding Nemo Andrew Stanton \n", + "5 6 The Incredibles Brad Bird \n", + "6 7 Cars John Lasseter \n", + "7 8 Ratatouille Brad Bird \n", + "8 9 WALL-E Andrew Stanton \n", + "9 10 Up Pete Docter \n", + "10 11 Toy Story 3 Lee Unkrich \n", + "11 12 Cars 2 John Lasseter \n", + "12 13 Brave Brenda Chapman \n", + "13 14 Monsters University Dan Scanlon \n", + "14 15 The Fast and the Furious Rob Cohen \n", + "15 16 A Beautiful Mind Ron Howard \n", + "16 17 Good Will Hunting Gus Van Sant \n", + "17 18 Shang-Chi and the Legend of the Ten Rings Destin Daniel Cretton \n", + "18 19 The Martian Ridley Scott \n", + "\n", + " Year Length_minutes Score \n", + "0 1995 81 8.3 \n", + "1 1998 95 7.2 \n", + "2 1999 93 7.9 \n", + "3 2001 92 8.1 \n", + "4 2003 107 8.2 \n", + "5 2004 116 8.0 \n", + "6 2006 117 7.2 \n", + "7 2007 115 8.1 \n", + "8 2008 104 8.4 \n", + "9 2009 101 8.3 \n", + "10 2010 103 8.3 \n", + "11 2011 120 6.2 \n", + "12 2012 102 7.1 \n", + "13 2013 110 7.3 \n", + "14 2001 106 6.8 \n", + "15 2001 135 8.2 \n", + "16 1997 126 8.3 \n", + "17 2021 132 7.4 \n", + "18 2015 141 8.0 " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdTitleDirectorYearLength_minutesScore
01Toy StoryJohn Lasseter1995818.3
12A Bug's LifeJohn Lasseter1998957.2
23Toy Story 2John Lasseter1999937.9
34Monsters, Inc.Pete Docter2001928.1
45Finding NemoAndrew Stanton20031078.2
56The IncrediblesBrad Bird20041168.0
67CarsJohn Lasseter20061177.2
78RatatouilleBrad Bird20071158.1
89WALL-EAndrew Stanton20081048.4
910UpPete Docter20091018.3
1011Toy Story 3Lee Unkrich20101038.3
1112Cars 2John Lasseter20111206.2
1213BraveBrenda Chapman20121027.1
1314Monsters UniversityDan Scanlon20131107.3
1415The Fast and the FuriousRob Cohen20011066.8
1516A Beautiful MindRon Howard20011358.2
1617Good Will HuntingGus Van Sant19971268.3
1718Shang-Chi and the Legend of the Ten RingsDestin Daniel Cretton20211327.4
1819The MartianRidley Scott20151418.0
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "movies_and_scores_df", + "summary": "{\n \"name\": \"movies_and_scores_df\",\n \"rows\": 20,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5,\n \"min\": 1,\n \"max\": 20,\n \"num_unique_values\": 20,\n \"samples\": [\n 1,\n 18,\n 16\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 20,\n \"samples\": [\n \"Toy Story\",\n \"Shang-Chi and the Legend of the Ten Rings\",\n \"A Beautiful Mind\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 13,\n \"samples\": [\n \"Ridley Scott\",\n \"Gus Van Sant\",\n \"John Lasseter\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6,\n \"min\": 1995,\n \"max\": 2021,\n \"num_unique_values\": 17,\n \"samples\": [\n 1995,\n 1998,\n 2004\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Length_minutes\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 15,\n \"min\": 81,\n \"max\": 141,\n \"num_unique_values\": 20,\n \"samples\": [\n 81,\n 132,\n 135\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Score\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.6190825129938148,\n \"min\": 6.2,\n \"max\": 8.4,\n \"num_unique_values\": 12,\n \"samples\": [\n 6.8,\n 7.3,\n 8.3\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 40 + } + ], + "source": [ + "movies_and_scores_df.head(-1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xoQ3fB8SttPD" + }, + "source": [ + "##### ==================================================================================================\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": { + "id": "ZITCa9qYttPD" + }, + "outputs": [], + "source": [ + "managers = pd.DataFrame(\n", + " {\n", + " 'Id': [1,2,3],\n", + " 'Manager':['Chris','Maritza','Jamin']\n", + " }\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": { + "id": "MX9spfihttPD", + "outputId": "9968b9fc-74e5-418d-e643-c6fad9e9d0ee", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 143 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Id Manager\n", + "0 1 Chris\n", + "1 2 Maritza\n", + "2 3 Jamin" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdManager
01Chris
12Maritza
23Jamin
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "managers", + "summary": "{\n \"name\": \"managers\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 1,\n \"max\": 3,\n \"num_unique_values\": 3,\n \"samples\": [\n 1,\n 2,\n 3\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Manager\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Chris\",\n \"Maritza\",\n \"Jamin\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 42 + } + ], + "source": [ + "managers.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": { + "id": "LtD1zQJuttPD" + }, + "outputs": [], + "source": [ + "captains = pd.DataFrame(\n", + " {\n", + " 'Id': [2,2,3,1,1,3,2,3,1,1,3,3],\n", + " 'Captain':['Derick','Shane','Becca','Anna','Christine','Melody','Tom','Eric','Naomi','Angelina','Nancy','Richard'],\n", + " 'Title':['C','C','SC','C','SC','C','C','SC','C','EC','C','SC']\n", + " }\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": { + "id": "0xOS-Bu4ttPD", + "outputId": "f5d8f2f8-ae4f-4146-9691-9d7beee65172", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 425 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Id Captain Title\n", + "0 2 Derick C\n", + "1 2 Shane C\n", + "2 3 Becca SC\n", + "3 1 Anna C\n", + "4 1 Christine SC\n", + "5 3 Melody C\n", + "6 2 Tom C\n", + "7 3 Eric SC\n", + "8 1 Naomi C\n", + "9 1 Angelina EC\n", + "10 3 Nancy C\n", + "11 3 Richard SC" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdCaptainTitle
02DerickC
12ShaneC
23BeccaSC
31AnnaC
41ChristineSC
53MelodyC
62TomC
73EricSC
81NaomiC
91AngelinaEC
103NancyC
113RichardSC
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "captains", + "summary": "{\n \"name\": \"captains\",\n \"rows\": 12,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 1,\n \"max\": 3,\n \"num_unique_values\": 3,\n \"samples\": [\n 2,\n 3,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Captain\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 12,\n \"samples\": [\n \"Nancy\",\n \"Angelina\",\n \"Derick\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"C\",\n \"SC\",\n \"EC\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 44 + } + ], + "source": [ + "captains.head(12)" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": { + "id": "3c478mlSttPD", + "outputId": "7acc418b-9454-4c7a-972f-6808c928937c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 394 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Id Captain Title Manager\n", + "0 2 Derick C Maritza\n", + "1 2 Shane C Maritza\n", + "2 2 Tom C Maritza\n", + "3 3 Becca SC Jamin\n", + "4 3 Melody C Jamin\n", + "5 3 Eric SC Jamin\n", + "6 3 Nancy C Jamin\n", + "7 3 Richard SC Jamin\n", + "8 1 Anna C Chris\n", + "9 1 Christine SC Chris\n", + "10 1 Naomi C Chris" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdCaptainTitleManager
02DerickCMaritza
12ShaneCMaritza
22TomCMaritza
33BeccaSCJamin
43MelodyCJamin
53EricSCJamin
63NancyCJamin
73RichardSCJamin
81AnnaCChris
91ChristineSCChris
101NaomiCChris
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "roster", + "summary": "{\n \"name\": \"roster\",\n \"rows\": 12,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 1,\n \"max\": 3,\n \"num_unique_values\": 3,\n \"samples\": [\n 2,\n 3,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Captain\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 12,\n \"samples\": [\n \"Naomi\",\n \"Christine\",\n \"Derick\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"C\",\n \"SC\",\n \"EC\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Manager\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Maritza\",\n \"Jamin\",\n \"Chris\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 45 + } + ], + "source": [ + "roster = captains.merge(managers,left_on = 'Id', right_on = 'Id')\n", + "roster.head(-1)" + ] + }, + { + "cell_type": "code", + "source": [ + "test_roster = pd.concat([captains, managers], axis=\"columns\").reset_index(drop=True)\n", + "test_roster.head()" + ], + "metadata": { + "id": "rJ9K1BPGXxzE", + "outputId": "45a65eb1-94ea-4fa3-d442-535fa5346b7f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + } + }, + "execution_count": 46, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Id Captain Title Id Manager\n", + "0 2 Derick C 1.0 Chris\n", + "1 2 Shane C 2.0 Maritza\n", + "2 3 Becca SC 3.0 Jamin\n", + "3 1 Anna C NaN NaN\n", + "4 1 Christine SC NaN NaN" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdCaptainTitleIdManager
02DerickC1.0Chris
12ShaneC2.0Maritza
23BeccaSC3.0Jamin
31AnnaCNaNNaN
41ChristineSCNaNNaN
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "test_roster", + "summary": "{\n \"name\": \"test_roster\",\n \"rows\": 12,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 1,\n \"max\": 3,\n \"num_unique_values\": 3,\n \"samples\": [\n 2,\n 3,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Captain\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 12,\n \"samples\": [\n \"Nancy\",\n \"Angelina\",\n \"Derick\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"C\",\n \"SC\",\n \"EC\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1.0,\n \"min\": 1.0,\n \"max\": 3.0,\n \"num_unique_values\": 3,\n \"samples\": [\n 1.0,\n 2.0,\n 3.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Manager\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Chris\",\n \"Maritza\",\n \"Jamin\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 46 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2hro1V6XttPD" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "```sql\n", + "SELECT *\n", + "FROM Captains\n", + "INNER JOIN Managers\n", + "ON Captains.Id = Managers.Id\n", + "```\n", + "##### ==================================================================================================\n", + "## Column Renaming\n", + "\n", + "We can use the `.rename` function in python to relabel the columns of a dataframe. Suppose we want to rename `Id` to `Cohort` and `Title` to `Captain Rank`." + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": { + "id": "-nELWGyPttPD", + "outputId": "c9ba481a-05ed-44b8-e5ec-bb4a7ba4ff6f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 394 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Cohort Captain Captain Rank Manager\n", + "0 2 Derick C Maritza\n", + "1 2 Shane C Maritza\n", + "2 2 Tom C Maritza\n", + "3 3 Becca SC Jamin\n", + "4 3 Melody C Jamin\n", + "5 3 Eric SC Jamin\n", + "6 3 Nancy C Jamin\n", + "7 3 Richard SC Jamin\n", + "8 1 Anna C Chris\n", + "9 1 Christine SC Chris\n", + "10 1 Naomi C Chris" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CohortCaptainCaptain RankManager
02DerickCMaritza
12ShaneCMaritza
22TomCMaritza
33BeccaSCJamin
43MelodyCJamin
53EricSCJamin
63NancyCJamin
73RichardSCJamin
81AnnaCChris
91ChristineSCChris
101NaomiCChris
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "roster", + "summary": "{\n \"name\": \"roster\",\n \"rows\": 12,\n \"fields\": [\n {\n \"column\": \"Cohort\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 1,\n \"max\": 3,\n \"num_unique_values\": 3,\n \"samples\": [\n 2,\n 3,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Captain\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 12,\n \"samples\": [\n \"Naomi\",\n \"Christine\",\n \"Derick\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Captain Rank\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"C\",\n \"SC\",\n \"EC\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Manager\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Maritza\",\n \"Jamin\",\n \"Chris\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 47 + } + ], + "source": [ + "roster = roster.rename(columns = {\"Id\":\"Cohort\",\"Title\":\"Captain Rank\"})\n", + "roster.head(-1)" + ] + }, + { + "cell_type": "code", + "source": [ + "roster.columns" + ], + "metadata": { + "id": "zrKc31ukYw3i", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "1a3fa73c-5ba1-4f43-8ee8-9d71f3a52e9d" + }, + "execution_count": 48, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Index(['Cohort', 'Captain', 'Captain Rank', 'Manager'], dtype='object')" + ] + }, + "metadata": {}, + "execution_count": 48 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N5H7HamottPE" + }, + "source": [ + "If we would like to replace all columns, we must use a list of equal length" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": { + "id": "eCTo6V3UttPE", + "outputId": "82b0a175-8201-4f65-b516-11e23e42f1e6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 394 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Cohort Num Capt Capt Rank Manager\n", + "0 2 Derick C Maritza\n", + "1 2 Shane C Maritza\n", + "2 2 Tom C Maritza\n", + "3 3 Becca SC Jamin\n", + "4 3 Melody C Jamin\n", + "5 3 Eric SC Jamin\n", + "6 3 Nancy C Jamin\n", + "7 3 Richard SC Jamin\n", + "8 1 Anna C Chris\n", + "9 1 Christine SC Chris\n", + "10 1 Naomi C Chris" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Cohort NumCaptCapt RankManager
02DerickCMaritza
12ShaneCMaritza
22TomCMaritza
33BeccaSCJamin
43MelodyCJamin
53EricSCJamin
63NancyCJamin
73RichardSCJamin
81AnnaCChris
91ChristineSCChris
101NaomiCChris
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "roster", + "summary": "{\n \"name\": \"roster\",\n \"rows\": 12,\n \"fields\": [\n {\n \"column\": \"Cohort Num\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 1,\n \"max\": 3,\n \"num_unique_values\": 3,\n \"samples\": [\n 2,\n 3,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Capt\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 12,\n \"samples\": [\n \"Naomi\",\n \"Christine\",\n \"Derick\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Capt Rank\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"C\",\n \"SC\",\n \"EC\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Manager\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Maritza\",\n \"Jamin\",\n \"Chris\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 49 + } + ], + "source": [ + "roster.columns = ['Cohort Num','Capt','Capt Rank','Manager']\n", + "roster.head(-1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Wp5nb6skttPE" + }, + "source": [ + "##### ==================================================================================================\n", + "### Drop Columns" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": { + "id": "doploOj9ttPE", + "outputId": "90c4c6be-da1d-4ab6-b19e-504fa0db6cdf", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 394 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Capt Capt Rank Manager\n", + "0 Derick C Maritza\n", + "1 Shane C Maritza\n", + "2 Tom C Maritza\n", + "3 Becca SC Jamin\n", + "4 Melody C Jamin\n", + "5 Eric SC Jamin\n", + "6 Nancy C Jamin\n", + "7 Richard SC Jamin\n", + "8 Anna C Chris\n", + "9 Christine SC Chris\n", + "10 Naomi C Chris" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CaptCapt RankManager
0DerickCMaritza
1ShaneCMaritza
2TomCMaritza
3BeccaSCJamin
4MelodyCJamin
5EricSCJamin
6NancyCJamin
7RichardSCJamin
8AnnaCChris
9ChristineSCChris
10NaomiCChris
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "roster", + "summary": "{\n \"name\": \"roster\",\n \"rows\": 12,\n \"fields\": [\n {\n \"column\": \"Capt\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 12,\n \"samples\": [\n \"Naomi\",\n \"Christine\",\n \"Derick\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Capt Rank\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"C\",\n \"SC\",\n \"EC\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Manager\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Maritza\",\n \"Jamin\",\n \"Chris\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 50 + } + ], + "source": [ + "#df.drop([\"column1\",\"column2\"], axis = \"columns\")\n", + "\n", + "roster = roster.drop(\"Cohort Num\", axis = \"columns\")\n", + "roster.head(-1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u-SBCempttPE" + }, + "source": [ + "##### ==================================================================================================\n", + "### Missing Values / NaN Values\n", + "\n", + "There are various types of missing data. Most commonly it could just be data was never collected, the data was handled incorrectly or null valued entry.\n", + "\n", + "Missing data can be remedied by the following:\n", + "1. Removing the row with the missing/NaN values\n", + "2. Removing the column with the missing/NaN values\n", + "3. Filling in the missing data\n", + "\n", + "For simplicity, we will only focus on the first two methods. The third method can be resolved with value interpolation by use of information from other rows or columns of the dataset. This process requires knowledge outside of the scope of this lesson. There are entire studies dedicated to this topic alone." + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": { + "id": "yV1RhRDNttPE", + "outputId": "79c2c6cb-d707-4643-a129-8badfc1d9267", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Company Location Year\n", + "0 ALFA ROMEO Italy 1910.0\n", + "1 Aston Martin Lagonda Ltd UK 1913.0\n", + "2 Audi Germany 1909.0\n", + "3 BMW Germany 1916.0\n", + "4 Chevrolet NaN NaN\n", + "5 Dodge USA 1900.0\n", + "6 Ferrari Italy 1947.0\n", + "7 Honda Japan 1948.0\n", + "8 Jaguar UK 1922.0\n", + "9 Lamborghini Italy 1963.0\n", + "10 MAZDA Japan 1920.0\n", + "11 McLaren UK 1985.0\n", + "12 Mercedes-Benz Germany NaN\n", + "13 NISSAN Japan 1933.0\n", + "14 Pagani Automobili S.p.A. Italy 1992.0\n", + "15 Porsche Germany 1931.0\n", + "16 FIAT NaN 1899.0\n", + "17 Mini Germany 1969.0\n", + "18 SCION NaN NaN\n", + "19 Subaru Japan 1953.0\n", + "20 Bentley UK 1919.0\n", + "21 Buick USA 1899.0\n", + "22 Ford USA 1903.0\n", + "23 HYUNDAI MOTOR COMPANY South Korea 1967.0\n", + "24 LEXUS Japan 1989.0\n", + "25 MASERATI Italy 1914.0\n", + "26 Roush NaN NaN\n", + "27 Volkswagen Germany 1937.0\n", + "28 Acura Japan 1986.0\n", + "29 Cadillac USA 1902.0\n", + "30 INFINITI Hong Kong 1989.0\n", + "31 KIA MOTORS CORPORATION South Korea 1944.0\n", + "32 Mitsubishi Motors Corporation Japan 1970.0\n", + "33 Rolls-Royce Motor Cars Limited UK 1904.0\n", + "34 TOYOTA Japan 1937.0\n", + "35 Volvo Sweden 1927.0\n", + "36 Chrysler USA 1925.0\n", + "37 Lincoln USA 1917.0\n", + "38 GMC USA 1911.0\n", + "39 RAM USA NaN\n", + "40 CHEVROLET USA 1911.0\n", + "41 Jeep USA 1943.0" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CompanyLocationYear
0ALFA ROMEOItaly1910.0
1Aston Martin Lagonda LtdUK1913.0
2AudiGermany1909.0
3BMWGermany1916.0
4ChevroletNaNNaN
5DodgeUSA1900.0
6FerrariItaly1947.0
7HondaJapan1948.0
8JaguarUK1922.0
9LamborghiniItaly1963.0
10MAZDAJapan1920.0
11McLarenUK1985.0
12Mercedes-BenzGermanyNaN
13NISSANJapan1933.0
14Pagani Automobili S.p.A.Italy1992.0
15PorscheGermany1931.0
16FIATNaN1899.0
17MiniGermany1969.0
18SCIONNaNNaN
19SubaruJapan1953.0
20BentleyUK1919.0
21BuickUSA1899.0
22FordUSA1903.0
23HYUNDAI MOTOR COMPANYSouth Korea1967.0
24LEXUSJapan1989.0
25MASERATIItaly1914.0
26RoushNaNNaN
27VolkswagenGermany1937.0
28AcuraJapan1986.0
29CadillacUSA1902.0
30INFINITIHong Kong1989.0
31KIA MOTORS CORPORATIONSouth Korea1944.0
32Mitsubishi Motors CorporationJapan1970.0
33Rolls-Royce Motor Cars LimitedUK1904.0
34TOYOTAJapan1937.0
35VolvoSweden1927.0
36ChryslerUSA1925.0
37LincolnUSA1917.0
38GMCUSA1911.0
39RAMUSANaN
40CHEVROLETUSA1911.0
41JeepUSA1943.0
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "cars", + "summary": "{\n \"name\": \"cars\",\n \"rows\": 43,\n \"fields\": [\n {\n \"column\": \"Company\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 43,\n \"samples\": [\n \"Lincoln\",\n \"LEXUS\",\n \"MASERATI\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Location\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 8,\n \"samples\": [\n \"UK\",\n \"South Korea\",\n \"Italy\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 28.994530312650504,\n \"min\": 1899.0,\n \"max\": 1992.0,\n \"num_unique_values\": 33,\n \"samples\": [\n 1911.0,\n 1969.0,\n 1970.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 51 + } + ], + "source": [ + "cars = pd.read_csv(\"Cars.csv\")\n", + "cars.head(-1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zT2P3Mq9ttPE" + }, + "source": [ + "##### ==================================================================================================\n", + "Now lets sort the companies in alphabetical order" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": { + "id": "W4xHJumrttPE", + "outputId": "65bc3d88-1a90-4123-f2f7-195781fe8b4c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Company Location Year\n", + "0 ALFA ROMEO Italy 1910.0\n", + "1 Acura Japan 1986.0\n", + "2 Aston Martin Lagonda Ltd UK 1913.0\n", + "3 Audi Germany 1909.0\n", + "4 BMW Germany 1916.0\n", + "5 Bentley UK 1919.0\n", + "6 Buick USA 1899.0\n", + "7 CHEVROLET USA 1911.0\n", + "8 Cadillac USA 1902.0\n", + "9 Chevrolet NaN NaN\n", + "10 Chrysler USA 1925.0\n", + "11 Dodge USA 1900.0\n", + "12 FIAT NaN 1899.0\n", + "13 Ferrari Italy 1947.0\n", + "14 Ford USA 1903.0\n", + "15 GMC USA 1911.0\n", + "16 HYUNDAI MOTOR COMPANY South Korea 1967.0\n", + "17 Honda Japan 1948.0\n", + "18 INFINITI Hong Kong 1989.0\n", + "19 Jaguar UK 1922.0\n", + "20 Jeep USA 1943.0\n", + "21 KIA MOTORS CORPORATION South Korea 1944.0\n", + "22 LEXUS Japan 1989.0\n", + "23 Lamborghini Italy 1963.0\n", + "24 Land Rover UK 1948.0\n", + "25 Lincoln USA 1917.0\n", + "26 MASERATI Italy 1914.0\n", + "27 MAZDA Japan 1920.0\n", + "28 McLaren UK 1985.0\n", + "29 Mercedes-Benz Germany NaN\n", + "30 Mini Germany 1969.0\n", + "31 Mitsubishi Motors Corporation Japan 1970.0\n", + "32 NISSAN Japan 1933.0\n", + "33 Pagani Automobili S.p.A. Italy 1992.0\n", + "34 Porsche Germany 1931.0\n", + "35 RAM USA NaN\n", + "36 Rolls-Royce Motor Cars Limited UK 1904.0\n", + "37 Roush NaN NaN\n", + "38 SCION NaN NaN\n", + "39 Subaru Japan 1953.0\n", + "40 TOYOTA Japan 1937.0\n", + "41 Volkswagen Germany 1937.0" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CompanyLocationYear
0ALFA ROMEOItaly1910.0
1AcuraJapan1986.0
2Aston Martin Lagonda LtdUK1913.0
3AudiGermany1909.0
4BMWGermany1916.0
5BentleyUK1919.0
6BuickUSA1899.0
7CHEVROLETUSA1911.0
8CadillacUSA1902.0
9ChevroletNaNNaN
10ChryslerUSA1925.0
11DodgeUSA1900.0
12FIATNaN1899.0
13FerrariItaly1947.0
14FordUSA1903.0
15GMCUSA1911.0
16HYUNDAI MOTOR COMPANYSouth Korea1967.0
17HondaJapan1948.0
18INFINITIHong Kong1989.0
19JaguarUK1922.0
20JeepUSA1943.0
21KIA MOTORS CORPORATIONSouth Korea1944.0
22LEXUSJapan1989.0
23LamborghiniItaly1963.0
24Land RoverUK1948.0
25LincolnUSA1917.0
26MASERATIItaly1914.0
27MAZDAJapan1920.0
28McLarenUK1985.0
29Mercedes-BenzGermanyNaN
30MiniGermany1969.0
31Mitsubishi Motors CorporationJapan1970.0
32NISSANJapan1933.0
33Pagani Automobili S.p.A.Italy1992.0
34PorscheGermany1931.0
35RAMUSANaN
36Rolls-Royce Motor Cars LimitedUK1904.0
37RoushNaNNaN
38SCIONNaNNaN
39SubaruJapan1953.0
40TOYOTAJapan1937.0
41VolkswagenGermany1937.0
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "cars", + "summary": "{\n \"name\": \"cars\",\n \"rows\": 43,\n \"fields\": [\n {\n \"column\": \"Company\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 43,\n \"samples\": [\n \"Roush\",\n \"Land Rover\",\n \"Lincoln\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Location\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 8,\n \"samples\": [\n \"Japan\",\n \"South Korea\",\n \"Italy\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 28.994530312650504,\n \"min\": 1899.0,\n \"max\": 1992.0,\n \"num_unique_values\": 33,\n \"samples\": [\n 1937.0,\n 1989.0,\n 1933.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 52 + } + ], + "source": [ + "cars = cars.sort_values(\"Company\").reset_index(drop=True)\n", + "cars.head(-1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tFLokzyvttPE" + }, + "source": [ + "##### ==================================================================================================\n", + "Now lets check how many entry points are missing. As we can see there are 4 entries in the Location column and 5 entries missing in the Year column." + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": { + "id": "q33En74DttPE", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "1090d9fa-fd70-42da-b4a1-33722d5bc001" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Company 0\n", + "Location 4\n", + "Year 5\n", + "dtype: int64" + ] + }, + "metadata": {}, + "execution_count": 53 + } + ], + "source": [ + "cars.isna().sum()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cGKKoYTpttPE" + }, + "source": [ + "##### ==================================================================================================\n", + "Lets inspect all the rows with any missing Loctation entries" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": { + "id": "dRmT5-TvttPE", + "outputId": "484a1c7f-376a-4503-d01b-403d7f8787cc", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 175 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Company Location Year\n", + "9 Chevrolet NaN NaN\n", + "12 FIAT NaN 1899.0\n", + "37 Roush NaN NaN\n", + "38 SCION NaN NaN" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CompanyLocationYear
9ChevroletNaNNaN
12FIATNaN1899.0
37RoushNaNNaN
38SCIONNaNNaN
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "repr_error": "0" + } + }, + "metadata": {}, + "execution_count": 54 + } + ], + "source": [ + "missing_car_info_filter = cars.loc[:, \"Location\"].isna()\n", + "cars.loc[missing_car_info_filter, :]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tvT4mHb5ttPE" + }, + "source": [ + "##### ==================================================================================================\n", + "Lets inspect all the rows with any missing Year entries" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": { + "id": "64m7mIH0ttPF", + "outputId": "acc3f119-af77-4999-a7b8-4f3ea273eda4", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Company Location Year\n", + "9 Chevrolet NaN NaN\n", + "29 Mercedes-Benz Germany NaN\n", + "35 RAM USA NaN\n", + "37 Roush NaN NaN\n", + "38 SCION NaN NaN" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CompanyLocationYear
9ChevroletNaNNaN
29Mercedes-BenzGermanyNaN
35RAMUSANaN
37RoushNaNNaN
38SCIONNaNNaN
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"cars\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"Company\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Mercedes-Benz\",\n \"SCION\",\n \"RAM\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Location\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"USA\",\n \"Germany\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": null,\n \"max\": null,\n \"num_unique_values\": 0,\n \"samples\": [],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 55 + } + ], + "source": [ + "cars.loc[cars.loc[:, \"Year\"].isna(), :]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Kl_wIVHCttPF" + }, + "source": [ + "##### ==================================================================================================\n", + "For simplicity we can fill all the missing Location entries with \"NA\"" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": { + "id": "k0NDuUMhttPF" + }, + "outputs": [], + "source": [ + "cars.loc[:, \"Location\"] = cars.loc[:, \"Location\"].fillna(value=\"NA\")" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "metadata": { + "id": "KXC45KtFttPF", + "outputId": "56beee41-cfc0-40c3-ccb1-88e933218f96", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Company 0\n", + "Location 0\n", + "Year 5\n", + "dtype: int64" + ] + }, + "metadata": {}, + "execution_count": 57 + } + ], + "source": [ + "cars.head(-1)\n", + "cars.isna().sum()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nB__rivattPF" + }, + "source": [ + "##### ==================================================================================================\n", + "Now lets drop any rows with missing entries" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "metadata": { + "id": "Ft1XTWOGttPF", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "87130665-7134-4fa2-f824-b316d40b6c4e" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Company 0\n", + "Location 0\n", + "Year 0\n", + "dtype: int64" + ] + }, + "metadata": {}, + "execution_count": 58 + } + ], + "source": [ + "cars = cars.dropna().reset_index(drop=True)\n", + "cars.head(-1)\n", + "cars.isna().sum()" + ] + }, + { + "cell_type": "code", + "source": [ + "cars.info()" + ], + "metadata": { + "id": "MoUYqyzSeK9n", + "outputId": "cfde1a76-289e-42aa-c7ad-2ae30ba36514", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": 59, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "RangeIndex: 38 entries, 0 to 37\n", + "Data columns (total 3 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 Company 38 non-null object \n", + " 1 Location 38 non-null object \n", + " 2 Year 38 non-null float64\n", + "dtypes: float64(1), object(2)\n", + "memory usage: 1.0+ KB\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lbaxA3zrttPF" + }, + "source": [ + "##### ==================================================================================================\n", + "## Summary\n", + "\n", + "- `pandas` provides `Series` and `DataFrame` classes that with tabular style data.\n", + "- `.loc` selects rows and columns based on their index values.\n", + "- `.iloc` selects rows and columns based on their position values.\n", + "- Calling a DataFrame method with `axis=\"rows\"` or `axis=0` causes it to operate along the row axis.\n", + "- Calling a DataFrame method with `axis=\"columns\"` or `axis=1` causes it to operate along the columns axis.\n", + "- `sort_values` reorders rows based on condition\n", + "- `.rename()` can rename columns in DataFrames. You can also rewrite the `.columns` attribute to rename columns.\n", + "- `.isna()` detects missing values\n", + "- `.fillna()` replaces NULL values with a specified value\n", + "- `.dropna()` removes all rows that contain NULL values\n", + "- `.merge()` updates content from one DataFrame with content from another Dataframe" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k8I532SRttPF" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 1:\n", + "Create a new DataFrame called `cohort` by inner joining the two DataFrames `roster` and `exam`" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "metadata": { + "id": "t3G0XkmittPF" + }, + "outputs": [], + "source": [ + "#solution\n", + "roster = pd.DataFrame(\n", + "{\n", + " \"Name\" : [\"James\",\"Greg\",\"Patrick\",\"Chris\",\"Cynthia\",\"Chandra\", \"John\",\"David\",\"Tiffany\",\"Peter\"],\n", + " \"Id\": [\"1\",\"2\",\"3\",\"4\",\"5\",\"6\",\"7\",\"8\",\"9\",\"10\"],\n", + "\n", + "})\n", + "\n", + "exam = pd.DataFrame({\n", + " \"Exam 1\" : [89,78,81,90,93,76,66,87,42,55],\n", + " \"Exam 2\" : [100,74,20,86,60,76,92,97,88,90],\n", + " \"Exam 3\" : [85,60,90,90,88,76,55,None,64,79],\n", + " \"Id\" : [\"4\",\"2\",\"1\",\"7\",\"5\",\"10\",\"6\",\"3\",\"9\",\"8\"]\n", + "})\n", + "\n", + "# YOUR CODE HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rMRopV2FttPF" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 2:\n", + "Fill all missing grades with 0." + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "metadata": { + "id": "DA8C74TLttPG" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6a_N8JEEttPG" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 3:\n", + "Update James Exam 2 score from 20 to 85 and update Tiffany Exam 1 score from 42 to 88" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "metadata": { + "id": "Mzka5Y3_ttPG" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DkuO3tIPttPG" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 4:\n", + "\n", + "Create a series called `Average` that takes the average of Exam 1, Exam 2 and Exam 3 scores" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "metadata": { + "id": "QWXVYTj0ttPG" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "96hRtey9ttPG" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 5:\n", + "Incorporate the newly created `Average` column into the DataFrame `cohort`" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "metadata": { + "id": "wEysGqYyttPG" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QHk1lZiDttPG" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 6:\n", + "Sort the dataset by Average in **descending** order and reindex the DataFrame" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "metadata": { + "id": "9azLYMHPttPG" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yyWST6gUttPG" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 7:\n", + "Drop columns Exam 1, 2, and 3" + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "metadata": { + "id": "PgD_KqCkttPG" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OHg6AiIYttPG" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 8:\n", + "Select only the top 3 **Name, Id and Average only*** based on highest Average grade" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "metadata": { + "id": "MmHW3ki9ttPG" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + }, + "colab": { + "provenance": [], + "include_colab_link": true + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file From e0cfc18883bd207b4e53deb7782f17e2992c4010 Mon Sep 17 00:00:00 2001 From: DerikVo Date: Sat, 27 Apr 2024 15:50:19 -0700 Subject: [PATCH 4/4] Delete Python_Lesson_3.ipynb file added to wrong repo --- Python_Lesson_3.ipynb | 12054 ---------------------------------------- 1 file changed, 12054 deletions(-) delete mode 100644 Python_Lesson_3.ipynb diff --git a/Python_Lesson_3.ipynb b/Python_Lesson_3.ipynb deleted file mode 100644 index 19d23ca..0000000 --- a/Python_Lesson_3.ipynb +++ /dev/null @@ -1,12054 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "view-in-github", - "colab_type": "text" - }, - "source": [ - "\"Open" - ] - }, - { - "cell_type": "markdown", - "source": [ - "# Must run this to prepare all the CSVs to be loaded\n", - "The current notebook in Google Classroom assumes COOPers already have the Data in the content folder, but that is not the case.\n", - "I added a script to download the files into the content folder. Unhide the script for more explanation.\n", - "\n", - "\n", - "If there are any questions Please contact \"[Derik Vo](https://www.linkedin.com/in/derik-vo/)\" on slack or LinkedIn\n", - "\n", - "Last updated:\n", - "20240427" - ], - "metadata": { - "id": "twfwKM5msch0" - } - }, - { - "cell_type": "code", - "source": [ - "\"\"\"\n", - "This script is designed to download all the appropriate CSV files to utilize this notebook.\n", - "The notebook assumes the program participants already have the files in the virtual working directory,\n", - "directory but that is not the case.\n", - "\n", - "This notebook simply downloads Google Sheets files as CSVs in the current working directory, content.\n", - "Which is the pathway used through out the notebook.\n", - "\"\"\"\n", - "\n", - "import importlib\n", - "# Check if gdown package is installed, so you can re-run this without downloading the package each time\n", - "if importlib.util.find_spec(\"gdown\") is None:\n", - " !pip install gdown\n", - "\n", - "# File ID of google sheet URLs e.g. the thing between .../d/... and /edit?...\n", - "file_id = ['1Jk5SlYcHsdklUgxMdV_4AVomxKcgOduiMw0TO-XKBJs',\n", - " '1krLyXgH0ZhMsrh5fHOTGb5qRzQQhbZM8QAV00R9uTlU',\n", - " '1cjJJ8_b4Du8AQaY0QB5a2DE-eaLa1kQ5WmeYS32nlQ8',\n", - " '15HXIdDfVSrkGt_ef7UyCrDRnDbfZICVYTtlTk9YmuSM']\n", - "\n", - "# Specify the file names, index must match file_id\n", - "output_file = ['Cars.csv',\n", - " 'Movie_Scores.csv',\n", - " 'Other_Movies.csv',\n", - " 'Pixar_Movies.csv'\n", - " ]\n", - "# Download using gdown, specify the export as csv\n", - "for i in range(0, len(file_id)):\n", - " download_url = f'https://docs.google.com/spreadsheets/d/{file_id[i]}/export?format=csv'\n", - " !gdown {download_url} --output {output_file[i]}" - ], - "metadata": { - "id": "luc8W-65kZr7", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "91c2998e-ae89-45cf-e0ec-0a5525b10916" - }, - "execution_count": 69, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "/usr/local/lib/python3.10/dist-packages/gdown/parse_url.py:48: UserWarning: You specified a Google Drive link that is not the correct link to download a file. You might want to try `--fuzzy` option or the following url: https://drive.google.com/uc?id=None\n", - " warnings.warn(\n", - "Downloading...\n", - "From: https://docs.google.com/spreadsheets/d/1Jk5SlYcHsdklUgxMdV_4AVomxKcgOduiMw0TO-XKBJs/export?format=csv\n", - "To: /content/Cars.csv\n", - "918B [00:00, 2.35MB/s]\n", - "/usr/local/lib/python3.10/dist-packages/gdown/parse_url.py:48: UserWarning: You specified a Google Drive link that is not the correct link to download a file. You might want to try `--fuzzy` option or the following url: https://drive.google.com/uc?id=None\n", - " warnings.warn(\n", - "Downloading...\n", - "From: https://docs.google.com/spreadsheets/d/1krLyXgH0ZhMsrh5fHOTGb5qRzQQhbZM8QAV00R9uTlU/export?format=csv\n", - "To: /content/Movie_Scores.csv\n", - "101B [00:00, 319kB/s]\n", - "/usr/local/lib/python3.10/dist-packages/gdown/parse_url.py:48: UserWarning: You specified a Google Drive link that is not the correct link to download a file. You might want to try `--fuzzy` option or the following url: https://drive.google.com/uc?id=None\n", - " warnings.warn(\n", - "Downloading...\n", - "From: https://docs.google.com/spreadsheets/d/1cjJJ8_b4Du8AQaY0QB5a2DE-eaLa1kQ5WmeYS32nlQ8/export?format=csv\n", - "To: /content/Other_Movies.csv\n", - "319B [00:00, 940kB/s]\n", - "/usr/local/lib/python3.10/dist-packages/gdown/parse_url.py:48: UserWarning: You specified a Google Drive link that is not the correct link to download a file. You might want to try `--fuzzy` option or the following url: https://drive.google.com/uc?id=None\n", - " warnings.warn(\n", - "Downloading...\n", - "From: https://docs.google.com/spreadsheets/d/15HXIdDfVSrkGt_ef7UyCrDRnDbfZICVYTtlTk9YmuSM/export?format=csv\n", - "To: /content/Pixar_Movies.csv\n", - "542B [00:00, 896kB/s]\n" - ] - } - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "gPZMglnlttO8" - }, - "source": [ - "\n", - "\n", - "# Basic Elementary Exploratory Data Analysis using Pandas\n", - "\n", - "_Author: Christopher Chan_" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "cdSXNUkhttO9" - }, - "source": [ - "### Objective\n", - "\n", - "Upon completion of this lesson you should be able to understand the following:\n", - "\n", - "1. Pandas library\n", - "2. Dataframes\n", - "3. Data selection\n", - "4. Data manipulation\n", - "5. Handling of missing data\n", - "\n", - "This is arguably the most important part of analysis. This is also referred to as the \"cleaning the data\". Data must be usable for it to a valid analysis. Otherwise it would be garbage in, garbage out." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1c9iQdrTttO-" - }, - "source": [ - "##### ==================================================================================================\n", - "## Data Selection and Inspection\n", - "\n", - "\n", - "### Pandas Library\n", - "\n", - "`pandas` is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,\n", - "built on top of the Python programming language.\n", - "\n", - "`pandas` data frame can be created by loading the data from the external, existing storage like a database, SQL, or CSV files. But the Pandas Data Frame can also be created from the lists, dictionary, etc. For simplicity, we will use `.csv` files. One of the ways to create a pandas data frame is shown below:\n", - "\n", - "### DataFrames\n", - "A data frame is a structured representation of data.\n", - "##### ==================================================================================================" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "id": "0v8znxdlttO-" - }, - "outputs": [], - "source": [ - "import pandas as pd" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": { - "id": "Q-fUhePhttO-" - }, - "outputs": [], - "source": [ - "data = {'Name':['John', 'Tiffany', 'Chris', 'Winnie', 'David'],\n", - " 'Age': [24, 23, 22, 19, 10],\n", - " 'Salary': [60000,120000,1000000,75000,80000]}\n", - "\n", - "people_df = pd.DataFrame(data)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "_fi2q8cuttO-" - }, - "source": [ - "##### ==================================================================================================\n", - "We can call on the dataframe we labeled `people_df` by applying the `.head()` function that would display the first five rows of the dataframe. Similarly, the `.tail()` function would return the last five rows of a dataframe." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": { - "id": "Muo9Gs_xttO_", - "outputId": "05ac9d53-000a-4d97-a299-b8a766c78531", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 206 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Name Age Salary\n", - "0 John 24 60000\n", - "1 Tiffany 23 120000\n", - "2 Chris 22 1000000\n", - "3 Winnie 19 75000\n", - "4 David 10 80000" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
NameAgeSalary
0John2460000
1Tiffany23120000
2Chris221000000
3Winnie1975000
4David1080000
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "variable_name": "people_df", - "summary": "{\n \"name\": \"people_df\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"Name\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Tiffany\",\n \"David\",\n \"Chris\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Age\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5,\n \"min\": 10,\n \"max\": 24,\n \"num_unique_values\": 5,\n \"samples\": [\n 23,\n 10,\n 22\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Salary\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 410359,\n \"min\": 60000,\n \"max\": 1000000,\n \"num_unique_values\": 5,\n \"samples\": [\n 120000,\n 80000,\n 1000000\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 11 - } - ], - "source": [ - "people_df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZtcR6GJ2ttO_" - }, - "source": [ - "##### ==================================================================================================\n", - "We can also modify the number of rows we would like to display by inserting the integer into the `.head()` function.\n", - "\n", - "Example: Select the first 2 rows of the dataframe" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": { - "id": "k-lZOSuGttO_", - "outputId": "16624df3-6df0-400c-9708-eebc63c390da", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 112 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Name Age Salary\n", - "0 John 24 60000\n", - "1 Tiffany 23 120000" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
NameAgeSalary
0John2460000
1Tiffany23120000
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "variable_name": "people_df", - "summary": "{\n \"name\": \"people_df\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"Name\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Tiffany\",\n \"David\",\n \"Chris\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Age\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5,\n \"min\": 10,\n \"max\": 24,\n \"num_unique_values\": 5,\n \"samples\": [\n 23,\n 10,\n 22\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Salary\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 410359,\n \"min\": 60000,\n \"max\": 1000000,\n \"num_unique_values\": 5,\n \"samples\": [\n 120000,\n 80000,\n 1000000\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 12 - } - ], - "source": [ - "people_df.head(2)" - ] - }, - { - "cell_type": "markdown", - "source": [ - "Example: Select the last 2 rows of the dataframe" - ], - "metadata": { - "id": "C_UTdB6IWiG_" - } - }, - { - "cell_type": "code", - "source": [ - "people_df.tail(2)" - ], - "metadata": { - "id": "tfNVLk_tWU52", - "outputId": "8a683d3c-3ff9-4e9b-9bf3-fda099413cb4", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 112 - } - }, - "execution_count": 13, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Name Age Salary\n", - "3 Winnie 19 75000\n", - "4 David 10 80000" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
NameAgeSalary
3Winnie1975000
4David1080000
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "summary": "{\n \"name\": \"people_df\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Name\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"David\",\n \"Winnie\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Age\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6,\n \"min\": 10,\n \"max\": 19,\n \"num_unique_values\": 2,\n \"samples\": [\n 10,\n 19\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Salary\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 3535,\n \"min\": 75000,\n \"max\": 80000,\n \"num_unique_values\": 2,\n \"samples\": [\n 80000,\n 75000\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 13 - } - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Q8nzMIscttO_" - }, - "source": [ - "##### ==================================================================================================\n", - "Another way to create a dataframe would be to load an existing CSV file by using the `read_csv` function built into `pandas` onto the desired file path as shown below:\n", - "\n", - "`dataframe = pd.read_csv(\".../file_location/file_name.csv\")`" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": { - "id": "IbdNygPgttO_" - }, - "outputs": [], - "source": [ - "movies_df = pd.read_csv(\"/content/Pixar_Movies.csv\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "4r7sM285ttPA" - }, - "source": [ - "##### ==================================================================================================" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": { - "id": "u-4v6IISttPA", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 363 - }, - "outputId": "9125fe6b-77a3-4251-c969-c0947f5d7e01" - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Id Title Director Year Length_minutes\n", - "0 1 Toy Story John Lasseter 1995 81\n", - "1 2 A Bug's Life John Lasseter 1998 95\n", - "2 3 Toy Story 2 John Lasseter 1999 93\n", - "3 4 Monsters, Inc. Pete Docter 2001 92\n", - "4 5 Finding Nemo Andrew Stanton 2003 107\n", - "5 6 The Incredibles Brad Bird 2004 116\n", - "6 7 Cars John Lasseter 2006 117\n", - "7 8 Ratatouille Brad Bird 2007 115\n", - "8 9 WALL-E Andrew Stanton 2008 104\n", - "9 10 Up Pete Docter 2009 101" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IdTitleDirectorYearLength_minutes
01Toy StoryJohn Lasseter199581
12A Bug's LifeJohn Lasseter199895
23Toy Story 2John Lasseter199993
34Monsters, Inc.Pete Docter200192
45Finding NemoAndrew Stanton2003107
56The IncrediblesBrad Bird2004116
67CarsJohn Lasseter2006117
78RatatouilleBrad Bird2007115
89WALL-EAndrew Stanton2008104
910UpPete Docter2009101
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "variable_name": "movies_df", - "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 14,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4,\n \"min\": 1,\n \"max\": 14,\n \"num_unique_values\": 14,\n \"samples\": [\n 10,\n 12,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 14,\n \"samples\": [\n \"Up\",\n \"Cars 2\",\n \"Toy Story\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"John Lasseter\",\n \"Pete Docter\",\n \"Brenda Chapman\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5,\n \"min\": 1995,\n \"max\": 2013,\n \"num_unique_values\": 14,\n \"samples\": [\n 2009,\n 2011,\n 1995\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Length_minutes\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 11,\n \"min\": 81,\n \"max\": 120,\n \"num_unique_values\": 14,\n \"samples\": [\n 101,\n 120,\n 81\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 15 - } - ], - "source": [ - "movies_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "5JISZYZHttPA" - }, - "source": [ - "#### The above python code is equivalent to SQL's\n", - "\n", - "```sql\n", - "SELECT *\n", - "FROM Movies\n", - "LIMIT 10\n", - "```\n", - "##### ==================================================================================================" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "uB-nxMi8ttPA" - }, - "source": [ - "`.shape` shows the number of rows and columns" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": { - "id": "tuyP3rLKttPA", - "outputId": "e814c6e6-08db-42a7-ed51-acebc0e11df0", - "colab": { - "base_uri": "https://localhost:8080/" - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "(14, 5)" - ] - }, - "metadata": {}, - "execution_count": 16 - } - ], - "source": [ - "movies_df.shape" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "sCnSo2HBttPA" - }, - "source": [ - "This shows us how many rows and columns are in the entire dataframe, 14 rows, 5 columns\n", - "\n", - "##### ==================================================================================================" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "zMhYN4Z1ttPA" - }, - "source": [ - "`.dtypes` shows the data types" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": { - "id": "UOImUY3attPA", - "outputId": "205c0d6d-3e13-40c1-b872-d7107d20b6cf", - "colab": { - "base_uri": "https://localhost:8080/" - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "Id int64\n", - "Title object\n", - "Director object\n", - "Year int64\n", - "Length_minutes int64\n", - "dtype: object" - ] - }, - "metadata": {}, - "execution_count": 17 - } - ], - "source": [ - "movies_df.dtypes" - ] - }, - { - "cell_type": "markdown", - "source": [ - "`.describe()` can be used to help summarize numerical data in our dataframe. It summarizes the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values." - ], - "metadata": { - "id": "244Ux8N_XWmo" - } - }, - { - "cell_type": "code", - "source": [ - "movies_df.describe()" - ], - "metadata": { - "id": "-MPTc3c6YjMp", - "outputId": "5e72034a-69a7-4a62-8966-b3c92d2684ce", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 300 - } - }, - "execution_count": 18, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Id Year Length_minutes\n", - "count 14.0000 14.000000 14.000000\n", - "mean 7.5000 2005.428571 104.000000\n", - "std 4.1833 5.598273 11.176899\n", - "min 1.0000 1995.000000 81.000000\n", - "25% 4.2500 2001.500000 96.500000\n", - "50% 7.5000 2006.500000 103.500000\n", - "75% 10.7500 2009.750000 113.750000\n", - "max 14.0000 2013.000000 120.000000" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IdYearLength_minutes
count14.000014.00000014.000000
mean7.50002005.428571104.000000
std4.18335.59827311.176899
min1.00001995.00000081.000000
25%4.25002001.50000096.500000
50%7.50002006.500000103.500000
75%10.75002009.750000113.750000
max14.00002013.000000120.000000
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 8,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4.745054915279273,\n \"min\": 1.0,\n \"max\": 14.0,\n \"num_unique_values\": 6,\n \"samples\": [\n 14.0,\n 7.5,\n 10.75\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 923.7077335637653,\n \"min\": 5.598272889084574,\n \"max\": 2013.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 2005.4285714285713,\n 2006.5,\n 14.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Length_minutes\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 43.476190740521304,\n \"min\": 11.176899253508413,\n \"max\": 120.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 104.0,\n 103.5,\n 14.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 18 - } - ] - }, - { - "cell_type": "markdown", - "source": [ - "You may optionally include categorical data in the `describe` method like so:" - ], - "metadata": { - "id": "uNXiyTCWYwl8" - } - }, - { - "cell_type": "code", - "source": [ - "movies_df.describe(include='all')" - ], - "metadata": { - "id": "ITxGRSqQY8oX", - "outputId": "2aff9745-e64d-401c-fc80-5e53cfaf81c0", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 394 - } - }, - "execution_count": 19, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Id Title Director Year Length_minutes\n", - "count 14.0000 14 14 14.000000 14.000000\n", - "unique NaN 14 7 NaN NaN\n", - "top NaN Toy Story John Lasseter NaN NaN\n", - "freq NaN 1 5 NaN NaN\n", - "mean 7.5000 NaN NaN 2005.428571 104.000000\n", - "std 4.1833 NaN NaN 5.598273 11.176899\n", - "min 1.0000 NaN NaN 1995.000000 81.000000\n", - "25% 4.2500 NaN NaN 2001.500000 96.500000\n", - "50% 7.5000 NaN NaN 2006.500000 103.500000\n", - "75% 10.7500 NaN NaN 2009.750000 113.750000\n", - "max 14.0000 NaN NaN 2013.000000 120.000000" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IdTitleDirectorYearLength_minutes
count14.0000141414.00000014.000000
uniqueNaN147NaNNaN
topNaNToy StoryJohn LasseterNaNNaN
freqNaN15NaNNaN
mean7.5000NaNNaN2005.428571104.000000
std4.1833NaNNaN5.59827311.176899
min1.0000NaNNaN1995.00000081.000000
25%4.2500NaNNaN2001.50000096.500000
50%7.5000NaNNaN2006.500000103.500000
75%10.7500NaNNaN2009.750000113.750000
max14.0000NaNNaN2013.000000120.000000
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 11,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4.745054915279273,\n \"min\": 1.0,\n \"max\": 14.0,\n \"num_unique_values\": 6,\n \"samples\": [\n 14.0,\n 7.5,\n 10.75\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"14\",\n \"Toy Story\",\n \"1\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 4,\n \"samples\": [\n 7,\n \"5\",\n \"14\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 923.7077335637653,\n \"min\": 5.598272889084574,\n \"max\": 2013.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 2005.4285714285713,\n 2006.5,\n 14.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Length_minutes\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 43.476190740521304,\n \"min\": 11.176899253508413,\n \"max\": 120.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 104.0,\n 103.5,\n 14.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 19 - } - ] - }, - { - "cell_type": "code", - "source": [ - "movies_df.info()" - ], - "metadata": { - "id": "s3f7AzxJv_L5", - "outputId": "9eded4ed-a6ae-48d0-ff9e-33be78411b2b", - "colab": { - "base_uri": "https://localhost:8080/" - } - }, - "execution_count": 20, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "\n", - "RangeIndex: 14 entries, 0 to 13\n", - "Data columns (total 5 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 Id 14 non-null int64 \n", - " 1 Title 14 non-null object\n", - " 2 Director 14 non-null object\n", - " 3 Year 14 non-null int64 \n", - " 4 Length_minutes 14 non-null int64 \n", - "dtypes: int64(3), object(2)\n", - "memory usage: 688.0+ bytes\n" - ] - } - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "QHrP16DdttPA" - }, - "source": [ - "##### ==================================================================================================\n", - "\n", - "### Row and Column Selection\n", - "\n", - "There are two common ways to select rows and columns in a dataframe using .loc and .iloc\n", - "\n", - "`.loc` selects rows and columns by label/name\n", - "\n", - "`.iloc` selects row and columns by index\n", - "\n", - "Example: using `.loc` to select every row in the dataframe by using `:` and filtering the column to just Title, Director and Year" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": { - "id": "_H7tY4X8ttPA", - "outputId": "85648979-5889-4331-c0b5-4f55c4fe69ca", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 143 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Title Director Year\n", - "2 Toy Story 2 John Lasseter 1999\n", - "3 Monsters, Inc. Pete Docter 2001\n", - "4 Finding Nemo Andrew Stanton 2003" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
TitleDirectorYear
2Toy Story 2John Lasseter1999
3Monsters, Inc.Pete Docter2001
4Finding NemoAndrew Stanton2003
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Toy Story 2\",\n \"Monsters, Inc.\",\n \"Finding Nemo\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"John Lasseter\",\n \"Pete Docter\",\n \"Andrew Stanton\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2,\n \"min\": 1999,\n \"max\": 2003,\n \"num_unique_values\": 3,\n \"samples\": [\n 1999,\n 2001,\n 2003\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 21 - } - ], - "source": [ - "movies_df.loc[2:4, ['Title','Director','Year'] ]" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "0qwe2OwyttPA" - }, - "source": [ - "##### ==================================================================================================\n", - "\n", - "Similarly we obtain the same results using `'iloc` by filtering the columns to the 1, 2, and 3 column that correspond to as Title, Director and Year respectively as shown below:" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": { - "id": "3rPyj7J1ttPA", - "outputId": "68a08106-5da8-4d74-9c48-cf0924545c85", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 488 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Title Director Year\n", - "0 Toy Story John Lasseter 1995\n", - "1 A Bug's Life John Lasseter 1998\n", - "2 Toy Story 2 John Lasseter 1999\n", - "3 Monsters, Inc. Pete Docter 2001\n", - "4 Finding Nemo Andrew Stanton 2003\n", - "5 The Incredibles Brad Bird 2004\n", - "6 Cars John Lasseter 2006\n", - "7 Ratatouille Brad Bird 2007\n", - "8 WALL-E Andrew Stanton 2008\n", - "9 Up Pete Docter 2009\n", - "10 Toy Story 3 Lee Unkrich 2010\n", - "11 Cars 2 John Lasseter 2011\n", - "12 Brave Brenda Chapman 2012\n", - "13 Monsters University Dan Scanlon 2013" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
TitleDirectorYear
0Toy StoryJohn Lasseter1995
1A Bug's LifeJohn Lasseter1998
2Toy Story 2John Lasseter1999
3Monsters, Inc.Pete Docter2001
4Finding NemoAndrew Stanton2003
5The IncrediblesBrad Bird2004
6CarsJohn Lasseter2006
7RatatouilleBrad Bird2007
8WALL-EAndrew Stanton2008
9UpPete Docter2009
10Toy Story 3Lee Unkrich2010
11Cars 2John Lasseter2011
12BraveBrenda Chapman2012
13Monsters UniversityDan Scanlon2013
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 14,\n \"fields\": [\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 14,\n \"samples\": [\n \"Up\",\n \"Cars 2\",\n \"Toy Story\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"John Lasseter\",\n \"Pete Docter\",\n \"Brenda Chapman\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5,\n \"min\": 1995,\n \"max\": 2013,\n \"num_unique_values\": 14,\n \"samples\": [\n 2009,\n 2011,\n 1995\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 22 - } - ], - "source": [ - "movies_df.iloc[ :, [1,2,3] ]" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "OigUAB8ottPB" - }, - "source": [ - "#### The two python codes above are equivalent to SQL's\n", - "\n", - "```sql\n", - "SELECT Title, Director, Year\n", - "FROM Movies\n", - "```\n", - "\n", - "##### ==================================================================================================" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": { - "id": "VrxiA9oittPB", - "outputId": "3010b408-ed54-4c6c-ce92-97434951c9a5", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 143 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Title Director Year\n", - "0 Toy Story John Lasseter 1995\n", - "1 A Bug's Life John Lasseter 1998\n", - "2 Toy Story 2 John Lasseter 1999" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
TitleDirectorYear
0Toy StoryJohn Lasseter1995
1A Bug's LifeJohn Lasseter1998
2Toy Story 2John Lasseter1999
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Toy Story\",\n \"A Bug's Life\",\n \"Toy Story 2\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"John Lasseter\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2,\n \"min\": 1995,\n \"max\": 1999,\n \"num_unique_values\": 3,\n \"samples\": [\n 1995\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 23 - } - ], - "source": [ - "movies_df.iloc[0:3,[1,2,3]]" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "W8Rpe-SBttPB" - }, - "source": [ - "#### The above python code is equivalent to SQL's\n", - "\n", - "```sql\n", - "SELECT Title, Director, Year\n", - "FROM Movies\n", - "LIMIT 3\n", - "```\n", - "##### ==================================================================================================" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "metadata": { - "id": "JYZXjJ7zttPB", - "outputId": "96fc2c14-aed4-47cf-dee7-f1084670fb09", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 143 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Title Director Year\n", - "2 Toy Story 2 John Lasseter 1999\n", - "3 Monsters, Inc. Pete Docter 2001\n", - "4 Finding Nemo Andrew Stanton 2003" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
TitleDirectorYear
2Toy Story 2John Lasseter1999
3Monsters, Inc.Pete Docter2001
4Finding NemoAndrew Stanton2003
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Toy Story 2\",\n \"Monsters, Inc.\",\n \"Finding Nemo\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"John Lasseter\",\n \"Pete Docter\",\n \"Andrew Stanton\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2,\n \"min\": 1999,\n \"max\": 2003,\n \"num_unique_values\": 3,\n \"samples\": [\n 1999,\n 2001,\n 2003\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 24 - } - ], - "source": [ - "movies_df.iloc[2:5, [1,2,3]]" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "xuF4sFtRttPB" - }, - "source": [ - "#### The above python code is equivalent to SQL's\n", - "\n", - "```sql\n", - "SELECT Title, Director, Year\n", - "FROM movies\n", - "LIMIT 3\n", - "OFFSET 2\n", - "```\n", - "##### ==================================================================================================" - ] - }, - { - "cell_type": "markdown", - "source": [ - "The `value_counts()` method returns the count of unique values in a given `Series`/column. For example, let's look at the number of entries each Director has in `movies_df`:" - ], - "metadata": { - "id": "qoAct0MgZq2Y" - } - }, - { - "cell_type": "code", - "execution_count": 25, - "metadata": { - "id": "mAaANUmittPB", - "outputId": "785fa941-046c-4201-d01d-37b6f1c6decc", - "colab": { - "base_uri": "https://localhost:8080/" - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "Director\n", - "John Lasseter 5\n", - "Pete Docter 2\n", - "Andrew Stanton 2\n", - "Brad Bird 2\n", - "Lee Unkrich 1\n", - "Brenda Chapman 1\n", - "Dan Scanlon 1\n", - "Name: count, dtype: int64" - ] - }, - "metadata": {}, - "execution_count": 25 - } - ], - "source": [ - "movies_df.loc[:,'Director'].value_counts()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "UUKE7FJkttPB" - }, - "source": [ - "#### The above python code is equivalent to SQL's\n", - "```sql\n", - "SELECT Director, COUNT(*)\n", - "FROM Movies\n", - "GROUP BY Director\n", - "```\n" - ] - }, - { - "cell_type": "markdown", - "source": [ - "##### ==================================================================================================\n", - "\n", - "We can use the `mean()` method to help us find the average of a column or group of columns." - ], - "metadata": { - "id": "dqwCWeGUdDOO" - } - }, - { - "cell_type": "code", - "source": [ - "movies_df.loc[:, 'Length_minutes'].mean()" - ], - "metadata": { - "id": "85Yx7Q8MdXDp", - "outputId": "5bbd0b42-99a8-450c-f306-42d051eefce0", - "colab": { - "base_uri": "https://localhost:8080/" - } - }, - "execution_count": 26, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "104.0" - ] - }, - "metadata": {}, - "execution_count": 26 - } - ] - }, - { - "cell_type": "markdown", - "source": [ - "#### The above python code is equivalent to SQL's\n", - "```sql\n", - "SELECT AVG(Length_minutes)\n", - "FROM Movies\n", - "```" - ], - "metadata": { - "id": "hxFxYRlVgy8D" - } - }, - { - "cell_type": "markdown", - "source": [ - "Using the `groupby()` method, we can perform operations that are similar to the `GROUP BY` clause in SQL.\n", - "\n", - "For example, let's get the average `Length_minutes` by `Director` to see the average number of minutes for each Director's movies:" - ], - "metadata": { - "id": "CDzWn6ZYhdjl" - } - }, - { - "cell_type": "code", - "source": [ - "movies_df.loc[:, ['Director', 'Length_minutes']].groupby('Director').mean()" - ], - "metadata": { - "id": "1Pc8Bk75ePoi", - "outputId": "8c744dc8-2701-4111-82a7-3a6a0d28848d", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 300 - } - }, - "execution_count": 27, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Length_minutes\n", - "Director \n", - "Andrew Stanton 105.5\n", - "Brad Bird 115.5\n", - "Brenda Chapman 102.0\n", - "Dan Scanlon 110.0\n", - "John Lasseter 101.2\n", - "Lee Unkrich 103.0\n", - "Pete Docter 96.5" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Length_minutes
Director
Andrew Stanton105.5
Brad Bird115.5
Brenda Chapman102.0
Dan Scanlon110.0
John Lasseter101.2
Lee Unkrich103.0
Pete Docter96.5
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 7,\n \"fields\": [\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"Andrew Stanton\",\n \"Brad Bird\",\n \"Lee Unkrich\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Length_minutes\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6.257642945877883,\n \"min\": 96.5,\n \"max\": 115.5,\n \"num_unique_values\": 7,\n \"samples\": [\n 105.5,\n 115.5,\n 103.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 27 - } - ] - }, - { - "cell_type": "markdown", - "source": [ - "#### The above python code is equivalent to SQL's\n", - "```sql\n", - "SELECT Director, AVG(Length_minutes) AS Length_minutes\n", - "FROM Movies\n", - "GROUP BY Director\n", - "```" - ], - "metadata": { - "id": "jbDuTSGwiCmq" - } - }, - { - "cell_type": "markdown", - "source": [ - "##### ==================================================================================================\n", - "### Filtering Data\n", - "Using operator comparisons on columns returns information based on our desired conditions\n", - "\n", - "Example: Suppose we want to return movie information if it is only longer than 100 minutes long." - ], - "metadata": { - "id": "cKaf4n4ycypo" - } - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": { - "id": "Zgl74_zjttPB", - "outputId": "77bbc64a-a1a2-4e8a-973a-0dcfb927e8da", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 363 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Id Title Director Year Length_minutes\n", - "4 5 Finding Nemo Andrew Stanton 2003 107\n", - "5 6 The Incredibles Brad Bird 2004 116\n", - "6 7 Cars John Lasseter 2006 117\n", - "7 8 Ratatouille Brad Bird 2007 115\n", - "8 9 WALL-E Andrew Stanton 2008 104\n", - "9 10 Up Pete Docter 2009 101\n", - "10 11 Toy Story 3 Lee Unkrich 2010 103\n", - "11 12 Cars 2 John Lasseter 2011 120\n", - "12 13 Brave Brenda Chapman 2012 102\n", - "13 14 Monsters University Dan Scanlon 2013 110" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IdTitleDirectorYearLength_minutes
45Finding NemoAndrew Stanton2003107
56The IncrediblesBrad Bird2004116
67CarsJohn Lasseter2006117
78RatatouilleBrad Bird2007115
89WALL-EAndrew Stanton2008104
910UpPete Docter2009101
1011Toy Story 3Lee Unkrich2010103
1112Cars 2John Lasseter2011120
1213BraveBrenda Chapman2012102
1314Monsters UniversityDan Scanlon2013110
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 10,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 3,\n \"min\": 5,\n \"max\": 14,\n \"num_unique_values\": 10,\n \"samples\": [\n 13,\n 6,\n 10\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 10,\n \"samples\": [\n \"Brave\",\n \"The Incredibles\",\n \"Up\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"Andrew Stanton\",\n \"Brad Bird\",\n \"Brenda Chapman\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 3,\n \"min\": 2003,\n \"max\": 2013,\n \"num_unique_values\": 10,\n \"samples\": [\n 2012,\n 2004,\n 2009\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Length_minutes\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 7,\n \"min\": 101,\n \"max\": 120,\n \"num_unique_values\": 10,\n \"samples\": [\n 102,\n 116,\n 101\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 28 - } - ], - "source": [ - "# Create the filter\n", - "movie_filter = movies_df.loc[:, \"Length_minutes\"] > 100\n", - "# Use the filter in the `.loc` selector\n", - "movies_df.loc[movie_filter, :]\n", - "\n", - "# An example showing everything in a single step\n", - "movies_df.loc[movies_df.loc[:, \"Length_minutes\"] > 100, :]\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1RAY_qWtttPB" - }, - "source": [ - "#### The above python code is equivalent to SQL's\n", - "```sql\n", - "SELECT *\n", - "FROM Movies\n", - "WHERE Length_minutes > 100\n", - "```\n", - "##### ==================================================================================================\n", - "\n", - "#### Multiple Conditional Filtering\n", - "\n", - "Supposed we want to return movie information only if it is longer than 100 minutes and was created before the year 2005" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "metadata": { - "id": "Dp1-vQ3mttPB", - "outputId": "5ee97816-cc89-4bb1-e7fc-a222915c26e8", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 112 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Id Title Director Year Length_minutes\n", - "4 5 Finding Nemo Andrew Stanton 2003 107\n", - "5 6 The Incredibles Brad Bird 2004 116" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IdTitleDirectorYearLength_minutes
45Finding NemoAndrew Stanton2003107
56The IncrediblesBrad Bird2004116
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 5,\n \"max\": 6,\n \"num_unique_values\": 2,\n \"samples\": [\n 6,\n 5\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"The Incredibles\",\n \"Finding Nemo\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Brad Bird\",\n \"Andrew Stanton\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 2003,\n \"max\": 2004,\n \"num_unique_values\": 2,\n \"samples\": [\n 2004,\n 2003\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Length_minutes\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6,\n \"min\": 107,\n \"max\": 116,\n \"num_unique_values\": 2,\n \"samples\": [\n 116,\n 107\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 29 - } - ], - "source": [ - "movie_len_filter = movies_df.loc[:, \"Length_minutes\"] > 100\n", - "movie_year_filter = movies_df.loc[:, \"Year\"] < 2005\n", - "\n", - "movies_df.loc[(movie_len_filter) & (movie_year_filter), :]" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "lQksNrTkttPB" - }, - "source": [ - "#### The above python code is equivalent to SQL's\n", - "```sql\n", - "SELECT *\n", - "FROM Movies\n", - "WHERE Length_minutes > 100\n", - "AND Year < 2005\n", - "```\n", - "##### ==================================================================================================" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "GVTfOPhottPB" - }, - "source": [ - "##### ==================================================================================================\n", - "### Sorting Data\n", - "The `sort_values()` method sorts the list ascending by default. To sort by descending order, you must apply `ascending = False`.\n", - "\n", - "The `.reset_index(drop=True)` will re-index the index after sorting." - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "metadata": { - "id": "KFQjjjOxttPC", - "outputId": "17dbeed8-b178-496d-8f22-c6932dbde3dc", - "colab": { - "base_uri": "https://localhost:8080/" - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "0 A Bug's Life\n", - "1 Brave\n", - "2 Cars\n", - "3 Cars 2\n", - "4 Finding Nemo\n", - "5 Monsters University\n", - "6 Monsters, Inc.\n", - "7 Ratatouille\n", - "8 The Incredibles\n", - "9 Toy Story\n", - "10 Toy Story 2\n", - "11 Toy Story 3\n", - "12 Up\n", - "13 WALL-E\n", - "Name: Title, dtype: object" - ] - }, - "metadata": {}, - "execution_count": 30 - } - ], - "source": [ - "movies_df.loc[:,\"Title\"].sort_values().reset_index(drop=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1fEi_PBfttPC" - }, - "source": [ - "#### The above python code is equivalent to SQL's\n", - "\n", - "```sql\n", - "SELECT Title\n", - "FROM Movies\n", - "ORDER BY Title\n", - "```\n", - "##### ==================================================================================================\n", - "\n", - "Sort the entire dataframe by a single column:" - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "metadata": { - "id": "stgM1BXxttPC", - "outputId": "58c62ab4-0d0f-4485-8d96-7366c97a9be1", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 488 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Id Title Director Year Length_minutes\n", - "0 2 A Bug's Life John Lasseter 1998 95\n", - "1 13 Brave Brenda Chapman 2012 102\n", - "2 7 Cars John Lasseter 2006 117\n", - "3 12 Cars 2 John Lasseter 2011 120\n", - "4 5 Finding Nemo Andrew Stanton 2003 107\n", - "5 14 Monsters University Dan Scanlon 2013 110\n", - "6 4 Monsters, Inc. Pete Docter 2001 92\n", - "7 8 Ratatouille Brad Bird 2007 115\n", - "8 6 The Incredibles Brad Bird 2004 116\n", - "9 1 Toy Story John Lasseter 1995 81\n", - "10 3 Toy Story 2 John Lasseter 1999 93\n", - "11 11 Toy Story 3 Lee Unkrich 2010 103\n", - "12 10 Up Pete Docter 2009 101\n", - "13 9 WALL-E Andrew Stanton 2008 104" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IdTitleDirectorYearLength_minutes
02A Bug's LifeJohn Lasseter199895
113BraveBrenda Chapman2012102
27CarsJohn Lasseter2006117
312Cars 2John Lasseter2011120
45Finding NemoAndrew Stanton2003107
514Monsters UniversityDan Scanlon2013110
64Monsters, Inc.Pete Docter200192
78RatatouilleBrad Bird2007115
86The IncrediblesBrad Bird2004116
91Toy StoryJohn Lasseter199581
103Toy Story 2John Lasseter199993
1111Toy Story 3Lee Unkrich2010103
1210UpPete Docter2009101
139WALL-EAndrew Stanton2008104
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 14,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4,\n \"min\": 1,\n \"max\": 14,\n \"num_unique_values\": 14,\n \"samples\": [\n 1,\n 11,\n 2\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 14,\n \"samples\": [\n \"Toy Story\",\n \"Toy Story 3\",\n \"A Bug's Life\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"John Lasseter\",\n \"Brenda Chapman\",\n \"Brad Bird\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5,\n \"min\": 1995,\n \"max\": 2013,\n \"num_unique_values\": 14,\n \"samples\": [\n 1995,\n 2010,\n 1998\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Length_minutes\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 11,\n \"min\": 81,\n \"max\": 120,\n \"num_unique_values\": 14,\n \"samples\": [\n 81,\n 103,\n 95\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 31 - } - ], - "source": [ - "movies_df.sort_values(\"Title\").reset_index(drop=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "V5j8FDwuttPC" - }, - "source": [ - "#### The above python code is equivalent to SQL's\n", - "```sql\n", - "SELECT *\n", - "FROM Movies\n", - "ORDER BY Title\n", - "```\n", - "##### ==================================================================================================\n", - "\n", - "We can also sort using multiple columns.\n", - "Example: We can sort by Director first, then within each Director, sort the Title of the films." - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "metadata": { - "id": "q6UqfJacttPC", - "outputId": "ee97eac3-eb25-491c-d574-1860e8af3a32", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 488 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Id Title Director Year Length_minutes\n", - "0 9 WALL-E Andrew Stanton 2008 104\n", - "1 5 Finding Nemo Andrew Stanton 2003 107\n", - "2 6 The Incredibles Brad Bird 2004 116\n", - "3 8 Ratatouille Brad Bird 2007 115\n", - "4 13 Brave Brenda Chapman 2012 102\n", - "5 14 Monsters University Dan Scanlon 2013 110\n", - "6 3 Toy Story 2 John Lasseter 1999 93\n", - "7 1 Toy Story John Lasseter 1995 81\n", - "8 12 Cars 2 John Lasseter 2011 120\n", - "9 7 Cars John Lasseter 2006 117\n", - "10 2 A Bug's Life John Lasseter 1998 95\n", - "11 11 Toy Story 3 Lee Unkrich 2010 103\n", - "12 10 Up Pete Docter 2009 101\n", - "13 4 Monsters, Inc. Pete Docter 2001 92" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IdTitleDirectorYearLength_minutes
09WALL-EAndrew Stanton2008104
15Finding NemoAndrew Stanton2003107
26The IncrediblesBrad Bird2004116
38RatatouilleBrad Bird2007115
413BraveBrenda Chapman2012102
514Monsters UniversityDan Scanlon2013110
63Toy Story 2John Lasseter199993
71Toy StoryJohn Lasseter199581
812Cars 2John Lasseter2011120
97CarsJohn Lasseter2006117
102A Bug's LifeJohn Lasseter199895
1111Toy Story 3Lee Unkrich2010103
1210UpPete Docter2009101
134Monsters, Inc.Pete Docter200192
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "summary": "{\n \"name\": \"movies_df\",\n \"rows\": 14,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4,\n \"min\": 1,\n \"max\": 14,\n \"num_unique_values\": 14,\n \"samples\": [\n 7,\n 11,\n 9\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 14,\n \"samples\": [\n \"Cars\",\n \"Toy Story 3\",\n \"WALL-E\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"Andrew Stanton\",\n \"Brad Bird\",\n \"Lee Unkrich\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5,\n \"min\": 1995,\n \"max\": 2013,\n \"num_unique_values\": 14,\n \"samples\": [\n 2006,\n 2010,\n 2008\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Length_minutes\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 11,\n \"min\": 81,\n \"max\": 120,\n \"num_unique_values\": 14,\n \"samples\": [\n 117,\n 103,\n 104\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 32 - } - ], - "source": [ - "movies_df.sort_values([\"Director\",\"Title\"], ascending=[True, False]).reset_index(drop=True)" - ] - }, - { - "cell_type": "markdown", - "source": [ - "```sql\n", - "SELECT Director, Title\n", - "FROM Movies\n", - "ORDER BY\n", - " Director ASC,\n", - " Title DESC\n", - "```" - ], - "metadata": { - "id": "5wlURoWy2eYC" - } - }, - { - "cell_type": "markdown", - "metadata": { - "id": "AWFTUYNVttPC" - }, - "source": [ - "##### ==================================================================================================\n", - "### Merging DataFrames\n", - "\n", - "In python the `.concat` function combines dataframes together. This can be either one on top of another dataframe or side by side.\n", - "\n", - "But first let us introduce a new dataset:" - ] - }, - { - "cell_type": "code", - "execution_count": 33, - "metadata": { - "id": "3C3P14EvttPC" - }, - "outputs": [], - "source": [ - "other_movies_df = pd.read_csv(\"Other_Movies.csv\")" - ] - }, - { - "cell_type": "code", - "execution_count": 34, - "metadata": { - "id": "fjEx1V8vttPC", - "outputId": "e66e4d1c-e519-4505-f814-642c854a96c6", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 206 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Id Title Director Year \\\n", - "0 15 The Fast and the Furious Rob Cohen 2001 \n", - "1 16 A Beautiful Mind Ron Howard 2001 \n", - "2 17 Good Will Hunting Gus Van Sant 1997 \n", - "3 18 Shang-Chi and the Legend of the Ten Rings Destin Daniel Cretton 2021 \n", - "4 19 The Martian Ridley Scott 2015 \n", - "\n", - " Length_minutes \n", - "0 106 \n", - "1 135 \n", - "2 126 \n", - "3 132 \n", - "4 141 " - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IdTitleDirectorYearLength_minutes
015The Fast and the FuriousRob Cohen2001106
116A Beautiful MindRon Howard2001135
217Good Will HuntingGus Van Sant1997126
318Shang-Chi and the Legend of the Ten RingsDestin Daniel Cretton2021132
419The MartianRidley Scott2015141
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "variable_name": "other_movies_df", - "summary": "{\n \"name\": \"other_movies_df\",\n \"rows\": 6,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 15,\n \"max\": 20,\n \"num_unique_values\": 6,\n \"samples\": [\n 15,\n 16,\n 20\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 6,\n \"samples\": [\n \"The Fast and the Furious\",\n \"A Beautiful Mind\",\n \"Fast Five\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 6,\n \"samples\": [\n \"Rob Cohen\",\n \"Ron Howard\",\n \"Justin Lin\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 9,\n \"min\": 1997,\n \"max\": 2021,\n \"num_unique_values\": 5,\n \"samples\": [\n 1997,\n 2011,\n 2021\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Length_minutes\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 12,\n \"min\": 106,\n \"max\": 141,\n \"num_unique_values\": 6,\n \"samples\": [\n 106,\n 135,\n 130\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 34 - } - ], - "source": [ - "other_movies_df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "DiyckWV1ttPC" - }, - "source": [ - "##### ==================================================================================================\n", - "Now lets combine the two dataframes, that being `movies_df` and `other_movies_df` using the `.concat` function and call this new dataframe `all_movies_df`" - ] - }, - { - "cell_type": "code", - "execution_count": 35, - "metadata": { - "id": "pjvZ8wGFttPC" - }, - "outputs": [], - "source": [ - "all_movies_df = pd.concat([movies_df,other_movies_df]).reset_index(drop=True)" - ] - }, - { - "cell_type": "code", - "execution_count": 36, - "metadata": { - "id": "DwGVaXWxttPC", - "outputId": "60acd983-0c9b-4cb9-e46b-6102b2b8a5a2", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 645 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Id Title Director \\\n", - "0 1 Toy Story John Lasseter \n", - "1 2 A Bug's Life John Lasseter \n", - "2 3 Toy Story 2 John Lasseter \n", - "3 4 Monsters, Inc. Pete Docter \n", - "4 5 Finding Nemo Andrew Stanton \n", - "5 6 The Incredibles Brad Bird \n", - "6 7 Cars John Lasseter \n", - "7 8 Ratatouille Brad Bird \n", - "8 9 WALL-E Andrew Stanton \n", - "9 10 Up Pete Docter \n", - "10 11 Toy Story 3 Lee Unkrich \n", - "11 12 Cars 2 John Lasseter \n", - "12 13 Brave Brenda Chapman \n", - "13 14 Monsters University Dan Scanlon \n", - "14 15 The Fast and the Furious Rob Cohen \n", - "15 16 A Beautiful Mind Ron Howard \n", - "16 17 Good Will Hunting Gus Van Sant \n", - "17 18 Shang-Chi and the Legend of the Ten Rings Destin Daniel Cretton \n", - "18 19 The Martian Ridley Scott \n", - "\n", - " Year Length_minutes \n", - "0 1995 81 \n", - "1 1998 95 \n", - "2 1999 93 \n", - "3 2001 92 \n", - "4 2003 107 \n", - "5 2004 116 \n", - "6 2006 117 \n", - "7 2007 115 \n", - "8 2008 104 \n", - "9 2009 101 \n", - "10 2010 103 \n", - "11 2011 120 \n", - "12 2012 102 \n", - "13 2013 110 \n", - "14 2001 106 \n", - "15 2001 135 \n", - "16 1997 126 \n", - "17 2021 132 \n", - "18 2015 141 " - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IdTitleDirectorYearLength_minutes
01Toy StoryJohn Lasseter199581
12A Bug's LifeJohn Lasseter199895
23Toy Story 2John Lasseter199993
34Monsters, Inc.Pete Docter200192
45Finding NemoAndrew Stanton2003107
56The IncrediblesBrad Bird2004116
67CarsJohn Lasseter2006117
78RatatouilleBrad Bird2007115
89WALL-EAndrew Stanton2008104
910UpPete Docter2009101
1011Toy Story 3Lee Unkrich2010103
1112Cars 2John Lasseter2011120
1213BraveBrenda Chapman2012102
1314Monsters UniversityDan Scanlon2013110
1415The Fast and the FuriousRob Cohen2001106
1516A Beautiful MindRon Howard2001135
1617Good Will HuntingGus Van Sant1997126
1718Shang-Chi and the Legend of the Ten RingsDestin Daniel Cretton2021132
1819The MartianRidley Scott2015141
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "variable_name": "all_movies_df", - "summary": "{\n \"name\": \"all_movies_df\",\n \"rows\": 20,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5,\n \"min\": 1,\n \"max\": 20,\n \"num_unique_values\": 20,\n \"samples\": [\n 1,\n 18,\n 16\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 20,\n \"samples\": [\n \"Toy Story\",\n \"Shang-Chi and the Legend of the Ten Rings\",\n \"A Beautiful Mind\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 13,\n \"samples\": [\n \"Ridley Scott\",\n \"Gus Van Sant\",\n \"John Lasseter\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6,\n \"min\": 1995,\n \"max\": 2021,\n \"num_unique_values\": 17,\n \"samples\": [\n 1995,\n 1998,\n 2004\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Length_minutes\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 15,\n \"min\": 81,\n \"max\": 141,\n \"num_unique_values\": 20,\n \"samples\": [\n 81,\n 132,\n 135\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 36 - } - ], - "source": [ - "all_movies_df.head(-1) # Using -1 in the head function will show us all of the rows" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "uwtTP015ttPD" - }, - "source": [ - "##### ==================================================================================================\n", - "Now lets introduce another dataframe, that being the movie scores received" - ] - }, - { - "cell_type": "code", - "execution_count": 37, - "metadata": { - "id": "wntotCmBttPD" - }, - "outputs": [], - "source": [ - "scores_df = pd.read_csv(\"Movie_Scores.csv\")" - ] - }, - { - "cell_type": "code", - "execution_count": 38, - "metadata": { - "id": "9xeFCBz5ttPD", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 206 - }, - "outputId": "7ad7b5c2-8c43-4a86-bb98-4613fcf88b90" - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Score\n", - "0 8.3\n", - "1 7.2\n", - "2 7.9\n", - "3 8.1\n", - "4 8.2" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Score
08.3
17.2
27.9
38.1
48.2
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "variable_name": "scores_df", - "summary": "{\n \"name\": \"scores_df\",\n \"rows\": 20,\n \"fields\": [\n {\n \"column\": \"Score\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.6190825129938148,\n \"min\": 6.2,\n \"max\": 8.4,\n \"num_unique_values\": 12,\n \"samples\": [\n 6.8,\n 7.3,\n 8.3\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 38 - } - ], - "source": [ - "scores_df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Yarl7-KPttPD" - }, - "source": [ - "##### ==================================================================================================\n", - "Now we can combine the two dataframes side by side" - ] - }, - { - "cell_type": "code", - "execution_count": 39, - "metadata": { - "id": "W2zOhxPcttPD" - }, - "outputs": [], - "source": [ - "movies_and_scores_df = pd.concat([all_movies_df,scores_df], axis = \"columns\").reset_index(drop=True)" - ] - }, - { - "cell_type": "code", - "execution_count": 40, - "metadata": { - "id": "VBMdQiRettPD", - "outputId": "8f1d854c-e1c0-4a8b-fe3e-a7cb1317b69b", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 645 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Id Title Director \\\n", - "0 1 Toy Story John Lasseter \n", - "1 2 A Bug's Life John Lasseter \n", - "2 3 Toy Story 2 John Lasseter \n", - "3 4 Monsters, Inc. Pete Docter \n", - "4 5 Finding Nemo Andrew Stanton \n", - "5 6 The Incredibles Brad Bird \n", - "6 7 Cars John Lasseter \n", - "7 8 Ratatouille Brad Bird \n", - "8 9 WALL-E Andrew Stanton \n", - "9 10 Up Pete Docter \n", - "10 11 Toy Story 3 Lee Unkrich \n", - "11 12 Cars 2 John Lasseter \n", - "12 13 Brave Brenda Chapman \n", - "13 14 Monsters University Dan Scanlon \n", - "14 15 The Fast and the Furious Rob Cohen \n", - "15 16 A Beautiful Mind Ron Howard \n", - "16 17 Good Will Hunting Gus Van Sant \n", - "17 18 Shang-Chi and the Legend of the Ten Rings Destin Daniel Cretton \n", - "18 19 The Martian Ridley Scott \n", - "\n", - " Year Length_minutes Score \n", - "0 1995 81 8.3 \n", - "1 1998 95 7.2 \n", - "2 1999 93 7.9 \n", - "3 2001 92 8.1 \n", - "4 2003 107 8.2 \n", - "5 2004 116 8.0 \n", - "6 2006 117 7.2 \n", - "7 2007 115 8.1 \n", - "8 2008 104 8.4 \n", - "9 2009 101 8.3 \n", - "10 2010 103 8.3 \n", - "11 2011 120 6.2 \n", - "12 2012 102 7.1 \n", - "13 2013 110 7.3 \n", - "14 2001 106 6.8 \n", - "15 2001 135 8.2 \n", - "16 1997 126 8.3 \n", - "17 2021 132 7.4 \n", - "18 2015 141 8.0 " - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IdTitleDirectorYearLength_minutesScore
01Toy StoryJohn Lasseter1995818.3
12A Bug's LifeJohn Lasseter1998957.2
23Toy Story 2John Lasseter1999937.9
34Monsters, Inc.Pete Docter2001928.1
45Finding NemoAndrew Stanton20031078.2
56The IncrediblesBrad Bird20041168.0
67CarsJohn Lasseter20061177.2
78RatatouilleBrad Bird20071158.1
89WALL-EAndrew Stanton20081048.4
910UpPete Docter20091018.3
1011Toy Story 3Lee Unkrich20101038.3
1112Cars 2John Lasseter20111206.2
1213BraveBrenda Chapman20121027.1
1314Monsters UniversityDan Scanlon20131107.3
1415The Fast and the FuriousRob Cohen20011066.8
1516A Beautiful MindRon Howard20011358.2
1617Good Will HuntingGus Van Sant19971268.3
1718Shang-Chi and the Legend of the Ten RingsDestin Daniel Cretton20211327.4
1819The MartianRidley Scott20151418.0
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "variable_name": "movies_and_scores_df", - "summary": "{\n \"name\": \"movies_and_scores_df\",\n \"rows\": 20,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5,\n \"min\": 1,\n \"max\": 20,\n \"num_unique_values\": 20,\n \"samples\": [\n 1,\n 18,\n 16\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 20,\n \"samples\": [\n \"Toy Story\",\n \"Shang-Chi and the Legend of the Ten Rings\",\n \"A Beautiful Mind\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 13,\n \"samples\": [\n \"Ridley Scott\",\n \"Gus Van Sant\",\n \"John Lasseter\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6,\n \"min\": 1995,\n \"max\": 2021,\n \"num_unique_values\": 17,\n \"samples\": [\n 1995,\n 1998,\n 2004\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Length_minutes\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 15,\n \"min\": 81,\n \"max\": 141,\n \"num_unique_values\": 20,\n \"samples\": [\n 81,\n 132,\n 135\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Score\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.6190825129938148,\n \"min\": 6.2,\n \"max\": 8.4,\n \"num_unique_values\": 12,\n \"samples\": [\n 6.8,\n 7.3,\n 8.3\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 40 - } - ], - "source": [ - "movies_and_scores_df.head(-1)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "xoQ3fB8SttPD" - }, - "source": [ - "##### ==================================================================================================\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 41, - "metadata": { - "id": "ZITCa9qYttPD" - }, - "outputs": [], - "source": [ - "managers = pd.DataFrame(\n", - " {\n", - " 'Id': [1,2,3],\n", - " 'Manager':['Chris','Maritza','Jamin']\n", - " }\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": 42, - "metadata": { - "id": "MX9spfihttPD", - "outputId": "9968b9fc-74e5-418d-e643-c6fad9e9d0ee", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 143 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Id Manager\n", - "0 1 Chris\n", - "1 2 Maritza\n", - "2 3 Jamin" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IdManager
01Chris
12Maritza
23Jamin
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "variable_name": "managers", - "summary": "{\n \"name\": \"managers\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 1,\n \"max\": 3,\n \"num_unique_values\": 3,\n \"samples\": [\n 1,\n 2,\n 3\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Manager\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Chris\",\n \"Maritza\",\n \"Jamin\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 42 - } - ], - "source": [ - "managers.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 43, - "metadata": { - "id": "LtD1zQJuttPD" - }, - "outputs": [], - "source": [ - "captains = pd.DataFrame(\n", - " {\n", - " 'Id': [2,2,3,1,1,3,2,3,1,1,3,3],\n", - " 'Captain':['Derick','Shane','Becca','Anna','Christine','Melody','Tom','Eric','Naomi','Angelina','Nancy','Richard'],\n", - " 'Title':['C','C','SC','C','SC','C','C','SC','C','EC','C','SC']\n", - " }\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": 44, - "metadata": { - "id": "0xOS-Bu4ttPD", - "outputId": "f5d8f2f8-ae4f-4146-9691-9d7beee65172", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 425 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Id Captain Title\n", - "0 2 Derick C\n", - "1 2 Shane C\n", - "2 3 Becca SC\n", - "3 1 Anna C\n", - "4 1 Christine SC\n", - "5 3 Melody C\n", - "6 2 Tom C\n", - "7 3 Eric SC\n", - "8 1 Naomi C\n", - "9 1 Angelina EC\n", - "10 3 Nancy C\n", - "11 3 Richard SC" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IdCaptainTitle
02DerickC
12ShaneC
23BeccaSC
31AnnaC
41ChristineSC
53MelodyC
62TomC
73EricSC
81NaomiC
91AngelinaEC
103NancyC
113RichardSC
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "variable_name": "captains", - "summary": "{\n \"name\": \"captains\",\n \"rows\": 12,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 1,\n \"max\": 3,\n \"num_unique_values\": 3,\n \"samples\": [\n 2,\n 3,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Captain\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 12,\n \"samples\": [\n \"Nancy\",\n \"Angelina\",\n \"Derick\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"C\",\n \"SC\",\n \"EC\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 44 - } - ], - "source": [ - "captains.head(12)" - ] - }, - { - "cell_type": "code", - "execution_count": 45, - "metadata": { - "id": "3c478mlSttPD", - "outputId": "7acc418b-9454-4c7a-972f-6808c928937c", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 394 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Id Captain Title Manager\n", - "0 2 Derick C Maritza\n", - "1 2 Shane C Maritza\n", - "2 2 Tom C Maritza\n", - "3 3 Becca SC Jamin\n", - "4 3 Melody C Jamin\n", - "5 3 Eric SC Jamin\n", - "6 3 Nancy C Jamin\n", - "7 3 Richard SC Jamin\n", - "8 1 Anna C Chris\n", - "9 1 Christine SC Chris\n", - "10 1 Naomi C Chris" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IdCaptainTitleManager
02DerickCMaritza
12ShaneCMaritza
22TomCMaritza
33BeccaSCJamin
43MelodyCJamin
53EricSCJamin
63NancyCJamin
73RichardSCJamin
81AnnaCChris
91ChristineSCChris
101NaomiCChris
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "variable_name": "roster", - "summary": "{\n \"name\": \"roster\",\n \"rows\": 12,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 1,\n \"max\": 3,\n \"num_unique_values\": 3,\n \"samples\": [\n 2,\n 3,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Captain\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 12,\n \"samples\": [\n \"Naomi\",\n \"Christine\",\n \"Derick\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"C\",\n \"SC\",\n \"EC\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Manager\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Maritza\",\n \"Jamin\",\n \"Chris\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 45 - } - ], - "source": [ - "roster = captains.merge(managers,left_on = 'Id', right_on = 'Id')\n", - "roster.head(-1)" - ] - }, - { - "cell_type": "code", - "source": [ - "test_roster = pd.concat([captains, managers], axis=\"columns\").reset_index(drop=True)\n", - "test_roster.head()" - ], - "metadata": { - "id": "rJ9K1BPGXxzE", - "outputId": "45a65eb1-94ea-4fa3-d442-535fa5346b7f", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 206 - } - }, - "execution_count": 46, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Id Captain Title Id Manager\n", - "0 2 Derick C 1.0 Chris\n", - "1 2 Shane C 2.0 Maritza\n", - "2 3 Becca SC 3.0 Jamin\n", - "3 1 Anna C NaN NaN\n", - "4 1 Christine SC NaN NaN" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IdCaptainTitleIdManager
02DerickC1.0Chris
12ShaneC2.0Maritza
23BeccaSC3.0Jamin
31AnnaCNaNNaN
41ChristineSCNaNNaN
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "variable_name": "test_roster", - "summary": "{\n \"name\": \"test_roster\",\n \"rows\": 12,\n \"fields\": [\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 1,\n \"max\": 3,\n \"num_unique_values\": 3,\n \"samples\": [\n 2,\n 3,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Captain\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 12,\n \"samples\": [\n \"Nancy\",\n \"Angelina\",\n \"Derick\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Title\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"C\",\n \"SC\",\n \"EC\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1.0,\n \"min\": 1.0,\n \"max\": 3.0,\n \"num_unique_values\": 3,\n \"samples\": [\n 1.0,\n 2.0,\n 3.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Manager\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Chris\",\n \"Maritza\",\n \"Jamin\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 46 - } - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2hro1V6XttPD" - }, - "source": [ - "#### The above python code is equivalent to SQL's\n", - "```sql\n", - "SELECT *\n", - "FROM Captains\n", - "INNER JOIN Managers\n", - "ON Captains.Id = Managers.Id\n", - "```\n", - "##### ==================================================================================================\n", - "## Column Renaming\n", - "\n", - "We can use the `.rename` function in python to relabel the columns of a dataframe. Suppose we want to rename `Id` to `Cohort` and `Title` to `Captain Rank`." - ] - }, - { - "cell_type": "code", - "execution_count": 47, - "metadata": { - "id": "-nELWGyPttPD", - "outputId": "c9ba481a-05ed-44b8-e5ec-bb4a7ba4ff6f", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 394 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Cohort Captain Captain Rank Manager\n", - "0 2 Derick C Maritza\n", - "1 2 Shane C Maritza\n", - "2 2 Tom C Maritza\n", - "3 3 Becca SC Jamin\n", - "4 3 Melody C Jamin\n", - "5 3 Eric SC Jamin\n", - "6 3 Nancy C Jamin\n", - "7 3 Richard SC Jamin\n", - "8 1 Anna C Chris\n", - "9 1 Christine SC Chris\n", - "10 1 Naomi C Chris" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
CohortCaptainCaptain RankManager
02DerickCMaritza
12ShaneCMaritza
22TomCMaritza
33BeccaSCJamin
43MelodyCJamin
53EricSCJamin
63NancyCJamin
73RichardSCJamin
81AnnaCChris
91ChristineSCChris
101NaomiCChris
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "variable_name": "roster", - "summary": "{\n \"name\": \"roster\",\n \"rows\": 12,\n \"fields\": [\n {\n \"column\": \"Cohort\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 1,\n \"max\": 3,\n \"num_unique_values\": 3,\n \"samples\": [\n 2,\n 3,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Captain\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 12,\n \"samples\": [\n \"Naomi\",\n \"Christine\",\n \"Derick\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Captain Rank\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"C\",\n \"SC\",\n \"EC\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Manager\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Maritza\",\n \"Jamin\",\n \"Chris\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 47 - } - ], - "source": [ - "roster = roster.rename(columns = {\"Id\":\"Cohort\",\"Title\":\"Captain Rank\"})\n", - "roster.head(-1)" - ] - }, - { - "cell_type": "code", - "source": [ - "roster.columns" - ], - "metadata": { - "id": "zrKc31ukYw3i", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "1a3fa73c-5ba1-4f43-8ee8-9d71f3a52e9d" - }, - "execution_count": 48, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "Index(['Cohort', 'Captain', 'Captain Rank', 'Manager'], dtype='object')" - ] - }, - "metadata": {}, - "execution_count": 48 - } - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "N5H7HamottPE" - }, - "source": [ - "If we would like to replace all columns, we must use a list of equal length" - ] - }, - { - "cell_type": "code", - "execution_count": 49, - "metadata": { - "id": "eCTo6V3UttPE", - "outputId": "82b0a175-8201-4f65-b516-11e23e42f1e6", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 394 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Cohort Num Capt Capt Rank Manager\n", - "0 2 Derick C Maritza\n", - "1 2 Shane C Maritza\n", - "2 2 Tom C Maritza\n", - "3 3 Becca SC Jamin\n", - "4 3 Melody C Jamin\n", - "5 3 Eric SC Jamin\n", - "6 3 Nancy C Jamin\n", - "7 3 Richard SC Jamin\n", - "8 1 Anna C Chris\n", - "9 1 Christine SC Chris\n", - "10 1 Naomi C Chris" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Cohort NumCaptCapt RankManager
02DerickCMaritza
12ShaneCMaritza
22TomCMaritza
33BeccaSCJamin
43MelodyCJamin
53EricSCJamin
63NancyCJamin
73RichardSCJamin
81AnnaCChris
91ChristineSCChris
101NaomiCChris
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "variable_name": "roster", - "summary": "{\n \"name\": \"roster\",\n \"rows\": 12,\n \"fields\": [\n {\n \"column\": \"Cohort Num\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 1,\n \"max\": 3,\n \"num_unique_values\": 3,\n \"samples\": [\n 2,\n 3,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Capt\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 12,\n \"samples\": [\n \"Naomi\",\n \"Christine\",\n \"Derick\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Capt Rank\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"C\",\n \"SC\",\n \"EC\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Manager\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Maritza\",\n \"Jamin\",\n \"Chris\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 49 - } - ], - "source": [ - "roster.columns = ['Cohort Num','Capt','Capt Rank','Manager']\n", - "roster.head(-1)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Wp5nb6skttPE" - }, - "source": [ - "##### ==================================================================================================\n", - "### Drop Columns" - ] - }, - { - "cell_type": "code", - "execution_count": 50, - "metadata": { - "id": "doploOj9ttPE", - "outputId": "90c4c6be-da1d-4ab6-b19e-504fa0db6cdf", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 394 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Capt Capt Rank Manager\n", - "0 Derick C Maritza\n", - "1 Shane C Maritza\n", - "2 Tom C Maritza\n", - "3 Becca SC Jamin\n", - "4 Melody C Jamin\n", - "5 Eric SC Jamin\n", - "6 Nancy C Jamin\n", - "7 Richard SC Jamin\n", - "8 Anna C Chris\n", - "9 Christine SC Chris\n", - "10 Naomi C Chris" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
CaptCapt RankManager
0DerickCMaritza
1ShaneCMaritza
2TomCMaritza
3BeccaSCJamin
4MelodyCJamin
5EricSCJamin
6NancyCJamin
7RichardSCJamin
8AnnaCChris
9ChristineSCChris
10NaomiCChris
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "variable_name": "roster", - "summary": "{\n \"name\": \"roster\",\n \"rows\": 12,\n \"fields\": [\n {\n \"column\": \"Capt\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 12,\n \"samples\": [\n \"Naomi\",\n \"Christine\",\n \"Derick\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Capt Rank\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"C\",\n \"SC\",\n \"EC\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Manager\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Maritza\",\n \"Jamin\",\n \"Chris\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 50 - } - ], - "source": [ - "#df.drop([\"column1\",\"column2\"], axis = \"columns\")\n", - "\n", - "roster = roster.drop(\"Cohort Num\", axis = \"columns\")\n", - "roster.head(-1)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "u-SBCempttPE" - }, - "source": [ - "##### ==================================================================================================\n", - "### Missing Values / NaN Values\n", - "\n", - "There are various types of missing data. Most commonly it could just be data was never collected, the data was handled incorrectly or null valued entry.\n", - "\n", - "Missing data can be remedied by the following:\n", - "1. Removing the row with the missing/NaN values\n", - "2. Removing the column with the missing/NaN values\n", - "3. Filling in the missing data\n", - "\n", - "For simplicity, we will only focus on the first two methods. The third method can be resolved with value interpolation by use of information from other rows or columns of the dataset. This process requires knowledge outside of the scope of this lesson. There are entire studies dedicated to this topic alone." - ] - }, - { - "cell_type": "code", - "execution_count": 51, - "metadata": { - "id": "yV1RhRDNttPE", - "outputId": "79c2c6cb-d707-4643-a129-8badfc1d9267", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 1000 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Company Location Year\n", - "0 ALFA ROMEO Italy 1910.0\n", - "1 Aston Martin Lagonda Ltd UK 1913.0\n", - "2 Audi Germany 1909.0\n", - "3 BMW Germany 1916.0\n", - "4 Chevrolet NaN NaN\n", - "5 Dodge USA 1900.0\n", - "6 Ferrari Italy 1947.0\n", - "7 Honda Japan 1948.0\n", - "8 Jaguar UK 1922.0\n", - "9 Lamborghini Italy 1963.0\n", - "10 MAZDA Japan 1920.0\n", - "11 McLaren UK 1985.0\n", - "12 Mercedes-Benz Germany NaN\n", - "13 NISSAN Japan 1933.0\n", - "14 Pagani Automobili S.p.A. Italy 1992.0\n", - "15 Porsche Germany 1931.0\n", - "16 FIAT NaN 1899.0\n", - "17 Mini Germany 1969.0\n", - "18 SCION NaN NaN\n", - "19 Subaru Japan 1953.0\n", - "20 Bentley UK 1919.0\n", - "21 Buick USA 1899.0\n", - "22 Ford USA 1903.0\n", - "23 HYUNDAI MOTOR COMPANY South Korea 1967.0\n", - "24 LEXUS Japan 1989.0\n", - "25 MASERATI Italy 1914.0\n", - "26 Roush NaN NaN\n", - "27 Volkswagen Germany 1937.0\n", - "28 Acura Japan 1986.0\n", - "29 Cadillac USA 1902.0\n", - "30 INFINITI Hong Kong 1989.0\n", - "31 KIA MOTORS CORPORATION South Korea 1944.0\n", - "32 Mitsubishi Motors Corporation Japan 1970.0\n", - "33 Rolls-Royce Motor Cars Limited UK 1904.0\n", - "34 TOYOTA Japan 1937.0\n", - "35 Volvo Sweden 1927.0\n", - "36 Chrysler USA 1925.0\n", - "37 Lincoln USA 1917.0\n", - "38 GMC USA 1911.0\n", - "39 RAM USA NaN\n", - "40 CHEVROLET USA 1911.0\n", - "41 Jeep USA 1943.0" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
CompanyLocationYear
0ALFA ROMEOItaly1910.0
1Aston Martin Lagonda LtdUK1913.0
2AudiGermany1909.0
3BMWGermany1916.0
4ChevroletNaNNaN
5DodgeUSA1900.0
6FerrariItaly1947.0
7HondaJapan1948.0
8JaguarUK1922.0
9LamborghiniItaly1963.0
10MAZDAJapan1920.0
11McLarenUK1985.0
12Mercedes-BenzGermanyNaN
13NISSANJapan1933.0
14Pagani Automobili S.p.A.Italy1992.0
15PorscheGermany1931.0
16FIATNaN1899.0
17MiniGermany1969.0
18SCIONNaNNaN
19SubaruJapan1953.0
20BentleyUK1919.0
21BuickUSA1899.0
22FordUSA1903.0
23HYUNDAI MOTOR COMPANYSouth Korea1967.0
24LEXUSJapan1989.0
25MASERATIItaly1914.0
26RoushNaNNaN
27VolkswagenGermany1937.0
28AcuraJapan1986.0
29CadillacUSA1902.0
30INFINITIHong Kong1989.0
31KIA MOTORS CORPORATIONSouth Korea1944.0
32Mitsubishi Motors CorporationJapan1970.0
33Rolls-Royce Motor Cars LimitedUK1904.0
34TOYOTAJapan1937.0
35VolvoSweden1927.0
36ChryslerUSA1925.0
37LincolnUSA1917.0
38GMCUSA1911.0
39RAMUSANaN
40CHEVROLETUSA1911.0
41JeepUSA1943.0
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "variable_name": "cars", - "summary": "{\n \"name\": \"cars\",\n \"rows\": 43,\n \"fields\": [\n {\n \"column\": \"Company\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 43,\n \"samples\": [\n \"Lincoln\",\n \"LEXUS\",\n \"MASERATI\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Location\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 8,\n \"samples\": [\n \"UK\",\n \"South Korea\",\n \"Italy\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 28.994530312650504,\n \"min\": 1899.0,\n \"max\": 1992.0,\n \"num_unique_values\": 33,\n \"samples\": [\n 1911.0,\n 1969.0,\n 1970.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 51 - } - ], - "source": [ - "cars = pd.read_csv(\"Cars.csv\")\n", - "cars.head(-1)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "zT2P3Mq9ttPE" - }, - "source": [ - "##### ==================================================================================================\n", - "Now lets sort the companies in alphabetical order" - ] - }, - { - "cell_type": "code", - "execution_count": 52, - "metadata": { - "id": "W4xHJumrttPE", - "outputId": "65bc3d88-1a90-4123-f2f7-195781fe8b4c", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 1000 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Company Location Year\n", - "0 ALFA ROMEO Italy 1910.0\n", - "1 Acura Japan 1986.0\n", - "2 Aston Martin Lagonda Ltd UK 1913.0\n", - "3 Audi Germany 1909.0\n", - "4 BMW Germany 1916.0\n", - "5 Bentley UK 1919.0\n", - "6 Buick USA 1899.0\n", - "7 CHEVROLET USA 1911.0\n", - "8 Cadillac USA 1902.0\n", - "9 Chevrolet NaN NaN\n", - "10 Chrysler USA 1925.0\n", - "11 Dodge USA 1900.0\n", - "12 FIAT NaN 1899.0\n", - "13 Ferrari Italy 1947.0\n", - "14 Ford USA 1903.0\n", - "15 GMC USA 1911.0\n", - "16 HYUNDAI MOTOR COMPANY South Korea 1967.0\n", - "17 Honda Japan 1948.0\n", - "18 INFINITI Hong Kong 1989.0\n", - "19 Jaguar UK 1922.0\n", - "20 Jeep USA 1943.0\n", - "21 KIA MOTORS CORPORATION South Korea 1944.0\n", - "22 LEXUS Japan 1989.0\n", - "23 Lamborghini Italy 1963.0\n", - "24 Land Rover UK 1948.0\n", - "25 Lincoln USA 1917.0\n", - "26 MASERATI Italy 1914.0\n", - "27 MAZDA Japan 1920.0\n", - "28 McLaren UK 1985.0\n", - "29 Mercedes-Benz Germany NaN\n", - "30 Mini Germany 1969.0\n", - "31 Mitsubishi Motors Corporation Japan 1970.0\n", - "32 NISSAN Japan 1933.0\n", - "33 Pagani Automobili S.p.A. Italy 1992.0\n", - "34 Porsche Germany 1931.0\n", - "35 RAM USA NaN\n", - "36 Rolls-Royce Motor Cars Limited UK 1904.0\n", - "37 Roush NaN NaN\n", - "38 SCION NaN NaN\n", - "39 Subaru Japan 1953.0\n", - "40 TOYOTA Japan 1937.0\n", - "41 Volkswagen Germany 1937.0" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
CompanyLocationYear
0ALFA ROMEOItaly1910.0
1AcuraJapan1986.0
2Aston Martin Lagonda LtdUK1913.0
3AudiGermany1909.0
4BMWGermany1916.0
5BentleyUK1919.0
6BuickUSA1899.0
7CHEVROLETUSA1911.0
8CadillacUSA1902.0
9ChevroletNaNNaN
10ChryslerUSA1925.0
11DodgeUSA1900.0
12FIATNaN1899.0
13FerrariItaly1947.0
14FordUSA1903.0
15GMCUSA1911.0
16HYUNDAI MOTOR COMPANYSouth Korea1967.0
17HondaJapan1948.0
18INFINITIHong Kong1989.0
19JaguarUK1922.0
20JeepUSA1943.0
21KIA MOTORS CORPORATIONSouth Korea1944.0
22LEXUSJapan1989.0
23LamborghiniItaly1963.0
24Land RoverUK1948.0
25LincolnUSA1917.0
26MASERATIItaly1914.0
27MAZDAJapan1920.0
28McLarenUK1985.0
29Mercedes-BenzGermanyNaN
30MiniGermany1969.0
31Mitsubishi Motors CorporationJapan1970.0
32NISSANJapan1933.0
33Pagani Automobili S.p.A.Italy1992.0
34PorscheGermany1931.0
35RAMUSANaN
36Rolls-Royce Motor Cars LimitedUK1904.0
37RoushNaNNaN
38SCIONNaNNaN
39SubaruJapan1953.0
40TOYOTAJapan1937.0
41VolkswagenGermany1937.0
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "variable_name": "cars", - "summary": "{\n \"name\": \"cars\",\n \"rows\": 43,\n \"fields\": [\n {\n \"column\": \"Company\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 43,\n \"samples\": [\n \"Roush\",\n \"Land Rover\",\n \"Lincoln\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Location\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 8,\n \"samples\": [\n \"Japan\",\n \"South Korea\",\n \"Italy\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 28.994530312650504,\n \"min\": 1899.0,\n \"max\": 1992.0,\n \"num_unique_values\": 33,\n \"samples\": [\n 1937.0,\n 1989.0,\n 1933.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 52 - } - ], - "source": [ - "cars = cars.sort_values(\"Company\").reset_index(drop=True)\n", - "cars.head(-1)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "tFLokzyvttPE" - }, - "source": [ - "##### ==================================================================================================\n", - "Now lets check how many entry points are missing. As we can see there are 4 entries in the Location column and 5 entries missing in the Year column." - ] - }, - { - "cell_type": "code", - "execution_count": 53, - "metadata": { - "id": "q33En74DttPE", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "1090d9fa-fd70-42da-b4a1-33722d5bc001" - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "Company 0\n", - "Location 4\n", - "Year 5\n", - "dtype: int64" - ] - }, - "metadata": {}, - "execution_count": 53 - } - ], - "source": [ - "cars.isna().sum()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "cGKKoYTpttPE" - }, - "source": [ - "##### ==================================================================================================\n", - "Lets inspect all the rows with any missing Loctation entries" - ] - }, - { - "cell_type": "code", - "execution_count": 54, - "metadata": { - "id": "dRmT5-TvttPE", - "outputId": "484a1c7f-376a-4503-d01b-403d7f8787cc", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 175 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Company Location Year\n", - "9 Chevrolet NaN NaN\n", - "12 FIAT NaN 1899.0\n", - "37 Roush NaN NaN\n", - "38 SCION NaN NaN" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
CompanyLocationYear
9ChevroletNaNNaN
12FIATNaN1899.0
37RoushNaNNaN
38SCIONNaNNaN
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "repr_error": "0" - } - }, - "metadata": {}, - "execution_count": 54 - } - ], - "source": [ - "missing_car_info_filter = cars.loc[:, \"Location\"].isna()\n", - "cars.loc[missing_car_info_filter, :]" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "tvT4mHb5ttPE" - }, - "source": [ - "##### ==================================================================================================\n", - "Lets inspect all the rows with any missing Year entries" - ] - }, - { - "cell_type": "code", - "execution_count": 55, - "metadata": { - "id": "64m7mIH0ttPF", - "outputId": "acc3f119-af77-4999-a7b8-4f3ea273eda4", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 206 - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " Company Location Year\n", - "9 Chevrolet NaN NaN\n", - "29 Mercedes-Benz Germany NaN\n", - "35 RAM USA NaN\n", - "37 Roush NaN NaN\n", - "38 SCION NaN NaN" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
CompanyLocationYear
9ChevroletNaNNaN
29Mercedes-BenzGermanyNaN
35RAMUSANaN
37RoushNaNNaN
38SCIONNaNNaN
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "summary": "{\n \"name\": \"cars\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"Company\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Mercedes-Benz\",\n \"SCION\",\n \"RAM\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Location\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"USA\",\n \"Germany\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": null,\n \"max\": null,\n \"num_unique_values\": 0,\n \"samples\": [],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } - }, - "metadata": {}, - "execution_count": 55 - } - ], - "source": [ - "cars.loc[cars.loc[:, \"Year\"].isna(), :]" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Kl_wIVHCttPF" - }, - "source": [ - "##### ==================================================================================================\n", - "For simplicity we can fill all the missing Location entries with \"NA\"" - ] - }, - { - "cell_type": "code", - "execution_count": 56, - "metadata": { - "id": "k0NDuUMhttPF" - }, - "outputs": [], - "source": [ - "cars.loc[:, \"Location\"] = cars.loc[:, \"Location\"].fillna(value=\"NA\")" - ] - }, - { - "cell_type": "code", - "execution_count": 57, - "metadata": { - "id": "KXC45KtFttPF", - "outputId": "56beee41-cfc0-40c3-ccb1-88e933218f96", - "colab": { - "base_uri": "https://localhost:8080/" - } - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "Company 0\n", - "Location 0\n", - "Year 5\n", - "dtype: int64" - ] - }, - "metadata": {}, - "execution_count": 57 - } - ], - "source": [ - "cars.head(-1)\n", - "cars.isna().sum()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "nB__rivattPF" - }, - "source": [ - "##### ==================================================================================================\n", - "Now lets drop any rows with missing entries" - ] - }, - { - "cell_type": "code", - "execution_count": 58, - "metadata": { - "id": "Ft1XTWOGttPF", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "87130665-7134-4fa2-f824-b316d40b6c4e" - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "Company 0\n", - "Location 0\n", - "Year 0\n", - "dtype: int64" - ] - }, - "metadata": {}, - "execution_count": 58 - } - ], - "source": [ - "cars = cars.dropna().reset_index(drop=True)\n", - "cars.head(-1)\n", - "cars.isna().sum()" - ] - }, - { - "cell_type": "code", - "source": [ - "cars.info()" - ], - "metadata": { - "id": "MoUYqyzSeK9n", - "outputId": "cfde1a76-289e-42aa-c7ad-2ae30ba36514", - "colab": { - "base_uri": "https://localhost:8080/" - } - }, - "execution_count": 59, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "\n", - "RangeIndex: 38 entries, 0 to 37\n", - "Data columns (total 3 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 Company 38 non-null object \n", - " 1 Location 38 non-null object \n", - " 2 Year 38 non-null float64\n", - "dtypes: float64(1), object(2)\n", - "memory usage: 1.0+ KB\n" - ] - } - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "lbaxA3zrttPF" - }, - "source": [ - "##### ==================================================================================================\n", - "## Summary\n", - "\n", - "- `pandas` provides `Series` and `DataFrame` classes that with tabular style data.\n", - "- `.loc` selects rows and columns based on their index values.\n", - "- `.iloc` selects rows and columns based on their position values.\n", - "- Calling a DataFrame method with `axis=\"rows\"` or `axis=0` causes it to operate along the row axis.\n", - "- Calling a DataFrame method with `axis=\"columns\"` or `axis=1` causes it to operate along the columns axis.\n", - "- `sort_values` reorders rows based on condition\n", - "- `.rename()` can rename columns in DataFrames. You can also rewrite the `.columns` attribute to rename columns.\n", - "- `.isna()` detects missing values\n", - "- `.fillna()` replaces NULL values with a specified value\n", - "- `.dropna()` removes all rows that contain NULL values\n", - "- `.merge()` updates content from one DataFrame with content from another Dataframe" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "k8I532SRttPF" - }, - "source": [ - "##### ==================================================================================================\n", - "### Exercise 1:\n", - "Create a new DataFrame called `cohort` by inner joining the two DataFrames `roster` and `exam`" - ] - }, - { - "cell_type": "code", - "execution_count": 60, - "metadata": { - "id": "t3G0XkmittPF" - }, - "outputs": [], - "source": [ - "#solution\n", - "roster = pd.DataFrame(\n", - "{\n", - " \"Name\" : [\"James\",\"Greg\",\"Patrick\",\"Chris\",\"Cynthia\",\"Chandra\", \"John\",\"David\",\"Tiffany\",\"Peter\"],\n", - " \"Id\": [\"1\",\"2\",\"3\",\"4\",\"5\",\"6\",\"7\",\"8\",\"9\",\"10\"],\n", - "\n", - "})\n", - "\n", - "exam = pd.DataFrame({\n", - " \"Exam 1\" : [89,78,81,90,93,76,66,87,42,55],\n", - " \"Exam 2\" : [100,74,20,86,60,76,92,97,88,90],\n", - " \"Exam 3\" : [85,60,90,90,88,76,55,None,64,79],\n", - " \"Id\" : [\"4\",\"2\",\"1\",\"7\",\"5\",\"10\",\"6\",\"3\",\"9\",\"8\"]\n", - "})\n", - "\n", - "# YOUR CODE HERE" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "rMRopV2FttPF" - }, - "source": [ - "##### ==================================================================================================\n", - "### Exercise 2:\n", - "Fill all missing grades with 0." - ] - }, - { - "cell_type": "code", - "execution_count": 61, - "metadata": { - "id": "DA8C74TLttPG" - }, - "outputs": [], - "source": [ - "# YOUR CODE HERE" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6a_N8JEEttPG" - }, - "source": [ - "##### ==================================================================================================\n", - "### Exercise 3:\n", - "Update James Exam 2 score from 20 to 85 and update Tiffany Exam 1 score from 42 to 88" - ] - }, - { - "cell_type": "code", - "execution_count": 62, - "metadata": { - "id": "Mzka5Y3_ttPG" - }, - "outputs": [], - "source": [ - "# YOUR CODE HERE" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "DkuO3tIPttPG" - }, - "source": [ - "##### ==================================================================================================\n", - "### Exercise 4:\n", - "\n", - "Create a series called `Average` that takes the average of Exam 1, Exam 2 and Exam 3 scores" - ] - }, - { - "cell_type": "code", - "execution_count": 63, - "metadata": { - "id": "QWXVYTj0ttPG" - }, - "outputs": [], - "source": [ - "# YOUR CODE HERE" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "96hRtey9ttPG" - }, - "source": [ - "##### ==================================================================================================\n", - "### Exercise 5:\n", - "Incorporate the newly created `Average` column into the DataFrame `cohort`" - ] - }, - { - "cell_type": "code", - "execution_count": 64, - "metadata": { - "id": "wEysGqYyttPG" - }, - "outputs": [], - "source": [ - "# YOUR CODE HERE" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "QHk1lZiDttPG" - }, - "source": [ - "##### ==================================================================================================\n", - "### Exercise 6:\n", - "Sort the dataset by Average in **descending** order and reindex the DataFrame" - ] - }, - { - "cell_type": "code", - "execution_count": 65, - "metadata": { - "id": "9azLYMHPttPG" - }, - "outputs": [], - "source": [ - "# YOUR CODE HERE" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "yyWST6gUttPG" - }, - "source": [ - "##### ==================================================================================================\n", - "### Exercise 7:\n", - "Drop columns Exam 1, 2, and 3" - ] - }, - { - "cell_type": "code", - "execution_count": 66, - "metadata": { - "id": "PgD_KqCkttPG" - }, - "outputs": [], - "source": [ - "# YOUR CODE HERE" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "OHg6AiIYttPG" - }, - "source": [ - "##### ==================================================================================================\n", - "### Exercise 8:\n", - "Select only the top 3 **Name, Id and Average only*** based on highest Average grade" - ] - }, - { - "cell_type": "code", - "execution_count": 67, - "metadata": { - "id": "MmHW3ki9ttPG" - }, - "outputs": [], - "source": [ - "# YOUR CODE HERE" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.5" - }, - "colab": { - "provenance": [], - "include_colab_link": true - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} \ No newline at end of file