diff --git "a/C\303\263pia_de_Notebooks_NB01_02__Condicionais_hs.ipynb" "b/C\303\263pia_de_Notebooks_NB01_02__Condicionais_hs.ipynb" new file mode 100644 index 000000000..dfefafb07 --- /dev/null +++ "b/C\303\263pia_de_Notebooks_NB01_02__Condicionais_hs.ipynb" @@ -0,0 +1,411 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "Cópia de Notebooks/NB01_02__Condicionais.ipynb", + "provenance": [], + "collapsed_sections": [ + "n8BIbzQbNWUo", + "7eS94uQ4NhVR", + "SYOgJpGYVLUu", + "CaHFxk98W5if", + "ReWUyWiHXCnc", + "CqszHxaKHr2h", + "tXgF1Wl9gHKY", + "Fotx7XUquAo8", + "36kmLUYDvsUI", + "SWO2GdNovxAp", + "vpN54l4vxze5", + "u4HOf9SNytSq", + "6BQ9oZiD9hg5", + "tz5-QdrX9vct", + "p1muBgMX8NK4", + "FxTC2-U88ajk", + "z8EYn0pP25Rh" + ], + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "accelerator": "GPU" + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8Y-QMrzHhpcu" + }, + "source": [ + "

CONDICIONAIS - IF

\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wYGZ0eGlv--6" + }, + "source": [ + "# **AGENDA**:\n", + "> Veja o **índice** dos itens que serão abordados neste capítulo." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q3FpTG0dh47M" + }, + "source": [ + "___\n", + "# **REFERÊNCIAS**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LWuIj53sVSnA" + }, + "source": [ + "___\n", + "# **CONDICIONAIS**\n", + "> Usado para decidir se uma determinada instrução ou bloco de instruções será executada ou não, isto é, se uma determinada condição for verdadeira, um bloco de instrução será executado." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NyG1l3awJzEq" + }, + "source": [ + "# Não executar o código a seguir:\n", + "if condicao1:\n", + " \n", + "elif condicao2:\n", + " \n", + "elif condicao3:\n", + " \n", + " ...\n", + "elif condicaoN:\n", + " \n", + "else:\n", + " " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FCJBMTh5WX5C" + }, + "source": [ + "## Exemplo 1" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vn5u7CEaWZjH" + }, + "source": [ + "def mensagem(i_idade, i_limite):\n", + " if i_idade > i_limite:\n", + " s_mensagem= f'{i_idade} é maior que {i_limite}'\n", + " print(s_mensagem)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lW0ME_nVXU4M" + }, + "source": [ + "mensagem(35, 40)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EBBU8Yw2XxUo" + }, + "source": [ + "Nenhuma mensagem? E agora?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xQ23cAjMX1kx", + "outputId": "3612d39b-3f92-40fd-af14-2dfbca6b0697", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "mensagem(45, 40)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "45 é maior que 40\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BeHU0tPuWK4s" + }, + "source": [ + "## Exemplo 2" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gSzCnjS0Fk-d" + }, + "source": [ + "def mensagem2(i_idade, i_limite):\n", + " if i_idade > i_limite:\n", + " s_mensagem= f'{i_idade} é maior que {i_limite}'\n", + " else:\n", + " s_mensagem= f'{i_idade} é menor ou igual a {i_limite}'\n", + " \n", + " print(s_mensagem)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "KxbmxuDwYFX_", + "outputId": "8f1faff1-de34-4967-865f-17453f7992af", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "mensagem2(35, 40)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "35 é menor ou igual a 40\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lToDO6pzWPGL" + }, + "source": [ + "## Exemplo 3" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "a1NlziSbGrIl", + "outputId": "ffed270b-c16f-4d30-cdaf-80ae96898a94", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 197 + } + }, + "source": [ + "def mensagem3(i_idade, i_limite1, i_limite2, i_limite3, i_limite4):\n", + " if ((i_idade > i_limite1) and (i_idade < i_limite2)):\n", + " s_mensagem= f'{i_idade} é maior que {i_limite1} e menor que {i_limite2}'\n", + " \n", + " elif ((i_idade > i_limite3) and (i_idade < i_limite4)):\n", + " s_mensagem= f'{i_idade} é maior que {i_limite3} e menor que {i_limite4}'\n", + " \n", + " else:\n", + " s_mensagem= f'{i_idade} é maior que {i_limite4}'\n", + " \n", + "print(s_mensagem)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "NameError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0ms_mensagem\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0;34mf'{i_idade} é maior que {i_limite4}'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 10\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 11\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms_mensagem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mNameError\u001b[0m: name 's_mensagem' is not defined" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V8FF3lFLYqui" + }, + "source": [ + "Porque temos um erro nesta função?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y5F09RKGYyoX" + }, + "source": [ + "**Resposta**: por causa da indentação! A forma correta é:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vR-oFyzAY5UC" + }, + "source": [ + "def mensagem3(i_idade, i_limite1, i_limite2, i_limite3, i_limite4):\n", + " if ((i_idade > i_limite1) and (i_idade < i_limite2)):\n", + " s_mensagem= f'{i_idade} é maior que {i_limite1} e menor que {i_limite2}'\n", + " elif ((i_idade > i_limite3) and (i_idade < i_limite4)):\n", + " s_mensagem= f'{i_idade} é maior que {i_limite3} e menor que {i_limite4}'\n", + " else:\n", + " s_mensagem= f'{i_idade} é maior que {i_limite4}'\n", + " \n", + " print(s_mensagem)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QgkBOGKdYgGU", + "outputId": "701f4620-817f-41e0-e9d7-f6b06adf6b3d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "mensagem3(35, 10, 20, 30, 40)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "35 é maior que 30 e menor que 40\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LLk7bhjSwZch" + }, + "source": [ + "___\n", + "# **Wrap Up**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lJvjcjm8NQ85" + }, + "source": [ + "___\n", + "# Exercícios\n", + "## **Exercício 1**: \n", + "Escreva uma função em Python que receba um número inteiro i_limite e, na sequência, imprime os números inteiros de 0 a i_limite;\n", + "\n", + "## **outros exercícios**: \n", + "Nos sites abaixo você vai encontrar exercícios de Python:\n", + "### https://pynative.com/python-if-else-and-for-loop-exercise-with-solutions/;\n", + "### https://www.w3resource.com/python-exercises/" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Gi091pZrwbnY" + }, + "source": [ + "# Exercicio 1\n", + "def imprime_inteiros(i_limite):\n", + " for i_inteiro in range(i_limite+1):\n", + " print(i_inteiro, end=' ')\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Vq2andamdPsZ", + "outputId": "7b7f7864-9402-4b13-86eb-c14aa95b55e5", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "imprime_inteiros(10)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "0 1 2 3 4 5 6 7 8 9 10 " + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MG_rtFd0eGgb" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/Notebooks/3DP_1_Feature Engineering_Fase1-TESTE.ipynb b/Notebooks/3DP_1_Feature Engineering_Fase1-TESTE.ipynb new file mode 100644 index 000000000..13a846cc6 --- /dev/null +++ b/Notebooks/3DP_1_Feature Engineering_Fase1-TESTE.ipynb @@ -0,0 +1,4856 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.4" + }, + "colab": { + "name": "Copy of 10. Feature Selection Techniques.ipynb", + "provenance": [], + "toc_visible": true, + "include_colab_link": true + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ngLc7b9XiKxN" + }, + "source": [ + "# 3DP_FEATURE ENGINEERING - FASE 1" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JNj9RdXbXmWq" + }, + "source": [ + "## Carrega as biblotecas:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "T9JCQatsiKxR" + }, + "source": [ + "from sklearn import feature_selection\n", + "import pandas as pd\n", + "import numpy as np\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline" + ], + "execution_count": 1, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_ebv3nAzU2ac" + }, + "source": [ + "## Carrega o dataframe\n", + "* A seguir, os principais atributos/features do dataframe:\n", + " * **PassengerID**: ID do passageiro;\n", + " * **Survived**: Indicador, sendo 1= Passageiro sobreviveu e 0= Passageiro morreu;\n", + " * **Pclass**: Classe em que o passageiro viaja (1 classe, 2 classe, 3 classe, etc);\n", + " * **Age**: Idade do Passageiro;\n", + " * **SibSp**: Número de parentes a bordo (esposa, irmãos, pais e etc);\n", + " * **Parch**: Número de pais/crianças a bordo;\n", + " * **Fare**: Valor pago pela viagem;\n", + " * **Cabin**: Cabine do Passageiro;\n", + " * **Embarked**: A porta pelo qual o Passageiro embarcou.\n", + " * **Name**: Nome do Passageiro;\n", + " * **Sex**: Sexo do Passageiro." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8M5uO9r-Vtze" + }, + "source": [ + "url_train= 'https://raw.githubusercontent.com/MathMachado/DSWP/master/Dataframes/Titanic_With_MV.csv'\n", + "url_test= 'https://raw.githubusercontent.com/MathMachado/DSWP/master/Dataframes/Titanic_test.csv'\n", + "\n", + "# Carrega os dataframes de treinamento e teste e define 'PassengerId' como chave\n", + "df_train= pd.read_csv(url_train, index_col='PassengerId')\n", + "df_test= pd.read_csv(url_test, index_col='PassengerId')\n", + "\n", + "# Faz uma cópia dos dados originais da variável resposta 'Survived'\n", + "df_train_Survived = df_train[\"Survived\"].copy()\n", + "\n", + "# merge train and test\n", + "df = df_train.append(df_test, sort= False)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QvbVxLs4ZZ0B" + }, + "source": [ + "## Entendendo o dataframe" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yr4kSh-vZcam", + "outputId": "8b2389b7-ad00-4a86-8748-3104a3ec5996", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Número de linhas/instâncias do dataframe\n", + "df.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(1309, 11)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 229 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "icC1tMH1ZhlA", + "outputId": "64a5866c-06ec-4401-bfcc-2d753e070ac5", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + } + }, + "source": [ + "# Colunas do dataframe\n", + "df.columns" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',\n", + " 'Fare', 'Cabin', 'Embarked'],\n", + " dtype='object')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 230 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fsVfiqfjwXcX", + "outputId": "2390399f-b753-448b-eba7-3817c8b8e35d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + } + }, + "source": [ + "# Tratar o nome das colunas usando lower\n", + "df.columns= [cols.lower() for cols in df.columns]\n", + "\n", + "# Verificar se o nome das variáveis estão ok\n", + "df.columns" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Index(['survived', 'pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',\n", + " 'fare', 'cabin', 'embarked'],\n", + " dtype='object')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 231 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "D2opFNkvZ2sf", + "outputId": "e3a5efc3-4e57-4d31-d909-db35cff48129", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 289 + } + }, + "source": [ + "# Informações gerais sobre o dataframe\n", + "df.info()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "\n", + "Int64Index: 1309 entries, 1 to 1309\n", + "Data columns (total 11 columns):\n", + "survived 891 non-null float64\n", + "pclass 1309 non-null int64\n", + "name 1309 non-null object\n", + "sex 1309 non-null object\n", + "age 1046 non-null float64\n", + "sibsp 1309 non-null int64\n", + "parch 1309 non-null int64\n", + "ticket 1309 non-null object\n", + "fare 1308 non-null float64\n", + "cabin 295 non-null object\n", + "embarked 1307 non-null object\n", + "dtypes: float64(3), int64(3), object(5)\n", + "memory usage: 122.7+ KB\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MeHf-pPtaFfM" + }, + "source": [ + "O que você diria do output acima? Que informações você consegue abstrair disso?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GAM06rMgaMnZ", + "outputId": "ae88c0f3-7f46-48e2-fb0a-8494ff4ed1b3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 354 + } + }, + "source": [ + "# Visualizando parte do dataframe\n", + "df.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclassnamesexagesibspparchticketfarecabinembarked
PassengerId
10.03Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
21.01Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
31.03Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
41.01Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
50.03Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", + "
" + ], + "text/plain": [ + " survived pclass ... cabin embarked\n", + "PassengerId ... \n", + "1 0.0 3 ... NaN S\n", + "2 1.0 1 ... C85 C\n", + "3 1.0 3 ... NaN S\n", + "4 1.0 1 ... C123 S\n", + "5 0.0 3 ... NaN S\n", + "\n", + "[5 rows x 11 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 233 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ngoNsaGgaXsJ" + }, + "source": [ + "## Deletar atributos/features que não são de interesse\n", + "* Eu não vejo, a priori, valor na variável 'ticket'. Portanto, vou deletá-la do dataframe." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BIihFnbTaj6K", + "outputId": "c4220be8-0f4c-455c-d087-4404e8360790", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 269 + } + }, + "source": [ + "df= df.drop(['ticket'], axis=1) # axis= 1 indica que se trata de uma operação na coluna do dataframe. Lembre-se: axis= 0 indica operação nas linhas do dataframe.\n", + "df.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclassnamesexagesibspparchfarecabinembarked
PassengerId
10.03Braund, Mr. Owen Harrismale22.0107.2500NaNS
21.01Cumings, Mrs. John Bradley (Florence Briggs Th...female38.01071.2833C85C
31.03Heikkinen, Miss. Lainafemale26.0007.9250NaNS
41.01Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01053.1000C123S
50.03Allen, Mr. William Henrymale35.0008.0500NaNS
\n", + "
" + ], + "text/plain": [ + " survived pclass ... cabin embarked\n", + "PassengerId ... \n", + "1 0.0 3 ... NaN S\n", + "2 1.0 1 ... C85 C\n", + "3 1.0 3 ... NaN S\n", + "4 1.0 1 ... C123 S\n", + "5 0.0 3 ... NaN S\n", + "\n", + "[5 rows x 10 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 234 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C8Sb5kOJasHr" + }, + "source": [ + "Observe que a coluna 'ticket' foi de fato deletada do dataframe." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3bbuq8rOawGN" + }, + "source": [ + "A seguir, crio a variável 'survived2' para ajudar no entendimento dos dados:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "AyyfGIGya1bw", + "outputId": "13c90a85-66e7-4b87-9ea6-56dc2f1241ec", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 371 + } + }, + "source": [ + "df['survived2'] = df['survived']\n", + "df['survived2'] = df['survived2'].map({0:'Died',1:'Survived'})\n", + "df.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclassnamesexagesibspparchfarecabinembarkedsurvived2
PassengerId
10.03Braund, Mr. Owen Harrismale22.0107.2500NaNSDied
21.01Cumings, Mrs. John Bradley (Florence Briggs Th...female38.01071.2833C85CSurvived
31.03Heikkinen, Miss. Lainafemale26.0007.9250NaNSSurvived
41.01Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01053.1000C123SSurvived
50.03Allen, Mr. William Henrymale35.0008.0500NaNSDied
\n", + "
" + ], + "text/plain": [ + " survived pclass ... embarked survived2\n", + "PassengerId ... \n", + "1 0.0 3 ... S Died\n", + "2 1.0 1 ... C Survived\n", + "3 1.0 3 ... S Survived\n", + "4 1.0 1 ... S Survived\n", + "5 0.0 3 ... S Died\n", + "\n", + "[5 rows x 11 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 235 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Cegg0IcQa6NL" + }, + "source": [ + "## Entendendo as variáveis Originais do Dataframe\n", + "* Vamos verificar como as variáveis estão preenchidas a fim de corrigir possíveis problemas de preenchimento." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_WCbklv0bDlp" + }, + "source": [ + "A função a seguir nos ajudará com o Data Visualization, cruzando a variável-resposta 'Survived' com qualquer outra passada à função:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "epxI-F2UbGGS" + }, + "source": [ + "def Avalia_Taxa_Sobrevivencia(df, column):\n", + " title_xt = pd.crosstab(df[column], df['survived2'])\n", + " print(pd.crosstab(df[column], df['survived2'], margins=True))\n", + " title_xt_pct = title_xt.div(title_xt.sum(1).astype(float), axis=0)\n", + " \n", + " title_xt_pct.plot(kind='bar', stacked=True, title='Taxa de Sobrevivência dos Passageiros', \n", + " color= ['r', 'g'])\n", + " plt.xlabel(column)\n", + " plt.ylabel('Taxa de Sobrevivência')\n", + " plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05),shadow=True, ncol=2)\n", + " plt.show()\n", + "\n", + "def Catplot_Graph(x, y, hue= 'survived2', col= None):\n", + " plt.rcdefaults()\n", + " g= sns.catplot(x= x, y= y, hue= hue, palette={'Died':'red','Survived':'blue'}, col= col, data=df, kind= 'bar', height=4, aspect=.7)\n", + " plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "a11nwzJKbNE-" + }, + "source": [ + "### Variável 'sex'" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "j56d6Z6ZbQ2m" + }, + "source": [ + "Vamos avaliar o preenchimento desta variável." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5X-0G4xNbU_b", + "outputId": "4f602695-5eb0-4b4e-a7bf-ad996fdac034", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 272 + } + }, + "source": [ + "df['sex'].value_counts()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "male 833\n", + "female 458\n", + "m 4\n", + "M 3\n", + "f 2\n", + "W 1\n", + "MALE 1\n", + "w 1\n", + "Woman 1\n", + "F 1\n", + "fEMALE 1\n", + "Men 1\n", + "mALE 1\n", + "Female 1\n", + "Name: sex, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 237 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eS2lsALObZwX" + }, + "source": [ + "Qual sua opinião sobre esse preenchimento?\n", + "\n", + "Algum problema?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AYQCZ9HObhXk" + }, + "source": [ + "Aqui temos vários problemas... Olhando para estes resultados, você concorda que 'male', 'm', 'MALE', M', 'mALE' e 'Men' se trata da mesma informação?\n", + "\n", + "Da mesma forma, 'female', 'f', 'F', 'Female', 'fEMALE', 'Woman', 'w' e 'W' também se trata da mesma informação?\n", + "\n", + "Então, vamos fazer o seguinte:\n", + "\n", + "Toda vez que eu encontrar um desses valores: ['m', 'MALE', 'M', 'mALE', 'Men'], vou substituir por 'male'; Toda vez que eu encontrar um desses valores: ['f', 'F', 'Female', 'fEMALE', 'Woman', 'w', 'W'], vou substituit por 'female'. O comando a seguir faz estas substituições:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hQx_tNBQblst" + }, + "source": [ + "Definindo o dicionário para fazermos as substituições dos valores inconsistentes:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LPZzQwRfbnSi", + "outputId": "b8f78b06-59b2-4843-e7cd-9378f86a1df1", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 255 + } + }, + "source": [ + "dSex= {}\n", + "dSex.update(dict.fromkeys(['m', 'MALE', 'M', 'mALE', 'Men', 'male'], 'male'))\n", + "dSex.update(dict.fromkeys(['f', 'F', 'Female', 'fEMALE', 'Woman', 'w', 'W', 'female'], 'female'))\n", + "dSex" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'F': 'female',\n", + " 'Female': 'female',\n", + " 'M': 'male',\n", + " 'MALE': 'male',\n", + " 'Men': 'male',\n", + " 'W': 'female',\n", + " 'Woman': 'female',\n", + " 'f': 'female',\n", + " 'fEMALE': 'female',\n", + " 'female': 'female',\n", + " 'm': 'male',\n", + " 'mALE': 'male',\n", + " 'male': 'male',\n", + " 'w': 'female'}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 238 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YQ3lwKRKbsx0" + }, + "source": [ + "Aplica a transformação:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "idBwRNI7bvCC", + "outputId": "4b83067f-3096-4425-cf46-e7eb365c5e56", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + } + }, + "source": [ + "df['sex2']= df['sex'].map(dSex)\n", + "df['sex2'].value_counts()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "male 843\n", + "female 466\n", + "Name: sex2, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 239 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FzDl78rfb3p5" + }, + "source": [ + "Qual a conclusão? Este preenchimento faz mais sentido que o anterior?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YPOpGyCpb_Yy" + }, + "source": [ + "**Atenção:** Os comandos abaixo são uma alternativa ao map() aplicado anteriormente para corrigir os atributos da variável 'Sex':\n", + "\n", + "```\n", + "df['Sex2'] = df['Sex'].replace(['m', 'MALE', 'M', 'mALE', 'Men'], 'male')\n", + "df['Sex3'] = df['Sex2'].replace(['f', 'F', 'Female', 'fEMALE', 'Woman', 'w', 'W'], 'female') \n", + "df.Sex3.value_counts()\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "muDUjFZecMAK" + }, + "source": [ + "Ok, de fato corrigimos os problemas de preenchimento da variável 'sex'. então, vamos renomear nossa variável para o que tínhamos antes:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iWiYCTj8b1Dq", + "outputId": "9f36189e-58a0-4b47-bbb2-a123a8a6ee11", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 371 + } + }, + "source": [ + "# Deleta as variáveis 'sex':\n", + "df= df.drop(columns= ['sex'], axis= 1)\n", + "\n", + "# Renomea a variável auxiliar 'sex2' para 'sex':\n", + "df= df.rename(columns= {'sex2': 'sex'})\n", + "\n", + "# Mostra os dados:\n", + "df.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclassnameagesibspparchfarecabinembarkedsurvived2sex
PassengerId
10.03Braund, Mr. Owen Harris22.0107.2500NaNSDiedmale
21.01Cumings, Mrs. John Bradley (Florence Briggs Th...38.01071.2833C85CSurvivedfemale
31.03Heikkinen, Miss. Laina26.0007.9250NaNSSurvivedfemale
41.01Futrelle, Mrs. Jacques Heath (Lily May Peel)35.01053.1000C123SSurvivedfemale
50.03Allen, Mr. William Henry35.0008.0500NaNSDiedmale
\n", + "
" + ], + "text/plain": [ + " survived pclass ... survived2 sex\n", + "PassengerId ... \n", + "1 0.0 3 ... Died male\n", + "2 1.0 1 ... Survived female\n", + "3 1.0 3 ... Survived female\n", + "4 1.0 1 ... Survived female\n", + "5 0.0 3 ... Died male\n", + "\n", + "[5 rows x 11 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 240 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XxKmFfe9cSxE", + "outputId": "22eed0ae-4d82-4d36-984d-c34ea9987709", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 386 + } + }, + "source": [ + "sns.catplot(x=\"sex\", kind=\"count\", data=df)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 241 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW4AAAFgCAYAAACbqJP/AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAFTNJREFUeJzt3X+0nVV95/H3RwJSUQngNYNJWDA1\nSxdjy69bhNrpssa2QGcMwyjFsUOkWZPODMU6TGfKtGtqW7XVsQ4Vp2VWVlGD41gRpaQOxWEF7My0\nggZBfmq5RTFJA7kgoJWlNvU7f5wdPcQbuGl47r375v1a66yz937289zvXTn55Mk+z3lOqgpJUj+e\nNd8FSJL2jcEtSZ0xuCWpMwa3JHXG4JakzhjcktQZg1uSOmNwS1JnDG5J6syS+S5gf5xxxhl1/fXX\nz3cZkvRMyWwmdX3G/fDDD893CZI057oObkk6EBncktQZg1uSOmNwS1JnDG5J6ozBLUmdMbglqTMG\ntyR1xuCWpM4Y3JLUGYNbkjpjcEtSZ7q+O+D+OuU/XDnfJWiO3Pqu8+e7BOkZ4xm3JHXG4Jakzhjc\nktQZg1uSOmNwS1JnDG5J6ozBLUmdMbglqTMGtyR1xuCWpM4Y3JLUmUGDO8m/S3J3kruSfDjJoUmO\nS3JLkqkkH0lySJv77NafatuPHbI2SerVYMGdZDnwJmCyql4GHAScB7wTuLSqXgw8Cqxru6wDHm3j\nl7Z5kqQ9DL1UsgT4gSRLgOcAO4BXAVe37RuBs1t7TevTtq9OkoHrk6TuDBbcVbUd+F3gK4wC+3Hg\nVuCxqtrVpm0Dlrf2cmBr23dXm3/UnsdNsj7JliRbpqenhypfkhasIZdKjmB0Fn0c8CLgMOCM/T1u\nVW2oqsmqmpyYmNjfw0lSd4ZcKnk18KWqmq6qvwU+DrwCWNqWTgBWANtbezuwEqBtPxx4ZMD6JKlL\nQwb3V4DTkjynrVWvBu4BbgJe2+asBa5t7U2tT9t+Y1XVgPVJUpeGXOO+hdGbjJ8D7mw/awPwK8DF\nSaYYrWFf0Xa5AjiqjV8MXDJUbZLUs0G/c7Kq3gK8ZY/h+4FTZ5j7TeB1Q9YjSYuBn5yUpM4Y3JLU\nGYNbkjpjcEtSZwxuSeqMwS1JnTG4JakzBrckdcbglqTOGNyS1BmDW5I6Y3BLUmcMbknqjMEtSZ0x\nuCWpMwa3JHXG4JakzhjcktQZg1uSOmNwS1JnDG5J6ozBLUmdMbglqTMGtyR1xuCWpM4MFtxJXpLk\n9rHH15K8OcmRSW5Icl97PqLNT5LLkkwluSPJyUPVJkk9Gyy4q+qLVXViVZ0InAI8AVwDXAJsrqpV\nwObWBzgTWNUe64HLh6pNkno2V0slq4G/qqoHgDXAxja+ETi7tdcAV9bIzcDSJEfPUX2S1I25Cu7z\ngA+39rKq2tHaDwLLWns5sHVsn21t7EmSrE+yJcmW6enpoeqVpAVr8OBOcgjwGuCje26rqgJqX45X\nVRuqarKqJicmJp6hKiWpH3Nxxn0m8Lmqeqj1H9q9BNKed7bx7cDKsf1WtDFJ0pi5CO7X871lEoBN\nwNrWXgtcOzZ+fru65DTg8bElFUlSs2TIgyc5DPhJ4BfGht8BXJVkHfAAcG4bvw44C5hidAXKBUPW\nJkm9GjS4q+obwFF7jD3C6CqTPecWcOGQ9UjSYuAnJyWpMwa3JHXG4JakzhjcktQZg1uSOmNwS1Jn\nDG5J6ozBLUmdMbglqTMGtyR1xuCWpM4Y3JLUGYNbkjpjcEtSZwxuSeqMwS1JnTG4JakzBrckdcbg\nlqTOGNyS1BmDW5I6Y3BLUmcMbknqjMEtSZ0xuCWpMwa3JHVm0OBOsjTJ1Um+kOTeJKcnOTLJDUnu\na89HtLlJclmSqSR3JDl5yNokqVdDn3G/B7i+ql4KnADcC1wCbK6qVcDm1gc4E1jVHuuByweuTZK6\nNFhwJzkc+HHgCoCq+nZVPQasATa2aRuBs1t7DXBljdwMLE1y9FD1SVKvhjzjPg6YBt6f5LYkf5jk\nMGBZVe1ocx4ElrX2cmDr2P7b2tiTJFmfZEuSLdPT0wOWL0kL05DBvQQ4Gbi8qk4CvsH3lkUAqKoC\nal8OWlUbqmqyqiYnJiaesWIlqRdDBvc2YFtV3dL6VzMK8od2L4G0551t+3Zg5dj+K9qYJGnMYMFd\nVQ8CW5O8pA2tBu4BNgFr29ha4NrW3gSc364uOQ14fGxJRZLULBn4+BcBH0pyCHA/cAGjfyyuSrIO\neAA4t829DjgLmAKeaHMlSXsYNLir6nZgcoZNq2eYW8CFQ9YjSYuBn5yUpM4Y3JLUGYNbkjpjcEtS\nZwxuSeqMwS1JnTG4JakzBrckdcbglqTOGNyS1BmDW5I6Y3BLUmcMbknqjMEtSZ0xuCWpMwa3JHXG\n4JakzhjcktQZg1uSOmNwS1JnDG5J6ozBLUmdMbglqTMGtyR1xuCWpM4MGtxJvpzkziS3J9nSxo5M\nckOS+9rzEW08SS5LMpXkjiQnD1mbJPVqLs64f6KqTqyqyda/BNhcVauAza0PcCawqj3WA5fPQW2S\n1J35WCpZA2xs7Y3A2WPjV9bIzcDSJEfPQ32StKANHdwF/O8ktyZZ38aWVdWO1n4QWNbay4GtY/tu\na2NPkmR9ki1JtkxPTw9VtyQtWEsGPv6PVdX2JC8EbkjyhfGNVVVJal8OWFUbgA0Ak5OT+7SvJC0G\ng55xV9X29rwTuAY4FXho9xJIe97Zpm8HVo7tvqKNSZLGDBbcSQ5L8rzdbeCngLuATcDaNm0tcG1r\nbwLOb1eXnAY8PrakIklqhlwqWQZck2T3z/mfVXV9ks8CVyVZBzwAnNvmXwecBUwBTwAXDFibJHVr\nsOCuqvuBE2YYfwRYPcN4ARcOVY8kLRZ+clKSOmNwS1JnDG5J6sysgjvJ5tmMSZKG95RvTiY5FHgO\n8IJ2M6i0Tc9nhk81SpKG93RXlfwC8GbgRcCtfC+4vwb8twHrkiTtxVMGd1W9B3hPkouq6r1zVJO0\nqHzlt35ovkvQHDrm1+8c/GfM6jruqnpvkh8Fjh3fp6quHKguSdJezCq4k3wQ+EHgduDv2nABBrck\nzbHZfnJyEji+fbpRkjSPZnsd913APxiyEEnS7Mz2jPsFwD1JPgN8a/dgVb1mkKokSXs12+D+jSGL\nkCTN3myvKvmzoQuRJM3ObK8q+Tqjq0gADgEOBr5RVc8fqjBJ0sxme8b9vN3tjL4ZYQ1w2lBFSZL2\nbp/vDlgjfwz89AD1SJKexmyXSs4Z6z6L0XXd3xykIknSU5rtVSX/dKy9C/gyo+USSdIcm+0at1/c\nK0kLxGy/SGFFkmuS7GyPjyVZMXRxkqTvN9s3J98PbGJ0X+4XAX/SxiRJc2y2wT1RVe+vql3t8QFg\nYsC6JEl7MdvgfiTJzyU5qD1+DnhkyMIkSTObbXD/PHAu8CCwA3gt8MaBapIkPYXZBvdvAWuraqKq\nXsgoyH9zNju2M/Tbknyi9Y9LckuSqSQfSXJIG39260+17cfu+68jSYvfbIP7h6vq0d2dqvoqcNIs\n9/0l4N6x/juBS6vqxcCjwLo2vg54tI1f2uZJkvYw2+B+VpIjdneSHMksrgFvlwz+DPCHrR/gVcDV\nbcpG4OzWXtP6tO2r23xJ0pjZfnLy3cCnk3y09V8HvH0W+/0e8B+B3TepOgp4rKp2tf42YHlrLwe2\nAlTVriSPt/kPjx8wyXpgPcAxxxwzy/IlafGY1Rl3+zb3c4CH2uOcqvrgU+2T5J8AO6vq1v2u8sm1\nbKiqyaqanJjwikRJB57ZnnFTVfcA9+zDsV8BvCbJWcChwPOB9wBLkyxpZ90rgO1t/nZgJbAtyRLg\ncLzkUJK+zz7f1nW2quo/VdWKqjoWOA+4sareANzE6HJCgLXAta29qfVp22/0W+Ul6fsNFtxP4VeA\ni5NMMVrDvqKNXwEc1cYvBi6Zh9okacGb9VLJ/qiqTwGfau37gVNnmPNNRm96SpKewnyccUuS9oPB\nLUmdMbglqTMGtyR1xuCWpM4Y3JLUGYNbkjpjcEtSZwxuSeqMwS1JnTG4JakzBrckdcbglqTOGNyS\n1BmDW5I6Y3BLUmcMbknqjMEtSZ0xuCWpMwa3JHXG4JakzhjcktQZg1uSOmNwS1JnDG5J6ozBLUmd\nGSy4kxya5DNJPp/k7iS/2caPS3JLkqkkH0lySBt/dutPte3HDlWbJPVsyDPubwGvqqoTgBOBM5Kc\nBrwTuLSqXgw8Cqxr89cBj7bxS9s8SdIeBgvuGvmb1j24PQp4FXB1G98InN3aa1qftn11kgxVnyT1\natA17iQHJbkd2AncAPwV8FhV7WpTtgHLW3s5sBWgbX8cOGqGY65PsiXJlunp6SHLl6QFadDgrqq/\nq6oTgRXAqcBLn4FjbqiqyaqanJiY2O8aJak3c3JVSVU9BtwEnA4sTbKkbVoBbG/t7cBKgLb9cOCR\nuahPknoy5FUlE0mWtvYPAD8J3MsowF/bpq0Frm3tTa1P235jVdVQ9UlSr5Y8/ZS/t6OBjUkOYvQP\nxFVV9Ykk9wB/lORtwG3AFW3+FcAHk0wBXwXOG7A2SerWYMFdVXcAJ80wfj+j9e49x78JvG6oeiRp\nsfCTk5LUGYNbkjpjcEtSZwxuSeqMwS1JnTG4JakzBrckdcbglqTOGNyS1BmDW5I6Y3BLUmcMbknq\njMEtSZ0xuCWpMwa3JHXG4JakzhjcktQZg1uSOmNwS1JnDG5J6ozBLUmdMbglqTMGtyR1xuCWpM4Y\n3JLUmcGCO8nKJDcluSfJ3Ul+qY0fmeSGJPe15yPaeJJclmQqyR1JTh6qNknq2ZBn3LuAf19VxwOn\nARcmOR64BNhcVauAza0PcCawqj3WA5cPWJskdWuw4K6qHVX1udb+OnAvsBxYA2xs0zYCZ7f2GuDK\nGrkZWJrk6KHqk6Rezckad5JjgZOAW4BlVbWjbXoQWNbay4GtY7tta2N7Hmt9ki1JtkxPTw9WsyQt\nVIMHd5LnAh8D3lxVXxvfVlUF1L4cr6o2VNVkVU1OTEw8g5VKUh8GDe4kBzMK7Q9V1cfb8EO7l0Da\n8842vh1YObb7ijYmSRoz5FUlAa4A7q2q/zq2aROwtrXXAteOjZ/fri45DXh8bElFktQsGfDYrwD+\nJXBnktvb2K8C7wCuSrIOeAA4t227DjgLmAKeAC4YsDZJ6tZgwV1V/w/IXjavnmF+ARcOVY8kLRZ+\nclKSOmNwS1JnDG5J6ozBLUmdMbglqTMGtyR1xuCWpM4Y3JLUGYNbkjpjcEtSZwxuSeqMwS1JnTG4\nJakzBrckdcbglqTOGNyS1BmDW5I6Y3BLUmcMbknqjMEtSZ0xuCWpMwa3JHXG4JakzhjcktQZg1uS\nOjNYcCd5X5KdSe4aGzsyyQ1J7mvPR7TxJLksyVSSO5KcPFRdktS7Ic+4PwCcscfYJcDmqloFbG59\ngDOBVe2xHrh8wLokqWuDBXdV/R/gq3sMrwE2tvZG4Oyx8Str5GZgaZKjh6pNkno212vcy6pqR2s/\nCCxr7eXA1rF529rY90myPsmWJFump6eHq1SSFqh5e3Oyqgqov8d+G6pqsqomJyYmBqhMkha2uQ7u\nh3YvgbTnnW18O7BybN6KNiZJ2sNcB/cmYG1rrwWuHRs/v11dchrw+NiSiiRpzJKhDpzkw8ArgRck\n2Qa8BXgHcFWSdcADwLlt+nXAWcAU8ARwwVB1SVLvBgvuqnr9XjatnmFuARcOVYskLSZ+clKSOmNw\nS1JnDG5J6ozBLUmdMbglqTMGtyR1xuCWpM4Y3JLUGYNbkjpjcEtSZwxuSeqMwS1JnTG4JakzBrck\ndcbglqTOGNyS1BmDW5I6Y3BLUmcMbknqjMEtSZ0xuCWpMwa3JHXG4JakzhjcktQZg1uSOmNwS1Jn\nFlRwJzkjyReTTCW5ZL7rkaSFaMEEd5KDgN8HzgSOB16f5Pj5rUqSFp4FE9zAqcBUVd1fVd8G/ghY\nM881SdKCs2S+CxizHNg61t8GvHzPSUnWA+tb92+SfHEOaltMXgA8PN9FzLX87tr5LuFAdEC+1nhL\n9mfv66vqjKebtJCCe1aqagOwYb7r6FWSLVU1Od91aPHztTachbRUsh1YOdZf0cYkSWMWUnB/FliV\n5LgkhwDnAZvmuSZJWnAWzFJJVe1K8ovAJ4GDgPdV1d3zXNZi5DKT5oqvtYGkqua7BknSPlhISyWS\npFkwuCWpMwb3AS7JK5N8Yr7r0MKT5E1J7k3yoYGO/xtJfnmIYy92C+bNSUkLzr8FXl1V2+a7ED2Z\nZ9yLQJJjk3whyQeS/GWSDyV5dZI/T3JfklPb49NJbkvyF0leMsNxDkvyviSfafO85cABKsl/B/4h\n8KdJfm2m10WSNyb54yQ3JPlykl9McnGbc3OSI9u8f5Xks0k+n+RjSZ4zw8/7wSTXJ7k1yf9N8tK5\n/Y37YnAvHi8G3g28tD3+BfBjwC8Dvwp8AfjHVXUS8OvAb89wjF8DbqyqU4GfAN6V5LA5qF0LTFX9\na+CvGb0ODmPvr4uXAecAPwK8HXiivcY+DZzf5ny8qn6kqk4A7gXWzfAjNwAXVdUpjF6zfzDMb7Y4\nuFSyeHypqu4ESHI3sLmqKsmdwLHA4cDGJKuAAg6e4Rg/BbxmbN3xUOAYRn/ZdODa2+sC4Kaq+jrw\n9SSPA3/Sxu8Efri1X5bkbcBS4LmMPqvxXUmeC/wo8NHku/f5ePYQv8hiYXAvHt8aa39nrP8dRn/O\nb2X0l+yfJTkW+NQMxwjwz6vKG3dp3IyviyQv5+lfdwAfAM6uqs8neSPwyj2O/yzgsao68Zkte/Fy\nqeTAcTjfu/fLG/cy55PARWmnPUlOmoO6tPDt7+viecCOJAcDb9hzY1V9DfhSkte14yfJCftZ86Jm\ncB84/gvwO0luY+//03oroyWUO9pyy1vnqjgtaPv7uvjPwC3AnzN6r2UmbwDWJfk8cDfei/8p+ZF3\nSeqMZ9yS1BmDW5I6Y3BLUmcMbknqjMEtSZ0xuCWpMwa3JHXG4Jb47p0R/1e7g91dSX42ySlJ/qzd\nse6TSY5OsqTd6e6Vbb/fSfL2eS5fBxjvVSKNnAH8dVX9DECSw4E/BdZU1XSSnwXeXlU/3+63cXWS\ni9p+L5+vonVgMrilkTuBdyd5J/AJ4FFGtyy9od2i4yBgB0BV3Z3kg23e6VX17fkpWQcqg1sCquov\nk5wMnAW8DbgRuLuqTt/LLj8EPAa8cI5KlL7LNW4JSPIiRl8C8D+AdzFa/phIcnrbfnCSf9Ta5wBH\nAj8OvDfJ0nkqWwcobzIlAUl+mlFgfwf4W+DfALuAyxjdEncJ8HvANcBfAKuramuSNwGnVNXaeSlc\nBySDW5I641KJJHXG4JakzhjcktQZg1uSOmNwS1JnDG5J6ozBLUmd+f+oDB5uBhggxwAAAABJRU5E\nrkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-tYbYZW1V_eO" + }, + "source": [ + "### Variável 'cabin'\n", + "* No caso da variável 'cabin', vamos construir as variáveis 'deck' e 'seat'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GCflGdANp4jU", + "outputId": "1bc2cf22-9273-4a5f-d78f-4a4f83b341b2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + } + }, + "source": [ + "set(df['cabin'])" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'A10',\n", + " 'A11',\n", + " 'A14',\n", + " 'A16',\n", + " 'A18',\n", + " 'A19',\n", + " 'A20',\n", + " 'A21',\n", + " 'A23',\n", + " 'A24',\n", + " 'A26',\n", + " 'A29',\n", + " 'A31',\n", + " 'A32',\n", + " 'A34',\n", + " 'A36',\n", + " 'A5',\n", + " 'A6',\n", + " 'A7',\n", + " 'A9',\n", + " 'B10',\n", + " 'B101',\n", + " 'B102',\n", + " 'B11',\n", + " 'B18',\n", + " 'B19',\n", + " 'B20',\n", + " 'B22',\n", + " 'B24',\n", + " 'B26',\n", + " 'B28',\n", + " 'B3',\n", + " 'B30',\n", + " 'B35',\n", + " 'B36',\n", + " 'B37',\n", + " 'B38',\n", + " 'B39',\n", + " 'B4',\n", + " 'B41',\n", + " 'B42',\n", + " 'B45',\n", + " 'B49',\n", + " 'B5',\n", + " 'B50',\n", + " 'B51 B53 B55',\n", + " 'B52 B54 B56',\n", + " 'B57 B59 B63 B66',\n", + " 'B58 B60',\n", + " 'B61',\n", + " 'B69',\n", + " 'B71',\n", + " 'B73',\n", + " 'B77',\n", + " 'B78',\n", + " 'B79',\n", + " 'B80',\n", + " 'B82 B84',\n", + " 'B86',\n", + " 'B94',\n", + " 'B96 B98',\n", + " 'C101',\n", + " 'C103',\n", + " 'C104',\n", + " 'C105',\n", + " 'C106',\n", + " 'C110',\n", + " 'C111',\n", + " 'C116',\n", + " 'C118',\n", + " 'C123',\n", + " 'C124',\n", + " 'C125',\n", + " 'C126',\n", + " 'C128',\n", + " 'C130',\n", + " 'C132',\n", + " 'C148',\n", + " 'C2',\n", + " 'C22 C26',\n", + " 'C23 C25 C27',\n", + " 'C28',\n", + " 'C30',\n", + " 'C31',\n", + " 'C32',\n", + " 'C39',\n", + " 'C45',\n", + " 'C46',\n", + " 'C47',\n", + " 'C49',\n", + " 'C50',\n", + " 'C51',\n", + " 'C52',\n", + " 'C53',\n", + " 'C54',\n", + " 'C55 C57',\n", + " 'C6',\n", + " 'C62 C64',\n", + " 'C65',\n", + " 'C68',\n", + " 'C7',\n", + " 'C70',\n", + " 'C78',\n", + " 'C80',\n", + " 'C82',\n", + " 'C83',\n", + " 'C85',\n", + " 'C86',\n", + " 'C87',\n", + " 'C89',\n", + " 'C90',\n", + " 'C91',\n", + " 'C92',\n", + " 'C93',\n", + " 'C95',\n", + " 'C97',\n", + " 'C99',\n", + " 'D',\n", + " 'D10 D12',\n", + " 'D11',\n", + " 'D15',\n", + " 'D17',\n", + " 'D19',\n", + " 'D20',\n", + " 'D21',\n", + " 'D22',\n", + " 'D26',\n", + " 'D28',\n", + " 'D30',\n", + " 'D33',\n", + " 'D34',\n", + " 'D35',\n", + " 'D36',\n", + " 'D37',\n", + " 'D38',\n", + " 'D40',\n", + " 'D43',\n", + " 'D45',\n", + " 'D46',\n", + " 'D47',\n", + " 'D48',\n", + " 'D49',\n", + " 'D50',\n", + " 'D56',\n", + " 'D6',\n", + " 'D7',\n", + " 'D9',\n", + " 'E10',\n", + " 'E101',\n", + " 'E12',\n", + " 'E121',\n", + " 'E17',\n", + " 'E24',\n", + " 'E25',\n", + " 'E31',\n", + " 'E33',\n", + " 'E34',\n", + " 'E36',\n", + " 'E38',\n", + " 'E39 E41',\n", + " 'E40',\n", + " 'E44',\n", + " 'E45',\n", + " 'E46',\n", + " 'E49',\n", + " 'E50',\n", + " 'E52',\n", + " 'E58',\n", + " 'E60',\n", + " 'E63',\n", + " 'E67',\n", + " 'E68',\n", + " 'E77',\n", + " 'E8',\n", + " 'F',\n", + " 'F E46',\n", + " 'F E57',\n", + " 'F E69',\n", + " 'F G63',\n", + " 'F G73',\n", + " 'F2',\n", + " 'F33',\n", + " 'F38',\n", + " 'F4',\n", + " 'G6',\n", + " 'T',\n", + " nan}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 242 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7E6yje89u7KF" + }, + "source": [ + "Como podemos ver, trata-se de uma variável categórica com vários níveis. Portanto, vamos capturar somente a primeira letra da variável 'cabin'. Para tal, vamos utilizar a função slice().\n", + "\n", + "> slice() - Get substring from a given string using slice object;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wmZLlSaArR6F" + }, + "source": [ + "A seguir, capturamos a primeira letra da variável 'Cabin':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hUZTJU0MvVxP", + "outputId": "1b05c1a5-65af-4d90-9a51-696ee929af48", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 170 + } + }, + "source": [ + "# definindo a variável 'deck' que representará a primeira letra da variável 'cabin'\n", + "df[\"deck\"] = df[\"cabin\"].str.slice(0,1) # slice(inicio, tamanho_da_string)\n", + "df['deck'].value_counts()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "C 94\n", + "B 65\n", + "D 46\n", + "E 41\n", + "A 22\n", + "F 21\n", + "G 5\n", + "T 1\n", + "Name: deck, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 243 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6myhrth0rZ6t" + }, + "source": [ + "A seguir, vamos extrair a parte numérica da variável 'cabin' usando Expressões Regulares:\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8UXkACPmsfwN" + }, + "source": [ + "# Importar a biblioiteca para Expressões Regulares\n", + "import re" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QKk-fnW4rf4o", + "outputId": "1c9c1b59-19ce-4e10-80ea-61b80a9443fd", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 235 + } + }, + "source": [ + "# Primeiramente, usamos a função split() para separar o conteúdo da variável em colunas: \n", + "new = df[\"cabin\"].str.split(\" \", n= 3, expand = True) \n", + "new.head(5)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0123
PassengerId
1NaNNaNNaNNaN
2C85NoneNoneNone
3NaNNaNNaNNaN
4C123NoneNoneNone
5NaNNaNNaNNaN
\n", + "
" + ], + "text/plain": [ + " 0 1 2 3\n", + "PassengerId \n", + "1 NaN NaN NaN NaN\n", + "2 C85 None None None\n", + "3 NaN NaN NaN NaN\n", + "4 C123 None None None\n", + "5 NaN NaN NaN NaN" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 245 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dFqoR-Xew9gX" + }, + "source": [ + "Observe acima que o comando gera quantos splits da variável eu quiser. No entanto, por simplicidade, me interessa somente o primeiro split." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_M7vA6WoVG05" + }, + "source": [ + "Agora, vou extrair o número do assento do passageiro usando Expressões Regulares:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rVH5o9KT_IH3", + "outputId": "fbb20adc-123a-4e37-f163-2c980568598f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 136 + } + }, + "source": [ + "# Aqui está o conteúdo de new[0]:\n", + "new[0].head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "PassengerId\n", + "1 NaN\n", + "2 C85\n", + "3 NaN\n", + "4 C123\n", + "5 NaN\n", + "Name: 0, dtype: object" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 246 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "P7NTcsGOxxSX", + "outputId": "61481b94-bf2c-4259-894b-95e34abc7483", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 235 + } + }, + "source": [ + "new2= new[0].str.extract('(\\d+)')\n", + "new2.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0
PassengerId
1NaN
285
3NaN
4123
5NaN
\n", + "
" + ], + "text/plain": [ + " 0\n", + "PassengerId \n", + "1 NaN\n", + "2 85\n", + "3 NaN\n", + "4 123\n", + "5 NaN" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 247 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bf8vw2Mc18bQ" + }, + "source": [ + "Por fim, vou carregar esta informação ao dataframe df:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6l6EoRvsxRXn", + "outputId": "760b25ed-ee3a-4a5e-da60-61eaee32c5a8", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 527 + } + }, + "source": [ + "df[\"seat\"]= new2\n", + "df.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclassnameagesibspparchfarecabinembarkedsurvived2sexdeckseat
PassengerId
10.03Braund, Mr. Owen Harris22.0107.2500NaNSDiedmaleNaNNaN
21.01Cumings, Mrs. John Bradley (Florence Briggs Th...38.01071.2833C85CSurvivedfemaleC85
31.03Heikkinen, Miss. Laina26.0007.9250NaNSSurvivedfemaleNaNNaN
41.01Futrelle, Mrs. Jacques Heath (Lily May Peel)35.01053.1000C123SSurvivedfemaleC123
50.03Allen, Mr. William Henry35.0008.0500NaNSDiedmaleNaNNaN
\n", + "
" + ], + "text/plain": [ + " survived pclass ... deck seat\n", + "PassengerId ... \n", + "1 0.0 3 ... NaN NaN\n", + "2 1.0 1 ... C 85\n", + "3 1.0 3 ... NaN NaN\n", + "4 1.0 1 ... C 123\n", + "5 0.0 3 ... NaN NaN\n", + "\n", + "[5 rows x 13 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 248 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LK4V61uy3N9s" + }, + "source": [ + "Por fim, excluir a variável 'cabin':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4uAr55J43NY7" + }, + "source": [ + "df= df.drop(columns= [\"cabin\"], axis=1, errors=\"ignore\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qZuH7YJXZCgY" + }, + "source": [ + "### Variável 'embarked'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nTPikhrIZGya", + "outputId": "fd84dd6a-7289-40d1-feab-9e5191b91258", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 85 + } + }, + "source": [ + "df['embarked'].value_counts()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "S 914\n", + "C 270\n", + "Q 123\n", + "Name: embarked, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 250 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ixbZsuqOZsOc", + "outputId": "f7ace000-ed9d-455a-a4ce-a8722ecb9dd3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 386 + } + }, + "source": [ + "sns.catplot(x=\"embarked\", kind=\"count\", data=df)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 251 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW4AAAFgCAYAAACbqJP/AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAEUdJREFUeJzt3Xvw5XVdx/HnS1YUNAVxQ93dglHS\n0FBxB1EqS3RCM0EDs1FBpegP73YRs1Fzcspb5i2LJIPGvOQNbIw0kGZ0FF0UQSBjIxUYkIXwHir4\n7o/zQX8uu8tZ4fs7v/dvn4+Z3+z3ds7vrYd57ne+e873pKqQJPVxu0UPIEnaOYZbkpox3JLUjOGW\npGYMtyQ1Y7glqRnDLUnNGG5JasZwS1IzaxY9wK1xxBFH1BlnnLHoMSTptpJ5Dmp9xn3NNdcsegRJ\nWnatwy1JuyLDLUnNGG5JasZwS1IzhluSmjHcktSM4ZakZgy3JDVjuCWpGcMtSc0YbklqxnBLUjOt\n7w44r4f84amLHmHVOPc1xy56BGmX5xm3JDVjuCWpGcMtSc0YbklqxnBLUjOGW5KaMdyS1IzhlqRm\nDLckNWO4JakZwy1JzRhuSWrGcEtSM4Zbkpox3JLUjOGWpGYMtyQ1Y7glqRnDLUnNGG5JasZwS1Iz\nhluSmjHcktSM4ZakZgy3JDVjuCWpGcMtSc0YbklqxnBLUjOGW5KaMdyS1IzhlqRmJg13khckuTDJ\nF5K8M8kdk+yf5Jwkm5O8O8nu49g7jPXNY/9+U84mSV1NFu4k64DnAhur6gHAbsCTgVcBr6+q+wDX\nAcePhxwPXDe2v34cJ0naytSXStYAeyRZA+wJXAk8Enjv2H8KcNRYPnKsM/YfniQTzydJ7UwW7qq6\nAngt8BVmwf46cC7wtaq6YRx2ObBuLK8DLhuPvWEcv8/Wz5vkhCSbkmzasmXLVONL0oo15aWSvZmd\nRe8P3Au4E3DErX3eqjqpqjZW1ca1a9fe2qeTpHamvFTyKOB/qmpLVX0feD9wGLDXuHQCsB64Yixf\nAWwAGPvvClw74XyS1NKU4f4KcGiSPce16sOBi4CPAUePY44DThvLp491xv6zqqomnE+SWpryGvc5\nzP6R8bPABeN3nQS8CHhhks3MrmGfPB5yMrDP2P5C4MSpZpOkztbc8iE/uap6GfCyrTZfChyyjWOv\nB46Zch5JWg385KQkNWO4JakZwy1JzRhuSWrGcEtSM4Zbkpox3JLUjOGWpGYMtyQ1Y7glqRnDLUnN\nGG5JasZwS1IzhluSmjHcktSM4ZakZgy3JDVjuCWpGcMtSc0YbklqxnBLUjOGW5KaMdyS1IzhlqRm\nDLckNWO4JakZwy1JzRhuSWrGcEtSM4Zbkpox3JLUjOGWpGYMtyQ1Y7glqRnDLUnNGG5JasZwS1Iz\nhluSmjHcktSM4ZakZgy3JDVjuCWpGcMtSc0YbklqxnBLUjOGW5KaMdyS1IzhlqRmDLckNWO4JakZ\nwy1JzRhuSWrGcEtSM4Zbkpox3JLUzKThTrJXkvcm+c8kFyd5WJK7JflokkvGn3uPY5PkjUk2Jzk/\nycFTziZJXU19xv0G4Iyquh/wQOBi4ETgzKo6ADhzrAM8Bjhg/JwAvHXi2SSppcnCneSuwC8DJwNU\n1feq6mvAkcAp47BTgKPG8pHAqTXzKWCvJPecaj5J6mrKM+79gS3A25N8LsnbktwJ2LeqrhzHXAXs\nO5bXAZctefzlY9uPSXJCkk1JNm3ZsmXC8SVpZZoy3GuAg4G3VtWDgW/zo8siAFRVAbUzT1pVJ1XV\nxqrauHbt2ttsWEnqYspwXw5cXlXnjPX3Mgv5V2+6BDL+vHrsvwLYsOTx68c2SdISk4W7qq4CLkty\n37HpcOAi4HTguLHtOOC0sXw6cOx4d8mhwNeXXFKRJA1rJn7+5wDvSLI7cCnwDGZ/WbwnyfHAl4En\njWM/DDwW2Ax8ZxwrSdrKpOGuqvOAjdvYdfg2ji3gWVPOI0mrgZ+clKRmDLckNWO4JakZwy1JzRhu\nSWrGcEtSM4Zbkpox3JLUjOGWpGYMtyQ1Y7glqRnDLUnNGG5JasZwS1IzhluSmjHcktSM4ZakZgy3\nJDVjuCWpGcMtSc0YbklqZq5wJzlznm2SpOmt2dHOJHcE9gTunmRvIGPXXYB1E88mSdqGHYYb+D3g\n+cC9gHP5Ubi/Abx5wrkkSduxw3BX1RuANyR5TlW9aZlmkiTtwC2dcQNQVW9K8nBgv6WPqapTJ5pL\nkrQdc4U7yT8C9wbOA24cmwsw3JK0zOYKN7AROLCqasphJEm3bN73cX8BuMeUg0iS5jPvGffdgYuS\nfBr47k0bq+rxk0wlSdquecP98imHkCTNb953lfzH1INIkuYz77tKvsnsXSQAuwO3B75dVXeZajBJ\n0rbNe8b9UzctJwlwJHDoVENJkrZvp+8OWDMfBH5tgnkkSbdg3kslT1yyejtm7+u+fpKJJEk7NO+7\nSn5jyfINwJeYXS6RJC2zea9xP2PqQSRJ85n3ixTWJ/lAkqvHz/uSrJ96OEnSzc37j5NvB05ndl/u\newEfGtskScts3nCvraq3V9UN4+cfgLUTziVJ2o55w31tkqcm2W38PBW4dsrBJEnbNm+4nwk8CbgK\nuBI4Gnj6RDNJknZg3rcDvgI4rqquA0hyN+C1zIIuSVpG855xH3RTtAGq6n+BB08zkiRpR+YN9+2S\n7H3TyjjjnvdsXZJ0G5o3vq8DPpnkn8f6McArpxlJkrQj835y8tQkm4BHjk1PrKqLphtLkrQ9c1/u\nGKE21pK0YDt9W1dJ0mIZbklqxnBLUjOGW5KaMdyS1Mzk4R43pfpckn8Z6/snOSfJ5iTvTrL72H6H\nsb557N9v6tkkqaPlOON+HnDxkvVXAa+vqvsA1wHHj+3HA9eN7a8fx0mStjJpuMe35Pw68LaxHmYf\n4nnvOOQU4KixfORYZ+w/fBwvSVpi6jPuvwL+CPjBWN8H+FpV3TDWLwfWjeV1wGUAY//Xx/E/JskJ\nSTYl2bRly5YpZ5ekFWmycCd5HHB1VZ17Wz5vVZ1UVRurauPatX4Jj6Rdz5R3+DsMeHySxwJ3BO4C\nvAHYK8macVa9HrhiHH8FsAG4PMka4K74LTuSdDOTnXFX1Yuran1V7Qc8GTirqp4CfIzZN+gAHAec\nNpZPH+uM/WdVVU01nyR1tYj3cb8IeGGSzcyuYZ88tp8M7DO2vxA4cQGzSdKKtyxfhlBVZwNnj+VL\ngUO2ccz1zO7zLUnaAT85KUnNGG5JasZwS1IzhluSmjHcktSM4ZakZgy3JDVjuCWpGcMtSc0Ybklq\nxnBLUjOGW5KaMdyS1IzhlqRmDLckNWO4JakZwy1JzRhuSWrGcEtSM4Zbkpox3JLUjOGWpGYMtyQ1\nY7glqRnDLUnNGG5JasZwS1IzhluSmjHcktSM4ZakZgy3JDVjuCWpGcMtSc0YbklqxnBLUjNrFj2A\ndm1fecUvLHqEVeNnXnrBokfQMvGMW5KaMdyS1IzhlqRmDLckNWO4JakZwy1JzRhuSWrGcEtSM4Zb\nkpox3JLUjOGWpGYMtyQ1Y7glqRnDLUnNGG5JasZwS1IzhluSmjHcktTMZOFOsiHJx5JclOTCJM8b\n2++W5KNJLhl/7j22J8kbk2xOcn6Sg6eaTZI6m/KM+wbg96vqQOBQ4FlJDgROBM6sqgOAM8c6wGOA\nA8bPCcBbJ5xNktqaLNxVdWVVfXYsfxO4GFgHHAmcMg47BThqLB8JnFoznwL2SnLPqeaTpK6W5Rp3\nkv2ABwPnAPtW1ZVj11XAvmN5HXDZkoddPrZt/VwnJNmUZNOWLVsmm1mSVqrJw53kzsD7gOdX1TeW\n7quqAmpnnq+qTqqqjVW1ce3atbfhpJLUw6ThTnJ7ZtF+R1W9f2z+6k2XQMafV4/tVwAbljx8/dgm\nSVpiyneVBDgZuLiq/nLJrtOB48byccBpS7YfO95dcijw9SWXVCRJw5oJn/sw4GnABUnOG9v+GPgL\n4D1Jjge+DDxp7Psw8FhgM/Ad4BkTziZJbU0W7qr6OJDt7D58G8cX8Kyp5pGk1cJPTkpSM4Zbkpox\n3JLUjOGWpGYMtyQ1Y7glqRnDLUnNGG5JasZwS1IzhluSmjHcktSM4ZakZgy3JDVjuCWpGcMtSc0Y\nbklqxnBLUjOGW5KaMdyS1MyUXxYsqbHD3nTYokdYNT7xnE/cps/nGbckNWO4JakZwy1JzRhuSWrG\ncEtSM4Zbkpox3JLUjOGWpGYMtyQ1Y7glqRnDLUnNGG5JasZwS1IzhluSmjHcktSM4ZakZgy3JDVj\nuCWpGcMtSc0YbklqxnBLUjOGW5KaMdyS1IzhlqRmDLckNWO4JakZwy1JzRhuSWrGcEtSM4Zbkpox\n3JLUjOGWpGYMtyQ1Y7glqRnDLUnNrKhwJzkiyReTbE5y4qLnkaSVaMWEO8luwFuAxwAHAr+d5MDF\nTiVJK8+KCTdwCLC5qi6tqu8B7wKOXPBMkrTipKoWPQMASY4Gjqiq3xnrTwMeWlXP3uq4E4ATxup9\ngS8u66DTujtwzaKH0Db52qxcq+m1uaaqjrilg9YsxyS3pao6CThp0XNMIcmmqtq46Dl0c742K9eu\n+NqspEslVwAblqyvH9skSUuspHB/Bjggyf5JdgeeDJy+4JkkacVZMZdKquqGJM8G/g3YDfj7qrpw\nwWMtt1V5CWiV8LVZuXa512bF/OOkJGk+K+lSiSRpDoZbkpox3CtAkpckuTDJ+UnOS/LQRc+kmST3\nSPKuJP+d5NwkH07yc4uea1eXZH2S05JckuTSJG9OcodFz7VcDPeCJXkY8Djg4Ko6CHgUcNlipxJA\nkgAfAM6uqntX1UOAFwP7LnayXdt4Xd4PfLCqDgAOAPYAXr3QwZbRinlXyS7snsw+LfVdgKpaLZ8A\nWw1+Ffh+Vf3NTRuq6vMLnEczjwSur6q3A1TVjUleAHw5yUuq6luLHW96nnEv3keADUn+K8lfJ3nE\nogfSDz0AOHfRQ+hm7s9Wr0tVfQP4EnCfRQy03Az3go2zg4cwu//KFuDdSZ6+0KEkrWiGewWoqhur\n6uyqehnwbOA3Fz2TALiQ2V+qWlkuYqvXJcldgHuwum46t12Ge8GS3DfJAUs2PQj48qLm0Y85C7jD\nuCMlAEkOSvJLC5xJcCawZ5Jj4Yf38n8d8Oaq+r+FTrZMDPfi3Rk4JclFSc5n9iUSL1/sSAKo2ceK\nnwA8arwd8ELgz4GrFjvZrm3J63J0kkuAa4EfVNUrFzvZ8vEj75JaS/Jw4J3AE6rqs4ueZzkYbklq\nxkslktSM4ZakZgy3JDVjuCWpGcOtXVqSpyd58618ji8lufuifr92PYZbuhXGhz+kZWW4tSokeWqS\nT4/7mf9tkt2SfCvJa8a9zv89ySFJzh73b378kodvGNsvSfKyJc/5wXEP7gu3+vTkt5K8LsnngYct\n2b5Hkn9N8rvbm2lsf8a4qdingcMm/z9Hq47hVntJfh74LeCwqnoQcCPwFOBOwFlVdX/gm8CfAY9m\n9qm7Vyx5ikOY3R/mIOCYJBvH9meOe3BvBJ6bZJ+x/U7AOVX1wKr6+Nh2Z+BDwDur6u+2N1OSewJ/\nyizYv8jsk7LSTvF+3FoNDmd206HPzO6xzx7A1cD3gDPGMRcA362q7ye5ANhvyeM/WlXXAiR5P7Og\nbmIW6yeMYzYwu2H/tcwi/L6tZjgNeHVVveMWZnoosy9m2DJ+37sBv1FHO8VwazUIcEpVvfjHNiZ/\nUD/6aPAPgJu+rOIHSZb+t7/1x4crya8w+zaih1XVd5KcDdxx7L++qm7c6jGfAI5I8k/jd25vpqN+\nov+F0hJeKtFqcCazGw79NECSuyX52Z14/KPHY/YAjmIW4bsC141o3w849Bae46XAdcBbbmGmc4BH\nJNknye2BY3ZiTgkw3FoFquoi4E+Aj4w7LH6U2VfCzevTzC59nA+8r6o2MbvEsibJxcBfAJ+a43me\nB+yR5NXbm6mqrmR298dPMvsL4uKdmFMCvMmUJLXjGbckNWO4JakZwy1JzRhuSWrGcEtSM4Zbkpox\n3JLUzP8DH9rOtI0QD4kAAAAASUVORK5CYII=\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VvdU8aAwZNvG" + }, + "source": [ + "Não vejo problemas com esta variável. Vamos em frente..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QBWfecCEF9ie" + }, + "source": [ + "### Variável 'pclass'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5TajFI92F_UI", + "outputId": "b762b729-d4c2-4e80-9dd2-334f67ea72ec", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 85 + } + }, + "source": [ + "df['pclass'].value_counts()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "3 709\n", + "1 323\n", + "2 277\n", + "Name: pclass, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 252 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "atG9HFsYGNHE" + }, + "source": [ + "Algum problema com esta variável?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hdBdLzIPg3xD", + "outputId": "c6e49200-d76b-4025-9331-4ba999a09b3f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 386 + } + }, + "source": [ + "sns.catplot(x=\"pclass\", kind=\"count\", data=df)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 253 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW4AAAFgCAYAAACbqJP/AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAE6RJREFUeJzt3X+s3fV93/HnKxhKSxMMyZ3n2kZk\nq0XE2obQW0pGV3WwdEDXGHUpSdoGl3lyJ5EokaatbJXatc3aZF2bAZ2QrJLUTmlSSpLhUURnOfRX\nFiAmOEAwGbeozPYMvoQf+cHSiuzdP87nlpMb2xw6f++5H9/nQzo63+/nfM/hbV3p6S9fn3tOqgpJ\nUj9eMe0BJEkvj+GWpM4YbknqjOGWpM4YbknqjOGWpM4YbknqjOGWpM4YbknqzKppD/D/49JLL607\n77xz2mNI0vGSSQ7q+oz7qaeemvYIkrTkug63JK1EhluSOmO4JakzhluSOmO4JakzhluSOmO4Jakz\nhluSOmO4JakzhluSOmO4JakzhluSOtP1pwNKGs5FN1w07RFOGJ9616eO6+sNdsad5Jwke8duX0ry\nniRnJtmV5NF2f0Y7PkmuTzKX5IEk5w81myT1bLBwV9UXquq8qjoP+F7geeATwLXA7qraCOxu+wCX\nARvbbStw41CzSVLPluoa9yXAn1fV48AmYHtb3w5c0bY3ATtq5G5gdZK1SzSfJHVjqcL9NuAjbXtN\nVR1q208Aa9r2OmD/2HMOtLVvkGRrkj1J9szPzw81ryQtW4OHO8kpwJuB31/8WFUVUC/n9apqW1XN\nVtXszMzMcZpSkvqxFGfclwGfraon2/6TC5dA2v3htn4Q2DD2vPVtTZI0ZinC/XZevEwCsBPY3LY3\nA7eNrV/V3l1yIfDc2CUVSVIz6Pu4k5wGvAn4mbHl9wG3JNkCPA5c2dbvAC4H5hi9A+XqIWeTpF4N\nGu6q+irw6kVrX2T0LpPFxxZwzZDzSNKJwF95l6TOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTO\nGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J\n6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6syg\n4U6yOsmtSR5Jsi/JG5OcmWRXkkfb/Rnt2CS5PslckgeSnD/kbJLUq6HPuK8D7qyq1wGvB/YB1wK7\nq2ojsLvtA1wGbGy3rcCNA88mSV0aLNxJTgd+ELgJoKr+qqqeBTYB29th24Er2vYmYEeN3A2sTrJ2\nqPkkqVdDnnG/FpgHPpTk/iS/leQ0YE1VHWrHPAGsadvrgP1jzz/Q1r5Bkq1J9iTZMz8/P+D4krQ8\nDRnuVcD5wI1V9Qbgq7x4WQSAqiqgXs6LVtW2qpqtqtmZmZnjNqwk9WLIcB8ADlTVPW3/VkYhf3Lh\nEki7P9wePwhsGHv++rYmSRozWLir6glgf5Jz2tIlwMPATmBzW9sM3Na2dwJXtXeXXAg8N3ZJRZLU\nrBr49d8F3JzkFOAx4GpGf1nckmQL8DhwZTv2DuByYA54vh0rSVpk0HBX1V5g9ggPXXKEYwu4Zsh5\nJOlE4G9OSlJnDLckdcZwS1JnDLckdcZwS1JnDLckdcZwS1JnDLckdcZwS1JnDLckdcZwS1JnDLck\ndcZwS1JnDLckdcZwS1JnDLckdcZwS1JnDLckdcZwS1JnDLckdcZwS1JnDLckdcZwS1JnDLckdcZw\nS1JnDLckdcZwS1JnDLckdcZwS1JnDLckdcZwS1JnBg13kr9I8mCSvUn2tLUzk+xK8mi7P6OtJ8n1\nSeaSPJDk/CFnk6ReLcUZ9z+uqvOqarbtXwvsrqqNwO62D3AZsLHdtgI3LsFsktSdaVwq2QRsb9vb\ngSvG1nfUyN3A6iRrpzCfJC1rQ4e7gP+R5L4kW9vamqo61LafANa07XXA/rHnHmhrkqQxqwZ+/R+o\nqoNJ/g6wK8kj4w9WVSWpl/OC7S+ArQBnnXXW8ZtUkjox6Bl3VR1s94eBTwAXAE8uXAJp94fb4QeB\nDWNPX9/WFr/mtqqararZmZmZIceXpGVpsHAnOS3JKxe2gR8GHgJ2ApvbYZuB29r2TuCq9u6SC4Hn\nxi6pSJKaIS+VrAE+kWThv/O7VXVnks8AtyTZAjwOXNmOvwO4HJgDngeuHnA2SerWYOGuqseA1x9h\n/YvAJUdYL+CaoeaRpBOFvzkpSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x\n3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLU\nGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUmcHDneSkJPcnub3t\nvzbJPUnmkvxeklPa+re0/bn2+NlDzyZJPVqKM+53A/vG9t8PfKCqvhN4BtjS1rcAz7T1D7TjJEmL\nDBruJOuBHwF+q+0HuBi4tR2yHbiibW9q+7THL2nHS5LGDH3G/V+Afwv8v7b/auDZqnqh7R8A1rXt\ndcB+gPb4c+14SdKYicKdZPcka4se/2fA4aq6728529Fed2uSPUn2zM/PH8+XlqQurDrWg0lOBb4N\neE2SM4CFSxev4sUz5aO5CHhzksuBU9tzrgNWJ1nVzqrXAwfb8QeBDcCBJKuA04EvLn7RqtoGbAOY\nnZ2tl/wTStIJ5qXOuH8GuA94XbtfuN0G/OaxnlhV/66q1lfV2cDbgE9W1U8CdwFvaYdtbq8FsLPt\n0x7/ZFUZZkla5Jhn3FV1HXBdkndV1Q3H6b/5s8BHk7wXuB+4qa3fBHw4yRzwNKPYS5IWOWa4F1TV\nDUn+IXD2+HOqaseEz/8j4I/a9mPABUc45mvAj0/yepK0kk0U7iQfBv4+sBf4elsuYKJwS5KOn4nC\nDcwC53rNWZKmb9L3cT8E/N0hB5EkTWbSM+7XAA8nuRf4y4XFqnrzIFNJko5q0nD/hyGHkCRNbtJ3\nlfzx0INIkiYz6btKvszoXSQApwAnA1+tqlcNNZgk6cgmPeN+5cJ2+8S+TcCFQw0lSTq6l/3pgDXy\n34B/OsA8kqSXMOmlkh8b230Fo/d1f22QiSRJxzTpu0p+dGz7BeAvGF0ukSQtsUmvcV899CCSpMlM\n+kUK65N8IsnhdvtY+1oySdISm/QfJz/E6POyv6Pd/ntbkyQtsUnDPVNVH6qqF9rtt4GZAeeSJB3F\npOH+YpKfSnJSu/0UR/haMUnS8CYN978ArgSeAA4x+mqxnx5oJknSMUz6dsBfAjZX1TMASc4E/jOj\noEuSltCkZ9zfsxBtgKp6GnjDMCNJko5l0nC/IskZCzvtjHvSs3VJ0nE0aXx/Hfh0kt9v+z8O/Mdh\nRpIkHcukvzm5I8ke4OK29GNV9fBwY0mSjmbiyx0t1MZakqbsZX+sqyRpugy3JHXGcEtSZ1bEW/q+\n99/smPYIJ4z7fu2qaY8grXiecUtSZwy3JHXGcEtSZwy3JHXGcEtSZwYLd5JTk9yb5HNJPp/kF9v6\na5Pck2Quye8lOaWtf0vbn2uPnz3UbJLUsyHPuP8SuLiqXg+cB1ya5ELg/cAHquo7gWeALe34LcAz\nbf0D7ThJ0iKDhbtGvtJ2T263YvRBVbe29e3AFW17U9unPX5Jkgw1nyT1atBr3O37KfcCh4FdwJ8D\nz1bVC+2QA8C6tr0O2A/QHn8OePURXnNrkj1J9szPzw85viQtS4OGu6q+XlXnAeuBC4DXHYfX3FZV\ns1U1OzPjF81LWnmW5F0lVfUscBfwRmB1koVftV8PHGzbB4ENAO3x0/Gb5CXpmwz5rpKZJKvb9rcC\nbwL2MQr4W9phm4Hb2vbOtk97/JNVVUPNJ0m9GvJDptYC25OcxOgviFuq6vYkDwMfTfJe4H7gpnb8\nTcCHk8wBTwNvG3A2SerWYOGuqgc4wjfBV9VjjK53L17/GqPvspQkHYO/OSlJnTHcktQZwy1JnTHc\nktSZFfHVZVq+/vcvffe0RzhhnPXzD057BC0Rz7glqTOGW5I6Y7glqTOGW5I6Y7glqTOGW5I6Y7gl\nqTOGW5I6Y7glqTOGW5I6Y7glqTOGW5I6Y7glqTOGW5I6Y7glqTOGW5I6Y7glqTOGW5I6Y7glqTOG\nW5I6Y7glqTOGW5I6Y7glqTOGW5I6Y7glqTOGW5I6Y7glqTODhTvJhiR3JXk4yeeTvLutn5lkV5JH\n2/0ZbT1Jrk8yl+SBJOcPNZsk9WzIM+4XgH9dVecCFwLXJDkXuBbYXVUbgd1tH+AyYGO7bQVuHHA2\nSerWYOGuqkNV9dm2/WVgH7AO2ARsb4dtB65o25uAHTVyN7A6ydqh5pOkXi3JNe4kZwNvAO4B1lTV\nofbQE8Catr0O2D/2tANtbfFrbU2yJ8me+fn5wWaWpOVq8HAn+XbgY8B7qupL449VVQH1cl6vqrZV\n1WxVzc7MzBzHSSWpD4OGO8nJjKJ9c1V9vC0/uXAJpN0fbusHgQ1jT1/f1iRJY4Z8V0mAm4B9VfUb\nYw/tBDa37c3AbWPrV7V3l1wIPDd2SUWS1Kwa8LUvAt4BPJhkb1v798D7gFuSbAEeB65sj90BXA7M\nAc8DVw84myR1a7BwV9WfATnKw5cc4fgCrhlqHkk6Ufibk5LUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMt\nSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x\n3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLU\nGcMtSZ0ZLNxJPpjkcJKHxtbOTLIryaPt/oy2niTXJ5lL8kCS84eaS5J6N+QZ928Dly5auxbYXVUb\ngd1tH+AyYGO7bQVuHHAuSeraYOGuqj8Bnl60vAnY3ra3A1eMre+okbuB1UnWDjWbJPVsqa9xr6mq\nQ237CWBN214H7B877kBb+yZJtibZk2TP/Pz8cJNK0jI1tX+crKoC6m/xvG1VNVtVszMzMwNMJknL\n21KH+8mFSyDt/nBbPwhsGDtufVuTJC2y1OHeCWxu25uB28bWr2rvLrkQeG7skookacyqoV44yUeA\nHwJek+QA8AvA+4BbkmwBHgeubIffAVwOzAHPA1cPNZck9W6wcFfV24/y0CVHOLaAa4aaRZJOJP7m\npCR1xnBLUmcMtyR1xnBLUmcMtyR1xnBLUmcMtyR1xnBLUmcMtyR1xnBLUmcMtyR1xnBLUmcMtyR1\nxnBLUmcMtyR1xnBLUmcMtyR1xnBLUmcMtyR1xnBLUmcMtyR1xnBLUmcMtyR1xnBLUmcMtyR1xnBL\nUmcMtyR1xnBLUmcMtyR1xnBLUmcMtyR1xnBLUmeWVbiTXJrkC0nmklw77XkkaTlaNuFOchLwX4HL\ngHOBtyc5d7pTSdLys2zCDVwAzFXVY1X1V8BHgU1TnkmSlp1U1bRnACDJW4BLq+pftv13AN9fVe9c\ndNxWYGvbPQf4wpIOOqzXAE9NewgdkT+b5etE+tk8VVWXvtRBq5ZikuOpqrYB26Y9xxCS7Kmq2WnP\noW/mz2b5Wok/m+V0qeQgsGFsf31bkySNWU7h/gywMclrk5wCvA3YOeWZJGnZWTaXSqrqhSTvBP4Q\nOAn4YFV9fspjLbUT8hLQCcKfzfK14n42y+YfJyVJk1lOl0okSRMw3JLUGcO9DCT5YJLDSR6a9ix6\nUZINSe5K8nCSzyd597Rn0kiSU5Pcm+Rz7Wfzi9OeaSl5jXsZSPKDwFeAHVX1XdOeRyNJ1gJrq+qz\nSV4J3AdcUVUPT3m0FS9JgNOq6itJTgb+DHh3Vd095dGWhGfcy0BV/Qnw9LTn0DeqqkNV9dm2/WVg\nH7BuulMJoEa+0nZPbrcVcxZquKUJJDkbeANwz3Qn0YIkJyXZCxwGdlXVivnZGG7pJST5duBjwHuq\n6kvTnkcjVfX1qjqP0W9ZX5BkxVxmNNzSMbTrpx8Dbq6qj097Hn2zqnoWuAt4yQ9nOlEYbuko2j+A\n3QTsq6rfmPY8elGSmSSr2/a3Am8CHpnuVEvHcC8DST4CfBo4J8mBJFumPZMAuAh4B3Bxkr3tdvm0\nhxIAa4G7kjzA6HOOdlXV7VOeacn4dkBJ6oxn3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtHUGSH0qy\nYt5epr4YbknqjOHWipHk7CSPJLk5yb4ktyb5tiTfl+R/ts92vrd9hOv48y5I8ukk97fjzmnr/6Ad\nvzfJA0k2JjktyR+013ooyVun86fViWzZfFmwtETOAbZU1aeSfBB4J/CvgLdW1WeSvAr4v4ue8wjw\nj9oXWv8T4FeAf96ed11V3ZzkFEZfcn058H+q6kcAkpy+NH8srSSGWyvN/qr6VNv+HeDngENV9RmA\nhU//G31Myd84HdieZCOjz3w+ua1/Gvi5JOuBj1fVo0keBH49yfuB26vqTwf/E2nF8VKJVprFn/Ew\nyce0/jJwV/t2oh8FTgWoqt8F3szoDP2OJBdX1f8CzgceBN6b5OeP2+RSY7i10pyV5I1t+yeAu4G1\nSb4PIMkrkyz+P9HTgYNt+6cXFpP8PeCxqroeuA34niTfATxfVb8D/BqjiEvHleHWSvMF4Jok+4Az\ngBuAtwI3JPkcsIt2Rj3mPwG/muR+vvHy4pXAQ+1bWL4L2AF8N3BvW/sF4L1D/mG0MvnpgFox2teP\n3e4XMqt3nnFLUmc845akznjGLUmdMdyS1BnDLUmdMdyS1BnDLUmd+WtB5MCuB4eMpAAAAABJRU5E\nrkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Qrnc6VUKSTNp" + }, + "source": [ + "### Variável 'parch'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2i4ed-0zSvJc", + "outputId": "029427d3-2436-4fe8-f957-9b8ec2baf4ec", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 170 + } + }, + "source": [ + "df['parch'].value_counts()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 1002\n", + "1 170\n", + "2 113\n", + "3 8\n", + "5 6\n", + "4 6\n", + "9 2\n", + "6 2\n", + "Name: parch, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 254 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qd7u__6KZ6DM", + "outputId": "b9638d31-849f-4888-fe06-a39852a1a9ce", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 386 + } + }, + "source": [ + "sns.catplot(x=\"parch\", kind=\"count\", data=df)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 255 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW4AAAFgCAYAAACbqJP/AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAE7FJREFUeJzt3X+w3XV95/HnSyJFaAWEuywmOGFa\nxmrddqG3lJYu7cKujdQaxqJju2qWZSe7W3SxdFppO1NYd5zR2VpFbZlJCTZU6o+CltRxUAYQW2el\nBqWCRNcMrSYZMBcF/DXWYt/7x/kEjpgbTuo953s+3Odj5s79fr/ne895R5lnvvme8/3eVBWSpH48\nZegBJEmHxnBLUmcMtyR1xnBLUmcMtyR1xnBLUmcMtyR1xnBLUmcMtyR1Zs3QA0zDhg0b6sYbbxx6\nDEk6VJlkpyflEfcDDzww9AiSNDVPynBL0pOZ4ZakzhhuSeqM4ZakzhhuSeqM4ZakzhhuSeqM4Zak\nzhhuSeqM4Zakzkwt3EmuTrIvyd1j256R5KYkn2/fj23bk+StSXYl+XSS08Z+ZlPb//NJNk1rXknq\nxTSPuP8U2PC4bZcCN1fVKcDNbR3gBcAp7WszcCWMQg9cBvw0cDpw2f7YS9JqNbW7A1bVR5Osf9zm\njcAvtOVtwEeA17bt11RVAR9PckySE9u+N1XVVwCS3MToL4N3HcosP/lb1/yL/gzfrzv+zysHeV1J\nT26zPsd9QlXd15bvB05oy2uB3WP77Wnbltv+PZJsTrIjyY6lpaWVnVqS5shgb062o+tawefbUlWL\nVbW4sLCwUk8rSXNn1uH+UjsFQvu+r23fC5w0tt+6tm257ZK0as063NuB/Z8M2QTcMLb9le3TJWcA\nD7dTKh8Cnp/k2Pam5PPbNklatab25mSSdzF6c/H4JHsYfTrkDcB7k1wIfAF4adv9g8C5wC7gm8AF\nAFX1lST/G/hE2+91+9+olKTVapqfKvnVZR465wD7FnDRMs9zNXD1Co4mSV3zyklJ6ozhlqTOGG5J\n6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozh\nlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTO\nGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J\n6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6swg4U7yG0k+k+TuJO9KckSSk5Pc\nnmRXkvckObzt+wNtfVd7fP0QM0vSvJh5uJOsBf4nsFhVzwMOA14GvBF4c1X9CPAgcGH7kQuBB9v2\nN7f9JGnVGupUyRrgaUnWAEcC9wFnA9e1x7cB57XljW2d9vg5STLDWSVprsw83FW1F/gD4IuMgv0w\ncAfwUFU90nbbA6xty2uB3e1nH2n7H/f4502yOcmOJDuWlpam+4eQpAENcarkWEZH0ScDzwSOAjZ8\nv89bVVuqarGqFhcWFr7fp5OkuTXEqZL/APx9VS1V1T8B7wPOBI5pp04A1gF72/Je4CSA9vjRwJdn\nO7IkzY8hwv1F4IwkR7Zz1ecA9wC3Aue3fTYBN7Tl7W2d9vgtVVUznFeS5soQ57hvZ/Qm4yeBu9oM\nW4DXApck2cXoHPbW9iNbgePa9kuAS2c9syTNkzVPvMvKq6rLgMset/le4PQD7Pst4CWzmEuSeuCV\nk5LUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLU\nGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMt\nSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x\n3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0ZJNxJjklyXZLPJtmZ\n5GeSPCPJTUk+374f2/ZNkrcm2ZXk00lOG2JmSZoXQx1xXwHcWFU/CvwEsBO4FLi5qk4Bbm7rAC8A\nTmlfm4ErZz+uJM2PmYc7ydHAWcBWgKr6dlU9BGwEtrXdtgHnteWNwDU18nHgmCQnznhsSZobQxxx\nnwwsAe9I8qkkVyU5Cjihqu5r+9wPnNCW1wK7x35+T9v2XZJsTrIjyY6lpaUpji9Jwxoi3GuA04Ar\nq+pU4Bs8dloEgKoqoA7lSatqS1UtVtXiwsLCig0rSfNmiHDvAfZU1e1t/TpGIf/S/lMg7fu+9vhe\n4KSxn1/XtknSqjTzcFfV/cDuJM9um84B7gG2A5vatk3ADW15O/DK9umSM4CHx06pSNKqs2ag1301\ncG2Sw4F7gQsY/SXy3iQXAl8AXtr2/SBwLrAL+GbbV5JWrUHCXVV3AosHeOicA+xbwEVTH0qSOuGV\nk5LUGcMtSZ0x3JLUGcMtSZ0x3JLUmYnCneTmSbZJkqbvoB8HTHIEcCRwfLvNatpDT+cA9wuRJE3f\nE32O+78BrwGeCdzBY+H+KvD2Kc4lSVrGQcNdVVcAVyR5dVW9bUYzSZIOYqIrJ6vqbUl+Flg//jNV\ndc2U5pIkLWOicCf5M+CHgTuB77TNBRhuSZqxSe9Vsgg8t903RJI0oEk/x3038K+nOYgkaTKTHnEf\nD9yT5G+Bf9y/sapeNJWpJEnLmjTcl09zCEnS5Cb9VMlt0x5EkjSZST9V8jUe++W9hwNPBb5RVU+f\n1mCSpAOb9Ij7h/YvJwmwEThjWkNJkpZ3yHcHrJG/BH5xCvNIkp7ApKdKXjy2+hRGn+v+1lQmkiQd\n1KSfKvnlseVHgH9gdLpEkjRjk57jvmDag0iSJjPpL1JYl+T9Sfa1r+uTrJv2cJKk7zXpm5PvALYz\nui/3M4G/atskSTM2abgXquodVfVI+/pTYGGKc0mSljFpuL+c5OVJDmtfLwe+PM3BJEkHNmm4/wvw\nUuB+4D7gfOA/T2kmSdJBTPpxwNcBm6rqQYAkzwD+gFHQJUkzNOkR94/vjzZAVX0FOHU6I0mSDmbS\ncD8lybH7V9oR96RH65KkFTRpfN8E/N8kf9HWXwK8fjojSZIOZtIrJ69JsgM4u216cVXdM72xJEnL\nmfh0Rwu1sZakgR3ybV0lScMy3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLU\nGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUmcHCneSwJJ9K8oG2fnKS25PsSvKeJIe37T/Q1ne1\nx9cPNbMkzYMhj7gvBnaOrb8ReHNV/QjwIHBh234h8GDb/ua2nyStWoOEO8k64JeAq9p6GP0+y+va\nLtuA89ryxrZOe/yctr8krUpDHXG/Bfht4J/b+nHAQ1X1SFvfA6xty2uB3QDt8Yfb/t8lyeYkO5Ls\nWFpamubskjSomYc7yQuBfVV1x0o+b1VtqarFqlpcWFhYyaeWpLky8W95X0FnAi9Kci5wBPB04Arg\nmCRr2lH1OmBv238vcBKwJ8ka4Gjgy7MfW5Lmw8yPuKvqd6pqXVWtB14G3FJV/wm4FTi/7bYJuKEt\nb2/rtMdvqaqa4ciSNFfm6XPcrwUuSbKL0TnsrW37VuC4tv0S4NKB5pOkuTDEqZJHVdVHgI+05XuB\n0w+wz7eAl8x0MEmaY/N0xC1JmoDhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozh\nlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTO\nGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J\n6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozhlqTOGG5J6ozh\nlqTOGG5J6szMw53kpCS3JrknyWeSXNy2PyPJTUk+374f27YnyVuT7Ery6SSnzXpmSZonQxxxPwL8\nZlU9FzgDuCjJc4FLgZur6hTg5rYO8ALglPa1Gbhy9iNL0vyYebir6r6q+mRb/hqwE1gLbAS2td22\nAee15Y3ANTXyceCYJCfOeGxJmhuDnuNOsh44FbgdOKGq7msP3Q+c0JbXArvHfmxP2/b459qcZEeS\nHUtLS1ObWZKGNli4k/wgcD3wmqr66vhjVVVAHcrzVdWWqlqsqsWFhYUVnFSS5ssg4U7yVEbRvraq\n3tc2f2n/KZD2fV/bvhc4aezH17VtkrQqDfGpkgBbgZ1V9YdjD20HNrXlTcANY9tf2T5dcgbw8Ngp\nFUladdYM8JpnAq8A7kpyZ9v2u8AbgPcmuRD4AvDS9tgHgXOBXcA3gQtmO64kzZeZh7uq/gbIMg+f\nc4D9C7hoqkNJUke8clKSOmO4JakzhluSOjPEm5Nqvvi6fzPI6z7r9+8a5HUlrQyPuCWpM4Zbkjpj\nuCWpM4ZbkjpjuCWpM4ZbkjpjuCWpM4ZbkjpjuCWpM4ZbkjpjuCWpM4ZbkjpjuCWpM4ZbkjpjuCWp\nM96PW9/jzLedOcjrfuzVHxvkdaXeeMQtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLU\nGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMt\nSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ0x3JLUGcMtSZ1ZM/QAk0qyAbgCOAy4qqreMPBI\nmrHbzvr5QV735z962yCvKy2ni3AnOQz4I+A/AnuATyTZXlX3DDuZVru3/+ZfDfbar3rTLy/72Otf\nfv4MJ/luv/fO6wZ77dWii3ADpwO7qupegCTvBjYChlvqyM7X3zLI6z7n984+6OOXX375bAZZoddN\nVa3sJFOQ5HxgQ1X917b+CuCnq+pVY/tsBja31WcDn1uhlz8eeGCFnmslOdehca7JzeNMsDrmeqCq\nNjzRTr0ccT+hqtoCbFnp502yo6oWV/p5v1/OdWica3LzOBM417hePlWyFzhpbH1d2yZJq04v4f4E\ncEqSk5McDrwM2D7wTJI0iC5OlVTVI0leBXyI0ccBr66qz8zo5Vf89MsKca5D41yTm8eZwLke1cWb\nk5Kkx/RyqkSS1BhuSeqM4T6IJBuSfC7JriSXDj0PQJKrk+xLcvfQs4xLclKSW5Pck+QzSS6eg5mO\nSPK3Sf6uzfS/hp5pXJLDknwqyQeGnmW/JP+Q5K4kdybZMfQ8+yU5Jsl1ST6bZGeSnxl6JoAkFye5\nu/339ZqZva7nuA+sXWb//xi7zB741aEvs09yFvB14Jqqet6Qs4xLciJwYlV9MskPAXcA5w35v1eS\nAEdV1deTPBX4G+Diqvr4UDONS3IJsAg8vapeOPQ8MAo3sFhVc3WhS5JtwF9X1VXtk2VHVtVDA8/0\nPODdjK7s/jZwI/Dfq2rXtF/bI+7lPXqZfVV9m9H/QRsHnomq+ijwlaHneLyquq+qPtmWvwbsBNYO\nPFNV1dfb6lPb11wcqSRZB/wScNXQs8y7JEcDZwFbAarq20NHu3kOcHtVfbOqHgFuA148ixc23Mtb\nC+weW9/DwCHqRZL1wKnA7cNO8ujpiDuBfcBNVTX4TM1bgN8G/nnoQR6ngA8nuaPdRmIenAwsAe9o\np5auSnLU0EMBdwP/LslxSY4EzuW7LxScGsOtFZXkB4HrgddU1VeHnqeqvlNV/5bR1bant3/eDirJ\nC4F9VXXH0LMcwM9V1WnAC4CL2qm5oa0BTgOurKpTgW8Ag7/nVFU7gTcCH2Z0muRO4DuzeG3DvTwv\nsz9E7Tzy9cC1VfW+oecZ1/5pfSvwhDfwmYEzgRe188nvBs5O8s5hRxqpqr3t+z7g/YxOGQ5tD7Bn\n7F9L1zEK+eCqamtV/WRVnQU8yOh9sakz3MvzMvtD0N4I3ArsrKo/HHoegCQLSY5py09j9EbzZ4ed\nCqrqd6pqXVWtZ/Tf1S1V9fKBxyLJUe2NZdqpiOczOh0wqKq6H9id5Nlt0znMyS2dk/yr9v1ZjM5v\n//ksXreLS96HMPBl9stK8i7gF4Djk+wBLquqrcNOBYyOIl8B3NXOKQP8blV9cMCZTgS2tU8IPQV4\nb1XNzUfv5tAJwPtHfwezBvjzqrpx2JEe9Wrg2nYQdS9wwcDz7Hd9kuOAfwIumtWbpn4cUJI646kS\nSeqM4ZakzhhuSeqM4ZakzhhuSeqM4ZZWUJL183bnRj35GG7pXyCJ10BoMIZbq1Y7Ov5skmvbPZ6v\nS3Jkkt9P8ol2n+Ut7apQknwkyVvafaovTnJCkve3+33/XZKfbU99WJI/afdo/nC7alNaMYZbq92z\ngT+uqucAXwV+HXh7Vf1Uu9/504Dxe2UfXlWLVfUm4K3AbVX1E4zunbH/ytpTgD+qqh8DHgJ+ZUZ/\nFq0Shlur3e6q+lhbfifwc8C/T3J7kruAs4EfG9v/PWPLZwNXwqN3IXy4bf/7qtp/2f8dwPppDa/V\nyfN0Wu0ef8+HAv6Y0W+B2Z3kcuCIsce/McFz/uPY8ncYHbVLK8Yjbq12zxr7/YW/xujXmwE80O4t\nfv5BfvZm4H/Ao7+w4ejpjSk9xnBrtfsco18YsBM4ltGpjz9hdDvTDzG6ve9yLmZ0WuUuRqdEnjvl\nWSXAuwNqFWu/Yu0D8/RLl6VJeMQtSZ3xiFuSOuMRtyR1xnBLUmcMtyR1xnBLUmcMtyR15v8DHTPO\ndbcumhYAAAAASUVORK5CYII=\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Z9vM3vktC7BG" + }, + "source": [ + "#### Exercício:\n", + "* Criar o atributo 'sozinho_parch', onde sozinho= 1 significa que o passageiro viaja sozinho e 0, caso contrário." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Nd4TyOYjs-HW" + }, + "source": [ + "# Função para retornar 0 ou 1 em função dos valores de variavel\n", + "def sozinho(variavel):\n", + " if (variavel == 0):\n", + " return 1\n", + " else:\n", + " return 0" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "5oByiBuos_B3", + "outputId": "ca493249-7147-4273-e3ac-cf22ff8ecec7", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 527 + } + }, + "source": [ + "df['sozinho_parch'] = df['parch'].map(sozinho)\n", + "df.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclassnameagesibspparchfareembarkedsurvived2sexdeckseatsozinho_parch
PassengerId
10.03Braund, Mr. Owen Harris22.0107.2500SDiedmaleNaNNaN1
21.01Cumings, Mrs. John Bradley (Florence Briggs Th...38.01071.2833CSurvivedfemaleC851
31.03Heikkinen, Miss. Laina26.0007.9250SSurvivedfemaleNaNNaN1
41.01Futrelle, Mrs. Jacques Heath (Lily May Peel)35.01053.1000SSurvivedfemaleC1231
50.03Allen, Mr. William Henry35.0008.0500SDiedmaleNaNNaN1
\n", + "
" + ], + "text/plain": [ + " survived pclass ... seat sozinho_parch\n", + "PassengerId ... \n", + "1 0.0 3 ... NaN 1\n", + "2 1.0 1 ... 85 1\n", + "3 1.0 3 ... NaN 1\n", + "4 1.0 1 ... 123 1\n", + "5 0.0 3 ... NaN 1\n", + "\n", + "[5 rows x 13 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 257 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C1ICby1oSd41" + }, + "source": [ + "### Variável 'sibsp'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5n7JNEQqTNjz", + "outputId": "dc13b210-2928-488d-84e9-22a36929848a", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 153 + } + }, + "source": [ + "df['sibsp'].value_counts()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 891\n", + "1 319\n", + "2 42\n", + "4 22\n", + "3 20\n", + "8 9\n", + "5 6\n", + "Name: sibsp, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 258 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NLfMhiy0x4u5" + }, + "source": [ + "* Algum problema?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nayYFRK9g8iV", + "outputId": "feb5e2e5-a924-49ee-8f1c-7d3f56745e5f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 386 + } + }, + "source": [ + "sns.catplot(x=\"sibsp\", kind=\"count\", data=df)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 259 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW4AAAFgCAYAAACbqJP/AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAEflJREFUeJzt3Xuw53Vdx/HnC1ZEMAFxh2SXBqYY\ni7xBO0ZRWmC1XtdpSK1UMopq8JZOSZdRuziT44XQGmcYUEEc09CC0iEdQB0dQxclQFZzxwx2RVkM\n8Jbl5rs/fp/Vsxu7+9vke37nvft8zJw539vvt2+YnSdfvuf7+55UFZKkPg5a9ACSpH1juCWpGcMt\nSc0YbklqxnBLUjOGW5KaMdyS1IzhlqRmDLckNbNq0QN8L9avX19XXXXVoseQpPtK5jmo9Rn3nXfe\nuegRJGnZtQ63JB2IDLckNWO4JakZwy1JzRhuSWrGcEtSM4Zbkpox3JLUjOGWpGYMtyQ1Y7glqRnD\nLUnNtH464K5+7PcuXfQIO7n+1c9Z9AiS9kOecUtSM4Zbkpox3JLUjOGWpGYMtyQ1Y7glqRnDLUnN\nGG5JasZwS1IzhluSmjHcktSM4ZakZgy3JDVjuCWpGcMtSc0YbklqxnBLUjOGW5KaMdyS1IzhlqRm\nDLckNWO4JakZwy1JzRhuSWrGcEtSM4ZbkpqZNNxJfjfJp5LcnOTtSQ5NckKS65JsTvKOJIeMY+8/\n1jeP/cdPOZskdTVZuJOsAV4ArKuqhwMHA88EXgWcX1U/BNwFnD1ecjZw19h+/jhOkrSLqS+VrAIe\nkGQVcBhwO3A6cPnYfwnwtLG8Yawz9p+RJBPPJ0ntTBbuqtoKvAa4lVmw7wGuB+6uqu3jsC3AmrG8\nBrhtvHb7OP7oXd83yTlJNibZuG3btqnGl6QVa8pLJUcxO4s+ATgWOBxY/72+b1VdWFXrqmrd6tWr\nv9e3k6R2prxU8njg36pqW1V9C3g3cBpw5Lh0ArAW2DqWtwLHAYz9RwBfnnA+SWppynDfCpya5LBx\nrfoM4BbgWuDMccxZwBVj+cqxzth/TVXVhPNJUktTXuO+jtkPGT8B3DT+rAuBlwIvTrKZ2TXsi8dL\nLgaOHttfDJw31WyS1NmqvR/y/1dVLwdevsvmzwGPuZdjvwn80pTzSNL+wE9OSlIzhluSmjHcktSM\n4ZakZgy3JDVjuCWpGcMtSc0YbklqxnBLUjOGW5KaMdyS1IzhlqRmDLckNWO4JakZwy1JzRhuSWrG\ncEtSM4Zbkpox3JLUjOGWpGYMtyQ1Y7glqRnDLUnNGG5JasZwS1IzhluSmjHcktSM4ZakZgy3JDVj\nuCWpGcMtSc0YbklqxnBLUjOGW5KaMdyS1IzhlqRmDLckNWO4JakZwy1JzRhuSWrGcEtSM4Zbkpox\n3JLUjOGWpGYMtyQ1Y7glqRnDLUnNGG5JasZwS1IzhluSmjHcktSM4ZakZgy3JDVjuCWpGcMtSc0Y\nbklqZtJwJzkyyeVJPp1kU5KfSPLgJO9P8tnx/ahxbJK8PsnmJDcmOWXK2SSpq6nPuC8ArqqqHwYe\nBWwCzgOurqoTgavHOsATgBPH1znAGyeeTZJamizcSY4AHgtcDFBV/11VdwMbgEvGYZcATxvLG4BL\na+afgSOTPHSq+SSpqynPuE8AtgFvTvLJJBclORw4pqpuH8d8EThmLK8Bblvy+i1j206SnJNkY5KN\n27Ztm3B8SVqZpgz3KuAU4I1VdTLwdb57WQSAqiqg9uVNq+rCqlpXVetWr159nw0rSV1MGe4twJaq\num6sX84s5F/acQlkfL9j7N8KHLfk9WvHNknSEpOFu6q+CNyW5GFj0xnALcCVwFlj21nAFWP5SuA5\n4+6SU4F7llxSkSQNqyZ+/+cDb0tyCPA54LnM/mPxziRnA/8OPH0c+17gicBm4BvjWEnSLiYNd1Xd\nAKy7l11n3MuxBZw75TyStD/wk5OS1IzhlqRmDLckNWO4JakZwy1JzRhuSWrGcEtSM4Zbkpox3JLU\njOGWpGYMtyQ1Y7glqRnDLUnNGG5JasZwS1IzhluSmjHcktSM4ZakZgy3JDVjuCWpGcMtSc0Ybklq\nxnBLUjOGW5KaMdyS1IzhlqRm5gp3kqvn2SZJmt6qPe1McihwGPCQJEcBGbseBKyZeDZJ0r3YY7iB\n3wJeBBwLXM93w/0V4K8mnEuStBt7DHdVXQBckOT5VfWGZZpJkrQHezvjBqCq3pDkJ4Hjl76mqi6d\naC5J0m7MFe4kbwV+ELgB+J+xuQDDLUnLbK5wA+uAk6qqphxGkrR3897HfTPw/VMOIkmaz7xn3A8B\nbknyMeC/dmysqqdOMpUkabfmDfcrphxCkjS/ee8q+eDUg0iS5jPvXSVfZXYXCcAhwP2Ar1fVg6Ya\nTJJ07+Y94/6+HctJAmwATp1qKEnS7u3z0wFr5u+BX5hgHknSXsx7qeQXl6wexOy+7m9OMpEkaY/m\nvavkKUuWtwOfZ3a5RJK0zOa9xv3cqQeRJM1n3l+ksDbJ3yW5Y3y9K8naqYeTJP1f8/5w8s3Alcye\ny30s8A9jmyRpmc0b7tVV9eaq2j6+3gKsnnAuSdJuzBvuLyd5VpKDx9ezgC9POZgk6d7NG+5fB54O\nfBG4HTgT+LWJZpIk7cG8twP+KXBWVd0FkOTBwGuYBV2StIzmPeN+5I5oA1TVfwAnTzOSJGlP5g33\nQUmO2rEyzrjnPVuXJN2H5o3va4GPJvnbsf5LwCunGUmStCfzfnLy0iQbgdPHpl+sqlumG0uStDtz\nX+4YoTbWkrRg+/xYV0nSYhluSWpm8nCPT1p+Msk/jvUTklyXZHOSdyQ5ZGy//1jfPPYfP/VsktTR\ncpxxvxDYtGT9VcD5VfVDwF3A2WP72cBdY/v54zhJ0i4mDfd49OuTgIvGepjdmXL5OOQS4GljecNY\nZ+w/YxwvSVpi6jPuvwR+H/j2WD8auLuqto/1LcCasbwGuA1g7L9nHL+TJOck2Zhk47Zt26acXZJW\npMnCneTJwB1Vdf19+b5VdWFVrauqdatX+2RZSQeeKT+2fhrw1CRPBA4FHgRcAByZZNU4q14LbB3H\nbwWOA7YkWQUcgY+OlaT/Y7Iz7qr6g6paW1XHA88ErqmqXwWuZfZYWICzgCvG8pVjnbH/mqqqqeaT\npK4WcR/3S4EXJ9nM7Br2xWP7xcDRY/uLgfMWMJskrXjL8oS/qvoA8IGx/DngMfdyzDeZPbxKkrQH\nfnJSkpox3JLUjOGWpGYMtyQ1Y7glqRnDLUnNGG5JasZwS1IzhluSmjHcktSM4ZakZgy3JDVjuCWp\nGcMtSc0YbklqxnBLUjOGW5KaMdyS1IzhlqRmDLckNbMsvyxYu3frnz5i0SN8xw+87KZFjyBpDp5x\nS1IzhluSmjHcktSM4ZakZgy3JDVjuCWpGcMtSc0YbklqxnBLUjOGW5KaMdyS1IzhlqRmDLckNWO4\nJakZwy1JzRhuSWrGcEtSM4Zbkpox3JLUjOGWpGYMtyQ1Y7glqRnDLUnNGG5JasZwS1IzhluSmjHc\nktSM4ZakZgy3JDVjuCWpGcMtSc0YbklqxnBLUjOGW5KaMdyS1Mxk4U5yXJJrk9yS5FNJXji2PzjJ\n+5N8dnw/amxPktcn2ZzkxiSnTDWbJHU25Rn3duAlVXUScCpwbpKTgPOAq6vqRODqsQ7wBODE8XUO\n8MYJZ5OktiYLd1XdXlWfGMtfBTYBa4ANwCXjsEuAp43lDcClNfPPwJFJHjrVfJLU1bJc405yPHAy\ncB1wTFXdPnZ9EThmLK8Bblvysi1j267vdU6SjUk2btu2bbKZJWmlmjzcSR4IvAt4UVV9Zem+qiqg\n9uX9qurCqlpXVetWr159H04qST1MGu4k92MW7bdV1bvH5i/tuAQyvt8xtm8Fjlvy8rVjmyRpiSnv\nKglwMbCpql63ZNeVwFlj+SzgiiXbnzPuLjkVuGfJJRVJ0rBqwvc+DXg2cFOSG8a2PwT+AnhnkrOB\nfweePva9F3gisBn4BvDcCWeTpLYmC3dVfRjIbnafcS/HF3DuVPNI0v7CT05KUjOGW5KaMdyS1Izh\nlqRmDLckNWO4JakZwy1JzRhuSWrGcEtSM4Zbkpox3JLUjOGWpGYMtyQ1Y7glqRnDLUnNGG5JasZw\nS1IzhluSmjHcktSM4ZakZgy3JDVjuCWpGcMtSc0YbklqxnBLUjOGW5KaMdyS1IzhlqRmDLckNWO4\nJakZwy1JzRhuSWrGcEtSM4Zbkpox3JLUjOGWpGYMtyQ1Y7glqRnDLUnNGG5JasZwS1IzhluSmjHc\nktSM4ZakZgy3JDWzatEDqJfT3nDaokf4jo88/yOLHkFaCM+4JakZwy1JzRhuSWrGcEtSM4Zbkpox\n3JLUjLcDar/2wcc+btEjfMfjPvTBvR7zVy/5h2WYZD7Pe+1TFj2CdsMzbklqxnBLUjOGW5KaWVHX\nuJOsBy4ADgYuqqq/WPBIkvYjm155zaJH2MmP/NHp/6/XrZhwJzkY+Gvg54AtwMeTXFlVtyx2Mkm7\n88pnnbnoEXbyR5ddvugRlsVKulTyGGBzVX2uqv4b+Btgw4JnkqQVJ1W16BkASHImsL6qfmOsPxv4\n8ap63i7HnQOcM1YfBnxmgnEeAtw5wftOxXmn121m553WVPPeWVXr93bQirlUMq+quhC4cMo/I8nG\nqlo35Z9xX3Le6XWb2Xmnteh5V9Klkq3AcUvW145tkqQlVlK4Pw6cmOSEJIcAzwSuXPBMkrTirJhL\nJVW1PcnzgH9idjvgm6rqUwsaZ9JLMRNw3ul1m9l5p7XQeVfMDyclSfNZSZdKJElzMNyS1Izh3kWS\n9Uk+k2RzkvMWPc+eJHlTkjuS3LzoWeaR5Lgk1ya5Jcmnkrxw0TPtSZJDk3wsyb+Mef9k0TPNI8nB\nST6Z5B8XPcs8knw+yU1JbkiycdHz7E2S3x1/H25O8vYkhy73DIZ7iSUfu38CcBLwy0lOWuxUe/QW\nYK83668g24GXVNVJwKnAuSv83+9/AadX1aOARwPrk5y64Jnm8UJg06KH2Ec/W1WPXun3cidZA7wA\nWFdVD2d2I8Uzl3sOw72zVh+7r6oPAf+x6DnmVVW3V9UnxvJXmcVlzWKn2r2a+dpYvd/4WtE/zU+y\nFngScNGiZ9mPrQIekGQVcBjwheUewHDvbA1w25L1LazgsHSW5HjgZOC6xU6yZ+Oyww3AHcD7q2pF\nzwv8JfD7wLcXPcg+KOB9Sa4fj7RYsapqK/Aa4FbgduCeqnrfcs9huLXskjwQeBfwoqr6yqLn2ZOq\n+p+qejSzT/I+JsnDFz3T7iR5MnBHVV2/6Fn20U9V1SnMLlGem+Sxix5od5Icxez/wk8AjgUOT/Ks\n5Z7DcO/Mj91PLMn9mEX7bVX17kXPM6+quhu4lpX9M4XTgKcm+Tyzy3ynJ7lssSPt3TiLparuAP6O\n2SXLlerxwL9V1baq+hbwbuAnl3sIw70zP3Y/oSQBLgY2VdXrFj3P3iRZneTIsfwAZs+K//Rip9q9\nqvqDqlpbVccz+7t7TVUt+9ngvkhyeJLv27EM/Dywku+SuhU4Nclh4+/zGSzgB8GGe4mq2g7s+Nj9\nJuCdC/zY/V4leTvwUeBhSbYkOXvRM+3FacCzmZ0J3jC+nrjoofbgocC1SW5k9h/191dVi1vsGjkG\n+HCSfwE+Brynqq5a8Ey7NX7GcTnwCeAmZg1d9o+/+5F3SWrGM25JasZwS1IzhluSmjHcktSM4Zak\nZgy3BCS5aMcDr5J8bW/HS4vk7YDSLpJ8raoeuOg5pN3xjFsHnPFpvfeM52zfnOQZST6QZN2SY84f\nz1y+Osnqse0F41niNyb5m7HtFUnemuSjST6b5DcX9c+lA4fh1oFoPfCFqnrUeKbyrp/UOxzYWFU/\nCnwQePnYfh5wclU9EvjtJcc/Ejgd+AngZUmOnXR6HfAMtw5ENwE/l+RVSX66qu7ZZf+3gXeM5cuA\nnxrLNwJvG0+D277k+Cuq6j+r6k5mD6JayQ9J0n7AcOuAU1X/CpzCLOB/nuRle3vJ+P4kZr8h6RTg\n4+NB+kv373q8NAnDrQPOuJTxjaq6DHg1sxAvdRBw5lj+FWYPQToIOK6qrgVeChwB7PgB5obx+ymP\nBn6G2QOppMms2vsh0n7nEcCrk3wb+BbwO8x+q8kOX2f2SxP+mNlvvnkGs98teFmSI4AAr6+qu2dP\n9uRGZpdIHgL8WVUt+6+y0oHF2wGl70GSVwBfq6rX7O1Y6b7ipRJJasYzbklqxjNuSWrGcEtSM4Zb\nkpox3JLUjOGWpGb+F4naF+JNWkajAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_58rZqMaDzf-" + }, + "source": [ + "#### Exercício:\n", + "* Criar o atributo 'sozinho_sibsp', onde sozinho= 1 significa que o passageiro viaja sozinho e 0, caso contrário." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HUrJ4IywrEoA", + "outputId": "c2e6a80d-a2ba-4a47-ed3c-6e7c501eeb28", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 527 + } + }, + "source": [ + "df['sozinho_sibsp'] = df['sibsp'].map(sozinho)\n", + "df.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclassnameagesibspparchfareembarkedsurvived2sexdeckseatsozinho_parchsozinho_sibsp
PassengerId
10.03Braund, Mr. Owen Harris22.0107.2500SDiedmaleNaNNaN10
21.01Cumings, Mrs. John Bradley (Florence Briggs Th...38.01071.2833CSurvivedfemaleC8510
31.03Heikkinen, Miss. Laina26.0007.9250SSurvivedfemaleNaNNaN11
41.01Futrelle, Mrs. Jacques Heath (Lily May Peel)35.01053.1000SSurvivedfemaleC12310
50.03Allen, Mr. William Henry35.0008.0500SDiedmaleNaNNaN11
\n", + "
" + ], + "text/plain": [ + " survived pclass ... sozinho_parch sozinho_sibsp\n", + "PassengerId ... \n", + "1 0.0 3 ... 1 0\n", + "2 1.0 1 ... 1 0\n", + "3 1.0 3 ... 1 1\n", + "4 1.0 1 ... 1 0\n", + "5 0.0 3 ... 1 1\n", + "\n", + "[5 rows x 14 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 260 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0MO9jj2NvGp_" + }, + "source": [ + "### Variável 'fare'" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UuWlMV6XvQHs" + }, + "source": [ + "Transformações: arredondar variável Fare." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "boAj64RHvQHu" + }, + "source": [ + "df['fare']= round(df['fare'], 0)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3CIqHUJpvcPa" + }, + "source": [ + "### Variável 'age'" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VULFXjvap3qZ" + }, + "source": [ + "Transformações: arredondar variável 'age'." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kpNhCRxcp7h9" + }, + "source": [ + "df['age']= round(df['age'], 0)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B0fZMKKpdHIl" + }, + "source": [ + "## Derivar outros atributos/features" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H6n6PzWjoSYf" + }, + "source": [ + "### Variável 'mv_age':\n", + "* Variável (dummy) que assume os valores 1, se o valor de age> 0 e 0, caso contrário." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QluOnZD7kHFW", + "outputId": "26077a35-f1ea-4d12-bf39-3733787d9168", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 527 + } + }, + "source": [ + "df.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclassnameagesibspparchfareembarkedsurvived2sexdeckseatsozinho_parchsozinho_sibsp
PassengerId
10.03Braund, Mr. Owen Harris22.0107.0SDiedmaleNaNNaN10
21.01Cumings, Mrs. John Bradley (Florence Briggs Th...38.01071.0CSurvivedfemaleC8510
31.03Heikkinen, Miss. Laina26.0008.0SSurvivedfemaleNaNNaN11
41.01Futrelle, Mrs. Jacques Heath (Lily May Peel)35.01053.0SSurvivedfemaleC12310
50.03Allen, Mr. William Henry35.0008.0SDiedmaleNaNNaN11
\n", + "
" + ], + "text/plain": [ + " survived pclass ... sozinho_parch sozinho_sibsp\n", + "PassengerId ... \n", + "1 0.0 3 ... 1 0\n", + "2 1.0 1 ... 1 0\n", + "3 1.0 3 ... 1 1\n", + "4 1.0 1 ... 1 0\n", + "5 0.0 3 ... 1 1\n", + "\n", + "[5 rows x 14 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 263 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qKMVIXGDsNkh" + }, + "source": [ + "Para construir a variável 'mv_age', vamos utilizar a função pd.isna(). Por exemplo, o comando abaixo verifica se cada linha/observação da variável 'age' é um NaN." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UHzKFytXsNkh", + "outputId": "45bc64e2-5708-493a-9e2e-3f2ac06c37ab", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "df['age'].isna().sum()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "263" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 264 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lW4NYZrjsNkk" + }, + "source": [ + "A seguir, criamos uma variável auxiliar intitulada 'mv_aux', que receberá 'True', caso 'age' seja NaN e 'False', caso contrário.\n", + "\n", + "Veja abaixo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-bTvVuVpsNkl", + "outputId": "3efb4a54-5d14-40f1-b620-5fb9cdbeff72", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 527 + } + }, + "source": [ + "df['mv_aux']= df['age'].isna()\n", + "df.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclassnameagesibspparchfareembarkedsurvived2sexdeckseatsozinho_parchsozinho_sibspmv_aux
PassengerId
10.03Braund, Mr. Owen Harris22.0107.0SDiedmaleNaNNaN10False
21.01Cumings, Mrs. John Bradley (Florence Briggs Th...38.01071.0CSurvivedfemaleC8510False
31.03Heikkinen, Miss. Laina26.0008.0SSurvivedfemaleNaNNaN11False
41.01Futrelle, Mrs. Jacques Heath (Lily May Peel)35.01053.0SSurvivedfemaleC12310False
50.03Allen, Mr. William Henry35.0008.0SDiedmaleNaNNaN11False
\n", + "
" + ], + "text/plain": [ + " survived pclass ... sozinho_sibsp mv_aux\n", + "PassengerId ... \n", + "1 0.0 3 ... 0 False\n", + "2 1.0 1 ... 0 False\n", + "3 1.0 3 ... 1 False\n", + "4 1.0 1 ... 0 False\n", + "5 0.0 3 ... 1 False\n", + "\n", + "[5 rows x 15 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 265 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "R_gk5S4nsNko", + "outputId": "0f69dcb5-75e6-4280-da3d-bab15059d47c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 527 + } + }, + "source": [ + "# Adiciona a nova coluna baseado no dicionario \n", + "df['mv_age'] = df['mv_aux'].map({True: 1, False: 0})\n", + "df.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclassnameagesibspparchfareembarkedsurvived2sexdeckseatsozinho_parchsozinho_sibspmv_auxmv_age
PassengerId
10.03Braund, Mr. Owen Harris22.0107.0SDiedmaleNaNNaN10False0
21.01Cumings, Mrs. John Bradley (Florence Briggs Th...38.01071.0CSurvivedfemaleC8510False0
31.03Heikkinen, Miss. Laina26.0008.0SSurvivedfemaleNaNNaN11False0
41.01Futrelle, Mrs. Jacques Heath (Lily May Peel)35.01053.0SSurvivedfemaleC12310False0
50.03Allen, Mr. William Henry35.0008.0SDiedmaleNaNNaN11False0
\n", + "
" + ], + "text/plain": [ + " survived pclass ... mv_aux mv_age\n", + "PassengerId ... \n", + "1 0.0 3 ... False 0\n", + "2 1.0 1 ... False 0\n", + "3 1.0 3 ... False 0\n", + "4 1.0 1 ... False 0\n", + "5 0.0 3 ... False 0\n", + "\n", + "[5 rows x 16 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 266 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IHzMcHl8sNkq" + }, + "source": [ + "Deleta a variável auxiliar 'mv_aux':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DYKh0uMYsNks", + "outputId": "ddcc284c-af6f-4761-8dc4-bc25e4ac94b2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 527 + } + }, + "source": [ + "df= df.drop(columns= ['mv_aux'], axis=1)\n", + "df.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclassnameagesibspparchfareembarkedsurvived2sexdeckseatsozinho_parchsozinho_sibspmv_age
PassengerId
10.03Braund, Mr. Owen Harris22.0107.0SDiedmaleNaNNaN100
21.01Cumings, Mrs. John Bradley (Florence Briggs Th...38.01071.0CSurvivedfemaleC85100
31.03Heikkinen, Miss. Laina26.0008.0SSurvivedfemaleNaNNaN110
41.01Futrelle, Mrs. Jacques Heath (Lily May Peel)35.01053.0SSurvivedfemaleC123100
50.03Allen, Mr. William Henry35.0008.0SDiedmaleNaNNaN110
\n", + "
" + ], + "text/plain": [ + " survived pclass ... sozinho_sibsp mv_age\n", + "PassengerId ... \n", + "1 0.0 3 ... 0 0\n", + "2 1.0 1 ... 0 0\n", + "3 1.0 3 ... 1 0\n", + "4 1.0 1 ... 0 0\n", + "5 0.0 3 ... 1 0\n", + "\n", + "[5 rows x 15 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 267 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "34-Qbd_QrC8W" + }, + "source": [ + "Qual a relação entre a variável 'mv_age' e a variável-resposta?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bhY8-UjyrC8Z", + "outputId": "f9c29d9b-3ba4-4eb4-ffa6-1f8b6d55a264", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 383 + } + }, + "source": [ + "Avalia_Taxa_Sobrevivencia(df, 'mv_age')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "survived2 Died Survived All\n", + "mv_age \n", + "0 424 290 714\n", + "1 125 52 177\n", + "All 549 342 891\n" + ], + "name": "stdout" + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEZCAYAAAB4hzlwAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAIABJREFUeJzt3XmYHVWZx/HvrzsBgmHThJCFLMiu\nQDDNoghGcFgcbAYFkiBxRpgJy4ABHBxGCRCEGUHGBWSA6DABJYRNpEVHGBEQ2bsBgSQgESQJJCYB\nEkhMJEm/80edLi5N9+0b6Lo36fw+z9NP13ar3rq3br33nFN1ShGBmZkZQF2tAzAzs3WHk4KZmeWc\nFMzMLOekYGZmOScFMzPLOSmYmVnOScFykg6VNLsG210g6ZNV2tZvJI1Zi+VPl/SKpD1qsf0u1vWw\npOO6Y109iaSNJS2TNKjWsayPnBRqIB2wbX+tklaUjH+x1vG9F5KGS/qZpMWSlkp6al3cl4g4MCJu\nrGTZlAj2AxqAb0vqU83t10JK0H9Jx+ICST+StGmt41obEfHXiOgbEa/UOpb1kZNCDaQDtm9E9AXm\nAJ8rmXZ9reN7j24AngO2BfoBXwYWdfdGJPXq7nV2JiJ+HxFHR8QrEXFwRKyo1rZr7OB0bO4N7A98\nrcbxdCtJ9bWOYV3mpLAOkrSfpEckLUlVF99tOxlK+rSkhZIGpvG9JL0u6cNp/FxJL0p6U9Izkv62\nzHY+IOn6tJ2ngT3bzd9W0u3p1/8Lkk7qZD0i+zX9PxGxIiJWRURLRNxVsswXJM1M2/q1pB3areYT\nkp6V9JqkKZI2Tq87VNJsSZMk/Rm4Mk0/MpVGlki6X9Kuafp5kn7SLr6rJV2Shh+WdJykTdOv4e1L\nlhucSm1bldtGmrdA0hnpPV6a3seNSuYfnV77pqTnJR1Uuv00vLOke9M+L5J0raTNynxef5vWtUTS\nd9rNq5c0WdIcSX+WdE3butLnPD1tZ0k6trbqbDttImIOcBfw0bSeE9Nn9Gb6TI4v2f42kn6V1v+q\npN+UzJskab6kNyTNkrR/mt7pcd7B/n5P7arLUjzPpf36haTBafomkkLSkDQ+XdJlku6StBz4uKQP\nSpqW3vcXJX0tHcdtn8vv0ue6SNJ1Xb1XPUpE+K+Gf8CfgM+0m7Y3sBdQD3wYmA2cVDL/P4H/BTYF\nngX+sWTeGGAgWcIfD7wJ9Otk298D7ga2BEaQ/dKfnebVA08D/wpsBOxIVqr5VCfr+h1wH3AMMKTd\nvN1SHKPTuiYBM4Feaf4C4AlgENAfeAw4J807FFgNXJBe2wfYF5gPjEpxTgD+APRKcb4J9Emv7w28\nCoxM4w8Dx6XhacCkkji/CvwsDXe6jZKYHwAGpJhnA/+Q5h0AvA58On0OQ4EdO9j+zsCBab+2SfO+\n1cn7OxBYDjSmffq39L60resUYBYwDNgcuAP4YZo3EbglvXe9yI6tD3SynQXAJ9Pw8LTP30jjjek4\nEfAZYAXwkTTvu8D30/o3Ag5I0/cAXkjvk4DtgBFdHedpf5cBh6f9/RqwqmR/x6T93THNvxC4J83b\nBAjScQhMB14D9kmfx8bATcDNQF9ge+BF4Itp+duAf0nx9gH2q/V5oqrnpFoHsKH/0UFS6GCZs4Eb\nSsY3JjthPw3c3sVrnwUO6WTeK8DokvGv8HZS+BTwfLvlJwNXdrKufsC30xe1FWgG9kzzLgKuK1m2\nnqxqad80voB0Qk3jnwdmpOFDyU6GvUvm/0/biapk2kvAPmm4GTgmDX8OmFmyXOlJ+fB281pKXtfV\nNhYAR5XMuwz4Xhq+FviPTt6nfPsdzBsLPNTJvAnAve3ew4Ul+/IAcHzJ/D2Av6QT2ylkCfujFRyP\nC8iS6pJ0bF4GbNzJsr8CTkzDl5CdZLdrt8xHyJLrp0kJtZLjPO3vPSXz6trt7z2kk3ga702WNAbQ\ncVKY0u77s6Y0VrLE+as0fBPwA2Dg2n6fe8Kfq4/WQZJ2lfS/qRrgDeBcspMukDWkAdeRFesvbffa\nE0qqPJaQ/QrqRzuS6sh+nc4tmfxSyfAwYHjbetK6zkyveZeIWBwRZ0XELmmZPwA/TbMHla47ItYA\nLwODS1bRPo7SK0cWRMSqdrF9vV1s/UvWNw0Yl4aPBTprp7kTGCBpD0k7ATsAP69wG5CdQNv8hexX\nJ2TtKn/sZJs5SYMk3Szp5fQ5/4gOPqtkECXvUcl7WDq/9PN7iexX7geB/yZLCrdImifp31W+Xv2w\niNgyIoZHxFfS8YakRkmPtlVDkZVy2uK9iOxHxj2paunMFOcMspP9RcDCVM02IK2v3HHefn9b2+3v\nMOCqks9mEVnJaUgn+1R6fG1DlmTmtHu/2j7bM8hK4U+k79IGdYWXk8K66YfA48CHI2JzsqoTtc2U\nNJys+uBaoLS9YUfgcrJfWR+MiC3JiuSinfQl+zPZCazN0JLhucCz6eTQ9rdZRBzZVfARsRD4DllS\n+QDZyWJYSfz1ZF/A0i95+zhKrxxp35XvXODcdrFtGhFtSehG4JBUx/w5siTRUZyryKpVxpElj9vi\n7cbkrrZRzlyy6pCufJusFPTR9Dn/Ix18Vsl8St6jlNRLE9Q73mOy93AF8FpkV+OcGxE7k1VtHU1W\nKqlY+hxvBr4JbJ2Ord+0xRsRSyNiYkQMA74AnCNpvzTv2oj4BFnV0SZkVT1Q/jifT8kJvoP9nUtW\nuiz9fPpEREsnu1B6DC0gK82WHu9DScdjRLwcEceTVWF9BbhGUumyPZqTwrppM2BpRCyT9BHgn9pm\npC/HdWQn/+PJ6l3PTbP7kh3si4A6ZQ3D29O5m4BvSNpC0jCyaoY2v0vbOz013PWStLukj3W0IkmX\npl9+9ZK2AE4CnomI5WQn6SMlHSCpN9kvx1fJqnnafEXSQEn90vxyl21OAU6T1KBM3/QrdlPIvtTA\nI8BU4OmIeLHMuqaRnSDH8c7kUXYbXfgRcGLa3zplDfY7drDcZmSf3xvppHNmmXU2AXtJOjy9h2eR\nlQLa3AD8i6ShyhqYLwSmRURI+kz6bOqAN8h+UbdWsB+l+pBV0SwEWiU1krURAXkpYrvUWLuUrHqm\nNW33U8ouHFiR/tq23elxnvZ3H0mfTT96zgRKG8evIks8O6XtbyXpC5XsSCr53Ab8u7JG+A+TVR/9\nJK1rjKRBkdUlLUkvW1PJunsCJ4V10xnAP0paBlzBO0+QZ5F9Qb+Zfu3/PfDPkvaJiMfJvizNZL+0\nRvDOE2975wCLyYrRvyBLNkD+K/qzwCfIitaLyK786fvu1QBZ42YT2QlhNllVy+fTup4CTgCuTus5\nCDgiIlaXvH46WT3x82RtJZd0FnREPED2C+5qsi/tH8h+6Zf+GpxG1hjaYSmhxG/J6ue3AH69ltvo\nLL77yZLif5G9H3fTcbXGucAn0zK3AbeWWed8suT1PbL3cADv/GyvJKuue5Cs6uo13k4yg4HbydoK\nngF+Sfmk29H2F5M1vv6cLKH/XVpPm13IPr83yd7TSyPiIbJj9T/JjrP5ZMfPpPSaTo/ztL/jyNo0\nFpO9f08Df03zbyCr9/9pqnp6EvibtdilE9P/l8hKPD/i7WrGjwMtKa6bgQnph8YGQVkyNDNbd6XS\nwgKye3oeqnU8PZlLCma2TpJ0WKra3AQ4j6wxv7M2A+smTgpmtq46gOz+gYVkVY5HRsRbtQ2p53P1\nkZmZ5VxSMDOzXNU6F+su/fr1i+HDh9c6DDOz9UpLS8viiOjf1XLrXVIYPnw4zc3lrrI0M7P2JL3U\n9VKuPjIzsxJOCmZmlnNSMDOznJOCmZnlnBTMzCznpGBmZrnCkoKyZ8QulPRMJ/Ol7Lmps9ODLDrs\nktnMzKqnyJLCVLJHKXbmMLInXe1A9lCYKwuMxczMKlBYUoiI35L16d6ZI8ie2xsR8TCwpaSBRcVj\nZmZdq+UdzYN553NT56Vp89svKGkCWWmCoUPXj6fiaXJnT1W09yLOc8eN3cXHZvfqacfmetHQHBFT\nIqIhIhr69++y6w4zM3uPapkUXuadD2sfwjsf5G5mZlVWy6TQBHwpXYW0L9kDvN9VdWRmZtVTWJuC\npBuA0UA/SfPIHqfXGyAiriJ76PdnyR7y/hfgy0XFYmZmlSksKUTEuC7mB/DPRW3fzMzW3nrR0Gxm\nZtXhpGBmZjknBTMzyzkpmJlZzknBzMxyTgpmZpZzUjAzs5yTgpmZ5ZwUzMws56RgZmY5JwUzM8s5\nKZiZWc5JwczMck4KZmaWc1IwM7Ock4KZmeWcFMzMLOekYGZmOScFMzPLOSmYmVnOScHMzHJOCmZm\nlnNSMDOznJOCmZnlnBTMzCznpGBmZjknBTMzyzkpmJlZzknBzMxyTgpmZpZzUjAzs5yTgpmZ5QpN\nCpIOlfScpNmSzu5g/lBJ90h6QtJTkj5bZDxmZlZeYUlBUj1wBXAYsCswTtKu7RY7B7gpIvYExgL/\nVVQ8ZmbWtSJLCnsDsyPihYh4C5gOHNFumQA2T8NbAK8UGI+ZmXWhyKQwGJhbMj4vTSt1PnCcpHnA\nL4HTOlqRpAmSmiU1L1q0qIhYzcyM2jc0jwOmRsQQ4LPAjyW9K6aImBIRDRHR0L9//6oHaWa2oSgy\nKbwMbFsyPiRNK3UCcBNARDwEbAL0KzAmMzMro1clC0namayxeJO2aRExrYuXPQbsIGkEWTIYCxzb\nbpk5wEHAVEm7pPW7fsjMrEa6TAqSzgEOBnYG7gQOAX4HlE0KEbFa0qnpNfXANRExQ9IFQHNENAFf\nBX4o6QyyRud/iIh4PztkZmbvXSUlhTHASODxiBgvaSAwtZKVR8QvyRqQS6edWzI8E9iv4mjNzKxQ\nlbQprIiINcBqSZsBC4BhxYZlZma1UElJ4QlJWwLXAM3AG8CjhUZlZmY10WVSiIgT0+AVku4ENo+I\nx4sNy8zMaqHTpCBph4h4XtLu7WatlrR7RDxVcGxmZlZl5UoKZ5PdR3BFB/MCOKCQiMzMrGY6TQoR\ncUL6v3/1wjEzs1rq8uojSSelhua28a0kTSg2LDMzq4VKLkk9KSKWtI1ExOvAycWFZGZmtVJJUqgv\nHUkd1vUuJhwzM6ulSu5T+D9JNwBXpfGTgF8XF5KZmdVKJUnhLOAU4Iw0/n/A1YVFZGZmNVPJzWtr\ngMvTn5mZ9WCV9JK6L3AeWX9H+fIRsWOBcZmZWQ1UUn30P8DXgBZgTbHhmJlZLVWSFN6IiJ8XHomZ\nVUWcX+sIepjzah1A96okKfxG0n8APwX+2jbRfR+ZmfU8lSSFT7b7D+77yMysR6rk6iP3fWRmtoGo\npO+j/pKulnRHGt9V0j8UHpmZmVVdp0lB0plpcCpwH7BtGn8e+GqxYZmZWS10mBQkTQT+lEa3johp\nQCtARKxqGzYzs56ls5LCrcDhaXi5pA+SNS4jaS+y5zSbmVkP02FDc0TMk3RSGj0L+DmwnaT7gMHA\nUVWKz8zMqqjck9feSv8fk/RpYBdAwMy2eWZm1rNUcvXR48BEYGlEPOmEYGbWc1XykJ2jyR6qc7uk\nhySdLmlQwXGZmVkNdJkUIuKPEfHvEbEHcDzwMWBO4ZGZmVnVVdLNBZKGAMcAY9JrvlFkUGZmVhuV\nPE/hQaAvcDNwXEQ8X3hUZmZWE5WUFP4pImYUHomZmdVcp0lB0riIuAE4SNJB7edHxGWFRmZmZlVX\nrqSwVfrfvxqBmJlZ7ZW7ee2/0uB3I+K197JySYcC3wfqgR9FxLc6WOYY4HyybjR+HxHHvpdtmZnZ\n+1dJm8Jjkp4DbgRui4iK+j2SVA9cAfwNMC+tpykiZpYsswPwb8B+EfG6pK3Xeg/MzKzbVHKfwoeB\nC4FRwFOSfiZpbAXr3huYHREvpLugpwNHtFvmn4ArIuL1tK2FaxW9mZl1q0ruaCYiHoyIr5DduPYG\ncH0FLxsMzC0Zn5emldoR2FHSA5IeTtVNZmZWI5X0fdRX0hcl/Rx4FFgEfKKbtt8L2AEYDYwDfihp\nyw5imCCpWVLzokWLumnTZmbWXiVtCs+QdZ19SUTcvxbrfpm3n9YGMCRNKzUPeCQ9uOdFSX8gSxKP\nlS4UEVOAKQANDQ2xFjHUTJxf6wh6mPNqHYDZhqGS6qPtIuI0slLC2ngM2EHSCEkbAWOBpnbL/Iys\nlICkfmTVSS+s5XbMzKybVJIUGiQ9TfZsZiTtIenyrl4UEauBU4E7gVnATRExQ9IFkhrTYncCr0qa\nCdwDnBURr76XHTEzs/dPEeVrYyQ9TNYR3s8iYs807ZmI+GgV4nuXhoaGaG5ursWm145U6wh6li6O\nU1sLPja713pybEpqiYiGrparpKRQFxEvtZu25r2FZWZm67JKGprnStobiHRD2mnAH4oNy8zMaqGS\nksLJwJnAUODPwL5pmpmZ9TBlSwqpZDA2Iiq5g9nMzNZzZUsKEbEGOK5KsZiZWY1V0qbwO0nfI+sQ\nb3nbxIh4qrCozMysJipJCnul/6NKpgVwQPeHY2ZmtdRlUoiI/asRiJmZ1V4lHeJtJek7kh6V9Iik\n/5S0VVevMzOz9U8ll6ROB94EvkjW6PwGWfuCmZn1MJW0KQyOiNI+KidLeqaogMzMrHYqKSncLemo\nthFJnwf+r7iQzMysVjotKUh6newqIwGnSVpd8polwBnFh2dmZtVUrvqoX9WiMDOzdUKnSSHdzQyA\npM/y9n0J90bEr4oOzMzMqq+SS1IvAr5G9kS0F4CvSbqw6MDMzKz6Krn66HPAnm0lB0nXAI8D5xQZ\nmJmZVV8lVx8BbF4yvFkRgZiZWe1VUlK4BHhc0t1kVyKNBiYVGZSZmdVGJX0f/UTSPcA+ZJeonhsR\nLxcemZmZVV2n1UeStpW0OUBKAouBTwJfkNS7SvGZmVkVlWtTuJnUliBpD+A2YCGwN3BF8aGZmVm1\nlas+2jQi5qXh44BrIuJiSXXA74sPzczMqq1cSUElwwcCdwNERCtZ24KZmfUw5UoK90maBswHPgT8\nBkDSNsCqKsRmZmZVVq6k8BXgl8ACYP+IeCtNH4QvSTUz65HK9X3UCvykg+mPFxqRmZnVTKV3NJuZ\n2QbAScHMzHIVJQVJG0navuhgzMystirpOvtvgadJj+CUNFLSbUUHZmZm1VdJSeECsn6PlgBExJOA\nSw1mZj1QJUlhVUQsaTfNN6+ZmfVAlSSFWZKOAeokjZD0XeDhSlYu6VBJz0maLensMst9QVJIaqgw\nbjMzK0AlSeFUYBTQStYp3lvA6V29SFI9Wcd5hwG7AuMk7drBcpsBE4FHKg/bzMyK0GVSiIjlEfGv\nEbFnRIxMw3+pYN17A7Mj4oV0N/R04IgOlvsmcDGwcq0iNzOzbtfpHc3pCqNO2w4i4vNdrHswMLdk\nfB5Zg3XpNj4GbBsRv5B0VplYJgATAIYOHdrFZs3M7L0qV1L4AVn1zzyyqqMfp7/VwJz3u+HUBfd3\ngK92tWxETImIhoho6N+///vdtJmZdaJc30d3A0i6OCLyBmBJPwMerWDdLwPblowPSdPabAZ8FLhX\nEsA2QJOkxohorngPzMys21TS0NxX0vCS8aFA3wpe9xiwQ7piaSNgLNDUNjMilkZEv4gYHhHDya5o\nckIwM6uhcs9TaPNV4H5Jz5E9eGd74KSuXhQRqyWdCtwJ1JM9uW2GpAuA5ohoKr8GMzOrNkV0fR+a\npD5kl5UCzIyIFYVGVUZDQ0M0N68HhQmp62WschUcp1YhH5vdaz05NiW1lDYFdKaSkgIpCbS876jM\nzGyd5q6zzcws56RgZma5iqqPJG0BfBjYpG1aRDxYVFBmZlYbXSYFSceTXYE0mOy5CnuRXT46utDI\nzMys6iqpPjoDaAD+FBH7k3WO92qhUZmZWU1UkhRWtl2CKmmjiJgB7FRsWGZmVguVtCnMl7Ql8HPg\nTkmvkfWHZGZmPUyXSSEiGtPgJEkHAVsAvyg0KjMzq4kuq48kjW4bjoi7I+KnwFFFBmVmZrVRSZvC\nRZIul9RHUv/0nIWjiw7MzMyqr5KksD9Zl9dPAA8CP42Ivys0KjMzq4lKksLmwB5kjcurgAGSe9Qy\nM+uJKkkKjwL3RMRnyG5c2w64v9CozMysJiq5JPXgiPgTQEQsB06RdGChUZmZWU1Ucknqnzro+2hl\noVGZmVlNuO8jMzPLue8jMzPLue8jMzPLdVp9JKlXRKzGfR+ZmW0wyrUpPAp8zH0fmZltOMolhXfd\noBYRdxcYi5mZ1Vi5pNBf0pmdzYyI7xQQj5mZ1VC5pFAP9KWDEoOZmfVM5ZLC/Ii4oGqRmJlZzZW7\nJNUlBDOzDUy5pHBQ1aIwM7N1QqdJISJeq2YgZmZWe5Xc0WxmZhsIJwUzM8s5KZiZWc5JwczMcoUm\nBUmHSnpO0mxJZ3cw/0xJMyU9JeluScOKjMfMzMorLClIqgeuAA4DdgXGSdq13WJPAA0RsTtwC3BJ\nUfGYmVnXiiwp7A3MjogXIuItYDpwROkCEXFPRPwljT4MDCkwHjMz60KRSWEwMLdkfF6a1pkTgP8t\nMB4zM+tCl89orgZJx5E98vNTncyfAEwAGDp0aBUjMzPbsBRZUngZ2LZkfEia9g6SPgN8A2iMiL92\ntKKImBIRDRHR0L9//0KCNTOzYpPCY8AOkkZI2ggYCzSVLiBpT+BqsoSwsMBYzMysAoUlhfR851OB\nO4FZwE0RMUPSBZLaHvH5bbJnNtws6UlJTZ2szszMqqDQNoWI+CXwy3bTzi0Z/kyR2zczs7XjO5rN\nzCznpGBmZjknBTMzyzkpmJlZzknBzMxyTgpmZpZzUjAzs5yTgpmZ5ZwUzMws56RgZmY5JwUzM8s5\nKZiZWc5JwczMck4KZmaWc1IwM7Ock4KZmeWcFMzMLOekYGZmOScFMzPLOSmYmVnOScHMzHJOCmZm\nlnNSMDOznJOCmZnlnBTMzCznpGBmZjknBTMzyzkpmJlZzknBzMxyTgpmZpZzUjAzs5yTgpmZ5ZwU\nzMwsV2hSkHSopOckzZZ0dgfzN5Z0Y5r/iKThRcZjZmblFZYUJNUDVwCHAbsC4yTt2m6xE4DXI2J7\n4LvAxUXFY2ZmXSuypLA3MDsiXoiIt4DpwBHtljkCuDYN3wIcJEkFxmRmZmX0KnDdg4G5JePzgH06\nWyYiVktaCnwIWFy6kKQJwIQ0ukzSc4VEvGHqR7v3e53k3wobIh+b3WtYJQsVmRS6TURMAabUOo6e\nSFJzRDTUOg6z9nxs1kaR1UcvA9uWjA9J0zpcRlIvYAvg1QJjMjOzMopMCo8BO0gaIWkjYCzQ1G6Z\nJuDv0/BRwG8iIgqMyczMyiis+ii1EZwK3AnUA9dExAxJFwDNEdEE/DfwY0mzgdfIEodVl6vlbF3l\nY7MG5B/mZmbWxnc0m5lZzknBzMxyTgpmZpZbL+5TsO4haWeyu8gHp0kvA00RMat2UZnZusQlhQ2E\npH8l62pEwKPpT8ANHXVWaLaukPTlWsewIfHVRxsISX8APhIRq9pN3wiYERE71CYys/IkzYmIobWO\nY0Ph6qMNRyswCHip3fSBaZ5ZzUh6qrNZwIBqxrKhc1LYcJwO3C3ped7uqHAosD1was2iMssMAA4B\nXm83XcCD1Q9nw+WksIGIiF9J2pGsS/PShubHImJN7SIzA+AOoG9EPNl+hqR7qx/OhsttCmZmlnNJ\nwcpqaWkZUldXd1dra+vOZEV5s/cr6urqnm1tbT141KhR82odjL2Tk4KVVVdXd9c222yzw4ABA1RX\n5yuY7f1rbW3V/Pnzd5ozZ84jjY2NDU1NTfNrHZO9zd9yK6u1tXXnAQMG9HJCsO5SV1fHwIED63r3\n7j0I+LfGxsZ+tY7J3uZvunXFJQTrdnV1daTHsW9GdgWcrSP8bTezWutd6wDsbW5TsLXT3Q8pr+Dq\nt/r6enbbbTdWrVpFr169+NKXvsQZZ5xBXV0dzc3NXHfddVx22WUVb3L06NFceumlNDR07+N/Nbl7\n35s4r7IrAy+66CKmTZtGfX09dXV1XH311eyzzz7va9tNTU3MnDmTs89+/z2g9O3bl2XLlr3v9Vh1\nOCnYOq9Pnz48+WR2+frChQs59thjeeONN5g8eTINDQ3dfnJfnzz00EPccccdPP7442y88cYsXryY\nt956q6LXrl69ml69Oj4FNDY20tjY2J2h2nrC1Ue2Xtl6662ZMmUKP/jBD4gI7r33Xg4//HAAli9f\nzvHHH8/ee+/Nnnvuye233w7AihUrGDt2LLvssgtHHnkkK1asqOUudKv58+fTr18/Nt54YwD69evH\noEGDGD58OIsXLwagubmZ0aNHA3D++eczfvx49ttvP8aPH8++++7LjBkz8vWNHj2a5uZmpk6dyqmn\nnsrSpUsZNmwYra1ZTyjLly9n2223ZdWqVfzxj3/k0EMPZdSoUey///48++yzALz44ot8/OMfZ7fd\nduOcc86p4rth3cFJwdY72223HWvWrGHhwoXvmH7RRRdx4IEH8uijj3LPPfdw1llnsXz5cq688ko2\n3XRTZs2axeTJk2lpaalR5N3v4IMPZu7cuey4446ccsop3HfffV2+ZubMmfz617/mhhtuYMyYMdx0\n001AlmDmz5//jpLXFltswciRI/P13nHHHRxyyCH07t2bCRMmcPnll9PS0sKll17KKaecAsDEiRM5\n+eSTefrppxk4cGABe21FclKwHuOuu+7iW9/6FiNHjmT06NGsXLmSOXPm8Nvf/pbjjjsOgN13353d\nd9+9xpF2n759+9LS0sKUKVPo378/Y8aMYerUqWVf09jYSJ8+fQA45phjuOWWWwC46aabOOqoo961\n/JgxY7jxxhsBmD59OmPGjGHZsmU8+OCDHH300YwcOZITTzyR+fOz2w0eeOABxo0bB8D48eO7a1et\nStymYOudF154gfr6erbeemtmzXr7+UARwa233spOO+1Uw+iqr76+ntGjRzN69Gh22203rr32Wnr1\n6pVX+axcufIdy3/gAx/IhwcPHsyHPvQhnnrqKW688Uauuuqqd62/sbGRr3/967z22mu0tLRw4IEH\nsnz5crbccsu8rac9dfcFCVY1LinYemXRokWcdNJJnHrqqe868RxyyCFcfvnltPXn9cQTTwBwwAEH\nMG3aNACeeeYZnnqqs16a1z+SNyn6AAAB/0lEQVTPPfcczz//fD7+5JNPMmzYMIYPH55Xk916661l\n1zFmzBguueQSli5d2mEpqm/fvuy1115MnDiRww8/nPr6ejbffHNGjBjBzTffDGQJ+fe//z0A++23\nH9OnTwfg+uuv75b9tOpxScHWTg06UFyxYgUjR47ML0kdP348Z5555ruWmzRpEqeffjq77747ra2t\njBgxgjvuuIOTTz6ZL3/5y+yyyy7ssssujBo1qpA4K72EtDstW7aM0047jSVLltCrVy+23357pkyZ\nwqxZszjhhBOYNGlS3sjcmaOOOoqJEycyadKkTpcZM2YMRx99NPfee28+7frrr+fkk0/mwgsvZNWq\nVYwdO5Y99tiD73//+xx77LFcfPHFHHHEEd20p1Yt7iXVymppaYmiTqK2YWtpaWHy5MlTgWuampru\nr3U8lnH1kZmZ5ZwUzMws56RgXYm2q1jMuktrayuuul43OSlYWXV1dc8uWLBgjRODdZfW1lbmz5/f\nunLlysW1jsXezVcfWVmtra0Hz5s37/5XXnlluK89t+4QEaxcufK166677sdAP+CNWsdkb3NSsLJG\njRo1r7GxcSdgAjAKWFPjkKzn+CDwO2BGVwta9fiSVKtIY2NjL2AwsEmtY7EeYwXwclNTk39orEOc\nFMzMLOeGZjMzyzkpmJlZzknBzMxy/w91op44Q8PhVAAAAABJRU5ErkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "umQ3fIKlsNlN" + }, + "source": [ + "### Variável 'age_category'\n", + "* Construir a variável 'age_category' baseado na variável 'age'." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "W66GkyuKkhFe" + }, + "source": [ + "def Age_Category(age):\n", + " if (age <= 1):\n", + " return 1\n", + " elif (age <= 5):\n", + " return 2\n", + " elif(age <= 10):\n", + " return 3\n", + " elif (age <= 15):\n", + " return 4\n", + " elif (age <= 20):\n", + " return 5\n", + " elif (age <= 25):\n", + " return 6\n", + " elif(age < 30):\n", + " return 7\n", + " elif(age < 35):\n", + " return 8\n", + " elif(age < 40):\n", + " return 9\n", + " elif(age < 45):\n", + " return 10\n", + " elif(age < 50):\n", + " return 11\n", + " elif(age < 60):\n", + " return 12\n", + " elif(age < 70):\n", + " return 13\n", + " elif(age < 80):\n", + " return 14\n", + " else:\n", + " return 15" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TnLzC6hCkuBL" + }, + "source": [ + "df['age_category'] = df['age'].map(Age_Category)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "kG8td6HPsNlP", + "outputId": "a1debdae-ae3e-41e4-fc8f-52696809fc11", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "set(df['age_category']) # Esse comando mostra os NaN's da variável" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 271 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B_3s5cgxfNKQ" + }, + "source": [ + "### Variável 'title'\n", + "\n", + "* Para fins de Data Manipulation, vamos capturar o tratamento dos passageiros contido na variável 'Name'. Ou seja, 'Mr.', 'Mrs.', 'Miss' e etc...\n", + "\n", + "> Fonte: As funções get_title e title_map foram extraídas de https://www.kaggle.com/tjsauer/titanic-survival-python-solution" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gslSjRdDoJFY", + "outputId": "68f27f6d-d52b-48e8-c2e3-ad86acf44792", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 527 + } + }, + "source": [ + "df.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclassnameagesibspparchfareembarkedsurvived2sexdeckseatsozinho_parchsozinho_sibspmv_ageage_category
PassengerId
10.03Braund, Mr. Owen Harris22.0107.0SDiedmaleNaNNaN1006
21.01Cumings, Mrs. John Bradley (Florence Briggs Th...38.01071.0CSurvivedfemaleC851009
31.03Heikkinen, Miss. Laina26.0008.0SSurvivedfemaleNaNNaN1107
41.01Futrelle, Mrs. Jacques Heath (Lily May Peel)35.01053.0SSurvivedfemaleC1231009
50.03Allen, Mr. William Henry35.0008.0SDiedmaleNaNNaN1109
\n", + "
" + ], + "text/plain": [ + " survived pclass ... mv_age age_category\n", + "PassengerId ... \n", + "1 0.0 3 ... 0 6\n", + "2 1.0 1 ... 0 9\n", + "3 1.0 3 ... 0 7\n", + "4 1.0 1 ... 0 9\n", + "5 0.0 3 ... 0 9\n", + "\n", + "[5 rows x 16 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 272 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nfIG6toGfhd5" + }, + "source": [ + "def get_title(name):\n", + " if '.' in name:\n", + " return name.split(',')[1].split('.')[0].strip()\n", + " else:\n", + " return 'Unknown'\n", + "\n", + "def title_map(title):\n", + " if title in ['Mr', 'Ms']:\n", + " return 1\n", + " elif title in ['Master']:\n", + " return 2\n", + " elif title in ['Ms','Mlle','Miss']:\n", + " return 3\n", + " elif title in [\"Mme\", \"Ms\", \"Mrs\"]:\n", + " return 4\n", + " elif title in [\"Jonkheer\", \"Don\", \"Sir\", \"the Countess\", \"Dona\", \"Lady\"]:\n", + " return 5\n", + " elif title in [\"Capt\", \"Col\", \"Major\", \"Dr\", \"Rev\"]:\n", + " return 6\n", + " else:\n", + " return 7" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7qNUwnCepe_x" + }, + "source": [ + "Captura o tratamento dos passageiros:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "r-Ltf33vgJ6Q", + "outputId": "930abc58-2b12-434e-9e93-c3da788dc000", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "df['title'] = df['name'].apply(get_title).apply(title_map) \n", + "set(df['title']) # Esse comando mostra os NaN's da variável" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{1, 2, 3, 4, 5, 6}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 274 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D3hY0WVhpRYK" + }, + "source": [ + "Drop a coluna 'Name', pois não vamos mais precisar dela em nossas análises:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Y8i3xKCes5WF" + }, + "source": [ + "df= df.drop(columns= [\"name\"], axis=1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7Sl1uFdwpW3y" + }, + "source": [ + "Apresenta o conteúdo do dataframe:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2uFnw-pZpan-", + "outputId": "a72224f4-41de-406a-da4c-07be3c824184", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 410 + } + }, + "source": [ + "df.head(10)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclassagesibspparchfareembarkedsurvived2sexdeckseatsozinho_parchsozinho_sibspmv_ageage_categorytitle
PassengerId
10.0322.0107.0SDiedmaleNaNNaN10061
21.0138.01071.0CSurvivedfemaleC8510094
31.0326.0008.0SSurvivedfemaleNaNNaN11073
41.0135.01053.0SSurvivedfemaleC12310094
50.0335.0008.0SDiedmaleNaNNaN11091
60.03NaN008.0QDiedmaleNaNNaN111151
70.0154.00052.0SDiedmaleE46110121
80.032.03121.0SDiedmaleNaNNaN00022
91.0327.00211.0SSurvivedfemaleNaNNaN01074
101.0214.01030.0CSurvivedfemaleNaNNaN10044
\n", + "
" + ], + "text/plain": [ + " survived pclass age ... mv_age age_category title\n", + "PassengerId ... \n", + "1 0.0 3 22.0 ... 0 6 1\n", + "2 1.0 1 38.0 ... 0 9 4\n", + "3 1.0 3 26.0 ... 0 7 3\n", + "4 1.0 1 35.0 ... 0 9 4\n", + "5 0.0 3 35.0 ... 0 9 1\n", + "6 0.0 3 NaN ... 1 15 1\n", + "7 0.0 1 54.0 ... 0 12 1\n", + "8 0.0 3 2.0 ... 0 2 2\n", + "9 1.0 3 27.0 ... 0 7 4\n", + "10 1.0 2 14.0 ... 0 4 4\n", + "\n", + "[10 rows x 16 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 276 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ci5FKBazGidH" + }, + "source": [ + "### Variável 'family_size'\n", + "* As variáveis 'sibsp' e 'parch' estão relacionadas ao grupo familiar. Portanto, vamos criar a variável 'family_size', da seguinte forma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DICRTPxhGvt5" + }, + "source": [ + "df['family_size']= df['sibsp']+df['parch']+1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6tLLmTrZhDvG", + "outputId": "6d120e96-a9b6-4291-d7ad-f64b3dba1c6a", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 386 + } + }, + "source": [ + "sns.catplot(x=\"family_size\", kind=\"count\", data=df)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 278 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW4AAAFgCAYAAACbqJP/AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAGBdJREFUeJzt3XvUXXWd3/H3RwJeGIcAPk0xoYVW\nimU5wy2lcZg6legIjALjAgZbJTJMM+3C22jrMDNdo2O1Szs6COqiiyVqUIuDeCFeirICaGcqOAGR\nq44RRRKBPHLzwnhBvv3j/FIPISEPkn3O80ver7XOOr/927+z95eQfJ79/M6+pKqQJPXjCdMuQJL0\n2BjcktQZg1uSOmNwS1JnDG5J6ozBLUmdMbglqTMGtyR1xuCWpM4smHYBj8fRRx9dl1566bTLkKTt\nJXMZ1PUR9/e+971plyBJE9d1cEvSzsjglqTOGNyS1BmDW5I6Y3BLUmcGDe4kf5TkpiQ3JrkwyZOS\n7J/k6iTrkvx1kt3a2Ce25XVt/X5D1iZJvRosuJMsBl4FLK2qZwG7AKcAbwPOqqpnAPcCp7ePnA7c\n2/rPauMkSZsZeqpkAfDkJAuApwB3AEcBF7f1q4ATWvv4tkxbvzzJnE5Gl6SdyWDBXVUbgLcD32EU\n2PcD1wD3VdWDbdh6YHFrLwZub599sI3fe6j6JKlXQ06V7MnoKHp/4OnA7sDR22G7K5OsTbJ2dnb2\n8W5Okroz5FTJ84BvVdVsVf0M+DhwJLCwTZ0ALAE2tPYGYF+Atn4P4O7NN1pV51XV0qpaOjMzM2D5\nkjQ/DRnc3wGWJXlKm6teDtwMXAGc2MasAC5p7dVtmbb+8qqqAeuTpC4NdnfAqro6ycXAtcCDwFeA\n84DPAB9J8ubWd377yPnAB5OsA+5hdAbKY3L4f7lge5Q+J9f85akT25ckjRv0tq5V9QbgDZt13woc\nsYWxPwZOGrIeSdoReOWkJHXG4JakzhjcktQZg1uSOmNwS1JnDG5J6ozBLUmdMbglqTMGtyR1xuCW\npM4Y3JLUGYNbkjpjcEtSZwxuSeqMwS1JnTG4JakzBrckdcbglqTOGNyS1BmDW5I6Y3BLUmcMbknq\njMEtSZ0xuCWpMwa3JHXG4JakzhjcktQZg1uSOjNYcCc5MMl1Y6/vJ3lNkr2SXJbkG+19zzY+Sc5J\nsi7J9UkOG6o2SerZYMFdVV+vqkOq6hDgcOAB4BPAmcCaqjoAWNOWAY4BDmivlcC5Q9UmST2b1FTJ\ncuCbVXUbcDywqvWvAk5o7eOBC2rkKmBhkn0mVJ8kdWNSwX0KcGFrL6qqO1r7TmBRay8Gbh/7zPrW\n9zBJViZZm2Tt7OzsUPVK0rw1eHAn2Q04Dvjo5uuqqoB6LNurqvOqamlVLZ2ZmdlOVUpSPyZxxH0M\ncG1V3dWW79o0BdLeN7b+DcC+Y59b0vokSWMmEdwv4RfTJACrgRWtvQK4ZKz/1HZ2yTLg/rEpFUlS\ns2DIjSfZHXg+8Idj3W8FLkpyOnAbcHLr/yxwLLCO0Rkopw1ZmyT1atDgrqofAXtv1nc3o7NMNh9b\nwBlD1iNJOwKvnJSkzhjcktQZg1uSOmNwS1JnDG5J6ozBLUmdMbglqTMGtyR1xuCWpM4Y3JLUGYNb\nkjpjcEtSZwxuSeqMwS1JnTG4JakzBrckdcbglqTOGNyS1BmDW5I6Y3BLUmcMbknqjMEtSZ0xuCWp\nMwa3JHXG4JakzhjcktQZg1uSOmNwS1JnBg3uJAuTXJzka0luSfLsJHsluSzJN9r7nm1skpyTZF2S\n65McNmRtktSroY+4zwYurapnAgcDtwBnAmuq6gBgTVsGOAY4oL1WAucOXJskdWmw4E6yB/Ac4HyA\nqvppVd0HHA+sasNWASe09vHABTVyFbAwyT5D1SdJvRryiHt/YBZ4f5KvJHlvkt2BRVV1RxtzJ7Co\ntRcDt499fn3re5gkK5OsTbJ2dnZ2wPIlaX4aMrgXAIcB51bVocCP+MW0CABVVUA9lo1W1XlVtbSq\nls7MzGy3YiWpF0MG93pgfVVd3ZYvZhTkd22aAmnvG9v6DcC+Y59f0vokSWMGC+6quhO4PcmBrWs5\ncDOwGljR+lYAl7T2auDUdnbJMuD+sSkVSVKzYODtvxL4cJLdgFuB0xj9sLgoyenAbcDJbexngWOB\ndcADbawkaTODBndVXQcs3cKq5VsYW8AZQ9YjSTsCr5yUpM4Y3JLUGYNbkjpjcEtSZwxuSeqMwS1J\nnTG4JakzBrckdcbglqTOGNyS1BmDW5I6Y3BLUmcMbknqjMEtSZ0xuCWpMwa3JHXG4JakzhjcktQZ\ng1uSOmNwS1JnDG5J6ozBLUmdMbglqTMGtyR1xuCWpM4Y3JLUGYNbkjozaHAn+XaSG5Jcl2Rt69sr\nyWVJvtHe92z9SXJOknVJrk9y2JC1SVKvJnHE/dyqOqSqlrblM4E1VXUAsKYtAxwDHNBeK4FzJ1Cb\nJHVnGlMlxwOrWnsVcMJY/wU1chWwMMk+U6hPkua1oYO7gM8nuSbJyta3qKruaO07gUWtvRi4feyz\n61vfwyRZmWRtkrWzs7ND1S1J89aCgbf/m1W1Ick/Ai5L8rXxlVVVSeqxbLCqzgPOA1i6dOlj+qwk\n7QgGPeKuqg3tfSPwCeAI4K5NUyDtfWMbvgHYd+zjS1qfJGnMYMGdZPckT93UBn4buBFYDaxow1YA\nl7T2auDUdnbJMuD+sSkVSVIz5FTJIuATSTbt539V1aVJ/g64KMnpwG3AyW38Z4FjgXXAA8BpA9Ym\nSd0aLLir6lbg4C303w0s30J/AWcMVY8k7Si8clKSOmNwS1JnDG5J6ozBLUmdMbglqTMGtyR1xuCW\npM4Y3JLUGYNbkjpjcEtSZwxuSeqMwS1JnTG4JakzBrckdWZOwZ1kzVz6JEnDe9T7cSd5EvAU4GlJ\n9gTSVv0qW3iQryRpeNt6kMIfAq8Bng5cwy+C+/vAuwesS5K0FY8a3FV1NnB2kldW1bsmVJMk6VHM\n6dFlVfWuJL8B7Df+maq6YKC6JElbMafgTvJB4J8D1wE/b90FGNySNGFzfVjwUuCg9kBfSdIUzfU8\n7huBfzxkIZKkuZnrEffTgJuTfBn4yabOqjpukKokSVs11+B+45BFSJLmbq5nlXxh6EIkSXMz17NK\nfsDoLBKA3YBdgR9V1a8OVZgkacvmesT91E3tJAGOB5YNVZQkaese890Ba+STwAsGqEeStA1znSp5\n8djiExid1/3jOX52F2AtsKGqXphkf+AjwN6M7n/ysqr6aZInMrqg53DgbuD3qurbc/0PkaSdxVyP\nuF809noB8ANG0yVz8WrglrHltwFnVdUzgHuB01v/6cC9rf+sNk6StJm5znGf9stsPMkS4HeAtwCv\nbfPjRwH/rg1ZxehUw3MZ/SB4Y+u/GHh3kni1piQ93FwfpLAkySeSbGyvj7VQ3pZ3Aq8HHmrLewP3\nVdWDbXk9v7iv92LgdoC2/v42fvNaViZZm2Tt7OzsXMqXpB3KXKdK3g+sZnRf7qcDn2p9W5XkhcDG\nqrrmcVW4mao6r6qWVtXSmZmZ7blpSerCXIN7pqreX1UPttcHgG2l5pHAcUm+zejLyKOAs4GFSTZN\n0SwBNrT2BmBfgLZ+D0ZfUkqSxsw1uO9O8tIku7TXS9lGqFbVn1TVkqraDzgFuLyq/j1wBXBiG7YC\nuKS1V7dl2vrLnd+WpEeaa3D/PnAycCdwB6Ngffkvuc8/ZvRF5TpGc9jnt/7zgb1b/2uBM3/J7UvS\nDm2uN5l6E7Ciqu4FSLIX8HZGgb5NVXUlcGVr3wocsYUxPwZOmmM9krTTmusR969vCm2AqroHOHSY\nkiRJj2auwf2EJHtuWmhH3HM9WpckbUdzDd93AF9K8tG2fBKji2okSRM21ysnL0iyltEpfQAvrqqb\nhytLkrQ1c57uaEFtWEvSlD3m27pKkqbL4JakzhjcktQZg1uSOmNwS1JnDG5J6ozBLUmdMbglqTMG\ntyR1xuCWpM4Y3JLUGYNbkjpjcEtSZwxuSeqMwS1JnTG4JakzBrckdcbglqTOGNyS1BmDW5I6Y3BL\nUmcMbknqzGDBneRJSb6c5KtJbkryF61//yRXJ1mX5K+T7Nb6n9iW17X1+w1VmyT1bMgj7p8AR1XV\nwcAhwNFJlgFvA86qqmcA9wKnt/GnA/e2/rPaOEnSZgYL7hr5YVvctb0KOAq4uPWvAk5o7ePbMm39\n8iQZqj5J6tWgc9xJdklyHbARuAz4JnBfVT3YhqwHFrf2YuB2gLb+fmDvLWxzZZK1SdbOzs4OWb4k\nzUuDBndV/byqDgGWAEcAz9wO2zyvqpZW1dKZmZnHXaMk9WYiZ5VU1X3AFcCzgYVJFrRVS4ANrb0B\n2Begrd8DuHsS9UlST4Y8q2QmycLWfjLwfOAWRgF+Yhu2AriktVe3Zdr6y6uqhqpPknq1YNtDfmn7\nAKuS7MLoB8RFVfXpJDcDH0nyZuArwPlt/PnAB5OsA+4BThmwNknq1mDBXVXXA4duof9WRvPdm/f/\nGDhpqHokaUfhlZOS1BmDW5I6Y3BLUmcMbknqjMEtSZ0xuCWpMwa3JHXG4JakzhjcktQZg1uSOmNw\nS1JnDG5J6ozBLUmdMbglqTMGtyR1xuCWpM4Y3JLUGYNbkjoz5DMnd1rfedOvTWxf/+TPb5jYviTN\nDx5xS1JnDG5J6ozBLUmdMbglqTMGtyR1xuCWpM4Y3JLUGYNbkjpjcEtSZwYL7iT7Jrkiyc1Jbkry\n6ta/V5LLknyjve/Z+pPknCTrklyf5LChapOkng15xP0g8LqqOghYBpyR5CDgTGBNVR0ArGnLAMcA\nB7TXSuDcAWuTpG4NFtxVdUdVXdvaPwBuARYDxwOr2rBVwAmtfTxwQY1cBSxMss9Q9UlSryYyx51k\nP+BQ4GpgUVXd0VbdCSxq7cXA7WMfW9/6Nt/WyiRrk6ydnZ0drGZJmq8GD+4kvwJ8DHhNVX1/fF1V\nFVCPZXtVdV5VLa2qpTMzM9uxUknqw6DBnWRXRqH94ar6eOu+a9MUSHvf2Po3APuOfXxJ65MkjRny\nrJIA5wO3VNVfja1aDaxo7RXAJWP9p7azS5YB949NqUiSmiEfpHAk8DLghiTXtb4/Bd4KXJTkdOA2\n4OS27rPAscA64AHgtAFrk6RuDRbcVfU3QLayevkWxhdwxlD1SNKOwisnJakzBrckdcbglqTOGNyS\n1BmDW5I6Y3BLUmcMbknqjMEtSZ0xuCWpM0Ne8q4pO/JdR050f3/7yr+d6P6knZVH3JLUGYNbkjpj\ncEtSZwxuSeqMwS1JnTG4JakzBrckdcbglqTOGNyS1BmDW5I6Y3BLUmcMbknqjMEtSZ0xuCWpMwa3\nJHXG4JakzhjcktSZwYI7yfuSbExy41jfXkkuS/KN9r5n60+Sc5KsS3J9ksOGqkuSejfkEfcHgKM3\n6zsTWFNVBwBr2jLAMcAB7bUSOHfAuiSpa4MFd1V9Ebhns+7jgVWtvQo4Yaz/ghq5CliYZJ+hapOk\nnk16jntRVd3R2ncCi1p7MXD72Lj1re8RkqxMsjbJ2tnZ2eEqlaR5ampfTlZVAfVLfO68qlpaVUtn\nZmYGqEyS5rdJB/ddm6ZA2vvG1r8B2Hds3JLWJ0nazKSDezWworVXAJeM9Z/azi5ZBtw/NqUiSRqz\nYKgNJ7kQ+LfA05KsB94AvBW4KMnpwG3AyW34Z4FjgXXAA8BpQ9UlSb0bLLir6iVbWbV8C2MLOGOo\nWiRpR+KVk5LUGYNbkjpjcEtSZwxuSeqMwS1JnTG4JakzBrckdcbglqTOGNyS1BmDW5I6Y3BLUmcM\nbknqjMEtSZ0xuCWpMwa3JHVmsPtxS+O+8Jzfmti+fuuLX5jYvqRp8IhbkjpjcEtSZwxuSeqMwS1J\nnTG4JakzBrckdcbTAaUpeMtLT5zo/v7sQxdPdH8alsGtncq7X/epie3rFe940cT2pZ2LUyWS1BmP\nuKWd3C1vuXxi+/qXf3bUxPa1I/OIW5I64xG3JI256KNHTHR/J5/05cf8mXkV3EmOBs4GdgHeW1Vv\nnXJJkibkjW984w65ryHMm6mSJLsA7wGOAQ4CXpLkoOlWJUnzz7wJbuAIYF1V3VpVPwU+Ahw/5Zok\nad5JVU27BgCSnAgcXVV/0JZfBvzrqnrFZuNWAivb4oHA1x/nrp8GfO9xbmN7mS+1WMcjzZdarOOR\n5kst26OO71XV0dsaNK/muOeiqs4Dztte20uytqqWbq/tPR7zpRbreKT5Uot1PNJ8qWWSdcynqZIN\nwL5jy0tanyRpzHwK7r8DDkiyf5LdgFOA1VOuSZLmnXkzVVJVDyZ5BfA5RqcDvq+qbprArrfbtMt2\nMF9qsY5Hmi+1WMcjzZdaJlbHvPlyUpI0N/NpqkSSNAcGtyR1ZqcN7iTvS7IxyY1TrmPfJFckuTnJ\nTUlePcVanpTky0m+2mr5i2nV0urZJclXknx6ijV8O8kNSa5LsnaKdSxMcnGSryW5Jcmzp1THge3P\nYtPr+0leM6Va/qj9Pb0xyYVJnjTBfT8iP5Kc1Op5KMmgpwXutMENfADY5onuE/Ag8LqqOghYBpwx\nxUv9fwIcVVUHA4cARydZNqVaAF4N3DLF/W/y3Ko6ZMrnCp8NXFpVzwQOZkp/LlX19fZncQhwOPAA\n8IlJ15FkMfAqYGlVPYvRCQ2nTLCED/DI/LgReDHwxaF3vtMGd1V9EbhnHtRxR1Vd29o/YPQPcvGU\naqmq+mFb3LW9pvLtdZIlwO8A753G/ueTJHsAzwHOB6iqn1bVfdOtCoDlwDer6rYp7X8B8OQkC4Cn\nAN+d1I63lB9VdUtVPd4ruedkpw3u+SjJfsChwNVTrGGXJNcBG4HLqmpatbwTeD3w0JT2v0kBn09y\nTbvdwjTsD8wC729TR+9NsvuUahl3CnDhNHZcVRuAtwPfAe4A7q+qz0+jlmkwuOeJJL8CfAx4TVV9\nf1p1VNXP26/BS4Ajkjxr0jUkeSGwsaqumfS+t+A3q+owRnetPCPJc6ZQwwLgMODcqjoU+BFw5hTq\n+P/aRXLHAR+d0v73ZHQTuv2BpwO7J3npNGqZBoN7HkiyK6PQ/nBVfXza9QC0X8WvYDrfAxwJHJfk\n24zuEnlUkg9NoY5NR3ZU1UZGc7mTvcv+yHpg/dhvPxczCvJpOga4tqrumtL+nwd8q6pmq+pnwMeB\n35hSLRNncE9ZkjCau7ylqv5qyrXMJFnY2k8Gng98bdJ1VNWfVNWSqtqP0a/jl1fVxI+mkuye5Kmb\n2sBvM/oCaqKq6k7g9iQHtq7lwM2TrmMzL2FK0yTNd4BlSZ7S/g0tZ358kT0RO21wJ7kQ+BJwYJL1\nSU6fUilHAi9jdFS56RSrY6dUyz7AFUmuZ3TvmMuqamqn4s0Di4C/SfJV4MvAZ6rq0inV8krgw+3/\nzSHAf59SHZt+iD2f0VHuVLTfPi4GrgVuYJRlk7vkfAv5keR3k6wHng18JsnnBtu/l7xLUl922iNu\nSeqVwS1JnTG4JakzBrckdcbglqTOGNyS1BmDW91K8qp2i9MPP87tvCnJ81r7yu11S852T5Fp3elR\nOzDP41a3knwNeF5Vrd+O27wS+M9VNbV7b0vb4hG3upTkfwL/DPjfSf44yZfanfP+76ZLw5O8PMkn\nk1zWHojwiiSvbeOuSrJXG/eBJCdutv3fT/LOseX/kOSsrdSye5LPtAdQ3Jjk91r/lUmWJjlu7KrY\nryf5Vlt/eJIvtDsPfi7JPsP8aWlHY3CrS1X1Hxndf/m5wLnAv2l3zvtzHn45+LMY3dz+XwFvAR5o\n474EnPoou7gIeFG7ARjAacD7tjL2aOC7VXVwu6n/wy6Lr6rVYw8f+Crw9rbddwEnVtXhbdtvmdt/\nvXZ2C6ZdgLQd7AGsSnIAo/tn7zq27or2gIofJLkf+FTrvwH49a1tsKp+mORy4IVJbgF2raobtjL8\nBuAdSd4GfLqq/s+WBiV5PfAPVfWedrvcZwGXje6RxC6M7istbZPBrR3Bf2MU0L/bHkZx5di6n4y1\nHxpbfoht//1/L/CnjO6Q+P6tDaqqv09yGHAs8OYka6rqTeNj2pefJzF6kg1AgJuqairPjlTfDG7t\nCPYANrT2y7fXRqvq6iT7Mrr39VaPzpM8Hbinqj6U5D7gDzZb/0+B9wAvqKp/aN1fB2aSPLuqvtSm\nTv5FVd20verXjsvg1o7gfzCaKvmvwGe287YvAg6pqnsfZcyvAX+Z5CHgZ8B/2mz9y4G9gU+2aZHv\nVtWx7QvRc9ozJRcwelybwa1t8nRA6VEk+TRwVlWtmXYt0iaeVSJtQZKFSf6e0ZeJhrbmFY+4pTlK\nsjewpRBfXlV3T7oe7bwMbknqjFMlktQZg1uSOmNwS1JnDG5J6sz/A4+9fj1otfugAAAAAElFTkSu\nQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sxwSYcLBY1gy", + "outputId": "0b4ad32e-8d61-49fd-b022-adab64863a26", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "set(df['family_size']) # Esse comando mostra os NaN's da variável" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{1, 2, 3, 4, 5, 6, 7, 8, 11}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 279 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "s5LrvxqXo2uL" + }, + "source": [ + "# DataViz - Data Visualization" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rJlcGd49vmkk", + "outputId": "8bd4559e-4ec7-499d-f664-cec23cab9e89", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 255 + } + }, + "source": [ + "df.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclassagesibspparchfareembarkedsurvived2sexdeckseatsozinho_parchsozinho_sibspmv_ageage_categorytitlefamily_size
PassengerId
10.0322.0107.0SDiedmaleNaNNaN100612
21.0138.01071.0CSurvivedfemaleC85100942
31.0326.0008.0SSurvivedfemaleNaNNaN110731
41.0135.01053.0SSurvivedfemaleC123100942
50.0335.0008.0SDiedmaleNaNNaN110911
\n", + "
" + ], + "text/plain": [ + " survived pclass age ... age_category title family_size\n", + "PassengerId ... \n", + "1 0.0 3 22.0 ... 6 1 2\n", + "2 1.0 1 38.0 ... 9 4 2\n", + "3 1.0 3 26.0 ... 7 3 1\n", + "4 1.0 1 35.0 ... 9 4 2\n", + "5 0.0 3 35.0 ... 9 1 1\n", + "\n", + "[5 rows x 17 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 282 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "htQ1dODRwfHw", + "outputId": "e7ea64a5-ff9e-4a2e-e4d9-d224192a4a7b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 275 + } + }, + "source": [ + "df.plot.scatter('age','fare', s= 50, c= 'survived')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 291 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAADxCAYAAADGO7BSAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAIABJREFUeJzsnXd4lMX2x7+zm83WFAgtlFCj9JaA\nAt4AIkWQoFIUsCAoRVB/gHJRFEGUK6JwhQtK1yDCpYWiIL2JoPQOgiBEiARISNvNJtk9vz92szfL\ntneT3dTzeZ55su+8886c983unHfmnDkjiAgMwzBM+UVW3AIwDMMwxQsrAoZhmHIOKwKGYZhyDisC\nhmGYcg4rAoZhmHIOKwKGYZhyDisChmGYYkAIsVQIkSSEOOvivBBCzBFCXBFCnBZCtPaXLKwIGIZh\niodvAPRwc/5JAJHWNBzAV/4ShBUBwzBMMUBE+wEkuynSB0AcWTgMIFQIEe4PWQL8UWlRUalSJapT\np05xi8EwTCng2LFjd4mocmHqEEJ4E4rhHICsfMcLiWihF9fXAJCQ7/gva16iF3VIolQrgjp16uDo\n0aPFLQbDMKUAIcT1Im4yi4iii7jNAlGqFQHDMExRI4SQVM4HcdxuAqiV77imNc/nsI2AYRjGC2Qy\nmaTkAzYBeMnqPfQogFQi8vm0EMAjAoZhGMkIISR38iaTyVNdKwF0AlBJCPEXgA8BKACAiL4GsAVA\nTwBXAOgBvFJQuT3BioBhGMYLpE4NeYKIBno4TwBG+6QxD/DUEFNiOHjwIFq1agWVSoWgoCC0bNkS\nTZo0Qfv27fHNN98gOzu7uEVkGAghJKXSBI8ImBLBt99+iyFDhtiOjUYjTp06ZTs+ffo0Fi9ejF27\ndkGpVBaDhAxjobR18lLgEQFT7JjNZrz66qtuy2RmZuLEiRNYtGhREUnFMM4piyMCVgRMsRMfH4/c\n3FyP5fR6Pb76ym+r7BnGI0IIyOVySak0wVNDTLGTmCjdI+7+/ft+lIRhPFPa3valwCMCptjp3bu3\npHIymQyPPfaYn6VhGPfw1JCXCCH+FEKcEUKcFEIcteZVFELsEEJctv6tYM0vspCrTMmidu3aaNq0\nqcdyKpUKkyZNKgKJGMY5UpUAKwJHOhNRy3wxNyYC2EVEkQB2WY+BIgy5ypQ8jhw5giZNmjjk57mS\nVqlSBfHx8WjevHkxSMcw/6MsKoLisBH0gWU1HQB8C2AvgH8iX8hVAIeFEKFCiHB/LalmShYqlQpn\nz57F5cuXER8fj8qVK6N37944efIkdDod2rZt66tl+wxTKEqbIVgK/lYEBGC7NXTrAmsI1qr5Ove/\nAVS1fpYUclUIMRyWEQMiIiL8KDpTHERGRmLChAm24yeeeKIYpWEYe0rj274U/K0IHiOim0KIKgB2\nCCEu5j9JRORlfG9YlclCAIiOji50eD+GYRhvKIuKwK9jbSK6af2bBCAeQFsAt/N22bH+TbIWL7KQ\nqwzDMAWlLNoI/KYIhBBaIURQ3mcA3QCchSW06svWYi8D2Gj9XGQhVxmGYQpKWVQE/pwaqgog3vpA\nAgB8T0Q/CSGOAFgthBgG4DqAAdbyRRZylWEYpqCUtk5eCn5TBER0FUALJ/n3AHRxkl9kIVcZhmEK\nQl6IibIGh5hgGIbxAh4RMAzDlHNYETAMw5RjSqMhWAqsCBiGYbyAFQHDMEw5h43FDMMw5RieGmIY\nhmFYETAMw5R3WBEwDMOUc8piOHRWBAzDMBJhGwHDMAzDXkMMwzDlHR4RMAzDlGOEEGwjYBiGKe/w\niIBhGKacw4qAYRimHMNTQwzDMAx7DTEMw5R3eGqIYRimHMNTQwzDMEyZHBGUPdXGMAzjR/LCTHhK\nEurpIYS4JIS4IoSY6OR8hBBijxDihBDitBCip19uCDwiYBiGkYwQwifGYiGEHMA8AF0B/AXgiBBi\nExGdz1fsfQCriegrIURjAFsA1Cl0405gRcAwDOMFPrIRtAVwhYiuAoAQYhWAPgDyKwICEGz9HALg\nli8adgYrAoZhGC/wwkZQSQhxNN/xQiJaaP1cA0BCvnN/AXjkgeunANguhHgDgBbAE95LKw1WBAzD\nMBLxMgz1XSKKLkRzAwF8Q0RfCCHaAVguhGhKROZC1OkUVgQMwzBe4KOpoZsAauU7rmnNy88wAD0A\ngIgOCSFUACoBSPKFAPlhryGGYRgv8JHX0BEAkUKIukKIQADPA9j0QJkbALpY22wEQAXgjo9vBwCP\nCBiGYSTjK68hIsoVQowBsA2AHMBSIjonhPgIwFEi2gRgPIBFQoixsBiOhxARFbpxJ7AiYBiG8QJf\nLSgjoi2wuITmz5uc7/N5AB180pgH/D41JISQWxdE/GA9riuE+NW6iOK/1mERhBBK6/EV6/k6/paN\nYRjGW2QymaRUmigKad8CcCHf8QwAs4moAYAUWAwisP5NsebPtpZjGIYpMUi1D5S2MBR+VQRCiJoA\negFYbD0WAB4HsNZa5FsAT1s/97Eew3q+iyhtT5NhmDIPjwi8598AJgDI83sNA3CfiHKtx3/BsrAC\nyLfAwno+1VreDiHEcCHEUSHE0Tt3/GJAZxiGcQmPCLxACPEUgCQiOubLeoloIRFFE1F05cqVfVk1\nwzCMW/K8hqSk0oQ/vYY6AIi1RsxTwRIz40sAoUKIAOtbf/5FFHkLLP4SQgTAElvjnh/lYxiG8ZrS\n9rYvBb+NCIjoXSKqSUR1YFkssZuIBgPYA6CftdjLADZaP2+yHsN6fre/fGYZhmEKCk8N+YZ/Ahgn\nhLgCiw1giTV/CYAwa/44AA7xuRmGYYqTsuo1VCQLyohoL4C91s9XYQnB+mCZLAD9i0IehmGYglLa\nOnkp8MpihmEYLyhtrqFSYEXAMAzjBTwiYBiGKccIIXhEwDAMU97hEQHDMEw5hxUBwzBMOYcVAcMw\nTDmGbQQMwzAMjwgYhmHKO6wIGIZhyjmsCBiGYcoxpTGOkBRYETAMw3gBKwKGYZhyDnsNMQzDlHN4\nRMAwDFOOYRsBwzAMw4qAYRimvMOKgGEYppzDxmKGYZhyDNsIGIZhGFYEDMMw5R1WBAzDMOUcVgQM\nwzDlHFYEDMMw5RjemIZhGIYpk4qg7N0RwzCMH8lzIfWUJNTTQwhxSQhxRQgx0UWZAUKI80KIc0KI\n731+M1Z4RMAwDCMRX60jEELIAcwD0BXAXwCOCCE2EdH5fGUiAbwLoAMRpQghqhS6YRfwiIBhGMYL\nfDQiaAvgChFdJaJsAKsA9HmgzGsA5hFRCgAQUZLPb8YKKwKGYRgvkMlkkhKASkKIo/nS8HzV1ACQ\nkO/4L2tefh4C8JAQ4qAQ4rAQooe/7omnhhiGYbzAi6mhu0QUXYimAgBEAugEoCaA/UKIZkR0vxB1\nOsVvIwIhhEoI8ZsQ4pTV0DHVml9XCPGr1UDyXyFEoDVfaT2+Yj1fx1+yMQzDFASp00ISlMVNALXy\nHde05uXnLwCbiCiHiK4B+B0WxeBz/Dk1ZATwOBG1ANASQA8hxKMAZgCYTUQNAKQAGGYtPwxAijV/\ntrUcwzBMicJHiuAIgEjri3EggOcBbHqgzAZYRgMQQlSCZaroqm/vxoLfFAFZyLAeKqyJADwOYK01\n/1sAT1s/97Eew3q+iyiLS/gYhinV+EIREFEugDEAtgG4AGA1EZ0TQnwkhIi1FtsG4J4Q4jyAPQDe\nIaJ7TuRJF0KkuUpS7smvNgKri9QxAA1gcZX6A8B960MA7A0kNuMJEeUKIVIBhAG4+0CdwwEMB4CI\niAh/is8wDOOAr95PiWgLgC0P5E3O95kAjLMmd/UEWeWaBiARwHIAAsBgAOFSZPGr1xARmYioJSzz\nX20BNPRBnQuJKJqIoitXrlxoGRmGYaQihIBcLpeUioFYIppPROlElEZEX8HRJdUpReI+arVy7wHQ\nDkCoECJvJJLfQGIznljPhwBwGAYxDMMUJ75aWewHMoUQg4UQciGETAgxGECmlAv96TVUWQgRav2s\nhmUF3QVYFEI/a7GXAWy0ft5kPYb1/G7r0IhhGKbEUIIVwSAAAwDctqb+1jyP+NNGEA7gW6udQAaL\nMeQHq+FjlRDiYwAnACyxll8CYLkQ4gqAZFis6AzDMCWKkurDQkR/QuJU0IP4TREQ0WkArZzkX4XF\nXvBgfhYsGoxhGKZEUoxv+x4RQjwE4CsAVYmoqRCiOSx2g489XcshJhiGYbygBE8NLYIlSF0OYHsZ\nlzSzwiEmGIZhvKAE70egIaLfHlBCua4K54cVAcMwjBeU1KkhAHeFEPVhWbgLIUQ/WNYVeIQVAcMw\njEREyd6qcjSAhQAaCiFuArgGy6Iyj0hSBEKIxwBEEtEyIURlADprECSmmMnJycGRI0dgMpkQHR0N\ntVrtUMZsNuPYsWPIzMxEq1atEBISAgD47bffsHjxYty+fRsNGjRAcnIykpOT8dhjj2Ho0KEICwsr\n6tthmBJPCR4RXCeiJ4QQWgAyIkqXeqFHRSCE+BBANICHASyDJWbQdwA6FFBYxkd89913eOONN2A2\nmwFYOvypU6di3Lj/rUjftWsXXnzxRaSnp0MmkyE7OxujR49GTk4OFi9ejKysLNv1eezYsQMff/wx\n9u7di1atHBy/GKZcU4IVwTUhxE8A/gtgtzcXShkRPAOLG+hxACCiW0KIIK9FZHzKtm3bMGLECOj1\nerv8Dz74ABUqVMArr7yCM2fOIDY21qHMf/7zH5hMJuTmOrcjGQwGGAwGxMbG4vr16yV5KMwwRUoJ\nnxpqCOApWKaIlgghfgCwioh+9nShlDvKtq7wzTNAaAsjKeMbJk2a5NDBA4Ber8f7778PIsL06dOR\nlZXlUMZoNLpUAvlJTU3FL7/84hN5GaasUFLdR4lIT0SriehZWF7egwHsk3KtFEWwWgixAJYYQa8B\n2AmLvypTjJw6dcrlubt37+L+/fs4cOCAw7SPNwgh8Pfffxf4eoYpi5RURWCVraMQYj4sUZ9VsISc\n8IjHqSEi+lwI0RVAGix2gslEtKMwwjKFR6PRIC3NeahxIoJarUZISAhu3nxw0yPp5OTkoEmTJgW+\nnmHKIiXVRiCE+BOWsD2rYdm7QFLAOcDDiMAaxW4PEe0goneI6G1WAiWDl156CYGBgQ75MpkM3bt3\nh0qlwsiRI6HRaJxe7+nLrFAoEBUVhUaNGvlE3rIKEWHlypWIjo5GtWrV8Pjjj2PHDv6JlGVK8Iig\nORE9Q0QrvVECgAdFQEQmAGYhREihxGN8ztSpU1GzZk07d1GVSoWwsDDMmzcPADB8+HA0b97cThko\nFAoEBQVh5syZ0Gq1CAkJwfTp06HT6RAUFASNRgOtVouWLVtiw4YNRX5fpY3XXnsNr732Go4dO4bb\nt29jz549ePrppzFr1qziFo3xA1KVQFEqAiHEBOvHT4QQcx5MUuqQ4jWUAeCMEGIH8sW2JqI3vReZ\n8RUVK1bEyZMnsWjRIsTFxcFkMqFfv34YPXo0KlWqBABQKpXYt28f4uLisGjRIqSnp6Nbt24YN24c\nIiIiMGrUKJw8eRJt2rTB2LFjsXXrVty9exetWrVCdHR0Md9hyefIkSNYuXKlg9Fer9fjvffew4sv\nvgjePKnsUUybzrjjgvXv0YJWIDyF/BdCvOwsn4i+dZZflERHR9PRowW+d4YpFG+++SbmzZvn1CCv\n0Wjw73//G6+99loxSMY4QwhxjIgK9YZTu3ZtmjRpkqSyI0aMKHR73iCEaE1ExwtyrRRjcbF3+AxT\nEklLS3PplWUymZCZ6dU0LVMKKMlhqAF8IYSoBmAtgP8S0VmpF3p0HxVCRAoh1gohzgshrualwkjL\nMGWB7t27Q6fTOT0nl8sRExNTxBIxRYFMJpOUihoi6gygM4A7ABYIIc4IId6Xcq0UaZfBstlBrrWR\nOFhCTDBMuaZv376oXLkyAgLsB9YqlQqPPPIIWrduXUySMf6kpBmL80NEfxPRHAAjAZwEMFnKdVIU\ngZqIdsFiT7hORFMA9CqwpIzPuXXrFhISElBetngmIty4cQOJiZIi7AIAEhMTcePGDZ8+o8DAQBw6\ndAidO3eGSqVCcHAwVCoV+vXrh82bN/usHaZkUVIVgRCikRBiihDiDIC5AH4BUFPKtVIUgVEIIQNw\nWQgxRgjxDADn42GmSNm7dy8aNWqE+vXr4+GHH0adOnUQHx9f3GL5lVWrVqFmzZpo2LAh6tati2bN\nmrkNg3Hw4EE0bdoUdevWRcOGDREREYE1a9b4TJ6qVati+/btuHbtGvbt24dbt25h+fLl0Go5EktZ\nJC/WUEmcGgKwFEAKgO5E1ImIviKiJCkXujQWCyGWE9GLADYA0AB4E8A0AI8DcOpJxBQdv/32G3r1\n6mXnunjjxg288MIL+O9//4unnnqqGKXzD2vWrMGwYcPs7vns2bPo2rUrfvnlF7Ro0cKu/IkTJ9Ct\nWze78n/99ReGDBkCmUyGvn37+ky2atWqoVq1aj6rjym5lMSgc0IIOYBrRPRlQa53d0dRQojqsGxs\noACgBzAewKsAfi9IY4zvePfdd10GnRs/fnyZmyYiIowfP97pPRsMBnzwwQcO+e+//z4MBoNDvl6v\nx9tvv13mnhHjf0rigjLAtvi3lhDCMdyABNy5j34NYBeAerAEMBKwRCDN+1uvIA0yvuHAgQMuz127\ndg2pqakIDQ0tQon8S2JiIu7cueP0HBFh927H8Ot79+512dknJiYiKSkJVatW9amcTNmnBLuPXgNw\nUAixCfaLfz0uc3epCKyW5zlCiK+IaJRPxGR8hkKhQE5OjtNzRASFQlHEEvkXpVLpNpKqs7hL7p4B\nETm9hmE8UYIVwR/WJAPg1Z4xUhaUsRIogTz77LNYuXIlTCaTw7kOHTqUOWNlWFgYmjVrhmPHjjmc\nCwgIwPPPP++QP2DAACxbtszp3gutWrVChQoV/CIrU3YRQpTEEBMAACKaWtBrS57VgwFg2Xbyp59+\nwpgxYzB+/Hj8+uuvdtMc06dPR2hoqJ0Pu1wuR1BQEObOnVuksiYnJ+PLL7/EiBEj8MUXX7icwiks\nCxYsgE6nszPWKRQKhIWFYfJkR3fpKVOmoGLFinYjAyEEFAoFIiIisHnzZqeKlGHcURJtBFa59ggh\ndj+YJF1bmg1mZTXWUEZGBjp37oyLFy8iIyMDQghoNBo8+eSTWLVqle2N5ObNm/j444+xZs0amEwm\n9OzZE1OmTEFkZGSRybpnzx7ExsbCbDZDr9dDrVZDCIG1a9fiySef9Hl7Fy5cwJQpU7Bt2zYoFAo8\n//zzmDRpkkuPncTEREyfPh0rV660hYQgIpjNZuh0OtStWxf79+8vU/YUxjm+iDVUv359+vTTTyWV\nHTBgQFHHGorKd6gC0BdALhFNcHHJ/65lRVDyePXVV/Hdd9/BaDTa5Ws0GkyfPh1vvfVWMUlmT0ZG\nBsLDw5GRkeFwTqPRICEhARUrViwGyRxZtmwZ3njjDYf4P4GBgejTpw9Wr15dTJIxRYWvFMGMGTMk\nle3fv3+RKgJnCCF+I6K2nsrx1FAJIysrCytWrHBQAoDF7XH27NnFIJVz1qxZ49Irh4jw3XclJxLJ\nzJkznQaBy87OxqZNm5CamloMUjGljZK8oEwIUTFfqiSE6AFA0l4yUvYjYIqQlJQUt/OLJWkP4evX\nr7uMsGkwGPDHH38UsUSuuXXrlstzCoUCSUlJCAnh/ZcYz5Rgr6Fj+J+Lfw6APwEMk3IhjwhKGGFh\nYW7fJiIiIopQGvc89NBDLqNvarXaErXfcd26dV2ey83NRXh4eBFKw5RmSuqIAMA/AbQkoroAlsOy\nlsBxBaYT/CatEKKW1Yp9XghxTgjxljW/ohBihxDisvVvBWu+sG6tdkUIcVoIUS5DNwYGBmLEiBF2\nW1DmodFo8O677xaDVM559tlnXfriy+VyDBw4sIglcs2kSZOcutSqVCoMHjzYpUJjmPz4cmpICNFD\nCHHJ2udNdFOurxCChBCe7A3vE1GaEOIxWEIBLYYlcrRH/Km2cgGMJ6LGAB4FMFoI0RjARAC7iCgS\nlpXLeQ/gSQCR1jQcEm+gLPKvf/0LXbp0gUajgUKhgFKphEqlwquvvoohQ4YUt3g2VCoVdu7cibCw\nMAQFBUEul0On0yE0NBTbtm1DUJBXa1r8Sr9+/fDWW29BpVJBpVIhICAAGo0G7du3L3J3W6Z04wv3\nUWtsoHmw9HuNAQy09o8PlgsC8BaAXyWIlucL3QvAIiL6EYCkVZN+sxEQUSKAROvndCHEBQA1APQB\n0Mla7FsAe2EZ0vQBEEcW6+NhIUSoECLcWk+5IjAwEJs3b8apU6dsbpJPP/202+mN4qJVq1a4desW\nNm7ciMuXL6Nu3bp45plnoFKpils0Bz755BOMHDkS8fHxMBgM6NKlC+/NzHiNj2wEbQFcIaKr1jpX\nwdIHnn+g3DQAMwC8I6HOm0KIBQC6ApghhFBC4st+kRiLhRB1ALSCRatVzde5/w0gL9hLDQAJ+S77\ny5pnpwiEEMNhGTGUqPlyf9CiRQuHiJolkcDAQPTv37+4xZBErVq18Oabbxa3GEwpxgtFUEkIkd+/\nfSERLbR+dtbfPfJAO60B1CKiH4UQUhTBAAA9AHxORPeFEOGQpkD8rwiEEDoA6wD8n3X+ynaOiEgI\n4dVCBuuDXAhY1hH4UlaGYRhPeKEI7hZ0HYF1D5hZAIZIvYaI9ADW5zu2zcp4wq+KQAihgEUJrCCi\nPAFv5035WDVW3sYJNwHUynd5TWsewzBMicCHsYY89XdBAJoC2GtVPNUAbBJCxBKRz1fR+tNrSABY\nAuDCA2FQN+F/G9u8DGBjvvyXrN5DjwJILY/2AYZhSjY+ijV0BECkEKKudQ+B52HpAwEARJRKRJWI\nqA4R1QFwGIBflADg3xFBBwAvAjgjhDhpzXsPwKcAVgshhgG4Dsu8FgBsAdATwBVYfF9f8aNsDMMw\nBcIXxmIiyhVCjAGwDYAcwFIiOieE+AjAUSLa5L4G3+JPr6GfYVnh5owuTsoTgNH+kodhGKaw5K0j\n8AVEtAWWF+D8eY5hdC35nXzSqAs4xATDMIwXlOAQEwWGFQHDMIwXsCJgGIYp55RFRcBB5xi/8euv\nv+KZZ55BvXr1EBMTg/Xr17sMW12cGI1GfP3112jVqhXq16+P4cOH48qVK8UtFlMCkeoxVNqUBY8I\nGL/w7bff4vXXX4fBYAAR4dq1azh+/DgGDRqEhQsXeq6giDAajejcuTNOnToFvd4SqPHGjRv4/vvv\nsWPHDrRr166YJWRKGqWtk5cCjwgYn5OWloZRo0ZBr9fbjQAyMzOxYsUKHD58uBils2fp0qV2SgCw\nhKXOzMzE4MGDS+QIhileyuKIgBUB43M2b97scvVlVlYWli1bVsQSuWbBggV2SiA/SUlJOHfuXBFL\nxJR0yqIi4Kkhxuekp6fDZDI5PWc2m5GcnOyXds1mM37//XcIIRAZGSnJ3zs9Pd3lOblc7vZ8UZKV\nlYXLly8jJCSkzAdbLOmUtk5eCuVqRGA2m7FmzRrExMTg4YcfxosvvoizZ88Wt1hFRk5ODpYtW4ZH\nHnkEDRs2xKhRo/yynWSHDh1c/lh0Oh26du3q8zbXrVuHmjVrIjo6GlFRUahVqxbi4+M9Xte5c2cE\nBDh/H8rJyUHTpk19LapXmM1mTJ06FZUrV0aHDh3w8MMPo1WrVjhz5kyxylVeKavGYhBRqU1RUVEk\nFbPZTM899xxptVqCZV9PksvlpNFoaPPmzZLrKa1kZ2dTx44d7e4/ICCAtFot/fLLLz5vr0uXLqRS\nqWxtASCZTEZVq1aljIwMn7a1detW0mg0dm0BII1GQ9u3b3d77e+//273TPJf+8477/hUzoIwceJE\np/cWHBxMCQkJxS1eqQKW0A2F6nMaNmxIhw4dkpR80V5RpXIzIti+fTt++OEHu83WTSYT9Ho9Xnjh\nBWRnZxejdP7nu+++w9GjR+3uP88oOnDgQJ8bRTdu3IjevXtDpVIhJCQEarUa0dHROHz4sNMtIwvD\nhAkTnM7z6/V6TJgwwe21kZGR2L59O+rUqQOtVmuTdfTo0fj00099Kqe3pKen48svv3R6b1lZWZg9\ne3YxSMWUxRFBubERLFy40K4TzA8RYffu3ejRo0cRS1V0zJ8/3+X93717F2fOnEHz5s191p5Wq8Xq\n1auRlJSEy5cvIzw8HPXq1fNZ/Xnk5OS4nd47deoUcnNzXU7/AED79u1x9epVnDt3DmlpaWjWrFmJ\n2Gbz+PHjLu0c2dnZ+PHHH/HFF18UsVRMaevkpVBuFMHdu3ddniMipKamFqE0RY+7+wsICPDb/Vep\nUgVVqlTxS92AxaArl8uRm5vr9HxAQICk+PFCiGK3BzyISqVy6dEEAAaDoQilYQDfBp0rSZS9O3JB\nt27dXO6jm5OTg7Zt2xaxREWLO6Oo0WhEs2bNilgi3yCTydC7d2+nP06ZTIann366TL7BMcVHWZwa\nKjeKYPjw4U4VgUqlQs+ePUvkxvC+5J133oFSqXTI12g0eO211xAaGloMUtmj1+uxdOlSDBw4ECNH\njsTUqVMxaNAgjBo1CocPH3Zpx5g1axYqVKgAhUJhy1MoFKhQoQJmzpxZVOL7HIPBALVa7fK8u3MM\n4w3lZmqocuXKOHDgAPr3748bN25AoVDAaDTi2WefxeLFi4tbPL/ToEEDbNu2DYMGDUJycjLkcjmM\nRiOGDRuGWbNmea7Az/z5559o164dMjIykJGRYXdOJpNh+fLlGDBgAJYsWeLwtlWnTh3s3LkTw4YN\nw6lTp0BECAsLg0wmQ0xMDJ555hlMmDAB1atX97ncRITVq1fjiy++wM2bN9GoUSO8++676NLFYcsN\nr4mKinKp/BQKBXr16lXoNhjvKW1v+1IQvvYWKUqio6Pp6FHvd247d+4c7ty5g8aNG/t1/rokQkQ4\ndeoUUlNT0aJFixIxEgCAtm3b4tixYzCbzS7LaLVaLFq0CAMHDrTLv3HjBqKiopCWlubU+yswMBA6\nnQ5HjhzxucF6+PDh+P777+0M8RqNBtOnT8dbb71V6PrfffddzJkzx8FWEBwcjHPnzqFmzZqFbqO8\nIIQ4RgXcTD6Pxo0b04oVKyTLPLTgAAAgAElEQVSVbd26daHbKyrKzdRQfpo0aYJOnTqVOyUAWN5m\nWrZsiY4dO5YYJXDlyhWcPXvWrRIALLGKnI1exo0bh+TkZJcuwNnZ2bh//z7eeOMNn8ibx9GjR7Fi\nxQoHbyy9Xo+JEyfizp07hW7jk08+wYQJE6DT6RAUFASVSoUWLVrgwIEDrASKCbYRMIwfuHXrFgID\nAyWXzY/ZbMamTZs8KhGz2YwdO3bAaDQWWM4HiYuLQ1ZWltNzcrkcGzduLHQbMpkMH374Ie7cuYOD\nBw/i0qVLOHnypE9dfRnp5HkNSUmliXJjI2BKLpGRkZI76CZNmtgdm0wml66jD0JEyM7Odmo0Lwhp\naWkuFVBubq6DraMwqFSqUuvZVdYobW/7UihdaqsUQ0Q4fPgwvvzyS3zzzTe4f/++03Lnzp3D3Llz\nsXDhQty+fbuIpSwewsPD0aNHD48dtEajwXvvvWeXp1Ao0KhRI0ntREREQKfTFVjOB+nWrZvL+uRy\nOWJiYnzWFlNy4KkhpkCkpKTg0UcfxRNPPIF//vOfGDNmDKpXr464uDhbGaPRiNjYWLRp0wYTJkzA\n2LFjUadOHXz88cfFKHnRERcXh7Zt20Kj0di5geYnMjISnTp1csj/7LPPPLpSajQafPbZZz79gfbt\n2xdhYWEO6zNUKhXatm2L1q1b+6wtpuTAioApEM8//zxOnjyJzMxMGI1GZGZmwmAwYNSoUTh+/DgA\nYPz48di5cycMBgOysrKg1+uRlZWFf/3rX9iwYUMx34H/CQoKwv79+7F//36X4R1OnTrlNKRCr169\nsGTJElSuXBk6nQ5KpRIymQxKpRJBQUGoVKkSvvrqK/Tt29enMiuVShw6dAj/+Mc/oFKpEBwcDJVK\nhWeeeQabN2/2aVsM40/KpftoUXL9+nU0bNjQqVFRJpPZfOMrVarkMmRAVFQUfH2fRIQffvgBX331\nFe7cuYNOnTrhzTffRK1atXzWxsWLFzF79mwcO3YMtWvXxptvvomOHTu6veaPP/5AgwYNXJ6vUKGC\ny/0MTCYTzp07ByEEGjVqhIsXL8JkMqFp06aSwkwUhlu3biExMRF169ZFxYoV/doWUzB84T7atGlT\nWr16taSyTZo0YfdRxsLvv//ucu7bbDbj9OnTuHXrltuO6vLly7bPly5dwjPPPAOVSgWVSoU+ffrg\nwoULXslkNpsxcOBADBw4EFu3bsXRo0cxZ84cNG7c2GfbSG7cuBFRUVFYunQpjh07hvj4ePTq1cth\njv9BDh065PZ8/phIZ86cQa9evaBSqaDRaDBo0CBotVo0a9YMAQEBaNq0KVq0aOF3JQAA1atXR1RU\nFCuBckBZ9BoqXdKWQmrUqIGcnByX52vXro3KlSu7LVOtWjUAwIULF9CqVSts2LABRqMRRqMRmzdv\nRtu2bb3aqGTjxo0OIbmzs7ORkZGB/v37FzokdWZmJgYNGgS9Xm/z6CEiZGZm4ssvv8SJEydcXtu4\ncWO3deeFCTl27BjatWuHrVu3wmg0wmAwYO3atYiKivLLZjsMkwfbCBivady4MRo0aOD0DUGr1WLs\n2LEICQlBz549nfrSazQajBs3DgDw0ksvOUwfEREyMjJsZaTgLiT1/fv38dtvv0muKz9msxkzZsxA\neHi4y6iZRqMRS5YscVlH69at3S50e/XVVwEAY8aMQWZmpp3SMpvNSE9Px7vvvlsg+RlGCqwImAKx\nfv16myETsLgWqtVqvP7667ZtGxctWoQ6derYygghoNVq0bNnT7z22mswm81u7QR79uyR7E+flJTk\n8pxcLi/wnsJDhw7FRx995HafX5PJhMTERLf17Nmzx+l0zkMPPYTZs2cjMzPT5bMwm81sqGX8Rlnd\nqpIXlBUB9evXx9WrV7Fq1Srs3r0blStXxpAhQ9CiRQtbmbCwMJw9exYbNmzADz/8AI1GgxdeeAHt\n27eHEMLt1BHwvy1HpRATE4MLFy44rTMrK8tOLqlcvnwZq1ev9hgjX6PROHUBzU/Lli2RnJyMiRMn\n4qeffoJOp8PYsWPxyiuvALAoE3c/NE+rjBmGscdvikAIsRTAUwCSiKipNa8igP8CqAPgTwADiChF\nWH7VXwLoCUAPYAgRHfeXbMWBRqPB0KFDMXToUJdlFAoF+vfvj/79+zucCwgIgBDCZWffpk0bl/73\nDzJ27FgsW7bMQRHkGZ8LEqVzy5YtHjtgIQSUSiVeeuklj/UFBwdj/vz5Ls89/PDDLncm69y5s2eB\nGaaAlDZDsBT8eUffAHhw78eJAHYRUSSAXdZjAHgSQKQ1DQfwlR/lKpUIIfD44487PafRaLzasrBe\nvXr48ccfUbVqVQQFBSE4OBhKpRK9e/fGsmXLCiyfu7f0gIAA1KtXDwcOHEBISEiB2sjPv//9b6eL\nyDQaTbHvNZyH2WzGzp07sWDBAuzYsYNHKkzJpaC73ktJsLz5n813fAlAuPVzOIBL1s8LAAx0Vs5d\nioqKovLEjRs3KDQ0lAAQAAoICKDmzZvTvn37iIjIbDZTTk4OHTt2jPR6vcf6TCYTHThwgDZu3EgJ\nCQl25y5evEj79u2j27dvS5LtypUrpFQqbbLlT4GBgTRv3jwym83e37Qbdu7cSU2bNiWFQkEKhYLa\ntm1Lv/76q0/bKCgXLlygmjVrUlBQEKnVagoKCqIaNWrQ2bNni1u0cguAo1TIPq1p06b0xx9/SEq+\naK+oUlErgvv5Pou8YwA/AHgs37ldAKJd1DkcwFEARyMiIiR9AcoSf//9N7399ttUs2ZNatKkCe3Y\nsYNycnIoMzOTFi1aRA0bNqTg4GDS6XT06aefet35Xrx4kZo1a0ZqtZpCQkJIpVLRc889R5mZmW6v\ny83NJZ1O51QRyOVyun//vq3s4cOH6amnnqKqVatS48aNaf78+ZSdnU2nTp2i/v37U7Vq1SgyMpJm\nzpzpoNCuXr1Kw4cPp+rVq1Pt2rXp/fffp6tXr1JqaqpX91lY0tLSKC4ujhISEig3N9fuXFZWFlWp\nUoWEEA7PolKlSmQwGIpUVsaCrxTB1atXJSVWBBIUgfU4hbxUBPlTeRoRpKam0g8//EBbtmyhjIwM\nu3MHDhwgtVrt0OloNBr6z3/+I7mNlJQUCgsLc+jAVCoV9erVy+21W7dudakI8suxevVq0mg0dufV\najW1bt2aNBoNyWQyh/y8jvPMmTMUHBxMAQEBtjJKpZIiIiLozp07Xj7RgrNlyxbSarWk0+lIrVbT\nzJkz6f79+zalu2rVKpfPQqfT0fLly4tMVuZ/+KJjbtasGV27dk1SKk2KoKitHreFEOEAYP2b58d4\nE0D+2AY1rXnlHiLCtGnTUK1aNQwaNAjPP/88qlSpgrlz59rKTJ482am3jl6vx5QpU2AymSS1NWfO\nHAfffMDiSbR79278/vvvLq89c+aMy9j8er0eR48etW2N+eAaA4PBgOPHj0Ov19vNoxsMBly8eBFL\nly4FAIwYMQJpaWl2brJGoxGJiYmYMGGCg9z+4K+//kK/fv2QmZmJjIwMGAwGvPPOO6hYsSIiIyNh\nNptx6tQplyGoMzIycPr0aZf1Z2Vl4dKlSz7Z1IYp2QghegghLgkhrgghJjo5P04IcV4IcVoIsUsI\nUdtfshS1ItgE4GXr55cBbMyX/5Kw8CiAVCJy72zuR/bt24fY2Fg0bdoUzz33HI4cOVJcomDRokWY\nMWMGDAYD0tLSkJaWZtsBa926dQDgdgFYZmamR7/9P//8E506dcKUKVPcbrTiLvxEtWrV3BpDq1ev\njr1793p0L30QvV6PRYsWITk52eXagZycHCxbtgwNGjTAzp07varfWxYuXOhUsZrNZty+fRv79u1D\njRo1XEZDVavVqFGjhkO+yWTCxIkTUalSJURHR6NWrVro3Lkzrl+/XmBZ8972GN/iixATQgg5gHmw\nOMo0BjBQCPHgsvoTsMyMNAewFsBnfrgdC/4aagBYCSARQA6AvwAMAxAGy7TPZQA7AVS0lhXWh/IH\ngDOQMC1Efpoamjp1Kmm1WttQXiaTkUajoYULF/q8LU+YzWYKDw93OsUAgOrVq0e9e/d2eR5WQ+29\ne/fozJkzdOjQIYdppZSUFKpcubLdlIyzFBAQQC+//DLdvXvXqaxbtmxxe/2MGTNo/vz5bsvkT0ql\nkl544QVavnw5xcXF0e3bt51Ofz2YVCoVHTx40G//k9jYWJdtq9Vq+vrrr+n06dMuDedqtZqSkpLo\n9u3bdPDgQbp27RoREQ0fPtxhykwul1PVqlXt7Cv5ycrKohUrVtALL7xAI0eOpIMHD5LZbKbffvuN\nOnbsSHK5nBQKBfXu3ZvWrl1LR48epZycHL89m5IOfDQ1dP36dUnJXXsA2gHYlu/4XQDvuinfCsDB\nwsrvsn5/VVwUydeK4NKlSy47G5VKVaTz0EQWg2T++XBnyZlBMv+5Fi1aUJ06dUir1VJwcDBpNBr6\n4IMPyGQyERHRZ599JqmDBUAKhYJkMhlVqlSJnnjiCdq5c6dN1pkzZ5JcLnd57ZAhQ2js2LGS2gkP\nD6c///yT0tLSiMhiiDabzbR8+XJJ10dHR/v8f5Gbm0vLli2jKlWquGxXq9VS8+bNSaVSkUqlcjgv\nk8lo7ty51L9/f1IqlTZjfFRUFAUGBjqtU6PR0OzZsx3k+fvvv6lu3bo2W4QQgrRaLXXt2tVBoeSX\nLywsjFasWOHz51Ma8JUiuHHjhqQEy1qpo/nS8Lx6APQDsDjf8YsA/uOqXQD/AfB+YeV3Wb+/Ki6K\n5GtF8P7777vseDUaDX311Vc+bc8T2dnZLt8sPaXAwEAKDg52aUSeMmUKERE99thjBIDat29PcXFx\ntH37dpo8eTJVrVrVYxsajYa++OILIiJaunSp3UgqfwoICKBGjRrR8OHDJSmdXbt2UXZ2tsPzyMjI\noOeee87j9UIIn/4fzGYzXbt2zeX95W/XneIOCAggjUbj0OnLZDK3Cr1Tp04OMj355JNOFa+7evL/\n37Zu3erTZ1Qa8JUiSEhIkJTcteeNIgDwAoDDAJSFld9VKntL5ArBnTt3XMbrMRqNSElJKVJ5FAoF\nnn32WYcdsKQwatQodO3aFdnZ2Q7n9Ho9Zs6cCYPBAI1Gg88//xzbt2/HoEGD0LVrV0ycOBG///47\n2rZt67YNvV6PSZMmISkpCc8++6xLG0G7du3w66+/YtasWS43nckjPDwc7dq1c7pKWqvVYvz48W6v\nB5D34/EZQghUrlwZERERHtt1F+8pNzcXer3e4X9iNpvdyqzVau2O7969i507dzq1VUi59zwbE+M9\nPow1JMlBRgjxBIBJAGKJSNrG3gWAFUE+YmJiXO5Bq1ar8eijjxaxRJYVtDVq1IBGo7HlaTQaj6t4\nP/roIxw+fNilx5BcLseFCxcwceJEjBw5Elqt1hboTa1WIzg4GPHx8R6/0DKZDOvXr0dISAiWLFkC\ntVptqycvcF58fDyCgoKg1WqxZs0aaLVau0irWq3WZlwNDw93u5G9lPAX/ggBkJOT49TI62+0Wq0t\nxlIet27dkhxg0BWnT5/mlc7FyxEAkUKIukKIQADPw+I0Y0MI0QqWxbaxROQ6UqQPYEWQj759+yIk\nJMQh8qVCoUDdunU9BkvzB1WqVMGZM2fw6aefon379vjHP/6B2bNno3v37i476SZNmiA4ONjt23da\nWhpeeuklPPzwwy49XHQ6HR577DG38uXk5NhcJQcOHIjU1FR8//33eOSRRzBgwADMmzfP7u0+JiYG\n586dw+jRoxEdHY1HHnkEa9euxbp166DRaPDnn3+63cjn/PnzbuUB4Jf/k1KpdOs+6wseHPlpNBq0\nbdsWTz/9tF1+Tk5OoUc9SqWy1EXILCn4wmuIiHIBjAGwDcAFAKuJ6JwQ4iMhRKy12EwAOgBrhBAn\nhRCbXFRXePw151QUyR9eQ9evX7ctbgoJCSG1Wk0dO3akpKQkn7dVGM6cOeN00ZJaraa9e/cSEdGs\nWbM8zsm787C5f/8+9e3b1+31Wq2WDh8+7LKOc+fOUXp6usvz58+ft30+evQode3alZYvX+40REZG\nRgZ16tTJrTxqtZpOnz4t5RFKJicnh7Zv314gW43UpNFo6Ouvv6bOnTtThQoVqH79+jR79mwyGo0O\n8ly+fNmtYd5TyvMAK2/ABzaC5s2bU2JioqTki/aKKhW7AIVJ/lxZfOHCBfrpp5/ojz/+8FsbheXE\niRM2N0G5XE5RUVG2uENERHq93qbUXHUKM2fOpKysLKf1Z2ZmUpMmTVyuklUqldShQwe3YSxMJpPL\n8BR6vd4hltHRo0epWrVqtH37dsrIyCC9Xk9paWmk1+tpzJgxbju4jh070okTJwrwJN2TlZVFFSpU\ncNu2SqWievXquTUoa7Va6tWrF2k0GjujrlarpeHDh0uWx2w2U61atZy2IZPJaMCAAdS6dWubS3D+\ntpRKJYWHh1NiYqLPn1NJhxUBK4JSz7Fjx2jatGn0ySef0NmzZ+nvv/+mxMREMpvNZDQaXcav0ev1\nNGfOHJfeR7Vr13b6xm4wGGjnzp00dOhQWrNmDU2bNo1atmxJKpXK5vbYv39/m4vng+TJlZWVRRcu\nXKB79+45lHH2tnv79m2brC1atKAxY8bQkCFDqGLFim474gYNGnj5RImuXbtGM2fOpClTptCePXtc\nKrQvvvjCqTtoXgoMDKSXX36ZjEYj/fDDD9SpUyeqXbs2RUVFUevWral27drUpUsX+umnn4jI4rIb\nGhpKMpmM1Go1jRw50iFekSe2bt3qMNqTy+VUpUoV+vvvv4nI8j9MSEig9957jxo0aECRkZH04Ycf\nulwLUtbxlSL4+++/JSVWBKwIfEZ2djb17t2bNBoNyeVym6uhTCYjpVJJkZGRtG3bNo/1NGzY0GVH\n1qlTJ7p37x7dv3+f0tLSKDMzk/bs2UMhISH0zDPPUL9+/Sg6OpqGDx9O+/btoyNHjnhcU2E0GqlT\np052LpP16tWj5s2b01NPPUVbtmyxrWV4kN69ezv1qw8ICCCFQuH0LTgyMpIWLVokKeoqEdEHH3xA\nKpWKAgMDSQhBOp2OoqOjnS7eeuKJJ9wqIaVSSf/85z8lLdb67LPPHEZoWq2W+vfv73WAwAMHDlCH\nDh1sz3jIkCF08+ZNr+ooT/iiY27RogXdvn1bUmJFwIrAZ7z33nse5/nVajXt2bPHbT0ffPCBRx/3\n7t270+DBg6lx48ZOy8jlcgoICKCuXbvS3LlzKSUlxWV7/fr187haOSwsjCZPnmybfjMajRQXF0ex\nsbFUsWJFW4epUCgoMDCQWrVqRb169aIWLVq4nHpp0qSJXSRSs9lMhw4dov/7v/+j119/nbZu3Uob\nNmxwOoUTGBhIffv2tbuPK1euuFWieUmj0dDgwYPd/g+SkpJcjiy0Wi3t37/fw7eBKQy+UgRJSUmS\nEisCVgQ+wWQyUXBwsMdOCAB5ehZ79+6VVI/UpNFoKCgoiH7++WeHts6cOWN7o4+JiaFRo0ZRnz59\nnL7Ny+VyUqlU9MEHH1BISIjD+WbNmlGVKlVsSiEv5Ed4eLhTRaNUKmn8+PFEZFkN3K9fP9Jqtbay\nOp3O7Ty+Uqm0TZ3MnDmT1Gq1ZMOsSqWiq1evOn3+Bw4coBYtWrhc8CWE8MpOwHiPrxTBnTt3JCVW\nBKwIfEJqaqrTztNZkslkTlfj5jF48GCfKoK8FBISQgaDgbKysigrK4vS09Opf//+VL9+fbp48aJt\nqik1NZWSk5NtK5mddYSu7kvqM8hLoaGhREQ0Z84ch2kYmUxGbdu2pZiYGKdG9ODgYDpx4gQdPnzY\nrZHdlXJcvHixw7OfO3eupLoGDRrkt+8S4ztFcPfuXUmpNCkCXkdQggkKCnLwIXeFEMKt7/K+fft8\nJZYdZrMZGzZswJIlS1C3bl2EhobiwIED2Lp1K+rXr4+goCBoNBoEBwejQoUK2LJlCypXruxQj+V3\n6rz+B/dW9kR6ejoA4IsvvrALef3UU08hMTERO3bswMaNG5GUlITJkyfbXWs0GlGzZk18+eWXXkdK\nlclkdgvlACApKQnvvPOOQ+jtB9HpdHjqqae8ao9hfAUrAitEhMOHD2Pt2rWSFi0VBUIIfPfdd4iJ\nifFYrlu3bg4L4fLjqSNyhlwu9xhWwWAw4MaNG4iNjUVKSgpMJhPeeOMNVKtWzWloDLlcjqFDh3ot\nizc0bmyJ5nv79m1b3iOPPIJVq1ahSpUqCA4ORmhoKLRaLd555x28/fbbAIDAwEB069YNlSpVwpUr\nV1wqJ1fk5uaiZ8+ednnr1q3zuLhIoVCgSpUq6Nu3r1ftMcWDj0JMlChYEcCiBD799FPExsZi2LBh\naNOmDdq0aYP9+/cjNTW1WGVTKBT4/PPP7UJM5EculyMoKAizZs3yWI+3aDQaTJ8+HaNGjXJZRq1W\n46GHHkLNmjUxduxYaLVaVK9e3eXqYI1GgzZt2nhsWyaToXfv3pg9ezY++ugjNGrUyGmZB1GpVBg1\nahSICHPnzsXMmTPRsWNHTJkyxekKap1Oh/feew+hoaGIjIzEN998AwBo3ry5W8Xq7L6mTJmCsLAw\nu/y0tDSn8Z7y30P37t1x6NAhh9EEU/LwYayhkkVxz00VJhXERpCVlUXx8fG0cOFCOnLkiC1fr9fT\nnTt3qGnTpnbzySqVil588UWHOP75MZvN9Msvv9CCBQto06ZNbufqC8KqVauoYsWKpFAoKCAggFQq\nlW3l8+DBg+nKlSse64iOjvZ6/l+n09HPP/9MN2/edOm5VLVqVZvbpNlsppUrV9KECRNcukIajUaa\nOXOm23bDwsLowoULNu+f7OxsyszMpM8//9xuPv65554jlUpFwcHBpFKpSAhBgYGBpNFoqEaNGrRh\nwwYymUyUlpbm9n+SlZVFe/futXNnPXPmjGQbQYsWLSg+Pt5p3fv373e5IE+r1dKiRYscrskLu834\nFvhgzr5ly5aUkpIiKfmivaJKxS5AYZI3isBoNNL27dspJCSEgoKCSKPRkFarpejoaJtPvMlkooSE\nBKd79sbExDj9cd68eZOaNm1KWq2WNBoNBQcHU8WKFenAgQOSZXPHpk2bnHZIGo3GFkpCCp5CMzhL\nERERtnvevXs36XQ60ul0JJfLKSgoiKpUqUJnz561tSGl88rOzqZHH32UtFotBQQEkFarJbVaTePG\njbN59vz4449OF5ulp6fTgAEDSKVS0eTJk4nIEgZj6tSpThfMqdVq2rFjBxGRyzUL7vjuu+9IrVbb\nKUGFQkFt27alsLAwm9Jyd99ms5lat27tsC4iICCAIiIi7O4zLi6O6tata1MSr7/+ustNaRjv8ZUi\nuH//vqTEiqAEKoL58+c77VAVCgW1a9fOVi41NZW6dOlCCoWCKlWqZHMd1Gg0NHPmTOrUqRM1adKE\n2rdvT6tXr6YmTZo49c/X6XS2FZ6F4aGHHnLZUT/66KOS62nQoIHLegIDA0mpVNrda2hoKJ08edKu\njszMTFq+fDlNmzaNDh065NC55nVqOTk5lJyc7LKDNJlMtH37dvrXv/5FCxcupM2bN9OMGTNo8uTJ\n1Lt3b5erpImIbt26RQkJCZScnGxb1dysWTOX99aqVSvbtQV5y05OTqaOHTtSlSpVaMGCBZSWlkYp\nKSmk1+spLi6OKlWq5FHpJycn05NPPmlbla1Wq6l9+/aUkJBgK/PJJ584fD8DAwOpcePGbp8HIx1W\nBKwIqFatWm43nckLVJaenk67d++2xblJTU2l6dOnk1qtphkzZlBqairp9XrKyMigr776yubnX79+\nfYqNjaU2bdrYRhHTpk2TLJ8zMjIy3C4Ck8vltrI5OTm0evVq6tWrF3Xp0oXmzZtnFzqiXbt2LuvR\naDS0fv16euONN6hv3740a9Ys22KxI0eO0Msvv0wxMTE0bNgw6tKlC3399dcO8YnMZjPduXOHxo4d\nS0FBQaRUKqlWrVq0bt06u2kZg8FAK1eupF9++YWSkpKoWbNmpNPpbKtj27Vr53Q0kEdubi6NHj2a\nVCoVde/enX7//Xfq2bMndevWzen0lRCi0NszxsbG0t27dx0UX2ZmJm3atImWL1/u9Lrz58/TqFGj\nKCYmhkaOHEm7du2iffv2Oaw1SElJcbvQbNmyZYWSvyyRmJhIkydPpk6dOtFzzz3nNjTIg/hKEaSm\npkpKrAhKoCJwF6smKCiIVq5cSUSWjubBTi4jI4Nu3rzpEDxNr9fToUOHaMeOHZSZmWl7UzQYDLZj\nqTj7MmdlZblVBCqVylauffv2dgultFot1axZk27dukVERN9//73LhVQPPfQQJSQk0MKFC20hGu7e\nvUtjx44ltVptt3BLCOFyPtxoNDqMYJRKJUVHR9t1omvXrnW56lin09Err7xCEydOdBnwLy0tjY4e\nPUpGo5HMZjOlpaVRamoqpaen04gRIxymYEwmE924cYMuX75sUwq5ubm0a9cuu7fytLQ0Sk5OdlAc\nW7ZssXsuly5dIoPBQPfu3aNTp07RwYMHKSUlhS5dukQJCQk0b9486tGjBykUCtsoSy6Xk1qtpqVL\nlzrcz/r1690uHHziiSfs7DAFmeYqC/z2228UFBRkU5p523OOGjVKkjJgRcCKwO0bcVBQkO3NwtWP\nzNUXLTc31+0brFS2bt3qNE5Ot27dnC62ksvltpAGH3/8scsYPE899RQRWaZjnn32WTtlkFdvo0aN\nbDGMPvnkE+ratatbBRQUFEQZGRm0Z88eWrp0Ke3fv5/MZjMZDAYaP3680859y5YtRGRRFvmNvu6S\nEILGjBnj9H/g6v9kMBiod+/etvuPiYmhhx9+mFQqlW3P5ddff52ysrLIbDbb/udms9k2cklLS7ML\nAmc2mykhIYG6dOlCSqWStFqtLe6TVqu11ZvXQblbiRwcHGxzPMi7n3Xr1rlVBF26dHG4z/KmDEwm\nE9WoUcPlqGn79u0e6+/gBsYAAA+qSURBVPBFx9yqVStKS0uTlFgRlEBFMGTIEJc/tOrVq9t+WK46\nfG/zveHevXsUEhJCe/bsofT0dMrJySGDwUB6vZ4uXLhAoaGhdh1znv3i8OHDFBsb67YzDQgIsM2l\nm0wm2rRpE/Xq1YseeeQRW2jlvLdztVpN1apV8xhSQavVUqVKlWzhGnQ6HdWvX58uXbpEkydPdnpN\nXvgEvV5v55kVFBTkUSF4u9n65cuXSaFQ2LyJHqyvfv36Hufd83e0aWlpFB4eXqg9APLSpk2baP78\n+RQREUFCCKpYsSK9/fbbLqPDarVaWrJkiVMZy5Nn0c8//+z2u9KnTx+PdfhKEaSnp0tKrAhKoCKI\niYlx+gVSqVR06NAhIrL8sFyFA/aXIjAYDNS/f3/b23nbtm3pvffeo3HjxlGdOnUoNTWVrl+/TiNG\njKBq1apReHg4vfXWW5SYmEiHDh3yuFm5EIKuX7/u0O758+dtHU1eBxcZGel1OIf87YSHh9M//vEP\np+dGjRpFmZmZ9Pbbb1P79u1t5x5//HFav36927pr167t1TM1m800ZswYl4b2JUuWSHLxzfvf/uc/\n//E63ISz1LBhQxozZoxDXWq1miIiIhxsHIGBgfTQQw9JjqhalomPj3c7amrTpo3HOlgRsCKgPn36\nuPwSde/enYiITp486XI3LX8oApPJRG+//bbLN02NRuPW0JmbmyupAzp27JjtGqPRSGlpaRQVFWXr\nZPPKeYoW6im56ixVKhW9+eab1KhRIwJgUxYqlYqmTJlCc+bMcVuvTqcr0PN1Vd9ff/3lVT1du3Yt\ntBIAQC+99JLLN3+dTkdvvfUW1axZ0zbNNGzYMEpOTnYqU3kaDRAR/fHHHy4N6gqFgt566y2PdfhK\nEWRkZEhKpUkRlJuVxSkpKS7Pbdu2DTdu3MCyZcuwc+dOZGZm2p3PyMjAuXPnbHvz5pGZmVnoTcQP\nHTrksMF8QEAAOnbsiKpVq+Kjjz7CtGnTcO7cOYdriQhNmzbFc88953LlcIMGDRAREQGDwQCTyQSZ\nTIb4+HicOHECgGUP4bxQEFJX0ioUCqcrh/V6vVM5srKyMGfOHFy4cAFCCNSrVw9XrlxBy5YtMWLE\nCDRs2BCPP/446tev77S9OnXqSJLrQd5//318+OGHDiuZjUajy2uys7Oxf/9+u43dXa3qlopSqUSX\nLl1w/fp1l+EmMjIykJqaioSEBGRmZiIzMxOLFy9GaGhoob9jZYF69eqhU6dOTr93gYGBePPNN4tB\nqjJEcWuiwiRvRgTujJ8A6I033qCRI0eSXC6ncePG0Zo1a2jZsmW0fv16GjZsGAGWee6rV69SdnY2\nJSQk0NixY6l3796UkZFhMxh786aWm5tLK1eudPkWnbelYZ7HydChQ+3mrk0mE2VlZZHJZKL58+eT\nWq22eysPCwtzuoOYwWCgY8eOUU5ODo0aNYrq1KlDP/30E61bt86lZ1Heqt3WrVvTH3/8QRkZGTR3\n7ly7hVJ5MfnzjzIeTBqNhnbv3k0mk4lycnIoOzvbZrQ1m82UnJxsN42nVqvpyy+/dLgHKc85JyeH\n9u3bR40aNbKb8urXr5/L6ZbLly+TTqejRo0a2fapjo+Pdxu62lOqUKEC6fV6mjt3rtu9JV588UUH\neUwmE929e5eysrJsRuz8z6s8kZaWRl27diW1Wk1BQUEUFBREYWFhtGvXLknXw0cjgszMTEnJF+0V\nVSp2AQqTpCqCixcvevyx9ujRg1555ZUC/dAjIyNp4cKFXtsXiIjOnj3rsf5mzZrRjz/+SFlZWXZT\nRSaTifbu3UujR4+mESNG0Lp162jdunW0cuVKunjxotsOIy//559/plu3blF2djaZTCbq1q2bwxSP\nSqWizz//nH777Te7OkwmE5lMJrp+/TqNHj2aNBqNbUvKuLg4qlevHoWGhlLr1q2pevXqFBwcTN26\ndbNTmg/KlpeXkZFBCxYsoA4dOrjc89jVs86r8/jx4y6nq1asWGHXdm5uLmVkZNiUUEBAAD3xxBO2\n+9yyZQuNHDnSZUeu0Who9OjRdvsd6HQ6Cg0NtT23q1evupze0Ol0FBcXZ1PcDz4Xo9FIycnJkmwb\nx48fp27dulFgYCCpVCrq27cvzZkzh+7evevz8CfFwcWLF+n777+nbdu2ebVGxFeKQK/XS0qsCEqQ\nIvj444897vBV2FStWjV68cUXvQ4HYDKZaPHixW7rzjMYfvLJJ3Y7b2VnZ9OTTz5JWq2WhBDUtWtX\nunHjBun1eq/eFnNycux+TNnZ2bRlyxY6fvw4/fnnn7R79246f/68x/oyMjJo27ZtNldMov/5vKem\nplJOTo7N1VaqbM7ceb3Z27dnz54ujem1a9cmk8lEd+7coXv37tG6deuoefPmdmUmTZpkJ6vRaKS7\nd+9SdHQ01a9fn2rWrEmhoaHUrl07m3tsWloaLVmyhCZNmkRxcXEOI4+hQ4c6KCelUkktW7akgwcP\n2txaXT0PT/z6668ulV/VqlVp3rx5lJSUVOhFdqURVgTlVBH8/PPPPvH28NRRT506lU6cOGHXUT+I\nsx/x/7d3tiFRtWkc/19z5s0zii+jaLbWKj4QveBktvFEUCG4bgRPhcTuh2WJlvlgfuiFIgOhlw+y\nKS1+2IJgF8peNlj6oCia9IDUB3MbkMTiwSjbXSnR1XkwmxecufaDztnRxpzRedO5fnDjnON97pcz\nZ87/nPu+7ut69uxZRNGviouLtSfuwGpnYM5i4ltO8VZDJMMP8RiuCGUBtRTf+u5NJhN/+vSJu7u7\nQ8ZH3rNnz7JvIiux5ff5fNzY2MhWq5UNBgOrqsq1tbWa2fBqCce5IBFp60tSiWjcmMvLy9nlcoWV\nRAiSRAhqamqWNa+MRqqtrWWPx7OkbfrMzAx7vV7taTbgTTM/Pz9i0QlYRxQUFGj7u7q6kmKBUaxF\nIODGIlxChb4MPpeTk5PscrlCWi3dv38/rLePlfY54BU1UEc0zp3T6Qzb/Fev16fcHIMIwdJpXVsN\nvXv3bk7tYkhGRgb27dsHo9GI8fHxryyOZmdn4fF4oNfrMTExgXv37qGlpQU2m21B4JRw8Hq9aG1t\nBQBMTExo+/fu3bts8JP1ABEhNzc37PzHjx8PGRwHAHbu3Ins7GyYzWacPHnyq1gFpaWlEcUjiBSd\nToeMjIyo1uH3+8P2gx9sFSUISXX3IKJqIvqJiN4S0cXVlldWVhbRDy3SYBKBH/OxY8cAAEVFRXC7\n3XC73fjy5Qu8Xi+mpqaQlZUFIkJ+fj6am5tx4cIFDA8PR1RXgECksaKiIm2f2+1eUVnrncuXLyMn\nJ2eBSatOp0N6ejpu3ryp7VMUBSUlJVo+i8WCkZGRsB4iVhuAJBCKMxrfYXZ2NkpLS8PKu3///rUX\nPCVJWI+BaZJGCIhIAfAXAL8BsBXA74ho62rKPHPmzJKRsnJycpCXlweLxYLMzEyYzWZUVVWhubk5\npC282WyGTqcDEWlxeLdt24bnz58vqMNqtcJsNkNVVRiNRuTl5S24KB49eoScnJyQ0bLCIWATf/Hi\nRc2+/e7du3EVA2YO+USZbBd/YWEhBgYGYLfbYbVakZmZiZqaGvT396O8vFzLZzKZ0NHRgStXruD8\n+fPYsmULxsfHI65vJW+fOp0OBoNhxdfDYlpaWpYtS1VVXL9+PSr1pSLrUQgSPjYVSAC+B9AdtF0P\noP5bx4RjNRQILhJwDpaWlsaZmZlcVVXFDoeD+/v7ub29fYFr4MHBQbbb7Xzw4EE+ffo0d3Z2cltb\nGw8NDfGbN2+4ra1Nc1u9EiYnJ7mpqYlLSkpCThYHOzALTmlpadzb28vMc9YzdXV1bDKZeMOGDfz+\n/fsFFirB47+BSdzFNuiL8yyHz+fjz58/86lTp/jGjRs8PT2tWQEtZaoaaR2JItD2EydOsKIoXFxc\nHJGFEjNrayMSTU9PD2/fvp31ej0riqIFAdLr9bx7927NpUqqgSjNEXg8nrBSNOqLVyKO8Rh6uBBR\nDYBqZv7j/PbvAexh5rpF+ewA7ACwadOmXR8+fFi2bKfTicePH2NychIVFRVJ81rMzLh69Sqampqg\nKAr8fj+ys7Nx584d9PX1obGxUcunqipu3bqFI0eOaG1//fo1RkZGMDw8DL1ej8rKShQVFUFRFAwN\nDSE3NxcGgwG9vb3IyspCdXU1pqen4XQ64fP5QETYvHkziAizs7Pw+/3fjJs7NTWFc+fO4cGDB/D7\n/Th06BBaW1vhdrvR2dmJsbEx2O12ZGVlacfMzs7C7XbD6/XCYrFob0/B55+ZQUTa33gRuPb9fj9G\nR0fR0NCAp0+f4sCBA7h06RK2bo3shTQwRp8M1xYwFy9ZURRYLBZMT0+DiJCenp7oZiUMInIwc8Vq\nyti1axf39fWFlddoNK66vnix5oQgmIqKCn758mW8mhgzZmZmMDg4CFVVsWPHDu1G4vF4MDAwAIPB\nAJvNFnJC2OfzYWBgAD6fDzabTbuRMzNevXoFl8uFsrIybbjA7/ejvb0dY2NjOHz4MAoLC7WyOjo6\n0NXVhY6ODhw9ehQ2mw2tra3o6elBWloaPn78CKfTidHRUZSUlKCgoCAOZ0cQooMIwdIkkxB8D+Ay\nM/96frseAJi5calj1osQJAs+nw+VlZXo7++Hy+XS9quqimvXruHs2bMJbJ0grI5oCcGLFy/Cymsw\nGNaMECTNZDGAfwL4joiKicgI4LcA2hLcppRCURQ8efIEDQ0N2LhxI1RVRXl5OR4+fCgiIAjrmNBG\n1gmAmWeJqA5ANwAFwN+Y+WuXm0JMMRqNqK+vR319faKbIghJRzLNAUWTpBECAGDmTgCdiW6HIAjC\nUqxHIUimoSFBEAQhAYgQCIIgREC0FpQt50mBiExE9Gj+/y+I6Jcx6A4AEQJBEIS4E6YnhZMAppi5\nFMCfAfwpVu0RIRAEQYiAKL0R/ArAW2Z+x8xeAH8H8MOiPD8AuDP/+R8AKilGExRJNVkcKQ6HY4KI\nll9a/H82AfhXrNqTpEifUwPp8/JsXm2FDoejm4jCdYFrJqLghU63mfn2/OeNAP4d9L//ANiz6Hgt\nz7xV5c8ArAAmEGXWtBAwc14k+YlofK0s8IgW0ufUQPocH5i5Op71xYtUGxpyJroBCUD6nBpIn9cW\nowCKgrZ/Mb8vZB4i0gPIBPDfWDQm1YTg50Q3IAFIn1MD6fPaIhxPCm0A/jD/uQbAjxwjn0Bremho\nBdxePsu6Q/qcGkif1xBLeVIgoquYc1/dBuCvAFqJ6C2AScyJRUxIGqdzgiAIQmJItaEhQRAEYREi\nBIIgCCmOCIEgCEKKI0IgCIKQ4ogQCIIgpDgiBIIgCCmOCIEgCEKK8z8526VlXBQ7GwAAAABJRU5E\nrkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p6KuSmVpeo_6" + }, + "source": [ + "# Conclusão" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GxVLsfqXesFC" + }, + "source": [ + "df.head(50)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aidvIN0ZyLx2" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oAIrbEJ5nsiz" + }, + "source": [ + "# Salvar cópia do dataframe" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "f2ktaBhAnxPi" + }, + "source": [ + "df.to_csv(\"df_3DP_FE1.csv\", sep= ',', index = True, header=True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GhB-30cIXtJZ" + }, + "source": [ + "# Exercícios\n", + "* Para cada dataframe a seguir, avalie o que necessita ser feito em termos de qualidade de dados." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "caFkC6oCmUKK" + }, + "source": [ + "## Exercício 1 - Predict Breast Cancer" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vhOM-Z9zmf-f" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "from sklearn.datasets import load_breast_cancer\n", + "\n", + "cancer = load_breast_cancer()\n", + "X= cancer['data']\n", + "y= cancer['target']\n", + "\n", + "df_cancer = pd.DataFrame(np.c_[X, y], columns= np.append(cancer['feature_names'], ['target']))\n", + "df_cancer['target'] = df_cancer['target'].map({0: 'malign', 1: 'benign'})\n", + "df_cancer.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1qruqUDqnvMc" + }, + "source": [ + "## Exercício 2 - Predict Boston Housing Price" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "trxK8YXNnsam" + }, + "source": [ + "from sklearn.datasets import load_boston\n", + "\n", + "boston = load_boston()\n", + "X= boston['data']\n", + "y= boston['target']\n", + "\n", + "df_boston = pd.DataFrame(np.c_[X, y], columns= np.append(boston['feature_names'], ['target']))\n", + "df_boston.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-CawPH2nb5cl" + }, + "source": [ + "## Exercícios 3 - Diabetes\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_lVjBS7QcZuT" + }, + "source": [ + "from sklearn.datasets import load_diabetes\n", + "\n", + "diabetes = load_diabetes()\n", + "X= diabetes['data']\n", + "y= diabetes['target']\n", + "\n", + "df_diabetes = pd.DataFrame(np.c_[X, y], columns= np.append(diabetes['feature_names'], ['target']))\n", + "df_diabetes.head()" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB01_02__Condicionais: if, elif, else_hs.ipynb b/Notebooks/NB01_02__Condicionais: if, elif, else_hs.ipynb new file mode 100644 index 000000000..e27c4913c --- /dev/null +++ b/Notebooks/NB01_02__Condicionais: if, elif, else_hs.ipynb @@ -0,0 +1,373 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "Notebooks/NB01_02__Condicionais.ipynb", + "provenance": [], + "collapsed_sections": [ + "n8BIbzQbNWUo", + "7eS94uQ4NhVR", + "SYOgJpGYVLUu", + "CaHFxk98W5if", + "ReWUyWiHXCnc", + "CqszHxaKHr2h", + "tXgF1Wl9gHKY", + "Fotx7XUquAo8", + "36kmLUYDvsUI", + "SWO2GdNovxAp", + "vpN54l4vxze5", + "u4HOf9SNytSq", + "6BQ9oZiD9hg5", + "tz5-QdrX9vct", + "p1muBgMX8NK4", + "FxTC2-U88ajk", + "z8EYn0pP25Rh" + ], + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "accelerator": "GPU" + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8Y-QMrzHhpcu" + }, + "source": [ + "

CONDICIONAIS - IF

\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wYGZ0eGlv--6" + }, + "source": [ + "# **AGENDA**:\n", + "> Veja o **índice** dos itens que serão abordados neste capítulo." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q3FpTG0dh47M" + }, + "source": [ + "___\n", + "# **REFERÊNCIAS**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LWuIj53sVSnA" + }, + "source": [ + "___\n", + "# **CONDICIONAIS**\n", + "> Usado para decidir se uma determinada instrução ou bloco de instruções será executada ou não, isto é, se uma determinada condição for verdadeira, um bloco de instrução será executado." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NyG1l3awJzEq" + }, + "source": [ + "# Não executar o código a seguir:\n", + "if condicao1:\n", + " \n", + "elif condicao2:\n", + " \n", + "elif condicao3:\n", + " \n", + " ...\n", + "elif condicaoN:\n", + " \n", + "else:\n", + " " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FCJBMTh5WX5C" + }, + "source": [ + "## Exemplo 1" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vn5u7CEaWZjH" + }, + "source": [ + "def mensagem(i_idade, i_limite):\n", + " if i_idade > i_limite:\n", + " s_mensagem= f'{i_idade} é maior que {i_limite}'\n", + " print(s_mensagem)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lW0ME_nVXU4M" + }, + "source": [ + "mensagem(35, 40)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EBBU8Yw2XxUo" + }, + "source": [ + "Nenhuma mensagem? E agora?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xQ23cAjMX1kx", + "outputId": "3612d39b-3f92-40fd-af14-2dfbca6b0697", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "mensagem(45, 40)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "45 é maior que 40\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BeHU0tPuWK4s" + }, + "source": [ + "## Exemplo 2" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gSzCnjS0Fk-d" + }, + "source": [ + "def mensagem2(i_idade, i_limite):\n", + " if i_idade > i_limite:\n", + " s_mensagem= f'{i_idade} é maior que {i_limite}'\n", + " else:\n", + " s_mensagem= f'{i_idade} é menor ou igual a {i_limite}'\n", + " \n", + " print(s_mensagem)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "KxbmxuDwYFX_", + "outputId": "8f1faff1-de34-4967-865f-17453f7992af", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "mensagem2(35, 40)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "35 é menor ou igual a 40\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lToDO6pzWPGL" + }, + "source": [ + "## Exemplo 3" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "a1NlziSbGrIl", + "outputId": "ffed270b-c16f-4d30-cdaf-80ae96898a94", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 197 + } + }, + "source": [ + "def mensagem3(i_idade, i_limite1, i_limite2, i_limite3, i_limite4):\n", + " if ((i_idade > i_limite1) and (i_idade < i_limite2)):\n", + " s_mensagem= f'{i_idade} é maior que {i_limite1} e menor que {i_limite2}'\n", + " \n", + " elif ((i_idade > i_limite3) and (i_idade < i_limite4)):\n", + " s_mensagem= f'{i_idade} é maior que {i_limite3} e menor que {i_limite4}'\n", + " \n", + " else:\n", + " s_mensagem= f'{i_idade} é maior que {i_limite4}'\n", + " \n", + "print(s_mensagem)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "NameError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0ms_mensagem\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0;34mf'{i_idade} é maior que {i_limite4}'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 10\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 11\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms_mensagem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mNameError\u001b[0m: name 's_mensagem' is not defined" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V8FF3lFLYqui" + }, + "source": [ + "Porque temos um erro nesta função?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y5F09RKGYyoX" + }, + "source": [ + "**Resposta**: por causa da indentação! A forma correta é:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vR-oFyzAY5UC" + }, + "source": [ + "def mensagem3(i_idade, i_limite1, i_limite2, i_limite3, i_limite4):\n", + " if ((i_idade > i_limite1) and (i_idade < i_limite2)):\n", + " s_mensagem= f'{i_idade} é maior que {i_limite1} e menor que {i_limite2}'\n", + " elif ((i_idade > i_limite3) and (i_idade < i_limite4)):\n", + " s_mensagem= f'{i_idade} é maior que {i_limite3} e menor que {i_limite4}'\n", + " else:\n", + " s_mensagem= f'{i_idade} é maior que {i_limite4}'\n", + " \n", + " print(s_mensagem)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QgkBOGKdYgGU", + "outputId": "701f4620-817f-41e0-e9d7-f6b06adf6b3d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "mensagem3(35, 10, 20, 30, 40)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "35 é maior que 30 e menor que 40\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LLk7bhjSwZch" + }, + "source": [ + "___\n", + "# **Wrap Up**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lJvjcjm8NQ85" + }, + "source": [ + "___\n", + "# Exercícios\n", + "## **Exercício 1**: \n", + "Escreva uma função em Python que receba um número inteiro i_limite e, na sequência, imprime os números inteiros de 0 a i_limite;\n", + "\n", + "## **outros exercícios**: \n", + "Nos sites abaixo você vai encontrar exercícios de Python:\n", + "### https://pynative.com/python-if-else-and-for-loop-exercise-with-solutions/;\n", + "### https://www.w3resource.com/python-exercises/" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Gi091pZrwbnY" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB01_02__Condicionais_hs.ipynb b/Notebooks/NB01_02__Condicionais_hs.ipynb new file mode 100644 index 000000000..bacbbc5eb --- /dev/null +++ b/Notebooks/NB01_02__Condicionais_hs.ipynb @@ -0,0 +1,373 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "Notebooks/NB01_02__Condicionais.ipynb", + "provenance": [], + "collapsed_sections": [ + "n8BIbzQbNWUo", + "7eS94uQ4NhVR", + "SYOgJpGYVLUu", + "CaHFxk98W5if", + "ReWUyWiHXCnc", + "CqszHxaKHr2h", + "tXgF1Wl9gHKY", + "Fotx7XUquAo8", + "36kmLUYDvsUI", + "SWO2GdNovxAp", + "vpN54l4vxze5", + "u4HOf9SNytSq", + "6BQ9oZiD9hg5", + "tz5-QdrX9vct", + "p1muBgMX8NK4", + "FxTC2-U88ajk", + "z8EYn0pP25Rh" + ], + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "accelerator": "GPU" + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8Y-QMrzHhpcu" + }, + "source": [ + "

CONDICIONAIS - IF

\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wYGZ0eGlv--6" + }, + "source": [ + "# **AGENDA**:\n", + "> Veja o **índice** dos itens que serão abordados neste capítulo." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q3FpTG0dh47M" + }, + "source": [ + "___\n", + "# **REFERÊNCIAS**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LWuIj53sVSnA" + }, + "source": [ + "___\n", + "# **CONDICIONAIS**\n", + "> Usado para decidir se uma determinada instrução ou bloco de instruções será executada ou não, isto é, se uma determinada condição for verdadeira, um bloco de instrução será executado." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NyG1l3awJzEq" + }, + "source": [ + "# Não executar o código a seguir:\n", + "if condicao1:\n", + " \n", + "elif condicao2:\n", + " \n", + "elif condicao3:\n", + " \n", + " ...\n", + "elif condicaoN:\n", + " \n", + "else:\n", + " " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FCJBMTh5WX5C" + }, + "source": [ + "## Exemplo 1" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vn5u7CEaWZjH" + }, + "source": [ + "def mensagem(i_idade, i_limite):\n", + " if i_idade > i_limite:\n", + " s_mensagem= f'{i_idade} é maior que {i_limite}'\n", + " print(s_mensagem)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lW0ME_nVXU4M" + }, + "source": [ + "mensagem(35, 40)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EBBU8Yw2XxUo" + }, + "source": [ + "Nenhuma mensagem? E agora?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xQ23cAjMX1kx", + "outputId": "3612d39b-3f92-40fd-af14-2dfbca6b0697", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "mensagem(45, 40)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "45 é maior que 40\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BeHU0tPuWK4s" + }, + "source": [ + "## Exemplo 2" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gSzCnjS0Fk-d" + }, + "source": [ + "def mensagem2(i_idade, i_limite):\n", + " if i_idade > i_limite:\n", + " s_mensagem= f'{i_idade} é maior que {i_limite}'\n", + " else:\n", + " s_mensagem= f'{i_idade} é menor ou igual a {i_limite}'\n", + " \n", + " print(s_mensagem)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "KxbmxuDwYFX_", + "outputId": "8f1faff1-de34-4967-865f-17453f7992af", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "mensagem2(35, 40)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "35 é menor ou igual a 40\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lToDO6pzWPGL" + }, + "source": [ + "## Exemplo 3" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "a1NlziSbGrIl", + "outputId": "ffed270b-c16f-4d30-cdaf-80ae96898a94", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 197 + } + }, + "source": [ + "def mensagem3(i_idade, i_limite1, i_limite2, i_limite3, i_limite4):\n", + " if ((i_idade > i_limite1) and (i_idade < i_limite2)):\n", + " s_mensagem= f'{i_idade} é maior que {i_limite1} e menor que {i_limite2}'\n", + " \n", + " elif ((i_idade > i_limite3) and (i_idade < i_limite4)):\n", + " s_mensagem= f'{i_idade} é maior que {i_limite3} e menor que {i_limite4}'\n", + " \n", + " else:\n", + " s_mensagem= f'{i_idade} é maior que {i_limite4}'\n", + " \n", + "print(s_mensagem)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "NameError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0ms_mensagem\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0;34mf'{i_idade} é maior que {i_limite4}'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 10\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 11\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms_mensagem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mNameError\u001b[0m: name 's_mensagem' is not defined" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V8FF3lFLYqui" + }, + "source": [ + "Porque temos um erro nesta função?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y5F09RKGYyoX" + }, + "source": [ + "**Resposta**: por causa da indentação! A forma correta é:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vR-oFyzAY5UC" + }, + "source": [ + "def mensagem3(i_idade, i_limite1, i_limite2, i_limite3, i_limite4):\n", + " if ((i_idade > i_limite1) and (i_idade < i_limite2)):\n", + " s_mensagem= f'{i_idade} é maior que {i_limite1} e menor que {i_limite2}'\n", + " elif ((i_idade > i_limite3) and (i_idade < i_limite4)):\n", + " s_mensagem= f'{i_idade} é maior que {i_limite3} e menor que {i_limite4}'\n", + " else:\n", + " s_mensagem= f'{i_idade} é maior que {i_limite4}'\n", + " \n", + " print(s_mensagem)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QgkBOGKdYgGU", + "outputId": "701f4620-817f-41e0-e9d7-f6b06adf6b3d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "mensagem3(35, 10, 20, 30, 40)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "35 é maior que 30 e menor que 40\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LLk7bhjSwZch" + }, + "source": [ + "___\n", + "# **Wrap Up**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lJvjcjm8NQ85" + }, + "source": [ + "___\n", + "# Exercícios\n", + "## **Exercício 1**: \n", + "Escreva uma função em Python que receba um número inteiro i_limite e, na sequência, imprime os números inteiros de 0 a i_limite;\n", + "\n", + "## **outros exercícios**: \n", + "Nos sites abaixo você vai encontrar exercícios de Python:\n", + "### https://pynative.com/python-if-else-and-for-loop-exercise-with-solutions/;\n", + "### https://www.w3resource.com/python-exercises/" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Gi091pZrwbnY" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB02__Numpy_hs.ipynb b/Notebooks/NB02__Numpy_hs.ipynb new file mode 100644 index 000000000..f71a98b7e --- /dev/null +++ b/Notebooks/NB02__Numpy_hs.ipynb @@ -0,0 +1,6176 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "NB02__Numpy.ipynb", + "provenance": [], + "collapsed_sections": [ + "n8BIbzQbNWUo", + "7eS94uQ4NhVR", + "SYOgJpGYVLUu", + "CaHFxk98W5if", + "ReWUyWiHXCnc", + "CqszHxaKHr2h", + "tXgF1Wl9gHKY", + "Fotx7XUquAo8", + "36kmLUYDvsUI", + "SWO2GdNovxAp", + "vpN54l4vxze5", + "u4HOf9SNytSq", + "6BQ9oZiD9hg5", + "tz5-QdrX9vct", + "p1muBgMX8NK4", + "FxTC2-U88ajk", + "z8EYn0pP25Rh" + ], + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "accelerator": "GPU" + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6QhLXoatkvKR" + }, + "source": [ + "

NUMPY

\n", + "\n", + "> NumPy é um pacote para computação científica e álgebra linear para Python.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "b8EZupp68vW8" + }, + "source": [ + "# **AGENDA**:\n", + "> Neste capítulo, vamos abordar os seguintes assuntos:\n", + "\n", + "* NumPy\n", + "* Criar arrays\n", + "* Criar Arrays Multidimensionais\n", + "* Selecionar itens\n", + "* Aplicar funções como max(), min() e etc\n", + "* Calcular Estatísticas Descritivas: média e variância\n", + "* Reshaping\n", + "* Tansposta de um array\n", + "* Autovalores e Autovetores\n", + "* Wrap Up\n", + "* Exercícios" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cO5t3xCO8kyK" + }, + "source": [ + "___\n", + "# **NOTAS E OBSERVAÇÕES**\n", + "\n", + "* Nosso foco com o NumPy é facilitar o uso do Pandas;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z2IFUG4GSB0Z" + }, + "source": [ + "___\n", + "# **CHEETSHEET**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jYLeDVH-SNCg" + }, + "source": [ + "![Numpy](https://github.com/MathMachado/Materials/blob/master/numpy_basics-1.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0mKvExmgUFOk" + }, + "source": [ + "# **ESCALAR, VETORES, MATRIZES E TENSORES**\n", + "\n", + "![Tensor](https://github.com/MathMachado/Materials/blob/master/tensor.png?raw=true)\n", + "\n", + "Source: [PyTorch for Deep Learning: A Quick Guide for Starters](https://towardsdatascience.com/pytorch-for-deep-learning-a-quick-guide-for-starters-5b60d2dbb564)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o00pYRIkXiAU" + }, + "source": [ + "## Import Statement - Primeiros exemplos\n", + "> Como exemplo, considere gerar uma amostra aleatória de tamanho 10 da Distribuição Normal(0, 1):" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l_XuvcUDWNDk" + }, + "source": [ + "## Importar a library NumPy" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "am_ZTIGaapCo" + }, + "source": [ + "### **Opção 1**: Importar a biblioteca NumPy COM alias" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "b4irLw6BWVVZ" + }, + "source": [ + "import numpy as np" + ], + "execution_count": 2, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JK54ga7dXnJu" + }, + "source": [ + "# Set up o número de casas decimais para o NumPy:\n", + "np.set_printoptions(precision = 2, suppress = True)\n", + "\n", + "'''\n", + "Define seed por questões de reproducibilidade, ou seja, \n", + "garante que todos vamos gerar os mesmos números aleatórios\n", + "'''\n", + "np.random.seed(seed = 20111974)\n", + "\n", + "# Gera 10 números aleatórios a partir da Distribuição Normal(media, desvio_padrao)\n", + "media = 0\n", + "desvio_padrao = 1\n", + "a_numeros1 = np.random.normal(media, desvio_padrao, size = 10) # Array 1D de size = 10\n", + "a_numeros1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3-0934isZUm6" + }, + "source": [ + "**Observação**: Altere o valor de [precision] para 4, 2 e 0 e observe o que acontece." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9ob_8S_bYYa2" + }, + "source": [ + "### **Opção 2**: Importar a biblioteca NumPy SEM alias" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NcGd1ho_XDXU" + }, + "source": [ + "import numpy" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zFYH6J5-Ydjl" + }, + "source": [ + "# Set up o número de casas decimais para o NumPy:\n", + "numpy.set_printoptions(precision = 2, suppress = True)\n", + "\n", + "'''\n", + "Define seed por questões de reproducibilidade, ou seja, \n", + "garante que todos vamos gerar os mesmos números aleatórios\n", + "'''\n", + "numpy.random.seed(seed = 20111974)\n", + "\n", + "# Gera 10 números aleatórios a partir da Distribuição Normal(mu, desvio_padrao)\n", + "media = 0\n", + "desvio_padrao = 1\n", + "numpy.random.normal(size = 10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AwWSzYrZWfvA" + }, + "source": [ + "### **Opção 3**: Importar funções específicas da biblioteca NumPy" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bfYJzcqRa5eu" + }, + "source": [ + "from numpy import set_printoptions\n", + "from numpy.random import seed, normal" + ], + "execution_count": 3, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Xj6fbpvubH_p", + "outputId": "35bb3776-a9c6-4352-f00d-02552a987e47", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "# Set up o número de casas decimais para o NumPy:\n", + "set_printoptions(precision = 2, suppress = True)\n", + "\n", + "'''\n", + "Define seed por questões de reproducibilidade, ou seja, \n", + "garante que todos vamos gerar os mesmos números aleatórios\n", + "'''\n", + "seed(seed = 20111974)\n", + "\n", + "# Gera 10 números aleatórios a partir da Distribuição Normal(mu, desvio_padrao)\n", + "media = 0\n", + "desvio_padrao = 1 \n", + "np.random.normal(size = 10)" + ], + "execution_count": 4, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([ 2.51, 1.11, 2.06, 0.56, 0.3 , 1.05, -0.13, 1.06, 1.14,\n", + " 1.38])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 4 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7nC6S2hpGIRF" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "00RerJPChnuP" + }, + "source": [ + "___\n", + "# **Estatísticas Descriticas com NumPy**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Qa6ro1VJlShd" + }, + "source": [ + "## Exemplo 1\n", + "> Vamos voltar ao mesmo exemplo anterior, mas desta vez, usando a opção 1 (com alias):\n", + "\n", + "* Gerar uma amostra aleatória de tamanho 10 da Distribuiçao Normal(0, 1)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "31dSBU8khvFk", + "outputId": "b1e515e2-6f14-4d72-e859-18d2b0ad14db", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "# Set up o número de casas decimais para o NumPy:\n", + "np.set_printoptions(precision = 2, suppress = True)\n", + "\n", + "# Define seed\n", + "np.random.seed(seed = 20111974)\n", + "\n", + "# Gera 10 números aleatórios a partir da Distribuição Normal(media, desvio_padrao)\n", + "media = 0\n", + "desvio_padrao = 1\n", + "\n", + "np.random\n", + "a_numeros1 = np.random.normal(media, desvio_padrao, size = 10) # Array 1D de size = 10\n", + "a_numeros1" + ], + "execution_count": 5, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([ 2.51, 1.11, 2.06, 0.56, 0.3 , 1.05, -0.13, 1.06, 1.14,\n", + " 1.38])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 5 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aT7LNdLyG7Mf", + "outputId": "a0f0f45d-0d0a-4292-f8a6-816d8c1cd36f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "np.mean(a_numeros1)" + ], + "execution_count": 6, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "1.1043374540652753" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 6 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vu3qg9LKHV7H", + "outputId": "143c7e96-09e0-48d4-9882-1a2464be3889", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "np.std(a_numeros1)" + ], + "execution_count": 8, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.735246705657231" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 8 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wa2t0P3nevTh" + }, + "source": [ + "Conferindo a média e desvio-padrão do array gerado:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "drUyk3f5ekDq", + "outputId": "a550f2ef-f2ab-4867-f55c-79ddc3ec4552", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "f'Distribuição N({np.mean(a_numeros1)}, {np.std(a_numeros1)})'" + ], + "execution_count": 9, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'Distribuição N(1.1043374540652753, 0.735246705657231)'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 9 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XSp7Hd-Gib67" + }, + "source": [ + "Estávamos à espera de media = 0 e sigma = 1. Certo? Porque isso não aconteceu?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HP_8VSgygXOF" + }, + "source": [ + "## **Laboratório 1**\n", + "> Altere os valores de [size] para 100, 1.000, 10.000, 100.000 e 1.000.000 e relate o que acontece com a média e desvio padrão." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4TbmVbdcg6iU" + }, + "source": [ + "## **Minha solução**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-qdiqBVHg-gd", + "outputId": "52e5582c-e6bd-4f58-8c92-cc1190464723", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 153 + } + }, + "source": [ + "# Define a média e o desvio-padrão\n", + "media = 0\n", + "desvio_padrao = 1\n", + "\n", + "# Define seed\n", + "np.random.seed(seed = 20111974)\n", + "l_lista_numeros = [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000]\n", + "\n", + "for i_size in l_lista_numeros:\n", + " a_numeros1 = np.random.normal(media, desvio_padrao, size = i_size)\n", + " print(f'Size: {i_size}--> Distribuição: N({np.mean(a_numeros1)}, {np.std(a_numeros1)})')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Size: 10--> Distribuição: N(1.1043374540652753, 0.735246705657231)\n", + "Size: 100--> Distribuição: N(-0.14020525697186714, 0.9254100654233511)\n", + "Size: 1000--> Distribuição: N(0.021644923462910873, 1.0054417533501039)\n", + "Size: 10000--> Distribuição: N(0.015499353804764507, 0.9970905566844254)\n", + "Size: 100000--> Distribuição: N(0.002039323041103302, 0.9960906293570095)\n", + "Size: 1000000--> Distribuição: N(-1.1062145143945444e-06, 0.999473966169304)\n", + "Size: 10000000--> Distribuição: N(0.0002892972723094128, 1.0001202837422036)\n", + "Size: 100000000--> Distribuição: N(0.00011967896623555603, 0.999944390106086)\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bp-YuviQwWqE" + }, + "source": [ + "Com relação à Distribuição Normal($\\mu, \\sigma$), temos que:\n", + "\n", + "![NormalDistribution](https://github.com/MathMachado/Materials/blob/master/NormalDistribution.PNG?raw=true)\n", + "\n", + "Fonte: [Normal Distribution](https://towardsdatascience.com/understanding-the-68-95-99-7-rule-for-a-normal-distribution-b7b7cbf760c2)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KwHBY3Enk04N" + }, + "source": [ + "## Lei Forte dos Grandes Números - LFGN\n", + "> Por favor, leia o que diz a [Law of large numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers). --> 3 minutos.\n", + "\n", + "* O que você aprendeu com isso?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BhwmSkAjlszT" + }, + "source": [ + "## Exemplo 2\n", + "> Vamos nos aprofundar um pouco mais no que diz a LFGN. Para isso, vamos simular o lançamento de dados. Como sabemos, os dados possuem 6 lados numerados de 1 a 6, com igual probabilidade. Certo?\n", + "\n", + "A LFGN nos diz que à medida que N (o tamanho da amostra ou número de dados) cresce, então a média dos dados converge para o valor esperado. Isso quer dizer que:\n", + "\n", + "$$\\frac{1+2+3+4+5+6}{6}= \\frac{21}{6}= 3,5$$\n", + "\n", + "Ou seja, à medida que N (o tamanho da amostra) cresce, espera-se que a média dos dados se aproxime de 3,5. Ok?\n", + "\n", + "Vamos ver se isso é verdade..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-QcJXf6roj0D" + }, + "source": [ + "Vamos usar o método np.random.randint (= função randint definido na classe np.random), a seguir:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "A2u0RzLOrRE2" + }, + "source": [ + "O que significa ou qual é a interpretação do resultado abaixo?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "B3-X_VBerUfa", + "outputId": "c55bbf0c-4ff6-46ec-c54d-15bfea2d7b20", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 102 + } + }, + "source": [ + "# Define seed\n", + "import numpy as np\n", + "np.random.seed(seed = 20111974)\n", + "\n", + "# Simular 100 lançamentos de um dado:\n", + "a_dados_simulados = np.random.randint(1, 7, size = 100)\n", + "a_dados_simulados" + ], + "execution_count": 1, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([4, 5, 3, 1, 1, 4, 3, 1, 2, 2, 1, 1, 6, 4, 5, 3, 1, 4, 1, 6, 2, 4,\n", + " 6, 2, 4, 3, 2, 6, 3, 6, 2, 6, 1, 3, 1, 2, 4, 2, 4, 6, 3, 2, 6, 1,\n", + " 4, 3, 6, 5, 2, 3, 3, 3, 3, 2, 1, 6, 2, 1, 2, 3, 1, 5, 6, 6, 6, 6,\n", + " 5, 6, 6, 5, 6, 3, 3, 2, 4, 2, 6, 1, 2, 3, 4, 5, 5, 3, 1, 6, 6, 5,\n", + " 5, 1, 4, 6, 2, 2, 4, 3, 6, 1, 5, 5])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 1 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "m8Of2MMIrbF3", + "outputId": "d362e33e-4b05-4a06-d231-31c9517341bd", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 136 + } + }, + "source": [ + "# Importar o pandas, pois vamos precisar do método pd.value_counts():\n", + "import pandas as pd\n", + "pd.value_counts(a_dados_simulados)" + ], + "execution_count": 2, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "6 22\n", + "3 18\n", + "2 18\n", + "1 17\n", + "4 13\n", + "5 12\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 2 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "54VwED8Br8rx" + }, + "source": [ + "**Interpretação**: Isso quer dizer que fizemos a simulação de lançamento de um dado 100 vezes. Acima, a frequência com que cada lado do dado aparece.\n", + "\n", + "Eu estava à espera de frequência igual para cada um dos lados, isto é, por volta dos 16 ou 17. Ou seja:\n", + "\n", + "$$\\frac{100}{6}= 16,66$$\n", + "\n", + "Mas ok, vamos continuar com nosso experimento..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HT_Dak-umC6I", + "outputId": "45e74f5f-7c2c-4a85-daf8-ff37831f790e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 170 + } + }, + "source": [ + "# Definir a semente\n", + "np.random.seed(20111974)\n", + "\n", + "for i_size in [10, 30, 50, 75, 100, 1000, 10000, 100000, 1000000]:\n", + " a_dados_simulados = np.random.randint(1, 7, size = i_size)\n", + " print(f'Size: {i_size} --> Média: {np.mean(a_dados_simulados)}')" + ], + "execution_count": 3, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Size: 10 --> Média: 2.6\n", + "Size: 30 --> Média: 3.3666666666666667\n", + "Size: 50 --> Média: 3.72\n", + "Size: 75 --> Média: 3.2666666666666666\n", + "Size: 100 --> Média: 3.42\n", + "Size: 1000 --> Média: 3.461\n", + "Size: 10000 --> Média: 3.5259\n", + "Size: 100000 --> Média: 3.50794\n", + "Size: 1000000 --> Média: 3.50151\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "edWNNOnXtbtd" + }, + "source": [ + "E agora, como você interpreta esses resultados?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eL6gXThkYcSf" + }, + "source": [ + "## Calcular percentis\n", + "> Boxplot" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jlGOQfXfPf0D" + }, + "source": [ + "![BoxPlot](https://github.com/MathMachado/Materials/blob/master/boxplot.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "grtEXG2BoNRt" + }, + "source": [ + "Considere o array de retornos (simulados) a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DjPKKq01YjF9", + "outputId": "71a8f79c-54d0-4d00-faef-27d2be4ccf89", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "import numpy as np\n", + "np.random.seed(20111974)\n", + "\n", + "# Simulando Retornos de ativos financeiros com a distribuição Normal(0, 1):\n", + "a_retornos = np.random.normal(0, 1, 100)\n", + "print(f'Média: {np.mean(a_retornos)}')" + ], + "execution_count": 4, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Média: -0.016996335492713833\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ajjlfqgssLVO", + "outputId": "955f617e-2a59-403d-c1cf-f6e3fb5d76c7", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 357 + } + }, + "source": [ + "a_retornos" + ], + "execution_count": 5, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([ 2.5062768 , 1.11440422, 2.05565501, 0.56482376, 0.29897276,\n", + " 1.04930857, -0.12607366, 1.06227632, 1.13807032, 1.37966044,\n", + " -2.05995563, 0.67474814, 0.72722843, -0.33923852, 0.43613107,\n", + " 0.59135489, -1.29281877, 1.17712036, -0.98644163, -1.79034143,\n", + " -1.08913605, -0.90712825, -1.02291108, -1.36445713, -0.29429164,\n", + " 0.06343709, -1.14196185, -0.50706079, -0.83539436, -1.41492946,\n", + " -0.2159062 , -1.16519474, -0.60767518, -0.61510925, 1.0771542 ,\n", + " 0.5043687 , 0.02674197, 1.83494644, 0.34728874, -1.14671885,\n", + " -0.59841423, -0.42698353, 0.10901983, -0.75168457, 0.71689294,\n", + " -0.50810299, 0.47524103, -0.38248511, -1.37491973, 1.5355728 ,\n", + " -0.27356178, 0.68072592, -1.80454873, 1.16995833, -0.37988822,\n", + " 0.19305861, 1.53792436, -0.11802807, -0.97621103, -1.23463994,\n", + " 1.0504434 , 1.91481015, 0.80359454, 0.35869561, 1.03409992,\n", + " -0.37200685, 0.32947575, 0.70038627, -0.98085533, -1.21072144,\n", + " 0.74366412, 0.18372348, 0.10430302, -0.78160841, -0.0423915 ,\n", + " 1.67094293, -1.07256479, -0.5493723 , -1.83082917, 0.11510819,\n", + " 1.3911365 , -0.28940563, 0.31904722, -0.70009623, -0.4353552 ,\n", + " -2.0301258 , -0.14205882, 1.66292963, -0.57691495, -0.78963384,\n", + " -0.80660503, 0.05581487, 0.8715663 , -0.3499477 , 1.37366912,\n", + " 0.88027638, -1.47925906, -0.40657104, -0.18789895, 0.47475142])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 5 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XZ3m06gv9lei" + }, + "source": [ + "A seguir, o boxplot do array a_retornos:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QtuwJP449tBQ", + "outputId": "4e8bf67a-7227-4287-f757-4f079c21d3b2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 269 + } + }, + "source": [ + "# Import da biblioteca seaborn: Uma das principais libraries para Data Visualization (outras: matplotlib)\n", + "import seaborn as sns\n", + "\n", + "sns.boxplot(y = a_retornos)" + ], + "execution_count": 6, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 6 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXIAAADrCAYAAAB0Oh02AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAIPUlEQVR4nO3d3YtdVxnH8d/TxJeIikiHCqMxyogiIgiDIF4I6kXtjSgIeiGIQvTCYQRBlP4JghAGbwKKN6I3WhSM+AKCCCpORKS1UQ6C2MGX0YIWEpXY5YUR25pmzuRsZ5+n+XwgkLNnZu2HkHxZWdknU2OMANDXXXMPAMBqhBygOSEHaE7IAZoTcoDmhBygudNz3PTuu+8e586dm+PWAG1dvnz5T2OMjadenyXk586dy/7+/hy3Bmirqn5zs+uOVgCaE3KA5oQcoDkhB2hOyAGaE3KA5oQcoLlZniNnfezt7WWxWMw9xlo4ODhIkmxubs48yXrY2trKzs7O3GOwBCGHG65duzb3CHBbhPwOZ8f1X7u7u0mSCxcuzDwJHI8zcoDmhBygOSEHaE7IAZoTcoDmhBygOSEHaE7IAZoTcoDmhBygOSEHaE7IAZoTcoDmhBygOSEHaE7IAZpbOeRV9bKq+l5V/aKqHqqq3SkGA2A5U3yHoOtJPj7G+GlVvSDJ5ar6zhjjFxOsDcARVt6RjzF+N8b46Y2fP5bk4SS+ey3ACZn0jLyqziV5Q5IfT7kuAE9vspBX1fOTfCXJx8YYf73Jx89X1X5V7R8eHk51W4A73iQhr6pn5d8R/+IY46s3+5wxxsUxxvYYY3tjY2OK2wKQaZ5aqSSfS/LwGOMzq48EwHFMsSN/c5L3J3lrVf3sxo/7JlgXgCWs/PjhGOMHSWqCWQC4Dd7ZCdCckAM0J+QAzQk5QHNCDtCckAM0J+QAzQk5QHNCDtCckAM0J+QAzQk5QHNCDtCckAM0J+QAzQk5QHNCDtCckAM0J+QAzQk5QHNCDtCckAM0J+QAzQk5QHNCDtCckAM0J+QAzQk5QHNCDtCckAM0J+QAzQk5QHNCDtCckAM0N0nIq+rzVfXHqnpwivUAWN5UO/IvJLl3orUAOIZJQj7G+H6SR6dYC4DjcUYO0NyJhbyqzlfVflXtHx4entRtAZ7xTizkY4yLY4ztMcb2xsbGSd0W4BnP0QpAc1M9fvilJD9M8uqqeqSqPjTFugAc7fQUi4wx3jfFOgAcn6MVgOaEHKA5IQdoTsgBmhNygOaEHKC5SR4/7GZvby+LxWLuMVgz//k9sbu7O/MkrJutra3s7OzMPcbTuiNDvlgs8rMHH84/n/fiuUdhjdz1j5EkufzrP8w8Cevk1NX1/49d78iQJ8k/n/fiXHvNfXOPAay5M1cuzT3CkZyRAzQn5ADNCTlAc0IO0JyQAzQn5ADNCTlAc0IO0JyQAzQn5ADNCTlAc0IO0JyQAzQn5ADNCTlAc0IO0JyQAzQn5ADNCTlAc0IO0JyQAzQn5ADNCTlAc0IO0JyQAzQn5ADNTRLyqrq3qn5ZVYuq+uQUawKwnJVDXlWnknw2yTuSvDbJ+6rqtauuC8ByptiRvzHJYozx6zHGP5J8Ock7J1gXgCVMEfLNJL99wutHblx7kqo6X1X7VbV/eHg4wW0BSE7wHzvHGBfHGNtjjO2NjY2Tui3AM94UIT9I8rInvH7pjWsAnIApQv6TJK+qqldU1bOTvDfJ1ydYF4AlnF51gTHG9ar6aJJvJTmV5PNjjIdWngyApawc8iQZY1xKcmmKtU7CwcFBTl39S85caTMyMJNTV/+cg4Prc49xS97ZCdDcJDvybjY3N/P7v5/OtdfcN/cowJo7c+VSNjfvmXuMW7IjB2hOyAGaE3KA5oQcoDkhB2hOyAGaE3KA5oQcoDkhB2hOyAGaE3KA5oQcoDkhB2hOyAGaE3KA5oQcoDkhB2hOyAGaE3KA5oQcoDkhB2hOyAGaE3KA5oQcoDkhB2ju9NwDzOXU1Udz5sqlucdgjdz1t78mSR5/7gtnnoR1curqo0numXuMW7ojQ761tTX3CKyhxeKxJMnWK9f7Dy0n7Z61b8YdGfKdnZ25R2AN7e7uJkkuXLgw8yRwPM7IAZoTcoDmhBygOSEHaG6lkFfVe6rqoap6vKq2pxoKgOWtuiN/MMm7k3x/glkAuA0rPX44xng4SapqmmkAODZn5ADNHbkjr6rvJnnJTT50/xjja8veqKrOJzmfJGfPnl16QABu7ciQjzHePsWNxhgXk1xMku3t7THFmgA4WgFob9XHD99VVY8keVOSb1TVt6YZC4BlrfrUygNJHphoFgBug6MVgOaEHKA5IQdoTsgBmhNygOaEHKA5IQdoTsgBmhNygOaEHKA5IQdoTsgBmhNygOaEHKA5IQdoTsgBmhNygOaEHKA5IQdoTsgBmhNygOaEHKA5IQdoTsgBmhNygOaEHKA5IQdoTsgBmhNygOaEHKA5IQdoTsgBmhNygOaEHKC5lUJeVZ+uqitV9fOqeqCqXjTVYAAsZ9Ud+XeSvG6M8fokv0ryqdVHAuA4Vgr5GOPbY4zrN17+KMlLVx8JgOOY8oz8g0m+OeF6ACzh9FGfUFXfTfKSm3zo/jHG1258zv1Jrif54i3WOZ/kfJKcPXv2toYF4H/VGGO1Bao+kOTDSd42xri6zNdsb2+P/f39le7LNPb29rJYLOYeYy3859dha2tr5knWw9bWVnZ2duYegyeoqstjjO2nXj9yR37Eovcm+USStywbcVhXZ86cmXsEuC0r7cirapHkOUn+fOPSj8YYHznq6+zIAY7v/7IjH2P4OyjAzLyzE6A5IQdoTsgBmhNygOaEHKA5IQdoTsgBmlv5Lfq3ddOqwyS/OfEbw9HuTvKnuYeAp/HyMcbGUy/OEnJYV1W1f7N3zsE6c7QC0JyQAzQn5PBkF+ceAI7LGTlAc3bkAM0JOUBzQg7QnJADNCfkAM39C46PaZwmexaoAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "o9ujdjxNY6qE" + }, + "source": [ + "# Vamos usar o método np.percentile(array, q = [p1, p2, p3, ..., p99])\n", + "percentis = np.percentile(a_retornos, q = [1, 5, 25, 50, 55, 75, 99])\n", + "\n", + "# Primeiro Quartil\n", + "q1 = percentis[2]" + ], + "execution_count": 7, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "AbOH4m6JR_lR", + "outputId": "f0d97c35-cc96-48a4-e354-95db5dbb8528", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "percentis" + ], + "execution_count": 9, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([-2.0304241 , -1.49481318, -0.78361477, -0.12205087, 0.08182676,\n", + " 0.71947681, 2.06016123])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 9 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c75g2Egco2lc" + }, + "source": [ + "Em qual posição do array a_retornos se encontra Q3?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nZr-A82Zo8Kb", + "outputId": "49f16b21-0184-42e5-d732-d8d68d495ef0", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "q3 = percentis[5]\n", + "\n", + "# ou de trás para a frente do conteúdo da lista:\n", + "q3_2 = percentis[-2]\n", + "print(q3, q3_2)" + ], + "execution_count": 10, + "outputs": [ + { + "output_type": "stream", + "text": [ + "0.7194768106252311 0.7194768106252311\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sWrnESPQT4JM" + }, + "source": [ + "# lim_inferior e lim_superior para detecção de outliers\n", + "lim_inferior = q1 - 1.5 * (q3 - q1)\n", + "lim_superior = q3 + 1.5 * (q3 - q1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Yb4-ZJlUUYsi" + }, + "source": [ + "f'Limite Inferior: {lim_inferior}; Limite Superior: {lim_superior}'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Jr6oXIHlUxOe" + }, + "source": [ + "np.min(a_retornos)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "UxE47cN0U54X" + }, + "source": [ + "np.max(a_retornos)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OTB9HnIac499" + }, + "source": [ + "___\n", + "# **Ordenar itens de um array**\n", + "> Considere o array a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Jgj8Yw46dBMx" + }, + "source": [ + "np.random.seed(20111974)\n", + "a_numeros1 = np.random.random(10)\n", + "a_numeros1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cC9272GFdRln" + }, + "source": [ + "Ordenando os itens de a_numeros1..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YUP90nBVdUeF" + }, + "source": [ + "np.sort(a_numeros1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lG763cDGj-yB" + }, + "source": [ + "___\n", + "# **Obter ajuda**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ehxPlD3EkEYL" + }, + "source": [ + "help(np.random.normal)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1Q_konJVaBsV" + }, + "source": [ + "___\n", + "# **Criar arrays 1D**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DddZT5kadYJ7" + }, + "source": [ + "import numpy as np\n", + "np.set_printoptions(precision = 2, suppress = True)\n", + "np.random.seed(seed = 20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jaqd-VnF3yIt" + }, + "source": [ + "Criar o array 1D a_numeros1, com os seguintes números:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "E3niz_zHaF3e" + }, + "source": [ + "a_numeros1 = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n", + "a_numeros1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DyfXbW_ZKJBS" + }, + "source": [ + "Qual a dimensão de a_numeros1?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gbHlydALKB3R" + }, + "source": [ + "# Dimensão do array\n", + "a_numeros1.ndim" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "am9otElpKNPa" + }, + "source": [ + "Qual o shape (dimensão) do array a_numeros1?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "juJJ74d2wale" + }, + "source": [ + "# Números de itens no array\n", + "a_numeros1.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BHg4Rre3GwPy" + }, + "source": [ + "O array a_numeros1 poderia ter sido criado usando a função np.arange(inicio, fim, step):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "I3fyusN7G5Zn" + }, + "source": [ + "# Lembre-se que o número 10 é exclusive.\n", + "a_numeros2 = np.arange(start = 0, stop = 10, step = 1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IHCEpmUxXsaK" + }, + "source": [ + "Outra alternativa seria usar np.linspace(start = 0, stop = 10, num = 9). Acompanhe a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JB9Y_x3RX1GX" + }, + "source": [ + "# Com np.linspace, o valor 9 é inclusive.\n", + "a_numeros3 = np.linspace(0, 9, 10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P6MR8MPeYOZm" + }, + "source": [ + "Compare os resultados de a_numeros1, a_numeros2 e a_numeros3 a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tWEzge6HYSFu" + }, + "source": [ + "a_numeros1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lUNlFVKYYT9f" + }, + "source": [ + "a_numeros2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Xo8Lid5fYVPW" + }, + "source": [ + "a_numeros3" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V9aW7C4vHAcF" + }, + "source": [ + "Ou seja, a_numeros1 é igual a a_numeros2 que também é igual a a_numeros3. Ok?\n", + "\n", + "**ATENÇÃO**: Observe que a sintaxe para criar a_numeros3 é ligeiramente diferente da sintaxe usada para criar a_numeros1 e a_numeros2. Abaixo, a sintaxe do comando np.linspace:\n", + "\n", + "![](https://github.com/MathMachado/Materials/blob/master/linspace_sintaxe.PNG?raw=true)\n", + "\n", + "Source: [HOW TO USE THE NUMPY LINSPACE FUNCTION](https://www.sharpsightlabs.com/blog/numpy-linspace/)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KNnwZa3uvYqE" + }, + "source": [ + "Soma 2 à cada item de a_numeros1:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Jt2KVyviw0bp" + }, + "source": [ + "a_numeros1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "arROkhWXbdTW" + }, + "source": [ + "a_numeros2 = a_numeros1 + 2\n", + "a_numeros2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZJx2vG86vdVi" + }, + "source": [ + "Multiplicar por 10 cada item de a_numeros1:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Vm7abO6Ebkun" + }, + "source": [ + "a_numeros1 = a_numeros1*10\n", + "a_numeros1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0Ev1xnBwaYJG" + }, + "source": [ + "___\n", + "# **Criar Arrays Multidimensionais**\n", + "> Ao criarmos, por exemplo, um array 2D, então a chamamos de matriz." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gHaeAug5vjjd" + }, + "source": [ + "Criar o array com 2 linhas e 3 colunas usando números aleatórios:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VDi0vIPSYR4F" + }, + "source": [ + "np.random.seed(20111974)\n", + "a_numeros1 = np.random.randn(2, 3)\n", + "a_numeros1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DIdd-nA3tJjV" + }, + "source": [ + "## Dimensão de um array\n", + "> Dimensão é o número de linhas e colunas da matriz." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pKvjjnkrK-v7" + }, + "source": [ + "a_numeros1.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-DHS5jXELCfa" + }, + "source": [ + "a_numeros1 é um array 2D (ou matriz), ou seja, 2 linhas, onde cada linha tem 3 elementos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HJI6X1wvv4Bg" + }, + "source": [ + "Criar um array com 3 linhas e 3 colunas:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hXPbWh3Tv26T" + }, + "source": [ + "a_numeros2 = np.array([[1, 2, 3],[4, 5, 6],[7, 8, 9]])\n", + "a_numeros2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "we6ZJOICc7bQ" + }, + "source": [ + "# Número de linhas e colunas de a_numeros1:\n", + "a_numeros1.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "f0ocwuI1dED6" + }, + "source": [ + "# Número de linhas e colunas de a_numeros2\n", + "a_numeros2.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "CApPtnW0YuRP" + }, + "source": [ + "# Somar 2 à cada elemento de a_numeros2\n", + "a_numeros2 = a_numeros2+2\n", + "a_numeros2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "M87aGmxRY3RW" + }, + "source": [ + "# Multiplicar por 10 cada elemento de a_numeros2\n", + "a_numeros2 = a_numeros2*10\n", + "a_numeros2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qZt93y1IL_v7" + }, + "source": [ + "___\n", + "# **Copiar arrays**\n", + "> Considere o array abaixo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sH2FTXj5MRRC" + }, + "source": [ + "np.random.seed(20111974)\n", + "a_numeros1 = np.random.randn(2, 3)\n", + "a_numeros1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VtgKeMt6MYrr" + }, + "source": [ + "Fazendo a cópia de a_numeros1..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "K0hOHR3IMa-o" + }, + "source": [ + "a_numeros1_copia = a_numeros1.copy()\n", + "a_numeros1_copia" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lFpmcR0HkCar" + }, + "source": [ + "___\n", + "# **Operações com arrays**\n", + "> Considere um array com temperaturas em Farenheit dado por:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VnagcUqVkLhW" + }, + "source": [ + "# Define a seed\n", + "np.random.seed(20111974)\n", + "\n", + "a_temperatura_farenheit = np.array(np.random.randint(0, 100, 10))\n", + "a_temperatura_farenheit " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VrjNKfXxk1yv" + }, + "source": [ + "type(a_temperatura_farenheit)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o1STejhrk0kZ" + }, + "source": [ + "Transformando a temperatura Fahrenheit em Celsius..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "E_jXflR_lNy3" + }, + "source": [ + "a_temperatura_celsius = 5*a_temperatura_farenheit/9 - 5*32/9\n", + "a_temperatura_celsius" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "U4pCv0pNqPZI" + }, + "source": [ + "# O mesmo resultado, porém, escrito de forma diferente:\n", + "a_temperatura_celsius = (5/9)*a_temperatura_farenheit - (160/9)\n", + "a_temperatura_celsius" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1UT4YD2FawUA" + }, + "source": [ + "___\n", + "# **Selecionar itens**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pqOv8P1za1m8" + }, + "source": [ + "# Selecionar o segundo item de a_numeros1 (lembre-se que no Python arrays começam com indice = 0)\n", + "a_numeros1[1]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TIwVKk6AyRv6" + }, + "source": [ + "Dado a_numeros2 abaixo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zoDmbXo6bCeu" + }, + "source": [ + "a_numeros2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iJXSPp-0yb4w" + }, + "source": [ + "... selecionar o item da linha 2, coluna 3 do array a_numeros2:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sJiVfnlzcjRv" + }, + "source": [ + "a_numeros2[1, 2]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Xl5HwJIMcv2e" + }, + "source": [ + "# Selecionar o último elemento de a_numeros1 --> Lembre-se que a_numeros1 é um array. Desta forma, teremos o último elemento do array!\n", + "a_numeros1[-1]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ezTH0HsyrnAl" + }, + "source": [ + "Veja..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OBv9EM54rYX3" + }, + "source": [ + "a_numeros1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Po3WLFC-rod8" + }, + "source": [ + "a_temperatura_celsius[-1]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4qJJ2HCedW4h" + }, + "source": [ + "___\n", + "# **Aplicar funções como max(), min() e etc**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_meTJdUsda4e" + }, + "source": [ + "f'O máximo de a_numeros1 é: {np.max(a_numeros1)}'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "m-wiBkAidnhN" + }, + "source": [ + "f'O mínimo de a_numeros1 é: {np.min(a_numeros1)}'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lmupnRHQdtwh" + }, + "source": [ + "f'O máximo de a_numeros2 é: {np.max(a_numeros2)}'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "H2z7oB6Bd786" + }, + "source": [ + "f'O máximo de cada LINHA de a_numeros2 é: {np.max(a_numeros2, axis = 1)}' # Aqui, axis = 1 é que diz ao numpy que estamos interessados nas linhas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gj2ZBDsWeMyk" + }, + "source": [ + "f'O máximo de cada COLUNA de a_numeros2 é: {np.max(a_numeros2, axis = 0)}' # axis = 0, diz ao numpy que estamos interessados nas colunas." + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7_tEfm2IecIU" + }, + "source": [ + "___\n", + "# **Calcular Estatísticas Descritivas: média e variância**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lIY5jx3ueh7q" + }, + "source": [ + "f'A média de a_numeros1 é: {np.mean(a_numeros1)}'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VmqSELRReuAW" + }, + "source": [ + "f'A média de a_numeros2 é: {np.mean(a_numeros2)}'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Gxap-Wg5e2_H" + }, + "source": [ + "f'O Desvio Padrão de a_numeros2 é: {np.std(a_numeros2)}'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R0GcljGtfBvP" + }, + "source": [ + "___\n", + "# **Reshaping**\n", + "> Muito útil em Machine Learning." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vfEmw01j8zux" + }, + "source": [ + "## Exemplo 1\n", + "* O array a_numeros2 tem a seguinte forma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-Lb3VZCCfK_a" + }, + "source": [ + "a_numeros2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "YWN_nN-4fD7u" + }, + "source": [ + "# reshaping para 9 linhas e 1 coluna:\n", + "a_numeros2.reshape(9, 1) # a_numeros2.reshape(9,-1) produz o mesmo resultado." + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "id9ILRRt7SwY" + }, + "source": [ + "## Mais um exemplo de Reshape\n", + "> Dado o array 1D abaixo, reshape para um array 3D com 2 colunas." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9RA9Ht2b7Swd", + "outputId": "eadedfd5-fd6c-49c8-db5c-6f8f30d45f36", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Define seed\n", + "np.random.seed(20111974)\n", + "a_numeros1 = np.array(np.random.randint(1, 10, size = 15))\n", + "a_numeros1" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([9, 9, 3, 9, 2, 9, 1, 5, 3, 1, 9, 4, 8, 2, 4])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 19 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8KxR4xZT7cRv" + }, + "source": [ + "### Solução\n", + "> Temos 15 elementos em a_numeros1 para construir (\"reshape\") um array 3D com 2 colunas.\n", + "\n", + "A princípio, a solução seria..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VMdHl1Il7wLw", + "outputId": "d51c7263-f523-4af8-9606-ee93cab66f1c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 162 + } + }, + "source": [ + "a_numeros1.reshape(-1, 2) # O valor \"-1\" na posição das linhas pede ao NumPy para calcular o número de linhas automaticamente." + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "ValueError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0ma_numeros1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreshape\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m2\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# O valor \"-1\" na posição das linhas pede ao NumPy para calcular o número de linhas automaticamente.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mValueError\u001b[0m: cannot reshape array of size 15 into shape (2)" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pZS4b4-y708q" + }, + "source": [ + "Porque temos esse erro?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4disywvR8HeH" + }, + "source": [ + "E se fizermos..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3oEAAXTp8I7Z", + "outputId": "e8c8a90f-c34a-4304-d9b4-fd7f04ce224f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Define seed\n", + "np.random.seed(20111974)\n", + "a_numeros1 = np.array(np.random.randint(1, 10, size = 16)) # Observe que agora temos 16 elementos\n", + "a_numeros1" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([9, 9, 3, 9, 2, 9, 1, 5, 3, 1, 9, 4, 8, 2, 4, 3])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 21 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iUhth0QV8Rpt" + }, + "source": [ + "Reshapping..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9D1y7uD88Qip", + "outputId": "e7d22bcd-c10f-4ea3-e41b-03f6f98a054f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 151 + } + }, + "source": [ + "a_numeros1.reshape(-1, 2) # O valor \"-1\" na posição das linhas pede ao NumPy para calcular o número de linhas automaticamente." + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[9, 9],\n", + " [3, 9],\n", + " [2, 9],\n", + " [1, 5],\n", + " [3, 1],\n", + " [9, 4],\n", + " [8, 2],\n", + " [4, 3]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 22 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ALh-sq7DMnN5", + "outputId": "db373349-7910-4f1f-93f3-8ac8f67da8b8", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 151 + } + }, + "source": [ + "# OU --> Neste caso, estamos reshaping o array em 8 linhas e 2 colunas\n", + "a_numeros1.reshape(8, -1)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[9, 9],\n", + " [3, 9],\n", + " [2, 9],\n", + " [1, 5],\n", + " [3, 1],\n", + " [9, 4],\n", + " [8, 2],\n", + " [4, 3]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 26 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yvTnrszn8Yk0" + }, + "source": [ + "Porque agora deu certo?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LeQ9LqIE8baG" + }, + "source": [ + "## Último exemplo com reshape\n", + "> Considere o array a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OQOC9iiN8hZT" + }, + "source": [ + "np.random.seed(20111974)\n", + "a_numeros1 = np.random.randn(2, 3)\n", + "a_numeros1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Cvce8qBl9Cvq" + }, + "source": [ + "Queremos agora transformá-la num array de 3 linhas e 2 colunas." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QDDsYoVt9Klz" + }, + "source": [ + "a_numeros1.reshape(-1, 2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AdwU5ygt9Svq" + }, + "source": [ + "Poderia ser..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5uBeokKc9Uo-" + }, + "source": [ + "a_numeros1.reshape(3, -1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OeRBsobc9aKj" + }, + "source": [ + "E por fim, também poderia ser..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MDt8UYYH9dBw" + }, + "source": [ + "a_numeros1.reshape(3, 2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "91o5vycQfdKW" + }, + "source": [ + "___\n", + "# **Transposta**\n", + "* O array a_numeros2 tem a seguinte forma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RsZwyuhoffjb" + }, + "source": [ + "a_numeros2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "A3MzTVoGfiyO" + }, + "source": [ + "# Transposta do array a_numeros2 é dado por:\n", + "a_numeros2.T" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ij-ZW5IyzXIb" + }, + "source": [ + "Ou seja, linha virou coluna. Ok?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qLy6ajgpt3lU" + }, + "source": [ + "# **Inversa da matriz quadrada**\n", + "> Se uma matriz é não-singular, então sua inversa existe.\n", + "\n", + "* Se o determinante de uma matriz is not equal to zero, then the matrix isé diferente de 0, então a matriz é não-singular." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-u7jRq34t9_x" + }, + "source": [ + "import numpy as np\n", + "\n", + "a_numeros1 = np.array([[1, 2, 3],[4, 5, 6],[7, 8, 9]])\n", + "a_numeros2 = np.array([[6, 2], [5, 3]])\n", + "a_numeros3 = np.array([[1, 3, 5],[2, 5, 1],[2, 3, 8]])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7zmHHWWlvaYB" + }, + "source": [ + "a_numeros1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "3fHKyhOJvcak" + }, + "source": [ + "a_numeros2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vQG7yyfjwLg9" + }, + "source": [ + "a_numeros3" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qa2Yre2rwgRk" + }, + "source": [ + "## Determinantes da matriz quadrada" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "N6jwuC6twkyc" + }, + "source": [ + "np.linalg.det(a_numeros1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QSvViNwzwnhI" + }, + "source": [ + "np.linalg.det(a_numeros2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "o8jwsnccw5id" + }, + "source": [ + "np.linalg.det(a_numeros3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kkVaTgzgw_XJ" + }, + "source": [ + "A seguir, calculamos as inversas das matrizes acima definidas..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "b9FgWvTYvpik" + }, + "source": [ + "np.linalg.inv(a_numeros2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "KsdEt1kIvsM_" + }, + "source": [ + "np.linalg.inv(a_numeros1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VA_F7_7kccpn" + }, + "source": [ + "Porque não temos a inversa de a_numeros1?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ANPBCnmVwOf4" + }, + "source": [ + "np.linalg.inv(a_numeros3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XAf9k1egxcdF" + }, + "source": [ + "# **Resolver sistemas de equações lineares**\n", + "> Considere o sistema de euqações lineares abaixo:\n", + "\n", + "\\begin{equation}\n", + "x + 3y + 5z = 10\\\\\n", + "2x+ 5y + z = 8 \\\\\n", + "2x + 3y + 8z= 3\n", + "\\end{equation}\n", + "\n", + "Ou $Ax = b$. A solução deste sistema de equações é dada por $A^{-1}b$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oNf5nqaLxhBY" + }, + "source": [ + "Ou seja, basta encontrarmos a inversa de A e multiplicarmos por b." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "omzC5dGA0btc" + }, + "source": [ + "A= np.array([[1, 3, 5], [2, 5, 1], [2, 3, 8]])\n", + "np.linalg.inv(A)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AiXI3oxB05iE" + }, + "source": [ + "Agora basta multiplicar a matriz inversa $A^{-1}$ acima por b. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XoGebKDa2Fcd" + }, + "source": [ + "A_Inv = np.linalg.inv(A)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "sKaP0a1QZG-P" + }, + "source": [ + "b= np.array([10, 8, 3]).reshape(3, -1)\n", + "b" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "3dAVq8dg19VI" + }, + "source": [ + "A_Inv.dot(b)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zso6hTnB17cm" + }, + "source": [ + "Uma forma fácil de se fazer isso é utilizar a expressão abaixo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ptQHIVll1E4P" + }, + "source": [ + "b= np.array([[10], [8], [3]])\n", + "b" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "X4VL8lyY1Xus" + }, + "source": [ + "np.linalg.solve(A, b)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fJKmwTS59-Bc" + }, + "source": [ + "# **Empilhar arrays**\n", + "\n", + "## Exemplo 1\n", + "\n", + "![Empilhar1](https://github.com/MathMachado/Materials/blob/master/Empilhar1.PNG?raw=true)\n", + "\n", + "## Exemplo 2\n", + "\n", + "![Empilhar2](https://github.com/MathMachado/Materials/blob/master/Empilhar2.PNG?raw=true)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rhPTt3EwXden" + }, + "source": [ + "## Gerar os arrays do exemplo1" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zEI-yBy3-E46" + }, + "source": [ + "np.random.seed(20111974)\n", + "a_numeros1 = np.random.randn(5, 8)\n", + "\n", + "np.random.seed(19741120)\n", + "a_numeros2 = np.random.randn(8, 8)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UYsAqBRp--79" + }, + "source": [ + "## Método 1 - Concatenate([A, B])" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HgO1ujvhObyE", + "outputId": "c40e7ed9-255b-4886-dddf-3b17f2b1be2f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 187 + } + }, + "source": [ + "a_numeros1" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[ 2.5062768 , 1.11440422, 2.05565501, 0.56482376, 0.29897276,\n", + " 1.04930857, -0.12607366, 1.06227632],\n", + " [ 1.13807032, 1.37966044, -2.05995563, 0.67474814, 0.72722843,\n", + " -0.33923852, 0.43613107, 0.59135489],\n", + " [-1.29281877, 1.17712036, -0.98644163, -1.79034143, -1.08913605,\n", + " -0.90712825, -1.02291108, -1.36445713],\n", + " [-0.29429164, 0.06343709, -1.14196185, -0.50706079, -0.83539436,\n", + " -1.41492946, -0.2159062 , -1.16519474],\n", + " [-0.60767518, -0.61510925, 1.0771542 , 0.5043687 , 0.02674197,\n", + " 1.83494644, 0.34728874, -1.14671885]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 33 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2aQY_klZOeg9", + "outputId": "14eb3d9c-d0fc-4b6a-fe19-1790695c838f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 289 + } + }, + "source": [ + "a_numeros2" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[-0.77337752, -1.10547465, 0.10062807, -1.14571729, -2.15266227,\n", + " -0.75255725, -2.1529949 , -0.33017773],\n", + " [-1.10465731, 0.32889675, 0.01010198, -1.33213633, -0.33945805,\n", + " -0.01299007, 0.05342823, -0.18641201],\n", + " [ 0.39473805, -0.89354231, -0.50667323, -0.74660913, 1.83586365,\n", + " -1.20536871, 1.20184886, 0.51160897],\n", + " [-0.56952286, -0.93343871, -0.24972528, 0.98487133, 1.19333367,\n", + " 2.29956497, 0.16657022, 0.71357415],\n", + " [-0.45251078, 0.92163918, 0.73421263, 2.17811191, -0.05655212,\n", + " 1.25326 , -0.37039248, 1.43855202],\n", + " [ 0.85646091, -0.11257239, -0.35400297, 0.94136671, -0.08696163,\n", + " -1.49000701, 0.00848666, 0.86705275],\n", + " [ 1.6340906 , 1.36321063, -0.02175361, -0.45301645, -0.37111236,\n", + " -0.04716069, -2.27337435, 0.95318738],\n", + " [ 0.7100548 , -0.79883269, -0.3165779 , -1.58352824, -0.37751484,\n", + " -0.29760341, -0.73424207, -0.55703223]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 34 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bK70vaq8_KMH", + "outputId": "f6d400cf-4b54-4990-815b-052f5224aadd", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 459 + } + }, + "source": [ + "np.concatenate([a_numeros1, a_numeros2], axis = 0) # axis= 0 diz ao NumPy para empilhar as linhas" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[ 2.5062768 , 1.11440422, 2.05565501, 0.56482376, 0.29897276,\n", + " 1.04930857, -0.12607366, 1.06227632],\n", + " [ 1.13807032, 1.37966044, -2.05995563, 0.67474814, 0.72722843,\n", + " -0.33923852, 0.43613107, 0.59135489],\n", + " [-1.29281877, 1.17712036, -0.98644163, -1.79034143, -1.08913605,\n", + " -0.90712825, -1.02291108, -1.36445713],\n", + " [-0.29429164, 0.06343709, -1.14196185, -0.50706079, -0.83539436,\n", + " -1.41492946, -0.2159062 , -1.16519474],\n", + " [-0.60767518, -0.61510925, 1.0771542 , 0.5043687 , 0.02674197,\n", + " 1.83494644, 0.34728874, -1.14671885],\n", + " [-0.77337752, -1.10547465, 0.10062807, -1.14571729, -2.15266227,\n", + " -0.75255725, -2.1529949 , -0.33017773],\n", + " [-1.10465731, 0.32889675, 0.01010198, -1.33213633, -0.33945805,\n", + " -0.01299007, 0.05342823, -0.18641201],\n", + " [ 0.39473805, -0.89354231, -0.50667323, -0.74660913, 1.83586365,\n", + " -1.20536871, 1.20184886, 0.51160897],\n", + " [-0.56952286, -0.93343871, -0.24972528, 0.98487133, 1.19333367,\n", + " 2.29956497, 0.16657022, 0.71357415],\n", + " [-0.45251078, 0.92163918, 0.73421263, 2.17811191, -0.05655212,\n", + " 1.25326 , -0.37039248, 1.43855202],\n", + " [ 0.85646091, -0.11257239, -0.35400297, 0.94136671, -0.08696163,\n", + " -1.49000701, 0.00848666, 0.86705275],\n", + " [ 1.6340906 , 1.36321063, -0.02175361, -0.45301645, -0.37111236,\n", + " -0.04716069, -2.27337435, 0.95318738],\n", + " [ 0.7100548 , -0.79883269, -0.3165779 , -1.58352824, -0.37751484,\n", + " -0.29760341, -0.73424207, -0.55703223]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 35 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CpaXBkm8_BF8" + }, + "source": [ + "## Método 2 - np.r_[A, B]" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3QnVUzAY_teZ", + "outputId": "e8adfd85-e760-40f5-d9ac-48353d24ccd2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 459 + } + }, + "source": [ + "np.r_[a_numeros1, a_numeros2]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[ 2.5062768 , 1.11440422, 2.05565501, 0.56482376, 0.29897276,\n", + " 1.04930857, -0.12607366, 1.06227632],\n", + " [ 1.13807032, 1.37966044, -2.05995563, 0.67474814, 0.72722843,\n", + " -0.33923852, 0.43613107, 0.59135489],\n", + " [-1.29281877, 1.17712036, -0.98644163, -1.79034143, -1.08913605,\n", + " -0.90712825, -1.02291108, -1.36445713],\n", + " [-0.29429164, 0.06343709, -1.14196185, -0.50706079, -0.83539436,\n", + " -1.41492946, -0.2159062 , -1.16519474],\n", + " [-0.60767518, -0.61510925, 1.0771542 , 0.5043687 , 0.02674197,\n", + " 1.83494644, 0.34728874, -1.14671885],\n", + " [-0.77337752, -1.10547465, 0.10062807, -1.14571729, -2.15266227,\n", + " -0.75255725, -2.1529949 , -0.33017773],\n", + " [-1.10465731, 0.32889675, 0.01010198, -1.33213633, -0.33945805,\n", + " -0.01299007, 0.05342823, -0.18641201],\n", + " [ 0.39473805, -0.89354231, -0.50667323, -0.74660913, 1.83586365,\n", + " -1.20536871, 1.20184886, 0.51160897],\n", + " [-0.56952286, -0.93343871, -0.24972528, 0.98487133, 1.19333367,\n", + " 2.29956497, 0.16657022, 0.71357415],\n", + " [-0.45251078, 0.92163918, 0.73421263, 2.17811191, -0.05655212,\n", + " 1.25326 , -0.37039248, 1.43855202],\n", + " [ 0.85646091, -0.11257239, -0.35400297, 0.94136671, -0.08696163,\n", + " -1.49000701, 0.00848666, 0.86705275],\n", + " [ 1.6340906 , 1.36321063, -0.02175361, -0.45301645, -0.37111236,\n", + " -0.04716069, -2.27337435, 0.95318738],\n", + " [ 0.7100548 , -0.79883269, -0.3165779 , -1.58352824, -0.37751484,\n", + " -0.29760341, -0.73424207, -0.55703223]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 36 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XmSPbDP6_20W" + }, + "source": [ + "**Obs**.: Eu prefiro este método!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dzVKW_wX_Dzw" + }, + "source": [ + "## Método 3 - np.vstack([A, B]) = np.r_[A, B]" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uL7lEN_mABID", + "outputId": "d1ea4d86-2cc1-4e2d-af72-b3a292ef15fd", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 459 + } + }, + "source": [ + "np.vstack([a_numeros1, a_numeros2])" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[ 2.5062768 , 1.11440422, 2.05565501, 0.56482376, 0.29897276,\n", + " 1.04930857, -0.12607366, 1.06227632],\n", + " [ 1.13807032, 1.37966044, -2.05995563, 0.67474814, 0.72722843,\n", + " -0.33923852, 0.43613107, 0.59135489],\n", + " [-1.29281877, 1.17712036, -0.98644163, -1.79034143, -1.08913605,\n", + " -0.90712825, -1.02291108, -1.36445713],\n", + " [-0.29429164, 0.06343709, -1.14196185, -0.50706079, -0.83539436,\n", + " -1.41492946, -0.2159062 , -1.16519474],\n", + " [-0.60767518, -0.61510925, 1.0771542 , 0.5043687 , 0.02674197,\n", + " 1.83494644, 0.34728874, -1.14671885],\n", + " [-0.77337752, -1.10547465, 0.10062807, -1.14571729, -2.15266227,\n", + " -0.75255725, -2.1529949 , -0.33017773],\n", + " [-1.10465731, 0.32889675, 0.01010198, -1.33213633, -0.33945805,\n", + " -0.01299007, 0.05342823, -0.18641201],\n", + " [ 0.39473805, -0.89354231, -0.50667323, -0.74660913, 1.83586365,\n", + " -1.20536871, 1.20184886, 0.51160897],\n", + " [-0.56952286, -0.93343871, -0.24972528, 0.98487133, 1.19333367,\n", + " 2.29956497, 0.16657022, 0.71357415],\n", + " [-0.45251078, 0.92163918, 0.73421263, 2.17811191, -0.05655212,\n", + " 1.25326 , -0.37039248, 1.43855202],\n", + " [ 0.85646091, -0.11257239, -0.35400297, 0.94136671, -0.08696163,\n", + " -1.49000701, 0.00848666, 0.86705275],\n", + " [ 1.6340906 , 1.36321063, -0.02175361, -0.45301645, -0.37111236,\n", + " -0.04716069, -2.27337435, 0.95318738],\n", + " [ 0.7100548 , -0.79883269, -0.3165779 , -1.58352824, -0.37751484,\n", + " -0.29760341, -0.73424207, -0.55703223]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 37 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "68icJ-2ZAdRj" + }, + "source": [ + "# Concatenar arrays\n", + "\n", + "## Exemplo 1\n", + "\n", + "![Concatenar1](https://github.com/MathMachado/Materials/blob/master/Concatenar1.PNG?raw=true)\n", + "\n", + "# Exemplo 2\n", + "\n", + "![Concatenar2](https://github.com/MathMachado/Materials/blob/master/Concatenar2.PNG?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OplgK9YoQi9o" + }, + "source": [ + "## Concatenar os elementos de dois arrays - np.c_[A, B]" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lpdsbTEKQ9EY" + }, + "source": [ + "np.random.seed(20111974)\n", + "a_numeros1 = np.random.randint(0, 10, 100).reshape(-1, 10)\n", + "a_numeros2 = np.random.randint(0, 2, 10).reshape(-1, 1)" + ], + "execution_count": 11, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JPxhGsaSSMk2", + "outputId": "700ec21b-ab45-4b58-d302-2a5861349678", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 187 + } + }, + "source": [ + "a_numeros1" + ], + "execution_count": 12, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[8, 8, 2, 8, 9, 1, 8, 0, 4, 2],\n", + " [0, 8, 9, 3, 7, 1, 3, 2, 9, 7],\n", + " [7, 9, 5, 6, 8, 7, 0, 9, 3, 9],\n", + " [3, 1, 8, 6, 3, 5, 4, 1, 2, 9],\n", + " [8, 6, 6, 1, 0, 9, 2, 0, 7, 5],\n", + " [5, 4, 4, 2, 7, 2, 7, 9, 3, 1],\n", + " [5, 0, 1, 2, 3, 8, 7, 5, 4, 0],\n", + " [5, 9, 6, 6, 1, 3, 6, 0, 4, 9],\n", + " [2, 1, 0, 9, 1, 4, 2, 9, 7, 9],\n", + " [5, 3, 7, 6, 3, 9, 8, 4, 3, 0]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 12 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9ZyUPfybTfej", + "outputId": "ac27a20e-1622-4cb9-d6f6-74ee467bdb72", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 187 + } + }, + "source": [ + "a_numeros2" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[1],\n", + " [0],\n", + " [0],\n", + " [0],\n", + " [0],\n", + " [1],\n", + " [0],\n", + " [0],\n", + " [0],\n", + " [1]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 40 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nS1cPG3aRug1", + "outputId": "c70cf891-ae8f-445d-c271-c6b7f7da1738", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 187 + } + }, + "source": [ + "# colocando o array a_numeros2 do lado de a_numeros1.\n", + "np.c_[a_numeros1, a_numeros2]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[8, 8, 2, 8, 9, 1, 8, 0, 4, 2, 1],\n", + " [0, 8, 9, 3, 7, 1, 3, 2, 9, 7, 0],\n", + " [7, 9, 5, 6, 8, 7, 0, 9, 3, 9, 0],\n", + " [3, 1, 8, 6, 3, 5, 4, 1, 2, 9, 0],\n", + " [8, 6, 6, 1, 0, 9, 2, 0, 7, 5, 0],\n", + " [5, 4, 4, 2, 7, 2, 7, 9, 3, 1, 1],\n", + " [5, 0, 1, 2, 3, 8, 7, 5, 4, 0, 0],\n", + " [5, 9, 6, 6, 1, 3, 6, 0, 4, 9, 0],\n", + " [2, 1, 0, 9, 1, 4, 2, 9, 7, 9, 0],\n", + " [5, 3, 7, 6, 3, 9, 8, 4, 3, 0, 1]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 41 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kIgU1YBw0OeM" + }, + "source": [ + "___\n", + "# **Selecionar itens que satisfazem condições**\n", + "> Considere o array a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "e2pL5anBV0DI", + "outputId": "3f88ada5-6035-4a4c-ba3c-75c50e70535b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_numeros1 = np.arange(10, 0, -1)\n", + "a_numeros1" + ], + "execution_count": 15, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 15 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i9HuZZAfV302" + }, + "source": [ + "Selecionar somente os itens > 7:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZCESvr7iXMkV" + }, + "source": [ + "## Usando np.where()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BdrAQLHkTS-v", + "outputId": "18f6bed6-13f8-4c21-8e3c-cc7a2f016349", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_numeros1" + ], + "execution_count": 16, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 16 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "O_ZBaWxfWA9o", + "outputId": "d0828963-049a-405e-d909-d1ddcd4571cf", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Índices do array que atendem a condição\n", + "l_indices = np.where(a_numeros1 > 7)\n", + "l_indices" + ], + "execution_count": 17, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(array([0, 1, 2]),)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 17 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EdWlfPOZWPME" + }, + "source": [ + "**Atenção**: Capturamos os índices. Para selecionar os itens, basta fazer:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tOxs3iYQWWxu", + "outputId": "83e88f0a-8dbf-434f-c5ca-f561119c7c28", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_numeros2 = a_numeros1[l_indices]\n", + "a_numeros2" + ], + "execution_count": 18, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([10, 9, 8])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 18 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PGsENqkaXRjh" + }, + "source": [ + "## Alternativa: Usando []" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YbdRNk1WXTLT", + "outputId": "0788cf3d-33c0-4acf-9c91-83ba2d60a5c5", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_numeros1[a_numeros1 > 7]" + ], + "execution_count": 19, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([10, 9, 8])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 19 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jijpzFxcSQC8" + }, + "source": [ + "Acho que vale a pena quebrar esta solução para entendermos melhor como as coisas funcionam:#" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rujhP2LQSWsq" + }, + "source": [ + " # Primeiro, avalie o resultado de a_numeros1 > 7:" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "FYZaBsasSb3N", + "outputId": "07162754-7d8e-4999-8313-3fdd20d6cb54", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "a_numeros1 > 7" + ], + "execution_count": 20, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([ True, True, True, False, False, False, False, False, False,\n", + " False])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 20 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mvEof-UKaaVG", + "outputId": "cdd88df9-5224-420c-e653-047e02e779b4", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_numeros1[a_numeros1 > 7]" + ], + "execution_count": 21, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([10, 9, 8])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 21 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nO4FiBmDUZOT", + "outputId": "53a4b692-3214-4c79-dabd-089ad82ed2d4", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_numeros1" + ], + "execution_count": 22, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 22 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ci5lT9nmSfsX" + }, + "source": [ + "Agora, com este resultado, fica fácil entender como o Python seleciona os elementos. Consegue explicar?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1v5Lfin0GGKD" + }, + "source": [ + "# Substituir itens baseado em condições\n", + "> Substituir os valores negativos do array abaixo por 0." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CLY_u0ePWdN7" + }, + "source": [ + "## Gerar o exemplo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NUANFy-fNXf5", + "outputId": "ed6284d2-bf8f-4c2b-8e25-5c9df5258e7d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 119 + } + }, + "source": [ + "np.random.seed(20111974)\n", + "a_numeros1 = np.array(np.random.randint(0, 10, size = 100))\n", + "\n", + "# Lista aleatória de índices que vou alterar\n", + "np.random.seed(20111974)\n", + "l_indices= np.random.randint(0, 99, 9)\n", + "\n", + "for i in l_indices:\n", + " a_numeros1[i] = -1*a_numeros1[i]\n", + "\n", + "a_numeros2 = a_numeros1.copy()\n", + "a_numeros2" + ], + "execution_count": 23, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([ 8, 8, -2, 8, 9, 1, 8, 0, -4, 2, 0, 8, 9, 3, 7, 1, 3,\n", + " 2, 9, 7, 7, 9, 5, 6, 8, 7, 0, -9, 3, 9, 3, 1, 8, 6,\n", + " 3, 5, 4, 1, 2, 9, -8, 6, -6, 1, 0, 9, -2, 0, 7, 5, 5,\n", + " 4, 4, 2, 7, 2, 7, 9, 3, 1, -5, 0, 1, 2, 3, 8, 7, 5,\n", + " 4, 0, 5, 9, 6, 6, 1, 3, 6, 0, 4, 9, 2, -1, 0, 9, 1,\n", + " 4, 2, 9, -7, 9, 5, 3, 7, 6, 3, 9, 8, 4, 3, 0])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 23 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dWVyI40uN2d2", + "outputId": "037f708d-ffad-462b-f7e2-5f0779d18d1f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Indices a serem multiplicados por -1:\n", + "l_indices" + ], + "execution_count": 24, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([60, 42, 40, 8, 27, 2, 46, 88, 81])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 24 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3Whuu854OJDZ" + }, + "source": [ + "## Substituir os valores negativos por 0" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sr268Rp8b-Se", + "outputId": "69b8f95c-8624-46db-fa98-3636db6d54d8", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 221 + } + }, + "source": [ + "a_numeros2 < 0" + ], + "execution_count": 25, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([False, False, True, False, False, False, False, False, True,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " True, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, True, False, True, False, False,\n", + " False, True, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, True, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " True, False, False, False, False, False, False, True, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 25 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "C-eKqPrfOQF6", + "outputId": "c685b6d3-bc94-483d-dc1a-bc5fd4cb2d3a", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 102 + } + }, + "source": [ + "a_numeros2[a_numeros2 < 0] = 0\n", + "a_numeros2" + ], + "execution_count": 26, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([8, 8, 0, 8, 9, 1, 8, 0, 0, 2, 0, 8, 9, 3, 7, 1, 3, 2, 9, 7, 7, 9,\n", + " 5, 6, 8, 7, 0, 0, 3, 9, 3, 1, 8, 6, 3, 5, 4, 1, 2, 9, 0, 6, 0, 1,\n", + " 0, 9, 0, 0, 7, 5, 5, 4, 4, 2, 7, 2, 7, 9, 3, 1, 0, 0, 1, 2, 3, 8,\n", + " 7, 5, 4, 0, 5, 9, 6, 6, 1, 3, 6, 0, 4, 9, 2, 0, 0, 9, 1, 4, 2, 9,\n", + " 0, 9, 5, 3, 7, 6, 3, 9, 8, 4, 3, 0])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 26 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eDLM0_JSZlfB" + }, + "source": [ + "Observe acima que os valores negativos foram substituídos por 0, como queríamos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AEHJ0rA3dHHU" + }, + "source": [ + "## Substituir os valores negativos por 0 e os positivos por 1" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y32J8SRNZwRF", + "outputId": "7eb344d0-c969-4868-f349-ade25dce94d6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 119 + } + }, + "source": [ + "a_numeros2 = a_numeros1.copy()\n", + "a_numeros2" + ], + "execution_count": 27, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([ 8, 8, -2, 8, 9, 1, 8, 0, -4, 2, 0, 8, 9, 3, 7, 1, 3,\n", + " 2, 9, 7, 7, 9, 5, 6, 8, 7, 0, -9, 3, 9, 3, 1, 8, 6,\n", + " 3, 5, 4, 1, 2, 9, -8, 6, -6, 1, 0, 9, -2, 0, 7, 5, 5,\n", + " 4, 4, 2, 7, 2, 7, 9, 3, 1, -5, 0, 1, 2, 3, 8, 7, 5,\n", + " 4, 0, 5, 9, 6, 6, 1, 3, 6, 0, 4, 9, 2, -1, 0, 9, 1,\n", + " 4, 2, 9, -7, 9, 5, 3, 7, 6, 3, 9, 8, 4, 3, 0])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 27 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1bSD9Fs6P5wW", + "outputId": "cf6756c5-52e0-447b-cd1f-3fbc8c79e8f0", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 102 + } + }, + "source": [ + "a_numeros2 = np.where(a_numeros2 <= 0, 0, 1)\n", + "a_numeros2" + ], + "execution_count": 28, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", + " 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,\n", + " 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1,\n", + " 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1,\n", + " 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 28 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i027scjl0qkm" + }, + "source": [ + "___\n", + "# Outliers\n", + "> Qualquer ponto/observação que é incomum quando comparado com todos os outros pontos/observações." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UnDTqRnZHQ3W" + }, + "source": [ + "## Z-Score\n", + "\n", + "* Z-Score pode ser utilizado para detectar Outliers.\n", + "* É a diferença entre o valor e a média da amostra expressa como o número de desvios-padrão. \n", + "* Se o escore z for menor que 2,5 ou maior que 2,5, o valor estará nos 5% do menor ou maior valor (2,5% dos valores em ambas as extremidades da distribuição). No entanto, é pratica comum utilizarmos 3 ao invés dos 2,5.\n", + "\n", + "![Z_Score](https://github.com/MathMachado/Materials/blob/master/Z_Score.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N7gb2zhtd0uM" + }, + "source": [ + "## IQR Score\n", + "\n", + "* O Intervalo interquartil (IQR) é uma medida de dispersão estatística, sendo igual à diferença entre os percentis 75 (Q3) e 25 (Q1), ou entre quartis superiores e inferiores, IQR = Q3 - Q1." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lMmWOKNvghI7" + }, + "source": [ + "![BoxPlot](https://github.com/MathMachado/Materials/blob/master/boxplot.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z3VZdU8rICZA" + }, + "source": [ + "## Desafio\n", + "> Substituir os outliers do array por:\n", + "1. Q1-1.5\\*(Q3 - Q1), se ponto < Q1-1.5\\*IQR\n", + "2. Q3+1.5\\*(Q3 - Q1), se ponto > Q3+1.5\\*IQR" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DUw_a-MjWvBc" + }, + "source": [ + "### Desafio para resolverem\n", + "> Objetivo: Simular aleatoriamente o salário de 1.000 pessoas com distribuição N(1.045; 100). \n", + "* Identificar os outliers da distribuição que acabamos de simular;\n", + "* Qual a média da distribuição que simulamos?\n", + "* Qual o desvio-padrão;\n", + "* Plotar o Boxplot da distribuição dos dados;\n", + "* Quantas pessoas > Q3 + 1.5*(Q3-Q1)\n", + "\n", + "Obs.: Use np.random.seed(20111974)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RL0Zb0fyDory", + "outputId": "12b72d2b-5a07-476c-b682-b43458dad537", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 442 + } + }, + "source": [ + "np.random.seed(19741120)\n", + "a_numeros1 = np.array(np.random.normal(100, 10, size = 100))\n", + "\n", + "# Lista aleatória de índices que vou alterar\n", + "np.random.seed(20111974)\n", + "l_indices = np.random.randint(0, 99, 10)\n", + "np.sort(l_indices)\n", + "\n", + "a_numeros1_copia = a_numeros1.copy()\n", + "for i in l_indices:\n", + " a_numeros1_copia[i] = 2*a_numeros1_copia[i]\n", + "\n", + "a_numeros1_copia" + ], + "execution_count": 29, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([ 92.26622483, 88.94525348, 202.01256141, 88.54282712,\n", + " 78.47337726, 92.47442751, 78.47005101, 96.69822268,\n", + " 177.90685389, 103.28896747, 100.10101983, 86.67863666,\n", + " 96.60541955, 99.87009928, 100.53428231, 98.13587989,\n", + " 103.94738054, 91.06457686, 94.93326767, 92.53390871,\n", + " 118.35863649, 87.94631286, 112.01848858, 105.1160897 ,\n", + " 94.30477141, 90.66561289, 97.50274717, 219.69742669,\n", + " 111.93333668, 122.99564969, 101.66570222, 107.13574148,\n", + " 95.47489218, 109.21639184, 107.3421263 , 121.78111913,\n", + " 99.43447875, 112.53259996, 96.29607519, 114.38552019,\n", + " 217.12921824, 98.87427612, 192.91994052, 109.41366709,\n", + " 99.13038373, 85.09992988, 200.16973311, 108.67052746,\n", + " 116.34090601, 113.63210628, 99.78246389, 95.46983552,\n", + " 96.28887641, 99.52839308, 77.2662565 , 109.53187379,\n", + " 107.10054804, 92.01167315, 96.83422097, 84.16471762,\n", + " 192.44970327, 97.02396587, 92.65757933, 94.42967769,\n", + " 94.02193121, 96.78511833, 113.48163484, 113.01668438,\n", + " 105.02843445, 107.58835306, 110.94932036, 99.23942748,\n", + " 93.74530061, 98.11132421, 84.62965767, 101.61893629,\n", + " 97.35568508, 94.27312327, 105.55501746, 105.45183177,\n", + " 91.55201318, 207.3344281 , 103.85618579, 98.93911584,\n", + " 119.4216084 , 91.53859352, 79.86676575, 106.45458816,\n", + " 311.17594425, 106.44862258, 113.52153879, 111.68130043,\n", + " 100.4550085 , 93.80661302, 98.87736992, 122.53185044,\n", + " 112.21271855, 101.34656943, 104.99984125, 94.74563688])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 29 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZnmykyahLWX9", + "outputId": "3bbfbafd-8715-4f72-de0e-693f2dcd9887", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "# Algumas estatísticas descritivas:\n", + "f'Média: {np.mean(a_numeros1)}; Mediana: {np.median(a_numeros1)}; STD: {np.std(a_numeros1)}'" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'Média: 100.18633633362035; Mediana: 99.33695311387913; STD: 10.028092450008492'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 55 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ILhNe80xW5C6" + }, + "source": [ + "### Solução do desafio" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "U993i1GJg2hk", + "outputId": "86faf16c-d671-4ebc-b695-7dc204185f8d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 269 + } + }, + "source": [ + "# Import a biblioteca seaborn:\n", + "import seaborn as sns\n", + "sns.boxplot(y = a_numeros1_copia)" + ], + "execution_count": 30, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 30 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAADrCAYAAACSE9ZyAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAOlUlEQVR4nO3db2xdd3nA8e8TOyMtbGvrWlHnRjPDnVB5sYAs1Ik3CGJwK0FB2lB5Qa5QpfCikCLxYoAqwaQiMWlQNZVWKaiImwnBqoFEGK1Vp2NCvADksq7pHxB3kKqxSmpuoRCldNh+9sInxTS+1/c6ds69P74fyfK9v3Nu8rhqvjk5Pvc4MhNJUll21T2AJGn7GXdJKpBxl6QCGXdJKpBxl6QCGXdJKtBo3QMAXH311Tk5OVn3GJI0VB555JFfZOb4RtsGIu6Tk5MsLCzUPYYkDZWIeLrTNk/LSFKBjLskFci4S1KBjLskFci4S120220OHz5Mu92uexSpL8Zd6qLZbHLy5EmOHTtW9yhSX4y71EG73WZubo7MZG5uzqN3DRXjLnXQbDZZXV0FYGVlxaN3DRXjLnVw4sQJlpeXAVheXmZ+fr7miaTeGXepgwMHDjA6uvYm7tHRUWZmZmqeSOqdcZc6aDQa7Nq19kdkZGSEgwcP1jyR1DvjLnUwNjbG7OwsEcHs7CxjY2N1jyT1bCBuHCYNqkajwalTpzxq19Ax7lIXY2NjHDlypO4xpL5telomIvZExA8i4n8i4omI+Mdq/bUR8f2IaEXEv0XEn1Trr6qet6rtkzv7JUiSXqmXc+4vAW/LzL8B9gOzEXED8E/AXZk5BfwSuLXa/1bgl9X6XdV+kqRLaNO455qz1dPd1UcCbwP+vVpvAu+pHt9cPafa/vaIiG2bWJK0qZ6ulomIkYh4FHgOmAf+F/hVZi5Xu5wGJqrHE8AzANX2F4ALLjOIiEMRsRARC0tLSxf3VUiS/kBPcc/MlczcD1wLvBl4/cX+xpl5NDOnM3N6fHzDHwEoSdqivq5zz8xfAd8G/ha4IiLOX21zLbBYPV4E9gFU2/8c8I5LknQJ9XK1zHhEXFE9vgyYAZ5iLfJ/V+3WAL5RPT5ePafa/p+Zmds5tCSpu16uc78GaEbECGt/Gdyfmf8REU8CX42IO4H/Bu6r9r8P+NeIaAHPA7fswNySpC42jXtmPga8cYP1n7J2/v2V678F/n5bppMkbYn3lpGkAhl3SSqQcZekAhl3SSqQcZekAhl3SSqQcZekAhl3SSqQcZekAhl3SSqQcZekAhl3SSqQcZekAhl3SSqQcZekAhl3SSqQcZekAhl3SSqQcZekAhl3SSqQcZekAhl3SSqQcZekAhl3SSqQcZekAhl3SSqQcZekAhl3SSqQcZekAhl3SSqQcZekAhl3SSqQcZekAm0a94jYFxHfjognI+KJiLi9Wv90RCxGxKPVx03rXvOJiGhFxI8j4p07+QVIki402sM+y8DHMvOHEfGnwCMRMV9tuysz/3n9zhFxPXAL8AbgL4ATEfHXmbmynYNLkjrb9Mg9M5/NzB9Wj38DPAVMdHnJzcBXM/OlzPwZ0ALevB3DSpJ609c594iYBN4IfL9a+nBEPBYRX4yIK6u1CeCZdS87zQZ/GUTEoYhYiIiFpaWlvgeXJHXWc9wj4jXA14CPZuavgXuB1wH7gWeBz/XzG2fm0cyczszp8fHxfl4qSdpET3GPiN2shf3Lmfl1gMw8k5krmbkKfIHfn3pZBPate/m11Zok6RLp5WqZAO4DnsrMz69bv2bdbu8FHq8eHwduiYhXRcRrgeuAH2zfyJKkzfRytcxbgA8AJyPi0Wrtk8D7I2I/kMAp4EMAmflERNwPPMnalTa3eaWMJF1am8Y9M78LxAabHujyms8An7mIuSRJF8F3qEpSgYy7JBXIuEtSgYy7JBXIuEtSgYy7JBXIuEtSgYy7JBXIuEtSgYy7JBXIuEtSgYy7JBXIuEtSgYy7JBXIuEtSgYy7JBXIuEtdtNttDh8+TLvdrnsUqS/GXeqi2Wxy8uRJjh07VvcoUl+Mu9RBu93mwQcfJDN58MEHPXrXUDHuUgfNZpPl5WUAfve733n0rqFi3KUO5ufnyUwAMpOHHnqo5omk3hl3qYO9e/d2fS4NMuMudXDmzJmuz6VBZtylDmZmZogIACKCd7zjHTVPJPXOuEsdNBoNdu/eDcDu3bs5ePBgzRNJvTPuUgdjY2PMzs4SEdx4442MjY3VPZLUM+MudfHud7+byy+/nHe96111jyL1xbhLXRw/fpxz587xzW9+s+5RpL4Yd6mDdrvN3Nwcmcnc3JzvUNVQMe5SB81mk9XVVQBWVlZ8h6qGinGXOjhx4sTLtx9YXl5mfn6+5omk3hl3qYMDBw4wOjoKwOjoKDMzMzVPJPXOuEsdNBqNl0/LrK6uep27hsqmcY+IfRHx7Yh4MiKeiIjbq/WrImI+In5Sfb6yWo+IOBIRrYh4LCLetNNfhLRTzt84TBo2vRy5LwMfy8zrgRuA2yLieuDjwMOZeR3wcPUc4EbguurjEHDvtk8tXQLNZvPluK+urvoNVQ2VTeOemc9m5g+rx78BngImgJuBZrVbE3hP9fhm4Fiu+R5wRURcs+2TSzvsld9A9Za/GiZ9nXOPiEngjcD3gb2Z+Wy16efA+fuhTgDPrHvZ6WpNGire8lfDrOe4R8RrgK8BH83MX6/flmv/du3r5GREHIqIhYhYWFpa6uel0iXhLX81zHqKe0TsZi3sX87Mr1fLZ86fbqk+P1etLwL71r382mrtD2Tm0cyczszp8fHxrc4v7Rhv+ath1svVMgHcBzyVmZ9ft+k40KgeN4BvrFs/WF01cwPwwrrTN9LQaDQajIyMADAyMuKlkBoqoz3s8xbgA8DJiHi0Wvsk8Fng/oi4FXgaeF+17QHgJqAFnAM+uK0TS5fI2NgYe/bs4ezZs+zZs8db/mqobBr3zPwuEB02v32D/RO47SLnkmrXarU4e/YsAGfPnqXVajE1NVXzVFJvfIeq1MGdd97Z9bk0yIy71MGpU6e6PpcGmXGXOpicnOz6XBpkxl3q4I477uj6XBpkxl3qYGpq6uWj9cnJSb+ZqqFi3KUu7rjjDl796ld71K6h08t17tIframpKb71rW/VPYbUN4/cJalAxl2SCmTcpS7a7TaHDx+m3W7XPYrUF+MuddFsNjl58qQ/hUlDx7hLHbTbbebm5shM5ubmPHrXUDHuUgfNZpPV1VUAVlZWPHrXUDHuUgcnTpxgeXkZgOXl5Qt+pqo0yIy71MGBAwcYHV17K8jo6CgzMzM1TyT1zrhLHTQaDXbtWvsj4k9i0rAx7lIHY2NjzM7OEhHMzs76k5g0VLz9gNRFo9Hg1KlTHrVr6Bh3qYuxsTGOHDlS9xhS3zwtI0kFMu6SVCDjLkkFMu6SVCDjLkkFMu6SVCDjLkkFMu6SVCDjLkkFMu6SVCDjLkkFMu6SVCDjLkkFMu6SVKBN4x4RX4yI5yLi8XVrn46IxYh4tPq4ad22T0REKyJ+HBHv3KnBJUmd9XLk/iVgdoP1uzJzf/XxAEBEXA/cAryhes2/RMTIdg0rSerNpnHPzO8Az/f4690MfDUzX8rMnwEt4M0XMZ8kaQsu5pz7hyPiseq0zZXV2gTwzLp9TldrkqRLaKtxvxd4HbAfeBb4XL+/QEQcioiFiFhYWlra4hiSpI1sKe6ZeSYzVzJzFfgCvz/1sgjsW7frtdXaRr/G0cyczszp8fHxrYwhSepgS3GPiGvWPX0vcP5KmuPALRHxqoh4LXAd8IOLG1GS1K/RzXaIiK8AbwWujojTwKeAt0bEfiCBU8CHADLziYi4H3gSWAZuy8yVnRldktRJZGbdMzA9PZ0LCwt1jyFJQyUiHsnM6Y22+Q5VSSqQcZekAhl3SSqQcZekAhl3SSqQcZekAhl3SSqQcZekAhl3SSqQcZekAhl3SSqQcZekAhl3SSqQcZekAhl3SSqQcZekAhl3SSrQpj9mT3987rnnHlqtVt1jDITFxbWf7z4xMVHzJINhamqKj3zkI3WPoR4Yd6mLF198se4RpC0x7rqAR2a/d/vttwNw99131zyJ1B/PuUtSgYy7JBXIuEtSgYy7JBXIuEtSgYy7JBXIuEtSgbzOveK7MrWR8/9PnL/eXTpv0N+ta9wrrVaLRx9/ipXLr6p7FA2QXf+XADzy0zM1T6JBMnLu+bpH2JRxX2fl8qt48fU31T2GpAF32Y8eqHuETXnOXZIKZNwlqUDGXZIKZNwlqUCbxj0ivhgRz0XE4+vWroqI+Yj4SfX5ymo9IuJIRLQi4rGIeNNODi9J2lgvR+5fAmZfsfZx4OHMvA54uHoOcCNwXfVxCLh3e8aUJPVj00shM/M7ETH5iuWbgbdWj5vAfwH/UK0fy8wEvhcRV0TENZn57HYNvFMWFxcZOffCUFziJKleI+faLC4u1z1GV1s95753XbB/DuytHk8Az6zb73S1doGIOBQRCxGxsLS0tMUxJEkbueg3MWVmRkRu4XVHgaMA09PTfb9+u01MTPDzl0Z9E5OkTV32oweYmNi7+Y412uqR+5mIuAag+vxctb4I7Fu337XVmiTpEtpq3I8DjepxA/jGuvWD1VUzNwAvDMP5dkkqzaanZSLiK6x98/TqiDgNfAr4LHB/RNwKPA28r9r9AeAmoAWcAz64AzPvmJFzz/sNVf2BXb/9NQCre/6s5kk0SNZuHDbYp2V6uVrm/R02vX2DfRO47WKHqsPU1FTdI2gAtVq/AWDqrwb7D7Iutb0D3wzvClkZ5Psyqz7n7+N+99131zyJ1B9vPyBJBTLuklQg4y5JBTLuklQg4y5JBTLuklQg4y5JBTLuklQg4y5JBTLuklQg4y5JBfLeMrrAPffcQ6vVqnuMgXD+v8P5e8z8sZuamvI+TEPCuEtdXHbZZXWPIG2JcdcFPDKThp/n3CWpQMZdkgpk3CWpQMZdkgpk3CWpQMZdkgpk3CWpQMZdkgoUmVn3DETEEvB03XNIHVwN/KLuIaQN/GVmjm+0YSDiLg2yiFjIzOm655D64WkZSSqQcZekAhl3aXNH6x5A6pfn3CWpQB65S1KBjLskFci4S1KBjLskFci4S1KB/h9mcr6bwEF81QAAAABJRU5ErkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VtenLK1uK1Pi" + }, + "source": [ + "Consegue identificar os outliers do array?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e3sHuGVGFBdW" + }, + "source": [ + "## Objetivo\n", + "> Substituir os outliers por mediana. \n", + "\n", + "* Como fazer isso?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RSegPNKCI-dS" + }, + "source": [ + "### Siga os passos a seguir\n", + "1. Calcule estatísticas descritivas antes das transformações par avaliar o impacto;\n", + " * Calcule média, mediana e desvio-padrão dos dados originais;\n", + "2. Calcule os valores a seguir:\n", + " * Q1, Q3\n", + " * IQR = Q3-Q1\n", + " * lim_inferior = Q1-1.5\\*IQR\n", + " * lim_superior = Q3+1.5\\*IQR\n", + "3. Proceda à substituição:\n", + " * Se a_numeros1_copia[i] < lim_inferior então a_numeros1_copia[i]= Mediana\n", + " * Se a_numeros1_copia[i] > lim_superior então a_numeros1_copia[i]= Mediana\n", + "4. Calcule as estatísticas descritivas após as substituições e compare com os valores antes das transformações." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9DQ7YnWaFn4v" + }, + "source": [ + "### Minha solução\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RBXJbTeGLC7Q" + }, + "source": [ + "1. Estatísticas Descritivas antes das transformações:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QueKYn7MLG12", + "outputId": "c410c7bb-76f4-4f8a-de71-1bbe0dfa6a21", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "# Algumas estatísticas descritivas:\n", + "f'Média: {np.mean(a_numeros1_copia)}; Mediana: {np.median(a_numeros1_copia)}; STD: {np.std(a_numeros1_copia)}'" + ], + "execution_count": 31, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'Média: 110.56825524166379; Mediana: 99.98555955648851; STD: 35.484921348581274'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 31 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oOBJ8INWL5fo" + }, + "source": [ + "Observe o quanto nossos dados estão distorcidos dos valores originalmente utilizados." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MX-fJeh2MBTD" + }, + "source": [ + "2. Calcular Q1, Q3 e IQR" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JlsPiQeGMGeU" + }, + "source": [ + "Q1= np.percentile(a_numeros1_copia, q = [25])\n", + "Q3= np.percentile(a_numeros1_copia, q = [75])\n", + "Q2= np.percentile(a_numeros1_copia, q = [50])\n", + "IQR = Q3-Q1\n", + "lim_inferior = Q1-1.5*IQR\n", + "lim_superior = Q3+1.5*IQR" + ], + "execution_count": 32, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VF2NJ3rCeI1_", + "outputId": "7b94b22f-a175-4f06-9ac1-1872aa52cfd8", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "f'Q1: {Q1}; Q3: {Q3}; lim_inferior: {lim_inferior}; lim_superior: {lim_superior}'" + ], + "execution_count": 33, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'Q1: [94.39845112]; Q3: [111.13231538]; lim_inferior: [69.29765473]; lim_superior: [136.23311177]'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 33 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JjnwJ7HwMxcl" + }, + "source": [ + "3. Substituir\n", + "* Se a_numeros1[i] < lim_inferior então a_numeros1[i]= Mediana\n", + "* Se a_numeros1[i] > Lia_Sup então a_numeros1[i]= Mediana" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hcAn-IwVfbcI" + }, + "source": [ + "a_numeros2 = a_numeros1_copia.copy()" + ], + "execution_count": 35, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "J3SSE45oM9oh", + "outputId": "b6fa4f55-88ac-4864-d4b5-08d642438aed", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 442 + } + }, + "source": [ + "a_numeros2[a_numeros2 < lim_inferior[0]] = Q2[0]\n", + "a_numeros2[a_numeros2 > lim_superior[0]] = Q2[0]\n", + "a_numeros2" + ], + "execution_count": 36, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([ 92.26622483, 88.94525348, 99.98555956, 88.54282712,\n", + " 78.47337726, 92.47442751, 78.47005101, 96.69822268,\n", + " 99.98555956, 103.28896747, 100.10101983, 86.67863666,\n", + " 96.60541955, 99.87009928, 100.53428231, 98.13587989,\n", + " 103.94738054, 91.06457686, 94.93326767, 92.53390871,\n", + " 118.35863649, 87.94631286, 112.01848858, 105.1160897 ,\n", + " 94.30477141, 90.66561289, 97.50274717, 99.98555956,\n", + " 111.93333668, 122.99564969, 101.66570222, 107.13574148,\n", + " 95.47489218, 109.21639184, 107.3421263 , 121.78111913,\n", + " 99.43447875, 112.53259996, 96.29607519, 114.38552019,\n", + " 99.98555956, 98.87427612, 99.98555956, 109.41366709,\n", + " 99.13038373, 85.09992988, 99.98555956, 108.67052746,\n", + " 116.34090601, 113.63210628, 99.78246389, 95.46983552,\n", + " 96.28887641, 99.52839308, 77.2662565 , 109.53187379,\n", + " 107.10054804, 92.01167315, 96.83422097, 84.16471762,\n", + " 99.98555956, 97.02396587, 92.65757933, 94.42967769,\n", + " 94.02193121, 96.78511833, 113.48163484, 113.01668438,\n", + " 105.02843445, 107.58835306, 110.94932036, 99.23942748,\n", + " 93.74530061, 98.11132421, 84.62965767, 101.61893629,\n", + " 97.35568508, 94.27312327, 105.55501746, 105.45183177,\n", + " 91.55201318, 99.98555956, 103.85618579, 98.93911584,\n", + " 119.4216084 , 91.53859352, 79.86676575, 106.45458816,\n", + " 99.98555956, 106.44862258, 113.52153879, 111.68130043,\n", + " 100.4550085 , 93.80661302, 98.87736992, 122.53185044,\n", + " 112.21271855, 101.34656943, 104.99984125, 94.74563688])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 36 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VEGFio0Nfj7O" + }, + "source": [ + "4. Estatísticas Descritivas para avaliarmos o impacto:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gX1LZHFqfjFQ", + "outputId": "b98ed302-57bb-4f81-9a70-e2c695f0c029", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "# Algumas estatísticas descritivas:\n", + "f'Média: {np.mean(a_numeros2)}; Mediana: {np.median(a_numeros2)}; STD: {np.std(a_numeros2)}'" + ], + "execution_count": 37, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'Média: 100.35899750691053; Mediana: 99.92782941793924; STD: 9.602142763141016'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 37 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-xnguZ7XgyvK", + "outputId": "992beaa4-f2e9-4462-c344-dafb124a4a27", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 269 + } + }, + "source": [ + "# Import a biblioteca seaborn:\n", + "import seaborn as sns\n", + "sns.boxplot(y = a_numeros2)" + ], + "execution_count": 38, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 38 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAADrCAYAAACSE9ZyAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAK5klEQVR4nO3df4jcd17H8eerCfYSQfoja6jbaiobOO+Kgi7lQE6KFRrLYcuhR4tgvSsGoayr/qFX/CN/Fe5QkBg4IdDSCtqzqEcLVrlSxP7VHhspZ9qmd8OVXrOkzd7lrgqJ9ZJ7+0fmcNlsMrs7m8703ecDws585jO7b0L77JfPzHRTVUiSerlm0gNIkrafcZekhoy7JDVk3CWpIeMuSQ0Zd0lqaOekBwDYs2dP7du3b9JjSNIHyrFjx75TVTPrPTYVcd+3bx9LS0uTHkOSPlCSvHm5xzyWkaSGjLskNWTcJakh4y5JDRl3SWrIuEtSQ8Zdkhqaive5a7ocOXKEwWAw6TGmwvLyMgCzs7MTnmQ6zM3NsbCwMOkxtAHGXbqCc+fOTXoEaUuMuy7hldn/W1xcBODw4cMTnkTaHM/cJakh4y5JDRl3SWrIuEtSQ8Zdkhoy7pLUkHGXpIZGxj3JY0lOJzm+au3Pk5xI8vUkX0ly3arHHk4ySPJ6kruu1uCSpMvbyJX748CBNWvPAbdV1c8D3wAeBkjyMeA+4OPD53wpyY5tm1aStCEj415VLwBn1qx9tarOD+++CNw8vH0P8OWqeq+q3gAGwO3bOK8kaQO248z9c8C/DG/PAm+teuzkcE2S9D4aK+5J/gw4D/ztFp57MMlSkqWVlZVxxpAkrbHluCf5XeBTwG9XVQ2Xl4FbVm27ebh2iao6WlXzVTU/MzOz1TEkSevYUtyTHAD+BPiNqjq76qFngPuSXJvkVmA/8LXxx5QkbcbI/+VvkieBO4A9SU4Ch7j47phrgeeSALxYVb9fVa8keQp4lYvHNQ9V1YWrNbwkaX0j415V96+z/OgV9j8CPDLOUJKk8fgJVUlqyLhLUkPGXZIaMu6S1JBxl6SGjLskNWTcJakh4y5JDRl3SWrIuEtSQ8Zdkhoy7pLUkHGXpIaMuyQ1ZNwlqSHjLkkNGXdJasi4S1JDxl2SGjLuktSQcZekhoy7JDVk3CWpIeMuSQ0Zd0lqyLhLUkPGXZIaMu6S1JBxl6SGjLskNWTcJakh4y5JDRl3SWrIuEtSQyPjnuSxJKeTHF+19ltJXknywyTza/Y/nGSQ5PUkd12NoSVJV7aRK/fHgQNr1o4DnwZeWL2Y5GPAfcDHh8/5UpId448pSdqMkXGvqheAM2vWXquq19fZfg/w5ap6r6reAAbA7dsyqSRpw7b7zH0WeGvV/ZPDNUnS+2hiL6gmOZhkKcnSysrKpMaQpJa2O+7LwC2r7t88XLtEVR2tqvmqmp+ZmdnmMSTpw2274/4McF+Sa5PcCuwHvrbNP0OSNMLOURuSPAncAexJchI4xMUXWI8AM8A/J3m5qu6qqleSPAW8CpwHHqqqC1dteknSukbGvaruv8xDX7nM/keAR8YZSpI0Hj+hKkkNGXdJasi4S1JDxl2SGjLuktSQcZekhoy7JDVk3CWpIeMuSQ2N/ITqh8WRI0cYDAaTHkNT5kf/TCwuLk54Ek2bubk5FhYWJj3GZRn3ocFgwMvHX+PC7hsmPYqmyDX/WwAc+9Y7E55E02TH2TOjN02YcV/lwu4bOPfRuyc9hqQpt+vEs5MeYSTP3CWpIeMuSQ0Zd0lqyLhLUkPGXZIaMu6S1JBxl6SGjLskNWTcJakh4y5JDRl3SWrIuEtSQ8Zdkhoy7pLUkHGXpIaMuyQ1ZNwlqSHjLkkNGXdJasi4S1JDxl2SGjLuktTQyLgneSzJ6STHV63dkOS5JN8cfr1+uJ4kf5VkkOTrSX7xag4vSVrfRq7cHwcOrFn7PPB8Ve0Hnh/eB/h1YP/wz0Hgr7dnTEnSZoyMe1W9AJxZs3wP8MTw9hPAvavW/6YuehG4LslN2zWsJGljdm7xeXur6tTw9tvA3uHtWeCtVftODtdOMeWWl5fZcfZddp14dtKjSJpyO85+l+Xl85Me44rGfkG1qgqozT4vycEkS0mWVlZWxh1DkrTKVq/c30lyU1WdGh67nB6uLwO3rNp383DtElV1FDgKMD8/v+n/OGy32dlZ3n5vJ+c+evekR5E05XadeJbZ2b2jN07QVq/cnwEeGN5+AHh61frvDN818wng3VXHN5Kk98nIK/ckTwJ3AHuSnAQOAV8AnkryIPAm8Jnh9meBu4EBcBb47FWYWZI0wsi4V9X9l3noznX2FvDQuENJksbjJ1QlqSHjLkkNGXdJasi4S1JDxl2SGjLuktSQcZekhoy7JDVk3CWpIeMuSQ0Zd0lqyLhLUkPGXZIaMu6S1JBxl6SGjLskNWTcJakh4y5JDY38NXsfJjvOnmHXiWcnPYamyDX/818A/PAjPzHhSTRNdpw9A+yd9BhXZNyH5ubmJj2CptBg8N8AzP3sdP+LrPfb3qlvhnEfWlhYmPQImkKLi4sAHD58eMKTSJvjmbskNWTcJakh4y5JDRl3SWrIuEtSQ8Zdkhoy7pLUkHGXpIaMuyQ1ZNwlqSHjLkkNGXdJasi4S1JDxl2SGhor7kkWkxxP8kqSPxyu3ZDkuSTfHH69fntGlSRt1JbjnuQ24PeA24FfAD6VZA74PPB8Ve0Hnh/elyS9j8a5cv854KWqOltV54F/Bz4N3AM8MdzzBHDveCNKkjZrnLgfBz6Z5MYku4G7gVuAvVV1arjnbS7ziwaTHEyylGRpZWVljDEkSWttOe5V9RrwReCrwL8CLwMX1uwpoC7z/KNVNV9V8zMzM1sdQ5K0jrFeUK2qR6vql6rqV4DvAd8A3klyE8Dw6+nxx5Qkbca475b5yeHXn+biefvfAc8ADwy3PAA8Pc7PkCRt3s4xn/+PSW4EfgA8VFXfT/IF4KkkDwJvAp8Zd0hJ0uaMFfeq+uQ6a98F7hzn+0qSxuMnVCWpIeMuSQ0Zd0lqyLhLUkPGXZIaMu6S1JBxl6SGjLskNWTcJakh4y5JDRl3SWrIuEtSQ8Zdkhoy7pLUkHGXpIaMuyQ1ZNwlqSHjLkkNGXdJasi4S1JDxl2SGjLuktSQcZekhoy7JDVk3CWpIeMuSQ0Zd0lqyLhLUkPGXZIaMu6S1JBxl6SGjLskNWTcJakh4y5JDY0V9yR/lOSVJMeTPJnkI0luTfJSkkGSv0/yY9s1rCRpY7Yc9ySzwB8A81V1G7ADuA/4IvCXVTUHfA94cDsGlSRt3LjHMjuBXUl2AruBU8CvAv8wfPwJ4N4xf4YkaZO2HPeqWgb+Avg2F6P+LnAM+H5VnR9uOwnMjjukJGlzxjmWuR64B7gV+Cngx4EDm3j+wSRLSZZWVla2OoYkaR3jHMv8GvBGVa1U1Q+AfwJ+GbhueEwDcDOwvN6Tq+poVc1X1fzMzMwYY0iS1hon7t8GPpFkd5IAdwKvAv8G/OZwzwPA0+ONKEnarHHO3F/i4gun/wH85/B7HQX+FPjjJAPgRuDRbZhTkrQJO0dvubyqOgQcWrP8LeD2cb6vJGk8fkJVkhoy7pLUkHGXpIaMuyQ1ZNwlqaGx3i2jno4cOcJgMJj0GFPhR38Pi4uLE55kOszNzbGwsDDpMbQBxl26gl27dk16BGlLjLsu4ZWZ9MHnmbskNWTcJakh4y5JDRl3SWrIuEtSQ8Zdkhoy7pLUkHGXpIZSVZOegSQrwJuTnkO6jD3AdyY9hLSOn6mqdX8J9VTEXZpmSZaqan7Sc0ib4bGMJDVk3CWpIeMujXZ00gNIm+WZuyQ15JW7JDVk3CWpIeMuSQ0Zd0lqyLhLUkP/B55vMcLXeCEVAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uEPFcBjFhETQ" + }, + "source": [ + "Como podem ver, os outliers desapareceram, como queríamos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tHfzjW_ymKuR" + }, + "source": [ + "___\n", + "# **Valores únicos**\n", + "> Considere o array a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HzmQgWZVmUUD", + "outputId": "5ceea152-492f-4cbc-e67b-8a1c9a3aec84", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 119 + } + }, + "source": [ + "import numpy as np\n", + "np.random.seed(20111974)\n", + "a_numeros1 = np.random.randint(0, 100, 100)\n", + "a_numeros1" + ], + "execution_count": 39, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([60, 42, 40, 8, 27, 2, 46, 88, 81, 88, 80, 13, 30, 82, 96, 63, 79,\n", + " 91, 72, 13, 89, 67, 93, 33, 99, 73, 77, 42, 55, 45, 41, 21, 22, 8,\n", + " 62, 10, 0, 94, 15, 9, 67, 89, 35, 42, 97, 93, 8, 83, 26, 5, 68,\n", + " 90, 74, 57, 40, 22, 45, 6, 81, 95, 0, 25, 50, 80, 76, 29, 7, 21,\n", + " 5, 95, 52, 93, 31, 78, 61, 50, 50, 7, 41, 3, 33, 47, 5, 16, 33,\n", + " 19, 92, 60, 56, 55, 53, 28, 84, 16, 27, 85, 22, 38, 49, 90])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 39 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Dm9ky1F1mrNA" + }, + "source": [ + "Quem são os valores únicos do array?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "G-LPRqc-mS5j", + "outputId": "68a139d6-59fe-46ce-a36f-dcf2d50c9cef", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 85 + } + }, + "source": [ + "np.unique(a_numeros1)" + ], + "execution_count": 40, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([ 0, 2, 3, 5, 6, 7, 8, 9, 10, 13, 15, 16, 19, 21, 22, 25, 26,\n", + " 27, 28, 29, 30, 31, 33, 35, 38, 40, 41, 42, 45, 46, 47, 49, 50, 52,\n", + " 53, 55, 56, 57, 60, 61, 62, 63, 67, 68, 72, 73, 74, 76, 77, 78, 79,\n", + " 80, 81, 82, 83, 84, 85, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 99])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 40 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uXZZoTd6nMuq" + }, + "source": [ + "___\n", + "# **Diferença entre dois arrays**\n", + "> O resultado é um array com os **valores únicos de A que não estão em B**. Na teoria de conjuntos escrevemos $A - B = A - A \\cap B$.\n", + "\n", + "![Difference](https://github.com/MathMachado/Materials/blob/master/set_Difference.PNG?raw=true)\n", + "\n", + "Fonte: [Python Set](https://www.learnbyexample.org/python-set/)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uW6i3m9q1ZNs" + }, + "source": [ + "\n", + "* Vamos ver como isso funciona na prática:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vw05sfe22mfk" + }, + "source": [ + "## Exemplo 1" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Qqw2do90nQ7k" + }, + "source": [ + "a_numeros1 = np.array([0, 1, 2, 4, 5, 7, 8, 8]) # array de valores que serão excluidos em a_numeros1. Observe que '3' não pertence a a_numeros1.\n", + "a_numeros2 = np.array([1, 6, 7, 3])" + ], + "execution_count": 41, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zXJ00pOMorM-", + "outputId": "48b60273-7035-4083-811a-0672427ebd12", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "np.setdiff1d(a_numeros1, a_numeros2)" + ], + "execution_count": 42, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0, 2, 4, 5, 8])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 42 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8GXZNgjfo8lO" + }, + "source": [ + "Observe que o resultado são os elementos de a_numeros1 que não pertencem a x_Y. Mas como fica o '3' nesta história?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aJSu6VKb2oc_" + }, + "source": [ + "## Exemplo 2" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "N1wahElXTqoB", + "outputId": "95a40bbf-533b-4e5e-cec3-4365a591861a", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_numeros1 = np.arange(10)\n", + "a_numeros1" + ], + "execution_count": 43, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 43 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nxDpCMg7T7Rj", + "outputId": "96e4285c-0f00-4153-d0aa-d2e766032786", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_numeros2 = np.array([1, 5, 7])\n", + "a_numeros2" + ], + "execution_count": 44, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1, 5, 7])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 44 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3LU3qYyiUXqm", + "outputId": "cc0bc36f-a91e-449a-d0f2-14a381049097", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "np.setdiff1d(a_numeros1, a_numeros2)" + ], + "execution_count": 45, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0, 2, 3, 4, 6, 8, 9])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 45 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mzZEytrRUioU" + }, + "source": [ + "Observe que os elementos de a_numeros2 foram deletados de a_numeros1. Ok?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gJRcoVRUnaY9" + }, + "source": [ + "___\n", + "# Diferença Simétrica\n", + "* Em teoria de conjuntos, chamamos de Diferença Simétrica e escrevemos $(A \\cup B)- (A \\cap B)$.\n", + "\n", + "![DifferenceSymetric](https://github.com/MathMachado/Materials/blob/master/set_DifferenceSymetric.PNG?raw=true)\n", + "\n", + "Fonte: [Python Set](https://www.learnbyexample.org/python-set/)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2Uzzm85Kup3H" + }, + "source": [ + "* Vamos ver como isso funciona na prática:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1z5wZ8VwpsWN" + }, + "source": [ + "import numpy as np\n", + "a_numeros1 = np.array([0, 1, 2, 4, 5, 7, 8]) # Observe que [1, 4, 7] pertencem a a_numeros1, mas 3, não. Portanto:\n", + "a_numeros2 = np.array([1, 4, 7, 3])" + ], + "execution_count": 46, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Tqd_9XO5p7bo", + "outputId": "4fa4d87e-1858-437c-a9f1-c589a05fe24c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "np.setxor1d(a_numeros1, a_numeros2)" + ], + "execution_count": 47, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0, 2, 3, 5, 8])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 47 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_meurG3mqS5Y" + }, + "source": [ + "Como explicamos ou interpretamos este resultado?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Kc8JoKe2nj2n" + }, + "source": [ + "___\n", + "# **União de dois arrays**\n", + "> Retorna os valores **únicos** dos dois arrays. Na teoria dos conjuntos, escrevemos:\n", + "\n", + "$$A \\cup B$$\n", + "\n", + "![Union](https://github.com/MathMachado/Materials/blob/master/set_Union.PNG?raw=true)\n", + "\n", + "Fonte: [Python Set](https://www.learnbyexample.org/python-set/)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1LZxorw2p2mg" + }, + "source": [ + "a_numeros1 = np.array([0, 1, 2, 4, 5, 7, 8, 8])\n", + "\n", + "# Observe que [1, 4, 7] pertencem a a_numeros1, mas 3, não. Portanto:\n", + "a_numeros2 = np.array([1, 4, 7, 3])" + ], + "execution_count": 48, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "COsZEmSwuY5L", + "outputId": "de10c400-956a-417c-82b6-435eb67d811b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "np.union1d(a_numeros1, a_numeros2)" + ], + "execution_count": 49, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0, 1, 2, 3, 4, 5, 7, 8])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 49 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "b53bR-GYRu_3" + }, + "source": [ + "___\n", + "# **Selecionar itens comuns dos arrays X e Y**\n", + "* Na teoria de conjuntos, chamamos de intersecção e escrevemos $X \\cap Y$.\n", + "\n", + "![Intersection](https://github.com/MathMachado/Materials/blob/master/set_Intersection.PNG?raw=true)\n", + "\n", + "Fonte: [Python Set](https://www.learnbyexample.org/python-set/)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "n2ec2tqqR1Gw" + }, + "source": [ + "* Considere os arrays a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rXVQQvBqR4J-", + "outputId": "2cd7bc3f-a2ce-4cc2-c4b3-f2c2ba91ce74", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_numeros1 = np.arange(10)\n", + "a_numeros1" + ], + "execution_count": 50, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 50 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pZTHhHxGSRfB", + "outputId": "9ac004d4-e9fe-4078-d651-0bc323996738", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_numeros2 = np.arange(8, 18)\n", + "a_numeros2" + ], + "execution_count": 51, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([ 8, 9, 10, 11, 12, 13, 14, 15, 16, 17])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 51 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MxB2_qHpScMB" + }, + "source": [ + "Quais são os elementos comuns à X e Y?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "e-rncJHtSfw0", + "outputId": "181798e4-5d52-4d1d-a9f1-66efb0bb3361", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "np.intersect1d(a_numeros1, a_numeros2)" + ], + "execution_count": 52, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([8, 9])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 52 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3Bb39sWdfqaF" + }, + "source": [ + "___\n", + "# **Autovalores e Autovetores**\n", + "> Autovetor e Autovalor são um dos tópicos mais importantes em Machine Learning.\n", + "\n", + "Por definição, o escalar $\\lambda$ e o vetor $v$ são autovalor e autovetor da matriz $A$ se\n", + "\n", + "$$Av = \\lambda v$$\n", + "\n", + "## Leitura Adicional:\n", + "\n", + "* [Machine Learning & Linear Algebra — Eigenvalue and eigenvector](https://medium.com/@jonathan_hui/machine-learning-linear-algebra-eigenvalue-and-eigenvector-f8d0493564c9)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XZBKq8nGCUbL" + }, + "source": [ + "* O array a_numeros2 tem a seguinte forma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iYlZGKFUfw-R" + }, + "source": [ + "a_numeros2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6EfvIbBNf02Z" + }, + "source": [ + "# Calcula autovalores e autovetores:\n", + "a_Autovalores, a_Autovetores= np.linalg.eig(a_numeros2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v3GtQQvAz9QU" + }, + "source": [ + "Os autovalores do array a_numeros2 são:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WvZGyBR1f9vP" + }, + "source": [ + "a_Autovalores" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AuuDRJVh0FC8" + }, + "source": [ + "Os autovetores do array a_numeros2 são:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6m4YFAwsf_rA" + }, + "source": [ + "a_Autovetores" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DASn2Un9ZNV-" + }, + "source": [ + "___\n", + "# **Encontrar Missing Values (NaN)**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TKilWBsSXtR4" + }, + "source": [ + "## Gerar o exemplo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lqLI2ER_ZUMY" + }, + "source": [ + "np.random.seed(20111974)\n", + "a_numeros1 = np.random.random(100)\n", + "\n", + "# Inserindo 15 NaN's no array:\n", + "np.random.seed(20111974)\n", + "l_indices_aleatorios= np.random.randint(0, 100, size = 15)\n", + "\n", + "for i_indices in l_indices_aleatorios:\n", + " #print(i_indices)\n", + " a_numeros1[i_indices] = np.nan" + ], + "execution_count": 53, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "2ZkbMPXMawYh", + "outputId": "029308be-52e1-4e3d-cf1f-ec1bb64a2530", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 357 + } + }, + "source": [ + "a_numeros1" + ], + "execution_count": 54, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.53097233, 0.56965626, nan, 0.65478409, 0.85708456,\n", + " 0.60174181, 0.87298309, 0.45573342, nan, 0.64300912,\n", + " 0.54808035, 0.35321428, 0.32005665, nan, 0.85159044,\n", + " 0.75930202, 0.65675987, 0.3278323 , 0.34592275, 0.41510657,\n", + " 0.30635652, 0.26750355, 0.30663224, 0.35503537, 0.60299892,\n", + " 0.0221767 , 0.36265947, nan, 0.28077438, 0.37056609,\n", + " nan, 0.43587362, 0.20494254, 0.20850854, 0.64886762,\n", + " 0.81792888, 0.71541492, 0.50313939, 0.1657674 , 0.60122378,\n", + " nan, 0.14442301, nan, 0.70671296, 0.07163699,\n", + " 0.56212721, nan, 0.83632274, 0.21435895, 0.85491145,\n", + " 0.62878505, 0.38468856, 0.90553087, 0.33703023, 0.06707729,\n", + " 0.1023552 , 0.84821523, 0.12156391, 0.94423963, 0.15835682,\n", + " nan, 0.91080887, 0.58558559, 0.36799242, 0.71647196,\n", + " 0.0740405 , 0.47889268, 0.77503169, 0.96720855, 0.71575223,\n", + " 0.28887146, 0.33306388, 0.95399002, 0.23557899, 0.97714605,\n", + " 0.85188315, 0.63303051, 0.57297905, 0.66792818, 0.87621361,\n", + " nan, nan, nan, 0.68323127, 0.28826713,\n", + " 0.32846648, 0.98334327, 0.17156066, nan, 0.91917489,\n", + " 0.98381602, 0.75915187, 0.31400247, 0.97074481, 0.07574498,\n", + " 0.55661541, nan, 0.4936932 , 0.07351232, 0.11418944])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 54 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Z7Bs75NvbSjx" + }, + "source": [ + "Ok, inserimos aleatoriamente 14 NaN's no array a_numeros1. Agora, vamos contar quantos NaN's (já sabemos a resposta!)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hL1Wn0vdX8ur" + }, + "source": [ + "## Identificar os NaN's" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5R-n3H0xbd6d", + "outputId": "5eef40ef-4f0c-4678-8177-c150ca9b2278", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "np.isnan(a_numeros1).sum()" + ], + "execution_count": 55, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "14" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 55 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Y7hh5uowoa3U" + }, + "source": [ + "Ok, temos 14 NaN's em a_numeros1." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iVLQf_bqbyNU" + }, + "source": [ + "Ok, agora eu quero saber os índices desses NaN's." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kJHxjZiwb5HM", + "outputId": "d92df3ef-e4f8-48bc-9dba-c53a8a83e27d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "i_indices= np.where(np.isnan(a_numeros1))\n", + "i_indices" + ], + "execution_count": 56, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(array([ 2, 8, 13, 27, 30, 40, 42, 46, 60, 80, 81, 82, 88, 96]),)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 56 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "W_jHGNImok7L", + "outputId": "78a1c58a-772a-4937-ddc1-18b24123bf6e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Checando...\n", + "a_numeros1[2]" + ], + "execution_count": 57, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "nan" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 57 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iPhHAhDYcMWO" + }, + "source": [ + "Vamos conferir se está correto? Para isso, basta comparar com l_indices_aleatorios:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gxQYslRCe11G" + }, + "source": [ + "___\n", + "# **Deletar NaN's de um array**\n", + "> Considere o mesmo array que acabamos de trabalhar. Agora eu quero excluir os NaN's identificados." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "AeBARFqNfNnN", + "outputId": "b98e2229-54c3-4d6c-ffa1-7e859bb2cdfc", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 357 + } + }, + "source": [ + "a_numeros1" + ], + "execution_count": 58, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.53097233, 0.56965626, nan, 0.65478409, 0.85708456,\n", + " 0.60174181, 0.87298309, 0.45573342, nan, 0.64300912,\n", + " 0.54808035, 0.35321428, 0.32005665, nan, 0.85159044,\n", + " 0.75930202, 0.65675987, 0.3278323 , 0.34592275, 0.41510657,\n", + " 0.30635652, 0.26750355, 0.30663224, 0.35503537, 0.60299892,\n", + " 0.0221767 , 0.36265947, nan, 0.28077438, 0.37056609,\n", + " nan, 0.43587362, 0.20494254, 0.20850854, 0.64886762,\n", + " 0.81792888, 0.71541492, 0.50313939, 0.1657674 , 0.60122378,\n", + " nan, 0.14442301, nan, 0.70671296, 0.07163699,\n", + " 0.56212721, nan, 0.83632274, 0.21435895, 0.85491145,\n", + " 0.62878505, 0.38468856, 0.90553087, 0.33703023, 0.06707729,\n", + " 0.1023552 , 0.84821523, 0.12156391, 0.94423963, 0.15835682,\n", + " nan, 0.91080887, 0.58558559, 0.36799242, 0.71647196,\n", + " 0.0740405 , 0.47889268, 0.77503169, 0.96720855, 0.71575223,\n", + " 0.28887146, 0.33306388, 0.95399002, 0.23557899, 0.97714605,\n", + " 0.85188315, 0.63303051, 0.57297905, 0.66792818, 0.87621361,\n", + " nan, nan, nan, 0.68323127, 0.28826713,\n", + " 0.32846648, 0.98334327, 0.17156066, nan, 0.91917489,\n", + " 0.98381602, 0.75915187, 0.31400247, 0.97074481, 0.07574498,\n", + " 0.55661541, nan, 0.4936932 , 0.07351232, 0.11418944])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 58 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "e497B492fFru", + "outputId": "98168450-f277-4834-f342-29fececcc27f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 323 + } + }, + "source": [ + "a_numeros1[~np.isnan(a_numeros1)]" + ], + "execution_count": 59, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.53097233, 0.56965626, 0.65478409, 0.85708456, 0.60174181,\n", + " 0.87298309, 0.45573342, 0.64300912, 0.54808035, 0.35321428,\n", + " 0.32005665, 0.85159044, 0.75930202, 0.65675987, 0.3278323 ,\n", + " 0.34592275, 0.41510657, 0.30635652, 0.26750355, 0.30663224,\n", + " 0.35503537, 0.60299892, 0.0221767 , 0.36265947, 0.28077438,\n", + " 0.37056609, 0.43587362, 0.20494254, 0.20850854, 0.64886762,\n", + " 0.81792888, 0.71541492, 0.50313939, 0.1657674 , 0.60122378,\n", + " 0.14442301, 0.70671296, 0.07163699, 0.56212721, 0.83632274,\n", + " 0.21435895, 0.85491145, 0.62878505, 0.38468856, 0.90553087,\n", + " 0.33703023, 0.06707729, 0.1023552 , 0.84821523, 0.12156391,\n", + " 0.94423963, 0.15835682, 0.91080887, 0.58558559, 0.36799242,\n", + " 0.71647196, 0.0740405 , 0.47889268, 0.77503169, 0.96720855,\n", + " 0.71575223, 0.28887146, 0.33306388, 0.95399002, 0.23557899,\n", + " 0.97714605, 0.85188315, 0.63303051, 0.57297905, 0.66792818,\n", + " 0.87621361, 0.68323127, 0.28826713, 0.32846648, 0.98334327,\n", + " 0.17156066, 0.91917489, 0.98381602, 0.75915187, 0.31400247,\n", + " 0.97074481, 0.07574498, 0.55661541, 0.4936932 , 0.07351232,\n", + " 0.11418944])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 59 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RpvKfJU_fmA6" + }, + "source": [ + "Observe que os NaN's foram excluidos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_Dv8MmNYg8zN" + }, + "source": [ + "___\n", + "# **Converter lista em array**\n", + "> Considere a lista a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "but6T9dVhFYb", + "outputId": "fadc10e9-3767-4061-f839-711837b06002", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "l_Lista = [np.random.randint(0, 10, 10)]\n", + "l_Lista" + ], + "execution_count": 60, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[array([8, 9, 3, 7, 1, 3, 2, 9, 7, 7])]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 60 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xytj4Eo4hTh9", + "outputId": "e7874289-2ad3-4e8c-fa2a-1f4bb238b333", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "type(l_Lista)" + ], + "execution_count": 61, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "list" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 61 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qrINdcruhWcH" + }, + "source": [ + "Convertendo a minha lista para array:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RoSyaX0OhZSE", + "outputId": "810f770f-beb4-43a9-bd85-ae8ed6c5c32f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_numeros = np.asarray(l_Lista)\n", + "a_numeros" + ], + "execution_count": 62, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[8, 9, 3, 7, 1, 3, 2, 9, 7, 7]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 62 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dMjTdbBUhlrk", + "outputId": "f8d483d1-b233-448e-90d9-c29813af67ec", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "type(a_numeros)" + ], + "execution_count": 63, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "numpy.ndarray" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 63 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Mbm3ZP9DhxDI" + }, + "source": [ + "___\n", + "# Converter tupla em array\n", + "> Considere a tupla a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cZxEFYLAh3S_", + "outputId": "1e267cf3-068c-464d-d282-2d3c8416a0fd", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "np.random.seed(20111974)\n", + "t_numeros = ([np.random.randint(0, 10, 3)], [np.random.randint(0, 10, 3)], [np.random.randint(0, 10, 3)])\n", + "t_numeros" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "([array([8, 8, 2])], [array([8, 9, 1])], [array([8, 0, 4])])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 26 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vlTXUJviiAml", + "outputId": "9962557d-f255-41d7-864d-55d521783332", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "type(t_numeros)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "tuple" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 27 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yEaOlq8oh3oh", + "outputId": "21edc300-270a-49f1-d044-8916038f8033", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 102 + } + }, + "source": [ + "a_numeros = np.asarray(t_numeros)\n", + "a_numeros" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[[8, 8, 2]],\n", + "\n", + " [[8, 9, 1]],\n", + "\n", + " [[8, 0, 4]]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 28 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "PSgQDmRWh3g5", + "outputId": "70328871-4dd7-4d28-e76c-53ba48ae049a", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "type(a_numeros)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "numpy.ndarray" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 29 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pH-Ht6yMiqJN" + }, + "source": [ + "___\n", + "# Acrescentar elementos à um array\n", + "> Considere o array a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dFaDZInZiwoo", + "outputId": "3c98416a-07d4-499c-a08a-7e7a37c327e2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_numeros1 = np.arange(5)\n", + "a_numeros1" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0, 1, 2, 3, 4])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 30 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "d3zrlf_Ci73Z", + "outputId": "997f288a-994f-418c-9110-bc78e0eb92cd", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "np.random.seed(20111974)\n", + "a_numeros1 = np.append(a_numeros1, [np.random.randint(0, 10, 3), np.random.randint(0, 10, 3), np.random.randint(0, 10, 3)])\n", + "a_numeros1" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0, 1, 2, 3, 4, 8, 8, 2, 8, 9, 1, 8, 0, 4])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 31 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eFRhtk13ojqA" + }, + "source": [ + "___\n", + "# **Converter array 1D num array 2D**\n", + "> Considere os arrays a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wYhBgW9Zu6ZP" + }, + "source": [ + "np.random.seed(20111974)\n", + "a_numeros1 = np.array(np.random.randint(0, 10, 6))\n", + "\n", + "np.random.seed(19741120)\n", + "a_numeros2 = np.array(np.random.randint(0, 10, 6))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "febs9AUHvs6n" + }, + "source": [ + "a_numeros1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "C9OEd-iavvBm" + }, + "source": [ + "a_numeros2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "KJWjtaWKv0MJ" + }, + "source": [ + "np.column_stack((a_numeros1, a_numeros2)) # Atenção aos parênteses em (a_numeros1, a_numeros2)." + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xr_WZXJ7pi2D" + }, + "source": [ + "___\n", + "# **Excluir um elemento específico do array usando indices**\n", + "> Considere os arrays a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tS0ZzOs8w0dw" + }, + "source": [ + "np.random.seed(20111974)\n", + "a_numeros1 = np.array(np.random.randint(0, 10, 6))\n", + "a_numeros1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7bOJiKDKxEsC" + }, + "source": [ + "Suponha que eu queira excluir os valores '8' de a_numeros1. Os índices dos valores '8' são: [0, 1, 3]. Portanto, temos:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SSjueEvjxTJO" + }, + "source": [ + "a_numeros1 = np.delete(a_numeros1, [0, 1, 3])\n", + "a_numeros1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mZkGZ2Rgp--5" + }, + "source": [ + "___\n", + "# **Frequência dos valores únicos de um array**\n", + "> Considere o array a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Z2BWKfH0xvQ8", + "outputId": "7712eab3-15ea-4064-dcfa-8121cf51f2a0", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 102 + } + }, + "source": [ + "np.random.seed(20111974)\n", + "a_numeros1 = np.array(np.random.randint(0, 10, 100))\n", + "a_numeros1" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([8, 8, 2, 8, 9, 1, 8, 0, 4, 2, 0, 8, 9, 3, 7, 1, 3, 2, 9, 7, 7, 9,\n", + " 5, 6, 8, 7, 0, 9, 3, 9, 3, 1, 8, 6, 3, 5, 4, 1, 2, 9, 8, 6, 6, 1,\n", + " 0, 9, 2, 0, 7, 5, 5, 4, 4, 2, 7, 2, 7, 9, 3, 1, 5, 0, 1, 2, 3, 8,\n", + " 7, 5, 4, 0, 5, 9, 6, 6, 1, 3, 6, 0, 4, 9, 2, 1, 0, 9, 1, 4, 2, 9,\n", + " 7, 9, 5, 3, 7, 6, 3, 9, 8, 4, 3, 0])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 13 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s_tdQBsax4rQ" + }, + "source": [ + "Suponha que eu queira saber quantas vezes o número/elemento '2' aparece em a_numeros1." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6yIlk7pWyAtf", + "outputId": "1dde1573-d81b-43b5-adc7-809674303603", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "l_itens_unicos, i_count = np.unique(a_numeros1, return_counts=True)\n", + "l_itens_unicos" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 14 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DyvrIwS9yZIR" + }, + "source": [ + "O que significa o output acima?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uO-MPMhXyV9H", + "outputId": "42892cc5-0e5e-4c61-cb4a-72add57cc618", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "i_count" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([10, 10, 10, 11, 8, 8, 8, 10, 10, 15])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 15 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zwoezXrPyofK" + }, + "source": [ + "Qual a interpretação do output acima?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HgYycSG7yr5e", + "outputId": "b87844ae-1ef3-4c99-fa77-84ef657020ae", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "np.asarray((l_itens_unicos, i_count))" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],\n", + " [10, 10, 10, 11, 8, 8, 8, 10, 10, 15]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 16 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SwIZiJAiy06T" + }, + "source": [ + "Qual a interpretação do output acima?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JpNRpN2Dql3N" + }, + "source": [ + "___\n", + "# **Combinações possíveis de outros arrays**\n", + "> Considere o exemplo a seguir:\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BUr89dH4zLXD" + }, + "source": [ + "a_numeros1 = [2, 4, 6]\n", + "a_numeros2 = [0, 8]\n", + "a_numeros4 = [1, 5]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "cEZH6l-Czx7y" + }, + "source": [ + "np.meshgrid(a_numeros1, a_numeros2, a_numeros4)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "btvmDkEcz0tH" + }, + "source": [ + "np.array(np.meshgrid(a_numeros1, a_numeros2, a_numeros4))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Z0xhO7rGz059" + }, + "source": [ + "np.array(np.meshgrid(a_numeros1, a_numeros2, a_numeros4)).T" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "eMv4lFnD0Enn" + }, + "source": [ + "# Resultado final\n", + "a_numeros3 = np.array(np.meshgrid(a_numeros1, a_numeros2, a_numeros4)).T.reshape(-1,3)\n", + "a_numeros3" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Rz80YANfAh2k" + }, + "source": [ + "___\n", + "# **Wrap Up**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_cyhMsAVXxGC" + }, + "source": [ + "___\n", + "# **Exercícios**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kNjovMw3uJ3R" + }, + "source": [ + "## Exercício 1 - Selecionar os números pares\n", + "> Dado o 1D array abaixo, selecionar somente os números pares." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "isDzQjwjBX3V", + "outputId": "eff14fb7-c4f7-4113-f568-01b4ba21c9b3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_numeros1 = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n", + "a_numeros1" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 17 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Kq1zt-uO1HXv" + }, + "source": [ + "### **Minha solução**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YFmK_n2M1Ks9", + "outputId": "11dbc1b6-10b6-42b1-a1bc-b8bc8d464d42", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_numeros1[a_numeros1 % 2 == 0]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0, 2, 4, 6, 8])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 18 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sScYG0hp05vb" + }, + "source": [ + "___\n", + "## Exercício 2 - Substituir pela mediana\n", + "> Dado o array 1D abaixo, substituir os números pares pela mediana de a_numeros1." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XLZ-DIWU1WFs", + "outputId": "52bad0fc-24e7-43b7-cec0-6f2cfbffddff", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_numeros1 = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n", + "a_numeros1" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 19 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9c4QWJno1WVB" + }, + "source": [ + "### **Minha solução**\n", + "* Primeiramente, precisamos calcular a mediana.\n", + "* Depois, substituimos os valores pares de a_numeros1 pela mediana encontrada anteriormente. Ok?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rx7NGAO01Wfb", + "outputId": "9667049f-ff4c-45f9-be1b-013a2e03d464", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_numeros1[a_numeros1 % 2 == 0] = np.median(a_numeros1)\n", + "a_numeros1" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([4, 1, 4, 3, 4, 5, 4, 7, 4, 9])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 20 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2c_AphX82qp8" + }, + "source": [ + "Verificando..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9kVta0Cr13Z9", + "outputId": "adac6c2d-f9d9-42a6-f13a-c41ae6ae8d09", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "f'A média de a_numeros1 é: {np.median(a_numeros1)}'" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'A média de a_numeros1 é: 4.0'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 23 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L9O-Hf5x26TY" + }, + "source": [ + "___\n", + "## Exercício 3 - Reshape\n", + "> Dado o array 1D abaixo, reshape para um array 2D com 3 colunas." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0_laUvtB4Wl-", + "outputId": "92df1bde-642a-455d-ddef-619c29cf29dc", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Define seed\n", + "np.random.seed(20111974)\n", + "a_numeros1 = np.array(np.random.randint(1, 10, size = 15))\n", + "a_numeros1" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([9, 9, 3, 9, 2, 9, 1, 5, 3, 1, 9, 4, 8, 2, 4])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 21 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dKzEX8TK5b4Z" + }, + "source": [ + "### **Minha solução**\n", + "* O array 1D a_numeros1 acima possui 15 elementos. Como queremos transformá-lo num array 2D com 3 colunas, então cada coluna terá 5 elementos." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "I-j5yVD04249", + "outputId": "0a2877e5-79af-4258-bb77-bf665892a112", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 102 + } + }, + "source": [ + "a_numeros1.reshape(5, 3) \n", + "# Poderia ser a_numeros1.reshape(-1, 3), onde \"-1\" pede para o NumPy calcular o número de linhas. " + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[9, 9, 3],\n", + " [9, 2, 9],\n", + " [1, 5, 3],\n", + " [1, 9, 4],\n", + " [8, 2, 4]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 22 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F1vfS8jE6L0_" + }, + "source": [ + "___\n", + "## Exercício 4 - Reshape\n", + "> Dado o array 1D abaixo, reshape para um array 3D com 2 colunas." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xcN-bez56L1D", + "outputId": "a3f2336e-fd0f-438f-93f6-74a67d26d939", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Define seed\n", + "np.random.seed(20111974)\n", + "a_numeros1 = np.array(np.random.randint(1, 10, size = 16))\n", + "a_numeros1" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([9, 9, 3, 9, 2, 9, 1, 5, 3, 1, 9, 4, 8, 2, 4, 3])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 24 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_JTbsHSGjaYS", + "outputId": "7f9ab328-1b90-4f12-bfa7-eff24af344ce", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 153 + } + }, + "source": [ + "a_numeros1.reshape(-1,2)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[9, 9],\n", + " [3, 9],\n", + " [2, 9],\n", + " [1, 5],\n", + " [3, 1],\n", + " [9, 4],\n", + " [8, 2],\n", + " [4, 3]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 25 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7iICnOyG6fcj" + }, + "source": [ + "### **Minha solução**\n", + "* O array 1D a_numeros1 acima possui 16 elementos. Queremos transformá-lo num array 3D com 2 colunas." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vdq5ybuD6fcn" + }, + "source": [ + "a_numeros1.reshape(-1, 2) # O valor \"-1\" na posição das linhas pede ao NumPy para calcular o número de linhas automaticamente." + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "haQfWPcCs_H0" + }, + "source": [ + "## Exercício 5\n", + "Para mais exercícios envolvendo arrays, visite a página [Python: Array Exercises, Practice, Solution](https://www.w3resource.com/python-exercises/array/)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LQQL0JS2tnc0" + }, + "source": [ + "## Exercício 6\n", + "Para mais exercícios envolvendo matemática, viste a página [Python Math: - Exercises, Practice, Solution](https://www.w3resource.com/python-exercises/math/index.php)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qNskKFy9t4D5" + }, + "source": [ + "## Exercício 7\n", + "Para mais exercícios envolvendo NumPy em geral, visite a página [NumPy Exercises, Practice, Solution](https://www.w3resource.com/python-exercises/numpy/index.php)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qqc1AiHXuKZ5" + }, + "source": [ + "## Exercício 8\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jYrgc3KvtmLy" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB02__Numpy_hs2.ipynb b/Notebooks/NB02__Numpy_hs2.ipynb new file mode 100644 index 000000000..ba74fa73f --- /dev/null +++ b/Notebooks/NB02__Numpy_hs2.ipynb @@ -0,0 +1,5840 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "NB02__Numpy.ipynb", + "provenance": [], + "collapsed_sections": [ + "n8BIbzQbNWUo", + "7eS94uQ4NhVR", + "SYOgJpGYVLUu", + "CaHFxk98W5if", + "ReWUyWiHXCnc", + "CqszHxaKHr2h", + "tXgF1Wl9gHKY", + "Fotx7XUquAo8", + "36kmLUYDvsUI", + "SWO2GdNovxAp", + "vpN54l4vxze5", + "u4HOf9SNytSq", + "6BQ9oZiD9hg5", + "tz5-QdrX9vct", + "p1muBgMX8NK4", + "FxTC2-U88ajk", + "z8EYn0pP25Rh" + ], + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "accelerator": "GPU" + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6QhLXoatkvKR" + }, + "source": [ + "

NUMPY

\n", + "\n", + "> NumPy é um pacote para computação científica e álgebra linear para Python.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "b8EZupp68vW8" + }, + "source": [ + "# **AGENDA**:\n", + "> Neste capítulo, vamos abordar os seguintes assuntos:\n", + "\n", + "* NumPy\n", + "* Criar arrays\n", + "* Criar Arrays Multidimensionais\n", + "* Selecionar itens\n", + "* Aplicar funções como max(), min() e etc\n", + "* Calcular Estatísticas Descritivas: média e variância\n", + "* Reshaping\n", + "* Tansposta de um array\n", + "* Autovalores e Autovetores\n", + "* Wrap Up\n", + "* Exercícios" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cO5t3xCO8kyK" + }, + "source": [ + "___\n", + "# **NOTAS E OBSERVAÇÕES**\n", + "\n", + "* Nosso foco com o NumPy é facilitar o uso do Pandas;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z2IFUG4GSB0Z" + }, + "source": [ + "___\n", + "# **CHEETSHEET**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jYLeDVH-SNCg" + }, + "source": [ + "![Numpy](https://github.com/MathMachado/Materials/blob/master/numpy_basics-1.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0mKvExmgUFOk" + }, + "source": [ + "# **ESCALAR, VETORES, MATRIZES E TENSORES**\n", + "\n", + "![Tensor](https://github.com/MathMachado/Materials/blob/master/tensor.png?raw=true)\n", + "\n", + "Source: [PyTorch for Deep Learning: A Quick Guide for Starters](https://towardsdatascience.com/pytorch-for-deep-learning-a-quick-guide-for-starters-5b60d2dbb564)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o00pYRIkXiAU" + }, + "source": [ + "## Import Statement - Primeiros exemplos\n", + "> Como exemplo, considere gerar uma amostra aleatória de tamanho 10 da Distribuição Normal(0, 1):" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l_XuvcUDWNDk" + }, + "source": [ + "## Importar a library NumPy" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "am_ZTIGaapCo" + }, + "source": [ + "### **Opção 1**: Importar a biblioteca NumPy COM alias" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "b4irLw6BWVVZ" + }, + "source": [ + "import numpy as np" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JK54ga7dXnJu", + "outputId": "1a31527c-f8b6-44d5-ecbd-9f08abc5f8d6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 50 + } + }, + "source": [ + "# Set up o número de casas decimais para o NumPy:\n", + "np.set_printoptions(precision = 2, suppress = True)\n", + "\n", + "'''\n", + "Define seed por questões de reproducibilidade, ou seja, \n", + "garante que todos vamos gerar os mesmos números aleatórios\n", + "'''\n", + "np.random.seed(seed = 20111974)\n", + "\n", + "# Gera 10 números aleatórios a partir da Distribuição Normal(media, desvio_padrao)\n", + "media = 0\n", + "desvio_padrao = 1\n", + "a_conjunto1 = np.random.normal(media, desvio_padrao, size = 10) # Array 1D de size = 10\n", + "a_conjunto1" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([ 2.51, 1.11, 2.06, 0.56, 0.3 , 1.05, -0.13, 1.06, 1.14,\n", + " 1.38])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 2 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3-0934isZUm6" + }, + "source": [ + "**Observação**: Altere o valor de [precision] para 4, 2 e 0 e observe o que acontece." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9ob_8S_bYYa2" + }, + "source": [ + "### **Opção 2**: Importar a biblioteca NumPy SEM alias" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NcGd1ho_XDXU" + }, + "source": [ + "import numpy" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zFYH6J5-Ydjl" + }, + "source": [ + "# Set up o número de casas decimais para o NumPy:\n", + "numpy.set_printoptions(precision = 2, suppress = True)\n", + "\n", + "'''\n", + "Define seed por questões de reproducibilidade, ou seja, \n", + "garante que todos vamos gerar os mesmos números aleatórios\n", + "'''\n", + "numpy.random.seed(seed = 20111974)\n", + "\n", + "# Gera 10 números aleatórios a partir da Distribuição Normal(mu, desvio_padrao)\n", + "media = 0\n", + "desvio_padrao = 1\n", + "numpy.random.normal(size = 10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AwWSzYrZWfvA" + }, + "source": [ + "### **Opção 3**: Importar funções específicas da biblioteca NumPy" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bfYJzcqRa5eu" + }, + "source": [ + "from numpy import set_printoptions\n", + "from numpy.random import seed, normal" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Xj6fbpvubH_p" + }, + "source": [ + "# Set up o número de casas decimais para o NumPy:\n", + "set_printoptions(precision = 2, suppress = True)\n", + "\n", + "'''\n", + "Define seed por questões de reproducibilidade, ou seja, \n", + "garante que todos vamos gerar os mesmos números aleatórios\n", + "'''\n", + "seed(seed = 20111974)\n", + "\n", + "# Gera 10 números aleatórios a partir da Distribuição Normal(mu, desvio_padrao)\n", + "media = 0\n", + "desvio_padrao = 1 \n", + "np.random.normal(size = 10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "00RerJPChnuP" + }, + "source": [ + "___\n", + "# **Estatísticas Descriticas com NumPy**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Qa6ro1VJlShd" + }, + "source": [ + "## Exemplo 1\n", + "> Vamos voltar ao mesmo exemplo anterior, mas desta vez, usando a opção 1 (com alias):\n", + "\n", + "* Gerar uma amostra aleatória de tamanho 10 da Distribuiçao Normal(0, 1)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "31dSBU8khvFk" + }, + "source": [ + "# Set up o número de casas decimais para o NumPy:\n", + "np.set_printoptions(precision = 2, suppress = True)\n", + "\n", + "# Define seed\n", + "np.random.seed(seed = 20111974)\n", + "\n", + "# Gera 10 números aleatórios a partir da Distribuição Normal(media, desvio_padrao)\n", + "media = 0\n", + "desvio_padrao = 1\n", + "\n", + "np.random\n", + "a_conjunto1 = np.random.normal(media, desvio_padrao, size = 10) # Array 1D de size = 10\n", + "a_conjunto1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wa2t0P3nevTh" + }, + "source": [ + "Conferindo a média e desvio-padrão do array gerado:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "drUyk3f5ekDq" + }, + "source": [ + "f'Distribuição N({np.mean(a_conjunto1)}, {np.std(a_conjunto1)})'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XSp7Hd-Gib67" + }, + "source": [ + "Estávamos à espera de media = 0 e sigma = 1. Certo? Porque isso não aconteceu?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HP_8VSgygXOF" + }, + "source": [ + "## **Laboratório 1**\n", + "> Altere os valores de [size] para 100, 1.000, 10.000, 100.000 e 1.000.000 e relate o que acontece com a média e desvio padrão." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4TbmVbdcg6iU" + }, + "source": [ + "## **Minha solução**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-qdiqBVHg-gd" + }, + "source": [ + "# Define a média e o desvio-padrão\n", + "media = 0\n", + "desvio_padrao = 1\n", + "\n", + "# Define seed\n", + "np.random.seed(seed = 20111974)\n", + "l_lista_conjunto = [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000]\n", + "\n", + "for i_size in l_lista_conjunto:\n", + " a_conjunto1 = np.random.normal(media, desvio_padrao, size = i_size)\n", + " print(f'Size: {i_size}--> Distribuição: N({np.mean(a_conjunto1)}, {np.std(a_conjunto1)})')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bp-YuviQwWqE" + }, + "source": [ + "Com relação à Distribuição Normal($\\mu, \\sigma$), temos que:\n", + "\n", + "![NormalDistribution](https://github.com/MathMachado/Materials/blob/master/NormalDistribution.PNG?raw=true)\n", + "\n", + "Fonte: [Normal Distribution](https://towardsdatascience.com/understanding-the-68-95-99-7-rule-for-a-normal-distribution-b7b7cbf760c2)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KwHBY3Enk04N" + }, + "source": [ + "## Lei Forte dos Grandes Números - LFGN\n", + "> Por favor, leia o que diz a [Law of large numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers). --> 3 minutos.\n", + "\n", + "* O que você aprendeu com isso?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BhwmSkAjlszT" + }, + "source": [ + "## Exemplo 2\n", + "> Vamos nos aprofundar um pouco mais no que diz a LFGN. Para isso, vamos simular o lançamento de dados. Como sabemos, os dados possuem 6 lados numerados de 1 a 6, com igual probabilidade. Certo?\n", + "\n", + "A LFGN nos diz que à medida que N (o tamanho da amostra ou número de dados) cresce, então a média dos dados converge para o valor esperado. Isso quer dizer que:\n", + "\n", + "$$\\frac{1+2+3+4+5+6}{6}= \\frac{21}{6}= 3,5$$\n", + "\n", + "Ou seja, à medida que N (o tamanho da amostra) cresce, espera-se que a média dos dados se aproxime de 3,5. Ok?\n", + "\n", + "Vamos ver se isso é verdade..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-QcJXf6roj0D" + }, + "source": [ + "Vamos usar o método np.random.randint (= função randint definido na classe np.random), a seguir:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "A2u0RzLOrRE2" + }, + "source": [ + "O que significa ou qual é a interpretação do resultado abaixo?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "B3-X_VBerUfa" + }, + "source": [ + "# Define seed\n", + "import numpy as np\n", + "np.random.seed(seed = 20111974)\n", + "\n", + "# Simular 100 lançamentos de um dado:\n", + "a_dados_simulados = np.random.randint(1, 7, size = 100)\n", + "a_dados_simulados" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "m8Of2MMIrbF3" + }, + "source": [ + "# Importar o pandas, pois vamos precisar do método pd.value_counts():\n", + "import pandas as pd\n", + "pd.value_counts(a_dados_simulados)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "54VwED8Br8rx" + }, + "source": [ + "**Interpretação**: Isso quer dizer que fizemos a simulação de lançamento de um dado 100 vezes. Acima, a frequência com que cada lado do dado aparece.\n", + "\n", + "Eu estava à espera de frequência igual para cada um dos lados, isto é, por volta dos 16 ou 17. Ou seja:\n", + "\n", + "$$\\frac{100}{6}= 16,66$$\n", + "\n", + "Mas ok, vamos continuar com nosso experimento..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HT_Dak-umC6I" + }, + "source": [ + "# Definir a semente\n", + "np.random.seed(20111974)\n", + "\n", + "for i_size in [10, 30, 50, 75, 100, 1000, 10000, 100000, 1000000]:\n", + " a_dados_simulados = np.random.randint(1, 7, size = i_size)\n", + " print(f'Size: {i_size} --> Média: {np.mean(a_dados_simulados)}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "edWNNOnXtbtd" + }, + "source": [ + "E agora, como você interpreta esses resultados?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eL6gXThkYcSf" + }, + "source": [ + "## Calcular percentis\n", + "> Boxplot" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jlGOQfXfPf0D" + }, + "source": [ + "![BoxPlot](https://github.com/MathMachado/Materials/blob/master/boxplot.png?raw=true)\n", + "\n", + "Fonte: [Understanding Boxplots](https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "grtEXG2BoNRt" + }, + "source": [ + "Considere o array de retornos (simulados) a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DjPKKq01YjF9" + }, + "source": [ + "import numpy as np\n", + "np.random.seed(20111974)\n", + "\n", + "# Simulando Retornos de ativos financeiros com a distribuição Normal(0, 1):\n", + "a_retornos = np.random.normal(0, 1, 100)\n", + "print(f'Média: {np.mean(a_retornos)}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ajjlfqgssLVO" + }, + "source": [ + "a_retornos" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XZ3m06gv9lei" + }, + "source": [ + "A seguir, o boxplot do array a_retornos:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QtuwJP449tBQ" + }, + "source": [ + "# Import da biblioteca seaborn: Uma das principais libraries para Data Visualization (outras: matplotlib)\n", + "import seaborn as sns\n", + "\n", + "sns.boxplot(y = a_retornos)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "o9ujdjxNY6qE" + }, + "source": [ + "# Vamos usar o método np.percentile(array, q = [p1, p2, p3, ..., p99])\n", + "percentis = np.percentile(a_retornos, q = [1, 5, 25, 50, 55, 75, 99])\n", + "\n", + "# Primeiro Quartil\n", + "q1 = percentis[2]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c75g2Egco2lc" + }, + "source": [ + "Em qual posição do array a_retornos se encontra Q3?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nZr-A82Zo8Kb" + }, + "source": [ + "q3 = percentis[5]\n", + "\n", + "# ou de trás para a frente do conteúdo da lista:\n", + "q3_2 = percentis[-2]\n", + "print(q3, q3_2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "sWrnESPQT4JM" + }, + "source": [ + "# lim_inferior_outlier e lim_superior_outlier para detecção de outliers\n", + "lim_inferior_outlier = q1 - 1.5 * (q3 - q1)\n", + "lim_superior_outlier = q3 + 1.5 * (q3 - q1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Yb4-ZJlUUYsi" + }, + "source": [ + "f'Limite Inferior: {lim_inferior_outlier}; Limite Superior: {lim_superior_outlier}'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Jr6oXIHlUxOe" + }, + "source": [ + "np.min(a_retornos)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "UxE47cN0U54X" + }, + "source": [ + "np.max(a_retornos)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OTB9HnIac499" + }, + "source": [ + "___\n", + "# **Ordenar itens de um array**\n", + "> Considere o array a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Jgj8Yw46dBMx" + }, + "source": [ + "np.random.seed(20111974)\n", + "a_conjunto1 = np.random.random(10)\n", + "a_conjunto1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cC9272GFdRln" + }, + "source": [ + "Ordenando os itens de a_conjunto1..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YUP90nBVdUeF" + }, + "source": [ + "np.sort(a_conjunto1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lG763cDGj-yB" + }, + "source": [ + "___\n", + "# **Obter ajuda**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ehxPlD3EkEYL" + }, + "source": [ + "help(np.random.normal)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1Q_konJVaBsV" + }, + "source": [ + "___\n", + "# **Criar arrays 1D**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DddZT5kadYJ7" + }, + "source": [ + "import numpy as np\n", + "np.set_printoptions(precision = 2, suppress = True)\n", + "np.random.seed(seed = 20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jaqd-VnF3yIt" + }, + "source": [ + "Criar o array 1D a_conjunto1, com os seguintes números:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "E3niz_zHaF3e" + }, + "source": [ + "a_conjunto1 = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n", + "a_conjunto1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DyfXbW_ZKJBS" + }, + "source": [ + "Qual a dimensão de a_conjunto1?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gbHlydALKB3R" + }, + "source": [ + "# Dimensão do array\n", + "a_conjunto1.ndim" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "am9otElpKNPa" + }, + "source": [ + "Qual o shape (dimensão) do array a_conjunto1?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "juJJ74d2wale" + }, + "source": [ + "# Números de itens no array\n", + "a_conjunto1.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BHg4Rre3GwPy" + }, + "source": [ + "O array a_conjunto1 poderia ter sido criado usando a função np.arange(inicio, fim, step):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "I3fyusN7G5Zn" + }, + "source": [ + "# Lembre-se que o número 10 é exclusive.\n", + "a_conjunto2 = np.arange(start = 0, stop = 10, step = 1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IHCEpmUxXsaK" + }, + "source": [ + "Outra alternativa seria usar np.linspace(start = 0, stop = 10, num = 9). Acompanhe a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JB9Y_x3RX1GX" + }, + "source": [ + "# Com np.linspace, o valor 9 é inclusive.\n", + "a_conjunto3 = np.linspace(0, 9, 10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P6MR8MPeYOZm" + }, + "source": [ + "Compare os resultados de a_conjunto1, a_conjunto2 e a_conjunto3 a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tWEzge6HYSFu" + }, + "source": [ + "a_conjunto1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lUNlFVKYYT9f" + }, + "source": [ + "a_conjunto2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Xo8Lid5fYVPW" + }, + "source": [ + "a_conjunto3" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V9aW7C4vHAcF" + }, + "source": [ + "Ou seja, a_conjunto1 é igual a a_conjunto2 que também é igual a a_conjunto3. Ok?\n", + "\n", + "**ATENÇÃO**: Observe que a sintaxe para criar a_conjunto3 é ligeiramente diferente da sintaxe usada para criar a_conjunto1 e a_conjunto2. Abaixo, a sintaxe do comando np.linspace:\n", + "\n", + "![](https://github.com/MathMachado/Materials/blob/master/linspace_sintaxe.PNG?raw=true)\n", + "\n", + "Source: [HOW TO USE THE NUMPY LINSPACE FUNCTION](https://www.sharpsightlabs.com/blog/numpy-linspace/)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KNnwZa3uvYqE" + }, + "source": [ + "Soma 2 à cada item de a_conjunto1:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Jt2KVyviw0bp" + }, + "source": [ + "a_conjunto1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "arROkhWXbdTW" + }, + "source": [ + "a_conjunto2 = a_conjunto1 + 2\n", + "a_conjunto2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZJx2vG86vdVi" + }, + "source": [ + "Multiplicar por 10 cada item de a_conjunto1:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Vm7abO6Ebkun" + }, + "source": [ + "a_conjunto1 = a_conjunto1*10\n", + "a_conjunto1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0Ev1xnBwaYJG" + }, + "source": [ + "___\n", + "# **Criar Arrays Multidimensionais**\n", + "> Ao criarmos, por exemplo, um array 2D, então a chamamos de matriz." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gHaeAug5vjjd" + }, + "source": [ + "Criar o array com 2 linhas e 3 colunas usando números aleatórios:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VDi0vIPSYR4F" + }, + "source": [ + "np.random.seed(20111974)\n", + "a_conjunto1 = np.random.randn(2, 3)\n", + "a_conjunto1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DIdd-nA3tJjV" + }, + "source": [ + "## Dimensão de um array\n", + "> Dimensão é o número de linhas e colunas da matriz." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pKvjjnkrK-v7" + }, + "source": [ + "a_conjunto1.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-DHS5jXELCfa" + }, + "source": [ + "a_conjunto1 é um array 2D (ou matriz), ou seja, 2 linhas, onde cada linha tem 3 elementos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HJI6X1wvv4Bg" + }, + "source": [ + "Criar um array com 3 linhas e 3 colunas:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hXPbWh3Tv26T" + }, + "source": [ + "a_conjunto2 = np.array([[1, 2, 3],[4, 5, 6],[7, 8, 9]])\n", + "a_conjunto2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "we6ZJOICc7bQ" + }, + "source": [ + "# Número de linhas e colunas de a_conjunto1:\n", + "a_conjunto1.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "f0ocwuI1dED6" + }, + "source": [ + "# Número de linhas e colunas de a_conjunto2\n", + "a_conjunto2.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "CApPtnW0YuRP" + }, + "source": [ + "# Somar 2 à cada elemento de a_conjunto2\n", + "a_conjunto2 = a_conjunto2+2\n", + "a_conjunto2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "M87aGmxRY3RW" + }, + "source": [ + "# Multiplicar por 10 cada elemento de a_conjunto2\n", + "a_conjunto2 = a_conjunto2*10\n", + "a_conjunto2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qZt93y1IL_v7" + }, + "source": [ + "___\n", + "# **Copiar arrays**\n", + "> Considere o array abaixo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sH2FTXj5MRRC" + }, + "source": [ + "np.random.seed(20111974)\n", + "a_conjunto1 = np.random.randn(2, 3)\n", + "a_conjunto1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VtgKeMt6MYrr" + }, + "source": [ + "Fazendo a cópia de a_conjunto1..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "K0hOHR3IMa-o" + }, + "source": [ + "a_salarios_copia = a_conjunto1.copy()\n", + "a_salarios_copia" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lFpmcR0HkCar" + }, + "source": [ + "___\n", + "# **Operações com arrays**\n", + "> Considere um array com temperaturas em Farenheit dado por:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VnagcUqVkLhW" + }, + "source": [ + "# Define a seed\n", + "np.random.seed(20111974)\n", + "\n", + "a_temperatura_farenheit = np.array(np.random.randint(0, 100, 10))\n", + "a_temperatura_farenheit " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VrjNKfXxk1yv" + }, + "source": [ + "type(a_temperatura_farenheit)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o1STejhrk0kZ" + }, + "source": [ + "Transformando a temperatura Fahrenheit em Celsius..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "E_jXflR_lNy3" + }, + "source": [ + "a_temperatura_celsius = 5*a_temperatura_farenheit/9 - 5*32/9\n", + "a_temperatura_celsius" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "U4pCv0pNqPZI" + }, + "source": [ + "# O mesmo resultado, porém, escrito de forma diferente:\n", + "a_temperatura_celsius = (5/9)*a_temperatura_farenheit - (160/9)\n", + "a_temperatura_celsius" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1UT4YD2FawUA" + }, + "source": [ + "___\n", + "# **Selecionar itens**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pqOv8P1za1m8" + }, + "source": [ + "# Selecionar o segundo item de a_conjunto1 (lembre-se que no Python arrays começam com indice = 0)\n", + "a_conjunto1[1]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TIwVKk6AyRv6" + }, + "source": [ + "Dado a_conjunto2 abaixo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zoDmbXo6bCeu" + }, + "source": [ + "a_conjunto2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iJXSPp-0yb4w" + }, + "source": [ + "... selecionar o item da linha 2, coluna 3 do array a_conjunto2:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sJiVfnlzcjRv" + }, + "source": [ + "a_conjunto2[1, 2]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Xl5HwJIMcv2e" + }, + "source": [ + "# Selecionar o último elemento de a_conjunto1 --> Lembre-se que a_conjunto1 é um array. Desta forma, teremos o último elemento do array!\n", + "a_conjunto1[-1]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ezTH0HsyrnAl" + }, + "source": [ + "Veja..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OBv9EM54rYX3" + }, + "source": [ + "a_conjunto1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Po3WLFC-rod8" + }, + "source": [ + "a_temperatura_celsius[-1]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4qJJ2HCedW4h" + }, + "source": [ + "___\n", + "# **Aplicar funções como max(), min() e etc**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_meTJdUsda4e" + }, + "source": [ + "f'O máximo de a_conjunto1 é: {np.max(a_conjunto1)}'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "m-wiBkAidnhN" + }, + "source": [ + "f'O mínimo de a_conjunto1 é: {np.min(a_conjunto1)}'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lmupnRHQdtwh" + }, + "source": [ + "f'O máximo de a_conjunto2 é: {np.max(a_conjunto2)}'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "H2z7oB6Bd786" + }, + "source": [ + "f'O máximo de cada LINHA de a_conjunto2 é: {np.max(a_conjunto2, axis = 1)}' # Aqui, axis = 1 é que diz ao numpy que estamos interessados nas linhas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gj2ZBDsWeMyk" + }, + "source": [ + "f'O máximo de cada COLUNA de a_conjunto2 é: {np.max(a_conjunto2, axis = 0)}' # axis = 0, diz ao numpy que estamos interessados nas colunas." + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7_tEfm2IecIU" + }, + "source": [ + "___\n", + "# **Calcular Estatísticas Descritivas: média e variância**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lIY5jx3ueh7q" + }, + "source": [ + "f'A média de a_conjunto1 é: {np.mean(a_conjunto1)}'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VmqSELRReuAW" + }, + "source": [ + "f'A média de a_conjunto2 é: {np.mean(a_conjunto2)}'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Gxap-Wg5e2_H" + }, + "source": [ + "f'O Desvio Padrão de a_conjunto2 é: {np.std(a_conjunto2)}'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R0GcljGtfBvP" + }, + "source": [ + "___\n", + "# **Reshaping**\n", + "> Muito útil em Machine Learning." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vfEmw01j8zux" + }, + "source": [ + "## Exemplo 1\n", + "* O array a_conjunto2 tem a seguinte forma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-Lb3VZCCfK_a" + }, + "source": [ + "a_conjunto2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "YWN_nN-4fD7u" + }, + "source": [ + "# reshaping para 9 linhas e 1 coluna:\n", + "a_conjunto2.reshape(9, 1) # a_conjunto2.reshape(9,-1) produz o mesmo resultado." + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "id9ILRRt7SwY" + }, + "source": [ + "## Mais um exemplo de Reshape\n", + "> Dado o array 1D abaixo, reshape para um array 3D com 2 colunas." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9RA9Ht2b7Swd", + "outputId": "eadedfd5-fd6c-49c8-db5c-6f8f30d45f36", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Define seed\n", + "np.random.seed(20111974)\n", + "a_conjunto1 = np.array(np.random.randint(1, 10, size = 15))\n", + "a_conjunto1" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([9, 9, 3, 9, 2, 9, 1, 5, 3, 1, 9, 4, 8, 2, 4])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 19 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8KxR4xZT7cRv" + }, + "source": [ + "### Solução\n", + "> Temos 15 elementos em a_conjunto1 para construir (\"reshape\") um array 3D com 2 colunas.\n", + "\n", + "A princípio, a solução seria..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VMdHl1Il7wLw", + "outputId": "d51c7263-f523-4af8-9606-ee93cab66f1c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 163 + } + }, + "source": [ + "a_conjunto1.reshape(-1, 2) # O valor \"-1\" na posição das linhas pede ao NumPy para calcular o número de linhas automaticamente." + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "ValueError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0ma_numeros1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreshape\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m2\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# O valor \"-1\" na posição das linhas pede ao NumPy para calcular o número de linhas automaticamente.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mValueError\u001b[0m: cannot reshape array of size 15 into shape (2)" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pZS4b4-y708q" + }, + "source": [ + "Porque temos esse erro?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4disywvR8HeH" + }, + "source": [ + "E se fizermos..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3oEAAXTp8I7Z", + "outputId": "e8c8a90f-c34a-4304-d9b4-fd7f04ce224f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Define seed\n", + "np.random.seed(20111974)\n", + "a_conjunto1 = np.array(np.random.randint(1, 10, size = 16)) # Observe que agora temos 16 elementos\n", + "a_conjunto1" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([9, 9, 3, 9, 2, 9, 1, 5, 3, 1, 9, 4, 8, 2, 4, 3])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 21 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iUhth0QV8Rpt" + }, + "source": [ + "Reshapping..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9D1y7uD88Qip", + "outputId": "e7d22bcd-c10f-4ea3-e41b-03f6f98a054f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 153 + } + }, + "source": [ + "a_conjunto1.reshape(-1, 2) # O valor \"-1\" na posição das linhas pede ao NumPy para calcular o número de linhas automaticamente." + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[9, 9],\n", + " [3, 9],\n", + " [2, 9],\n", + " [1, 5],\n", + " [3, 1],\n", + " [9, 4],\n", + " [8, 2],\n", + " [4, 3]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 22 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ALh-sq7DMnN5", + "outputId": "db373349-7910-4f1f-93f3-8ac8f67da8b8", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 153 + } + }, + "source": [ + "# OU --> Neste caso, estamos reshaping o array em 8 linhas e 2 colunas\n", + "a_conjunto1.reshape(8, -1)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[9, 9],\n", + " [3, 9],\n", + " [2, 9],\n", + " [1, 5],\n", + " [3, 1],\n", + " [9, 4],\n", + " [8, 2],\n", + " [4, 3]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 26 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yvTnrszn8Yk0" + }, + "source": [ + "Porque agora deu certo?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LeQ9LqIE8baG" + }, + "source": [ + "## Último exemplo com reshape\n", + "> Considere o array a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OQOC9iiN8hZT" + }, + "source": [ + "np.random.seed(20111974)\n", + "a_conjunto1 = np.random.randn(2, 3)\n", + "a_conjunto1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Cvce8qBl9Cvq" + }, + "source": [ + "Queremos agora transformá-la num array de 3 linhas e 2 colunas." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QDDsYoVt9Klz" + }, + "source": [ + "a_conjunto1.reshape(-1, 2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AdwU5ygt9Svq" + }, + "source": [ + "Poderia ser..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5uBeokKc9Uo-" + }, + "source": [ + "a_conjunto1.reshape(3, -1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OeRBsobc9aKj" + }, + "source": [ + "E por fim, também poderia ser..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MDt8UYYH9dBw" + }, + "source": [ + "a_conjunto1.reshape(3, 2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "91o5vycQfdKW" + }, + "source": [ + "___\n", + "# **Transposta**\n", + "* O array a_conjunto2 tem a seguinte forma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RsZwyuhoffjb" + }, + "source": [ + "a_conjunto2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "A3MzTVoGfiyO" + }, + "source": [ + "# Transposta do array a_conjunto2 é dado por:\n", + "a_conjunto2.T" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ij-ZW5IyzXIb" + }, + "source": [ + "Ou seja, linha virou coluna. Ok?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qLy6ajgpt3lU" + }, + "source": [ + "# **Inversa da matriz quadrada**\n", + "> Se uma matriz é não-singular, então sua inversa existe.\n", + "\n", + "* Se o determinante de uma matriz is not equal to zero, then the matrix isé diferente de 0, então a matriz é não-singular." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-u7jRq34t9_x" + }, + "source": [ + "import numpy as np\n", + "\n", + "a_conjunto1 = np.array([[1, 2, 3],[4, 5, 6],[7, 8, 9]])\n", + "a_conjunto2 = np.array([[6, 2], [5, 3]])\n", + "a_conjunto3 = np.array([[1, 3, 5],[2, 5, 1],[2, 3, 8]])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7zmHHWWlvaYB" + }, + "source": [ + "a_conjunto1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "3fHKyhOJvcak" + }, + "source": [ + "a_conjunto2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vQG7yyfjwLg9" + }, + "source": [ + "a_conjunto3" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qa2Yre2rwgRk" + }, + "source": [ + "## Determinantes da matriz quadrada" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "N6jwuC6twkyc" + }, + "source": [ + "np.linalg.det(a_conjunto1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QSvViNwzwnhI" + }, + "source": [ + "np.linalg.det(a_conjunto2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "o8jwsnccw5id" + }, + "source": [ + "np.linalg.det(a_conjunto3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kkVaTgzgw_XJ" + }, + "source": [ + "A seguir, calculamos as inversas das matrizes acima definidas..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "b9FgWvTYvpik" + }, + "source": [ + "np.linalg.inv(a_conjunto2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "KsdEt1kIvsM_" + }, + "source": [ + "np.linalg.inv(a_conjunto1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VA_F7_7kccpn" + }, + "source": [ + "Porque não temos a inversa de a_conjunto1?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ANPBCnmVwOf4" + }, + "source": [ + "np.linalg.inv(a_conjunto3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XAf9k1egxcdF" + }, + "source": [ + "# **Resolver sistemas de equações lineares**\n", + "> Considere o sistema de euqações lineares abaixo:\n", + "\n", + "\\begin{equation}\n", + "x + 3y + 5z = 10\\\\\n", + "2x+ 5y + z = 8 \\\\\n", + "2x + 3y + 8z= 3\n", + "\\end{equation}\n", + "\n", + "Ou $Ax = b$. A solução deste sistema de equações é dada por $A^{-1}b$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oNf5nqaLxhBY" + }, + "source": [ + "Ou seja, basta encontrarmos a inversa de A e multiplicarmos por b." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "omzC5dGA0btc" + }, + "source": [ + "A= np.array([[1, 3, 5], [2, 5, 1], [2, 3, 8]])\n", + "np.linalg.inv(A)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AiXI3oxB05iE" + }, + "source": [ + "Agora basta multiplicar a matriz inversa $A^{-1}$ acima por b. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XoGebKDa2Fcd" + }, + "source": [ + "A_Inv = np.linalg.inv(A)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "sKaP0a1QZG-P" + }, + "source": [ + "b= np.array([10, 8, 3]).reshape(3, -1)\n", + "b" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "3dAVq8dg19VI" + }, + "source": [ + "A_Inv.dot(b)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zso6hTnB17cm" + }, + "source": [ + "Uma forma fácil de se fazer isso é utilizar a expressão abaixo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ptQHIVll1E4P" + }, + "source": [ + "b= np.array([[10], [8], [3]])\n", + "b" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "X4VL8lyY1Xus" + }, + "source": [ + "np.linalg.solve(A, b)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fJKmwTS59-Bc" + }, + "source": [ + "# **Empilhar arrays**\n", + "\n", + "## Exemplo 1\n", + "\n", + "![Empilhar1](https://github.com/MathMachado/Materials/blob/master/Empilhar1.PNG?raw=true)\n", + "\n", + "## Exemplo 2\n", + "\n", + "![Empilhar2](https://github.com/MathMachado/Materials/blob/master/Empilhar2.PNG?raw=true)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rhPTt3EwXden" + }, + "source": [ + "## Gerar os arrays do exemplo1" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zEI-yBy3-E46" + }, + "source": [ + "np.random.seed(20111974)\n", + "a_conjunto1 = np.random.randn(5, 8)\n", + "\n", + "np.random.seed(19741120)\n", + "a_conjunto2 = np.random.randn(8, 8)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UYsAqBRp--79" + }, + "source": [ + "## Método 1 - Concatenate([A, B])" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HgO1ujvhObyE", + "outputId": "c40e7ed9-255b-4886-dddf-3b17f2b1be2f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 187 + } + }, + "source": [ + "a_conjunto1" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[ 2.5062768 , 1.11440422, 2.05565501, 0.56482376, 0.29897276,\n", + " 1.04930857, -0.12607366, 1.06227632],\n", + " [ 1.13807032, 1.37966044, -2.05995563, 0.67474814, 0.72722843,\n", + " -0.33923852, 0.43613107, 0.59135489],\n", + " [-1.29281877, 1.17712036, -0.98644163, -1.79034143, -1.08913605,\n", + " -0.90712825, -1.02291108, -1.36445713],\n", + " [-0.29429164, 0.06343709, -1.14196185, -0.50706079, -0.83539436,\n", + " -1.41492946, -0.2159062 , -1.16519474],\n", + " [-0.60767518, -0.61510925, 1.0771542 , 0.5043687 , 0.02674197,\n", + " 1.83494644, 0.34728874, -1.14671885]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 33 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2aQY_klZOeg9", + "outputId": "14eb3d9c-d0fc-4b6a-fe19-1790695c838f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 289 + } + }, + "source": [ + "a_conjunto2" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[-0.77337752, -1.10547465, 0.10062807, -1.14571729, -2.15266227,\n", + " -0.75255725, -2.1529949 , -0.33017773],\n", + " [-1.10465731, 0.32889675, 0.01010198, -1.33213633, -0.33945805,\n", + " -0.01299007, 0.05342823, -0.18641201],\n", + " [ 0.39473805, -0.89354231, -0.50667323, -0.74660913, 1.83586365,\n", + " -1.20536871, 1.20184886, 0.51160897],\n", + " [-0.56952286, -0.93343871, -0.24972528, 0.98487133, 1.19333367,\n", + " 2.29956497, 0.16657022, 0.71357415],\n", + " [-0.45251078, 0.92163918, 0.73421263, 2.17811191, -0.05655212,\n", + " 1.25326 , -0.37039248, 1.43855202],\n", + " [ 0.85646091, -0.11257239, -0.35400297, 0.94136671, -0.08696163,\n", + " -1.49000701, 0.00848666, 0.86705275],\n", + " [ 1.6340906 , 1.36321063, -0.02175361, -0.45301645, -0.37111236,\n", + " -0.04716069, -2.27337435, 0.95318738],\n", + " [ 0.7100548 , -0.79883269, -0.3165779 , -1.58352824, -0.37751484,\n", + " -0.29760341, -0.73424207, -0.55703223]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 34 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bK70vaq8_KMH", + "outputId": "f6d400cf-4b54-4990-815b-052f5224aadd", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 459 + } + }, + "source": [ + "np.concatenate([a_conjunto1, a_conjunto2], axis = 0) # axis= 0 diz ao NumPy para empilhar as linhas" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[ 2.5062768 , 1.11440422, 2.05565501, 0.56482376, 0.29897276,\n", + " 1.04930857, -0.12607366, 1.06227632],\n", + " [ 1.13807032, 1.37966044, -2.05995563, 0.67474814, 0.72722843,\n", + " -0.33923852, 0.43613107, 0.59135489],\n", + " [-1.29281877, 1.17712036, -0.98644163, -1.79034143, -1.08913605,\n", + " -0.90712825, -1.02291108, -1.36445713],\n", + " [-0.29429164, 0.06343709, -1.14196185, -0.50706079, -0.83539436,\n", + " -1.41492946, -0.2159062 , -1.16519474],\n", + " [-0.60767518, -0.61510925, 1.0771542 , 0.5043687 , 0.02674197,\n", + " 1.83494644, 0.34728874, -1.14671885],\n", + " [-0.77337752, -1.10547465, 0.10062807, -1.14571729, -2.15266227,\n", + " -0.75255725, -2.1529949 , -0.33017773],\n", + " [-1.10465731, 0.32889675, 0.01010198, -1.33213633, -0.33945805,\n", + " -0.01299007, 0.05342823, -0.18641201],\n", + " [ 0.39473805, -0.89354231, -0.50667323, -0.74660913, 1.83586365,\n", + " -1.20536871, 1.20184886, 0.51160897],\n", + " [-0.56952286, -0.93343871, -0.24972528, 0.98487133, 1.19333367,\n", + " 2.29956497, 0.16657022, 0.71357415],\n", + " [-0.45251078, 0.92163918, 0.73421263, 2.17811191, -0.05655212,\n", + " 1.25326 , -0.37039248, 1.43855202],\n", + " [ 0.85646091, -0.11257239, -0.35400297, 0.94136671, -0.08696163,\n", + " -1.49000701, 0.00848666, 0.86705275],\n", + " [ 1.6340906 , 1.36321063, -0.02175361, -0.45301645, -0.37111236,\n", + " -0.04716069, -2.27337435, 0.95318738],\n", + " [ 0.7100548 , -0.79883269, -0.3165779 , -1.58352824, -0.37751484,\n", + " -0.29760341, -0.73424207, -0.55703223]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 35 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CpaXBkm8_BF8" + }, + "source": [ + "## Método 2 - np.r_[A, B]" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3QnVUzAY_teZ", + "outputId": "e8adfd85-e760-40f5-d9ac-48353d24ccd2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 459 + } + }, + "source": [ + "np.r_[a_conjunto1, a_conjunto2]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[ 2.5062768 , 1.11440422, 2.05565501, 0.56482376, 0.29897276,\n", + " 1.04930857, -0.12607366, 1.06227632],\n", + " [ 1.13807032, 1.37966044, -2.05995563, 0.67474814, 0.72722843,\n", + " -0.33923852, 0.43613107, 0.59135489],\n", + " [-1.29281877, 1.17712036, -0.98644163, -1.79034143, -1.08913605,\n", + " -0.90712825, -1.02291108, -1.36445713],\n", + " [-0.29429164, 0.06343709, -1.14196185, -0.50706079, -0.83539436,\n", + " -1.41492946, -0.2159062 , -1.16519474],\n", + " [-0.60767518, -0.61510925, 1.0771542 , 0.5043687 , 0.02674197,\n", + " 1.83494644, 0.34728874, -1.14671885],\n", + " [-0.77337752, -1.10547465, 0.10062807, -1.14571729, -2.15266227,\n", + " -0.75255725, -2.1529949 , -0.33017773],\n", + " [-1.10465731, 0.32889675, 0.01010198, -1.33213633, -0.33945805,\n", + " -0.01299007, 0.05342823, -0.18641201],\n", + " [ 0.39473805, -0.89354231, -0.50667323, -0.74660913, 1.83586365,\n", + " -1.20536871, 1.20184886, 0.51160897],\n", + " [-0.56952286, -0.93343871, -0.24972528, 0.98487133, 1.19333367,\n", + " 2.29956497, 0.16657022, 0.71357415],\n", + " [-0.45251078, 0.92163918, 0.73421263, 2.17811191, -0.05655212,\n", + " 1.25326 , -0.37039248, 1.43855202],\n", + " [ 0.85646091, -0.11257239, -0.35400297, 0.94136671, -0.08696163,\n", + " -1.49000701, 0.00848666, 0.86705275],\n", + " [ 1.6340906 , 1.36321063, -0.02175361, -0.45301645, -0.37111236,\n", + " -0.04716069, -2.27337435, 0.95318738],\n", + " [ 0.7100548 , -0.79883269, -0.3165779 , -1.58352824, -0.37751484,\n", + " -0.29760341, -0.73424207, -0.55703223]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 36 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XmSPbDP6_20W" + }, + "source": [ + "**Obs**.: Eu prefiro este método!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dzVKW_wX_Dzw" + }, + "source": [ + "## Método 3 - np.vstack([A, B]) = np.r_[A, B]" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uL7lEN_mABID", + "outputId": "d1ea4d86-2cc1-4e2d-af72-b3a292ef15fd", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 459 + } + }, + "source": [ + "np.vstack([a_conjunto1, a_conjunto2])" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[ 2.5062768 , 1.11440422, 2.05565501, 0.56482376, 0.29897276,\n", + " 1.04930857, -0.12607366, 1.06227632],\n", + " [ 1.13807032, 1.37966044, -2.05995563, 0.67474814, 0.72722843,\n", + " -0.33923852, 0.43613107, 0.59135489],\n", + " [-1.29281877, 1.17712036, -0.98644163, -1.79034143, -1.08913605,\n", + " -0.90712825, -1.02291108, -1.36445713],\n", + " [-0.29429164, 0.06343709, -1.14196185, -0.50706079, -0.83539436,\n", + " -1.41492946, -0.2159062 , -1.16519474],\n", + " [-0.60767518, -0.61510925, 1.0771542 , 0.5043687 , 0.02674197,\n", + " 1.83494644, 0.34728874, -1.14671885],\n", + " [-0.77337752, -1.10547465, 0.10062807, -1.14571729, -2.15266227,\n", + " -0.75255725, -2.1529949 , -0.33017773],\n", + " [-1.10465731, 0.32889675, 0.01010198, -1.33213633, -0.33945805,\n", + " -0.01299007, 0.05342823, -0.18641201],\n", + " [ 0.39473805, -0.89354231, -0.50667323, -0.74660913, 1.83586365,\n", + " -1.20536871, 1.20184886, 0.51160897],\n", + " [-0.56952286, -0.93343871, -0.24972528, 0.98487133, 1.19333367,\n", + " 2.29956497, 0.16657022, 0.71357415],\n", + " [-0.45251078, 0.92163918, 0.73421263, 2.17811191, -0.05655212,\n", + " 1.25326 , -0.37039248, 1.43855202],\n", + " [ 0.85646091, -0.11257239, -0.35400297, 0.94136671, -0.08696163,\n", + " -1.49000701, 0.00848666, 0.86705275],\n", + " [ 1.6340906 , 1.36321063, -0.02175361, -0.45301645, -0.37111236,\n", + " -0.04716069, -2.27337435, 0.95318738],\n", + " [ 0.7100548 , -0.79883269, -0.3165779 , -1.58352824, -0.37751484,\n", + " -0.29760341, -0.73424207, -0.55703223]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 37 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "68icJ-2ZAdRj" + }, + "source": [ + "# Concatenar arrays\n", + "\n", + "## Exemplo 1\n", + "\n", + "![Concatenar1](https://github.com/MathMachado/Materials/blob/master/Concatenar1.PNG?raw=true)\n", + "\n", + "# Exemplo 2\n", + "\n", + "![Concatenar2](https://github.com/MathMachado/Materials/blob/master/Concatenar2.PNG?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OplgK9YoQi9o" + }, + "source": [ + "## Concatenar os elementos de dois arrays - np.c_[A, B]" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lpdsbTEKQ9EY" + }, + "source": [ + "np.random.seed(20111974)\n", + "a_conjunto1 = np.random.randint(0, 10, 100).reshape(-1, 10)\n", + "a_conjunto2 = np.random.randint(0, 2, 10).reshape(-1, 1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JPxhGsaSSMk2", + "outputId": "47727fe9-05f1-4ff7-ec0a-04579120cf78", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 187 + } + }, + "source": [ + "a_conjunto1" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[8, 8, 2, 8, 9, 1, 8, 0, 4, 2],\n", + " [0, 8, 9, 3, 7, 1, 3, 2, 9, 7],\n", + " [7, 9, 5, 6, 8, 7, 0, 9, 3, 9],\n", + " [3, 1, 8, 6, 3, 5, 4, 1, 2, 9],\n", + " [8, 6, 6, 1, 0, 9, 2, 0, 7, 5],\n", + " [5, 4, 4, 2, 7, 2, 7, 9, 3, 1],\n", + " [5, 0, 1, 2, 3, 8, 7, 5, 4, 0],\n", + " [5, 9, 6, 6, 1, 3, 6, 0, 4, 9],\n", + " [2, 1, 0, 9, 1, 4, 2, 9, 7, 9],\n", + " [5, 3, 7, 6, 3, 9, 8, 4, 3, 0]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 39 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9ZyUPfybTfej", + "outputId": "ac27a20e-1622-4cb9-d6f6-74ee467bdb72", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 187 + } + }, + "source": [ + "a_conjunto2" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[1],\n", + " [0],\n", + " [0],\n", + " [0],\n", + " [0],\n", + " [1],\n", + " [0],\n", + " [0],\n", + " [0],\n", + " [1]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 40 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nS1cPG3aRug1", + "outputId": "c70cf891-ae8f-445d-c271-c6b7f7da1738", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 187 + } + }, + "source": [ + "# colocando o array a_conjunto2 do lado de a_conjunto1.\n", + "np.c_[a_conjunto1, a_conjunto2]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[8, 8, 2, 8, 9, 1, 8, 0, 4, 2, 1],\n", + " [0, 8, 9, 3, 7, 1, 3, 2, 9, 7, 0],\n", + " [7, 9, 5, 6, 8, 7, 0, 9, 3, 9, 0],\n", + " [3, 1, 8, 6, 3, 5, 4, 1, 2, 9, 0],\n", + " [8, 6, 6, 1, 0, 9, 2, 0, 7, 5, 0],\n", + " [5, 4, 4, 2, 7, 2, 7, 9, 3, 1, 1],\n", + " [5, 0, 1, 2, 3, 8, 7, 5, 4, 0, 0],\n", + " [5, 9, 6, 6, 1, 3, 6, 0, 4, 9, 0],\n", + " [2, 1, 0, 9, 1, 4, 2, 9, 7, 9, 0],\n", + " [5, 3, 7, 6, 3, 9, 8, 4, 3, 0, 1]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 41 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kIgU1YBw0OeM" + }, + "source": [ + "___\n", + "# **Selecionar itens que satisfazem condições**\n", + "> Considere o array a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "e2pL5anBV0DI", + "outputId": "f37cd827-ee00-49ba-994d-77cab3a24421", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_conjunto1 = np.arange(10, 0, -1)\n", + "a_conjunto1" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 42 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i9HuZZAfV302" + }, + "source": [ + "Selecionar somente os itens > 7:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZCESvr7iXMkV" + }, + "source": [ + "## Usando np.where()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BdrAQLHkTS-v", + "outputId": "44a6e480-1b6c-4dad-ee29-2fcb4ada5097", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_conjunto1" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 45 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "O_ZBaWxfWA9o", + "outputId": "fae44244-ff29-4b04-cd2d-a4c768487e75", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Índices do array que atendem a condição\n", + "l_indices = np.where(a_conjunto1 > 7)\n", + "l_indices" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(array([0, 1, 2]),)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 44 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EdWlfPOZWPME" + }, + "source": [ + "**Atenção**: Capturamos os índices. Para selecionar os itens, basta fazer:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tOxs3iYQWWxu", + "outputId": "b402fdfd-c6e0-4170-b35c-c7c5cd2ca85e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_conjunto2 = a_conjunto1[l_indices]\n", + "a_conjunto2" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([10, 9, 8])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 46 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PGsENqkaXRjh" + }, + "source": [ + "## Alternativa: Usando []" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YbdRNk1WXTLT", + "outputId": "062b157c-00fb-4f8f-d207-a0c8e9871e48", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_conjunto1[a_conjunto1 > 7]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([10, 9, 8])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 47 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jijpzFxcSQC8" + }, + "source": [ + "Acho que vale a pena quebrar esta solução para entendermos melhor como as coisas funcionam:#" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rujhP2LQSWsq" + }, + "source": [ + " # Primeiro, avalie o resultado de a_conjunto1 > 7:" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "FYZaBsasSb3N", + "outputId": "0a190896-249c-4d7c-ea0d-a20a53536446", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "a_conjunto1 > 7" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([ True, True, True, False, False, False, False, False, False,\n", + " False])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 48 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mvEof-UKaaVG" + }, + "source": [ + "a_conjunto1[a_conjunto1 > 7]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "nO4FiBmDUZOT", + "outputId": "9f54e601-d95a-444c-bd59-28947e332248", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_conjunto1" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([-1, -1, -1, 7, 6, 5, 4, 3, 2, 1])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 52 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ci5lT9nmSfsX" + }, + "source": [ + "Agora, com este resultado, fica fácil entender como o Python seleciona os elementos. Consegue explicar?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1v5Lfin0GGKD" + }, + "source": [ + "# Substituir itens baseado em condições\n", + "> Substituir os valores negativos do array abaixo por 0." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CLY_u0ePWdN7" + }, + "source": [ + "## Gerar o exemplo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NUANFy-fNXf5" + }, + "source": [ + "np.random.seed(20111974)\n", + "a_conjunto1 = np.array(np.random.randint(0, 10, size = 100))\n", + "\n", + "# Lista aleatória de índices que vou alterar\n", + "np.random.seed(20111974)\n", + "l_indices= np.random.randint(0, 99, 9)\n", + "\n", + "for i in l_indices:\n", + " a_conjunto1[i] = -1*a_conjunto1[i]\n", + "\n", + "a_conjunto2 = a_conjunto1.copy()\n", + "a_conjunto2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "dWVyI40uN2d2" + }, + "source": [ + "# Indices a serem multiplicados por -1:\n", + "l_indices" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3Whuu854OJDZ" + }, + "source": [ + "## Substituir os valores negativos por 0" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sr268Rp8b-Se", + "outputId": "82514805-b350-45c4-a3fc-7cb24c847b7f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_conjunto2 < 0" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([False, False, False])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 50 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "C-eKqPrfOQF6" + }, + "source": [ + "a_conjunto2[a_conjunto2 < 0] = 0\n", + "a_conjunto2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eDLM0_JSZlfB" + }, + "source": [ + "Observe acima que os valores negativos foram substituídos por 0, como queríamos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AEHJ0rA3dHHU" + }, + "source": [ + "## Substituir os valores negativos por 0 e os positivos por 1" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y32J8SRNZwRF" + }, + "source": [ + "a_conjunto2 = a_conjunto1.copy()\n", + "a_conjunto2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1bSD9Fs6P5wW" + }, + "source": [ + "a_conjunto2 = np.where(a_conjunto2 <= 0, 0, 1)\n", + "a_conjunto2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i027scjl0qkm" + }, + "source": [ + "___\n", + "# Outliers\n", + "> Qualquer ponto/observação que é incomum quando comparado com todos os outros pontos/observações." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UnDTqRnZHQ3W" + }, + "source": [ + "## Z-Score\n", + "\n", + "* Z-Score pode ser utilizado para detectar Outliers.\n", + "* É a diferença entre o valor e a média da amostra expressa como o número de desvios-padrão. \n", + "* Se o escore z for menor que 2,5 ou maior que 2,5, o valor estará nos 5% do menor ou maior valor (2,5% dos valores em ambas as extremidades da distribuição). No entanto, é pratica comum utilizarmos 3 ao invés dos 2,5.\n", + "\n", + "![Z_Score](https://github.com/MathMachado/Materials/blob/master/Z_Score.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N7gb2zhtd0uM" + }, + "source": [ + "## IQR Score\n", + "\n", + "* O Intervalo interquartil (IQR) é uma medida de dispersão estatística, sendo igual à diferença entre os percentis 75 (Q3) e 25 (Q1), ou entre quartis superiores e inferiores, IQR = Q3 - Q1." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lMmWOKNvghI7" + }, + "source": [ + "![BoxPlot](https://github.com/MathMachado/Materials/blob/master/boxplot.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DUw_a-MjWvBc" + }, + "source": [ + "### Desafio para resolverem\n", + "> **Objetivo**: Simular aleatoriamente o salário de 1.000 pessoas com distribuição N(1.045; 100). \n", + "* Identificar os _outliers_ da distribuição que acabamos de simular;\n", + "* Qual a média da distribuição que simulamos?\n", + "* Qual o desvio-padrão;\n", + "* Plotar o Boxplot da distribuição dos dados;\n", + "* Quantas pessoas > Q3 + 1.5*(Q3-Q1)\n", + "* Substituir os outliers do array por:\n", + " * Q1-1.5*(Q3 - Q1), se ponto < Q1-1.5*(Q3-Q1)\n", + " * Q3+1.5*(Q3 - Q1), se ponto > Q3+1.5*(Q3-Q1)\n", + "\n", + "Obs.: Use np.random.seed(20111974)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L9ntAdS_oOAh" + }, + "source": [ + "### Geração aleatória do array a_salarios com distribuição $N(\\mu, \\sigma)$" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RL0Zb0fyDory", + "outputId": "b66caa11-c8a4-4f9b-c5d0-23ace3f54a89", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 102 + } + }, + "source": [ + "import numpy as np\n", + "np.random.seed(20111974)\n", + "np.set_printoptions(precision = 2, suppress = True)\n", + "\n", + "media = 1045\n", + "desvio_padrao = 100\n", + "i_tamanho = 1000\n", + "\n", + "a_salarios = np.array(np.random.normal(media, desvio_padrao, size = i_tamanho))\n", + "a_salarios[:30]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1295.63, 1156.44, 1250.57, 1101.48, 1074.9 , 1149.93, 1032.39,\n", + " 1151.23, 1158.81, 1182.97, 839. , 1112.47, 1117.72, 1011.08,\n", + " 1088.61, 1104.14, 915.72, 1162.71, 946.36, 865.97, 936.09,\n", + " 954.29, 942.71, 908.55, 1015.57, 1051.34, 930.8 , 994.29,\n", + " 961.46, 903.51])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 1 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Fc3a-yhViCTs" + }, + "source": [ + "### Geração aleatória dos índices que serão (manualmente) alterados" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Iakt6i1cgEcB", + "outputId": "fc266c05-ffa5-457f-df79-1ac19df60878", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Lista aleatória de índices que vou alterar\n", + "np.random.seed(19741120)\n", + "l_indices = np.random.randint(0, i_tamanho, 10)\n", + "\n", + "# Estas são as posições que serão alteradas (manualmente)\n", + "np.sort(l_indices)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([ 14, 105, 208, 349, 484, 567, 615, 616, 622, 847])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 2 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oXwME1rciHkw" + }, + "source": [ + "### Cópia dos salários para compararmos o ANTES e DEPOIS" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BEtnua7sgp_y", + "outputId": "a581c437-ba16-4768-8129-49f7f67bbff6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 102 + } + }, + "source": [ + "# cópia do array a_salarios\n", + "a_salarios_copia = a_salarios.copy()\n", + "a_salarios_copia2 = a_salarios.copy()\n", + "\n", + "a_salarios[:30]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1295.63, 1156.44, 1250.57, 1101.48, 1074.9 , 1149.93, 1032.39,\n", + " 1151.23, 1158.81, 1182.97, 839. , 1112.47, 1117.72, 1011.08,\n", + " 1088.61, 1104.14, 915.72, 1162.71, 946.36, 865.97, 936.09,\n", + " 954.29, 942.71, 908.55, 1015.57, 1051.34, 930.8 , 994.29,\n", + " 961.46, 903.51])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 3 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "So8qj3Yrh-Az" + }, + "source": [ + "### Alteração (manual dos salários): 2 alternativas\n", + "> Vamos medir o tempo para avaliarmos o que é mais rápido. Qual solução é mais rápida?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Z0613on8z5VH" + }, + "source": [ + "from timeit import default_timer as timer\n", + "from datetime import timedelta" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "NpvvholVxMhs", + "outputId": "06c8c1ac-5a4c-42e8-a2c2-0840b15baf25", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Índices a serem alterados\n", + "l_indices" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([567, 14, 616, 484, 208, 105, 349, 615, 622, 847])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 5 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BqXsmMdm1yF-" + }, + "source": [ + "#### Solução 1" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FiiOrlnbgKOD", + "outputId": "728d7500-6870-4321-8f81-9b425d1fbcaa", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Alteração dos salários dos índices propostos\n", + "start = timer()\n", + "for i_indice in l_indices:\n", + " a_salarios_copia[i_indice] = 2*a_salarios[i_indice] # Loop para os índices a serem alterados (manualmente)\n", + "\n", + "a_salarios_copia[:30]\n", + "end = timer()\n", + "print(timedelta(seconds=end-start))" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "0:00:00.000118\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FgvKC-aFzWpZ" + }, + "source": [ + "#### Solução 2" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XWlQC5Jazt26", + "outputId": "3265351c-57b1-4bd4-864b-a614d9129d60", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "start = timer()\n", + "a_salarios_copia2[l_indices] = 2*a_salarios_copia2[l_indices] # Loop para os índices a serem alterados (manualmente)\n", + "a_salarios_copia2[:30]\n", + "end = timer()\n", + "\n", + "print(timedelta(seconds=end-start))" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "0:00:00.000251\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "U92w03afhrmC" + }, + "source": [ + "### Compare" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ls-jCFCYhtD8", + "outputId": "8b1dff92-91e3-4cc0-ecae-3c337dd53799", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "# Antes\n", + "a_salarios[l_indices]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([ 826.43, 1088.61, 1121.95, 833.96, 1165.97, 1081.13, 1078.51,\n", + " 1094.67, 904.32, 1128.66])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 8 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nwwU06OahzD2", + "outputId": "b5b5ca1f-a355-4604-96c5-a7d1736c8fc6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "# Depois\n", + "a_salarios_copia[l_indices]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1652.85, 2177.23, 2243.89, 1667.93, 2331.93, 2162.26, 2157.02,\n", + " 2189.34, 1808.63, 2257.32])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 9 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qyUUdHmtisJS", + "outputId": "9f926c6b-d17e-4bb8-ac0e-60008a40b2c4", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 102 + } + }, + "source": [ + "# 30 primeiras elementos de a_salarios\n", + "a_salarios[:30]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1295.63, 1156.44, 1250.57, 1101.48, 1074.9 , 1149.93, 1032.39,\n", + " 1151.23, 1158.81, 1182.97, 839. , 1112.47, 1117.72, 1011.08,\n", + " 1088.61, 1104.14, 915.72, 1162.71, 946.36, 865.97, 936.09,\n", + " 954.29, 942.71, 908.55, 1015.57, 1051.34, 930.8 , 994.29,\n", + " 961.46, 903.51])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 10 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "CJ1FEjlCi0-n", + "outputId": "caeb29f2-5c48-419e-c3fb-539dc1ded4fd", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 102 + } + }, + "source": [ + "# 30 primeiras posições de a_salarios_copia\n", + "a_salarios_copia[:30]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1295.63, 1156.44, 1250.57, 1101.48, 1074.9 , 1149.93, 1032.39,\n", + " 1151.23, 1158.81, 1182.97, 839. , 1112.47, 1117.72, 1011.08,\n", + " 2177.23, 1104.14, 915.72, 1162.71, 946.36, 865.97, 936.09,\n", + " 954.29, 942.71, 908.55, 1015.57, 1051.34, 930.8 , 994.29,\n", + " 961.46, 903.51])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 11 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wKbSUgxxiOUL" + }, + "source": [ + "### Algumas Estatísticas Descritivas:\n", + "#### Antes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZnmykyahLWX9", + "outputId": "288c4212-b78d-42a0-9f2d-e3676b738654", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "f'Média: {np.mean(a_salarios)}; Mediana: {np.median(a_salarios)}; STD: {np.std(a_salarios)}'" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'Média: 1047.150212238584; Mediana: 1047.631166829137; STD: 101.18708333868835'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 12 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ow7MHjgmPIty" + }, + "source": [ + "#### Depois" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5iO-BAikieHJ", + "outputId": "3e3f1ec5-47d1-4954-8362-c3d8d266d07e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "f'Média: {np.mean(a_salarios_copia)}; Mediana: {np.median(a_salarios_copia)}; STD: {np.std(a_salarios_copia)}'" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'Média: 1057.4744151862524; Mediana: 1048.089607774499; STD: 144.64306489539533'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 13 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ILhNe80xW5C6" + }, + "source": [ + "### Solução do desafio" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OyFbWs-APowd", + "outputId": "bee45538-605a-4bc3-9508-2f3bd4b18883", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 269 + } + }, + "source": [ + "# Import a biblioteca seaborn:\n", + "import seaborn as sns\n", + "\n", + "# Boxplot antes dos \"outliers\"\n", + "sns.boxplot(y = a_salarios)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 14 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAADrCAYAAACFMUa7AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAReUlEQVR4nO3df2xd93nf8fdHEpJaATbLFOu5lDy5k5AtDbzAJRwDQ4dslm3ZaKygWwMbA8RlxuRijqyuAzpnBSoggYEWHRBYWmuAgwXLQObM+xFE2TQ7jDbM/8yp6cCV5cSp75w4FuEfLOUqw5SmlfXsDx4tNzQp8vJSvGTP+wVc3HOe8733PtcwPzz6nu/lTVUhSWqHDYNuQJK0egx9SWoRQ1+SWsTQl6QWMfQlqUUMfUlqkU2DbuBytm7dWjt27Bh0G5K0rrzwwgt/UlXD8x1b06G/Y8cOJicnB92GJK0rSV5f6JjTO5LUIouGfpKjSd5JcnqeY/8iSSXZ2uwnyeEknSSnktzUNXYsyavNbWxl34YkaSmWcqb/OLBnbjHJduB24Add5TuBXc1tP/BoM/Ya4BDwceBm4FCSLf00Lknq3aKhX1XPAmfnOfRF4DeB7j/esxd4omY9B1yd5DrgDmCiqs5W1bvABPP8IpEkXVnLmtNPsheYqqo/mnNoBHija/9MU1uoLq1LMzMzPPjgg8zMzAy6FaknPYd+ks3AvwJ+e+XbgST7k0wmmZyenr4SLyH17dixY7z00ks88cQTg25F6slyzvT/BnAD8EdJvg9sA76V5K8BU8D2rrHbmtpC9fepqvGqGq2q0eHheZeZSgM1MzPD008/TVXx9NNPe7avdaXn0K+ql6rqZ6tqR1XtYHaq5qaqegs4DuxrVvHcApyrqjeBZ4Dbk2xpLuDe3tSkdefYsWNcvHgRgPfee8+zfa0rS1my+STwv4APJzmT5L7LDD8BvAZ0gH8L/DOAqjoLfAF4vrl9vqlJ6843vvENLly4AMCFCxeYmJgYcEfS0i36idyquneR4zu6tgt4YIFxR4GjPfYnrTm7d+/mxIkTXLhwgU2bNnHbbbcNuiVpyfxErtSjsbExNmyY/dHZuHEj+/btG3BH0tIZ+lKPhoaG2LNnD0nYs2cPQ0NDg25JWjJDX1qGu+++m82bN/PJT35y0K1IPTH0pWU4fvw458+f52tf+9qgW5F6YuhLPXKdvtYzQ1/qkev0tZ4Z+lKPXKev9czQl3q0e/duNm2a/YiL6/S13hj6Uo9cp6/1zNCXeuQ6fa1na/qL0aW1amxsjO9///ue5WvdMfSlZRgaGuLw4cODbkPqmdM7ktQinulryY4cOUKn0xl0G2vC1NTsdwCNjPitnwA7d+7kwIEDg25DS2DoS8vwox/9aNAtSMti6GvJPJP7iYMHDwLwyCOPDLgTqTfO6UtSixj6ktQihr4ktYihL0ktsmjoJzma5J0kp7tqX0hyKsmLSb6e5OeaepIcTtJpjt/U9ZixJK82t7Er83YkSZezlDP9x4E9c2q/V1U3VtXHgP8C/HZTvxPY1dz2A48CJLkGOAR8HLgZOJRkS9/dS5J6smjoV9WzwNk5tR927X4IqGZ7L/BEzXoOuDrJdcAdwERVna2qd4EJ3v+LRJJ0hS17nX6Sh4F9wDng7zXlEeCNrmFnmtpCdUnSKlr2hdyq+q2q2g58CfjsSjWUZH+SySST09PTK/W0kiRWZvXOl4B/0GxPAdu7jm1ragvV36eqxqtqtKpGh4eHV6A9SdIlywr9JLu6dvcCrzTbx4F9zSqeW4BzVfUm8Axwe5ItzQXc25uaJGkVLTqnn+RJ4BPA1iRnmF2Fc1eSDwMXgdeBX2uGnwDuAjrAeeAzAFV1NskXgOebcZ+vqp+6OCxJuvIWDf2qunee8mMLjC3ggQWOHQWO9tSdJGlF+YlcSWoRQ1+SWsTQl6QWMfQlqUUMfUlqEUNfklrE0JekFjH0JalFDH1JahFDX5JaxNCXpBYx9CWpRQx9SWoRQ1+SWsTQl6QWMfQlqUUMfUlqEUNfklrE0JekFjH0JalFDH1JapFFQz/J0STvJDndVfu9JK8kOZXkK0mu7jr2uSSdJN9NckdXfU9T6yR5aOXfiiRpMUs5038c2DOnNgF8tKpuBP4Y+BxAko8A9wC/0DzmD5JsTLIR+H3gTuAjwL3NWEnSKlo09KvqWeDsnNrXq+pCs/scsK3Z3gt8uap+XFXfAzrAzc2tU1WvVdWfA19uxkqSVtFKzOn/E+C/NdsjwBtdx840tYXqkqRV1FfoJ/kt4ALwpZVpB5LsTzKZZHJ6enqlnlaSRB+hn+QfA78M/KOqqqY8BWzvGratqS1Uf5+qGq+q0aoaHR4eXm57kqR5LCv0k+wBfhO4u6rOdx06DtyT5INJbgB2AX8IPA/sSnJDkg8we7H3eH+tS5J6tWmxAUmeBD4BbE1yBjjE7GqdDwITSQCeq6pfq6qXkzwFfJvZaZ8Hquq95nk+CzwDbASOVtXLV+D9SJIuY9HQr6p75yk/dpnxDwMPz1M/AZzoqTtJ0oryE7mS1CKGviS1iKEvSS1i6EtSixj6ktQihr4ktYihL0ktYuhLUosY+pLUIoa+JLWIoS9JLWLoS1KLGPqS1CKGviS1iKEvSS1i6EtSixj6ktQihr4ktciiX5fYdkeOHKHT6Qy6Da0xl/6fOHjw4IA70Vqzc+dODhw4MOg2FmToL6LT6fDi6e/w3uZrBt2K1pANf14AvPDa2wPuRGvJxvNnB93Cogz9JXhv8zX86G/eNeg2JK1xV71yYtAtLGrROf0kR5O8k+R0V+1Xk7yc5GKS0TnjP5ekk+S7Se7oqu9pap0kD63s25AkLcVSLuQ+DuyZUzsN/ArwbHcxyUeAe4BfaB7zB0k2JtkI/D5wJ/AR4N5mrCRpFS06vVNVzybZMaf2HYAkc4fvBb5cVT8GvpekA9zcHOtU1WvN477cjP12P81Lknqz0ks2R4A3uvbPNLWF6u+TZH+SySST09PTK9yeJLXbmlunX1XjVTVaVaPDw8ODbkeS/lJZ6dU7U8D2rv1tTY3L1CVJq2Slz/SPA/ck+WCSG4BdwB8CzwO7ktyQ5APMXuw9vsKvLUlaxKJn+kmeBD4BbE1yBjgEnAWOAMPAf03yYlXdUVUvJ3mK2Qu0F4AHquq95nk+CzwDbASOVtXLV+INSZIWtpTVO/cucOgrC4x/GHh4nvoJYO1/ckGS/hJbcxdyJUlXjqEvSS1i6EtSixj6ktQihr4ktYihL0ktYuhLUosY+pLUIoa+JLWIX5e4iKmpKTaeP7cuvgZN0mBtPD/D1NSFQbdxWZ7pS1KLeKa/iJGREd768Sa/GF3Soq565QQjI9cOuo3L8kxfklrE0JekFjH0JalFDH1JahFDX5JaxNCXpBYx9CWpRQx9SWqRRUM/ydEk7yQ53VW7JslEkleb+y1NPUkOJ+kkOZXkpq7HjDXjX00ydmXejiTpcpZypv84sGdO7SHgZFXtAk42+wB3Arua237gUZj9JQEcAj4O3AwcuvSLQpK0ehYN/ap6Fjg7p7wXONZsHwM+1VV/omY9B1yd5DrgDmCiqs5W1bvABO//RSJJusKWO6d/bVW92Wy/BVz6YxMjwBtd4840tYXqkqRV1PeF3KoqoFagFwCS7E8ymWRyenp6pZ5WksTyQ//tZtqG5v6dpj4FbO8at62pLVR/n6oar6rRqhodHh5eZnuSpPksN/SPA5dW4IwBX+2q72tW8dwCnGumgZ4Bbk+ypbmAe3tTkyStokX/nn6SJ4FPAFuTnGF2Fc7vAE8luQ94Hfh0M/wEcBfQAc4DnwGoqrNJvgA834z7fFXNvTgsSbrCFg39qrp3gUO3zjO2gAcWeJ6jwNGeupMkrSg/kStJLWLoS1KL+B25S7Dx/FmueuXEoNvQGrLhz34IwMWf+SsD7kRrycbzZ/nJx5bWJkN/ETt37hx0C1qDOp3/A8DOn1/bP+Babdeu+cww9Bdx4MCBQbegNejgwYMAPPLIIwPuROqNc/qS1CKGviS1iKEvSS1i6EtSixj6ktQihr4ktYihL0ktYuhLUosY+pLUIoa+JLWIoS9JLWLoS1KLGPqS1CKGviS1iKEvSS1i6EtSi/QV+kkOJjmd5OUkv97UrkkykeTV5n5LU0+Sw0k6SU4luWkl3oAkaemWHfpJPgr8U+Bm4G8Dv5xkJ/AQcLKqdgEnm32AO4FdzW0/8GgffUuSlqGfM/2/BXyzqs5X1QXgfwK/AuwFjjVjjgGfarb3Ak/UrOeAq5Nc18frS5J61E/onwZ+KclQks3AXcB24NqqerMZ8xY/+Wr4EeCNrsefaWqSpFWy7C9Gr6rvJPld4OvA/wVeBN6bM6aSVC/Pm2Q/s9M/XH/99cttT5I0j74u5FbVY1X1i1X1d4F3gT8G3r40bdPcv9MMn2L2XwKXbGtqc59zvKpGq2p0eHi4n/YkSXP0u3rnZ5v765mdz/93wHFgrBkyBny12T4O7GtW8dwCnOuaBpIkrYJlT+80/lOSIeAvgAeq6k+T/A7wVJL7gNeBTzdjTzA7798BzgOf6fO1JUk96iv0q+qX5qnNALfOUy/ggX5eT5LUHz+RK0ktYuhLUosY+pLUIoa+JLWIoS9JLWLoS1KLGPqS1CKGviS1iKEvSS1i6EtSixj6ktQihr4ktYihL0ktYuhLUosY+pLUIoa+JLWIoS9JLWLoS1KLGPqS1CKGviS1iKEvSS3SV+gn+edJXk5yOsmTSX4myQ1Jvpmkk+TfJ/lAM/aDzX6nOb5jJd6AJGnplh36SUaAB4HRqvoosBG4B/hd4ItVtRN4F7ivech9wLtN/YvNOEnSKup3emcTcFWSTcBm4E3g7wP/sTl+DPhUs7232ac5fmuS9Pn6kqQeLDv0q2oK+NfAD5gN+3PAC8CfVtWFZtgZYKTZHgHeaB57oRk/NPd5k+xPMplkcnp6erntSZLm0c/0zhZmz95vAH4O+BCwp9+Gqmq8qkaranR4eLjfp5Mkdelnemc38L2qmq6qvwD+M/B3gKub6R6AbcBUsz0FbAdojv9VYKaP15ck9aif0P8BcEuSzc3c/K3At4H/AfzDZswY8NVm+3izT3P8v1dV9fH6kqQe9TOn/01mL8h+C3ipea5x4F8Cv5Gkw+yc/WPNQx4Dhpr6bwAP9dG3JGkZNi0+ZGFVdQg4NKf8GnDzPGP/DPjVfl5PktQfP5ErSS1i6EtSixj6ktQihr4ktYihL0kt0tfqHbXLkSNH6HQ6g25jTbj03+HgwYMD7mRt2LlzJwcOHBh0G1oCQ19ahquuumrQLUjLYuhryTyTk9Y/5/QlqUUMfUlqEUNfklrE0JekFjH0pWWYmZnhwQcfZGbGr4TQ+mLoS8swPj7OqVOnGB8fH3QrUk8MfalHMzMzTExMADAxMeHZvtYVQ1/q0fj4OBcvXgTg4sWLnu1rXTH0pR6dPHnysvvSWmboSz2a+9XOftWz1hNDX+rRrbfe+lP7u3fvHlAnUu8MfalH999/Pxs2zP7obNiwgf379w+4I2nplh36ST6c5MWu2w+T/HqSa5JMJHm1ud/SjE+Sw0k6SU4luWnl3oa0eoaGhv7/2f1tt93G0NDQgDuSlm7ZoV9V362qj1XVx4BfBM4DXwEeAk5W1S7gZLMPcCewq7ntBx7tp3FpkO6//35uvPFGz/K17qzU9M6twP+uqteBvcCxpn4M+FSzvRd4omY9B1yd5LoVen1pVQ0NDXH48GHP8rXurFTo3wM82WxfW1VvNttvAdc22yPAG12POdPUJEmrpO/QT/IB4G7gP8w9VrNr2Xpaz5Zkf5LJJJPT09P9tidJ6rISZ/p3At+qqreb/bcvTds09+809Slge9fjtjW1n1JV41U1WlWjw8PDK9CeJOmSlQj9e/nJ1A7AcWCs2R4DvtpV39es4rkFONc1DSRJWgXp59OEST4E/AD4+ao619SGgKeA64HXgU9X1dkkAf4NsIfZlT6fqarJRZ5/unkOaS3aCvzJoJuQ5vHXq2reqZK+Ql9qsySTVTU66D6kXviJXElqEUNfklrE0JeWzz+kr3XHOX1JahHP9CWpRQx9SWoRQ1+SWsTQl6QWMfQlqUX+Hw7RzvrAsf4dAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "U993i1GJg2hk", + "outputId": "6f8719c1-df46-42d4-f816-b3a1ada3d7b5", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 271 + } + }, + "source": [ + "# Boxplot do array a_salarios_copia depois dos \"outliers\"\n", + "sns.boxplot(y = a_salarios_copia)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 15 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAADtCAYAAABTaKWmAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAWIklEQVR4nO3df4zcdZ3H8eeL3QUBT2m3a8Vtua1OvQv+OCUjkBhyKG3ZEmP9485gLnZQco0CpRgSA9jYAHrx1EhoT0l6oWF7IXCc4tmEdsvW846YXLFbDigtaCdQaNdS1imgpgjd7fv+mE9xWPbHzOx2Zybf1yOZ9Pt9fz8z8/4S9jWffL/fma8iAjMzy4bTGt2AmZnNHoe+mVmGOPTNzDLEoW9mliEOfTOzDHHom5llyJShL2mhpF9I2idpr6Q1Y7bfKCkkzUvrkrReUlHSk5IuqBhbkLQ/PQozvztmZjaZ9irGjAA3RsRjkv4C2C1pICL2SVoILANeqBi/HFicHhcBdwEXSZoLrAPyQKTX2RIRL8/g/piZ2SSmDP2IOAwcTst/kPQ00A3sA+4Avg78rOIpK4DNUf7W105J50g6F7gUGIiIowCSBoBe4L6J3nvevHnR09NTx26ZmWXX7t27fxcRXeNtq2am/yZJPcDHgUclrQCGIuIJSZXDuoGDFeuHUm2i+tj3WAWsAjjvvPMYHByspUUzs8yT9PxE26o+kSvpncBPgBsoH/K5BfjmtLsbIyI2RkQ+IvJdXeN+UJmZWZ2qCn1JHZQD/96IeBD4ALAIeELSAWAB8Jik9wJDwMKKpy9ItYnqZmY2S6q5ekfA3cDTEfEDgIjYExHviYieiOihfKjmgoh4EdgCrExX8VwMvJrOC2wHlkmaI2kO5RPA20/NbpmZ2XiqOab/SeCLwB5Jj6faLRGxdYLxW4ErgCJwDPgSQEQclXQ7sCuNu+3kSV0zM5sd1Vy980tAU4zpqVgO4NoJxm0CNtXWolnzKZVK3Hrrraxbt47Ozs5Gt2NWNX8j16wOfX197Nmzh82bNze6FbOaOPTNalQqlejv7yci6O/vp1QqNbols6o59M1q1NfXx4kTJwAYHR31bN9aikPfrEY7duxgZGQEgJGREQYGBhrckVn1HPpmNVqyZAnt7eVrINrb21m6dGmDOzKrnkPfrEaFQoHTTiv/6bS1tbFy5coGd2RWPYe+WY06Ozvp7e1FEr29vb5k01pKTT+4ZmZlhUKBAwcOeJZvLcczfTOzDHHom9XBX86yVuXQN6tR5Zeztm3b5i9nWUtx6JvVqK+vj+PHjwNw/Phxz/atpTj0zWo0MDBA+XcFISJ4+OGHG9yRWfUc+mY1mj9//qTrZs3MoW9WoyNHjky6btbMHPpmNVq6dCnlG8qBJJYtW9bgjsyqV83tEhdK+oWkfZL2SlqT6t+T9IykJyX9VNI5Fc+5WVJR0q8lXV5R7021oqSbTs0umZ1ahULhLb+94y9oWSupZqY/AtwYEecDFwPXSjofGAA+HBEfBX4D3AyQtl0JfAjoBX4kqU1SG/BDYDlwPvCFNNaspXR2dtLd3Q1Ad3e3f4bBWsqUoR8RhyPisbT8B+BpoDsiHo6IkTRsJ7AgLa8A7o+I1yPiOcr3yr0wPYoR8WxEvAHcn8aatZRSqcTQ0BAAQ0NDvk7fWkpNx/Ql9QAfBx4ds+nLwLa03A0crNh2KNUmqo99j1WSBiUNDg8P19Ke2azo6+t7y+/p+zp9ayVVh76kdwI/AW6IiN9X1L9B+RDQvTPRUERsjIh8ROS7urpm4iXNZpSv07dWVlXoS+qgHPj3RsSDFfWrgM8A/xAn/wpgCFhY8fQFqTZR3ayl+Dp9a2XVXL0j4G7g6Yj4QUW9F/g68NmIOFbxlC3AlZLOkLQIWAz8CtgFLJa0SNLplE/2bpm5XTGbHYcPH5503ayZVfN7+p8EvgjskfR4qt0CrAfOAAbSNcs7I+IrEbFX0gPAPsqHfa6NiFEASdcB24E2YFNE7J3RvTGbBR0dHbz++utvWTdrFVOGfkT8EtA4m7ZO8pxvA98ep751sueZtYI//vGPk66bNTN/I9esRj09PZOumzUzh75ZjdauXTvpulkzc+ib1SiXy705u+/p6SGXyzW2IbMaOPTN6rB27VrOPvtsz/Kt5VRz9Y6ZjZHL5XjooYca3YZZzTzTNzPLEIe+mVmGOPTNzDLEoW9mliEOfTOzDHHom5lliEPfzCxDHPpmZhni0DczyxCHvplZhlRz56yFkn4haZ+kvZLWpPpcSQOS9qd/56S6JK2XVJT0pKQLKl6rkMbvl1Q4dbtlZmbjqWamPwLcGBHnAxcD10o6H7gJ+HlELAZ+ntYBllO+ReJiYBVwF5Q/JIB1wEXAhcC6kx8UZmY2O6YM/Yg4HBGPpeU/AE8D3cAKoC8N6wM+l5ZXAJujbCdwjqRzgcuBgYg4GhEvAwNA74zujZmZTaqmY/qSeoCPA48C8yPi5B2hXwTmp+Vu4GDF0w6l2kR1MzObJVWHvqR3Aj8BboiI31dui4gAYiYakrRK0qCkweHh4Zl4STMzS6oKfUkdlAP/3oh4MJWPpMM2pH9fSvUhYGHF0xek2kT1t4iIjRGRj4h8V1dXLftiZmZTqObqHQF3A09HxA8qNm0BTl6BUwB+VlFfma7iuRh4NR0G2g4skzQnncBdlmpmZjZLqrlz1ieBLwJ7JD2earcA3wEekHQ18Dzw+bRtK3AFUASOAV8CiIijkm4HdqVxt0XE0RnZCzMzq4rKh+ObUz6fj8HBwUa3YWbWUiTtjoj8eNv8jVwzswxx6JuZZYhD38wsQxz6ZmYZ4tA3M8sQh76ZWYY49M3MMsShb2aWIQ59M7MMceibmWWIQ9/MLEMc+mZmGeLQNzPLEIe+mVmGOPTN6lAqlbj++usplUqNbsWsJg59szr09fWxZ88eNm/e3OhWzGpSze0SN0l6SdJTFbWPSdop6fF0E/MLU12S1ksqSnpS0gUVzylI2p8ehfHey6wVlEol+vv7iQj6+/s927eWUs1M/x6gd0ztu8CtEfEx4JtpHWA5sDg9VgF3AUiaC6wDLgIuBNal++SatZy+vj5OnDgBwOjoqGf71lKmDP2IeAQYey/bAN6Vlt8N/DYtrwA2R9lO4BxJ5wKXAwMRcTQiXgYGePsHiVlL2LFjByMjIwCMjIwwMDDQ4I7MqlfvMf0bgO9JOgh8H7g51buBgxXjDqXaRPW3kbQqHTIaHB4errM9s1NnyZIltLe3A9De3s7SpUsb3JFZ9eoN/a8CX4uIhcDXgLtnqqGI2BgR+YjId3V1zdTLms2YQqHAaaeV/3Ta2tpYuXJlgzsyq169oV8AHkzL/0H5OD3AELCwYtyCVJuobtZyOjs76e3tRRK9vb10dnY2uiWzqtUb+r8F/jYtfxrYn5a3ACvTVTwXA69GxGFgO7BM0px0AndZqpm1pEKhwEc+8hHP8q3ltE81QNJ9wKXAPEmHKF+F84/AnZLagT9RvlIHYCtwBVAEjgFfAoiIo5JuB3alcbdFxNiTw2Yto7Ozk/Xr1ze6DbOaKSIa3cOE8vl8DA4ONroNM7OWIml3ROTH2+Zv5JqZZYhD38wsQxz6ZmYZ4tA3M8sQh76ZWYY49M3MMsShb1YH30TFWpVD36wOvomKtSqHvlmNSqUS27ZtIyLYtm2bZ/vWUhz6ZjXq6+t78/f0jx8/7tm+tRSHvlmNBgYGOPnzJRHBww8/3OCOzKrn0Der0fz58yddN2tmDn2zGh05cmTSdbNm5tA3q9HSpUuRBIAkli1b1uCOzKrn0DerUaFQoKOjA4COjg7fSMVaypShL2mTpJckPTWmvlrSM5L2SvpuRf1mSUVJv5Z0eUW9N9WKkm6a2d0wmz2Vt0tcvny5b5doLWXKO2cB9wD/Arx5XZqkTwErgL+JiNclvSfVzweuBD4EvA/YIemD6Wk/BJYCh4BdkrZExL6Z2hGz2VQoFDhw4IBn+dZypgz9iHhEUs+Y8leB70TE62nMS6m+Arg/1Z+TVOTPN00vRsSzAJLuT2Md+taSfLtEa1X1HtP/IHCJpEcl/Y+kT6R6N3CwYtyhVJuobmZms6iawzsTPW8ucDHwCeABSe+fiYYkrSLdaP28886biZc0M7Ok3pn+IeDBKPsVcAKYBwwBCyvGLUi1iepvExEbIyIfEfmurq462zMzs/HUG/r/CXwKIJ2oPR34HbAFuFLSGZIWAYuBXwG7gMWSFkk6nfLJ3i3Tbd7MzGoz5eEdSfcBlwLzJB0C1gGbgE3pMs43gEKUf4xkr6QHKJ+gHQGujYjR9DrXAduBNmBTROw9BftjZmaT0MkfjmpG+Xw+BgcHG92GmVlLkbQ7IvLjbfM3cs3MMsShb2aWIQ59M7MMceibmWWIQ9/MLEMc+mZmGeLQNzPLEIe+mVmGOPTNzDLEoW9mliEOfTOzDHHom5lliEPfzCxDHPpmZhni0DczyxCHvplZhkwZ+pI2SXop3SVr7LYbJYWkeWldktZLKkp6UtIFFWMLkvanR2Fmd8PMzKpRzUz/HqB3bFHSQmAZ8EJFeTnl++IuBlYBd6WxcynfZvEi4EJgnaQ502ncrJFKpRLXX389pVKp0a2Y1WTK0I+IR4Cj42y6A/g6UHm/xRXA5ijbCZwj6VzgcmAgIo5GxMvAAON8kJi1ir6+Pvbs2cPmzZsb3YpZTeo6pi9pBTAUEU+M2dQNHKxYP5RqE9XHe+1VkgYlDQ4PD9fTntkpVSqV6O/vJyLo7+/3bN9aSs2hL+ks4BbgmzPfDkTExojIR0S+q6vrVLyF2bT09fVx4sQJAEZHRz3bt5ZSz0z/A8Ai4AlJB4AFwGOS3gsMAQsrxi5ItYnqZi1nx44djIyMADAyMsLAwECDOzKrXs2hHxF7IuI9EdETET2UD9VcEBEvAluAlekqnouBVyPiMLAdWCZpTjqBuyzVzFrOJZdcMum6WTOr5pLN+4D/Bf5K0iFJV08yfCvwLFAE/hW4BiAijgK3A7vS47ZUM2s5ETH1ILMmpWb+Hzifz8fg4GCj2zB7iyuuuIJjx469uX7WWWexdevWBnZk9laSdkdEfrxt/kauWY2WLFlCW1sbAG1tbSxdurTBHZlVr73RDVjr2LBhA8VisdFtNNzx48cZHR0F4MSJE+zfv581a9Y0uKvGyuVyrF69utFtWBU80zerUUdHB+3t5fnS3Llz6ejoaHBHZtXzTN+q5pncn11zzTU8//zzbNy4kc7Ozka3Y1Y1z/TN6tDR0UEul3PgW8tx6JuZZYhD38wsQxz6ZmYZ4tA3M8sQh76ZWYY49M3MMsShb2aWIQ59M7MMceibmWWIQ9/MLEMc+mZmGVLNnbM2SXpJ0lMVte9JekbSk5J+Kumcim03SypK+rWkyyvqvalWlHTTzO+KmZlNpZqZ/j1A75jaAPDhiPgo8BvgZgBJ5wNXAh9Kz/mRpDZJbcAPgeXA+cAX0lgzM5tFU4Z+RDwCHB1TezgiRtLqTmBBWl4B3B8Rr0fEc5TvlXthehQj4tmIeAO4P401M7NZNBPH9L8MbEvL3cDBim2HUm2i+ttIWiVpUNLg8PDwDLRnZmYnTSv0JX0DGAHunZl2ICI2RkQ+IvJdXV0z9bJmZsY07pwl6SrgM8BlERGpPAQsrBi2INWYpG5mZrOkrpm+pF7g68BnI+JYxaYtwJWSzpC0CFgM/ArYBSyWtEjS6ZRP9m6ZXutmZlarKWf6ku4DLgXmSToErKN8tc4ZwIAkgJ0R8ZWI2CvpAWAf5cM+10bEaHqd64DtQBuwKSL2noL9MTOzSUwZ+hHxhXHKd08y/tvAt8epbwW21tSdmZnNKH8j18wsQxz6ZmYZ4tA3M8uQui/ZzIoNGzZQLBYb3YY1mZP/T6xZs6bBnVizyeVyrF69utFtTMihP4ViscjjTz3N6FlzG92KNZHT3ih/NWX3s0ca3Ik1k7ZjR6ce1GAO/SqMnjWX1/76ika3YWZN7sxnmv8CRR/TNzPLEIe+mVmGOPTNzDLEoW9mliEOfTOzDPHVO1MYGhqi7dirLXFW3swaq+1YiaGhkakHNpBn+mZmGeKZ/hS6u7t58fV2X6dvZlM685mtdHfPb3Qbk/JM38wsQ6YMfUmbJL0k6amK2lxJA5L2p3/npLokrZdUlPSkpAsqnlNI4/dLKpya3TEzs8lUM9O/B+gdU7sJ+HlELAZ+ntYBllO+ReJiYBVwF5Q/JCjfcesi4EJg3ckPCjMzmz1Thn5EPAKM/RWhFUBfWu4DPldR3xxlO4FzJJ0LXA4MRMTRiHgZGODtHyRmZnaK1XtMf35EHE7LLwInz1x0Awcrxh1KtYnqbyNplaRBSYPDw8N1tmdmZuOZ9onciAggZqCXk6+3MSLyEZHv6uqaqZc1MzPqv2TziKRzI+JwOnzzUqoPAQsrxi1ItSHg0jH1/67zvWdd27Gj/nKWvcVpf/o9ACfe8a4Gd2LNpPx7+s19yWa9ob8FKADfSf/+rKJ+naT7KZ+0fTV9MGwH/qni5O0y4Ob62549uVyu0S1YEyoW/wBA7v3N/Qdus21+02fGlKEv6T7Ks/R5kg5RvgrnO8ADkq4Gngc+n4ZvBa4AisAx4EsAEXFU0u3ArjTutoho/lvMQFPf9swa5+RtEu+8884Gd2JWmylDPyK+MMGmy8YZG8C1E7zOJmBTTd2ZmdmM8jdyzcwyxKFvZpYhDn0zswxx6JuZZYhD38wsQxz6ZmYZ4tA3M8sQh76ZWYY49M3MMsShb2aWIQ59M7MMceibmWWIQ9/MLEMc+mZmGeLQNzPLEIe+mVmGTCv0JX1N0l5JT0m6T9I7JC2S9KikoqR/l3R6GntGWi+m7T0zsQNmZla9ukNfUjdwPZCPiA8DbcCVwD8Dd0REDngZuDo95Wrg5VS/I40zM7NZNN3DO+3AmZLagbOAw8CngR+n7X3A59LyirRO2n6ZJE3z/c3MrAZ1h35EDAHfB16gHPavAruBVyJiJA07BHSn5W7gYHruSBrfOfZ1Ja2SNChpcHh4uN72zMxsHNM5vDOH8ux9EfA+4Gygd7oNRcTGiMhHRL6rq2u6L2dmZhWmc3hnCfBcRAxHxHHgQeCTwDnpcA/AAmAoLQ8BCwHS9ncDpWm8v5mZ1Wg6of8CcLGks9Kx+cuAfcAvgL9LYwrAz9LylrRO2v5fERHTeH8zM6tR+9RDxhcRj0r6MfAYMAL8H7AReAi4X9K3Uu3u9JS7gX+TVASOUr7Sx1rIhg0bKBaLjW6jKZz877BmzZoGd9Iccrkcq1evbnQbVoW6Qx8gItYB68aUnwUuHGfsn4C/n877mTWLjo4OXnnlFV577TXOPPPMRrdjVrVphb5li2dyf3bVVVfxyiuv8MYbb7Bx48ZGt2NWNf8Mg1mNisUiBw4cAODAgQM+5GUtxaFvVqNvfetbk66bNTOHvlmNTs7yJ1o3a2YOfbMa9fT0TLpu1swc+mY1Wrt27aTrZs3MoW9Wo1wu9+bsvqenh1wu19iGzGrg0Derw9q1azn77LM9y7eW4+v0zeqQy+V46KGHGt2GWc080zczyxCHvplZhjj0zcwyxKFvZpYhauaftJc0DDzf6D7MJjAP+F2jmzAbx19GxLi3Hmzq0DdrZpIGIyLf6D7MauHDO2ZmGeLQNzPLEIe+Wf189xRrOT6mb2aWIZ7pm5lliEPfzCxDHPpmZhni0DczyxCHvplZhvw/5I+5LV0j8I0AAAAASUVORK5CYII=\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VtenLK1uK1Pi" + }, + "source": [ + "Consegue identificar os outliers do array?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e3sHuGVGFBdW" + }, + "source": [ + "## Objetivo\n", + "> Identificar e substituir os outliers pela mediana dos dados. \n", + "\n", + "* Como fazer isso?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RSegPNKCI-dS" + }, + "source": [ + "### Siga os passos a seguir\n", + "1. Calcule estatísticas descritivas antes das transformações par avaliar o impacto;\n", + " * Calcule média, mediana e desvio-padrão dos dados originais;\n", + "2. Calcule os valores a seguir:\n", + " * Q1, Q3\n", + " * IQR = Q3-Q1\n", + " * lim_inferior_outlier = Q1-1.5\\*IQR\n", + " * lim_superior_outlier = Q3+1.5\\*IQR\n", + "3. Proceda à substituição:\n", + " * Se a_salarios_copia[i] < lim_inferior_outlier então a_salarios_copia[i]= Mediana\n", + " * Se a_salarios_copia[i] > lim_superior_outlier então a_salarios_copia[i]= Mediana\n", + "4. Calcule as estatísticas descritivas após as substituições e compare com os valores antes das transformações." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9DQ7YnWaFn4v" + }, + "source": [ + "### Minha solução\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RBXJbTeGLC7Q" + }, + "source": [ + "1. Estatísticas Descritivas antes das transformações:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QueKYn7MLG12", + "outputId": "75489f71-3f1e-4819-b5fe-21f134bf2b1e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "# Algumas estatísticas descritivas:\n", + "f'Média: {np.mean(a_salarios_copia)}; Mediana: {np.median(a_salarios_copia)}; STD: {np.std(a_salarios_copia)}'" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'Média: 1057.4744151862524; Mediana: 1048.089607774499; STD: 144.64306489539533'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 35 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oOBJ8INWL5fo" + }, + "source": [ + "Observe o quanto nossos dados estão distorcidos dos valores originalmente utilizados." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MX-fJeh2MBTD" + }, + "source": [ + "2. Calcular Q1, Q3 e IQR" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JlsPiQeGMGeU" + }, + "source": [ + "Q1 = np.percentile(a_salarios_copia, q = [25])\n", + "Q3 = np.percentile(a_salarios_copia, q = [75])\n", + "Q2 = np.percentile(a_salarios_copia, q = [50])\n", + "p99 = np.percentile(a_salarios_copia, q = [99])\n", + "p95 = np.percentile(a_salarios_copia, q = [95])\n", + "\n", + "IQR = Q3-Q1 # Diferença interquartílica\n", + "lim_inferior_outlier = Q1-1.5*IQR\n", + "lim_superior_outlier = Q3+1.5*IQR" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VF2NJ3rCeI1_", + "outputId": "34a2097c-334f-472f-9fe0-7198ea827e47", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "f'Q1: {Q1}; Q3: {Q3}; lim_inferior_outlier: {lim_inferior_outlier}; lim_superior_outlier: {lim_superior_outlier}'" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'Q1: [974.41]; Q3: [1119.81]; lim_inferior: [756.33]; lim_superior: [1337.89]'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 37 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JjnwJ7HwMxcl" + }, + "source": [ + "3. Substituir\n", + "* Se a_salarios[i] < lim_inferior_outlier --> a_salarios[i] = Mediana\n", + "* Se a_salarios[i] > lim_superior_outlier --> a_salarios[i] = Mediana" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hcAn-IwVfbcI" + }, + "source": [ + "a_salarios2 = a_salarios_copia.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "M4UJY4vbRics", + "outputId": "8792d911-f1f7-4068-ba6e-d9e42674be7a", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "Q2[0]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "1048.089607774499" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 18 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "J3SSE45oM9oh", + "outputId": "f6c222d1-3b09-4500-fbde-79a513097aa8", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 102 + } + }, + "source": [ + "a_salarios2[a_salarios2 < lim_inferior_outlier[0]] = Q2[0] # Atribuição da Mediana\n", + "a_salarios2[a_salarios2 > lim_superior_outlier[0]] = Q2[0] # Atribuição da Mediana\n", + "a_salarios2[:30]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1295.63, 1156.44, 1250.57, 1101.48, 1074.9 , 1149.93, 1032.39,\n", + " 1151.23, 1158.81, 1182.97, 839. , 1112.47, 1117.72, 1011.08,\n", + " 1048.09, 1104.14, 915.72, 1162.71, 946.36, 865.97, 936.09,\n", + " 954.29, 942.71, 908.55, 1015.57, 1051.34, 930.8 , 994.29,\n", + " 961.46, 903.51])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 19 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VEGFio0Nfj7O" + }, + "source": [ + "4. Estatísticas Descritivas para avaliarmos o impacto das alterações na amostra:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gX1LZHFqfjFQ", + "outputId": "5749f50a-88b3-4bbe-8886-74db69c975d4", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "# Algumas estatísticas descritivas - Depois do trtamento de OUtliers:\n", + "f'Média: {np.mean(a_salarios2)}; Mediana: {np.median(a_salarios2)}; STD: {np.std(a_salarios2)}'" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'Média: 1047.3019702056902; Mediana: 1048.089607774499; STD: 98.3265929249586'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 20 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cSXrg2PFSYKY", + "outputId": "771a880e-398f-4765-9d84-f93342b9c34d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "# Algumas estatísticas descritivas - Antes do trtamento de OUtliers:\n", + "f'Média: {np.mean(a_salarios)}; Mediana: {np.median(a_salarios)}; STD: {np.std(a_salarios)}'" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'Média: 1047.150212238584; Mediana: 1047.631166829137; STD: 101.18708333868835'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 21 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZVc6_nsGS_J2" + }, + "source": [ + "### Exercício: Substituir e comentar com seus respectivos colegas de grupo quando substituimos:\n", + "* Q2[0] pela média.\n", + "* Q2[0] pelo valor do percentil 95 e 99." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-xnguZ7XgyvK", + "outputId": "95e3ea05-085a-4684-ccd6-deef149e8476", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 269 + } + }, + "source": [ + "# Import a biblioteca seaborn:\n", + "import seaborn as sns\n", + "sns.boxplot(y = a_salarios2)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 22 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAADrCAYAAACFMUa7AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAO5klEQVR4nO3df6zddX3H8eer90YEzQa0d40WGCxt5tC4xTVIsriw8asQsxo3DWRJ7xxZY4Klwz82jMmaYEg0Lhpo1KQJDW3iYGSbsW5dsbBl/IVSFgJFUU5QpA3K9RZxWRW57Xt/3C/x7nJv7z333PZc/Dwfycn5nvf3c77nfQj3dT/9fL/nnlQVkqQ2rBp2A5KkM8fQl6SGGPqS1BBDX5IaYuhLUkMMfUlqyOiwGziVNWvW1MUXXzzsNiTpDeWxxx77cVWNzbVvRYf+xRdfzKFDh4bdhiS9oSR5br59Lu9IUkMMfUlqiKEvSQ0x9CWpIYa+tASTk5PccsstTE5ODrsVqS+GvrQEe/bs4cknn2Tv3r3DbkXqi6Ev9WlycpIDBw5QVRw4cMDZvt5QDH2pT3v27OHkyZMAnDhxwtm+3lAMfalPDz74IFNTUwBMTU1x8ODBIXckLZ6hL/XpqquuYnR0+sPso6OjXH311UPuSFo8Q1/q0/j4OKtWTf/ojIyMsGXLliF3JC3eiv7bO1pZdu7cSa/XG3YbK0ISAN761rdy++23D7mb4Vu/fj3btm0bdhtaBGf60hKsWrWKVatWsXbt2mG3IvXFmb4WzZncL23fvh2AO++8c8idSP1xpi9JDVkw9JPsTvJiksMzap9K8kSSx5N8Pcnbu3qS3JWk1+1/z4znjCd5pruNn563I0k6lcXM9O8BNs2qfbaq3l1Vvwf8K/B3Xf06YEN32wp8CSDJ+cAO4L3AZcCOJOcN3L0kqS8Lhn5VPQwcm1X76YyHbwGq294M7K1pjwDnJnkbcC1wsKqOVdVLwEFe/4tEknSaLflEbpI7gC3Ay8AfdeV1wPMzhh3pavPVJUln0JJP5FbVJ6vqQuDLwMeWq6EkW5McSnJoYmJiuQ4rSWJ5rt75MvCn3fZR4MIZ+y7oavPVX6eqdlXVxqraODY255e5S5KWaEmhn2TDjIebgae77X3Alu4qnsuBl6vqBeAB4Jok53UncK/papKkM2jBNf0k9wJXAGuSHGH6Kpzrk/w2cBJ4DvhoN3w/cD3QA44DHwGoqmNJPgU82o27var+38lhSdLpt2DoV9WNc5TvnmdsATfPs283sLuv7iRJy8pP5EpSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWrIgqGfZHeSF5McnlH7bJKnkzyR5CtJzp2x7xNJekm+k+TaGfVNXa2X5LblfyuSpIUsZqZ/D7BpVu0g8K6qejfwXeATAEkuBW4A3tk954tJRpKMAF8ArgMuBW7sxkqSzqAFQ7+qHgaOzap9vaqmuoePABd025uB+6rqlar6HtADLutuvap6tqp+AdzXjZUknUHLsab/l8C/d9vrgOdn7DvS1earS5LOoIFCP8kngSngy8vTDiTZmuRQkkMTExPLdVhJEgOEfpK/AN4P/HlVVVc+Clw4Y9gFXW2++utU1a6q2lhVG8fGxpbaniRpDksK/SSbgL8B/qSqjs/YtQ+4IclZSS4BNgDfBB4FNiS5JMmbmD7Zu2+w1iVJ/RpdaECSe4ErgDVJjgA7mL5a5yzgYBKAR6rqo1X1VJL7gW8xvexzc1Wd6I7zMeABYATYXVVPnYb3I0k6hQVDv6punKN89ynG3wHcMUd9P7C/r+4kScvKT+RKUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1JDRYTew0u3cuZNerzfsNrTCvPb/xPbt24fciVaa9evXs23btmG3MS9DfwG9Xo/HD3+bE+ecP+xWtIKs+kUB8NizPxpyJ1pJRo4fG3YLCzL0F+HEOefzs3dcP+w2JK1wZz+9f9gtLGjBNf0ku5O8mOTwjNqHkjyV5GSSjbPGfyJJL8l3klw7o76pq/WS3La8b0OStBiLOZF7D7BpVu0w8EHg4ZnFJJcCNwDv7J7zxSQjSUaALwDXAZcCN3ZjJUln0ILLO1X1cJKLZ9W+DZBk9vDNwH1V9QrwvSQ94LJuX6+qnu2ed1839luDNC9J6s9yX7K5Dnh+xuMjXW2++usk2ZrkUJJDExMTy9yeJLVtxV2nX1W7qmpjVW0cGxsbdjuS9Ctlua/eOQpcOOPxBV2NU9QlSWfIcs/09wE3JDkrySXABuCbwKPAhiSXJHkT0yd79y3za0uSFrDgTD/JvcAVwJokR4AdwDFgJzAG/FuSx6vq2qp6Ksn9TJ+gnQJurqoT3XE+BjwAjAC7q+qp0/GGJEnzW8zVOzfOs+sr84y/A7hjjvp+YOV/ckGSfoWtuBO5kqTTx9CXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1BC/GH0BR48eZeT4y2+ILzyWNFwjxyc5enRq2G2ckjN9SWqIM/0FrFu3jh++MsrP3nH9sFuRtMKd/fR+1q1bO+w2TsmZviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1JAFQz/J7iQvJjk8o3Z+koNJnunuz+vqSXJXkl6SJ5K8Z8ZzxrvxzyQZPz1vR5J0KouZ6d8DbJpVuw14qKo2AA91jwGuAzZ0t63Al2D6lwSwA3gvcBmw47VfFJKkM2fB0K+qh4Fjs8qbgT3d9h7gAzPqe2vaI8C5Sd4GXAscrKpjVfUScJDX/yKRJJ1mS13TX1tVL3TbPwRe+1ui64DnZ4w70tXmq0uSzqCBT+RWVQG1DL0AkGRrkkNJDk1MTCzXYSVJLD30f9Qt29Ddv9jVjwIXzhh3QVebr/46VbWrqjZW1caxsbEltidJmstSQ38f8NoVOOPAV2fUt3RX8VwOvNwtAz0AXJPkvO4E7jVdTZJ0Bi34dYlJ7gWuANYkOcL0VTifBu5PchPwHPDhbvh+4HqgBxwHPgJQVceSfAp4tBt3e1XNPjksSTrNFgz9qrpxnl1XzjG2gJvnOc5uYHdf3UmSlpWfyJWkhhj6ktQQQ1+SGrLgmr5g5Pgxzn56/7Db0Aqy6uc/BeDkm39tyJ1oJRk5foxfflZ1ZTL0F7B+/fpht6AVqNf7HwDW/9bK/gHXmbZ2xWeGob+Abdu2DbsFrUDbt28H4M477xxyJ1J/XNOXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDBgr9JNuTHE7yVJK/7mrnJzmY5Jnu/ryuniR3JekleSLJe5bjDUiSFm/JoZ/kXcBfAZcBvwu8P8l64DbgoaraADzUPQa4DtjQ3bYCXxqgb0nSEgwy0/8d4BtVdbyqpoD/Aj4IbAb2dGP2AB/otjcDe2vaI8C5Sd42wOtLkvo0SOgfBt6XZHWSc4DrgQuBtVX1Qjfmh8Dabnsd8PyM5x/papKkM2R0qU+sqm8n+QzwdeB/gceBE7PGVJLq57hJtjK9/MNFF1201PYkSXMY6ERuVd1dVb9fVX8IvAR8F/jRa8s23f2L3fCjTP9L4DUXdLXZx9xVVRurauPY2Ngg7UmSZhn06p3f6O4vYno9/x+AfcB4N2Qc+Gq3vQ/Y0l3Fcznw8oxlIEnSGbDk5Z3OPydZDbwK3FxVP0nyaeD+JDcBzwEf7sbuZ3rdvwccBz4y4GtLkvo0UOhX1fvmqE0CV85RL+DmQV5PkjQYP5ErSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIQOFfpJbkzyV5HCSe5O8OcklSb6RpJfkH5O8qRt7Vve41+2/eDnegCRp8ZYc+knWAbcAG6vqXcAIcAPwGeDzVbUeeAm4qXvKTcBLXf3z3ThJ0hk06PLOKHB2klHgHOAF4I+Bf+r27wE+0G1v7h7T7b8ySQZ8fUlSH5Yc+lV1FPh74AdMh/3LwGPAT6pqqht2BFjXba8Dnu+eO9WNXz37uEm2JjmU5NDExMRS25MkzWGQ5Z3zmJ69XwK8HXgLsGnQhqpqV1VtrKqNY2Njgx5OkjTDIMs7VwHfq6qJqnoV+BfgD4Bzu+UegAuAo932UeBCgG7/rwOTA7y+JKlPg4T+D4DLk5zTrc1fCXwL+E/gz7ox48BXu+193WO6/f9RVTXA60uS+jTImv43mD4h+9/Ak92xdgF/C3w8SY/pNfu7u6fcDazu6h8Hbhugb0nSEowuPGR+VbUD2DGr/Cxw2Rxjfw58aJDXkyQNxk/kSlJDDH1JashAyztqy86dO+n1esNuY0V47b/D9u3bh9zJyrB+/Xq2bds27Da0CM70pSU466yzeOWVV3j11VeH3YrUF2f6WjRncr/0uc99jq997Wts2LCBW2+9ddjtSIvmTF/q0+TkJAcOHKCqOHDgAJOTfsZQbxyGvtSnPXv2cPLkSQBOnDjB3r17h9yRtHiGvtSnBx98kKmp6b8pODU1xcGDB4fckbR4hr7Up6uuuorR0enTYaOjo1x99dVD7khaPENf6tP4+DirVk3/6IyMjLBly5YhdyQtnqEv9Wn16tVs2rSJJGzatInVq1/3tRDSiuUlm9ISjI+P8/3vf99Zvt5wDH1pCVavXs1dd9017Dakvrm8I0kNMfQlqSGGviQ1xNCXpIZkJX9NbZIJ4Llh9yHNYw3w42E3Ic3hN6tqbK4dKzr0pZUsyaGq2jjsPqR+uLwjSQ0x9CWpIYa+tHS7ht2A1C/X9CWpIc70Jakhhr4kNcTQl6SGGPqS1BBDX5Ia8n+y6aH62hucLAAAAABJRU5ErkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uEPFcBjFhETQ" + }, + "source": [ + "Como podem ver, os outliers desapareceram, como queríamos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tHfzjW_ymKuR" + }, + "source": [ + "___\n", + "# **Valores únicos**\n", + "> Considere o array de a_idades a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HzmQgWZVmUUD", + "outputId": "a8fe3bba-2483-4c62-8d46-963f79ae0462", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 121 + } + }, + "source": [ + "np.random.seed(20111974)\n", + "a_idades = np.random.randint(18, 100, 100)\n", + "a_idades" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([78, 60, 58, 26, 45, 20, 64, 99, 98, 31, 48, 81, 97, 90, 31, 85, 51,\n", + " 91, 95, 60, 73, 63, 59, 39, 40, 26, 80, 28, 18, 33, 27, 85, 53, 60,\n", + " 26, 44, 23, 86, 92, 75, 58, 40, 63, 24, 99, 18, 43, 68, 98, 94, 47,\n", + " 25, 39, 23, 70, 49, 96, 79, 68, 68, 25, 59, 21, 51, 65, 23, 34, 51,\n", + " 37, 78, 74, 73, 71, 46, 34, 45, 40, 56, 67, 31, 22, 43, 65, 64, 36,\n", + " 76, 19, 82, 75, 35, 38, 68, 43, 73, 91, 92, 61, 37, 73, 72])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 73 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Dm9ky1F1mrNA" + }, + "source": [ + "Quem são os valores únicos do array?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "G-LPRqc-mS5j", + "outputId": "6c61d635-95b9-4cec-e7d8-12ec2b0e0322", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 87 + } + }, + "source": [ + "np.unique(a_idades)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 31, 33, 34, 35, 36, 37,\n", + " 38, 39, 40, 43, 44, 45, 46, 47, 48, 49, 51, 53, 56, 58, 59, 60, 61,\n", + " 63, 64, 65, 67, 68, 70, 71, 72, 73, 74, 75, 76, 78, 79, 80, 81, 82,\n", + " 85, 86, 90, 91, 92, 94, 95, 96, 97, 98, 99])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 74 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uXZZoTd6nMuq" + }, + "source": [ + "___\n", + "# **Diferença entre dois arrays**\n", + "> O resultado é um array com os **valores únicos de A que não estão em B**. Na teoria de conjuntos escrevemos $A - B = A - A \\cap B$.\n", + "\n", + "![Difference](https://github.com/MathMachado/Materials/blob/master/set_Difference.PNG?raw=true)\n", + "\n", + "Fonte: [Python Set](https://www.learnbyexample.org/python-set/)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uW6i3m9q1ZNs" + }, + "source": [ + "\n", + "* Vamos ver como isso funciona na prática:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vw05sfe22mfk" + }, + "source": [ + "## Exemplo 1" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Qqw2do90nQ7k" + }, + "source": [ + "a_conjunto1 = np.array([0, 1, 2, 4, 5, 7, 8, 8]) # array de valores que serão excluidos em a_conjunto1. Observe que '3' não pertence a a_conjunto1.\n", + "a_conjunto2 = np.array([1, 6, 7, 3])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zXJ00pOMorM-", + "outputId": "c3108557-ad55-45cc-f707-3af35cf456c1", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "np.setdiff1d(a_conjunto1, a_conjunto2)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0, 2, 4, 5, 8])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 50 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8GXZNgjfo8lO" + }, + "source": [ + "Observe que o resultado são os elementos de a_conjunto1 que não pertencem a x_Y. Mas como fica o '3' nesta história?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aJSu6VKb2oc_" + }, + "source": [ + "## Exemplo 2" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "N1wahElXTqoB" + }, + "source": [ + "a_conjunto1 = np.arange(10)\n", + "a_conjunto1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "nxDpCMg7T7Rj" + }, + "source": [ + "a_conjunto2 = np.array([1, 5, 7])\n", + "a_conjunto2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "3LU3qYyiUXqm" + }, + "source": [ + "np.setdiff1d(a_conjunto1, a_conjunto2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mzZEytrRUioU" + }, + "source": [ + "Observe que os elementos de a_conjunto2 foram deletados de a_conjunto1. Ok?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gJRcoVRUnaY9" + }, + "source": [ + "___\n", + "# Diferença Simétrica\n", + "* Em teoria de conjuntos, chamamos de Diferença Simétrica e escrevemos $(A \\cup B)- (A \\cap B)$.\n", + "\n", + "![DifferenceSymetric](https://github.com/MathMachado/Materials/blob/master/set_DifferenceSymetric.PNG?raw=true)\n", + "\n", + "Fonte: [Python Set](https://www.learnbyexample.org/python-set/)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2Uzzm85Kup3H" + }, + "source": [ + "* Vamos ver como isso funciona na prática:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1z5wZ8VwpsWN" + }, + "source": [ + "import numpy as np\n", + "a_conjunto1 = np.array([0, 1, 2, 4, 5, 7, 8]) # Observe que [1, 4, 7] pertencem a a_conjunto1, mas 3, não. Portanto:\n", + "a_conjunto2 = np.array([1, 4, 7, 3])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Tqd_9XO5p7bo", + "outputId": "d7670965-e38f-40a1-9864-8ec850143245", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "np.setxor1d(a_conjunto1, a_conjunto2)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0, 2, 3, 5, 8])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 52 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_meurG3mqS5Y" + }, + "source": [ + "Como explicamos ou interpretamos este resultado?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Kc8JoKe2nj2n" + }, + "source": [ + "___\n", + "# **União de dois arrays**\n", + "> Retorna os valores **únicos** dos dois arrays. Na teoria dos conjuntos, escrevemos:\n", + "\n", + "$$A \\cup B$$\n", + "\n", + "![Union](https://github.com/MathMachado/Materials/blob/master/set_Union.PNG?raw=true)\n", + "\n", + "Fonte: [Python Set](https://www.learnbyexample.org/python-set/)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1LZxorw2p2mg" + }, + "source": [ + "a_conjunto1 = np.array([0, 1, 2, 4, 5, 7, 8, 8])\n", + "\n", + "# Observe que [1, 4, 7] pertencem a a_conjunto1, mas 3, não. Portanto:\n", + "a_conjunto2 = np.array([1, 4, 7, 3])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "COsZEmSwuY5L" + }, + "source": [ + "np.union1d(a_conjunto1, a_conjunto2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "b53bR-GYRu_3" + }, + "source": [ + "___\n", + "# **Selecionar itens comuns dos arrays X e Y**\n", + "* Na teoria de conjuntos, chamamos de intersecção e escrevemos $X \\cap Y$.\n", + "\n", + "![Intersection](https://github.com/MathMachado/Materials/blob/master/set_Intersection.PNG?raw=true)\n", + "\n", + "Fonte: [Python Set](https://www.learnbyexample.org/python-set/)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "n2ec2tqqR1Gw" + }, + "source": [ + "* Considere os arrays a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rXVQQvBqR4J-", + "outputId": "c1332edd-af01-45cb-d3e1-c6e3ba30e157", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "a_conjunto1 = np.arange(10)\n", + "a_conjunto1" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 53 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pZTHhHxGSRfB", + "outputId": "2c93501a-3ed8-4297-d58e-990c529a5a3d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "a_conjunto2 = np.arange(8, 18)\n", + "a_conjunto2" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([ 8, 9, 10, 11, 12, 13, 14, 15, 16, 17])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 54 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MxB2_qHpScMB" + }, + "source": [ + "Quais são os elementos comuns à X e Y?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "e-rncJHtSfw0", + "outputId": "11f0b85d-c634-419a-cc62-e0899f9cef31", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "np.intersect1d(a_conjunto1, a_conjunto2)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([8, 9])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 55 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3Bb39sWdfqaF" + }, + "source": [ + "___\n", + "# **Autovalores e Autovetores**\n", + "> Autovetor e Autovalor são um dos tópicos mais importantes em Machine Learning.\n", + "\n", + "Por definição, o escalar $\\lambda$ e o vetor $v$ são autovalor e autovetor da matriz $A$ se\n", + "\n", + "$$Av = \\lambda v$$\n", + "\n", + "## Leitura Adicional:\n", + "\n", + "* [Machine Learning & Linear Algebra — Eigenvalue and eigenvector](https://medium.com/@jonathan_hui/machine-learning-linear-algebra-eigenvalue-and-eigenvector-f8d0493564c9)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XZBKq8nGCUbL" + }, + "source": [ + "* O array a_conjunto2 tem a seguinte forma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iYlZGKFUfw-R" + }, + "source": [ + "a_conjunto2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6EfvIbBNf02Z" + }, + "source": [ + "# Calcula autovalores e autovetores:\n", + "a_autovalores, a_autovalores= np.linalg.eig(a_conjunto2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v3GtQQvAz9QU" + }, + "source": [ + "Os autovalores do array a_conjunto2 são:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WvZGyBR1f9vP" + }, + "source": [ + "a_autovalores" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AuuDRJVh0FC8" + }, + "source": [ + "Os autovetores do array a_conjunto2 são:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6m4YFAwsf_rA" + }, + "source": [ + "a_autovalores" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DASn2Un9ZNV-" + }, + "source": [ + "___\n", + "# **Encontrar Missing Values (NaN)**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TKilWBsSXtR4" + }, + "source": [ + "## Gerar o exemplo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lqLI2ER_ZUMY" + }, + "source": [ + "import numpy as np\n", + "np.random.seed(20111974)\n", + "a_conjunto1 = np.random.random(100)\n", + "\n", + "# Inserindo 15 NaN's no array:\n", + "np.random.seed(20111974)\n", + "l_indices_aleatorios= np.random.randint(0, 100, size = 15)\n", + "\n", + "for i_indices in l_indices_aleatorios:\n", + " #print(i_indices)\n", + " a_conjunto1[i_indices] = np.nan" + ], + "execution_count": 2, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gw--poMaadv3", + "outputId": "53e3313d-21b5-4bee-b469-c9898dfa00fb", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "l_indices_aleatorios" + ], + "execution_count": 3, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([60, 42, 40, 8, 27, 2, 46, 88, 81, 88, 80, 13, 30, 82, 96])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 3 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2ZkbMPXMawYh", + "outputId": "c23cbf32-34e4-4e15-cf93-b04d6b50ed45", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 357 + } + }, + "source": [ + "a_conjunto1" + ], + "execution_count": 4, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.53097233, 0.56965626, nan, 0.65478409, 0.85708456,\n", + " 0.60174181, 0.87298309, 0.45573342, nan, 0.64300912,\n", + " 0.54808035, 0.35321428, 0.32005665, nan, 0.85159044,\n", + " 0.75930202, 0.65675987, 0.3278323 , 0.34592275, 0.41510657,\n", + " 0.30635652, 0.26750355, 0.30663224, 0.35503537, 0.60299892,\n", + " 0.0221767 , 0.36265947, nan, 0.28077438, 0.37056609,\n", + " nan, 0.43587362, 0.20494254, 0.20850854, 0.64886762,\n", + " 0.81792888, 0.71541492, 0.50313939, 0.1657674 , 0.60122378,\n", + " nan, 0.14442301, nan, 0.70671296, 0.07163699,\n", + " 0.56212721, nan, 0.83632274, 0.21435895, 0.85491145,\n", + " 0.62878505, 0.38468856, 0.90553087, 0.33703023, 0.06707729,\n", + " 0.1023552 , 0.84821523, 0.12156391, 0.94423963, 0.15835682,\n", + " nan, 0.91080887, 0.58558559, 0.36799242, 0.71647196,\n", + " 0.0740405 , 0.47889268, 0.77503169, 0.96720855, 0.71575223,\n", + " 0.28887146, 0.33306388, 0.95399002, 0.23557899, 0.97714605,\n", + " 0.85188315, 0.63303051, 0.57297905, 0.66792818, 0.87621361,\n", + " nan, nan, nan, 0.68323127, 0.28826713,\n", + " 0.32846648, 0.98334327, 0.17156066, nan, 0.91917489,\n", + " 0.98381602, 0.75915187, 0.31400247, 0.97074481, 0.07574498,\n", + " 0.55661541, nan, 0.4936932 , 0.07351232, 0.11418944])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 4 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Z7Bs75NvbSjx" + }, + "source": [ + "Ok, inserimos aleatoriamente 14 NaN's no array a_conjunto1. Agora, vamos contar quantos NaN's (já sabemos a resposta!)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hL1Wn0vdX8ur" + }, + "source": [ + "## Identificar os NaN's" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5R-n3H0xbd6d", + "outputId": "227eef3e-0d3f-4a98-f95a-cbdffff6e77b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "np.isnan(a_conjunto1).sum()" + ], + "execution_count": 5, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "14" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 5 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Y7hh5uowoa3U" + }, + "source": [ + "Ok, temos 14 NaN's em a_conjunto1." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iVLQf_bqbyNU" + }, + "source": [ + "Ok, agora eu quero saber os índices desses NaN's." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kJHxjZiwb5HM", + "outputId": "adf8ae23-3be3-473b-d04e-cb973e145dfa", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "i_indices = np.where(np.isnan(a_conjunto1))\n", + "i_indices" + ], + "execution_count": 6, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(array([ 2, 8, 13, 27, 30, 40, 42, 46, 60, 80, 81, 82, 88, 96]),)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 6 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "W_jHGNImok7L", + "outputId": "da396cce-64d9-4d31-ed2f-eac04e37ff36", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Checando...\n", + "a_conjunto1[2]" + ], + "execution_count": 7, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "nan" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 7 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iPhHAhDYcMWO" + }, + "source": [ + "Vamos conferir se está correto? Para isso, basta comparar com l_indices_aleatorios:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gxQYslRCe11G" + }, + "source": [ + "___\n", + "# **Deletar NaN's de um array**\n", + "> Considere o mesmo array que acabamos de trabalhar. Agora eu quero excluir os NaN's identificados." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "AeBARFqNfNnN", + "outputId": "7c991ca5-7403-492d-9fef-b03de1e4da34", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 357 + } + }, + "source": [ + "a_conjunto1" + ], + "execution_count": 8, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.53097233, 0.56965626, nan, 0.65478409, 0.85708456,\n", + " 0.60174181, 0.87298309, 0.45573342, nan, 0.64300912,\n", + " 0.54808035, 0.35321428, 0.32005665, nan, 0.85159044,\n", + " 0.75930202, 0.65675987, 0.3278323 , 0.34592275, 0.41510657,\n", + " 0.30635652, 0.26750355, 0.30663224, 0.35503537, 0.60299892,\n", + " 0.0221767 , 0.36265947, nan, 0.28077438, 0.37056609,\n", + " nan, 0.43587362, 0.20494254, 0.20850854, 0.64886762,\n", + " 0.81792888, 0.71541492, 0.50313939, 0.1657674 , 0.60122378,\n", + " nan, 0.14442301, nan, 0.70671296, 0.07163699,\n", + " 0.56212721, nan, 0.83632274, 0.21435895, 0.85491145,\n", + " 0.62878505, 0.38468856, 0.90553087, 0.33703023, 0.06707729,\n", + " 0.1023552 , 0.84821523, 0.12156391, 0.94423963, 0.15835682,\n", + " nan, 0.91080887, 0.58558559, 0.36799242, 0.71647196,\n", + " 0.0740405 , 0.47889268, 0.77503169, 0.96720855, 0.71575223,\n", + " 0.28887146, 0.33306388, 0.95399002, 0.23557899, 0.97714605,\n", + " 0.85188315, 0.63303051, 0.57297905, 0.66792818, 0.87621361,\n", + " nan, nan, nan, 0.68323127, 0.28826713,\n", + " 0.32846648, 0.98334327, 0.17156066, nan, 0.91917489,\n", + " 0.98381602, 0.75915187, 0.31400247, 0.97074481, 0.07574498,\n", + " 0.55661541, nan, 0.4936932 , 0.07351232, 0.11418944])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 8 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ck1w6_Tvb72M", + "outputId": "af2fc919-3ccf-457a-c493-f775addd9eb6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 221 + } + }, + "source": [ + "np.isnan(a_conjunto1)" + ], + "execution_count": 9, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([False, False, True, False, False, False, False, False, True,\n", + " False, False, False, False, True, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " True, False, False, True, False, False, False, False, False,\n", + " False, False, False, False, True, False, True, False, False,\n", + " False, True, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, True, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, True,\n", + " True, True, False, False, False, False, False, True, False,\n", + " False, False, False, False, False, False, True, False, False,\n", + " False])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 9 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "e497B492fFru", + "outputId": "90ea8d35-7338-46d6-d1bc-1c14a4a7030b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 323 + } + }, + "source": [ + "a_conjunto1[~np.isnan(a_conjunto1)]" + ], + "execution_count": 11, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.53097233, 0.56965626, 0.65478409, 0.85708456, 0.60174181,\n", + " 0.87298309, 0.45573342, 0.64300912, 0.54808035, 0.35321428,\n", + " 0.32005665, 0.85159044, 0.75930202, 0.65675987, 0.3278323 ,\n", + " 0.34592275, 0.41510657, 0.30635652, 0.26750355, 0.30663224,\n", + " 0.35503537, 0.60299892, 0.0221767 , 0.36265947, 0.28077438,\n", + " 0.37056609, 0.43587362, 0.20494254, 0.20850854, 0.64886762,\n", + " 0.81792888, 0.71541492, 0.50313939, 0.1657674 , 0.60122378,\n", + " 0.14442301, 0.70671296, 0.07163699, 0.56212721, 0.83632274,\n", + " 0.21435895, 0.85491145, 0.62878505, 0.38468856, 0.90553087,\n", + " 0.33703023, 0.06707729, 0.1023552 , 0.84821523, 0.12156391,\n", + " 0.94423963, 0.15835682, 0.91080887, 0.58558559, 0.36799242,\n", + " 0.71647196, 0.0740405 , 0.47889268, 0.77503169, 0.96720855,\n", + " 0.71575223, 0.28887146, 0.33306388, 0.95399002, 0.23557899,\n", + " 0.97714605, 0.85188315, 0.63303051, 0.57297905, 0.66792818,\n", + " 0.87621361, 0.68323127, 0.28826713, 0.32846648, 0.98334327,\n", + " 0.17156066, 0.91917489, 0.98381602, 0.75915187, 0.31400247,\n", + " 0.97074481, 0.07574498, 0.55661541, 0.4936932 , 0.07351232,\n", + " 0.11418944])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 11 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RpvKfJU_fmA6" + }, + "source": [ + "Observe que os NaN's foram excluidos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "60-l91_ccJxt" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kywe-SmtcLpF" + }, + "source": [ + "### **Exercício**: Atribuir a mediana aos valores nan da amostra." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_Dv8MmNYg8zN" + }, + "source": [ + "___\n", + "# **Converter lista em array**\n", + "> Considere a lista a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "but6T9dVhFYb" + }, + "source": [ + "l_Lista = [np.random.randint(0, 10, 10)]\n", + "l_Lista" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "xytj4Eo4hTh9" + }, + "source": [ + "type(l_Lista)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qrINdcruhWcH" + }, + "source": [ + "Convertendo a minha lista para array:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RoSyaX0OhZSE" + }, + "source": [ + "a_conjunto = np.asarray(l_Lista)\n", + "a_conjunto" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "dMjTdbBUhlrk" + }, + "source": [ + "type(a_conjunto)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Mbm3ZP9DhxDI" + }, + "source": [ + "___\n", + "# Converter tupla em array\n", + "> Considere a tupla a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cZxEFYLAh3S_" + }, + "source": [ + "np.random.seed(20111974)\n", + "t_numeros = ([np.random.randint(0, 10, 3)], [np.random.randint(0, 10, 3)], [np.random.randint(0, 10, 3)])\n", + "t_numeros" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vlTXUJviiAml" + }, + "source": [ + "type(t_numeros)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "yEaOlq8oh3oh" + }, + "source": [ + "a_conjunto = np.asarray(t_numeros)\n", + "a_conjunto" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "PSgQDmRWh3g5" + }, + "source": [ + "type(a_conjunto)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pH-Ht6yMiqJN" + }, + "source": [ + "___\n", + "# Acrescentar elementos à um array\n", + "> Considere o array a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dFaDZInZiwoo" + }, + "source": [ + "a_conjunto1 = np.arange(5)\n", + "a_conjunto1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "d3zrlf_Ci73Z" + }, + "source": [ + "np.random.seed(20111974)\n", + "a_conjunto1 = np.append(a_conjunto1, [np.random.randint(0, 10, 3), np.random.randint(0, 10, 3), np.random.randint(0, 10, 3)])\n", + "a_conjunto1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eFRhtk13ojqA" + }, + "source": [ + "___\n", + "# **Converter array 1D num array 2D**\n", + "> Considere os arrays a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wYhBgW9Zu6ZP" + }, + "source": [ + "np.random.seed(20111974)\n", + "a_conjunto1 = np.array(np.random.randint(0, 10, 6))\n", + "\n", + "np.random.seed(19741120)\n", + "a_conjunto2 = np.array(np.random.randint(0, 10, 6))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "febs9AUHvs6n" + }, + "source": [ + "a_conjunto1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "C9OEd-iavvBm" + }, + "source": [ + "a_conjunto2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "KJWjtaWKv0MJ" + }, + "source": [ + "np.column_stack((a_conjunto1, a_conjunto2)) # Atenção aos parênteses em (a_conjunto1, a_conjunto2)." + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xr_WZXJ7pi2D" + }, + "source": [ + "___\n", + "# **Excluir um elemento específico do array usando indices**\n", + "> Considere os arrays a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tS0ZzOs8w0dw" + }, + "source": [ + "np.random.seed(20111974)\n", + "a_conjunto1 = np.array(np.random.randint(0, 10, 6))\n", + "a_conjunto1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7bOJiKDKxEsC" + }, + "source": [ + "Suponha que eu queira excluir os valores '8' de a_conjunto1. Os índices dos valores '8' são: [0, 1, 3]. Portanto, temos:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SSjueEvjxTJO" + }, + "source": [ + "a_conjunto1 = np.delete(a_conjunto1, [0, 1, 3])\n", + "a_conjunto1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mZkGZ2Rgp--5" + }, + "source": [ + "___\n", + "# **Frequência dos valores únicos de um array**\n", + "> Considere o array a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Z2BWKfH0xvQ8", + "outputId": "8e684b4b-1ceb-4075-e529-1edb95e30edb", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 102 + } + }, + "source": [ + "np.random.seed(20111974)\n", + "a_conjunto1 = np.array(np.random.randint(0, 10, 100))\n", + "a_conjunto1" + ], + "execution_count": 12, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([8, 8, 2, 8, 9, 1, 8, 0, 4, 2, 0, 8, 9, 3, 7, 1, 3, 2, 9, 7, 7, 9,\n", + " 5, 6, 8, 7, 0, 9, 3, 9, 3, 1, 8, 6, 3, 5, 4, 1, 2, 9, 8, 6, 6, 1,\n", + " 0, 9, 2, 0, 7, 5, 5, 4, 4, 2, 7, 2, 7, 9, 3, 1, 5, 0, 1, 2, 3, 8,\n", + " 7, 5, 4, 0, 5, 9, 6, 6, 1, 3, 6, 0, 4, 9, 2, 1, 0, 9, 1, 4, 2, 9,\n", + " 7, 9, 5, 3, 7, 6, 3, 9, 8, 4, 3, 0])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 12 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s_tdQBsax4rQ" + }, + "source": [ + "Suponha que eu queira saber quantas vezes o número/elemento '2' aparece em a_conjunto1." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6yIlk7pWyAtf", + "outputId": "587f9dc5-f92b-4843-c173-4d0aff309bf5", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "l_itens_unicos, i_count = np.unique(a_conjunto1, return_counts = True)\n", + "l_itens_unicos" + ], + "execution_count": 16, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 16 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DyvrIwS9yZIR" + }, + "source": [ + "O que significa o output acima?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uO-MPMhXyV9H", + "outputId": "5f141140-a6d0-4a68-e037-918286166f54", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "i_count" + ], + "execution_count": 17, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([10, 10, 10, 11, 8, 8, 8, 10, 10, 15])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 17 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zwoezXrPyofK" + }, + "source": [ + "Qual a interpretação do output acima?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HgYycSG7yr5e", + "outputId": "17c18c3a-55cb-405a-9d9c-c1c9d1e122ae", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "np.asarray((l_itens_unicos, i_count))" + ], + "execution_count": 18, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],\n", + " [10, 10, 10, 11, 8, 8, 8, 10, 10, 15]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 18 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SwIZiJAiy06T" + }, + "source": [ + "Qual a interpretação do output acima?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JpNRpN2Dql3N" + }, + "source": [ + "___\n", + "# **Combinações possíveis de outros arrays**\n", + "> Considere o exemplo a seguir:\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BUr89dH4zLXD" + }, + "source": [ + "a_conjunto1 = [2, 4, 6]\n", + "a_conjunto2 = [0, 8]\n", + "a_conjunto4 = [1, 5]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "cEZH6l-Czx7y" + }, + "source": [ + "np.meshgrid(a_conjunto1, a_conjunto2, a_conjunto4)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "btvmDkEcz0tH" + }, + "source": [ + "np.array(np.meshgrid(a_conjunto1, a_conjunto2, a_conjunto4))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Z0xhO7rGz059" + }, + "source": [ + "np.array(np.meshgrid(a_conjunto1, a_conjunto2, a_conjunto4)).T" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "eMv4lFnD0Enn" + }, + "source": [ + "# Resultado final\n", + "a_conjunto3 = np.array(np.meshgrid(a_conjunto1, a_conjunto2, a_conjunto4)).T.reshape(-1,3)\n", + "a_conjunto3" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Rz80YANfAh2k" + }, + "source": [ + "___\n", + "# **Wrap Up**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_cyhMsAVXxGC" + }, + "source": [ + "___\n", + "# **Exercícios**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kNjovMw3uJ3R" + }, + "source": [ + "## Exercício 1 - Selecionar os números pares\n", + "> Dado o 1D array abaixo, selecionar somente os números pares." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "isDzQjwjBX3V" + }, + "source": [ + "a_conjunto1 = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n", + "a_conjunto1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Kq1zt-uO1HXv" + }, + "source": [ + "### **Minha solução**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YFmK_n2M1Ks9" + }, + "source": [ + "a_conjunto1[a_conjunto1 % 2 == 0]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sScYG0hp05vb" + }, + "source": [ + "___\n", + "## Exercício 2 - Substituir pela mediana\n", + "> Dado o array 1D abaixo, substituir os números pares pela mediana de a_conjunto1." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XLZ-DIWU1WFs" + }, + "source": [ + "a_conjunto1 = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n", + "a_conjunto1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9c4QWJno1WVB" + }, + "source": [ + "### **Minha solução**\n", + "* Primeiramente, precisamos calcular a mediana.\n", + "* Depois, substituimos os valores pares de a_conjunto1 pela mediana encontrada anteriormente. Ok?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rx7NGAO01Wfb" + }, + "source": [ + "a_conjunto1[a_conjunto1 % 2 == 0] = np.median(a_conjunto1)\n", + "a_conjunto1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2c_AphX82qp8" + }, + "source": [ + "Verificando..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9kVta0Cr13Z9" + }, + "source": [ + "f'A média de a_conjunto1 é: {np.median(a_conjunto1)}'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L9O-Hf5x26TY" + }, + "source": [ + "___\n", + "## Exercício 3 - Reshape\n", + "> Dado o array 1D abaixo, reshape para um array 2D com 3 colunas." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0_laUvtB4Wl-" + }, + "source": [ + "# Define seed\n", + "np.random.seed(20111974)\n", + "a_conjunto1 = np.array(np.random.randint(1, 10, size = 15))\n", + "a_conjunto1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dKzEX8TK5b4Z" + }, + "source": [ + "### **Minha solução**\n", + "* O array 1D a_conjunto1 acima possui 15 elementos. Como queremos transformá-lo num array 2D com 3 colunas, então cada coluna terá 5 elementos." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "I-j5yVD04249" + }, + "source": [ + "a_conjunto1.reshape(5, 3) \n", + "# Poderia ser a_conjunto1.reshape(-1, 3), onde \"-1\" pede para o NumPy calcular o número de linhas. " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F1vfS8jE6L0_" + }, + "source": [ + "___\n", + "## Exercício 4 - Reshape\n", + "> Dado o array 1D abaixo, reshape para um array 3D com 2 colunas." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xcN-bez56L1D" + }, + "source": [ + "# Define seed\n", + "np.random.seed(20111974)\n", + "a_conjunto1 = np.array(np.random.randint(1, 10, size = 16))\n", + "a_conjunto1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7iICnOyG6fcj" + }, + "source": [ + "### **Minha solução**\n", + "* O array 1D a_conjunto1 acima possui 16 elementos. Queremos transformá-lo num array 3D com 2 colunas." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vdq5ybuD6fcn" + }, + "source": [ + "a_conjunto1.reshape(-1, 2) # O valor \"-1\" na posição das linhas pede ao NumPy para calcular o número de linhas automaticamente." + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "haQfWPcCs_H0" + }, + "source": [ + "## Exercício 5\n", + "Para mais exercícios envolvendo arrays, visite a página [Python: Array Exercises, Practice, Solution](https://www.w3resource.com/python-exercises/array/)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LQQL0JS2tnc0" + }, + "source": [ + "## Exercício 6\n", + "Para mais exercícios envolvendo matemática, viste a página [Python Math: - Exercises, Practice, Solution](https://www.w3resource.com/python-exercises/math/index.php)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qNskKFy9t4D5" + }, + "source": [ + "## Exercício 7\n", + "Para mais exercícios envolvendo NumPy em geral, visite a página [NumPy Exercises, Practice, Solution](https://www.w3resource.com/python-exercises/numpy/index.php)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qqc1AiHXuKZ5" + }, + "source": [ + "## Exercício 8\n" + ] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB07__Dictionaries_hs.ipynb b/Notebooks/NB07__Dictionaries_hs.ipynb new file mode 100644 index 000000000..28d54e5e2 --- /dev/null +++ b/Notebooks/NB07__Dictionaries_hs.ipynb @@ -0,0 +1,3137 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "NB07__Dictionaries.ipynb", + "provenance": [], + "collapsed_sections": [ + "n8BIbzQbNWUo", + "7eS94uQ4NhVR", + "SYOgJpGYVLUu", + "CaHFxk98W5if", + "ReWUyWiHXCnc", + "CqszHxaKHr2h", + "tXgF1Wl9gHKY", + "Fotx7XUquAo8", + "36kmLUYDvsUI", + "SWO2GdNovxAp", + "vpN54l4vxze5", + "u4HOf9SNytSq", + "6BQ9oZiD9hg5", + "tz5-QdrX9vct", + "p1muBgMX8NK4", + "FxTC2-U88ajk", + "z8EYn0pP25Rh" + ], + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "accelerator": "GPU" + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iBW6agsvqqAm" + }, + "source": [ + "

DICIONÁRIOS

\n", + "\n", + "* Coleção desordenada, mutável e indexada (estrutura do tipo {key: value}) de itens;\n", + "* Não permite itens duplicados;\n", + "* Usamos {key: value} para representar os itens do dicionário;\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LFcr_2Xnq2ho" + }, + "source": [ + "# **AGENDA**:\n", + "\n", + "> Veja o **índice** dos itens que serão abordados neste capítulo.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r8vR-lHJIhgM" + }, + "source": [ + "# **NOTAS E OBSERVAÇÕES**\n", + "* Levar os exemplos de lambda function daqui para o capítulo de Lambda Function.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DkxCxjsbE5fL" + }, + "source": [ + "# **CHEETSHEET**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cGUWTualFCOk" + }, + "source": [ + "![DataSctructures](https://github.com/MathMachado/Materials/blob/master/PythonDataStructures.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ublDMf3R_qMn" + }, + "source": [ + "A seguir, os principais métodos associados aos dicionários. Para isso, considere as listas l_frutas e l_precos_frutas que darão origem ao dicionário d_frutas a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FxuJ7Awd8f5a", + "outputId": "b92e247b-99a3-4687-87af-de5ccb28961f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 54 + } + }, + "source": [ + "# Definição da lista l_frutas:\n", + "l_frutas = ['Avocado', 'Apple', 'Apricot', 'Banana', 'Blackcurrant', 'Blackberry', 'Blueberry', 'Cherry', 'Coconut', 'Fig', 'Grape', 'Kiwi', 'Lemon', 'Mango', 'Nectarine', \n", + " 'Orange', 'Papaya', 'Passion Fruit', 'Peach', 'Pineapple', 'Plum', 'Raspberry', 'Strawberry', 'Watermelon']\n", + "\n", + "print(l_frutas)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "['Avocado', 'Apple', 'Apricot', 'Banana', 'Blackcurrant', 'Blackberry', 'Blueberry', 'Cherry', 'Coconut', 'Fig', 'Grape', 'Kiwi', 'Lemon', 'Mango', 'Nectarine', 'Orange', 'Papaya', 'Passion Fruit', 'Peach', 'Pineapple', 'Plum', 'Raspberry', 'Strawberry', 'Watermelon']\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jJyxuMQc9Ewy", + "outputId": "4396f10a-5709-45cc-d923-347e4183fb7b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 425 + } + }, + "source": [ + "# Definição da lista l_precos_frutas:\n", + "l_precos_frutas = [0.35, 0.40, 0.25, 0.30, 0.70, 0.55, 0.45, 0.50, 0.75, 0.60, 0.65, 0.20, 0.15, 0.80, 0.75, 0.25, 0.30,0.45,0.55,0.55,0.60,0.40,0.50,0.45]\n", + "l_precos_frutas" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[0.35,\n", + " 0.4,\n", + " 0.25,\n", + " 0.3,\n", + " 0.7,\n", + " 0.55,\n", + " 0.45,\n", + " 0.5,\n", + " 0.75,\n", + " 0.6,\n", + " 0.65,\n", + " 0.2,\n", + " 0.15,\n", + " 0.8,\n", + " 0.75,\n", + " 0.25,\n", + " 0.3,\n", + " 0.45,\n", + " 0.55,\n", + " 0.55,\n", + " 0.6,\n", + " 0.4,\n", + " 0.5,\n", + " 0.45]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 2 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hXP3kxW4-AI1" + }, + "source": [ + "Observe abaixo o uso das funções dict() e zip() para criarmos o dicionário d_frutas:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qT_4sYxA9dyn", + "outputId": "15bd1dd0-81aa-4592-ca39-f8465f89eafe", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 425 + } + }, + "source": [ + "# Definir o dicionário d_frutas: estrutura do tipo {chave1: valor1, chave2: valor2, ..., chaveN: valorN} --> JSON\n", + "d_frutas = dict(zip(l_frutas, l_precos_frutas))\n", + "d_frutas" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'Apple': 0.4,\n", + " 'Apricot': 0.25,\n", + " 'Avocado': 0.35,\n", + " 'Banana': 0.3,\n", + " 'Blackberry': 0.55,\n", + " 'Blackcurrant': 0.7,\n", + " 'Blueberry': 0.45,\n", + " 'Cherry': 0.5,\n", + " 'Coconut': 0.75,\n", + " 'Fig': 0.6,\n", + " 'Grape': 0.65,\n", + " 'Kiwi': 0.2,\n", + " 'Lemon': 0.15,\n", + " 'Mango': 0.8,\n", + " 'Nectarine': 0.75,\n", + " 'Orange': 0.25,\n", + " 'Papaya': 0.3,\n", + " 'Passion Fruit': 0.45,\n", + " 'Peach': 0.55,\n", + " 'Pineapple': 0.55,\n", + " 'Plum': 0.6,\n", + " 'Raspberry': 0.4,\n", + " 'Strawberry': 0.5,\n", + " 'Watermelon': 0.45}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 3 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iHKUaGNT_IDt" + }, + "source": [ + "A seguir, resumo dos principais métodos relacionados à dicionários:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MQLZ1mwW_yiU" + }, + "source": [ + "| Método | Descrição | Exemplo | Resultado |\n", + "|-------------------------|----------------------------------------------------------------------------------------------------|------------------------------------------|--------------------------------------------------------------------------------|\n", + "| d_dicionario.clear() | Remove todos os itens de d_dicionario | d_frutas.clear() | {} |\n", + "| d_dicionario.copy() | Retorna uma cópia de d_dicionario | d_frutas2= d_frutas.copy() | d_frutas2 é uma cópia de d_frutas |\n", + "| d_dicionario.get(key) | Retorna o valor para key, se key estiver em d_dicionario | d_frutas.get('Passion Fruit') | 0.45 |\n", + "| | | d_frutas.get('XPTO') | O Python não apresenta nenhum retorno |\n", + "| d_dicionario.items() | Retorna um objeto com as tuplas (key, valor) de d_dicionario | d_frutas.items() | dict_items([('Avocado', 0.35), ..., ('Watermelon', 0.45)]) |\n", + "| d_dicionario.keys() | Retorna um objeto com as keys de d_dicionario | d_frutas.keys() | dict_keys(['Avocado', 'Apple', ..., 'Watermelon']) |\n", + "| d_dicionario.values() | Retorna um objeto com os valores de d_dicionario | d_frutas.values() | dict_values([0.35, 0.4, ..., 0.45]) |\n", + "| d_dicionario.popitem() | Retorna e remove um item de d_dicionario | d_frutas.popitem() | ('Watermelon', 0.45) |\n", + "| | | 'Watermelon' in d_frutas | False |\n", + "| d_dicionario.pop(key[, default]) | Retorna e remove o item de d_dicionario correspondente à key | d_frutas.pop('Orange') | 0.25 |\n", + "| | | 'Orange' in d_frutas | False |\n", + "| d_dicionario.update(d2) | Adiciona item(s) à d_dicionario se key não estiver em d_dicionario. Se key estiver em d_dicionario, atualizará key com o novo valor | d_frutas.update({'Cherimoya': 1.3}) | Adicionará o item {'Cherimoya': 1.3} à d_frutas, pois key= 'Cherimoya' não está em d_frutas. |\n", + "| | | d_frutas.update({'Orange': 0.55}) | Atualiza o valor de key= 'Orange' para 0.55. O valor anterior era 0.25 |\n", + "| d_dicionario.fromkeys(keys, value) | Retorna um dicionário com keys especificadas e valores | tFruits= ('Avocado', 'Apple', 'Apricot') | |\n", + "| | | d_frutas.fromkeys(tFruits, 0) | {'Apple': 0, 'Apricot': 0, 'Avocado': 0} |" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uH6cHnctDu2l" + }, + "source": [ + "A seguir, vamos apresentar mais alguns exemplos de dicionários e seus métodos associados:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YeCPxCab4e4k" + }, + "source": [ + "___\n", + "# **EXEMPLO**\n", + "* Os dias da semana como dicionário." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "N_2J839X4lps", + "outputId": "72c3547d-59ee-4e42-bc85-5e050e3209dd", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 136 + } + }, + "source": [ + "d_dia_semana = {'Seg': 'Segunda', 'Ter': 'Terça', 'Qua': 'Quarta', 'Qui': 'Quinta', 'Sex': 'Sexta', 'Sab': 'Sabado', 'Dom': 'Domingo'}\n", + "d_dia_semana" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'Dom': 'Domingo',\n", + " 'Qua': 'Quarta',\n", + " 'Qui': 'Quinta',\n", + " 'Sab': 'Sabado',\n", + " 'Seg': 'Segunda',\n", + " 'Sex': 'Sexta',\n", + " 'Ter': 'Terça'}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 4 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CnZLR-VX6FV4" + }, + "source": [ + "Observe que:\n", + "* os itens do dicionário d_dia_semana seguem a estrutura {key: value}.\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "eHuvY7BWQKhQ", + "outputId": "e340fc45-e036-4812-fc51-6f941063e588", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "d_dia_semana['Seg'] # A chave aqui é 'Seg'" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'Segunda'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 5 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "j65BxhzGG0NA" + }, + "source": [ + "___\n", + "# **DECLARAR OU INICIALIZAR UM DICIONÁRIO VAZIO**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LEGwQ0U-fKtL" + }, + "source": [ + "Por exemplo, o comando abaixo declara um dicionário vazio chamado d_paises:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2iPWXPBLfOlr", + "outputId": "41fc8f1a-70cd-4e56-aeb0-0c0e4aac5571", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "d_paises = {} # Também podemos usar a função dict() para criar o dicionário vazio da seguinte forma: d_paises= dict()\n", + "d_paises" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 6 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vCxZv-jmG5y0" + }, + "source": [ + "___\n", + "# **OBTER O TIPO DO OBJETO**\n", + "> type(d_dicionario)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "voPYpGIGff3o", + "outputId": "b05165d8-375a-44e9-cacf-876eb0da4071", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "type(d_paises)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "dict" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 7 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "X3MvCkFiG-UO" + }, + "source": [ + "___\n", + "# **ADICIONAR ITENS AO DICIONÁRIO**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fzP8iG5xfi0H" + }, + "source": [ + "Adicionar o valor 'Italy' à key = 1. Em outras palavras, estamos a adicionar o item {1: 'Italy'}" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "EXZ7eEZofnza", + "outputId": "0976b0e9-c5d3-4341-d444-d6e556c225bb", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "d_paises[1] = 'Italy'\n", + "d_paises" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{1: 'Italy'}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 8 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rH51ORGHHREE" + }, + "source": [ + "Adicionar o valor 'Denmark' à key= 2. Em outras palavras, estamos a adicionar o item {2: 'Denmark'}" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GAXSzSiufv1u", + "outputId": "19a48880-61f6-4c24-ba7c-e5b81e1cad52", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "d_paises[2] = 'Denmark'\n", + "d_paises" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{1: 'Italy', 2: 'Denmark'}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 9 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Xqdc_IYoHVVQ" + }, + "source": [ + "Adicionar o valor 'Brazil' à key= 3. Em outras palavras, estamos a adicionar o item {3: 'Brazil'}" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FN7km8C9gAjM", + "outputId": "99f3df3e-aa72-4b86-aff7-9383f4f74dcb", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "d_paises[3]= 'Brazil'\n", + "d_paises" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{1: 'Italy', 2: 'Denmark', 3: 'Brazil'}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 10 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iwU8pJKRHapD" + }, + "source": [ + "___\n", + "# **ATUALIZAR VALORES DO DICIONÁRIO**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CxXUV7TugLXn" + }, + "source": [ + "O que acontece quando eu atribuo à key 3 outro valor, por exemplo, 'France'. Vamos conferir abaixo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Rr6DtJnDgU5I", + "outputId": "b2d7aa85-4860-4c0d-b1dd-7f3ae61f9e17", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Adicionar o valor 'France' à key= 3\n", + "d_paises[3]= 'France'\n", + "d_paises" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{1: 'Italy', 2: 'Denmark', 3: 'France'}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 11 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xB9G1l3_ggo-" + }, + "source": [ + "Como a key= 3 existe no dicionário d_paises, então o Python substitui o valor anterior 'Brazil' pelo novo valor, 'France'. \n", + "\n", + "* Lembre-se, os dicionários são mutáveis!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T8JBxySZHiOJ" + }, + "source": [ + "___\n", + "# **OBTER KEYS DO DICIONÁRIO**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FQtAHjJdb0xK", + "outputId": "aad4e409-d6b5-4c3f-a06c-fff1cb7a6688", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "d_paises" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{1: 'Italy', 2: 'Denmark', 3: 'France'}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 12 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ALwbHwi4iwky", + "outputId": "82d7b6c6-53a8-4f1d-abdb-d4906a269097", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "d_paises.keys()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "dict_keys([1, 2, 3])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 13 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FIvi0Li1Hng5" + }, + "source": [ + "___\n", + "# **OBTER VALORES DO DICIONÁRIO**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cp0PPtl3jEKo", + "outputId": "de0432c6-240d-4783-b6ff-7692b3d92188", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "d_paises.values()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "dict_values(['Italy', 'Denmark', 'France'])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 14 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JUblZBMjHrwl" + }, + "source": [ + "___\n", + "# **OBTER ITENS (key, value) DO DICIONÁRIO**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LraTwXjdjG3m", + "outputId": "fa2310f2-2a88-4d4c-8799-bacd6b40fcc6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "d_paises.items()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "dict_items([(1, 'Italy'), (2, 'Denmark'), (3, 'France')])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 15 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IJEMg2LKHyGa" + }, + "source": [ + "___\n", + "# **OBTER VALOR PARA UMA KEY ESPECÍFICA**\n", + "* d_dicionario.get(key)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dzgBhsphjSQm" + }, + "source": [ + "Qual o valor para key= 1?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FUfTjqktjW60", + "outputId": "0ce4b014-66c5-4bbe-af35-1b627f877a6f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "d_paises.get(1)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'Italy'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 16 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tyJ0KsloIBoD" + }, + "source": [ + "___\n", + "# **COPIAR DICIONÁRIO**\n", + "* d_dicionario.copy()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XL17EmvMkkky", + "outputId": "198fd037-5b42-41c5-a959-fcaaa419d35d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "d_paises2 = d_paises.copy()\n", + "d_paises2" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{1: 'Italy', 2: 'Denmark', 3: 'France'}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 17 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8V25l2ZoIG4B" + }, + "source": [ + "___\n", + "# **REMOVER TODOS OS ITENS DO DICIONÁRIO**\n", + "* d_dicionario.clear()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "r-8Gs1gYjqLN" + }, + "source": [ + "d_paises.clear()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ro_42gzDjsdV", + "outputId": "b31b6516-a7af-45ff-85e2-a7e6fa8e40ec", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "d_paises" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 19 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pCzKkKoujv7G" + }, + "source": [ + "Como esperado, removemos todos os itens do dicionário d_paises. Entretanto, o dicionário d_paises continua a existir!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MKtPwGVsIaLQ" + }, + "source": [ + "___\n", + "# **DELETAR O DICIONÁRIO**\n", + "* del d_dicionario" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8wvM-o7Lj7A0" + }, + "source": [ + "del d_paises" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "wK83ZURYkD_T", + "outputId": "03254461-9939-4ef9-de30-c4b59c920674", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 166 + } + }, + "source": [ + "d_paises" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "NameError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdCountries\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mNameError\u001b[0m: name 'dCountries' is not defined" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aSe3veUB1lo_" + }, + "source": [ + "Como esperado, pois agora o dicionário já não existe mais. Ok?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "STtkGUvEg7d1" + }, + "source": [ + "___\n", + "# **ITERAR PELO DICIONÁRIO**\n", + "* Considere o dicionário d_frutas a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "IG8hKSvcfalZ" + }, + "source": [ + "# Definindo os valores iniciais do dicionário d_frutas:\n", + "d_frutas = {'Avocado': 0.35, \n", + " 'Apple': 0.40, \n", + " 'Apricot': 0.25, \n", + " 'Banana': 0.30, \n", + " 'Blackcurrant': 0.70, \n", + " 'Blackberry': 0.55, \n", + " 'Blueberry': 0.45, \n", + " 'Cherry': 0.50, \n", + " 'Coconut': 0.75, \n", + " 'Fig': 0.60, \n", + " 'Grape': 0.65, \n", + " 'Kiwi': 0.20, \n", + " 'Lemon': 0.15, \n", + " 'Mango': 0.80, \n", + " 'Nectarine': 0.75, \n", + " 'Orange': 0.25, \n", + " 'Papaya': 0.30,\n", + " 'Passion Fruit': 0.45,\n", + " 'Peach': 0.55,\n", + " 'Pineapple': 0.55,\n", + " 'Plum': 0.60,\n", + " 'Raspberry': 0.40,\n", + " 'Strawberry': 0.50,\n", + " 'Watermelon': 0.45}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ppRkK_jJJG6W" + }, + "source": [ + "Mostrando os itens do dicionário d_frutas:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bI7Ctf0ohyz8", + "outputId": "3fedf44b-015e-474f-b5d3-1f2aa09d1957", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 425 + } + }, + "source": [ + "d_frutas" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'Apple': 0.4,\n", + " 'Apricot': 0.25,\n", + " 'Avocado': 0.35,\n", + " 'Banana': 0.3,\n", + " 'Blackberry': 0.55,\n", + " 'Blackcurrant': 0.7,\n", + " 'Blueberry': 0.45,\n", + " 'Cherry': 0.5,\n", + " 'Coconut': 0.75,\n", + " 'Fig': 0.6,\n", + " 'Grape': 0.65,\n", + " 'Kiwi': 0.2,\n", + " 'Lemon': 0.15,\n", + " 'Mango': 0.8,\n", + " 'Nectarine': 0.75,\n", + " 'Orange': 0.25,\n", + " 'Papaya': 0.3,\n", + " 'Passion Fruit': 0.45,\n", + " 'Peach': 0.55,\n", + " 'Pineapple': 0.55,\n", + " 'Plum': 0.6,\n", + " 'Raspberry': 0.4,\n", + " 'Strawberry': 0.5,\n", + " 'Watermelon': 0.45}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 5 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wXFfyiyPtD35" + }, + "source": [ + "Qual o valor para a fruta 'Apple'? Para responder à esta pergunta, basta lembrar que 'Apple' é uma key do dicionário d_frutas. Certo?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JpreyE_LtCcU", + "outputId": "7ff1b31c-9a76-4de8-d38f-a36e77bd30f6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "d_frutas['Apple']" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.4" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 6 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RMWau2TOclHr", + "outputId": "741f0735-17a1-4f4f-ec4c-4df8bc82c987", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 166 + } + }, + "source": [ + "d_frutas['blablabla'] # Isso significa que 'blablabla' não faz parte do dicionário!" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "KeyError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0md_frutas\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'blablabla'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;31m# Isso significa que 'blablabla' não faz parte do dicionário!\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mKeyError\u001b[0m: 'blablabla'" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JBMf8SbAJmiq" + }, + "source": [ + "## Iterar pelas keys do dicionário:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "i-6-pNQCcyXY", + "outputId": "fc039905-cff5-4b3a-cc54-139f3a2204c6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 425 + } + }, + "source": [ + "for chave in d_frutas.keys():\n", + " print(chave)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Avocado\n", + "Apple\n", + "Apricot\n", + "Banana\n", + "Blackcurrant\n", + "Blackberry\n", + "Blueberry\n", + "Cherry\n", + "Coconut\n", + "Fig\n", + "Grape\n", + "Kiwi\n", + "Lemon\n", + "Mango\n", + "Nectarine\n", + "Orange\n", + "Papaya\n", + "Passion Fruit\n", + "Peach\n", + "Pineapple\n", + "Plum\n", + "Raspberry\n", + "Strawberry\n", + "Watermelon\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9u4xJ0FfdCxm" + }, + "source": [ + "## Iterar pelos valores do dicionário:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vrFPwQPDdFP3", + "outputId": "0db51495-3826-4c2e-9f06-4cbbc1ec7376", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 425 + } + }, + "source": [ + "for i_valor in d_frutas.values():\n", + " print(i_valor)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "0.35\n", + "0.4\n", + "0.25\n", + "0.3\n", + "0.7\n", + "0.55\n", + "0.45\n", + "0.5\n", + "0.75\n", + "0.6\n", + "0.65\n", + "0.2\n", + "0.15\n", + "0.8\n", + "0.75\n", + "0.25\n", + "0.3\n", + "0.45\n", + "0.55\n", + "0.55\n", + "0.6\n", + "0.4\n", + "0.5\n", + "0.45\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yDkOLvRFJxco" + }, + "source": [ + "## Iterar pelos itens (chave, valor) do dicionário" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "H8BCC6qodU6o", + "outputId": "7cf6c927-b433-47d4-af28-b93c51b960cf", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 54 + } + }, + "source": [ + "d_frutas.items()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "dict_items([('Avocado', 0.35), ('Apple', 0.4), ('Apricot', 0.25), ('Banana', 0.3), ('Blackcurrant', 0.7), ('Blackberry', 0.55), ('Blueberry', 0.45), ('Cherry', 0.5), ('Coconut', 0.75), ('Fig', 0.6), ('Grape', 0.65), ('Kiwi', 0.2), ('Lemon', 0.15), ('Mango', 0.8), ('Nectarine', 0.75), ('Orange', 0.25), ('Papaya', 0.3), ('Passion Fruit', 0.45), ('Peach', 0.55), ('Pineapple', 0.55), ('Plum', 0.6), ('Raspberry', 0.4), ('Strawberry', 0.5), ('Watermelon', 0.45)])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 9 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DpFB1g-3kDSt", + "outputId": "3a9f1b1a-d96b-47b8-b36d-9e306d4c4271", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 425 + } + }, + "source": [ + "for item in d_frutas.items():\n", + " print(item) " + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "('Avocado', 0.35)\n", + "('Apple', 0.4)\n", + "('Apricot', 0.25)\n", + "('Banana', 0.3)\n", + "('Blackcurrant', 0.7)\n", + "('Blackberry', 0.55)\n", + "('Blueberry', 0.45)\n", + "('Cherry', 0.5)\n", + "('Coconut', 0.75)\n", + "('Fig', 0.6)\n", + "('Grape', 0.65)\n", + "('Kiwi', 0.2)\n", + "('Lemon', 0.15)\n", + "('Mango', 0.8)\n", + "('Nectarine', 0.75)\n", + "('Orange', 0.25)\n", + "('Papaya', 0.3)\n", + "('Passion Fruit', 0.45)\n", + "('Peach', 0.55)\n", + "('Pineapple', 0.55)\n", + "('Plum', 0.6)\n", + "('Raspberry', 0.4)\n", + "('Strawberry', 0.5)\n", + "('Watermelon', 0.45)\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-LmEUroVKDUA" + }, + "source": [ + "## Iterar pela key e valor do dicionário" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oRhZ_Zq9oQIg", + "outputId": "02b624d2-eebd-4666-f2b3-972f1defe060", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 425 + } + }, + "source": [ + "for key, value in d_frutas.items():\n", + " print(\"%s --> %s\" %(key, value))" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Avocado --> 0.35\n", + "Apple --> 0.4\n", + "Apricot --> 0.25\n", + "Banana --> 0.3\n", + "Blackcurrant --> 0.7\n", + "Blackberry --> 0.55\n", + "Blueberry --> 0.45\n", + "Cherry --> 0.5\n", + "Coconut --> 0.75\n", + "Fig --> 0.6\n", + "Grape --> 0.65\n", + "Kiwi --> 0.2\n", + "Lemon --> 0.15\n", + "Mango --> 0.8\n", + "Nectarine --> 0.75\n", + "Orange --> 0.25\n", + "Papaya --> 0.3\n", + "Passion Fruit --> 0.45\n", + "Peach --> 0.55\n", + "Pineapple --> 0.55\n", + "Plum --> 0.6\n", + "Raspberry --> 0.4\n", + "Strawberry --> 0.5\n", + "Watermelon --> 0.45\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Fotx7XUquAo8" + }, + "source": [ + "___\n", + "# **VERIFICAR SE UMA KEY ESPECÍFICA PERTENCE AO DICIONÁRIO**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ju__WsSoKXtk" + }, + "source": [ + "A fruta 'Apple' (que em nosso caso, é uma key) existe no dicionário?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-gkEKNZPTeMp", + "outputId": "30f46535-0f53-450b-b598-a8eab0f04121", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "'Apple' in d_frutas.keys()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "True" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 12 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fMzBeFMIusv7" + }, + "source": [ + "A fruta 'Coconut' pertence ao dicionário d_frutas?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SKtEwmBCuxyi", + "outputId": "cf52b903-90e0-432f-8b7e-bc878d5c5f5f", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "'Coconut' in d_frutas.keys()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "True" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 29 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rrH8ArqsK6Bd" + }, + "source": [ + "___\n", + "# **VERIFICAR SE VALOR PERTENCE AO DICIONÁRIO**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DbWpbuLTK9sn", + "outputId": "a97cf9c4-369a-453f-955d-0078dd83dbf0", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "0.4 in d_frutas.values()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "True" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 13 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "36kmLUYDvsUI" + }, + "source": [ + "## Adicionar novos itens ao dicionário\n", + "* Considere o dicionário d_frutas2 abaixo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5Rwq4-UG4--u", + "outputId": "e5da55a7-9d26-4a48-e148-8d5b187d8db0", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "d_frutas2 = {'Grapefruit': 1.0}\n", + "d_frutas2" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'Grapefruit': 1.0}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 14 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vljceM6_5H9o" + }, + "source": [ + "O comando abaixo adiciona o dicionário d_frutas2 ao dicionário d_frutas." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7BD_mYMM5O5o", + "outputId": "b1d770f5-19fa-4f10-9d4e-d98e14811e88", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 442 + } + }, + "source": [ + "d_frutas.update(d_frutas2)\n", + "d_frutas" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'Apple': 0.4,\n", + " 'Apricot': 0.25,\n", + " 'Avocado': 0.35,\n", + " 'Banana': 0.3,\n", + " 'Blackberry': 0.55,\n", + " 'Blackcurrant': 0.7,\n", + " 'Blueberry': 0.45,\n", + " 'Cherry': 0.5,\n", + " 'Coconut': 0.75,\n", + " 'Fig': 0.6,\n", + " 'Grape': 0.65,\n", + " 'Grapefruit': 1.0,\n", + " 'Kiwi': 0.2,\n", + " 'Lemon': 0.15,\n", + " 'Mango': 0.8,\n", + " 'Nectarine': 0.75,\n", + " 'Orange': 0.25,\n", + " 'Papaya': 0.3,\n", + " 'Passion Fruit': 0.45,\n", + " 'Peach': 0.55,\n", + " 'Pineapple': 0.55,\n", + " 'Plum': 0.6,\n", + " 'Raspberry': 0.4,\n", + " 'Strawberry': 0.5,\n", + " 'Watermelon': 0.45}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 15 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ffh-94lo55n4" + }, + "source": [ + "Agora, considere o dicionário d_frutas3 abaixo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JMAq_jbP5---" + }, + "source": [ + "d_frutas3 = {'Apple': 0.70}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Jd6B2cy-6KmY" + }, + "source": [ + "Qual o resultado do comando abaixo?\n", + "\n", + "* Atenção: A fruta 'Apple' (é uma key do dicionário d_frutas) tem valor 0.40. E no dicionário d_frutas3 a fruta 'Apple' tem valor 0.70." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "E4GKdTw76PXI", + "outputId": "c9f4ccec-90b3-4e91-bf19-976493cc3933", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 442 + } + }, + "source": [ + "d_frutas.update(d_frutas3)\n", + "d_frutas" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'Apple': 0.7,\n", + " 'Apricot': 0.25,\n", + " 'Avocado': 0.35,\n", + " 'Banana': 0.3,\n", + " 'Blackberry': 0.55,\n", + " 'Blackcurrant': 0.7,\n", + " 'Blueberry': 0.45,\n", + " 'Cherry': 0.5,\n", + " 'Coconut': 0.75,\n", + " 'Fig': 0.6,\n", + " 'Grape': 0.65,\n", + " 'Grapefruit': 1.0,\n", + " 'Kiwi': 0.2,\n", + " 'Lemon': 0.15,\n", + " 'Mango': 0.8,\n", + " 'Nectarine': 0.75,\n", + " 'Orange': 0.25,\n", + " 'Papaya': 0.3,\n", + " 'Passion Fruit': 0.45,\n", + " 'Peach': 0.55,\n", + " 'Pineapple': 0.55,\n", + " 'Plum': 0.6,\n", + " 'Raspberry': 0.4,\n", + " 'Strawberry': 0.5,\n", + " 'Watermelon': 0.45}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 17 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HMmDfrln6o0c" + }, + "source": [ + "Como esperado, como key= 'Apple' existe no dicionário d_frutas, então o Python atualizou o valor de key= 'Apple' para 0.70." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SWO2GdNovxAp" + }, + "source": [ + "## Modificar keys e valores" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DX9UTy4TwlAw" + }, + "source": [ + "Suponha que queremos aplicar um desconto de 10% para cada fruta do nosso dicionário.\n", + "\n", + "* Como fazemos isso?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RV-YOkrffa3h", + "outputId": "6a7a1330-7ecd-4924-bf73-f3e0bc16baae", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "d_frutas.keys()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "dict_keys(['Avocado', 'Apple', 'Apricot', 'Banana', 'Blackcurrant', 'Blackberry', 'Blueberry', 'Cherry', 'Coconut', 'Fig', 'Grape', 'Kiwi', 'Lemon', 'Mango', 'Nectarine', 'Orange', 'Papaya', 'Passion Fruit', 'Peach', 'Pineapple', 'Plum', 'Raspberry', 'Strawberry', 'Watermelon', 'Grapefruit'])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 34 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tV8k5w2Bf1Oq", + "outputId": "4ee0496f-ab86-4c6d-fc9f-b04863017bb0", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "d_frutas.items()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "dict_items([('Avocado', 0.35), ('Apple', 0.4), ('Apricot', 0.25), ('Banana', 0.3), ('Blackcurrant', 0.7), ('Blackberry', 0.55), ('Blueberry', 0.45), ('Cherry', 0.5), ('Coconut', 0.75), ('Fig', 0.6), ('Grape', 0.65), ('Kiwi', 0.2), ('Lemon', 0.15), ('Mango', 0.8), ('Nectarine', 0.75), ('Orange', 0.25), ('Papaya', 0.3), ('Passion Fruit', 0.45), ('Peach', 0.55), ('Pineapple', 0.55), ('Plum', 0.6), ('Raspberry', 0.4), ('Strawberry', 0.5), ('Watermelon', 0.45), ('Grapefruit', 1.0)])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 37 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZziGmKGmwqwn", + "outputId": "178faaca-bf7d-4542-a5ef-662b96a70362", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 442 + } + }, + "source": [ + "for key, value in d_frutas.items():\n", + " d_frutas[key] = round((value * 0.9), 2) # Isso representa um desconto de 10% no valor das frutas\n", + "\n", + "d_frutas" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'Apple': 0.63,\n", + " 'Apricot': 0.23,\n", + " 'Avocado': 0.32,\n", + " 'Banana': 0.27,\n", + " 'Blackberry': 0.5,\n", + " 'Blackcurrant': 0.63,\n", + " 'Blueberry': 0.41,\n", + " 'Cherry': 0.45,\n", + " 'Coconut': 0.68,\n", + " 'Fig': 0.54,\n", + " 'Grape': 0.59,\n", + " 'Grapefruit': 0.9,\n", + " 'Kiwi': 0.18,\n", + " 'Lemon': 0.14,\n", + " 'Mango': 0.72,\n", + " 'Nectarine': 0.68,\n", + " 'Orange': 0.23,\n", + " 'Papaya': 0.27,\n", + " 'Passion Fruit': 0.41,\n", + " 'Peach': 0.5,\n", + " 'Pineapple': 0.5,\n", + " 'Plum': 0.54,\n", + " 'Raspberry': 0.36,\n", + " 'Strawberry': 0.45,\n", + " 'Watermelon': 0.41}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 18 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s1B-yN8lM-C1" + }, + "source": [ + "Mostra d_frutas com os valores atualizados:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zZLa85knxBtY", + "outputId": "2c7c12f8-8885-4f34-a0d1-1323e98a9437", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 442 + } + }, + "source": [ + "d_frutas" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'Apple': 0.63,\n", + " 'Apricot': 0.23,\n", + " 'Avocado': 0.32,\n", + " 'Banana': 0.27,\n", + " 'Blackberry': 0.5,\n", + " 'Blackcurrant': 0.63,\n", + " 'Blueberry': 0.41,\n", + " 'Cherry': 0.45,\n", + " 'Coconut': 0.68,\n", + " 'Fig': 0.54,\n", + " 'Grape': 0.59,\n", + " 'Grapefruit': 0.9,\n", + " 'Kiwi': 0.18,\n", + " 'Lemon': 0.14,\n", + " 'Mango': 0.72,\n", + " 'Nectarine': 0.68,\n", + " 'Orange': 0.23,\n", + " 'Papaya': 0.27,\n", + " 'Passion Fruit': 0.41,\n", + " 'Peach': 0.5,\n", + " 'Pineapple': 0.5,\n", + " 'Plum': 0.54,\n", + " 'Raspberry': 0.36,\n", + " 'Strawberry': 0.45,\n", + " 'Watermelon': 0.41}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 84 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vpN54l4vxze5" + }, + "source": [ + "## Deletar keys do dicionário\n", + "* Deletar uma key significa deletar todo o item {key: value}, ok?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eDlthLStNIwR" + }, + "source": [ + "Suponha que queremos deletar a fruta 'Avocado' do dicionário d_frutas.\n", + "\n", + "* Como fazer isso?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fnpzHZU_x5Y1" + }, + "source": [ + "for key in list(d_frutas.keys()): # Dica: use a função list para melhorar a performance computacional\n", + " if key == 'Avocado':\n", + " del d_frutas[key] # Deleta key = 'Avocado'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VyPUrobONqvI" + }, + "source": [ + "Mostra o dicionário d_frutas atualizado:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "IwnsHejhyT4l", + "outputId": "5062a125-6a3d-48e3-f272-22d89ebbd09e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 425 + } + }, + "source": [ + "d_frutas" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'Apple': 0.63,\n", + " 'Apricot': 0.23,\n", + " 'Banana': 0.27,\n", + " 'Blackberry': 0.5,\n", + " 'Blackcurrant': 0.63,\n", + " 'Blueberry': 0.41,\n", + " 'Cherry': 0.45,\n", + " 'Coconut': 0.68,\n", + " 'Fig': 0.54,\n", + " 'Grape': 0.59,\n", + " 'Grapefruit': 0.9,\n", + " 'Kiwi': 0.18,\n", + " 'Lemon': 0.14,\n", + " 'Mango': 0.72,\n", + " 'Nectarine': 0.68,\n", + " 'Orange': 0.23,\n", + " 'Papaya': 0.27,\n", + " 'Passion Fruit': 0.41,\n", + " 'Peach': 0.5,\n", + " 'Pineapple': 0.5,\n", + " 'Plum': 0.54,\n", + " 'Raspberry': 0.36,\n", + " 'Strawberry': 0.45,\n", + " 'Watermelon': 0.41}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 20 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u4HOf9SNytSq" + }, + "source": [ + "## Filtrar/Selecionar itens baseado em condições\n", + "Em algumas situações você vai querer filtrar os itens do dicionário que satisfaçam alguma(s) condições.\n", + "\n", + "* Considere o exemplo a seguir: queremos selecionar/filtrar somente as frutas com preços maiores que 0.5." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "EwqxWiVlyvgH" + }, + "source": [ + "d_frutas_filtro = {}\n", + "\n", + "for key, value in d_frutas.items():\n", + " if value > 0.5:\n", + " d_frutas_filtro.update({key: value})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eb0jmAKWOtYt" + }, + "source": [ + "Mostrar o conteúdo do dicionário d_frutas_filtro:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SsStWM5k1s-Q", + "outputId": "3d0e9a69-0949-49fd-eb1a-b04859607534", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 170 + } + }, + "source": [ + "d_frutas_filtro" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'Apple': 0.63,\n", + " 'Blackcurrant': 0.63,\n", + " 'Coconut': 0.68,\n", + " 'Fig': 0.54,\n", + " 'Grape': 0.59,\n", + " 'Grapefruit': 0.9,\n", + " 'Mango': 0.72,\n", + " 'Nectarine': 0.68,\n", + " 'Plum': 0.54}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 22 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u1ve6xIGOjrE" + }, + "source": [ + " Como se pode ver, temos várias frutas com preços acima de 0.5 (satisfaz a condição)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KJqpPrfkCk9L" + }, + "source": [ + "## Cálculos com os itens do dicionário" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "exD8HXodCqg6" + }, + "source": [ + "from collections import Counter" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "llCLTysdCuwB" + }, + "source": [ + "Somando os valores de todas as frutas" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uG0VP1MNCroX", + "outputId": "034db243-6cd7-4782-ea00-69706bd3cabc", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "sum(d_frutas.values())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "11.219999999999997" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 24 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "a5MBNCF-C5-4" + }, + "source": [ + "Quantos itens existem no dicionário:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "AkvygR0PC9bT", + "outputId": "9c2e7a2d-e487-489a-9344-1632414c1f0b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "len(list(d_frutas))" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "24" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 25 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xBNFaklq8OC9" + }, + "source": [ + "## Sortear itens do dicionário - sorted(d_dicionario.items(), reverse= True/False)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WULJMjHA-mal" + }, + "source": [ + "Ordem alfabética (por key):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SH0WIKZ8-Ylr", + "outputId": "cd096812-50e3-4c0d-875c-fd387e0d6be1", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 425 + } + }, + "source": [ + "d_frutas_ordenadas = sorted(d_frutas.items(), reverse = False)\n", + "d_frutas_ordenadas" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[('Apple', 0.63),\n", + " ('Apricot', 0.23),\n", + " ('Banana', 0.27),\n", + " ('Blackberry', 0.5),\n", + " ('Blackcurrant', 0.63),\n", + " ('Blueberry', 0.41),\n", + " ('Cherry', 0.45),\n", + " ('Coconut', 0.68),\n", + " ('Fig', 0.54),\n", + " ('Grape', 0.59),\n", + " ('Grapefruit', 0.9),\n", + " ('Kiwi', 0.18),\n", + " ('Lemon', 0.14),\n", + " ('Mango', 0.72),\n", + " ('Nectarine', 0.68),\n", + " ('Orange', 0.23),\n", + " ('Papaya', 0.27),\n", + " ('Passion Fruit', 0.41),\n", + " ('Peach', 0.5),\n", + " ('Pineapple', 0.5),\n", + " ('Plum', 0.54),\n", + " ('Raspberry', 0.36),\n", + " ('Strawberry', 0.45),\n", + " ('Watermelon', 0.41)]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 26 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T4Li1Q2d-pnZ" + }, + "source": [ + "Ordem reversa (por key):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "PoBOmfpM_A_a", + "outputId": "079b0969-f79d-47ba-a557-afe19a55c528", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 425 + } + }, + "source": [ + "d_frutas_ordenadas_reverse = sorted(d_frutas.items(), reverse = True)\n", + "d_frutas_ordenadas_reverse" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[('Watermelon', 0.41),\n", + " ('Strawberry', 0.45),\n", + " ('Raspberry', 0.36),\n", + " ('Plum', 0.54),\n", + " ('Pineapple', 0.5),\n", + " ('Peach', 0.5),\n", + " ('Passion Fruit', 0.41),\n", + " ('Papaya', 0.27),\n", + " ('Orange', 0.23),\n", + " ('Nectarine', 0.68),\n", + " ('Mango', 0.72),\n", + " ('Lemon', 0.14),\n", + " ('Kiwi', 0.18),\n", + " ('Grapefruit', 0.9),\n", + " ('Grape', 0.59),\n", + " ('Fig', 0.54),\n", + " ('Coconut', 0.68),\n", + " ('Cherry', 0.45),\n", + " ('Blueberry', 0.41),\n", + " ('Blackcurrant', 0.63),\n", + " ('Blackberry', 0.5),\n", + " ('Banana', 0.27),\n", + " ('Apricot', 0.23),\n", + " ('Apple', 0.63)]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 27 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FxTC2-U88ajk" + }, + "source": [ + "## Função filter()\n", + "* A função filter() aplica um filtro no dicionário, retornando apenas os itens que satisfaz as condições do filtro." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iJq1clvOHVG2", + "outputId": "4fcf65ea-fff4-4320-d093-79822f54ae12", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 425 + } + }, + "source": [ + "d_frutas" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'Apple': 0.63,\n", + " 'Apricot': 0.23,\n", + " 'Banana': 0.27,\n", + " 'Blackberry': 0.5,\n", + " 'Blackcurrant': 0.63,\n", + " 'Blueberry': 0.41,\n", + " 'Cherry': 0.45,\n", + " 'Coconut': 0.68,\n", + " 'Fig': 0.54,\n", + " 'Grape': 0.59,\n", + " 'Grapefruit': 0.9,\n", + " 'Kiwi': 0.18,\n", + " 'Lemon': 0.14,\n", + " 'Mango': 0.72,\n", + " 'Nectarine': 0.68,\n", + " 'Orange': 0.23,\n", + " 'Papaya': 0.27,\n", + " 'Passion Fruit': 0.41,\n", + " 'Peach': 0.5,\n", + " 'Pineapple': 0.5,\n", + " 'Plum': 0.54,\n", + " 'Raspberry': 0.36,\n", + " 'Strawberry': 0.45,\n", + " 'Watermelon': 0.41}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 28 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qtTKvNeJNycl" + }, + "source": [ + "### Filtrando por key:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uIDW5FhwAiSs", + "outputId": "b266365b-417a-4f9f-9fda-c033446472e8", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "d_frutas2 = {chave: valor for chave, valor in filter(lambda t: t[0] == 'Apple', d_frutas.items())}\n", + "d_frutas2" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'Apple': 0.4}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 4 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JtbUrtyaTl-H" + }, + "source": [ + "#### Versões mais Pythonic!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "l4etRMMEToau", + "outputId": "bbc5e488-9d65-48ec-b9df-152b89e68c72", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 178 + } + }, + "source": [ + "# Opção 1:\n", + "d_frutas2 = dict(filter(lambda t: t[0] == 'Apple', d_frutas.items()))\n", + "d_frutas2" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "NameError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Opção 1:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0md_frutas2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0mchavechave\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mvalor\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mchave\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalor\u001b[0m \u001b[0;32min\u001b[0m \u001b[0md_frutas\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mchave\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'Apple'\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mNameError\u001b[0m: name 'd_frutas' is not defined" + ] + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0y0cfS61Vbgw" + }, + "source": [ + "# Opção 2:\n", + "d_frutas2 = {chave: valor for chave, valor in d_frutas.items() if chave == 'Apple'}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3XmPlpNqBVMl" + }, + "source": [ + "### A expressão acima é equivalente à expressão abaixo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_5j19I7tiHgp", + "outputId": "87e3bd82-8ec6-4f59-c8e2-74aaa80858d3", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "d_filtro = {}\n", + "\n", + "for chave, valor in d_frutas.items():\n", + " if chave == 'Apple':\n", + " d_filtro.update({chave: valor})\n", + "\n", + "d_filtro" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'Apple': 0.4}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 8 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nUMGIzxeNt_U" + }, + "source": [ + "### Filtrando por valor:\n", + "\n", + "Equivalente a:\n", + "\n", + "```\n", + "d_frutas3 = {}\n", + "\n", + "for key, value in d_frutas.items():\n", + " if value > 0.5:\n", + " d_frutas3.update({key: value})\n", + "```" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tvHcQatANltL", + "outputId": "8feaf5b1-1db8-4391-8950-248ba8ab46c5", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 187 + } + }, + "source": [ + "d_frutas3 = {k: v for k, v in filter(lambda t: t[1] > 0.5, d_frutas.items())}\n", + "d_frutas3" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'Blackberry': 0.55,\n", + " 'Blackcurrant': 0.7,\n", + " 'Coconut': 0.75,\n", + " 'Fig': 0.6,\n", + " 'Grape': 0.65,\n", + " 'Mango': 0.8,\n", + " 'Nectarine': 0.75,\n", + " 'Peach': 0.55,\n", + " 'Pineapple': 0.55,\n", + " 'Plum': 0.6}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 7 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mqOFuiG1WEMG" + }, + "source": [ + "#### Versões mais Pythonic!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "n0QV7jEfWEMH", + "outputId": "bbc5e488-9d65-48ec-b9df-152b89e68c72", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 178 + } + }, + "source": [ + "# Opção 1:\n", + "d_frutas3 = dict(filter(lambda t: t[1] > 0.5, d_frutas.items()))\n", + "d_frutas3" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "NameError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Opção 1:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0md_frutas2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0mchavechave\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mvalor\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mchave\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalor\u001b[0m \u001b[0;32min\u001b[0m \u001b[0md_frutas\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mchave\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'Apple'\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mNameError\u001b[0m: name 'd_frutas' is not defined" + ] + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5BhyxoDvWEMK" + }, + "source": [ + "# Opção 2:\n", + "d_frutas3 = {chave: valor for chave, valor in d_frutas.items() if valor > 0.5}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qA_XhCdmA6Gn" + }, + "source": [ + "___\n", + "# **EXERCÍCIOS**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RSpyl_URgNyE" + }, + "source": [ + "## Exercício 1\n", + "* É possível sortear os itens de um dicionário? Explique sua resposta." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CXqc9kHch6Mm" + }, + "source": [ + "## Exercício 2\n", + "* É possível termos um dicionário do tipo abaixo?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0BBWO9Zth_mc", + "outputId": "14be585a-7315-4901-863a-9ba18090b5e8", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + } + }, + "source": [ + "d_colaboradores = {'Gerentes': ['A', 'B', 'C'], 'Programadores': ['B', 'D', 'E', 'F', 'G'], 'Gerentes_Projeto': ['A', 'E']}\n", + "d_colaboradores" + ], + "execution_count": 1, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'Gerentes': ['A', 'B', 'C'],\n", + " 'Gerentes_Projeto': ['A', 'E'],\n", + " 'Programadores': ['B', 'D', 'E', 'F', 'G']}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 1 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TNiJSG_uiePb" + }, + "source": [ + "Como acessar o Gerente 'A'?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rGvVgyz7jxwn", + "outputId": "c4e02509-6910-46c5-d906-b7d6f542dfb3", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "d_colaboradores['Gerentes']" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "['A', 'B', 'C']" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 50 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "c-VwXvdij3QQ", + "outputId": "f4344858-8ebf-4e0c-b336-e7a6ed4a43a2", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "d_colaboradores['Programadores']???" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "['B', 'D', 'E', 'F', 'G']" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 51 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VBLRSQSw6mGB", + "outputId": "ffdc226c-92c9-4096-ea0f-e13f25dc8a1c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "d_colaboradores['Gerentes'][0]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'A'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 30 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WV0WaGB4kCiP", + "outputId": "171e4ea0-c66f-49c2-f4ea-deb44b315d43", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "s_gerentes = d_colaboradores['Gerentes']\n", + "s_gerentes[0]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'A'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 62 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yRrG7wUgkf6K", + "outputId": "122c0ff9-47af-4a50-874e-42779aa3c068", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "s_gerente_A = d_colaboradores.values()\n", + "s_gerente_A" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "dict_values([['A', 'B', 'C'], ['B', 'D', 'E', 'F', 'G'], ['A', 'E']])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 55 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ntVcr_3XwaQ-" + }, + "source": [ + "## Exercício 3\n", + "Consulte a página [Python Data Types: Dictionary - Exercises, Practice, Solution](https://www.w3resource.com/python-exercises/dictionary/) para mais exercícios relacionados à dicionários." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7u5-o8dzlryA" + }, + "source": [ + "## **Exercício 4**\n", + "\n", + "Retornar do dicionário d_colaboradores somente o Programador cujo nome seja 'E'. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lRKtdcA07gax", + "outputId": "f5352b08-2e67-4a1d-d879-12501f486d43", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "d_colaboradores['Programadores'][2]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'E'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 33 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7KRNcC_I-zHh", + "outputId": "9a18f79c-5d4c-4c25-82f0-9309e4d7ec20", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "list(filter(lambda t: t == 'E', d_colaboradores['Programadores']))" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "['E']" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 38 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "02t2Uczp8D03", + "outputId": "13e14ec9-c3fd-4b9e-ed60-d5d191054081", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "s_prog = list(filter(lambda t: t == 'E', d_colaboradores['Programadores']))\n", + "s_prog[0]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'E'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 42 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zfnP-CArmPb4" + }, + "source": [ + "## **Exercício 5**\n", + "\n", + "Retornar qual é o cargo do funcionário (todas as pessoas da organização) que se chama 'A'." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BVbHId2mR9mc", + "outputId": "4831d244-36fb-4f37-f730-fe0429328b05", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "l_A = [item[0] for item in d_colaboradores.items() for nome in item[1] if nome == 'A']\n", + "l_A" + ], + "execution_count": 6, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "['Gerentes', 'Gerentes_Projeto']" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 6 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VzjLVeFvmnjk" + }, + "source": [ + "## **Exercício 6**\n", + "\n", + "* Quais são os colabores que são ao mesmo tempo:\n", + " * Gerente de Projeto e Gerente (funcional)?\n", + " * Gerentes de Projeto e Programadores?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wnCi3kWcl8Sb", + "outputId": "b9e78b70-1fc1-45cf-f54d-df64b6241927", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "l_gerentes = [ l_nomes for l_nomes in d_colaboradores['Gerentes_Projeto'] for nome_comp in l_nomes for nome_comp2 in d_colaboradores['Gerentes'] if nome_comp == nome_comp2]\n", + "l_gerentes" + ], + "execution_count": 9, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "['A']" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 9 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3kZ5FQgjW0Rn", + "outputId": "c058767e-9d8a-4651-a3e9-80d8429ed736", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "l_gerentes = [ l_nomes for l_nomes in d_colaboradores['Gerentes_Projeto'] for nome_comp in l_nomes for nome_comp2 in d_colaboradores['Programadores'] if nome_comp == nome_comp2]\n", + "l_gerentes" + ], + "execution_count": 10, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "['E']" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 10 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TjO_Ol77YMJj" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB09_01__Functions_hs.ipynb b/Notebooks/NB09_01__Functions_hs.ipynb new file mode 100644 index 000000000..4edc3de18 --- /dev/null +++ b/Notebooks/NB09_01__Functions_hs.ipynb @@ -0,0 +1,1471 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "NB09_01__Functions.ipynb", + "provenance": [], + "private_outputs": true, + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d_YndS20uqkK" + }, + "source": [ + "

FUNÇÕES

\n", + "\n", + "\n", + "\n", + "# **AGENDA**:\n", + "\n", + "> Veja o **índice** dos itens que serão abordados neste capítulo.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e0UKAZQvJ_c2" + }, + "source": [ + "___\n", + "# **INTRODUÇÃO ÀS FUNÇÕES**\n", + "> Funções são uma sequência de comandos para executar uma tarefa.\n", + ">> Atenção ao que recomenda o PEP8 sobre como escrever funções." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Z4-gPTjZUP50" + }, + "source": [ + "# Não executar este codigo!\n", + "def funcao(arg1, arg2, ..., argN):\n", + " " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "etxNlyRYo39A" + }, + "source": [ + "def show_hello_world():\n", + " print('Hello World!')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "G6I9PFvZpBgR" + }, + "source": [ + "type(show_hello_world)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_meNdNygpIbv" + }, + "source": [ + "show_hello_world()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6zfLd8HwpPpg" + }, + "source": [ + "___\n", + "# **DOCUMENTAR FUNÇÕES COM COMMENTS/DOCSTRING**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3yzgBxtNpRi_" + }, + "source": [ + "def show_hello_world():\n", + " '''\n", + " Esta função faz um cumprimento: 'Hello World!'\n", + " Inputs: \n", + " param1: djdjdjdjdj\n", + " param2: fjrjirjjirjir\n", + " '''\n", + " print('Hello World!')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "0rBaxjpmpbm1" + }, + "source": [ + "show_hello_world()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6ThOwDQp4TfR" + }, + "source": [ + "# Se quisermos ver a documentação da função, basta invocar o statement __doc__ da seguinte forma:\n", + "show_hello_world.__doc__" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9YZ2afpNA4st" + }, + "source": [ + "OU..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uSnwA4BVA5_t" + }, + "source": [ + "help(show_hello_world)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "whbnnMA5p1Jw" + }, + "source": [ + "___\n", + "# **FUNÇÕES COM ARGUMENTOS**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "O3bSjLA_qTTc" + }, + "source": [ + "Definir a função mostra_nome com dois argumentos: s_primeiro_nome e s_ultimo_nome:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9jWyCCPPp4yS" + }, + "source": [ + "def mostra_nome(s_primeiro_nome, s_ultimo_nome):\n", + " print(f'Olá, meu nome é {s_primeiro_nome} {s_ultimo_nome}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VOB3Ip63qIzr" + }, + "source": [ + "mostra_nome('Nelio', 'Machado')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Oi0c_GuesfcL" + }, + "source": [ + "Neste caso, o primeiro argumento da função (s_primeiro_nome) vai receber o valor 'Nelio' e o segundo argumento da função (s_ultimo_nome) vai receber 'Machado'." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qkMblpnLsITO" + }, + "source": [ + "No entanto, também podemos invocar a função da seguinte forma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TTli7e6xsMCo" + }, + "source": [ + "mostra_nome(s_ultimo_nome = 'Machado', s_primeiro_nome = 'Nelio')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rmatMmhTsaVc" + }, + "source": [ + "Observe que o resultado é o mesmo. No entanto, desta forma, estamos dizendo o valor específico que cada parâmetro irá receber." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PnNYrgJ6VQo9" + }, + "source": [ + "## PEP8 + Annotations = Códigos mais fáceis de entender e atualizar\n", + "\n", + "> Observe abaixo quando combinamos PEP8 + Annotations para tornar o código Python ainda mais detalhado. O objetivo de _Annotations_ é deixar o código mais claro, sem mudar o comportamento da função. No exemplo abaixo, os argumentos da função s_primeiro_nome e s_ultimo_nome são argumentos do tipo _str_ e a função retorna um _output_ do tipo _str_." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aU2Sob37VVmi" + }, + "source": [ + "def mostra_nome2(s_primeiro_nome: str, s_ultimo_nome: str) -> str:\n", + " print(f'Olá, meu nome é {s_primeiro_nome} {s_ultimo_nome}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "iIvqS73mXNam" + }, + "source": [ + "mostra_nome2(s_ultimo_nome = 'Machado', s_primeiro_nome = 'Nelio')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rSnrtFNtXrbN" + }, + "source": [ + "# **\\*args**\n", + "> \\*args permite que você passe mais argumentos do que o número de argumentos formais que você definiu anteriormente." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "x2rsiSseqHcX" + }, + "source": [ + "O que acontece quando evocamos a função mostra_nome2('Nelio', 'Pereira', 'Machado')?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8ap7pqmOqUnP" + }, + "source": [ + "mostra_nome2('Nelio', 'Pereira', 'Machado')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aT0_PeuEvXiP" + }, + "source": [ + "## Exemplo 1\n", + "> Considere a função (simples) para imprimir o nome completo de um cliente." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Npbi_Hy0bUec" + }, + "source": [ + "# definimos a função mostra_nome3 da seguinte forma:\n", + "def mostra_nome3(*args):\n", + " nome = ' '.join(args)\n", + " print(f'Olá, meu nome é {nome}.')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "dFzM0gA3_9za" + }, + "source": [ + "mostra_nome3('Nelio', 'Machado')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "370bpgaSvDbJ" + }, + "source": [ + "E agora, a função recebe qualquer quantidade de parâmetros." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DQcRFHu4qnc5" + }, + "source": [ + "mostra_nome3('Nelio', 'Pereira', 'Machado')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "4kYcu6PEX-Nz" + }, + "source": [ + "mostra_nome3('Pedro', 'de', 'Alcantara', 'Francisco', 'Antonio', 'Joao', 'Carlos', 'Xavier', 'de', 'Paula', 'Miguel', 'Rafael', 'Joaquim', 'Jose', 'Gonzaga', 'Pascoal', 'Cipriano', 'Serafim')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KMgngPmFimxb" + }, + "source": [ + "Observe que desta forma pouco importa a quantidade de parâmetros que passamos á função." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Y9pDa6ZRjo0U" + }, + "source": [ + "## Exemplo 2\n", + "* Suponha que estamos insteressados em desenvolver uma função que multiplica dois números (passados como parâmetros)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1A-vhsHxv1YE" + }, + "source": [ + "Antes de vermos a solução usando \\*args, vamos ver como seria nossa função se \\*args não existisse." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cCDwruF8j5i5" + }, + "source": [ + "### Forma \"Normal\"" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_R03BiwLjtwB" + }, + "source": [ + "# Definição da função\n", + "def multiplicar_numeros(x1, x2):\n", + " '''\n", + " Objetivo: Esta função multiplica DOIS números passados como argumentos.\n", + " Autor: Nelio Machado\n", + " Data: 04/10/2020\n", + " '''\n", + " return (x1 * x2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "0eVm1Qj9kDtd" + }, + "source": [ + "print(multiplicar_numeros(3, 4))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4h9Nhkickf_8" + }, + "source": [ + "### Usando \\*args" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9Kf89meJkjw8" + }, + "source": [ + "def multiplicar_numeros2(*args):\n", + " '''\n", + " Objetivo: Esta função multiplica vários números passados como argumentos.\n", + " Autor: Nelio Machado\n", + " Data: 04/10/2020\n", + " '''\n", + " print(args)\n", + " print(type(args))\n", + " x = 1\n", + " for N in args:\n", + " x *= N # Isso é a mesma coisa que: x = x * N\n", + " \n", + " return x" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZuIzwitWk7by" + }, + "source": [ + "print(multiplicar_numeros2(1, 2, 3, 4, 5)) # Isso é a mesma coisa que 5! (cinco fatorial)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "U5kyPu792gMN" + }, + "source": [ + "Eu também posso fazer da seguinte forma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oc2NJmJf2s7X" + }, + "source": [ + "args = (1, 2, 3, 4, 5)\n", + "print(multiplicar_numeros2(*args))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "GM5NVX3fsaKv" + }, + "source": [ + "# Para conferirmos o resultado da função\n", + "import math\n", + "math.factorial(5)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "38jVie_IjMXI" + }, + "source": [ + "# \\**kwargs\n", + "\n", + "* \\**kwargs é usado para passar um dicionário de comprimento variável para uma função.\n", + "* Argumento do tipo {chave: valor};\n", + "\n", + "* Para exemplificar o uso de \\**kwargs, vou usar parte do dicionário dFruits que definimos na sessão [Dictionaries](Dictionaries.ipynb). Qualquer dúvida, volte áquele capítulo para relembrar os principais conceitos." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yAntQ724nMbv" + }, + "source": [ + "# Definindo a função para receber parâmetros em forma de dicionário:\n", + "def imprime_frutas(**kwargs):\n", + " '''\n", + " Objetivo: Esta função imprime as frutas contidas em kwargs.\n", + " Autor: Nelio Machado\n", + " Data: 04/10/2020\n", + " '''\n", + " for key, value in kwargs.items():\n", + " print(f'O valor de {key} é {value}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jpmSk9mfxww3" + }, + "source": [ + "Atenção à forma como os itens são passados à função!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "88-1lStInaVs" + }, + "source": [ + "imprime_frutas(Avocado = 0.35, Apple = 0.4, Apricot = 0.25, Banana = 0.30)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-jb_kkLiyQt8" + }, + "source": [ + "No entanto, posso passar um dicionário na forma como estamos acostumados, da seguinte forma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JZJNiLz7wgCy" + }, + "source": [ + "d_frutas = {'Apple': 0.4, 'Avocado': 0.3, 'Orange': 0.5, 'Lemon': 0.25}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Emjm6xP7cjmV" + }, + "source": [ + "# De forma geral, atribuimos/adicionamos/update dos itens do dicionário da seguinte forma:\n", + "d_frutas[chave] = valor" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7WMsKxh0cPpN" + }, + "source": [ + "# Lembre-se que d_frutas \n", + "d_frutas.items()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "eUCum4JPEcxD" + }, + "source": [ + "imprime_frutas(**d_frutas) # Atenção à forma como passamos o dicionário para a função: **dicionario." + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iK8-e7a1sXmn" + }, + "source": [ + "___\n", + "# **Python return**\n", + "> Uma função Python pode ou não retornar um valor." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HS0dGA55siWw" + }, + "source": [ + "def par_ou_impar(i_numero1, i_numero2):\n", + " '''\n", + " Esta função somente avalia se a soma de dois números é par ou impar. \n", + " A função retorna odd ou even.\n", + " '''\n", + " i_soma = i_numero1+i_numero2\n", + " i_modulo = i_soma % 2\n", + " print(f'A soma é {i_soma}')\n", + " if i_modulo > 0:\n", + " return 'Odd'\n", + " else:\n", + " return 'Even' " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "mZTG2tDJuIZQ" + }, + "source": [ + "i_numero1 = int(input('Por favor, informe o primeiro número: '))\n", + "i_numero2 = int(input('Por favor, informe o segundo número.: '))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7p_9pq3Du18a" + }, + "source": [ + "type(i_numero1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "4oO7aAjcvCAe" + }, + "source": [ + "type(i_numero2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Br7yT8UHuKYY" + }, + "source": [ + "s_resultado = par_ou_impar(i_numero1, i_numero2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "601QnggJuhf-" + }, + "source": [ + "print(f'O resultado é {s_resultado}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "t6HNf9j9yKcT" + }, + "source": [ + "Mostra o valor de i_modulo ou i_soma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Yu8RsyDAyXne" + }, + "source": [ + "i_modulo" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nx3twrLRyaeJ" + }, + "source": [ + "Python reporta que i_modulo não existe.\n", + "Está correta esta informação?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "imkyRO4kyvgV" + }, + "source": [ + "Considere o exemplo a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kwRiXDA5y19h" + }, + "source": [ + "i_modulo = 0\n", + "\n", + "def par_ou_impar_v2(i_numero1, i_numero2):\n", + " '''\n", + " Esta função somente avalia se a soma de dois números é par ou impar. \n", + " A função retorna odd ou even.\n", + " '''\n", + " i_soma = i_numero1+i_numero2\n", + " i_modulo = i_soma % 2\n", + " print(f'A soma é {i_soma}')\n", + " if i_modulo > 0:\n", + " return 'Odd'\n", + " else:\n", + " return 'Even' " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "GYxLSGQLy_Ai" + }, + "source": [ + "i_numero1 = int(input('Por favor, informe o primeiro número: '))\n", + "i_numero2 = int(input('Por favor, informe o segundo número.: '))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "NMtv99fjzHGs" + }, + "source": [ + "s_resultado = par_ou_impar_v2(i_numero1, i_numero2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "qjOHnYDVzNGK" + }, + "source": [ + "print(f'O resultado é {s_resultado}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pPTecxRfzQUc" + }, + "source": [ + "Agora, vamos checar o valor de i_modulo..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jkQb2mQzzTEo" + }, + "source": [ + "i_modulo" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oOlyGxBAzjE3" + }, + "source": [ + "Porque agora o Python reconhece a variável i_modulo?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dceSkt9Z0BZh" + }, + "source": [ + "___\n", + "# **ESCOPO DE VARIÁVEIS: LOCAL & GLOBAL**\n", + "* **Local** - Variável declarada dentro da função. Em outras palavras, é uma variável local/uso da função.\n", + "\n", + "* **Global** - Variável declarada fora da função. Neste caso, a variável é visível à todo o programa. Entretanto, não se pode alterar o valor da variável dentro da função. Caso queira alterar o valor da variável dentro da função, então é necesário declarar a variável usando a palavra reservada 'global’." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0tIjI9GScPxu" + }, + "source": [ + "## Exemplo 1" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QRojHHJ20iTY" + }, + "source": [ + "def exemplo1():\n", + " i_valor = 20\n", + " i_valor += 1\n", + " print(i_valor)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "RdhElmTs0y1c" + }, + "source": [ + "exemplo1()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Tytq7PnH08pz" + }, + "source": [ + "O escopo da variável 'i_valor' é local, ou seja, de uso/restrito à função. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "299AK0PA1lIg" + }, + "source": [ + "i_valor" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gGP4cx17y8EZ" + }, + "source": [ + "Portanto, o erro acima faz sentido, pois a variável i_valor é restrito á função. Ou seja, fora da função o Python não conhece este valor." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KTV_6Gzxfvpc" + }, + "source": [ + "## Exemplo 2" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zyi9AyJwfxTm" + }, + "source": [ + "i_valor = 100\n", + "\n", + "def exemplo2():\n", + " i_valor = 20\n", + " i_valor += 1\n", + " print(i_valor)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "iEWrboG6gBSs" + }, + "source": [ + "exemplo2()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JPvT0BHG-vxE" + }, + "source": [ + "Isso é um tanto estranho! Definimos, fora da função, i_valor= 100 e, dentro da função, redefinimos i_valor= 20. Entretanto, como vimos, exemplo2() retorna 21 como resultado." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N_t8tIDC-149" + }, + "source": [ + "Agora, a seguir, fora da função, pedimos para ver o valor de i_valor e temos, como resposta, o valor 100." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "I46Bn4FlgJLu" + }, + "source": [ + "i_valor" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IQlP5nbngL6E" + }, + "source": [ + "Saberia nos explicar o que está acontecendo?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "h8PHd6rLgtwK" + }, + "source": [ + "## Exemplo 3" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qB7_zPQVgvVT" + }, + "source": [ + "i_valor = 100\n", + "\n", + "def exemplo3():\n", + " global i_valor\n", + " i_valor = 20\n", + " i_valor += 1\n", + " print(i_valor)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "2KgQSbYCg8Eq" + }, + "source": [ + "exemplo3()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Y7yWoojrg_9Z" + }, + "source": [ + "i_valor" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cGlmbIJGzWG6" + }, + "source": [ + "Saberia explicar o que acontece neste exemplo?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "X8qFfIoxhFOp" + }, + "source": [ + "## Exemplo 4" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZM-yTLuO1bFh" + }, + "source": [ + "i_valor = 20\n", + "\n", + "def exemplo4():\n", + " i_valor += 1\n", + " print(i_valor)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "oLvfPO8w1zwL" + }, + "source": [ + "exemplo4()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2V7QzpZp2QcM" + }, + "source": [ + "Qual a razão deste erro?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "w9qI8kln1_C7" + }, + "source": [ + "i_valor" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AQFFGqLI1FWn" + }, + "source": [ + "___\n", + "# **ARGUMENTOS DEFAULT**\n", + "> Considere o exemplo a seguir: toda vez que vai ao supermercado compra 1 pack de leite (contendo 4 garrafas) e 1 garrafão de água de 5L. Portanto, de forma simples, podemos definir nossa função da seguinte forma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HbcSTiBI4nOj" + }, + "source": [ + "# Define a função para receber os parâmetros arroz, feijao, leite e água.\n", + "def lista_de_compras(arroz, feijao, leite= 1, agua= 1):\n", + " '''\n", + " Documentação da função: objetivos, autor e data.\n", + " '''\n", + " print('Lista de Compras:')\n", + " print(f'Quantidade de arroz.: {arroz} kilos.') \n", + " print(f'Quantidade de feijão: {feijao} kilos.') \n", + " print(f'Quantidade de leite.: {leite} pack com 4.') \n", + " print(f'Quantidade de água..: {agua} garrafa de 5 litros.') " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vwZnDgoq5pgB" + }, + "source": [ + "lista_de_compras(5, 3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l7bY5BSO7eJF" + }, + "source": [ + "Como leite= 1 e agua= 1 são valores default's, não precisamos passar esses parâmetros, desde que informamos ao Python o valor default. No entanto, se numa determinada semana precisarmos de 2 pack's de leite, ao invés de 1, devemos informar ao Python o novo valor:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YY4OrFuH7yXi" + }, + "source": [ + "lista_de_compras(5, 3, 2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-nfrZAvN73YT" + }, + "source": [ + "Da mesma forma, se numa outra semana precisarmos de 2 garrafões de água ao invés de 1, informamos ao Python da seguinte forma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Vpoh6TdM7_xb" + }, + "source": [ + "lista_de_compras(5, 3, 2, 2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "q3qZn9FuVQly" + }, + "source": [ + "___\n", + "# **map()**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Dav8k0JYWi4B" + }, + "source": [ + "## Exemplo 1\n", + "> Suponha que queremos o quadrado de cada número passado à uma função." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "R6NC0i2OVktM" + }, + "source": [ + "l_numeros = [0, 1, 2, 3, 4, 5]\n", + "l_numeros" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "AVjYlN44Vw2k" + }, + "source": [ + "def quadrado_do_numero(i_numero):\n", + " return (i_numero**2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "i_4CHiehV7lD" + }, + "source": [ + "list(map(quadrado_do_numero, l_numeros))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5tq8QDSPWNf6" + }, + "source": [ + "OU..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZAfkybybWOcG" + }, + "source": [ + "for i in map(quadrado_do_numero, l_numeros):\n", + " print(i)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c01V5CEzWlGF" + }, + "source": [ + "## Exemplo 2\n", + "> substituir_truer todos os valores True da lista abaixo por 1 e False por 0." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qH1ackDZWvKp" + }, + "source": [ + "import random\n", + "\n", + "l_dados = []\n", + "for i in range(50):\n", + " random.seed(i)\n", + " l_dados.append(random.choice([True, False]))\n", + " \n", + "l_dados" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Dt2UKC-WXsxr" + }, + "source": [ + "def substituir_true(s_String):\n", + " if s_String == True:\n", + " return 1\n", + " else:\n", + " return 0" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "BIIkPuDEXaM0" + }, + "source": [ + "list(map(substituir_true, l_dados))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TzkLIH1gYpFQ" + }, + "source": [ + "___\n", + "# **Filter()**\n", + "* Filtra elementos baseado em condições." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cjU8YznfZai1" + }, + "source": [ + "Suponha que agora eu quero filtrar os itens True da lista l_dados." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "a3SeaKJgZlAZ" + }, + "source": [ + "def filtrar_true(item):\n", + " if item == True:\n", + " return True\n", + " else:\n", + " return False" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1Z1APDQtZyXs" + }, + "source": [ + "list(filter(filtrar_true, l_dados))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xPpFqVUnKEH7" + }, + "source": [ + "___\n", + "# **EXERCÍCIOS**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RDgCRPRs0W6C" + }, + "source": [ + "## Exercício 1\n", + "Construa uma função para retornar o dia da semana a partir de um número, sendo:\n", + "\n", + "* 1 - Dom\n", + "* 2 - Seg\n", + "* 3 - Ter\n", + "* 4 - Qua\n", + "* 5 - Qui\n", + "* 6 - Sex\n", + "* 7 - Sab" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N53NOsZjOv9m" + }, + "source": [ + "## Exercício 2\n", + "* Desenvolver uma função que retorna True se s_palavra pertence à uma string e False caso contrário. Se pertencer, retornar a posição da palavra." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NJeqwxDjPxub" + }, + "source": [ + "A frase abaixo foi extraída de [+ Bíblia + Camões + Legião Urbana - (Guerra) = Monte Castelo](http://compondoletras.blogspot.com/2013/11/biblia-camoes-legiao-urbana-guerra.html)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Dj_n_beIPRBN" + }, + "source": [ + "s_frase = 'O amor é o fogo que arde sem se ver. É ferida que dói e não se sente. É um contentamento descontente. É dor que desatina sem doer'\n", + "s_frase" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "s40FJ9iCPPY0" + }, + "source": [ + "s_palavra = 'fogo'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tzc2eaM7QUFE" + }, + "source": [ + "A palavra s_palavra está em s_frase?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pMx9E0xMu1lc" + }, + "source": [ + "## Exercício 3\n", + "Para mais exercícios envolvendo funções, consulte [Python functions - Exercises, Practice, Solution](https://www.w3resource.com/python-exercises/python-functions-exercises.php)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Mw6Wg5hFvFMR" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB10_01__Pandas__Resposta_Exercicios_hs.ipynb b/Notebooks/NB10_01__Pandas__Resposta_Exercicios_hs.ipynb new file mode 100644 index 000000000..a5ad57e24 --- /dev/null +++ b/Notebooks/NB10_01__Pandas__Resposta_Exercicios_hs.ipynb @@ -0,0 +1,2777 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "NB10_01__Pandas.ipynb", + "provenance": [], + "private_outputs": true, + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8fpUiw8PwC7_" + }, + "source": [ + "

PANDAS PARA DATA ANALYSIS

\n", + "\n", + "\n", + "\n", + "# **Resposta dos Exercícios**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wkxQFPPmeKLl" + }, + "source": [ + "![Pandas](https://github.com/MathMachado/Materials/blob/master/Pandas.jpeg?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eKawOG-neqaD" + }, + "source": [ + "![Pandas](https://github.com/MathMachado/Materials/blob/master/Pandas2.jpeg?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iwd1lhq9mrD3" + }, + "source": [ + "___\n", + "# **Exercícios**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o_cl0kFgQfFh" + }, + "source": [ + "## Exercício 1\n", + "* A partir dos dataframes USA_Abbrev, USA_Area e USA_Population, construa o Dataframe USA contendo as COLUNAS state, abbreviation, area, ages, year, population.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s8rQUo7yHKJ1" + }, + "source": [ + "* Observação: A forma mais fácil de ler um arquivo CSV (a partir do Excell por exemplo) a partir do GitHub é clicar no arquivo csv no seu repositório do GitHub e em seguida clicar em 'raw'. Depois, copie o endereço apresentado no browser e cole na variável 'url'. Qualquer dúvida, leia o documento a seguir: https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KTun1uSLuJ-A" + }, + "source": [ + "## Exercício 2\n", + "Source: https://github.com/aakankshaws/Pandas-exercises\n", + "\n", + "* Considere os dataframes a seguir e faça o merge do dataframe df_esquerdo com o dataframe df_direito:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Soq7GVZnuREq" + }, + "source": [ + "df_esquerdo = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],\n", + " 'A': ['A0', 'A1', 'A2', 'A3'],\n", + " 'B': ['B0', 'B1', 'B2', 'B3']})\n", + " \n", + "df_direito = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],\n", + " 'C': ['C0', 'C1', 'C2', 'C3'],\n", + " 'D': ['D0', 'D1', 'D2', 'D3']})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6KEsTARfvM1C" + }, + "source": [ + "## Exercício 3\n", + "Source: https://github.com/aakankshaws/Pandas-exercises\n", + "\n", + "* Considere os dataframes a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hgxE5gZ9vMEg" + }, + "source": [ + "df_esquerdo = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],\n", + " 'key2': ['K0', 'K1', 'K0', 'K1'],\n", + " 'A': ['A0', 'A1', 'A2', 'A3'],\n", + " 'B': ['B0', 'B1', 'B2', 'B3']})\n", + " \n", + "df_direito = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],\n", + " 'key2': ['K0', 'K0', 'K0', 'K0'],\n", + " 'C': ['C0', 'C1', 'C2', 'C3'],\n", + " 'D': ['D0', 'D1', 'D2', 'D3']})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iv7AmZ1ivm8R" + }, + "source": [ + "### Perguntas\n", + "* Qual o output e a interpretação dos comandos a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TWAW_1tuvvSO" + }, + "source": [ + "pd.merge(df_esquerdo, df_direito, on = ['key1', 'key2'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QjM7pBONvzCJ" + }, + "source": [ + "pd.merge(df_esquerdo, df_direito, how = 'outer', on = ['key1', 'key2'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "D1Rr3Ghsv2iS" + }, + "source": [ + "pd.merge(df_esquerdo, df_direito, how = 'right', on = ['key1', 'key2'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vXQwLjT-v3Iu" + }, + "source": [ + "pd.merge(df_esquerdo, df_direito, how = 'left', on = ['key1', 'key2'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EIdltTC-t_lF" + }, + "source": [ + "## Exercício 5\n", + "5.1. Identifique e delete os atributos do dataframe df_Titanic que podem ser excluídos inicialmente no início da análise de dados." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bMwPLgWclWBq" + }, + "source": [ + "___\n", + "## Exercício 6 - Resolvido\n", + "* Carregue o dataframe Titanic_With_MV.csv e analise o dataframe em busca de inconsistências e Missing Values (NaN)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ej6WjQX90n1E" + }, + "source": [ + "### Identificação e tratamento dos Missing Values\n", + "* Em geral, deletamos variáveis com mais de 50% de Missing Values." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nuaM4JKNLeSI" + }, + "source": [ + "df4.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "GaYc-HXNJ1TQ" + }, + "source": [ + "pd.set_option('display.max_rows', 500)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "v5s71jcHIGch" + }, + "source": [ + "df_missing_values = pd.DataFrame(df4.isnull().sum())\n", + "df_missing_values['mv_percent'] = 100*df_missing_values[0]/df4.shape[0]\n", + "df_missing_values[0].sort_values(ascending= False)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "V7KUGAX6lilP" + }, + "source": [ + "import pandas as pd\n", + "df_Titanic = pd.read_csv('https://raw.githubusercontent.com/MathMachado/Python4DS/DS_Python/Dataframes/Titanic_With_MV.csv?token =AGDJQ63MNPPPROFNSO2BZW25XSR72', index_col= 'PassengerId')\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "m3UnAPJakCLR" + }, + "source": [ + "* Segue o dicionário de dados do dataframe Titanic:\n", + " * PassengerID: ID do passageiro;\n", + " * survived: Indicador, sendo 1= Passageiro sobreviveu e 0= Passageiro morreu;\n", + " * Pclass: Classe;\n", + " * Age: Idade do Passageiro;\n", + " * SibSp: Número de parentes a bordo (esposa, irmãos, pais e etc);\n", + " * Parch: Número de pais/crianças a bordo;\n", + " * Fare: Valor pago pelo Passageiro;\n", + " * Cabin: Cabine do Passageiro;\n", + " * Embarked: A porta pelo qual o Passageiro embarcou.\n", + " * Name: Nome do Passageiro;\n", + " * sex: sexo do Passageiro\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_6RvRCXgwomw" + }, + "source": [ + "### Avaliando inconsistências nas COLUNAS" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "PToomnfRxxI5" + }, + "source": [ + "import seaborn as sns\n", + "import pandas as pd\n", + "import numpy as np" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "3nc_iuRR1Tju" + }, + "source": [ + "# Uniformizando o nome das COLUNAS\n", + "df_Titanic.columns= [cols.lower() for cols in df_Titanic.columns]\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "G9jteCnAxdnK" + }, + "source": [ + "### Coluna 'pclass'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wUk0YNlxsgvf" + }, + "source": [ + "df_Titanic['pclass'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "9vPrB3AAx0Ym" + }, + "source": [ + "sns.countplot(x = 'survived', hue ='pclass', data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2n8s9Ad1m7od" + }, + "source": [ + "Não me parece nada estranho com a variável 'pclass'. Ou você identifica alguma coisa estranho?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "m8EGM6gSxrzS" + }, + "source": [ + "### Coluna 'sex'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BRRgcLtinIRz" + }, + "source": [ + "sns.countplot(x = 'survived', hue ='sex', data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8SQ8v2Wnspfb" + }, + "source": [ + "df_Titanic['sex'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wpp0iL0kyGTl" + }, + "source": [ + "Qual sua opinião sobre esse preenchimento?\n", + "\n", + "Algum problema?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jxx06kJFnNrP" + }, + "source": [ + "Oops... Aqui temos vários problemas... Olhando para estes resultados, você concorda que 'male', 'm', 'MALE', M', 'mALE' e 'Men' se trata da mesma informação?\n", + "\n", + "Da mesma forma, 'female', 'f', 'F', 'Female', 'fEMALE', 'Woman', 'w' e 'W' também se trata da mesma informação?\n", + "\n", + "Então, vamos fazer o seguinte:\n", + "\n", + "* Toda vez que eu encontrar um desses valores: ['m', 'MALE', 'M', 'mALE', 'Men'], vou substituir por 'male';\n", + "* Toda vez que eu encontrar um desses valores: ['f', 'F', 'Female', 'fEMALE', 'Woman', 'w', 'W'], vou substituit por 'female'." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oQbEVi1t2tfR" + }, + "source": [ + "df_Titanic2= df_Titanic.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "apc-ccODyZ-d" + }, + "source": [ + "#### Corrigir com df.replace()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "CwoyLBK9oME5" + }, + "source": [ + "df_Titanic['sex2'] = df_Titanic['sex'].replace(['m', 'MALE', 'M', 'mALE', 'Men'], 'male')\n", + "df_Titanic['sex3'] = df_Titanic['sex2'].replace(['f', 'F', 'Female', 'fEMALE', 'Woman', 'w', 'W'], 'female') " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RC35I-Njp4vh" + }, + "source": [ + "Vamos ver a distribuição dos dados novamente no gráfico:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1eGvEhA9qAN6" + }, + "source": [ + "sns.countplot(x = 'survived', hue ='sex3', data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IY3TaKUcszTQ" + }, + "source": [ + "df_Titanic['sex3'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2nOAcv3iqEaK" + }, + "source": [ + "Ok, de fato corrigimos os problemas de preenchimento da variável 'sex'." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dqLqmrTWylY3" + }, + "source": [ + "#### Corrigir com df.map()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dRvuNo4E3Ewx" + }, + "source": [ + "df_Titanic= df_Titanic2.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "3X0_ZdwCyquk" + }, + "source": [ + "d_sexo= {}\n", + "d_sexo.update(dict.fromkeys(['m', 'MALE', 'M', 'mALE', 'Men', 'male'], 'male'))\n", + "d_sexo.update(dict.fromkeys(['f', 'F', 'Female', 'fEMALE', 'Woman', 'w', 'W', 'female'], 'female'))\n", + "d_sexo" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YQ3lwKRKbsx0" + }, + "source": [ + "Aplica a transformação:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "idBwRNI7bvCC" + }, + "source": [ + "df_Titanic['sex2'] = df_Titanic['sex'].map(d_sexo)\n", + "df_Titanic['sex2'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FzDl78rfb3p5" + }, + "source": [ + "Qual a conclusão? Este preenchimento faz mais sentido que o anterior?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SvrZtKRpzIDc" + }, + "source": [ + "# Deleta as variáveis 'sex':\n", + "df_Titanic = df_Titanic.drop(columns = ['sex'], axis = 1).rename(columns= {'sex2': 'sex'})\n", + "\n", + "# Mostra os dados:\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6URC6h8xzfc5" + }, + "source": [ + "sns.catplot(x=\"sex\", kind=\"count\", data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "k_spkJbmqdRW" + }, + "source": [ + "sns.countplot(x = 'survived', hue ='sex', data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bgBNoXUNzoWZ" + }, + "source": [ + "### Feature Engineering\n", + "#### Coluna 'cabin'\n", + "* Construir as COLUNAS:\n", + " * deck - Letra de Cabin;\n", + " * seat - Número de Cabin" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8fHsLrnut6mk" + }, + "source": [ + "Sugestões:\n", + "1) Não descartar nenhuma informação (Fábio);\n", + "\n", + "2) Coluna com número de cabines reservadas (Thomaz)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "p0NFFxx8z-vq" + }, + "source": [ + "set(df_Titanic['cabin'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7E6yje89u7KF" + }, + "source": [ + "Como podemos ver, trata-se de uma variável categórica com vários níveis. Portanto, vamos capturar somente a primeira letra da variável 'cabin'. Para tal, vamos utilizar a função slice().\n", + "\n", + "> str.slice() - Captura (slice) partes de s_Str." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wmZLlSaArR6F" + }, + "source": [ + "A seguir, capturamos a primeira letra da variável 'cabin':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hUZTJU0MvVxP" + }, + "source": [ + "# definindo a variável 'deck' que representará a primeira letra da variável 'cabin'\n", + "df_Titanic[\"deck\"] = df_Titanic[\"cabin\"].str.slice(0, 1) # slice(inicio, tamanho_da_string)\n", + "df_Titanic['deck'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6myhrth0rZ6t" + }, + "source": [ + "A seguir, vamos extrair a parte numérica da variável 'cabin' usando Expressões Regulares:\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8UXkACPmsfwN" + }, + "source": [ + "# Importar a biblioiteca para Expressões Regulares\n", + "import re" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QKk-fnW4rf4o" + }, + "source": [ + "# Primeiramente, usamos a função split() para separar o conteúdo da variável em COLUNAS: \n", + "new = df_Titanic[\"cabin\"].str.split(\" \", n = 3, expand = True) \n", + "new.head(5)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dFqoR-Xew9gX" + }, + "source": [ + "Observe acima que o comando gera quantos splits da variável eu quiser. No entanto, por simplicidade, me interessa somente o primeiro split." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_M7vA6WoVG05" + }, + "source": [ + "Agora, vou extrair o número do assento do passageiro usando Expressões Regulares:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rVH5o9KT_IH3" + }, + "source": [ + "# Aqui está o conteúdo de new[0]:\n", + "new[0].head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "P7NTcsGOxxSX" + }, + "source": [ + "new2= new[0].str.extract('(\\d+)')\n", + "new2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bf8vw2Mc18bQ" + }, + "source": [ + "Por fim, vou carregar esta informação ao dataframe df:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6l6EoRvsxRXn" + }, + "source": [ + "df_Titanic[\"seat\"] = new2\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LK4V61uy3N9s" + }, + "source": [ + "Por fim, excluir a variável 'cabin':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4uAr55J43NY7" + }, + "source": [ + "df_Titanic= df_Titanic.drop(columns= [\"cabin\"], axis =1, errors=\"ignore\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qZuH7YJXZCgY" + }, + "source": [ + "### Coluna 'embarked'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nTPikhrIZGya" + }, + "source": [ + "df_Titanic['embarked'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ixbZsuqOZsOc" + }, + "source": [ + "sns.catplot(x=\"embarked\", kind=\"count\", data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VvdU8aAwZNvG" + }, + "source": [ + "Não vejo problemas com esta variável. Vamos em frente..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "k2SLRAhrub_B" + }, + "source": [ + "sns.countplot(x = 'survived', hue ='embarked', data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "YRJcWaYkuxK4" + }, + "source": [ + "sns.countplot(x = 'pclass', hue ='embarked', data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rzrOUULUu6-P" + }, + "source": [ + "sns.countplot(x = 'sex', hue ='embarked', data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DfSMcYYZ5yLV" + }, + "source": [ + "### Variável 'pclass'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q2uU0k-G5yLN" + }, + "source": [ + "df_Titanic['pclass'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Gue26Y3A5yLL" + }, + "source": [ + "Algum problema com esta variável?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "q3P82wPp5yK8" + }, + "source": [ + "sns.catplot(x=\"pclass\", kind=\"count\", data = df)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Qrnc6VUKSTNp" + }, + "source": [ + "### Coluna 'parch'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2i4ed-0zSvJc" + }, + "source": [ + "df_Titanic['parch'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "qd7u__6KZ6DM" + }, + "source": [ + "sns.catplot(x=\"parch\", kind=\"count\", data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Z9vM3vktC7BG" + }, + "source": [ + "### Feature Engineering\n", + "* Criar a coluna 'sozinho_parch', onde sozinho_parch= 1 significa que o passageiro viaja sozinho e 0, caso contrário." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Nd4TyOYjs-HW" + }, + "source": [ + "# Função para retornar 0 ou 1 em função dos valores de variavel\n", + "def sozinho(variavel):\n", + " if (variavel == 0):\n", + " return 1\n", + " else:\n", + " return 0" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "5oByiBuos_B3" + }, + "source": [ + "df_Titanic['sozinho_parch'] = df_Titanic['parch'].map(sozinho)\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C1ICby1oSd41" + }, + "source": [ + "### Coluna 'sibsp'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5n7JNEQqTNjz" + }, + "source": [ + "df_Titanic['sibsp'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NLfMhiy0x4u5" + }, + "source": [ + "* Algum problema?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nayYFRK9g8iV" + }, + "source": [ + "sns.catplot(x=\"sibsp\", kind=\"count\", data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "KzCX2MTmE9Tw" + }, + "source": [ + "sns.countplot(x = 'survived', hue ='sibsp', data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_58rZqMaDzf-" + }, + "source": [ + "### Feature Engineering:\n", + "* Criar o atributo 'sozinho_sibsp', onde sozinho= 1 significa que o passageiro viaja sozinho e 0, caso contrário." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HUrJ4IywrEoA" + }, + "source": [ + "df_Titanic['sozinho_sibsp'] = df_Titanic['sibsp'].map(sozinho)\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0MO9jj2NvGp_" + }, + "source": [ + "### Coluna 'fare'\n", + "> Discretizar a coluna 'fare' em 10 buckets." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4-qO2Xk76Buz" + }, + "source": [ + "df_Titanic['fare_class'] = pd.qcut(df_Titanic['fare'], 10, labels=False)\n", + "df_Titanic['fare_class'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "boAj64RHvQHu" + }, + "source": [ + "sns.catplot(x=\"fare_class\", kind=\"count\", data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3CIqHUJpvcPa" + }, + "source": [ + "### Coluna 'age'\n", + "> Discretizar a coluna 'age' em 10 buckets." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rCRnbKX57VN-" + }, + "source": [ + "df_Titanic['age_class'] = pd.qcut(df_Titanic['age'], 10, labels=False)\n", + "df_Titanic['age_class'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "uFsZLYDi7VOH" + }, + "source": [ + "sns.catplot(x=\"age_class\", kind=\"count\", data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DIY-sL337uje" + }, + "source": [ + "#### Alternativa para discretizar 'age'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "W66GkyuKkhFe" + }, + "source": [ + "def Age_Category(age):\n", + " if (age <= 1):\n", + " return 1\n", + " elif (age <= 5):\n", + " return 2\n", + " elif(age <= 10):\n", + " return 3\n", + " elif (age <= 15):\n", + " return 4\n", + " elif (age <= 20):\n", + " return 5\n", + " elif (age <= 25):\n", + " return 6\n", + " elif(age < 30):\n", + " return 7\n", + " elif(age < 35):\n", + " return 8\n", + " elif(age < 40):\n", + " return 9\n", + " elif(age < 45):\n", + " return 10\n", + " elif(age < 50):\n", + " return 11\n", + " elif(age < 60):\n", + " return 12\n", + " elif(age < 70):\n", + " return 13\n", + " elif(age < 80):\n", + " return 14\n", + " else:\n", + " return 15" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TnLzC6hCkuBL" + }, + "source": [ + "df_Titanic['age_class2'] = df['age'].map(Age_Category)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "kG8td6HPsNlP" + }, + "source": [ + "set(df_Titanic['age_category']) # Esse comando mostra os NaN's da coluna, se houver." + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B_3s5cgxfNKQ" + }, + "source": [ + "### Coluna 'title'\n", + "\n", + "* Para fins de Data Manipulation, vamos capturar o tratamento dos passageiros contido na variável 'nome'. Ou seja, 'Mr.', 'Mrs.', 'Miss' e etc...\n", + "\n", + "> Fonte: As funções get_title e title_map foram extraídas de https://www.kaggle.com/tjsauer/titanic-survival-python-solution" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gslSjRdDoJFY" + }, + "source": [ + "df.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XjqEVVnr8R4d" + }, + "source": [ + "Primeiramente, vamos entender como funciona, step by step..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "D6gjWc3XozK7" + }, + "source": [ + "'Allen, Mr. William Henry'.split(',')[1].split('.')[0].strip()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "nfIG6toGfhd5" + }, + "source": [ + "def get_title(nome):\n", + " if '.' in nome:\n", + " return nome.split(',')[1].split('.')[0].strip()\n", + " else:\n", + " return 'Unknown'\n", + "\n", + "def title_map(title):\n", + " if title in ['Mr', 'Ms']:\n", + " return 1\n", + " elif title in ['Master']:\n", + " return 2\n", + " elif title in ['Ms','Mlle','Miss']:\n", + " return 3\n", + " elif title in [\"Mme\", \"Ms\", \"Mrs\"]:\n", + " return 4\n", + " elif title in [\"Jonkheer\", \"Don\", \"Sir\", \"the Countess\", \"Dona\", \"Lady\"]:\n", + " return 5\n", + " elif title in [\"Capt\", \"Col\", \"Major\", \"Dr\", \"Rev\"]:\n", + " return 6\n", + " else:\n", + " return 7" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HLQoJwf0rjrf" + }, + "source": [ + "Exercícios\n", + "* Melhorar a função title_map." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7qNUwnCepe_x" + }, + "source": [ + "Captura o tratamento dos passageiros:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "r-Ltf33vgJ6Q" + }, + "source": [ + "df_Titanic['title'] = df_Titanic['nome'].apply(get_title).apply(title_map) \n", + "set(df_Titanic['title']) # Esse comando mostra os NaN's da variável" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D3hY0WVhpRYK" + }, + "source": [ + "Drop a coluna 'nome', pois não vamos mais precisar dela em nossas análises:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Y8i3xKCes5WF" + }, + "source": [ + "df_Titanic= df_Titanic.drop(columns= [\"nome\"], axis =1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7Sl1uFdwpW3y" + }, + "source": [ + "Apresenta o conteúdo do dataframe:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2uFnw-pZpan-" + }, + "source": [ + "df_Titanic.head(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B0fZMKKpdHIl" + }, + "source": [ + "## Missing Value\n", + "> Faça o devido tratamento de NaN's das COLUNAS do dataframe df_Titanic.\n", + "\n", + "**Pergunta**: Na coluna 'value', os valores 0 (zero) são considerados Missing Values?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UHzKFytXsNkh" + }, + "source": [ + "df_Titanic['age'].isna().sum()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZC1ULWd883t2" + }, + "source": [ + "## Relação causa --> efeito" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_WCbklv0bDlp" + }, + "source": [ + "A função a seguir nos ajudará com o Data Visualization, cruzando a variável-resposta 'survived' com qualquer outra passada à função:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "epxI-F2UbGGS" + }, + "source": [ + "def taxa_sobrevivencia(df, column):\n", + " title_xt = pd.crosstab(df[column], df['survived'])\n", + " print(pd.crosstab(df[column], df['survived'], margins=True))\n", + " title_xt_pct = title_xt.div(title_xt.sum(1).astype(float), axis =0)\n", + " \n", + " title_xt_pct.plot(kind='bar', stacked=True, title='Taxa de Sobrevivência dos Passageiros', \n", + " color= ['r', 'g'])\n", + " plt.xlabel(column)\n", + " plt.ylabel('Taxa de Sobrevivência')\n", + " plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05),shadow=True, ncol=2)\n", + " plt.show()\n", + "\n", + "def grafico_catplot(x, y, hue = 'survived', col= None):\n", + " plt.rcdefaults()\n", + " g= sns.catplot(x= x, y= y, hue = hue, palette={'Died':'red','Survived':'blue'}, col= col, data = df, kind= 'bar', height=4, aspect=.7)\n", + " plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "34-Qbd_QrC8W" + }, + "source": [ + "Qual a relação entre a variável 'sex' e a variável-resposta?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bhY8-UjyrC8Z" + }, + "source": [ + "taxa_sobrevivencia(df_Titanic, 'sex')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UbexhGtayV4X" + }, + "source": [ + "## Exercício 7\n", + "Consulte a página [Pandas Exercises, Practice, Solution](https://www.w3resource.com/python-exercises/pandas/index.php) para mais exercícios relacionados á este tópico." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P62MXm3tK8Ty" + }, + "source": [ + "## Exercício 8\n", + "Crie a coluna 'aleatorio' no dataframe df_Titanic em que cada linha recebe um valor aleatório usando o método np.random.random()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Du7Y8E4uFmiu" + }, + "source": [ + "i_linhas_Titanic = df_Titanic.shape[0]\n", + "\n", + "df_Titanic['aleatorio'] = np.random.random(i_linhas_Titanic)\n", + "df_Titanic.head(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LMD3HksDL0PQ" + }, + "source": [ + "## Exercício 9\n", + "\n", + "1. Carregue o arquivo FIFA.csv (está na área de Dataframes do curso);\n", + "2. Que colunas podem previamente ser eliminadas da análise? Porque identificar o que pode ser eliminado é importante?\n", + "3. Qual o dtype de cada variável/atributo do dataframe?\n", + "4. Se alguma variável/atributo é do tipo string (object) e supostamente deveria ser numérica, como alteramos o tipo?\n", + "5. Normalize os nomes das colunas, ou seja, renomeie o nome das colunas para minúsculo;\n", + "6. Há Missing values nos dados? Se sim, o qual sua proposta (proposta do grupo) para tratar estes Missing values?\n", + "7. Qual a distribuição do número de jogadores por países? Apresente uma tabela com a distribuição.\n", + "8. Qual a média de idade dos jogadores por países (variável/atributo 'Nacionality');\n", + "9. Qual a número de jogadores por idade?\n", + "10. Quantos jogadores possuem cada clube?\n", + "11. Qual a média de idade por clube?\n", + "12. Qual a média de salário por país?\n", + "13. Qual a média de salário por clube?\n", + "14. Qual a média de salário por idade?\n", + "15. Quanto cada clube gasta com pagamento de salários?\n", + "16. Quais são os insight (o que você consegue descobrir) em relação à variável 'Potential' (mede o potencial dos jogadores)?\n", + "17. Quais os insights em relação à variável overall (nota média do atleta) por idade, clube e país?\n", + "18. Quais são os melhores clubes se levarmos em consideração as variáveis Potential e Overall?\n", + "19. Apresente o ranking dos goleiros (use a variável/atributo 'Preferred Positions') por Potencial, Overall. Estamos à procura de 'GK'.\n", + "20. Quem são os jogadores mais rápidos (variável/atributo 'Sprint speed'=?\n", + "21. Quem são os 5 melhores jogadores em termos de chute (força para chutar) (use a variável/atributo 'Shot power')?\n", + "22. Quem são os outliers em termos de salário?\n", + "23. Quem são os outliers em termos de potência no chute?\n", + "24. Qual a correlação e a interpretação entre as variáveis 'value' e as demais variáveis numéricas do dataframe?\n", + "25. Construa variáveis dummy para as colunas preferred_foot e work_rate. preferred_foot_left;\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "70Ml5KyZ04mk" + }, + "source": [ + "A seguir, significado da variável \"Position\":\n", + "* GK = Goalkeeper – Goleiro.\n", + "* RB = Right Back – Zagueiro Direito.\n", + "* CB = Central Back – Zagueiro Central.\n", + "* LB = Left Back – Zagueiro Esquerdo.\n", + "* SW = Sweeper – Líbero.\n", + "* RWB = Right Wing Back – Lateral Direito.\n", + "* LWB = Left Wing back – Lateral Esquerdo.\n", + "* CDM = Central Defensive Midfielder – Meio Campo Defensivo / Volante.\n", + "* CM = Central Midfielder – Meia Central.\n", + "* CAM = Center Attacking Middlefielder – Meio Campo Ofensivo / Armador.\n", + "* OM = Offensive Midfielder – Meia Ofensivo.\n", + "* LOM = Left Offensive Midfielder – Meia Esquerda Ofensivo.\n", + "* ROM = Right Offensive Midfielder – Meia Direita Ofensivo.\n", + "* LM = Left Midfielder – Meia Esquerda.\n", + "* RM = Right Midfielder – Meia Direita.\n", + "* LWM = Left Wing Midfielder – Meio Ala Esquerdo.\n", + "* RWM = Right Wing Midfielder – Meio Ala Direito.\n", + "* RW = Right Winger – Ala Direito.\n", + "* LW = Left Winger – Ala Esquerto.\n", + "* LF = Left Forward – Atacante Esquerdo.\n", + "* RF = Right Forward – Atacante Direito.\n", + "* ST = Striker – Atacante.\n", + "* CF = Center Forward – Centro Avante.\n", + "* RS = Right Striker – Atacante Direito.\n", + "* LS = Left Striker – Atacante Esquerdo." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tjHDjj68zawa" + }, + "source": [ + "## 1. Carregue o arquivo FIFA.csv (está na área de Dataframes do curso);" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Wzosi4Ue1vDs" + }, + "source": [ + "### Carregar as libraries necessárias" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "B0fqR6rzMAa3" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vgoLTamaOC50" + }, + "source": [ + "#### Configurar ambiente" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RRwi_z8JOFrD" + }, + "source": [ + "d_configuracao = {\n", + " 'display.max_columns': 1000,\n", + " 'display.expand_frame_repr': True,\n", + " 'display.max_rows': 10,\n", + " 'display.precision': 2,\n", + " 'display.show_dimensions': True\n", + " }\n", + "\n", + "for op, value in d_configuracao.items():\n", + " pd.set_option(op, value)\n", + " print(op, value)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MdVljEbcMGU9" + }, + "source": [ + "#### Carregar os dados" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GMivDUHEMFKp" + }, + "source": [ + "df = pd.read_csv('https://raw.githubusercontent.com/MathMachado/DataFrames/master/FIFA.csv?token=AGDJQ63GC7SPIHTGNW73QB27RXRN6') #, index_col= 'PassengerId')\n", + "df.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7pDUpFVLTOfl" + }, + "source": [ + "#### Definir a coluna 'ID' como index do dataframe" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TEue20CbMp9U" + }, + "source": [ + "df.set_index('ID', inplace = True)\n", + "df.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "G8CDrpI1_wMd" + }, + "source": [ + "### Função para retirar os sinais de \"+\" ou \"-\" em algumas colunas/vriáveis:\n", + "* Percebeste algumas colunas com o sinal de \"+\" no nome?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7zqHkNCsEDpJ" + }, + "source": [ + "A seguir, exemplo de algumas colunas com este problema:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_hUvJbCqCBBl" + }, + "source": [ + "df[['RS', 'LS', 'ST']].head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "78QhptWdEIB0" + }, + "source": [ + "A seguir, definimos um dataframe chamado df_string contendo a quantidade de colunas separadas pelo sinal \"+\". Observe que o máximo de colunas que obtemos são 2. Porque?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DzeSvQMGF4G7" + }, + "source": [ + "df_string = df['RS'].str.split(r'\\+', n = 4, expand = True) # n representa o número de splits no output.\n", + "df_string.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "PEzqRR5CEUru" + }, + "source": [ + "df_string[0] = pd.to_numeric(df_string[0])\n", + "df_string[1] = pd.to_numeric(df_string[1])\n", + "df_string['RS2'] = df_string[0]+df_string[1]\n", + "\n", + "df_string.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "2t4rnjRWFPON" + }, + "source": [ + "df_string.dtypes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "MAYju4f6GFzw" + }, + "source": [ + "df_string.drop(columns= [0, 1], axis = 1, inplace = True)\n", + "df = pd.merge(df, df_string, how = 'left', on = 'ID')\n", + "df.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sm5lOGrrHoDp" + }, + "source": [ + " **Desafio**: Próximo passo: transformar isso numa função para tratar as demais variáveis!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QtmOlKNpzbOz" + }, + "source": [ + "## 2. Que colunas podem previamente ser eliminadas da análise? Porque identificar o que pode ser eliminado é importante?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7TzcuD2GxfBP" + }, + "source": [ + "### Colunas que poderiam previamente ser eliminadas:\n", + "* Photo\n", + "* Flag\n", + "* Club Logo\n", + "* Unnamed: 0" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kXDe_AdEx3DD" + }, + "source": [ + "df2 = df.copy()\n", + "\n", + "l_cols_drop = ['Unnamed: 0', 'Photo', 'Flag', 'Club Logo']\n", + "df2.drop(columns = l_cols_drop, axis = 1, inplace = True)\n", + "df2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "m97dcDy9zbSO" + }, + "source": [ + "## 3. Qual o dtype de cada variável/atributo do dataframe?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GEbvITXR2U17" + }, + "source": [ + "# Função para nos mostrar o tipo das colunas:\n", + "def mostra_tipo(df):\n", + " d_tipos = dict(zip(df.columns, df.dtypes))\n", + " for item in d_tipos.items():\n", + " print(item)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "3B9vxmbl9HNP" + }, + "source": [ + "mostra_tipo(df2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5XKcxC0Pzshm" + }, + "source": [ + "## 4. Se alguma variável/atributo é do tipo string (object) e supostamente deveria ser numérica, como alteramos o tipo?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "A7T31nFiPdDu" + }, + "source": [ + "### Mudar o tipo de algumas colunas\n", + "* Exemplo: 'Wage', 'Value' e 'Release Clause'. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VJSsvOpK71n7" + }, + "source": [ + "df4 = df2.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "xyV-_MY9688C" + }, + "source": [ + "def transforma_monetarias(coluna):\n", + " if 'M' in coluna:\n", + " return int(float(coluna.replace('M', '')) * 1000000)\n", + "\n", + " elif 'K' in coluna:\n", + " return int(float(coluna.replace('K', '')) * 1000)\n", + " \n", + " else:\n", + " return int(coluna) " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AJ9-8sVS6MXj" + }, + "source": [ + "Substituindo o símbolo \"€\" por '':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ArgK2NVe6vqz" + }, + "source": [ + "l_colunas_monetarias = ['Value', 'Wage']\n", + "\n", + "for coluna in l_colunas_monetarias:\n", + " df4[coluna] = df4[coluna].str.replace('€', '')\n", + " df4[coluna] = df4[coluna].apply(lambda x: transforma_monetarias(x))\n", + "\n", + "df4.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c_lznTRHzbV9" + }, + "source": [ + "## 5. Normalize os nomes das colunas, ou seja, renomeie o nome das colunas para minúsculo;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "usM674sR8Gv9" + }, + "source": [ + "df5 = df4.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N6LCmJ0QUsJo" + }, + "source": [ + "### Nome das colunas --> Substituir os \" \" por \"_\" nos nomes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NWJYqphfUxn1" + }, + "source": [ + "df5.columns = [c.replace(' ', '_') for c in df5.columns]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lXUOzLWmVTNZ" + }, + "source": [ + "### Renomear as colunas usando lower()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZwwLMOYRVXnr" + }, + "source": [ + "df5.columns = [c.lower() for c in df5.columns]\n", + "mostra_tipo(df5)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Uc12gBThz1nD" + }, + "source": [ + "## 6. Há Missing values nos dados? Se sim, o qual sua proposta (proposta do grupo) para tratar estes Missing values?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nYgvxvcT8QIT" + }, + "source": [ + "df6 = df5.copy()\n", + "df6.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "9STC9fsWJAHn" + }, + "source": [ + "# Fazendo uma cópia permanente do dataframe df6 para uso futuro\n", + "df6[['overall', 'potential', 'value', 'wage', 'nationality', 'position', 'age', 'preferred_foot']].to_csv('FIFA_algumas_features.csv')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ESFYFvOy8XOM" + }, + "source": [ + "Aqui vou substituir os Missing Values pela mediana. Fique à vontade para substituir por outras alternativas como min, max, média, limite superior de outliers e limite inferior para outliers." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "j7zDrRvi8iay" + }, + "source": [ + "l_colunas_numericas = df6.select_dtypes(np.number).columns.tolist()\n", + "l_colunas_numericas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "mZEM0N2f9vi7" + }, + "source": [ + "# Mediana antes da substituição:\n", + "df6[l_colunas_numericas].median()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "dzfw0kp69dK2" + }, + "source": [ + "# Substituição pela mediana\n", + "for coluna in l_colunas_numericas:\n", + " df6[coluna].fillna(df6[coluna].median())\n", + "\n", + "# Mediana depois da substituição:\n", + "df6[l_colunas_numericas].median()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jpQR9zDC-nEj" + }, + "source": [ + "Abaixo, identifiquei 252 registros com value = 0 --> Nestes casos, vou atribuir a mediana também." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "s1Zj3gBJ-Z5c" + }, + "source": [ + "df6[df6['value'] == 0]['value'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "HjuNw2u6-7i9" + }, + "source": [ + "# Mediana antes\n", + "df6['value'].median()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VWEp0Tc_-vLD" + }, + "source": [ + "# Atribuição da mediana para os valores 0 de 'value'\n", + "df6.loc[df6['value'] == 0, 'value'] = df6['value'].median()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "HynCT_Yu_JL-" + }, + "source": [ + "# Mediana depois\n", + "df6['value'].median()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B4O5kw6h_z3H" + }, + "source": [ + "E se tivéssemos substituído pela média, ao invés da mediana? Teria mudado alguma coisa?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eU7ybhA2zbZh" + }, + "source": [ + "## 7. Qual a distribuição do número de jogadores por países? Apresente uma tabela com a distribuição." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "A34BwvXrXAqU" + }, + "source": [ + "df7 = df6.copy()\n", + "df7.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "fu87YSiudcM_" + }, + "source": [ + "df7.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IQQ7AvgBYZmx" + }, + "source": [ + "df_jogadores_por_paises = pd.DataFrame(df7.groupby(by=['nationality']).size())\n", + "df_jogadores_por_paises.columns = ['numero_jogadores']\n", + "df_jogadores_por_paises.sort_values(by = ['numero_jogadores'], ascending = False, inplace= True)\n", + "df_jogadores_por_paises = df_jogadores_por_paises.reset_index()\n", + "df_jogadores_por_paises\n", + "\n", + "# Numa única linha ficaria assim:\n", + "df_jogadores_por_paises2 = pd.DataFrame(df7.groupby(by=['nationality']).size(), columns= ['numero_jogadores']).sort_values(by = ['numero_jogadores'], ascending = False).reset_index()\n", + "df_jogadores_por_paises2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JfyDUEC2zbcv" + }, + "source": [ + "## 8. Qual a média de idade dos jogadores por países (variável/atributo 'Nacionality');" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0a9MvyWPcu-C" + }, + "source": [ + "df_media_idade_por_paises = df7.groupby(by = ['nationality']).agg({'age': ['count', 'mean']}).reset_index()\n", + "df_media_idade_por_paises.columns = ['nationality', 'numero_joagadores', 'media_idade']\n", + "df_media_idade_por_paises.sort_values(by = ['media_idade'], ascending = False, inplace = True)\n", + "df_media_idade_por_paises.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vNmu0xyg0CW4" + }, + "source": [ + "## 9. Qual a número de jogadores por idade?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DRVvPgpRf9vw" + }, + "source": [ + "df_jogadores_por_idade = df7.groupby(by = ['age']).agg({'age': ['count']}).reset_index()\n", + "df_jogadores_por_idade.columns = ['age', 'numero_joagadores']\n", + "df_jogadores_por_idade.sort_values(by = ['numero_joagadores'], ascending = False, inplace = True)\n", + "df_jogadores_por_idade.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8eChi2NW0CZp" + }, + "source": [ + "## 10. Quantos jogadores possuem cada clube?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JpNI3ZlHgUx1" + }, + "source": [ + "df_jogadores_por_clube = df7.groupby(by = ['club']).size().reset_index()\n", + "df_jogadores_por_clube.columns = ['clube', 'numero_joagadores']\n", + "df_jogadores_por_clube.sort_values(by = ['numero_joagadores'], ascending = False, inplace = True)\n", + "df_jogadores_por_clube.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gMiibNwW0Cck" + }, + "source": [ + "## 11. Qual a média de idade por clube?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "D9rF9frzgqSr" + }, + "source": [ + "df_media_idade_por_clube = df7.groupby(by = ['club']).agg({'age': ['count', 'mean']}).reset_index()\n", + "df_media_idade_por_clube.columns = ['clube', 'numero_joagadores', 'media_idade']\n", + "df_media_idade_por_clube.sort_values(by = ['media_idade'], ascending = False, inplace = True)\n", + "df_media_idade_por_clube.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uE_o76xH0QU-" + }, + "source": [ + "## 12. Qual a média de salário por país?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "keQXqnU7hJy4" + }, + "source": [ + "df_media_salario_por_pais = df7.groupby(by = ['nationality']).agg({'wage': ['count', 'mean']}).reset_index()\n", + "df_media_salario_por_pais.columns = ['nationality', 'numero_joagadores', 'media_salario']\n", + "df_media_salario_por_pais.sort_values(by = ['media_salario'], ascending = False, inplace = True)\n", + "df_media_salario_por_pais.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vqT1ozNA0Cfd" + }, + "source": [ + "## 13. Qual a média de salário por clube?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "54_Q2IGchmN-" + }, + "source": [ + "df_media_salario_por_clube = df7.groupby(by = ['club']).agg({'wage': ['count', 'mean']}).reset_index()\n", + "df_media_salario_por_clube.columns = ['clube', 'numero_joagadores', 'media_salario']\n", + "df_media_salario_por_clube.sort_values(by = ['media_salario'], ascending = False, inplace = True)\n", + "df_media_salario_por_clube.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4eflozOo0Cif" + }, + "source": [ + "## 14. Qual a média de salário por idade?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Xtq9Am60hwGr" + }, + "source": [ + "df_media_salario_por_idade = df7.groupby(by = ['age']).agg({'wage': ['count', 'mean']}).reset_index()\n", + "df_media_salario_por_idade.columns = ['age', 'numero_joagadores', 'media_salario']\n", + "df_media_salario_por_idade.sort_values(by = ['media_salario'], ascending = False, inplace = True)\n", + "df_media_salario_por_idade.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L0yRSSIb0WYj" + }, + "source": [ + "## 15. Quanto cada clube gasta com pagamento de salários?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "C9N7_pLfh_uq" + }, + "source": [ + "df_soma_salario_por_clube = df7.groupby(by = ['club']).agg({'wage': ['count', 'mean', 'sum']}).reset_index()\n", + "df_soma_salario_por_clube.columns = ['clube', 'numero_joagadores', 'media_salario', 'soma_salario']\n", + "df_soma_salario_por_clube.sort_values(by = ['soma_salario'], ascending = False, inplace = True)\n", + "df_soma_salario_por_clube.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1c7NGMg90YMi" + }, + "source": [ + "## 16. Quais são os insight (o que você consegue descobrir) em relação à variável 'Potential' (mede o potencial dos jogadores)?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RAU41Iyaihvc" + }, + "source": [ + "df7.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "bM_ePTWfiTFq" + }, + "source": [ + "df_potential_por_clube = df7.groupby(by = ['potential', 'club', 'nationality']).agg({'potential': ['count']}).reset_index()\n", + "df_potential_por_clube.columns = ['potential', 'club', 'nationality', 'numero_joagadores']\n", + "df_potential_por_clube.sort_values(by = ['potential'], ascending = False, inplace = True)\n", + "df_potential_por_clube.head(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HytWPvfvjTON" + }, + "source": [ + "#### Quem é o jogador com potential = 95?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Fk2X1q7LjWJE" + }, + "source": [ + "df7.loc[df7['potential'] == 95]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "W2o4oLzujnHj" + }, + "source": [ + "#### Quem são os jogadores com potencial = 94?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GOCyMr-qjsL7" + }, + "source": [ + "df7.loc[df7['potential'] == 94]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LHDJimdw0ClU" + }, + "source": [ + "## 17. Quais os insights em relação à variável overall (nota média do atleta) por idade, clube e país?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FXFp5nxrj9Yc" + }, + "source": [ + "df_overall = df7.groupby(by = ['overall', 'club', 'nationality']).agg({'overall': ['count']}).reset_index()\n", + "df_overall.columns = ['overall', 'club', 'nationality', 'numero_joagadores']\n", + "df_overall.sort_values(by = ['overall'], ascending = False, inplace = True)\n", + "df_overall.head(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4LTooiIdk1XV" + }, + "source": [ + "#### Quem é o jogador com overall = 94?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QieAKyi7k5Bb" + }, + "source": [ + "df7.loc[df7['overall'] == 94]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JFH54d1D0b5B" + }, + "source": [ + "## 18. Quais são os melhores clubes se levarmos em consideração as variáveis Potential e Overall?\n", + "* Para responder esta questão, tirei a média de overall e potential." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0JZ7PTFTle_d" + }, + "source": [ + "df18 = df7.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "s25u8RoplMZ8" + }, + "source": [ + "df18['overall_potential'] = ((df18['potential']+df18['overall'])/2)\n", + "df18[['name', 'overall', 'potential', 'overall_potential']].head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8gJFzhhIlDCH" + }, + "source": [ + "df_overall_potential = df18.groupby(by = ['club', 'nationality', 'age']).agg({'overall_potential': ['count', 'mean']}).reset_index()\n", + "df_overall_potential.columns = ['club', 'nationality', 'age', 'numero_jogadores', 'media_overall_potential']\n", + "df_overall_potential.sort_values(by = ['media_overall_potential'], ascending = False, inplace = True)\n", + "df_overall_potential.head(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "adpxQpWlmvac" + }, + "source": [ + "De forma geral:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fzn_81eomrj2" + }, + "source": [ + "df_overall_potential2 = df18.groupby(by = ['club']).agg({'overall_potential': ['count', 'mean']}).reset_index()\n", + "df_overall_potential2.columns = ['club', 'numero_jogadores', 'media_overall_potential']\n", + "df_overall_potential2.sort_values(by = ['media_overall_potential'], ascending = False, inplace = True)\n", + "df_overall_potential2.head(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dM8FehYC0df7" + }, + "source": [ + "## 19. Apresente o ranking dos goleiros (use a variável/atributo 'Preferred Positions') por Potencial, Overall. Estamos à procura de 'GK'." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_967BF6MnD4U" + }, + "source": [ + "df19 = df18.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "wXPah5zOmkXc" + }, + "source": [ + "df_goleiros = df19[df19['position'] == 'GK']\n", + "df_goleiros.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "77ehyNmSnTIB" + }, + "source": [ + "df_overall_potential_goleiros = df_goleiros.groupby(by = ['club']).agg({'overall_potential': ['count', 'mean']}).reset_index()\n", + "df_overall_potential_goleiros.columns = ['club', 'numero_jogadores', 'media_overall_potential']\n", + "df_overall_potential_goleiros.sort_values(by = ['media_overall_potential'], ascending = False, inplace = True)\n", + "df_overall_potential_goleiros.head(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-dEtuBtF0fiZ" + }, + "source": [ + "## 20. Quem são os jogadores mais rápidos (variável/atributo 'Sprint speed')?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KWMU1hMMnxTI" + }, + "source": [ + "df20 = df19.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "sezEQIjqnwCZ" + }, + "source": [ + "df20.sort_values(by = 'sprintspeed', ascending = False).head(5)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aEg0eaFO0lF6" + }, + "source": [ + "## 21. Quem são os 5 melhores jogadores em termos de chute (força para chutar) (use a variável/atributo 'Shot power')?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xXuj-dc7oA-0" + }, + "source": [ + "df21 = df20.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8HGT_dM2oEES" + }, + "source": [ + "df21.sort_values(by = 'shotpower', ascending = False).head(5)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bRk42JIf0moZ" + }, + "source": [ + "## 22. Quem são os outliers em termos de salário?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qRNaog7y0qI4" + }, + "source": [ + "### Identificação e tratamento dos Outliers\n", + "* Qual o Overall médio do Barcelona, Juventus e Real Madrid?\n", + "* E qual o overall médio depois do tratamento dos outliers?\n", + "* Quem são os atletas que estão influenciando a média?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_bIxG1Sw9OUB" + }, + "source": [ + "![BoxPlot](https://github.com/MathMachado/Materials/blob/master/boxplot.png?raw=true)\n", + "\n", + "Fonte: [Understanding Boxplots](https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qEiikIxNoZkl" + }, + "source": [ + "df22 = df21.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lGYYvE0BoOoV" + }, + "source": [ + "q1_salario, q3_salario = df22['wage'].quantile([0.25,0.75]).to_list()\n", + "iqr_salario = q3_salario - q1_salario\n", + "print(q1_salario, q3_salario)\n", + "print(iqr_salario)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "PB44VV9pogT1" + }, + "source": [ + "outlier_salario_inferior = q1_salario - 1.5 * iqr_salario\n", + "outlier_salario_superior = q3_salario + 1.5 * iqr_salario\n", + "\n", + "df_outliers_salario = df22[['name', 'club', 'nationality', 'wage', 'overall', 'potential']]\n", + "\n", + "# Salários outliers inferiores\n", + "df_outliers_salario[df_outliers_salario['wage'] < outlier_salario_inferior]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "9867KNNBqG7Z" + }, + "source": [ + "# Top 10 Salários outliers superior\n", + "df_outliers_salario[df_outliers_salario['wage'] > outlier_salario_superior].sort_values(by = ['wage'], ascending = False).head(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gT2zGwq90oQ5" + }, + "source": [ + "## 23. Quem são os outliers em termos de potência no chute?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "05uYj7cwqrdW" + }, + "source": [ + "df23 = df22.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "GzbVRU9HqrdZ" + }, + "source": [ + "q1_chute, q3_chute = df23['shotpower'].quantile([0.25,0.75]).to_list()\n", + "iqr_chute = q3_chute - q1_chute\n", + "print(q1_chute, q3_chute)\n", + "print(iqr_chute)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "V5TQX_yGqrda" + }, + "source": [ + "outlier_chute_inferior = q1_chute - 1.5 * iqr_chute\n", + "outlier_chute_superior = q3_chute + 1.5 * iqr_chute\n", + "\n", + "df_outliers_chute = df23[['name', 'club', 'nationality', 'shotpower', 'overall', 'potential']]\n", + "\n", + "# Salários outliers inferiores\n", + "df_outliers_chute[df_outliers_chute['shotpower'] < outlier_chute_inferior]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "URj1SYXxqrdc" + }, + "source": [ + "# Top 10 outliers superiores - shotpower\n", + "df_outliers_chute[df_outliers_chute['shotpower'] > outlier_chute_superior].sort_values(by = ['shotpower'], ascending = False).head(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eHm1qeHx0pza" + }, + "source": [ + "## 24. Qual a correlação e a interpretação entre as variáveis 'value' e as demais variáveis numéricas do dataframe?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OR6MD2GQ0rNq" + }, + "source": [ + "## 25. Construa variáveis dummy para as colunas preferred_foot e work_rate. preferred_foot_left;" + ] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB10_01__Pandas__Resposta_Exercicios_hs2.ipynb b/Notebooks/NB10_01__Pandas__Resposta_Exercicios_hs2.ipynb new file mode 100644 index 000000000..2c4eeaa3c --- /dev/null +++ b/Notebooks/NB10_01__Pandas__Resposta_Exercicios_hs2.ipynb @@ -0,0 +1,2777 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "NB10_01__Pandas.ipynb", + "provenance": [], + "private_outputs": true, + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8fpUiw8PwC7_" + }, + "source": [ + "

PANDAS PARA DATA ANALYSIS

\n", + "\n", + "\n", + "\n", + "# **Resposta dos Exercícios**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wkxQFPPmeKLl" + }, + "source": [ + "![Pandas](https://github.com/MathMachado/Materials/blob/master/Pandas.jpeg?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eKawOG-neqaD" + }, + "source": [ + "![Pandas](https://github.com/MathMachado/Materials/blob/master/Pandas2.jpeg?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iwd1lhq9mrD3" + }, + "source": [ + "___\n", + "# **Exercícios**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o_cl0kFgQfFh" + }, + "source": [ + "## Exercício 1\n", + "* A partir dos dataframes USA_Abbrev, USA_Area e USA_Population, construa o Dataframe USA contendo as COLUNAS state, abbreviation, area, ages, year, population.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s8rQUo7yHKJ1" + }, + "source": [ + "* Observação: A forma mais fácil de ler um arquivo CSV (a partir do Excell por exemplo) a partir do GitHub é clicar no arquivo csv no seu repositório do GitHub e em seguida clicar em 'raw'. Depois, copie o endereço apresentado no browser e cole na variável 'url'. Qualquer dúvida, leia o documento a seguir: https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KTun1uSLuJ-A" + }, + "source": [ + "## Exercício 2\n", + "Source: https://github.com/aakankshaws/Pandas-exercises\n", + "\n", + "* Considere os dataframes a seguir e faça o merge do dataframe df_esquerdo com o dataframe df_direito:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Soq7GVZnuREq" + }, + "source": [ + "df_esquerdo = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],\n", + " 'A': ['A0', 'A1', 'A2', 'A3'],\n", + " 'B': ['B0', 'B1', 'B2', 'B3']})\n", + " \n", + "df_direito = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],\n", + " 'C': ['C0', 'C1', 'C2', 'C3'],\n", + " 'D': ['D0', 'D1', 'D2', 'D3']})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6KEsTARfvM1C" + }, + "source": [ + "## Exercício 3\n", + "Source: https://github.com/aakankshaws/Pandas-exercises\n", + "\n", + "* Considere os dataframes a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hgxE5gZ9vMEg" + }, + "source": [ + "df_esquerdo = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],\n", + " 'key2': ['K0', 'K1', 'K0', 'K1'],\n", + " 'A': ['A0', 'A1', 'A2', 'A3'],\n", + " 'B': ['B0', 'B1', 'B2', 'B3']})\n", + " \n", + "df_direito = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],\n", + " 'key2': ['K0', 'K0', 'K0', 'K0'],\n", + " 'C': ['C0', 'C1', 'C2', 'C3'],\n", + " 'D': ['D0', 'D1', 'D2', 'D3']})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iv7AmZ1ivm8R" + }, + "source": [ + "### Perguntas\n", + "* Qual o output e a interpretação dos comandos a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TWAW_1tuvvSO" + }, + "source": [ + "pd.merge(df_esquerdo, df_direito, on = ['key1', 'key2'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QjM7pBONvzCJ" + }, + "source": [ + "pd.merge(df_esquerdo, df_direito, how = 'outer', on = ['key1', 'key2'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "D1Rr3Ghsv2iS" + }, + "source": [ + "pd.merge(df_esquerdo, df_direito, how = 'right', on = ['key1', 'key2'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vXQwLjT-v3Iu" + }, + "source": [ + "pd.merge(df_esquerdo, df_direito, how = 'left', on = ['key1', 'key2'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EIdltTC-t_lF" + }, + "source": [ + "## Exercício 5\n", + "5.1. Identifique e delete os atributos do dataframe df_Titanic que podem ser excluídos inicialmente no início da análise de dados." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bMwPLgWclWBq" + }, + "source": [ + "___\n", + "## Exercício 6 - Resolvido\n", + "* Carregue o dataframe Titanic_With_MV.csv e analise o dataframe em busca de inconsistências e Missing Values (NaN)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ej6WjQX90n1E" + }, + "source": [ + "### Identificação e tratamento dos Missing Values\n", + "* Em geral, deletamos variáveis com mais de 50% de Missing Values." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nuaM4JKNLeSI" + }, + "source": [ + "df4.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "GaYc-HXNJ1TQ" + }, + "source": [ + "pd.set_option('display.max_rows', 500)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "v5s71jcHIGch" + }, + "source": [ + "df_missing_values = pd.DataFrame(df4.isnull().sum())\n", + "df_missing_values['mv_percent'] = 100*df_missing_values[0]/df4.shape[0]\n", + "df_missing_values[0].sort_values(ascending= False)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "V7KUGAX6lilP" + }, + "source": [ + "import pandas as pd\n", + "df_Titanic = pd.read_csv('https://raw.githubusercontent.com/MathMachado/Python4DS/DS_Python/Dataframes/Titanic_With_MV.csv?token =AGDJQ63MNPPPROFNSO2BZW25XSR72', index_col= 'PassengerId')\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "m3UnAPJakCLR" + }, + "source": [ + "* Segue o dicionário de dados do dataframe Titanic:\n", + " * PassengerID: ID do passageiro;\n", + " * survived: Indicador, sendo 1= Passageiro sobreviveu e 0= Passageiro morreu;\n", + " * Pclass: Classe;\n", + " * Age: Idade do Passageiro;\n", + " * SibSp: Número de parentes a bordo (esposa, irmãos, pais e etc);\n", + " * Parch: Número de pais/crianças a bordo;\n", + " * Fare: Valor pago pelo Passageiro;\n", + " * Cabin: Cabine do Passageiro;\n", + " * Embarked: A porta pelo qual o Passageiro embarcou.\n", + " * Name: Nome do Passageiro;\n", + " * sex: sexo do Passageiro\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_6RvRCXgwomw" + }, + "source": [ + "### Avaliando inconsistências nas COLUNAS" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "PToomnfRxxI5" + }, + "source": [ + "import seaborn as sns\n", + "import pandas as pd\n", + "import numpy as np" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "3nc_iuRR1Tju" + }, + "source": [ + "# Uniformizando o nome das COLUNAS\n", + "df_Titanic.columns= [cols.lower() for cols in df_Titanic.columns]\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "G9jteCnAxdnK" + }, + "source": [ + "### Coluna 'pclass'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wUk0YNlxsgvf" + }, + "source": [ + "df_Titanic['pclass'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "9vPrB3AAx0Ym" + }, + "source": [ + "sns.countplot(x = 'survived', hue ='pclass', data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2n8s9Ad1m7od" + }, + "source": [ + "Não me parece nada estranho com a variável 'pclass'. Ou você identifica alguma coisa estranho?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "m8EGM6gSxrzS" + }, + "source": [ + "### Coluna 'sex'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BRRgcLtinIRz" + }, + "source": [ + "sns.countplot(x = 'survived', hue ='sex', data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8SQ8v2Wnspfb" + }, + "source": [ + "df_Titanic['sex'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wpp0iL0kyGTl" + }, + "source": [ + "Qual sua opinião sobre esse preenchimento?\n", + "\n", + "Algum problema?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jxx06kJFnNrP" + }, + "source": [ + "Oops... Aqui temos vários problemas... Olhando para estes resultados, você concorda que 'male', 'm', 'MALE', M', 'mALE' e 'Men' se trata da mesma informação?\n", + "\n", + "Da mesma forma, 'female', 'f', 'F', 'Female', 'fEMALE', 'Woman', 'w' e 'W' também se trata da mesma informação?\n", + "\n", + "Então, vamos fazer o seguinte:\n", + "\n", + "* Toda vez que eu encontrar um desses valores: ['m', 'MALE', 'M', 'mALE', 'Men'], vou substituir por 'male';\n", + "* Toda vez que eu encontrar um desses valores: ['f', 'F', 'Female', 'fEMALE', 'Woman', 'w', 'W'], vou substituit por 'female'." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oQbEVi1t2tfR" + }, + "source": [ + "df_Titanic2= df_Titanic.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "apc-ccODyZ-d" + }, + "source": [ + "#### Corrigir com df.replace()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "CwoyLBK9oME5" + }, + "source": [ + "df_Titanic['sex2'] = df_Titanic['sex'].replace(['m', 'MALE', 'M', 'mALE', 'Men'], 'male')\n", + "df_Titanic['sex3'] = df_Titanic['sex2'].replace(['f', 'F', 'Female', 'fEMALE', 'Woman', 'w', 'W'], 'female') " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RC35I-Njp4vh" + }, + "source": [ + "Vamos ver a distribuição dos dados novamente no gráfico:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1eGvEhA9qAN6" + }, + "source": [ + "sns.countplot(x = 'survived', hue ='sex3', data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IY3TaKUcszTQ" + }, + "source": [ + "df_Titanic['sex3'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2nOAcv3iqEaK" + }, + "source": [ + "Ok, de fato corrigimos os problemas de preenchimento da variável 'sex'." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dqLqmrTWylY3" + }, + "source": [ + "#### Corrigir com df.map()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dRvuNo4E3Ewx" + }, + "source": [ + "df_Titanic= df_Titanic2.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "3X0_ZdwCyquk" + }, + "source": [ + "d_sexo= {}\n", + "d_sexo.update(dict.fromkeys(['m', 'MALE', 'M', 'mALE', 'Men', 'male'], 'male'))\n", + "d_sexo.update(dict.fromkeys(['f', 'F', 'Female', 'fEMALE', 'Woman', 'w', 'W', 'female'], 'female'))\n", + "d_sexo" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YQ3lwKRKbsx0" + }, + "source": [ + "Aplica a transformação:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "idBwRNI7bvCC" + }, + "source": [ + "df_Titanic['sex2'] = df_Titanic['sex'].map(d_sexo)\n", + "df_Titanic['sex2'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FzDl78rfb3p5" + }, + "source": [ + "Qual a conclusão? Este preenchimento faz mais sentido que o anterior?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SvrZtKRpzIDc" + }, + "source": [ + "# Deleta as variáveis 'sex':\n", + "df_Titanic = df_Titanic.drop(columns = ['sex'], axis = 1).rename(columns= {'sex2': 'sex'})\n", + "\n", + "# Mostra os dados:\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6URC6h8xzfc5" + }, + "source": [ + "sns.catplot(x=\"sex\", kind=\"count\", data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "k_spkJbmqdRW" + }, + "source": [ + "sns.countplot(x = 'survived', hue ='sex', data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bgBNoXUNzoWZ" + }, + "source": [ + "### Feature Engineering\n", + "#### Coluna 'cabin'\n", + "* Construir as COLUNAS:\n", + " * deck - Letra de Cabin;\n", + " * seat - Número de Cabin" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8fHsLrnut6mk" + }, + "source": [ + "Sugestões:\n", + "1) Não descartar nenhuma informação (Fábio);\n", + "\n", + "2) Coluna com número de cabines reservadas (Thomaz)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "p0NFFxx8z-vq" + }, + "source": [ + "set(df_Titanic['cabin'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7E6yje89u7KF" + }, + "source": [ + "Como podemos ver, trata-se de uma variável categórica com vários níveis. Portanto, vamos capturar somente a primeira letra da variável 'cabin'. Para tal, vamos utilizar a função slice().\n", + "\n", + "> str.slice() - Captura (slice) partes de s_Str." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wmZLlSaArR6F" + }, + "source": [ + "A seguir, capturamos a primeira letra da variável 'cabin':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hUZTJU0MvVxP" + }, + "source": [ + "# definindo a variável 'deck' que representará a primeira letra da variável 'cabin'\n", + "df_Titanic[\"deck\"] = df_Titanic[\"cabin\"].str.slice(0, 1) # slice(inicio, tamanho_da_string)\n", + "df_Titanic['deck'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6myhrth0rZ6t" + }, + "source": [ + "A seguir, vamos extrair a parte numérica da variável 'cabin' usando Expressões Regulares:\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8UXkACPmsfwN" + }, + "source": [ + "# Importar a biblioiteca para Expressões Regulares\n", + "import re" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QKk-fnW4rf4o" + }, + "source": [ + "# Primeiramente, usamos a função split() para separar o conteúdo da variável em COLUNAS: \n", + "new = df_Titanic[\"cabin\"].str.split(\" \", n = 3, expand = True) \n", + "new.head(5)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dFqoR-Xew9gX" + }, + "source": [ + "Observe acima que o comando gera quantos splits da variável eu quiser. No entanto, por simplicidade, me interessa somente o primeiro split." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_M7vA6WoVG05" + }, + "source": [ + "Agora, vou extrair o número do assento do passageiro usando Expressões Regulares:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rVH5o9KT_IH3" + }, + "source": [ + "# Aqui está o conteúdo de new[0]:\n", + "new[0].head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "P7NTcsGOxxSX" + }, + "source": [ + "new2= new[0].str.extract('(\\d+)')\n", + "new2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bf8vw2Mc18bQ" + }, + "source": [ + "Por fim, vou carregar esta informação ao dataframe df:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6l6EoRvsxRXn" + }, + "source": [ + "df_Titanic[\"seat\"] = new2\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LK4V61uy3N9s" + }, + "source": [ + "Por fim, excluir a variável 'cabin':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4uAr55J43NY7" + }, + "source": [ + "df_Titanic= df_Titanic.drop(columns= [\"cabin\"], axis =1, errors=\"ignore\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qZuH7YJXZCgY" + }, + "source": [ + "### Coluna 'embarked'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nTPikhrIZGya" + }, + "source": [ + "df_Titanic['embarked'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ixbZsuqOZsOc" + }, + "source": [ + "sns.catplot(x=\"embarked\", kind=\"count\", data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VvdU8aAwZNvG" + }, + "source": [ + "Não vejo problemas com esta variável. Vamos em frente..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "k2SLRAhrub_B" + }, + "source": [ + "sns.countplot(x = 'survived', hue ='embarked', data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "YRJcWaYkuxK4" + }, + "source": [ + "sns.countplot(x = 'pclass', hue ='embarked', data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rzrOUULUu6-P" + }, + "source": [ + "sns.countplot(x = 'sex', hue ='embarked', data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DfSMcYYZ5yLV" + }, + "source": [ + "### Variável 'pclass'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q2uU0k-G5yLN" + }, + "source": [ + "df_Titanic['pclass'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Gue26Y3A5yLL" + }, + "source": [ + "Algum problema com esta variável?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "q3P82wPp5yK8" + }, + "source": [ + "sns.catplot(x=\"pclass\", kind=\"count\", data = df)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Qrnc6VUKSTNp" + }, + "source": [ + "### Coluna 'parch'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2i4ed-0zSvJc" + }, + "source": [ + "df_Titanic['parch'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "qd7u__6KZ6DM" + }, + "source": [ + "sns.catplot(x=\"parch\", kind=\"count\", data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Z9vM3vktC7BG" + }, + "source": [ + "### Feature Engineering\n", + "* Criar a coluna 'sozinho_parch', onde sozinho_parch= 1 significa que o passageiro viaja sozinho e 0, caso contrário." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Nd4TyOYjs-HW" + }, + "source": [ + "# Função para retornar 0 ou 1 em função dos valores de variavel\n", + "def sozinho(variavel):\n", + " if (variavel == 0):\n", + " return 1\n", + " else:\n", + " return 0" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "5oByiBuos_B3" + }, + "source": [ + "df_Titanic['sozinho_parch'] = df_Titanic['parch'].map(sozinho)\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C1ICby1oSd41" + }, + "source": [ + "### Coluna 'sibsp'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5n7JNEQqTNjz" + }, + "source": [ + "df_Titanic['sibsp'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NLfMhiy0x4u5" + }, + "source": [ + "* Algum problema?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nayYFRK9g8iV" + }, + "source": [ + "sns.catplot(x=\"sibsp\", kind=\"count\", data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "KzCX2MTmE9Tw" + }, + "source": [ + "sns.countplot(x = 'survived', hue ='sibsp', data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_58rZqMaDzf-" + }, + "source": [ + "### Feature Engineering:\n", + "* Criar o atributo 'sozinho_sibsp', onde sozinho= 1 significa que o passageiro viaja sozinho e 0, caso contrário." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HUrJ4IywrEoA" + }, + "source": [ + "df_Titanic['sozinho_sibsp'] = df_Titanic['sibsp'].map(sozinho)\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0MO9jj2NvGp_" + }, + "source": [ + "### Coluna 'fare'\n", + "> Discretizar a coluna 'fare' em 10 buckets." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4-qO2Xk76Buz" + }, + "source": [ + "df_Titanic['fare_class'] = pd.qcut(df_Titanic['fare'], 10, labels=False)\n", + "df_Titanic['fare_class'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "boAj64RHvQHu" + }, + "source": [ + "sns.catplot(x=\"fare_class\", kind=\"count\", data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3CIqHUJpvcPa" + }, + "source": [ + "### Coluna 'age'\n", + "> Discretizar a coluna 'age' em 10 buckets." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rCRnbKX57VN-" + }, + "source": [ + "df_Titanic['age_class'] = pd.qcut(df_Titanic['age'], 10, labels=False)\n", + "df_Titanic['age_class'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "uFsZLYDi7VOH" + }, + "source": [ + "sns.catplot(x=\"age_class\", kind=\"count\", data = df_Titanic)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DIY-sL337uje" + }, + "source": [ + "#### Alternativa para discretizar 'age'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "W66GkyuKkhFe" + }, + "source": [ + "def Age_Category(age):\n", + " if (age <= 1):\n", + " return 1\n", + " elif (age <= 5):\n", + " return 2\n", + " elif(age <= 10):\n", + " return 3\n", + " elif (age <= 15):\n", + " return 4\n", + " elif (age <= 20):\n", + " return 5\n", + " elif (age <= 25):\n", + " return 6\n", + " elif(age < 30):\n", + " return 7\n", + " elif(age < 35):\n", + " return 8\n", + " elif(age < 40):\n", + " return 9\n", + " elif(age < 45):\n", + " return 10\n", + " elif(age < 50):\n", + " return 11\n", + " elif(age < 60):\n", + " return 12\n", + " elif(age < 70):\n", + " return 13\n", + " elif(age < 80):\n", + " return 14\n", + " else:\n", + " return 15" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TnLzC6hCkuBL" + }, + "source": [ + "df_Titanic['age_class2'] = df['age'].map(Age_Category)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "kG8td6HPsNlP" + }, + "source": [ + "set(df_Titanic['age_category']) # Esse comando mostra os NaN's da coluna, se houver." + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B_3s5cgxfNKQ" + }, + "source": [ + "### Coluna 'title'\n", + "\n", + "* Para fins de Data Manipulation, vamos capturar o tratamento dos passageiros contido na variável 'nome'. Ou seja, 'Mr.', 'Mrs.', 'Miss' e etc...\n", + "\n", + "> Fonte: As funções get_title e title_map foram extraídas de https://www.kaggle.com/tjsauer/titanic-survival-python-solution" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gslSjRdDoJFY" + }, + "source": [ + "df.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XjqEVVnr8R4d" + }, + "source": [ + "Primeiramente, vamos entender como funciona, step by step..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "D6gjWc3XozK7" + }, + "source": [ + "'Allen, Mr. William Henry'.split(',')[1].split('.')[0].strip()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "nfIG6toGfhd5" + }, + "source": [ + "def get_title(nome):\n", + " if '.' in nome:\n", + " return nome.split(',')[1].split('.')[0].strip()\n", + " else:\n", + " return 'Unknown'\n", + "\n", + "def title_map(title):\n", + " if title in ['Mr', 'Ms']:\n", + " return 1\n", + " elif title in ['Master']:\n", + " return 2\n", + " elif title in ['Ms','Mlle','Miss']:\n", + " return 3\n", + " elif title in [\"Mme\", \"Ms\", \"Mrs\"]:\n", + " return 4\n", + " elif title in [\"Jonkheer\", \"Don\", \"Sir\", \"the Countess\", \"Dona\", \"Lady\"]:\n", + " return 5\n", + " elif title in [\"Capt\", \"Col\", \"Major\", \"Dr\", \"Rev\"]:\n", + " return 6\n", + " else:\n", + " return 7" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HLQoJwf0rjrf" + }, + "source": [ + "Exercícios\n", + "* Melhorar a função title_map." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7qNUwnCepe_x" + }, + "source": [ + "Captura o tratamento dos passageiros:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "r-Ltf33vgJ6Q" + }, + "source": [ + "df_Titanic['title'] = df_Titanic['nome'].apply(get_title).apply(title_map) \n", + "set(df_Titanic['title']) # Esse comando mostra os NaN's da variável" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D3hY0WVhpRYK" + }, + "source": [ + "Drop a coluna 'nome', pois não vamos mais precisar dela em nossas análises:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Y8i3xKCes5WF" + }, + "source": [ + "df_Titanic= df_Titanic.drop(columns= [\"nome\"], axis =1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7Sl1uFdwpW3y" + }, + "source": [ + "Apresenta o conteúdo do dataframe:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2uFnw-pZpan-" + }, + "source": [ + "df_Titanic.head(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B0fZMKKpdHIl" + }, + "source": [ + "## Missing Value\n", + "> Faça o devido tratamento de NaN's das COLUNAS do dataframe df_Titanic.\n", + "\n", + "**Pergunta**: Na coluna 'value', os valores 0 (zero) são considerados Missing Values?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UHzKFytXsNkh" + }, + "source": [ + "df_Titanic['age'].isna().sum()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZC1ULWd883t2" + }, + "source": [ + "## Relação causa --> efeito" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_WCbklv0bDlp" + }, + "source": [ + "A função a seguir nos ajudará com o Data Visualization, cruzando a variável-resposta 'survived' com qualquer outra passada à função:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "epxI-F2UbGGS" + }, + "source": [ + "def taxa_sobrevivencia(df, column):\n", + " title_xt = pd.crosstab(df[column], df['survived'])\n", + " print(pd.crosstab(df[column], df['survived'], margins=True))\n", + " title_xt_pct = title_xt.div(title_xt.sum(1).astype(float), axis =0)\n", + " \n", + " title_xt_pct.plot(kind='bar', stacked=True, title='Taxa de Sobrevivência dos Passageiros', \n", + " color= ['r', 'g'])\n", + " plt.xlabel(column)\n", + " plt.ylabel('Taxa de Sobrevivência')\n", + " plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05),shadow=True, ncol=2)\n", + " plt.show()\n", + "\n", + "def grafico_catplot(x, y, hue = 'survived', col= None):\n", + " plt.rcdefaults()\n", + " g= sns.catplot(x= x, y= y, hue = hue, palette={'Died':'red','Survived':'blue'}, col= col, data = df, kind= 'bar', height=4, aspect=.7)\n", + " plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "34-Qbd_QrC8W" + }, + "source": [ + "Qual a relação entre a variável 'sex' e a variável-resposta?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bhY8-UjyrC8Z" + }, + "source": [ + "taxa_sobrevivencia(df_Titanic, 'sex')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UbexhGtayV4X" + }, + "source": [ + "## Exercício 7\n", + "Consulte a página [Pandas Exercises, Practice, Solution](https://www.w3resource.com/python-exercises/pandas/index.php) para mais exercícios relacionados á este tópico." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P62MXm3tK8Ty" + }, + "source": [ + "## Exercício 8\n", + "Crie a coluna 'aleatorio' no dataframe df_Titanic em que cada linha recebe um valor aleatório usando o método np.random.random()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Du7Y8E4uFmiu" + }, + "source": [ + "i_linhas_Titanic = df_Titanic.shape[0]\n", + "\n", + "df_Titanic['aleatorio'] = np.random.random(i_linhas_Titanic)\n", + "df_Titanic.head(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LMD3HksDL0PQ" + }, + "source": [ + "## Exercício 9\n", + "\n", + "1. Carregue o arquivo FIFA.csv (está na área de Dataframes do curso);\n", + "2. Que colunas podem previamente ser eliminadas da análise? Porque identificar o que pode ser eliminado é importante?\n", + "3. Qual o dtype de cada variável/atributo do dataframe?\n", + "4. Se alguma variável/atributo é do tipo string (object) e supostamente deveria ser numérica, como alteramos o tipo?\n", + "5. Normalize os nomes das colunas, ou seja, renomeie o nome das colunas para minúsculo;\n", + "6. Há Missing values nos dados? Se sim, o qual sua proposta (proposta do grupo) para tratar estes Missing values?\n", + "7. Qual a distribuição do número de jogadores por países? Apresente uma tabela com a distribuição.\n", + "8. Qual a média de idade dos jogadores por países (variável/atributo 'Nacionality');\n", + "9. Qual a número de jogadores por idade?\n", + "10. Quantos jogadores possuem cada clube?\n", + "11. Qual a média de idade por clube?\n", + "12. Qual a média de salário por país?\n", + "13. Qual a média de salário por clube?\n", + "14. Qual a média de salário por idade?\n", + "15. Quanto cada clube gasta com pagamento de salários?\n", + "16. Quais são os insight (o que você consegue descobrir) em relação à variável 'Potential' (mede o potencial dos jogadores)?\n", + "17. Quais os insights em relação à variável overall (nota média do atleta) por idade, clube e país?\n", + "18. Quais são os melhores clubes se levarmos em consideração as variáveis Potential e Overall?\n", + "19. Apresente o ranking dos goleiros (use a variável/atributo 'Preferred Positions') por Potencial, Overall. Estamos à procura de 'GK'.\n", + "20. Quem são os jogadores mais rápidos (variável/atributo 'Sprint speed'=?\n", + "21. Quem são os 5 melhores jogadores em termos de chute (força para chutar) (use a variável/atributo 'Shot power')?\n", + "22. Quem são os outliers em termos de salário?\n", + "23. Quem são os outliers em termos de potência no chute?\n", + "24. Qual a correlação e a interpretação entre as variáveis 'value' e as demais variáveis numéricas do dataframe?\n", + "25. Construa variáveis dummy para as colunas preferred_foot e work_rate. preferred_foot_left;\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "70Ml5KyZ04mk" + }, + "source": [ + "A seguir, significado da variável \"Position\":\n", + "* GK = Goalkeeper – Goleiro.\n", + "* RB = Right Back – Zagueiro Direito.\n", + "* CB = Central Back – Zagueiro Central.\n", + "* LB = Left Back – Zagueiro Esquerdo.\n", + "* SW = Sweeper – Líbero.\n", + "* RWB = Right Wing Back – Lateral Direito.\n", + "* LWB = Left Wing back – Lateral Esquerdo.\n", + "* CDM = Central Defensive Midfielder – Meio Campo Defensivo / Volante.\n", + "* CM = Central Midfielder – Meia Central.\n", + "* CAM = Center Attacking Middlefielder – Meio Campo Ofensivo / Armador.\n", + "* OM = Offensive Midfielder – Meia Ofensivo.\n", + "* LOM = Left Offensive Midfielder – Meia Esquerda Ofensivo.\n", + "* ROM = Right Offensive Midfielder – Meia Direita Ofensivo.\n", + "* LM = Left Midfielder – Meia Esquerda.\n", + "* RM = Right Midfielder – Meia Direita.\n", + "* LWM = Left Wing Midfielder – Meio Ala Esquerdo.\n", + "* RWM = Right Wing Midfielder – Meio Ala Direito.\n", + "* RW = Right Winger – Ala Direito.\n", + "* LW = Left Winger – Ala Esquerto.\n", + "* LF = Left Forward – Atacante Esquerdo.\n", + "* RF = Right Forward – Atacante Direito.\n", + "* ST = Striker – Atacante.\n", + "* CF = Center Forward – Centro Avante.\n", + "* RS = Right Striker – Atacante Direito.\n", + "* LS = Left Striker – Atacante Esquerdo." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tjHDjj68zawa" + }, + "source": [ + "## 1. Carregue o arquivo FIFA.csv (está na área de Dataframes do curso);" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Wzosi4Ue1vDs" + }, + "source": [ + "### Carregar as libraries necessárias" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "B0fqR6rzMAa3" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vgoLTamaOC50" + }, + "source": [ + "#### Configurar ambiente" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RRwi_z8JOFrD" + }, + "source": [ + "d_configuracao = {\n", + " 'display.max_columns': 1000,\n", + " 'display.expand_frame_repr': True,\n", + " 'display.max_rows': 10,\n", + " 'display.precision': 2,\n", + " 'display.show_dimensions': True\n", + " }\n", + "\n", + "for op, value in d_configuracao.items():\n", + " pd.set_option(op, value)\n", + " print(op, value)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MdVljEbcMGU9" + }, + "source": [ + "#### Carregar os dados" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GMivDUHEMFKp" + }, + "source": [ + "df = pd.read_csv('https://raw.githubusercontent.com/MathMachado/DataFrames/master/FIFA.csv?token=AGDJQ63GC7SPIHTGNW73QB27RXRN6') #, index_col= 'PassengerId')\n", + "df.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7pDUpFVLTOfl" + }, + "source": [ + "#### Definir a coluna 'ID' como index do dataframe" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TEue20CbMp9U" + }, + "source": [ + "df.set_index('ID', inplace = True)\n", + "df.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "G8CDrpI1_wMd" + }, + "source": [ + "### Função para retirar os sinais de \"+\" ou \"-\" em algumas colunas/vriáveis:\n", + "* Percebeste algumas colunas com o sinal de \"+\" no nome?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7zqHkNCsEDpJ" + }, + "source": [ + "A seguir, exemplo de algumas colunas com este problema:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_hUvJbCqCBBl" + }, + "source": [ + "df[['RS', 'LS', 'ST']].head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "78QhptWdEIB0" + }, + "source": [ + "A seguir, definimos um dataframe chamado df_string contendo a quantidade de colunas separadas pelo sinal \"+\". Observe que o máximo de colunas que obtemos são 2. Porque?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DzeSvQMGF4G7" + }, + "source": [ + "df_string = df['RS'].str.split(r'\\+', n = 4, expand = True) # n representa o número de splits no output.\n", + "df_string.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "PEzqRR5CEUru" + }, + "source": [ + "df_string[0] = pd.to_numeric(df_string[0])\n", + "df_string[1] = pd.to_numeric(df_string[1])\n", + "df_string['RS2'] = df_string[0]+df_string[1]\n", + "\n", + "df_string.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "2t4rnjRWFPON" + }, + "source": [ + "df_string.dtypes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "MAYju4f6GFzw" + }, + "source": [ + "df_string.drop(columns= [0, 1], axis = 1, inplace = True)\n", + "df = pd.merge(df, df_string, how = 'left', on = 'ID')\n", + "df.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sm5lOGrrHoDp" + }, + "source": [ + " **Desafio**: Próximo passo: transformar isso numa função para tratar as demais variáveis!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QtmOlKNpzbOz" + }, + "source": [ + "## 2. Que colunas podem previamente ser eliminadas da análise? Porque identificar o que pode ser eliminado é importante?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7TzcuD2GxfBP" + }, + "source": [ + "### Colunas que poderiam previamente ser eliminadas:\n", + "* Photo\n", + "* Flag\n", + "* Club Logo\n", + "* Unnamed: 0" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kXDe_AdEx3DD" + }, + "source": [ + "df2 = df.copy()\n", + "\n", + "l_cols_drop = ['Unnamed: 0', 'Photo', 'Flag', 'Club Logo']\n", + "df2.drop(columns = l_cols_drop, axis = 1, inplace = True)\n", + "df2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "m97dcDy9zbSO" + }, + "source": [ + "## 3. Qual o dtype de cada variável/atributo do dataframe?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GEbvITXR2U17" + }, + "source": [ + "# Função para nos mostrar o tipo das colunas:\n", + "def mostra_tipo(df):\n", + " d_tipos = dict(zip(df.columns, df.dtypes))\n", + " for item in d_tipos.items():\n", + " print(item)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "3B9vxmbl9HNP" + }, + "source": [ + "mostra_tipo(df2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5XKcxC0Pzshm" + }, + "source": [ + "## 4. Se alguma variável/atributo é do tipo string (object) e supostamente deveria ser numérica, como alteramos o tipo?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "A7T31nFiPdDu" + }, + "source": [ + "### Mudar o tipo de algumas colunas\n", + "* Exemplo: 'Wage', 'Value' e 'Release Clause'. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VJSsvOpK71n7" + }, + "source": [ + "df4 = df2.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "xyV-_MY9688C" + }, + "source": [ + "def transforma_monetarias(coluna):\n", + " if 'M' in coluna:\n", + " return int(float(coluna.replace('M', '')) * 1000000)\n", + "\n", + " elif 'K' in coluna:\n", + " return int(float(coluna.replace('K', '')) * 1000)\n", + " \n", + " else:\n", + " return int(coluna) " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AJ9-8sVS6MXj" + }, + "source": [ + "Substituindo o símbolo \"€\" por '':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ArgK2NVe6vqz" + }, + "source": [ + "l_colunas_monetarias = ['Value', 'Wage']\n", + "\n", + "for coluna in l_colunas_monetarias:\n", + " df4[coluna] = df4[coluna].str.replace('€', '')\n", + " df4[coluna] = df4[coluna].apply(lambda x: transforma_monetarias(x))\n", + "\n", + "df4.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c_lznTRHzbV9" + }, + "source": [ + "## 5. Normalize os nomes das colunas, ou seja, renomeie o nome das colunas para minúsculo;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "usM674sR8Gv9" + }, + "source": [ + "df5 = df4.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N6LCmJ0QUsJo" + }, + "source": [ + "### Nome das colunas --> Substituir os \" \" por \"_\" nos nomes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NWJYqphfUxn1" + }, + "source": [ + "df5.columns = [c.replace(' ', '_') for c in df5.columns]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lXUOzLWmVTNZ" + }, + "source": [ + "### Renomear as colunas usando lower()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZwwLMOYRVXnr" + }, + "source": [ + "df5.columns = [c.lower() for c in df5.columns]\n", + "mostra_tipo(df5)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Uc12gBThz1nD" + }, + "source": [ + "## 6. Há Missing values nos dados? Se sim, o qual sua proposta (proposta do grupo) para tratar estes Missing values?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nYgvxvcT8QIT" + }, + "source": [ + "df6 = df5.copy()\n", + "df6.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "9STC9fsWJAHn" + }, + "source": [ + "# Fazendo uma cópia permanente do dataframe df6 para uso futuro\n", + "df6[['overall', 'potential', 'value', 'wage', 'nationality', 'position', 'age', 'preferred_foot']].to_csv('FIFA_algumas_features.csv')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ESFYFvOy8XOM" + }, + "source": [ + "Aqui vou substituir os Missing Values pela mediana. Fique à vontade para substituir por outras alternativas como min, max, média, limite superior de outliers e limite inferior para outliers." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "j7zDrRvi8iay" + }, + "source": [ + "l_colunas_numericas = df6.select_dtypes(np.number).columns.tolist()\n", + "l_colunas_numericas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "mZEM0N2f9vi7" + }, + "source": [ + "# Mediana antes da substituição:\n", + "df6[l_colunas_numericas].median()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "dzfw0kp69dK2" + }, + "source": [ + "# Substituição pela mediana\n", + "for coluna in l_colunas_numericas:\n", + " df6[coluna].fillna(df6[coluna].median())\n", + "\n", + "# Mediana depois da substituição:\n", + "df6[l_colunas_numericas].median()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jpQR9zDC-nEj" + }, + "source": [ + "Abaixo, identifiquei 252 registros com value = 0 --> Nestes casos, vou atribuir a mediana também." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "s1Zj3gBJ-Z5c" + }, + "source": [ + "df6[df6['value'] == 0]['value'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "HjuNw2u6-7i9" + }, + "source": [ + "# Mediana antes\n", + "df6['value'].median()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VWEp0Tc_-vLD" + }, + "source": [ + "# Atribuição da mediana para os valores 0 de 'value'\n", + "df6.loc[df6['value'] == 0, 'value'] = df6['value'].median()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "HynCT_Yu_JL-" + }, + "source": [ + "# Mediana depois\n", + "df6['value'].median()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B4O5kw6h_z3H" + }, + "source": [ + "E se tivéssemos substituído pela média, ao invés da mediana? Teria mudado alguma coisa?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eU7ybhA2zbZh" + }, + "source": [ + "## 7. Qual a distribuição do número de jogadores por países? Apresente uma tabela com a distribuição." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "A34BwvXrXAqU" + }, + "source": [ + "df7 = df6.copy()\n", + "df7.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "fu87YSiudcM_" + }, + "source": [ + "df7.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IQQ7AvgBYZmx" + }, + "source": [ + "df_jogadores_por_paises = pd.DataFrame(df7.groupby(by=['nationality']).size())\n", + "df_jogadores_por_paises.columns = ['numero_jogadores']\n", + "df_jogadores_por_paises.sort_values(by = ['numero_jogadores'], ascending = False, inplace= True)\n", + "df_jogadores_por_paises = df_jogadores_por_paises.reset_index()\n", + "df_jogadores_por_paises\n", + "\n", + "# Numa única linha ficaria assim:\n", + "df_jogadores_por_paises2 = pd.DataFrame(df7.groupby(by=['nationality']).size(), columns= ['numero_jogadores']).sort_values(by = ['numero_jogadores'], ascending = False).reset_index()\n", + "df_jogadores_por_paises2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JfyDUEC2zbcv" + }, + "source": [ + "## 8. Qual a média de idade dos jogadores por países (variável/atributo 'Nacionality');" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0a9MvyWPcu-C" + }, + "source": [ + "df_media_idade_por_paises = df7.groupby(by = ['nationality']).agg({'age': ['count', 'mean']}).reset_index()\n", + "df_media_idade_por_paises.columns = ['nationality', 'numero_joagadores', 'media_idade']\n", + "df_media_idade_por_paises.sort_values(by = ['media_idade'], ascending = False, inplace = True)\n", + "df_media_idade_por_paises.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vNmu0xyg0CW4" + }, + "source": [ + "## 9. Qual a número de jogadores por idade?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DRVvPgpRf9vw" + }, + "source": [ + "df_jogadores_por_idade = df7.groupby(by = ['age']).agg({'age': ['count']}).reset_index()\n", + "df_jogadores_por_idade.columns = ['age', 'numero_joagadores']\n", + "df_jogadores_por_idade.sort_values(by = ['numero_joagadores'], ascending = False, inplace = True)\n", + "df_jogadores_por_idade.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8eChi2NW0CZp" + }, + "source": [ + "## 10. Quantos jogadores possuem cada clube?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JpNI3ZlHgUx1" + }, + "source": [ + "df_jogadores_por_clube = df7.groupby(by = ['club']).size().reset_index()\n", + "df_jogadores_por_clube.columns = ['clube', 'numero_joagadores']\n", + "df_jogadores_por_clube.sort_values(by = ['numero_joagadores'], ascending = False, inplace = True)\n", + "df_jogadores_por_clube.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gMiibNwW0Cck" + }, + "source": [ + "## 11. Qual a média de idade por clube?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "D9rF9frzgqSr" + }, + "source": [ + "df_media_idade_por_clube = df7.groupby(by = ['club']).agg({'age': ['count', 'mean']}).reset_index()\n", + "df_media_idade_por_clube.columns = ['clube', 'numero_joagadores', 'media_idade']\n", + "df_media_idade_por_clube.sort_values(by = ['media_idade'], ascending = False, inplace = True)\n", + "df_media_idade_por_clube.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uE_o76xH0QU-" + }, + "source": [ + "## 12. Qual a média de salário por país?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "keQXqnU7hJy4" + }, + "source": [ + "df_media_salario_por_pais = df7.groupby(by = ['nationality']).agg({'wage': ['count', 'mean']}).reset_index()\n", + "df_media_salario_por_pais.columns = ['nationality', 'numero_joagadores', 'media_salario']\n", + "df_media_salario_por_pais.sort_values(by = ['media_salario'], ascending = False, inplace = True)\n", + "df_media_salario_por_pais.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vqT1ozNA0Cfd" + }, + "source": [ + "## 13. Qual a média de salário por clube?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "54_Q2IGchmN-" + }, + "source": [ + "df_media_salario_por_clube = df7.groupby(by = ['club']).agg({'wage': ['count', 'mean']}).reset_index()\n", + "df_media_salario_por_clube.columns = ['clube', 'numero_joagadores', 'media_salario']\n", + "df_media_salario_por_clube.sort_values(by = ['media_salario'], ascending = False, inplace = True)\n", + "df_media_salario_por_clube.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4eflozOo0Cif" + }, + "source": [ + "## 14. Qual a média de salário por idade?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Xtq9Am60hwGr" + }, + "source": [ + "df_media_salario_por_idade = df7.groupby(by = ['age']).agg({'wage': ['count', 'mean']}).reset_index()\n", + "df_media_salario_por_idade.columns = ['age', 'numero_joagadores', 'media_salario']\n", + "df_media_salario_por_idade.sort_values(by = ['media_salario'], ascending = False, inplace = True)\n", + "df_media_salario_por_idade.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L0yRSSIb0WYj" + }, + "source": [ + "## 15. Quanto cada clube gasta com pagamento de salários?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "C9N7_pLfh_uq" + }, + "source": [ + "df_soma_salario_por_clube = df7.groupby(by = ['club']).agg({'wage': ['count', 'mean', 'sum']}).reset_index()\n", + "df_soma_salario_por_clube.columns = ['clube', 'numero_joagadores', 'media_salario', 'soma_salario']\n", + "df_soma_salario_por_clube.sort_values(by = ['soma_salario'], ascending = False, inplace = True)\n", + "df_soma_salario_por_clube.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1c7NGMg90YMi" + }, + "source": [ + "## 16. Quais são os insight (o que você consegue descobrir) em relação à variável 'Potential' (mede o potencial dos jogadores)?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RAU41Iyaihvc" + }, + "source": [ + "df7.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "bM_ePTWfiTFq" + }, + "source": [ + "df_potential_por_clube = df7.groupby(by = ['potential', 'club', 'nationality']).agg({'potential': ['count']}).reset_index()\n", + "df_potential_por_clube.columns = ['potential', 'club', 'nationality', 'numero_joagadores']\n", + "df_potential_por_clube.sort_values(by = ['potential'], ascending = False, inplace = True)\n", + "df_potential_por_clube.head(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HytWPvfvjTON" + }, + "source": [ + "#### Quem é o jogador com potential = 95?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Fk2X1q7LjWJE" + }, + "source": [ + "df7.loc[df7['potential'] == 95]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "W2o4oLzujnHj" + }, + "source": [ + "#### Quem são os jogadores com potencial = 94?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GOCyMr-qjsL7" + }, + "source": [ + "df7.loc[df7['potential'] == 94]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LHDJimdw0ClU" + }, + "source": [ + "## 17. Quais os insights em relação à variável overall (nota média do atleta) por idade, clube e país?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FXFp5nxrj9Yc" + }, + "source": [ + "df_overall = df7.groupby(by = ['overall', 'club', 'nationality']).agg({'overall': ['count']}).reset_index()\n", + "df_overall.columns = ['overall', 'club', 'nationality', 'numero_joagadores']\n", + "df_overall.sort_values(by = ['overall'], ascending = False, inplace = True)\n", + "df_overall.head(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4LTooiIdk1XV" + }, + "source": [ + "#### Quem é o jogador com overall = 94?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QieAKyi7k5Bb" + }, + "source": [ + "df7.loc[df7['overall'] == 94]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JFH54d1D0b5B" + }, + "source": [ + "## 18. Quais são os melhores clubes se levarmos em consideração as variáveis Potential e Overall?\n", + "* Para responder esta questão, tirei a média de overall e potential." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0JZ7PTFTle_d" + }, + "source": [ + "df18 = df7.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "s25u8RoplMZ8" + }, + "source": [ + "df18['overall_potential'] = ((df18['potential']+df18['overall'])/2)\n", + "df18[['name', 'overall', 'potential', 'overall_potential']].head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8gJFzhhIlDCH" + }, + "source": [ + "df_overall_potential = df18.groupby(by = ['club', 'nationality', 'age']).agg({'overall_potential': ['count', 'mean']}).reset_index()\n", + "df_overall_potential.columns = ['club', 'nationality', 'age', 'numero_jogadores', 'media_overall_potential']\n", + "df_overall_potential.sort_values(by = ['media_overall_potential'], ascending = False, inplace = True)\n", + "df_overall_potential.head(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "adpxQpWlmvac" + }, + "source": [ + "De forma geral:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fzn_81eomrj2" + }, + "source": [ + "df_overall_potential2 = df18.groupby(by = ['club']).agg({'overall_potential': ['count', 'mean']}).reset_index()\n", + "df_overall_potential2.columns = ['club', 'numero_jogadores', 'media_overall_potential']\n", + "df_overall_potential2.sort_values(by = ['media_overall_potential'], ascending = False, inplace = True)\n", + "df_overall_potential2.head(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dM8FehYC0df7" + }, + "source": [ + "## 19. Apresente o ranking dos goleiros (use a variável/atributo 'Preferred Positions') por Potencial, Overall. Estamos à procura de 'GK'." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_967BF6MnD4U" + }, + "source": [ + "df19 = df18.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "wXPah5zOmkXc" + }, + "source": [ + "df_goleiros = df19[df19['position'] == 'GK']\n", + "df_goleiros.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "77ehyNmSnTIB" + }, + "source": [ + "df_overall_potential_goleiros = df_goleiros.groupby(by = ['club']).agg({'overall_potential': ['count', 'mean']}).reset_index()\n", + "df_overall_potential_goleiros.columns = ['club', 'numero_jogadores', 'media_overall_potential']\n", + "df_overall_potential_goleiros.sort_values(by = ['media_overall_potential'], ascending = False, inplace = True)\n", + "df_overall_potential_goleiros.head(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-dEtuBtF0fiZ" + }, + "source": [ + "## 20. Quem são os jogadores mais rápidos (variável/atributo 'Sprint speed')?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KWMU1hMMnxTI" + }, + "source": [ + "df20 = df19.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "sezEQIjqnwCZ" + }, + "source": [ + "df20.sort_values(by = 'sprintspeed', ascending = False).head(5)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aEg0eaFO0lF6" + }, + "source": [ + "## 21. Quem são os 5 melhores jogadores em termos de chute (força para chutar) (use a variável/atributo 'Shot power')?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xXuj-dc7oA-0" + }, + "source": [ + "df21 = df20.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8HGT_dM2oEES" + }, + "source": [ + "df21.sort_values(by = 'shotpower', ascending = False).head(5)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bRk42JIf0moZ" + }, + "source": [ + "## 22. Quem são os outliers em termos de salário?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qRNaog7y0qI4" + }, + "source": [ + "### Identificação e tratamento dos Outliers\n", + "* Qual o Overall médio do Barcelona, Juventus e Real Madrid?\n", + "* E qual o overall médio depois do tratamento dos outliers?\n", + "* Quem são os atletas que estão influenciando a média?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_bIxG1Sw9OUB" + }, + "source": [ + "![BoxPlot](https://github.com/MathMachado/Materials/blob/master/boxplot.png?raw=true)\n", + "\n", + "Fonte: [Understanding Boxplots](https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qEiikIxNoZkl" + }, + "source": [ + "df22 = df21.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lGYYvE0BoOoV" + }, + "source": [ + "q1_salario, q3_salario = df22['wage'].quantile([0.25,0.75]).to_list()\n", + "iqr_salario = q3_salario - q1_salario\n", + "print(q1_salario, q3_salario)\n", + "print(iqr_salario)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "PB44VV9pogT1" + }, + "source": [ + "outlier_salario_inferior = q1_salario - 1.5 * iqr_salario\n", + "outlier_salario_superior = q3_salario + 1.5 * iqr_salario\n", + "\n", + "df_outliers_salario = df22[['name', 'club', 'nationality', 'wage', 'overall', 'potential']]\n", + "\n", + "# Salários outliers inferiores\n", + "df_outliers_salario[df_outliers_salario['wage'] < outlier_salario_inferior]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "9867KNNBqG7Z" + }, + "source": [ + "# Top 10 Salários outliers superior\n", + "df_outliers_salario[df_outliers_salario['wage'] > outlier_salario_superior].sort_values(by = ['wage'], ascending = False).head(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gT2zGwq90oQ5" + }, + "source": [ + "## 23. Quem são os outliers em termos de potência no chute?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "05uYj7cwqrdW" + }, + "source": [ + "df23 = df22.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "GzbVRU9HqrdZ" + }, + "source": [ + "q1_chute, q3_chute = df23['shotpower'].quantile([0.25,0.75]).to_list()\n", + "iqr_chute = q3_chute - q1_chute\n", + "print(q1_chute, q3_chute)\n", + "print(iqr_chute)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "V5TQX_yGqrda" + }, + "source": [ + "outlier_chute_inferior = q1_chute - 1.5 * iqr_chute\n", + "outlier_chute_superior = q3_chute + 1.5 * iqr_chute\n", + "\n", + "df_outliers_chute = df23[['name', 'club', 'nationality', 'shotpower', 'overall', 'potential']]\n", + "\n", + "# Salários outliers inferiores\n", + "df_outliers_chute[df_outliers_chute['shotpower'] < outlier_chute_inferior]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "URj1SYXxqrdc" + }, + "source": [ + "# Top 10 outliers superiores - shotpower\n", + "df_outliers_chute[df_outliers_chute['shotpower'] > outlier_chute_superior].sort_values(by = ['shotpower'], ascending = False).head(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eHm1qeHx0pza" + }, + "source": [ + "## 24. Qual a correlação e a interpretação entre as variáveis 'value' e as demais variáveis numéricas do dataframe?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OR6MD2GQ0rNq" + }, + "source": [ + "## 25. Construa variáveis dummy para as colunas preferred_foot e work_rate. preferred_foot_left;" + ] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB10_01__Pandas_hs.ipynb b/Notebooks/NB10_01__Pandas_hs.ipynb new file mode 100644 index 000000000..9b3a57d2f --- /dev/null +++ b/Notebooks/NB10_01__Pandas_hs.ipynb @@ -0,0 +1,5202 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "Copy of NB10_01__Pandas.ipynb", + "provenance": [], + "private_outputs": true, + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8fpUiw8PwC7_" + }, + "source": [ + "

PANDAS PARA DATA ANALYSIS

\n", + "\n", + "\n", + "\n", + "# **AGENDA**:\n", + "\n", + "> Veja o **índice** dos itens que serão abordados neste capítulo.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vo7mtiNSr_Wk" + }, + "source": [ + "___\n", + "# **REFERÊNCIAS**\n", + "* [Learn Aggregation and Data Wrangling with Python](https://data-flair.training/blogs/data-wrangling-with-python/)\n", + "* [Python Data Cleansing by Pandas & Numpy | Python Data Operations](https://data-flair.training/blogs/python-data-cleansing/)\n", + "* [Pandas from basic to advanced for Data Scientists](https://towardsdatascience.com/pandas-from-basic-to-advanced-for-data-scientists-aee4eed19cfe)\n", + "* [Feature engineering and ensembled models for the top 10 in Kaggle “Housing Prices Competition”](https://towardsdatascience.com/feature-engineering-and-ensembled-models-for-the-top-10-in-kaggle-housing-prices-competition-efb35828eef0)\n", + "* [Pandas.Series Methods for Machine Learning](https://towardsdatascience.com/pandas-series-methods-for-machine-learning-fd83709368ff)\n", + "* [Pandas.Series Methods for Machine Learning](https://towardsdatascience.com/pandas-series-methods-for-machine-learning-fd83709368ff)\n", + "* [Gaining a solid understanding of Pandas series](https://towardsdatascience.com/gaining-a-solid-understanding-of-pandas-series-893fb8f785aa)\n", + "* [ariáveis Dummy: o que é? Quando usar? E como usar?](https://medium.com/data-hackers/vari%C3%A1veis-dummy-o-que-%C3%A9-quando-usar-e-como-usar-78de66cfcca9)\n", + "* [Exploratory Data Analysis Made Easy Using Pandas Profiling](https://towardsdatascience.com/exploratory-data-analysis-made-easy-using-pandas-profiling-86e347ef5b65)\n", + "* [Data Handling using Pandas; Machine Learning in Real Life](https://towardsdatascience.com/data-handling-using-pandas-machine-learning-in-real-life-be76a697418c)\n", + "* [Exploratory Data Analysis Tutorial in Python](https://towardsdatascience.com/exploratory-data-analysis-tutorial-in-python-15602b417445)\n", + "* [Exploring the data using python](https://towardsdatascience.com/exploring-the-data-using-python-47c4bc7b8fa2)\n", + "* [A better EDA with Pandas-profiling](https://towardsdatascience.com/a-better-eda-with-pandas-profiling-e842a00e1136)\n", + "* [Exploratory Data Analysis: Haberman’s Cancer Survival Dataset](https://towardsdatascience.com/exploratory-data-analysis-habermans-cancer-survival-dataset-c511255d62cb)\n", + "* [Exploring Exploratory Data Analysis](https://towardsdatascience.com/exploring-exploratory-data-analysis-1aa72908a5df)\n", + "* [Getting started with Data Analysis with Python Pandas](https://towardsdatascience.com/getting-started-to-data-analysis-with-python-pandas-with-titanic-dataset-a195ab043c77)\n", + "* [A Gentle Introduction to Exploratory Data Analysis](https://towardsdatascience.com/a-gentle-introduction-to-exploratory-data-analysis-f11d843b8184)\n", + "* [Exploratory Data Analysis (EDA) techniques for Kaggle competition beginners](https://towardsdatascience.com/exploratory-data-analysis-eda-techniques-for-kaggle-competition-beginners-be4237c3c3a9)\n", + "* [What is Exploratory Data Analysis?](https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15)\n", + "* [Exploring real estate investment opportunity in Boston and Seattle](https://towardsdatascience.com/exploring-real-estate-investment-opportunity-in-boston-and-seattle-9d89d0c9bed2)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BUEbp88oD1Km" + }, + "source": [ + "___\n", + "# **ANÁLISE DE DADOS COM PANDAS**\n", + "## Highlights\n", + "\n", + "* Rápida e eficiente library para data manipulation;\n", + "* Ferramentas para ler e gravar todos os tipos de dados e formatos: CSV, txt, Microsoft Excel, SQL databases, JSON e HDF5 format;\n", + "* Pandas é a library mais popular para análise de dados. As principais ações que faremos com Pandas são:\n", + " * Ler/gravar diferentes formatos de dados;\n", + " * Selecionar subconjuntos de dados;\n", + " * Cálculos variados por coluna ou por linha das tabelas;\n", + " * Encontrar e tratar Missing Values;\n", + " * Combinar múltiplos dataframes;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wkxQFPPmeKLl" + }, + "source": [ + "![Pandas](https://github.com/MathMachado/Materials/blob/master/Pandas.jpeg?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eKawOG-neqaD" + }, + "source": [ + "![Pandas](https://github.com/MathMachado/Materials/blob/master/Pandas2.jpeg?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TLdSmsJZwlcQ" + }, + "source": [ + "___\n", + "# **ATÉ QUE VOLUME DE DADOS PODEMOS USAR PANDAS?**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "O7YKF5gB2x0K" + }, + "source": [ + "![RightToolForEachSize](https://github.com/MathMachado/Materials/blob/master/SizesAndTools.PNG?raw=true)\n", + "\n", + "## Sources\n", + "### Dask\n", + "* [Pandas, Dask or PySpark? What Should You Choose for Your Dataset?](https://medium.com/datadriveninvestor/pandas-dask-or-pyspark-what-should-you-choose-for-your-dataset-c0f67e1b1d36)\n", + "* [Processing Data with Dask](https://medium.com/when-i-work-data/processing-data-with-dask-47e4233cf165)\n", + "* [Pandas, Fast and Slow](https://medium.com/when-i-work-data/pandas-fast-and-slow-b6d8dde6862e)\n", + "* [Por que Parquet](https://medium.com/when-i-work-data/por-que-parquet-2a3ec42141c6)\n", + "* [How to Run Parallel Data Analysis in Python using Dask Dataframes](https://towardsdatascience.com/trying-out-dask-dataframes-in-python-for-fast-data-analysis-in-parallel-aa960c18a915)\n", + "* [Why every Data Scientist should use Dask?](https://towardsdatascience.com/why-every-data-scientist-should-use-dask-81b2b850e15b)\n", + "\n", + "### Spark, Koalas\n", + "* [Databricks Koalas-Python Pandas for Spark](https://medium.com/future-vision/databricks-koalas-python-pandas-for-spark-ce20fc8a7d08)\n", + "* [Bye Pandas, Meet Koalas: Pandas APIs on Apache Spark (Ep. 4)](https://medium.com/@kyleake/bye-pandas-meet-koalas-pandas-apis-on-apache-spark-ep-4-aedcd363cf4e)\n", + "* [Koalas: Easy Transition from pandas to Apache Spark](https://databricks.com/blog/2019/04/24/koalas-easy-transition-from-pandas-to-apache-spark.html?source=post_page-----aedcd363cf4e----------------------)\n", + "* [Use PySpark for Your Next Big Problem](https://medium.com/swlh/use-pyspark-for-your-next-big-problem-8aa288d5ecfa)\n", + "* [A Neanderthal’s Guide to Apache Spark in Python](https://towardsdatascience.com/a-neanderthals-guide-to-apache-spark-in-python-9ef1f156d427)\n", + "* [The Jungle of Koalas, Pandas, Optimus and Spark](https://towardsdatascience.com/the-jungle-of-koalas-pandas-optimus-and-spark-dd486f873aa4)\n", + "* [From Pandas to PySpark with Koalas](https://towardsdatascience.com/from-pandas-to-pyspark-with-koalas-e40f293be7c8)\n", + "\n", + "# O que Dask?\n", + "\n", + "\"Dask is designed to extend the numpy and pandas packages to work on data processing problems that are too large to be kept in memory. It breaks the larger processing job into many smaller tasks that are handled by numpy or pandas and then it reassembles the results into a coherent whole.\" - Eric Ness ([Processing Data with Dask](https://medium.com/when-i-work-data/processing-data-with-dask-47e4233cf165))\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yEyzjGUfG33-" + }, + "source": [ + "___\n", + "# **Carregar a library Pandas e verificar a versão**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oVMjT3DrG97K" + }, + "source": [ + "# Carrega a library Pandas\n", + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "\n", + "print(f'Versão do Pandas: {pd.__version__}')\n", + "print(f'Versão do NumPy.: {np.__version__}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OxoDsaKUVHdH" + }, + "source": [ + "# Configurações\n", + "> Podemos configurar o pandas de forma a tornar nosso trabalho mais produtivo. Podemos configurar, por exemplo, o número de LINHAS e COLUNAS a ser mostrado, precisão dos números float. Vamos ver com mais detalhes a seguir.\n", + "\n", + "Fonte: [5 Advanced Features of Pandas and How to Use Them](https://www.kdnuggets.com/2019/10/5-advanced-features-pandas.html)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "IOdqrf7uVlhC" + }, + "source": [ + "d_configuracao = {\n", + " 'display.max_columns': 1000,\n", + " 'display.expand_frame_repr': True,\n", + " 'display.max_rows': 10,\n", + " 'display.precision': 2,\n", + " 'display.show_dimensions': True\n", + " }\n", + "\n", + "for op, value in d_configuracao.items():\n", + " pd.set_option(op, value)\n", + " print(op, value)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Paz-R-FOAJ7F" + }, + "source": [ + "___\n", + "# **Criar um dataframe a partir de outros objetos**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L4Jc0C2qPAQz" + }, + "source": [ + "## Criar dataframe a partir de dicionários" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Sa5rKwq6Fscj" + }, + "source": [ + "### Exemplo 1" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0ofIGkiSSuYq" + }, + "source": [ + "d_frutas = {'Apple': [5, 6, 6, 8, 10, 3, 2],\n", + " 'Avocado': [6, 6, 3, 9, 3, 2, 1]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "iJCNvPlUTzTI" + }, + "source": [ + "d_frutas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7Y_0O_tJTfm3" + }, + "source": [ + "# index=['Seg', 'Ter', 'Qua', 'Qui', 'Sex', 'Sab', 'Dom'] abaixo define os label.\n", + "df_frutas = pd.DataFrame(d_frutas, index = ['Seg', 'Ter', 'Qua', 'Qui', 'Sex', 'Sab', 'Dom'])\n", + "df_frutas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l2ll8ktfUKz2" + }, + "source": [ + "O que se comprou na sexta?\n", + "\n", + "* Função df.loc[label] retorna o(s) valor(es) associados à label. Em nosso caso, os label (chaves do dicionário) são 'Seg', 'Ter', ..., 'Dom'." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9Voor8_PUJum" + }, + "source": [ + "df_frutas.loc['Sex'] # Aqui, label= 'Sex'." + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LMh4DTfebwAr" + }, + "source": [ + "* Ou seja, o label = 'Sex', que ocupa a posição 4, tem os valores:\n", + " * Apple..: 10\n", + " * Avocado: 3\n", + "\n", + "Da mesma forma, poderíamos utilizar a função df.iloc[index] para retornar o conteúdo/informações de index." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GJxawdh6bvJN" + }, + "source": [ + "df_frutas.iloc[4]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "obJt9OPGcL-x" + }, + "source": [ + "Portanto, df.loc['Sex'] = df.iloc[4]. Correto?\n", + "\n", + "Para nos ajudar a memorizar, considere que:\n", + "\n", + "* pd.loc[label] --> loc começa com a letra **l**, o que remete à label da linha.\n", + "* pd.iloc[indice] --> iloc começa com a letra **i**, o que remete ao índice (inteiro) da linha." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v7QlCcEorEIX" + }, + "source": [ + "#### Qual é o output do code abaixo?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kRRdQShrrKHk" + }, + "source": [ + "df_frutas.loc[4]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EkjAtbrRF01h" + }, + "source": [ + "### Exemplo 2" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2EOX5MC4E1xL" + }, + "source": [ + "Na prática, lidamos com grandes bancos de dados e, nesses casos, não temos label das LINHAS definidos. Para exemplificar, considere o mesmo exemplo que acabamos de ver, com uma pequena alteração:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RC_OXmdjrkQm" + }, + "source": [ + "d_frutas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "D6FckgDPFFs0" + }, + "source": [ + "df_frutas = pd.DataFrame(d_frutas) # Observe que aqui não definimos os indíces\n", + "df_frutas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tkGc4JQcFPkp" + }, + "source": [ + "Veja agora que os label são números inteiros de 0 a N." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ri-EdUYAovLG" + }, + "source": [ + "#### Qual o conteúdo da linha cujo label é 4?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5YgWG_vlFVe_" + }, + "source": [ + "df_frutas.loc[4]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rFQxcAcVo2KD" + }, + "source": [ + "#### Qual o conteúdo da linha cujo índice é 4?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xB1j4n6HFank" + }, + "source": [ + "df_frutas.iloc[4]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jEbCke3TFf_q" + }, + "source": [ + "Ou seja, nesses casos, tanto faz usar pd.loc[] ou pd.iloc[]. Entendeu?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bKHw_VBKjkoL" + }, + "source": [ + "### Exemplo 3 - Definir os indices do dataframe usando df.set_index()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "13ArWIhYju6s" + }, + "source": [ + "d_frutas= {'Dia_Semana': ['Seg', 'Ter', 'Qua', 'Qui', 'Sex', 'Sab', 'Dom'],\n", + " 'Apple': [5, 6, 6, 8, 10, 3, 2],\n", + " 'Avocado': [6, 6, 3, 9, 3, 2, 1]}\n", + "\n", + "d_frutas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Evw9w16gk5h0" + }, + "source": [ + "# Cria o dataframe df_frutas:\n", + "df_frutas = pd.DataFrame(d_frutas) # Não apontamos o índice do dataframe. Portanto, o índice é criado automaticamente de 0.. N.\n", + "df_frutas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NLbbRrdYoclw" + }, + "source": [ + "#### Qual o conteúdo da linha 4?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lB-ngbutl_0c" + }, + "source": [ + "df_frutas.iloc[4]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1aJLGapZlUFI" + }, + "source": [ + "# Definir 'Dia_Semana' como índice (label das linhas) do dataframe df_frutas\n", + "df_frutas.set_index('Dia_Semana', inplace = True)\n", + "df_frutas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L1-U_sD-jAoO" + }, + "source": [ + "A expressão acima é equivalente a:\n", + "\n", + "```\n", + "df_frutas2 = df_frutas.set_index('Dia_Semana') # Observe que aqui não há 'inplace'\n", + "df_frutas2\n", + "```\n", + "\n", + "* Então, qual a função do 'inplace =True' na primeira opção?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oXeFjJonpQfB" + }, + "source": [ + "#### Qual o conteúdo da linha 4?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MMXg3vVQpUhh" + }, + "source": [ + "df_frutas.iloc[4]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fhoYuGMlpVFj" + }, + "source": [ + "#### Qual o conteúdo da linha cujo label é 'Sex'?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fmcWbrEspdYW" + }, + "source": [ + "df_frutas.loc['Sex']" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bobggpoCTRkj" + }, + "source": [ + "### Qual a diferença entre as duas próximas linhas?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SjiYgbNrsvpl" + }, + "source": [ + "df_frutas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "OFhzE7hgTD0a" + }, + "source": [ + "df_frutas.mean()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "V42I3807TNte" + }, + "source": [ + "df_frutas.mean(1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6iUCthsbtLV8" + }, + "source": [ + "df_frutas.describe()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "YdkmYePYtcON" + }, + "source": [ + "df_frutas.dtypes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2RmgCIC2HZFp" + }, + "source": [ + "### Exemplo 4" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kbHHuMzzAR1A" + }, + "source": [ + "d_estudantes = {'Nome': ['Jack', 'Richard', 'Tommy', 'Ana'], \n", + " 'Age': [25, 34, 18, 21],\n", + " 'City': ['Sydney', 'Rio de Janeiro', 'Lisbon', 'New York'],\n", + " 'Country': ['Australia', 'Brazil', 'Portugal', 'United States']}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ayKqLmHTANOu" + }, + "source": [ + "# Mostrar o conteúdo do dicionário d_estudantes...\n", + "d_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "0ONA8QsBBP6R" + }, + "source": [ + "# Keys associadas ao dicionário d_estudantes\n", + "d_estudantes.keys()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "k8mmvKQ_BjO6" + }, + "source": [ + "# Itens associados ao dicionário d_estudantes\n", + "d_estudantes.items()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "hcm8V_UmBr1Y" + }, + "source": [ + "# Valores associados ao dicionário d_estudantes\n", + "d_estudantes.values()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KK7IejsPDkWC" + }, + "source": [ + "Temos uma key = 'nome'. Qual o conteúdo desta key?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "eHvPpeiTBwoR" + }, + "source": [ + "d_estudantes['nome']" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "S1y7p8CcDsXl" + }, + "source": [ + "Qual o output da expressão a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "26WIDl-HB3Bq" + }, + "source": [ + "d_estudantes['nome'][0]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gV68kQ5HCIif" + }, + "source": [ + "Criando o dataframe df_estudantes a partir do dicionário d_estudantes:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2oa808hkCSaq" + }, + "source": [ + "df_estudantes = pd.DataFrame(d_estudantes)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7HLp0FYpCiSc" + }, + "source": [ + "# Mostra o conteúdo do dataframe df_estudantes...\n", + "df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "en06lfazciE0" + }, + "source": [ + "**Atenção**: Observe que nesse caso, não definimos labels para as LINHAS. Na prática, isso é o mais comum, ou seja, os label = index, que aqui são números inteiros de 0 a N." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gFaPp-S-cy1-" + }, + "source": [ + "Mais uma vez, vamos usar df.loc[] e df.iloc[]..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mT9vwRBidGXX" + }, + "source": [ + "# Mostrando o conteúdo de da linha 3 usando df.loc[]\n", + "df_estudantes.loc[3]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Zj88AwHUdix0" + }, + "source": [ + "OU" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SP2mG8todkMe" + }, + "source": [ + "# Mostrando o conteúdo de da linha 3 usando df.iloc[]\n", + "df_estudantes.iloc[3]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hzbLO0EDGWTf" + }, + "source": [ + "Ok, já discutimos isso anteriormente. Quando não temos labels para as LINHAS, então iloc[] = loc[]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VvzVg7SpeOOB" + }, + "source": [ + "___\n", + "## Criar dataframes a partir de listas\n", + "* Considere a lista de frutas a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0_PY9OROeUiT" + }, + "source": [ + "l_frutas = [('Melon', 6, 8, 5, 4 ,6, 2, 8), ('Avocado', 6, 6, 3, 8, 9, 3, 1), ('Blueberry', 7, 5, 9, 3, 1, 0, 4)]\n", + "l_frutas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "AfE_rHq5g4_P" + }, + "source": [ + "type(l_frutas)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZpdPSi7RgVjK" + }, + "source": [ + "l_frutas[0]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "NMyIpVW8gZTH" + }, + "source": [ + "l_frutas[0][0]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-cyZVqQFhjjg" + }, + "source": [ + "# Lista contendo os nomes das COLUNAS do dataframe:\n", + "l_colunas = ['Frutas', 'Dom', 'Seg', 'Ter', 'Qua', 'Qui', 'Sex', 'Sab']\n", + "l_colunas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "wplKvgayfZm_" + }, + "source": [ + "# Convertendo as listas em dataframe\n", + "df_frutas = pd.DataFrame(l_frutas, columns = l_colunas) # Observe que aqui, o nome das COLUNAS é uma lista.\n", + "df_frutas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GojgsAXTFZmB" + }, + "source": [ + "___\n", + "# **Copiar dataframes**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "g_Tda4ZwjWIW" + }, + "source": [ + "O dataframe df_estudantes tem o seguinte conteúdo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "P5y0aVkdkA8o" + }, + "source": [ + "df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Cp3bvPEqj5fS" + }, + "source": [ + "se fizermos..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "J2PT5L11j8O0" + }, + "source": [ + "df_estudantes2 = df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2D29pGuikBBK" + }, + "source": [ + "então df_estudantes2 tem o mesmo conteúdo de df_estudantes, ok?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_IseZEpLkGS4" + }, + "source": [ + "df_estudantes2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "29MpozLrkI83" + }, + "source": [ + "Agora altere o valor 'Rio de Janeiro' para 'Sao Paulo' no dataframe df_estudantes2." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TXCqFiGFkmyv" + }, + "source": [ + "df_estudantes2['city'] = df_estudantes2['city'].replace({'Rio de Janeiro': 'Sao Paulo'})\n", + "df_estudantes2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "I_0mgT7-8Fsl" + }, + "source": [ + "# OU\n", + "alteracoes = {'Rio de Janeiro': 'Sao Paulo'}\n", + "df_estudantes2['city'] = df_estudantes2['city'].replace(alteracoes)\n", + "df_estudantes2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BN8ZGu2Xk6vt" + }, + "source": [ + "Ok, alteramos o valor 'Rio de Janeiro' por 'Sao Paulo', como queríamos. Vamos ver o conteúdo de df_estudantes (**que está intacto, pois fizemos a alteração no dataframe df_estudantes2**)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "thNAWoDflRoQ" + }, + "source": [ + "df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VkIS8wVmlAyq" + }, + "source": [ + "Ooooops... df_estudantes foi alterado? Como, se procedemos a alteração em df_estudantes2 e NÃO em df_estudantes???\n", + "\n", + "* **As operações que fizermos em df_estudantes2 também serão aplicadas à df_estudantes**?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e9u-Z9NMltC9" + }, + "source": [ + "**Resposta**: SIM, pois df_estudantes2 é um ponteiro para df_estudantes. Ou seja, **qualquer operação que fizermos em df_estudantes2 será feita em df_estudantes**." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IDwvsxhhmlE4" + }, + "source": [ + "Uma forma fácil de ver isso é através dos endereços de memória dos dois (**supostos diferentes**) dataframes:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ePFwKua8mu7k" + }, + "source": [ + "id(df_estudantes2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "bMvY_E0mmwQH" + }, + "source": [ + "id(df_estudantes)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K5qC5BuzmyF0" + }, + "source": [ + "**Conclusão**: df_estudantes2 é ponteiro para df_estudantes." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZZ50ejRImAQ8" + }, + "source": [ + "## Forma correta de fazer a cópia de um dataframe" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oTbzxNkDmQiJ" + }, + "source": [ + "Primeiramente, vamos reconstruir df_estudantes:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DmVq0vM0mTtQ" + }, + "source": [ + "df_estudantes = pd.DataFrame(d_estudantes)\n", + "df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oZrlwtqJmYB_" + }, + "source": [ + "Fazendo a cópia do dataframe (**da forma correta**):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "No5A7nHDFbsy" + }, + "source": [ + "df_estudantes_Copy = df_estudantes.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NvKNFr8RnEft" + }, + "source": [ + "Vamos verificar os endereços de memória dos dois dataframes:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0_OO90SFki4f" + }, + "source": [ + "id(df_estudantes_Copy)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "T0BibX8rkes5" + }, + "source": [ + "id(df_estudantes)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Fbm-8cCUFgJa" + }, + "source": [ + "Agora, dataframe df_estudantes_Copy é uma cópia do dataframe df_estudantes" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SuL8WUxL-u6-" + }, + "source": [ + "___\n", + "# **Renomear COLUNAS do dataframe**\n", + "> **Snippet**: \n", + "\n", + " * df.rename(columns = {'Old_Name': 'New_Name'}, inplace = True)\n", + " * OU df = df.rename(columns = {'Old_Name': 'New_Name'})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IvpCfmQnIZKl" + }, + "source": [ + "Suponha que quero renamear a COLUNA 'nome' para 'nome_cliente', que é um nome mais sugestivo." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "o54Fa-yxnmuz" + }, + "source": [ + "df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "FwzXjYJgCvGk" + }, + "source": [ + "df_estudantes= df_estudantes.rename(columns = {'nome': 'nome_cliente'})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gOolGiWt4A18" + }, + "source": [ + "O comando abaixo produz o mesmo resultado que a linha anterior:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Y6jjAFRd341e" + }, + "source": [ + "```\n", + "df_estudantes.rename(columns= {'nome': 'nome_cliente'}, inplace = True)\n", + "```" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DwVMldKiF5gS" + }, + "source": [ + "# Mostrando o conteúdo de df_estudantes após renamearmos a coluna/variável 'nome' para 'Clien_Name'...\n", + "df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "m-WZBLWqELOv" + }, + "source": [ + "Agora, suponha que queremos renamear 'age' para 'idade_cliente', 'city' para 'cidade_cliente' e 'country' para 'pais_cliente'..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VS6ua4u1EX5g" + }, + "source": [ + "df_estudantes.rename(columns = {'age': 'idade_cliente', 'city': 'cidade_cliente', 'country': 'pais_cliente'}, inplace = True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i_7LW07y4SvO" + }, + "source": [ + "O comando abaixo produz o mesmo resultado que a linha anterior:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9X-cv9RL4WjV" + }, + "source": [ + "```\n", + "df_estudante = df_estudantes.rename(columns= {'Age': 'idade_cliente', 'City': 'cidade_cliente', 'Country': 'pais_cliente'}, inplace = True)\n", + "```" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "EOb1-TEKGM9I" + }, + "source": [ + "# Mostrando o conteúdo de df_estudantes após a múltipla operação de renamear...\n", + "df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "q0IZZjLRJlU6" + }, + "source": [ + "Alguma dúvida até aqui?\n", + "Tudo bem até aqui?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5LwL2m5KbLYz" + }, + "source": [ + "## Challenge\n", + "* Aplicar lowercase() em todas as COLUNAS do dataframe df_estudantes. Como fazer isso?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MURfzmeLbUzF" + }, + "source": [ + "### Minha solução:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "r-FgBY-3xBi9" + }, + "source": [ + "df_estudantes2 = df_estudantes.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "hlSlfcoub8gH" + }, + "source": [ + "# Colocar o nome das COLUNAS numa lista:\n", + "l_colunas = df_estudantes2.columns\n", + "l_colunas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "I_IGvEK4bdQP" + }, + "source": [ + "# Lowercase todas as COLUNAS\n", + "df_estudantes2.columns = [col.lower() for col in l_colunas]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "0qzzAa3ycKmF" + }, + "source": [ + "# Mostrando o conteúdo do dataframe df_estudantes\n", + "df_estudantes2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c-u-ndMPV_KX" + }, + "source": [ + "___\n", + "# **Adicionar/Acrescentar novas LINHAS ao dataframe**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MDkWbukBLhw7" + }, + "source": [ + "## Usando dicionários\n", + "* É necessário informar {'Column_Name': value} para cada inserção. Por exemplo, vou adicionar o seguinte registro ao dataframe:\n", + " * nome_cliente= 'Anderson';\n", + " * idade_cliente= 22;\n", + " * cidade_cliente= 'Porto';\n", + " * pais_cliente= 'Portugal'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GECPO7iyK9UU" + }, + "source": [ + "df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "XQKqqC93LoQ_" + }, + "source": [ + "df_estudantes_Copia= df_estudantes.copy()\n", + "df_estudantes.append({'nome_cliente': 'Anderson', \n", + " 'idade_cliente': 22,\n", + " 'cidade_cliente': 'Porto',\n", + " 'pais_cliente': 'Portugal'}, ignore_index = True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bdBttsHNLjd-" + }, + "source": [ + "Esse é o resultado que desejamos?\n", + "Saberia explicar-nos o que houve de errado?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6jDoq6CCMerp" + }, + "source": [ + "**DICA**: Lembre-se que no passo anterior, reescrevemos os nomes das COLUNAS usando o método lower()." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ffReAaUHLvEF" + }, + "source": [ + "# Definindo df_estudantes novamente usando a cópia df_estudantes_Copia\n", + "df_estudantes = df_estudantes_Copia.copy()\n", + "df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EzTo-IvmM2Fg" + }, + "source": [ + "Ok, restabelecemos a cópia de df_estudantes. Agora vamos à forma correta:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "IRhE76i4M6d6" + }, + "source": [ + "df_estudantes = df_estudantes.append({'nome_cliente': 'Anderson', \n", + " 'idade_cliente': 22,\n", + " 'cidade_cliente': 'Porto',\n", + " 'pais_cliente': 'Portugal'}, ignore_index= True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jAojB2MMNDRJ" + }, + "source": [ + "Bom, esse é o resultado que estávamos à espera..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5czZb-5wNp_F" + }, + "source": [ + "## Usando Series\n", + "* Como exemplo, considere que queremos adicionar os seguintes dados:\n", + " * nome_cliente= 'Bill';\n", + " * idade_cliente= 30;\n", + " * cidade_cliente= 'São Paulo';\n", + " * pais_cliente= 'Brazil'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "J3qCydqMNtGt" + }, + "source": [ + "novo_estudante = pd.Series(['Bill', 30, 'Sao Paulo', 'Brazil'], index= df_estudantes2.columns) # Olha que interessante: estamos a usar index= df_estudantes.columns." + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "g_DyMDrNPrmC" + }, + "source": [ + "Vamos ver o conteúdo de novo_estudante:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jDQUl0RBPoLB" + }, + "source": [ + "novo_estudante" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zMKRNQrsPvxp" + }, + "source": [ + "Por fim, adiciona/acrescenta novo_estudante ao dataframe df_estudantes..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5mEQg26iPw4A" + }, + "source": [ + "df_estudantes2 = df_estudantes2.append(novo_estudante, ignore_index= True)\n", + "df_estudantes2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Biwk2McAWW1Z" + }, + "source": [ + "___\n", + "# **Adicionar/acrescentar novas COLUNAS ao Dataframe**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EZFTH7A-Wpw5" + }, + "source": [ + "## Usando Lists\n", + "* Suponha que queremos adicionar a coluna/variável 'Score'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YzBKQo5epXP5" + }, + "source": [ + "df_estudantes2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "pPoObAKJW6YF" + }, + "source": [ + "# Acrescentando ou criando a coluna/variável 'score' ao dataframe usando um objeto list\n", + "df_estudantes2['score'] = [500, 300, 200, 800, 700, 100]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ocbh8sZqWsoW" + }, + "source": [ + "# Mostra o conteúdo do dataframe df_estudantes...\n", + "df_estudantes2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZxfCMcVxYQgL" + }, + "source": [ + "> **Atenção**:\n", + "\n", + "* Se a quantidade de valores da lista forem menores que o número de LINHAS do dataframe, então o Python apresenta um erro.\n", + "* Se a coluna/variável que queremos inserir já existe no dataframe, então os valores serão atualizados com os novos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "34ntllD_YbNa" + }, + "source": [ + "## Usando um valor default\n", + "* Adicionar a coluna 'total' com o mesmo valor para todas as LINHAS" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "T7QSMJMQYous" + }, + "source": [ + "df_estudantes['total'] = 500\n", + "df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gll-gJt7as3C" + }, + "source": [ + "## Adicionar uma COLUNA calculada a partir de outras COLUNAS" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "T_pB_isBaw-E" + }, + "source": [ + "df_estudantes['percentagem'] = 100*(df_estudantes['score']/sum(df_estudantes['score']))\n", + "df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D9TNylt84hle" + }, + "source": [ + "___\n", + "# **Ler/carregar dados no Python**\n", + "* Vários formatos de arquivos podem ser lidos:\n", + "\n", + "|Format Type | Data Description | Reader | Writer |\n", + "|---|---|---|---|\n", + "text | CSV | read_csv | to_csv |\n", + "text | JSON | read_json | to_json |\n", + "text | HTML | read_html | to_html |\n", + "text | Local clipboard | read_clipboard | to_clipboard |\n", + "binary | MS Excel | read_excel | to_excel |\n", + "binary | HDF5 Format | read_hdf | to_hdf |\n", + "binary | Stata | read_stata | to_stata |\n", + "binary | SAS | read_sas \n", + "binary | Python Pickle Format | read_pickle | to_pickle |\n", + "SQL | SQL | read_sql | to_sql |\n", + "SQL | Google Big Query | read_gbq | to_gbq |\n", + "\n", + "* Fonte: [IO tools (text, CSV, HDF5, …)](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ss8jLEUSblDm" + }, + "source": [ + "___\n", + "# **Ler/Carregar csv**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "n8e9aphab_oe" + }, + "source": [ + "# carregar a library Pandas\n", + "import pandas as pd" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R2fRd_MSQ2Xa" + }, + "source": [ + "A seguir, vamos:\n", + "* Ler o dataframe Titanic.csv;\n", + "* Definir 'PassengerId' como índice/chave da tabela através do comando index_col= 'PassengerId'." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1R9YoFJ02TR7" + }, + "source": [ + "url = 'https://raw.githubusercontent.com/MathMachado/DataFrames/master/Titanic_With_MV.csv?token=AGDJQ67OZ36XJUJPE77Z7LC7RBCAU'\n", + "df_Titanic = pd.read_csv(url, index_col = 'PassengerId')\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VS7_V15u0MgR" + }, + "source": [ + "df_Titanic.iloc[4] # NÃO É A MESMA COISA QUE df_Titanic.loc[4]!!!" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WJ9RlRDSkk0_" + }, + "source": [ + "* Segue o dicionário de dados do dataframe df_Titanic:\n", + " * PassengerID: ID do passageiro;\n", + " * survived: Indicador, sendo 1= Passageiro sobreviveu e 0= Passageiro morreu;\n", + " * Pclass: Classe;\n", + " * Age: Idade do Passageiro;\n", + " * SibSp: Número de parentes a bordo (esposa, irmãos, pais e etc);\n", + " * Parch: Número de pais/crianças a bordo;\n", + " * Fare: Valor pago pelo Passageiro;\n", + " * Cabin: Cabine do Passageiro;\n", + " * Embarked: A porta pelo qual o Passageiro embarcou.\n", + " * Name: Nome do Passageiro;\n", + " * sex: sexo do Passageiro." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wz7Qd9mqMrfY" + }, + "source": [ + "# Show o dataframe df_Titanic:\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nDlANdnm4iod" + }, + "source": [ + "### DICA 1\n", + "Suponha que o dataframe que queremos ler esteja localizado em:\n", + "\n", + "```\n", + "/home/nsolucoes4ds/Dropbox/Data_Science/Python/Python_RFB/Python_RFB-DS_Python_020919_2244/Dataframes\n", + "```\n", + "\n", + "Desta forma, para ler o dataframe (local), basta usar o comando a seguir:\n", + "\n", + "```\n", + "url = '/home/nsolucoes4ds/Dropbox/Data_Science/Python/Python_RFB/Python_RFB-DS_Python_020919_2244/Dataframes/creditcard.csv'\n", + "df_Titanic = pd.read_csv(url)\n", + "```\n", + "\n", + "### Dica 2\n", + "No Windows, o diretório aparece, por exemplo, da seguinte forma: \n", + "```\n", + "C:\\nsolucoes4ds\\Data_Science\n", + "```\n", + "Observe as '\\\\' (**barras invertidas**). Neste caso, use o comando a seguir:\n", + "\n", + "```\n", + "url= r'C:\\nsolucoes4ds\\Data_Science\\creditcard.csv'\n", + "df_Titanic = pd.read_csv(url)\n", + "```\n", + "\n", + "Percebeu o r'diretorio'?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HubfewY8NgUv" + }, + "source": [ + "___\n", + "# **Corrigir (ou uniformizar) nome das COLUNAS**\n", + "* Por exemplo, reescrever o nome das COLUNAS usando lowercase()." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4f_pEEOjvwjk" + }, + "source": [ + "Para facilitar nossas análises, vamos aplicar o método lower() em todos os valores das COLUNAS objects/strings. Para isso, considere a função abaixo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ft13IahH1kVX" + }, + "source": [ + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "G-UlaHFPv7kp" + }, + "source": [ + "def transformacao_lower(df):\n", + " # Primeira transformação: Aplicar lower() nos nomes das COLUNAS:\n", + " df.columns = [col.lower() for col in df.columns]\n", + "\n", + " # Segunda transformação: Aplicar o método .str.lower() nos valores das COLUNAS object/strings:\n", + " l_cols_objeto = df.select_dtypes(include = ['object']).columns\n", + " \n", + " for col in l_cols_objeto:\n", + " df[col] = df[col].str.lower()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hNixsW8M7n1X" + }, + "source": [ + "Para saber mais sobre o método df[col].str.lower(), consulte [pandas.Series.str.lower](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.lower.html)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hz90zejtbxYj" + }, + "source": [ + "transformacao_lower(df_Titanic)\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UE5P1W-CPePM" + }, + "source": [ + "# **Selecionar um subconjunto de colunas**\n", + "Suponha que eu queira selecionar somente as colunas 'Name' e 'Sex'." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "P7HJa4x7P0bQ" + }, + "source": [ + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "3jLZUCfePsBs" + }, + "source": [ + "df_Titanic2 = df_Titanic[['Name', 'Sex']]\n", + "df_Titanic2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PyNsYTilnL2r" + }, + "source": [ + "# map()\n", + "> Artificio para lidar com a transformação de dados utilizando um dicionário: {'key': valor}." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6z4FcyyAiTfF" + }, + "source": [ + "# Construindo uma variável mais intuitiva para nos ajudar nas análises:\n", + "df_Titanic['survived2'] = df_Titanic['survived']\n", + "df_Titanic['survived2'] = df_Titanic['survived2'].map({0:'died', 1:'survived'})\n", + "df_Titanic[['survived', 'survived2']].head(3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jwBWkaJOdhCv" + }, + "source": [ + "___\n", + "# **Selecionar COLUNAS do dataframe**\n", + "* Suponha que queremos selecionar somente as COLUNAS 'survived', 'sex' e 'embarked':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ivvj8JU2pBTq" + }, + "source": [ + "df_Titanic2 = df_Titanic[['survived', 'sex', 'embarked']]\n", + "df_Titanic2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Nf-Wnof_fdTR" + }, + "source": [ + "___\n", + "# **Criar um dicionário a partir de um dataframe**\n", + "> Suponha o dataframe-exemplo a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lxf6Lgp4fit8" + }, + "source": [ + "df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l7yzJu1y5huV" + }, + "source": [ + "De dataframe para Dicionário..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_6V0qFZGhEoF" + }, + "source": [ + "df.to_dict('dict')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0GIe6xtqPA1Z" + }, + "source": [ + "___\n", + "# **Criar uma lista a partir de um dataframe**\n", + "> Suponha o dataframe-exemplo a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fZxgejTtPLzX" + }, + "source": [ + "df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JoShm6oF5qLV" + }, + "source": [ + "De dataframe para Lista..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gigPpSH_hlXu" + }, + "source": [ + "df.to_dict('list')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GpJDX-5xUUC0" + }, + "source": [ + "___\n", + "# **Mostrar as primeiras k LINHAS do dataframe**\n", + "> df.head(k), onde k é o número de LINHAS que queremos visualizar. Por default, k= 10." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RwC9j_OxUbIR" + }, + "source": [ + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "G9cp2QrsA5M0" + }, + "source": [ + "___\n", + "# **Mostrar as últimas k LINHAS do dataframe**\n", + "> df.tail(k), onde k é o número de LINHAS que queremos ver. Por default, k= 10." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9mPxyhqoA4Wc" + }, + "source": [ + "df_Titanic.tail()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Odwm2qSLA_Ro" + }, + "source": [ + "Por default, df.tail() mostra as últimas 5 LINHAS/instâncias do dataframe. Entretando, pode ser ver qualquer número de LINHAS k, como, por exemplo, k= 10 mostrado abaixo." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pUAnR00WA8ma" + }, + "source": [ + "df_Titanic.tail(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cZ64LfWv4zxo" + }, + "source": [ + "___\n", + "# **Mostrar o nome das COLUNAS do dataframe**\n", + "* df.columns" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "CKUUrX5n4zFW" + }, + "source": [ + "df_Titanic.columns" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6m7ukrOu5Inv" + }, + "source": [ + "___\n", + "# **Mostrar os tipos das COLUNAS do dataframe**\n", + "* Propriedade: df.dtypes --> Não há parênteses!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "S4NIHAPPl9lc" + }, + "source": [ + "df_Titanic.dtypes # dtypes é uma propriedade, portanto não requer \"()\". Os métodos, por outro lado, requerem \"(arg1, arg2, ..., argN)\"" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DGc6m-UBdHlE" + }, + "source": [ + "___\n", + "# **Selecionar automaticamente as COLUNAS do dataframe pelo tipo**\n", + "> snippet: df.select_dtypes(include=[tipo]).columns\n", + "\n", + "| Tipo | O que seleciona | Sintaxe |\n", + "|------|-----------------|---------|\n", + "| number | colunas do tipo numéricas | df.select_dtypes(include=['number]).columns |\n", + "| float | colunas do tipo float | df.select_dtypes(include=['float']).columns |\n", + "| bool | colunas do tipo booleanas | df.select_dtypes(include=['bool']).columns |\n", + "| object | colunas do tipo categóricas/strings | df.select_dtypes(include=['object']).columns |\n", + "\n", + "* Se quisermos selecionar mais de um tipo, basta informar a lista de tipos. \n", + " * Exemplo: df.select_dtypes(include=['object', 'float']).columns\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "O88YRCqIdYFL" + }, + "source": [ + "## Selecionar automaticamente as COLUNAS Numéricas do dataframe" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xG4a9ZfRnxPW" + }, + "source": [ + "### Lista com as COLUNAS numéricas do dataframe:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "C87uga35dKsF" + }, + "source": [ + "l_cols_numericas = df_Titanic.select_dtypes(include = ['number']).columns # \".columns\" retorna a lista de colunas numéricas\n", + "l_cols_numericas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5W6kbIVNn2UA" + }, + "source": [ + "### DataFrame com as COLUNAS numéricas:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iTieUd_-eDmW" + }, + "source": [ + "df_numericas = df_Titanic.select_dtypes(include = ['number']) # Atenção: aqui não temos .columns --> Neste caso, o retorno será o dataframe.\n", + "df_numericas.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xh4BFs_lds80" + }, + "source": [ + "## Selecionar automaticamente as COLUNAS float do dataframe" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Tw3FD74MoC6q" + }, + "source": [ + "### Lista com as COLUNAS float:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5clAUAIrd3UR" + }, + "source": [ + "l_cols_float = df_Titanic.select_dtypes(include = ['float']).columns\n", + "l_cols_float" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IZPROG6IoHwy" + }, + "source": [ + "### DataFrame com as COLUNAS float:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "osJDsyMHeXX4" + }, + "source": [ + "df_float = df_Titanic.select_dtypes(include = ['float']) # Atenção: aqui não temos .columns\n", + "df_float.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5uObezIIfuJ4" + }, + "source": [ + "## Selecionar automaticamente as COLUNAS Booleanas do dataframe" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xMKP5HhgoeMg" + }, + "source": [ + "### Lista com as COLUNAS Booleanas:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3Pn2IPBkf7k-" + }, + "source": [ + "l_cols_booleanas = df_Titanic.select_dtypes(include = ['bool']).columns\n", + "l_cols_booleanas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k3sdiuXYokBE" + }, + "source": [ + "### DataFrame com as COLUNAS Booleanas:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Oem-M-17f7lG" + }, + "source": [ + "df_booleanas = df_Titanic.select_dtypes(include=['bool']) # Atenção: aqui não temos .columns\n", + "df_booleanas.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ObHYW92-gOXz" + }, + "source": [ + "## Selecionar automaticamente as COLUNAS do tipo string (object)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IzM5CIKXoxHO" + }, + "source": [ + "### Lista com as COLUNAS do tipo object/string:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rdYThBingOX1" + }, + "source": [ + "l_cols_objeto = df_Titanic.select_dtypes(include=['object']).columns\n", + "l_cols_objeto" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2ZGB5d36o21t" + }, + "source": [ + "### DataFrame com as COLUNAS do tipo Object/String:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kWTtxeU4gOX4" + }, + "source": [ + "df_cols_obs = df_Titanic.select_dtypes(include=['object']) # Atenção: aqui não temos .columns\n", + "df_cols_obs.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SEBKHKRLkbUK" + }, + "source": [ + "___\n", + "# **Reordenar as COLUNAS do dataframe**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XRWfelWEkhae" + }, + "source": [ + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KBGDeR_JkyCc" + }, + "source": [ + "* Suponha que queremos reordenar as COLUNAS do dataframe df_Titanic em ordem alfabética, conforme abaixo:\n", + " * age;\n", + " * embarked;\n", + " * fare;\n", + " * parch;\n", + " * pclass;\n", + " * sex;\n", + " * sibsp;\n", + " * survived." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "d9jJi6qllnq_" + }, + "source": [ + "df_Titanic = df_Titanic.reindex(sorted(df_Titanic.columns), axis = 1)\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Cj4MREti-izC" + }, + "source": [ + "___\n", + "# **Mostrar a dimensão do dataframe**\n", + "* Dimensão = Número de LINHAS e COLUNAS" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "50Tij93l-n7B" + }, + "source": [ + "df_Titanic.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZQo4YeH_-qfL" + }, + "source": [ + "Qual a interpretação?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "klHcwpPEALP8" + }, + "source": [ + "## **Quebrar a dimensão em duas partes: número de LINHAS e COLUNAS**\n", + "* Número de linhas do dataframe.: df_Titanic.shape[0]\n", + "* Número de colunas do dataframe: df_Titanic.shape[1]" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qjR8OEdDAOog" + }, + "source": [ + "f'O dataframe df_Titanic possui {df_Titanic.shape[0]} linhas e {df_Titanic.shape[1]} colunas.'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pIsf_nDtyAvF" + }, + "source": [ + "___\n", + "# **Combinar dataframes: Merge, Join & Concatenate**\n", + "* Fonte: [Merge, join, and concatenate](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s1fSplrlEMHK" + }, + "source": [ + "* A seguir, três formas para combinar dataframes:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6DYtWxuIrdzF" + }, + "source": [ + "## Concatenate\n", + "* Une/empilha dataframes\n", + "* Fonte: https://github.com/aakankshaws/Pandas-exercises" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nnP5VuWkri_b" + }, + "source": [ + "import pandas as pd\n", + "df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\n", + " 'B': ['B0', 'B1', 'B2', 'B3'],\n", + " 'C': ['C0', 'C1', 'C2', 'C3'],\n", + " 'D': ['D0', 'D1', 'D2', 'D3']})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rkJvSGYSrm8b" + }, + "source": [ + "df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],\n", + " 'B': ['B4', 'B5', 'B6', 'B7'],\n", + " 'C': ['C4', 'C5', 'C6', 'C7'],\n", + " 'D': ['D4', 'D5', 'D6', 'D7']})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "NCgdYvJIrqx1" + }, + "source": [ + "df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],\n", + " 'B': ['B8', 'B9', 'B10', 'B11'],\n", + " 'C': ['C8', 'C9', 'C10', 'C11'],\n", + " 'D': ['D8', 'D9', 'D10', 'D11']})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gUoyjyjur5Zn" + }, + "source": [ + "df1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "xU6Rh10Gr7NA" + }, + "source": [ + "df2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "qKwmOWsQr9wA" + }, + "source": [ + "df3" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-MNn-XdlsjJS" + }, + "source": [ + "df= pd.concat([df1, df2, df3])\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BV6HgxSYtG6Z" + }, + "source": [ + "Veja que basicamente empilhamos os dataframes. No entanto, se fizermos..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Dp-oh-7ftLo5" + }, + "source": [ + "df = pd.concat([df1, df2, df3], axis = 1) # axis = 1 é uma operação de coluna\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iyDZt2XEtmVs" + }, + "source": [ + "Se, no entanto, tivermos:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5PAhjjVZtpP5" + }, + "source": [ + "df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\n", + " 'B': ['B0', 'B1', 'B2', 'B3'],\n", + " 'C': ['C0', 'C1', 'C2', 'C3'],\n", + " 'D': ['D0', 'D1', 'D2', 'D3']},\n", + " index=[0, 1, 2, 3])\n", + "\n", + "df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],\n", + " 'B': ['B4', 'B5', 'B6', 'B7'],\n", + " 'C': ['C4', 'C5', 'C6', 'C7'],\n", + " 'D': ['D4', 'D5', 'D6', 'D7']},\n", + " index=[4, 5, 6, 7])\n", + "\n", + "df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],\n", + " 'B': ['B8', 'B9', 'B10', 'B11'],\n", + " 'C': ['C8', 'C9', 'C10', 'C11'],\n", + " 'D': ['D8', 'D9', 'D10', 'D11']},\n", + " index=[8, 9, 10, 11])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zGDHd-kPt3-T" + }, + "source": [ + "Então..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3bTl2Nr2t5WM" + }, + "source": [ + "df = pd.concat([df1, df2, df3], axis = 1)\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sUXjlp_Jt925" + }, + "source": [ + "Porque isso acontece?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JdKXY873HrYt" + }, + "source": [ + "## Merge\n", + "> Primeiramente, vamos ver todos os casos possíveis de joins.\n", + "\n", + "### Exemplo\n", + "> O exemplo a seguir foi inspirado no exemplo apresentado em [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins). Considere os dataframes a seguir" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "g4pmhk2t3x8s" + }, + "source": [ + "import pandas as pd\n", + "\n", + "d_Tabela_A = {'indices': [1,2,3,6,7,5,4,10], 'valores': ['A','B','C','D','E','F','G','H']}\n", + "d_Tabela_B = {'indices': [1,2,3,6,7,8,9,11], 'valores': ['AA', 'BB','CC','DD','EE','FF','GG','HH']}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "XxfUULxY52ns" + }, + "source": [ + "df_conjunto_A = pd.DataFrame(d_Tabela_A).set_index('indices')\n", + "df_conjunto_B = pd.DataFrame(d_Tabela_B).set_index('indices')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gGdU36Vi0Yso" + }, + "source": [ + "![SQL_inner_join](https://github.com/MathMachado/Materials/blob/master/SQL_inner_join.png?raw=true)\n", + "\n", + "Source: [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5w7ox7LV9cuG" + }, + "source": [ + "df_conjunto_A" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TPhmKw-F9fWX" + }, + "source": [ + "df_conjunto_B" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "5AaTlCPy9FBZ" + }, + "source": [ + "df_inner_join = pd.merge(df_conjunto_A, df_conjunto_B, on = 'indices', how = 'inner')\n", + "df_inner_join" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "U3OjFM0E0af-" + }, + "source": [ + "![SQL_left_join](https://github.com/MathMachado/Materials/blob/master/SQL_left_join.png?raw=true)\n", + "\n", + "Source: [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins).\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-efYd9c69k4L" + }, + "source": [ + "df_conjunto_A" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "SqFbNStz9k4S" + }, + "source": [ + "df_conjunto_B" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rUpc2k729KA-" + }, + "source": [ + "df_left_join = pd.merge(df_conjunto_A, df_conjunto_B, on = 'indices', how = 'left')\n", + "df_left_join" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WioSBHjW06Hg" + }, + "source": [ + "![SQL_right_join](https://github.com/MathMachado/Materials/blob/master/SQL_right_join.png?raw=true)\n", + "\n", + "Source: [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "IrzPjGNp9o2n" + }, + "source": [ + "df_conjunto_A" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "tFFTp_yG9o2s" + }, + "source": [ + "df_conjunto_B" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_D4tF7E-9PCx" + }, + "source": [ + "df_right_join = pd.merge(df_conjunto_A, df_conjunto_B, on = 'indices', how = 'right')\n", + "df_right_join" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "E9xFrurZ0ksg" + }, + "source": [ + "![SQL_outer_join](https://github.com/MathMachado/Materials/blob/master/SQL_outer_join.png?raw=true)\n", + "\n", + "Source: [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kQCBAfj_9rO_" + }, + "source": [ + "df_conjunto_A" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "FTDHYsgc9rP0" + }, + "source": [ + "df_conjunto_B" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "hJqyAs0U9XwO" + }, + "source": [ + "df_outer_join = pd.merge(df_conjunto_A, df_conjunto_B, on = 'indices', how = 'outer')\n", + "df_outer_join" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fHEgLynu0vve" + }, + "source": [ + "![SQL_left_excluding_join](https://github.com/MathMachado/Materials/blob/master/SQL_left_excluding_join.png?raw=true)\n", + "\n", + "Source: [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins).\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZA8CcERE-RRS" + }, + "source": [ + "df_conjunto_A" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IZiAa9X6-UL0" + }, + "source": [ + "df_conjunto_B" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "jdUt63rA-Vjo" + }, + "source": [ + "df_left_excluding_join = pd.merge(df_conjunto_A, df_conjunto_B, on = 'indices', how =\"outer\", indicator=True).query('_merge==\"left_only\"')\n", + "df_left_excluding_join" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CShcqL-h1MqK" + }, + "source": [ + "![SQL_right_excluding_join](https://github.com/MathMachado/Materials/blob/master/SQL_right_excluding_join.png?raw=true)\n", + "\n", + "Source: [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins).\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ECjUDoYf_C9x" + }, + "source": [ + "df_conjunto_A" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "xym7VsXi_FXa" + }, + "source": [ + "df_conjunto_B" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-zFalmly_HJ7" + }, + "source": [ + "df_right_excluding_join = pd.merge(df_conjunto_A, df_conjunto_B, on = 'indices', how =\"outer\", indicator=True).query('_merge==\"right_only\"')\n", + "df_right_excluding_join" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T8v4-zUt1WQz" + }, + "source": [ + "![SQL_outer_excluding_join](https://github.com/MathMachado/Materials/blob/master/SQL_outer_excluding_join.png?raw=true)\n", + "\n", + "Source: [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8HeMgBqyAYjW" + }, + "source": [ + "### Desafio: Como resolver este?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SkCbLsoktgKl" + }, + "source": [ + "### Observações:\n", + "\n", + "* Em alguns casos a variável chave nos dois dataframes que se quer fazer o join possui nomes diferentes. Neste caso, use 'left_on' e 'right_on' para definir o nome das COLUNAS chaves no dataframe da esquerda e direita:\n", + " * pd.merge(df1, df2, left_on =\"employee\", right_on =\"nome\")\n", + " * No exemplo acima, o dataframe df1 (dataframe da esquerda) possui chave 'employee' enquanto que o dataframe df2 (dataframe da direita), possui chave 'nome'. Usando as 'left_on' e 'right_on' fica claro o nome das chaves de ligação de cada dataframe." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6Obc0fHUwIpu" + }, + "source": [ + "## Joining" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DQOa89_cwLyd" + }, + "source": [ + "df_esquerdo = pd.DataFrame({'A': ['A0', 'A1', 'A2'],\n", + " 'B': ['B0', 'B1', 'B2']},\n", + " index=['K0', 'K1', 'K2']) \n", + "\n", + "df_direito = pd.DataFrame({'C': ['C0', 'C2', 'C3'],\n", + " 'D': ['D0', 'D2', 'D3']},\n", + " index=['K0', 'K2', 'K3'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "UHnX9rxzwMmx" + }, + "source": [ + "df_esquerdo" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "GBc1Mr0Qwff3" + }, + "source": [ + "df_direito" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TmIk3Kjlwg-7" + }, + "source": [ + "df_esquerdo.join(df_direito)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "h609fbjjwoZ3" + }, + "source": [ + "df_esquerdo.join(df_direito, how ='outer')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Y8W2kP-VCB3E" + }, + "source": [ + "___\n", + "# **Selecionar LINHAS do dataframe baseado nos índices**\n", + "### Leitura Adicional\n", + "* [pandas loc vs. iloc vs. ix vs. at vs. iat?\n", + "](https://stackoverflow.com/questions/28757389/pandas-loc-vs-iloc-vs-ix-vs-at-vs-iat/47098873#47098873)\n", + "* [Indexing and selecting data](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NN1R1ngAG61x" + }, + "source": [ + "## 1st Approach - pd.loc[]\n", + "* Para capturar o conteúdo da linha k, use df.loc[row_indexer,column_indexer]." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oduXMUtIUvkN" + }, + "source": [ + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JX9nGPWcVLgE" + }, + "source": [ + "\n", + "Por exemlo, o comando a seguir mostra o conteúdo da linha 0, todas as COLUNAS(:)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "U5-I2NgYC2fD" + }, + "source": [ + "df2= df_Titanic.loc[1,:]\n", + "df2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tDSJcQLTDyJw" + }, + "source": [ + "Mostrando o conteúdo das LINHAS k= 1:2 (ou seja, LINHAS 1 e 2), todas as COLUNAS(:)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JD1TDTqAD_5r" + }, + "source": [ + "df_Titanic.loc[1:2, :]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EoAmcdfnEIho" + }, + "source": [ + "Mostrar os conteúdos da linha k= 1, coluna 'pclass':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8vjc5z3_EQfY" + }, + "source": [ + "df_Titanic.loc[1, ['pclass']]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7bC8-H-QFLgd" + }, + "source": [ + "Mostrar os conteúdos da linha k= 1 e COLUNAS ['pclass', 'sex']:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LYFTrZr_FR5g" + }, + "source": [ + "df_Titanic.loc[0, ['pclass', 'sex']]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UtUsmU8sXYTU" + }, + "source": [ + "Porque temos um erro aqui?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CRy5sDx-XbBL" + }, + "source": [ + "Versão correta abaixo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5Lfw0HEnXdn0" + }, + "source": [ + "df_Titanic.loc[1, ['pclass', 'sex']]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Tjw3vjkDZg1Z" + }, + "source": [ + "Mostrar os conteúdos da linha k= 1:5 e COLUNAS ['pclass', 'sex']:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4GuAE5MSZjNb" + }, + "source": [ + "df_Titanic.loc[1:5, ['pclass', 'sex']]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xRZxqE6RFnJI" + }, + "source": [ + "Agora suponha que queremos selecionar toda a 'sex'. Como fazer isso?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JdeD_uzfFrp5" + }, + "source": [ + "df_sex= df_Titanic.loc[:, 'sex']\n", + "df_sex.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z_WUjYxsX-Av" + }, + "source": [ + "Fácil selecionarmos o que queremos usando .loc() e iloc(), certo?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RKk0zollHFbp" + }, + "source": [ + "## 2nd Approach - Usando lists\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jhwoY6LmGzC0" + }, + "source": [ + "df_Titanic[0:2] # Mostrar os conteúdos das LINHAS 0:2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "I6EOVIDxGiy-" + }, + "source": [ + "df_Titanic[:3] # Mostrar os conteúdos até a linha 3" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VOHp77F8H9t1" + }, + "source": [ + "df_Titanic['sex'].head() # Mostrar o conteúdo inteiro da variável 'sex'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8nvHNdhPZ040" + }, + "source": [ + "df_Titanic[0:5]['sex'].head() # Mostrar as LINHAS 0 a 5 da variável 'sex'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GMFso1jaYXgN" + }, + "source": [ + "___\n", + "# **Selecionar/Filtrar/Substituir LINHAS do dataframe baseado em condições**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BKljSpS5ou-i" + }, + "source": [ + "## Exemplo 1\n", + "> Aproveitando o exemplo anterior, queremos selecionar do dataframe somente os passageiros do sexo 'male'." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jek8Ru3Aam23" + }, + "source": [ + "### Approach 1: df.loc() e df.iloc()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "eysZoBX2YKb-" + }, + "source": [ + "df_sexo_m_1 = df_Titanic.loc[df_Titanic['sex'] == 'male', 'sex']\n", + "df_sexo_m_1.head() " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uLDOHKGfaq-Z" + }, + "source": [ + "### Approach 2: Uso do []" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QncrZwHkasiu" + }, + "source": [ + "df_sexo_m_2 = df_Titanic[df_Titanic['sex'] == 'male']['sex']\n", + "df_sexo_m_2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ot6UBTYJF-AJ" + }, + "source": [ + "### Approach 3: df.isin()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OBRF0be3VuTi" + }, + "source": [ + "#### Exemplo 1 - Filtro simples" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LeTDiGICGOzb" + }, + "source": [ + "df_sexo_m_3 = df_Titanic['sex'].isin(['male'])\n", + "df_sexo_m_3.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q6emu30nGmpt" + }, + "source": [ + "#### Exemplo 2 - Filtro duplo = Duas condições\n", + "> Selecionar todas as LINHAS onde sexo = 'male' e Pclass = 1." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TRaiCYMRGpgl" + }, + "source": [ + "# Filtros usando df.isin() \n", + "filtro_m = df_Titanic[\"sex\"].isin([\"male\"]) \n", + "filtro_class1 = df_Titanic[\"Pclass\"].isin([1]) \n", + " \n", + "# Mostra os resutados \n", + "df_Titanic[filtro_m & filtro_class1].head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Sh0DDj1xcPaI" + }, + "source": [ + "df_sexo_m_class = df_Titanic[((df_Titanic['sex'] == 'male') & (df_Titanic['Pclass'] == 1))]\n", + "df_sexo_m_class.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ujrYHyOsfW7n" + }, + "source": [ + "### Approach 4 - Filtrar com df.str.contains('s_substr')" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gntbfHgTfanx" + }, + "source": [ + "# Mostrar todas as LINHAS onde a string 'Mr' aparece no nome do passageiro:\n", + "df2 = df_Titanic[df_Titanic['Name'].str.contains('Mr')]\n", + "df2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eaRtQ8Ja8MOH" + }, + "source": [ + "Para saber mais sobre o método df[col].str.contais(), consulte https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FyJ-gEjzQI2Y" + }, + "source": [ + "## Substituir valores do dataframe\n", + "> Suponha que queremos substituir todos os valores de pclass da seguinte forma:\n", + "* Se pclass = 1 --> pclass2 = 'Classe1';\n", + "* Se pclass = 2 --> pclass2 = 'Classe2';\n", + "* Se pclass = 3 --> pclass2 = 'Classe3';\n", + "\n", + "Como fazer isso?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Pi8MFiUPQQb7" + }, + "source": [ + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "19mynzdfQqVf" + }, + "source": [ + "df_Titanic['pclass2'] = df_Titanic['pclass']\n", + "df_Titanic['pclass2'][df_Titanic['pclass'] == 1] = 'Classe1'\n", + "df_Titanic['pclass2'][df_Titanic['pclass'] == 2] = 'Classe2'\n", + "df_Titanic['pclass2'][df_Titanic['pclass'] == 3] = 'Classe3'\n", + "df_Titanic['pclass2'].head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KVSAYeU0KA2V" + }, + "source": [ + "___\n", + "# **Selecionar amostras aleatórias do dataframe**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "U502dAs3OfOH" + }, + "source": [ + "Vimos que o dataframe df_Titanic é muito grande. Então, vamos selecionar aleatoriamente 100 LINHAS." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0BrKUnAiPcAy" + }, + "source": [ + "import random \n", + "\n", + "# Biblioteca para avaliarmos o tempo de processamento de cada alternativa\n", + "import time" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "iJ1G8lYgKGsc" + }, + "source": [ + "# Usando sample\n", + "t0= time.time()\n", + "df_Titanic_a100= df_Titanic.sample(100, replace= False, random_state= 20111974)\n", + "t1= time.time()\n", + "t= t1-t0\n", + "df_Titanic_a100.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8DvWOKizZQr8" + }, + "source": [ + "f'Tempo de processamento: {t}'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "nAHLTjpvYKPS" + }, + "source": [ + "# Usando NumPy\n", + "import numpy as np\n", + "\n", + "t0 = time.time()\n", + "np.random.seed(20111974)\n", + "indices = np.random.choice(df_Titanic.shape[0], replace = False, size=100)\n", + "df_Titanic_a100_2 = df_Titanic.iloc[indices]\n", + "t1 = time.time()\n", + "t = t1-t0\n", + "df_Titanic_a100_2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "U8PEDMJ4a52P" + }, + "source": [ + "f'Tempo de processamento: {t}'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "wYeuJWdEdMPd" + }, + "source": [ + "df_Titanic_a100_2.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vNMiRkjCQ9Mu" + }, + "source": [ + "___\n", + "# **Descrever o Dataframe**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GllUFj56RHuD" + }, + "source": [ + "df_Titanic_a100.describe()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "izbpIEi1d1sx" + }, + "source": [ + "df_Titanic_a100_2.describe()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H40G3QzWbG9N" + }, + "source": [ + "___\n", + "# **Identificar e lidar com LINHAS duplicadas**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_OoM_HS5ZgxG" + }, + "source": [ + "## Exemplo 1\n", + "* considera as duplicatas em todas as COLUNAS do dataframe." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5XOOdOZBbLc_" + }, + "source": [ + "df = pd.DataFrame({'A':[1,1,3,4,5,1], 'B':[1,1,3,7,8,1], 'C':[3,1,1,6,7,1]})\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Gio08BkTbTOp" + }, + "source": [ + "# Lista as duplicações em forma booleana\n", + "df.duplicated()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "obgbM4d_hJ_J" + }, + "source": [ + "Observe a linha 5, onde temos a informação que esta linha está duplicada. Na verdade, a linha 5 é igual à linha 1" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LHhOIb-EbWfn" + }, + "source": [ + "# Mostra as LINHAS duplicadas\n", + "df[df.duplicated()]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IyJS70_kZ-Jk" + }, + "source": [ + "# Deleta a linha 5 que, como vimos, estava duplicada (uma cópia da linha 1).\n", + "df= df.drop_duplicates()\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3Q05mxOSaEjX" + }, + "source": [ + "## Exemplo 2\n", + "* Considera somente algumas COLUNAS" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jiqyjcqdaQ1y" + }, + "source": [ + "df = pd.DataFrame({'A':[1,1,3,4,5,1], 'B':[1,1,3,7,8,1], 'C':[3,1,1,6,7,1]})\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "F_118d7vbZ9Y" + }, + "source": [ + "# Mostra as LINHAS duplicadas usando as COLUNAS 'A' e 'B'\n", + "df[df.duplicated(subset=['A','B'])]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_1w_ZZO4vF3A" + }, + "source": [ + "# Deleta as LINHAS 1 e 5, pois como podemos ver, são duplicatas da linha 0\n", + "df= df.drop_duplicates(subset = ['A', 'B'])\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qVx6p8u36jhD" + }, + "source": [ + "___\n", + "# **Trabalhar com dados do tipo texto**\n", + "* Fontes:\n", + " * [Working with text data](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html)\n", + " * [Using String Methods](https://www.ritchieng.com/pandas-string-methods/)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JLG3cVA1e8-B" + }, + "source": [ + "Preparando os dados para o exemplo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "G_CEULoyeP8C" + }, + "source": [ + "# Definir um dicionário com os dados: \n", + "import numpy as np\n", + "\n", + "l_idade = []\n", + "for i in range(6):\n", + " np.random.seed(i) \n", + " l_idade.append(np.random.randint(10, 40))\n", + " \n", + "\n", + "d_exemplo = {'Nome':['Mr. Antonio dos Santos', 'Mr. Joao Pedro', 'Miss. Priscila Alvarenga', 'Mr. fagner NoVAES', 'Miss. Danielle Aparecida', 'Mr. Paullo Amarantes'], \n", + " 'Idade': l_idade, \n", + " 'Cidade':['lisboa', 'Sintra', 'Braga', 'Guimaraes', 'Mafra', 'Nazare']} \n", + " \n", + "# Converte o dicionário num dataframe\n", + "df = pd.DataFrame(d_exemplo) \n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "or-Kzaqmdn2b" + }, + "source": [ + "* Sugestões do que podemos fazer com relação á coluna 'nome' do dataframe df:\n", + " * Extrair o cumprimento do nome: Mr., Miss e etc.\n", + " * Construir as COLUNAS PrimeiroNome e SegundoNome.\n", + " * Criar a variável classe_idade." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Vd99ksvcg7uy" + }, + "source": [ + "## Extrair o cumprimento do nome" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rNsANzFAg_Kn" + }, + "source": [ + "df_Nome= df['Nome'].str.split(' ', n = 2, expand = True) \n", + "df_Nome" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ianqsxLol008" + }, + "source": [ + "Altere o valor de n para 3 e explique como as coisas funcionam..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5NDAkEqCl6H5" + }, + "source": [ + "# Capturando o cumprimento do nome:\n", + "df['tamanho_nome'] = df['Nome'].str.split(' ', n = 2, expand = True)[0]\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B1QoH4LyrpVI" + }, + "source": [ + "## Construir as COLUNAS primeiro_nome e Segundo_Nome" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cbi4eRN2mOu9" + }, + "source": [ + "# Capturando o primeiro nome:\n", + "df['primeiro_nome'] = df['Nome'].str.split(' ', n = 2, expand = True)[1]\n", + "df['ultimo_nome'] = df['Nome'].str.split(' ', n = 2, expand = True)[2]\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7eagWhgZrwOh" + }, + "source": [ + "### Construir a variável classe_idade\n", + "\n", + " | Limite Inferior | Limite Superior | Classe |\n", + " |-----------------|-----------------|--------|\n", + " | Inf | 15 | Inf_15 |\n", + " | 15 | 20 | 15_20 |\n", + " | 20 | 30 | 25_30 |\n", + " | 30 | 40 | 30_40 |\n", + " | 40 | 50 | 40_50 |\n", + " | 50 | Sup | 50_Sup |" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lBjRBGBWr2AH" + }, + "source": [ + "def classe_idade):\n", + " if (Idade <= 15):\n", + " return 'Inf_15'\n", + " if (15 < Idade <= 20):\n", + " return '15_20'\n", + " elif(20 < Idade <= 30):\n", + " return '20_30'\n", + " elif (30 < Idade <= 40):\n", + " return '30_40'\n", + " elif (40 < Idade <= 50):\n", + " return '40_50'\n", + " elif (Idade > 50):\n", + " return '50_Sup'\n", + " else:\n", + " return 'Outros'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "OogrvjCrsdoh" + }, + "source": [ + "df['classe_idade'] = df['Idade'].map(classe_idade)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JDtxz_eaRcmi" + }, + "source": [ + "___\n", + "# **Agrupar Informações: pd.groupby()**\n", + "* Fonte: [Group By: split-apply-combine](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)\n", + "\n", + "* Os componentes do comando Groupby()\n", + " * **Grouping_Column** - Coluna Categórica pelo qual os dados serão agrupados;\n", + " * **Aggregating_Column** - Coluna numérica cujos valores serão agrupados;\n", + " * **Aggregating_Function** - Função agregadora, ou seja: sum, min, max, mean, median, etc...\n", + "\n", + "> Sintaxe: \n", + "\n", + "```\n", + "df.groupby('Grouping_Column').agg({'Aggregating_Column': 'Aggregating_Function'})\n", + "\n", + "OU\n", + "\n", + "df['Aggregating_Column'].groupby(df['Grouping_Column']).Function()\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bmFf-273XPXj" + }, + "source": [ + "## Exemplo 1" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wteEveUsd36C" + }, + "source": [ + "transformacao_lower(df_Titanic)\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "buF5DhkFfqVA" + }, + "source": [ + "# Agrupando df_Titanic por 'sex3'\n", + "df_Titanic.groupby(['sex', 'pclass']).agg({'fare': ['min', 'median', 'mean','max'], 'age': ['count', 'mean','max']})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "YP3GDwq0gR_V" + }, + "source": [ + "# Agrupando df_Titanic por 'sex3' e 'Pclass'\n", + "df_Titanic.groupby(['sex3','Pclass']).agg({'Fare': ['max', 'min']}).round(0)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "se4tQ3ETeUfv" + }, + "source": [ + "df_Titanic.groupby(['sex3']).agg({'Age': ['mean','min','max']}).round(0)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zUj82I7Cm220" + }, + "source": [ + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OrLZjm9bXTOr" + }, + "source": [ + "## Exemplo 2" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "x8aPZPT6XZVP" + }, + "source": [ + "### Preparando o exemplo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KrCe6RgOXaFx" + }, + "source": [ + "l_coluna = []\n", + "\n", + "for i in range(1,6):\n", + " np.random.seed(i)\n", + " l_coluna.append(np.random.randint(0, 10, 10))\n", + " \n", + "np.random.seed(6)\n", + "l_coluna.append(np.random.rand(10))\n", + "\n", + "l_coluna" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "tXaHjmfSXeCw" + }, + "source": [ + "l_coluna[0]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "U_aEVMTHq6ee" + }, + "source": [ + "df = pd.DataFrame({'coluna6' : ['a', 'a', 'b', 'b', 'a', 'b', 'b', 'b', 'a', 'a'],\n", + " 'coluna7' : ['um', 'dois', 'um', 'dois', 'um', 'dois', 'dois', 'um', 'um', 'dois'],\n", + " 'coluna1' : l_coluna[0],\n", + " 'coluna2' : l_coluna[1],\n", + " 'coluna3' : l_coluna[2],\n", + " 'coluna4' : l_coluna[3],\n", + " 'coluna5' : l_coluna[4],\n", + " 'coluna8' : l_coluna[5],\n", + " 'Pessoas' : ['Jose','Maria','Pedro','Carlos','Joao','Ana','Manoel','Mafalda','Antonio','Ricardo'],\n", + " 'sexo' : ['m','f','m','m','m','f','m','f','m','m']})\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ok4a28lGlVC5" + }, + "source": [ + "Agrupando por 'coluna6':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Vx77lyzlZIFW" + }, + "source": [ + "df.groupby('coluna6').agg({'coluna1': ['min','mean','median','max']})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T6i-R2KemadE" + }, + "source": [ + "Agora, vamos repetir o processo usando duas COLUNAS-chaves 'coluna6' e 'coluna7':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WxmHQnQSZrXA" + }, + "source": [ + "df_estatisticas_descritivas = df.groupby(['coluna6','coluna7']).agg({'coluna1': ['min','mean','median','max']})\n", + "df_estatisticas_descritivas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ipw5EROwaaCX" + }, + "source": [ + "Observe que df_estatisticas_descritivas é um dataframe. Portanto, podemos selecionar LINHAS e/ou COLUNAS deste dataframe da forma que quisermos." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qk5uSdVwb7dH" + }, + "source": [ + "# Índices do dataframe:\n", + "df_estatisticas_descritivas.index" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "brIgUFlkalix" + }, + "source": [ + "# Selecionando o conteúdo de coluna6= 'a' e coluna7= 'um':\n", + "df_estatisticas_descritivas.loc[('a', 'um')]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "fQUs2PVHc6iR" + }, + "source": [ + "# Selecionando o conteúdo de coluna6= 'a' e coluna7= 'um', primeiro valor:\n", + "df_estatisticas_descritivas.loc[('a', 'um')][0] # ou seja, selecionamos min" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zT0xiee6dDpK" + }, + "source": [ + "# Selecionando o conteúdo de coluna6= 'a' e coluna7= 'um', segundo valor:\n", + "df_estatisticas_descritivas.loc[('a', 'um')][1] # ou seja, selecionamos mean" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vXlcjPM6dQKi" + }, + "source": [ + "E daí por diante..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EMxFMqn9dm3g" + }, + "source": [ + "Para aprender mais sobre como trabalhar com dois índices em um dataframe, consulte [Hierarchical indices, groupby and pandas](https://www.datacamp.com/community/tutorials/pandas-multi-index)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gNHyH7M0pGDy" + }, + "source": [ + "___\n", + "## Exemplo 3\n", + "### Operações e transformações em grupo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ywl3k_l8pGD0" + }, + "source": [ + "# Mostra o dataframe-exemplo:\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "AF8cbNsjpGD5" + }, + "source": [ + "# Constroi dataframe df_Medias\n", + "df_Medias = df.groupby('coluna6').mean().add_prefix('mean_')\n", + "df_Medias" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JGlA6ufLpGD9" + }, + "source": [ + "# Combina (merge) com o dataframe df:\n", + "pd.merge(df, df_Medias, left_on ='coluna6', right_index=True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1MjZu3sVpGEd" + }, + "source": [ + "___\n", + "# **Discretizar COLUNAS numéricas**\n", + "* pd.cut() - classes com base em valores;\n", + "* pd.qcut() - classes com base em quantis da amostra, ou seja teremos a mesma quantidade de itens em cada classe.\n", + "\n", + "> Este artifício é muito utilizado em Machine Learning quando queremos construir classes para variáveis numéricas (integer ou float). Acompanhe a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yK772hiSfZaE" + }, + "source": [ + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wi-nv6fshKIX" + }, + "source": [ + "## pd.cut()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SVExQmzDpGEe" + }, + "source": [ + "# Construir 4 classes para a variável float 'coluna8':\n", + "Bucket_cut = pd.cut(df['coluna8'], 4) # aqui, estamos construindo 4 buckets\n", + "Bucket_cut" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "OOD38I6ug1AY" + }, + "source": [ + "# Quem são os Bucket's que construimos:\n", + "Bucket_cut.value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9s2eaZGtfsxu" + }, + "source": [ + "Como podem ver, de fato construimos 4 bucket's. **Observe que não temos a mesma quantidade de itens em cada classe!!!**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T7u0pS64hPHC" + }, + "source": [ + "## pd.qcut()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cJTQTHA6pGEm" + }, + "source": [ + "Bucket_qcut = pd.qcut(df['coluna8'], 4, labels=False)\n", + "Bucket_qcut" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vM30Td_8hZre" + }, + "source": [ + "# Quem são os Bucket's que construimos:\n", + "Bucket_qcut.value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jhf6V5LTh4G7" + }, + "source": [ + "## Comentários\n", + "* pd.qcut() garante uma distribuição mais uniforme dos valores em cada classe. Isso significa que é menos provável que você tenha uma classe com muitos dados e outra com poucos dados.\n", + "* Eu prefiro usar pd.qcut()." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RNsR0NsS5iIU" + }, + "source": [ + "___\n", + "# **Distribuição conjunta - crosstabs**\n", + "> Suponha que queremos analisar o número de sobreviventes em relação à COLUNA embarked." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LKQv6YtSfGSU" + }, + "source": [ + "df_Titanic2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ANhb5rBffTh6" + }, + "source": [ + "pd.crosstab(df_Titanic2['survived'], df_Titanic2['embarked'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WIlHAYEVqSjT" + }, + "source": [ + "___\n", + "# **Deletar COLUNAS do dataframe**\n", + "> Deletar as COLUNAS 'coluna2' e 'coluna5' do dataframe." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YssOMF_Vqso5" + }, + "source": [ + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rVF_1p0Gq3gZ" + }, + "source": [ + "## Usando inplace = True" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7BjRIX1jqWQT" + }, + "source": [ + "df.drop(['coluna2','coluna5'], axis =1, inplace =True)\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "POC2fnTlq8mK" + }, + "source": [ + "## Usando atribuição" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YRSwEbnfq7s_" + }, + "source": [ + "df= df.drop(['coluna2','coluna5'], axis =1)\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bHth6KSv7k0G" + }, + "source": [ + "___\n", + "# **Criar COLUNAS dummies para dados categóricos**\n", + "> Nosso objetivo é construir variáveis dummies para nossas COLUNAS categóricas.\n", + "\n", + "* Fontes: \n", + " * [Categorical data](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html)\n", + " * [Creating Dummy Variables](https://www.ritchieng.com/pandas-creating-dummy-variables/)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GOqcARHqjMr_" + }, + "source": [ + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yNqvwEu9jbuW" + }, + "source": [ + "Vamos construir variáveis dummies para as COLUNAS 'coluna6' e 'coluna7', da seguinte forma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "16osZsMEjmDh" + }, + "source": [ + "pd.get_dummies(df['coluna6'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Cb1gp_Y1jxz2" + }, + "source": [ + "Qual a interpretação do resultado acima?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Cic19l-Mj39q" + }, + "source": [ + "pd.get_dummies(df['coluna7'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "44FDXcoyj-tT" + }, + "source": [ + "Qual a interpretação do resultado acima?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cxHc6BvDkCWl" + }, + "source": [ + "df = pd.get_dummies(df, columns =['coluna6', 'coluna7', 'sexo'])\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "A2m25N4znZ2O" + }, + "source": [ + "df.columns" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "x0uXu0RRlB2a" + }, + "source": [ + "___\n", + "# **Calcular correlação (Análise de Correlação)**\n", + "> A correlação pode ser calculada usando o método df.corr(). Para mais detalhes sobre os tipos de correlação existentes bem como a aplicação de cada uma delas, consulte os links a seguir:\n", + "\n", + "* [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)\n", + "* [Kendall rank correlation coefficient](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient)\n", + "* [Spearman's rank correlation coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient).\n", + "\n", + "Para aprender mais sobre a geração de heatmap, consulte [Seaborn Heatmap Tutorial (Python Data Visualization)](https://likegeeks.com/seaborn-heatmap-tutorial/)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AgoigF8AnYG0" + }, + "source": [ + "## Gerando o dataframe-exemplo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NsuhsZCTmqEm" + }, + "source": [ + "# Visualizar os dados\n", + "df_X.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "D0JNMHqYoSMs" + }, + "source": [ + "# Mostra a matriz de correlação usando a correlação de Pearson\n", + "set_Colunas_Correlacionadas = set()\n", + "matriz_correlacao = df_X.corr().where(np.triu(np.ones(df_X.corr().shape), k = 1).astype(np.bool))\n", + "matriz_correlacao" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6scRm8kNnbby" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "\n", + "# Gerando um dataframe com 15 colunas, sendo 9 informativas e 6 redundantes:\n", + "from sklearn.datasets import make_classification\n", + "X, y = make_classification(n_samples=1000, n_features=15, n_informative=9,\n", + " n_redundant=6, n_repeated=0, n_classes=2, n_clusters_per_class=1,\n", + " random_state=20111974)\n", + "\n", + "df_X = pd.DataFrame(X, columns= ['v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8', 'v9', 'v10', 'v11', 'v12', 'v13', 'v14', 'v15'])\n", + "df_y = pd.DataFrame(y, columns= ['target'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Vnj6A8z6r7nM" + }, + "source": [ + "### Quem são as colunas altamente correlacionadas?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "a_YUD-dOr_p-" + }, + "source": [ + "for i in range(len(matriz_correlacao.columns)):\n", + " for j in range(i):\n", + " if abs(matriz_correlacao.iloc[i, j]) > 0.8:\n", + " colnome = matriz_correlacao.columns[i]\n", + " set_Colunas_Correlacionadas.add(colnome)\n", + "\n", + "set_Colunas_Correlacionadas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3-0Xe6GdozYT" + }, + "source": [ + "A seguir, a correlação mais visual:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5-_Qadx1o1U9" + }, + "source": [ + "fig, ax = plt.subplots(figsize = (12, 12)) \n", + "mask = np.zeros_like(df_X.corr().abs())\n", + "mask[np.triu_indices_from(mask)] = 1\n", + "sns.heatmap(df_X.corr().abs(), mask= mask, ax= ax, cmap='coolwarm', annot= True, fmt= '.2f', center= 0)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5ZOp9ZGgtqFQ" + }, + "source": [ + "# **Scatterplot**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eReJJjG8tuKV" + }, + "source": [ + "## Com regressão" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tVmdSo6ztruA" + }, + "source": [ + "sns.pairplot(df_X, kind = \"reg\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xG9A6b32twv-" + }, + "source": [ + "## Sem regressão" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fyTOS3zVtz-O" + }, + "source": [ + "sns.pairplot(df_X, kind = \"scatter\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f-1bpipc6bMh" + }, + "source": [ + "___\n", + "# **Salvar dataframe como csv**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "64CoM1aY6gf6" + }, + "source": [ + "df_X.to_csv('example.csv')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oy646p33DJV0" + }, + "source": [ + "# **Dicionário de palavras**\n", + "> Muito utilizado em NLP e Machine Learning.\n", + "* Caso de Uso: Seguradoras --> Quando um segurado aciona a Seguradora para descrever um acidente (por exemplo), há um algorítmo que transforma o áudio em texto para mineração de textos." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DQR906rVD1V-" + }, + "source": [ + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "sHvDaztJDPP7" + }, + "source": [ + "from sklearn.feature_extraction.text import CountVectorizer\n", + "CountVectorizer = CountVectorizer()\n", + "matriz_contagens = CountVectorizer.fit_transform(df_Titanic['name']) # Informe a coluna do tipo texto/string que queremos analisar/avaliar\n", + "print(matriz_contagens)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "jwT-56dED8VJ" + }, + "source": [ + "df_dicionario_palavras = pd.DataFrame(CountVectorizer.get_feature_names(), columns = ['palavra'])\n", + "df_dicionario_palavras[\"vezes_que_aparece\"] = matriz_contagens.sum(axis = 0).tolist()[0]\n", + "df_dicionario_palavras = df_dicionario_palavras.sort_values(\"vezes_que_aparece\", ascending = False) #.reset_index(drop = True)\n", + "df_dicionario_palavras.head(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nx65RmEAGTvd" + }, + "source": [ + "# Desafio\n", + "> Transforme o code Python da sessão **Dicionário de palavras** em função para usarmos futuramente." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iwd1lhq9mrD3" + }, + "source": [ + "___\n", + "# **Exercícios**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o_cl0kFgQfFh" + }, + "source": [ + "## Exercício 1\n", + "* A partir dos dataframes USA_Abbrev, USA_Area e USA_Population, construa o Dataframe USA contendo as COLUNAS state, abbreviation, area, ages, year, population.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s8rQUo7yHKJ1" + }, + "source": [ + "* Observação: A forma mais fácil de ler um arquivo CSV (a partir do Excell por exemplo) a partir do GitHub é clicar no arquivo csv no seu repositório do GitHub e em seguida clicar em 'raw'. Depois, copie o endereço apresentado no browser e cole na variável 'url'. Qualquer dúvida, leia o documento a seguir: https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KTun1uSLuJ-A" + }, + "source": [ + "## Exercício 2\n", + "Source: https://github.com/aakankshaws/Pandas-exercises\n", + "\n", + "* Considere os dataframes a seguir e faça o merge do dataframe df_esquerdo com o dataframe df_direito:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Soq7GVZnuREq" + }, + "source": [ + "df_esquerdo = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],\n", + " 'A': ['A0', 'A1', 'A2', 'A3'],\n", + " 'B': ['B0', 'B1', 'B2', 'B3']})\n", + " \n", + "df_direito = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],\n", + " 'C': ['C0', 'C1', 'C2', 'C3'],\n", + " 'D': ['D0', 'D1', 'D2', 'D3']})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6KEsTARfvM1C" + }, + "source": [ + "## Exercício 3\n", + "Source: https://github.com/aakankshaws/Pandas-exercises\n", + "\n", + "* Considere os dataframes a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hgxE5gZ9vMEg" + }, + "source": [ + "df_esquerdo = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],\n", + " 'key2': ['K0', 'K1', 'K0', 'K1'],\n", + " 'A': ['A0', 'A1', 'A2', 'A3'],\n", + " 'B': ['B0', 'B1', 'B2', 'B3']})\n", + " \n", + "df_direito = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],\n", + " 'key2': ['K0', 'K0', 'K0', 'K0'],\n", + " 'C': ['C0', 'C1', 'C2', 'C3'],\n", + " 'D': ['D0', 'D1', 'D2', 'D3']})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iv7AmZ1ivm8R" + }, + "source": [ + "### Perguntas\n", + "* Qual o output e a interpretação dos comandos a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TWAW_1tuvvSO" + }, + "source": [ + "pd.merge(df_esquerdo, df_direito, on = ['key1', 'key2'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QjM7pBONvzCJ" + }, + "source": [ + "pd.merge(df_esquerdo, df_direito, how = 'outer', on = ['key1', 'key2'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "D1Rr3Ghsv2iS" + }, + "source": [ + "pd.merge(df_esquerdo, df_direito, how = 'right', on = ['key1', 'key2'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vXQwLjT-v3Iu" + }, + "source": [ + "pd.merge(df_esquerdo, df_direito, how = 'left', on = ['key1', 'key2'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EIdltTC-t_lF" + }, + "source": [ + "## Exercício 5\n", + "5.1. Identifique e delete os atributos do dataframe df_Titanic que podem ser excluídos inicialmente no início da análise de dados." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bMwPLgWclWBq" + }, + "source": [ + "___\n", + "## Exercício 6\n", + "* (a) Carregue o dataframe Titanic_With_MV.csv e analise o dataframe em busca de inconsistências e Missing Values (NaN).\n", + "\n", + "### Feature Engineering\n", + "* (b) Com a coluna 'cabin', construir as colunas:\n", + " * deck - Letra de Cabin;\n", + " * seat - Número de Cabin\n", + "* (c) Criar a coluna 'sozinho_parch', onde sozinho_parch= 1 significa que o passageiro viaja sozinho e 0, caso contrário.\n", + "* (d) Criar o atributo 'sozinho_sibsp', onde sozinho= 1 significa que o passageiro viaja sozinho e 0, caso contrário.\n", + "* (e) Discretizar a coluna 'fare' em 10 buckets.\n", + "* (f) Discretizar a coluna 'age'.\n", + "* (g) Capturar os títulos 'Ms', 'Mr' e etc contidos na coluna 'Title';\n", + "* (h) Qual a relação entre as variáveis e a variável-target?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "V7KUGAX6lilP" + }, + "source": [ + "import pandas as pd\n", + "df_Titanic = pd.read_csv('https://raw.githubusercontent.com/MathMachado/Python4DS/DS_Python/Dataframes/Titanic_With_MV.csv?token =AGDJQ63MNPPPROFNSO2BZW25XSR72', index_col= 'PassengerId')\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "m3UnAPJakCLR" + }, + "source": [ + "* Segue o dicionário de dados do dataframe Titanic:\n", + " * PassengerID: ID do passageiro;\n", + " * survived: Indicador, sendo 1= Passageiro sobreviveu e 0= Passageiro morreu;\n", + " * Pclass: Classe;\n", + " * Age: Idade do Passageiro;\n", + " * SibSp: Número de parentes a bordo (esposa, irmãos, pais e etc);\n", + " * Parch: Número de pais/crianças a bordo;\n", + " * Fare: Valor pago pelo Passageiro;\n", + " * Cabin: Cabine do Passageiro;\n", + " * Embarked: A porta pelo qual o Passageiro embarcou.\n", + " * Name: Nome do Passageiro;\n", + " * sex: sexo do Passageiro\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B_3s5cgxfNKQ" + }, + "source": [ + "## Resposta do item (a)\n", + "### Coluna XPTO\n", + "\n", + "\n", + "### Coluna XPTO2" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "q3oLgyhdL6xd" + }, + "source": [ + "## Resposta do item (b)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UbexhGtayV4X" + }, + "source": [ + "## Exercício 7\n", + "Consulte a página [Pandas Exercises, Practice, Solution](https://www.w3resource.com/python-exercises/pandas/index.php) para mais exercícios relacionados á este tópico." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Iia0ikd_KBtH" + }, + "source": [ + "## Exercício 8\n", + "Crie a coluna 'aleatorio' no dataframe df_Titanic em que cada linha recebe um valor aleatório usando o método np.random.random()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HPiLKUkWNYs3" + }, + "source": [ + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ldWQd9j4NhPS" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB10_01__Pandas_hs2.ipynb b/Notebooks/NB10_01__Pandas_hs2.ipynb new file mode 100644 index 000000000..d4b3c0c78 --- /dev/null +++ b/Notebooks/NB10_01__Pandas_hs2.ipynb @@ -0,0 +1,5534 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "Copy of NB10_01__Pandas.ipynb", + "provenance": [], + "private_outputs": true, + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8fpUiw8PwC7_" + }, + "source": [ + "

PANDAS PARA DATA ANALYSIS

\n", + "\n", + "\n", + "\n", + "# **AGENDA**:\n", + "\n", + "> Veja o **índice** dos itens que serão abordados neste capítulo.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vo7mtiNSr_Wk" + }, + "source": [ + "___\n", + "# **REFERÊNCIAS**\n", + "* [Learn Aggregation and Data Wrangling with Python](https://data-flair.training/blogs/data-wrangling-with-python/)\n", + "* [Python Data Cleansing by Pandas & Numpy | Python Data Operations](https://data-flair.training/blogs/python-data-cleansing/)\n", + "* [Pandas from basic to advanced for Data Scientists](https://towardsdatascience.com/pandas-from-basic-to-advanced-for-data-scientists-aee4eed19cfe)\n", + "* [Feature engineering and ensembled models for the top 10 in Kaggle “Housing Prices Competition”](https://towardsdatascience.com/feature-engineering-and-ensembled-models-for-the-top-10-in-kaggle-housing-prices-competition-efb35828eef0)\n", + "* [Pandas.Series Methods for Machine Learning](https://towardsdatascience.com/pandas-series-methods-for-machine-learning-fd83709368ff)\n", + "* [Pandas.Series Methods for Machine Learning](https://towardsdatascience.com/pandas-series-methods-for-machine-learning-fd83709368ff)\n", + "* [Gaining a solid understanding of Pandas series](https://towardsdatascience.com/gaining-a-solid-understanding-of-pandas-series-893fb8f785aa)\n", + "* [ariáveis Dummy: o que é? Quando usar? E como usar?](https://medium.com/data-hackers/vari%C3%A1veis-dummy-o-que-%C3%A9-quando-usar-e-como-usar-78de66cfcca9)\n", + "* [Exploratory Data Analysis Made Easy Using Pandas Profiling](https://towardsdatascience.com/exploratory-data-analysis-made-easy-using-pandas-profiling-86e347ef5b65)\n", + "* [Data Handling using Pandas; Machine Learning in Real Life](https://towardsdatascience.com/data-handling-using-pandas-machine-learning-in-real-life-be76a697418c)\n", + "* [Exploratory Data Analysis Tutorial in Python](https://towardsdatascience.com/exploratory-data-analysis-tutorial-in-python-15602b417445)\n", + "* [Exploring the data using python](https://towardsdatascience.com/exploring-the-data-using-python-47c4bc7b8fa2)\n", + "* [A better EDA with Pandas-profiling](https://towardsdatascience.com/a-better-eda-with-pandas-profiling-e842a00e1136)\n", + "* [Exploratory Data Analysis: Haberman’s Cancer Survival Dataset](https://towardsdatascience.com/exploratory-data-analysis-habermans-cancer-survival-dataset-c511255d62cb)\n", + "* [Exploring Exploratory Data Analysis](https://towardsdatascience.com/exploring-exploratory-data-analysis-1aa72908a5df)\n", + "* [Getting started with Data Analysis with Python Pandas](https://towardsdatascience.com/getting-started-to-data-analysis-with-python-pandas-with-titanic-dataset-a195ab043c77)\n", + "* [A Gentle Introduction to Exploratory Data Analysis](https://towardsdatascience.com/a-gentle-introduction-to-exploratory-data-analysis-f11d843b8184)\n", + "* [Exploratory Data Analysis (EDA) techniques for Kaggle competition beginners](https://towardsdatascience.com/exploratory-data-analysis-eda-techniques-for-kaggle-competition-beginners-be4237c3c3a9)\n", + "* [What is Exploratory Data Analysis?](https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15)\n", + "* [Exploring real estate investment opportunity in Boston and Seattle](https://towardsdatascience.com/exploring-real-estate-investment-opportunity-in-boston-and-seattle-9d89d0c9bed2)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BUEbp88oD1Km" + }, + "source": [ + "___\n", + "# **ANÁLISE DE DADOS COM PANDAS**\n", + "## Highlights\n", + "\n", + "* Rápida e eficiente library para data manipulation;\n", + "* Ferramentas para ler e gravar todos os tipos de dados e formatos: CSV, txt, Microsoft Excel, SQL databases, JSON e HDF5 format;\n", + "* Pandas é a library mais popular para análise de dados. As principais ações que faremos com Pandas são:\n", + " * Ler/gravar diferentes formatos de dados;\n", + " * Selecionar subconjuntos de dados;\n", + " * Cálculos variados por coluna ou por linha das tabelas;\n", + " * Encontrar e tratar Missing Values;\n", + " * Combinar múltiplos dataframes;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wkxQFPPmeKLl" + }, + "source": [ + "![Pandas](https://github.com/MathMachado/Materials/blob/master/Pandas.jpeg?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eKawOG-neqaD" + }, + "source": [ + "![Pandas](https://github.com/MathMachado/Materials/blob/master/Pandas2.jpeg?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TLdSmsJZwlcQ" + }, + "source": [ + "___\n", + "# **ATÉ QUE VOLUME DE DADOS PODEMOS USAR PANDAS?**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "O7YKF5gB2x0K" + }, + "source": [ + "![RightToolForEachSize](https://github.com/MathMachado/Materials/blob/master/SizesAndTools.PNG?raw=true)\n", + "\n", + "## Sources\n", + "### Dask\n", + "* [Pandas, Dask or PySpark? What Should You Choose for Your Dataset?](https://medium.com/datadriveninvestor/pandas-dask-or-pyspark-what-should-you-choose-for-your-dataset-c0f67e1b1d36)\n", + "* [Processing Data with Dask](https://medium.com/when-i-work-data/processing-data-with-dask-47e4233cf165)\n", + "* [Pandas, Fast and Slow](https://medium.com/when-i-work-data/pandas-fast-and-slow-b6d8dde6862e)\n", + "* [Por que Parquet](https://medium.com/when-i-work-data/por-que-parquet-2a3ec42141c6)\n", + "* [How to Run Parallel Data Analysis in Python using Dask Dataframes](https://towardsdatascience.com/trying-out-dask-dataframes-in-python-for-fast-data-analysis-in-parallel-aa960c18a915)\n", + "* [Why every Data Scientist should use Dask?](https://towardsdatascience.com/why-every-data-scientist-should-use-dask-81b2b850e15b)\n", + "\n", + "### Spark, Koalas\n", + "* [Databricks Koalas-Python Pandas for Spark](https://medium.com/future-vision/databricks-koalas-python-pandas-for-spark-ce20fc8a7d08)\n", + "* [Bye Pandas, Meet Koalas: Pandas APIs on Apache Spark (Ep. 4)](https://medium.com/@kyleake/bye-pandas-meet-koalas-pandas-apis-on-apache-spark-ep-4-aedcd363cf4e)\n", + "* [Koalas: Easy Transition from pandas to Apache Spark](https://databricks.com/blog/2019/04/24/koalas-easy-transition-from-pandas-to-apache-spark.html?source=post_page-----aedcd363cf4e----------------------)\n", + "* [Use PySpark for Your Next Big Problem](https://medium.com/swlh/use-pyspark-for-your-next-big-problem-8aa288d5ecfa)\n", + "* [A Neanderthal’s Guide to Apache Spark in Python](https://towardsdatascience.com/a-neanderthals-guide-to-apache-spark-in-python-9ef1f156d427)\n", + "* [The Jungle of Koalas, Pandas, Optimus and Spark](https://towardsdatascience.com/the-jungle-of-koalas-pandas-optimus-and-spark-dd486f873aa4)\n", + "* [From Pandas to PySpark with Koalas](https://towardsdatascience.com/from-pandas-to-pyspark-with-koalas-e40f293be7c8)\n", + "\n", + "# O que Dask?\n", + "\n", + "\"Dask is designed to extend the numpy and pandas packages to work on data processing problems that are too large to be kept in memory. It breaks the larger processing job into many smaller tasks that are handled by numpy or pandas and then it reassembles the results into a coherent whole.\" - Eric Ness ([Processing Data with Dask](https://medium.com/when-i-work-data/processing-data-with-dask-47e4233cf165))\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yEyzjGUfG33-" + }, + "source": [ + "___\n", + "# **Carregar a library Pandas e verificar a versão**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oVMjT3DrG97K" + }, + "source": [ + "# Carrega a library Pandas\n", + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "\n", + "import locale # Nova importação\n", + "\n", + "print(f'Versão do Pandas: {pd.__version__}')\n", + "print(f'Versão do NumPy.: {np.__version__}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OxoDsaKUVHdH" + }, + "source": [ + "# Configurações\n", + "> Podemos configurar o pandas de forma a tornar nosso trabalho mais produtivo. Podemos configurar, por exemplo, o número de LINHAS e COLUNAS a ser mostrado, precisão dos números float. Vamos ver com mais detalhes a seguir.\n", + "\n", + "Fonte: [5 Advanced Features of Pandas and How to Use Them](https://www.kdnuggets.com/2019/10/5-advanced-features-pandas.html)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "IOdqrf7uVlhC" + }, + "source": [ + "d_configuracao = {\n", + " 'display.max_columns': 1000,\n", + " 'display.expand_frame_repr': True,\n", + " 'display.max_rows': 10,\n", + " 'display.precision': 2,\n", + " 'display.show_dimensions': True\n", + " }\n", + "\n", + "for op, value in d_configuracao.items():\n", + " pd.set_option(op, value)\n", + " print(op, value)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "plpT_-jFTCjO" + }, + "source": [ + "# OUTRA FORMA DE CONFIGURAR A EXIBIÇÃO\n", + "# cria variável pdod (iniciais de pd.options.display)\n", + "# usa método do Pandas para ajustar opções de display de dados\n", + "# locale.setlocale(locale.LC_NUMERIC, 'pt_BR') ## DEU ERRO - verificar depois\n", + "pdod = pd.options.display\n", + "pdod.max_rows = 100\n", + "pdod.min_rows = 50\n", + "pdod.max_columns = None\n", + "# pdod.float_format = '{:.2f}'.format \n", + "pdod.float_format = lambda x: locale.format_string('%.2f', x, grouping=True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-vJufa3WTrG_" + }, + "source": [ + "dir(pd.options.display)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Paz-R-FOAJ7F" + }, + "source": [ + "___\n", + "# **Criar um dataframe a partir de outros objetos**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L4Jc0C2qPAQz" + }, + "source": [ + "## Criar dataframe a partir de dicionários" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Sa5rKwq6Fscj" + }, + "source": [ + "### Exemplo 1" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0ofIGkiSSuYq" + }, + "source": [ + "d_frutas = {'Apple': [5, 6, 6, 8, 10, 3, 2],\n", + " 'Avocado': [6, 6, 3, 9, 3, 2, 1]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "iJCNvPlUTzTI" + }, + "source": [ + "d_frutas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7Y_0O_tJTfm3" + }, + "source": [ + "# index=['Seg', 'Ter', 'Qua', 'Qui', 'Sex', 'Sab', 'Dom'] abaixo define os label.\n", + "df_frutas = pd.DataFrame(d_frutas, index = ['Seg', 'Ter', 'Qua', 'Qui', 'Sex', 'Sab', 'Dom'])\n", + "df_frutas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l2ll8ktfUKz2" + }, + "source": [ + "O que se comprou na sexta?\n", + "\n", + "* Função df.loc[label] retorna o(s) valor(es) associados à label. Em nosso caso, os label (chaves do dicionário) são 'Seg', 'Ter', ..., 'Dom'." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9Voor8_PUJum" + }, + "source": [ + "df_frutas.loc['Sex'] # Aqui, label= 'Sex'." + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LMh4DTfebwAr" + }, + "source": [ + "* Ou seja, o label = 'Sex', que ocupa a posição 4, tem os valores:\n", + " * Apple..: 10\n", + " * Avocado: 3\n", + "\n", + "Da mesma forma, poderíamos utilizar a função df.iloc[index] para retornar o conteúdo/informações de index." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GJxawdh6bvJN" + }, + "source": [ + "df_frutas.iloc[4]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "obJt9OPGcL-x" + }, + "source": [ + "Portanto, df.loc['Sex'] = df.iloc[4]. Correto?\n", + "\n", + "Para nos ajudar a memorizar, considere que:\n", + "\n", + "* pd.loc[label] --> loc começa com a letra **l**, o que remete à label da linha.\n", + "* pd.iloc[indice] --> iloc começa com a letra **i**, o que remete ao índice (inteiro) da linha." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v7QlCcEorEIX" + }, + "source": [ + "#### Qual é o output do code abaixo?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kRRdQShrrKHk" + }, + "source": [ + "df_frutas.loc[4]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EkjAtbrRF01h" + }, + "source": [ + "### Exemplo 2" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2EOX5MC4E1xL" + }, + "source": [ + "Na prática, lidamos com grandes bancos de dados e, nesses casos, não temos label das LINHAS definidos. Para exemplificar, considere o mesmo exemplo que acabamos de ver, com uma pequena alteração:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RC_OXmdjrkQm" + }, + "source": [ + "d_frutas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "D6FckgDPFFs0" + }, + "source": [ + "df_frutas = pd.DataFrame(d_frutas) # Observe que aqui não definimos os indíces\n", + "df_frutas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tkGc4JQcFPkp" + }, + "source": [ + "Veja agora que os label são números inteiros de 0 a N." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ri-EdUYAovLG" + }, + "source": [ + "#### Qual o conteúdo da linha cujo label é 4?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5YgWG_vlFVe_" + }, + "source": [ + "df_frutas.loc[4]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rFQxcAcVo2KD" + }, + "source": [ + "#### Qual o conteúdo da linha cujo índice é 4?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xB1j4n6HFank" + }, + "source": [ + "df_frutas.iloc[4]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jEbCke3TFf_q" + }, + "source": [ + "Ou seja, nesses casos, tanto faz usar pd.loc[] ou pd.iloc[]. Entendeu?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bKHw_VBKjkoL" + }, + "source": [ + "### Exemplo 3 - Definir os indices do dataframe usando df.set_index()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "13ArWIhYju6s" + }, + "source": [ + "d_frutas= {'Dia_Semana': ['Seg', 'Ter', 'Qua', 'Qui', 'Sex', 'Sab', 'Dom'],\n", + " 'Apple': [5, 6, 6, 8, 10, 3, 2],\n", + " 'Avocado': [6, 6, 3, 9, 3, 2, 1]}\n", + "\n", + "d_frutas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Evw9w16gk5h0" + }, + "source": [ + "# Cria o dataframe df_frutas:\n", + "df_frutas = pd.DataFrame(d_frutas) # Não apontamos o índice do dataframe. Portanto, o índice é criado automaticamente de 0.. N.\n", + "df_frutas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NLbbRrdYoclw" + }, + "source": [ + "#### Qual o conteúdo da linha 4?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lB-ngbutl_0c" + }, + "source": [ + "df_frutas.iloc[4]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1aJLGapZlUFI" + }, + "source": [ + "# Definir 'Dia_Semana' como índice (label das linhas) do dataframe df_frutas\n", + "df_frutas.set_index('Dia_Semana', inplace = True)\n", + "df_frutas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L1-U_sD-jAoO" + }, + "source": [ + "A expressão acima é equivalente a:\n", + "\n", + "```\n", + "df_frutas2 = df_frutas.set_index('Dia_Semana') # Observe que aqui não há 'inplace'\n", + "df_frutas2\n", + "```\n", + "\n", + "* Então, qual a função do 'inplace =True' na primeira opção?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oXeFjJonpQfB" + }, + "source": [ + "#### Qual o conteúdo da linha 4?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MMXg3vVQpUhh" + }, + "source": [ + "df_frutas.iloc[4]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fhoYuGMlpVFj" + }, + "source": [ + "#### Qual o conteúdo da linha cujo label é 'Sex'?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fmcWbrEspdYW" + }, + "source": [ + "df_frutas.loc['Sex']" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bobggpoCTRkj" + }, + "source": [ + "### Qual a diferença entre as duas próximas linhas?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SjiYgbNrsvpl" + }, + "source": [ + "df_frutas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "OFhzE7hgTD0a" + }, + "source": [ + "df_frutas.mean()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "V42I3807TNte" + }, + "source": [ + "df_frutas.mean(1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6iUCthsbtLV8" + }, + "source": [ + "df_frutas.describe()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "YdkmYePYtcON" + }, + "source": [ + "df_frutas.dtypes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2RmgCIC2HZFp" + }, + "source": [ + "### Exemplo 4" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kbHHuMzzAR1A" + }, + "source": [ + "d_estudantes = {'nome': ['Jack', 'Richard', 'Tommy', 'Ana'], \n", + " 'age': [25, 34, 18, 21],\n", + " 'city': ['Sydney', 'Rio de Janeiro', 'Lisbon', 'New York'],\n", + " 'country': ['Australia', 'Brazil', 'Portugal', 'United States']}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ayKqLmHTANOu" + }, + "source": [ + "# Mostrar o conteúdo do dicionário d_estudantes...\n", + "d_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "0ONA8QsBBP6R" + }, + "source": [ + "# Keys associadas ao dicionário d_estudantes\n", + "d_estudantes.keys()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "k8mmvKQ_BjO6" + }, + "source": [ + "# Itens associados ao dicionário d_estudantes\n", + "d_estudantes.items()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "hcm8V_UmBr1Y" + }, + "source": [ + "# Valores associados ao dicionário d_estudantes\n", + "d_estudantes.values()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KK7IejsPDkWC" + }, + "source": [ + "Temos uma key = 'nome'. Qual o conteúdo desta key?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "eHvPpeiTBwoR" + }, + "source": [ + "d_estudantes['nome']" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "S1y7p8CcDsXl" + }, + "source": [ + "Qual o output da expressão a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "26WIDl-HB3Bq" + }, + "source": [ + "d_estudantes['nome'][0]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gV68kQ5HCIif" + }, + "source": [ + "Criando o dataframe df_estudantes a partir do dicionário d_estudantes:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2oa808hkCSaq" + }, + "source": [ + "df_estudantes = pd.DataFrame(d_estudantes)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7HLp0FYpCiSc" + }, + "source": [ + "# Mostra o conteúdo do dataframe df_estudantes...\n", + "df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "en06lfazciE0" + }, + "source": [ + "**Atenção**: Observe que nesse caso, não definimos labels para as LINHAS. Na prática, isso é o mais comum, ou seja, os label = index, que aqui são números inteiros de 0 a N." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gFaPp-S-cy1-" + }, + "source": [ + "Mais uma vez, vamos usar df.loc[] e df.iloc[]..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mT9vwRBidGXX" + }, + "source": [ + "# Mostrando o conteúdo de da linha 3 usando df.loc[]\n", + "df_estudantes.loc[3]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Zj88AwHUdix0" + }, + "source": [ + "OU" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SP2mG8todkMe" + }, + "source": [ + "# Mostrando o conteúdo de da linha 3 usando df.iloc[]\n", + "df_estudantes.iloc[3]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hzbLO0EDGWTf" + }, + "source": [ + "Ok, já discutimos isso anteriormente. Quando não temos labels para as LINHAS, então iloc[] = loc[]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VvzVg7SpeOOB" + }, + "source": [ + "___\n", + "## Criar dataframes a partir de listas\n", + "* Considere a lista de frutas a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0_PY9OROeUiT" + }, + "source": [ + "l_frutas = [('Melon', 6, 8, 5, 4 ,6, 2, 8), ('Avocado', 6, 6, 3, 8, 9, 3, 1), ('Blueberry', 7, 5, 9, 3, 1, 0, 4)]\n", + "l_frutas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "AfE_rHq5g4_P" + }, + "source": [ + "type(l_frutas)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZpdPSi7RgVjK" + }, + "source": [ + "l_frutas[0]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "NMyIpVW8gZTH" + }, + "source": [ + "l_frutas[0][0]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-cyZVqQFhjjg" + }, + "source": [ + "# Lista contendo os nomes das COLUNAS do dataframe:\n", + "l_colunas = ['Frutas', 'Dom', 'Seg', 'Ter', 'Qua', 'Qui', 'Sex', 'Sab']\n", + "l_colunas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "wplKvgayfZm_" + }, + "source": [ + "# Convertendo as listas em dataframe\n", + "df_frutas = pd.DataFrame(l_frutas, columns = l_colunas) # Observe que aqui, o nome das COLUNAS é uma lista.\n", + "df_frutas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GojgsAXTFZmB" + }, + "source": [ + "___\n", + "# **Copiar dataframes**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gCkqcDo8X_ld" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "g_Tda4ZwjWIW" + }, + "source": [ + "O dataframe df_estudantes tem o seguinte conteúdo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "P5y0aVkdkA8o" + }, + "source": [ + "df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Cp3bvPEqj5fS" + }, + "source": [ + "se fizermos..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "J2PT5L11j8O0" + }, + "source": [ + "df_estudantes2 = df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2D29pGuikBBK" + }, + "source": [ + "então df_estudantes2 tem o mesmo conteúdo de df_estudantes, ok?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_IseZEpLkGS4" + }, + "source": [ + "df_estudantes2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "29MpozLrkI83" + }, + "source": [ + "Agora altere o valor 'Rio de Janeiro' para 'Sao Paulo' no dataframe df_estudantes2." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TXCqFiGFkmyv" + }, + "source": [ + "df_estudantes2['city'] = df_estudantes2['city'].replace({'Rio de Janeiro': 'Sao Paulo'})\n", + "df_estudantes2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "I_0mgT7-8Fsl" + }, + "source": [ + "# OU\n", + "alteracoes = {'Rio de Janeiro': 'Sao Paulo'}\n", + "df_estudantes2['city'] = df_estudantes2['city'].replace(alteracoes)\n", + "df_estudantes2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BN8ZGu2Xk6vt" + }, + "source": [ + "Ok, alteramos o valor 'Rio de Janeiro' por 'Sao Paulo', como queríamos. Vamos ver o conteúdo de df_estudantes (**que está intacto, pois fizemos a alteração no dataframe df_estudantes2**)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "thNAWoDflRoQ" + }, + "source": [ + "df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VkIS8wVmlAyq" + }, + "source": [ + "Ooooops... df_estudantes foi alterado? Como, se procedemos a alteração em df_estudantes2 e NÃO em df_estudantes???\n", + "\n", + "* **As operações que fizermos em df_estudantes2 também serão aplicadas à df_estudantes**?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e9u-Z9NMltC9" + }, + "source": [ + "**Resposta**: SIM, pois df_estudantes2 é um ponteiro para df_estudantes. Ou seja, **qualquer operação que fizermos em df_estudantes2 será feita em df_estudantes**." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IDwvsxhhmlE4" + }, + "source": [ + "Uma forma fácil de ver isso é através dos endereços de memória dos dois (**supostos diferentes**) dataframes:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ePFwKua8mu7k" + }, + "source": [ + "id(df_estudantes2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "bMvY_E0mmwQH" + }, + "source": [ + "id(df_estudantes)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K5qC5BuzmyF0" + }, + "source": [ + "**Conclusão**: df_estudantes2 é ponteiro para df_estudantes." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZZ50ejRImAQ8" + }, + "source": [ + "## Forma correta de fazer a cópia de um dataframe" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oTbzxNkDmQiJ" + }, + "source": [ + "Primeiramente, vamos reconstruir df_estudantes:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DmVq0vM0mTtQ" + }, + "source": [ + "df_estudantes = pd.DataFrame(d_estudantes)\n", + "df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oZrlwtqJmYB_" + }, + "source": [ + "Fazendo a cópia do dataframe (**da forma correta**):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "No5A7nHDFbsy" + }, + "source": [ + "df_estudantes_Copy = df_estudantes.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NvKNFr8RnEft" + }, + "source": [ + "Vamos verificar os endereços de memória dos dois dataframes:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0_OO90SFki4f" + }, + "source": [ + "id(df_estudantes_Copy)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "T0BibX8rkes5" + }, + "source": [ + "id(df_estudantes)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Fbm-8cCUFgJa" + }, + "source": [ + "Agora, dataframe df_estudantes_Copy é uma cópia do dataframe df_estudantes" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SuL8WUxL-u6-" + }, + "source": [ + "___\n", + "# **Renomear COLUNAS do dataframe**\n", + "> **Snippet**: \n", + "\n", + " * df.rename(columns = {'Old_Name': 'New_Name'}, inplace = True)\n", + " * OU df = df.rename(columns = {'Old_Name': 'New_Name'})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IvpCfmQnIZKl" + }, + "source": [ + "Suponha que quero renamear a COLUNA 'nome' para 'nome_cliente', que é um nome mais sugestivo." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "o54Fa-yxnmuz" + }, + "source": [ + "df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "FwzXjYJgCvGk" + }, + "source": [ + "df_estudantes= df_estudantes.rename(columns = {'nome': 'nome_cliente'})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gOolGiWt4A18" + }, + "source": [ + "O comando abaixo produz o mesmo resultado que a linha anterior:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Y6jjAFRd341e" + }, + "source": [ + "```\n", + "df_estudantes.rename(columns= {'nome': 'nome_cliente'}, inplace = True)\n", + "```" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DwVMldKiF5gS" + }, + "source": [ + "# Mostrando o conteúdo de df_estudantes após renamearmos a coluna/variável 'nome' para 'Clien_Name'...\n", + "df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "m-WZBLWqELOv" + }, + "source": [ + "Agora, suponha que queremos renamear 'age' para 'idade_cliente', 'city' para 'cidade_cliente' e 'country' para 'pais_cliente'..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VS6ua4u1EX5g" + }, + "source": [ + "df_estudantes.rename(columns = {'age': 'idade_cliente', 'city': 'cidade_cliente', 'country': 'pais_cliente'}, inplace = True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i_7LW07y4SvO" + }, + "source": [ + "O comando abaixo produz o mesmo resultado que a linha anterior:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9X-cv9RL4WjV" + }, + "source": [ + "```\n", + "df_estudante = df_estudantes.rename(columns= {'Age': 'idade_cliente', 'City': 'cidade_cliente', 'Country': 'pais_cliente'}, inplace = True)\n", + "```" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "EOb1-TEKGM9I" + }, + "source": [ + "# Mostrando o conteúdo de df_estudantes após a múltipla operação de renamear...\n", + "df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "q0IZZjLRJlU6" + }, + "source": [ + "Alguma dúvida até aqui?\n", + "Tudo bem até aqui?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5LwL2m5KbLYz" + }, + "source": [ + "## Challenge\n", + "* Aplicar lowercase() em todas as COLUNAS do dataframe df_estudantes. Como fazer isso?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MURfzmeLbUzF" + }, + "source": [ + "### Minha solução:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "r-FgBY-3xBi9" + }, + "source": [ + "df_estudantes2 = df_estudantes.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "hlSlfcoub8gH" + }, + "source": [ + "# Colocar o nome das COLUNAS numa lista:\n", + "l_colunas = df_estudantes2.columns\n", + "l_colunas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "I_IGvEK4bdQP" + }, + "source": [ + "# Lowercase todas as COLUNAS\n", + "df_estudantes2.columns = [col.lower() for col in l_colunas]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "0qzzAa3ycKmF" + }, + "source": [ + "# Mostrando o conteúdo do dataframe df_estudantes\n", + "df_estudantes2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c-u-ndMPV_KX" + }, + "source": [ + "___\n", + "# **Adicionar/Acrescentar novas LINHAS ao dataframe**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MDkWbukBLhw7" + }, + "source": [ + "## Usando dicionários\n", + "* É necessário informar {'Column_Name': value} para cada inserção. Por exemplo, vou adicionar o seguinte registro ao dataframe:\n", + " * nome_cliente= 'Anderson';\n", + " * idade_cliente= 22;\n", + " * cidade_cliente= 'Porto';\n", + " * pais_cliente= 'Portugal'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GECPO7iyK9UU" + }, + "source": [ + "df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "XQKqqC93LoQ_" + }, + "source": [ + "df_estudantes_Copia= df_estudantes.copy()\n", + "df_estudantes.append({'nome_cliente': 'Anderson', \n", + " 'idade_cliente': 22,\n", + " 'cidade_cliente': 'Porto',\n", + " 'pais_cliente': 'Portugal'}, ignore_index = True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bdBttsHNLjd-" + }, + "source": [ + "Esse é o resultado que desejamos?\n", + "Saberia explicar-nos o que houve de errado?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6jDoq6CCMerp" + }, + "source": [ + "**DICA**: Lembre-se que no passo anterior, reescrevemos os nomes das COLUNAS usando o método lower()." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ffReAaUHLvEF" + }, + "source": [ + "# Definindo df_estudantes novamente usando a cópia df_estudantes_Copia\n", + "df_estudantes = df_estudantes_Copia.copy()\n", + "df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EzTo-IvmM2Fg" + }, + "source": [ + "Ok, restabelecemos a cópia de df_estudantes. Agora vamos à forma correta:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "IRhE76i4M6d6" + }, + "source": [ + "df_estudantes = df_estudantes.append({'nome_cliente': 'Anderson', \n", + " 'idade_cliente': 22,\n", + " 'cidade_cliente': 'Porto',\n", + " 'pais_cliente': 'Portugal'}, ignore_index= True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jAojB2MMNDRJ" + }, + "source": [ + "Bom, esse é o resultado que estávamos à espera..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5czZb-5wNp_F" + }, + "source": [ + "## Usando Series\n", + "* Como exemplo, considere que queremos adicionar os seguintes dados:\n", + " * nome_cliente= 'Bill';\n", + " * idade_cliente= 30;\n", + " * cidade_cliente= 'São Paulo';\n", + " * pais_cliente= 'Brazil'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "J3qCydqMNtGt" + }, + "source": [ + "novo_estudante = pd.Series(['Bill', 30, 'Sao Paulo', 'Brazil'], index= df_estudantes2.columns) # Olha que interessante: estamos a usar index= df_estudantes.columns." + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "g_DyMDrNPrmC" + }, + "source": [ + "Vamos ver o conteúdo de novo_estudante:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jDQUl0RBPoLB" + }, + "source": [ + "novo_estudante" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zMKRNQrsPvxp" + }, + "source": [ + "Por fim, adiciona/acrescenta novo_estudante ao dataframe df_estudantes..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5mEQg26iPw4A" + }, + "source": [ + "df_estudantes2 = df_estudantes2.append(novo_estudante, ignore_index= True)\n", + "df_estudantes2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Biwk2McAWW1Z" + }, + "source": [ + "___\n", + "# **Adicionar/acrescentar novas COLUNAS ao Dataframe**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EZFTH7A-Wpw5" + }, + "source": [ + "## Usando Lists\n", + "* Suponha que queremos adicionar a coluna/variável 'Score'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YzBKQo5epXP5" + }, + "source": [ + "df_estudantes2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "pPoObAKJW6YF" + }, + "source": [ + "# Acrescentando ou criando a coluna/variável 'score' ao dataframe usando um objeto list\n", + "df_estudantes2['score'] = [500, 300, 200, 800, 700]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ocbh8sZqWsoW" + }, + "source": [ + "# Mostra o conteúdo do dataframe df_estudantes...\n", + "df_estudantes2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZxfCMcVxYQgL" + }, + "source": [ + "> **Atenção**:\n", + "\n", + "* Se a quantidade de valores da lista forem menores que o número de LINHAS do dataframe, então o Python apresenta um erro.\n", + "* Se a coluna/variável que queremos inserir já existe no dataframe, então os valores serão atualizados com os novos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "34ntllD_YbNa" + }, + "source": [ + "## Usando um valor default\n", + "* Adicionar a coluna 'total' com o mesmo valor para todas as LINHAS" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "T7QSMJMQYous" + }, + "source": [ + "df_estudantes['total'] = 500\n", + "df_estudantes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gll-gJt7as3C" + }, + "source": [ + "## Adicionar uma COLUNA calculada a partir de outras COLUNAS" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "T_pB_isBaw-E" + }, + "source": [ + "df_estudantes2['percentagem'] = 100*(df_estudantes2['score']/sum(df_estudantes2['score']))\n", + "df_estudantes2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D9TNylt84hle" + }, + "source": [ + "___\n", + "# **Ler/carregar dados no Python**\n", + "* Vários formatos de arquivos podem ser lidos:\n", + "\n", + "|Format Type | Data Description | Reader | Writer |\n", + "|---|---|---|---|\n", + "text | CSV | read_csv | to_csv |\n", + "text | JSON | read_json | to_json |\n", + "text | HTML | read_html | to_html |\n", + "text | Local clipboard | read_clipboard | to_clipboard |\n", + "binary | MS Excel | read_excel | to_excel |\n", + "binary | HDF5 Format | read_hdf | to_hdf |\n", + "binary | Stata | read_stata | to_stata |\n", + "binary | SAS | read_sas \n", + "binary | Python Pickle Format | read_pickle | to_pickle |\n", + "SQL | SQL | read_sql | to_sql |\n", + "SQL | Google Big Query | read_gbq | to_gbq |\n", + "\n", + "* Fonte: [IO tools (text, CSV, HDF5, …)](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ss8jLEUSblDm" + }, + "source": [ + "___\n", + "# **Ler/Carregar csv**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "n8e9aphab_oe" + }, + "source": [ + "# carregar a library Pandas\n", + "import pandas as pd" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R2fRd_MSQ2Xa" + }, + "source": [ + "A seguir, vamos:\n", + "* Ler o dataframe Titanic.csv;\n", + "* Definir 'PassengerId' como índice/chave da tabela através do comando index_col= 'PassengerId'." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1R9YoFJ02TR7" + }, + "source": [ + "url = 'https://raw.githubusercontent.com/Hayltons/DSWP/master/Dataframes/Titanic_With_MV.csv'\n", + "df_Titanic = pd.read_csv(url, index_col = 'PassengerId')\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VS7_V15u0MgR" + }, + "source": [ + "df_Titanic.iloc[4] # NÃO É A MESMA COISA QUE df_Titanic.loc[4]!!!" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WJ9RlRDSkk0_" + }, + "source": [ + "* Segue o dicionário de dados do dataframe df_Titanic:\n", + " * PassengerID: ID do passageiro;\n", + " * survived: Indicador, sendo 1= Passageiro sobreviveu e 0= Passageiro morreu;\n", + " * Pclass: Classe;\n", + " * Age: Idade do Passageiro;\n", + " * SibSp: Número de parentes a bordo (esposa, irmãos, pais e etc);\n", + " * Parch: Número de pais/crianças a bordo;\n", + " * Fare: Valor pago pelo Passageiro;\n", + " * Cabin: Cabine do Passageiro;\n", + " * Embarked: A porta pelo qual o Passageiro embarcou.\n", + " * Name: Nome do Passageiro;\n", + " * sex: sexo do Passageiro." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wz7Qd9mqMrfY" + }, + "source": [ + "# Show o dataframe df_Titanic:\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nDlANdnm4iod" + }, + "source": [ + "### DICA 1\n", + "Suponha que o dataframe que queremos ler esteja localizado em:\n", + "\n", + "```\n", + "/home/nsolucoes4ds/Dropbox/Data_Science/Python/Python_RFB/Python_RFB-DS_Python_020919_2244/Dataframes\n", + "```\n", + "\n", + "Desta forma, para ler o dataframe (local), basta usar o comando a seguir:\n", + "\n", + "```\n", + "url = '/home/nsolucoes4ds/Dropbox/Data_Science/Python/Python_RFB/Python_RFB-DS_Python_020919_2244/Dataframes/creditcard.csv'\n", + "df_Titanic = pd.read_csv(url)\n", + "```\n", + "\n", + "### Dica 2\n", + "No Windows, o diretório aparece, por exemplo, da seguinte forma: \n", + "```\n", + "C:\\nsolucoes4ds\\Data_Science\n", + "```\n", + "Observe as '\\\\' (**barras invertidas**). Neste caso, use o comando a seguir:\n", + "\n", + "```\n", + "url= r'C:\\nsolucoes4ds\\Data_Science\\creditcard.csv'\n", + "df_Titanic = pd.read_csv(url)\n", + "```\n", + "\n", + "Percebeu o r'diretorio'?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HubfewY8NgUv" + }, + "source": [ + "___\n", + "# **Corrigir (ou uniformizar) nome das COLUNAS**\n", + "* Por exemplo, reescrever o nome das COLUNAS usando lowercase()." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4f_pEEOjvwjk" + }, + "source": [ + "Para facilitar nossas análises, vamos aplicar o método lower() em todos os valores das COLUNAS objects/strings. Para isso, considere a função abaixo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ft13IahH1kVX" + }, + "source": [ + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "G-UlaHFPv7kp" + }, + "source": [ + "def transformacao_lower(df):\n", + " # Primeira transformação: Aplicar lower() nos nomes das COLUNAS:\n", + " df.columns = [col.lower() for col in df.columns]\n", + "\n", + " # Segunda transformação: Aplicar o método .str.lower() nos valores das COLUNAS object/strings:\n", + " l_cols_objeto = df.select_dtypes(include = ['object']).columns\n", + " \n", + " for col in l_cols_objeto:\n", + " df[col] = df[col].str.lower()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hNixsW8M7n1X" + }, + "source": [ + "Para saber mais sobre o método df[col].str.lower(), consulte [pandas.Series.str.lower](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.lower.html)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hz90zejtbxYj" + }, + "source": [ + "transformacao_lower(df_Titanic)\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UE5P1W-CPePM" + }, + "source": [ + "# **Selecionar um subconjunto de colunas**\n", + "Suponha que eu queira selecionar somente as colunas 'Name' e 'Sex'." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "P7HJa4x7P0bQ" + }, + "source": [ + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "3jLZUCfePsBs" + }, + "source": [ + "df_Titanic2 = df_Titanic[['name', 'sex']]\n", + "df_Titanic2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PyNsYTilnL2r" + }, + "source": [ + "# map()\n", + "> Artificio para lidar com a transformação de dados utilizando um dicionário: {'key': valor}." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6z4FcyyAiTfF" + }, + "source": [ + "# Construindo uma variável mais intuitiva para nos ajudar nas análises:\n", + "df_Titanic['survived2'] = df_Titanic['survived']\n", + "df_Titanic['survived2'] = df_Titanic['survived2'].map({0:'died', 1:'survived'})\n", + "df_Titanic[['survived', 'survived2']].head(3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jwBWkaJOdhCv" + }, + "source": [ + "___\n", + "# **Selecionar COLUNAS do dataframe**\n", + "* Suponha que queremos selecionar somente as COLUNAS 'survived', 'sex' e 'embarked':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ivvj8JU2pBTq" + }, + "source": [ + "df_Titanic2 = df_Titanic[['survived', 'sex', 'embarked']]\n", + "df_Titanic2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Nf-Wnof_fdTR" + }, + "source": [ + "___\n", + "# **Criar um dicionário a partir de um dataframe**\n", + "> Suponha o dataframe-exemplo a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lxf6Lgp4fit8" + }, + "source": [ + "df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l7yzJu1y5huV" + }, + "source": [ + "De dataframe para Dicionário..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_6V0qFZGhEoF" + }, + "source": [ + "df.to_dict('dict')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0GIe6xtqPA1Z" + }, + "source": [ + "___\n", + "# **Criar uma lista a partir de um dataframe**\n", + "> Suponha o dataframe-exemplo a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fZxgejTtPLzX" + }, + "source": [ + "df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JoShm6oF5qLV" + }, + "source": [ + "De dataframe para Lista..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gigPpSH_hlXu" + }, + "source": [ + "df.to_dict('list')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GpJDX-5xUUC0" + }, + "source": [ + "___\n", + "# **Mostrar as primeiras k LINHAS do dataframe**\n", + "> df.head(k), onde k é o número de LINHAS que queremos visualizar. Por default, k= 10." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RwC9j_OxUbIR" + }, + "source": [ + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "G9cp2QrsA5M0" + }, + "source": [ + "___\n", + "# **Mostrar as últimas k LINHAS do dataframe**\n", + "> df.tail(k), onde k é o número de LINHAS que queremos ver. Por default, k= 10." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9mPxyhqoA4Wc" + }, + "source": [ + "df_Titanic.tail()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Odwm2qSLA_Ro" + }, + "source": [ + "Por default, df.tail() mostra as últimas 5 LINHAS/instâncias do dataframe. Entretando, pode ser ver qualquer número de LINHAS k, como, por exemplo, k= 10 mostrado abaixo." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pUAnR00WA8ma" + }, + "source": [ + "df_Titanic.tail(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cZ64LfWv4zxo" + }, + "source": [ + "___\n", + "# **Mostrar o nome das COLUNAS do dataframe**\n", + "* df.columns" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "CKUUrX5n4zFW" + }, + "source": [ + "df_Titanic.columns" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6m7ukrOu5Inv" + }, + "source": [ + "___\n", + "# **Mostrar os tipos das COLUNAS do dataframe**\n", + "* Propriedade: df.dtypes --> Não há parênteses!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "S4NIHAPPl9lc" + }, + "source": [ + "df_Titanic.dtypes # dtypes é uma propriedade, portanto não requer \"()\". Os métodos, por outro lado, requerem \"(arg1, arg2, ..., argN)\"" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DGc6m-UBdHlE" + }, + "source": [ + "___\n", + "# **Selecionar automaticamente as COLUNAS do dataframe pelo tipo**\n", + "> snippet: df.select_dtypes(include=[tipo]).columns\n", + "\n", + "| Tipo | O que seleciona | Sintaxe |\n", + "|------|-----------------|---------|\n", + "| number | colunas do tipo numéricas | df.select_dtypes(include=['number]).columns |\n", + "| float | colunas do tipo float | df.select_dtypes(include=['float']).columns |\n", + "| bool | colunas do tipo booleanas | df.select_dtypes(include=['bool']).columns |\n", + "| object | colunas do tipo categóricas/strings | df.select_dtypes(include=['object']).columns |\n", + "\n", + "* Se quisermos selecionar mais de um tipo, basta informar a lista de tipos. \n", + " * Exemplo: df.select_dtypes(include=['object', 'float']).columns\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "O88YRCqIdYFL" + }, + "source": [ + "## Selecionar automaticamente as COLUNAS Numéricas do dataframe" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xG4a9ZfRnxPW" + }, + "source": [ + "### Lista com as COLUNAS numéricas do dataframe:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "C87uga35dKsF" + }, + "source": [ + "l_cols_numericas = df_Titanic.select_dtypes(include = ['number']).columns # \".columns\" retorna a lista de colunas numéricas\n", + "l_cols_numericas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5W6kbIVNn2UA" + }, + "source": [ + "### DataFrame com as COLUNAS numéricas:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iTieUd_-eDmW" + }, + "source": [ + "df_numericas = df_Titanic.select_dtypes(include = ['number']) # Atenção: aqui não temos .columns --> Neste caso, o retorno será o dataframe.\n", + "df_numericas.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xh4BFs_lds80" + }, + "source": [ + "## Selecionar automaticamente as COLUNAS float do dataframe" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Tw3FD74MoC6q" + }, + "source": [ + "### Lista com as COLUNAS float:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5clAUAIrd3UR" + }, + "source": [ + "l_cols_float = df_Titanic.select_dtypes(include = ['float']).columns\n", + "l_cols_float" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IZPROG6IoHwy" + }, + "source": [ + "### DataFrame com as COLUNAS float:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "osJDsyMHeXX4" + }, + "source": [ + "df_float = df_Titanic.select_dtypes(include = ['float']) # Atenção: aqui não temos .columns\n", + "df_float.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5uObezIIfuJ4" + }, + "source": [ + "## Selecionar automaticamente as COLUNAS Booleanas do dataframe" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xMKP5HhgoeMg" + }, + "source": [ + "### Lista com as COLUNAS Booleanas:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3Pn2IPBkf7k-" + }, + "source": [ + "l_cols_booleanas = df_Titanic.select_dtypes(include = ['bool']).columns\n", + "l_cols_booleanas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k3sdiuXYokBE" + }, + "source": [ + "### DataFrame com as COLUNAS Booleanas:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Oem-M-17f7lG" + }, + "source": [ + "df_booleanas = df_Titanic.select_dtypes(include=['bool']) # Atenção: aqui não temos .columns\n", + "df_booleanas.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ObHYW92-gOXz" + }, + "source": [ + "## Selecionar automaticamente as COLUNAS do tipo string (object)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IzM5CIKXoxHO" + }, + "source": [ + "### Lista com as COLUNAS do tipo object/string:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rdYThBingOX1" + }, + "source": [ + "l_cols_objeto = df_Titanic.select_dtypes(include=['object']).columns\n", + "l_cols_objeto" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2ZGB5d36o21t" + }, + "source": [ + "### DataFrame com as COLUNAS do tipo Object/String:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kWTtxeU4gOX4" + }, + "source": [ + "df_cols_obs = df_Titanic.select_dtypes(include=['object']) # Atenção: aqui não temos .columns\n", + "df_cols_obs.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SEBKHKRLkbUK" + }, + "source": [ + "___\n", + "# **Reordenar as COLUNAS do dataframe**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XRWfelWEkhae" + }, + "source": [ + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KBGDeR_JkyCc" + }, + "source": [ + "* Suponha que queremos reordenar as COLUNAS do dataframe df_Titanic em ordem alfabética, conforme abaixo:\n", + " * age;\n", + " * embarked;\n", + " * fare;\n", + " * parch;\n", + " * pclass;\n", + " * sex;\n", + " * sibsp;\n", + " * survived." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "d9jJi6qllnq_" + }, + "source": [ + "# Dataframe ordenado\n", + "df_Titanic = df_Titanic.reindex(sorted(df_Titanic.columns), axis = 1)\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Cj4MREti-izC" + }, + "source": [ + "___\n", + "# **Mostrar a dimensão do dataframe**\n", + "* Dimensão = Número de LINHAS e COLUNAS" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "50Tij93l-n7B" + }, + "source": [ + "df_Titanic.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZQo4YeH_-qfL" + }, + "source": [ + "Qual a interpretação?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "klHcwpPEALP8" + }, + "source": [ + "## **Quebrar a dimensão em duas partes: número de LINHAS e COLUNAS**\n", + "* Número de linhas do dataframe.: df_Titanic.shape[0]\n", + "* Número de colunas do dataframe: df_Titanic.shape[1]" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qjR8OEdDAOog" + }, + "source": [ + "f'O dataframe df_Titanic possui {df_Titanic.shape[0]} linhas e {df_Titanic.shape[1]} colunas.'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pIsf_nDtyAvF" + }, + "source": [ + "___\n", + "# **Combinar dataframes: Merge, Join & Concatenate**\n", + "* Fonte: [Merge, join, and concatenate](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s1fSplrlEMHK" + }, + "source": [ + "* A seguir, três formas para combinar dataframes:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6DYtWxuIrdzF" + }, + "source": [ + "## Concatenate\n", + "* Une/empilha dataframes\n", + "* Fonte: https://github.com/aakankshaws/Pandas-exercises" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nnP5VuWkri_b" + }, + "source": [ + "import pandas as pd\n", + "df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\n", + " 'B': ['B0', 'B1', 'B2', 'B3'],\n", + " 'C': ['C0', 'C1', 'C2', 'C3'],\n", + " 'D': ['D0', 'D1', 'D2', 'D3']})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rkJvSGYSrm8b" + }, + "source": [ + "df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],\n", + " 'B': ['B4', 'B5', 'B6', 'B7'],\n", + " 'C': ['C4', 'C5', 'C6', 'C7'],\n", + " 'D': ['D4', 'D5', 'D6', 'D7']})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "NCgdYvJIrqx1" + }, + "source": [ + "df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],\n", + " 'B': ['B8', 'B9', 'B10', 'B11'],\n", + " 'C': ['C8', 'C9', 'C10', 'C11'],\n", + " 'D': ['D8', 'D9', 'D10', 'D11']})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gUoyjyjur5Zn" + }, + "source": [ + "df1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "xU6Rh10Gr7NA" + }, + "source": [ + "df2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "qKwmOWsQr9wA" + }, + "source": [ + "df3" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-MNn-XdlsjJS" + }, + "source": [ + "df= pd.concat([df1, df2, df3])\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BV6HgxSYtG6Z" + }, + "source": [ + "Veja que basicamente empilhamos os dataframes. No entanto, se fizermos..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Dp-oh-7ftLo5" + }, + "source": [ + "df = pd.concat([df1, df2, df3], axis = 1) # axis = 1 é uma operação de coluna\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iyDZt2XEtmVs" + }, + "source": [ + "Se, no entanto, tivermos:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5PAhjjVZtpP5" + }, + "source": [ + "df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\n", + " 'B': ['B0', 'B1', 'B2', 'B3'],\n", + " 'C': ['C0', 'C1', 'C2', 'C3'],\n", + " 'D': ['D0', 'D1', 'D2', 'D3']},\n", + " index=[0, 1, 2, 3])\n", + "\n", + "df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],\n", + " 'B': ['B4', 'B5', 'B6', 'B7'],\n", + " 'C': ['C4', 'C5', 'C6', 'C7'],\n", + " 'D': ['D4', 'D5', 'D6', 'D7']},\n", + " index=[4, 5, 6, 7])\n", + "\n", + "df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],\n", + " 'B': ['B8', 'B9', 'B10', 'B11'],\n", + " 'C': ['C8', 'C9', 'C10', 'C11'],\n", + " 'D': ['D8', 'D9', 'D10', 'D11']},\n", + " index=[8, 9, 10, 11])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zGDHd-kPt3-T" + }, + "source": [ + "Então..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3bTl2Nr2t5WM" + }, + "source": [ + "df = pd.concat([df1, df2, df3], axis = 1)\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sUXjlp_Jt925" + }, + "source": [ + "Porque isso acontece?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JdKXY873HrYt" + }, + "source": [ + "## Merge\n", + "> Primeiramente, vamos ver todos os casos possíveis de joins.\n", + "\n", + "### Exemplo\n", + "> O exemplo a seguir foi inspirado no exemplo apresentado em [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins). Considere os dataframes a seguir" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "g4pmhk2t3x8s" + }, + "source": [ + "import pandas as pd\n", + "\n", + "d_Tabela_A = {'indices': [1,2,3,6,7,5,4,10], 'valores': ['A','B','C','D','E','F','G','H']}\n", + "d_Tabela_B = {'indices': [1,2,3,6,7,8,9,11], 'valores': ['AA', 'BB','CC','DD','EE','FF','GG','HH']}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "XxfUULxY52ns" + }, + "source": [ + "df_conjunto_A = pd.DataFrame(d_Tabela_A).set_index('indices')\n", + "df_conjunto_B = pd.DataFrame(d_Tabela_B).set_index('indices')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gGdU36Vi0Yso" + }, + "source": [ + "![SQL_inner_join](https://github.com/MathMachado/Materials/blob/master/SQL_inner_join.png?raw=true)\n", + "\n", + "Source: [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5w7ox7LV9cuG" + }, + "source": [ + "df_conjunto_A" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TPhmKw-F9fWX" + }, + "source": [ + "df_conjunto_B" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "5AaTlCPy9FBZ" + }, + "source": [ + "df_inner_join = pd.merge(df_conjunto_A, df_conjunto_B, on = 'indices', how = 'inner')\n", + "df_inner_join" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "U3OjFM0E0af-" + }, + "source": [ + "![SQL_left_join](https://github.com/MathMachado/Materials/blob/master/SQL_left_join.png?raw=true)\n", + "\n", + "Source: [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins).\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-efYd9c69k4L" + }, + "source": [ + "df_conjunto_A" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "SqFbNStz9k4S" + }, + "source": [ + "df_conjunto_B" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rUpc2k729KA-" + }, + "source": [ + "df_left_join = pd.merge(df_conjunto_A, df_conjunto_B, on = 'indices', how = 'left')\n", + "df_left_join" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WioSBHjW06Hg" + }, + "source": [ + "![SQL_right_join](https://github.com/MathMachado/Materials/blob/master/SQL_right_join.png?raw=true)\n", + "\n", + "Source: [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "IrzPjGNp9o2n" + }, + "source": [ + "df_conjunto_A" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "tFFTp_yG9o2s" + }, + "source": [ + "df_conjunto_B" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_D4tF7E-9PCx" + }, + "source": [ + "df_right_join = pd.merge(df_conjunto_A, df_conjunto_B, on = 'indices', how = 'right')\n", + "df_right_join" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "E9xFrurZ0ksg" + }, + "source": [ + "![SQL_outer_join](https://github.com/MathMachado/Materials/blob/master/SQL_outer_join.png?raw=true)\n", + "\n", + "Source: [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kQCBAfj_9rO_" + }, + "source": [ + "df_conjunto_A" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "FTDHYsgc9rP0" + }, + "source": [ + "df_conjunto_B" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "hJqyAs0U9XwO" + }, + "source": [ + "df_outer_join = pd.merge(df_conjunto_A, df_conjunto_B, on = 'indices', how = 'outer')\n", + "df_outer_join" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fHEgLynu0vve" + }, + "source": [ + "![SQL_left_excluding_join](https://github.com/MathMachado/Materials/blob/master/SQL_left_excluding_join.png?raw=true)\n", + "\n", + "Source: [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins).\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZA8CcERE-RRS" + }, + "source": [ + "df_conjunto_A" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IZiAa9X6-UL0" + }, + "source": [ + "df_conjunto_B" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "jdUt63rA-Vjo" + }, + "source": [ + "df_left_excluding_join = pd.merge(df_conjunto_A, df_conjunto_B, on = 'indices', how =\"outer\", indicator=True).query('_merge==\"left_only\"')\n", + "df_left_excluding_join" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CShcqL-h1MqK" + }, + "source": [ + "![SQL_right_excluding_join](https://github.com/MathMachado/Materials/blob/master/SQL_right_excluding_join.png?raw=true)\n", + "\n", + "Source: [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins).\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ECjUDoYf_C9x" + }, + "source": [ + "df_conjunto_A" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "xym7VsXi_FXa" + }, + "source": [ + "df_conjunto_B" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-zFalmly_HJ7" + }, + "source": [ + "df_right_excluding_join = pd.merge(df_conjunto_A, df_conjunto_B, on = 'indices', how =\"outer\", indicator=True).query('_merge==\"right_only\"')\n", + "df_right_excluding_join" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T8v4-zUt1WQz" + }, + "source": [ + "![SQL_outer_excluding_join](https://github.com/MathMachado/Materials/blob/master/SQL_outer_excluding_join.png?raw=true)\n", + "\n", + "Source: [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins).\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iVEQFqx8Hdu5" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8HeMgBqyAYjW" + }, + "source": [ + "### Desafio: Como resolver este?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "djYoI_eUHD71" + }, + "source": [ + "df_right_left_excluding_join = pd.merge(df_conjunto_A, df_conjunto_B, on = 'indices', how =\"outer\", indicator=True).query(\n", + " '_merge==\"left_only\" | _merge==\"right_only\"')\n", + "df_right_left_excluding_join" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SkCbLsoktgKl" + }, + "source": [ + "### Observações:\n", + "\n", + "* Em alguns casos a variável chave nos dois dataframes que se quer fazer o join possui nomes diferentes. Neste caso, use 'left_on' e 'right_on' para definir o nome das COLUNAS chaves no dataframe da esquerda e direita:\n", + " * pd.merge(df1, df2, left_on =\"employee\", right_on =\"nome\")\n", + " * No exemplo acima, o dataframe df1 (dataframe da esquerda) possui chave 'employee' enquanto que o dataframe df2 (dataframe da direita), possui chave 'nome'. Usando as 'left_on' e 'right_on' fica claro o nome das chaves de ligação de cada dataframe." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6Obc0fHUwIpu" + }, + "source": [ + "## Joining" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DQOa89_cwLyd" + }, + "source": [ + "df_esquerdo = pd.DataFrame({'A': ['A0', 'A1', 'A2'],\n", + " 'B': ['B0', 'B1', 'B2']},\n", + " index=['K0', 'K1', 'K2']) \n", + "\n", + "df_direito = pd.DataFrame({'C': ['C0', 'C2', 'C3'],\n", + " 'D': ['D0', 'D2', 'D3']},\n", + " index=['K0', 'K2', 'K3'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "UHnX9rxzwMmx" + }, + "source": [ + "df_esquerdo" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "GBc1Mr0Qwff3" + }, + "source": [ + "df_direito" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TmIk3Kjlwg-7" + }, + "source": [ + "df_esquerdo.join(df_direito)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "h609fbjjwoZ3" + }, + "source": [ + "df_esquerdo.join(df_direito, how ='outer')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Y8W2kP-VCB3E" + }, + "source": [ + "___\n", + "# **Selecionar LINHAS do dataframe baseado nos índices**\n", + "### Leitura Adicional\n", + "* [pandas loc vs. iloc vs. ix vs. at vs. iat?\n", + "](https://stackoverflow.com/questions/28757389/pandas-loc-vs-iloc-vs-ix-vs-at-vs-iat/47098873#47098873)\n", + "* [Indexing and selecting data](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NN1R1ngAG61x" + }, + "source": [ + "## 1st Approach - pd.loc[]\n", + "* Para capturar o conteúdo da linha k, use df.loc[row_indexer,column_indexer]." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oduXMUtIUvkN" + }, + "source": [ + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JX9nGPWcVLgE" + }, + "source": [ + "\n", + "Por exemlo, o comando a seguir mostra o conteúdo da linha 0, todas as COLUNAS(:)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "U5-I2NgYC2fD" + }, + "source": [ + "df2= df_Titanic.loc[1,:]\n", + "df2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tDSJcQLTDyJw" + }, + "source": [ + "Mostrando o conteúdo das LINHAS k= 1:2 (ou seja, LINHAS 1 e 2), todas as COLUNAS(:)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JD1TDTqAD_5r" + }, + "source": [ + "df_Titanic.loc[1:2, :]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EoAmcdfnEIho" + }, + "source": [ + "Mostrar os conteúdos da linha k= 1, coluna 'pclass':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8vjc5z3_EQfY" + }, + "source": [ + "df_Titanic.loc[1, ['pclass']]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7bC8-H-QFLgd" + }, + "source": [ + "Mostrar os conteúdos da linha k= 1 e COLUNAS ['pclass', 'sex']:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LYFTrZr_FR5g" + }, + "source": [ + "df_Titanic.loc[0, ['pclass', 'sex']]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UtUsmU8sXYTU" + }, + "source": [ + "Porque temos um erro aqui?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CRy5sDx-XbBL" + }, + "source": [ + "Versão correta abaixo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5Lfw0HEnXdn0" + }, + "source": [ + "df_Titanic.loc[1, ['pclass', 'sex']]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Tjw3vjkDZg1Z" + }, + "source": [ + "Mostrar os conteúdos da linha k= 1:5 e COLUNAS ['pclass', 'sex']:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4GuAE5MSZjNb" + }, + "source": [ + "df_Titanic.loc[1:5, ['pclass', 'sex']]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xRZxqE6RFnJI" + }, + "source": [ + "Agora suponha que queremos selecionar toda a 'sex'. Como fazer isso?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JdeD_uzfFrp5" + }, + "source": [ + "df_sex= df_Titanic.loc[:, 'sex']\n", + "df_sex.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z_WUjYxsX-Av" + }, + "source": [ + "Fácil selecionarmos o que queremos usando .loc() e iloc(), certo?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RKk0zollHFbp" + }, + "source": [ + "## 2nd Approach - Usando lists\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jhwoY6LmGzC0" + }, + "source": [ + "df_Titanic[0:2] # Mostrar os conteúdos das LINHAS 0:2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "I6EOVIDxGiy-" + }, + "source": [ + "df_Titanic[:3] # Mostrar os conteúdos até a linha 3" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VOHp77F8H9t1" + }, + "source": [ + "df_Titanic['sex'].head() # Mostrar o conteúdo inteiro da variável 'sex'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8nvHNdhPZ040" + }, + "source": [ + "df_Titanic[0:5]['sex'].head() # Mostrar as LINHAS 0 a 5 da variável 'sex'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GMFso1jaYXgN" + }, + "source": [ + "___\n", + "# **Selecionar/Filtrar/Substituir LINHAS do dataframe baseado em condições**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BKljSpS5ou-i" + }, + "source": [ + "## Exemplo 1\n", + "> Aproveitando o exemplo anterior, queremos selecionar do dataframe somente os passageiros do sexo 'male'." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jek8Ru3Aam23" + }, + "source": [ + "### Approach 1: df.loc() e df.iloc()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "eysZoBX2YKb-" + }, + "source": [ + "df_sexo_m_1 = df_Titanic.loc[df_Titanic['sex'] == 'male', 'sex']\n", + "df_sexo_m_1.head() " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uLDOHKGfaq-Z" + }, + "source": [ + "### Approach 2: Uso do []" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QncrZwHkasiu" + }, + "source": [ + "df_sexo_m_2 = df_Titanic[df_Titanic['sex'] == 'male']['sex']\n", + "df_sexo_m_2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ot6UBTYJF-AJ" + }, + "source": [ + "### Approach 3: df.isin()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OBRF0be3VuTi" + }, + "source": [ + "#### Exemplo 1 - Filtro simples" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LeTDiGICGOzb" + }, + "source": [ + "df_sexo_m_3 = df_Titanic['sex'].isin(['male'])\n", + "df_sexo_m_3.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q6emu30nGmpt" + }, + "source": [ + "#### Exemplo 2 - Filtro duplo = Duas condições\n", + "> Selecionar todas as LINHAS onde sexo = 'male' e Pclass = 1." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TRaiCYMRGpgl" + }, + "source": [ + "# Filtros usando df.isin() \n", + "filtro_m = df_Titanic[\"sex\"].isin([\"male\"]) \n", + "filtro_class1 = df_Titanic[\"pclass\"].isin([1]) \n", + " \n", + "# Mostra os resutados \n", + "df_Titanic[filtro_m & filtro_class1].head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Sh0DDj1xcPaI" + }, + "source": [ + "df_sexo_m_class = df_Titanic[(df_Titanic['sex'] == 'male') & (df_Titanic['pclass'] == 1)]\n", + "df_sexo_m_class.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ujrYHyOsfW7n" + }, + "source": [ + "### Approach 4 - Filtrar com df.str.contains('s_substr')" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gntbfHgTfanx" + }, + "source": [ + "# Mostrar todas as LINHAS onde a string 'Mr' aparece no nome do passageiro:\n", + "df2 = df_Titanic[df_Titanic['name'].str.contains('mr')]\n", + "df2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eaRtQ8Ja8MOH" + }, + "source": [ + "Para saber mais sobre o método df[col].str.contais(), consulte https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FyJ-gEjzQI2Y" + }, + "source": [ + "## Substituir valores do dataframe\n", + "> Suponha que queremos substituir todos os valores de pclass da seguinte forma:\n", + "* Se pclass = 1 --> pclass2 = 'Classe1';\n", + "* Se pclass = 2 --> pclass2 = 'Classe2';\n", + "* Se pclass = 3 --> pclass2 = 'Classe3';\n", + "\n", + "Como fazer isso?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Pi8MFiUPQQb7" + }, + "source": [ + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "19mynzdfQqVf" + }, + "source": [ + "df_Titanic['pclass2'] = df_Titanic['pclass']\n", + "df_Titanic.loc[df_Titanic['pclass'] == 1, 'pclass2'] = 'Classe1'\n", + "df_Titanic.loc[df_Titanic['pclass'] == 2, 'pclass2'] = 'Classe2'\n", + "df_Titanic.loc[df_Titanic['pclass'] == 3, 'pclass2'] = 'Classe3'\n", + "df_Titanic['pclass2'].head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JC9z602zimyY" + }, + "source": [ + "df_Titanic['pclass3'] = df_Titanic['pclass']\n", + "df_Titanic['pclass3'] = df_Titanic['pclass3'].map({1:'Classe1',2:'Classe2',3:'Classe3'})\n", + "df_Titanic['pclass3'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KVSAYeU0KA2V" + }, + "source": [ + "___\n", + "# **Selecionar amostras aleatórias do dataframe**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "U502dAs3OfOH" + }, + "source": [ + "Vimos que o dataframe df_Titanic é muito grande. Então, vamos selecionar aleatoriamente 100 LINHAS." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0BrKUnAiPcAy" + }, + "source": [ + "import random \n", + "\n", + "# Biblioteca para avaliarmos o tempo de processamento de cada alternativa\n", + "import time" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "iJ1G8lYgKGsc" + }, + "source": [ + "# Usando sample\n", + "t0= time.time()\n", + "df_Titanic_a100= df_Titanic.sample(100, replace= False, random_state= 20111974)\n", + "t1= time.time()\n", + "t= t1-t0\n", + "df_Titanic_a100.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8DvWOKizZQr8" + }, + "source": [ + "f'Tempo de processamento: {t}'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "nAHLTjpvYKPS" + }, + "source": [ + "# Usando NumPy\n", + "import numpy as np\n", + "\n", + "t0 = time.time()\n", + "np.random.seed(20111974)\n", + "indices = np.random.choice(df_Titanic.shape[0], replace = False, size=100)\n", + "df_Titanic_a100_2 = df_Titanic.iloc[indices]\n", + "t1 = time.time()\n", + "t = t1-t0\n", + "df_Titanic_a100_2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "U8PEDMJ4a52P" + }, + "source": [ + "f'Tempo de processamento: {t}'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "wYeuJWdEdMPd" + }, + "source": [ + "df_Titanic_a100_2.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vNMiRkjCQ9Mu" + }, + "source": [ + "___\n", + "# **Descrever o Dataframe**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GllUFj56RHuD" + }, + "source": [ + "df_Titanic_a100.describe()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "izbpIEi1d1sx" + }, + "source": [ + "df_Titanic_a100_2.describe()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H40G3QzWbG9N" + }, + "source": [ + "___\n", + "# **Identificar e lidar com LINHAS duplicadas**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_OoM_HS5ZgxG" + }, + "source": [ + "## Exemplo 1\n", + "* considera as duplicatas em todas as COLUNAS do dataframe." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5XOOdOZBbLc_" + }, + "source": [ + "df = pd.DataFrame({'A':[1,1,3,4,5,1], 'B':[1,1,3,7,8,1], 'C':[3,1,1,6,7,1]})\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Gio08BkTbTOp" + }, + "source": [ + "# Lista as duplicações em forma booleana\n", + "df.duplicated()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "obgbM4d_hJ_J" + }, + "source": [ + "Observe a linha 5, onde temos a informação que esta linha está duplicada. Na verdade, a linha 5 é igual à linha 1" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LHhOIb-EbWfn" + }, + "source": [ + "# Mostra as LINHAS duplicadas\n", + "df[df.duplicated()]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IyJS70_kZ-Jk" + }, + "source": [ + "# Deleta a linha 5 que, como vimos, estava duplicada (uma cópia da linha 1).\n", + "df= df.drop_duplicates()\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3Q05mxOSaEjX" + }, + "source": [ + "## Exemplo 2\n", + "* Considera somente algumas COLUNAS" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jiqyjcqdaQ1y" + }, + "source": [ + "df = pd.DataFrame({'A':[1,1,3,4,5,1], 'B':[1,1,3,7,8,1], 'C':[3,1,1,6,7,1]})\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "F_118d7vbZ9Y" + }, + "source": [ + "# Mostra as LINHAS duplicadas usando as COLUNAS 'A' e 'B'\n", + "df[df.duplicated(subset=['A','B'])]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_1w_ZZO4vF3A" + }, + "source": [ + "# Deleta as LINHAS 1 e 5, pois como podemos ver, são duplicatas da linha 0\n", + "df= df.drop_duplicates(subset = ['A', 'B'])\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qVx6p8u36jhD" + }, + "source": [ + "___\n", + "# **Trabalhar com dados do tipo texto**\n", + "* Fontes:\n", + " * [Working with text data](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html)\n", + " * [Using String Methods](https://www.ritchieng.com/pandas-string-methods/)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JLG3cVA1e8-B" + }, + "source": [ + "Preparando os dados para o exemplo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "G_CEULoyeP8C" + }, + "source": [ + "# Definir um dicionário com os dados: \n", + "import numpy as np\n", + "\n", + "l_idade = []\n", + "for i in range(6):\n", + " np.random.seed(i) \n", + " l_idade.append(np.random.randint(10, 40))\n", + " \n", + "\n", + "d_exemplo = {'Nome':['Mr. Antonio dos Santos', 'Mr. Joao Pedro', 'Miss. Priscila Alvarenga', 'Mr. fagner NoVAES', 'Miss. Danielle Aparecida', 'Mr. Paullo Amarantes'], \n", + " 'Idade': l_idade, \n", + " 'Cidade':['lisboa', 'Sintra', 'Braga', 'Guimaraes', 'Mafra', 'Nazare']} \n", + " \n", + "# Converte o dicionário num dataframe\n", + "df = pd.DataFrame(d_exemplo) \n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "or-Kzaqmdn2b" + }, + "source": [ + "* Sugestões do que podemos fazer com relação á coluna 'nome' do dataframe df:\n", + " * Extrair o cumprimento do nome: Mr., Miss e etc.\n", + " * Construir as COLUNAS PrimeiroNome e SegundoNome.\n", + " * Criar a variável classe_idade." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Vd99ksvcg7uy" + }, + "source": [ + "## Extrair o comprimento do nome" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rNsANzFAg_Kn" + }, + "source": [ + "df_Nome= df['Nome'].str.split(' ', n = 2, expand = True) \n", + "df_Nome" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ianqsxLol008" + }, + "source": [ + "Altere o valor de n para 3 e explique como as coisas funcionam..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5NDAkEqCl6H5" + }, + "source": [ + "# Capturando o cumprimento do nome:\n", + "df['Tratamento_Nome'] = df['Nome'].str.split(' ', n = 2, expand = True)[0]\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B1QoH4LyrpVI" + }, + "source": [ + "## Construir as COLUNAS primeiro_nome e Segundo_Nome" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cbi4eRN2mOu9" + }, + "source": [ + "# Capturando o primeiro nome:\n", + "df['primeiro_nome'] = df['Nome'].str.split(' ', n = 2, expand = True)[1]\n", + "df['ultimo_nome'] = df['Nome'].str.split(' ', n = 2, expand = True)[2]\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7eagWhgZrwOh" + }, + "source": [ + "### Construir a variável classe_idade\n", + "\n", + " | Limite Inferior | Limite Superior | Classe |\n", + " |-----------------|-----------------|--------|\n", + " | Inf | 15 | Inf_15 |\n", + " | 15 | 20 | 15_20 |\n", + " | 20 | 30 | 25_30 |\n", + " | 30 | 40 | 30_40 |\n", + " | 40 | 50 | 40_50 |\n", + " | 50 | Sup | 50_Sup |" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lBjRBGBWr2AH" + }, + "source": [ + "def classe_idade(Idade):\n", + " if (Idade <= 15):\n", + " return 'Inf_15'\n", + " if (15 < Idade <= 20):\n", + " return '15_20'\n", + " elif(20 < Idade <= 30):\n", + " return '20_30'\n", + " elif (30 < Idade <= 40):\n", + " return '30_40'\n", + " elif (40 < Idade <= 50):\n", + " return '40_50'\n", + " elif (Idade > 50):\n", + " return '50_Sup'\n", + " else:\n", + " return 'Outros'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "OogrvjCrsdoh" + }, + "source": [ + "df['classe_idade'] = df['Idade'].map(classe_idade)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "57tzJCS0p_G4" + }, + "source": [ + "df.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JDtxz_eaRcmi" + }, + "source": [ + "___\n", + "# **Agrupar Informações: pd.groupby()**\n", + "* Fonte: [Group By: split-apply-combine](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)\n", + "\n", + "* Os componentes do comando Groupby()\n", + " * **Grouping_Column** - Coluna Categórica pelo qual os dados serão agrupados;\n", + " * **Aggregating_Column** - Coluna numérica cujos valores serão agrupados;\n", + " * **Aggregating_Function** - Função agregadora, ou seja: sum, min, max, mean, median, etc...\n", + "\n", + "> Sintaxe: \n", + "\n", + "```\n", + "df.groupby('Grouping_Column').agg({'Aggregating_Column': 'Aggregating_Function'})\n", + "\n", + "OU\n", + "\n", + "df['Aggregating_Column'].groupby(df['Grouping_Column']).Function()\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bmFf-273XPXj" + }, + "source": [ + "## Exemplo 1" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wteEveUsd36C" + }, + "source": [ + "transformacao_lower(df_Titanic)\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "buF5DhkFfqVA" + }, + "source": [ + "# Agrupando df_Titanic por 'sex'\n", + "df_Titanic.groupby(['sex', 'pclass']).agg({'fare': ['min', 'median', 'mean','max'], 'age': ['count', 'mean','max']})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "YP3GDwq0gR_V" + }, + "source": [ + "# Agrupando df_Titanic por 'sex' e 'Pclass'\n", + "df_Titanic.groupby(['sex','pclass']).agg({'fare': ['max', 'min']}).round(0)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "se4tQ3ETeUfv" + }, + "source": [ + "df_Titanic.groupby(['sex']).agg({'age': ['mean','min','max']}).round(0)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zUj82I7Cm220" + }, + "source": [ + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OrLZjm9bXTOr" + }, + "source": [ + "## Exemplo 2" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "x8aPZPT6XZVP" + }, + "source": [ + "### Preparando o exemplo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KrCe6RgOXaFx" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "l_coluna = []\n", + "\n", + "for i in range(1,6):\n", + " np.random.seed(i)\n", + " l_coluna.append(np.random.randint(0, 10, 10))\n", + " \n", + "np.random.seed(6)\n", + "l_coluna.append(np.random.rand(10))\n", + "\n", + "l_coluna" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "tXaHjmfSXeCw" + }, + "source": [ + "l_coluna[0]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "U_aEVMTHq6ee" + }, + "source": [ + "df = pd.DataFrame({'coluna6' : ['a', 'a', 'b', 'b', 'a', 'b', 'b', 'b', 'a', 'a'],\n", + " 'coluna7' : ['um', 'dois', 'um', 'dois', 'um', 'dois', 'dois', 'um', 'um', 'dois'],\n", + " 'coluna1' : l_coluna[0],\n", + " 'coluna2' : l_coluna[1],\n", + " 'coluna3' : l_coluna[2],\n", + " 'coluna4' : l_coluna[3],\n", + " 'coluna5' : l_coluna[4],\n", + " 'coluna8' : l_coluna[5],\n", + " 'Pessoas' : ['Jose','Maria','Pedro','Carlos','Joao','Ana','Manoel','Mafalda','Antonio','Ricardo'],\n", + " 'sexo' : ['m','f','m','m','m','f','m','f','m','m']})\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ok4a28lGlVC5" + }, + "source": [ + "Agrupando por 'coluna6':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Vx77lyzlZIFW" + }, + "source": [ + "df.groupby('coluna6').agg({'coluna1': ['min','mean','median','max']})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T6i-R2KemadE" + }, + "source": [ + "Agora, vamos repetir o processo usando duas COLUNAS-chaves 'coluna6' e 'coluna7':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WxmHQnQSZrXA" + }, + "source": [ + "df_estatisticas_descritivas = df.groupby(['coluna6','coluna7']).agg({'coluna1': ['min','mean','median','max']})\n", + "df_estatisticas_descritivas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ipw5EROwaaCX" + }, + "source": [ + "Observe que df_estatisticas_descritivas é um dataframe. Portanto, podemos selecionar LINHAS e/ou COLUNAS deste dataframe da forma que quisermos." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qk5uSdVwb7dH" + }, + "source": [ + "# Índices do dataframe:\n", + "df_estatisticas_descritivas.index" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "brIgUFlkalix" + }, + "source": [ + "# Selecionando o conteúdo de coluna6= 'a' e coluna7= 'um':\n", + "df_estatisticas_descritivas.loc[('a', 'um')]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "fQUs2PVHc6iR" + }, + "source": [ + "# Selecionando o conteúdo de coluna6= 'a' e coluna7= 'um', primeiro valor:\n", + "df_estatisticas_descritivas.loc[('a', 'um')][0] # ou seja, selecionamos min" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zT0xiee6dDpK" + }, + "source": [ + "# Selecionando o conteúdo de coluna6= 'a' e coluna7= 'um', segundo valor:\n", + "df_estatisticas_descritivas.loc[('a', 'um')][1] # ou seja, selecionamos mean" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vXlcjPM6dQKi" + }, + "source": [ + "E daí por diante..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EMxFMqn9dm3g" + }, + "source": [ + "Para aprender mais sobre como trabalhar com dois índices em um dataframe, consulte [Hierarchical indices, groupby and pandas](https://www.datacamp.com/community/tutorials/pandas-multi-index)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gNHyH7M0pGDy" + }, + "source": [ + "___\n", + "## Exemplo 3\n", + "### Operações e transformações em grupo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ywl3k_l8pGD0" + }, + "source": [ + "# Mostra o dataframe-exemplo:\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "AF8cbNsjpGD5" + }, + "source": [ + "# Constroi dataframe df_Medias\n", + "df_Medias = df.groupby('coluna6').mean().add_prefix('mean_')\n", + "df_Medias" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JGlA6ufLpGD9" + }, + "source": [ + "# Combina (merge) com o dataframe df:\n", + "pd.merge(df, df_Medias, left_on ='coluna6', right_index=True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1MjZu3sVpGEd" + }, + "source": [ + "___\n", + "# **Discretizar COLUNAS numéricas**\n", + "* pd.cut() - classes com base em valores;\n", + "* pd.qcut() - classes com base em quantis da amostra, ou seja teremos a mesma quantidade de itens em cada classe.\n", + "\n", + "> Este artifício é muito utilizado em Machine Learning quando queremos construir classes para variáveis numéricas (integer ou float). Acompanhe a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yK772hiSfZaE" + }, + "source": [ + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wi-nv6fshKIX" + }, + "source": [ + "## pd.cut()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GUAYB8J2KzJt" + }, + "source": [ + "Bucket_cut = pd.cut(df['coluna8'], 4) " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "SVExQmzDpGEe" + }, + "source": [ + "# Construir 4 classes para a variável float 'coluna8':\n", + "Bucket_cut = pd.cut(df['coluna8'], 4) # aqui, estamos construindo 4 buckets\n", + "Bucket_cut" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "OOD38I6ug1AY" + }, + "source": [ + "# Quem são os Bucket's que construimos:\n", + "Bucket_cut.value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9s2eaZGtfsxu" + }, + "source": [ + "Como podem ver, de fato construimos 4 bucket's. **Observe que não temos a mesma quantidade de itens em cada classe!!!**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T7u0pS64hPHC" + }, + "source": [ + "## pd.qcut()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cJTQTHA6pGEm" + }, + "source": [ + "Bucket_qcut = pd.qcut(df['coluna8'], 4, labels=False)\n", + "Bucket_qcut" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vM30Td_8hZre" + }, + "source": [ + "# Quem são os Bucket's que construimos:\n", + "Bucket_qcut.value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jhf6V5LTh4G7" + }, + "source": [ + "## Comentários\n", + "* pd.qcut() garante uma distribuição mais uniforme dos valores em cada classe. Isso significa que é menos provável que você tenha uma classe com muitos dados e outra com poucos dados.\n", + "* Eu prefiro usar pd.qcut()." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RNsR0NsS5iIU" + }, + "source": [ + "___\n", + "# **Distribuição conjunta - crosstabs**\n", + "> Suponha que queremos analisar o número de sobreviventes em relação à COLUNA embarked." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LKQv6YtSfGSU" + }, + "source": [ + "df_Titanic2 = df_Titanic.copy()\n", + "df_Titanic2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ANhb5rBffTh6" + }, + "source": [ + "pd.crosstab(df_Titanic2['survived'], df_Titanic2['embarked'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WIlHAYEVqSjT" + }, + "source": [ + "___\n", + "# **Deletar COLUNAS do dataframe**\n", + "> Deletar as COLUNAS 'coluna2' e 'coluna5' do dataframe." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YssOMF_Vqso5" + }, + "source": [ + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rVF_1p0Gq3gZ" + }, + "source": [ + "## Usando inplace = True" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7BjRIX1jqWQT" + }, + "source": [ + "df.drop(['coluna2','coluna5'], axis =1, inplace =True)\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "POC2fnTlq8mK" + }, + "source": [ + "## Usando atribuição" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YRSwEbnfq7s_" + }, + "source": [ + "df= df.drop(['coluna2','coluna5'], axis =1)\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bHth6KSv7k0G" + }, + "source": [ + "___\n", + "# **Criar COLUNAS dummies para dados categóricos**\n", + "> Nosso objetivo é construir variáveis dummies para nossas COLUNAS categóricas.\n", + "\n", + "* Fontes: \n", + " * [Categorical data](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html)\n", + " * [Creating Dummy Variables](https://www.ritchieng.com/pandas-creating-dummy-variables/)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GOqcARHqjMr_" + }, + "source": [ + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yNqvwEu9jbuW" + }, + "source": [ + "Vamos construir variáveis dummies para as COLUNAS 'coluna6' e 'coluna7', da seguinte forma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "16osZsMEjmDh" + }, + "source": [ + "pd.get_dummies(df['coluna6'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Cb1gp_Y1jxz2" + }, + "source": [ + "Qual a interpretação do resultado acima?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Cic19l-Mj39q" + }, + "source": [ + "pd.get_dummies(df['coluna7'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "44FDXcoyj-tT" + }, + "source": [ + "Qual a interpretação do resultado acima?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cxHc6BvDkCWl" + }, + "source": [ + "df = pd.get_dummies(df, columns =['coluna6', 'coluna7', 'sexo'])\n", + "df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "A2m25N4znZ2O" + }, + "source": [ + "df.columns" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "x0uXu0RRlB2a" + }, + "source": [ + "___\n", + "# **Calcular correlação (Análise de Correlação)**\n", + "> A correlação pode ser calculada usando o método df.corr(). Para mais detalhes sobre os tipos de correlação existentes bem como a aplicação de cada uma delas, consulte os links a seguir:\n", + "\n", + "* [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)\n", + "* [Kendall rank correlation coefficient](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient)\n", + "* [Spearman's rank correlation coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient).\n", + "\n", + "Para aprender mais sobre a geração de heatmap, consulte [Seaborn Heatmap Tutorial (Python Data Visualization)](https://likegeeks.com/seaborn-heatmap-tutorial/)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AgoigF8AnYG0" + }, + "source": [ + "## Gerando o dataframe-exemplo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6scRm8kNnbby" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "\n", + "# Gerando um dataframe com 15 colunas, sendo 9 informativas e 6 redundantes:\n", + "from sklearn.datasets import make_classification\n", + "X, y = make_classification(n_samples=1000, n_features=15, n_informative=9,\n", + " n_redundant=6, n_repeated=0, n_classes=2, n_clusters_per_class=1,\n", + " random_state=20111974)\n", + "\n", + "df_X = pd.DataFrame(X, columns= ['v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8', 'v9', 'v10', 'v11', 'v12', 'v13', 'v14', 'v15'])\n", + "df_y = pd.DataFrame(y, columns= ['target'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "NsuhsZCTmqEm" + }, + "source": [ + "# Visualizar os dados\n", + "df_X.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "D0JNMHqYoSMs" + }, + "source": [ + "# Mostra a matriz de correlação usando a correlação de Pearson\n", + "set_Colunas_Correlacionadas = set()\n", + "matriz_correlacao = df_X.corr().where(np.triu(np.ones(df_X.corr().shape), k = 1).astype(np.bool))\n", + "matriz_correlacao" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Vnj6A8z6r7nM" + }, + "source": [ + "### Quem são as colunas altamente correlacionadas?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "a_YUD-dOr_p-" + }, + "source": [ + "set_Colunas_Correlacionadas = set()\n", + "for i in range(len(matriz_correlacao.columns)):\n", + " for j in range(i):\n", + " if abs(matriz_correlacao.iloc[j, i]) > 0.8: # consertei o código do Nélio invertendo i com j\n", + " colnome = matriz_correlacao.columns[j] # consertei o código do Nélio colocando j no lugar do i\n", + " set_Colunas_Correlacionadas.add(colnome)\n", + "\n", + "set_Colunas_Correlacionadas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "MF8Cii-OXo8h" + }, + "source": [ + "for i in range(len(matriz_correlacao.columns)):\n", + " for j in range(i):\n", + " print(f'i = {i} | j = {j} | Correlacao = {matriz_correlacao.iloc[j, i]}') # consertei o código do Nélio invertendo i com j\n", + " # if abs(matriz_correlacao.iloc[i, j]) > 0.8: # código original do nelio com erro\n", + " # colnome = matriz_correlacao.columns[i] # código original do nelio com erro\n", + " # set_Colunas_Correlacionadas.add(colnome) \n", + "\n", + "#set_Colunas_Correlacionadas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3-0Xe6GdozYT" + }, + "source": [ + "A seguir, a correlação mais visual:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5-_Qadx1o1U9" + }, + "source": [ + "fig, ax = plt.subplots(figsize = (12, 12)) \n", + "mask = np.zeros_like(df_X.corr().abs())\n", + "mask[np.triu_indices_from(mask)] = 1\n", + "sns.heatmap(df_X.corr().abs(), mask= mask, ax= ax, cmap='coolwarm', annot= True, fmt= '.2f', center= 0)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5ZOp9ZGgtqFQ" + }, + "source": [ + "# **Scatterplot**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eReJJjG8tuKV" + }, + "source": [ + "## Com regressão" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tVmdSo6ztruA" + }, + "source": [ + "sns.pairplot(df_X, kind = \"reg\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xG9A6b32twv-" + }, + "source": [ + "## Sem regressão" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fyTOS3zVtz-O" + }, + "source": [ + "sns.pairplot(df_X, kind = \"scatter\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f-1bpipc6bMh" + }, + "source": [ + "___\n", + "# **Salvar dataframe como csv**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "64CoM1aY6gf6" + }, + "source": [ + "df_X.to_csv('example.csv')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oy646p33DJV0" + }, + "source": [ + "# **Dicionário de palavras**\n", + "> Muito utilizado em NLP e Machine Learning.\n", + "* Caso de Uso: Seguradoras --> Quando um segurado aciona a Seguradora para descrever um acidente (por exemplo), há um algorítmo que transforma o áudio em texto para mineração de textos." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DQR906rVD1V-" + }, + "source": [ + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "sHvDaztJDPP7" + }, + "source": [ + "from sklearn.feature_extraction.text import CountVectorizer\n", + "CountVectorizer = CountVectorizer()\n", + "matriz_contagens = CountVectorizer.fit_transform(df_Titanic['name']) # Informe a coluna do tipo texto/string que queremos analisar/avaliar\n", + "print(matriz_contagens)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "jwT-56dED8VJ" + }, + "source": [ + "df_dicionario_palavras = pd.DataFrame(CountVectorizer.get_feature_names(), columns = ['palavra'])\n", + "df_dicionario_palavras[\"vezes_que_aparece\"] = matriz_contagens.sum(axis = 0).tolist()[0]\n", + "df_dicionario_palavras = df_dicionario_palavras.sort_values(\"vezes_que_aparece\", ascending = False) #.reset_index(drop = True) # Sorte ordena as linhas do dataframe\n", + "df_dicionario_palavras.head(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nx65RmEAGTvd" + }, + "source": [ + "# Desafio\n", + "> Transforme o code Python da sessão **Dicionário de palavras** em função para usarmos futuramente." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iwd1lhq9mrD3" + }, + "source": [ + "___\n", + "# **Exercícios**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o_cl0kFgQfFh" + }, + "source": [ + "## Exercício 1\n", + "* A partir dos dataframes USA_Abbrev, USA_Area e USA_Population, construa o Dataframe USA contendo as COLUNAS state, abbreviation, area, ages, year, population.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s8rQUo7yHKJ1" + }, + "source": [ + "* Observação: A forma mais fácil de ler um arquivo CSV (a partir do Excell por exemplo) a partir do GitHub é clicar no arquivo csv no seu repositório do GitHub e em seguida clicar em 'raw'. Depois, copie o endereço apresentado no browser e cole na variável 'url'. Qualquer dúvida, leia o documento a seguir: https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Jlz6smN3TLuH" + }, + "source": [ + "url_USA_Abb = 'https://raw.githubusercontent.com/Hayltons/DSWP/master/Dataframes/USA_Abbrev.csv'\n", + "url_USA_Area = 'https://raw.githubusercontent.com/Hayltons/DSWP/master/Dataframes/USA_Area.csv'\n", + "url_USA_Pop = 'https://raw.githubusercontent.com/Hayltons/DSWP/master/Dataframes/USA_Population.csv'\n", + "df_USA_Abb = pd.read_csv(url_USA_Abb)\n", + "df_USA_Area = pd.read_csv(url_USA_Area)\n", + "df_USA_Pop = pd.read_csv(url_USA_Pop)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Vh0yvaAmUl-G" + }, + "source": [ + "df_USA_Abb.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "E_UBdJ66UlXS" + }, + "source": [ + "df_USA_Area.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ngvPiSVVUkcz" + }, + "source": [ + "df_USA_Pop.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "uZkWKRsWUx7N" + }, + "source": [ + "df_USA = pd.merge(df_USA_Abb, df_USA_Area, how='outer', on='state')\n", + "df_USA" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "jhPpQsGlUxlK" + }, + "source": [ + "df_USA1 = pd.merge(df_USA, df_USA_Pop, how='outer', left_on='abbreviation', right_on = 'state_region')\n", + "df_USA1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "XponZhgnUxG8" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KTun1uSLuJ-A" + }, + "source": [ + "## Exercício 2\n", + "Source: https://github.com/aakankshaws/Pandas-exercises\n", + "\n", + "* Considere os dataframes a seguir e faça o merge do dataframe df_esquerdo com o dataframe df_direito:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Soq7GVZnuREq" + }, + "source": [ + "df_esquerdo = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],\n", + " 'A': ['A0', 'A1', 'A2', 'A3'],\n", + " 'B': ['B0', 'B1', 'B2', 'B3']})\n", + " \n", + "df_direito = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],\n", + " 'C': ['C0', 'C1', 'C2', 'C3'],\n", + " 'D': ['D0', 'D1', 'D2', 'D3']})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6KEsTARfvM1C" + }, + "source": [ + "## Exercício 3\n", + "Source: https://github.com/aakankshaws/Pandas-exercises\n", + "\n", + "* Considere os dataframes a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hgxE5gZ9vMEg" + }, + "source": [ + "df_esquerdo = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],\n", + " 'key2': ['K0', 'K1', 'K0', 'K1'],\n", + " 'A': ['A0', 'A1', 'A2', 'A3'],\n", + " 'B': ['B0', 'B1', 'B2', 'B3']})\n", + " \n", + "df_direito = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],\n", + " 'key2': ['K0', 'K0', 'K0', 'K0'],\n", + " 'C': ['C0', 'C1', 'C2', 'C3'],\n", + " 'D': ['D0', 'D1', 'D2', 'D3']})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iv7AmZ1ivm8R" + }, + "source": [ + "### Perguntas\n", + "* Qual o output e a interpretação dos comandos a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TWAW_1tuvvSO" + }, + "source": [ + "pd.merge(df_esquerdo, df_direito, on = ['key1', 'key2'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QjM7pBONvzCJ" + }, + "source": [ + "pd.merge(df_esquerdo, df_direito, how = 'outer', on = ['key1', 'key2'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "D1Rr3Ghsv2iS" + }, + "source": [ + "pd.merge(df_esquerdo, df_direito, how = 'right', on = ['key1', 'key2'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vXQwLjT-v3Iu" + }, + "source": [ + "pd.merge(df_esquerdo, df_direito, how = 'left', on = ['key1', 'key2'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EIdltTC-t_lF" + }, + "source": [ + "## Exercício 5\n", + "5.1. Identifique e delete os atributos do dataframe df_Titanic que podem ser excluídos inicialmente no início da análise de dados." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bMwPLgWclWBq" + }, + "source": [ + "___\n", + "## Exercício 6\n", + "* (a) Carregue o dataframe Titanic_With_MV.csv e analise o dataframe em busca de inconsistências e Missing Values (NaN).\n", + "\n", + "### Feature Engineering\n", + "* (b) Com a coluna 'cabin', construir as colunas:\n", + " * deck - Letra de Cabin;\n", + " * seat - Número de Cabin\n", + "* (c) Criar a coluna 'sozinho_parch', onde sozinho_parch= 1 significa que o passageiro viaja sozinho e 0, caso contrário.\n", + "* (d) Criar o atributo 'sozinho_sibsp', onde sozinho= 1 significa que o passageiro viaja sozinho e 0, caso contrário.\n", + "* (e) Discretizar a coluna 'fare' em 10 buckets.\n", + "* (f) Discretizar a coluna 'age'.\n", + "* (g) Capturar os títulos 'Ms', 'Mr' e etc contidos na coluna 'Title';\n", + "* (h) Qual a relação entre as variáveis e a variável-target?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "V7KUGAX6lilP" + }, + "source": [ + "import pandas as pd\n", + "df_Titanic = pd.read_csv('https://raw.githubusercontent.com/Hayltons/DSWP/master/Dataframes/Titanic_With_MV.csv', index_col= 'PassengerId')\n", + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "m3UnAPJakCLR" + }, + "source": [ + "* Segue o dicionário de dados do dataframe Titanic:\n", + " * PassengerID: ID do passageiro;\n", + " * survived: Indicador, sendo 1= Passageiro sobreviveu e 0= Passageiro morreu;\n", + " * Pclass: Classe;\n", + " * Age: Idade do Passageiro;\n", + " * SibSp: Número de parentes a bordo (esposa, irmãos, pais e etc);\n", + " * Parch: Número de pais/crianças a bordo;\n", + " * Fare: Valor pago pelo Passageiro;\n", + " * Cabin: Cabine do Passageiro;\n", + " * Embarked: A porta pelo qual o Passageiro embarcou.\n", + " * Name: Nome do Passageiro;\n", + " * sex: sexo do Passageiro\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B_3s5cgxfNKQ" + }, + "source": [ + "## Resposta do item (a)\n", + "### Coluna XPTO\n", + "\n", + "\n", + "### Coluna XPTO2" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "q3oLgyhdL6xd" + }, + "source": [ + "## Resposta do item (b)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Sm-LaRZZKAac" + }, + "source": [ + "df_Titanic.loc[pd.isna(df_Titanic['Cabin']), 'Cabin'] = 'N0'\n", + "df_Titanic['deck'] = df_Titanic['Cabin'].str.get(0)\n", + "df_Titanic['seat'] = df_Titanic['Cabin'].str[1:]\n", + "df_Titanic.head(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "d-d_4f6hKi6T" + }, + "source": [ + "## Resposta do item (c)\n", + "df_Titanic['sozinho_parch'] = df_Titanic['Parch']\n", + "s_sozinho_parch = df_Titanic['sozinho_parch'] > 0\n", + "df_Titanic.loc[s_sozinho_parch, ['sozinho_parch']] = 0\n", + "df_Titanic.loc[~s_sozinho_parch, ['sozinho_parch']] = 1\n", + "df_Titanic" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "xAUT7OjvKirj" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "fzzC3zP7KiaG" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "aRHx_74nKiGd" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "WNHrcn9uKg-i" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UbexhGtayV4X" + }, + "source": [ + "## Exercício 7\n", + "Consulte a página [Pandas Exercises, Practice, Solution](https://www.w3resource.com/python-exercises/pandas/index.php) para mais exercícios relacionados á este tópico." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Iia0ikd_KBtH" + }, + "source": [ + "## Exercício 8\n", + "Crie a coluna 'aleatorio' no dataframe df_Titanic em que cada linha recebe um valor aleatório usando o método np.random.random()." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HPiLKUkWNYs3" + }, + "source": [ + "df_Titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TUVTlE9WYW8C" + }, + "source": [ + "## Exercício 9\n", + "O arquivo FIFA.csv contem dados relacionados à última edição do FIFA 2018 (um dos jogos de video-game mais famosos) e traz os mais variados dados sobre os jogadores (exemplo): idade, nacionalidade, potencial, salário e etc. Faça o seguinte:\n", + "\n", + "1. Carregue o arquivo FIFA.csv (está na área de Dataframes do curso);\n", + "2. Que colunas podem previamente ser eliminadas da análise? Porque identificar o que pode ser eliminado é importante?\n", + "3. Qual o dtype de cada variável/atributo do dataframe?\n", + "4. Se alguma variávável/atributo é do tipo string (object) e supostamente deveria ser numérica, como alteramos o tipo?\n", + "5. Normalize os nomes das colunas, ou seja, renomeie o nome das colunas para minúsculo;\n", + "6. Há Missing values nos dados? Se sim, o qual sua proposta (proposta do grupo) para tratar estes Missing values?\n", + "7. Qual a distribuição do número de jogadores por países? Apresente uma tabela com a distribuição.\n", + "8. Qual a média de idade dos jogadores por países (variável/atributo 'Nacionality');\n", + "9. Qual a número de jogadores por idade?\n", + "10. Quantos jogadores possuem cada clube?\n", + "11. Qual a média de idade por clube?\n", + "12. Qual a média de salário por país?\n", + "13. Qual a média de salário por clube?\n", + "14. Qual a média de salário por idade?\n", + "15. Quanto cada clube gasta com pagamento de salários?\n", + "16. Quais são os insight (o que você consegue descobrir) em relação à variável 'Potential' (mede o potencial dos jogadores)?\n", + "17. Quais os insights em relação à variável overall (nota média do atleta) por idade, clube e país?\n", + "18. Quais são os melhores clubes se levarmos em consideração as variáveis Potential e Overall?\n", + "19. Apresente o ranking dos goleiros (use a variável/atributo 'Preferred Positions') por Potencial, Overall. Estamos à procura de 'GK'.\n", + "20. Quem são os jogadores mais rápidos (variável/atributo 'Sprint speed'=?\n", + "21. Quem são os 5 melhores jogadores em termos de chute (força para chutar) (use a variável/atributo 'Shot power')?\n", + "22. Quem são os outliers em termos de salário?\n", + "23. Quem são os outliers em termos de potência no chute?\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ldWQd9j4NhPS" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "KVhF-Lc_XhDL" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB10_04__3DP_2_Missing_Value_Handling_hs.ipynb b/Notebooks/NB10_04__3DP_2_Missing_Value_Handling_hs.ipynb new file mode 100644 index 000000000..3b217d9a3 --- /dev/null +++ b/Notebooks/NB10_04__3DP_2_Missing_Value_Handling_hs.ipynb @@ -0,0 +1,1848 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "NB10_04__3DP_2_Missing_Value_Handling.ipynb", + "provenance": [], + "collapsed_sections": [], + "include_colab_link": true + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rGVp396DAlIW" + }, + "source": [ + "

3DP_2 - MISSING VALUES HANDLING

\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ii1Mci_PxQdJ" + }, + "source": [ + "# **AGENDA**:\n", + "\n", + "> Consulte **Table of contents**.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "slSYEvDtArHO" + }, + "source": [ + "___\n", + "# **REFERÊNCIAS**\n", + "* [Working with missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)\n", + "* [Handling Missing Data for a Beginner](https://towardsdatascience.com/handling-missing-data-for-a-beginner-6d6f5ea53436)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UGJtCSbTwraG" + }, + "source": [ + "___\n", + "# **3DP_MISSING VALUES HANDLING**\n", + "\n", + "> Lidar com Missing Values é um dos piores pesadelos de um Cientista de dados. Especialmente, se o número de MV for grande o suficiente (geralmente acima de 5%). Nesse caso, os valores não podem ser descartados e um Cientista de Dados inteligente deve \"imputar\" os valores ausentes.\n", + "\n", + "* Nesta sessão, vamos identificar, analisar e tratar Missing Values (MV).\n", + "* Como MV são gerados?\n", + " * Usuário se esqueceu de preencher ou preencheu errado o campo;\n", + " * Os dados foram perdidos durante a transferência manual de um banco de dados legado;\n", + " * Erro de programação;\n", + " * Os usuários optaram por não preencher um campo vinculado a suas crenças sobre como os resultados seriam usados ou interpretados.\n", + "* As funções df.isnull() e df.isna() são apropriadas para nos indicar quantas observações são MV no dataframe.\n", + "\n", + "* Na prática:\n", + " * Variáveis Contínuas/Numéricas - Podemos substituir os NaN por Média/Mediana/Moda;\n", + "\t* Variáveis Categóricas - Uma alternativa é atribuir uma categoria inexistente como, por exemplo \"MV\" para indicar o NaN.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4mFlY2iIHDaV" + }, + "source": [ + "___\n", + "# **MACHINE LEARNING COM PYTHON (Scikit-Learn)**\n", + "\n", + "![Scikit-Learn](https://github.com/MathMachado/Materials/blob/master/scikit-learn-1.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CA-GboEcP4zY" + }, + "source": [ + "## Carregar as biliotecas" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "x0fq_2HoP7OE" + }, + "source": [ + "import pandas as pd\n", + "from pandas import Series, DataFrame\n", + "\n", + "import numpy as np\n", + "from sklearn import preprocessing\n", + "import matplotlib\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "%matplotlib inline\n", + "matplotlib.style.use('ggplot')\n", + "\n", + "# remove warnings to keep notebook clean\n", + "import warnings\n", + "warnings.filterwarnings('ignore')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P4_7D_4NA7KJ" + }, + "source": [ + "## Dataframes\n", + "* O dataframe abaixo foi gerado aleatoriamente para entendermos como lidar com os NaN's." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "CwyndikLOld0", + "outputId": "c8a71b76-e743-42ef-c482-be2e4872123e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 355 + } + }, + "source": [ + "df= pd.DataFrame({\n", + " 'idade': [32,38,np.nan,37,np.nan,36,38,32,0,np.nan],\n", + " 'salario': ['High', 'High', 'High', 'Low', 'Low', 'High', np.nan, 'Medium', 'Medium', 'High'],\n", + " 'pais': ['Spain', 'France', 'France', np.nan, 'Germany', 'France', 'Spain', 'France', np.nan, 'Spain']})\n", + "\n", + "df" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idadesalariopais
032.0HighSpain
138.0HighFrance
2NaNHighFrance
337.0LowNaN
4NaNLowGermany
536.0HighFrance
638.0NaNSpain
732.0MediumFrance
80.0MediumNaN
9NaNHighSpain
\n", + "
" + ], + "text/plain": [ + " idade salario pais\n", + "0 32.0 High Spain\n", + "1 38.0 High France\n", + "2 NaN High France\n", + "3 37.0 Low NaN\n", + "4 NaN Low Germany\n", + "5 36.0 High France\n", + "6 38.0 NaN Spain\n", + "7 32.0 Medium France\n", + "8 0.0 Medium NaN\n", + "9 NaN High Spain" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 3 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nnJDArN0Thcs" + }, + "source": [ + "## Identificar os NaN's" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OuWnwsrWUOwJ" + }, + "source": [ + "A função df.isna() será usada para identificarmos os NaN's nos dataframes. Por exemplo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MbpVaEz0Vrhv", + "outputId": "5f0e80df-6612-4870-dcf7-85dfe1a76af4", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 359 + } + }, + "source": [ + "df.isna()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdadeSalarioPais
0FalseFalseFalse
1FalseFalseFalse
2TrueFalseFalse
3FalseFalseTrue
4TrueFalseFalse
5FalseFalseFalse
6FalseTrueFalse
7FalseFalseFalse
8FalseFalseTrue
9TrueFalseFalse
\n", + "
" + ], + "text/plain": [ + " Idade Salario Pais\n", + "0 False False False\n", + "1 False False False\n", + "2 True False False\n", + "3 False False True\n", + "4 True False False\n", + "5 False False False\n", + "6 False True False\n", + "7 False False False\n", + "8 False False True\n", + "9 True False False" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 64 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yNTQr1HIYmfj" + }, + "source": [ + "Qual a interpretação deste output?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4j_sDA9UYwfy" + }, + "source": [ + "Para um dataframe muito grande, vamos usar a expressão abaixo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_I9Rmip5Y0q1", + "outputId": "e308f7b0-385b-4060-e7cb-50c8b453a032", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 85 + } + }, + "source": [ + "df.isna().sum()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Idade 3\n", + "Salario 1\n", + "Pais 2\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 65 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iylkpQhWY6Fb" + }, + "source": [ + "Mais prático não é? No entanto, vamos utilizar a função abaixo, que nos ajudará mais com os NaN's:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9CP1CsHPeeUQ" + }, + "source": [ + "def mostra_missing_value(df):\n", + " total = df.isnull().sum().sort_values(ascending = False)\n", + " percent = 100*round((df.isnull().sum()/df.isnull().count()).sort_values(ascending = False), 2)\n", + " missing_data = pd.concat([total, percent], axis = 1, keys=['Total', 'Percentual'])\n", + " print(missing_data.head(10))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "nHB8ND2iefp4", + "outputId": "d9168eb0-b962-47dc-d07e-28b7e6462ff6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 87 + } + }, + "source": [ + "mostra_missing_value(df)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + " Total Percentual\n", + "idade 3 30.0\n", + "pais 2 20.0\n", + "salario 1 10.0\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qaKpKXBVZBeu" + }, + "source": [ + "## A função df.dropna()\n", + "* Esta função deleta as instâncias (linhas do dataframes) onde há pelo menos 1 NaN." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xhw5fJKFZGPn", + "outputId": "038d7b73-478d-4d1d-f488-237c7da6bf7a", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 173 + } + }, + "source": [ + "df2 = df.dropna()\n", + "df2" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdadeSalarioPais
032.0HighSpain
138.0HighFrance
536.0HighFrance
732.0MediumFrance
\n", + "
" + ], + "text/plain": [ + " Idade Salario Pais\n", + "0 32.0 High Spain\n", + "1 38.0 High France\n", + "5 36.0 High France\n", + "7 32.0 Medium France" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 66 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BqWoQZ5fZeVk" + }, + "source": [ + "Como podemos ver, somente as instâncias 0, 1, 5 e 7 tem atributos não NaN's." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lc7lmP7hikBT" + }, + "source": [ + "Uma forma menos severa seria:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "h1qLnkFoimU5", + "outputId": "7ee00406-e4df-42bb-d8e9-a156db0ec3dd", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 297 + } + }, + "source": [ + "df3 = df.dropna(axis = 0, subset = ['pais'])\n", + "df3" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdadeSalarioPaisIdade2Idade3Pais2
032.0HighSpain32.032.0Spain
138.0HighFrance38.038.0France
2NaNHighFranceNaN36.0France
4NaNLowGermanyNaN36.0Germany
536.0HighFrance36.036.0France
638.0NaNSpain38.038.0Spain
732.0MediumFrance32.032.0France
9NaNHighSpainNaN36.0Spain
\n", + "
" + ], + "text/plain": [ + " Idade Salario Pais Idade2 Idade3 Pais2\n", + "0 32.0 High Spain 32.0 32.0 Spain\n", + "1 38.0 High France 38.0 38.0 France\n", + "2 NaN High France NaN 36.0 France\n", + "4 NaN Low Germany NaN 36.0 Germany\n", + "5 36.0 High France 36.0 36.0 France\n", + "6 38.0 NaN Spain 38.0 38.0 Spain\n", + "7 32.0 Medium France 32.0 32.0 France\n", + "9 NaN High Spain NaN 36.0 Spain" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 91 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bbQC69pjizWO" + }, + "source": [ + "* Saberias explicar o que o comando acima fez?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z52kmtrSS_2J" + }, + "source": [ + "## Tratar os NaN's de Variáveis Numéricas\n", + "* Neste exemplo, vou substituir os NaN's da variável 'idade' pela mediana. No entanto, responda a seguinte perfunta:\n", + " * Faz sendido idade= 0?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TNadcxzRe5r3" + }, + "source": [ + "Acho que a resposta é não. Então, neste caso, 0 é um NaN. Vamos substituído pela mediana da variável:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "O8SOOYbFfBtQ", + "outputId": "06afc715-c122-444a-891b-29ca9493efb0", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 359 + } + }, + "source": [ + "df['idade2'] = df['idade'].replace({0: df['idade'].median()})\n", + "df" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdadeSalarioPaisIdade2
032.0HighSpain32.0
138.0HighFrance38.0
2NaNHighFranceNaN
337.0LowNaN37.0
4NaNLowGermanyNaN
536.0HighFrance36.0
638.0NaNSpain38.0
732.0MediumFrance32.0
80.0MediumNaN36.0
9NaNHighSpainNaN
\n", + "
" + ], + "text/plain": [ + " Idade Salario Pais Idade2\n", + "0 32.0 High Spain 32.0\n", + "1 38.0 High France 38.0\n", + "2 NaN High France NaN\n", + "3 37.0 Low NaN 37.0\n", + "4 NaN Low Germany NaN\n", + "5 36.0 High France 36.0\n", + "6 38.0 NaN Spain 38.0\n", + "7 32.0 Medium France 32.0\n", + "8 0.0 Medium NaN 36.0\n", + "9 NaN High Spain NaN" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 80 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YzpEHNXffRyC" + }, + "source": [ + "Como podemos verificar acima na variável 'idade2', o valor 0 foi substituído pela mediana da variável 'idade'." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jlBNhkUtb60L" + }, + "source": [ + "Vamos verificar a média da variável antes da operação:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ix4ioHTHcCAJ", + "outputId": "43bbd457-d06f-4d40-aa4e-2861bd624340", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "df['idade2'].mean()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "35.57142857142857" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 82 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WefKW3WaTFdQ", + "outputId": "26c1c9f4-59e8-4f84-e0e7-5a29c8326711", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 359 + } + }, + "source": [ + "df['idade3'] = df['idade2']\n", + "df" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdadeSalarioPaisIdade2Idade3
032.0HighSpain32.032.0
138.0HighFrance38.038.0
2NaNHighFranceNaNNaN
337.0LowNaN37.037.0
4NaNLowGermanyNaNNaN
536.0HighFrance36.036.0
638.0NaNSpain38.038.0
732.0MediumFrance32.032.0
80.0MediumNaN36.036.0
9NaNHighSpainNaNNaN
\n", + "
" + ], + "text/plain": [ + " Idade Salario Pais Idade2 Idade3\n", + "0 32.0 High Spain 32.0 32.0\n", + "1 38.0 High France 38.0 38.0\n", + "2 NaN High France NaN NaN\n", + "3 37.0 Low NaN 37.0 37.0\n", + "4 NaN Low Germany NaN NaN\n", + "5 36.0 High France 36.0 36.0\n", + "6 38.0 NaN Spain 38.0 38.0\n", + "7 32.0 Medium France 32.0 32.0\n", + "8 0.0 Medium NaN 36.0 36.0\n", + "9 NaN High Spain NaN NaN" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 83 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AOWQOGmEcIfi" + }, + "source": [ + "Aplicamos a operação:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gAzxEchhdOXJ", + "outputId": "b17cafc9-eea2-4ce7-e2c5-de3b646a3c74", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 359 + } + }, + "source": [ + "df['idade3'].fillna(df['idade3'].median(), inplace = True)\n", + "df" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdadeSalarioPaisIdade2Idade3
032.0HighSpain32.032.0
138.0HighFrance38.038.0
2NaNHighFranceNaN36.0
337.0LowNaN37.037.0
4NaNLowGermanyNaN36.0
536.0HighFrance36.036.0
638.0NaNSpain38.038.0
732.0MediumFrance32.032.0
80.0MediumNaN36.036.0
9NaNHighSpainNaN36.0
\n", + "
" + ], + "text/plain": [ + " Idade Salario Pais Idade2 Idade3\n", + "0 32.0 High Spain 32.0 32.0\n", + "1 38.0 High France 38.0 38.0\n", + "2 NaN High France NaN 36.0\n", + "3 37.0 Low NaN 37.0 37.0\n", + "4 NaN Low Germany NaN 36.0\n", + "5 36.0 High France 36.0 36.0\n", + "6 38.0 NaN Spain 38.0 38.0\n", + "7 32.0 Medium France 32.0 32.0\n", + "8 0.0 Medium NaN 36.0 36.0\n", + "9 NaN High Spain NaN 36.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 84 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eeKi5thjd5cn" + }, + "source": [ + "Podemos observar que os valores NaN's do atributo 'idade3' foi substituído pelo valor 36." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YhhlVz2ddkbm" + }, + "source": [ + "E agora, a média após a operação:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "z2EsBMugdnCJ", + "outputId": "1e3e9566-3acb-4909-b433-c953deb5e589", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "df['idade3'].mean()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "35.7" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 85 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kD-bY7Vlf6pH" + }, + "source": [ + "* Qual a conclusão?\n", + " * Houve muito impacto na distribuição da variável 'idade'?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3oE1ZuB4TFlr" + }, + "source": [ + "## Tratar NaN's de Variáveis Categóricas\n", + "* Observe a variável 'pais'. Temos alguns NaN's. As alternativas que temos são:\n", + " * substituir os NaN's desta variável pela moda (valor mais frequente) da distribuição.\n", + " * substiruir os NaN's por 'Undefined'." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KNUkV4x2hLeV" + }, + "source": [ + "Qual o valor (no caso, País) mais frequente ?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OzuJ5p9UKa4v", + "outputId": "02a2b4d1-87ca-4af3-f7cc-60ae42b91930", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 85 + } + }, + "source": [ + "df.pais.value_counts()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "France 4\n", + "Spain 3\n", + "Germany 1\n", + "Name: Pais, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 87 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GGmqsEflhYV3" + }, + "source": [ + "Ok, a instância 'France' é o mais frequente. Então vamos substituir os NaN's por 'France'. De forma automática, temos:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ms1EBykBh3Ic", + "outputId": "ddfe280e-8ceb-4bf9-81e3-4b2411d98046", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 52 + } + }, + "source": [ + "sMode_Of_pais = df['pais'].mode()[0]\n", + "sMode_Of_pais" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 France\n", + "dtype: object" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 11 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XhsTLndlhk6W", + "outputId": "64fcc5ac-d9fe-44c4-b370-8ca63861a286", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 359 + } + }, + "source": [ + "df[\"pais2\"] = df[\"pais\"]\n", + "df[\"pais2\"] = df[\"pais2\"].fillna(sMode_Of_pais)\n", + "df" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdadeSalarioPaisIdade2Idade3Pais2
032.0HighSpain32.032.0Spain
138.0HighFrance38.038.0France
2NaNHighFranceNaN36.0France
337.0LowNaN37.037.0France
4NaNLowGermanyNaN36.0Germany
536.0HighFrance36.036.0France
638.0NaNSpain38.038.0Spain
732.0MediumFrance32.032.0France
80.0MediumNaN36.036.0France
9NaNHighSpainNaN36.0Spain
\n", + "
" + ], + "text/plain": [ + " Idade Salario Pais Idade2 Idade3 Pais2\n", + "0 32.0 High Spain 32.0 32.0 Spain\n", + "1 38.0 High France 38.0 38.0 France\n", + "2 NaN High France NaN 36.0 France\n", + "3 37.0 Low NaN 37.0 37.0 France\n", + "4 NaN Low Germany NaN 36.0 Germany\n", + "5 36.0 High France 36.0 36.0 France\n", + "6 38.0 NaN Spain 38.0 38.0 Spain\n", + "7 32.0 Medium France 32.0 32.0 France\n", + "8 0.0 Medium NaN 36.0 36.0 France\n", + "9 NaN High Spain NaN 36.0 Spain" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 90 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vdv--JyTj2s8" + }, + "source": [ + "df[\"pais3\"] = df[\"pais\"].fillna('pais_mv')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "AU7mwDcSkeSz", + "outputId": "a072a642-00ff-465d-a9c5-a2cb45b07e0f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 355 + } + }, + "source": [ + "df" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idadesalariopaispais3
032.0HighSpainSpain
138.0HighFranceFrance
2NaNHighFranceFrance
337.0LowNaNPais_MV
4NaNLowGermanyGermany
536.0HighFranceFrance
638.0NaNSpainSpain
732.0MediumFranceFrance
80.0MediumNaNPais_MV
9NaNHighSpainSpain
\n", + "
" + ], + "text/plain": [ + " idade salario pais pais3\n", + "0 32.0 High Spain Spain\n", + "1 38.0 High France France\n", + "2 NaN High France France\n", + "3 37.0 Low NaN Pais_MV\n", + "4 NaN Low Germany Germany\n", + "5 36.0 High France France\n", + "6 38.0 NaN Spain Spain\n", + "7 32.0 Medium France France\n", + "8 0.0 Medium NaN Pais_MV\n", + "9 NaN High Spain Spain" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 10 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lTtkurTY8ttj" + }, + "source": [ + "# **EXERCÍCIOS**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uGQqvlYhiQ56" + }, + "source": [ + "## Exercício 1\n", + "* Trate os NaN's da variável 'salario'." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5Nv2-w7t824J" + }, + "source": [ + "## Exercício 2 - Diabetes\n", + "* Carregue o dataframe diabeletes.csv e trate os NaN's." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "48KUsFSe9wwj" + }, + "source": [ + "### Carregar o dataframe" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "I8RUE_aj9zND" + }, + "source": [ + "url_df= ''\n", + "df = pd.read_csv(url_df)\n", + "df.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jgFm2aJL-EaL" + }, + "source": [ + "**Dica**: Algumas medidas não fazem sentido seram nulas (0). Portanto, os NaN's aqui neste dataframe são o valor 0. Portanto, substitua os NaN's (no caso, 0)das variáveis Glucose, BloodPressure, SkinThickness, Insulin e BMI por alguma medida como, por exemplo, média, mediana, moda e etc." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tK0YWKti_znY" + }, + "source": [ + "## Exercício 3 - Titanic\n", + "> Trate os NaN's do dataframe Titanic_With_MV.csv." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aMZzx7iFAFXH" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB10_04__3DP_3_Data_Transformation_hs.ipynb b/Notebooks/NB10_04__3DP_3_Data_Transformation_hs.ipynb new file mode 100644 index 000000000..1e66371b7 --- /dev/null +++ b/Notebooks/NB10_04__3DP_3_Data_Transformation_hs.ipynb @@ -0,0 +1,1239 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "NB10_04__3DP_3_Data_Transformation.ipynb", + "provenance": [], + "private_outputs": true, + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5CgDLvphxfcX" + }, + "source": [ + "

3DP_3 - DATA TRANSFORMATION

\n", + "\n", + "* **Objetivo**: Preparar os dados para o Machine Learning." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PvW689ZBxbxH" + }, + "source": [ + "# **AGENDA**:\n", + "\n", + "> Consulte **Table of contents**.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GNiuYCCxGe8v" + }, + "source": [ + "# **Melhorias da sessão**\n", + "* Desenvolver a sessão sobe WOE." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-TdSY74U0XS9" + }, + "source": [ + "___\n", + "# **Referências**\n", + "* [Why, How and When to Scale your Features](https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e)\n", + "* [Demonstrating the different strategies of KBinsDiscretizer](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization_strategies.html#sphx-glr-auto-examples-preprocessing-plot-discretization-strategies-py);\n", + "* [Why do we need feature scaling in Machine Learning and how to do it using SciKit Learn?](https://medium.com/@contactsunny/why-do-we-need-feature-scaling-in-machine-learning-and-how-to-do-it-using-scikit-learn-d8314206fe73)\n", + "* [Importance of Feature Scaling](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py) --> Muito importante por demonstrar os efeitos e a importância de se transformar as colunas numéricas.\n", + "* [Feature discretization](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization_classification.html#sphx-glr-auto-examples-preprocessing-plot-discretization-classification-py) --> Mostra o impacto na acurácia dos modelos com e sem discretização. Ou seja, discretizar faz sentido!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l9DGifbWSmW3" + }, + "source": [ + "___\n", + "# **Machine Learning com Python (Scikit-Learn)**\n", + "\n", + "![Scikit-Learn](https://github.com/MathMachado/Materials/blob/master/scikit-learn-1.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Vg82Iouo_Qm2" + }, + "source": [ + "# Porque dimensionar (Scale), padronizar (Standardize) e normalizar (Normalize) importa?\n", + "* Porque muitos algoritmos de Machine Learning performam melhor ou convergem mais rápido quando os atributos/colunas/variáveis estão na mesma escala e possuem distribuição \"próxima\" da Normal." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "q-chlATnKSza" + }, + "source": [ + "## Carregar as bibliotecas (genéricas) Python" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kQGVQB18-tM_" + }, + "source": [ + "!pip install category_encoders\n", + "!pip install update" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7FJxrZckYxk6" + }, + "source": [ + "import pandas as pd\n", + "\n", + "import numpy as np\n", + "from sklearn import preprocessing\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "%matplotlib inline\n", + "\n", + "import category_encoders as ce # library para aplicação do WOE - Weight Of Evidence para avaliar importância dos atributos\n", + "\n", + "# remove warnings to keep notebook clean\n", + "import warnings\n", + "warnings.filterwarnings('ignore')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "CyuWQM2NTMls" + }, + "source": [ + "pd.options.display.float_format = '{:.2f}'.format" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R0fuDyI8_UPf" + }, + "source": [ + "## Carregar os dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9oRWtarakgMY" + }, + "source": [ + "### Dataframe gerado aleatoriamente - variáveis com distribuição Normal" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7BXPXo3k0VDI" + }, + "source": [ + "np.random.seed(20111974)\n", + "\n", + "i_N = 10000\n", + "\n", + "df_A1 = pd.DataFrame({\n", + " 'coluna1': np.random.normal(0, 2, i_N), # Observem que a média das colunas são distintas\n", + " 'coluna2': np.random.normal(50, 3, i_N),\n", + " 'coluna3': np.random.normal(-5, 5, i_N),\n", + " 'coluna4': np.random.normal(-10, 10, i_N)\n", + "})\n", + "\n", + "df_A1.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "93ST1JnoRZKm" + }, + "source": [ + "**Dica**: Podemos usar outras distribuições (se quisermos), como a Exponential (mostrada abaixo)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XUqjo5QcQH99" + }, + "source": [ + "np.random.seed(20111974)\n", + "\n", + "df_A2 = pd.DataFrame({\n", + " 'coluna1': np.random.normal(0, 2, i_N),\n", + " 'coluna2': np.random.normal(50, 3, i_N),\n", + " 'coluna3': np.random.exponential(1, i_N), # coluna3 tem distribuição Exponential\n", + " 'coluna4': np.random.normal(-10, 10, i_N)\n", + "})\n", + "\n", + "df_A2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J8MZNLbUkp8R" + }, + "source": [ + "### Dataframe gerado aleatoriamente 2" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BR-fDDujcTup" + }, + "source": [ + "from sklearn.datasets import make_classification\n", + "\n", + "dados, classe = make_classification(n_samples = i_N, n_features = 4, n_informative = 3, n_redundant = 1, n_classes = 3)\n", + "\n", + "df_A3 = pd.DataFrame({'coluna1': dados[:,0],\n", + " 'coluna2':dados[:,1],\n", + " 'coluna3':dados[:,2],\n", + " 'coluna4':dados[:,3]}) #, 'coluna5':classe})\n", + "\n", + "df_A3.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Zq1cnpwLKvjS" + }, + "source": [ + "df_A4 = pd.DataFrame({ \n", + " 'coluna1': np.random.beta(5, 1, i_N) * 25, \n", + " 'coluna2': np.random.exponential(10, i_N),\n", + " 'coluna3': np.random.normal(10, 2, i_N),\n", + " 'coluna4': np.random.normal(10, 10, i_N), \n", + "})\n", + "\n", + "df_A4.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "O7sXQjvYRfhb" + }, + "source": [ + "#### Extração de amostras para compararmos" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rjVHsnnHRkIo" + }, + "source": [ + "df_A1_test = df_A1.sample(n = 100)\n", + "df_A2_test = df_A2.sample(n = 100)\n", + "df_A3_test = df_A3.sample(n = 100)\n", + "df_A4_test = df_A4.sample(n = 100)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "t0v0uXFRl-yG" + }, + "source": [ + "___\n", + "# **Transformações**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pkzTO0fdz93b" + }, + "source": [ + "## (1) StandardScaler\n", + "* StandardScaler é a transformação que centraliza os dados através da remoção da média (dos dados) e, na sequência, redimensiona (scale) através da divisão pelo desvio-padrão;\n", + "* Após a transformação, os dados terão média zero e desvio-padrão 1;\n", + "* Assume que os dados (as colunas a serem transformadas) são normalmente distribuidos ;\n", + "* Se os dados não possuem distribuição Normal, então esta não é uma boa transformação a se aplicar.\n", + "\n", + "$$z_{i}= \\frac{x_{i}-mean(x)}{std(x)}$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v1UOOWeQ0R_Y" + }, + "source": [ + "### Exemplo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y1Lzx3xN6wpZ" + }, + "source": [ + "df_A3.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9cPq_7Vu2HCS" + }, + "source": [ + "Histograma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZYW9WwBC3hd_" + }, + "source": [ + "plt.figure(figsize = (12, 8))\n", + "plt.hist(df_A1['coluna3'], color = 'blue', edgecolor = 'black', bins = int(180/5))\n", + "\n", + "# Adiciona títulos e labels\n", + "plt.title('Histograma da coluna3')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "h8ogcQvvT5zK" + }, + "source": [ + "plt.figure(figsize = (12, 8))\n", + "plt.hist(df_A2['coluna3'], color = 'blue', edgecolor = 'black', bins = int(180/5))\n", + "\n", + "# Adiciona títulos e labels\n", + "plt.title('Histograma da coluna3')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RrgxkESc-Uaq" + }, + "source": [ + "Considere o gráfico a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "U7dHTF1W-Xsn" + }, + "source": [ + "df_A1.plot(kind = 'kde') # KDE (= kernel Density Estimate) ajuda-nos a visualizar a distribuição dos dados, análogo ao histograma." + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hMS72n14-hDO" + }, + "source": [ + "Qual a interpretação para o gráfico acima?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "izqGNcNILdaX" + }, + "source": [ + "df_A1.plot()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZEkAqlZg-p0v" + }, + "source": [ + "A seguir, a transformação StandardScaler:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "N4u3T_BX-oc_" + }, + "source": [ + "from sklearn.preprocessing import StandardScaler" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "voFQ4odSzzPZ" + }, + "source": [ + "O ideal é termos um array com as preditoras, da seguinte forma:\n", + "X = [coluna1, coluna2, ..., colunaN]" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rPa4-SCt-ynX" + }, + "source": [ + "np.set_printoptions(precision = 3)\n", + "\n", + "A1_scale = StandardScaler().fit_transform(df_A1) # Combinação dos métodos fit() + transform()\n", + "\n", + "A1_scale_fit = StandardScaler().fit(df_A1) # Aplica o fit() separadamente\n", + "A1_scale_transform = A1_scale_fit.transform(df_A1) # Aplica o transform() separadamente.\n", + "A1_scale_fit_transform = StandardScaler().fit(df_A1).transform(df_A1) # Aplica fit().transform() encadeado\n", + "\n", + "A2_scale = StandardScaler().fit_transform(df_A2)\n", + "\n", + "A3_scale = StandardScaler().fit_transform(df_A3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ioZ_IN3Z6d39" + }, + "source": [ + "Observe abaixo que A1_scale = A1_scale_transform = A1_scale_fit_transform --> São arrays multidimensionais (do tipo NumPy)!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "v4xQR4cu5D1J" + }, + "source": [ + "A1_scale" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "j6GtN2KF4E_A" + }, + "source": [ + "A1_scale_transform" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "0q2bvSqb6T4g" + }, + "source": [ + "A1_scale_fit_transform" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WIhaErnA46Fi" + }, + "source": [ + "Transformando em dataframe:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HAhRvPze44JW" + }, + "source": [ + "df_A1_scale = pd.DataFrame(A1_scale, columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])\n", + "df_A2_scale = pd.DataFrame(A2_scale, columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])\n", + "df_A3_scale = pd.DataFrame(A3_scale, columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bmQp8wDO_E88" + }, + "source": [ + "Agora compare esse novo gráfico abaixo --> Vemos que os dados transformados tem distribuição Normal(0, 1):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "csfqRhDH2zUb" + }, + "source": [ + "df_A1.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-krh1pDg22RF" + }, + "source": [ + "df_A1_scale.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "D2fTPWsm_Hq3" + }, + "source": [ + "df_A1_scale.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "9oN-829l3277" + }, + "source": [ + "df_A2.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Jqh8L5BeUHT-" + }, + "source": [ + "df_A2_scale.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Yvz6O1zk4XNE" + }, + "source": [ + "df_A3.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ffU-fQxCUSmm" + }, + "source": [ + "df_A3_scale.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y24MOLL83w9j" + }, + "source": [ + "### Exercício: Calcular a média e o desvio-padrão." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1Aa25gVlSdOi" + }, + "source": [ + "df_A1.describe()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "EXZUiZImSmOE" + }, + "source": [ + "df_A1_scale.describe()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uIUQw5dpRwvA" + }, + "source": [ + "#### Correlação das colunas\n", + "* Observe que as correlações entre as variáveis não se alteram com as transformações." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uj1UerjORq9q" + }, + "source": [ + "df_A1.corr()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "jp6vPK0aR_p0" + }, + "source": [ + "df_A1_scale.corr()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4fuURrao_M0c" + }, + "source": [ + "Qual a conclusão?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f0A9U7rs_RAT" + }, + "source": [ + "## (2) MinMaxScaler\n", + "* **Transformação muito popular e utilizada**.\n", + "* Transforma os dados para o intervalo (0, 1);\n", + "* Se StandardScaler não é aplicável, então essa transformação funciona bem.\n", + "* Sensível aos outliers. Portanto, o ideal é que os outliers sejam tratados previamente.\n", + "* Uma transformação similar à MinMaxScaler() é MaxAbsScaler() que redimensiona os dados no intervalo [-1, 1].\n", + "\n", + "$$z_{i}= \\frac{x_{i}-min(x)}{max(x)-min(x)}$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C0HbeuP-AU_p" + }, + "source": [ + "### Exemplo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mgeLckzxAWaC" + }, + "source": [ + "from sklearn.preprocessing import MinMaxScaler" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "S_W9bTO2AbEg" + }, + "source": [ + "df_A1.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "PJRFbUpBAg5J" + }, + "source": [ + "A1_MinMaxScaler = MinMaxScaler().fit_transform(df_A1)\n", + "df_A1_MinMaxScaler = pd.DataFrame(A1_MinMaxScaler,columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])\n", + "\n", + "# Gráfico\n", + "df_A1_MinMaxScaler.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7g8GA4LTA40U" + }, + "source": [ + "Qual a conclusão?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4Z6D3vfnB9Nm" + }, + "source": [ + "## (3) RobustScaler\n", + "* Transformação ideal para dados com outliers.\n", + "\n", + "$$z_{i}= \\frac{x_{i}-Q_{1}(x)}{Q_{3}(x)-Q_{1}(x)}$$" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "m3oyuxLeCW1D" + }, + "source": [ + "df_A1.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zeDF7-w_CcBy" + }, + "source": [ + "from sklearn.preprocessing import RobustScaler" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vLoqSKijCf2v" + }, + "source": [ + "A1_RobustScaler = RobustScaler().fit_transform(df_A1)\n", + "df_A1_RobustScaler = pd.DataFrame(A1_RobustScaler, columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])\n", + "\n", + "# Gráfico\n", + "df_A1_RobustScaler.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-YVMgt-WEFif" + }, + "source": [ + "## Encoding Variáveis Categóricas" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xHYvLc8T_jxQ" + }, + "source": [ + "### Encoding Variáveis Ordinais\n", + "* Exemplo: Variáveis com valores ordinais: baixo, médio ou alto." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i1BgGiGdSTcG" + }, + "source": [ + "#### Gera um dataframe como exemplo." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kdVahfJAEkuO" + }, + "source": [ + "# Aqui vou usar a função randint - Retorna números inteiros aleatórios incluindo o número inferior e excluindo o superior.\n", + "\n", + "l_idade= [np.random.randint(20, 40), np.random.randint(20, 40), np.random.randint(20, 40), np.random.randint(20, 40), np.random.randint(20, 40),\n", + " np.random.randint(20, 40), np.random.randint(20, 40), np.random.randint(20, 40), np.random.randint(20, 40), np.random.randint(20, 40)]\n", + "\n", + "l_salario = ['baixo', 'medio', 'alto']\n", + "l_salario2 = np.random.choice(l_salario, 10, p = [0.6, 0.3, 0.1])\n", + "\n", + "df_A4 = pd.DataFrame({\n", + " 'idade': l_idade,\n", + " 'salario': l_salario2})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "m_15P2eUHSBY" + }, + "source": [ + "df_A4" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R1g9pEuyHe2q" + }, + "source": [ + "Neste exemplo, vamos redefinir a variável categórical ordinal 'Salario' da seguinte forma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bkwFuEa8HnMV" + }, + "source": [ + "df_A4['salario_cat'] = df_A4['salario'].map({'baixo': 1, 'medio': 2, 'alto': 3})\n", + "df_A4" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DlaIFiWIIPAl" + }, + "source": [ + "### Encoding Variáveis Nominais\n", + "* Exemplo: Variáveis com valores nominais: Sexo (Feminino, Masculino).\n", + "\n", + "* Use One-Hot Encoding ou pd.get.dummies()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ffNoJQbgJRoY" + }, + "source": [ + "Vamos utilizar o dataframe criado no passo anterior:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "PMCoUWZOI7c0" + }, + "source": [ + "df_A4['salario'].unique()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "bdIEyBkaJeN8" + }, + "source": [ + "from sklearn.preprocessing import LabelEncoder, OneHotEncoder" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4MwK4cUEKeK4" + }, + "source": [ + "#### Aplicar LabelEncoder()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6X6VXDsHJiII" + }, + "source": [ + "le = LabelEncoder()\n", + "df_A4['salario_le'] = le.fit_transform(df_A4['salario'])\n", + "df_A4" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "RY80x59J8Ham" + }, + "source": [ + "df_A4['salario'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Dgv2Zz07Kqfj" + }, + "source": [ + "#### Aplicar pd.get.dummies()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WSZRIEs6K5sP" + }, + "source": [ + "dummies = pd.get_dummies(df_A4['salario'])\n", + "df_A4 = pd.concat([df_A4, dummies], axis = 1)\n", + "df_A4" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CY8GZ-HlNOgm" + }, + "source": [ + "# **Wrap Up**\n", + "* Use MinMaxScaler como transformação default, pois esta transformação não distorce os dados;\n", + "* Use RobustScaler se seus dados/coluna/variável possui outliers e gostaríamos de reduzir o efeito/impacto destes outliers. Entretanto, o melhor tratamento é estudar os outliers cuidadosamente e tratá-los adequadamente;\n", + "* Use StandardScaler se seus dados/colunas/variáveis possuem distribuição Normal (ou pelo menos se aproxima bem da distribuição Normal)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Mwh0alhdgrE3" + }, + "source": [ + "___\n", + "# **Exercícios**\n", + "> Para cada um dos dataframes a seguir, aplique os seguintes steps:\n", + "\n", + "* Padronizar o nome das colunas\n", + " * Eliminar espaços entre os nomes das colunas;\n", + " * Eliminar caracteres especiais dos nomes das colunas;\n", + " * Renomear as colunas com lower() (ou upper());\n", + "* Aplicar a trasformação StandardScaler e MinMaxScaler em cada uma das colunas do dataframe;\n", + "* DataViz - Mostrar a distribuição das colunas para compararmos os resultados antes e depois das transformações.\n", + "* As correlações das colunas mudam com as transformações?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hSTKrd992LtI" + }, + "source": [ + "## Exercício 1 - Iris --> **Resolvido**\n", + "* [Aqui](https://en.wikipedia.org/wiki/Iris_flower_data_set) você obterá mais informações sobre o dataframe iris. Confira." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mThqvGGr2Vuk" + }, + "source": [ + "from sklearn.datasets import load_iris\n", + "\n", + "iris = load_iris()\n", + "X= iris['data']\n", + "y= iris['target']\n", + "\n", + "df_iris = pd.DataFrame(np.c_[X, y], columns= np.append(iris['feature_names'], ['target']))\n", + "df_iris['target2'] = df_iris['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})\n", + "df_iris.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "eU5FaJhdYblP" + }, + "source": [ + "df_iris.columns = [c.replace(' ', '_') for c in df_iris.columns]\n", + "df_iris.columns = [c.replace('_(cm)', '') for c in df_iris.columns]\n", + "df_iris.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "K9DPAakJZQHH" + }, + "source": [ + "df_iris.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "YYYmVq68Y8bB" + }, + "source": [ + "# Aplica a transformação:\n", + "df_iris_MinMaxScaler = MinMaxScaler().fit_transform(df_iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']])\n", + "\n", + "# Transformando em Dataframe:\n", + "df_iris_MinMaxScaler = pd.DataFrame(df_iris_MinMaxScaler, columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])\n", + "\n", + "# Gráfico\n", + "df_iris_MinMaxScaler.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "caFkC6oCmUKK" + }, + "source": [ + "## Exercício 2 - Breast Cancer" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vhOM-Z9zmf-f" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "from sklearn.datasets import load_breast_cancer\n", + "\n", + "cancer = load_breast_cancer()\n", + "X= cancer['data']\n", + "y= cancer['target']\n", + "\n", + "df_A1_cancer = pd.DataFrame(np.c_[X, y], columns= np.append(cancer['feature_names'], ['target']))\n", + "df_A1_cancer['target'] = df_A1_cancer['target'].map({0: 'malign', 1: 'benign'})\n", + "df_A1_cancer.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1qruqUDqnvMc" + }, + "source": [ + "## Exercício 3 - Boston Housing Price" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "trxK8YXNnsam" + }, + "source": [ + "from sklearn.datasets import load_boston\n", + "\n", + "boston = load_boston()\n", + "X= boston['data']\n", + "y= boston['target']\n", + "\n", + "df_A1_boston = pd.DataFrame(np.c_[X, y], columns= np.append(boston['feature_names'], ['target']))\n", + "df_A1_boston.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nzu0Dz33c8ds" + }, + "source": [ + "## Exercícios 4 - Diabetes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "d6ahBZmqc_-1" + }, + "source": [ + "from sklearn.datasets import load_diabetes\n", + "\n", + "diabetes = load_diabetes()\n", + "X= diabetes['data']\n", + "y= diabetes['target']\n", + "\n", + "df_A1_diabetes = pd.DataFrame(np.c_[X, y], columns= np.append(diabetes['feature_names'], ['target']))\n", + "df_A1_diabetes.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NyunIr6oaWEl" + }, + "source": [ + "## Exercícios 6 - 120 years of Olympic history: athletes and results\n", + "* [120 years of Olympic history: athletes and results](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results)\n", + " * Trate adequadamente as variáveis 'sex', 'season', 'team', 'city', 'sport' e 'medal';\n", + " * Aplique as transformações que acabamos de estudar nos campos/colunas numéricas 'height' e 'weight'. Cuidado com os Missing Values contidos nas variáveis!\n", + " * Verifique/avalie o impacto dos outliers nestas colunas.\n", + " * Neste caso, qual transformação é mais adequado diante dos outliers?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hB_riY7ID0MV" + }, + "source": [ + "from google.colab import drive\n", + "\n", + "drive.mount('/content/drive')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1i3KI-M1Ds2U" + }, + "source": [ + "df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/athlete_events.zip', compression='zip')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lVKFY-FV_2Jx" + }, + "source": [ + "# Para leitura do dataframe no Jupyter Notebook - usar o código abaixo\n", + "url = r'C:\\Users\\81689004720\\Desktop\\Python_Sufis\\Python - Avançado - Nelio\\athlete_events.csv'\n", + "df_olimpiadas = pd.read_csv(url)\n", + "\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o5fDp1Ib_Dg8" + }, + "source": [ + "# WOE - Weight Of Evidence\n", + "* As vantagens da transformação WOE são\n", + " * Lida bem com NaN's;\n", + " * Lida bem com outliers;\n", + " * A transformação é baseada no valor logarítmico das distribuições.\n", + " * Usando a técnica de binning apropriada, pode estabelecer uma relação monotônica (aumentar ou diminuir) entre a variável dependente e independente." + ] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB10_04__3DP_3_Data_Transformation_hs2.ipynb b/Notebooks/NB10_04__3DP_3_Data_Transformation_hs2.ipynb new file mode 100644 index 000000000..8af225e60 --- /dev/null +++ b/Notebooks/NB10_04__3DP_3_Data_Transformation_hs2.ipynb @@ -0,0 +1,1580 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "NB10_04__3DP_3_Data_Transformation.ipynb", + "provenance": [], + "private_outputs": true, + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5CgDLvphxfcX" + }, + "source": [ + "

3DP_3 - DATA TRANSFORMATION

\n", + "\n", + "* **Objetivo**: Preparar os dados para o Machine Learning." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PvW689ZBxbxH" + }, + "source": [ + "# **AGENDA**:\n", + "\n", + "> Consulte **Table of contents**.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GNiuYCCxGe8v" + }, + "source": [ + "# **Melhorias da sessão**\n", + "* Desenvolver a sessão sobe WOE." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-TdSY74U0XS9" + }, + "source": [ + "___\n", + "# **Referências**\n", + "* [Why, How and When to Scale your Features](https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e)\n", + "* [Demonstrating the different strategies of KBinsDiscretizer](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization_strategies.html#sphx-glr-auto-examples-preprocessing-plot-discretization-strategies-py);\n", + "* [Why do we need feature scaling in Machine Learning and how to do it using SciKit Learn?](https://medium.com/@contactsunny/why-do-we-need-feature-scaling-in-machine-learning-and-how-to-do-it-using-scikit-learn-d8314206fe73)\n", + "* [Importance of Feature Scaling](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py) --> Muito importante por demonstrar os efeitos e a importância de se transformar as colunas numéricas.\n", + "* [Feature discretization](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization_classification.html#sphx-glr-auto-examples-preprocessing-plot-discretization-classification-py) --> Mostra o impacto na acurácia dos modelos com e sem discretização. Ou seja, discretizar faz sentido!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l9DGifbWSmW3" + }, + "source": [ + "___\n", + "# **Machine Learning com Python (Scikit-Learn)**\n", + "\n", + "![Scikit-Learn](https://github.com/MathMachado/Materials/blob/master/scikit-learn-1.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Vg82Iouo_Qm2" + }, + "source": [ + "# Porque dimensionar (Scale), padronizar (Standardize) e normalizar (Normalize) importa?\n", + "* Porque muitos algoritmos de **Machine Learning** performam melhor ou convergem mais rápido quando os atributos/colunas/variáveis estão na mesma escala e possuem distribuição \"próxima\" da Normal." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "q-chlATnKSza" + }, + "source": [ + "## Carregar as bibliotecas (genéricas) Python" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kQGVQB18-tM_" + }, + "source": [ + "!pip install category_encoders\n", + "!pip install update" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7FJxrZckYxk6" + }, + "source": [ + "import pandas as pd\n", + "\n", + "import numpy as np\n", + "from sklearn import preprocessing\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "%matplotlib inline\n", + "\n", + "import category_encoders as ce # library para aplicação do WOE - Weight Of Evidence para avaliar importância dos atributos\n", + "\n", + "# remove warnings to keep notebook clean\n", + "import warnings\n", + "warnings.filterwarnings('ignore')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "CyuWQM2NTMls" + }, + "source": [ + "pd.options.display.float_format = '{:.2f}'.format" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R0fuDyI8_UPf" + }, + "source": [ + "## Carregar os dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9oRWtarakgMY" + }, + "source": [ + "### Dataframe gerado aleatoriamente - variáveis com distribuição Normal" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7BXPXo3k0VDI" + }, + "source": [ + "np.random.seed(20111974)\n", + "\n", + "i_N = 10000\n", + "\n", + "df_A1 = pd.DataFrame({\n", + " 'coluna1': np.random.normal(0, 2, i_N), # Observem que a média das colunas são distintas\n", + " 'coluna2': np.random.normal(50, 3, i_N),\n", + " 'coluna3': np.random.normal(-5, 5, i_N),\n", + " 'coluna4': np.random.normal(-10, 10, i_N)\n", + "})\n", + "\n", + "df_A1.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "93ST1JnoRZKm" + }, + "source": [ + "**Dica**: Podemos usar outras distribuições (se quisermos), como a Exponential (mostrada abaixo)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XUqjo5QcQH99" + }, + "source": [ + "np.random.seed(20111974)\n", + "\n", + "df_A2 = pd.DataFrame({\n", + " 'coluna1': np.random.normal(0, 2, i_N),\n", + " 'coluna2': np.random.normal(50, 3, i_N),\n", + " 'coluna3': np.random.exponential(5, i_N), # coluna3 tem distribuição Exponential\n", + " 'coluna4': np.random.normal(-10, 10, i_N)\n", + "})\n", + "\n", + "df_A2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J8MZNLbUkp8R" + }, + "source": [ + "### Dataframe gerado aleatoriamente 2" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BR-fDDujcTup" + }, + "source": [ + "from sklearn.datasets import make_classification\n", + "\n", + "dados, classe = make_classification(n_samples = i_N, n_features = 4, n_informative = 3, n_redundant = 1, n_classes = 3)\n", + "\n", + "df_A3 = pd.DataFrame({'coluna1': dados[:,0],\n", + " 'coluna2':dados[:,1],\n", + " 'coluna3':dados[:,2],\n", + " 'coluna4':dados[:,3]}) #, 'coluna5':classe})\n", + "\n", + "df_A3.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Zq1cnpwLKvjS" + }, + "source": [ + "df_A4 = pd.DataFrame({ \n", + " 'coluna1': np.random.beta(5, 1, i_N) * 25, \n", + " 'coluna2': np.random.exponential(5, i_N),\n", + " 'coluna3': np.random.normal(10, 2, i_N),\n", + " 'coluna4': np.random.normal(10, 10, i_N), \n", + "})\n", + "\n", + "df_A4.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "O7sXQjvYRfhb" + }, + "source": [ + "#### Extração de amostras para compararmos" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rjVHsnnHRkIo" + }, + "source": [ + "df_A1_test = df_A1.sample(n = 100)\n", + "df_A2_test = df_A2.sample(n = 100)\n", + "df_A3_test = df_A3.sample(n = 100)\n", + "df_A4_test = df_A4.sample(n = 100)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "t0v0uXFRl-yG" + }, + "source": [ + "___\n", + "# **Transformações**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pkzTO0fdz93b" + }, + "source": [ + "## (1) StandardScaler\n", + "* StandardScaler é a transformação que centraliza os dados através da remoção da média (dos dados) e, na sequência, redimensiona (scale) através da divisão pelo desvio-padrão;\n", + "* Após a transformação, os dados terão média zero e desvio-padrão 1;\n", + "* **Assume que os dados (as colunas a serem transformadas) são normalmente distribuidos**;\n", + "* Se os dados não possuem distribuição Normal, então esta **NÃO** é uma boa transformação a se aplicar.\n", + "\n", + "$$z_{i}= \\frac{x_{i}-mean(x)}{std(x)}$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v1UOOWeQ0R_Y" + }, + "source": [ + "### Exemplo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y1Lzx3xN6wpZ" + }, + "source": [ + "df_A3.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9cPq_7Vu2HCS" + }, + "source": [ + "Histograma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZYW9WwBC3hd_" + }, + "source": [ + "plt.figure(figsize = (12, 8))\n", + "plt.hist(df_A1['coluna3'], color = 'blue', edgecolor = 'black', bins = int(180/5))\n", + "\n", + "# Adiciona títulos e labels\n", + "plt.title('Coluna3 - Distribuição Normal')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "h8ogcQvvT5zK" + }, + "source": [ + "plt.figure(figsize = (12, 8))\n", + "plt.hist(df_A2['coluna3'], color = 'blue', edgecolor = 'black', bins = int(180/5))\n", + "\n", + "# Adiciona títulos e labels\n", + "plt.title('Coluna3 - Distribuição Exponencial')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RrgxkESc-Uaq" + }, + "source": [ + "Considere o gráfico a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "U7dHTF1W-Xsn" + }, + "source": [ + "df_A1.plot(kind = 'kde') # KDE (= kernel Density Estimate) ajuda-nos a visualizar a distribuição dos dados, análogo ao histograma." + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hMS72n14-hDO" + }, + "source": [ + "Qual a interpretação para o gráfico acima?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "izqGNcNILdaX" + }, + "source": [ + "df_A1.plot()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZEkAqlZg-p0v" + }, + "source": [ + "A seguir, a transformação StandardScaler:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "N4u3T_BX-oc_" + }, + "source": [ + "from sklearn.preprocessing import StandardScaler" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "voFQ4odSzzPZ" + }, + "source": [ + "O ideal é termos um array com as preditoras, da seguinte forma:\n", + "X = [coluna1, coluna2, ..., colunaN]" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rPa4-SCt-ynX" + }, + "source": [ + "np.set_printoptions(precision = 3)\n", + "\n", + "A1_scale = StandardScaler().fit_transform(df_A1) # Combinação dos métodos fit() + transform()\n", + "\n", + "A1_scale_fit = StandardScaler().fit(df_A1) # Aplica o fit() separadamente\n", + "A1_scale_transform = A1_scale_fit.transform(df_A1) # Aplica o transform() separadamente.\n", + "A1_scale_fit_transform = StandardScaler().fit(df_A1).transform(df_A1) # Aplica fit().transform() encadeado\n", + "\n", + "A2_scale = StandardScaler().fit_transform(df_A2)\n", + "\n", + "A3_scale = StandardScaler().fit_transform(df_A3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "a8tZJgbOplDd" + }, + "source": [ + "## Salvar os parâmetros do StandardScaler e outros --> Colocar aqui!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ERfRIz-njqcD" + }, + "source": [ + "A1_scale_fit.scale_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ioZ_IN3Z6d39" + }, + "source": [ + "Observe abaixo que A1_scale = A1_scale_transform = A1_scale_fit_transform --> São arrays multidimensionais (do tipo NumPy)!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "v4xQR4cu5D1J" + }, + "source": [ + "A1_scale" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "j6GtN2KF4E_A" + }, + "source": [ + "A1_scale_transform" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "0q2bvSqb6T4g" + }, + "source": [ + "A1_scale_fit_transform" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WIhaErnA46Fi" + }, + "source": [ + "Transformando em dataframe:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HAhRvPze44JW" + }, + "source": [ + "df_A1_scale = pd.DataFrame(A1_scale, columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])\n", + "df_A2_scale = pd.DataFrame(A2_scale, columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])\n", + "df_A3_scale = pd.DataFrame(A3_scale, columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bmQp8wDO_E88" + }, + "source": [ + "Agora compare esse novo gráfico abaixo --> Vemos que os dados transformados tem distribuição Normal(0, 1):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "csfqRhDH2zUb" + }, + "source": [ + "df_A1.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-krh1pDg22RF" + }, + "source": [ + "df_A1_scale.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "D2fTPWsm_Hq3" + }, + "source": [ + "df_A1_scale.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "9oN-829l3277" + }, + "source": [ + "df_A2.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Jqh8L5BeUHT-" + }, + "source": [ + "df_A2_scale.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Yvz6O1zk4XNE" + }, + "source": [ + "df_A3.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ffU-fQxCUSmm" + }, + "source": [ + "df_A3_scale.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y24MOLL83w9j" + }, + "source": [ + "### Exercício: Calcular a média e o desvio-padrão." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1Aa25gVlSdOi" + }, + "source": [ + "df_A1.describe()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "EXZUiZImSmOE" + }, + "source": [ + "df_A1_scale.describe()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uIUQw5dpRwvA" + }, + "source": [ + "#### Correlação das colunas\n", + "* Observe que as correlações entre as variáveis não se alteram com as transformações." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uj1UerjORq9q" + }, + "source": [ + "df_A1.corr()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "jp6vPK0aR_p0" + }, + "source": [ + "df_A1_scale.corr()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4fuURrao_M0c" + }, + "source": [ + "Qual a conclusão?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f0A9U7rs_RAT" + }, + "source": [ + "## (2) MinMaxScaler\n", + "* **Transformação muito popular e utilizada**.\n", + "* Transforma os dados para o intervalo [0, 1];\n", + "* Se StandardScaler não é aplicável, então essa transformação funciona bem.\n", + "* Sensível aos _outliers_. Portanto, o ideal é que os _outliers_ sejam tratados previamente.\n", + "* Uma transformação similar à MinMaxScaler() é MaxAbsScaler() (redimensiona os dados no intervalo [-1, 1]) e centralizado em 0).\n", + "* Não corrige skewness;\n", + "* Sensível à outliers;\n", + "\n", + "$$z_{i}= \\frac{x_{i}-min(x)}{max(x)-min(x)}$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C0HbeuP-AU_p" + }, + "source": [ + "### Exemplo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mgeLckzxAWaC" + }, + "source": [ + "from sklearn.preprocessing import MinMaxScaler" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "S_W9bTO2AbEg" + }, + "source": [ + "df_A1.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "PJRFbUpBAg5J" + }, + "source": [ + "A1_MinMaxScaler = MinMaxScaler().fit_transform(df_A1)\n", + "df_A1_MinMaxScaler = pd.DataFrame(A1_MinMaxScaler,columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])\n", + "\n", + "# Gráfico\n", + "df_A1_MinMaxScaler.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7g8GA4LTA40U" + }, + "source": [ + "Qual a conclusão?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4Z6D3vfnB9Nm" + }, + "source": [ + "## (3) RobustScaler\n", + "* Transformação ideal para dados com **outliers**.\n", + "\n", + "$$z_{i}= \\frac{x_{i}-Q_{1}(x)}{Q_{3}(x)-Q_{1}(x)}$$" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "m3oyuxLeCW1D" + }, + "source": [ + "df_A1.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zeDF7-w_CcBy" + }, + "source": [ + "from sklearn.preprocessing import RobustScaler" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vLoqSKijCf2v" + }, + "source": [ + "A1_RobustScaler = RobustScaler().fit_transform(df_A1)\n", + "df_A1_RobustScaler = pd.DataFrame(A1_RobustScaler, columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])\n", + "\n", + "# Gráfico\n", + "df_A1_RobustScaler.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "g_D-7ik2xXpU" + }, + "source": [ + "### **Insight**: Gerar aleatoriamente colunas/variáveis com distribuição Gamma, Beta, Normal, Exponential e etc e avaliar o impacto das várias transformações." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GxIOPmSYwX-e" + }, + "source": [ + "# **Wrap Up**\n", + "* Use MinMaxScaler como transformação default, pois esta transformação não distorce os dados;\n", + "* Use RobustScaler se seus dados/coluna/variável possui **outliers** e gostaríamos de reduzir o efeito/impacto destes **outliers**. Entretanto, o melhor tratamento é estudar os **outliers** cuidadosamente e tratá-los adequadamente;\n", + "* Use StandardScaler se seus dados/colunas/variáveis possuem distribuição Normal (ou pelo menos se aproxima bem da distribuição Normal)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-YVMgt-WEFif" + }, + "source": [ + "## Encoding Variáveis Categóricas" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xHYvLc8T_jxQ" + }, + "source": [ + "### Encoding Variáveis Ordinais\n", + "* Exemplo: Variáveis com valores ordinais: baixo, médio ou alto." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i1BgGiGdSTcG" + }, + "source": [ + "#### Dataframe-exemplo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kdVahfJAEkuO" + }, + "source": [ + "# Aqui vou usar a função randint - Retorna números inteiros aleatórios incluindo o número inferior e excluindo o superior.\n", + "\n", + "l_idade = [\n", + " np.random.randint(20, 40),\n", + " np.random.randint(20, 40),\n", + " np.random.randint(20, 40), \n", + " np.random.randint(20, 40),\n", + " np.random.randint(20, 40),\n", + " np.random.randint(20, 40),\n", + " np.random.randint(20, 40),\n", + " np.random.randint(20, 40),\n", + " np.random.randint(20, 40),\n", + " np.random.randint(20, 40)\n", + " ]\n", + "\n", + "l_salario = ['baixo', 'medio', 'alto']\n", + "l_salario2 = np.random.choice(l_salario, 10, p = [0.6, 0.3, 0.1])\n", + "\n", + "df_A5 = pd.DataFrame({\n", + " 'idade': l_idade,\n", + " 'salario': l_salario2})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "m_15P2eUHSBY" + }, + "source": [ + "df_A5" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R1g9pEuyHe2q" + }, + "source": [ + "Neste exemplo, vamos redefinir a variável categórical ordinal 'Salario' da seguinte forma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bkwFuEa8HnMV" + }, + "source": [ + "df_A5['salario_cat'] = df_A5['salario'].map({'baixo': 1, 'medio': 2, 'alto': 3})\n", + "df_A5" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DlaIFiWIIPAl" + }, + "source": [ + "### Encoding Variáveis Nominais\n", + "* Exemplo: Variáveis com valores nominais: Sexo (Feminino, Masculino).\n", + "\n", + "* Use One-Hot Encoding ou pd.get.dummies()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ffNoJQbgJRoY" + }, + "source": [ + "Vamos utilizar o dataframe criado no passo anterior:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "PMCoUWZOI7c0" + }, + "source": [ + "df_A5['salario'].unique()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "bdIEyBkaJeN8" + }, + "source": [ + "from sklearn.preprocessing import LabelEncoder, OneHotEncoder" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4MwK4cUEKeK4" + }, + "source": [ + "#### Aplicar LabelEncoder()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6X6VXDsHJiII" + }, + "source": [ + "le = LabelEncoder()\n", + "df_A5['salario_le'] = le.fit_transform(df_A5['salario'])\n", + "df_A5" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "RY80x59J8Ham" + }, + "source": [ + "df_A5['salario'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Dgv2Zz07Kqfj" + }, + "source": [ + "#### Aplicar pd.get.dummies()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WSZRIEs6K5sP" + }, + "source": [ + "dummies = pd.get_dummies(df_A5['salario'])\n", + "df_A5 = pd.concat([df_A5, dummies], axis = 1)\n", + "df_A5" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WeXKQ3lkg8qO" + }, + "source": [ + "# Power Transformations\n", + "* Tem por objetivo transformar a distribuição de probabilidade da variável/coluna a fim de torná-la Normal. Esta normalização é feita através da correção da skewness (estabilização da variância) da distribuição.\n", + "* Exemplos de Power Transformations:\n", + " * log;\n", + " * Ajuda com distribuições skewness;\n", + " * Útil para distribuições não-negativas e sem zeros;\n", + " * raiz quadrada;\n", + " * raiz cúbica;\n", + " * Transformação de Box-Cox e\n", + " * Transformação de Yeo-Johson." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BQDf9EfzRXYC" + }, + "source": [ + "### Transformação de Yeo-Johnson" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "o9q9lxbYhlKE" + }, + "source": [ + "plt.figure(figsize = (12, 8))\n", + "plt.hist(df_A2['coluna3'], color = 'blue', edgecolor = 'black', bins = int(180/5))\n", + "\n", + "# Adiciona títulos e labels\n", + "plt.title('Histograma da coluna3 - Distribuição Exponencial')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QJ91Stekh8JO" + }, + "source": [ + "from sklearn.preprocessing import PowerTransformer" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "RCGFIeszkLVK" + }, + "source": [ + "df_A2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "XOQYdkyzi-PL" + }, + "source": [ + "yeo_johnson = PowerTransformer(method = 'yeo-johnson', standardize = True)\n", + "A2_yeo_johnson = yeo_johnson.fit_transform(df_A2)\n", + "df_A2_yeo_johnson = pd.DataFrame(A2_yeo_johnson, columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])\n", + "df_A2_yeo_johnson.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "T_5beALqkao_" + }, + "source": [ + "plt.figure(figsize = (12, 8))\n", + "plt.hist(df_A2_yeo_johnson['coluna3'], color = 'blue', edgecolor = 'black', bins = int(180/5))\n", + "\n", + "# Adiciona títulos e labels\n", + "plt.title('Coluna3 - Distribuição aproximadamente Normal')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BwEhZBARk7oA" + }, + "source": [ + "## Transformação de Box-Cox\n", + "* Inventada por dois grandes personagens da Estatística;\n", + "* A coluna/variável/atributo não pode conter números negativos ou zero. Ou seja, $X_{i} > 0$.\n", + "\n", + "* Se $w_{i}$ é a variável transformada e $x_{i}$ é a variável que queremos transformar.\n", + " * Se $\\lambda = 0$ --> $w_{i}^{(\\lambda)} = \\log(x_{i})$;\n", + " * Se $\\lambda <> 0$ --> $w_{i}^{(\\lambda)} = \\frac{x_{i}-1}{\\lambda}$;\n", + "* Se $\\lambda = 1$, então $w_{i}$ então os dados/distribuição já são normalmente distribuídos e a transformação de Box&Cox não se faz necessário.\n", + "* Precisamos escolher o valor de $\\lambda$ que permite a melhor aproximação da distribuição normal.\n", + "* A função scipy.stats.boxcox(array_1D) retorna o valor de $\\lambda$ ótimo. Basta passar como parâmetro o array de dimensão 1D que a função retorna o $\\lambda$ ótimo que melhor se ajusta aos seus dados.\n", + "* Para retornar seus dados aos valores originais, use scipy.special.inv_boxcox(y, lambda).\n", + "* Quais são as desvantagens da transformação?\n", + " * Perde-se a interpretação." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AZXkW4Baz87T" + }, + "source": [ + "Libraries necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QzWRS8chz8V_" + }, + "source": [ + "import numpy as np \n", + "from scipy import stats \n", + "import matplotlib.pyplot as plt " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_lwyGVDMzC4y" + }, + "source": [ + "### Exemplo 1\n", + "* Dados possuem distribuição Exponencial." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Tx5J1L8Az4qQ" + }, + "source": [ + "# Gráficos: \n", + "def compara_graficos(y, w, lambda_box_cox):\n", + " fig, ax = plt.subplots(1, 2) \n", + " \n", + " # Gráfico das distribuições originais e transformada\n", + " sns.distplot(y, hist = False, kde = True, kde_kws = {'shade': True, 'linewidth': 2}, label = \"Non-Normal\", color =\"green\", ax = ax[0]) \n", + " sns.distplot(w, hist = False, kde = True, kde_kws = {'shade': True, 'linewidth': 2}, label = \"Normal\", color =\"green\", ax = ax[1]) \n", + " \n", + " # Legendas \n", + " plt.legend(loc = \"upper right\") \n", + " \n", + " # Redimensionando os sub-gráficos \n", + " fig.set_figheight(5) \n", + " fig.set_figwidth(10) \n", + " \n", + " print(f\"Valor de Lambda usado na transformação: {lambda_box_cox}\") " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Cf5PPVl9Rr5H" + }, + "source": [ + "Transforma os dados/distribuições:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xWLsXEBB0CQO" + }, + "source": [ + "# Gerando dados com distribuição Exponencial\n", + "distribuicao_exponencial = np.random.exponential(size = 1000) \n", + "\n", + "# Dados transformados \n", + "box_cox, lambda_box_cox = stats.boxcox(distribuicao_exponencial) \n", + "compara_graficos(distribuicao_exponencial, box_cox, lambda_box_cox)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F9jObBLCZh19" + }, + "source": [ + "### Exemplo 2\n", + "* Dados possuem distribuição Beta.\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "CePoB8imzPRQ" + }, + "source": [ + "# Gerando dados com distribuição Exponencial\n", + "distribuicao_beta = np.random.beta(1, 3, 1000)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "j1CwLPm6zRx2" + }, + "source": [ + "# transform training data & save lambda value \n", + "box_cox, lambda_box_cox = stats.boxcox(distribuicao_beta) " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "snd63l9U0ugI" + }, + "source": [ + "compara_graficos(distribuicao_beta, box_cox, lambda_box_cox)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CQ6o9pOkPjUT" + }, + "source": [ + "### Transformação log\n", + "* De forma geral, a transformação **log** trata de dados skewed, tornando os dados (ou a distribuição dos dados) mais \"normal\";\n", + "* Se os dados forem de alguma forma normalmente distribuídos, então nada muda." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DrsXETsRPupd" + }, + "source": [ + "# Gerando dados com distribuição Exponencial\n", + "distribuicao_beta = np.random.beta(1, 3, 1000)\n", + "\n", + "transformacao_log = np.log(distribuicao_beta)\n", + "compara_graficos(distribuicao_beta, transformacao_log, 1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Mwh0alhdgrE3" + }, + "source": [ + "___\n", + "# **Exercícios**\n", + "> Para cada um dos dataframes a seguir, aplique os seguintes steps:\n", + "\n", + "* Padronizar o nome das colunas\n", + " * Eliminar espaços entre os nomes das colunas;\n", + " * Eliminar caracteres especiais dos nomes das colunas;\n", + " * Renomear as colunas com lower() (ou upper());\n", + "* Aplicar a trasformação StandardScaler e MinMaxScaler em cada uma das colunas do dataframe;\n", + "* DataViz - Mostrar a distribuição das colunas para compararmos os resultados antes e depois das transformações.\n", + "* As correlações das colunas mudam com as transformações?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hSTKrd992LtI" + }, + "source": [ + "## Exercício 1 - Iris --> **Resolvido**\n", + "* [Aqui](https://en.wikipedia.org/wiki/Iris_flower_data_set) você obterá mais informações sobre o dataframe iris. Confira." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mThqvGGr2Vuk" + }, + "source": [ + "from sklearn.datasets import load_iris\n", + "\n", + "iris = load_iris()\n", + "X= iris['data']\n", + "y= iris['target']\n", + "\n", + "df_iris = pd.DataFrame(np.c_[X, y], columns= np.append(iris['feature_names'], ['target']))\n", + "df_iris['target2'] = df_iris['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})\n", + "df_iris.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "eU5FaJhdYblP" + }, + "source": [ + "df_iris.columns = [c.replace(' ', '_') for c in df_iris.columns]\n", + "df_iris.columns = [c.replace('_(cm)', '') for c in df_iris.columns]\n", + "df_iris.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "K9DPAakJZQHH" + }, + "source": [ + "df_iris.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "YYYmVq68Y8bB" + }, + "source": [ + "# Aplica a transformação:\n", + "df_iris_MinMaxScaler = MinMaxScaler().fit_transform(df_iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']])\n", + "\n", + "# Transformando em Dataframe:\n", + "df_iris_MinMaxScaler = pd.DataFrame(df_iris_MinMaxScaler, columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])\n", + "\n", + "# Gráfico\n", + "df_iris_MinMaxScaler.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DKsHcjd77YZT" + }, + "source": [ + "### Aplicar as outras transformações e comparar os gráficos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "caFkC6oCmUKK" + }, + "source": [ + "## Exercício 2 - Breast Cancer" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vhOM-Z9zmf-f" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "from sklearn.datasets import load_breast_cancer\n", + "\n", + "cancer = load_breast_cancer()\n", + "X= cancer['data']\n", + "y= cancer['target']\n", + "\n", + "df_A1_cancer = pd.DataFrame(np.c_[X, y], columns= np.append(cancer['feature_names'], ['target']))\n", + "df_A1_cancer['target'] = df_A1_cancer['target'].map({0: 'malign', 1: 'benign'})\n", + "df_A1_cancer.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1qruqUDqnvMc" + }, + "source": [ + "## Exercício 3 - Boston Housing Price" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "trxK8YXNnsam" + }, + "source": [ + "from sklearn.datasets import load_boston\n", + "\n", + "boston = load_boston()\n", + "X= boston['data']\n", + "y= boston['target']\n", + "\n", + "df_A1_boston = pd.DataFrame(np.c_[X, y], columns= np.append(boston['feature_names'], ['target']))\n", + "df_A1_boston.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nzu0Dz33c8ds" + }, + "source": [ + "## Exercícios 4 - Diabetes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "d6ahBZmqc_-1" + }, + "source": [ + "from sklearn.datasets import load_diabetes\n", + "\n", + "diabetes = load_diabetes()\n", + "X= diabetes['data']\n", + "y= diabetes['target']\n", + "\n", + "df_A1_diabetes = pd.DataFrame(np.c_[X, y], columns= np.append(diabetes['feature_names'], ['target']))\n", + "df_A1_diabetes.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NyunIr6oaWEl" + }, + "source": [ + "## Exercícios 5 - 120 years of Olympic history: athletes and results\n", + "* [120 years of Olympic history: athletes and results](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results)\n", + " * Trate adequadamente as variáveis 'sex', 'season', 'team', 'city', 'sport' e 'medal';\n", + " * Aplique as transformações que acabamos de estudar nos campos/colunas numéricas 'height' e 'weight'. Cuidado com os Missing Values contidos nas variáveis!\n", + " * Verifique/avalie o impacto dos outliers nestas colunas.\n", + " * Neste caso, qual transformação é mais adequado diante dos outliers?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8V7OCd3G9zj1" + }, + "source": [ + "from google.colab import drive\n", + "drive.mount('/content/drive')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Z3g4dqM190mj" + }, + "source": [ + "url = '/content/drive/My Drive/Datasets4ML/athlete_events.csv'\n", + "df = pd.read_csv(url)\n", + "df.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5ExKsjGmKaEx" + }, + "source": [ + "## Exercício 6 - FIFA\n", + "* Aplique as transformações MinMaxScaler, RobustScaler e StandardScaler às colunas numéricas do dataframe FIFA_algumas_features.csv.\n", + "* Para as colunas categóricas, aplique a transformação mais adequada." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Zjukr52HK3S_" + }, + "source": [ + "import pandas as pd" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "S41tXs2EKlHN" + }, + "source": [ + "url = 'https://raw.githubusercontent.com/MathMachado/DataFrames/master/FIFA_algumas_features.csv?token=AGDJQ62CSW5KBLZNXH4TULK7SXICE'\n", + "\n", + "df = pd.read_csv(url, index_col = 'ID')\n", + "df.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o5fDp1Ib_Dg8" + }, + "source": [ + "# WOE - Weight Of Evidence\n", + "* As vantagens da transformação WOE são\n", + " * Lida bem com NaN's;\n", + " * Lida bem com outliers;\n", + " * A transformação é baseada no valor logarítmico das distribuições.\n", + " * Usando a técnica de binning apropriada, pode estabelecer uma relação monotônica (aumentar ou diminuir) entre a variável dependente e independente." + ] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB10_04__3DP_3_Data_Transformation_hs3.ipynb b/Notebooks/NB10_04__3DP_3_Data_Transformation_hs3.ipynb new file mode 100644 index 000000000..0a188359e --- /dev/null +++ b/Notebooks/NB10_04__3DP_3_Data_Transformation_hs3.ipynb @@ -0,0 +1,1642 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "NB10_04__3DP_3_Data_Transformation.ipynb", + "provenance": [], + "private_outputs": true, + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5CgDLvphxfcX" + }, + "source": [ + "

3DP_3 - DATA TRANSFORMATION

\n", + "\n", + "* **Objetivo**: Preparar os dados para o Machine Learning." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PvW689ZBxbxH" + }, + "source": [ + "# **AGENDA**:\n", + "\n", + "> Consulte **Table of contents**.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GNiuYCCxGe8v" + }, + "source": [ + "# **Melhorias da sessão**\n", + "* Desenvolver a sessão sobe WOE." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-TdSY74U0XS9" + }, + "source": [ + "___\n", + "# **Referências**\n", + "* [Why, How and When to Scale your Features](https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e)\n", + "* [Demonstrating the different strategies of KBinsDiscretizer](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization_strategies.html#sphx-glr-auto-examples-preprocessing-plot-discretization-strategies-py);\n", + "* [Why do we need feature scaling in Machine Learning and how to do it using SciKit Learn?](https://medium.com/@contactsunny/why-do-we-need-feature-scaling-in-machine-learning-and-how-to-do-it-using-scikit-learn-d8314206fe73)\n", + "* [Importance of Feature Scaling](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py) --> Muito importante por demonstrar os efeitos e a importância de se transformar as colunas numéricas.\n", + "* [Feature discretization](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization_classification.html#sphx-glr-auto-examples-preprocessing-plot-discretization-classification-py) --> Mostra o impacto na acurácia dos modelos com e sem discretização. Ou seja, discretizar faz sentido!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l9DGifbWSmW3" + }, + "source": [ + "___\n", + "# **Machine Learning com Python (Scikit-Learn)**\n", + "\n", + "![Scikit-Learn](https://github.com/MathMachado/Materials/blob/master/scikit-learn-1.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Vg82Iouo_Qm2" + }, + "source": [ + "# Porque dimensionar (Scale), padronizar (Standardize) e normalizar (Normalize) importa?\n", + "* Porque muitos algoritmos de **Machine Learning** performam melhor ou convergem mais rápido quando os atributos/colunas/variáveis estão na mesma escala e possuem distribuição \"próxima\" da Normal." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "q-chlATnKSza" + }, + "source": [ + "## Carregar as bibliotecas (genéricas) Python" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kQGVQB18-tM_" + }, + "source": [ + "!pip install category_encoders\n", + "!pip install update" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7FJxrZckYxk6" + }, + "source": [ + "import pandas as pd\n", + "\n", + "import numpy as np\n", + "from sklearn import preprocessing\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "%matplotlib inline\n", + "\n", + "import category_encoders as ce # library para aplicação do WOE - Weight Of Evidence para avaliar importância dos atributos\n", + "\n", + "# remove warnings to keep notebook clean\n", + "import warnings\n", + "warnings.filterwarnings('ignore')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "CyuWQM2NTMls" + }, + "source": [ + "pd.options.display.float_format = '{:.2f}'.format" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R0fuDyI8_UPf" + }, + "source": [ + "## Carregar os dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9oRWtarakgMY" + }, + "source": [ + "### Dataframe gerado aleatoriamente - variáveis com distribuição Normal" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7BXPXo3k0VDI" + }, + "source": [ + "np.random.seed(20111974)\n", + "\n", + "i_N = 10000\n", + "\n", + "df_A1 = pd.DataFrame({\n", + " 'coluna1': np.random.normal(0, 2, i_N), # Observem que a média das colunas são distintas\n", + " 'coluna2': np.random.normal(50, 3, i_N),\n", + " 'coluna3': np.random.normal(-5, 5, i_N),\n", + " 'coluna4': np.random.normal(-10, 10, i_N)\n", + "})\n", + "\n", + "df_A1.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "93ST1JnoRZKm" + }, + "source": [ + "**Dica**: Podemos usar outras distribuições (se quisermos), como a Exponential (mostrada abaixo)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XUqjo5QcQH99" + }, + "source": [ + "np.random.seed(20111974)\n", + "\n", + "df_A2 = pd.DataFrame({\n", + " 'coluna1': np.random.normal(0, 2, i_N),\n", + " 'coluna2': np.random.normal(50, 3, i_N),\n", + " 'coluna3': np.random.exponential(5, i_N), # coluna3 tem distribuição Exponential\n", + " 'coluna4': np.random.normal(-10, 10, i_N)\n", + "})\n", + "\n", + "df_A2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J8MZNLbUkp8R" + }, + "source": [ + "### Dataframe gerado aleatoriamente 2" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BR-fDDujcTup" + }, + "source": [ + "from sklearn.datasets import make_classification\n", + "\n", + "dados, classe = make_classification(n_samples = i_N, n_features = 4, n_informative = 3, n_redundant = 1, n_classes = 3)\n", + "\n", + "df_A3 = pd.DataFrame({'coluna1': dados[:,0],\n", + " 'coluna2':dados[:,1],\n", + " 'coluna3':dados[:,2],\n", + " 'coluna4':dados[:,3]}) #, 'coluna5':classe})\n", + "\n", + "df_A3.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Zq1cnpwLKvjS" + }, + "source": [ + "df_A4 = pd.DataFrame({ \n", + " 'coluna1': np.random.beta(5, 1, i_N) * 25, \n", + " 'coluna2': np.random.exponential(5, i_N),\n", + " 'coluna3': np.random.normal(10, 2, i_N),\n", + " 'coluna4': np.random.normal(10, 10, i_N), \n", + "})\n", + "\n", + "df_A4.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "O7sXQjvYRfhb" + }, + "source": [ + "#### Extração de amostras para compararmos" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rjVHsnnHRkIo" + }, + "source": [ + "df_A1_test = df_A1.sample(n = 100)\n", + "df_A2_test = df_A2.sample(n = 100)\n", + "df_A3_test = df_A3.sample(n = 100)\n", + "df_A4_test = df_A4.sample(n = 100)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "t0v0uXFRl-yG" + }, + "source": [ + "___\n", + "# **Transformações**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pkzTO0fdz93b" + }, + "source": [ + "## (1) StandardScaler\n", + "* StandardScaler é a transformação que centraliza os dados através da remoção da média (dos dados) e, na sequência, redimensiona (scale) através da divisão pelo desvio-padrão;\n", + "* Após a transformação, os dados terão média zero e desvio-padrão 1;\n", + "* **Assume que os dados (as colunas a serem transformadas) são normalmente distribuidos**;\n", + "* Se os dados não possuem distribuição Normal, então esta **NÃO** é uma boa transformação a se aplicar.\n", + "\n", + "$$z_{i}= \\frac{x_{i}-mean(x)}{std(x)}$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v1UOOWeQ0R_Y" + }, + "source": [ + "### Exemplo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y1Lzx3xN6wpZ" + }, + "source": [ + "df_A3.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9cPq_7Vu2HCS" + }, + "source": [ + "Histograma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZYW9WwBC3hd_" + }, + "source": [ + "plt.figure(figsize = (12, 8))\n", + "plt.hist(df_A1['coluna3'], color = 'blue', edgecolor = 'black', bins = int(180/5))\n", + "\n", + "# Adiciona títulos e labels\n", + "plt.title('Coluna3 - Distribuição Normal')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "h8ogcQvvT5zK" + }, + "source": [ + "plt.figure(figsize = (12, 8))\n", + "plt.hist(df_A2['coluna3'], color = 'blue', edgecolor = 'black', bins = int(180/5))\n", + "\n", + "# Adiciona títulos e labels\n", + "plt.title('Coluna3 - Distribuição Exponencial')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RrgxkESc-Uaq" + }, + "source": [ + "Considere o gráfico a seguir:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "U7dHTF1W-Xsn" + }, + "source": [ + "df_A1.plot(kind = 'kde') # KDE (= kernel Density Estimate) ajuda-nos a visualizar a distribuição dos dados, análogo ao histograma." + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hMS72n14-hDO" + }, + "source": [ + "Qual a interpretação para o gráfico acima?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "izqGNcNILdaX" + }, + "source": [ + "df_A1.plot()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZEkAqlZg-p0v" + }, + "source": [ + "A seguir, a transformação StandardScaler:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "N4u3T_BX-oc_" + }, + "source": [ + "from sklearn.preprocessing import StandardScaler" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "voFQ4odSzzPZ" + }, + "source": [ + "O ideal é termos um array com as preditoras, da seguinte forma:\n", + "X = [coluna1, coluna2, ..., colunaN]" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rPa4-SCt-ynX" + }, + "source": [ + "np.set_printoptions(precision = 3)\n", + "\n", + "A1_scale = StandardScaler().fit_transform(df_A1) # Combinação dos métodos fit() + transform()\n", + "\n", + "A1_scale_fit = StandardScaler().fit(df_A1) # Aplica o fit() separadamente\n", + "A1_scale_transform = A1_scale_fit.transform(df_A1) # Aplica o transform() separadamente.\n", + "A1_scale_fit_transform = StandardScaler().fit(df_A1).transform(df_A1) # Aplica fit().transform() encadeado\n", + "\n", + "A2_scale = StandardScaler().fit_transform(df_A2)\n", + "\n", + "A3_scale = StandardScaler().fit_transform(df_A3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "a8tZJgbOplDd" + }, + "source": [ + "## Salvar os parâmetros do StandardScaler e outros --> Colocar aqui!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ERfRIz-njqcD" + }, + "source": [ + "A1_scale_fit.scale_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ioZ_IN3Z6d39" + }, + "source": [ + "Observe abaixo que A1_scale = A1_scale_transform = A1_scale_fit_transform --> São arrays multidimensionais (do tipo NumPy)!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "v4xQR4cu5D1J" + }, + "source": [ + "A1_scale" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "j6GtN2KF4E_A" + }, + "source": [ + "A1_scale_transform" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "0q2bvSqb6T4g" + }, + "source": [ + "A1_scale_fit_transform" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WIhaErnA46Fi" + }, + "source": [ + "Transformando em dataframe:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HAhRvPze44JW" + }, + "source": [ + "df_A1_scale = pd.DataFrame(A1_scale, columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])\n", + "df_A2_scale = pd.DataFrame(A2_scale, columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])\n", + "df_A3_scale = pd.DataFrame(A3_scale, columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bmQp8wDO_E88" + }, + "source": [ + "Agora compare esse novo gráfico abaixo --> Vemos que os dados transformados tem distribuição Normal(0, 1):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "csfqRhDH2zUb" + }, + "source": [ + "df_A1.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-krh1pDg22RF" + }, + "source": [ + "df_A1_scale.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "D2fTPWsm_Hq3" + }, + "source": [ + "df_A1_scale.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "9oN-829l3277" + }, + "source": [ + "df_A2.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Jqh8L5BeUHT-" + }, + "source": [ + "df_A2_scale.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Yvz6O1zk4XNE" + }, + "source": [ + "df_A3.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ffU-fQxCUSmm" + }, + "source": [ + "df_A3_scale.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y24MOLL83w9j" + }, + "source": [ + "### Exercício: Calcular a média e o desvio-padrão." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1Aa25gVlSdOi" + }, + "source": [ + "df_A1.describe()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "EXZUiZImSmOE" + }, + "source": [ + "df_A1_scale.describe()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uIUQw5dpRwvA" + }, + "source": [ + "#### Correlação das colunas\n", + "* Observe que as correlações entre as variáveis não se alteram com as transformações." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uj1UerjORq9q" + }, + "source": [ + "df_A1.corr()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "jp6vPK0aR_p0" + }, + "source": [ + "df_A1_scale.corr()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4fuURrao_M0c" + }, + "source": [ + "Qual a conclusão?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f0A9U7rs_RAT" + }, + "source": [ + "## (2) MinMaxScaler\n", + "* **Transformação muito popular e utilizada**.\n", + "* Transforma os dados para o intervalo [0, 1];\n", + "* Se StandardScaler não é aplicável, então essa transformação funciona bem.\n", + "* Sensível aos _outliers_. Portanto, o ideal é que os _outliers_ sejam tratados previamente.\n", + "* Uma transformação similar à MinMaxScaler() é MaxAbsScaler() (redimensiona os dados no intervalo [-1, 1]) e centralizado em 0).\n", + "* Não corrige skewness;\n", + "* Sensível à outliers;\n", + "\n", + "$$z_{i}= \\frac{x_{i}-min(x)}{max(x)-min(x)}$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C0HbeuP-AU_p" + }, + "source": [ + "### Exemplo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mgeLckzxAWaC" + }, + "source": [ + "from sklearn.preprocessing import MinMaxScaler" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "S_W9bTO2AbEg" + }, + "source": [ + "df_A1.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "PJRFbUpBAg5J" + }, + "source": [ + "A1_MinMaxScaler = MinMaxScaler().fit_transform(df_A1)\n", + "df_A1_MinMaxScaler = pd.DataFrame(A1_MinMaxScaler,columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])\n", + "\n", + "# Gráfico\n", + "df_A1_MinMaxScaler.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7g8GA4LTA40U" + }, + "source": [ + "Qual a conclusão?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4Z6D3vfnB9Nm" + }, + "source": [ + "## (3) RobustScaler\n", + "* Transformação ideal para dados com **outliers**.\n", + "\n", + "$$z_{i}= \\frac{x_{i}-Q_{1}(x)}{Q_{3}(x)-Q_{1}(x)}$$" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "m3oyuxLeCW1D" + }, + "source": [ + "df_A1.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zeDF7-w_CcBy" + }, + "source": [ + "from sklearn.preprocessing import RobustScaler" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vLoqSKijCf2v" + }, + "source": [ + "A1_RobustScaler = RobustScaler().fit_transform(df_A1)\n", + "df_A1_RobustScaler = pd.DataFrame(A1_RobustScaler, columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])\n", + "\n", + "# Gráfico\n", + "df_A1_RobustScaler.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "g_D-7ik2xXpU" + }, + "source": [ + "### **Insight**: Gerar aleatoriamente colunas/variáveis com distribuição Gamma, Beta, Normal, Exponential e etc e avaliar o impacto das várias transformações." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GxIOPmSYwX-e" + }, + "source": [ + "# **Wrap Up**\n", + "* Use MinMaxScaler como transformação default, pois esta transformação não distorce os dados;\n", + "* Use RobustScaler se seus dados/coluna/variável possui **outliers** e gostaríamos de reduzir o efeito/impacto destes **outliers**. Entretanto, o melhor tratamento é estudar os **outliers** cuidadosamente e tratá-los adequadamente;\n", + "* Use StandardScaler se seus dados/colunas/variáveis possuem distribuição Normal (ou pelo menos se aproxima bem da distribuição Normal)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-YVMgt-WEFif" + }, + "source": [ + "## Encoding Variáveis Categóricas" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xHYvLc8T_jxQ" + }, + "source": [ + "### Encoding Variáveis Ordinais\n", + "* Exemplo: Variáveis com valores ordinais: baixo, médio ou alto." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i1BgGiGdSTcG" + }, + "source": [ + "#### Dataframe-exemplo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kdVahfJAEkuO" + }, + "source": [ + "# Aqui vou usar a função randint - Retorna números inteiros aleatórios incluindo o número inferior e excluindo o superior.\n", + "\n", + "l_idade = [\n", + " np.random.randint(20, 40),\n", + " np.random.randint(20, 40),\n", + " np.random.randint(20, 40), \n", + " np.random.randint(20, 40),\n", + " np.random.randint(20, 40),\n", + " np.random.randint(20, 40),\n", + " np.random.randint(20, 40),\n", + " np.random.randint(20, 40),\n", + " np.random.randint(20, 40),\n", + " np.random.randint(20, 40)\n", + " ]\n", + "\n", + "l_salario = ['baixo', 'medio', 'alto']\n", + "l_salario2 = np.random.choice(l_salario, 10, p = [0.6, 0.3, 0.1])\n", + "\n", + "df_A5 = pd.DataFrame({\n", + " 'idade': l_idade,\n", + " 'salario': l_salario2})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "m_15P2eUHSBY" + }, + "source": [ + "df_A5" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R1g9pEuyHe2q" + }, + "source": [ + "Neste exemplo, vamos redefinir a variável categórical ordinal 'Salario' da seguinte forma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bkwFuEa8HnMV" + }, + "source": [ + "df_A5['salario_cat'] = df_A5['salario'].map({'baixo': 1, 'medio': 2, 'alto': 3})\n", + "df_A5" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DlaIFiWIIPAl" + }, + "source": [ + "### Encoding Variáveis Nominais\n", + "* Exemplo: Variáveis com valores nominais: Sexo (Feminino, Masculino).\n", + "\n", + "* Use One-Hot Encoding ou pd.get.dummies()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ffNoJQbgJRoY" + }, + "source": [ + "Vamos utilizar o dataframe criado no passo anterior:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "PMCoUWZOI7c0" + }, + "source": [ + "df_A5['salario'].unique()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "bdIEyBkaJeN8" + }, + "source": [ + "from sklearn.preprocessing import LabelEncoder, OneHotEncoder" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4MwK4cUEKeK4" + }, + "source": [ + "#### Aplicar LabelEncoder()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6X6VXDsHJiII" + }, + "source": [ + "le = LabelEncoder()\n", + "df_A5['salario_le'] = le.fit_transform(df_A5['salario'])\n", + "df_A5" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "RY80x59J8Ham" + }, + "source": [ + "df_A5['salario'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Dgv2Zz07Kqfj" + }, + "source": [ + "#### Aplicar pd.get.dummies()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WSZRIEs6K5sP" + }, + "source": [ + "dummies = pd.get_dummies(df_A5['salario'])\n", + "df_A5 = pd.concat([df_A5, dummies], axis = 1)\n", + "df_A5" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WeXKQ3lkg8qO" + }, + "source": [ + "# Power Transformations\n", + "* Tem por objetivo transformar a distribuição de probabilidade da variável/coluna a fim de torná-la Normal. Esta normalização é feita através da correção da skewness (estabilização da variância) da distribuição.\n", + "* Exemplos de Power Transformations:\n", + " * log;\n", + " * Ajuda com distribuições skewness;\n", + " * Útil para distribuições não-negativas e sem zeros;\n", + " * raiz quadrada;\n", + " * raiz cúbica;\n", + " * **Transformação de Box-Cox** e\n", + " * Transformação de Yeo-Johson." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BQDf9EfzRXYC" + }, + "source": [ + "### Transformação de Yeo-Johnson (transformação default da librarie PowerTransformer)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "o9q9lxbYhlKE" + }, + "source": [ + "plt.figure(figsize = (12, 8))\n", + "plt.hist(df_A2['coluna3'], color = 'blue', edgecolor = 'black', bins = int(180/5))\n", + "\n", + "# Adiciona títulos e labels\n", + "plt.title('Histograma da coluna3 - Distribuição Exponencial')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QJ91Stekh8JO" + }, + "source": [ + "from sklearn.preprocessing import PowerTransformer" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "RCGFIeszkLVK" + }, + "source": [ + "df_A2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "iNVvFc4NMcB2" + }, + "source": [ + "dados = objeto.transform(dataframe)\n", + "dados_transformados = fit(df).transform(df)\n", + "dados_transformados = fit_transform(df)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "XOQYdkyzi-PL" + }, + "source": [ + "yeo_johnson = PowerTransformer(method = 'yeo-johnson', standardize = True)\n", + "A2_yeo_johnson = yeo_johnson.fit_transform(df_A2)\n", + "A2_yeo_johnson # array NumPy com os dados transformados" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "yn1T_TY9M2nR" + }, + "source": [ + "df_A2_yeo_johnson = pd.DataFrame(A2_yeo_johnson, columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])\n", + "df_A2_yeo_johnson.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "T_5beALqkao_" + }, + "source": [ + "plt.figure(figsize = (12, 8))\n", + "plt.hist(df_A2_yeo_johnson['coluna3'], color = 'blue', edgecolor = 'black', bins = int(180/5))\n", + "\n", + "# Adiciona títulos e labels\n", + "plt.title('Coluna3 - Distribuição aproximadamente Normal')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BwEhZBARk7oA" + }, + "source": [ + "## Transformação de Box-Cox\n", + "* Inventada por dois grandes personagens da Estatística;\n", + "* A coluna/variável/atributo não pode conter números negativos ou zero. Ou seja, $x_{i} > 0$.\n", + "\n", + "* Se $w_{i}$ é a variável transformada e $x_{i}$ é a variável que queremos transformar.\n", + " * Se $\\lambda = 0$ --> $w_{i}^{(\\lambda)} = \\log(x_{i})$;\n", + " * Se $\\lambda <> 0$ --> $w_{i}^{(\\lambda)} = \\frac{x_{i}-1}{\\lambda}$;\n", + "* Se $\\lambda = 1$, então $w_{i}$ então os dados/distribuição já são normalmente distribuídos e a transformação de Box&Cox não se faz necessário.\n", + "* Precisamos escolher o valor de $\\lambda$ que permite a melhor aproximação da distribuição normal.\n", + "* A função scipy.stats.boxcox(array_1D) retorna o valor de $\\lambda$ ótimo. Basta passar como parâmetro o array de dimensão 1D que a função retorna o $\\lambda$ ótimo que melhor se ajusta aos seus dados.\n", + "* Para retornar seus dados aos valores originais, use scipy.special.inv_boxcox(y, lambda).\n", + "* Quais são as desvantagens/inconveniente da transformação?\n", + " * Perde-se a interpretação." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AZXkW4Baz87T" + }, + "source": [ + "Libraries necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QzWRS8chz8V_" + }, + "source": [ + "import numpy as np \n", + "from scipy import stats \n", + "import matplotlib.pyplot as plt " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "iWQlhjRpO1Zc" + }, + "source": [ + "# Gerando dados com distribuição Exponencial\n", + "distribuicao_exponencial = np.random.exponential(size = 1000) \n", + "\n", + "# Dados transformados \n", + "box_cox, lambda_box_cox = stats.boxcox(distribuicao_exponencial) \n", + "f\"lambda ótimo: {lambda_box_cox}\"" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "huT5bd89EiyD" + }, + "source": [ + "distribuicao_exponencial[:50]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zEG3u_LbPWP5" + }, + "source": [ + "# Dados transformados pela Box-Cox:\n", + "box_cox[0:30]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_lwyGVDMzC4y" + }, + "source": [ + "### Exemplo 1\n", + "* Dados possuem distribuição Exponencial." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Tx5J1L8Az4qQ" + }, + "source": [ + "# Gráficos: \n", + "def compara_graficos(y, w, lambda_box_cox):\n", + " fig, ax = plt.subplots(1, 2) \n", + " \n", + " # Gráfico das distribuições originais e transformada\n", + " sns.distplot(y, hist = False, kde = True, kde_kws = {'shade': True, 'linewidth': 2}, label = \"Non-Normal\", color = \"green\", ax = ax[0]) \n", + " sns.distplot(w, hist = False, kde = True, kde_kws = {'shade': True, 'linewidth': 2}, label = \"Normal\", color = \"green\", ax = ax[1]) \n", + " \n", + " # Legendas \n", + " plt.legend(loc = \"upper right\")\n", + " \n", + " # Redimensionando os sub-gráficos \n", + " fig.set_figheight(5)\n", + " fig.set_figwidth(10) \n", + " \n", + " print(f\"Valor de Lambda usado na transformação: {lambda_box_cox}\") " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "xWLsXEBB0CQO" + }, + "source": [ + "compara_graficos(distribuicao_exponencial, box_cox, lambda_box_cox)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F9jObBLCZh19" + }, + "source": [ + "### Exemplo 2\n", + "* Dados possuem distribuição Beta.\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "CePoB8imzPRQ" + }, + "source": [ + "# Gerando dados com distribuição Exponencial\n", + "distribuicao_beta = np.random.beta(1, 3, 1000)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "j1CwLPm6zRx2" + }, + "source": [ + "# transform training data & save lambda value \n", + "box_cox, lambda_box_cox = stats.boxcox(distribuicao_beta) " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "DAy3IPUbWJut" + }, + "source": [ + "f\"Lambda ótimo: {lambda_box_cox}\"" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "snd63l9U0ugI" + }, + "source": [ + "compara_graficos(distribuicao_beta, box_cox, lambda_box_cox)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CQ6o9pOkPjUT" + }, + "source": [ + "### Transformação log\n", + "* De forma geral, a transformação **log** trata de dados skewed (diferentes da distribuição Normal), tornando os dados (ou a distribuição dos dados) mais \"normal\";\n", + "* Se os dados forem de alguma forma normalmente distribuídos, então nada muda." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DrsXETsRPupd" + }, + "source": [ + "# Gerando dados com distribuição Exponencial\n", + "distribuicao_beta = np.random.beta(1, 3, 1000)\n", + "\n", + "transformacao_log = np.log(distribuicao_beta)\n", + "\n", + "# Aproveitando a função compara_graficos()\n", + "compara_graficos(distribuicao_beta, transformacao_log, 1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Mwh0alhdgrE3" + }, + "source": [ + "___\n", + "# **Exercícios**\n", + "> Para cada um dos dataframes a seguir, aplique os seguintes steps:\n", + "\n", + "* Padronizar o nome das colunas\n", + " * Eliminar espaços entre os nomes das colunas;\n", + " * Eliminar caracteres especiais dos nomes das colunas;\n", + " * Renomear as colunas com lower() (ou upper());\n", + "* Aplicar a trasformação StandardScaler e MinMaxScaler em cada uma das colunas do dataframe;\n", + "* DataViz - Mostrar a distribuição das colunas para compararmos os resultados antes e depois das transformações.\n", + "* As correlações das colunas mudam com as transformações?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hSTKrd992LtI" + }, + "source": [ + "## Exercício 1 - Iris --> **Resolvido**\n", + "* [Aqui](https://en.wikipedia.org/wiki/Iris_flower_data_set) você obterá mais informações sobre o dataframe iris. Confira." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mThqvGGr2Vuk" + }, + "source": [ + "from sklearn.datasets import load_iris\n", + "\n", + "iris = load_iris()\n", + "X= iris['data']\n", + "y= iris['target']\n", + "\n", + "df_iris = pd.DataFrame(np.c_[X, y], columns= np.append(iris['feature_names'], ['target']))\n", + "df_iris['target2'] = df_iris['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})\n", + "df_iris.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "eU5FaJhdYblP" + }, + "source": [ + "df_iris.columns = [c.replace(' ', '_') for c in df_iris.columns]\n", + "df_iris.columns = [c.replace('_(cm)', '') for c in df_iris.columns]\n", + "df_iris.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "K9DPAakJZQHH" + }, + "source": [ + "df_iris.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "YYYmVq68Y8bB" + }, + "source": [ + "# Aplica a transformação:\n", + "df_iris_MinMaxScaler = MinMaxScaler().fit_transform(df_iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']])\n", + "\n", + "# Transformando em Dataframe:\n", + "df_iris_MinMaxScaler = pd.DataFrame(df_iris_MinMaxScaler, columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])\n", + "\n", + "# Gráfico\n", + "df_iris_MinMaxScaler.plot(kind = 'kde')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DKsHcjd77YZT" + }, + "source": [ + "### Aplicar as outras transformações e comparar os gráficos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "caFkC6oCmUKK" + }, + "source": [ + "## Exercício 2 - Breast Cancer" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vhOM-Z9zmf-f" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "from sklearn.datasets import load_breast_cancer\n", + "\n", + "cancer = load_breast_cancer()\n", + "X= cancer['data']\n", + "y= cancer['target']\n", + "\n", + "df_A1_cancer = pd.DataFrame(np.c_[X, y], columns= np.append(cancer['feature_names'], ['target']))\n", + "df_A1_cancer['target'] = df_A1_cancer['target'].map({0: 'malign', 1: 'benign'})\n", + "df_A1_cancer.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1qruqUDqnvMc" + }, + "source": [ + "## Exercício 3 - Boston Housing Price" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "trxK8YXNnsam" + }, + "source": [ + "from sklearn.datasets import load_boston\n", + "\n", + "boston = load_boston()\n", + "X= boston['data']\n", + "y= boston['target']\n", + "\n", + "df_A1_boston = pd.DataFrame(np.c_[X, y], columns= np.append(boston['feature_names'], ['target']))\n", + "df_A1_boston.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nzu0Dz33c8ds" + }, + "source": [ + "## Exercícios 4 - Diabetes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "d6ahBZmqc_-1" + }, + "source": [ + "from sklearn.datasets import load_diabetes\n", + "\n", + "diabetes = load_diabetes()\n", + "X= diabetes['data']\n", + "y= diabetes['target']\n", + "\n", + "df_A1_diabetes = pd.DataFrame(np.c_[X, y], columns= np.append(diabetes['feature_names'], ['target']))\n", + "df_A1_diabetes.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NyunIr6oaWEl" + }, + "source": [ + "## Exercícios 5 - 120 years of Olympic history: athletes and results\n", + "* [120 years of Olympic history: athletes and results](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results)\n", + " * Trate adequadamente as variáveis 'sex', 'season', 'team', 'city', 'sport' e 'medal';\n", + " * Aplique as transformações que acabamos de estudar nos campos/colunas numéricas 'height' e 'weight'. Cuidado com os Missing Values contidos nas variáveis!\n", + " * Verifique/avalie o impacto dos outliers nestas colunas.\n", + " * Neste caso, qual transformação é mais adequado diante dos outliers?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8V7OCd3G9zj1" + }, + "source": [ + "from google.colab import drive\n", + "drive.mount('/content/drive')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Z3g4dqM190mj" + }, + "source": [ + "url = '/content/drive/My Drive/Datasets4ML/athlete_events.csv'\n", + "df = pd.read_csv(url)\n", + "df.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5ExKsjGmKaEx" + }, + "source": [ + "## Exercício 6 - FIFA\n", + "* Aplique as transformações MinMaxScaler, RobustScaler e StandardScaler às colunas numéricas do dataframe FIFA_algumas_features.csv.\n", + "* Para as colunas categóricas, aplique a transformação mais adequada." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Zjukr52HK3S_" + }, + "source": [ + "import pandas as pd" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "S41tXs2EKlHN" + }, + "source": [ + "url = 'https://raw.githubusercontent.com/MathMachado/DataFrames/master/FIFA_algumas_features.csv?token=AGDJQ62CSW5KBLZNXH4TULK7SXICE'\n", + "\n", + "df = pd.read_csv(url, index_col = 'ID')\n", + "df.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o5fDp1Ib_Dg8" + }, + "source": [ + "# WOE - Weight Of Evidence\n", + "* As vantagens da transformação WOE são\n", + " * Lida bem com NaN's;\n", + " * Lida bem com outliers;\n", + " * A transformação é baseada no valor logarítmico das distribuições.\n", + " * Usando a técnica de binning apropriada, pode estabelecer uma relação monotônica (aumentar ou diminuir) entre a variável dependente e independente." + ] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB10_04__3DP_4_Anomaly_Detection_hs.ipynb b/Notebooks/NB10_04__3DP_4_Anomaly_Detection_hs.ipynb new file mode 100644 index 000000000..b0d28a16f --- /dev/null +++ b/Notebooks/NB10_04__3DP_4_Anomaly_Detection_hs.ipynb @@ -0,0 +1,4873 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "NB10_04__3DP_4_Anomaly_Detection.ipynb", + "provenance": [], + "collapsed_sections": [], + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EAqSDJGzyYrx" + }, + "source": [ + "

3DP_4 - ANOMALY/OUTLIER DETECTION

\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H-VrOjTTymSK" + }, + "source": [ + "# **AGENDA**:\n", + "\n", + "> Consulte a **Table of contents**." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wSAsbafemNax" + }, + "source": [ + "# **Melhorias da sessão**\n", + "* Mostrar junto os gráficos com a região de Anomaly Score junto com a distribuição de probabilidade das variáveis envolvidas.\n", + "* Mensagens de deprecating --> Analisar e substituir os métodos, funções deprecated;\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7qK6Yx0tBqUz" + }, + "source": [ + "___\n", + "# **Referências**\n", + "* [Comparing anomaly detection algorithms for outlier detection on toy datasets](https://scikit-learn.org/stable/auto_examples/plot_anomaly_comparison.html#sphx-glr-auto-examples-plot-anomaly-comparison-py)\n", + "* [Outlier detection with several methods](https://scikit-learn.org/0.18/auto_examples/covariance/plot_outlier_detection.html)\n", + "* [anomaly-detection-resources](https://github.com/MathMachado/anomaly-detection-resources)\n", + "* [Outlier Detection with Extended Isolation Forest](https://towardsdatascience.com/outlier-detection-with-extended-isolation-forest-1e248a3fe97b)\n", + "* [Outlier Detection with Isolation Forest](https://towardsdatascience.com/outlier-detection-with-isolation-forest-3d190448d45e)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f7tTnUJ6B2UG" + }, + "source": [ + "___\n", + "## O que é Anomaly Detection?\n", + "> Qualquer ponto/observação que é incomum quando comparado com todos os outros pontos/observações." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7VJZf1U5Ds_w" + }, + "source": [ + "___\n", + "# **Machine Learning com Python (Scikit-Learn)**\n", + "\n", + "![Scikit-Learn](https://github.com/MathMachado/Materials/blob/master/scikit-learn-1.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rpHJ1qVUEwOn" + }, + "source": [ + "___\n", + "# **Técnicas tradicionais para detecção de outliers**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OOI_VTo3E3sv" + }, + "source": [ + "## Boxplot\n", + "\n", + "![BoxPlot](https://github.com/MathMachado/Materials/blob/master/boxplot.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vivFsmJGFVC0" + }, + "source": [ + "## Z-Score\n", + "* Z-Score pode ser utilizado para detectar Outliers.\n", + "* É a diferença entre o valor e a média da amostra expressa como o número de desvios-padrão. \n", + "* Se o escore z for menor que 2,5 ou maior que 2,5, o valor estará nos 5% do menor ou maior valor (2,5% dos valores em ambas as extremidades da distribuição). No entanto, é pratica comum utilizarmos 3 ao invés dos 2,5.\n", + "\n", + "![Z_Score](https://github.com/MathMachado/Materials/blob/master/Z_Score.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hUw6W3SSFiwj" + }, + "source": [ + "## IQR Score\n", + "\n", + "* O Intervalo interquartil (IQR) é uma medida de dispersão estatística, sendo igual à diferença entre os percentis 75 e 25, ou entre quartis superiores e inferiores, IQR = Q3 - Q1.\n", + "\n", + "![BoxPlot](https://github.com/MathMachado/Materials/blob/master/boxplot.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7_YohlTIF8zi" + }, + "source": [ + "___\n", + "# **Hands-On**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OrXdGg8t0V_D" + }, + "source": [ + "## Carrega as Bibliotecas necessárias" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7pYqwxIe1Hcq", + "outputId": "4065238d-0996-4036-d6bf-2a1bf1250073", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 683 + } + }, + "source": [ + "!pip install pyod" + ], + "execution_count": 1, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Collecting pyod\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/a3/4b/d2edd1e85b132d480feced17f044267b3e330391240779d78b1c3d378b24/pyod-0.8.3.tar.gz (96kB)\n", + "\r\u001b[K |███▍ | 10kB 24.8MB/s eta 0:00:01\r\u001b[K |██████▊ | 20kB 6.1MB/s eta 0:00:01\r\u001b[K |██████████▏ | 30kB 5.8MB/s eta 0:00:01\r\u001b[K |█████████████▌ | 40kB 6.3MB/s eta 0:00:01\r\u001b[K |█████████████████ | 51kB 6.5MB/s eta 0:00:01\r\u001b[K |████████████████████▎ | 61kB 7.2MB/s eta 0:00:01\r\u001b[K |███████████████████████▊ | 71kB 7.6MB/s eta 0:00:01\r\u001b[K |███████████████████████████ | 81kB 7.1MB/s eta 0:00:01\r\u001b[K |██████████████████████████████▍ | 92kB 7.5MB/s eta 0:00:01\r\u001b[K |████████████████████████████████| 102kB 4.8MB/s \n", + "\u001b[?25hCollecting combo\n", + " Downloading https://files.pythonhosted.org/packages/0a/2a/61b6ac584e75d8df16dc27962aa5fe99d76b09da5b6710e83d4862c84001/combo-0.1.1.tar.gz\n", + "Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from pyod) (0.16.0)\n", + "Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (from pyod) (3.2.2)\n", + "Requirement already satisfied: numpy>=1.13 in /usr/local/lib/python3.6/dist-packages (from pyod) (1.18.5)\n", + "Requirement already satisfied: numba>=0.35 in /usr/local/lib/python3.6/dist-packages (from pyod) (0.48.0)\n", + "Requirement already satisfied: pandas>=0.25 in /usr/local/lib/python3.6/dist-packages (from pyod) (1.1.2)\n", + "Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.6/dist-packages (from pyod) (1.4.1)\n", + "Requirement already satisfied: scikit_learn>=0.19.1 in /usr/local/lib/python3.6/dist-packages (from pyod) (0.22.2.post1)\n", + "Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from pyod) (1.15.0)\n", + "Requirement already satisfied: statsmodels in /usr/local/lib/python3.6/dist-packages (from pyod) (0.10.2)\n", + "Collecting suod\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/a1/87/9170cabe1b5e10a7d095c0e28f2e30e7c1886a13f063de85d3cfacc06f4b/suod-0.0.4.tar.gz (2.1MB)\n", + "\u001b[K |████████████████████████████████| 2.1MB 13.6MB/s \n", + "\u001b[?25hRequirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->pyod) (2.4.7)\n", + "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib->pyod) (0.10.0)\n", + "Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->pyod) (1.2.0)\n", + "Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->pyod) (2.8.1)\n", + "Requirement already satisfied: llvmlite<0.32.0,>=0.31.0dev0 in /usr/local/lib/python3.6/dist-packages (from numba>=0.35->pyod) (0.31.0)\n", + "Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from numba>=0.35->pyod) (50.3.0)\n", + "Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.25->pyod) (2018.9)\n", + "Requirement already satisfied: patsy>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from statsmodels->pyod) (0.5.1)\n", + "Building wheels for collected packages: pyod, combo, suod\n", + " Building wheel for pyod (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + " Created wheel for pyod: filename=pyod-0.8.3-cp36-none-any.whl size=110349 sha256=41264a336d54b8628dab7dc68909de14cafec5c2fab86dc9a1d1adea764dcce0\n", + " Stored in directory: /root/.cache/pip/wheels/29/46/95/86facd235cce1d58ae6747ab1aea2b3742564325a66a60863a\n", + " Building wheel for combo (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + " Created wheel for combo: filename=combo-0.1.1-cp36-none-any.whl size=42113 sha256=4e68819bebfddd1709034c7b2b6700761360d2c2d32f44a51b48c9eb70f400fa\n", + " Stored in directory: /root/.cache/pip/wheels/55/ec/e5/a2331372c676c467e70c6646e646edf6997d5c4905b8c0f5e6\n", + " Building wheel for suod (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + " Created wheel for suod: filename=suod-0.0.4-cp36-none-any.whl size=2167158 sha256=7a79c8fb77870bdd3263611e03970b202bdc30fb4f9e7b1eeea2e0fa05ffa1c1\n", + " Stored in directory: /root/.cache/pip/wheels/57/55/e5/a4fca65bba231f6d0115059b589148774b41faea25b3f2aa27\n", + "Successfully built pyod combo suod\n", + "Installing collected packages: combo, suod, pyod\n", + "Successfully installed combo-0.1.1 pyod-0.8.3 suod-0.0.4\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gxBgvhA4mowO" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "from numpy import percentile\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "import matplotlib\n", + "\n", + "from sklearn.ensemble import IsolationForest\n", + "\n", + "# Scaling variables\n", + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.preprocessing import MinMaxScaler\n", + "\n", + "from pyod.models.abod import ABOD\n", + "from pyod.models.cblof import CBLOF\n", + "#from pyod.models.feature_bagging import FeatureBagging\n", + "from pyod.models.hbos import HBOS\n", + "from pyod.models.iforest import IForest\n", + "from pyod.models.knn import KNN\n", + "#from pyod.models.lof import LOF\n", + "from scipy import stats\n", + "\n", + "# remove warnings to keep notebook clean\n", + "import warnings\n", + "warnings.filterwarnings('ignore')" + ], + "execution_count": 2, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WLf_c29t0ekj" + }, + "source": [ + "## Carrega dataframe" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YL_VQljA0gxZ", + "outputId": "bc14a91c-4e61-4061-eca5-ecf9c5ce4cfc", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "df_titanic = sns.load_dataset('titanic')\n", + "df_titanic = df_titanic.dropna()\n", + "df_titanic.head()" + ], + "execution_count": 3, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
111female38.01071.2833CFirstwomanFalseCCherbourgyesFalse
311female35.01053.1000SFirstwomanFalseCSouthamptonyesFalse
601male54.00051.8625SFirstmanTrueESouthamptonnoTrue
1013female4.01116.7000SThirdchildFalseGSouthamptonyesFalse
1111female58.00026.5500SFirstwomanFalseCSouthamptonyesTrue
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age ... deck embark_town alive alone\n", + "1 1 1 female 38.0 ... C Cherbourg yes False\n", + "3 1 1 female 35.0 ... C Southampton yes False\n", + "6 0 1 male 54.0 ... E Southampton no True\n", + "10 1 3 female 4.0 ... G Southampton yes False\n", + "11 1 1 female 58.0 ... C Southampton yes True\n", + "\n", + "[5 rows x 15 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 3 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q2oxyyQWB-uz" + }, + "source": [ + "# Normalizar as variáveis 'age' e 'fare'\n", + "df_titanic_ss = df_titanic.copy()\n", + "df_titanic_ss[['fare', 'age']] = StandardScaler().fit_transform(df_titanic_ss[['fare','age']])" + ], + "execution_count": 4, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rAKnKtil9Oz1", + "outputId": "cbe30a0f-c362-42ac-c96d-34aeb3d7f2c8", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Linhas do df_titanic\n", + "df_titanic_ss.shape" + ], + "execution_count": 5, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(182, 15)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 5 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sHSYUkEQFIwS" + }, + "source": [ + "# Função para plotar o Boxplot\n", + "def boxplot_sobreviveu(df, column):\n", + " plt.rcdefaults()\n", + " sns.catplot(x = 'survived', y = column, kind = \"box\", data = df, height = 4, aspect = 1.5)\n", + " \n", + " # add data points to boxplot with stripplot\n", + " sns.stripplot(x = 'survived', y = column, data = df, alpha = 0.3, jitter = 0.2, color = 'k');\n", + " plt.show()" + ], + "execution_count": 7, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "o9-VgcNnFNb1", + "outputId": "ff6653c7-f679-4683-9d64-f9c93fb58bba", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 426 + } + }, + "source": [ + "boxplot_sobreviveu(df_titanic, 'fare')" + ], + "execution_count": 8, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8FIo9tD1FQ0u", + "outputId": "e213bc87-210e-4723-c133-02a1cf67c919", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 426 + } + }, + "source": [ + "boxplot_sobreviveu(df_titanic, 'age')" + ], + "execution_count": 9, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wTd8J0LyfOKH", + "outputId": "b888e12d-1d41-47fd-f359-8ce104acb9d0", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 85 + } + }, + "source": [ + "df_titanic.columns" + ], + "execution_count": 10, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',\n", + " 'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',\n", + " 'alive', 'alone'],\n", + " dtype='object')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 10 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TLX6YhlYfbQ9", + "outputId": "61dc83f0-0d37-47a9-8986-e1d9fc3c9a42", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 221 + } + }, + "source": [ + "df_titanic.fare.sort_values(ascending=False)" + ], + "execution_count": 13, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "737 512.3292\n", + "679 512.3292\n", + "341 263.0000\n", + "438 263.0000\n", + "27 263.0000\n", + " ... \n", + "699 7.6500\n", + "715 7.6500\n", + "872 5.0000\n", + "263 0.0000\n", + "806 0.0000\n", + "Name: fare, Length: 182, dtype: float64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 13 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "t0ZCXA15fzsq", + "outputId": "6a994ce8-185c-4eee-ba96-db3c338d195a", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 111 + } + }, + "source": [ + "df_titanic[df_titanic['fare'] > 500]" + ], + "execution_count": 23, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
67911male36.001512.3292CFirstmanTrueBCherbourgyesFalse
73711male35.000512.3292CFirstmanTrueBCherbourgyesTrue
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age ... deck embark_town alive alone\n", + "679 1 1 male 36.0 ... B Cherbourg yes False\n", + "737 1 1 male 35.0 ... B Cherbourg yes True\n", + "\n", + "[2 rows x 15 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 23 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dry76OnKi_X6", + "outputId": "471ba5fc-2fc8-4027-86ac-188e90c94952", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 111 + } + }, + "source": [ + "df_titanic[df_titanic['age'] > 70]" + ], + "execution_count": 26, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
9601male71.00034.6542CFirstmanTrueACherbourgnoTrue
63011male80.00030.0000SFirstmanTrueASouthamptonyesTrue
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age ... deck embark_town alive alone\n", + "96 0 1 male 71.0 ... A Cherbourg no True\n", + "630 1 1 male 80.0 ... A Southampton yes True\n", + "\n", + "[2 rows x 15 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 26 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fCqj102y9Kfo", + "outputId": "891bd5c0-76f3-47d2-a125-12be081812e2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 168 + } + }, + "source": [ + "# Descrever o dataframe, variável 'fare'\n", + "df_titanic_ss['fare'].describe()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "count 1.820000e+02\n", + "mean 2.537653e-16\n", + "std 1.002759e+00\n", + "min -1.034601e+00\n", + "25% -6.452479e-01\n", + "50% -2.873576e-01\n", + "75% 1.452571e-01\n", + "max 5.681797e+00\n", + "Name: fare, dtype: float64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 9 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SMcvIb1K_69n", + "outputId": "701603fd-6df5-4622-dbdc-62603ca91fff", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 472 + } + }, + "source": [ + "plt.scatter(range(df_titanic_ss.shape[0]), np.sort(df_titanic_ss['fare'].values))\n", + "plt.xlabel('index')\n", + "plt.ylabel('Fares')\n", + "plt.title(\"Distribuição da variável Fare\")\n", + "\n", + "sns.despine()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NCcXnHHYIlM4", + "outputId": "c49ba369-ac5a-47b2-85a0-b25cffdfe4fe", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 284 + } + }, + "source": [ + "df_titanic_ss.describe()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclassagesibspparchfare
count182.000000182.0000001.820000e+02182.000000182.0000001.820000e+02
mean0.6758241.1923081.464030e-170.4670330.4780222.537653e-16
std0.4693570.5164111.002759e+000.6450070.7558691.002759e+00
min0.0000001.000000-2.220506e+000.0000000.000000-1.034601e+00
25%0.0000001.000000-7.437173e-010.0000000.000000-6.452479e-01
50%1.0000001.0000002.411064e-020.0000000.000000-2.873576e-01
75%1.0000001.0000007.759421e-011.0000001.0000001.452571e-01
max1.0000003.0000002.839480e+003.0000004.0000005.681797e+00
\n", + "
" + ], + "text/plain": [ + " survived pclass ... parch fare\n", + "count 182.000000 182.000000 ... 182.000000 1.820000e+02\n", + "mean 0.675824 1.192308 ... 0.478022 2.537653e-16\n", + "std 0.469357 0.516411 ... 0.755869 1.002759e+00\n", + "min 0.000000 1.000000 ... 0.000000 -1.034601e+00\n", + "25% 0.000000 1.000000 ... 0.000000 -6.452479e-01\n", + "50% 1.000000 1.000000 ... 0.000000 -2.873576e-01\n", + "75% 1.000000 1.000000 ... 1.000000 1.452571e-01\n", + "max 1.000000 3.000000 ... 4.000000 5.681797e+00\n", + "\n", + "[8 rows x 6 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 11 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7pzTvLleGpWc", + "outputId": "df81c529-7718-4a7c-8c31-6aa5453b8535", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 472 + } + }, + "source": [ + "# Distribuição da variável 'fare'\n", + "\n", + "sns.distplot(df_titanic_ss['fare'])\n", + "plt.title(\"Distribuição da variável Fare\")\n", + "sns.despine()" + ], + "execution_count": 27, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vZ9RJvhHkIB5", + "outputId": "dc181c26-7dd5-43eb-982f-dd8a2bcc892e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 297 + } + }, + "source": [ + "df_titanic.describe()" + ], + "execution_count": 28, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclassagesibspparchfare
count182.000000182.000000182.000000182.000000182.000000182.000000
mean0.6758241.19230835.6231870.4670330.47802278.919735
std0.4693570.51641115.6716150.6450070.75586976.490774
min0.0000001.0000000.9200000.0000000.0000000.000000
25%0.0000001.00000024.0000000.0000000.00000029.700000
50%1.0000001.00000036.0000000.0000000.00000057.000000
75%1.0000001.00000047.7500001.0000001.00000090.000000
max1.0000003.00000080.0000003.0000004.000000512.329200
\n", + "
" + ], + "text/plain": [ + " survived pclass age sibsp parch fare\n", + "count 182.000000 182.000000 182.000000 182.000000 182.000000 182.000000\n", + "mean 0.675824 1.192308 35.623187 0.467033 0.478022 78.919735\n", + "std 0.469357 0.516411 15.671615 0.645007 0.755869 76.490774\n", + "min 0.000000 1.000000 0.920000 0.000000 0.000000 0.000000\n", + "25% 0.000000 1.000000 24.000000 0.000000 0.000000 29.700000\n", + "50% 1.000000 1.000000 36.000000 0.000000 0.000000 57.000000\n", + "75% 1.000000 1.000000 47.750000 1.000000 1.000000 90.000000\n", + "max 1.000000 3.000000 80.000000 3.000000 4.000000 512.329200" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 28 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Qa28Hc3ZC6FV" + }, + "source": [ + "___\n", + "## Kurtosis\n", + "> Kurtosis é uma medida estatística que define com que intensidade as caudas de uma distribuição diferem das caudas de uma distribuição Normal. Em outras palavras, a curtose identifica se as caudas de uma determinada distribuição contêm valores extremos.\n", + ">> A Kurtosis de uma distribuição Normal padrão é igual a 3. Portanto, se Kurtosis-3 > 0, então isso é o que chamamos de excesso de Kurtosis.\n", + ">>> **Alta Kurtosis é um indicador de que os dados possuem caudas pesadas ou outliers**.\n", + "\n", + "* **Dica muito importante**: Normalize os dados antes!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ynyNHZqmD-tb" + }, + "source": [ + "___\n", + "## Skewness\n", + "> É o grau de distorção da distribuição, ou seja, mede a falta de simetria na distribuição de dados, diferenciando valores extremos em uma cauda versus na outra. Uma distribuição simétrica terá uma assimetria de 0.\n", + "\n", + "![Skewness](https://github.com/MathMachado/Materials/blob/master/Skewness.png?raw=true)\n", + "\n", + "Source: [Skew and Kurtosis: 2 Important Statistics terms you need to know in Data Science](https://codeburst.io/2-important-statistics-terms-you-need-to-know-in-data-science-skewness-and-kurtosis-388fef94eeaa)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Uoo3xVhBFixi" + }, + "source": [ + "### Interpretando a Skewness (Rule of Thumb)\n", + "* Se -0.5 < Skewness < 0.5: Dados razoavelmente simétricos;\n", + "* Se -1 < Skewness < -0.5: Dados moderadamente negativa;\n", + "* Se 0.5 < Skewness < 1: Dados moderadamente positiva;\n", + "* Se Skewness < -1: Dados altamente negativa;\n", + "* Se Skewness > 1: Dados altamente positiva.\n", + "\n", + "> **Dica**: Normalize os dados antes!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oHg3nyjUTiRu", + "outputId": "eaacbc04-e67b-4aa6-fec1-5d9cc97fa449", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 50 + } + }, + "source": [ + "# Cálculo das medidas de Skewness e Kurtosis para 'fare'\n", + "print(f\"Skewness: {df_titanic_ss['fare'].skew()}\")\n", + "print(f\"Kurtosis: {df_titanic_ss['fare'].kurt()}\")" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Skewness: 2.7073683146429004\n", + "Kurtosis: 10.690697893681472\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V2jCZLGVH3Qu" + }, + "source": [ + "Olhando para as medidas de Skewness e Kurtosis logo acima, qual a conclusão?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0nnFS8vi_rOe", + "outputId": "fa152ce8-a07f-4bf4-9671-fe0012a72242", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 168 + } + }, + "source": [ + "# Distribuição da variável 'age'\n", + "df_titanic_ss['age'].describe()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "count 1.820000e+02\n", + "mean 1.464030e-17\n", + "std 1.002759e+00\n", + "min -2.220506e+00\n", + "25% -7.437173e-01\n", + "50% 2.411064e-02\n", + "75% 7.759421e-01\n", + "max 2.839480e+00\n", + "Name: age, dtype: float64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 14 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "h9ZmvO1b_4sF", + "outputId": "56c563a6-6313-4d18-f336-d6bc308b1bf9", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 472 + } + }, + "source": [ + "plt.scatter(range(df_titanic_ss.shape[0]), np.sort(df_titanic_ss['age'].values))\n", + "plt.xlabel('index')\n", + "plt.ylabel('age')\n", + "plt.title(\"Distribuição da variável age\")\n", + "sns.despine()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GIAYrDJyCT6r", + "outputId": "f90e17e3-b26f-49e1-8910-76461cb2d4fa", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 472 + } + }, + "source": [ + "sns.distplot(df_titanic_ss['age'])\n", + "plt.title(\"Distribuição da variável age\")\n", + "sns.despine()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "USy48-H2UXqB", + "outputId": "e13e7d01-589f-4329-cc96-7f827adac029", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 50 + } + }, + "source": [ + "# Cálculo das medidas de Skewness e Kurtosis para 'age'\n", + "print(f\"Skewness: {df_titanic_ss['age'].skew()}\")\n", + "print(f\"Kurtosis: {df_titanic_ss['age'].kurt()}\")" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Skewness: 0.01841894050949496\n", + "Kurtosis: -0.2309427735598728\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ENQaVw2lItVL" + }, + "source": [ + "Olhando para as medidas de Skewness e Kurtosis logo acima, qual a conclusão?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Nt0PQIjW-wXd" + }, + "source": [ + "___\n", + "## **Isolation Forest Region**\n", + "* Source: [Outlier Detection with Isolation Forest](https://towardsdatascience.com/outlier-detection-with-isolation-forest-3d190448d45e)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tM6Xht76KmUN" + }, + "source": [ + "### Anomaly Detection para 'fare'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uFuAUh5S778M", + "outputId": "7f8311f9-cd4f-4ed0-baf3-9fe55200790c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 388 + } + }, + "source": [ + "isolation_forest = IsolationForest(n_estimators = 100)\n", + "isolation_forest.fit(df_titanic['fare'].values.reshape(-1, 1))\n", + "xx = np.linspace(df_titanic['fare'].min(), df_titanic['fare'].max(), len(df_titanic)).reshape(-1, 1)\n", + "anomaly_score = isolation_forest.decision_function(xx)\n", + "outlier = isolation_forest.predict(xx)\n", + "plt.figure(figsize = (10, 4))\n", + "plt.plot(xx, anomaly_score, label = 'anomaly score')\n", + "plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score), where = outlier == -1, color = 'r', alpha = .4, label = 'outlier region')\n", + "plt.legend()\n", + "plt.ylabel('anomaly score')\n", + "plt.xlabel('fare')\n", + "plt.show();" + ], + "execution_count": 29, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FkhRwo1cgYtK", + "outputId": "771f93e4-9d22-41cf-d6f0-6ee611326a47", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "# Vamos avaliar os dados do dataframe para fare > 200, por exemplo\n", + "df_titanic.loc[df_titanic['fare'] > 200].head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
2701male19.032263.0000SFirstmanTrueCSouthamptonnoFalse
8811female23.032263.0000SFirstwomanFalseCSouthamptonyesFalse
11801male24.001247.5208CFirstmanTrueBCherbourgnoFalse
29911female50.001247.5208CFirstwomanFalseBCherbourgyesFalse
31111female18.022262.3750CFirstwomanFalseBCherbourgyesFalse
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age ... deck embark_town alive alone\n", + "27 0 1 male 19.0 ... C Southampton no False\n", + "88 1 1 female 23.0 ... C Southampton yes False\n", + "118 0 1 male 24.0 ... B Cherbourg no False\n", + "299 1 1 female 50.0 ... B Cherbourg yes False\n", + "311 1 1 female 18.0 ... B Cherbourg yes False\n", + "\n", + "[5 rows x 15 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 19 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XFbRlmrYgtTS", + "outputId": "c59e564a-392c-4d63-8743-08f9ca4300b2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 286 + } + }, + "source": [ + "# Zoom na linha 27\n", + "df_titanic.loc[27]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "survived 0\n", + "pclass 1\n", + "sex male\n", + "age 19\n", + "sibsp 3\n", + "parch 2\n", + "fare 263\n", + "embarked S\n", + "class First\n", + "who man\n", + "adult_male True\n", + "deck C\n", + "embark_town Southampton\n", + "alive no\n", + "alone False\n", + "Name: 27, dtype: object" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 20 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bH4o-CL-N9Np" + }, + "source": [ + "A região onde os dados têm baixa probabilidade de aparecer fica no lado direito da distribuição." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7HK9cBvwGOqG" + }, + "source": [ + "### Anomaly Detection para 'age'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "PoDzs4DTFSY-", + "outputId": "21bdf5d8-71fd-449f-d644-2f8ef94c2605", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 388 + } + }, + "source": [ + "isolation_forest = IsolationForest(n_estimators = 100)\n", + "isolation_forest.fit(df_titanic['age'].values.reshape(-1, 1))\n", + "xx = np.linspace(df_titanic['age'].min(), df_titanic['age'].max(), len(df_titanic)).reshape(-1, 1)\n", + "anomaly_score = isolation_forest.decision_function(xx)\n", + "outlier = isolation_forest.predict(xx)\n", + "plt.figure(figsize = (10, 4))\n", + "plt.plot(xx, anomaly_score, label='anomaly score')\n", + "plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score), where = outlier == -1, color = 'r', alpha = .4, label = 'outlier region')\n", + "plt.legend()\n", + "plt.ylabel('anomaly score')\n", + "plt.xlabel('age')\n", + "plt.show();" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GivF2cSFS208" + }, + "source": [ + "Observe no gráfico acima que há duas regiões em que os dados têm baixa probabilidade de aparecer: uma no lado esquerdo da distribuição, outra no lado direito da distribuição." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XtizVySOlPUT", + "outputId": "0b387228-ac36-4617-9e06-a231e7b0fbdf", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "# Avaliando os dados da cauda esquerda\n", + "df_titanic.loc[df_titanic['age'] < 15].head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
1013female4.01116.7000SThirdchildFalseGSouthamptonyesFalse
18312male1.02139.0000SSecondchildFalseFSouthamptonyesFalse
19312male3.01126.0000SSecondchildFalseFSouthamptonyesFalse
20503female2.00110.4625SThirdchildFalseGSouthamptonnoFalse
29701female2.012151.5500SFirstchildFalseCSouthamptonnoFalse
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age ... deck embark_town alive alone\n", + "10 1 3 female 4.0 ... G Southampton yes False\n", + "183 1 2 male 1.0 ... F Southampton yes False\n", + "193 1 2 male 3.0 ... F Southampton yes False\n", + "205 0 3 female 2.0 ... G Southampton no False\n", + "297 0 1 female 2.0 ... C Southampton no False\n", + "\n", + "[5 rows x 15 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 22 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YGnZlzDDlyZO", + "outputId": "d5e0eb2c-2637-4021-f6e7-885b3e74161d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 286 + } + }, + "source": [ + "# Zoom na linha 3\n", + "df_titanic.loc[10]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "survived 1\n", + "pclass 3\n", + "sex female\n", + "age 4\n", + "sibsp 1\n", + "parch 1\n", + "fare 16.7\n", + "embarked S\n", + "class Third\n", + "who child\n", + "adult_male False\n", + "deck G\n", + "embark_town Southampton\n", + "alive yes\n", + "alone False\n", + "Name: 10, dtype: object" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 23 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YVhBJua_sG-u", + "outputId": "2b9f3aad-9f81-41e6-882d-cf5b3c0a0d02", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 136 + } + }, + "source": [ + "# Avaliando dados da cauda direita\n", + "df_titanic.loc[df_titanic['age'] > 65].head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
9601male71.00034.6542CFirstmanTrueACherbourgnoTrue
63011male80.00030.0000SFirstmanTrueASouthamptonyesTrue
74501male70.01171.0000SFirstmanTrueBSouthamptonnoFalse
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age ... deck embark_town alive alone\n", + "96 0 1 male 71.0 ... A Cherbourg no True\n", + "630 1 1 male 80.0 ... A Southampton yes True\n", + "745 0 1 male 70.0 ... B Southampton no False\n", + "\n", + "[3 rows x 15 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 24 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LRkUWSletcq-", + "outputId": "6fb001bd-c1d8-407b-ffd3-15ea7405dda0", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 286 + } + }, + "source": [ + "# Zoom na linha 96\n", + "df_titanic.loc[96]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "survived 0\n", + "pclass 1\n", + "sex male\n", + "age 71\n", + "sibsp 0\n", + "parch 0\n", + "fare 34.6542\n", + "embarked C\n", + "class First\n", + "who man\n", + "adult_male True\n", + "deck A\n", + "embark_town Cherbourg\n", + "alive no\n", + "alone True\n", + "Name: 96, dtype: object" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 25 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JQKECo0BSefE", + "outputId": "46a0911a-3cf8-43d1-8232-f05c710e54dc", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 454 + } + }, + "source": [ + "sns.regplot(x = \"age\", y = \"fare\", data = df_titanic_ss)\n", + "sns.despine()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "AChZpGY4Ghc9", + "outputId": "eed40cf6-a0e9-438c-e674-4bac462b872f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "cols = ['fare', 'age']\n", + "df_titanic_ss[cols].head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
fareage
1-0.1001100.152082
3-0.338485-0.039875
6-0.3547081.175852
10-0.815672-2.023430
11-0.6865431.431795
\n", + "
" + ], + "text/plain": [ + " fare age\n", + "1 -0.100110 0.152082\n", + "3 -0.338485 -0.039875\n", + "6 -0.354708 1.175852\n", + "10 -0.815672 -2.023430\n", + "11 -0.686543 1.431795" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 27 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s2tddgHcUiAF" + }, + "source": [ + "___\n", + "## **CBLOF - Cluster-based Local Outlier Factor**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fbJ7k1bbbfr4" + }, + "source": [ + "# Normalizar as variáveis 'age' e 'fare'\n", + "df_titanic_ss = df_titanic.copy()\n", + "df_titanic_ss[['fare', 'age']] = MinMaxScaler().fit_transform(df_titanic_ss[['fare', 'age']])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "il0LFdCFJEsw" + }, + "source": [ + "X1 = df_titanic_ss['age'].values.reshape(-1, 1)\n", + "X2 = df_titanic_ss['fare'].values.reshape(-1, 1)\n", + "X = np.concatenate((X1,X2), axis = 1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QtBn0u7CKlS6", + "outputId": "16701bb5-dcf1-4335-878b-545575f79f73", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 755 + } + }, + "source": [ + "outliers_fraction = 0.01\n", + "xx , yy = np.meshgrid(np.linspace(0, 1, 100), np.linspace(0, 1, 100))\n", + "clf = CBLOF(contamination = outliers_fraction, check_estimator = False, random_state = 0)\n", + "clf.fit(X)\n", + "# predict raw anomaly score\n", + "scores_pred = clf.decision_function(X) * -1\n", + " \n", + "# prediction of a datapoint category outlier or inlier\n", + "y_pred = clf.predict(X)\n", + "n_inliers = len(y_pred) - np.count_nonzero(y_pred)\n", + "n_outliers = np.count_nonzero(y_pred == 1)\n", + "\n", + "plt.figure(figsize = (8, 8))\n", + "\n", + "df1 = df_titanic_ss\n", + "df1['outlier'] = y_pred.tolist()\n", + "\n", + "inliers_fare = np.array(df1['fare'][df1['outlier'] == 0]).reshape(-1,1)\n", + "inliers_age = np.array(df1['age'][df1['outlier'] == 0]).reshape(-1,1)\n", + " \n", + "outliers_fare = df1['fare'][df1['outlier'] == 1].values.reshape(-1,1)\n", + "outliers_age = df1['age'][df1['outlier'] == 1].values.reshape(-1,1)\n", + " \n", + "print('OUTLIERS:',n_outliers,'INLIERS:',n_inliers)\n", + " \n", + "# Use threshold para definir um ponto como inlier ou outlier\n", + "# threshold = stats.scoreatpercentile(scores_pred,100 * outliers_fraction)\n", + "threshold = percentile(scores_pred, 100 * outliers_fraction)\n", + " \n", + "# Calcula o Anomaly Score\n", + "Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1\n", + "Z = Z.reshape(xx.shape)\n", + "\n", + "plt.contourf(xx, yy, Z, levels = np.linspace(Z.min(), threshold, 7), cmap = plt.cm.Blues_r)\n", + " \n", + "# Desenha a linha vermelha a partir do qual Anomaly Score = thresold\n", + "a = plt.contour(xx, yy, Z, levels = [threshold], linewidths = 2, colors = 'red')\n", + " \n", + "# Região Azul onde threshold < Anomaly Score < max(Anomaly score)\n", + "plt.contourf(xx, yy, Z, levels= [threshold, Z.max()], colors='orange')\n", + "b = plt.scatter(inliers_fare, inliers_age, c = 'white', s = 20, edgecolor = 'k')\n", + " \n", + "c = plt.scatter(outliers_fare, outliers_age, c = 'black', s = 20, edgecolor = 'k')\n", + " \n", + "plt.axis('tight') \n", + "plt.legend([a.collections[0], b, c], ['learned decision function', 'inliers', 'outliers'],\n", + " prop = matplotlib.font_manager.FontProperties(size = 10), loc = 'upper center', frameon = False, bbox_to_anchor = (0.5, -0.05),\n", + " fancybox = True, shadow = True, ncol = 5)\n", + " \n", + "plt.xlim((0, 1))\n", + "plt.ylim((0, 1))\n", + "plt.title('Cluster-based Local Outlier Factor (CBLOF)')\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "OUTLIERS: 2 INLIERS: 180\n" + ], + "name": "stdout" + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "O7NmDgjRm5EE", + "outputId": "0d0379c1-b9ce-4aed-bf59-feb9e9d99091", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 106 + } + }, + "source": [ + "# Zoom em alguns outliers...\n", + "df1.loc[df1['outlier'] == 1].head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealoneoutlier
67911male0.443601011.0CFirstmanTrueBCherbourgyesFalse1
73711male0.430956001.0CFirstmanTrueBCherbourgyesTrue1
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age ... embark_town alive alone outlier\n", + "679 1 1 male 0.443601 ... Cherbourg yes False 1\n", + "737 1 1 male 0.430956 ... Cherbourg yes True 1\n", + "\n", + "[2 rows x 16 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 31 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HIRxOj93nVXu", + "outputId": "8b95aa23-b3f7-4ed6-83bd-31e971b6fa62", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 286 + } + }, + "source": [ + "# Zoom na linha 679\n", + "df_titanic.loc[679]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "survived 1\n", + "pclass 1\n", + "sex male\n", + "age 36\n", + "sibsp 0\n", + "parch 1\n", + "fare 512.329\n", + "embarked C\n", + "class First\n", + "who man\n", + "adult_male True\n", + "deck B\n", + "embark_town Cherbourg\n", + "alive yes\n", + "alone False\n", + "Name: 679, dtype: object" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 32 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "euxK-4K1oKs0", + "outputId": "24a8459e-163d-497c-ba60-b48a792bfff6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 166 + } + }, + "source": [ + "# Algumas medidas para compararmos\n", + "df_resumo = df_titanic.groupby('sex').agg({'age': ['mean'], 'fare': ['mean']}).round(0)\n", + "df_resumo" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
agefare
meanmean
sex
female33.089.0
male38.069.0
\n", + "
" + ], + "text/plain": [ + " age fare\n", + " mean mean\n", + "sex \n", + "female 33.0 89.0\n", + "male 38.0 69.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 33 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nuNxqgWMtMHC", + "outputId": "d9f8600f-c28d-4c63-9a14-4045efd4dc1d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Média Geral de 'age'\n", + "round(df_titanic['age'].mean())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "36" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 34 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bLIZcvyuuU2R", + "outputId": "772602e6-5c1a-4a83-cb88-857a9b905fba", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Média Geral de 'fare'\n", + "round(df_titanic['fare'].mean())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "79" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 35 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fFd-D1HTVhE7" + }, + "source": [ + "___\n", + "## **HBOS - Histogram-based Outlier Detection**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q5Hh5iMEXuhM", + "outputId": "8ac4d33b-c4bd-4b03-e7ae-af052af8b2a7", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 755 + } + }, + "source": [ + "outliers_fraction = 0.01\n", + "xx , yy = np.meshgrid(np.linspace(0, 1, 100), np.linspace(0, 1, 100))\n", + "clf = HBOS(contamination = outliers_fraction)\n", + "clf.fit(X)\n", + "# predict raw anomaly score\n", + "scores_pred = clf.decision_function(X) * -1\n", + " \n", + "# prediction of a datapoint category outlier or inlier\n", + "y_pred = clf.predict(X)\n", + "n_inliers = len(y_pred) - np.count_nonzero(y_pred)\n", + "n_outliers = np.count_nonzero(y_pred == 1)\n", + "plt.figure(figsize = (8, 8))\n", + "# copy of dataframe\n", + "df1 = df_titanic_ss\n", + "df1['outlier'] = y_pred.tolist()\n", + " \n", + "inliers_fare = np.array(df1['fare'][df1['outlier'] == 0]).reshape(-1, 1)\n", + "inliers_age = np.array(df1['age'][df1['outlier'] == 0]).reshape(-1, 1)\n", + " \n", + "outliers_fare = df1['fare'][df1['outlier'] == 1].values.reshape(-1, 1)\n", + "outliers_age = df1['age'][df1['outlier'] == 1].values.reshape(-1, 1)\n", + " \n", + "print('OUTLIERS:', n_outliers, 'INLIERS:', n_inliers)\n", + " \n", + "# threshold define se um ponto será outlier ou inlier\n", + "threshold = percentile(scores_pred, 100 * outliers_fraction)\n", + " \n", + "# Calcula o Anomaly score\n", + "Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1\n", + "Z = Z.reshape(xx.shape)\n", + "\n", + "# Define a região azul tal que min(Anomaly score) < threshold\n", + "plt.contourf(xx, yy, Z, levels = np.linspace(Z.min(), threshold, 7), cmap = plt.cm.Blues_r)\n", + " \n", + "# Desenha a linha a partir do queal Anomaly score = thresold\n", + "a = plt.contour(xx, yy, Z, levels = [threshold], linewidths = 2, colors = 'red')\n", + " \n", + "# Define a região laranja a partir do qual threshold < Anomaly score < max(Anomaly score)\n", + "plt.contourf(xx, yy, Z, levels = [threshold, Z.max()],colors='orange')\n", + "b = plt.scatter(inliers_fare, inliers_age, c='white',s=20, edgecolor='k')\n", + " \n", + "c = plt.scatter(outliers_fare, outliers_age, c='black',s=20, edgecolor='k')\n", + " \n", + "plt.axis('tight') \n", + " \n", + "plt.legend([a.collections[0], b, c], ['learned decision function', 'inliers', 'outliers'],\n", + " prop=matplotlib.font_manager.FontProperties(size = 10), loc ='upper center', frameon = False, bbox_to_anchor = (0.5, -0.05),\n", + " fancybox = True, shadow = True, ncol = 5)\n", + " \n", + "plt.xlim((0, 1))\n", + "plt.ylim((0, 1))\n", + "plt.title('Histogram-base Outlier Detection (HBOS)')\n", + "plt.show();" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "OUTLIERS: 2 INLIERS: 180\n" + ], + "name": "stdout" + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gHRoON0BnLVb", + "outputId": "73ec2b80-9edf-445b-8276-baa505178432", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 106 + } + }, + "source": [ + "# Zoom em alguns outliers...\n", + "df1.loc[df1['outlier'] == 1].head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealoneoutlier
31811female0.380374020.321798SFirstwomanFalseCSouthamptonyesFalse1
68911female0.178048010.412503SFirstchildFalseBSouthamptonyesFalse1
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age ... embark_town alive alone outlier\n", + "318 1 1 female 0.380374 ... Southampton yes False 1\n", + "689 1 1 female 0.178048 ... Southampton yes False 1\n", + "\n", + "[2 rows x 16 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 37 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YblU2tnxnXi7", + "outputId": "c96e85f5-7565-46bf-c355-223297a1d980", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 286 + } + }, + "source": [ + "# Zoom na linha 689\n", + "df_titanic.loc[689]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "survived 1\n", + "pclass 1\n", + "sex female\n", + "age 15\n", + "sibsp 0\n", + "parch 1\n", + "fare 211.338\n", + "embarked S\n", + "class First\n", + "who child\n", + "adult_male False\n", + "deck B\n", + "embark_town Southampton\n", + "alive yes\n", + "alone False\n", + "Name: 689, dtype: object" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 38 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "AkWj5aQ-uzxB", + "outputId": "1a12b0f5-1cec-4a43-8fef-29c2ca67e779", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 166 + } + }, + "source": [ + "# Algumas medidas para compararmos\n", + "df_resumo = df_titanic.groupby('sex').agg({'age': ['mean'], 'fare': ['mean']}).round(0)\n", + "df_resumo" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
agefare
meanmean
sex
female33.089.0
male38.069.0
\n", + "
" + ], + "text/plain": [ + " age fare\n", + " mean mean\n", + "sex \n", + "female 33.0 89.0\n", + "male 38.0 69.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 39 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "EVy5NDrFujgD", + "outputId": "e3e2dbd7-b4fd-445a-fac4-3dbf9139ba08", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Média Geral de 'age'\n", + "round(df_titanic['age'].mean())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "36" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 40 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Hgcp_LU6ujgJ", + "outputId": "e6959579-db19-4fbd-cc87-0b42421a0b90", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Média Geral de 'fare'\n", + "round(df_titanic['fare'].mean())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "79" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 41 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KyPUT9JmWeN-" + }, + "source": [ + "___\n", + "## **Isolation Forest**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Lrx85bG0YOqM", + "outputId": "84485503-9084-42e2-f251-56626897b391", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 755 + } + }, + "source": [ + "outliers_fraction = 0.01\n", + "xx , yy = np.meshgrid(np.linspace(0, 1, 100), np.linspace(0, 1, 100))\n", + "clf = IForest(contamination = outliers_fraction,random_state = 0)\n", + "clf.fit(X)\n", + "# predict raw anomaly score\n", + "scores_pred = clf.decision_function(X) * -1\n", + " \n", + "# prediction of a datapoint category outlier or inlier\n", + "y_pred = clf.predict(X)\n", + "n_inliers = len(y_pred) - np.count_nonzero(y_pred)\n", + "n_outliers = np.count_nonzero(y_pred == 1)\n", + "plt.figure(figsize = (8, 8))\n", + "# copy of dataframe\n", + "df1 = df_titanic_ss\n", + "df1['outlier'] = y_pred.tolist()\n", + " \n", + "# fare - inlier feature 1, age - inlier feature 2\n", + "inliers_fare = np.array(df1['fare'][df1['outlier'] == 0]).reshape(-1,1)\n", + "inliers_age = np.array(df1['age'][df1['outlier'] == 0]).reshape(-1,1)\n", + " \n", + "# fare - outlier feature 1, age - outlier feature 2\n", + "outliers_fare = df1['fare'][df1['outlier'] == 1].values.reshape(-1,1)\n", + "outliers_age = df1['age'][df1['outlier'] == 1].values.reshape(-1,1)\n", + " \n", + "print('OUTLIERS: ', n_outliers,'INLIERS: ', n_inliers)\n", + " \n", + "# threshold value to consider a datapoint inlier or outlier\n", + "threshold = percentile(scores_pred, 100 * outliers_fraction)\n", + " \n", + "# decision function calculates the raw anomaly score for every point\n", + "Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1\n", + "Z = Z.reshape(xx.shape)\n", + "# fill blue map colormap from minimum anomaly score to threshold value\n", + "plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),cmap=plt.cm.Blues_r)\n", + " \n", + "# draw red contour line where anomaly score is equal to thresold\n", + "a = plt.contour(xx, yy, Z, levels= [threshold],linewidths=2, colors='red')\n", + " \n", + "# fill orange contour lines where range of anomaly score is from threshold to maximum anomaly score\n", + "plt.contourf(xx, yy, Z, levels= [threshold, Z.max()],colors='orange')\n", + "b = plt.scatter(inliers_fare, inliers_age, c='white',s=20, edgecolor='k')\n", + " \n", + "c = plt.scatter(outliers_fare, outliers_age, c='black',s=20, edgecolor='k')\n", + " \n", + "plt.axis('tight')\n", + "plt.legend([a.collections[0], b,c], ['learned decision function', 'inliers', 'outliers'],\n", + " prop=matplotlib.font_manager.FontProperties(size = 10), loc='upper center', frameon= False, bbox_to_anchor = (0.5, -0.05),\n", + " fancybox = True, shadow = True, ncol=5)\n", + " \n", + "plt.xlim((0, 1))\n", + "plt.ylim((0, 1))\n", + "plt.title('Isolation Forest')\n", + "plt.show();" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "OUTLIERS: 2 INLIERS: 180\n" + ], + "name": "stdout" + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HLVraGcCnNTA", + "outputId": "957c19be-cec9-4d4a-9902-0f49f806198f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 106 + } + }, + "source": [ + "# Zoom em alguns outliers...\n", + "df1.loc[df1['outlier'] == 1].head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealoneoutlier
67911male0.443601011.0CFirstmanTrueBCherbourgyesFalse1
73711male0.430956001.0CFirstmanTrueBCherbourgyesTrue1
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age ... embark_town alive alone outlier\n", + "679 1 1 male 0.443601 ... Cherbourg yes False 1\n", + "737 1 1 male 0.430956 ... Cherbourg yes True 1\n", + "\n", + "[2 rows x 16 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 43 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y0WBmFOonZKY", + "outputId": "06cdcee2-c8fc-44ec-aa7b-f33d16d4d7c2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 286 + } + }, + "source": [ + "# Zoom na linha 679\n", + "df_titanic.loc[679]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "survived 1\n", + "pclass 1\n", + "sex male\n", + "age 36\n", + "sibsp 0\n", + "parch 1\n", + "fare 512.329\n", + "embarked C\n", + "class First\n", + "who man\n", + "adult_male True\n", + "deck B\n", + "embark_town Cherbourg\n", + "alive yes\n", + "alone False\n", + "Name: 679, dtype: object" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 44 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "auSy5b6Du3PH", + "outputId": "b1aae710-3d55-4ff1-81e1-635fdb96a8ca", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 166 + } + }, + "source": [ + "# Algumas medidas para compararmos\n", + "df_resumo = df_titanic.groupby('sex').agg({'age': ['mean'], 'fare': ['mean']}).round(0)\n", + "df_resumo" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
agefare
meanmean
sex
female33.089.0
male38.069.0
\n", + "
" + ], + "text/plain": [ + " age fare\n", + " mean mean\n", + "sex \n", + "female 33.0 89.0\n", + "male 38.0 69.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 45 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fIQg2D6fuoSG", + "outputId": "94244010-8314-481c-a428-b526d3b70ca8", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Média Geral de 'age'\n", + "round(df_titanic['age'].mean())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "36" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 46 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pNUds1oDuoSO", + "outputId": "64f51541-4329-404d-8d0b-f65faf2b397e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Média Geral de 'fare'\n", + "round(df_titanic['fare'].mean())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "79" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 47 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QbpzXB2RV4sq" + }, + "source": [ + "___\n", + "## **KNN - K-Nearest Neighbors**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6gtIWWbRYxEj", + "outputId": "8bc4ae18-4a33-489e-d13e-98a3160a9e13", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 755 + } + }, + "source": [ + "outliers_fraction = 0.01\n", + "xx , yy = np.meshgrid(np.linspace(0, 1, 100), np.linspace(0, 1, 100))\n", + "clf = KNN(contamination = outliers_fraction)\n", + "clf.fit(X)\n", + "# predict raw anomaly score\n", + "scores_pred = clf.decision_function(X) * -1\n", + " \n", + "# prediction of a datapoint category outlier or inlier\n", + "y_pred = clf.predict(X)\n", + "n_inliers = len(y_pred) - np.count_nonzero(y_pred)\n", + "n_outliers = np.count_nonzero(y_pred == 1)\n", + "plt.figure(figsize = (8, 8))\n", + "# copy of dataframe\n", + "df1 = df_titanic_ss\n", + "df1['outlier'] = y_pred.tolist()\n", + " \n", + "inliers_fare = np.array(df1['fare'][df1['outlier'] == 0]).reshape(-1,1)\n", + "inliers_age = np.array(df1['age'][df1['outlier'] == 0]).reshape(-1,1)\n", + " \n", + "outliers_fare = df1['fare'][df1['outlier'] == 1].values.reshape(-1,1)\n", + "outliers_age = df1['age'][df1['outlier'] == 1].values.reshape(-1,1)\n", + " \n", + "print('OUTLIERS: ',n_outliers, 'INLIERS: ', n_inliers)\n", + " \n", + "# threshold value to consider a datapoint inlier or outlier\n", + "threshold = percentile(scores_pred, 100 * outliers_fraction)\n", + " \n", + "# decision function calculates the raw anomaly score for every point\n", + "Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1\n", + "Z = Z.reshape(xx.shape)\n", + "# fill blue map colormap from minimum anomaly score to threshold value\n", + "plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),cmap=plt.cm.Blues_r)\n", + " \n", + "# draw red contour line where anomaly score is equal to thresold\n", + "a = plt.contour(xx, yy, Z, levels= [threshold],linewidths=2, colors='red')\n", + " \n", + "# fill orange contour lines where range of anomaly score is from threshold to maximum anomaly score\n", + "plt.contourf(xx, yy, Z, levels= [threshold, Z.max()],colors='orange')\n", + "b = plt.scatter(inliers_fare, inliers_age, c='white',s=20, edgecolor='k')\n", + " \n", + "c = plt.scatter(outliers_fare, outliers_age, c='black',s=20, edgecolor='k')\n", + " \n", + "plt.axis('tight') \n", + " \n", + "plt.legend([a.collections[0], b,c], ['learned decision function', 'inliers', 'outliers'],\n", + " prop=matplotlib.font_manager.FontProperties(size=10), loc='upper center', frameon= False, bbox_to_anchor = (0.5, -0.05),\n", + " fancybox = True, shadow = True, ncol = 5)\n", + " \n", + "plt.xlim((0, 1))\n", + "plt.ylim((0, 1))\n", + "plt.title('K-Nearest Neighbors (KNN)')\n", + "plt.show();" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "OUTLIERS: 2 INLIERS: 180\n" + ], + "name": "stdout" + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6B-L7MwXg25Z", + "outputId": "0f8d806d-9e19-4133-c038-01d96c26b168", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df1.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealoneoutlier
111female0.468892100.139136CFirstwomanFalseCCherbourgyesFalse0
311female0.430956100.103644SFirstwomanFalseCSouthamptonyesFalse0
601male0.671219000.101229SFirstmanTrueESouthamptonnoTrue0
1013female0.038948110.032596SThirdchildFalseGSouthamptonyesFalse0
1111female0.721801000.051822SFirstwomanFalseCSouthamptonyesTrue0
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age ... embark_town alive alone outlier\n", + "1 1 1 female 0.468892 ... Cherbourg yes False 0\n", + "3 1 1 female 0.430956 ... Southampton yes False 0\n", + "6 0 1 male 0.671219 ... Southampton no True 0\n", + "10 1 3 female 0.038948 ... Southampton yes False 0\n", + "11 1 1 female 0.721801 ... Southampton yes True 0\n", + "\n", + "[5 rows x 16 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 49 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gvXGH0BHBBNN", + "outputId": "b44557de-c87c-4012-adb9-63597c401390", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 106 + } + }, + "source": [ + "# Zoom em alguns outliers...\n", + "df1.loc[df1['outlier'] == 1].head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealoneoutlier
67911male0.443601011.0CFirstmanTrueBCherbourgyesFalse1
73711male0.430956001.0CFirstmanTrueBCherbourgyesTrue1
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age ... embark_town alive alone outlier\n", + "679 1 1 male 0.443601 ... Cherbourg yes False 1\n", + "737 1 1 male 0.430956 ... Cherbourg yes True 1\n", + "\n", + "[2 rows x 16 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 50 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MYbNaaO7D3NY", + "outputId": "bc9fcf92-8937-42e7-b79f-84a17a50ecb3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 286 + } + }, + "source": [ + "# Zoom na linha 679\n", + "df_titanic.loc[679]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "survived 1\n", + "pclass 1\n", + "sex male\n", + "age 36\n", + "sibsp 0\n", + "parch 1\n", + "fare 512.329\n", + "embarked C\n", + "class First\n", + "who man\n", + "adult_male True\n", + "deck B\n", + "embark_town Cherbourg\n", + "alive yes\n", + "alone False\n", + "Name: 679, dtype: object" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 51 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-juEvWvru5jp", + "outputId": "6f5f85aa-7249-4c32-8473-db43dc01eec3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 166 + } + }, + "source": [ + "# Algumas medidas para compararmos\n", + "df_resumo = df_titanic.groupby('sex').agg({'age': ['mean'], 'fare': ['mean']}).round(0)\n", + "df_resumo" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
agefare
meanmean
sex
female33.089.0
male38.069.0
\n", + "
" + ], + "text/plain": [ + " age fare\n", + " mean mean\n", + "sex \n", + "female 33.0 89.0\n", + "male 38.0 69.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 52 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "B6NXG6oDusSg", + "outputId": "707873f7-a36b-45a5-e184-2aaf789985fc", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Média Geral de 'age'\n", + "round(df_titanic['age'].mean())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "36" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 53 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cgHJb3iBusSl", + "outputId": "47a1c2c1-498e-4f85-8176-5464ac8077c4", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Média Geral de 'fare'\n", + "round(df_titanic['fare'].mean())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "79" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 54 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1w7MIkoAG2Qr" + }, + "source": [ + "___\n", + "# **Exercícios**\n", + "Para cada um dos dataframes a seguir, faça uma análise de outlier utilizando uma das técnicas apresentadas e explique seus resultados." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ep_Z3iQIG56r" + }, + "source": [ + "## Exercício 1 - Predict Breast Cancer" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "v-Lvzrt7HN2l", + "outputId": "035c44f6-200d-41ab-c89d-4622f5b0bec6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 249 + } + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "from sklearn.datasets import load_breast_cancer\n", + "\n", + "cancer = load_breast_cancer()\n", + "X = cancer['data']\n", + "y = cancer['target']\n", + "\n", + "df_cancer = pd.DataFrame(np.c_[X, y], columns= np.append(cancer['feature_names'], ['target']))\n", + "df_cancer['target'] = df_cancer['target'].map({0: 'malign', 1: 'benign'})\n", + "df_cancer.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
mean radiusmean texturemean perimetermean areamean smoothnessmean compactnessmean concavitymean concave pointsmean symmetrymean fractal dimensionradius errortexture errorperimeter errorarea errorsmoothness errorcompactness errorconcavity errorconcave points errorsymmetry errorfractal dimension errorworst radiusworst textureworst perimeterworst areaworst smoothnessworst compactnessworst concavityworst concave pointsworst symmetryworst fractal dimensiontarget
017.9910.38122.801001.00.118400.277600.30010.147100.24190.078711.09500.90538.589153.400.0063990.049040.053730.015870.030030.00619325.3817.33184.602019.00.16220.66560.71190.26540.46010.11890malign
120.5717.77132.901326.00.084740.078640.08690.070170.18120.056670.54350.73393.39874.080.0052250.013080.018600.013400.013890.00353224.9923.41158.801956.00.12380.18660.24160.18600.27500.08902malign
219.6921.25130.001203.00.109600.159900.19740.127900.20690.059990.74560.78694.58594.030.0061500.040060.038320.020580.022500.00457123.5725.53152.501709.00.14440.42450.45040.24300.36130.08758malign
311.4220.3877.58386.10.142500.283900.24140.105200.25970.097440.49561.15603.44527.230.0091100.074580.056610.018670.059630.00920814.9126.5098.87567.70.20980.86630.68690.25750.66380.17300malign
420.2914.34135.101297.00.100300.132800.19800.104300.18090.058830.75720.78135.43894.440.0114900.024610.056880.018850.017560.00511522.5416.67152.201575.00.13740.20500.40000.16250.23640.07678malign
\n", + "
" + ], + "text/plain": [ + " mean radius mean texture ... worst fractal dimension target\n", + "0 17.99 10.38 ... 0.11890 malign\n", + "1 20.57 17.77 ... 0.08902 malign\n", + "2 19.69 21.25 ... 0.08758 malign\n", + "3 11.42 20.38 ... 0.17300 malign\n", + "4 20.29 14.34 ... 0.07678 malign\n", + "\n", + "[5 rows x 31 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 55 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cEHLrU0gHRtu" + }, + "source": [ + "## Exercício 2 - Boston Housing Price" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8G9GZnubHYjy", + "outputId": "c83f182b-c3f4-4d55-9ac8-ce1690d71acf", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "from sklearn.datasets import load_boston\n", + "\n", + "boston = load_boston()\n", + "X = boston['data']\n", + "y = boston['target']\n", + "\n", + "df_boston = pd.DataFrame(np.c_[X, y], columns = np.append(boston['feature_names'], ['target']))\n", + "df_boston.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATtarget
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2
\n", + "
" + ], + "text/plain": [ + " CRIM ZN INDUS CHAS NOX ... TAX PTRATIO B LSTAT target\n", + "0 0.00632 18.0 2.31 0.0 0.538 ... 296.0 15.3 396.90 4.98 24.0\n", + "1 0.02731 0.0 7.07 0.0 0.469 ... 242.0 17.8 396.90 9.14 21.6\n", + "2 0.02729 0.0 7.07 0.0 0.469 ... 242.0 17.8 392.83 4.03 34.7\n", + "3 0.03237 0.0 2.18 0.0 0.458 ... 222.0 18.7 394.63 2.94 33.4\n", + "4 0.06905 0.0 2.18 0.0 0.458 ... 222.0 18.7 396.90 5.33 36.2\n", + "\n", + "[5 rows x 14 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 56 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QlAdIYfmHaE8" + }, + "source": [ + "## Exercício 3 - Iris\n", + "* [Aqui](https://en.wikipedia.org/wiki/Iris_flower_data_set) você obterá mais informações sobre o dataframe iris." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Rke4C3wFHfYU", + "outputId": "7a1966b5-c787-4130-9ff0-55cc65bdbba2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "from sklearn.datasets import load_iris\n", + "\n", + "iris = load_iris()\n", + "X= iris['data']\n", + "y= iris['target']\n", + "\n", + "df_iris = pd.DataFrame(np.c_[X, y], columns = np.append(iris['feature_names'], ['target']))\n", + "df_iris['target'] = df_iris['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})\n", + "df_iris.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)target
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", + "
" + ], + "text/plain": [ + " sepal length (cm) sepal width (cm) ... petal width (cm) target\n", + "0 5.1 3.5 ... 0.2 setosa\n", + "1 4.9 3.0 ... 0.2 setosa\n", + "2 4.7 3.2 ... 0.2 setosa\n", + "3 4.6 3.1 ... 0.2 setosa\n", + "4 5.0 3.6 ... 0.2 setosa\n", + "\n", + "[5 rows x 5 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 57 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6qn3gC4NHj-p" + }, + "source": [ + "## Exercícios 4 - Diabetes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "P-esq5TSHnf6", + "outputId": "eb042bc6-ad2f-49cb-eb35-2e1ba208e93d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "from sklearn.datasets import load_diabetes\n", + "\n", + "diabetes = load_diabetes()\n", + "X = diabetes['data']\n", + "y = diabetes['target']\n", + "\n", + "df_diabetes = pd.DataFrame(np.c_[X, y], columns = np.append(diabetes['feature_names'], ['target']))\n", + "df_diabetes.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
agesexbmibps1s2s3s4s5s6target
00.0380760.0506800.0616960.021872-0.044223-0.034821-0.043401-0.0025920.019908-0.017646151.0
1-0.001882-0.044642-0.051474-0.026328-0.008449-0.0191630.074412-0.039493-0.068330-0.09220475.0
20.0852990.0506800.044451-0.005671-0.045599-0.034194-0.032356-0.0025920.002864-0.025930141.0
3-0.089063-0.044642-0.011595-0.0366560.0121910.024991-0.0360380.0343090.022692-0.009362206.0
40.005383-0.044642-0.0363850.0218720.0039350.0155960.008142-0.002592-0.031991-0.046641135.0
\n", + "
" + ], + "text/plain": [ + " age sex bmi bp ... s4 s5 s6 target\n", + "0 0.038076 0.050680 0.061696 0.021872 ... -0.002592 0.019908 -0.017646 151.0\n", + "1 -0.001882 -0.044642 -0.051474 -0.026328 ... -0.039493 -0.068330 -0.092204 75.0\n", + "2 0.085299 0.050680 0.044451 -0.005671 ... -0.002592 0.002864 -0.025930 141.0\n", + "3 -0.089063 -0.044642 -0.011595 -0.036656 ... 0.034309 0.022692 -0.009362 206.0\n", + "4 0.005383 -0.044642 -0.036385 0.021872 ... -0.002592 -0.031991 -0.046641 135.0\n", + "\n", + "[5 rows x 11 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 58 + } + ] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB10_04__3DP_4_Anomaly_Detection_hs2.ipynb b/Notebooks/NB10_04__3DP_4_Anomaly_Detection_hs2.ipynb new file mode 100644 index 000000000..5ff30efaf --- /dev/null +++ b/Notebooks/NB10_04__3DP_4_Anomaly_Detection_hs2.ipynb @@ -0,0 +1,1484 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "NB10_04__3DP_4_Anomaly_Detection.ipynb", + "provenance": [], + "collapsed_sections": [], + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EAqSDJGzyYrx" + }, + "source": [ + "

3DP_4 - ANOMALY/OUTLIER DETECTION

\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H-VrOjTTymSK" + }, + "source": [ + "# **AGENDA**:\n", + "\n", + "> Consulte a **Table of contents**." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wSAsbafemNax" + }, + "source": [ + "# **Melhorias da sessão**\n", + "* Mostrar junto os gráficos com a região de Anomaly Score junto com a distribuição de probabilidade das variáveis envolvidas.\n", + "* Mensagens de deprecating --> Analisar e substituir os métodos, funções deprecated;\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7qK6Yx0tBqUz" + }, + "source": [ + "___\n", + "# **Referências**\n", + "* [Comparing anomaly detection algorithms for outlier detection on toy datasets](https://scikit-learn.org/stable/auto_examples/plot_anomaly_comparison.html#sphx-glr-auto-examples-plot-anomaly-comparison-py)\n", + "* [Outlier detection with several methods](https://scikit-learn.org/0.18/auto_examples/covariance/plot_outlier_detection.html)\n", + "* [anomaly-detection-resources](https://github.com/MathMachado/anomaly-detection-resources)\n", + "* [Outlier Detection with Extended Isolation Forest](https://towardsdatascience.com/outlier-detection-with-extended-isolation-forest-1e248a3fe97b)\n", + "* [Outlier Detection with Isolation Forest](https://towardsdatascience.com/outlier-detection-with-isolation-forest-3d190448d45e)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f7tTnUJ6B2UG" + }, + "source": [ + "___\n", + "## O que é Anomaly Detection (= Análise de Outliers)?\n", + "> Qualquer ponto/observação que é incomum quando comparado com todos os outros pontos/observações." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7VJZf1U5Ds_w" + }, + "source": [ + "___\n", + "# **Machine Learning com Python (Scikit-Learn)**\n", + "\n", + "![Scikit-Learn](https://github.com/MathMachado/Materials/blob/master/scikit-learn-1.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rpHJ1qVUEwOn" + }, + "source": [ + "___\n", + "# **Técnicas tradicionais para detecção de outliers**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OOI_VTo3E3sv" + }, + "source": [ + "## Boxplot\n", + "* $IQR = Q_{3}-Q_{1}$\n", + "\n", + "![BoxPlot](https://github.com/MathMachado/Materials/blob/master/boxplot.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vivFsmJGFVC0" + }, + "source": [ + "## Z-Score\n", + "* Z-Score pode ser utilizado para detectar Outliers.\n", + "* É a diferença entre o valor e a média da amostra expressa como o número de desvios-padrão. \n", + "* Se o escore z for menor que 2,5 ou maior que 2,5, o valor estará nos 5% do menor ou maior valor (2,5% dos valores em ambas as extremidades da distribuição). No entanto, é pratica comum utilizarmos 3 ao invés dos 2,5.\n", + "\n", + "![Z_Score](https://github.com/MathMachado/Materials/blob/master/Z_Score.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7_YohlTIF8zi" + }, + "source": [ + "___\n", + "# **Hands-On**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OrXdGg8t0V_D" + }, + "source": [ + "## Carrega as Bibliotecas necessárias" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7pYqwxIe1Hcq" + }, + "source": [ + "!pip install pyod" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gxBgvhA4mowO" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "from numpy import percentile\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "import matplotlib\n", + "\n", + "from sklearn.ensemble import IsolationForest\n", + "\n", + "# Scaling variables\n", + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.preprocessing import MinMaxScaler\n", + "\n", + "from pyod.models.abod import ABOD\n", + "from pyod.models.cblof import CBLOF\n", + "\n", + "#from pyod.models.feature_bagging import FeatureBagging\n", + "from pyod.models.hbos import HBOS\n", + "from pyod.models.iforest import IForest\n", + "from pyod.models.knn import KNN\n", + "#from pyod.models.lof import LOF\n", + "from scipy import stats\n", + "\n", + "# remove warnings to keep notebook clean\n", + "import warnings\n", + "warnings.filterwarnings('ignore')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WLf_c29t0ekj" + }, + "source": [ + "## Carrega dataframe" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GZyPw_7RRx26" + }, + "source": [ + "df_titanic = sns.load_dataset('titanic')\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zGqsV7kxSSCj" + }, + "source": [ + "df_titanic.isna().sum()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "YL_VQljA0gxZ" + }, + "source": [ + "# Por simplicidade, vou descartando/omitindo todos os Missing Values\n", + "df_titanic = df_titanic.dropna() # Esta não é a forma adequada! Dê o devido tratamento aos Missing Values (NaN) da base de dados!\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q2oxyyQWB-uz" + }, + "source": [ + "# Normalizar as variáveis 'age' e 'fare'\n", + "df_titanic_ss = df_titanic.copy()\n", + "df_titanic_ss[['fare', 'age']] = StandardScaler().fit_transform(df_titanic_ss[['fare', 'age']])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rAKnKtil9Oz1" + }, + "source": [ + "# Linhas do df_titanic\n", + "df_titanic_ss.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "sHSYUkEQFIwS" + }, + "source": [ + "# Função para plotar o Boxplot\n", + "def boxplot_sobreviveu(df, column):\n", + " plt.rcdefaults()\n", + " sns.catplot(x = 'survived', y = column, kind = \"box\", data = df, height = 4, aspect = 1.5)\n", + " \n", + " # add data points to boxplot with stripplot\n", + " sns.stripplot(x = 'survived', y = column, data = df, alpha = 0.3, jitter = 0.2, color = 'k');\n", + " plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z7F4zWltT7l6" + }, + "source": [ + "Esta é a visão univariada da variável 'fare':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y4pahkOLUX1D" + }, + "source": [ + "df_titanic[['survived']].head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "o9-VgcNnFNb1" + }, + "source": [ + "boxplot_sobreviveu(df_titanic, 'fare')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "BIN9WDB1ffb9" + }, + "source": [ + "boxplot_sobreviveu(df_titanic_ss, 'fare')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8FIo9tD1FQ0u" + }, + "source": [ + "boxplot_sobreviveu(df_titanic, 'age')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "fCqj102y9Kfo" + }, + "source": [ + "# Descrever o dataframe, variável 'fare'\n", + "pd.set_option('display.float_format', lambda x: '%.3f' %x)\n", + "df_titanic_ss['fare'].describe()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RwVlK5jXe_ya" + }, + "source": [ + "## Mostrar a frequência acumulada!! É mais informativo!!!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SMcvIb1K_69n" + }, + "source": [ + "plt.scatter(range(df_titanic_ss.shape[0]), np.sort(df_titanic_ss['fare'].values)) # A intenção do gráfico é ordenar os valores em df_titanic['fare'] do menor valor para o maior!\n", + "plt.xlabel('indices')\n", + "plt.ylabel('fares')\n", + "plt.title(\"Distribuição da variável fare\")\n", + "\n", + "sns.despine()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6nuuTtqHh0Kk" + }, + "source": [ + "Com o gráfico acima, podemos ver quais e quantos são os pontos que estão acima de $\\mu + 3\\sigma$ --> Se usarmos o critério do z-score, todos os pontos acima deste limiar são outliers!\n", + "\n", + " Colocar os gráficos lado a lado: df_titanic e df_titanic_ss!!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7pzTvLleGpWc" + }, + "source": [ + "# Distribuição da variável 'fare' (após StandardScaler)\n", + "sns.distplot(df_titanic_ss['fare'])\n", + "plt.title(\"Distribuição da variável fare\")\n", + "sns.despine()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "3Cumr43hjByz" + }, + "source": [ + "# Distribuição da variável 'fare' (após StandardScaler)\n", + "sns.distplot(df_titanic['fare'])\n", + "plt.title(\"Distribuição da variável fare\")\n", + "sns.despine()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oXeTALDBk57N" + }, + "source": [ + "### Calcular a mediana e comparar com a média." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vB7ik75rjJUI" + }, + "source": [ + "df_titanic.describe()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Qa28Hc3ZC6FV" + }, + "source": [ + "___\n", + "## Kurtosis\n", + "> Kurtosis é uma medida estatística que define com que intensidade as caudas de uma distribuição diferem das caudas de uma distribuição Normal. Em outras palavras, a curtose identifica se as caudas de uma determinada distribuição contêm valores extremos.\n", + ">> A Kurtosis de uma distribuição Normal padrão é igual a 3. Portanto, se Kurtosis-3 > 0, então isso é o que chamamos de excesso de Kurtosis.\n", + ">>> **Alta Kurtosis é um indicador de que os dados possuem caudas pesadas ou outliers**.\n", + "\n", + "* **Dica muito importante**: Normalize os dados antes!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ynyNHZqmD-tb" + }, + "source": [ + "___\n", + "## Skewness\n", + "> É o grau de distorção da distribuição, ou seja, mede a falta de simetria na distribuição de dados, diferenciando valores extremos em uma cauda versus na outra. Uma distribuição simétrica terá uma assimetria de 0.\n", + "\n", + "![Skewness](https://github.com/MathMachado/Materials/blob/master/Skewness.png?raw=true)\n", + "\n", + "Source: [Skew and Kurtosis: 2 Important Statistics terms you need to know in Data Science](https://codeburst.io/2-important-statistics-terms-you-need-to-know-in-data-science-skewness-and-kurtosis-388fef94eeaa)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Uoo3xVhBFixi" + }, + "source": [ + "### Interpretando a Skewness (Rule of Thumb)\n", + "* Se -0.5 < Skewness < 0.5: Dados razoavelmente simétricos;\n", + "* Se -1 < Skewness < -0.5: Dados moderadamente negativa;\n", + "* Se 0.5 < Skewness < 1: Dados moderadamente positiva;\n", + "* Se Skewness < -1: Dados altamente negativa;\n", + "* Se Skewness > 1: Dados altamente positiva.\n", + "\n", + "> **Dica**: Normalize os dados antes!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oHg3nyjUTiRu" + }, + "source": [ + "# Cálculo das medidas de Skewness e Kurtosis para 'fare'\n", + "print(f\"Skewness: {df_titanic_ss['fare'].skew()}\")\n", + "print(f\"Kurtosis: {df_titanic_ss['fare'].kurt()}\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V2jCZLGVH3Qu" + }, + "source": [ + "Olhando para as medidas de Skewness e Kurtosis logo acima, qual a conclusão?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0nnFS8vi_rOe" + }, + "source": [ + "# Distribuição da variável 'age'\n", + "df_titanic_ss['age'].describe()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "h9ZmvO1b_4sF" + }, + "source": [ + "plt.scatter(range(df_titanic_ss.shape[0]), np.sort(df_titanic_ss['age'].values))\n", + "plt.xlabel('index')\n", + "plt.ylabel('age')\n", + "plt.title(\"Distribuição da variável age\")\n", + "sns.despine()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "GIAYrDJyCT6r" + }, + "source": [ + "sns.distplot(df_titanic_ss['age'])\n", + "plt.title(\"Distribuição da variável age\")\n", + "sns.despine()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "USy48-H2UXqB" + }, + "source": [ + "# Cálculo das medidas de Skewness e Kurtosis para 'age'\n", + "print(f\"Skewness: {df_titanic_ss['age'].skew()}\")\n", + "print(f\"Kurtosis: {df_titanic_ss['age'].kurt()}\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ENQaVw2lItVL" + }, + "source": [ + "Olhando para as medidas de Skewness e Kurtosis logo acima, qual a conclusão?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Nt0PQIjW-wXd" + }, + "source": [ + "___\n", + "## **Isolation Forest Region**\n", + "* Source: [Outlier Detection with Isolation Forest](https://towardsdatascience.com/outlier-detection-with-isolation-forest-3d190448d45e)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tM6Xht76KmUN" + }, + "source": [ + "### Anomaly Detection para 'fare'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uFuAUh5S778M", + "outputId": "b2fda534-a71c-4bd0-b16d-444ab30794d6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 380 + } + }, + "source": [ + "# Instanciamento \n", + "isolation_forest = IsolationForest(n_estimators = 100) \n", + "\n", + "# Ajustamento (fit())\n", + "isolation_forest.fit(df_titanic['fare'].values.reshape(-1, 1))\n", + "xx = np.linspace(df_titanic['fare'].min(), df_titanic['fare'].max(), len(df_titanic)).reshape(-1, 1)\n", + "\n", + "anomaly_score = isolation_forest.decision_function(xx)\n", + "outlier = isolation_forest.predict(xx)\n", + "plt.figure(figsize = (10, 4))\n", + "plt.plot(xx, anomaly_score, label = 'anomaly score')\n", + "plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score), where = outlier == -1, color = 'r', alpha = .4, label = 'outlier region')\n", + "plt.legend()\n", + "plt.ylabel('anomaly score')\n", + "plt.xlabel('fare')\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F2-nWZ3KoPJ8" + }, + "source": [ + "**Conclusão**: anomaly_score > 0 --> Não é outliers. Do contrário, ou seja, se anomaly_score < 0 --> Outlier." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FkhRwo1cgYtK" + }, + "source": [ + "# Vamos avaliar os dados do dataframe para fare > 200, por exemplo\n", + "df_titanic.loc[df_titanic['fare'] > 200].head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "XFbRlmrYgtTS" + }, + "source": [ + "# Zoom na linha 27\n", + "df_titanic.loc[27]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bH4o-CL-N9Np" + }, + "source": [ + "A região onde os dados têm baixa probabilidade de aparecer fica no lado direito da distribuição." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7HK9cBvwGOqG" + }, + "source": [ + "### Anomaly Detection para 'age'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "PoDzs4DTFSY-" + }, + "source": [ + "isolation_forest = IsolationForest(n_estimators = 100)\n", + "isolation_forest.fit(df_titanic['age'].values.reshape(-1, 1))\n", + "xx = np.linspace(df_titanic['age'].min(), df_titanic['age'].max(), len(df_titanic)).reshape(-1, 1)\n", + "anomaly_score = isolation_forest.decision_function(xx)\n", + "outlier = isolation_forest.predict(xx)\n", + "plt.figure(figsize = (10, 4))\n", + "plt.plot(xx, anomaly_score, label='anomaly score')\n", + "plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score), where = outlier == -1, color = 'r', alpha = .4, label = 'outlier region')\n", + "plt.legend()\n", + "plt.ylabel('anomaly score')\n", + "plt.xlabel('age')\n", + "plt.show();" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GivF2cSFS208" + }, + "source": [ + "Observe no gráfico acima que há duas regiões em que os dados têm baixa probabilidade de aparecer: uma no lado esquerdo da distribuição, outra no lado direito da distribuição." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XtizVySOlPUT" + }, + "source": [ + "# Avaliando os dados da cauda esquerda\n", + "df_titanic.loc[df_titanic['age'] < 15].head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "YGnZlzDDlyZO" + }, + "source": [ + "# Zoom na linha 3\n", + "df_titanic.loc[10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "YVhBJua_sG-u" + }, + "source": [ + "# Avaliando dados da cauda direita\n", + "df_titanic.loc[df_titanic['age'] > 65].head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "LRkUWSletcq-" + }, + "source": [ + "# Zoom na linha 96\n", + "df_titanic.loc[96]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JQKECo0BSefE" + }, + "source": [ + "sns.regplot(x = \"age\", y = \"fare\", data = df_titanic_ss)\n", + "sns.despine()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "AChZpGY4Ghc9" + }, + "source": [ + "cols = ['fare', 'age']\n", + "df_titanic_ss[cols].head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s2tddgHcUiAF" + }, + "source": [ + "___\n", + "## **CBLOF - Cluster-based Local Outlier Factor**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fbJ7k1bbbfr4" + }, + "source": [ + "# Normalizar as variáveis 'age' e 'fare'\n", + "df_titanic_ss = df_titanic.copy()\n", + "df_titanic_ss[['fare', 'age']] = MinMaxScaler().fit_transform(df_titanic_ss[['fare', 'age']])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "il0LFdCFJEsw" + }, + "source": [ + "X1 = df_titanic_ss['age'].values.reshape(-1, 1)\n", + "X2 = df_titanic_ss['fare'].values.reshape(-1, 1)\n", + "X = np.concatenate((X1,X2), axis = 1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QtBn0u7CKlS6" + }, + "source": [ + "outliers_fraction = 0.01\n", + "xx , yy = np.meshgrid(np.linspace(0, 1, 100), np.linspace(0, 1, 100))\n", + "clf = CBLOF(contamination = outliers_fraction, check_estimator = False, random_state = 0)\n", + "clf.fit(X)\n", + "# predict raw anomaly score\n", + "scores_pred = clf.decision_function(X) * -1\n", + " \n", + "# prediction of a datapoint category outlier or inlier\n", + "y_pred = clf.predict(X)\n", + "n_inliers = len(y_pred) - np.count_nonzero(y_pred)\n", + "n_outliers = np.count_nonzero(y_pred == 1)\n", + "\n", + "plt.figure(figsize = (8, 8))\n", + "\n", + "df1 = df_titanic_ss\n", + "df1['outlier'] = y_pred.tolist()\n", + "\n", + "inliers_fare = np.array(df1['fare'][df1['outlier'] == 0]).reshape(-1,1)\n", + "inliers_age = np.array(df1['age'][df1['outlier'] == 0]).reshape(-1,1)\n", + " \n", + "outliers_fare = df1['fare'][df1['outlier'] == 1].values.reshape(-1,1)\n", + "outliers_age = df1['age'][df1['outlier'] == 1].values.reshape(-1,1)\n", + " \n", + "print('OUTLIERS:',n_outliers,'INLIERS:',n_inliers)\n", + " \n", + "# Use threshold para definir um ponto como inlier ou outlier\n", + "# threshold = stats.scoreatpercentile(scores_pred,100 * outliers_fraction)\n", + "threshold = percentile(scores_pred, 100 * outliers_fraction)\n", + " \n", + "# Calcula o Anomaly Score\n", + "Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1\n", + "Z = Z.reshape(xx.shape)\n", + "\n", + "plt.contourf(xx, yy, Z, levels = np.linspace(Z.min(), threshold, 7), cmap = plt.cm.Blues_r)\n", + " \n", + "# Desenha a linha vermelha a partir do qual Anomaly Score = thresold\n", + "a = plt.contour(xx, yy, Z, levels = [threshold], linewidths = 2, colors = 'red')\n", + " \n", + "# Região Azul onde threshold < Anomaly Score < max(Anomaly score)\n", + "plt.contourf(xx, yy, Z, levels= [threshold, Z.max()], colors='orange')\n", + "b = plt.scatter(inliers_fare, inliers_age, c = 'white', s = 20, edgecolor = 'k')\n", + " \n", + "c = plt.scatter(outliers_fare, outliers_age, c = 'black', s = 20, edgecolor = 'k')\n", + " \n", + "plt.axis('tight') \n", + "plt.legend([a.collections[0], b, c], ['learned decision function', 'inliers', 'outliers'],\n", + " prop = matplotlib.font_manager.FontProperties(size = 10), loc = 'upper center', frameon = False, bbox_to_anchor = (0.5, -0.05),\n", + " fancybox = True, shadow = True, ncol = 5)\n", + " \n", + "plt.xlim((0, 1))\n", + "plt.ylim((0, 1))\n", + "plt.title('Cluster-based Local Outlier Factor (CBLOF)')\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "O7NmDgjRm5EE" + }, + "source": [ + "# Zoom em alguns outliers...\n", + "df1.loc[df1['outlier'] == 1].head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "HIRxOj93nVXu" + }, + "source": [ + "# Zoom na linha 679\n", + "df_titanic.loc[679]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "euxK-4K1oKs0" + }, + "source": [ + "# Algumas medidas para compararmos\n", + "df_resumo = df_titanic.groupby('sex').agg({'age': ['mean'], 'fare': ['mean']}).round(0)\n", + "df_resumo" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "nuNxqgWMtMHC" + }, + "source": [ + "# Média Geral de 'age'\n", + "round(df_titanic['age'].mean())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "bLIZcvyuuU2R" + }, + "source": [ + "# Média Geral de 'fare'\n", + "round(df_titanic['fare'].mean())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fFd-D1HTVhE7" + }, + "source": [ + "___\n", + "## **HBOS - Histogram-based Outlier Detection**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q5Hh5iMEXuhM" + }, + "source": [ + "outliers_fraction = 0.01\n", + "xx , yy = np.meshgrid(np.linspace(0, 1, 100), np.linspace(0, 1, 100))\n", + "clf = HBOS(contamination = outliers_fraction)\n", + "clf.fit(X)\n", + "\n", + "# predict raw anomaly score\n", + "scores_pred = clf.decision_function(X) * -1\n", + " \n", + "# prediction of a datapoint category outlier or inlier\n", + "y_pred = clf.predict(X)\n", + "n_inliers = len(y_pred) - np.count_nonzero(y_pred)\n", + "n_outliers = np.count_nonzero(y_pred == 1)\n", + "plt.figure(figsize = (8, 8))\n", + "\n", + "# copy of dataframe\n", + "df1 = df_titanic_ss\n", + "df1['outlier'] = y_pred.tolist()\n", + " \n", + "inliers_fare = np.array(df1['fare'][df1['outlier'] == 0]).reshape(-1, 1)\n", + "inliers_age = np.array(df1['age'][df1['outlier'] == 0]).reshape(-1, 1)\n", + " \n", + "outliers_fare = df1['fare'][df1['outlier'] == 1].values.reshape(-1, 1)\n", + "outliers_age = df1['age'][df1['outlier'] == 1].values.reshape(-1, 1)\n", + " \n", + "print('OUTLIERS:', n_outliers, 'INLIERS:', n_inliers)\n", + " \n", + "# threshold define se um ponto será outlier ou inlier\n", + "threshold = percentile(scores_pred, 100 * outliers_fraction)\n", + " \n", + "# Calcula o Anomaly score\n", + "Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1\n", + "Z = Z.reshape(xx.shape)\n", + "\n", + "# Define a região azul tal que min(Anomaly score) < threshold\n", + "plt.contourf(xx, yy, Z, levels = np.linspace(Z.min(), threshold, 7), cmap = plt.cm.Blues_r)\n", + " \n", + "# Desenha a linha a partir do queal Anomaly score = thresold\n", + "a = plt.contour(xx, yy, Z, levels = [threshold], linewidths = 2, colors = 'red')\n", + " \n", + "# Define a região laranja a partir do qual threshold < Anomaly score < max(Anomaly score)\n", + "plt.contourf(xx, yy, Z, levels = [threshold, Z.max()],colors='orange')\n", + "b = plt.scatter(inliers_fare, inliers_age, c='white',s=20, edgecolor='k')\n", + " \n", + "c = plt.scatter(outliers_fare, outliers_age, c='black',s=20, edgecolor='k')\n", + " \n", + "plt.axis('tight') \n", + " \n", + "plt.legend([a.collections[0], b, c], ['learned decision function', 'inliers', 'outliers'],\n", + " prop=matplotlib.font_manager.FontProperties(size = 10), loc ='upper center', frameon = False, bbox_to_anchor = (0.5, -0.05),\n", + " fancybox = True, shadow = True, ncol = 5)\n", + " \n", + "plt.xlim((0, 1))\n", + "plt.ylim((0, 1))\n", + "plt.title('Histogram-base Outlier Detection (HBOS)')\n", + "plt.show();" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gHRoON0BnLVb" + }, + "source": [ + "# Zoom em alguns outliers...\n", + "df1.loc[df1['outlier'] == 1].head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "YblU2tnxnXi7" + }, + "source": [ + "# Zoom na linha 689\n", + "df_titanic.loc[689]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "AkWj5aQ-uzxB" + }, + "source": [ + "# Algumas medidas para compararmos\n", + "df_resumo = df_titanic.groupby('sex').agg({'age': ['mean'], 'fare': ['mean']}).round(0)\n", + "df_resumo" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "EVy5NDrFujgD" + }, + "source": [ + "# Média Geral de 'age'\n", + "round(df_titanic['age'].mean())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Hgcp_LU6ujgJ" + }, + "source": [ + "# Média Geral de 'fare'\n", + "round(df_titanic['fare'].mean())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KyPUT9JmWeN-" + }, + "source": [ + "___\n", + "## **Isolation Forest**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Lrx85bG0YOqM" + }, + "source": [ + "outliers_fraction = 0.01\n", + "xx , yy = np.meshgrid(np.linspace(0, 1, 100), np.linspace(0, 1, 100))\n", + "clf = IForest(contamination = outliers_fraction,random_state = 0)\n", + "clf.fit(X)\n", + "# predict raw anomaly score\n", + "scores_pred = clf.decision_function(X) * -1\n", + " \n", + "# prediction of a datapoint category outlier or inlier\n", + "y_pred = clf.predict(X)\n", + "n_inliers = len(y_pred) - np.count_nonzero(y_pred)\n", + "n_outliers = np.count_nonzero(y_pred == 1)\n", + "plt.figure(figsize = (8, 8))\n", + "# copy of dataframe\n", + "df1 = df_titanic_ss\n", + "df1['outlier'] = y_pred.tolist()\n", + " \n", + "# fare - inlier feature 1, age - inlier feature 2\n", + "inliers_fare = np.array(df1['fare'][df1['outlier'] == 0]).reshape(-1,1)\n", + "inliers_age = np.array(df1['age'][df1['outlier'] == 0]).reshape(-1,1)\n", + " \n", + "# fare - outlier feature 1, age - outlier feature 2\n", + "outliers_fare = df1['fare'][df1['outlier'] == 1].values.reshape(-1,1)\n", + "outliers_age = df1['age'][df1['outlier'] == 1].values.reshape(-1,1)\n", + " \n", + "print('OUTLIERS: ', n_outliers,'INLIERS: ', n_inliers)\n", + " \n", + "# threshold value to consider a datapoint inlier or outlier\n", + "threshold = percentile(scores_pred, 100 * outliers_fraction)\n", + " \n", + "# decision function calculates the raw anomaly score for every point\n", + "Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1\n", + "Z = Z.reshape(xx.shape)\n", + "# fill blue map colormap from minimum anomaly score to threshold value\n", + "plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),cmap=plt.cm.Blues_r)\n", + " \n", + "# draw red contour line where anomaly score is equal to thresold\n", + "a = plt.contour(xx, yy, Z, levels= [threshold],linewidths=2, colors='red')\n", + " \n", + "# fill orange contour lines where range of anomaly score is from threshold to maximum anomaly score\n", + "plt.contourf(xx, yy, Z, levels= [threshold, Z.max()],colors='orange')\n", + "b = plt.scatter(inliers_fare, inliers_age, c='white',s=20, edgecolor='k')\n", + " \n", + "c = plt.scatter(outliers_fare, outliers_age, c='black',s=20, edgecolor='k')\n", + " \n", + "plt.axis('tight')\n", + "plt.legend([a.collections[0], b,c], ['learned decision function', 'inliers', 'outliers'],\n", + " prop=matplotlib.font_manager.FontProperties(size = 10), loc='upper center', frameon= False, bbox_to_anchor = (0.5, -0.05),\n", + " fancybox = True, shadow = True, ncol=5)\n", + " \n", + "plt.xlim((0, 1))\n", + "plt.ylim((0, 1))\n", + "plt.title('Isolation Forest')\n", + "plt.show();" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "HLVraGcCnNTA" + }, + "source": [ + "# Zoom em alguns outliers...\n", + "df1.loc[df1['outlier'] == 1].head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "y0WBmFOonZKY" + }, + "source": [ + "# Zoom na linha 679\n", + "df_titanic.loc[679]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "auSy5b6Du3PH" + }, + "source": [ + "# Algumas medidas para compararmos\n", + "df_resumo = df_titanic.groupby('sex').agg({'age': ['mean'], 'fare': ['mean']}).round(0)\n", + "df_resumo" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "fIQg2D6fuoSG" + }, + "source": [ + "# Média Geral de 'age'\n", + "round(df_titanic['age'].mean())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "pNUds1oDuoSO" + }, + "source": [ + "# Média Geral de 'fare'\n", + "round(df_titanic['fare'].mean())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QbpzXB2RV4sq" + }, + "source": [ + "___\n", + "## **KNN - K-Nearest Neighbors**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6gtIWWbRYxEj" + }, + "source": [ + "outliers_fraction = 0.01\n", + "xx , yy = np.meshgrid(np.linspace(0, 1, 100), np.linspace(0, 1, 100))\n", + "clf = KNN(contamination = outliers_fraction)\n", + "clf.fit(X)\n", + "# predict raw anomaly score\n", + "scores_pred = clf.decision_function(X) * -1\n", + " \n", + "# prediction of a datapoint category outlier or inlier\n", + "y_pred = clf.predict(X)\n", + "n_inliers = len(y_pred) - np.count_nonzero(y_pred)\n", + "n_outliers = np.count_nonzero(y_pred == 1)\n", + "plt.figure(figsize = (8, 8))\n", + "# copy of dataframe\n", + "df1 = df_titanic_ss\n", + "df1['outlier'] = y_pred.tolist()\n", + " \n", + "inliers_fare = np.array(df1['fare'][df1['outlier'] == 0]).reshape(-1,1)\n", + "inliers_age = np.array(df1['age'][df1['outlier'] == 0]).reshape(-1,1)\n", + " \n", + "outliers_fare = df1['fare'][df1['outlier'] == 1].values.reshape(-1,1)\n", + "outliers_age = df1['age'][df1['outlier'] == 1].values.reshape(-1,1)\n", + " \n", + "print('OUTLIERS: ',n_outliers, 'INLIERS: ', n_inliers)\n", + " \n", + "# threshold value to consider a datapoint inlier or outlier\n", + "threshold = percentile(scores_pred, 100 * outliers_fraction)\n", + " \n", + "# decision function calculates the raw anomaly score for every point\n", + "Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1\n", + "Z = Z.reshape(xx.shape)\n", + "# fill blue map colormap from minimum anomaly score to threshold value\n", + "plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),cmap=plt.cm.Blues_r)\n", + " \n", + "# draw red contour line where anomaly score is equal to thresold\n", + "a = plt.contour(xx, yy, Z, levels= [threshold],linewidths=2, colors='red')\n", + " \n", + "# fill orange contour lines where range of anomaly score is from threshold to maximum anomaly score\n", + "plt.contourf(xx, yy, Z, levels= [threshold, Z.max()],colors='orange')\n", + "b = plt.scatter(inliers_fare, inliers_age, c='white',s=20, edgecolor='k')\n", + " \n", + "c = plt.scatter(outliers_fare, outliers_age, c='black',s=20, edgecolor='k')\n", + " \n", + "plt.axis('tight') \n", + " \n", + "plt.legend([a.collections[0], b,c], ['learned decision function', 'inliers', 'outliers'],\n", + " prop=matplotlib.font_manager.FontProperties(size=10), loc='upper center', frameon= False, bbox_to_anchor = (0.5, -0.05),\n", + " fancybox = True, shadow = True, ncol = 5)\n", + " \n", + "plt.xlim((0, 1))\n", + "plt.ylim((0, 1))\n", + "plt.title('K-Nearest Neighbors (KNN)')\n", + "plt.show();" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6B-L7MwXg25Z" + }, + "source": [ + "df1.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gvXGH0BHBBNN" + }, + "source": [ + "# Zoom em alguns outliers...\n", + "df1.loc[df1['outlier'] == 1].head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "MYbNaaO7D3NY" + }, + "source": [ + "# Zoom na linha 679\n", + "df_titanic.loc[679]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-juEvWvru5jp" + }, + "source": [ + "# Algumas medidas para compararmos\n", + "df_resumo = df_titanic.groupby('sex').agg({'age': ['mean'], 'fare': ['mean']}).round(0)\n", + "df_resumo" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "B6NXG6oDusSg" + }, + "source": [ + "# Média Geral de 'age'\n", + "round(df_titanic['age'].mean())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "cgHJb3iBusSl" + }, + "source": [ + "# Média Geral de 'fare'\n", + "round(df_titanic['fare'].mean())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1w7MIkoAG2Qr" + }, + "source": [ + "___\n", + "# **Exercícios**\n", + "Para cada um dos dataframes a seguir, faça uma análise de outlier utilizando uma das técnicas apresentadas e explique seus resultados." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ep_Z3iQIG56r" + }, + "source": [ + "## Exercício 1 - Predict Breast Cancer" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "v-Lvzrt7HN2l" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "from sklearn.datasets import load_breast_cancer\n", + "\n", + "cancer = load_breast_cancer()\n", + "X = cancer['data']\n", + "y = cancer['target']\n", + "\n", + "df_cancer = pd.DataFrame(np.c_[X, y], columns= np.append(cancer['feature_names'], ['target']))\n", + "df_cancer['target'] = df_cancer['target'].map({0: 'malign', 1: 'benign'})\n", + "df_cancer.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cEHLrU0gHRtu" + }, + "source": [ + "## Exercício 2 - Boston Housing Price" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8G9GZnubHYjy" + }, + "source": [ + "from sklearn.datasets import load_boston\n", + "\n", + "boston = load_boston()\n", + "X = boston['data']\n", + "y = boston['target']\n", + "\n", + "df_boston = pd.DataFrame(np.c_[X, y], columns = np.append(boston['feature_names'], ['target']))\n", + "df_boston.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QlAdIYfmHaE8" + }, + "source": [ + "## Exercício 3 - Iris\n", + "* [Aqui](https://en.wikipedia.org/wiki/Iris_flower_data_set) você obterá mais informações sobre o dataframe iris." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Rke4C3wFHfYU" + }, + "source": [ + "from sklearn.datasets import load_iris\n", + "\n", + "iris = load_iris()\n", + "X= iris['data']\n", + "y= iris['target']\n", + "\n", + "df_iris = pd.DataFrame(np.c_[X, y], columns = np.append(iris['feature_names'], ['target']))\n", + "df_iris['target'] = df_iris['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})\n", + "df_iris.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6qn3gC4NHj-p" + }, + "source": [ + "## Exercícios 4 - Diabetes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "P-esq5TSHnf6" + }, + "source": [ + "from sklearn.datasets import load_diabetes\n", + "\n", + "diabetes = load_diabetes()\n", + "X = diabetes['data']\n", + "y = diabetes['target']\n", + "\n", + "df_diabetes = pd.DataFrame(np.c_[X, y], columns = np.append(diabetes['feature_names'], ['target']))\n", + "df_diabetes.head()" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB10_04__3DP_5_Feature_Selection_hs.ipynb b/Notebooks/NB10_04__3DP_5_Feature_Selection_hs.ipynb new file mode 100644 index 000000000..e190813dd --- /dev/null +++ b/Notebooks/NB10_04__3DP_5_Feature_Selection_hs.ipynb @@ -0,0 +1,6081 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.4" + }, + "colab": { + "name": "NB10_04__3DP_5_Feature_Selection.ipynb", + "provenance": [], + "include_colab_link": true + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Cka1jqOwy6KT" + }, + "source": [ + "

3DP_5 - FEATURE SELECTION

\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3aYp_plmy17y" + }, + "source": [ + "# **AGENDA**:\n", + "\n", + "> Consulte **Table of contents**.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rSFnHHQUKDX5" + }, + "source": [ + "# **Melhorias da sessão**\n", + "* Desenvolver t-SNE.\n", + "* https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "arSNhd_2KHL6" + }, + "source": [ + "___\n", + "# **Referências**\n", + "* [Feature Selection in Python — Recursive Feature Elimination](https://towardsdatascience.com/feature-selection-in-python-recursive-feature-elimination-19f1c39b8d15)\n", + "* [Feature Selection with sklearn and Pandas](https://towardsdatascience.com/feature-selection-with-pandas-e3690ad8504b)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cBLSvJTXHBjK" + }, + "source": [ + "___\n", + "# **CHEETSHEET**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZdjR3nahUuKq" + }, + "source": [ + "\n", + "![Scikit-Learn](https://github.com/MathMachado/Materials/blob/master/scikit-learn-1.png?raw=true)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ngLc7b9XiKxN" + }, + "source": [ + "___\n", + "# **3DP_FEATURE SELECTION**\n", + "## Introdução à Feature Selection\n", + "> Nosso objetivo com Feature Engineering será:\n", + "* Deletar colunas irrelevantes;\n", + "* Deletar colunas com baixa correlação com a variável-target;\n", + "* Deletar colunas com baixa variância;\n", + "* Deletar colunas com muitos NaN's.\n", + "\n", + "* Sugestões:\n", + " * Normalize colunas numéricas;\n", + " * Aplique LabelEncoding (colunas numéricas) ou One Hot Encoding (colunas categóricas).\n", + "\n", + "![FeatureSelection](https://github.com/MathMachado/Materials/blob/master/FeatureSelection.png?raw=true)\n", + "\n", + "[Fonte](https://medium.com/@sundarstyles89/weight-of-evidence-and-information-value-using-python-6f05072e83eb)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "T9JCQatsiKxR" + }, + "source": [ + "from sklearn import feature_selection\n", + "import pandas as pd\n", + "import numpy as np\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9U6Az5qpiKxU" + }, + "source": [ + "___\n", + "# **VarianceThreshold**\n", + "* Drop variáveis/features cuja variância seja inferior a um determinado threshold;\n", + "* Este é um método não-supervisionado, isto é, a variável rotulada (variável-resposta ou variável target) não entra e ação;\n", + "* **Intuição**: \n", + " * Features/variáveis com baixa variância contem baixa informação;\n", + "* **Como funciona**:\n", + " * Calcula a variância para cada feature/variável e então deleta a coluna/variável com baixa variância\n", + "* **Cuidados**:\n", + " * Assegure-se que as features/variáveis tenham a mesma escala. Ou seja, use StandardScaler() ou MinMaxScaler() para colocar as variáveis na mesma escala." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "euWJlVAViKxV", + "outputId": "0a16d5aa-7e5b-4db1-b5cc-e1a13f730cff", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 284 + } + }, + "source": [ + "df = pd.DataFrame(\n", + " {'sexo': ['m', 'm', 'f', 'm', 'm', 'm', 'm', 'm'], \n", + " 'b': [1, 2, 3, 1, 2, 1, 1, 1], \n", + " 'c': [1, 2, 3, 1, 2, 1, 1, 1]})\n", + "\n", + "df" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sexobc
0m11
1m22
2f33
3m11
4m22
5m11
6m11
7m11
\n", + "
" + ], + "text/plain": [ + " sexo b c\n", + "0 m 1 1\n", + "1 m 2 2\n", + "2 f 3 3\n", + "3 m 1 1\n", + "4 m 2 2\n", + "5 m 1 1\n", + "6 m 1 1\n", + "7 m 1 1" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 212 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1rvj2MZ6Jtgq" + }, + "source": [ + "A seguir, usamos [LabelEncoder](sklearn.preprocessing.LabelEncoder) para a coluna 'sexo':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "I6L5L_wtTSUe" + }, + "source": [ + "from sklearn.preprocessing import LabelEncoder\n", + "le = LabelEncoder()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VS2v_GnbiKxi", + "outputId": "5d026279-9b0f-4e3d-c9ac-3a9bff1be40e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 284 + } + }, + "source": [ + "# Aplica o LabelEncoder à coluna 'sexo':\n", + "df['sexo'] = le.fit_transform(df['sexo'])\n", + "df" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sexobc
0111
1122
2033
3111
4122
5111
6111
7111
\n", + "
" + ], + "text/plain": [ + " sexo b c\n", + "0 1 1 1\n", + "1 1 2 2\n", + "2 0 3 3\n", + "3 1 1 1\n", + "4 1 2 2\n", + "5 1 1 1\n", + "6 1 1 1\n", + "7 1 1 1" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 214 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1w-_VsJuWVHN", + "outputId": "3ad5cb6b-408e-4fd1-ba2e-2a98d7f55dda", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Calculando as variâncias de cada Feature/variável:\n", + "l_variaveis= ['sexo', 'b', 'c']\n", + "print(f'Variância das variáveis do dataframe df:')\n", + "for s_Var in l_variaveis:\n", + " print(f'{s_Var}: {round(df[s_Var].var(),2)}')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Variância das variáveis do dataframe df:\n", + "sexo: 0.12\n", + "b: 0.57\n", + "c: 0.57\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3IITDmqUiKxo", + "outputId": "16e79dc1-1519-4750-f397-327b5455781d", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Selecionar atributos cuja variância seja maior que 0.20:\n", + "vt = feature_selection.VarianceThreshold(threshold= .20)\n", + "vt.fit_transform(df)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[1, 1],\n", + " [2, 2],\n", + " [3, 3],\n", + " [1, 1],\n", + " [2, 2],\n", + " [1, 1],\n", + " [1, 1],\n", + " [1, 1]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 216 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tAOL215MiKxu", + "outputId": "cf1d0ae2-8fd9-4971-b5a3-8b246bb39ed0", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Variância calculada pela VarianceThreshold()\n", + "vt.variances_" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.109375, 0.5 , 0.5 ])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 217 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yntaZtd98boc" + }, + "source": [ + "### O que aconteceu aqui? Qual a conclusão?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FXyfpmWtiKyB" + }, + "source": [ + "___\n", + "# **ANOVA (Analysis Of Variance) com f_classif**\n", + "* Aplica-se aos casos em que as colunas a serem testadas sejam numéricas por natureza e a variável-target seja discreta por natureza;\n", + "* ANOVA é um teste que visa medir diferença entre grupos/experimentos. Aqui, **o propósito da ANOVA é testar se as colunas numéricas testadas são diferentes**. Obviamente que ao identificarmos colunas semelhantes, então podemos reduzir o número de colunas para evitarmos multicolinearidade, overfitting." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yB7TC9VKiKyC" + }, + "source": [ + "from sklearn.datasets import load_breast_cancer\n", + "\n", + "df_cancer = load_breast_cancer()\n", + "X_cancer = df_cancer.data\n", + "y_cancer = df_cancer.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "N2_mE0Z5xxvL", + "outputId": "5b255dcb-ec6e-4ee4-e0f8-d0d89a14fc64", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X_cancer" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,\n", + " 1.189e-01],\n", + " [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,\n", + " 8.902e-02],\n", + " [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,\n", + " 8.758e-02],\n", + " ...,\n", + " [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,\n", + " 7.820e-02],\n", + " [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,\n", + " 1.240e-01],\n", + " [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,\n", + " 7.039e-02]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 219 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "t0QYORbSiKyL" + }, + "source": [ + "chi2, p_value = feature_selection.f_classif(X_cancer, y_cancer)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7BC6y7etiKyP", + "outputId": "b6cf6b7d-77d9-49fa-96f1-10983adf21b8", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "np.round(chi2)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([647., 118., 697., 573., 84., 313., 534., 862., 70., 0., 269.,\n", + " 0., 254., 244., 3., 53., 39., 113., 0., 3., 861., 150.,\n", + " 898., 662., 122., 304., 437., 964., 119., 66.])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 221 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aYSOfeH4iKyW" + }, + "source": [ + "* **Comentário**: Acima, cada valor representa a importância de uma feature/coluna ==> **Quanto maior, melhor!**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "k1fHrPM4upex", + "outputId": "690c1ed2-9c4f-4518-a356-6486b3ffcf63", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "np.round(p_value, 2)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.76, 0. ,\n", + " 0.84, 0. , 0. , 0.11, 0. , 0. , 0. , 0.88, 0.06, 0. , 0. ,\n", + " 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 222 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1JYciSVkMW8f" + }, + "source": [ + "* **Comentário**: Acima, os p_value's associados à cada valor de chi2 ==> **Quanto menor, melhor!**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1YVBLc7eu6_H" + }, + "source": [ + "## **Conclusão**: **Foco no p_value**. Se p_value < 0.05 ==> variável significativa/relevante para o modelo." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r29_gmCgiKyY" + }, + "source": [ + "___\n", + "# **Univariate Regression Test using f_regression**\n", + "* Modelo Linear para testar o efeito individual de cada uma das variáveis regressoras;\n", + "* **Como funciona**:\n", + " * Usa a correlação entre cada variável e variável-target;\n", + " * F-test calcula a dependência linear;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6IDzu3kCiKyZ", + "outputId": "b4941bbd-ebe8-4718-81b0-48666ada1d44", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "from sklearn.datasets import california_housing\n", + "house_data = california_housing.fetch_california_housing()\n", + "X_house, y_house = house_data.data, house_data.target\n", + "\n", + "X_house" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[ 8.3252 , 41. , 6.98412698, ..., 2.55555556,\n", + " 37.88 , -122.23 ],\n", + " [ 8.3014 , 21. , 6.23813708, ..., 2.10984183,\n", + " 37.86 , -122.22 ],\n", + " [ 7.2574 , 52. , 8.28813559, ..., 2.80225989,\n", + " 37.85 , -122.24 ],\n", + " ...,\n", + " [ 1.7 , 17. , 5.20554273, ..., 2.3256351 ,\n", + " 39.43 , -121.22 ],\n", + " [ 1.8672 , 18. , 5.32951289, ..., 2.12320917,\n", + " 39.43 , -121.32 ],\n", + " [ 2.3886 , 16. , 5.25471698, ..., 2.61698113,\n", + " 39.37 , -121.24 ]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 223 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1DS9T6WXw8qN", + "outputId": "038178a6-3f44-49da-f49c-695e8ff999c7", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "y_house # Variável contínua" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 224 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uKYhjpEViKyl" + }, + "source": [ + "F, p_value = feature_selection.f_regression(X_house, y_house)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "NEqZ3I4jiKyp", + "outputId": "44e87de3-1abc-4664-817b-93801dc8eb36", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "np.round(F, 2)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1.855657e+04, 2.328400e+02, 4.877600e+02, 4.511000e+01,\n", + " 1.255000e+01, 1.164000e+01, 4.380100e+02, 4.370000e+01])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 226 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Rsaf7y8MiKyt" + }, + "source": [ + "### **Comentários**: Colunas com alto F-values tem maior poder preditivo. Portanto, **quanto maior, melhor**." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Fh80Xf8KG-Vj", + "outputId": "5e7aaa4e-563d-4643-b5f7-a7c6a16e5c21", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "np.round(p_value, 2)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0., 0., 0., 0., 0., 0., 0., 0.])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 227 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LKutBVPIxTP2", + "outputId": "c1a988e0-a69b-4642-b258-720024c66068", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "np.round(p_value, 2)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0., 0., 0., 0., 0., 0., 0., 0.])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 228 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DD-JKUQ1xjR8" + }, + "source": [ + "### **Conclusão**: **Foco no p_value**. Se p_value < 0.05 ==> variável significativa/relevante para o modelo." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xvIXLHK9iKz8" + }, + "source": [ + "___\n", + "# **SelectFromModel**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "p0mtUVnjiKz8" + }, + "source": [ + "from sklearn.datasets import load_boston\n", + "boston = load_boston()\n", + "X_boston, y_boston = boston.data, boston.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "WY1c2U10iK0A" + }, + "source": [ + "from sklearn.linear_model import LinearRegression" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "3SJDM-Bxc_UF", + "outputId": "0096e7b9-84c4-4bfd-f6ea-06edb31747bc", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Observe abaixo que a variável-target é float. Portanto, é um problema de regressão\n", + "y_boston" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,\n", + " 18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,\n", + " 15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,\n", + " 13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,\n", + " 21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,\n", + " 35.4, 24.7, 31.6, 23.3, 19.6, 18.7, 16. , 22.2, 25. , 33. , 23.5,\n", + " 19.4, 22. , 17.4, 20.9, 24.2, 21.7, 22.8, 23.4, 24.1, 21.4, 20. ,\n", + " 20.8, 21.2, 20.3, 28. , 23.9, 24.8, 22.9, 23.9, 26.6, 22.5, 22.2,\n", + " 23.6, 28.7, 22.6, 22. , 22.9, 25. , 20.6, 28.4, 21.4, 38.7, 43.8,\n", + " 33.2, 27.5, 26.5, 18.6, 19.3, 20.1, 19.5, 19.5, 20.4, 19.8, 19.4,\n", + " 21.7, 22.8, 18.8, 18.7, 18.5, 18.3, 21.2, 19.2, 20.4, 19.3, 22. ,\n", + " 20.3, 20.5, 17.3, 18.8, 21.4, 15.7, 16.2, 18. , 14.3, 19.2, 19.6,\n", + " 23. , 18.4, 15.6, 18.1, 17.4, 17.1, 13.3, 17.8, 14. , 14.4, 13.4,\n", + " 15.6, 11.8, 13.8, 15.6, 14.6, 17.8, 15.4, 21.5, 19.6, 15.3, 19.4,\n", + " 17. , 15.6, 13.1, 41.3, 24.3, 23.3, 27. , 50. , 50. , 50. , 22.7,\n", + " 25. , 50. , 23.8, 23.8, 22.3, 17.4, 19.1, 23.1, 23.6, 22.6, 29.4,\n", + " 23.2, 24.6, 29.9, 37.2, 39.8, 36.2, 37.9, 32.5, 26.4, 29.6, 50. ,\n", + " 32. , 29.8, 34.9, 37. , 30.5, 36.4, 31.1, 29.1, 50. , 33.3, 30.3,\n", + " 34.6, 34.9, 32.9, 24.1, 42.3, 48.5, 50. , 22.6, 24.4, 22.5, 24.4,\n", + " 20. , 21.7, 19.3, 22.4, 28.1, 23.7, 25. , 23.3, 28.7, 21.5, 23. ,\n", + " 26.7, 21.7, 27.5, 30.1, 44.8, 50. , 37.6, 31.6, 46.7, 31.5, 24.3,\n", + " 31.7, 41.7, 48.3, 29. , 24. , 25.1, 31.5, 23.7, 23.3, 22. , 20.1,\n", + " 22.2, 23.7, 17.6, 18.5, 24.3, 20.5, 24.5, 26.2, 24.4, 24.8, 29.6,\n", + " 42.8, 21.9, 20.9, 44. , 50. , 36. , 30.1, 33.8, 43.1, 48.8, 31. ,\n", + " 36.5, 22.8, 30.7, 50. , 43.5, 20.7, 21.1, 25.2, 24.4, 35.2, 32.4,\n", + " 32. , 33.2, 33.1, 29.1, 35.1, 45.4, 35.4, 46. , 50. , 32.2, 22. ,\n", + " 20.1, 23.2, 22.3, 24.8, 28.5, 37.3, 27.9, 23.9, 21.7, 28.6, 27.1,\n", + " 20.3, 22.5, 29. , 24.8, 22. , 26.4, 33.1, 36.1, 28.4, 33.4, 28.2,\n", + " 22.8, 20.3, 16.1, 22.1, 19.4, 21.6, 23.8, 16.2, 17.8, 19.8, 23.1,\n", + " 21. , 23.8, 23.1, 20.4, 18.5, 25. , 24.6, 23. , 22.2, 19.3, 22.6,\n", + " 19.8, 17.1, 19.4, 22.2, 20.7, 21.1, 19.5, 18.5, 20.6, 19. , 18.7,\n", + " 32.7, 16.5, 23.9, 31.2, 17.5, 17.2, 23.1, 24.5, 26.6, 22.9, 24.1,\n", + " 18.6, 30.1, 18.2, 20.6, 17.8, 21.7, 22.7, 22.6, 25. , 19.9, 20.8,\n", + " 16.8, 21.9, 27.5, 21.9, 23.1, 50. , 50. , 50. , 50. , 50. , 13.8,\n", + " 13.8, 15. , 13.9, 13.3, 13.1, 10.2, 10.4, 10.9, 11.3, 12.3, 8.8,\n", + " 7.2, 10.5, 7.4, 10.2, 11.5, 15.1, 23.2, 9.7, 13.8, 12.7, 13.1,\n", + " 12.5, 8.5, 5. , 6.3, 5.6, 7.2, 12.1, 8.3, 8.5, 5. , 11.9,\n", + " 27.9, 17.2, 27.5, 15. , 17.2, 17.9, 16.3, 7. , 7.2, 7.5, 10.4,\n", + " 8.8, 8.4, 16.7, 14.2, 20.8, 13.4, 11.7, 8.3, 10.2, 10.9, 11. ,\n", + " 9.5, 14.5, 14.1, 16.1, 14.3, 11.7, 13.4, 9.6, 8.7, 8.4, 12.8,\n", + " 10.5, 17.1, 18.4, 15.4, 10.8, 11.8, 14.9, 12.6, 14.1, 13. , 13.4,\n", + " 15.2, 16.1, 17.8, 14.9, 14.1, 12.7, 13.5, 14.9, 20. , 16.4, 17.7,\n", + " 19.5, 20.2, 21.4, 19.9, 19. , 19.1, 19.1, 20.1, 19.9, 19.6, 23.2,\n", + " 29.8, 13.8, 13.3, 16.7, 12. , 14.6, 21.4, 23. , 23.7, 25. , 21.8,\n", + " 20.6, 21.2, 19.1, 20.6, 15.2, 7. , 8.1, 13.6, 20.1, 21.8, 24.5,\n", + " 23.1, 19.7, 18.3, 21.2, 17.5, 16.8, 22.4, 20.6, 23.9, 22. , 11.9])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 231 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2zNpau9HiK0C" + }, + "source": [ + "ml = LinearRegression()\n", + "sfm = feature_selection.SelectFromModel(ml, threshold = 0.25)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "62vOFTVViK0D", + "outputId": "a4a64f62-a280-4d40-8188-82b49ceda42d", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Dataframe com as colunas mais relevantes\n", + "sfm.fit_transform(X_boston, y_boston).shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(506, 7)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 233 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Y8_j7HRZiK0J", + "outputId": "9a732d60-aa2f-43d2-8c94-c5ff85b3a694", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Dados originais\n", + "X_boston.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(506, 13)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 234 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yX5oufYkcrH7" + }, + "source": [ + "### **Conclusão**: Houve uma redução de 13 para 7 colunas." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FaGrKVl0Re2A" + }, + "source": [ + "Abaixo, o indicador das colunas que foram selecionadas:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hYVIANETRE2p", + "outputId": "e31de91e-1f8e-4a20-c2f3-891a43632b44", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "l_variaveis_selecionadas = sfm.get_support()\n", + "l_variaveis_selecionadas" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([False, False, False, True, True, True, False, True, True,\n", + " False, True, False, True])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 235 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ol88OWzaVpJi" + }, + "source": [ + "___\n", + "# **Análise de Correlação**\n", + "* É sempre uma boa ideia eliminar colunas altamente correlacionadas do dataframe, pois colunas altamente correlacionadas fornecem a mesma informação.\n", + "\n", + "Fonte: [Better Heatmaps and Correlation Matrix Plots in Python](https://towardsdatascience.com/better-heatmaps-and-correlation-matrix-plots-in-python-41445d0f2bec)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KWBe1v_6V5HB" + }, + "source": [ + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "\n", + "from sklearn.datasets import load_breast_cancer\n", + "df_cancer = load_breast_cancer()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vDz4Byzk8yA9" + }, + "source": [ + "X_cancer = pd.DataFrame(df_cancer.data)\n", + "y_cancer = df_cancer.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XIPITCn4cgjs" + }, + "source": [ + "Usando a correlação de Pearson:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0vHleiMlxuCG", + "outputId": "ee6e2dd6-a45f-4527-a548-52b3c708bc64", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 955 + } + }, + "source": [ + "# calcula a correlação entre as colunas/variáveis do dataframe\n", + "correlacao = X_cancer.corr().abs()\n", + "\n", + "# Seleciona o triângulo superior da matriz de correlação\n", + "correlacao = correlacao.where(np.triu(np.ones(correlacao.shape), k = 1).astype(np.bool))\n", + "correlacao" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
01234567891011121314151617181920212223242526272829
0NaN0.3237820.9978550.9873570.1705810.5061240.6767640.8225290.1477410.3116310.6790900.0973170.6741720.7358640.2226000.2060000.1942040.3761690.1043210.0426410.9695390.2970080.9651370.9410820.1196160.4134630.5269110.7442140.1639530.007066
1NaNNaN0.3295330.3210860.0233890.2367020.3024180.2934640.0714010.0764370.2758690.3863580.2816730.2598450.0066140.1919750.1432930.1638510.0091270.0544580.3525730.9120450.3580400.3435460.0775030.2778300.3010250.2953160.1050080.119205
2NaNNaNNaN0.9865070.2072780.5569360.7161360.8509770.1830270.2614770.6917650.0867610.6931350.7449830.2026940.2507440.2280820.4072170.0816290.0055230.9694760.3030380.9703870.9415500.1505490.4557740.5638790.7712410.1891150.051019
3NaNNaNNaNNaN0.1770280.4985020.6859830.8232690.1512930.2831100.7325620.0662800.7266280.8000860.1667770.2125830.2076600.3723200.0724970.0198870.9627460.2874890.9591200.9592130.1235230.3904100.5126060.7220170.1435700.003738
4NaNNaNNaNNaNNaN0.6591230.5219840.5536950.5577750.5847920.3014670.0684060.2960920.2465520.3323750.3189430.2483960.3806760.2007740.2836070.2131200.0360720.2388530.2067180.8053240.4724680.4349260.5030530.3943090.499316
5NaNNaNNaNNaNNaNNaN0.8831210.8311350.6026410.5653690.4974730.0462050.5489050.4556530.1352990.7387220.5705170.6422620.2299770.5073180.5353150.2481330.5902100.5096040.5655410.8658090.8162750.8155730.5102230.687382
6NaNNaNNaNNaNNaNNaNNaN0.9213910.5006670.3367830.6319250.0762180.6603910.6174270.0985640.6702790.6912700.6832600.1780090.4493010.6882360.2998790.7295650.6759870.4488220.7549680.8841030.8613230.4094640.514930
7NaNNaNNaNNaNNaNNaNNaNNaN0.4624970.1669170.6980500.0214800.7106500.6902990.0276530.4904240.4391670.6156340.0953510.2575840.8303180.2927520.8559230.8096300.4527530.6674540.7523990.9101550.3757440.368661
8NaNNaNNaNNaNNaNNaNNaNNaNNaN0.4799210.3033790.1280530.3138930.2239700.1873210.4216590.3426270.3932980.4491370.3317860.1857280.0906510.2191690.1771930.4266750.4732000.4337210.4302970.6998260.438413
9NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0001110.1641740.0398300.0901700.4019640.5598370.4466300.3411980.3450070.6881320.2536910.0512690.2051510.2318540.5049420.4587980.3462340.1753250.3340190.767297
10NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.2132470.9727940.9518300.1645140.3560650.3323580.5133460.2405670.2277540.7150650.1947990.7196840.7515480.1419190.2871030.3805850.5310620.0945430.049559
11NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.2231710.1115670.3972430.2317000.1949980.2302830.4116210.2797230.1116900.4090030.1022420.0831950.0736580.0924390.0689560.1196380.1282150.045655
12NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.9376550.1510750.4163220.3624820.5562640.2664870.2441430.6972010.2003710.7210310.7307130.1300540.3419190.4188990.5548970.1099300.085433
13NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0751500.2848400.2708950.4157300.1341090.1270710.7573730.1964970.7612130.8114080.1253890.2832570.3851000.5381660.0741260.017539
14NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.3366960.2686850.3284290.4135060.4273740.2306910.0747430.2173040.1821950.3144570.0555580.0582980.1020070.1073420.101480
15NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.8012680.7440830.3947130.8032690.2046070.1430030.2605160.1993710.2273940.6787800.6391470.4832080.2778780.590973
16NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.7718040.3094290.7273720.1869040.1002410.2266800.1883530.1684810.4848580.6625640.4404720.1977880.439329
17NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.3127800.6110440.3581270.0867410.3949990.3422710.2153510.4528880.5495920.6024500.1431160.310655
18NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.3690780.1281210.0774730.1037530.1103430.0126620.0602550.0371190.0304130.3894020.078079
19NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0374880.0031950.0010000.0227360.1705680.3901590.3799750.2152040.1110940.591328
20NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.3599210.9937080.9840150.2165740.4758200.5739750.7874240.2435290.093492
21NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.3650980.3458420.2254290.3608320.3683660.3597550.2330270.219122
22NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.9775780.2367750.5294080.6183440.8163220.2694930.138957
23NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.2091450.4382960.5433310.7474190.2091460.079647
24NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.5681870.5185230.5476910.4938380.617624
25NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.8922610.8010800.6144410.810455
26NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.8554340.5325200.686511
27NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.5025280.511114
28NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.537848
29NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", + "
" + ], + "text/plain": [ + " 0 1 2 3 ... 26 27 28 29\n", + "0 NaN 0.323782 0.997855 0.987357 ... 0.526911 0.744214 0.163953 0.007066\n", + "1 NaN NaN 0.329533 0.321086 ... 0.301025 0.295316 0.105008 0.119205\n", + "2 NaN NaN NaN 0.986507 ... 0.563879 0.771241 0.189115 0.051019\n", + "3 NaN NaN NaN NaN ... 0.512606 0.722017 0.143570 0.003738\n", + "4 NaN NaN NaN NaN ... 0.434926 0.503053 0.394309 0.499316\n", + "5 NaN NaN NaN NaN ... 0.816275 0.815573 0.510223 0.687382\n", + "6 NaN NaN NaN NaN ... 0.884103 0.861323 0.409464 0.514930\n", + "7 NaN NaN NaN NaN ... 0.752399 0.910155 0.375744 0.368661\n", + "8 NaN NaN NaN NaN ... 0.433721 0.430297 0.699826 0.438413\n", + "9 NaN NaN NaN NaN ... 0.346234 0.175325 0.334019 0.767297\n", + "10 NaN NaN NaN NaN ... 0.380585 0.531062 0.094543 0.049559\n", + "11 NaN NaN NaN NaN ... 0.068956 0.119638 0.128215 0.045655\n", + "12 NaN NaN NaN NaN ... 0.418899 0.554897 0.109930 0.085433\n", + "13 NaN NaN NaN NaN ... 0.385100 0.538166 0.074126 0.017539\n", + "14 NaN NaN NaN NaN ... 0.058298 0.102007 0.107342 0.101480\n", + "15 NaN NaN NaN NaN ... 0.639147 0.483208 0.277878 0.590973\n", + "16 NaN NaN NaN NaN ... 0.662564 0.440472 0.197788 0.439329\n", + "17 NaN NaN NaN NaN ... 0.549592 0.602450 0.143116 0.310655\n", + "18 NaN NaN NaN NaN ... 0.037119 0.030413 0.389402 0.078079\n", + "19 NaN NaN NaN NaN ... 0.379975 0.215204 0.111094 0.591328\n", + "20 NaN NaN NaN NaN ... 0.573975 0.787424 0.243529 0.093492\n", + "21 NaN NaN NaN NaN ... 0.368366 0.359755 0.233027 0.219122\n", + "22 NaN NaN NaN NaN ... 0.618344 0.816322 0.269493 0.138957\n", + "23 NaN NaN NaN NaN ... 0.543331 0.747419 0.209146 0.079647\n", + "24 NaN NaN NaN NaN ... 0.518523 0.547691 0.493838 0.617624\n", + "25 NaN NaN NaN NaN ... 0.892261 0.801080 0.614441 0.810455\n", + "26 NaN NaN NaN NaN ... NaN 0.855434 0.532520 0.686511\n", + "27 NaN NaN NaN NaN ... NaN NaN 0.502528 0.511114\n", + "28 NaN NaN NaN NaN ... NaN NaN NaN 0.537848\n", + "29 NaN NaN NaN NaN ... NaN NaN NaN NaN\n", + "\n", + "[30 rows x 30 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 238 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gDxrqpPXxuCM", + "outputId": "ebd4ba38-df7c-4775-c12d-b20102a2e73c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 989 + } + }, + "source": [ + "fig, ax = plt.subplots(figsize = (20, 17)) \n", + "mask = np.zeros_like(X_cancer.corr().abs())\n", + "mask[np.triu_indices_from(mask)] = 1\n", + "sns.heatmap(X_cancer.corr().abs(), mask = mask, ax = ax, cmap ='coolwarm', annot = True, fmt = '.2f')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 239 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2p0kgfS-ao5o" + }, + "source": [ + "Como podemos ver, há várias colunas altamente correlacionados no dataframe. Vamos excluir (automaticamente!) as colunas altamente correlacionadas da seguinte forma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "C7mUtTlFaoFx", + "outputId": "cb4734a6-f56a-48f0-9b23-cc478f88e9e8", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 955 + } + }, + "source": [ + "set_variaveis_corr = set()\n", + "matrix_corr = X_cancer.corr()\n", + "matrix_corr" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
01234567891011121314151617181920212223242526272829
01.0000000.3237820.9978550.9873570.1705810.5061240.6767640.8225290.147741-0.3116310.679090-0.0973170.6741720.735864-0.2226000.2060000.1942040.376169-0.104321-0.0426410.9695390.2970080.9651370.9410820.1196160.4134630.5269110.7442140.1639530.007066
10.3237821.0000000.3295330.321086-0.0233890.2367020.3024180.2934640.071401-0.0764370.2758690.3863580.2816730.2598450.0066140.1919750.1432930.1638510.0091270.0544580.3525730.9120450.3580400.3435460.0775030.2778300.3010250.2953160.1050080.119205
20.9978550.3295331.0000000.9865070.2072780.5569360.7161360.8509770.183027-0.2614770.691765-0.0867610.6931350.744983-0.2026940.2507440.2280820.407217-0.081629-0.0055230.9694760.3030380.9703870.9415500.1505490.4557740.5638790.7712410.1891150.051019
30.9873570.3210860.9865071.0000000.1770280.4985020.6859830.8232690.151293-0.2831100.732562-0.0662800.7266280.800086-0.1667770.2125830.2076600.372320-0.072497-0.0198870.9627460.2874890.9591200.9592130.1235230.3904100.5126060.7220170.1435700.003738
40.170581-0.0233890.2072780.1770281.0000000.6591230.5219840.5536950.5577750.5847920.3014670.0684060.2960920.2465520.3323750.3189430.2483960.3806760.2007740.2836070.2131200.0360720.2388530.2067180.8053240.4724680.4349260.5030530.3943090.499316
50.5061240.2367020.5569360.4985020.6591231.0000000.8831210.8311350.6026410.5653690.4974730.0462050.5489050.4556530.1352990.7387220.5705170.6422620.2299770.5073180.5353150.2481330.5902100.5096040.5655410.8658090.8162750.8155730.5102230.687382
60.6767640.3024180.7161360.6859830.5219840.8831211.0000000.9213910.5006670.3367830.6319250.0762180.6603910.6174270.0985640.6702790.6912700.6832600.1780090.4493010.6882360.2998790.7295650.6759870.4488220.7549680.8841030.8613230.4094640.514930
70.8225290.2934640.8509770.8232690.5536950.8311350.9213911.0000000.4624970.1669170.6980500.0214800.7106500.6902990.0276530.4904240.4391670.6156340.0953510.2575840.8303180.2927520.8559230.8096300.4527530.6674540.7523990.9101550.3757440.368661
80.1477410.0714010.1830270.1512930.5577750.6026410.5006670.4624971.0000000.4799210.3033790.1280530.3138930.2239700.1873210.4216590.3426270.3932980.4491370.3317860.1857280.0906510.2191690.1771930.4266750.4732000.4337210.4302970.6998260.438413
9-0.311631-0.076437-0.261477-0.2831100.5847920.5653690.3367830.1669170.4799211.0000000.0001110.1641740.039830-0.0901700.4019640.5598370.4466300.3411980.3450070.688132-0.253691-0.051269-0.205151-0.2318540.5049420.4587980.3462340.1753250.3340190.767297
100.6790900.2758690.6917650.7325620.3014670.4974730.6319250.6980500.3033790.0001111.0000000.2132470.9727940.9518300.1645140.3560650.3323580.5133460.2405670.2277540.7150650.1947990.7196840.7515480.1419190.2871030.3805850.5310620.0945430.049559
11-0.0973170.386358-0.086761-0.0662800.0684060.0462050.0762180.0214800.1280530.1641740.2132471.0000000.2231710.1115670.3972430.2317000.1949980.2302830.4116210.279723-0.1116900.409003-0.102242-0.083195-0.073658-0.092439-0.068956-0.119638-0.128215-0.045655
120.6741720.2816730.6931350.7266280.2960920.5489050.6603910.7106500.3138930.0398300.9727940.2231711.0000000.9376550.1510750.4163220.3624820.5562640.2664870.2441430.6972010.2003710.7210310.7307130.1300540.3419190.4188990.5548970.1099300.085433
130.7358640.2598450.7449830.8000860.2465520.4556530.6174270.6902990.223970-0.0901700.9518300.1115670.9376551.0000000.0751500.2848400.2708950.4157300.1341090.1270710.7573730.1964970.7612130.8114080.1253890.2832570.3851000.5381660.0741260.017539
14-0.2226000.006614-0.202694-0.1667770.3323750.1352990.0985640.0276530.1873210.4019640.1645140.3972430.1510750.0751501.0000000.3366960.2686850.3284290.4135060.427374-0.230691-0.074743-0.217304-0.1821950.314457-0.055558-0.058298-0.102007-0.1073420.101480
150.2060000.1919750.2507440.2125830.3189430.7387220.6702790.4904240.4216590.5598370.3560650.2317000.4163220.2848400.3366961.0000000.8012680.7440830.3947130.8032690.2046070.1430030.2605160.1993710.2273940.6787800.6391470.4832080.2778780.590973
160.1942040.1432930.2280820.2076600.2483960.5705170.6912700.4391670.3426270.4466300.3323580.1949980.3624820.2708950.2686850.8012681.0000000.7718040.3094290.7273720.1869040.1002410.2266800.1883530.1684810.4848580.6625640.4404720.1977880.439329
170.3761690.1638510.4072170.3723200.3806760.6422620.6832600.6156340.3932980.3411980.5133460.2302830.5562640.4157300.3284290.7440830.7718041.0000000.3127800.6110440.3581270.0867410.3949990.3422710.2153510.4528880.5495920.6024500.1431160.310655
18-0.1043210.009127-0.081629-0.0724970.2007740.2299770.1780090.0953510.4491370.3450070.2405670.4116210.2664870.1341090.4135060.3947130.3094290.3127801.0000000.369078-0.128121-0.077473-0.103753-0.110343-0.0126620.0602550.037119-0.0304130.3894020.078079
19-0.0426410.054458-0.005523-0.0198870.2836070.5073180.4493010.2575840.3317860.6881320.2277540.2797230.2441430.1270710.4273740.8032690.7273720.6110440.3690781.000000-0.037488-0.003195-0.001000-0.0227360.1705680.3901590.3799750.2152040.1110940.591328
200.9695390.3525730.9694760.9627460.2131200.5353150.6882360.8303180.185728-0.2536910.715065-0.1116900.6972010.757373-0.2306910.2046070.1869040.358127-0.128121-0.0374881.0000000.3599210.9937080.9840150.2165740.4758200.5739750.7874240.2435290.093492
210.2970080.9120450.3030380.2874890.0360720.2481330.2998790.2927520.090651-0.0512690.1947990.4090030.2003710.196497-0.0747430.1430030.1002410.086741-0.077473-0.0031950.3599211.0000000.3650980.3458420.2254290.3608320.3683660.3597550.2330270.219122
220.9651370.3580400.9703870.9591200.2388530.5902100.7295650.8559230.219169-0.2051510.719684-0.1022420.7210310.761213-0.2173040.2605160.2266800.394999-0.103753-0.0010000.9937080.3650981.0000000.9775780.2367750.5294080.6183440.8163220.2694930.138957
230.9410820.3435460.9415500.9592130.2067180.5096040.6759870.8096300.177193-0.2318540.751548-0.0831950.7307130.811408-0.1821950.1993710.1883530.342271-0.110343-0.0227360.9840150.3458420.9775781.0000000.2091450.4382960.5433310.7474190.2091460.079647
240.1196160.0775030.1505490.1235230.8053240.5655410.4488220.4527530.4266750.5049420.141919-0.0736580.1300540.1253890.3144570.2273940.1684810.215351-0.0126620.1705680.2165740.2254290.2367750.2091451.0000000.5681870.5185230.5476910.4938380.617624
250.4134630.2778300.4557740.3904100.4724680.8658090.7549680.6674540.4732000.4587980.287103-0.0924390.3419190.283257-0.0555580.6787800.4848580.4528880.0602550.3901590.4758200.3608320.5294080.4382960.5681871.0000000.8922610.8010800.6144410.810455
260.5269110.3010250.5638790.5126060.4349260.8162750.8841030.7523990.4337210.3462340.380585-0.0689560.4188990.385100-0.0582980.6391470.6625640.5495920.0371190.3799750.5739750.3683660.6183440.5433310.5185230.8922611.0000000.8554340.5325200.686511
270.7442140.2953160.7712410.7220170.5030530.8155730.8613230.9101550.4302970.1753250.531062-0.1196380.5548970.538166-0.1020070.4832080.4404720.602450-0.0304130.2152040.7874240.3597550.8163220.7474190.5476910.8010800.8554341.0000000.5025280.511114
280.1639530.1050080.1891150.1435700.3943090.5102230.4094640.3757440.6998260.3340190.094543-0.1282150.1099300.074126-0.1073420.2778780.1977880.1431160.3894020.1110940.2435290.2330270.2694930.2091460.4938380.6144410.5325200.5025281.0000000.537848
290.0070660.1192050.0510190.0037380.4993160.6873820.5149300.3686610.4384130.7672970.049559-0.0456550.0854330.0175390.1014800.5909730.4393290.3106550.0780790.5913280.0934920.2191220.1389570.0796470.6176240.8104550.6865110.5111140.5378481.000000
\n", + "
" + ], + "text/plain": [ + " 0 1 2 ... 27 28 29\n", + "0 1.000000 0.323782 0.997855 ... 0.744214 0.163953 0.007066\n", + "1 0.323782 1.000000 0.329533 ... 0.295316 0.105008 0.119205\n", + "2 0.997855 0.329533 1.000000 ... 0.771241 0.189115 0.051019\n", + "3 0.987357 0.321086 0.986507 ... 0.722017 0.143570 0.003738\n", + "4 0.170581 -0.023389 0.207278 ... 0.503053 0.394309 0.499316\n", + "5 0.506124 0.236702 0.556936 ... 0.815573 0.510223 0.687382\n", + "6 0.676764 0.302418 0.716136 ... 0.861323 0.409464 0.514930\n", + "7 0.822529 0.293464 0.850977 ... 0.910155 0.375744 0.368661\n", + "8 0.147741 0.071401 0.183027 ... 0.430297 0.699826 0.438413\n", + "9 -0.311631 -0.076437 -0.261477 ... 0.175325 0.334019 0.767297\n", + "10 0.679090 0.275869 0.691765 ... 0.531062 0.094543 0.049559\n", + "11 -0.097317 0.386358 -0.086761 ... -0.119638 -0.128215 -0.045655\n", + "12 0.674172 0.281673 0.693135 ... 0.554897 0.109930 0.085433\n", + "13 0.735864 0.259845 0.744983 ... 0.538166 0.074126 0.017539\n", + "14 -0.222600 0.006614 -0.202694 ... -0.102007 -0.107342 0.101480\n", + "15 0.206000 0.191975 0.250744 ... 0.483208 0.277878 0.590973\n", + "16 0.194204 0.143293 0.228082 ... 0.440472 0.197788 0.439329\n", + "17 0.376169 0.163851 0.407217 ... 0.602450 0.143116 0.310655\n", + "18 -0.104321 0.009127 -0.081629 ... -0.030413 0.389402 0.078079\n", + "19 -0.042641 0.054458 -0.005523 ... 0.215204 0.111094 0.591328\n", + "20 0.969539 0.352573 0.969476 ... 0.787424 0.243529 0.093492\n", + "21 0.297008 0.912045 0.303038 ... 0.359755 0.233027 0.219122\n", + "22 0.965137 0.358040 0.970387 ... 0.816322 0.269493 0.138957\n", + "23 0.941082 0.343546 0.941550 ... 0.747419 0.209146 0.079647\n", + "24 0.119616 0.077503 0.150549 ... 0.547691 0.493838 0.617624\n", + "25 0.413463 0.277830 0.455774 ... 0.801080 0.614441 0.810455\n", + "26 0.526911 0.301025 0.563879 ... 0.855434 0.532520 0.686511\n", + "27 0.744214 0.295316 0.771241 ... 1.000000 0.502528 0.511114\n", + "28 0.163953 0.105008 0.189115 ... 0.502528 1.000000 0.537848\n", + "29 0.007066 0.119205 0.051019 ... 0.511114 0.537848 1.000000\n", + "\n", + "[30 rows x 30 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 240 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9lvLldX5eWVW", + "outputId": "3a8f8ccb-e59e-408e-cd99-1f7439029321", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "for i in range(len(matrix_corr.columns)):\n", + " for j in range(i):\n", + " if abs(matrix_corr.iloc[i, j]) > 0.8:\n", + " colname = matrix_corr.columns[i]\n", + " set_variaveis_corr.add(colname)\n", + "\n", + "set_variaveis_corr" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{2, 3, 6, 7, 12, 13, 16, 19, 20, 21, 22, 23, 24, 25, 26, 27, 29}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 241 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R4YC3Kl5erhc" + }, + "source": [ + "Deletando as colunas altamente correlacionadas do dataframe e calculando a correlação novamente:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "psDDQlrSevG5", + "outputId": "a2aea1b6-1826-4147-d8a5-0ccd7c95bc47", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 989 + } + }, + "source": [ + "X_cancer = X_cancer.drop(set_variaveis_corr, axis = 1)\n", + "\n", + "fig, ax = plt.subplots(figsize = (20, 17)) \n", + "mask = np.zeros_like(X_cancer.corr().abs())\n", + "mask[np.triu_indices_from(mask)] = 1\n", + "sns.heatmap(X_cancer.corr().abs(), mask = mask, ax = ax, cmap='coolwarm', annot = True, fmt = '.2f')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 242 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1oDecB_8fGwM" + }, + "source": [ + "### **Conclusão**: Qual a conclusão podemos tirar esta análise?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "avZ12nHqiK0M" + }, + "source": [ + "___\n", + "# **RFE - Recursive Feature Elimination** (continuação da Análise de Correlação)\n", + "* Muito tempo de processamento! Portanto, exclua as colunas altamente correlacionadas do dataframe previamente.\n", + "* A matriz X e target deste tópico vem do tópico anterior \"Análise de Correlação\";\n", + "\n", + "* Leitura recomendada: [Feature Selection in Python — Recursive Feature Elimination](https://towardsdatascience.com/feature-selection-in-python-recursive-feature-elimination-19f1c39b8d15)\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MCGTzI59R3G7" + }, + "source": [ + "from sklearn.feature_selection import RFECV\n", + "from sklearn.ensemble import RandomForestRegressor\n", + "from sklearn.model_selection import StratifiedKFold" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "sqsK1bGmiK0O", + "outputId": "3a989f0a-6324-4b9c-a668-c0b512558223", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "rf = RandomForestRegressor(random_state = 20111974)\n", + "filtro_rfe = RFECV(estimator = rf, step = 1, cv = StratifiedKFold(10))\n", + "filtro_rfe.fit(X_cancer, y_cancer)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "RFECV(cv=StratifiedKFold(n_splits=10, random_state=None, shuffle=False),\n", + " estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,\n", + " criterion='mse', max_depth=None,\n", + " max_features='auto', max_leaf_nodes=None,\n", + " max_samples=None,\n", + " min_impurity_decrease=0.0,\n", + " min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0,\n", + " n_estimators=100, n_jobs=None,\n", + " oob_score=False, random_state=20111974,\n", + " verbose=0, warm_start=False),\n", + " min_features_to_select=1, n_jobs=None, scoring=None, step=1, verbose=0)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 244 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MBnSFtWiDvOd", + "outputId": "863be880-0320-4154-c1a8-06fd72c65133", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X_cancer.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(569, 13)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 245 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yomSdCzJD2tS", + "outputId": "f956d8e0-6038-4eed-b908-a1ee5d80762c", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "y_cancer.size" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "569" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 246 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "f-dKku9IiK0V", + "outputId": "aa8192fa-dbe9-4399-caf0-10ecd83b4e4a", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Número ótimo de colunas:\n", + "filtro_rfe.n_features_" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "8" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 247 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TUoMb40-iK0Y", + "outputId": "fc509029-ce5e-4102-c3a3-715e0e0af10c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 605 + } + }, + "source": [ + "plt.figure(figsize=(16, 9))\n", + "plt.title('Recursive Feature Elimination (rfe) com Cross-Validation', fontsize=18, fontweight='bold', pad=20)\n", + "plt.xlabel('Número de colunas selecionadas', fontsize=14, labelpad=20)\n", + "plt.ylabel('% Acurácia do Modelo', fontsize=14, labelpad=20)\n", + "plt.plot(range(1, len(filtro_rfe.grid_scores_) + 1), filtro_rfe.grid_scores_, color='#303F9F', linewidth=3)\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RG16C-2QdhUx" + }, + "source": [ + "### **Conclusão**: Houve uma redução para 7 colunas." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "M8lpuOjti6Fw", + "outputId": "cf3d3eeb-0b63-40f1-8b4e-a2c581638b9f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "X_cancer.drop(X_cancer.columns[np.where(filtro_rfe.support_ == False)[0]], axis = 1, inplace = True)\n", + "X_cancer.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
014510151828
017.9910.380.118400.277601.09500.049040.030030.4601
120.5717.770.084740.078640.54350.013080.013890.2750
219.6921.250.109600.159900.74560.040060.022500.3613
311.4220.380.142500.283900.49560.074580.059630.6638
420.2914.340.100300.132800.75720.024610.017560.2364
\n", + "
" + ], + "text/plain": [ + " 0 1 4 5 10 15 18 28\n", + "0 17.99 10.38 0.11840 0.27760 1.0950 0.04904 0.03003 0.4601\n", + "1 20.57 17.77 0.08474 0.07864 0.5435 0.01308 0.01389 0.2750\n", + "2 19.69 21.25 0.10960 0.15990 0.7456 0.04006 0.02250 0.3613\n", + "3 11.42 20.38 0.14250 0.28390 0.4956 0.07458 0.05963 0.6638\n", + "4 20.29 14.34 0.10030 0.13280 0.7572 0.02461 0.01756 0.2364" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 249 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "s7S87LApdESo", + "outputId": "096f4cc6-45fe-49f5-c4e8-8d098d3001f2", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "filtro_rfe.estimator_.feature_importances_" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.6274782 , 0.06895439, 0.02932265, 0.06823762, 0.05093922,\n", + " 0.01941762, 0.03383343, 0.10181687])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 250 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FEfmeic37hvw" + }, + "source": [ + "___\n", + "# **Feature Selection com Random Forest**\n", + "* Para demonstrar este método, vou utilizar o Boston Housing Price dataframe.\n", + "\n", + "![Supervised_X_Unsupervised](https://github.com/MathMachado/Materials/blob/master/Supervised_X_Unsupervised.jpeg?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0F2BdrZgKzV5" + }, + "source": [ + "### Carregar o dataframe\n", + "* [Aqui](https://en.wikipedia.org/wiki/Iris_flower_data_set) você obterá mais informações sobre o dataframe iris. Confira." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6H31U15q7kIO" + }, + "source": [ + "from sklearn.datasets import load_iris\n", + "\n", + "# Função para carregar as informações do dataframe Iris\n", + "def carrega_df_iris():\n", + " global df_iris, l_iris_labels, X_iris, y_iris, iris\n", + "\n", + " iris = load_iris()\n", + " X_iris = iris['data']\n", + " y_iris= iris['target']\n", + "\n", + " df_iris = pd.DataFrame(np.c_[X_iris, y_iris], columns= np.append(iris['feature_names'], ['target']))\n", + " df_iris['target2']= df_iris['target']\n", + " df_iris= df_iris.rename(columns={'sepal length (cm)': 'Sepal Length', 'sepal width (cm)': 'sepal width', 'petal length (cm)': 'petal length', 'petal width (cm)': 'petal width'})\n", + " df_iris['target'] = df_iris['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})\n", + "\n", + " # Criar a lista de nomes das variáveis\n", + " l_iris_labels = ['Sepal Length','Sepal Width','Petal Length','Petal Width']" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rD2DmkpNXkFy" + }, + "source": [ + "# Carregar as informações do dataframe Iris\n", + "carrega_df_iris()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jVQuRHYhM4fD" + }, + "source": [ + "> A variável-resposta que estamos tentando prever/explicar é categórica. Portanto, vamos usar um algoritmo da classe supervisionado para Classificação.\n", + "\n", + "* SelectFromModel selecionará os atributos cuja importância seja maior do que a importância média de todos os recursos por padrão, mas podemos alterar esse limite se quisermos." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1pPpC7GXLgpC" + }, + "source": [ + "from sklearn.ensemble import RandomForestClassifier\n", + "from sklearn.feature_selection import SelectFromModel" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "dfSfuUlHQOSt" + }, + "source": [ + "# Particionar base de treinamento (80%) e validação (20%)\n", + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_iris, y_iris, test_size = 0.2, random_state = 20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "0JDsdjsZ0M4F", + "outputId": "71d1931c-9569-4a44-d7b0-7a7cfb789fa5", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X_treinamento.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(120, 4)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 255 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SH_B3C1u0Qkl", + "outputId": "1e00e16e-ed0b-4317-fe46-f0f5f0f46562", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X_teste.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(30, 4)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 256 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YQ270kclOFeK" + }, + "source": [ + "# Create a random forest Regressor\n", + "ml_rf = RandomForestClassifier(n_estimators = 10000, random_state = 20111974, n_jobs = -1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vFSCt8uaeKFN", + "outputId": "154735f9-3388-4023-9de6-cabd217c793f", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Treina o classificador\n", + "ml_rf.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n", + " criterion='gini', max_depth=None, max_features='auto',\n", + " max_leaf_nodes=None, max_samples=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, n_estimators=10000,\n", + " n_jobs=-1, oob_score=False, random_state=20111974,\n", + " verbose=0, warm_start=False)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 258 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VfdKeUkgS6ul" + }, + "source": [ + "Os atributos mais importantes são:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tnrwVLPKSNxr", + "outputId": "2a3db4d0-5794-48ad-ca10-ba988b399e85", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Imprime o nome do atributo associado à importância usando índice de Gini\n", + "for feature in zip(l_iris_labels, ml_rf.feature_importances_):\n", + " print(feature)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "('Sepal Length', 0.08731002037613723)\n", + "('Sepal Width', 0.021750035432184116)\n", + "('Petal Length', 0.39132734233988486)\n", + "('Petal Width', 0.4996126018517938)\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "x8FHRHlDWTn4" + }, + "source": [ + "* Os scores acima representam a importância de cada variável.\n", + " * A soma dos scores resulta em 100%;\n", + " * Os atributos 'Petal Length' (Score= 0.45) e 'Petal Width' (Score= 0.42) são os mais importantes.\n", + " * Combinados, as duas variáveis mais importantes somam ~0.86." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wbMnS_gIVBA8" + }, + "source": [ + "Como regra geral, selecione os atributos que tenha importância de no mínimo 0.15. \n", + "\n", + "Citar autor/Referência!!!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M3TnDYRVeMEs" + }, + "source": [ + "Algo mais visual:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "o8QkARWpeI_e" + }, + "source": [ + "def importancia_variaveis():\n", + " # Calcula a importância das features\n", + " importances = ml_rf.feature_importances_\n", + "\n", + " # Ordena as features por importância\n", + " indices = np.argsort(importances)[::-1]\n", + "\n", + " # Associa a feature name com a feature importance\n", + " names = [iris.feature_names[i] for i in indices]\n", + "\n", + " # Barplot\n", + " plt.bar(range(X_treinamento.shape[1]), importances[indices])\n", + "\n", + " # Adiciona as feature names no eixo x-axis\n", + " plt.xticks(range(X_treinamento.shape[1]), names, rotation = 20, fontsize = 8)\n", + "\n", + " # Define o título do gráfico\n", + " plt.title(\"Importância Preditiva das variáveis\")\n", + " plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ahZdlCBE6h_e", + "outputId": "1d263536-5b65-489a-905b-1bc6a4e54e1f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 304 + } + }, + "source": [ + "importancia_variaveis()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "owVs6pvJ8F8B" + }, + "source": [ + "## Análise de Correlação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RISO1Ury8EEH", + "outputId": "943423c9-1aa1-448c-b737-ee23adda9e31", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "# calcula a correlação entre as colunas/variáveis do dataframe\n", + "correlacao = df_iris.corr().abs()\n", + "\n", + "# Seleciona o triângulo superior da matriz de correlação\n", + "correlacao = correlacao.where(np.triu(np.ones(correlacao.shape), k = 1).astype(np.bool))\n", + "correlacao" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Sepal Lengthsepal widthpetal lengthpetal widthtarget2
Sepal LengthNaN0.117570.8717540.8179410.782561
sepal widthNaNNaN0.4284400.3661260.426658
petal lengthNaNNaNNaN0.9628650.949035
petal widthNaNNaNNaNNaN0.956547
target2NaNNaNNaNNaNNaN
\n", + "
" + ], + "text/plain": [ + " Sepal Length sepal width petal length petal width target2\n", + "Sepal Length NaN 0.11757 0.871754 0.817941 0.782561\n", + "sepal width NaN NaN 0.428440 0.366126 0.426658\n", + "petal length NaN NaN NaN 0.962865 0.949035\n", + "petal width NaN NaN NaN NaN 0.956547\n", + "target2 NaN NaN NaN NaN NaN" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 262 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lieB7hDg8EEM", + "outputId": "a6c1c02d-3c2f-41e8-c1e5-6154f34d95ae", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 500 + } + }, + "source": [ + "fig, ax = plt.subplots(figsize = (8, 8)) \n", + "mask = np.zeros_like(df_iris.corr().abs())\n", + "mask[np.triu_indices_from(mask)] = 1\n", + "sns.heatmap(df_iris.corr().abs(), mask = mask, ax = ax, cmap ='coolwarm', annot = True, fmt = '.2f')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 263 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "A_FXdDMr8EEQ" + }, + "source": [ + "> Pela Análise de Correlação, vemos duas variáveis altamente correlacionadas com a variável-resposta, que são: 'Petal Width' e 'Petal Length', que são as duas variáveis mais importantes no dataframe. Lembram-se?\n", + ">> No entanto, confira a correlação entre 'Petal Width' e 'Petal Length'. Observou que a correlação entre elas é de 0.96? Estas variáveis são altamente correlacionadas..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "taei2KXSTmZ0" + }, + "source": [ + "### Usando SelectFromModel()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JLwboa9tTpBq", + "outputId": "ea927ac2-d0c4-4e10-ae72-7f9014aae8bb", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# A partir do Random Forest, seleciona features cuja importância seja maior que 0.15 e 0.45\n", + "sfm = SelectFromModel(rf, threshold = 0.15)\n", + "sfm_2 = SelectFromModel(rf, threshold = 0.45)\n", + "\n", + "# Treina o seletor\n", + "sfm.fit(X_treinamento, y_treinamento)\n", + "sfm_2.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "SelectFromModel(estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,\n", + " criterion='mse', max_depth=None,\n", + " max_features='auto',\n", + " max_leaf_nodes=None,\n", + " max_samples=None,\n", + " min_impurity_decrease=0.0,\n", + " min_impurity_split=None,\n", + " min_samples_leaf=1,\n", + " min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0,\n", + " n_estimators=100, n_jobs=None,\n", + " oob_score=False,\n", + " random_state=20111974,\n", + " verbose=0, warm_start=False),\n", + " max_features=None, norm_order=1, prefit=False, threshold=0.45)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 264 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2MiZrU56VzUW", + "outputId": "d64749e9-8883-4432-e68d-bfe4c1b8c7f1", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Mostra as duas features mais importantes para sfm\n", + "for feature_list_index in sfm.get_support(indices = True):\n", + " print(l_iris_labels[feature_list_index])" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Petal Length\n", + "Petal Width\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "M0junBxr79Th", + "outputId": "e150591e-1db0-4754-8999-f5e198d3bd60", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Mostra as duas features mais importantes para sfm_2\n", + "for feature_list_index in sfm_2.get_support(indices = True):\n", + " print(l_iris_labels[feature_list_index])" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Petal Width\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "neMolQ0gYtp7" + }, + "source": [ + "Selecionando somente os atributos mais importantes:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dXRWZgDeYxHS" + }, + "source": [ + "# Constroi um dataset contendo somente as variáveis mais importantes\n", + "# Nota: Neste caso, estamos a aplicar a transformação tanto na base de treinamento quanto de validação\n", + "X_treinamento_rfi = sfm.transform(X_treinamento)\n", + "X_teste_rfi = sfm.transform(X_teste)\n", + "\n", + "X_treinamento_rfi_2 = sfm_2.transform(X_treinamento)\n", + "X_teste_rfi_2 = sfm_2.transform(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1tmpt3YvZFvW", + "outputId": "a214e4c8-b1f9-4599-b018-757235a1aa0d", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Criar um classificador Random Forest somente com as features mais importantes\n", + "clf_rfi = RandomForestClassifier(n_estimators = 10000, random_state = 20111974, n_jobs = 50)\n", + "clf_rfi_2 = RandomForestClassifier(n_estimators = 10000, random_state = 20111974, n_jobs = 50)\n", + "\n", + "# Treina o modelo com as features mais importantes\n", + "clf_rfi.fit(X_treinamento_rfi, y_treinamento)\n", + "clf_rfi_2.fit(X_treinamento_rfi_2, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n", + " criterion='gini', max_depth=None, max_features='auto',\n", + " max_leaf_nodes=None, max_samples=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, n_estimators=10000,\n", + " n_jobs=50, oob_score=False, random_state=20111974,\n", + " verbose=0, warm_start=False)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 268 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "epVwhUeYZM7v", + "outputId": "1503fb45-653c-4c2a-9304-c3d0689b300d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 405 + } + }, + "source": [ + "from sklearn.metrics import accuracy_score\n", + "from sklearn.metrics import confusion_matrix as cm\n", + "\n", + "# Aplica o Classificador no dataframe teste\n", + "y_pred = clf_rfi.predict(X_teste_rfi)\n", + "\n", + "# Verifica acurácia\n", + "accuracy_score(y_teste, y_pred)\n", + "\n", + "# Matriz de Confusão\n", + "from sklearn.metrics import confusion_matrix as cm\n", + "cm = cm(y_teste, y_pred) \n", + "index = ['setosa', 'versicolor', 'virginica'] \n", + "columns = ['setosa','versicolor', 'virginica'] \n", + "cm_df = pd.DataFrame(cm, columns, index) \n", + "plt.figure(figsize = (10, 6)) \n", + "cm_df.index.name = 'Actual'\n", + "cm_df.columns.name = 'Predicted'\n", + "sns.heatmap(cm_df, annot = True)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 269 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "CA_rhTMiZbLN", + "outputId": "a1af75c9-690e-49a0-ed79-87259804b06b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 405 + } + }, + "source": [ + "# Aplica o classificador na base de teste\n", + "y_pred_rfi = clf_rfi.predict(X_teste_rfi)\n", + "\n", + "# Avalia acurácia\n", + "accuracy_score(y_teste, y_pred_rfi)\n", + "\n", + "# Matriz de Confusão\n", + "from sklearn.metrics import confusion_matrix as cm\n", + "cm = cm(y_teste, y_pred_rfi) \n", + "index = ['setosa','versicolor','virginica'] \n", + "columns = ['setosa','versicolor','virginica'] \n", + "cm_df = pd.DataFrame(cm,columns,index) \n", + "plt.figure(figsize=(10,6)) \n", + "cm_df.index.name = 'Actual'\n", + "cm_df.columns.name = 'Predicted'\n", + "sns.heatmap(cm_df, annot=True)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 270 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lIYvP0Bq8ST_", + "outputId": "c16ef565-c421-42a2-a9c9-48fc04d4d6bc", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 405 + } + }, + "source": [ + "# Aplica o classificador na base de teste depois da análise de correlação\n", + "y_pred_rfi_2 = clf_rfi_2.predict(X_teste_rfi_2)\n", + "\n", + "# Avalia acurácia\n", + "accuracy_score(y_teste, y_pred_rfi_2)\n", + "\n", + "# Matriz de Confusão\n", + "from sklearn.metrics import confusion_matrix as cm\n", + "cm = cm(y_teste, y_pred_rfi_2) \n", + "index = ['setosa','versicolor', 'virginica'] \n", + "columns = ['setosa','versicolor', 'virginica'] \n", + "cm_df = pd.DataFrame(cm,columns, index) \n", + "plt.figure(figsize = (10, 6)) \n", + "cm_df.index.name = 'Actual'\n", + "cm_df.columns.name = 'Predicted'\n", + "sns.heatmap(cm_df, annot = True)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 271 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uWdAUMQoZtaV" + }, + "source": [ + "> Como podemos ver:\n", + "* Modelo original (com 4 atributos) presenta acurácia de 93.3%;\n", + "* Modelo reduzido (com 2 atributos) apresenta acurácia de 93%;\n", + "* Modelo reduzido 2 (com 1 atributo) apresenta acurácia de 93%.\n", + "\n", + ">> Ou seja, reduzimos o modelo de 4 para 1 atributo/variável e a acurária continua a mesma." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OfCq7UGIpSYg", + "outputId": "5bec1c14-5210-4e28-84a2-b795ef78fe5a", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 106 + } + }, + "source": [ + "# Correlação dois a dois...\n", + "df_iris[['petal length', 'petal width']].corr()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
petal lengthpetal width
petal length1.0000000.962865
petal width0.9628651.000000
\n", + "
" + ], + "text/plain": [ + " petal length petal width\n", + "petal length 1.000000 0.962865\n", + "petal width 0.962865 1.000000" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 272 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BMfxK3UCbjfc" + }, + "source": [ + "## Feature Selection With XGBoost (Extreme Gradient Boosting)\n", + "> XGBoost, em geral, fornece melhores soluções do que outros algoritmos de Machine Learning." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8ZYGjr0Su4y-", + "outputId": "919fa042-5b49-4378-a46b-cf33f6469e48", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "!pip install xgboost" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Requirement already satisfied: xgboost in /usr/local/lib/python3.6/dist-packages (0.90)\n", + "Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from xgboost) (1.4.1)\n", + "Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from xgboost) (1.18.5)\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "--vKKHVWbwGv" + }, + "source": [ + "from xgboost import XGBClassifier\n", + "\n", + "# Carregar as informações do dataframe Iris\n", + "carrega_df_iris()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_d5qWQmXgPIH", + "outputId": "1a2fe686-b70e-4812-d129-304cefd93611", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Cria um clasificador XGBoost\n", + "clf = XGBClassifier(n_estimators = 10000, random_state = 20111974, n_jobs = 50, max_depth = 5, learning_rate = 0.05)\n", + "\n", + "# Treina o classificador\n", + "clf.fit(X_treinamento, y_treinamento)\n", + "\n", + "# Calcula o y_pred e avalia a qualidade do ajuste\n", + "y_pred = clf.predict(X_teste)\n", + "predictions = [round(value) for value in y_pred]\n", + "accuracy = accuracy_score(y_teste, predictions)\n", + "print(f\"Acurácia: {accuracy}\")" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Acurácia: 0.8666666666666667\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fTdKMOKdC2UA", + "outputId": "0421c20e-1af7-473b-f0fa-4a77c4d1d3dd", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Adaptado de https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/\n", + "# Ajusta o ML usando cada importância calculada como threshold\n", + "\n", + "thresholds = sorted(clf.feature_importances_)\n", + "for thresh in thresholds:\n", + "\t# seleciona as features usando threshold\n", + "\tselection = SelectFromModel(clf, threshold=thresh, prefit=True)\n", + "\tselect_X_treinamento = selection.transform(X_treinamento)\n", + "\t\n", + " # treina o ML\n", + "\tselection_clf = XGBClassifier()\n", + "\tselection_clf.fit(select_X_treinamento, y_treinamento)\n", + "\t\n", + " # Avalia o ML\n", + "\tselect_X_teste = selection.transform(X_teste)\n", + "\ty_pred = selection_clf.predict(select_X_teste)\n", + "\tpredictions = [round(value) for value in y_pred]\n", + "\taccuracy = accuracy_score(y_teste, predictions)\n", + "\tprint(f\"Threshold= {round(thresh,2)}, n= {select_X_treinamento.shape[1]}, Acurácia: {round(accuracy*100.0,2)}\")" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Threshold= 0.009999999776482582, n= 4, Acurácia: 86.67\n", + "Threshold= 0.03999999910593033, n= 3, Acurácia: 86.67\n", + "Threshold= 0.44999998807907104, n= 2, Acurácia: 86.67\n", + "Threshold= 0.5, n= 1, Acurácia: 86.67\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zv2gbuc5glFJ", + "outputId": "490e0459-db4a-4406-f4f4-5f5d70519cce", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 304 + } + }, + "source": [ + "# Calcula a importância das features\n", + "importances = clf.feature_importances_\n", + "\n", + "# Ordena as importâncias por ordem descendente\n", + "indices = np.argsort(importances)[::-1]\n", + "\n", + "# Organiza...\n", + "names = [iris.feature_names[i] for i in indices]\n", + "\n", + "# Barplot\n", + "plt.bar(range(X_treinamento.shape[1]), importances[indices])\n", + "\n", + "# Coloca o nome dos labels no eixo X\n", + "plt.xticks(range(X_treinamento.shape[1]), names, rotation=20, fontsize = 8)\n", + "\n", + "# Constroi o gráfico\n", + "plt.title(\"Feature Importance\")\n", + "\n", + "# Mostra o gráfico\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAEfCAYAAABRUD3KAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nO3deZgcVb3/8feHhAACASSRNSEIEQkKKoEgXoXrxqYBLqJBRVD8ISpuiArKRUQUAcUHFRdEBBFlUzBqEFSEe1FQgiwSIBq4gSwsYTHsS8j398f3dKbSTDKdZCaTOfm8nidPurtquk7X1Hzq9FmqFBGYmdnAt0p/F8DMzHqHA93MrBIOdDOzSjjQzcwq4UA3M6uEA93MrBIOdDOzSjjQrWOSpkt6StLjjX8b98J7vrm3ytjB9o6T9NPltb3FkXSwpGv6uxxWDwe6Lam3R8RajX+z+7Mwkgb35/aX1kAtt63YHOi2zCStI+lHku6VNEvSCZIGlWVbSLpS0kOSHpR0nqR1y7JzgZHAr0tt/7OSdpU0s+39F9TiSw37Ykk/lfQocPDitt9B2UPSRyT9S9Jjkr5cyvwXSY9KulDSkLLurpJmSvp8+SzTJb2nbT/8RNIcSXdLOkbSKmXZwZL+LOmbkh4CLgC+D7y2fPZ/l/X2knRj2fYMScc13n9UKe9Bku4pZfhCY/mgUrY7y2e5QdKIsuzlkn4v6WFJUyW9cwl/zTYAONCtN5wNzAO2BF4NvBX4YFkm4ERgY2BrYARwHEBEHAjcQ1et/+QOt7c3cDGwLnBeD9vvxG7A9sBOwGeBM4D3lrK+Ajigse6GwDBgE+Ag4AxJW5Vl3wbWAV4K7AK8D3h/42fHAXcBG5T3Pwy4tnz2dcs6T5SfWxfYC/iwpH3ayvsfwFbAm4BjJW1dXj+ilHVPYCjwAeBJSWsCvwd+BrwEmAB8V9KYJdhHNgA40G1JXSrp3+XfpZI2IAPkkxHxREQ8AHyTDA0iYlpE/D4inomIOcCpZNgti2sj4tKImE8G1yK336GTI+LRiJgC3ApcERF3RcRc4DLyJNH03+XzXA38Fnhn+UYwATg6Ih6LiOnAN4ADGz83OyK+HRHzIuKp7goSEVdFxD8iYn5E3AL8nBfury9FxFMRcTNwM7Bdef2DwDERMTXSzRHxEPA2YHpE/Lhs+0bgF8D+S7CPbABwO54tqX0i4g+tJ5J2BFYF7pXUenkVYEZZvgFwGvB6YO2y7JFlLMOMxuPNFrf9Dt3fePxUN883bDx/JCKeaDy/m/z2MayU4+62ZZssotzdkjQO+Br5zWAIsBpwUdtq9zUePwmsVR6PAO7s5m03A8a1mnWKwcC5PZXHBhbX0G1ZzQCeAYZFxLrl39CI2KYs/yoQwCsjYijZ1KDGz7df7vMJ4EWtJ6XmO7xtnebP9LT93rZeacJoGQnMBh4EniPDs7ls1iLK3d1zyGaRicCIiFiHbGdXN+t1ZwawxSJev7qxf9YtzTwf7vB9bYBwoNsyiYh7gSuAb0gaKmmV0qnYaiZYG3gcmCtpE+AzbW9xP9nm3PJPYPXSObgqcAxZS13a7feFL0kaIun1ZHPGRRHxPHAh8BVJa0vajGzTXtwQyfuBTVudrsXawMMR8XT59vPuJSjXmcCXJY1W2lbS+sBvgJdJOlDSquXfDo22d6uEA916w/vI5oHbyOaUi4GNyrIvAa8B5pLtzb9s+9kTgWNKm/yRpd36I2Q4zSJr7DNZvMVtv7fdV7Yxm+yQPSwi7ijLPkaW9y7gGrK2fdZi3utKYApwn6QHy2sfAY6X9BhwLHmS6NSpZf0rgEeBHwFrRMRjZEfxhFLu+4CTWMyJ0gYm+QYXZp2RtCvw04jYtL/LYtYd19DNzCrhQDczq4SbXMzMKuEauplZJRzoZmaV6LeZosOGDYtRo0b11+bNzAakG2644cGIaJ9sB/RjoI8aNYrJkyf31+bNzAYkSXcvapmbXMzMKuFANzOrhAPdzKwSDnQzs0p0FOiSdi+3rZom6ahulh9cbrt1U/m3JHeLMTOzXtDjKJdyPerTgbeQV727XtLEiLitbdULIuLwPiijmZl1oJMa+o7AtHJLrmeB88l7OpqZ2Qqkk0DfhIVvnTWThW+r1bKfpFvKHdlHdPdGkg6VNFnS5Dlz5ixFcc3MbFF6a2LRr4GfR8Qzkj4EnAO8sX2liDiDvKM6Y8eOXeqrgo066rdL+6NVmP61vfq7CGa2Auqkhj6LvPlsy6YsfJ9EIuKhiHimPD0T2L53imdmZp3qJNCvB0ZL2rzc+3ACeRPbBSQ1b/c1Hri994poZmad6LHJJSLmSTocuBwYBJwVEVMkHQ9MjoiJwMcljQfmAQ8DB/dhmc3MrBsdtaFHxCRgUttrxzYeHw0c3btFMzOzJeGZomZmlXCgm5lVwoFuZlYJB7qZWSUc6GZmlXCgm5lVwoFuZlYJB7qZWSUc6GZmlXCgm5lVwoFuZlYJB7qZWSUc6GZmlXCgm5lVorduQWcDiG/h51v4WZ1cQzczq4QD3cysEg50M7NKONDNzCrhQDczq4QD3cysEg50M7NKONDNzCrhQDczq4QD3cysEg50M7NKONDNzCrhQDczq4QD3cysEg50M7NKONDNzCrhQDczq0RHgS5pd0lTJU2TdNRi1ttPUkga23tFNDOzTvQY6JIGAacDewBjgAMkjelmvbWBTwB/7e1CmplZzzqpoe8ITIuIuyLiWeB8YO9u1vsycBLwdC+Wz8zMOtRJoG8CzGg8n1leW0DSa4AREbFy333YzKwfLXOnqKRVgFOBT3ew7qGSJkuaPGfOnGXdtJmZNXQS6LOAEY3nm5bXWtYGXgFcJWk6sBMwsbuO0Yg4IyLGRsTY4cOHL32pzczsBToJ9OuB0ZI2lzQEmABMbC2MiLkRMSwiRkXEKOA6YHxETO6TEpuZWbd6DPSImAccDlwO3A5cGBFTJB0vaXxfF9DMzDozuJOVImISMKnttWMXse6uy14sMzNbUp4pamZWCQe6mVklHOhmZpVwoJuZVcKBbmZWCQe6mVklHOhmZpVwoJuZVcKBbmZWCQe6mVklHOhmZpVwoJuZVcKBbmZWCQe6mVklHOhmZpVwoJuZVcKBbmZWCQe6mVklHOhmZpVwoJuZVcKBbmZWCQe6mVklHOhmZpVwoJuZVcKBbmZWCQe6mVklHOhmZpVwoJuZVcKBbmZWCQe6mVklHOhmZpVwoJuZVcKBbmZWiY4CXdLukqZKmibpqG6WHybpH5JuknSNpDG9X1QzM1ucHgNd0iDgdGAPYAxwQDeB/bOIeGVEvAo4GTi110tqZmaL1UkNfUdgWkTcFRHPAucDezdXiIhHG0/XBKL3imhmZp0Y3ME6mwAzGs9nAuPaV5L0UeAIYAjwxu7eSNKhwKEAI0eOXNKympnZYvRap2hEnB4RWwCfA45ZxDpnRMTYiBg7fPjw3tq0mZnRWaDPAkY0nm9aXluU84F9lqVQZma25DoJ9OuB0ZI2lzQEmABMbK4gaXTj6V7Av3qviGZm1oke29AjYp6kw4HLgUHAWRExRdLxwOSImAgcLunNwHPAI8BBfVloMzN7oU46RYmIScCktteObTz+RC+Xy8zMlpBnipqZVcKBbmZWCQe6mVklHOhmZpVwoJuZVcKBbmZWCQe6mVklHOhmZpVwoJuZVcKBbmZWCQe6mVklHOhmZpVwoJuZVcKBbmZWCQe6mVklHOhmZpVwoJuZVcKBbmZWCQe6mVklHOhmZpVwoJuZVcKBbmZWCQe6mVklHOhmZpVwoJuZVcKBbmZWCQe6mVklHOhmZpVwoJuZVcKBbmZWCQe6mVklHOhmZpXoKNAl7S5pqqRpko7qZvkRkm6TdIukP0rarPeLamZmi9NjoEsaBJwO7AGMAQ6QNKZttRuBsRGxLXAxcHJvF9TMzBavkxr6jsC0iLgrIp4Fzgf2bq4QEX+KiCfL0+uATXu3mGZm1pNOAn0TYEbj+czy2qIcAly2LIUyM7MlN7g330zSe4GxwC6LWH4ocCjAyJEje3PTZmYrvU5q6LOAEY3nm5bXFiLpzcAXgPER8Ux3bxQRZ0TE2IgYO3z48KUpr5mZLUIngX49MFrS5pKGABOAic0VJL0a+AEZ5g/0fjHNzKwnPQZ6RMwDDgcuB24HLoyIKZKOlzS+rHYKsBZwkaSbJE1cxNuZmVkf6agNPSImAZPaXju28fjNvVwuMzNbQp4pamZWCQe6mVklHOhmZpVwoJuZVcKBbmZWCQe6mVklHOhmZpVwoJuZVcKBbmZWCQe6mVklHOhmZpVwoJuZVcKBbmZWCQe6mVklHOhmZpVwoJuZVcKBbmZWCQe6mVklHOhmZpVwoJuZVcKBbmZWCQe6mVklHOhmZpVwoJuZVcKBbmZWCQe6mVklHOhmZpVwoJuZVcKBbmZWCQe6mVklHOhmZpVwoJuZVcKBbmZWiY4CXdLukqZKmibpqG6Wv0HS3yXNk/SO3i+mmZn1pMdAlzQIOB3YAxgDHCBpTNtq9wAHAz/r7QKamVlnBnewzo7AtIi4C0DS+cDewG2tFSJielk2vw/KaGZmHeikyWUTYEbj+czympmZrUCWa6eopEMlTZY0ec6cOctz02Zm1esk0GcBIxrPNy2vLbGIOCMixkbE2OHDhy/NW5iZ2SJ0EujXA6MlbS5pCDABmNi3xTIzsyXVY6BHxDzgcOBy4HbgwoiYIul4SeMBJO0gaSawP/ADSVP6stBmZvZCnYxyISImAZPaXju28fh6sinGzMz6iWeKmplVwoFuZlYJB7qZWSUc6GZmlXCgm5lVwoFuZlYJB7qZWSUc6GZmlXCgm5lVwoFuZlYJB7qZWSUc6GZmlXCgm5lVwoFuZlYJB7qZWSUc6GZmlXCgm5lVwoFuZlYJB7qZWSU6uqeomXUZddRv+7sI/W761/bq7yJYN1xDNzOrhAPdzKwSDnQzs0o40M3MKuFANzOrhAPdzKwSDnQzs0o40M3MKuFANzOrhAPdzKwSDnQzs0o40M3MKuGLc5nZcreyX+Csry5u1lENXdLukqZKmibpqG6WrybpgrL8r5JG9XZBzcxs8XoMdEmDgNOBPYAxwAGSxrStdgjwSERsCXwTOKm3C2pmZovXSQ19R2BaRNwVEc8C5wN7t62zN3BOeXwx8CZJ6r1implZTzppQ98EmNF4PhMYt6h1ImKepLnA+sCDzZUkHQocWp4+Lmnq0hR6BTCMts+2PGngf//x/lt23ofLZiDvv80WtWC5dopGxBnAGctzm31B0uSIGNvf5RiovP+Wnffhsql1/3XS5DILGNF4vml5rdt1JA0G1gEe6o0CmplZZzoJ9OuB0ZI2lzQEmABMbFtnInBQefwO4MqIiN4rppmZ9aTHJpfSJn44cDkwCDgrIqZIOh6YHBETgR8B50qaBjxMhn7NBnyzUT/z/lt23ofLpsr9J1ekzczq4Kn/ZmaVcKCbmVXCgb6Ck7SOpG3LY0/WWgqSRkh6S3+XYyBS2k7ShNbz/i7TQCTp9ZI+JWm1vtyOA30F1fjD2Qw4QtKaHjm0ZMoQWoD5wM6tULKeSRokSeWYuxf4L0mjfAx2ruzDVsbeCMwGPiNp1b7apgN9BdL45dP6w4mIW4A7gHdLWq+/yjZQtO3DeeX/WcA3gNdLenN/lW1FV2rjqwBExPMREZKGRMQDwGXA3pI2799Srti62YfzS2Xs8Yi4ANgI2Levtu9AXwE0DoD5jdf2kzS+PL0AWA94XT8Ub0Bo34eSXiTpk5KOkLRZRDwOXEP9Q2qXWqTW/ttW0sVA6+qqrbknO/RL4VZg7RWxEuKrSxon6U/AaZJ2K6v8BvivviqLr4feTyQNiojnYaEQ2hHYGXgE2A14kaR7IuImSXcDG0taq4TTSq99H5ZmqrcArwf+TV6vYybwdWB/4JfAh7wPF1xFdX6zCUXSq8gT3suBPwPnAntKelNE/FHSbGDtUuN8ol8KvgJqq4jtAuxJXpn2QeAIYB5wGnB5RFxWKhkbRsR9vV0W19CXg/I1bA1JC64d0QoiSWtL2lLSecBHgZ2AXSPi3cBFwGtLm9v9wNrAU8v/E6wYJK0laXtJa8NC+3CL8tpV5AnxX2TN8ssRcTqwpqTtI+IZ4AbgP/rlA/QzSVtJOgQWalJ5qaRXS9oZ+DYwFfgDsFVE/Aq4AhgjaXVgDrBJRDyxsnaOln34LklrNF57g6QPSBoK/IA8xvYnLzk+MyL+ATwk6T/Lj1xNBn6vc6AvB6UWtCn5SwZA0kGSfgccAKwFrE7WJD8OTJe0NTAZ2BAYCdwK7NgKsZXUZsBBEfFYOUkeIOkK4KsR8RgwF/hbRPwE+DVdbZW/B95XHt9GdvKtjIaSs71bN6U5h6yFDwfuA/4J/J28BPYdkrYD/kYef1uSHXuvgK4+npWQgN0j4ikASd8EPgkMiohHyf33WEQ8Te7bd5Sf+xPwgfL4eeCmviicA70PLKL28hqyBomkEcBY4LhyBco7yFrRWLKp4AHgNRExFRgCbA48C0yTtGGff4AVQGkSaHcb8GJJ65VAGQd8IyLeVZb/kayhQ/Y7HFAeXwy0houNA+7sm1KvOBax/15H16WwdwZmRcTrIuIK8iR3Nbl/7geeBHaOiBnAXWQIrQlcJmmdvi5/f2qNjlrEaJS7yUt/b1yWzwc+HRE/LMt/Tzb7AVwKHFwen0Meh5AnhT5psnKg95LSCff2Rti0Xm/t4zXo+qq/BbB5RFwHUM7mtwKjyV/2vcAw5cXQTo6IP5CB/vW+aHdbUUgaIum4Ztt4m2HALcDmkoaR1w2a3Vj+S8o+jojLgSckDY2IGRFxWFnn0xHxeI1NBpJGSvpvSS9u7j91Dd+8H3hbefwEWYEAoNQ4/0XWwIcC/wD+r/wuvhcRtwP3RsTZETF3eXye/qAcJ36FpMER8Vz5Jtg8VoYD08lvzsPJCti6jeW/JG/ws3pE/Bn4btn/T5YmLCLiq6X5r9c50JeBpFUkbVDaF4eStZ5ty2u7ta3+u7KciLgKGCppo/I+LyVrRI+TtfHLIuK0iHg2Ih4pP/N063FtJL2ynAifJb/JvFXSyyV9VNKWjVXnk/toFNlx/BQ5FHFNSfuTHaC3S2o1C7wrIh5t/UGWcHqsLBvwTQaSVi19MNuWY/Bhsia9b+l4OwG6hm+SI1VeIWm1iPgbsLqk15Y24HeTnXdXAqtGxP9GxO8a/RStMelVTi6StK+k4SVo7wQ+LOkY4DyyLbxlLnkMviIiZpOXDp8g6b2SLirrfJXst1FEnBMR89qPt77ahw70pVDCZl/yrkxjyRB+FLgO+DTwfeATytmJAig163sl7Vne5iLgKEk3Ax8im1lOjog7+ursvSIpJ8NWs8A4oNVs8kvgY+SdrYaR96gFICIeImuRryT3/Q+BDcigeifZYffhiLi1EeKrNMb0V9P/IGkv4EiyGWRb4O3Aa8l+l/cAQY6Sem8Je8rIlNuB/crbfJYckfFlskJxY0RcGhEvuJNPM5AqORmu0haqGwFfl7QrObx1X7Im/jXgmNZK5dvJZGBrSS+LiLPIfbojcF5EzI2ICyLiocWdAPtsH0aE/3X4Dxhc/h9JtoltTA5HupZss92ZHGf6EmB7csjSlo2f3ws4ARhZnm8GvLRtG+rvz9kP+3Uv4OmyD19JtoVvX5ZdCuzQWHdVsuP4c+X5i4C12t5vlf7+TL24b1pXRN2i/D+o/L8GeUIbAfyWHJ1yQtl/vwReTI6kOAIY03i/scDRwF7l+ZBFbbO2f919LnJAwo5kheIJskK2Bjm655VlnfOAN7b9Pg4ETl7UvuqvY9A19B5IWl/S12Chr65DyRricPJr7j+Bz5PBPpk8QO4EngNe1ni7y4BJwDGlieHuiLir2U4X5WioRfloLzjOSn/DJ5TDNR8mO+ROjBziNYVseoFsqjqg/MzgiHiO/ANbV9I2EfFkZJv4KupmgtZAFxGhnCG8d3n+vKTtyVDejuyP+TFwFlm5uBX4H+A/yRrm85SRKeXnJ5MngjeWtvZnYeFO1AqPwUGw8OeStLWks4GTgOPIcfcXA/8T2Z8wGXhTWf0y4P+Vx61j+SLyON2kvN9Cx3l/HYMO9G60/WIeAr7Y6Pn+Cvk17DHgMHLc6TXA6HLA/B14VUT8m/waO7LxlXd+RPyFHGb3SDPEa/sjaikfbcHszfL/gWTv/4Nk4DwFnEnXiIAL6Wq3vISc6LLghFp+JydGxJTGdubXFORtx+AjwA/UNe3+IHJM+AXA+8nj7wnym0yQTQA7kft1FvBk4/hVRDwYEZ+OiIcb26imOQpesP9a/QA7S3p5efltdH2r+SewFbk/Wzexv5iuGZ0XA38v+641ke3pyPbxmeV5rAjHn2eKFs1On0YArUq2j7+MbBb4MDlefF9yOOGZ5D58Dti0HCwPA2tI2pQMo8dbtaCWiJhe/q8qxMsJSrHwzLmXkOPv9wdukPQdsnPz2og4T9KTZdlJwMck7UDej3Y9SRtHdjzt2b6tyDG/VWk/BksIb0Q2k6wK7CfpA8DQiPi2cnLLWWQTwf3ATpLGADeTtcf1gF80j7O2WqpqOQbVNnu17Rh8G1nDHgTcqZz/cRewWUTcK+ly8hj7HPBpSScDvwKulrROZLv5KYvY7gq1D1fqGrrKeNpmx1l5vrWkT5Hta18lv8KOLl/dtgBGNA6eXcn2tq3JDrx5ZA18ZkQ83B7mtZH0YkmHKq+XEiWI1pc0rqyyCzlr813kCIojgWlkPwPkhJYtyf32U3KSxqrAHhExu1XT6q7ZZqCTtI2k15THg1vHYGk+2h84nxzNc0hEXEgG9Dyys/M1pWlgTbJDdCLZsS7g+sjhhQ8vLmxWpCBaFspRYsdK2qY8HyrpSOVInyHkt8CtIuJtwE/Imvc9dM3WXJPsF3sx8AXyW/bkiPjvaAzRXK6dm0upuj+STimHaX2r66lWlfRuSVsAryJr48+TtclBZAiNI2cgvk/Su8gDYBty3PgXI2KPiLguKr/OhRaeWLI62Z+wQ1n2fvL+s3tLOg74BTnM6yGynRxyTP1gSd8mv+I+DbwlIs6IiPdExC0R8XQ50c6HutrFG+3V21FGUETeu3cNSSeSo1XWI4+v9YG55ZvOTcAbge8BB0i6hPyGOCcinomI70TEKaW5r8rhhbDwibB4jAzo7crzI8nO8huBiyPiMnLq/UbkPIanyG/Vt0i6EhhPTrjapfz9nh9lpJm6uQLqimylCfRmp1nxG7J9e3DkdS2eIycI7EK2p/2ObId8nByNcglwWET8iGyzHA+cEhFHl5rp463tLL9Ptfy1nQgh23L/QQ7jEjm6Z3fgbOAQuoYVTij7aDY5Y/GLwP+STS3X0pg5V2PnJiz0uVrt1ReQnbvrlNefIkf4bEPux6vIysWa5LH5W+BTEfEnsgP0RxGxZ0T8vLmNZt/M8vhcy0t3J8KiNZz15cqJQVuR11N5A7Cl8horfwD2L3/n04HXRcS3yNr6R8iA/3vZTr93bi6t2sNneOtxq9OsfB0bWtpg7wb2afzIOWQTwOvItvC5ZNvaIHJK79PKGWB/KDXJP7Rvc6AdAD1Z3IkQoPyB/B85tX4Mua+uIy809qGIuJfcdx8sP38B8OfICT63kB2hh5AdVJT3rGIfdjfyQdK6kg6W9PIS7DfTuMZPOaZuI0emiNwv44HVIuJq4Dul3XZmRPymbGehbVQY5D2dCOeTIT2fHGE2i+zs/BUwtvytX0LXLNkzIi/aRnntEvKE+UDj/Qak6gK9VTtRjix5h7pGVmyjHCJ3JfAR5dTxsymTLMrX+yfIM/l8ssPkr8D1sOAP5dDIafqtbVW3/2CpToT3kV9ZX0sG/mRy0somkj5ZQurUEkTTIuLG8nMbA38la0szqEyrT6H1XNJ+ZMiMI2dzbk9WIsaX5a2a9TVk2/kGEXEHecK7p+y/ie2BPZADqDtLcyIkO4Wnk31Z55IXFdsBOF/S5yLiJsoxGwtfOvlqYM+I+FCpZAxo1QRS+9fMEryjgb9IOgJ4BjiD7MRcCzgwcgr+SyQNKwfN4NKJeQlwVfkDentE3NDYzoD9OrY4S3siLB4hv/K+lGwymAT8jByjezFAM4gaNa4rI+KSqKDjWC+ceYjyXpynSjpN0gZkmO9GjhsfRwbJTWR/whYREeq6INSJ5F2WiIgfR85ArKrmvShLeiIsPzOXHLnyCvJbzZnkt8YjIuKkss6T7b+jyOv8PNfHH2m5GdCB3l2HhaSXSXqHpPXLoqeBsyJiGvl5LyQPjNGlbW0yXc0BrZPB9yPHmLbesznpopoQh2U/EZafmU+2jT9Jjl45NyL2i4jjoozTbaplHzbDodXU0fp2oxzlcynZMfd7st9hMFnb/gQ5kWWY8uqZ15HT9SFHsRARf46IO7vbVk1640TY+NHp5L4eFBHXR3YST2u+d+0nxQEZ6I0Qmt/2+mFkrWYr8o/mu+QveKuyyjjyoDiEHEf+BvK6K7eW92teoU6N7dQ86aI3ToQA/4qIEyLinuaJsMYg6q7TUXmRq18B5yo7jmeTNca/lbbuB8ihm8+TzQGzgXXIDtBvAKd2857NE0Y1QdRHJ0LKsffrVtNJ8294ZTEgJxY1AmM3spf6e+WMvR55fYX55M0i/k5ea/z1ZFvtFHKW51S6OuceITv1ut1GTUoT0kJfZ8vrh5ETp64jL/T0XXLUyVbkfmudCP8KHEvXiXCb1ntEN1flq+1E2NI4/saTofxTckTU6RFxhaRfkMfgT8kQP5489h4k+x9OJcPqqFjM5ZBrOwYbx99CJ0JynsJqkn5CjnxqnQinlr/x5onwQfK4a50IXxDY3W1nZaGB+JmVvdtfIttsJ5FX5XuQHN51ONkpchFZ896QbMd9iBxRMSwi/tn2fivUbK++1n4ilHQ0ORSzdSI8idyXG0XE1yW9nYVPhFdEpZfy7URp5/4OOUb8BHK/fJZslnqWHDL3PfLYO5scfrg9sM/iAnxl0XYi/CQwpXEivIic+ToiIo4vlY0ZLHwiPNf7sXsDNdAHA58B1o6Iz5ez/EHkJJZ9yRrmvuQQuuPJKxpOa3uPVSj9L8u18P3IJ8LeUXi2fqkAAAIcSURBVJqcPktOMluPDPSdyIuIvZOcrr8NGexHAndHjnlesM/UzU2aa+cTYd8bkG3okRdpuoqcpQgZSneRtczfkRMF/g2cWtropkH3bXfLs9wrgCfI4V23RsT3yREpryLHg19GtpPvQ076uZ+8f+f4yCnk/4S6J650KnLo5nnkrM21yUrDbLJCcS55NcMh5D68jRyeudD0/ig3aV7+pe9Xa5AT0U4hQ3tD8sJYj5FXk/whOWloJvl3PDUidoqI+xqjsKrsl+ktA7INvbgdWF/St8ixp1dExFOSfh0RE7v7gZXwD2ghkdPLr6Jr/G7zRLgGeSK8HTintLMvOBE2gqiKESrLKiJulzSbvCrkILKN/G/A7yLi/tZ6yptYP1D24bzu323lEHn3qPPISsM08kR4Jl0nwvXIb9etE+FoWHAibI3+qbJfprcMyCYXWNBkMgF4K3BadE1WWbDc4fNCktYlp43PpetEeMrK2nyytJQXfTqObD44OyKubVu+qPuirvRK098e5Djyj5PfHIe2nQhHkN8ef+PjsnMDNtAhx6uSdw75SpTri/uXv3g+EVp/8omwbw3kJhfI4YaTyR7zRxzmPYucEdu6I9B0cJPKslClFxLrK5Gzgj+/mOUO82UwoGvotnTKKI09geui3GzDbHnyibBvONDNzCoxIIctmpnZCznQzcwq4UA3M6uEA93MrBIOdDOzSjjQzcwq8f8BuZf3oTE41uEAAAAASUVORK5CYII=\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cXTwXiB_LVB3", + "outputId": "83b9f975-bd71-4604-aded-b8310da30193", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 405 + } + }, + "source": [ + "# Matriz de Confusão\n", + "from sklearn.metrics import confusion_matrix as cm\n", + "cm = cm(y_teste, y_pred) \n", + "index = ['setosa','versicolor','virginica'] \n", + "columns = ['setosa','versicolor','virginica'] \n", + "cm_df = pd.DataFrame(cm,columns,index) \n", + "plt.figure(figsize = (10, 6)) \n", + "cm_df.index.name = 'Actual'\n", + "cm_df.columns.name = 'Predicted'\n", + "sns.heatmap(cm_df, annot=True)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 278 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_ebv3nAzU2ac" + }, + "source": [ + "## Feature Selection using PCA (Principal Components Analysis)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8M5uO9r-Vtze" + }, + "source": [ + "from sklearn.datasets import load_iris\n", + "\n", + "# Carregar as informações do dataframe Iris\n", + "carrega_df_iris()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wWS8bvXzX4Fg" + }, + "source": [ + "### Standardize the Data\n", + "* O PCA é afetado por escala, portanto, é necessário dimensionar as features/atributos antes de aplicar o PCA.\n", + "* Use o StandardScaler para padronizar os features/atributos usando com média = 0 e variância = 1." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2oVG8_1HXweo" + }, + "source": [ + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.decomposition import PCA" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xYXnYebxclya" + }, + "source": [ + "Standardizing as features de X:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iDMzHm3mcpbs" + }, + "source": [ + "X_STD = StandardScaler().fit_transform(X_iris) " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "MmKMCuvMc63E" + }, + "source": [ + "pca_2c = PCA(n_components = 2)\n", + "X_PCA_2c = pca_2c.fit_transform(X_STD)\n", + "df_PCA_2c = pd.DataFrame(data = X_PCA_2c, columns = ['PCA1', 'PCA2'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0Yfvb02JdV8B" + }, + "source": [ + "Vamos entender o que está acontecendo:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-PVc1vJ8d_w6" + }, + "source": [ + "Primeiramente, observe nosso array X abaixo. Cada coluna desse array representa uma coluna do dataframe df_iris. Por exemplo, a primeira coluna são os dados da variável 'Sepal Length'. Identificou?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BEp1JD0Odd3L", + "outputId": "28a25a6b-d24a-4e3d-cbf0-2f3af24b4298", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Listando as primeiras 5 linhas de X\n", + "X_iris[0:5]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[5.1, 3.5, 1.4, 0.2],\n", + " [4.9, 3. , 1.4, 0.2],\n", + " [4.7, 3.2, 1.3, 0.2],\n", + " [4.6, 3.1, 1.5, 0.2],\n", + " [5. , 3.6, 1.4, 0.2]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 286 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "41-KcSTneURx" + }, + "source": [ + "Segundo, com a standardização, construimos o array X_STD, que mostramos abaixo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "igBQNHS5eaND", + "outputId": "93a6ece8-138e-4f56-f44a-061c45c7380e", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X_STD[0:5]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[-0.90068117, 1.01900435, -1.34022653, -1.3154443 ],\n", + " [-1.14301691, -0.13197948, -1.34022653, -1.3154443 ],\n", + " [-1.38535265, 0.32841405, -1.39706395, -1.3154443 ],\n", + " [-1.50652052, 0.09821729, -1.2833891 , -1.3154443 ],\n", + " [-1.02184904, 1.24920112, -1.34022653, -1.3154443 ]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 287 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TgRD7-qPjg29" + }, + "source": [ + "Veja abaixo a média e desvio-padrão do array X_STD:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "M0VL4ilZjliL", + "outputId": "cd424bdd-ef89-45b1-c965-2695797f5a87", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "np.mean(X_STD),np.std(X_STD)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(-1.4684549872375404e-15, 1.0)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 288 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pyDJwNCgju0O" + }, + "source": [ + "Temos média 0 e desvio-padrão 1, certo? É isso que queríamos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KB7R7OXQemze" + }, + "source": [ + "Por fim, a partir de X_STD, construimos o array X_PCA_2c, mostrado abaixo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lNmAskXWerqG", + "outputId": "2f417913-1e00-445e-a990-6a1cbce2b417", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X_PCA_2c[0:5]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[-2.26470281, 0.4800266 ],\n", + " [-2.08096115, -0.67413356],\n", + " [-2.36422905, -0.34190802],\n", + " [-2.29938422, -0.59739451],\n", + " [-2.38984217, 0.64683538]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 289 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fT2N6Ym7fBt-" + }, + "source": [ + "Portanto, reduzimos (ou resumimos) o array X_STD de 4 dimensões para um array de 2 dimensões." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cA54fhYgfQuC" + }, + "source": [ + "Finalmente, o dataframe final é mostrado abaixo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kAQ4O9EBfUlN" + }, + "source": [ + "df_PCA_final_2c = pd.concat([df_PCA_2c, df_iris[['target']]], axis= 1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JXrOdfSZPBS_", + "outputId": "21e43aaa-c78d-4779-86e4-e4ec6864b1e9", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_PCA_final_2c.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PCA1PCA2target
0-2.2647030.480027setosa
1-2.080961-0.674134setosa
2-2.364229-0.341908setosa
3-2.299384-0.597395setosa
4-2.3898420.646835setosa
\n", + "
" + ], + "text/plain": [ + " PCA1 PCA2 target\n", + "0 -2.264703 0.480027 setosa\n", + "1 -2.080961 -0.674134 setosa\n", + "2 -2.364229 -0.341908 setosa\n", + "3 -2.299384 -0.597395 setosa\n", + "4 -2.389842 0.646835 setosa" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 291 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MEbvp3RFf-zs" + }, + "source": [ + "### Visualizar reultados" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GEqP-NVngBO1", + "outputId": "8ff483a3-ba1b-4579-deb2-400020f310c7", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 523 + } + }, + "source": [ + "fig = plt.figure(figsize = (8,8))\n", + "ax = fig.add_subplot(1,1,1) \n", + "ax.set_xlabel('PCA1', fontsize = 15)\n", + "ax.set_ylabel('PCA2', fontsize = 15)\n", + "ax.set_title('2 componentes PCA', fontsize = 20)\n", + "targets = ['setosa', 'versicolor', 'virginica']\n", + "colors = ['r', 'g', 'b']\n", + "for target, color in zip(targets,colors):\n", + " indicesToKeep = df_PCA_final_2c['target'] == target\n", + " ax.scatter(df_PCA_final_2c.loc[indicesToKeep, 'PCA1']\n", + " , df_PCA_final_2c.loc[indicesToKeep, 'PCA2']\n", + " , c = color\n", + " , s = 50)\n", + "ax.legend(targets)\n", + "ax.grid()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pZaGiGnUl6ER" + }, + "source": [ + "O que significa reduzir para 2 dimensões um array com 4 dimensões?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xF-whgBmlHN1", + "outputId": "a23c21ca-cc81-48ee-f293-067c4ea585d1", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 654 + } + }, + "source": [ + "X_new = pca_2c.inverse_transform(X_PCA_2c)\n", + "plt.figure(figsize=(10, 10), dpi=80)\n", + "plt.scatter(X_STD[:, 0], X_STD[:, 1], alpha=0.2)\n", + "plt.scatter(X_new[:, 0], X_new[:, 1], alpha=0.9)\n", + "plt.axis('equal');" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XvLDQ2JF5NP8" + }, + "source": [ + "### Análise de Correlação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DK3R415_5RzY", + "outputId": "7bb27d07-dda9-4f39-ec26-39b20f49a0ca", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "# calcula a correlação entre as colunas/variáveis do dataframe\n", + "correlacao= df_iris.corr().abs()\n", + "\n", + "# Seleciona o triângulo superior da matriz de correlação\n", + "correlacao = correlacao.where(np.triu(np.ones(correlacao.shape), k=1).astype(np.bool))\n", + "correlacao" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Sepal Lengthsepal widthpetal lengthpetal widthtarget2
Sepal LengthNaN0.117570.8717540.8179410.782561
sepal widthNaNNaN0.4284400.3661260.426658
petal lengthNaNNaNNaN0.9628650.949035
petal widthNaNNaNNaNNaN0.956547
target2NaNNaNNaNNaNNaN
\n", + "
" + ], + "text/plain": [ + " Sepal Length sepal width petal length petal width target2\n", + "Sepal Length NaN 0.11757 0.871754 0.817941 0.782561\n", + "sepal width NaN NaN 0.428440 0.366126 0.426658\n", + "petal length NaN NaN NaN 0.962865 0.949035\n", + "petal width NaN NaN NaN NaN 0.956547\n", + "target2 NaN NaN NaN NaN NaN" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 294 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uLvvzpvl5Zy9", + "outputId": "02c8c610-efe1-45a0-e5c1-7bd308513e56", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 500 + } + }, + "source": [ + "fig, ax = plt.subplots(figsize=(8, 8)) \n", + "mask = np.zeros_like(df_iris.corr().abs())\n", + "mask[np.triu_indices_from(mask)] = 1\n", + "sns.heatmap(df_iris.corr().abs(), mask= mask, ax= ax, cmap='coolwarm', annot= True, fmt= '.2f')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 295 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xo-llZpb7JfO" + }, + "source": [ + "Pela Análise de Correlação, vemos duas variáveis altamente correlacionadas com a variável-resposta, que são: 'Peta Width' e 'Petal Length', que são as duas variáveis mais importantes no dataframe. Lembram-se?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kjPawS_dgSKU" + }, + "source": [ + "### Variância explicada\n", + "* Como vimos, reduzimos um array de informações de 4 dimensões para um array com 2 dimensões. Com isso, perde-se alguma informação relativa à variância. Mas quanto perdemos?\n", + "\n", + "* A Variância Explicada (ou Explained Variance, em inglês), mede o quanto de informação (variação) foi atribuída a cada um dos componentes principais. Usando o atributo explain_variance_ratio_, é possível ver que o primeiro componente principal contém 72,77% da variação e o segundo componente principal contém 23,03% da variação. Juntos, os dois componentes contêm 95,80% das informações. Portanto, perdemos quase nada em termos de informação e o modelo não é prejudicado por esta redução.\n", + "\n", + "\n", + "A resposta à essa pergunta é:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "i6gcdvtYgwpX", + "outputId": "d416ed99-d2aa-4f64-cad9-9ba37903aa3e", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "pca.explained_variance_ratio_" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.72962445, 0.22850762, 0.03668922, 0.00517871])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 298 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "71ubjjflQEUf" + }, + "source": [ + "Observe que o terceiro e quarto valores são muito baixo. Ou seja, baixa variabilidade explicada... Portanto, daqui já conseguimos ver que o número ideal de componentes são 2." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "m_4TqauJvIvX" + }, + "source": [ + "### Quantos componentes escolher" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "O2NRgjCjvUli", + "outputId": "da0675c7-c382-48d7-eccb-1f00037e5aff", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 279 + } + }, + "source": [ + "pca = PCA().fit(X_STD)\n", + "plt.plot(np.cumsum(pca.explained_variance_ratio_))\n", + "plt.xlabel('number of components')\n", + "plt.ylabel('cumulative explained variance');" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AvQ6KwQdwAdC" + }, + "source": [ + "**Interpretação**: Esta curva quantifica quanto da variância total de 4 dimensões está contida nos primeiros N componentes. Por exemplo, a primeira componente principal tem aproximadamente 95% da variação, enquanto que 2 componentes (os dois primeiros) explicam quase 100% da variabilidade. Portanto, em nosso caso, 2 componentes principais são suficientes para captar grande parte da variabilidade dos dados." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "S151TcjQw6vc" + }, + "source": [ + "pca.explained_variance_ratio_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2dNWrsMCyUUE" + }, + "source": [ + "### Medindo o impacto" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ueFuNN47zMd6" + }, + "source": [ + "#### Treinar o modelo com X_PCA" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yb4dpTHPyYlU" + }, + "source": [ + "from sklearn.ensemble import RandomForestClassifier" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Mr3tAGbjTET8" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_PCA_2c, y_iris, test_size = 0.2, random_state = 20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IzCZCBSmylwZ", + "outputId": "d8218700-7b4d-40b3-be83-dd2f87c8f400", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Optamos por 2 componentes principais\n", + "classifier_2c = RandomForestClassifier(max_depth = 2, random_state = 0)\n", + "classifier_2c.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n", + " criterion='gini', max_depth=2, max_features='auto',\n", + " max_leaf_nodes=None, max_samples=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, n_estimators=100,\n", + " n_jobs=None, oob_score=False, random_state=0, verbose=0,\n", + " warm_start=False)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 307 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BiDgWh2PzYEY" + }, + "source": [ + "#### Fazer as predições" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "A7V8DWW4zVPs" + }, + "source": [ + "y_pred_2c = classifier_2c.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JTm4msQy2ezQ" + }, + "source": [ + "___\n", + "# **Exercícios**\n", + "* Para cada um dos dataframes a seguir, selecione os melhores atributos utilizando as seguintes técnicas técnicas:\n", + " * Random Forest\n", + " * XGBoost\n", + " * RFE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "caFkC6oCmUKK" + }, + "source": [ + "## Exercício 1 - Breast Cancer" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vhOM-Z9zmf-f" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "from sklearn.datasets import load_breast_cancer\n", + "\n", + "cancer = load_breast_cancer()\n", + "X= cancer['data']\n", + "y= cancer['target']\n", + "\n", + "df_cancer = pd.DataFrame(np.c_[X, y], columns= np.append(cancer['feature_names'], ['target']))\n", + "df_cancer['target'] = df_cancer['target'].map({0: 'malign', 1: 'benign'})\n", + "df_cancer.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zA395jtOfGEl" + }, + "source": [ + "## Exercício 2 - Fraud Detection" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "14fV0gz0flb8" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "\n", + "url= 'https://raw.githubusercontent.com/MathMachado/Python_RFB/DS_Python/Dataframes/creditcard.csv?token=AGDJQ63IAZCFP7GTSZTOMAK5QBSP6'\n", + "df_CC= pd.read_csv(url)\n", + "df_CC.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1qruqUDqnvMc" + }, + "source": [ + "## Exercício 3 - Boston Housing Price" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "trxK8YXNnsam" + }, + "source": [ + "from sklearn.datasets import load_boston\n", + "\n", + "boston = load_boston()\n", + "X= boston['data']\n", + "y= boston['target']\n", + "\n", + "df_boston = pd.DataFrame(np.c_[X, y], columns= np.append(boston['feature_names'], ['target']))\n", + "df_boston.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-CawPH2nb5cl" + }, + "source": [ + "## Exercício 4 - Diabetes\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_lVjBS7QcZuT" + }, + "source": [ + "from sklearn.datasets import load_diabetes\n", + "\n", + "diabetes = load_diabetes()\n", + "X= diabetes['data']\n", + "y= diabetes['target']\n", + "\n", + "df_diabetes = pd.DataFrame(np.c_[X, y], columns= np.append(diabetes['feature_names'], ['target']))\n", + "df_diabetes.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qP1vjDdylyHr" + }, + "source": [ + "## Exercício 5 - Crimes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fqLHDXbOl0Sf" + }, + "source": [ + "url= 'https://raw.githubusercontent.com/MathMachado/Python_RFB/DS_Python/Dataframes/Crime.txt?token=AGDJQ665WUIWIEKDPK6WO625P3QUQ'\n", + "df_Crime = pd.read_table(url, sep=',', na_values='?')\n", + "df_Crime.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "fxhTXj6Ll7wB" + }, + "source": [ + "df_Crime.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d3b-Yv2HUmoI" + }, + "source": [ + "## Exercício 6 - Titanic" + ] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB10_04__3DP_5_Feature_Selection_hs2.ipynb b/Notebooks/NB10_04__3DP_5_Feature_Selection_hs2.ipynb new file mode 100644 index 000000000..13566c5bb --- /dev/null +++ b/Notebooks/NB10_04__3DP_5_Feature_Selection_hs2.ipynb @@ -0,0 +1,6081 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.4" + }, + "colab": { + "name": "NB10_04__3DP_5_Feature_Selection.ipynb", + "provenance": [], + "include_colab_link": true + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Cka1jqOwy6KT" + }, + "source": [ + "

3DP_5 - FEATURE SELECTION

\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3aYp_plmy17y" + }, + "source": [ + "# **AGENDA**:\n", + "\n", + "> Consulte **Table of contents**.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rSFnHHQUKDX5" + }, + "source": [ + "# **Melhorias da sessão**\n", + "* Desenvolver t-SNE.\n", + "* https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "arSNhd_2KHL6" + }, + "source": [ + "___\n", + "# **Referências**\n", + "* [Feature Selection in Python — Recursive Feature Elimination](https://towardsdatascience.com/feature-selection-in-python-recursive-feature-elimination-19f1c39b8d15)\n", + "* [Feature Selection with sklearn and Pandas](https://towardsdatascience.com/feature-selection-with-pandas-e3690ad8504b)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cBLSvJTXHBjK" + }, + "source": [ + "___\n", + "# **CHEETSHEET**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZdjR3nahUuKq" + }, + "source": [ + "\n", + "![Scikit-Learn](https://github.com/MathMachado/Materials/blob/master/scikit-learn-1.png?raw=true)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ngLc7b9XiKxN" + }, + "source": [ + "___\n", + "# **3DP_FEATURE SELECTION**\n", + "## Introdução à Feature Selection\n", + "> Nosso objetivo com Feature Engineering será:\n", + "* Deletar colunas irrelevantes;\n", + "* Deletar colunas com baixa correlação com a variável-target;\n", + "* Deletar colunas com baixa variância;\n", + "* Deletar colunas com muitos NaN's.\n", + "\n", + "* Sugestões:\n", + " * Normalize colunas numéricas;\n", + " * Aplique LabelEncoding (colunas numéricas) ou One Hot Encoding (colunas categóricas).\n", + "\n", + "![FeatureSelection](https://github.com/MathMachado/Materials/blob/master/FeatureSelection.png?raw=true)\n", + "\n", + "[Fonte](https://medium.com/@sundarstyles89/weight-of-evidence-and-information-value-using-python-6f05072e83eb)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "T9JCQatsiKxR" + }, + "source": [ + "from sklearn import feature_selection\n", + "import pandas as pd\n", + "import numpy as np\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9U6Az5qpiKxU" + }, + "source": [ + "___\n", + "# **VarianceThreshold**\n", + "* Drop variáveis/features cuja variância seja inferior a um determinado threshold;\n", + "* Este é um método não-supervisionado, isto é, a variável rotulada (variável-resposta ou variável target) não entra e ação;\n", + "* **Intuição**: \n", + " * Features/variáveis com baixa variância contem baixa informação;\n", + "* **Como funciona**:\n", + " * Calcula a variância para cada feature/variável e então deleta a coluna/variável com baixa variância\n", + "* **Cuidados**:\n", + " * Assegure-se que as features/variáveis tenham a mesma escala. Ou seja, use StandardScaler() ou MinMaxScaler() para colocar as variáveis na mesma escala." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "euWJlVAViKxV", + "outputId": "0a16d5aa-7e5b-4db1-b5cc-e1a13f730cff", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 284 + } + }, + "source": [ + "df = pd.DataFrame(\n", + " {'sexo': ['m', 'm', 'f', 'm', 'm', 'm', 'm', 'm'], \n", + " 'b': [1, 2, 3, 1, 2, 1, 1, 1], \n", + " 'c': [1, 2, 3, 1, 2, 1, 1, 1]})\n", + "\n", + "df" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sexobc
0m11
1m22
2f33
3m11
4m22
5m11
6m11
7m11
\n", + "
" + ], + "text/plain": [ + " sexo b c\n", + "0 m 1 1\n", + "1 m 2 2\n", + "2 f 3 3\n", + "3 m 1 1\n", + "4 m 2 2\n", + "5 m 1 1\n", + "6 m 1 1\n", + "7 m 1 1" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 212 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1rvj2MZ6Jtgq" + }, + "source": [ + "A seguir, usamos [LabelEncoder](sklearn.preprocessing.LabelEncoder) para a coluna 'sexo':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "I6L5L_wtTSUe" + }, + "source": [ + "from sklearn.preprocessing import LabelEncoder\n", + "le = LabelEncoder()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VS2v_GnbiKxi", + "outputId": "5d026279-9b0f-4e3d-c9ac-3a9bff1be40e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 284 + } + }, + "source": [ + "# Aplica o LabelEncoder à coluna 'sexo':\n", + "df['sexo'] = le.fit_transform(df['sexo'])\n", + "df" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sexobc
0111
1122
2033
3111
4122
5111
6111
7111
\n", + "
" + ], + "text/plain": [ + " sexo b c\n", + "0 1 1 1\n", + "1 1 2 2\n", + "2 0 3 3\n", + "3 1 1 1\n", + "4 1 2 2\n", + "5 1 1 1\n", + "6 1 1 1\n", + "7 1 1 1" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 214 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1w-_VsJuWVHN", + "outputId": "3ad5cb6b-408e-4fd1-ba2e-2a98d7f55dda", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Calculando as variâncias de cada Feature/variável:\n", + "l_variaveis= ['sexo', 'b', 'c']\n", + "print(f'Variância das variáveis do dataframe df:')\n", + "for s_Var in l_variaveis:\n", + " print(f'{s_Var}: {round(df[s_Var].var(),2)}')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Variância das variáveis do dataframe df:\n", + "sexo: 0.12\n", + "b: 0.57\n", + "c: 0.57\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3IITDmqUiKxo", + "outputId": "16e79dc1-1519-4750-f397-327b5455781d", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Selecionar atributos cuja variância seja maior que 0.20:\n", + "vt = feature_selection.VarianceThreshold(threshold= .20)\n", + "vt.fit_transform(df)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[1, 1],\n", + " [2, 2],\n", + " [3, 3],\n", + " [1, 1],\n", + " [2, 2],\n", + " [1, 1],\n", + " [1, 1],\n", + " [1, 1]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 216 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tAOL215MiKxu", + "outputId": "cf1d0ae2-8fd9-4971-b5a3-8b246bb39ed0", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Variância calculada pela VarianceThreshold()\n", + "vt.variances_" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.109375, 0.5 , 0.5 ])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 217 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yntaZtd98boc" + }, + "source": [ + "### O que aconteceu aqui? Qual a conclusão?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FXyfpmWtiKyB" + }, + "source": [ + "___\n", + "# **ANOVA (Analysis Of Variance) com f_classif**\n", + "* Aplica-se aos casos em que as colunas a serem testadas sejam numéricas por natureza e a variável-target seja discreta por natureza;\n", + "* ANOVA é um teste que visa medir diferença entre grupos/experimentos. Aqui, **o propósito da ANOVA é testar se as colunas numéricas testadas são diferentes**. Obviamente que ao identificarmos colunas semelhantes, então podemos reduzir o número de colunas para evitarmos multicolinearidade, overfitting." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yB7TC9VKiKyC" + }, + "source": [ + "from sklearn.datasets import load_breast_cancer\n", + "\n", + "df_cancer = load_breast_cancer()\n", + "X_cancer = df_cancer.data\n", + "y_cancer = df_cancer.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "N2_mE0Z5xxvL", + "outputId": "5b255dcb-ec6e-4ee4-e0f8-d0d89a14fc64", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X_cancer" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,\n", + " 1.189e-01],\n", + " [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,\n", + " 8.902e-02],\n", + " [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,\n", + " 8.758e-02],\n", + " ...,\n", + " [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,\n", + " 7.820e-02],\n", + " [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,\n", + " 1.240e-01],\n", + " [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,\n", + " 7.039e-02]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 219 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "t0QYORbSiKyL" + }, + "source": [ + "chi2, p_value = feature_selection.f_classif(X_cancer, y_cancer)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7BC6y7etiKyP", + "outputId": "b6cf6b7d-77d9-49fa-96f1-10983adf21b8", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "np.round(chi2)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([647., 118., 697., 573., 84., 313., 534., 862., 70., 0., 269.,\n", + " 0., 254., 244., 3., 53., 39., 113., 0., 3., 861., 150.,\n", + " 898., 662., 122., 304., 437., 964., 119., 66.])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 221 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aYSOfeH4iKyW" + }, + "source": [ + "* **Comentário**: Acima, cada valor representa a importância de uma feature/coluna ==> **Quanto maior, melhor!**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "k1fHrPM4upex", + "outputId": "690c1ed2-9c4f-4518-a356-6486b3ffcf63", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "np.round(p_value, 2)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.76, 0. ,\n", + " 0.84, 0. , 0. , 0.11, 0. , 0. , 0. , 0.88, 0.06, 0. , 0. ,\n", + " 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 222 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1JYciSVkMW8f" + }, + "source": [ + "* **Comentário**: Acima, os p_value's associados à cada valor de chi2 ==> **Quanto menor, melhor!**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1YVBLc7eu6_H" + }, + "source": [ + "## **Conclusão**: **Foco no p_value**. Se p_value < 0.05 ==> variável significativa/relevante para o modelo." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r29_gmCgiKyY" + }, + "source": [ + "___\n", + "# **Univariate Regression Test using f_regression**\n", + "* Modelo Linear para testar o efeito individual de cada uma das variáveis regressoras;\n", + "* **Como funciona**:\n", + " * Usa a correlação entre cada variável e variável-target;\n", + " * F-test calcula a dependência linear;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6IDzu3kCiKyZ", + "outputId": "b4941bbd-ebe8-4718-81b0-48666ada1d44", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "from sklearn.datasets import california_housing\n", + "house_data = california_housing.fetch_california_housing()\n", + "X_house, y_house = house_data.data, house_data.target\n", + "\n", + "X_house" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[ 8.3252 , 41. , 6.98412698, ..., 2.55555556,\n", + " 37.88 , -122.23 ],\n", + " [ 8.3014 , 21. , 6.23813708, ..., 2.10984183,\n", + " 37.86 , -122.22 ],\n", + " [ 7.2574 , 52. , 8.28813559, ..., 2.80225989,\n", + " 37.85 , -122.24 ],\n", + " ...,\n", + " [ 1.7 , 17. , 5.20554273, ..., 2.3256351 ,\n", + " 39.43 , -121.22 ],\n", + " [ 1.8672 , 18. , 5.32951289, ..., 2.12320917,\n", + " 39.43 , -121.32 ],\n", + " [ 2.3886 , 16. , 5.25471698, ..., 2.61698113,\n", + " 39.37 , -121.24 ]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 223 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1DS9T6WXw8qN", + "outputId": "038178a6-3f44-49da-f49c-695e8ff999c7", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "y_house # Variável contínua" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 224 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uKYhjpEViKyl" + }, + "source": [ + "F, p_value = feature_selection.f_regression(X_house, y_house)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "NEqZ3I4jiKyp", + "outputId": "44e87de3-1abc-4664-817b-93801dc8eb36", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "np.round(F, 2)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1.855657e+04, 2.328400e+02, 4.877600e+02, 4.511000e+01,\n", + " 1.255000e+01, 1.164000e+01, 4.380100e+02, 4.370000e+01])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 226 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Rsaf7y8MiKyt" + }, + "source": [ + "### **Comentários**: Colunas com alto F-values tem maior poder preditivo. Portanto, **quanto maior, melhor**." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Fh80Xf8KG-Vj", + "outputId": "5e7aaa4e-563d-4643-b5f7-a7c6a16e5c21", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "np.round(p_value, 2)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0., 0., 0., 0., 0., 0., 0., 0.])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 227 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LKutBVPIxTP2", + "outputId": "c1a988e0-a69b-4642-b258-720024c66068", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "np.round(p_value, 2)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0., 0., 0., 0., 0., 0., 0., 0.])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 228 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DD-JKUQ1xjR8" + }, + "source": [ + "### **Conclusão**: **Foco no p_value**. Se p_value < 0.05 ==> variável significativa/relevante para o modelo." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xvIXLHK9iKz8" + }, + "source": [ + "___\n", + "# **SelectFromModel**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "p0mtUVnjiKz8" + }, + "source": [ + "from sklearn.datasets import load_boston\n", + "boston = load_boston()\n", + "X_boston, y_boston = boston.data, boston.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "WY1c2U10iK0A" + }, + "source": [ + "from sklearn.linear_model import LinearRegression" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "3SJDM-Bxc_UF", + "outputId": "0096e7b9-84c4-4bfd-f6ea-06edb31747bc", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Observe abaixo que a variável-target é float. Portanto, é um problema de regressão\n", + "y_boston" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,\n", + " 18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,\n", + " 15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,\n", + " 13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,\n", + " 21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,\n", + " 35.4, 24.7, 31.6, 23.3, 19.6, 18.7, 16. , 22.2, 25. , 33. , 23.5,\n", + " 19.4, 22. , 17.4, 20.9, 24.2, 21.7, 22.8, 23.4, 24.1, 21.4, 20. ,\n", + " 20.8, 21.2, 20.3, 28. , 23.9, 24.8, 22.9, 23.9, 26.6, 22.5, 22.2,\n", + " 23.6, 28.7, 22.6, 22. , 22.9, 25. , 20.6, 28.4, 21.4, 38.7, 43.8,\n", + " 33.2, 27.5, 26.5, 18.6, 19.3, 20.1, 19.5, 19.5, 20.4, 19.8, 19.4,\n", + " 21.7, 22.8, 18.8, 18.7, 18.5, 18.3, 21.2, 19.2, 20.4, 19.3, 22. ,\n", + " 20.3, 20.5, 17.3, 18.8, 21.4, 15.7, 16.2, 18. , 14.3, 19.2, 19.6,\n", + " 23. , 18.4, 15.6, 18.1, 17.4, 17.1, 13.3, 17.8, 14. , 14.4, 13.4,\n", + " 15.6, 11.8, 13.8, 15.6, 14.6, 17.8, 15.4, 21.5, 19.6, 15.3, 19.4,\n", + " 17. , 15.6, 13.1, 41.3, 24.3, 23.3, 27. , 50. , 50. , 50. , 22.7,\n", + " 25. , 50. , 23.8, 23.8, 22.3, 17.4, 19.1, 23.1, 23.6, 22.6, 29.4,\n", + " 23.2, 24.6, 29.9, 37.2, 39.8, 36.2, 37.9, 32.5, 26.4, 29.6, 50. ,\n", + " 32. , 29.8, 34.9, 37. , 30.5, 36.4, 31.1, 29.1, 50. , 33.3, 30.3,\n", + " 34.6, 34.9, 32.9, 24.1, 42.3, 48.5, 50. , 22.6, 24.4, 22.5, 24.4,\n", + " 20. , 21.7, 19.3, 22.4, 28.1, 23.7, 25. , 23.3, 28.7, 21.5, 23. ,\n", + " 26.7, 21.7, 27.5, 30.1, 44.8, 50. , 37.6, 31.6, 46.7, 31.5, 24.3,\n", + " 31.7, 41.7, 48.3, 29. , 24. , 25.1, 31.5, 23.7, 23.3, 22. , 20.1,\n", + " 22.2, 23.7, 17.6, 18.5, 24.3, 20.5, 24.5, 26.2, 24.4, 24.8, 29.6,\n", + " 42.8, 21.9, 20.9, 44. , 50. , 36. , 30.1, 33.8, 43.1, 48.8, 31. ,\n", + " 36.5, 22.8, 30.7, 50. , 43.5, 20.7, 21.1, 25.2, 24.4, 35.2, 32.4,\n", + " 32. , 33.2, 33.1, 29.1, 35.1, 45.4, 35.4, 46. , 50. , 32.2, 22. ,\n", + " 20.1, 23.2, 22.3, 24.8, 28.5, 37.3, 27.9, 23.9, 21.7, 28.6, 27.1,\n", + " 20.3, 22.5, 29. , 24.8, 22. , 26.4, 33.1, 36.1, 28.4, 33.4, 28.2,\n", + " 22.8, 20.3, 16.1, 22.1, 19.4, 21.6, 23.8, 16.2, 17.8, 19.8, 23.1,\n", + " 21. , 23.8, 23.1, 20.4, 18.5, 25. , 24.6, 23. , 22.2, 19.3, 22.6,\n", + " 19.8, 17.1, 19.4, 22.2, 20.7, 21.1, 19.5, 18.5, 20.6, 19. , 18.7,\n", + " 32.7, 16.5, 23.9, 31.2, 17.5, 17.2, 23.1, 24.5, 26.6, 22.9, 24.1,\n", + " 18.6, 30.1, 18.2, 20.6, 17.8, 21.7, 22.7, 22.6, 25. , 19.9, 20.8,\n", + " 16.8, 21.9, 27.5, 21.9, 23.1, 50. , 50. , 50. , 50. , 50. , 13.8,\n", + " 13.8, 15. , 13.9, 13.3, 13.1, 10.2, 10.4, 10.9, 11.3, 12.3, 8.8,\n", + " 7.2, 10.5, 7.4, 10.2, 11.5, 15.1, 23.2, 9.7, 13.8, 12.7, 13.1,\n", + " 12.5, 8.5, 5. , 6.3, 5.6, 7.2, 12.1, 8.3, 8.5, 5. , 11.9,\n", + " 27.9, 17.2, 27.5, 15. , 17.2, 17.9, 16.3, 7. , 7.2, 7.5, 10.4,\n", + " 8.8, 8.4, 16.7, 14.2, 20.8, 13.4, 11.7, 8.3, 10.2, 10.9, 11. ,\n", + " 9.5, 14.5, 14.1, 16.1, 14.3, 11.7, 13.4, 9.6, 8.7, 8.4, 12.8,\n", + " 10.5, 17.1, 18.4, 15.4, 10.8, 11.8, 14.9, 12.6, 14.1, 13. , 13.4,\n", + " 15.2, 16.1, 17.8, 14.9, 14.1, 12.7, 13.5, 14.9, 20. , 16.4, 17.7,\n", + " 19.5, 20.2, 21.4, 19.9, 19. , 19.1, 19.1, 20.1, 19.9, 19.6, 23.2,\n", + " 29.8, 13.8, 13.3, 16.7, 12. , 14.6, 21.4, 23. , 23.7, 25. , 21.8,\n", + " 20.6, 21.2, 19.1, 20.6, 15.2, 7. , 8.1, 13.6, 20.1, 21.8, 24.5,\n", + " 23.1, 19.7, 18.3, 21.2, 17.5, 16.8, 22.4, 20.6, 23.9, 22. , 11.9])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 231 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2zNpau9HiK0C" + }, + "source": [ + "ml = LinearRegression()\n", + "sfm = feature_selection.SelectFromModel(ml, threshold = 0.25)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "62vOFTVViK0D", + "outputId": "a4a64f62-a280-4d40-8188-82b49ceda42d", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Dataframe com as colunas mais relevantes\n", + "sfm.fit_transform(X_boston, y_boston).shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(506, 7)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 233 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Y8_j7HRZiK0J", + "outputId": "9a732d60-aa2f-43d2-8c94-c5ff85b3a694", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Dados originais\n", + "X_boston.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(506, 13)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 234 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yX5oufYkcrH7" + }, + "source": [ + "### **Conclusão**: Houve uma redução de 13 para 7 colunas." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FaGrKVl0Re2A" + }, + "source": [ + "Abaixo, o indicador das colunas que foram selecionadas:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hYVIANETRE2p", + "outputId": "e31de91e-1f8e-4a20-c2f3-891a43632b44", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "l_variaveis_selecionadas = sfm.get_support()\n", + "l_variaveis_selecionadas" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([False, False, False, True, True, True, False, True, True,\n", + " False, True, False, True])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 235 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ol88OWzaVpJi" + }, + "source": [ + "___\n", + "# **Análise de Correlação**\n", + "* É sempre uma boa ideia eliminar colunas altamente correlacionadas do dataframe, pois colunas altamente correlacionadas fornecem a mesma informação.\n", + "\n", + "Fonte: [Better Heatmaps and Correlation Matrix Plots in Python](https://towardsdatascience.com/better-heatmaps-and-correlation-matrix-plots-in-python-41445d0f2bec)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KWBe1v_6V5HB" + }, + "source": [ + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "\n", + "from sklearn.datasets import load_breast_cancer\n", + "df_cancer = load_breast_cancer()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vDz4Byzk8yA9" + }, + "source": [ + "X_cancer = pd.DataFrame(df_cancer.data)\n", + "y_cancer = df_cancer.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XIPITCn4cgjs" + }, + "source": [ + "Usando a correlação de Pearson:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0vHleiMlxuCG", + "outputId": "ee6e2dd6-a45f-4527-a548-52b3c708bc64", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 955 + } + }, + "source": [ + "# calcula a correlação entre as colunas/variáveis do dataframe\n", + "correlacao = X_cancer.corr().abs()\n", + "\n", + "# Seleciona o triângulo superior da matriz de correlação\n", + "correlacao = correlacao.where(np.triu(np.ones(correlacao.shape), k = 1).astype(np.bool))\n", + "correlacao" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
01234567891011121314151617181920212223242526272829
0NaN0.3237820.9978550.9873570.1705810.5061240.6767640.8225290.1477410.3116310.6790900.0973170.6741720.7358640.2226000.2060000.1942040.3761690.1043210.0426410.9695390.2970080.9651370.9410820.1196160.4134630.5269110.7442140.1639530.007066
1NaNNaN0.3295330.3210860.0233890.2367020.3024180.2934640.0714010.0764370.2758690.3863580.2816730.2598450.0066140.1919750.1432930.1638510.0091270.0544580.3525730.9120450.3580400.3435460.0775030.2778300.3010250.2953160.1050080.119205
2NaNNaNNaN0.9865070.2072780.5569360.7161360.8509770.1830270.2614770.6917650.0867610.6931350.7449830.2026940.2507440.2280820.4072170.0816290.0055230.9694760.3030380.9703870.9415500.1505490.4557740.5638790.7712410.1891150.051019
3NaNNaNNaNNaN0.1770280.4985020.6859830.8232690.1512930.2831100.7325620.0662800.7266280.8000860.1667770.2125830.2076600.3723200.0724970.0198870.9627460.2874890.9591200.9592130.1235230.3904100.5126060.7220170.1435700.003738
4NaNNaNNaNNaNNaN0.6591230.5219840.5536950.5577750.5847920.3014670.0684060.2960920.2465520.3323750.3189430.2483960.3806760.2007740.2836070.2131200.0360720.2388530.2067180.8053240.4724680.4349260.5030530.3943090.499316
5NaNNaNNaNNaNNaNNaN0.8831210.8311350.6026410.5653690.4974730.0462050.5489050.4556530.1352990.7387220.5705170.6422620.2299770.5073180.5353150.2481330.5902100.5096040.5655410.8658090.8162750.8155730.5102230.687382
6NaNNaNNaNNaNNaNNaNNaN0.9213910.5006670.3367830.6319250.0762180.6603910.6174270.0985640.6702790.6912700.6832600.1780090.4493010.6882360.2998790.7295650.6759870.4488220.7549680.8841030.8613230.4094640.514930
7NaNNaNNaNNaNNaNNaNNaNNaN0.4624970.1669170.6980500.0214800.7106500.6902990.0276530.4904240.4391670.6156340.0953510.2575840.8303180.2927520.8559230.8096300.4527530.6674540.7523990.9101550.3757440.368661
8NaNNaNNaNNaNNaNNaNNaNNaNNaN0.4799210.3033790.1280530.3138930.2239700.1873210.4216590.3426270.3932980.4491370.3317860.1857280.0906510.2191690.1771930.4266750.4732000.4337210.4302970.6998260.438413
9NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0001110.1641740.0398300.0901700.4019640.5598370.4466300.3411980.3450070.6881320.2536910.0512690.2051510.2318540.5049420.4587980.3462340.1753250.3340190.767297
10NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.2132470.9727940.9518300.1645140.3560650.3323580.5133460.2405670.2277540.7150650.1947990.7196840.7515480.1419190.2871030.3805850.5310620.0945430.049559
11NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.2231710.1115670.3972430.2317000.1949980.2302830.4116210.2797230.1116900.4090030.1022420.0831950.0736580.0924390.0689560.1196380.1282150.045655
12NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.9376550.1510750.4163220.3624820.5562640.2664870.2441430.6972010.2003710.7210310.7307130.1300540.3419190.4188990.5548970.1099300.085433
13NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0751500.2848400.2708950.4157300.1341090.1270710.7573730.1964970.7612130.8114080.1253890.2832570.3851000.5381660.0741260.017539
14NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.3366960.2686850.3284290.4135060.4273740.2306910.0747430.2173040.1821950.3144570.0555580.0582980.1020070.1073420.101480
15NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.8012680.7440830.3947130.8032690.2046070.1430030.2605160.1993710.2273940.6787800.6391470.4832080.2778780.590973
16NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.7718040.3094290.7273720.1869040.1002410.2266800.1883530.1684810.4848580.6625640.4404720.1977880.439329
17NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.3127800.6110440.3581270.0867410.3949990.3422710.2153510.4528880.5495920.6024500.1431160.310655
18NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.3690780.1281210.0774730.1037530.1103430.0126620.0602550.0371190.0304130.3894020.078079
19NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0374880.0031950.0010000.0227360.1705680.3901590.3799750.2152040.1110940.591328
20NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.3599210.9937080.9840150.2165740.4758200.5739750.7874240.2435290.093492
21NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.3650980.3458420.2254290.3608320.3683660.3597550.2330270.219122
22NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.9775780.2367750.5294080.6183440.8163220.2694930.138957
23NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.2091450.4382960.5433310.7474190.2091460.079647
24NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.5681870.5185230.5476910.4938380.617624
25NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.8922610.8010800.6144410.810455
26NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.8554340.5325200.686511
27NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.5025280.511114
28NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.537848
29NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", + "
" + ], + "text/plain": [ + " 0 1 2 3 ... 26 27 28 29\n", + "0 NaN 0.323782 0.997855 0.987357 ... 0.526911 0.744214 0.163953 0.007066\n", + "1 NaN NaN 0.329533 0.321086 ... 0.301025 0.295316 0.105008 0.119205\n", + "2 NaN NaN NaN 0.986507 ... 0.563879 0.771241 0.189115 0.051019\n", + "3 NaN NaN NaN NaN ... 0.512606 0.722017 0.143570 0.003738\n", + "4 NaN NaN NaN NaN ... 0.434926 0.503053 0.394309 0.499316\n", + "5 NaN NaN NaN NaN ... 0.816275 0.815573 0.510223 0.687382\n", + "6 NaN NaN NaN NaN ... 0.884103 0.861323 0.409464 0.514930\n", + "7 NaN NaN NaN NaN ... 0.752399 0.910155 0.375744 0.368661\n", + "8 NaN NaN NaN NaN ... 0.433721 0.430297 0.699826 0.438413\n", + "9 NaN NaN NaN NaN ... 0.346234 0.175325 0.334019 0.767297\n", + "10 NaN NaN NaN NaN ... 0.380585 0.531062 0.094543 0.049559\n", + "11 NaN NaN NaN NaN ... 0.068956 0.119638 0.128215 0.045655\n", + "12 NaN NaN NaN NaN ... 0.418899 0.554897 0.109930 0.085433\n", + "13 NaN NaN NaN NaN ... 0.385100 0.538166 0.074126 0.017539\n", + "14 NaN NaN NaN NaN ... 0.058298 0.102007 0.107342 0.101480\n", + "15 NaN NaN NaN NaN ... 0.639147 0.483208 0.277878 0.590973\n", + "16 NaN NaN NaN NaN ... 0.662564 0.440472 0.197788 0.439329\n", + "17 NaN NaN NaN NaN ... 0.549592 0.602450 0.143116 0.310655\n", + "18 NaN NaN NaN NaN ... 0.037119 0.030413 0.389402 0.078079\n", + "19 NaN NaN NaN NaN ... 0.379975 0.215204 0.111094 0.591328\n", + "20 NaN NaN NaN NaN ... 0.573975 0.787424 0.243529 0.093492\n", + "21 NaN NaN NaN NaN ... 0.368366 0.359755 0.233027 0.219122\n", + "22 NaN NaN NaN NaN ... 0.618344 0.816322 0.269493 0.138957\n", + "23 NaN NaN NaN NaN ... 0.543331 0.747419 0.209146 0.079647\n", + "24 NaN NaN NaN NaN ... 0.518523 0.547691 0.493838 0.617624\n", + "25 NaN NaN NaN NaN ... 0.892261 0.801080 0.614441 0.810455\n", + "26 NaN NaN NaN NaN ... NaN 0.855434 0.532520 0.686511\n", + "27 NaN NaN NaN NaN ... NaN NaN 0.502528 0.511114\n", + "28 NaN NaN NaN NaN ... NaN NaN NaN 0.537848\n", + "29 NaN NaN NaN NaN ... NaN NaN NaN NaN\n", + "\n", + "[30 rows x 30 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 238 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gDxrqpPXxuCM", + "outputId": "ebd4ba38-df7c-4775-c12d-b20102a2e73c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 989 + } + }, + "source": [ + "fig, ax = plt.subplots(figsize = (20, 17)) \n", + "mask = np.zeros_like(X_cancer.corr().abs())\n", + "mask[np.triu_indices_from(mask)] = 1\n", + "sns.heatmap(X_cancer.corr().abs(), mask = mask, ax = ax, cmap ='coolwarm', annot = True, fmt = '.2f')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 239 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2p0kgfS-ao5o" + }, + "source": [ + "Como podemos ver, há várias colunas altamente correlacionados no dataframe. Vamos excluir (automaticamente!) as colunas altamente correlacionadas da seguinte forma:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "C7mUtTlFaoFx", + "outputId": "cb4734a6-f56a-48f0-9b23-cc478f88e9e8", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 955 + } + }, + "source": [ + "set_variaveis_corr = set()\n", + "matrix_corr = X_cancer.corr()\n", + "matrix_corr" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
01234567891011121314151617181920212223242526272829
01.0000000.3237820.9978550.9873570.1705810.5061240.6767640.8225290.147741-0.3116310.679090-0.0973170.6741720.735864-0.2226000.2060000.1942040.376169-0.104321-0.0426410.9695390.2970080.9651370.9410820.1196160.4134630.5269110.7442140.1639530.007066
10.3237821.0000000.3295330.321086-0.0233890.2367020.3024180.2934640.071401-0.0764370.2758690.3863580.2816730.2598450.0066140.1919750.1432930.1638510.0091270.0544580.3525730.9120450.3580400.3435460.0775030.2778300.3010250.2953160.1050080.119205
20.9978550.3295331.0000000.9865070.2072780.5569360.7161360.8509770.183027-0.2614770.691765-0.0867610.6931350.744983-0.2026940.2507440.2280820.407217-0.081629-0.0055230.9694760.3030380.9703870.9415500.1505490.4557740.5638790.7712410.1891150.051019
30.9873570.3210860.9865071.0000000.1770280.4985020.6859830.8232690.151293-0.2831100.732562-0.0662800.7266280.800086-0.1667770.2125830.2076600.372320-0.072497-0.0198870.9627460.2874890.9591200.9592130.1235230.3904100.5126060.7220170.1435700.003738
40.170581-0.0233890.2072780.1770281.0000000.6591230.5219840.5536950.5577750.5847920.3014670.0684060.2960920.2465520.3323750.3189430.2483960.3806760.2007740.2836070.2131200.0360720.2388530.2067180.8053240.4724680.4349260.5030530.3943090.499316
50.5061240.2367020.5569360.4985020.6591231.0000000.8831210.8311350.6026410.5653690.4974730.0462050.5489050.4556530.1352990.7387220.5705170.6422620.2299770.5073180.5353150.2481330.5902100.5096040.5655410.8658090.8162750.8155730.5102230.687382
60.6767640.3024180.7161360.6859830.5219840.8831211.0000000.9213910.5006670.3367830.6319250.0762180.6603910.6174270.0985640.6702790.6912700.6832600.1780090.4493010.6882360.2998790.7295650.6759870.4488220.7549680.8841030.8613230.4094640.514930
70.8225290.2934640.8509770.8232690.5536950.8311350.9213911.0000000.4624970.1669170.6980500.0214800.7106500.6902990.0276530.4904240.4391670.6156340.0953510.2575840.8303180.2927520.8559230.8096300.4527530.6674540.7523990.9101550.3757440.368661
80.1477410.0714010.1830270.1512930.5577750.6026410.5006670.4624971.0000000.4799210.3033790.1280530.3138930.2239700.1873210.4216590.3426270.3932980.4491370.3317860.1857280.0906510.2191690.1771930.4266750.4732000.4337210.4302970.6998260.438413
9-0.311631-0.076437-0.261477-0.2831100.5847920.5653690.3367830.1669170.4799211.0000000.0001110.1641740.039830-0.0901700.4019640.5598370.4466300.3411980.3450070.688132-0.253691-0.051269-0.205151-0.2318540.5049420.4587980.3462340.1753250.3340190.767297
100.6790900.2758690.6917650.7325620.3014670.4974730.6319250.6980500.3033790.0001111.0000000.2132470.9727940.9518300.1645140.3560650.3323580.5133460.2405670.2277540.7150650.1947990.7196840.7515480.1419190.2871030.3805850.5310620.0945430.049559
11-0.0973170.386358-0.086761-0.0662800.0684060.0462050.0762180.0214800.1280530.1641740.2132471.0000000.2231710.1115670.3972430.2317000.1949980.2302830.4116210.279723-0.1116900.409003-0.102242-0.083195-0.073658-0.092439-0.068956-0.119638-0.128215-0.045655
120.6741720.2816730.6931350.7266280.2960920.5489050.6603910.7106500.3138930.0398300.9727940.2231711.0000000.9376550.1510750.4163220.3624820.5562640.2664870.2441430.6972010.2003710.7210310.7307130.1300540.3419190.4188990.5548970.1099300.085433
130.7358640.2598450.7449830.8000860.2465520.4556530.6174270.6902990.223970-0.0901700.9518300.1115670.9376551.0000000.0751500.2848400.2708950.4157300.1341090.1270710.7573730.1964970.7612130.8114080.1253890.2832570.3851000.5381660.0741260.017539
14-0.2226000.006614-0.202694-0.1667770.3323750.1352990.0985640.0276530.1873210.4019640.1645140.3972430.1510750.0751501.0000000.3366960.2686850.3284290.4135060.427374-0.230691-0.074743-0.217304-0.1821950.314457-0.055558-0.058298-0.102007-0.1073420.101480
150.2060000.1919750.2507440.2125830.3189430.7387220.6702790.4904240.4216590.5598370.3560650.2317000.4163220.2848400.3366961.0000000.8012680.7440830.3947130.8032690.2046070.1430030.2605160.1993710.2273940.6787800.6391470.4832080.2778780.590973
160.1942040.1432930.2280820.2076600.2483960.5705170.6912700.4391670.3426270.4466300.3323580.1949980.3624820.2708950.2686850.8012681.0000000.7718040.3094290.7273720.1869040.1002410.2266800.1883530.1684810.4848580.6625640.4404720.1977880.439329
170.3761690.1638510.4072170.3723200.3806760.6422620.6832600.6156340.3932980.3411980.5133460.2302830.5562640.4157300.3284290.7440830.7718041.0000000.3127800.6110440.3581270.0867410.3949990.3422710.2153510.4528880.5495920.6024500.1431160.310655
18-0.1043210.009127-0.081629-0.0724970.2007740.2299770.1780090.0953510.4491370.3450070.2405670.4116210.2664870.1341090.4135060.3947130.3094290.3127801.0000000.369078-0.128121-0.077473-0.103753-0.110343-0.0126620.0602550.037119-0.0304130.3894020.078079
19-0.0426410.054458-0.005523-0.0198870.2836070.5073180.4493010.2575840.3317860.6881320.2277540.2797230.2441430.1270710.4273740.8032690.7273720.6110440.3690781.000000-0.037488-0.003195-0.001000-0.0227360.1705680.3901590.3799750.2152040.1110940.591328
200.9695390.3525730.9694760.9627460.2131200.5353150.6882360.8303180.185728-0.2536910.715065-0.1116900.6972010.757373-0.2306910.2046070.1869040.358127-0.128121-0.0374881.0000000.3599210.9937080.9840150.2165740.4758200.5739750.7874240.2435290.093492
210.2970080.9120450.3030380.2874890.0360720.2481330.2998790.2927520.090651-0.0512690.1947990.4090030.2003710.196497-0.0747430.1430030.1002410.086741-0.077473-0.0031950.3599211.0000000.3650980.3458420.2254290.3608320.3683660.3597550.2330270.219122
220.9651370.3580400.9703870.9591200.2388530.5902100.7295650.8559230.219169-0.2051510.719684-0.1022420.7210310.761213-0.2173040.2605160.2266800.394999-0.103753-0.0010000.9937080.3650981.0000000.9775780.2367750.5294080.6183440.8163220.2694930.138957
230.9410820.3435460.9415500.9592130.2067180.5096040.6759870.8096300.177193-0.2318540.751548-0.0831950.7307130.811408-0.1821950.1993710.1883530.342271-0.110343-0.0227360.9840150.3458420.9775781.0000000.2091450.4382960.5433310.7474190.2091460.079647
240.1196160.0775030.1505490.1235230.8053240.5655410.4488220.4527530.4266750.5049420.141919-0.0736580.1300540.1253890.3144570.2273940.1684810.215351-0.0126620.1705680.2165740.2254290.2367750.2091451.0000000.5681870.5185230.5476910.4938380.617624
250.4134630.2778300.4557740.3904100.4724680.8658090.7549680.6674540.4732000.4587980.287103-0.0924390.3419190.283257-0.0555580.6787800.4848580.4528880.0602550.3901590.4758200.3608320.5294080.4382960.5681871.0000000.8922610.8010800.6144410.810455
260.5269110.3010250.5638790.5126060.4349260.8162750.8841030.7523990.4337210.3462340.380585-0.0689560.4188990.385100-0.0582980.6391470.6625640.5495920.0371190.3799750.5739750.3683660.6183440.5433310.5185230.8922611.0000000.8554340.5325200.686511
270.7442140.2953160.7712410.7220170.5030530.8155730.8613230.9101550.4302970.1753250.531062-0.1196380.5548970.538166-0.1020070.4832080.4404720.602450-0.0304130.2152040.7874240.3597550.8163220.7474190.5476910.8010800.8554341.0000000.5025280.511114
280.1639530.1050080.1891150.1435700.3943090.5102230.4094640.3757440.6998260.3340190.094543-0.1282150.1099300.074126-0.1073420.2778780.1977880.1431160.3894020.1110940.2435290.2330270.2694930.2091460.4938380.6144410.5325200.5025281.0000000.537848
290.0070660.1192050.0510190.0037380.4993160.6873820.5149300.3686610.4384130.7672970.049559-0.0456550.0854330.0175390.1014800.5909730.4393290.3106550.0780790.5913280.0934920.2191220.1389570.0796470.6176240.8104550.6865110.5111140.5378481.000000
\n", + "
" + ], + "text/plain": [ + " 0 1 2 ... 27 28 29\n", + "0 1.000000 0.323782 0.997855 ... 0.744214 0.163953 0.007066\n", + "1 0.323782 1.000000 0.329533 ... 0.295316 0.105008 0.119205\n", + "2 0.997855 0.329533 1.000000 ... 0.771241 0.189115 0.051019\n", + "3 0.987357 0.321086 0.986507 ... 0.722017 0.143570 0.003738\n", + "4 0.170581 -0.023389 0.207278 ... 0.503053 0.394309 0.499316\n", + "5 0.506124 0.236702 0.556936 ... 0.815573 0.510223 0.687382\n", + "6 0.676764 0.302418 0.716136 ... 0.861323 0.409464 0.514930\n", + "7 0.822529 0.293464 0.850977 ... 0.910155 0.375744 0.368661\n", + "8 0.147741 0.071401 0.183027 ... 0.430297 0.699826 0.438413\n", + "9 -0.311631 -0.076437 -0.261477 ... 0.175325 0.334019 0.767297\n", + "10 0.679090 0.275869 0.691765 ... 0.531062 0.094543 0.049559\n", + "11 -0.097317 0.386358 -0.086761 ... -0.119638 -0.128215 -0.045655\n", + "12 0.674172 0.281673 0.693135 ... 0.554897 0.109930 0.085433\n", + "13 0.735864 0.259845 0.744983 ... 0.538166 0.074126 0.017539\n", + "14 -0.222600 0.006614 -0.202694 ... -0.102007 -0.107342 0.101480\n", + "15 0.206000 0.191975 0.250744 ... 0.483208 0.277878 0.590973\n", + "16 0.194204 0.143293 0.228082 ... 0.440472 0.197788 0.439329\n", + "17 0.376169 0.163851 0.407217 ... 0.602450 0.143116 0.310655\n", + "18 -0.104321 0.009127 -0.081629 ... -0.030413 0.389402 0.078079\n", + "19 -0.042641 0.054458 -0.005523 ... 0.215204 0.111094 0.591328\n", + "20 0.969539 0.352573 0.969476 ... 0.787424 0.243529 0.093492\n", + "21 0.297008 0.912045 0.303038 ... 0.359755 0.233027 0.219122\n", + "22 0.965137 0.358040 0.970387 ... 0.816322 0.269493 0.138957\n", + "23 0.941082 0.343546 0.941550 ... 0.747419 0.209146 0.079647\n", + "24 0.119616 0.077503 0.150549 ... 0.547691 0.493838 0.617624\n", + "25 0.413463 0.277830 0.455774 ... 0.801080 0.614441 0.810455\n", + "26 0.526911 0.301025 0.563879 ... 0.855434 0.532520 0.686511\n", + "27 0.744214 0.295316 0.771241 ... 1.000000 0.502528 0.511114\n", + "28 0.163953 0.105008 0.189115 ... 0.502528 1.000000 0.537848\n", + "29 0.007066 0.119205 0.051019 ... 0.511114 0.537848 1.000000\n", + "\n", + "[30 rows x 30 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 240 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9lvLldX5eWVW", + "outputId": "3a8f8ccb-e59e-408e-cd99-1f7439029321", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "for i in range(len(matrix_corr.columns)):\n", + " for j in range(i):\n", + " if abs(matrix_corr.iloc[i, j]) > 0.8:\n", + " colname = matrix_corr.columns[i]\n", + " set_variaveis_corr.add(colname)\n", + "\n", + "set_variaveis_corr" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{2, 3, 6, 7, 12, 13, 16, 19, 20, 21, 22, 23, 24, 25, 26, 27, 29}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 241 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R4YC3Kl5erhc" + }, + "source": [ + "Deletando as colunas altamente correlacionadas do dataframe e calculando a correlação novamente:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "psDDQlrSevG5", + "outputId": "a2aea1b6-1826-4147-d8a5-0ccd7c95bc47", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 989 + } + }, + "source": [ + "X_cancer = X_cancer.drop(set_variaveis_corr, axis = 1)\n", + "\n", + "fig, ax = plt.subplots(figsize = (20, 17)) \n", + "mask = np.zeros_like(X_cancer.corr().abs())\n", + "mask[np.triu_indices_from(mask)] = 1\n", + "sns.heatmap(X_cancer.corr().abs(), mask = mask, ax = ax, cmap='coolwarm', annot = True, fmt = '.2f')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 242 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1oDecB_8fGwM" + }, + "source": [ + "### **Conclusão**: Qual a conclusão podemos tirar esta análise?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "avZ12nHqiK0M" + }, + "source": [ + "___\n", + "# **RFE - Recursive Feature Elimination** (continuação da Análise de Correlação)\n", + "* Muito tempo de processamento! Portanto, exclua as colunas altamente correlacionadas do dataframe previamente.\n", + "* A matriz X e target deste tópico vem do tópico anterior \"Análise de Correlação\";\n", + "\n", + "* Leitura recomendada: [Feature Selection in Python — Recursive Feature Elimination](https://towardsdatascience.com/feature-selection-in-python-recursive-feature-elimination-19f1c39b8d15)\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MCGTzI59R3G7" + }, + "source": [ + "from sklearn.feature_selection import RFECV\n", + "from sklearn.ensemble import RandomForestRegressor\n", + "from sklearn.model_selection import StratifiedKFold" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "sqsK1bGmiK0O", + "outputId": "3a989f0a-6324-4b9c-a668-c0b512558223", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "rf = RandomForestRegressor(random_state = 20111974)\n", + "filtro_rfe = RFECV(estimator = rf, step = 1, cv = StratifiedKFold(10))\n", + "filtro_rfe.fit(X_cancer, y_cancer)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "RFECV(cv=StratifiedKFold(n_splits=10, random_state=None, shuffle=False),\n", + " estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,\n", + " criterion='mse', max_depth=None,\n", + " max_features='auto', max_leaf_nodes=None,\n", + " max_samples=None,\n", + " min_impurity_decrease=0.0,\n", + " min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0,\n", + " n_estimators=100, n_jobs=None,\n", + " oob_score=False, random_state=20111974,\n", + " verbose=0, warm_start=False),\n", + " min_features_to_select=1, n_jobs=None, scoring=None, step=1, verbose=0)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 244 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MBnSFtWiDvOd", + "outputId": "863be880-0320-4154-c1a8-06fd72c65133", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X_cancer.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(569, 13)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 245 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yomSdCzJD2tS", + "outputId": "f956d8e0-6038-4eed-b908-a1ee5d80762c", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "y_cancer.size" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "569" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 246 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "f-dKku9IiK0V", + "outputId": "aa8192fa-dbe9-4399-caf0-10ecd83b4e4a", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Número ótimo de colunas:\n", + "filtro_rfe.n_features_" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "8" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 247 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TUoMb40-iK0Y", + "outputId": "fc509029-ce5e-4102-c3a3-715e0e0af10c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 605 + } + }, + "source": [ + "plt.figure(figsize=(16, 9))\n", + "plt.title('Recursive Feature Elimination (rfe) com Cross-Validation', fontsize=18, fontweight='bold', pad=20)\n", + "plt.xlabel('Número de colunas selecionadas', fontsize=14, labelpad=20)\n", + "plt.ylabel('% Acurácia do Modelo', fontsize=14, labelpad=20)\n", + "plt.plot(range(1, len(filtro_rfe.grid_scores_) + 1), filtro_rfe.grid_scores_, color='#303F9F', linewidth=3)\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RG16C-2QdhUx" + }, + "source": [ + "### **Conclusão**: Houve uma redução para 7 colunas." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "M8lpuOjti6Fw", + "outputId": "cf3d3eeb-0b63-40f1-8b4e-a2c581638b9f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "X_cancer.drop(X_cancer.columns[np.where(filtro_rfe.support_ == False)[0]], axis = 1, inplace = True)\n", + "X_cancer.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
014510151828
017.9910.380.118400.277601.09500.049040.030030.4601
120.5717.770.084740.078640.54350.013080.013890.2750
219.6921.250.109600.159900.74560.040060.022500.3613
311.4220.380.142500.283900.49560.074580.059630.6638
420.2914.340.100300.132800.75720.024610.017560.2364
\n", + "
" + ], + "text/plain": [ + " 0 1 4 5 10 15 18 28\n", + "0 17.99 10.38 0.11840 0.27760 1.0950 0.04904 0.03003 0.4601\n", + "1 20.57 17.77 0.08474 0.07864 0.5435 0.01308 0.01389 0.2750\n", + "2 19.69 21.25 0.10960 0.15990 0.7456 0.04006 0.02250 0.3613\n", + "3 11.42 20.38 0.14250 0.28390 0.4956 0.07458 0.05963 0.6638\n", + "4 20.29 14.34 0.10030 0.13280 0.7572 0.02461 0.01756 0.2364" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 249 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "s7S87LApdESo", + "outputId": "096f4cc6-45fe-49f5-c4e8-8d098d3001f2", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "filtro_rfe.estimator_.feature_importances_" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.6274782 , 0.06895439, 0.02932265, 0.06823762, 0.05093922,\n", + " 0.01941762, 0.03383343, 0.10181687])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 250 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FEfmeic37hvw" + }, + "source": [ + "___\n", + "# **Feature Selection com Random Forest**\n", + "* Para demonstrar este método, vou utilizar o Boston Housing Price dataframe.\n", + "\n", + "![Supervised_X_Unsupervised](https://github.com/MathMachado/Materials/blob/master/Supervised_X_Unsupervised.jpeg?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0F2BdrZgKzV5" + }, + "source": [ + "### Carregar o dataframe\n", + "* [Aqui](https://en.wikipedia.org/wiki/Iris_flower_data_set) você obterá mais informações sobre o dataframe iris. Confira." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6H31U15q7kIO" + }, + "source": [ + "from sklearn.datasets import load_iris\n", + "\n", + "# Função para carregar as informações do dataframe Iris\n", + "def carrega_df_iris():\n", + " global df_iris, l_iris_labels, X_iris, y_iris, iris\n", + "\n", + " iris = load_iris()\n", + " X_iris = iris['data']\n", + " y_iris= iris['target']\n", + "\n", + " df_iris = pd.DataFrame(np.c_[X_iris, y_iris], columns= np.append(iris['feature_names'], ['target']))\n", + " df_iris['target2']= df_iris['target']\n", + " df_iris= df_iris.rename(columns={'sepal length (cm)': 'Sepal Length', 'sepal width (cm)': 'sepal width', 'petal length (cm)': 'petal length', 'petal width (cm)': 'petal width'})\n", + " df_iris['target'] = df_iris['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})\n", + "\n", + " # Criar a lista de nomes das variáveis\n", + " l_iris_labels = ['Sepal Length','Sepal Width','Petal Length','Petal Width']" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rD2DmkpNXkFy" + }, + "source": [ + "# Carregar as informações do dataframe Iris\n", + "carrega_df_iris()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jVQuRHYhM4fD" + }, + "source": [ + "> A variável-resposta que estamos tentando prever/explicar é categórica. Portanto, vamos usar um algoritmo da classe supervisionado para Classificação.\n", + "\n", + "* SelectFromModel selecionará os atributos cuja importância seja maior do que a importância média de todos os recursos por padrão, mas podemos alterar esse limite se quisermos." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1pPpC7GXLgpC" + }, + "source": [ + "from sklearn.ensemble import RandomForestClassifier\n", + "from sklearn.feature_selection import SelectFromModel" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "dfSfuUlHQOSt" + }, + "source": [ + "# Particionar base de treinamento (80%) e validação (20%)\n", + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_iris, y_iris, test_size = 0.2, random_state = 20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "0JDsdjsZ0M4F", + "outputId": "71d1931c-9569-4a44-d7b0-7a7cfb789fa5", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X_treinamento.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(120, 4)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 255 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SH_B3C1u0Qkl", + "outputId": "1e00e16e-ed0b-4317-fe46-f0f5f0f46562", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X_teste.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(30, 4)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 256 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YQ270kclOFeK" + }, + "source": [ + "# Create a random forest Regressor\n", + "ml_rf = RandomForestClassifier(n_estimators = 10000, random_state = 20111974, n_jobs = -1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vFSCt8uaeKFN", + "outputId": "154735f9-3388-4023-9de6-cabd217c793f", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Treina o classificador\n", + "ml_rf.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n", + " criterion='gini', max_depth=None, max_features='auto',\n", + " max_leaf_nodes=None, max_samples=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, n_estimators=10000,\n", + " n_jobs=-1, oob_score=False, random_state=20111974,\n", + " verbose=0, warm_start=False)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 258 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VfdKeUkgS6ul" + }, + "source": [ + "Os atributos mais importantes são:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tnrwVLPKSNxr", + "outputId": "2a3db4d0-5794-48ad-ca10-ba988b399e85", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Imprime o nome do atributo associado à importância usando índice de Gini\n", + "for feature in zip(l_iris_labels, ml_rf.feature_importances_):\n", + " print(feature)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "('Sepal Length', 0.08731002037613723)\n", + "('Sepal Width', 0.021750035432184116)\n", + "('Petal Length', 0.39132734233988486)\n", + "('Petal Width', 0.4996126018517938)\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "x8FHRHlDWTn4" + }, + "source": [ + "* Os scores acima representam a importância de cada variável.\n", + " * A soma dos scores resulta em 100%;\n", + " * Os atributos 'Petal Length' (Score= 0.45) e 'Petal Width' (Score= 0.42) são os mais importantes.\n", + " * Combinados, as duas variáveis mais importantes somam ~0.86." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wbMnS_gIVBA8" + }, + "source": [ + "Como regra geral, selecione os atributos que tenha importância de no mínimo 0.15. \n", + "\n", + "Citar autor/Referência!!!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M3TnDYRVeMEs" + }, + "source": [ + "Algo mais visual:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "o8QkARWpeI_e" + }, + "source": [ + "def importancia_variaveis():\n", + " # Calcula a importância das features\n", + " importances = ml_rf.feature_importances_\n", + "\n", + " # Ordena as features por importância\n", + " indices = np.argsort(importances)[::-1]\n", + "\n", + " # Associa a feature name com a feature importance\n", + " names = [iris.feature_names[i] for i in indices]\n", + "\n", + " # Barplot\n", + " plt.bar(range(X_treinamento.shape[1]), importances[indices])\n", + "\n", + " # Adiciona as feature names no eixo x-axis\n", + " plt.xticks(range(X_treinamento.shape[1]), names, rotation = 20, fontsize = 8)\n", + "\n", + " # Define o título do gráfico\n", + " plt.title(\"Importância Preditiva das variáveis\")\n", + " plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ahZdlCBE6h_e", + "outputId": "1d263536-5b65-489a-905b-1bc6a4e54e1f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 304 + } + }, + "source": [ + "importancia_variaveis()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "owVs6pvJ8F8B" + }, + "source": [ + "## Análise de Correlação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RISO1Ury8EEH", + "outputId": "943423c9-1aa1-448c-b737-ee23adda9e31", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "# calcula a correlação entre as colunas/variáveis do dataframe\n", + "correlacao = df_iris.corr().abs()\n", + "\n", + "# Seleciona o triângulo superior da matriz de correlação\n", + "correlacao = correlacao.where(np.triu(np.ones(correlacao.shape), k = 1).astype(np.bool))\n", + "correlacao" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Sepal Lengthsepal widthpetal lengthpetal widthtarget2
Sepal LengthNaN0.117570.8717540.8179410.782561
sepal widthNaNNaN0.4284400.3661260.426658
petal lengthNaNNaNNaN0.9628650.949035
petal widthNaNNaNNaNNaN0.956547
target2NaNNaNNaNNaNNaN
\n", + "
" + ], + "text/plain": [ + " Sepal Length sepal width petal length petal width target2\n", + "Sepal Length NaN 0.11757 0.871754 0.817941 0.782561\n", + "sepal width NaN NaN 0.428440 0.366126 0.426658\n", + "petal length NaN NaN NaN 0.962865 0.949035\n", + "petal width NaN NaN NaN NaN 0.956547\n", + "target2 NaN NaN NaN NaN NaN" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 262 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lieB7hDg8EEM", + "outputId": "a6c1c02d-3c2f-41e8-c1e5-6154f34d95ae", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 500 + } + }, + "source": [ + "fig, ax = plt.subplots(figsize = (8, 8)) \n", + "mask = np.zeros_like(df_iris.corr().abs())\n", + "mask[np.triu_indices_from(mask)] = 1\n", + "sns.heatmap(df_iris.corr().abs(), mask = mask, ax = ax, cmap ='coolwarm', annot = True, fmt = '.2f')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 263 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "A_FXdDMr8EEQ" + }, + "source": [ + "> Pela Análise de Correlação, vemos duas variáveis altamente correlacionadas com a variável-resposta, que são: 'Petal Width' e 'Petal Length', que são as duas variáveis mais importantes no dataframe. Lembram-se?\n", + ">> No entanto, confira a correlação entre 'Petal Width' e 'Petal Length'. Observou que a correlação entre elas é de 0.96? Estas variáveis são altamente correlacionadas..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "taei2KXSTmZ0" + }, + "source": [ + "### Usando SelectFromModel()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JLwboa9tTpBq", + "outputId": "ea927ac2-d0c4-4e10-ae72-7f9014aae8bb", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# A partir do Random Forest, seleciona features cuja importância seja maior que 0.15 e 0.45\n", + "sfm = SelectFromModel(rf, threshold = 0.15)\n", + "sfm_2 = SelectFromModel(rf, threshold = 0.45)\n", + "\n", + "# Treina o seletor\n", + "sfm.fit(X_treinamento, y_treinamento)\n", + "sfm_2.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "SelectFromModel(estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,\n", + " criterion='mse', max_depth=None,\n", + " max_features='auto',\n", + " max_leaf_nodes=None,\n", + " max_samples=None,\n", + " min_impurity_decrease=0.0,\n", + " min_impurity_split=None,\n", + " min_samples_leaf=1,\n", + " min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0,\n", + " n_estimators=100, n_jobs=None,\n", + " oob_score=False,\n", + " random_state=20111974,\n", + " verbose=0, warm_start=False),\n", + " max_features=None, norm_order=1, prefit=False, threshold=0.45)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 264 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2MiZrU56VzUW", + "outputId": "d64749e9-8883-4432-e68d-bfe4c1b8c7f1", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Mostra as duas features mais importantes para sfm\n", + "for feature_list_index in sfm.get_support(indices = True):\n", + " print(l_iris_labels[feature_list_index])" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Petal Length\n", + "Petal Width\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "M0junBxr79Th", + "outputId": "e150591e-1db0-4754-8999-f5e198d3bd60", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Mostra as duas features mais importantes para sfm_2\n", + "for feature_list_index in sfm_2.get_support(indices = True):\n", + " print(l_iris_labels[feature_list_index])" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Petal Width\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "neMolQ0gYtp7" + }, + "source": [ + "Selecionando somente os atributos mais importantes:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dXRWZgDeYxHS" + }, + "source": [ + "# Constroi um dataset contendo somente as variáveis mais importantes\n", + "# Nota: Neste caso, estamos a aplicar a transformação tanto na base de treinamento quanto de validação\n", + "X_treinamento_rfi = sfm.transform(X_treinamento)\n", + "X_teste_rfi = sfm.transform(X_teste)\n", + "\n", + "X_treinamento_rfi_2 = sfm_2.transform(X_treinamento)\n", + "X_teste_rfi_2 = sfm_2.transform(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1tmpt3YvZFvW", + "outputId": "a214e4c8-b1f9-4599-b018-757235a1aa0d", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Criar um classificador Random Forest somente com as features mais importantes\n", + "clf_rfi = RandomForestClassifier(n_estimators = 10000, random_state = 20111974, n_jobs = 50)\n", + "clf_rfi_2 = RandomForestClassifier(n_estimators = 10000, random_state = 20111974, n_jobs = 50)\n", + "\n", + "# Treina o modelo com as features mais importantes\n", + "clf_rfi.fit(X_treinamento_rfi, y_treinamento)\n", + "clf_rfi_2.fit(X_treinamento_rfi_2, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n", + " criterion='gini', max_depth=None, max_features='auto',\n", + " max_leaf_nodes=None, max_samples=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, n_estimators=10000,\n", + " n_jobs=50, oob_score=False, random_state=20111974,\n", + " verbose=0, warm_start=False)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 268 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "epVwhUeYZM7v", + "outputId": "1503fb45-653c-4c2a-9304-c3d0689b300d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 405 + } + }, + "source": [ + "from sklearn.metrics import accuracy_score\n", + "from sklearn.metrics import confusion_matrix as cm\n", + "\n", + "# Aplica o Classificador no dataframe teste\n", + "y_pred = clf_rfi.predict(X_teste_rfi)\n", + "\n", + "# Verifica acurácia\n", + "accuracy_score(y_teste, y_pred)\n", + "\n", + "# Matriz de Confusão\n", + "from sklearn.metrics import confusion_matrix as cm\n", + "cm = cm(y_teste, y_pred) \n", + "index = ['setosa', 'versicolor', 'virginica'] \n", + "columns = ['setosa','versicolor', 'virginica'] \n", + "cm_df = pd.DataFrame(cm, columns, index) \n", + "plt.figure(figsize = (10, 6)) \n", + "cm_df.index.name = 'Actual'\n", + "cm_df.columns.name = 'Predicted'\n", + "sns.heatmap(cm_df, annot = True)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 269 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "CA_rhTMiZbLN", + "outputId": "a1af75c9-690e-49a0-ed79-87259804b06b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 405 + } + }, + "source": [ + "# Aplica o classificador na base de teste\n", + "y_pred_rfi = clf_rfi.predict(X_teste_rfi)\n", + "\n", + "# Avalia acurácia\n", + "accuracy_score(y_teste, y_pred_rfi)\n", + "\n", + "# Matriz de Confusão\n", + "from sklearn.metrics import confusion_matrix as cm\n", + "cm = cm(y_teste, y_pred_rfi) \n", + "index = ['setosa','versicolor','virginica'] \n", + "columns = ['setosa','versicolor','virginica'] \n", + "cm_df = pd.DataFrame(cm,columns,index) \n", + "plt.figure(figsize=(10,6)) \n", + "cm_df.index.name = 'Actual'\n", + "cm_df.columns.name = 'Predicted'\n", + "sns.heatmap(cm_df, annot=True)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 270 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lIYvP0Bq8ST_", + "outputId": "c16ef565-c421-42a2-a9c9-48fc04d4d6bc", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 405 + } + }, + "source": [ + "# Aplica o classificador na base de teste depois da análise de correlação\n", + "y_pred_rfi_2 = clf_rfi_2.predict(X_teste_rfi_2)\n", + "\n", + "# Avalia acurácia\n", + "accuracy_score(y_teste, y_pred_rfi_2)\n", + "\n", + "# Matriz de Confusão\n", + "from sklearn.metrics import confusion_matrix as cm\n", + "cm = cm(y_teste, y_pred_rfi_2) \n", + "index = ['setosa','versicolor', 'virginica'] \n", + "columns = ['setosa','versicolor', 'virginica'] \n", + "cm_df = pd.DataFrame(cm,columns, index) \n", + "plt.figure(figsize = (10, 6)) \n", + "cm_df.index.name = 'Actual'\n", + "cm_df.columns.name = 'Predicted'\n", + "sns.heatmap(cm_df, annot = True)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 271 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uWdAUMQoZtaV" + }, + "source": [ + "> Como podemos ver:\n", + "* Modelo original (com 4 atributos) presenta acurácia de 93.3%;\n", + "* Modelo reduzido (com 2 atributos) apresenta acurácia de 93%;\n", + "* Modelo reduzido 2 (com 1 atributo) apresenta acurácia de 93%.\n", + "\n", + ">> Ou seja, reduzimos o modelo de 4 para 1 atributo/variável e a acurária continua a mesma." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OfCq7UGIpSYg", + "outputId": "5bec1c14-5210-4e28-84a2-b795ef78fe5a", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 106 + } + }, + "source": [ + "# Correlação dois a dois...\n", + "df_iris[['petal length', 'petal width']].corr()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
petal lengthpetal width
petal length1.0000000.962865
petal width0.9628651.000000
\n", + "
" + ], + "text/plain": [ + " petal length petal width\n", + "petal length 1.000000 0.962865\n", + "petal width 0.962865 1.000000" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 272 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BMfxK3UCbjfc" + }, + "source": [ + "## Feature Selection With XGBoost (Extreme Gradient Boosting)\n", + "> XGBoost, em geral, fornece melhores soluções do que outros algoritmos de Machine Learning." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8ZYGjr0Su4y-", + "outputId": "919fa042-5b49-4378-a46b-cf33f6469e48", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "!pip install xgboost" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Requirement already satisfied: xgboost in /usr/local/lib/python3.6/dist-packages (0.90)\n", + "Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from xgboost) (1.4.1)\n", + "Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from xgboost) (1.18.5)\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "--vKKHVWbwGv" + }, + "source": [ + "from xgboost import XGBClassifier\n", + "\n", + "# Carregar as informações do dataframe Iris\n", + "carrega_df_iris()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_d5qWQmXgPIH", + "outputId": "1a2fe686-b70e-4812-d129-304cefd93611", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Cria um clasificador XGBoost\n", + "clf = XGBClassifier(n_estimators = 10000, random_state = 20111974, n_jobs = 50, max_depth = 5, learning_rate = 0.05)\n", + "\n", + "# Treina o classificador\n", + "clf.fit(X_treinamento, y_treinamento)\n", + "\n", + "# Calcula o y_pred e avalia a qualidade do ajuste\n", + "y_pred = clf.predict(X_teste)\n", + "predictions = [round(value) for value in y_pred]\n", + "accuracy = accuracy_score(y_teste, predictions)\n", + "print(f\"Acurácia: {accuracy}\")" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Acurácia: 0.8666666666666667\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fTdKMOKdC2UA", + "outputId": "0421c20e-1af7-473b-f0fa-4a77c4d1d3dd", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Adaptado de https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/\n", + "# Ajusta o ML usando cada importância calculada como threshold\n", + "\n", + "thresholds = sorted(clf.feature_importances_)\n", + "for thresh in thresholds:\n", + "\t# seleciona as features usando threshold\n", + "\tselection = SelectFromModel(clf, threshold=thresh, prefit=True)\n", + "\tselect_X_treinamento = selection.transform(X_treinamento)\n", + "\t\n", + " # treina o ML\n", + "\tselection_clf = XGBClassifier()\n", + "\tselection_clf.fit(select_X_treinamento, y_treinamento)\n", + "\t\n", + " # Avalia o ML\n", + "\tselect_X_teste = selection.transform(X_teste)\n", + "\ty_pred = selection_clf.predict(select_X_teste)\n", + "\tpredictions = [round(value) for value in y_pred]\n", + "\taccuracy = accuracy_score(y_teste, predictions)\n", + "\tprint(f\"Threshold= {round(thresh,2)}, n= {select_X_treinamento.shape[1]}, Acurácia: {round(accuracy*100.0,2)}\")" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Threshold= 0.009999999776482582, n= 4, Acurácia: 86.67\n", + "Threshold= 0.03999999910593033, n= 3, Acurácia: 86.67\n", + "Threshold= 0.44999998807907104, n= 2, Acurácia: 86.67\n", + "Threshold= 0.5, n= 1, Acurácia: 86.67\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zv2gbuc5glFJ", + "outputId": "490e0459-db4a-4406-f4f4-5f5d70519cce", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 304 + } + }, + "source": [ + "# Calcula a importância das features\n", + "importances = clf.feature_importances_\n", + "\n", + "# Ordena as importâncias por ordem descendente\n", + "indices = np.argsort(importances)[::-1]\n", + "\n", + "# Organiza...\n", + "names = [iris.feature_names[i] for i in indices]\n", + "\n", + "# Barplot\n", + "plt.bar(range(X_treinamento.shape[1]), importances[indices])\n", + "\n", + "# Coloca o nome dos labels no eixo X\n", + "plt.xticks(range(X_treinamento.shape[1]), names, rotation=20, fontsize = 8)\n", + "\n", + "# Constroi o gráfico\n", + "plt.title(\"Feature Importance\")\n", + "\n", + "# Mostra o gráfico\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cXTwXiB_LVB3", + "outputId": "83b9f975-bd71-4604-aded-b8310da30193", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 405 + } + }, + "source": [ + "# Matriz de Confusão\n", + "from sklearn.metrics import confusion_matrix as cm\n", + "cm = cm(y_teste, y_pred) \n", + "index = ['setosa','versicolor','virginica'] \n", + "columns = ['setosa','versicolor','virginica'] \n", + "cm_df = pd.DataFrame(cm,columns,index) \n", + "plt.figure(figsize = (10, 6)) \n", + "cm_df.index.name = 'Actual'\n", + "cm_df.columns.name = 'Predicted'\n", + "sns.heatmap(cm_df, annot=True)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 278 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_ebv3nAzU2ac" + }, + "source": [ + "## Feature Selection using PCA (Principal Components Analysis)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8M5uO9r-Vtze" + }, + "source": [ + "from sklearn.datasets import load_iris\n", + "\n", + "# Carregar as informações do dataframe Iris\n", + "carrega_df_iris()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wWS8bvXzX4Fg" + }, + "source": [ + "### Standardize the Data\n", + "* O PCA é afetado por escala, portanto, é necessário dimensionar as features/atributos antes de aplicar o PCA.\n", + "* Use o StandardScaler para padronizar os features/atributos usando com média = 0 e variância = 1." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2oVG8_1HXweo" + }, + "source": [ + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.decomposition import PCA" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xYXnYebxclya" + }, + "source": [ + "Standardizing as features de X:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iDMzHm3mcpbs" + }, + "source": [ + "X_STD = StandardScaler().fit_transform(X_iris) " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "MmKMCuvMc63E" + }, + "source": [ + "pca_2c = PCA(n_components = 2)\n", + "X_PCA_2c = pca_2c.fit_transform(X_STD)\n", + "df_PCA_2c = pd.DataFrame(data = X_PCA_2c, columns = ['PCA1', 'PCA2'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0Yfvb02JdV8B" + }, + "source": [ + "Vamos entender o que está acontecendo:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-PVc1vJ8d_w6" + }, + "source": [ + "Primeiramente, observe nosso array X abaixo. Cada coluna desse array representa uma coluna do dataframe df_iris. Por exemplo, a primeira coluna são os dados da variável 'Sepal Length'. Identificou?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BEp1JD0Odd3L", + "outputId": "28a25a6b-d24a-4e3d-cbf0-2f3af24b4298", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Listando as primeiras 5 linhas de X\n", + "X_iris[0:5]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[5.1, 3.5, 1.4, 0.2],\n", + " [4.9, 3. , 1.4, 0.2],\n", + " [4.7, 3.2, 1.3, 0.2],\n", + " [4.6, 3.1, 1.5, 0.2],\n", + " [5. , 3.6, 1.4, 0.2]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 286 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "41-KcSTneURx" + }, + "source": [ + "Segundo, com a standardização, construimos o array X_STD, que mostramos abaixo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "igBQNHS5eaND", + "outputId": "93a6ece8-138e-4f56-f44a-061c45c7380e", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X_STD[0:5]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[-0.90068117, 1.01900435, -1.34022653, -1.3154443 ],\n", + " [-1.14301691, -0.13197948, -1.34022653, -1.3154443 ],\n", + " [-1.38535265, 0.32841405, -1.39706395, -1.3154443 ],\n", + " [-1.50652052, 0.09821729, -1.2833891 , -1.3154443 ],\n", + " [-1.02184904, 1.24920112, -1.34022653, -1.3154443 ]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 287 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TgRD7-qPjg29" + }, + "source": [ + "Veja abaixo a média e desvio-padrão do array X_STD:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "M0VL4ilZjliL", + "outputId": "cd424bdd-ef89-45b1-c965-2695797f5a87", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "np.mean(X_STD),np.std(X_STD)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(-1.4684549872375404e-15, 1.0)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 288 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pyDJwNCgju0O" + }, + "source": [ + "Temos média 0 e desvio-padrão 1, certo? É isso que queríamos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KB7R7OXQemze" + }, + "source": [ + "Por fim, a partir de X_STD, construimos o array X_PCA_2c, mostrado abaixo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lNmAskXWerqG", + "outputId": "2f417913-1e00-445e-a990-6a1cbce2b417", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X_PCA_2c[0:5]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[-2.26470281, 0.4800266 ],\n", + " [-2.08096115, -0.67413356],\n", + " [-2.36422905, -0.34190802],\n", + " [-2.29938422, -0.59739451],\n", + " [-2.38984217, 0.64683538]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 289 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fT2N6Ym7fBt-" + }, + "source": [ + "Portanto, reduzimos (ou resumimos) o array X_STD de 4 dimensões para um array de 2 dimensões." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cA54fhYgfQuC" + }, + "source": [ + "Finalmente, o dataframe final é mostrado abaixo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kAQ4O9EBfUlN" + }, + "source": [ + "df_PCA_final_2c = pd.concat([df_PCA_2c, df_iris[['target']]], axis= 1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JXrOdfSZPBS_", + "outputId": "21e43aaa-c78d-4779-86e4-e4ec6864b1e9", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_PCA_final_2c.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PCA1PCA2target
0-2.2647030.480027setosa
1-2.080961-0.674134setosa
2-2.364229-0.341908setosa
3-2.299384-0.597395setosa
4-2.3898420.646835setosa
\n", + "
" + ], + "text/plain": [ + " PCA1 PCA2 target\n", + "0 -2.264703 0.480027 setosa\n", + "1 -2.080961 -0.674134 setosa\n", + "2 -2.364229 -0.341908 setosa\n", + "3 -2.299384 -0.597395 setosa\n", + "4 -2.389842 0.646835 setosa" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 291 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MEbvp3RFf-zs" + }, + "source": [ + "### Visualizar reultados" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GEqP-NVngBO1", + "outputId": "8ff483a3-ba1b-4579-deb2-400020f310c7", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 523 + } + }, + "source": [ + "fig = plt.figure(figsize = (8,8))\n", + "ax = fig.add_subplot(1,1,1) \n", + "ax.set_xlabel('PCA1', fontsize = 15)\n", + "ax.set_ylabel('PCA2', fontsize = 15)\n", + "ax.set_title('2 componentes PCA', fontsize = 20)\n", + "targets = ['setosa', 'versicolor', 'virginica']\n", + "colors = ['r', 'g', 'b']\n", + "for target, color in zip(targets,colors):\n", + " indicesToKeep = df_PCA_final_2c['target'] == target\n", + " ax.scatter(df_PCA_final_2c.loc[indicesToKeep, 'PCA1']\n", + " , df_PCA_final_2c.loc[indicesToKeep, 'PCA2']\n", + " , c = color\n", + " , s = 50)\n", + "ax.legend(targets)\n", + "ax.grid()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pZaGiGnUl6ER" + }, + "source": [ + "O que significa reduzir para 2 dimensões um array com 4 dimensões?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xF-whgBmlHN1", + "outputId": "a23c21ca-cc81-48ee-f293-067c4ea585d1", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 654 + } + }, + "source": [ + "X_new = pca_2c.inverse_transform(X_PCA_2c)\n", + "plt.figure(figsize=(10, 10), dpi=80)\n", + "plt.scatter(X_STD[:, 0], X_STD[:, 1], alpha=0.2)\n", + "plt.scatter(X_new[:, 0], X_new[:, 1], alpha=0.9)\n", + "plt.axis('equal');" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XvLDQ2JF5NP8" + }, + "source": [ + "### Análise de Correlação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DK3R415_5RzY", + "outputId": "7bb27d07-dda9-4f39-ec26-39b20f49a0ca", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "# calcula a correlação entre as colunas/variáveis do dataframe\n", + "correlacao= df_iris.corr().abs()\n", + "\n", + "# Seleciona o triângulo superior da matriz de correlação\n", + "correlacao = correlacao.where(np.triu(np.ones(correlacao.shape), k=1).astype(np.bool))\n", + "correlacao" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Sepal Lengthsepal widthpetal lengthpetal widthtarget2
Sepal LengthNaN0.117570.8717540.8179410.782561
sepal widthNaNNaN0.4284400.3661260.426658
petal lengthNaNNaNNaN0.9628650.949035
petal widthNaNNaNNaNNaN0.956547
target2NaNNaNNaNNaNNaN
\n", + "
" + ], + "text/plain": [ + " Sepal Length sepal width petal length petal width target2\n", + "Sepal Length NaN 0.11757 0.871754 0.817941 0.782561\n", + "sepal width NaN NaN 0.428440 0.366126 0.426658\n", + "petal length NaN NaN NaN 0.962865 0.949035\n", + "petal width NaN NaN NaN NaN 0.956547\n", + "target2 NaN NaN NaN NaN NaN" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 294 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uLvvzpvl5Zy9", + "outputId": "02c8c610-efe1-45a0-e5c1-7bd308513e56", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 500 + } + }, + "source": [ + "fig, ax = plt.subplots(figsize=(8, 8)) \n", + "mask = np.zeros_like(df_iris.corr().abs())\n", + "mask[np.triu_indices_from(mask)] = 1\n", + "sns.heatmap(df_iris.corr().abs(), mask= mask, ax= ax, cmap='coolwarm', annot= True, fmt= '.2f')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 295 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xo-llZpb7JfO" + }, + "source": [ + "Pela Análise de Correlação, vemos duas variáveis altamente correlacionadas com a variável-resposta, que são: 'Peta Width' e 'Petal Length', que são as duas variáveis mais importantes no dataframe. Lembram-se?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kjPawS_dgSKU" + }, + "source": [ + "### Variância explicada\n", + "* Como vimos, reduzimos um array de informações de 4 dimensões para um array com 2 dimensões. Com isso, perde-se alguma informação relativa à variância. Mas quanto perdemos?\n", + "\n", + "* A Variância Explicada (ou Explained Variance, em inglês), mede o quanto de informação (variação) foi atribuída a cada um dos componentes principais. Usando o atributo explain_variance_ratio_, é possível ver que o primeiro componente principal contém 72,77% da variação e o segundo componente principal contém 23,03% da variação. Juntos, os dois componentes contêm 95,80% das informações. Portanto, perdemos quase nada em termos de informação e o modelo não é prejudicado por esta redução.\n", + "\n", + "\n", + "A resposta à essa pergunta é:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "i6gcdvtYgwpX", + "outputId": "d416ed99-d2aa-4f64-cad9-9ba37903aa3e", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "pca.explained_variance_ratio_" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.72962445, 0.22850762, 0.03668922, 0.00517871])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 298 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "71ubjjflQEUf" + }, + "source": [ + "Observe que o terceiro e quarto valores são muito baixo. Ou seja, baixa variabilidade explicada... Portanto, daqui já conseguimos ver que o número ideal de componentes são 2." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "m_4TqauJvIvX" + }, + "source": [ + "### Quantos componentes escolher" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "O2NRgjCjvUli", + "outputId": "da0675c7-c382-48d7-eccb-1f00037e5aff", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 279 + } + }, + "source": [ + "pca = PCA().fit(X_STD)\n", + "plt.plot(np.cumsum(pca.explained_variance_ratio_))\n", + "plt.xlabel('number of components')\n", + "plt.ylabel('cumulative explained variance');" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AvQ6KwQdwAdC" + }, + "source": [ + "**Interpretação**: Esta curva quantifica quanto da variância total de 4 dimensões está contida nos primeiros N componentes. Por exemplo, a primeira componente principal tem aproximadamente 95% da variação, enquanto que 2 componentes (os dois primeiros) explicam quase 100% da variabilidade. Portanto, em nosso caso, 2 componentes principais são suficientes para captar grande parte da variabilidade dos dados." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "S151TcjQw6vc" + }, + "source": [ + "pca.explained_variance_ratio_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2dNWrsMCyUUE" + }, + "source": [ + "### Medindo o impacto" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ueFuNN47zMd6" + }, + "source": [ + "#### Treinar o modelo com X_PCA" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yb4dpTHPyYlU" + }, + "source": [ + "from sklearn.ensemble import RandomForestClassifier" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Mr3tAGbjTET8" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_PCA_2c, y_iris, test_size = 0.2, random_state = 20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IzCZCBSmylwZ", + "outputId": "d8218700-7b4d-40b3-be83-dd2f87c8f400", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Optamos por 2 componentes principais\n", + "classifier_2c = RandomForestClassifier(max_depth = 2, random_state = 0)\n", + "classifier_2c.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n", + " criterion='gini', max_depth=2, max_features='auto',\n", + " max_leaf_nodes=None, max_samples=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, n_estimators=100,\n", + " n_jobs=None, oob_score=False, random_state=0, verbose=0,\n", + " warm_start=False)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 307 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BiDgWh2PzYEY" + }, + "source": [ + "#### Fazer as predições" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "A7V8DWW4zVPs" + }, + "source": [ + "y_pred_2c = classifier_2c.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JTm4msQy2ezQ" + }, + "source": [ + "___\n", + "# **Exercícios**\n", + "* Para cada um dos dataframes a seguir, selecione os melhores atributos utilizando as seguintes técnicas técnicas:\n", + " * Random Forest\n", + " * XGBoost\n", + " * RFE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "caFkC6oCmUKK" + }, + "source": [ + "## Exercício 1 - Breast Cancer" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vhOM-Z9zmf-f" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "from sklearn.datasets import load_breast_cancer\n", + "\n", + "cancer = load_breast_cancer()\n", + "X= cancer['data']\n", + "y= cancer['target']\n", + "\n", + "df_cancer = pd.DataFrame(np.c_[X, y], columns= np.append(cancer['feature_names'], ['target']))\n", + "df_cancer['target'] = df_cancer['target'].map({0: 'malign', 1: 'benign'})\n", + "df_cancer.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zA395jtOfGEl" + }, + "source": [ + "## Exercício 2 - Fraud Detection" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "14fV0gz0flb8" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "\n", + "url= 'https://raw.githubusercontent.com/MathMachado/Python_RFB/DS_Python/Dataframes/creditcard.csv?token=AGDJQ63IAZCFP7GTSZTOMAK5QBSP6'\n", + "df_CC= pd.read_csv(url)\n", + "df_CC.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1qruqUDqnvMc" + }, + "source": [ + "## Exercício 3 - Boston Housing Price" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "trxK8YXNnsam" + }, + "source": [ + "from sklearn.datasets import load_boston\n", + "\n", + "boston = load_boston()\n", + "X= boston['data']\n", + "y= boston['target']\n", + "\n", + "df_boston = pd.DataFrame(np.c_[X, y], columns= np.append(boston['feature_names'], ['target']))\n", + "df_boston.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-CawPH2nb5cl" + }, + "source": [ + "## Exercício 4 - Diabetes\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_lVjBS7QcZuT" + }, + "source": [ + "from sklearn.datasets import load_diabetes\n", + "\n", + "diabetes = load_diabetes()\n", + "X= diabetes['data']\n", + "y= diabetes['target']\n", + "\n", + "df_diabetes = pd.DataFrame(np.c_[X, y], columns= np.append(diabetes['feature_names'], ['target']))\n", + "df_diabetes.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qP1vjDdylyHr" + }, + "source": [ + "## Exercício 5 - Crimes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fqLHDXbOl0Sf" + }, + "source": [ + "url= 'https://raw.githubusercontent.com/MathMachado/Python_RFB/DS_Python/Dataframes/Crime.txt?token=AGDJQ665WUIWIEKDPK6WO625P3QUQ'\n", + "df_Crime = pd.read_table(url, sep=',', na_values='?')\n", + "df_Crime.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "fxhTXj6Ll7wB" + }, + "source": [ + "df_Crime.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d3b-Yv2HUmoI" + }, + "source": [ + "## Exercício 6 - Titanic" + ] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB11__DataViz_Matplotlib & Seaborn_hs.ipynb b/Notebooks/NB11__DataViz_Matplotlib & Seaborn_hs.ipynb new file mode 100644 index 000000000..4ec3664e4 --- /dev/null +++ b/Notebooks/NB11__DataViz_Matplotlib & Seaborn_hs.ipynb @@ -0,0 +1,1246 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "Untitled31.ipynb", + "provenance": [], + "private_outputs": true, + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oRokSLxEMgDN" + }, + "source": [ + "## Referência\n", + "* [Visualization](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html)\n", + "* [Python Graph Galery](https://python-graph-gallery.com/all-charts/)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "IFiAWdKnFS5A" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "import bokeh # Library necessária ***\n", + "\n", + "plt.rcParams[\"figure.figsize\"] = [15, 12]\n", + "%matplotlib inline" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UfrAHnWpJTwD" + }, + "source": [ + "## Séries temporais simples" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_PV_kTGRMq4B" + }, + "source": [ + "#### Série/Dados simulados" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_yVTB9v0KQxp" + }, + "source": [ + "from datetime import datetime\n", + "\n", + "dt_hoje = datetime.strptime('2020-10-14', '%Y-%m-%d')\n", + "dt_inicio = datetime.strptime('2020-01-01', '%Y-%m-%d')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gMQx3JSlJz0R" + }, + "source": [ + "# Quantos dias desde a data inicial?\n", + "i_quantidade_dias = abs((dt_hoje - dt_inicio).days)\n", + "i_quantidade_dias" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Tb70ycS_JWvQ" + }, + "source": [ + "np.random.seed(20111974)\n", + "\n", + "i_qtd_ativos = 4\n", + "df_series_temporais = pd.DataFrame(np.random.randn(i_quantidade_dias, i_qtd_ativos), index = pd.date_range(dt_inicio, periods = i_quantidade_dias)) #, columns = list('ABCD'))\n", + "df_series_temporais.columns = ['Ativo1', 'Ativo2', 'Ativo3', 'Ativo4']\n", + "\n", + "#serie_temporal = pd.Series(np.random.randn(i_quantidade_dias), index = pd.date_range(dt_inicio, periods = i_quantidade_dias))\n", + "df_series_temporais.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hPq0XtirNMhm" + }, + "source": [ + "## Gráfico de séries temporais" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kEu3wDl9L92i" + }, + "source": [ + "df_series_temporais2 = df_series_temporais.cumsum()\n", + "plt.figure()\n", + "df_series_temporais2.plot()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oEQQHUG8KtAv" + }, + "source": [ + "Gráfico de 1 única série temporal" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xqNCkZdIKh3L" + }, + "source": [ + "df_series_temporais3 = df_series_temporais['Ativo1']\n", + "plt.figure()\n", + "df_series_temporais3.plot()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "m5rMpulVKrSe" + }, + "source": [ + "df_series_temporais3 = df_series_temporais['Ativo1'].cumsum()\n", + "plt.figure()\n", + "df_series_temporais3.plot(kind = 'line')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Wa4sXjcMNkzS" + }, + "source": [ + "Experimente kind = {'line', 'box', 'hist', 'kde'}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8eAETzNARsxo" + }, + "source": [ + "### Se quisermos comparar horizontalmente\n", + "* No caso abaixo, estou a comparar as colunas 'Ativo1', 'Ativo2', 'Ativo3' e 'Ativo4' quanto ao conteúdo da linha 3 --> iloc[3]." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "APnKHRMSbYMO" + }, + "source": [ + "df_series_temporais2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6a0FB-SPReD9" + }, + "source": [ + "plt.figure()\n", + "df_series_temporais2.iloc[3].plot(kind = 'bar')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qJ8SBoT6SSu0" + }, + "source": [ + "### Comparar grupos\n", + "* Neste caso, vou selecionar (ou dar um zoom) somente em alguns dias do dataframe." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kKeby_vwTB5j" + }, + "source": [ + "df_series_temporais2_zoom = df_series_temporais2[0:10]\n", + "df_series_temporais2_zoom" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "I_XBwdn_Sa8h" + }, + "source": [ + "df_series_temporais2_zoom.plot(kind = 'bar')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Zru6GuoYTuzd" + }, + "source": [ + "#### Outra forma de visualizar o mesmo resultado:\n", + "* stacked bar plot --> Basta usar o parâmetro stacked = True" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lHY7A1RLTzaT" + }, + "source": [ + "df_series_temporais2_zoom.plot(kind = 'bar', stacked = True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UWP6hLn8US1M" + }, + "source": [ + "### Se quiser visualizar o gráfico na horizontal..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7dtzx-vOUWNG" + }, + "source": [ + "df_series_temporais2_zoom.plot(kind = 'barh', stacked = True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Z22k7IOyU6la" + }, + "source": [ + "### Histogramas" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LKLWYWYeU8UV" + }, + "source": [ + "df_series_temporais2.plot(kind = 'hist', bins = 30) # O que são bins?" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MjLO8BqUeQvP" + }, + "source": [ + "#### O que são bins?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dG4zhQExVbY1" + }, + "source": [ + "#### Histograma individual" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZNGWjh9LVdb7" + }, + "source": [ + "plt.figure()\n", + "df_series_temporais2['Ativo3'].diff().hist() # Veja abaixo melhores explicações sobre o método diff(axis, periods) " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "3LQlM_qjWd7g" + }, + "source": [ + "df_series_temporais2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "x3N6q_fTWl60" + }, + "source": [ + "df_series_temporais2.diff(axis = 0, periods = 1).head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "LGknpyFaWqcZ" + }, + "source": [ + "df_series_temporais2.iloc[1][0] - df_series_temporais2.iloc[0][0]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TdjsYr4Wer73" + }, + "source": [ + "df_series_temporais2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Yq6TtAU2XAHL" + }, + "source": [ + "#### diff(axis = 1, periods = 1) aplica a diferença nas colunas! Veja abaixo:\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6QRBLyBQXKq8" + }, + "source": [ + "df_series_temporais2.diff(axis = 1, periods = 1).head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "niDjEkSpYgAj" + }, + "source": [ + "### Histogramas em múltiplos gráficos" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4ie8toFUYlF-" + }, + "source": [ + "plt.figure()\n", + "df_series_temporais2.diff(axis = 0, periods = 1).hist(color ='g', alpha = 0.5, bins = 50)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r7W97FztGTMl" + }, + "source": [ + "## Boxplot" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q-19pTLZZKVj" + }, + "source": [ + "plt.figure()\n", + "boxplot = df_series_temporais2.boxplot(vert = True) # Observe o parâmetro vert = True" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "aQ2qQetiGU8f" + }, + "source": [ + "plt.figure()\n", + "boxplot = df_series_temporais2.boxplot(vert = False) # Observe o parâmetro vert = False" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Wo6AFzOPMvMf" + }, + "source": [ + "#### Dados sobre a qualidade de vinhos - White vs Red\n", + "\n", + "* O objetivo é avaliar a qualidade dos vinhos (tinto vs branco), numa scala de 0–100. A seguir, a qualidade em função da escala:\n", + "\n", + "* 95–100 Classic: a great wine\n", + "* 90–94 Outstanding: a wine of superior character and style\n", + "* 85–89 Very good: a wine with special qualities\n", + "* 80–84 Good: a solid, well-made wine\n", + "* 75–79 Mediocre: a drinkable wine that may have minor flaws\n", + "* 50–74 Not recommended" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aO9K8R9Qa9Uj" + }, + "source": [ + "url_tinto = 'https://raw.githubusercontent.com/Hayltons/DSWP/master/Dataframes/Wine_red.csv'\n", + "url_branco = 'https://raw.githubusercontent.com/Hayltons/DSWP/master/Dataframes/Wine_white.csv'\n", + "df_vinho_tinto = pd.read_csv(url_tinto)\n", + "df_vinho_tinto[\"color\"] = 1 # --> Vinho Tinto\n", + "\n", + "df_vinho_branco = pd.read_csv(url_branco)\n", + "df_vinho_branco[\"color\"] = 0 # --> Vinho Branco" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "owdOjksbg7Dc" + }, + "source": [ + "# Empilhando os dataframes df_vinho_tinto e df_vinho_branco:\n", + "df_vinhos = pd.concat([df_vinho_tinto, df_vinho_branco], axis = 0)\n", + "df_vinhos.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zYniNn5PfGx9" + }, + "source": [ + "df_vinho_tinto.columns" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "KL7iW5mtgCre" + }, + "source": [ + "df_vinhos['quality'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "G_yOZ-Gqmscv" + }, + "source": [ + "df_vinhos['color'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IKTEbTW2jMVv" + }, + "source": [ + "#### Tratamento do nome das colunas" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JeXjuKNIm39F" + }, + "source": [ + "df_vinhos.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1Oo-6k2jh3bx" + }, + "source": [ + "df_vinhos.columns = [col.lower() for col in df_vinhos.columns]\n", + "\n", + "# substituir ' ' por '_' no nome das colunas:\n", + "df_vinhos.columns = [col.replace(' ', '_') for col in df_vinhos.columns]\n", + "df_vinhos.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "eiMHK6aJjoZl" + }, + "source": [ + "df_vinhos.describe()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "sUNEzoC7j0PV" + }, + "source": [ + "print(f\"Média do vinho Branco: {df_vinho_branco['quality'].mean()}\")\n", + "print(f\"Média do vinho Tinto.: {df_vinho_tinto['quality'].mean()}\")\n", + "print(f\"Média Geral..........: {df_vinhos['quality'].mean()}\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tIBDUBI4n78b" + }, + "source": [ + "Abaixo, o mesmo cálculo, porém usando o artificio de procurar/selecionar o tipo que queremos no dataframe:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "X1Nllwpxl228" + }, + "source": [ + "print(f\"Média do vinho Branco: {df_vinhos[df_vinhos['color'] == 0]['quality'].mean()}\")\n", + "print(f\"Média do vinho Tinto.: {df_vinhos[df_vinhos['color'] == 1]['quality'].mean()}\")\n", + "print(f\"Média Geral..........: {df_vinhos['quality'].mean()}\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "GHjfSmExmg0u" + }, + "source": [ + "df_vinhos.columns" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "J3ZsHlrWmLDQ" + }, + "source": [ + "df_vinhos[df_vinhos['color'] == 1]['quality']" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "a-4XRBelnKCW" + }, + "source": [ + "fig, ax = plt.subplots(figsize=(10, 6))\n", + "df_vinhos['quality'].value_counts().plot(kind = 'bar')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7HjKZ6Z1bkct" + }, + "source": [ + "A seguir, algo mais sofisticado, contendo título do gráfico, annotations e etc" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jB9BTwBOa7UA" + }, + "source": [ + "fig, ax = plt.subplots(figsize = (10, 6))\n", + "df_vinhos['quality'].value_counts().plot(kind = 'bar')\n", + "\n", + "# Título e label dos eixos X e Y\n", + "plt.title('Avaliação da qualidade do vinho', fontsize = 25)\n", + "plt.xlabel('Atributo: quality', fontsize = 10)\n", + "plt.ylabel('Quantidade', fontsize = 10)\n", + "\n", + "# Colocar grid no gráfico\n", + "ax.grid(True)\n", + "\n", + "# Configurar a legenda\n", + "#plt.legend()\n", + "\n", + "# Configurar limites do eixo Y\n", + "#plt.ylim(0, 5000)\n", + "\n", + "# Configurar limites do eixo X\n", + "#plt.xlim(0, 3000)\n", + " \n", + "# Show graphic\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "w1CyCXVkmrFV" + }, + "source": [ + "df_vinhos['color'].value_counts().plot(kind = 'bar')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "jU1AY-_wpU2h" + }, + "source": [ + "df_vinhos.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ke8nw1nIcFpT" + }, + "source": [ + "df_vinhos['fixed_acidity'].value_counts().sort_index()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "e0ayzbRanNDq" + }, + "source": [ + "df_vinhos['fixed_acidity'].value_counts().sort_index().plot(kind = 'area')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eSxvaczjoll-" + }, + "source": [ + "### Desafio: Tornar o gráfico abaixo mais informativo\n", + "* Por exemplo, mostrar qual a variável analisada, eixo X e Y, títulos e etc." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RjzkMuPTn0yI" + }, + "source": [ + "l_colunas = df_vinhos.columns # automatizando\n", + "for caracteristica in l_colunas:\n", + " plt.figure() # Tire esta linha e veja o resultado\n", + " df_vinhos[caracteristica].value_counts().sort_index().plot(kind = 'area')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PYIjyMkVnWnr" + }, + "source": [ + "### Correlações\n", + "* Apresentar a tabela com a interpretação das correlações." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gn7xXclM7ewN" + }, + "source": [ + "### Introdução\n", + "O código abaixo gera dataframes para avaliarmos as correlações entre variáveis/dataframe." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4un3dsyZ7fFU" + }, + "source": [ + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "\n", + "i_simulacoes = 5000\n", + "\n", + "# Definir a semente --> Reproducibilidade\n", + "np.random.seed(19741120)\n", + "\n", + "# Array de médias das amostras:\n", + "a_media = np.array([0.0, 5.0, 10.0])\n", + "\n", + "# Array com a matriz de covariância:\n", + "a_covariancia = np.array([\n", + " [ 3.40, -2.75, -2.00],\n", + " [ -2.75, 5.50, 1.50],\n", + " [ -2.00, 1.50, 1.25]\n", + " ])\n", + "\n", + "# Geração das amostras aleatórias usando f_media e eGenerate the random samples.\n", + "a_amostras = np.random.multivariate_normal(a_media, a_covariancia, size = i_simulacoes)\n", + "a_amostras" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "akHw3Mym_FgQ" + }, + "source": [ + "A seguir, gráfico que mostra a correlação entre a_amostras[:, 0] e a_amostras[:, 1]:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iTLIn1uwJoVi" + }, + "source": [ + "plt.figure(figsize= (12, 8))\n", + "ax = sns.regplot(x = a_amostras[:,0], y = a_amostras[:,1], color = 'g')\n", + "plt.xlabel('a_amostras[0]')\n", + "plt.ylabel('a_amostras[1]')\n", + "#plt.axis('equal')\n", + "plt.grid(True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JcermWt-Ar5c" + }, + "source": [ + "np.corrcoef(a_amostras[:, 0], a_amostras[:, 1])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ryLXMQ66_fce" + }, + "source": [ + "Gráfico da correlação entre a_amostras[:, 0] e a_amostras[:, 2]:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8Xp69Xgg9iRV" + }, + "source": [ + "plt.figure(figsize= (12, 8))\n", + "ax = sns.regplot(x = a_amostras[:,0], y = a_amostras[:,2], color = 'g')\n", + "plt.xlabel('a_amostras[0]')\n", + "plt.ylabel('a_amostras[2]')\n", + "#plt.axis('equal')\n", + "plt.grid(True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Gw6OpxFBA5Sp" + }, + "source": [ + "np.corrcoef(a_amostras[:, 0], a_amostras[:, 2])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GmnKTqxQ_uZ9" + }, + "source": [ + "E por fim, gráfico com as correlações entre a_amostras[:, 1] e a_amostras[:, 2]:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yjWoFPhR_t3I" + }, + "source": [ + "plt.figure(figsize= (12, 8))\n", + "ax = sns.regplot(x = a_amostras[:, 1], y = a_amostras[:, 2], color = 'g')\n", + "plt.xlabel('a_amostras[1]')\n", + "plt.ylabel('a_amostras[2]')\n", + "#plt.axis('equal')\n", + "plt.grid(True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "xnJkxZ25C7kX" + }, + "source": [ + "np.corrcoef(a_amostras[:, 1], a_amostras[:, 2])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qEttRQwgDGq_" + }, + "source": [ + "E a seguir, o pairplot para avaliarmos todas as colunas ao mesmo tempo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mkAJivoPC_OM" + }, + "source": [ + "sns.pairplot(pd.DataFrame(a_amostras))\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6FVQwuNP8w6s" + }, + "source": [ + "### Análise do dataframe df_vinhos" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "N-Aa8wnh6rky" + }, + "source": [ + "df_vinhos.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZhtIILrs6vUT" + }, + "source": [ + "### Correlações entre um par de variáveis X e Y" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lJh2Comx6a_k" + }, + "source": [ + "np.corrcoef(df_vinhos['fixed_acidity'], df_vinhos['alcohol'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ifZybEAE68V9" + }, + "source": [ + "### Correlações do dataframe" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "IOCk4vhpnYn9" + }, + "source": [ + "correlacoes = df_vinhos.corr()\n", + "\n", + "top_correlacoes_cols = correlacoes.color.sort_values(ascending = False).keys()\n", + "top_correlacoes = correlacoes.loc[top_correlacoes_cols, top_correlacoes_cols]\n", + "dropSelf = np.zeros_like(top_correlacoes)\n", + "dropSelf[np.triu_indices_from(dropSelf)] = True\n", + "plt.figure(figsize = (15, 9))\n", + "sns.heatmap(top_correlacoes, cmap=sns.diverging_palette(220, 10, as_cmap = True), annot = True, fmt = \".2f\", mask = dropSelf)\n", + "sns.set(font_scale=1.5)\n", + "plt.show()\n", + "del correlacoes, dropSelf, top_correlacoes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "SFqklDJf-8le" + }, + "source": [ + "df_vinhos.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H7hKbxfdBV8w" + }, + "source": [ + "### Avaliar o comportamento bivariado\n", + "* 2D Density Plot\n", + " * Útil para avaliarmos a relação entre 2 variáveis numéricas. O gráfico no centro mostra a correlação entre as variáveis enquanto os gráficos marginais mostra a distribuição das respectivas variáveis usando histogramas ou gráficos de densidade." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LllKqLx3_IIG" + }, + "source": [ + "sns.jointplot(x = df_vinhos['alcohol'], y = df_vinhos['density'], kind = \"scatter\", color = 'm', s=50, edgecolor = \"skyblue\", linewidth = 2)\n", + "plt.savefig('minha_figura.png')\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "33yTNYN2K40X" + }, + "source": [ + "Mesmos dados, gráfico diferente --> Explorem as opções disponíveis: https://python-graph-gallery.com/82-marginal-plot-with-seaborn/" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BVmAt0wCK1Ob" + }, + "source": [ + "sns.jointplot(x = df_vinhos['alcohol'], y = df_vinhos['density'], kind = \"reg\", color = 'm', )\n", + "plt.savefig('minha_figura.png')\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4ixcDmeXIFQ1" + }, + "source": [ + "### Pairplot\n", + "* Verificar relacionamentos entre pares no dataframe." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lWqwaZ_lArji" + }, + "source": [ + "sns.pairplot(df_vinhos)\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vAhaEgyYtfX9" + }, + "source": [ + "Abaixo, gráfico segmentado por color:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jnu-giD_tcwd" + }, + "source": [ + "sns.pairplot(df_vinhos, hue = \"color\") # Compare os gráficos com e sem a opção hue\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vtOH-mTHLGC-" + }, + "source": [ + "df_vinhos.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "dcaQ8aPaHwBB" + }, + "source": [ + "sns.lmplot(\"alcohol\", \"density\", df_vinhos, hue = \"color\", fit_reg = False)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "pWsCs585LPyn" + }, + "source": [ + "sns.lmplot(\"alcohol\", \"density\", df_vinhos, hue = \"quality\", fit_reg = False)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5RwOiYi3OfD5" + }, + "source": [ + "### Boxplot" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZqIP5xUOMAqL" + }, + "source": [ + "df_vinhos.boxplot(column = 'alcohol', by = 'quality', figsize = (12, 8))\n", + "plt.xlabel('Quality', fontsize = 10, color= 'blue')\n", + "plt.ylabel('alcohol', fontsize = 10, color= 'blue')\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lWypAe78YQNm" + }, + "source": [ + "## Exercícios" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YD8jgEZyYSHP" + }, + "source": [ + "### Exercício 1\n", + "* Análise gráfica das variáveis do dataframe IRIS.\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "h0F7uXixYVqx" + }, + "source": [ + "from sklearn.datasets import load_iris\n", + "\n", + "iris = load_iris()\n", + "X= iris['data']\n", + "y= iris['target']\n", + "\n", + "df_iris = pd.DataFrame(np.c_[X, y], columns= np.append(iris['feature_names'], ['target']))\n", + "df_iris['target2'] = df_iris['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})\n", + "df_iris.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JdPniPYQlI8X" + }, + "source": [ + "sns.pairprot(pd.DataFrame)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yV5gDSF1YdGL" + }, + "source": [ + "### Exercício 2\n", + "* Usando o dataframe FIFA, responda:\n", + " * (1) Mostre o gráfico de barras com o número de jogadores por clube;\n", + " * (2) Mostre o boxplot/histograma dos salários dos atletas para os clubes Real Madrid, Barcelona Paris Saint-Germain Bayern Munich e Juventus." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "27NbnlDkYoeH" + }, + "source": [ + "df_FIFA_2[df_FIFA_2['club'].isin(\n", + " ['Real Madrid', 'FC Barcelona','Paris Saint-Germain','FC Bayern München'])].boxplot(column = 'wage_montante', by = 'club', figsize = (12, 8))" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB15_00__Machine_Learning___DSWP_h2.ipynb b/Notebooks/NB15_00__Machine_Learning___DSWP_h2.ipynb new file mode 100644 index 000000000..0cbbbaf5f --- /dev/null +++ b/Notebooks/NB15_00__Machine_Learning___DSWP_h2.ipynb @@ -0,0 +1,4001 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "colab": { + "name": "NB15_00__Machine_Learning.ipynb", + "provenance": [], + "include_colab_link": true + }, + "accelerator": "TPU" + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ShVXyGj9wkgN" + }, + "source": [ + "

MACHINE LEARNING WITH PYTHON

" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aYQ4cDfcPu4e" + }, + "source": [ + "___\n", + "# **NOTAS E OBSERVAÇÕES**\n", + "* Abordar o impacto do desbalanceamento da amostra;\n", + "* Colocar AUROC no material e mostrar o cut off para classificação entre 0 e 1;\n", + "* Conceitos estatísticos de bias & variance;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5YvhLC_uf4_G" + }, + "source": [ + "___\n", + "# **AGENDA**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QgX6n2VDyY1O" + }, + "source": [ + "___\n", + "# **REFERÊNCIAS**\n", + "* [scikit-learn - Machine Learning With Python](https://scikit-learn.org/stable/);\n", + "* [An Introduction to Machine Learning Theory and Its Applications: A Visual Tutorial with Examples](https://www.toptal.com/machine-learning/machine-learning-theory-an-introductory-primer)\n", + "* [The Difference Between Artificial Intelligence, Machine Learning, and Deep Learning](https://medium.com/iotforall/the-difference-between-artificial-intelligence-machine-learning-and-deep-learning-3aa67bff5991)\n", + "* [A Gentle Guide to Machine Learning](https://blog.monkeylearn.com/a-gentle-guide-to-machine-learning/)\n", + "* [A Visual Introduction to Machine Learning](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)\n", + "* [Introduction to Machine Learning](http://alex.smola.org/drafts/thebook.pdf)\n", + "* [The 10 Statistical Techniques Data Scientists Need to Master](https://medium.com/cracking-the-data-science-interview/the-10-statistical-techniques-data-scientists-need-to-master-1ef6dbd531f7)\n", + "* [Tune: a library for fast hyperparameter tuning at any scale](https://towardsdatascience.com/fast-hyperparameter-tuning-at-scale-d428223b081c)\n", + "* [How to lie with Data Science](https://towardsdatascience.com/how-to-lie-with-data-science-5090f3891d9c)\n", + "* [5 Reasons “Logistic Regression” should be the first thing you learn when becoming a Data Scientist](https://towardsdatascience.com/5-reasons-logistic-regression-should-be-the-first-thing-you-learn-when-become-a-data-scientist-fcaae46605c4)\n", + "* [Machine learning on categorical variables](https://towardsdatascience.com/machine-learning-on-categorical-variables-3b76ffe4a7cb)\n", + "\n", + "## Deep Learning & Neural Networks\n", + "\n", + "- [An Introduction to Neural Networks](http://www.cs.stir.ac.uk/~lss/NNIntro/InvSlides.html)\n", + "- [An Introduction to Image Recognition with Deep Learning](https://medium.com/@ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721)\n", + "- [Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/index.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TsCbZd2epfxo" + }, + "source": [ + "___\n", + "# **INTRODUÇÃO**\n", + "\n", + "* \"__Information is the oil of the 21st century, and analytics is the combustion engine__.\" - Peter Sondergaard, SVP, Garner Research;\n", + "\n", + "\n", + ">O foco deste capítulo será:\n", + "* Linear, Logistic Regression, Decision Tree, Random Forest, Support Vector Machine and XGBoost algorithms for building Machine Learning models;\n", + "* Entender como resolver problemas de classificação e Regressão;\n", + "* Aplicar técnicas de Ensemble como Bagging e Boosting;\n", + "* Como medir a acurácia dos modelos de Machine Learning;\n", + "* Aprender os principais algoritmos de Machine Learning tanto das técnicas de aprendizagem supervisionada quanto da não-supervisionada.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HqqB2vaHXMGt" + }, + "source": [ + "___\n", + "# **ARTIFICIAL INTELLIGENCE VS MACHINE LEARNING VS DEEP LEARNING**\n", + "* **Machine Learning** - dá aos computadores a capacidade de aprender sem serem explicitamente programados. Os computadores podem melhorar sua capacidade de aprendizagem através da prática de uma tarefa, geralmente usando grandes conjuntos de dados.\n", + "* **Deep Learning** - é um método de Machine Learning que depende de redes neurais artificiais, permitindo que os sistemas de computadores aprendam pelo exemplo, assim como nós humanos aprendemos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P961GcguXFFA" + }, + "source": [ + "![EvolutionOfAI](https://github.com/MathMachado/Materials/blob/master/Evolution%20of%20AI.PNG?raw=true)\n", + "\n", + "Source: [Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://github.com/MathMachado/P4ML/blob/DS_Python/Material/Evolution%20of%20AI.PNG)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lkqGtO88ZkPr" + }, + "source": [ + "![AI_vs_ML_vs_DL](https://github.com/MathMachado/Materials/blob/master/AI_vs_ML_vs_DL.PNG?raw=true)\n", + "\n", + "Source: [Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://towardsdatascience.com/artificial-intelligence-vs-machine-learning-vs-deep-learning-2210ba8cc4ac)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xesQpzfmaqj6" + }, + "source": [ + "![ML_vs_DL](https://github.com/MathMachado/Materials/blob/master/ML_vs_DL.PNG?raw=true)\n", + "\n", + "Source: [Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://towardsdatascience.com/artificial-intelligence-vs-machine-learning-vs-deep-learning-2210ba8cc4ac)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KeIVR59IIS7f" + }, + "source": [ + "___\n", + "# **MACHINE LEARNING - TECHNIQUES**\n", + "\n", + "* Supervised Learning\n", + "* Unsupervised Learning\n", + "\n", + "![MachineLearning](https://github.com/MathMachado/Materials/blob/master/MachineLearningTechniques.jpg?raw=true)\n", + "\n", + "Source: [Machine Learning for Everyone](https://vas3k.com/blog/machine_learning/?source=post_page-----885aa35db58b----------------------)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rvwp5UHdBiup" + }, + "source": [ + "___\n", + "# **NOSSO FOCO AQUI SERÁ...**\n", + "\n", + "![ClassicalML](https://github.com/MathMachado/Materials/blob/master/ClassicalML.jpg?raw=true)\n", + "\n", + "Source: [Machine Learning for Everyone](https://vas3k.com/blog/machine_learning/?source=post_page-----885aa35db58b----------------------)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cBLSvJTXHBjK" + }, + "source": [ + "___\n", + "# **CHEETSHEET**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZdjR3nahUuKq" + }, + "source": [ + "\n", + "![Scikit-Learn](https://github.com/MathMachado/Materials/blob/master/scikit-learn-1.png?raw=true)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MkBSvyorGXQz" + }, + "source": [ + "___\n", + "# **CROSS-VALIDATION**\n", + "* K-fold é o método de Cross-Validation (CV) mais conhecido e utilizado;\n", + "* Como funciona: divide o dataframe de treinamento em k partes;\n", + " * Usa k-1 partes para treinar o modelo e o restante para validar o modelo;\n", + " * repete este processo k vezes, sendo que em cada iteração calcula as métricas desejadas;\n", + " * Ao final das k iterações, teremos k métricas das quais calculamos média e desvio-padrão.\n", + "\n", + " A figura abaixo nos ajuda a entender como funciona CV:\n", + "\n", + "![Cross-Validation](https://github.com/MathMachado/Materials/blob/master/CV2.PNG?raw=true)\n", + "\n", + "Source: [5 Reasons why you should use Cross-Validation in your Data Science Projects](https://towardsdatascience.com/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79)\n", + "\n", + "* **valor de k**:\n", + " * valor de k (folds): entre 5 e 10 --> Não há regra geral para a escolha de k;\n", + " * Quanto maior o valor de k, menor o viés do CV;\n", + "\n", + "[Applied Predictive Modeling, 2013](https://www.amazon.com/Applied-Predictive-Modeling-Max-Kuhn/dp/1461468485/ref=as_li_ss_tl?ie=UTF8&qid=1520380699&sr=8-1&keywords=applied+predictive+modeling&linkCode=sl1&tag=inspiredalgor-20&linkId=1af1f3de89c11e4a7fd49de2b05e5ebf)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HscfN-a1V043" + }, + "source": [ + "* **Vantagens do uso de CV**:\n", + " * Modelos com melhor acurácia;\n", + " * Melhor uso dos dados, pois todos os dados são utilizados como treinamento e validação. Portanto, qualquer problema com os dados serão encontrados nesta fase.\n", + "\n", + "* **Leitura Adicional**\n", + " * [Cross-Validation in Machine Learning](https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f)\n", + " * [5 Reasons why you should use Cross-Validation in your Data Science Projects](https://towardsdatascience.com/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79)\n", + " * [Cross-validation: evaluating estimator performance](https://scikit-learn.org/stable/modules/cross_validation.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XRukccWQSklx" + }, + "source": [ + "## Medidas para avaliarmos a variabilidade presente nos dados\n", + "* As principais medidas para medirmos a variabilidade dos dados são amplitude, variância, desvio padrão e coeficiente de variação;\n", + "* Estas medidas nos permite concluir se os dados são homogêneos (menor dispersão/variabilidade) ou heterogêneos (maior variabilidade/dispersão).\n", + "\n", + "* **Na próxima versão, trazer estes conceitos para o Notebook e usar o Python para calcular estas medidas**." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yBR8tWV_lhQq" + }, + "source": [ + "___\n", + "# **ENSEMBLE METHODS** (= Combinar modelos preditivos)\n", + "* Métodos\n", + " * **Bagging** (Bootstrap AGGregatING)\n", + " * **Boosting**\n", + " * Stacking --> Não é muito utilizado\n", + "* Evita overfitting (Overfitting é quando o modelo/função se ajusta muito bem ao dados de treinamento, sendo ineficiente para generalizar para outras amostras/população).\n", + "* Constroi meta-classificadores: combinar os resultados de vários algoritmos para produzir previsões mais precisas e robustas do que as previsões de cada classificador individual.\n", + "* Ensemble reduz/minimiza os efeitos das principais causas de erros nos modelos de Machine Learning:\n", + " * ruído;\n", + " * bias (viés);\n", + " * variância --> Principal medida para medir a variabilidade presente nos dados.\n", + "\n", + "# Referências\n", + "* [Simple guide for ensemble learning methods](https://towardsdatascience.com/simple-guide-for-ensemble-learning-methods-d87cc68705a2) - Explica didaticamente como funcionam ensembes." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "25RW8u-Sj780" + }, + "source": [ + "### Leitura Adicional\n", + "* [Ensemble methods: bagging, boosting and stacking](https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205)\n", + "* [Ensemble Methods in Machine Learning: What are They and Why Use Them?](https://towardsdatascience.com/ensemble-methods-in-machine-learning-what-are-they-and-why-use-them-68ec3f9fef5f)\n", + "* [Ensemble Learning Using Scikit-learn](https://towardsdatascience.com/ensemble-learning-using-scikit-learn-85c4531ff86a)\n", + "* [Let’s Talk About Machine Learning Ensemble Learning In Python](https://medium.com/fintechexplained/lets-talk-about-machine-learning-ensemble-learning-in-python-382747e5fba8)\n", + "* [Boosting, Bagging, and Stacking — Ensemble Methods with sklearn and mlens](https://medium.com/@rrfd/boosting-bagging-and-stacking-ensemble-methods-with-sklearn-and-mlens-a455c0c982de)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FugME1HSl4jJ" + }, + "source": [ + "___\n", + "# **PARAMETER TUNNING** (= Parâmetros ótimos dos modelos de Machine Learning)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u_147cIRl9F1" + }, + "source": [ + "## GridSearch (Ferramenta ou meio que vamos utilizar para otimização dos parâmetros dos modelos de ML)\n", + "* Encontra os parâmetros ótimos (hyperparameter tunning) que melhoram a acurácia dos modelos.\n", + "* Necessita dos seguintes inputs:\n", + " * A matrix $X_{p}$ com as $p$ COLUNAS (variáveis ou atributos) do dataframe;\n", + " * A matriz $y_{p}$ com a COLUNA-target (vaiável resposta);\n", + " * Exemplo: DecisionTree, RandomForestClassifier, XGBoostClassificer e etc;\n", + " * Um dicionário com os parâmetros a serem otimizados;\n", + " * O número de folds para o método de Cross-validation." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "39Sg77fbTWCO" + }, + "source": [ + "___\n", + "# **MODEL SELECTION & EVALUATION**\n", + "> Nesta fase identificamos e aplicamos as melhores métricas (Accuracy, Sensitivity, Specificity, F-Score, AUC, R-Sq, Adj R-SQ, RMSE (Root Mean Square Error)) para avaliar o desempenho/acurácia/performance dos modelos de ML.\n", + ">> Treinamos os modelos de ML usando a amostra de treinamento e avaliamos o desempenho/acurácia/performance na amostra de teste/validação.\n", + "\n", + "* Leitura Adicional\n", + " * [The 5 Classification Evaluation metrics every Data Scientist must know](https://towardsdatascience.com/the-5-classification-evaluation-metrics-you-must-know-aa97784ff226)\n", + " * [Confusion matrix and other metrics in machine learning](https://medium.com/hugo-ferreiras-blog/confusion-matrix-and-other-metrics-in-machine-learning-894688cb1c0a)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oQQVzZ2ZTYrB" + }, + "source": [ + "## Confusion Matrix\n", + "* Termos associados à Confusion Matrix:\n", + " * **Verdadeiro Positivo** (TP = True Positive): Quando o valor observado é True e o modelo estima como True. Ou seja, o modelo acertou na estimativa.\n", + " * Exemplo: **Observado**: Fraude (Positive); **Modelo**: Fraude (Positive) --> Modelo acertou!\n", + " * **Verdadeiro Negativo** (TN = True Negative): Quando o valor observado é False e o modelo estima como False. Ou seja, o modelo acertou na estimativa;\n", + " * Exemplo: **Observado**: NÃO-Fraude (Negative); **Modelo**: NÃO-Fraude (Negative) --> Modelo acertou!\n", + " * **Falso Positivo** (FP = False Positive): Quando o valor observado é False e o modelo estima como True. Ou seja, o modelo errou na estimativa. \n", + " * Exemplo: **Observado**: NÃO-Fraude (Negative); **Modelo**: Fraude (Positive) --> Modelo errou!\n", + " * **Falso Negativo** (FN = False Negative): Quando o valor observado é True e o modelo estima como False.\n", + " * Exemplo: **Observado**: Fraude (Positive); **Modelo**: NÃO-Fraude (Negative) --> Modelo errou!\n", + "\n", + "* Consulte [Confusion matrix](https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py)\n", + "\n", + "![ConfusionMatrix](https://github.com/MathMachado/Materials/blob/master/ConfusionMatrix.PNG?raw=true)\n", + "\n", + "Source: [Confusion Matrix](https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781838555078/6/ch06lvl1sec34/confusion-matrix)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ci-6eiqBTgbL" + }, + "source": [ + "## Accuracy\n", + "> Accuracy - é o número de previsões corretas feitas pelo modelo.\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Com que frequência o classificador (modelo preditivo) classifica corretamente?\n", + "```\n", + "\n", + "$$Accuracy= \\frac{TP+TN}{TP+TN+FP+FN}$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F7YI8X5TRx-R" + }, + "source": [ + "## Precision (ou Specificity)\n", + "> **Precision** - fornece informações sobre o desempenho em relação a Falsos Positivos (quantos capturamos).\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Com relação ao resultado Positivo, com que frequência o classificador está correto?\n", + "```\n", + "\n", + "\n", + "$$Precision= \\frac{TP}{TP+FP}$$\n", + "\n", + "**Exemplo**: Precison nos dirá a proporção de clientes que o modelo estimou como sendo Fraude quando, na verdade, são fraude.\n", + "\n", + "**Comentário**: Se nosso foco é minimizar Falso Negativos (FN), então precisamos nos esforçar para termos Recall próximo de 100%." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zO39n8x_Sz3L" + }, + "source": [ + "## Recall (ou Sensitivity)\n", + "> **Recall** - nos fornece informações sobre o desempenho de um classificador em relação a Falsos Negativos (quantos perdemos).\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Quando o valor observado é Positivo, com que frequência o classificador está correto?\n", + "```\n", + "\n", + "$$Recall = Sensitivity = \\frac{TP}{TP+FN}$$\n", + "\n", + "**Exemplo**: Recall é a proporção de clientes observados como Fraude e que o modelo estima como Fraude.\n", + "\n", + "**Comentário**: Se nosso foco for minimizar Falso Positivos (FP), então precisamos nos esforçar para fazer Precision mais próximo de 100% possível." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "htS6rdHVVXRG" + }, + "source": [ + "## Specificity\n", + "> **Specificity** - proporção de TN por TN+FP.\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Quando o valor observado é Negativo, com que frequência o classificador está correto?\n", + "```\n", + "\n", + "**Exemplo**: Specificity é a proporção de clientes NÃO-Fraude que o modelo estima como NÃO-Fraude.\n", + "\n", + "$$Specificity= \\frac{TN}{TN+FP}$$\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mNn0twadTacc" + }, + "source": [ + "## F1-Score\n", + "> F1-Score é a média harmônica entre Recall e Precision e é um número entre 0 e 1. Quanto mais próximo de 1, melhor. Quanto mais próximo de 0, pior. Ou seja, é um equilíbrio entre Recall e Precision.\n", + "\n", + "$$F1\\_Score= 2\\left(\\frac{Recall*Precision}{Recall+Precision}\\right)$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rsH9dMxazWCg" + }, + "source": [ + "# **DATAFRAME-EXEMPLO USADO NESTE TUTORIAL**\n", + "> Gerar um dataframe com 18 colunas, sendo 9 informativas, 6 redundantes e 3 repetidas:\n", + "\n", + "Para saber mais sobre a geração de dataframes-exemplo (toy), consulte [Synthetic data generation — a must-have skill for new data scientists](https://towardsdatascience.com/synthetic-data-generation-a-must-have-skill-for-new-data-scientists-915896c0c1ae)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GEyDo_EIV_jV" + }, + "source": [ + "## Definir variáveis globais" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TdwgpZ76WFaT" + }, + "source": [ + "i_CV = 10 # Número de Cross-Validations\n", + "i_Seed = 20111974 # semente por questões de reproducibilidade\n", + "f_Test_Size = 0.3 # Proporção do dataframe de validação (outros valores poderiam ser 0.15, 0.20 ou 0.25)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gJTJfpwWzykS" + }, + "source": [ + "from sklearn.datasets import make_classification\n", + "\n", + "X, y = make_classification(n_samples = 1000, \n", + " n_features = 18, \n", + " n_informative = 9, \n", + " n_redundant = 6, \n", + " n_repeated = 3, \n", + " n_classes = 2, \n", + " n_clusters_per_class = 1, \n", + " random_state=i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gWy2IZh3s-o3" + }, + "source": [ + "X" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ccjhGnzxtAaV" + }, + "source": [ + "y[0:30] # Semelhante aos casos de fraude: {0, 1}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OHO2befKJxR3" + }, + "source": [ + "___\n", + "# **DECISION TREE**\n", + "> Decision Trees possuem estrutura em forma de árvores.\n", + "\n", + "* **Principais Vantagens**:\n", + " * São algoritmos fáceis de entender, visualizar e interpretar;\n", + " * Captura facilmente padrões não-lineares presentes nos dados;\n", + " * Requer pouco poder computacional --> Treinar Decision Trees não requer tanto recurso computacional!\n", + " * Lida bem com COLUNAS numéricas ou categóricas;\n", + " * Não requer os dados sejam normalizados;\n", + " * Pode ser utilizado como Feature Engineering ao lidar com Missing Values;\n", + " * Pode ser utilizado como Feature Selection;\n", + " * Não requer suposições sobre a distribuição dos dados por causa da natureza não-paramétrica do algoritmo\n", + "\n", + "* **Principais desvantagens**\n", + " * Propenso a Overfitting, pois Decision Trees podem construir árvores complexas que não sejam capazes de generalizar bem os dados. As coisas complicam muito se a amostra de treinamento possuir outliers. Portanto, **recomenda-se fortemente a tratar os outliers previamente**.\n", + " * Pode criar árvores viesadas se tivermos um dataframe não-balanceado ou que alguma classe seja dominante. Por conta disso, **recomenda-se balancear o dataframe previamente para se evitar esse problema**.\n", + "\n", + "* **Principais parâmetros**\n", + " * **Gini Index** - é uma métrica que mede a frequência com que um ponto/observação aleatoriamente selecionado seria incorretamente identificado.\n", + " * Portanto, quanto menor o valor de Gini Index, melhor a COLUNA;\n", + " * **Entropy** - é uma métrica que mede aleatoriedade da informação presente nos dados.\n", + " * Portanto, quanto maior a entropia da COLUNA, pior ela se torna para nos ajudar a tomar uma conclusão (classificar, por exemplo).\n", + "\n", + "## **Referências**:\n", + "* [1.10. Decision Trees](https://scikit-learn.org/stable/modules/tree.html).\n", + "* [Decision Tree Algorithm With Hands On Example](https://medium.com/datadriveninvestor/decision-tree-algorithm-with-hands-on-example-e6c2afb40d38) - ótimo tutorial para aprender, entender, interpretar e calcular os índices de Gini e entropia.\n", + "* [Intuitive Guide to Understanding Decision Trees](https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-understanding-decision-trees-adb2165ccab7) - ótimo tutorial para aprender, entender, interpretar e calcular os índices de Gini e entropia.\n", + "* [The Complete Guide to Decision Trees](https://towardsdatascience.com/the-complete-guide-to-decision-trees-28a4e3c7be14)\n", + "* [Creating and Visualizing Decision Tree Algorithm in Machine Learning Using Sklearn](https://intellipaat.com/blog/decision-tree-algorithm-in-machine-learning/) - Muito didático!\n", + "* [Decision Trees in Machine Learning](https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052)\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FrMkPN5aLp0Y" + }, + "source": [ + "## Carregar as bibliotecas" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FVU1CM0PKgO4" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "\n", + "import warnings\n", + "warnings.filterwarnings(\"ignore\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "15clh4XrISpz" + }, + "source": [ + "## Carregar/Ler os dados" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UMPL46w2IWJw" + }, + "source": [ + "l_colunas = ['v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8', 'v9', 'v10', 'v11', 'v12', 'v13', 'v14', 'v15', 'v16', 'v17', 'v18']\n", + "\n", + "df_X = pd.DataFrame(X, columns = l_colunas)\n", + "df_y = pd.DataFrame(y, columns = ['target'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "MFaQF2MGFl_M" + }, + "source": [ + "df_X.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "s-ibdD2ZG7tm" + }, + "source": [ + "df_X.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "f9cqRaywa_TR" + }, + "source": [ + "set(df_y['target'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BN6jbpn6Iwmu" + }, + "source": [ + "## Estatísticas Descritivas básicas do dataframe - df.describe()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KlwhxxUNIyYs" + }, + "source": [ + "df_X.describe()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N_QhFqyZOKFB" + }, + "source": [ + "## Selecionar as amostras de treinamento e validação\n", + "\n", + "* Dividir os dados/amostra em:\n", + " * **Amostra de treinamento**: usado para treinar o modelo e otimizar os hiperparâmetros;\n", + " * **Amostra de teste**: usado para verificar se o modelo otimizado funciona em dados totalmente desconhecidos. É nesta amostra de teste que avaliamos a performance do modelo em termos de generalização (trabalhar com dados que não lhe foi apresentado);\n", + "* Geralmente usamos 70% da amostra para treinamento e 30% validação. Outras opções são usar os percentuais 80/20 ou 75/25 (default).\n", + "* Consulte [sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) para mais detalhes.\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8sKBgs-QOOfn" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_train, X_test, y_train, y_test = train_test_split(df_X, df_y, test_size = f_Test_Size, random_state = i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TPTKBBHgOpoA", + "outputId": "3c8ab56e-2746-4310-df58-9b16986b9413", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "X_train.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(700, 18)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 15 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lEn_LLs2OtRI", + "outputId": "7e53d785-2595-4ba6-c229-ac02b99d3c55", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "y_train.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(700, 1)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 16 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_uAw8EcyOvrG", + "outputId": "00356053-c127-40d1-8bdd-d769af9ef0e2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "X_test.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(300, 18)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 17 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "A2LYI-9hOyXI", + "outputId": "b4f4b728-0bee-435e-e697-27768787d43e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "y_test.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(300, 1)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 18 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "npgoBSX2dd4l" + }, + "source": [ + "## Treinar o algoritmo com os dados de treinamento\n", + "### Carregar os algoritmos/libraries" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hcvzrtolGfnQ", + "outputId": "b0d2ab18-7386-461b-d5f5-8e1880496244", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + } + }, + "source": [ + "!pip install graphviz\n", + "!pip install pydotplus" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Requirement already satisfied: graphviz in /usr/local/lib/python3.6/dist-packages (0.10.1)\n", + "Requirement already satisfied: pydotplus in /usr/local/lib/python3.6/dist-packages (2.0.2)\n", + "Requirement already satisfied: pyparsing>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from pydotplus) (2.4.7)\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "v_pF-HH3JKL2" + }, + "source": [ + "from sklearn.metrics import accuracy_score # para medir a acurácia do modelo preditivo\n", + "#from sklearn.model_selection import train_test_split\n", + "#from sklearn.metrics import classification_report\n", + "from sklearn.metrics import confusion_matrix # para plotar a confusion matrix\n", + "\n", + "from sklearn.model_selection import GridSearchCV # para otimizar os parâmetros dos modelos preditivos\n", + "from sklearn.model_selection import cross_val_score\n", + "from time import time\n", + "from operator import itemgetter\n", + "from scipy.stats import randint\n", + "\n", + "from sklearn.tree import export_graphviz\n", + "from sklearn.externals.six import StringIO \n", + "from IPython.display import Image \n", + "import pydotplus\n", + "\n", + "np.set_printoptions(suppress=True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9ROlyvgij2yl" + }, + "source": [ + "Função para plotar a Confusion Matrix extraído de [Confusion Matrix Visualization](https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "klQ0FLOIgeX1" + }, + "source": [ + "def mostra_confusion_matrix(cf, \n", + " group_names = None, \n", + " categories = 'auto', \n", + " count = True, \n", + " percent = True, \n", + " cbar = True, \n", + " xyticks = False, \n", + " xyplotlabels = True, \n", + " sum_stats = True, figsize = (8, 8), \n", + " cmap = 'Blues'):\n", + " '''\n", + " This function will make a pretty plot of an sklearn Confusion Matrix cm using a Seaborn heatmap visualization.\n", + " Arguments\n", + " ---------\n", + " cf: confusion matrix to be passed in\n", + " group_names: List of strings that represent the labels row by row to be shown in each square.\n", + " categories: List of strings containing the categories to be displayed on the x,y axis. Default is 'auto'\n", + " count: If True, show the raw number in the confusion matrix. Default is True.\n", + " normalize: If True, show the proportions for each category. Default is True.\n", + " cbar: If True, show the color bar. The cbar values are based off the values in the confusion matrix.\n", + " Default is True.\n", + " xyticks: If True, show x and y ticks. Default is True.\n", + " xyplotlabels: If True, show 'True Label' and 'Predicted Label' on the figure. Default is True.\n", + " sum_stats: If True, display summary statistics below the figure. Default is True.\n", + " figsize: Tuple representing the figure size. Default will be the matplotlib rcParams value.\n", + " cmap: Colormap of the values displayed from matplotlib.pyplot.cm. Default is 'Blues'\n", + " See http://matplotlib.org/examples/color/colormaps_reference.html\n", + " '''\n", + "\n", + " # CODE TO GENERATE TEXT INSIDE EACH SQUARE\n", + " blanks = ['' for i in range(cf.size)]\n", + "\n", + " if group_names and len(group_names)==cf.size:\n", + " group_labels = [\"{}\\n\".format(value) for value in group_names]\n", + " else:\n", + " group_labels = blanks\n", + "\n", + " if count:\n", + " group_counts = [\"{0:0.0f}\\n\".format(value) for value in cf.flatten()]\n", + " else:\n", + " group_counts = blanks\n", + "\n", + " if percent:\n", + " group_percentages = [\"{0:.2%}\".format(value) for value in cf.flatten()/np.sum(cf)]\n", + " else:\n", + " group_percentages = blanks\n", + "\n", + " box_labels = [f\"{v1}{v2}{v3}\".strip() for v1, v2, v3 in zip(group_labels,group_counts,group_percentages)]\n", + " box_labels = np.asarray(box_labels).reshape(cf.shape[0],cf.shape[1])\n", + "\n", + " # CODE TO GENERATE SUMMARY STATISTICS & TEXT FOR SUMMARY STATS\n", + " if sum_stats:\n", + " #Accuracy is sum of diagonal divided by total observations\n", + " accuracy = np.trace(cf) / float(np.sum(cf))\n", + "\n", + " #if it is a binary confusion matrix, show some more stats\n", + " if len(cf)==2:\n", + " #Metrics for Binary Confusion Matrices\n", + " precision = cf[1,1] / sum(cf[:,1])\n", + " recall = cf[1,1] / sum(cf[1,:])\n", + " f1_score = 2*precision*recall / (precision + recall)\n", + " stats_text = \"\\n\\nAccuracy={:0.3f}\\nPrecision={:0.3f}\\nRecall={:0.3f}\\nF1 Score={:0.3f}\".format(accuracy,precision,recall,f1_score)\n", + " else:\n", + " stats_text = \"\\n\\nAccuracy={:0.3f}\".format(accuracy)\n", + " else:\n", + " stats_text = \"\"\n", + "\n", + " # SET FIGURE PARAMETERS ACCORDING TO OTHER ARGUMENTS\n", + " if figsize==None:\n", + " #Get default figure size if not set\n", + " figsize = plt.rcParams.get('figure.figsize')\n", + "\n", + " if xyticks==False:\n", + " #Do not show categories if xyticks is False\n", + " categories=False\n", + "\n", + " # MAKE THE HEATMAP VISUALIZATION\n", + " plt.figure(figsize=figsize)\n", + " sns.heatmap(cf,annot=box_labels,fmt=\"\",cmap=cmap,cbar=cbar,xticklabels=categories,yticklabels=categories)\n", + "\n", + " if xyplotlabels:\n", + " plt.ylabel('True label')\n", + " plt.xlabel('Predicted label' + stats_text)\n", + " else:\n", + " plt.xlabel(stats_text)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YJMS9ePQ6B6t" + }, + "source": [ + "**Atenção**: Para evitar overfitting nos algoritmos DecisionTreeClassifier, considere min_samples_split = 2 como default." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nNeRHYePJc-r" + }, + "source": [ + "from sklearn.tree import DecisionTreeClassifier # Library para Decision Tree (Classificação)\n", + "\n", + "# Instancia com os parâmetros sugeridos para se evitar overfitting:\n", + "ml_DT= DecisionTreeClassifier(criterion = 'gini', \n", + " splitter = 'best', \n", + " max_depth = None, \n", + " min_samples_split = 2, \n", + " min_samples_leaf = 1, \n", + " min_weight_fraction_leaf = 0.0, \n", + " max_features = None, \n", + " random_state = i_Seed, \n", + " max_leaf_nodes = None, \n", + " min_impurity_decrease = 0.0, \n", + " min_impurity_split = None, \n", + " class_weight = None, \n", + " presort = False)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gVLZznprx2YX", + "outputId": "956487e9-beb3-4638-c305-786d7e06c0c0", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 120 + } + }, + "source": [ + "# Objeto configurado\n", + "ml_DT" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n", + " max_depth=None, max_features=None, max_leaf_nodes=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, presort=False,\n", + " random_state=None, splitter='best')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 30 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OgAHfXVo-Nw8", + "outputId": "10fed276-0cf3-4149-e5d1-784e736a2841", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 120 + } + }, + "source": [ + "# Treina o algoritmo: fit(df)\n", + "ml_DT.fit(X_train, y_train)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n", + " max_depth=None, max_features=None, max_leaf_nodes=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, presort=False,\n", + " random_state=None, splitter='best')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 33 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ohmGCDpfyhvV", + "outputId": "fee641eb-64d0-4072-874c-f704c6a70cfe", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "i_CV" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "10" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 24 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6exa9D8R2fDJ", + "outputId": "5bfc98af-bd00-440d-b504-ab499254c533", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_DT, X_train, y_train, cv = i_CV)\n", + "\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Média das Acurácias calculadas pelo CV....: 91.43\n", + "std médio das Acurácias calculadas pelo CV: 3.8899999999999997\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Uxoplcea0byV", + "outputId": "578c5e51-c311-4cdf-c5ad-0de8fedd4e17", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "a_scores_CV # array com os scores a cada iteração do CV" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.87142857, 0.98571429, 0.85714286, 0.91428571, 0.9 ,\n", + " 0.95714286, 0.91428571, 0.92857143, 0.87142857, 0.94285714])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 36 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y3k-PcbN0o_i", + "outputId": "0334a08d-8d2b-4687-ccda-65c6eac86759", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_scores_CV.mean()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.9142857142857144" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 37 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6_rYker2gzeG" + }, + "source": [ + "**Interpretação**: Nosso classificador (DecisionTreeClassifier) tem uma acurácia média de 91,43% (base de treinamento). Além disso, o std é da ordem de 3,66%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tkwchmkP3p_A", + "outputId": "8b157dfc-f416-49d2-d185-3cf8ebfa13b0", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Acurácias: [0.87142857 0.98571429 0.85714286 0.91428571 0.9 0.95714286\n", + " 0.91428571 0.92857143 0.87142857 0.94285714]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sI31WkZs2ht_" + }, + "source": [ + "# Faz predições...\n", + "y_pred = ml_DT.predict(X_test)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rfapj3OG13PG", + "outputId": "af6e5144-5cdb-4017-885e-e398508d9cf5", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "y_pred[0:30]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0,\n", + " 1, 0, 0, 1, 1, 0, 1, 1])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 40 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sc88ofqh16RT", + "outputId": "4c2d7859-fa1a-4ecb-ea61-9ec399e439de", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "y[0:30]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,\n", + " 1, 1, 0, 1, 0, 1, 0, 1])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 41 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fSaVzJ9xFpwW", + "outputId": "12eb1946-18c6-4369-af9d-916b5a0fc42d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 538 + } + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_test, y_pred)\n", + "cf_labels = ['True_Negative', 'False_Positive', 'False_Negative', 'True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p8D975NqsGtj" + }, + "source": [ + "## Parameter tunning\n", + "### Referência\n", + "* [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74)\n", + "* [Decision Tree Adventures 2 — Explanation of Decision Tree Classifier Parameters](https://medium.com/datadriveninvestor/decision-tree-adventures-2-explanation-of-decision-tree-classifier-parameters-84776f39a28) - Explica didaticamente e step by step como fazer parameter tunning." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Bfdq5zEhlVsk" + }, + "source": [ + "# Dicionário de parâmetros para o parameter tunning. Ao todo serão ajustados 2X13X5X5X7= 4.550 modelos. Contando com 10 folds no Cross-Validation, então são 45.500 modelos.\n", + "d_parametros_DT= {\"criterion\": [\"gini\", \"entropy\"]} #, \"min_samples_split\": [2, 5, 10, 30, 50, 70, 90, 120, 150, 180, 210, 240, 270, 350, 400], \"max_depth\": [None, 2, 5, 9, 15], \"min_samples_leaf\": [20, 40, 60, 80, 100], \"max_leaf_nodes\": [None, 2, 3, 4, 5, 10, 15]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H8gNSs0G0A-L" + }, + "source": [ + "```\n", + "grid_search = GridSearchCV(ml_DT, param_grid= d_parametros_DT, cv = i_CV, n_jobs= -1)\n", + "start = time()\n", + "grid_search.fit(X_train, y_train)\n", + "tempo_elapsed= time()-start\n", + "print(f\"\\nGridSearchCV levou {tempo_elapsed:.2f} segundos para estimar {len(grid_search.cv_results_)} modelos candidatos\")\n", + "\n", + "GridSearchCV levou 1999.12 segundos para estimar 23 modelos candidatos\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ap3WMXqDthu9" + }, + "source": [ + "# Definindo a função para o GridSearchCV\n", + "def GridSearchOptimizer(modelo, ml_Opt, d_Parametros, X_train, y_train, X_test, y_test, cv = i_CV):\n", + " ml_GridSearchCV = GridSearchCV(modelo, d_Parametros, cv = i_CV, n_jobs= -1, verbose= 10, scoring= 'accuracy')\n", + " start = time()\n", + " ml_GridSearchCV.fit(X_train, y_train)\n", + " tempo_elapsed= time()-start\n", + " #print(f\"\\nGridSearchCV levou {tempo_elapsed:.2f} segundos.\")\n", + "\n", + " # Parâmetros que otimizam a classificação:\n", + " print(f'\\nParametros otimizados: {ml_GridSearchCV.best_params_}')\n", + " \n", + " if ml_Opt == 'ml_DT2':\n", + " print(f'\\nDecisionTreeClassifier *********************************************************************************************************')\n", + " ml_Opt = DecisionTreeClassifier(criterion= ml_GridSearchCV.best_params_['criterion'], \n", + " max_depth= ml_GridSearchCV.best_params_['max_depth'],\n", + " max_leaf_nodes= ml_GridSearchCV.best_params_['max_leaf_nodes'],\n", + " min_samples_split= ml_GridSearchCV.best_params_['min_samples_leaf'],\n", + " min_samples_leaf= ml_GridSearchCV.best_params_['min_samples_split'], \n", + " random_state= i_Seed)\n", + " \n", + " elif ml_Opt == 'ml_RF2':\n", + " print(f'\\nRandomForestClassifier *********************************************************************************************************')\n", + " ml_Opt = RandomForestClassifier(bootstrap= ml_GridSearchCV.best_params_['bootstrap'], \n", + " max_depth= ml_GridSearchCV.best_params_['max_depth'],\n", + " max_features= ml_GridSearchCV.best_params_['max_features'],\n", + " min_samples_leaf= ml_GridSearchCV.best_params_['min_samples_leaf'],\n", + " min_samples_split= ml_GridSearchCV.best_params_['min_samples_split'],\n", + " n_estimators= ml_GridSearchCV.best_params_['n_estimators'],\n", + " random_state= i_Seed)\n", + " \n", + " elif ml_Opt == 'ml_AB2':\n", + " print(f'\\nAdaBoostClassifier *********************************************************************************************************')\n", + " ml_Opt = AdaBoostClassifier(algorithm='SAMME.R', \n", + " base_estimator=RandomForestClassifier(bootstrap = False, \n", + " max_depth = 10, \n", + " max_features = 'auto', \n", + " min_samples_leaf = 1, \n", + " min_samples_split = 2, \n", + " n_estimators = 400), \n", + " learning_rate = ml_GridSearchCV.best_params_['learning_rate'], \n", + " n_estimators = ml_GridSearchCV.best_params_['n_estimators'], \n", + " random_state = i_Seed)\n", + " \n", + " elif ml_Opt == 'ml_GB2':\n", + " print(f'\\nGradientBoostingClassifier *********************************************************************************************************')\n", + " ml_Opt = GradientBoostingClassifier(learning_rate = ml_GridSearchCV.best_params_['learning_rate'], \n", + " n_estimators = ml_GridSearchCV.best_params_['n_estimators'], \n", + " max_depth = ml_GridSearchCV.best_params_['max_depth'], \n", + " min_samples_split = ml_GridSearchCV.best_params_['min_samples_split'], \n", + " min_samples_leaf = ml_GridSearchCV.best_params_['min_samples_leaf'], \n", + " max_features = ml_GridSearchCV.best_params_['max_features'])\n", + " \n", + " elif ml_Opt == 'ml_XGB2':\n", + " print(f'\\nXGBoostingClassifier *********************************************************************************************************')\n", + " ml_Opt = XGBoostingClassifier(learning_rate= ml_GridSearchCV.best_params_['learning_rate'], \n", + " max_depth= ml_GridSearchCV.best_params_['max_depth'], \n", + " colsample_bytree= ml_GridSearchCV.best_params_['colsample_bytree'], \n", + " subsample= ml_GridSearchCV.best_params_['subsample'], \n", + " gamma= ml_GridSearchCV.best_params_['gamma'], \n", + " min_child_weight= ml_GridSearchCV.best_params_['min_child_weight'])\n", + " \n", + " # Treina novamente usando os parametros otimizados...\n", + " ml_Opt.fit(X_train, y_train)\n", + "\n", + " # Cross-Validation com 10 folds\n", + " print(f'\\n********* CROSS-VALIDATION ***********')\n", + " a_scores_CV = cross_val_score(ml_Opt, X_train, y_train, cv = i_CV)\n", + " print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + " print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')\n", + "\n", + " # Faz predições com os parametros otimizados...\n", + " y_pred = ml_Opt.predict(X_test)\n", + " \n", + " # Importância das COLUNAS\n", + " print(f'\\n********* IMPORTÂNCIA DAS COLUNAS ***********')\n", + " df_importancia_variaveis = pd.DataFrame(zip(l_colunas, ml_Opt.feature_importances_), columns= ['coluna', 'importancia'])\n", + " df_importancia_variaveis = df_importancia_variaveis.sort_values(by= ['importancia'], ascending=False)\n", + " print(df_importancia_variaveis)\n", + "\n", + " # Matriz de Confusão\n", + " print(f'\\n********* CONFUSION MATRIX - PARAMETER TUNNING ***********')\n", + " cf_matrix = confusion_matrix(y_test, y_pred)\n", + " cf_labels = ['True_Negative', 'False_Positive', 'False_Negative', 'True_Positive']\n", + " cf_categories = ['Zero', 'One']\n", + " mostra_confusion_matrix(cf_matrix, group_names = cf_labels, categories = cf_categories)\n", + "\n", + " return ml_Opt, ml_GridSearchCV.best_params_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "44-BRnNjBT25" + }, + "source": [ + "# Invoca a função\n", + "ml_DT2, best_params = GridSearchOptimizer(ml_DT, 'ml_DT2', d_parametros_DT, X_train, y_train, X_test, y_test, cv = i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gmCkjGjPJMLr" + }, + "source": [ + "### Visualizar o resultado" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cIc3ZgaISEd0" + }, + "source": [ + "from sklearn.tree import export_graphviz\n", + "from sklearn.externals.six import StringIO \n", + "from IPython.display import Image \n", + "import pydotplus\n", + "\n", + "dot_data = StringIO()\n", + "export_graphviz(ml_DT2, out_file = dot_data, filled = True, rounded = True, special_characters = True, feature_names = l_colunas, class_names = ['0','1'])\n", + "\n", + "graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) \n", + "graph.write_png('DecisionTree.png')\n", + "Image(graph.create_png())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e1R2GBkbnV37" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vv7GKBvs6Ybf" + }, + "source": [ + "# Função desenvolvida para Selecionar COLUNAS relevantes\n", + "from sklearn.feature_selection import SelectFromModel\n", + "\n", + "def seleciona_colunas_relevantes(modelo, X_train, X_test, threshold = 0.05):\n", + " # Cria um seletor para selecionar as COLUNAS com importância > threshold\n", + " sfm = SelectFromModel(modelo, threshold)\n", + " \n", + " # Treina o seletor\n", + " sfm.fit(X_train, y_train)\n", + "\n", + " # Mostra o indice das COLUNAS mais importantes\n", + " print(f'\\n********** COLUNAS Relevantes ******')\n", + " print(sfm.get_support(indices=True))\n", + "\n", + " # Seleciona somente as COLUNAS relevantes\n", + " X_train_I = sfm.transform(X_train)\n", + " X_test_I = sfm.transform(X_test)\n", + " return X_train_I, X_test_I " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ukMLoEr7nbUf" + }, + "source": [ + "X_train_DT, X_test_DT = seleciona_colunas_relevantes(ml_DT2, X_train, X_test)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8JjePRQAoqkk" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Gt3aCPpfKRxm" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zq6uCVtzovMt" + }, + "source": [ + "# Treina usando as COLUNAS relevantes...\n", + "ml_DT2.fit(X_train_DT, y_train)\n", + "\n", + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_DT2, X_train_DT, y_train, cv = i_CV)\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Tc7esxqtq-Og" + }, + "source": [ + "****************************************************************" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "znWy3LE1q-Z3" + }, + "source": [ + "ml_DT3, best_params2 = GridSearchOptimizer(ml_DT2, 'ml_DT2', d_parametros_DT, X_train_DT, y_train, X_test_DT, y_test, cv = i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6IhCC6pfq-jL" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "qw6Dk3kesT0q" + }, + "source": [ + "best_params2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "SbS4ZKN8s-ee" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_DT3, X_train_DT, y_train, cv = i_CV)\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_at3XP1Bq-qb" + }, + "source": [ + "***************************************************************" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MZ1-vGRcxJoN" + }, + "source": [ + "## Valida o modelo usando o dataframe X_test" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ig9GiUAEw9jr" + }, + "source": [ + "y_pred_DT = ml_DT2.predict(X_test_DT)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7UZz4UzHDqae" + }, + "source": [ + "# Calcula acurácia\n", + "accuracy_score(y_test, y_pred_DT)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K3EUMAxxKBur" + }, + "source": [ + "___\n", + "# **RANDOM FOREST**\n", + "* Decision Trees possuem estrutura em forma de árvores.\n", + "* Random Forest pode ser utilizado tanto para classificação (RandomForestClassifier)quanto para Regressão (RandomForestRegressor).\n", + "\n", + "* **Vantagens**:\n", + " * Não requer tanto data preprocessing;\n", + " * Lida bem com COLUNAS categóricas e numéricas;\n", + " * É um Boosting Ensemble Method (pois constrói muitas árvores). Estes modelos aprendem com os próprios erros e ajustam as árvores de modo a fazer melhores classificações;\n", + " * Mais robusta que uma simples Decision Tree. **Porque?**\n", + " * Controla automaticamente overfitting (**porque?**) e frequentemente produz modelos muito robustos e de alta-performance.\n", + " * Pode ser utilizado como Feature Selection, pois gera a matriz de importância dos atributos (importance sample). A soma das importâncias soma 100;\n", + " * Assim como as Decision Trees, esses modelos capturam facilmente padrões não-lineares presentes nos dados;\n", + " * Não requer os dados sejam normalizados;\n", + " * Lida bem com Missing Values;\n", + " * Não requer suposições (assumptions) sobre a distribuição dos dados por causa da natureza não-paramétrica do algoritmo\n", + "\n", + "* **Desvantagens**\n", + " * **Recomenda-se balancear o dataframe previamente para se evitar esse problema**.\n", + "\n", + "* **Principais parâmetros**\n", + "\n", + "## **Referências**:\n", + "* [Running Random Forests? Inspect the feature importances with this code](https://towardsdatascience.com/running-random-forests-inspect-the-feature-importances-with-this-code-2b00dd72b92e)\n", + "* [Feature importances with forests of trees](https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html)\n", + "* [Understanding Random Forests Classifiers in Python](https://www.datacamp.com/community/tutorials/random-forests-classifier-python)\n", + "* [Understanding Random Forest](https://towardsdatascience.com/understanding-random-forest-58381e0602d2)\n", + "* [An Implementation and Explanation of the Random Forest in Python](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76)\n", + "* [Random Forest Simple Explanation](https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d)\n", + "* [Random Forest Explained](https://www.youtube.com/watch?v=eM4uJ6XGnSM)\n", + "* [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74) - Explica os principais parâmetros do Random Forest." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cnfDw_GEKBuu" + }, + "source": [ + "from sklearn.ensemble import RandomForestClassifier\n", + "\n", + "# Instancia...\n", + "ml_RF= RandomForestClassifier(n_estimators=100, min_samples_split= 2, max_features=\"auto\", random_state= i_Seed)\n", + "\n", + "# Treina...\n", + "ml_RF.fit(X_train, y_train)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lYa9oaZW__o6" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_RF, X_train, y_train, cv = i_CV)\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AouWUu8vANdb" + }, + "source": [ + "**Interpretação**: Nosso classificador (RandomForestClassifier) tem uma acurácia média de 96,44% (base de treinamento). Além disso, o std é da ordem de 2,77%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vbducxlgAa85" + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_lxx-LUw_5sd" + }, + "source": [ + "# Faz predições...\n", + "y_pred = ml_RF.predict(X_test)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "pQIRO_LpGAkw" + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_test, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yKLHZ5_C6FJ8" + }, + "source": [ + "## Parameter tunning\n", + "### Referência\n", + "* [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74)\n", + "* [Decision Tree Adventures 2 — Explanation of Decision Tree Classifier Parameters](https://medium.com/datadriveninvestor/decision-tree-adventures-2-explanation-of-decision-tree-classifier-parameters-84776f39a28) - Explica didaticamente e step by step como fazer parameter tunning.\n", + "* [Optimizing Hyperparameters in Random Forest Classification](https://towardsdatascience.com/optimizing-hyperparameters-in-random-forest-classification-ec7741f9d3f6) - Outro approach para entender parameter tunning. Recomendo fortemente a leitura! " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XOa9naju6FKA" + }, + "source": [ + "# Dicionário de parâmetros para o parameter tunning.\n", + "d_parametros_RF= {'bootstrap': [True, False]} #,\n", + "# 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],\n", + "# 'max_features': ['auto', 'sqrt'],\n", + "# 'min_samples_leaf': [1, 2, 4],\n", + "# 'min_samples_split': [2, 5, 10],\n", + "# 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6__f2jZaTQat" + }, + "source": [ + "# Invoca a função\n", + "ml_RF2, best_params = GridSearchOptimizer(ml_RF, 'ml_RF2', d_parametros_RF, X_train, y_train, X_test, y_test, cv = i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "crfn-n--KG4n" + }, + "source": [ + "### Resultado da execução do Random Forest\n", + "\n", + "```\n", + "[Parallel(n_jobs=-1)]: Done 7920 out of 7920 | elapsed: 194.0min finished\n", + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}\n", + "```" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SGTOe5PaRw59" + }, + "source": [ + "# Como o procedimento acima levou 194 minutos para executar, então vou estimar ml_RF2 abaixo usando os parâmetros acima estimados\n", + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}\n", + "\n", + "ml_RF2= RandomForestClassifier(bootstrap= best_params['bootstrap'], \n", + " max_depth= best_params['max_depth'], \n", + " max_features= best_params['max_features'], \n", + " min_samples_leaf= best_params['min_samples_leaf'], \n", + " min_samples_split= best_params['min_samples_split'], \n", + " n_estimators= best_params['n_estimators'], \n", + " random_state= i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HMJcAdLlTQa0" + }, + "source": [ + "## Visualizar o resultado\n", + "> Implementar a visualização do RandomForest." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WWNiy7Z0TQa3" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kOi11YOKTQa4" + }, + "source": [ + "X_train_RF, X_test_RF = seleciona_colunas_relevantes(ml_RF2, X_train, X_test)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Zn_O7c_DTQbE" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UwEOwzSGTQbF" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Rr8qDrgvTQbL" + }, + "source": [ + "# Treina com as COLUNAS relevantes...\n", + "ml_RF2.fit(X_train_RF, y_train)\n", + "\n", + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_RF2, X_train_RF, y_train, cv = i_CV)\n", + "print(f'Acurácia Media: {100*a_scores_CV.mean():.2f}')\n", + "print(f'std médio.....: {100*a_scores_CV.std():.2f}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-mYfQLlsTQbQ" + }, + "source": [ + "## Valida o modelo usando o dataframe X_test" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sSD5o1JQTQbR" + }, + "source": [ + "y_pred_RF = ml_RF2.predict(X_test_RF)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "wywF6LymDzKr" + }, + "source": [ + "# Calcula acurácia\n", + "accuracy_score(y_test, y_pred_RF)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hJJsL0IJb6iO" + }, + "source": [ + "## Estudo do comportamento dos parametros do algoritmo\n", + "> Consulte [Optimizing Hyperparameters in Random Forest Classification](https://towardsdatascience.com/optimizing-hyperparameters-in-random-forest-classification-ec7741f9d3f6) para mais detalhes." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "navUWMwHi44D" + }, + "source": [ + "param_range = np.arange(1, 250, 2)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_train, \n", + " y_train, \n", + " param_name=\"n_estimators\", \n", + " param_range = param_range, \n", + " cv = i_CV, \n", + " scoring = \"accuracy\", \n", + " n_jobs = -1)\n", + "\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label = \"Training score\", color = \"black\")\n", + "plt.plot(param_range, test_mean, label = \"Cross-validation score\", color = \"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color = \"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color = \"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc = \"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rv7TIM9kjsud" + }, + "source": [ + "param_range = np.arange(1, 250, 2)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_train, \n", + " y_train, \n", + " param_name = \"max_depth\", \n", + " param_range = param_range, \n", + " cv = i_CV, \n", + " scoring = \"accuracy\", \n", + " n_jobs = -1)\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label=\"Training score\", color=\"black\")\n", + "plt.plot(param_range, test_mean, label=\"Cross-validation score\", color=\"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color=\"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color=\"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc=\"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lm_fPGYwkJYc" + }, + "source": [ + "param_range = np.arange(1, 250, 2)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_train, \n", + " y_train, \n", + " param_name='min_samples_leaf', \n", + " param_range=param_range,\n", + " cv = i_CV, \n", + " scoring=\"accuracy\", \n", + " n_jobs=-1)\n", + "\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label=\"Training score\", color=\"black\")\n", + "plt.plot(param_range, test_mean, label=\"Cross-validation score\", color=\"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color=\"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color=\"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc=\"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "CAqdiSaVlAB8" + }, + "source": [ + "param_range = np.arange(0.05, 1, 0.05)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_train, \n", + " y_train, \n", + " param_name='min_samples_split', \n", + " param_range=param_range,\n", + " cv = i_CV, \n", + " scoring=\"accuracy\", \n", + " n_jobs=-1)\n", + "\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label=\"Training score\", color=\"black\")\n", + "plt.plot(param_range, test_mean, label=\"Cross-validation score\", color=\"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color=\"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color=\"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc=\"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cX_gfsbQSdNd" + }, + "source": [ + "___\n", + "# **BOOSTING MODELS**\n", + "* São algoritmos muito utilizados nas competições do Kaggle;\n", + "* São algoritmos utilizados para melhorar a performance dos algoritmos de Machine Learning;\n", + "* Modelos:\n", + " - [X] AdaBoost\n", + " - [X] XGBoost\n", + " - [X] LightGBM\n", + " - [X] GradientBoosting\n", + " - [X] CatBoost\n", + "\n", + "## Bagging vs Boosting vc Stacking\n", + "### **Bagging**\n", + "* Objetivo é reduzir a variância;\n", + "\n", + "#### Como funciona\n", + "* Seleciona várias amostras **COM REPOSIÇÃO** do dataframe de treinamento. Cada amostra é usada para treinar um modelo usando Decision Trees. Como resultado, temos um ensemble de muitas e diferentes modelos (Decision Trees). A média de desses muitos e diferentes modelos (Decision Trees) são usados para produzir o resultado final;\n", + "* O resultado final é mais robusto do que usarmos uma simples Decision Tree.\n", + "\n", + "![Bagging](https://github.com/MathMachado/Materials/blob/master/Bagging.png?raw=true)\n", + "\n", + "Souce: [Boosting and Bagging: How To Develop A Robust Machine Learning Algorithm](https://hackernoon.com/how-to-develop-a-robust-algorithm-c38e08f32201).\n", + "\n", + "#### Steps\n", + "* Suponha um dataframe X_train (dataframe de treinamento) contendo N observações (instâncias, pontos, linhas) e M COLUNAS (features, atributos).\n", + " 1. Bagging seleciona aleatoriamente uma amostra **COM REPOSIÇÃO** de X_train;\n", + " 2. Bagging seleciona aleatoriamente M2 (M2 < M) COLUNAS do dataframe extraído do passo (1);\n", + " 3. Constroi uma Decision Tree com as M2 COLUNAS do passo (2) e o dataframe obtido no passo (1) e as COLUNAS são avaliadas pela sua habilidade de classificar as observações;\n", + " 4. Os passos (1)--> (2)-- (3) são repetidos K vezes (ou seja, K Decision Trees), de forma que as COLUNAS são ranqueadas pelo seu poder preditivo e o resultado final (acurácia, por exemplo) é obtido pela agregação das predições dos K Decision Trees.\n", + "\n", + "#### Vantagens\n", + "* Reduz overfitting;\n", + "* Lida bem com dataframes com muitas COLUNAS (high dimensionality);\n", + "* Lida automaticamente com Missing Values;\n", + "\n", + "#### Desvantagem\n", + "* A predição final é baseada na média das K Decision Trees, o que pode comprometer a acurácia final.\n", + "\n", + "___ \n", + "### **Boosting**\n", + "* Objetivo é melhorar acurácia;\n", + "\n", + "#### Como funciona\n", + "* Os classificadores são usados sequencialmente, de forma que o classificador no passo N aprende com os erros do classificador do passo N-1. Ou seja, o objetivo é melhorar a precisão/acurácia à cada passo aprendendo com o passado.\n", + "\n", + "![Boosting](https://github.com/MathMachado/Materials/blob/master/Boosting.png?raw=true)\n", + "\n", + "Source: [Ensemble methods: bagging, boosting and stacking](https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205), Joseph Rocca\n", + ".\n", + "\n", + "#### Steps\n", + "* Suponha um dataframe X_train (dataframe de treinamento) contendo N observações (instâncias, pontos, linhas) e M COLUNAS (features, atributos).\n", + " 1. Boosting seleciona aleatoriamente uma amostra D1 SEM reposição de X_train;\n", + " 2. Boosting treina o classificador C1;\n", + " 3. Boosting seleciona aleatoriamente a SEGUNDA amostra D2 SEM reposição de X_train e acrescenta à D2 50% das observações que foram classificadas incorretamente para treinar o classificador C2;\n", + " 4. Boosting encontra em X_train a amostra D3 que os classificadores C1 e C2 discordam em classificar e treina C3;\n", + " 5. Combina (voto) as predições de C1, C2 e C3 para produzir o resultado final.\n", + "\n", + "#### Vantagens\n", + "* Lida bem com dataframes com muitas COLUNAS (high dimensionality);\n", + "* Lida automaticamente com Missing Values;\n", + "\n", + "#### Desvantagem\n", + "* Propenso a overfitting. Recomenda-se tratar outliers previamente.\n", + "* Requer ajuste cuidadoso dos hyperparameters;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9fgUrkmPk4dr" + }, + "source": [ + "___\n", + "# STACKING\n", + "\n", + "![Stacking](https://github.com/MathMachado/Materials/blob/master/Stacking.png?raw=true)\n", + "\n", + "Kd a referência desta figura???" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B0jxx3ETpOdm" + }, + "source": [ + "___\n", + "# **BOOTSTRAPPING METHODS**\n", + "> Antes de falarmos de Boosting ou Bagging, precisamos entender primeiro o que é Bootstrap, pois ambos (Boosting e Bagging) são baseados em Bootstrap.\n", + "\n", + "* Em Estatística (e em Machine Learning), Bootstrap se refere à extrair amostras aleatórias COM reposição da população X." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SyqazmUuifkE" + }, + "source": [ + "___\n", + "# **ADABOOST(Adaptive Boosting)**\n", + "* Quando nada funciona, AdaBoost funciona!\n", + "* Foi um dos primeiros algoritmos de Boosting (1995);\n", + "* AdaBoost pode ser utilizado tanto para classificação (AdaBoostClassifier) quanto para Regressão (AdaBoostRegressor);\n", + "* AdaBoost usam algoritmos DecisionTree como base_estimator;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RU-vzkXqrFVw" + }, + "source": [ + "## Referências\n", + "* [AdaBoost Classifier Example In Python](https://towardsdatascience.com/machine-learning-part-17-boosting-algorithms-adaboost-in-python-d00faac6c464) - Didático e explica exatamente como o AdaBoost funciona.\n", + "* [Adaboost for Dummies: Breaking Down the Math (and its Equations) into Simple Terms](https://towardsdatascience.com/adaboost-for-dummies-breaking-down-the-math-and-its-equations-into-simple-terms-87f439757dcf) - Para quem quer entender a matemática por trás do algoritmo.\n", + "* [Gradient Boosting and XGBoost](https://medium.com/hackernoon/gradient-boosting-and-xgboost-90862daa6c77)\n", + "* [Understanding AdaBoost](https://towardsdatascience.com/understanding-adaboost-2f94f22d5bfe), Akash Desarda.\n", + "* [AdaBoost Classifier Example In Python](https://towardsdatascience.com/machine-learning-part-17-boosting-algorithms-adaboost-in-python-d00faac6c464)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6EMrjQDZIMl_" + }, + "source": [ + "## O que é AdaBoost (Adaptive Boosting)?\n", + "* é um dos classificadores do tipo ensemble (combina vários classificadores para aumentar a precisão).\n", + "* AdaBoost é um classificador iterativo e forte que combina (ensemble) vários classificadores fracos para melhorar a precisão.\n", + "* Qualquer algoritmo de aprendizado de máquina pode ser usado como um classificador de base (parâmetro base_estimator);\n", + "\n", + "## Parâmetros mais importantes do AdaBoost:\n", + "* base_estimator - É um classificador usado para treinar o modelo. Como default, AdaBoost usa o DecisionTreeClassifier. Como dito anteriormente, pode-se utilizar diferentes algoritmos para esse fim.\n", + "* n_estimators - Número de base_estimator para treinar iterativamente.\n", + "* learning_rate - Controla a contribuição do base_estimator na solução/combinação final;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TzLtHzWNJBix" + }, + "source": [ + "## Usando diferentes algoritmos para base_estimator\n", + "> Como dito anteriormente, pode-se utilizar vários tipos de base_estimator em AdaBoost. Por exemplo, se quisermos usar SVM (Support Vector Machines), devemos proceder da seguinte forma:\n", + "\n", + "\n", + "```\n", + "# Importar a biblioteca base_estimator\n", + "from sklearn.svm import SVC\n", + "\n", + "# Treina o classificador (algoritmo)\n", + "ml_SVC= SVC(probability=True, kernel='linear')\n", + "\n", + "# Constroi o modelo AdaBoost\n", + "ml_AB = AdaBoostClassifier(n_estimators= 50, base_estimator=ml_SVC, learning_rate=1)\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hrj4a4s6hMMB" + }, + "source": [ + "## Vantagens\n", + "* AdaBoost é fácil de implementar;\n", + "* AdaBoost corrige os erros do base_estimator iterativamente e melhora a acurácia;\n", + "* Faz o Feature Selection automaticamente (**Porque**?);\n", + "* Pode-se usar muitos algoritos como base_estimator ;\n", + "* Como é um método ensemble, então o modelo final é pouco propenso à overfitting.\n", + "\n", + "## Desvantagens\n", + "* AdaBoost é sensível a ruídos nos dados;\n", + "* Altamente impactado por outliers (contribui para overfitting), pois o algoritmo tenta se ajustr a cada ponto da mehor forma possível;\n", + "* AdaBoost é mais lento que XGBoost;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bgJmu7YLiyv7" + }, + "source": [ + "No exemplo a seguir, vou usar RandomForestClassifier com os parâmetros otimizados, ou seja:\n", + "\n", + "```\n", + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5VCRNyZT3qvc" + }, + "source": [ + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1gIboJdriq61" + }, + "source": [ + "from sklearn.ensemble import AdaBoostClassifier\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "\n", + "# Instancia RandomForestClassifier - Parâmetros otimizados!\n", + "ml_RF2= RandomForestClassifier(bootstrap= best_params['bootstrap'], \n", + " max_depth= best_params['max_depth'], \n", + " max_features= best_params['max_features'], \n", + " min_samples_leaf= best_params['min_samples_leaf'], \n", + " min_samples_split= best_params['min_samples_split'], \n", + " n_estimators= best_params['n_estimators'], \n", + " random_state= i_Seed)\n", + "# Instancia AdaBoostClassifier\n", + "ml_AB= AdaBoostClassifier(n_estimators=100, base_estimator= ml_RF2, random_state= i_Seed)\n", + "\n", + "# Treina...\n", + "ml_AB.fit(X_train, y_train)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "A4Cs81OLD40y" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_AB, X_train, y_train, cv = i_CV)\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F7Ce5L38ECoC" + }, + "source": [ + "**Interpretação**: Nosso classificador (AdaBoostClassifier) tem uma acurácia média de 96,72% (base de treinamento). Além disso, o std é da ordem de 2,54%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "t5GfnBwEifkO" + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q9rSpuXyEPA5" + }, + "source": [ + "# Faz predições com os parametros otimizados...\n", + "y_pred = ml_AB.predict(X_test)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "2F9k-_eXGDLa" + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_test, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XweWTjQ9EXLw" + }, + "source": [ + "## Parameter tunning" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fcrKzse9EbL_" + }, + "source": [ + "# Dicionário de parâmetros para o parameter tunning.\n", + "d_parametros_AB = {'n_estimators':[50, 100, 200], 'learning_rate':[.001, 0.01, 0.05, 0.1, 0.3,1]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Susc3I7mFDQX" + }, + "source": [ + "# Invoca a função\n", + "ml_AB2, best_params= GridSearchOptimizer(ml_AB, 'ml_AB2', d_parametros_AB, X_train, y_train, X_test, y_test, cv = i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "w4JjWsusjNS8" + }, + "source": [ + "___\n", + "# **GRADIENT BOOSTING**\n", + "* Gradient boosting pode ser usado para resolver problemas de classificação (GradientBoostingClassifier) e Regressão (GradientBoostingRegressor);\n", + "* Gradient boosting são um refinamento do AdaBoost (lembra que AdaBoost foi um dos primeiros métodos de Boosting - criado em 1995). O que Gradient Boosting faz adicionalmente ao AdaBoost é minimizar a loss (função perda), ie, minimizar a diferença entre os valores observados de y e os valores preditos.\n", + "* Usa Gradient Descent para encontrar as deficiências nas previsões do passo anterior. Gradient Descent é um algoritmo popular e poderoso e usado em Redes Neurais;\n", + "* O objetivo do Gradient Boosting é minimizar 'loss function'. Portanto, Gradient Boosting depende da \"loss function\".\n", + "* Gradient boosting usam algoritmos DecisionTree como base_estimator;\n", + "\n", + "## Vantagens\n", + "* Não há necessidade de pre-processing;\n", + "* Trabalha normalmente com COLUNAS numéricas ou categóricas;\n", + "* Trata automaticamente os Missing Values. Ou seja, não é necessário aplicar métodos de Missing Value Imputation;\n", + "\n", + "## Desvantagens\n", + "* Como Gradient Boosting tenta continuamente minimizar os erros à cada iteração, isso pode enfatizar os outliers e causar overfitting. Portanto, deve-se:\n", + " * Tratar os outliers previamente OU\n", + " * Usar Cross-Validation para neutralizar os efeitos dos outliers (**Eu prefiro este método, pois toma menos tempo**);\n", + "* Computacionalmene caro. Geralmente são necessários muitas árvores (> 1000) para se obter bons resultados;\n", + "* Devido à flexibilidade (muitos parâmetros para ajustar), então é necessário usar GridSearchCV para encontrar a combinação ótima dos hyperparameters;\n", + "\n", + "## Referências\n", + "* [Gradient Boosting Decision Tree Algorithm Explained](https://towardsdatascience.com/machine-learning-part-18-boosting-algorithms-gradient-boosting-in-python-ef5ae6965be4) - Didático e detalhista.\n", + "* [Predicting Wine Quality with Gradient Boosting Machines](https://towardsdatascience.com/predicting-wine-quality-with-gradient-boosting-machines-a-gmb-tutorial-d950b1542065)\n", + "* [Parameter Tuning in Gradient Boosting (GBM) with Python](https://www.datacareer.de/blog/parameter-tuning-in-gradient-boosting-gbm/)\n", + "* [Tune Learning Rate for Gradient Boosting with XGBoost in Python](https://machinelearningmastery.com/tune-learning-rate-for-gradient-boosting-with-xgboost-in-python/)\n", + "* [In Depth: Parameter tuning for Gradient Boosting](https://medium.com/all-things-ai/in-depth-parameter-tuning-for-gradient-boosting-3363992e9bae) - Muito bom\n", + "* [Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q4bUCZs2jNTA" + }, + "source": [ + "from sklearn.ensemble import GradientBoostingClassifier\n", + "\n", + "# Instancia...\n", + "ml_GB=GradientBoostingClassifier(n_estimators=100, min_samples_split= 2)\n", + "\n", + "# Treina...\n", + "ml_GB.fit(X_train, y_train)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-dr6dyjdXwvd" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_GB, X_train, y_train, cv = i_CV)\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VlC3y3M5YaGG" + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vnLvQ0ZDYNjB" + }, + "source": [ + "**Interpretação**: Nosso classificador (GradientBoostingClassifier) tem uma acurácia média de 96,86% (base de treinamento). Além disso, o std é da ordem de 2,52%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "D2n1RKZuXq3D" + }, + "source": [ + "# Faz precições...\n", + "y_pred = ml_GB.predict(X_test)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8r6JCzQRGFa0" + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_test, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names = cf_labels, categories = cf_categories)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KFv-Q2AD5uCk" + }, + "source": [ + "## Parameter tunning\n", + "> Consulte [Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/) para detalhes sobre os parâmetros, significado e etc." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wgU040AcjNTF" + }, + "source": [ + "# Dicionário de parâmetros para o parameter tunning.\n", + "d_parametros_GB= {'learning_rate': [1, 0.5, 0.25, 0.1, 0.05, 0.01]} #,\n", + "# 'n_estimators': [1, 2, 4, 8, 16, 32, 64, 100, 200],\n", + "# 'max_depth': [5, 10, 15, 20, 25, 30],\n", + "# 'min_samples_split': [0.1, 0.3, 0.5, 0.7, 0.9],\n", + "# 'min_samples_leaf': [0.1, 0.2, 0.3, 0.4, 0.5],\n", + "# 'max_features': list(range(1, X_train.shape[1]))}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "v5KLFlpTjNTH" + }, + "source": [ + "# Invoca a função\n", + "ml_GB2, best_params= GridSearchOptimizer(ml_GB, 'ml_GB2', d_parametros_GB, X_train, y_train, X_test, y_test, cv = i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YQ6ERz3fi9i2" + }, + "source": [ + "### Resultado da execução do Gradient Boosting" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RSa7uKw13mKG" + }, + "source": [ + "```\n", + "[Parallel(n_jobs=-1)]: Done 275400 out of 275400 | elapsed: 93.7min finished\n", + "\n", + "Parametros otimizados: {'learning_rate': 1, 'max_depth': 30, 'max_features': 11, 'min_samples_leaf': 0.1, 'min_samples_split': 0.1, 'n_estimators': 100}\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wiJpA2PyjDjR" + }, + "source": [ + "# Como o procedimento acima levou 93 minutos para executar, então vou estimar ml_GB2 abaixo usando os parâmetros acima estimados\n", + "best_params= {'learning_rate': 1, 'max_depth': 30, 'max_features': 11, 'min_samples_leaf': 0.1, 'min_samples_split': 0.1, 'n_estimators': 100}\n", + "\n", + "#ml_GB2= GradientBoostingClassifier(learning_rate= best_params['learning_rate'], \n", + "# max_depth= best_params['max_depth'],\n", + "# max_features= best_params['max_features'],\n", + "# min_samples_leaf= best_params['min_samples_leaf'],\n", + "# min_samples_split= best_params['min_samples_split'],\n", + "# n_estimators= best_params['n_estimators'],\n", + "# random_state= i_Seed)\n", + "\n", + "ml_GB2= GradientBoostingClassifier(learning_rate= best_params['learning_rate'], \n", + " max_depth= best_params['max_depth'],\n", + " min_samples_leaf= best_params['min_samples_leaf'],\n", + " min_samples_split= best_params['min_samples_split'],\n", + " n_estimators= best_params['n_estimators'],\n", + " random_state= i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mb14gJ7-jbVM" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TAqGZIFYm2sU" + }, + "source": [ + "X_train_GB, X_test_GB = seleciona_colunas_relevantes(ml_GB2, X_train, X_test)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6yiu6dahnBvC" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "APrtWN18nc4t" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VS0mLdOmnXAY" + }, + "source": [ + "# Treina com as COLUNAS relevantes\n", + "ml_GB2.fit(X_train_GB, y_train)\n", + "\n", + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_GB2, X_train_GB, y_train, cv = i_CV)\n", + "print(f'Acurácia Media: {100*a_scores_CV.mean():.2f}')\n", + "print(f'std médio.....: {100*a_scores_CV.std():.2f}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vmc9PP_Rn1TN" + }, + "source": [ + "## Valida o modelo usando o dataframe X_test" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "e3mnIALvnzP2" + }, + "source": [ + "y_pred_GB = ml_GB2.predict(X_test_GB)\n", + "\n", + "# Calcula acurácia\n", + "accuracy_score(y_test, y_pred_GB)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kwP9Z2GnkV7r" + }, + "source": [ + "___\n", + "# **XGBOOST (eXtreme Gradient Boosting)**\n", + "* XGBoost é uma melhoria de Gradient Boosting. As melhorias são em velocidade e performace, além de corrigir as ineficiências do GradientBoosting.\n", + "* Algoritmo preferido pelos Kaggle Grandmasters;\n", + "* Paralelizável;\n", + "* Estado-da-arte em termos de Machine Learning;\n", + "\n", + "## Parâmetros relevantes e seus valores iniciais\n", + "Consulte [Complete Guide to Parameter Tuning in XGBoost with codes in Python](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/) para detalhes completos sobre os parâmetros, significado e etc.\n", + "\n", + "* n_estimators = 100 (100 caso o dataframe for grande. Se o dataframe for médio/pequeno, então 1000) - É o número de árvores desejamos construir;\n", + "* max_depth= 3 - Determina quão profundo cada árvore pode crescer durante qualquer round de treinamento. Valores típicos no intervalo [3, 10];\n", + "* learning rate= 0.01 - Usado para evitar overfitting, intervalo: [0, 1];\n", + "* alpha (somente para problemas de Regressão) - L1 regularization nos pesos. Valores altos resulta em mais regularization;\n", + "* lambda (somente para problemas de Regressão) - L2 regularization nos pesos.\n", + "* colsample_bytree: 1 - porcentagem de COLUNAS usados por cada árvore. Alto valor pode causar overfitting;\n", + "* subsample: 0.8 - porcentagem de amostras usadas por árvore. Um valor baixo pode levar a overfitting;\n", + "* gamma: 1 - Controla se um determinado nó será dividido com base na redução esperada na perda após a divisão. Um valor mais alto leva a menos divisões.\n", + "* objective: Define a \"loss function\". As opções são:\n", + " * reg:linear - Para resolver problemas de regressão;\n", + " * reg:logistic - Para resolver problemas de classificação;\n", + " * binary:logistic - Para resolver problemas de classificação com cálculo de probabilidades;\n", + "\n", + "# Referências\n", + "* [How exactly XGBoost Works?](https://medium.com/@pushkarmandot/how-exactly-xgboost-works-a320d9b8aeef)\n", + "* [Fine-tuning XGBoost in Python like a boss](https://towardsdatascience.com/fine-tuning-xgboost-in-python-like-a-boss-b4543ed8b1e)\n", + "* [Gentle Introduction of XGBoost Library](https://medium.com/@imoisharma/gentle-introduction-of-xgboost-library-2b1ac2669680)\n", + "* [A Beginner’s guide to XGBoost](https://towardsdatascience.com/a-beginners-guide-to-xgboost-87f5d4c30ed7)\n", + "* [Exploring XGBoost](https://towardsdatascience.com/exploring-xgboost-4baf9ace0cf6)\n", + "* [Feature Importance and Feature Selection With XGBoost in Python](https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/)\n", + "* [Ensemble Learning case study: Running XGBoost on Google Colab free GPU](https://towardsdatascience.com/running-xgboost-on-google-colab-free-gpu-a-case-study-841c90fef101) - Recomendo\n", + "* [Predicting movie revenue with AdaBoost, XGBoost and LightGBM](https://towardsdatascience.com/predicting-movie-revenue-with-adaboost-xgboost-and-lightgbm-262eadee6daa)\n", + "* [Tuning XGBoost Hyperparameters with Scikit Optimize](https://towardsdatascience.com/how-to-improve-the-performance-of-xgboost-models-1af3995df8ad)\n", + "* [An Example of Hyperparameter Optimization on XGBoost, LightGBM and CatBoost using Hyperopt](https://towardsdatascience.com/an-example-of-hyperparameter-optimization-on-xgboost-lightgbm-and-catboost-using-hyperopt-12bc41a271e) - Interessante\n", + "* [XGBOOST vs LightGBM: Which algorithm wins the race !!!](https://towardsdatascience.com/lightgbm-vs-xgboost-which-algorithm-win-the-race-1ff7dd4917d) - LightGBM tem se mostrado interessante.\n", + "* [From Zero to Hero in XGBoost Tuning](https://towardsdatascience.com/from-zero-to-hero-in-xgboost-tuning-e48b59bfaf58) - Gostei\n", + "* [Build XGBoost / LightGBM models on large datasets — what are the possible solutions?](https://towardsdatascience.com/build-xgboost-lightgbm-models-on-large-datasets-what-are-the-possible-solutions-bf882da2c27d)\n", + "* [Selecting Optimal Parameters for XGBoost Model Training](https://towardsdatascience.com/selecting-optimal-parameters-for-xgboost-model-training-c7cd9ed5e45e) - Muito bom!\n", + "* [CatBoost vs. Light GBM vs. XGBoost](https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db)\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iMM_R4_ukV7x" + }, + "source": [ + "from xgboost import XGBClassifier\n", + "import xgboost as xgb\n", + "\n", + "# Instancia...\n", + "ml_XGB= XGBClassifier(silent=False, \n", + " scale_pos_weight=1,\n", + " learning_rate=0.01, \n", + " colsample_bytree = 1,\n", + " subsample = 0.8,\n", + " objective='binary:logistic', \n", + " n_estimators=1000, \n", + " reg_alpha = 0.3,\n", + " max_depth= 3, \n", + " gamma=1, \n", + " max_delta_step=5)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "E4wQMlDEFINR" + }, + "source": [ + "# Treina...\n", + "ml_XGB.fit(X_train, y_train)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zAhsTtwGqMkG" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_XGB, X_train, y_train, cv = i_CV)\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JNyKX6PkrXOk" + }, + "source": [ + "**Interpretação**: Nosso classificador (XGBClassifier) tem uma acurácia média de 96,72% (base de treinamento). Além disso, o std é da ordem de 2,02%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_h0QYv3FkV73" + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "AKhhAZLjkV76" + }, + "source": [ + "# Faz predições...\n", + "y_pred = ml_XGB.predict(X_test)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ir2Kd1PqGHgz" + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_test, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jEC7gW4qYpWw" + }, + "source": [ + "## Parameter tunning\n", + "### Leitura Adicional:\n", + "* [Fine-tuning XGBoost in Python like a boss](https://towardsdatascience.com/fine-tuning-xgboost-in-python-like-a-boss-b4543ed8b1e)\n", + "* [Complete Guide to Parameter Tuning in XGBoost with codes in Python](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)\n", + "\n", + "> Olhando para os resultados acima, qual o melhor modelo?\n", + "\n", + "XGBoost? Supondo que sim, agora vamos fazer o fine-tuning dos parâmetros do modelo." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "n3MsUONPwIV9" + }, + "source": [ + "# Dicionário de parâmetros para XGBoost:\n", + "d_parametros_XGB = {'min_child_weight': [i for i in np.arange(1, 13)]} #,\n", + "# 'gamma': [i for i in np.arange(0, 5, 0.5)],\n", + "# 'subsample': [0.6, 0.8, 1.0],\n", + "# 'colsample_bytree': [0.6, 0.8, 1.0],\n", + "# 'max_depth': [3, 4, 5, 7, 9],\n", + "# 'learning_rate': [i for i in np.arange(0.01, 1, 0.1)]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "CX27FCKmwSni" + }, + "source": [ + "# Invoca a função\n", + "ml_XGB, best_params= GridSearchOptimizer(ml_XGB, 'ml_XGB2', d_parametros_XGB, X_train, y_train, X_test, y_test, cv = i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9b7uCuF74Hjv" + }, + "source": [ + "### Resultado da execução do XGBoostClassifier\n", + "\n", + "```\n", + "[Parallel(n_jobs=-1)]: Done 108000 out of 108000 | elapsed: 372.0min finished\n", + "\n", + "Parametros otimizados: {'colsample_bytree': 0.8, 'gamma': 0.5, 'learning_rate': 0.51, 'max_depth': 5, 'min_child_weight': 1, 'subsample': 0.6}\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "n7E0oyxEtbGi" + }, + "source": [ + "# Como o procedimento acima levou 372 minutos para executar, então vou estimar ml_XGB2 abaixo usando os parâmetros acima estimados\n", + "best_params= {'colsample_bytree': 0.8, 'gamma': 0.5, 'learning_rate': 0.51, 'max_depth': 5, 'min_child_weight': 1, 'subsample': 0.6}\n", + "\n", + "ml_XGB2= XGBClassifier(min_child_weight= best_params['min_child_weight'], \n", + " gamma= best_params['gamma'], \n", + " subsample= best_params['subsample'], \n", + " colsample_bytree= best_params['colsample_bytree'], \n", + " max_depth= best_params['max_depth'], \n", + " learning_rate= best_params['learning_rate'], \n", + " random_state= i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CuqyLHTU5Z-j" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes\n", + "* [The Multiple faces of ‘Feature importance’ in XGBoost](https://towardsdatascience.com/be-careful-when-interpreting-your-features-importance-in-xgboost-6e16132588e7)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QPG3JZIpRZ-T" + }, + "source": [ + "# plot feature importance\n", + "from xgboost import plot_importance\n", + "\n", + "xgb.plot_importance(ml_XGB2, color = 'red')\n", + "plt.title('importance', fontsize = 20)\n", + "plt.yticks(fontsize = 10)\n", + "plt.ylabel('features', fontsize = 20)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "EmpRC2lHW-KP" + }, + "source": [ + "ml_XGB2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "4f9MIEBiyq-5" + }, + "source": [ + "X_train_XGB, X_test_XGB= seleciona_colunas_relevantes(ml_XGB2, X_train, X_test)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F6EayWaY5nMm" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Huy18gKI5qad" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "E3-PaTdc5vZk" + }, + "source": [ + "# Treina com as COLUNAS relevantes...\n", + "ml_XGB2.fit(X_train_XGB, y_train)\n", + "\n", + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_XGB2, X_train_XGB, y_train, cv = i_CV)\n", + "print(f'Acurácia Media: {100*a_scores_CV.mean():.2f}')\n", + "print(f'std médio.....: {100*a_scores_CV.std():.2f}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tBdYikDU6NhD" + }, + "source": [ + "## Valida o modelo usando o dataframe X_test" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GcvY-VdL6VIZ" + }, + "source": [ + "y_pred_XGB = ml_XGB2.predict(X_test_XGB)\n", + "\n", + "# Calcula acurácia\n", + "accuracy_score(y_test, y_pred_XGB)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8oLtdH-vTSbC" + }, + "source": [ + "xgb.to_graphviz(ml_XGB2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "czXQG3MCHfHM" + }, + "source": [ + "# KNN - KNEIGHBORSCLASSIFIER" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "llTTXNeyHiwx" + }, + "source": [ + "# BAGGINGCLASSIFIER" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Fbkekd4QHoZO" + }, + "source": [ + "# EXTRATREESCLASSIFIER" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "widavwR4HzwE" + }, + "source": [ + "# SVM\n", + "https://data-flair.training/blogs/svm-support-vector-machine-tutorial/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "id_Ubulns6We" + }, + "source": [ + "# NAIVE BAYES" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3e0m7lEnYOV9" + }, + "source": [ + "# **IMPORTANCIA DAS COLUNAS**\n", + "Source: [Plotting Feature Importances](https://www.kaggle.com/grfiv4/plotting-feature-importances)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fjco0HnNYr-N" + }, + "source": [ + "def mostra_feature_importances(clf, X_train, y_train=None, \n", + " top_n=10, figsize=(8,8), print_table=False, title=\"Feature Importances\"):\n", + " '''\n", + " plot feature importances of a tree-based sklearn estimator\n", + " \n", + " Note: X_train and y_train are pandas DataFrames\n", + " \n", + " Note: Scikit-plot is a lovely package but I sometimes have issues\n", + " 1. flexibility/extendibility\n", + " 2. complicated models/datasets\n", + " But for many situations Scikit-plot is the way to go\n", + " see https://scikit-plot.readthedocs.io/en/latest/Quickstart.html\n", + " \n", + " Parameters\n", + " ----------\n", + " clf (sklearn estimator) if not fitted, this routine will fit it\n", + " \n", + " X_train (pandas DataFrame)\n", + " \n", + " y_train (pandas DataFrame) optional\n", + " required only if clf has not already been fitted \n", + " \n", + " top_n (int) Plot the top_n most-important features\n", + " Default: 10\n", + " \n", + " figsize ((int,int)) The physical size of the plot\n", + " Default: (8,8)\n", + " \n", + " print_table (boolean) If True, print out the table of feature importances\n", + " Default: False\n", + " \n", + " Returns\n", + " -------\n", + " the pandas dataframe with the features and their importance\n", + " \n", + " Author\n", + " ------\n", + " George Fisher\n", + " '''\n", + " \n", + " __name__ = \"mostra_feature_importances\"\n", + " \n", + " import pandas as pd\n", + " import numpy as np\n", + " import matplotlib.pyplot as plt\n", + " \n", + " from xgboost.core import XGBoostError\n", + " from lightgbm.sklearn import LightGBMError\n", + " \n", + " try: \n", + " if not hasattr(clf, 'feature_importances_'):\n", + " clf.fit(X_train.values, y_train.values.ravel())\n", + "\n", + " if not hasattr(clf, 'feature_importances_'):\n", + " raise AttributeError(\"{} does not have feature_importances_ attribute\".\n", + " format(clf.__class__.__name__))\n", + " \n", + " except (XGBoostError, LightGBMError, ValueError):\n", + " clf.fit(X_train.values, y_train.values.ravel())\n", + " \n", + " feat_imp = pd.DataFrame({'importance':clf.feature_importances_}) \n", + " feat_imp['feature'] = X_train.columns\n", + " feat_imp.sort_values(by ='importance', ascending = False, inplace = True)\n", + " feat_imp = feat_imp.iloc[:top_n]\n", + " \n", + " feat_imp.sort_values(by='importance', inplace = True)\n", + " feat_imp = feat_imp.set_index('feature', drop = True)\n", + " feat_imp.plot.barh(title=title, figsize=figsize)\n", + " plt.xlabel('Feature Importance Score')\n", + " plt.show()\n", + " \n", + " if print_table:\n", + " from IPython.display import display\n", + " print(\"Top {} features in descending order of importance\".format(top_n))\n", + " display(feat_imp.sort_values(by = 'importance', ascending = False))\n", + " \n", + " return feat_imp" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ycu_EIGlYUYn" + }, + "source": [ + "import pandas as pd\n", + "\n", + "from xgboost import XGBClassifier\n", + "from sklearn.ensemble import ExtraTreesClassifier\n", + "from sklearn.tree import ExtraTreeClassifier\n", + "from sklearn.tree import DecisionTreeClassifier\n", + "from sklearn.ensemble import GradientBoostingClassifier\n", + "from sklearn.ensemble import BaggingClassifier\n", + "from sklearn.ensemble import AdaBoostClassifier\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "from sklearn.linear_model import LogisticRegression\n", + "from lightgbm import LGBMClassifier\n", + "\n", + "clfs = [XGBClassifier(), LGBMClassifier(), \n", + " ExtraTreesClassifier(), ExtraTreeClassifier(),\n", + " BaggingClassifier(), DecisionTreeClassifier(),\n", + " GradientBoostingClassifier(), LogisticRegression(),\n", + " AdaBoostClassifier(), RandomForestClassifier()]\n", + "\n", + "for clf in clfs:\n", + " try:\n", + " _ = mostra_feature_importances(clf, X_train, y_train, top_n=X_train.shape[1], title=clf.__class__.__name__)\n", + " except AttributeError as e:\n", + " print(e)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EwWkjfC8KEZH" + }, + "source": [ + "# ENSEMBLE METHODS\n", + "https://towardsdatascience.com/using-bagging-and-boosting-to-improve-classification-tree-accuracy-6d3bb6c95e5b\n", + "\n", + "![Ensemble](https://github.com/MathMachado/Materials/blob/master/Ensemble.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3Uf1RML7xETY" + }, + "source": [ + "# WOE e IV\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TBNRfYZCyhMP" + }, + "source": [ + "## Construção do exemplo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gIIroyyP4ZRZ" + }, + "source": [ + "df_y.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "PzQQdrkf1ohX" + }, + "source": [ + "from random import choices\n", + "\n", + "df_X2= df_X.copy()\n", + "df_X2['tipo']= choices(['A', 'B', 'C', 'D'], k= 1000)\n", + "df_X2['idade']= np.random.randint(10, 15, size= 1000)\n", + "df_X2['target']= df_y['target']\n", + "df_X2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "v-OpwIpx4hXJ" + }, + "source": [ + "df_X2['target'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "yZfqSvbKzeJ3" + }, + "source": [ + "def Constroi_Buckets(df, i, k= 10):\n", + " coluna= 'v'+ str(i)\n", + " df[coluna+'_Bucket']= pd.cut(df[coluna], bins= k, labels= np.arange(1, k+1))\n", + " df= df.drop(columns= [coluna], axis= 1)\n", + " return df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "V6Nrpsx60HD3" + }, + "source": [ + "for i in np.arange(1,19):\n", + " df_X2= Constroi_Buckets(df_X2, i)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "J2Fbh41-03OB" + }, + "source": [ + "df_X2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "O9r5BeWVxIr3" + }, + "source": [ + "# Função para calcular WOE e IV\n", + "def calculate_woe_iv(dataset, feature, target):\n", + "\n", + " def codethem(IV):\n", + " if IV < 0.02: return 'Useless'\n", + " elif IV >= 0.02 and IV < 0.1: return 'Weak'\n", + " elif IV >= 0.1 and IV < 0.3: return 'Medium'\n", + " elif IV >= 0.3 and IV < 0.5: return 'Strong'\n", + " elif IV >= 0.5: return 'Suspicious'\n", + " else: return 'None'\n", + "\n", + " lst = []\n", + " for i in range(dataset[feature].nunique()):\n", + " val = list(dataset[feature].unique())[i]\n", + " lst.append({\n", + " 'Value': val,\n", + " 'All': dataset[dataset[feature] == val].count()[feature],\n", + " 'Good': dataset[(dataset[feature] == val) & (dataset[target] == 0)].count()[feature],\n", + " 'Bad': dataset[(dataset[feature] == val) & (dataset[target] == 1)].count()[feature]\n", + " })\n", + " \n", + " dset = pd.DataFrame(lst)\n", + " dset['Distr_Good'] = dset['Good']/dset['Good'].sum()\n", + " dset['Distr_Bad'] = dset['Bad']/dset['Bad'].sum()\n", + " dset['Mean']= dset['All']/dset['All'].sum()\n", + " dset['WoE'] = np.log(dset['Distr_Good']/dset['Distr_Bad'])\n", + " dset = dset.replace({'WoE': {np.inf: 0, -np.inf: 0}})\n", + " dset['IV'] = (dset['Distr_Good'] - dset['Distr_Bad']) * dset['WoE']\n", + " #dset= dset.drop(columns= ['Distr_Good', 'Distr_Bad'], axis= 1)\n", + "\n", + " dset['Predictive_Power']= dset['IV'].map(codethem)\n", + " iv = dset['IV'].sum() \n", + " dset = dset.sort_values(by='IV') \n", + " return dset, iv" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Y8WGjWH63nx_" + }, + "source": [ + "df_Lab = df_X2.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-N6xr1MgxTiz" + }, + "source": [ + "def calcula_Predictive_Power(df_Lab, coluna):\n", + " print('WoE and IV for column: {}'.format(coluna))\n", + " df, iv = calculate_woe_iv(df_Lab, coluna, 'target')\n", + " print(df)\n", + " print('IV score: {:.2f}'.format(iv))\n", + " print('\\n')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ayqN_7WnxVq9" + }, + "source": [ + "for i in np.arange(1,19):\n", + " coluna= 'v'+str(i)+'_Bucket'\n", + " calcula_Predictive_Power(df_Lab, coluna)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qtoJVI4Pyx3I" + }, + "source": [ + "# **IMBALANCED SAMPLE**\n", + "> Alguns objetivos como detectar fraude em transações bancárias ou detecção de intrusão em network tem em comum o fato que a classe de interesse (o que queremos detectar), geralmente é um evento raro\n", + "\n", + "## Exemplo: Detectar fraude\n", + "A proporção de fraudes diante de NÃO-FRAUDES são mais ou menos 1%/99%. Neste caso, ao desenvovermos um modelo para detectar fraudes e o modelo classificar todas as instâncias como NÃO-FRAUDE, então o modelo terá uma acurácia de 99%. No entanto, este modelo não nos ajudará em nada.\n", + "\n", + "## Necessidade de se usar outras métricas \n", + "> Recomenda-se utilizar outras métricas (na verdade, é boa prática usar mais de 1 métrica para medir a performance dos modelos) como, por exemplo, F1-Score, Precision/Specificity, Recall/Sensitivity e AUROC.\n", + "\n", + "## Como lidar com a amostra desbalanceada?\n", + "* Under-sampling\n", + "> Seleciona aleatoriamente a classe MAJORITÁRIA (em nosso exemplo, NÃO-FRAUDE) até o número de instâncias da classe MINORITÁRIA (FRAUDE);\n", + "\n", + "* Over-sampling\n", + "> Resample aleatoriamente a classe MINORITÁRIA (em nosso exemplo, FRAUDE) até o número de instâncias da classe MAJORITÁRIA (NÃO-FRAUDE), ou uma proporção da classe MAJORITÁRIA. Veja a bibliotea SMOTE (Synthetic Minority Over-Sampling Techniques);\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2o45zx8zw-aB" + }, + "source": [ + "## EFEITOS DA AMOSTRA DESBALANCEADA" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cCVTPCB-Xkbd" + }, + "source": [ + "# TPOT\n", + "https://towardsdatascience.com/tpot-automated-machine-learning-in-python-4c063b3e5de9" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2ulXii6JXpWd" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_TWUq-z4X4yZ" + }, + "source": [ + "___\n", + "# FEATURETOOLS\n", + "https://medium.com/@rrfd/simple-automatic-feature-engineering-using-featuretools-in-python-for-classification-b1308040e183\n", + "\n", + "https://www.analyticsvidhya.com/blog/2018/08/guide-automated-feature-engineering-featuretools-python/\n", + "\n", + "https://mlwhiz.com/blog/2019/05/19/feature_extraction/\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aZUNOgmSgAmq" + }, + "source": [ + "!pip install featuretools" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_sxdONzsh9rb" + }, + "source": [ + "df_X.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "p5_ynGo1dBJJ" + }, + "source": [ + "df_X.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TqJRJXUhiDqf" + }, + "source": [ + "from random import choices\n", + "\n", + "df_X2= df_X.copy()\n", + "df_X2['tipo'] = choices(['A', 'B', 'C', 'D'], k = 1000)\n", + "df_X2['idade'] = np.random.randint(10, 15, size = 1000)\n", + "df_X2['id'] = range(0,1000)\n", + "df_X2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "nR56bGGngk-W" + }, + "source": [ + "# Automated feature engineering\n", + "import featuretools as ft\n", + "import featuretools.variable_types as vtypes\n", + "\n", + "es= ft.EntitySet(id = 'simulacao')\n", + "\n", + "# adding a dataframe \n", + "es.entity_from_dataframe(entity_id = 'df_X2', dataframe = df_X2, index = 'id')\n", + "es" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IOJ4Tr5Ogk6M" + }, + "source": [ + "es['df_X2'].variables" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1uXPqHDZgkys" + }, + "source": [ + "variable_types = {'idade': vtypes.Categorical}\n", + " \n", + "es.entity_from_dataframe(entity_id = 'df_X2', dataframe = df_X2, index = 'id', variable_types= variable_types)\n", + "\n", + "es = es.normalize_entity(base_entity_id='df_X2', new_entity_id= 'tipo', index='id')\n", + "es = es.normalize_entity(base_entity_id='df_X2', new_entity_id= 'idade', index='id')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "dnbYTBqugkvm" + }, + "source": [ + "es" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "I2v_jetdgkr7" + }, + "source": [ + "feature_matrix, feature_names = ft.dfs(entityset=es, target_entity = 'df_X2', max_depth = 3, verbose = 3, n_jobs= 1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zZiRBvHXgkoJ" + }, + "source": [ + "feature_matrix.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aWiahwKe2d6U" + }, + "source": [ + "# **EXERCÍCIOS**\n", + "> Encontre algoritmos adequados para ser aplicados aos seguintes problemas:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XbSLkbDB2mzK" + }, + "source": [ + "## Exercício 1 - Credit Card Fraud Detection\n", + "Source: [Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud)\n", + "\n", + "### Leitura suporte\n", + "* [Detecting Credit Card Fraud Using Machine Learning](https://towardsdatascience.com/detecting-credit-card-fraud-using-machine-learning-a3d83423d3b8)\n", + "* [Credit Card Fraud Detection](https://towardsdatascience.com/credit-card-fraud-detection-a1c7e1b75f59)\n", + "\n", + "### Dataframe\n", + "* [Creditcard.csv](https://raw.githubusercontent.com/MathMachado/DataFrames/master/creditcard.csv)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oYgK6JXd3MgA" + }, + "source": [ + "## Exercício 2 - Predicting species on IRIS dataset\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "si0rsJvu3O6O" + }, + "source": [ + "from sklearn import datasets\n", + "import xgboost as xgb\n", + "\n", + "iris = datasets.load_iris()\n", + "X_iris = iris.data\n", + "y_iris = iris.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zom8t4yWC_UC" + }, + "source": [ + "## Exercício 3 - Predict Wine Quality\n", + "> Estimar a qualidade dos vinhos, numa scala de 0–100. A seguir, a qualidade em função da escala:\n", + "\n", + "* 95–100 Classic: a great wine\n", + "* 90–94 Outstanding: a wine of superior character and style\n", + "* 85–89 Very good: a wine with special qualities\n", + "* 80–84 Good: a solid, well-made wine\n", + "* 75–79 Mediocre: a drinkable wine that may have minor flaws\n", + "* 50–74 Not recommended\n", + "\n", + "Source: [Wine Reviews](https://www.kaggle.com/zynicide/wine-reviews)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "klL2Q9Ria96n" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from sklearn import datasets\n", + "\n", + "Wine = datasets.load_wine()\n", + "X_vinho = Wine.data\n", + "y_vinho = Wine.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lhVhSWBgGijq" + }, + "source": [ + "## Exercício 4 - Predict Parkinson\n", + "Source: https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SVCxHqv0VBJn" + }, + "source": [ + "## Exercício 5 - Predict survivors from Titanic tragedy\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "CwvB8us4eKNi" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "\n", + "df_titanic = sns.load_dataset('titanic')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZJrT9YIXVdtx" + }, + "source": [ + "## Exercício 6 - Predict Loan\n", + "> Os dados devem ser obtidos diretamente da fonte: [Loan Default Prediction - Imperial College London](https://www.kaggle.com/c/loan-default-prediction/data)\n", + "\n", + "* [Bank Loan Default Prediction](https://medium.com/@wutianhao910/bank-loan-default-prediction-94d4902db740)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R8-GVu7ZWeA8" + }, + "source": [ + "## Exercício 7 - Predict the sales of a store.\n", + "* [Predicting expected sales for Bigmart’s stores](https://medium.com/diogo-menezes-borges/project-1-bigmart-sale-prediction-fdc04f07dc1e)\n", + "* Dataframes\n", + " * [Treinamento](https://raw.githubusercontent.com/MathMachado/DataFrames/master/Big_Mart_Sales_III_train.txt)\n", + " * [Validação](https://raw.githubusercontent.com/MathMachado/DataFrames/master/Big_Mart_Sales_III_test.txt)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fv9w86j4Wnwj" + }, + "source": [ + "## Exercício 8 - [The Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html)\n", + "> Predict the median value of owner occupied homes." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5HYRt8-ug1BT" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from sklearn import datasets\n", + "\n", + "Boston = datasets.load_boston()\n", + "X_boston = Boston.data\n", + "y_boston = Boston.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1UDIaqmtXQ0T" + }, + "source": [ + "## Exercício 9 - Predict the height or weight of a person.\n", + "\n", + "http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-7R146nIXmMT" + }, + "source": [ + "## Exercício 10 - Black Friday Sales Prediction - Predict purchase amount.\n", + "\n", + "This dataset comprises of sales transactions captured at a retail store. It’s a classic dataset to explore and expand your feature engineering skills and day to day understanding from multiple shopping experiences. This is a regression problem. The dataset has 550,069 rows and 12 columns.\n", + "\n", + "https://github.com/MathMachado/DataFrames/blob/master/blackfriday.zip\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mQ8FPbuLZlIh" + }, + "source": [ + "## Exercício 11 - Predict the income class of US population.\n", + "\n", + "http://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Af4NRrchgPlM" + }, + "source": [ + "## Exercício 12 - Predicting Cancer\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "c4LOlgZW3P40" + }, + "source": [ + "from sklearn import datasets\n", + "cancer = datasets.load_breast_cancer()\n", + "X_cancer = cancer.data\n", + "y_cancer = cancer.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "74PmpT8Ix0tD" + }, + "source": [ + "## Exercício 13\n", + "Source: [Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/).\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WY8GZMixZ9W9" + }, + "source": [ + "## Exercício 14 - Predict Diabetes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y92t6tbOge0S" + }, + "source": [ + "from sklearn import datasets\n", + "Diabetes= datasets.load_diabetes()\n", + "\n", + "X_diabetes = Diabetes.data\n", + "y_diabetes = Diabetes.target" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB15_00__Machine_Learning___DSWP_hs.ipynb b/Notebooks/NB15_00__Machine_Learning___DSWP_hs.ipynb new file mode 100644 index 000000000..fac30d1e5 --- /dev/null +++ b/Notebooks/NB15_00__Machine_Learning___DSWP_hs.ipynb @@ -0,0 +1,4316 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "colab": { + "name": "NB15_00__Machine_Learning.ipynb", + "provenance": [], + "include_colab_link": true + }, + "accelerator": "TPU" + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ShVXyGj9wkgN" + }, + "source": [ + "

MACHINE LEARNING WITH PYTHON

" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aYQ4cDfcPu4e" + }, + "source": [ + "___\n", + "# **NOTAS E OBSERVAÇÕES**\n", + "* Abordar o impacto do desbalanceamento da amostra;\n", + "* Colocar AUROC no material e mostrar o cut off para classificação entre 0 e 1." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5YvhLC_uf4_G" + }, + "source": [ + "___\n", + "# **AGENDA**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QgX6n2VDyY1O" + }, + "source": [ + "___\n", + "# **REFERÊNCIAS**\n", + "* [scikit-learn - Machine Learning With Python](https://scikit-learn.org/stable/);\n", + "* [An Introduction to Machine Learning Theory and Its Applications: A Visual Tutorial with Examples](https://www.toptal.com/machine-learning/machine-learning-theory-an-introductory-primer)\n", + "* [The Difference Between Artificial Intelligence, Machine Learning, and Deep Learning](https://medium.com/iotforall/the-difference-between-artificial-intelligence-machine-learning-and-deep-learning-3aa67bff5991)\n", + "* [A Gentle Guide to Machine Learning](https://blog.monkeylearn.com/a-gentle-guide-to-machine-learning/)\n", + "* [A Visual Introduction to Machine Learning](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)\n", + "* [Introduction to Machine Learning](http://alex.smola.org/drafts/thebook.pdf)\n", + "* [The 10 Statistical Techniques Data Scientists Need to Master](https://medium.com/cracking-the-data-science-interview/the-10-statistical-techniques-data-scientists-need-to-master-1ef6dbd531f7)\n", + "* [Tune: a library for fast hyperparameter tuning at any scale](https://towardsdatascience.com/fast-hyperparameter-tuning-at-scale-d428223b081c)\n", + "* [How to lie with Data Science](https://towardsdatascience.com/how-to-lie-with-data-science-5090f3891d9c)\n", + "* [5 Reasons “Logistic Regression” should be the first thing you learn when becoming a Data Scientist](https://towardsdatascience.com/5-reasons-logistic-regression-should-be-the-first-thing-you-learn-when-become-a-data-scientist-fcaae46605c4)\n", + "* [Machine learning on categorical variables](https://towardsdatascience.com/machine-learning-on-categorical-variables-3b76ffe4a7cb)\n", + "\n", + "## Deep Learning & Neural Networks\n", + "\n", + "- [An Introduction to Neural Networks](http://www.cs.stir.ac.uk/~lss/NNIntro/InvSlides.html)\n", + "- [An Introduction to Image Recognition with Deep Learning](https://medium.com/@ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721)\n", + "- [Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/index.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TsCbZd2epfxo" + }, + "source": [ + "___\n", + "# **INTRODUÇÃO**\n", + "\n", + "* \"__Information is the oil of the 21st century, and analytics is the combustion engine__.\" - Peter Sondergaard, SVP, Garner Research;\n", + "\n", + "\n", + ">O foco deste capítulo será:\n", + "* Linear, Logistic Regression, Decision Tree, Random Forest, Support Vector Machine and XGBoost algorithms for building Machine Learning models;\n", + "* Entender como resolver problemas de classificação e Regressão;\n", + "* Aplicar técnicas de Ensemble como Bagging e Boosting;\n", + "* Como medir a acurácia dos modelos de Machine Learning;\n", + "* Aprender os principais algoritmos de Machine Learning tanto das técnicas de aprendizagem supervisionada quanto da não-supervisionada.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HqqB2vaHXMGt" + }, + "source": [ + "___\n", + "# **ARTIFICIAL INTELLIGENCE VS MACHINE LEARNING VS DEEP LEARNING**\n", + "* **Machine Learning** - dá aos computadores a capacidade de aprender sem serem explicitamente programados. Os computadores podem melhorar sua capacidade de aprendizagem através da prática de uma tarefa, geralmente usando grandes conjuntos de dados.\n", + "* **Deep Learning** - é um método de Machine Learning que depende de redes neurais artificiais, permitindo que os sistemas de computadores aprendam pelo exemplo, assim como nós humanos aprendemos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P961GcguXFFA" + }, + "source": [ + "![EvolutionOfAI](https://github.com/MathMachado/Materials/blob/master/Evolution%20of%20AI.PNG?raw=true)\n", + "\n", + "Source: [Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://github.com/MathMachado/P4ML/blob/DS_Python/Material/Evolution%20of%20AI.PNG)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lkqGtO88ZkPr" + }, + "source": [ + "![AI_vs_ML_vs_DL](https://github.com/MathMachado/Materials/blob/master/AI_vs_ML_vs_DL.PNG?raw=true)\n", + "\n", + "Source: [Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://towardsdatascience.com/artificial-intelligence-vs-machine-learning-vs-deep-learning-2210ba8cc4ac)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xesQpzfmaqj6" + }, + "source": [ + "![ML_vs_DL](https://github.com/MathMachado/Materials/blob/master/ML_vs_DL.PNG?raw=true)\n", + "\n", + "Source: [Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://towardsdatascience.com/artificial-intelligence-vs-machine-learning-vs-deep-learning-2210ba8cc4ac)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KeIVR59IIS7f" + }, + "source": [ + "___\n", + "# **MACHINE LEARNING - TECHNIQUES**\n", + "\n", + "* Supervised Learning\n", + "* Unsupervised Learning\n", + "\n", + "![MachineLearning](https://github.com/MathMachado/Materials/blob/master/MachineLearningTechniques.jpg?raw=true)\n", + "\n", + "Source: [Machine Learning for Everyone](https://vas3k.com/blog/machine_learning/?source=post_page-----885aa35db58b----------------------)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rvwp5UHdBiup" + }, + "source": [ + "___\n", + "# **NOSSO FOCO AQUI SERÁ...**\n", + "\n", + "![ClassicalML](https://github.com/MathMachado/Materials/blob/master/ClassicalML.jpg?raw=true)\n", + "\n", + "Source: [Machine Learning for Everyone](https://vas3k.com/blog/machine_learning/?source=post_page-----885aa35db58b----------------------)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cBLSvJTXHBjK" + }, + "source": [ + "___\n", + "# **CHEETSHEET**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZdjR3nahUuKq" + }, + "source": [ + "\n", + "![Scikit-Learn](https://github.com/MathMachado/Materials/blob/master/scikit-learn-1.png?raw=true)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MkBSvyorGXQz" + }, + "source": [ + "___\n", + "# **CROSS-VALIDATION**\n", + "> Cross-validation (CV) é uma técnica na qual treinamos nosso modelo usando o subconjunto do dataframe de treinamento X e validamos noutro subconjunto do dataframe de treinamento X. A figura abaixo nos ajuda a entender como funciona CV:\n", + "\n", + "![Cross-Validation](https://github.com/MathMachado/Materials/blob/master/CV2.PNG?raw=true)\n", + "\n", + "Source: [5 Reasons why you should use Cross-Validation in your Data Science Projects](https://towardsdatascience.com/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79)\n", + "\n", + "* **Vantagens do uso de CV**:\n", + " * Modelos com melhor acurácia;\n", + " * Melhor uso dos dados, pois todos os dados são utilizados como treinamento e validação. Portanto, qualquer problema com os dados serão encontrados nesta fase.\n", + "\n", + "* **Leitura Adicional**\n", + " * [Cross-Validation in Machine Learning](https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f)\n", + " * [5 Reasons why you should use Cross-Validation in your Data Science Projects](https://towardsdatascience.com/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79)\n", + " * [Cross-validation: evaluating estimator performance](https://scikit-learn.org/stable/modules/cross_validation.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yBR8tWV_lhQq" + }, + "source": [ + "___\n", + "# **ENSEMBLE METHODS** (=Combinar modelos preditivos)\n", + "* Métodos\n", + " * **Bagging** (Bootstrap AGGregatING)\n", + " * **Boosting**\n", + " * Stacking --> Não é muito utilizado\n", + "* Evita overfitting (Overfitting é quando o modelo/função se ajusta muito bem o dados, sendo ineficiente para generalizar para outras amostras/população).\n", + "* Constroi meta-classificadores: combinar os resultados de vários algoritmos para produzir previsões mais precisas e robustas do que as previsões de cada classificador individual.\n", + "* Ensemble reduz/minimiza os efeitos das principais causas de erros nos modelos de Machine Learning:\n", + " * ruído;\n", + " * bias (viés);\n", + " * variância\n", + "\n", + "# Referências\n", + "* [Simple guide for ensemble learning methods](https://towardsdatascience.com/simple-guide-for-ensemble-learning-methods-d87cc68705a2) - Explica didaticamente como funcionam ensembes." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "25RW8u-Sj780" + }, + "source": [ + "### Leitura Adicional\n", + "* [Ensemble methods: bagging, boosting and stacking](https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205)\n", + "* [Ensemble Methods in Machine Learning: What are They and Why Use Them?](https://towardsdatascience.com/ensemble-methods-in-machine-learning-what-are-they-and-why-use-them-68ec3f9fef5f)\n", + "* [Ensemble Learning Using Scikit-learn](https://towardsdatascience.com/ensemble-learning-using-scikit-learn-85c4531ff86a)\n", + "* [Let’s Talk About Machine Learning Ensemble Learning In Python](https://medium.com/fintechexplained/lets-talk-about-machine-learning-ensemble-learning-in-python-382747e5fba8)\n", + "* [Boosting, Bagging, and Stacking — Ensemble Methods with sklearn and mlens](https://medium.com/@rrfd/boosting-bagging-and-stacking-ensemble-methods-with-sklearn-and-mlens-a455c0c982de)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FugME1HSl4jJ" + }, + "source": [ + "___\n", + "# **PARAMETER TUNNING**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u_147cIRl9F1" + }, + "source": [ + "## GridSearch\n", + "* Encontra os parâmetros ótimos (hyperparameter tunning) que melhoram a acurácia dos modelos.\n", + "* Necessita dos seguintes inputs:\n", + " * A matrix $X_{p}$ com as $p$ COLUNAS (variáveis ou atributos) do dataframe;\n", + " * A matriz $y_{p}$ com a COLUNA-target;\n", + " * Exemplo: DecisionTree, RandomForestClassifier, XGBoostClassificer e etc;\n", + " * Um dicionário com os parâmetros a serem otimizados;\n", + " * O número de folds para o método de Cross-validation." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "39Sg77fbTWCO" + }, + "source": [ + "___\n", + "# **MODEL SELECTION & EVALUATION**\n", + "> Nesta fase identificamos e aplicamos as melhores métricas (Accuracy, Sensitivity, Specificity, F-Score, AUC, R-Sq, Adj R-SQ, RMSE (Root Mean Square Error)) para avaliar o desempenho/acurácia/performance dos modelos de ML.\n", + ">> Treinamos os modelos de ML usando a amostra de treinamento e avaliamos o desempenho/acurácia/performance na amostra de teste/validação.\n", + "\n", + "* Leitura Adicional\n", + " * [The 5 Classification Evaluation metrics every Data Scientist must know](https://towardsdatascience.com/the-5-classification-evaluation-metrics-you-must-know-aa97784ff226)\n", + " * [Confusion matrix and other metrics in machine learning](https://medium.com/hugo-ferreiras-blog/confusion-matrix-and-other-metrics-in-machine-learning-894688cb1c0a)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oQQVzZ2ZTYrB" + }, + "source": [ + "## Confusion Matrix\n", + "* Termos associados à Confusion Matrix:\n", + " * **Verdadeiro Positivo** (TP = True Positive): Quando o valor observado é True e o modelo estima como True. Ou seja, o modelo acertou na estimativa.\n", + " * Exemplo: **Observado**: Fraude (Positive); **Modelo**: Fraude (Positive) --> Modelo acertou!\n", + " * **Verdadeiro Negativo** (TN = True Negative): Quando o valor observado é False e o modelo estima como False. Ou seja, o modelo acertou na estimativa;\n", + " * Exemplo: **Observado**: NÃO-Fraude (Negative); **Modelo**: NÃO-Fraude (Negative) --> Modelo acertou!\n", + " * **Falso Positivo** (FP = False Positive): Quando o valor observado é False e o modelo estima como True. Ou seja, o modelo errou na estimativa. \n", + " * Exemplo: **Observado**: NÃO-Fraude (Negative); **Modelo**: Fraude (Positive) --> Modelo errou!\n", + " * **Falso Negativo** (FN = False Negative): Quando o valor observado é True e o modelo estima como False.\n", + " * Exemplo: **Observado**: Fraude (Positive); **Modelo**: NÃO-Fraude (Negative) --> Modelo errou!\n", + "\n", + "* Consulte [Confusion matrix](https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py)\n", + "\n", + "![ConfusionMatrix](https://github.com/MathMachado/Materials/blob/master/ConfusionMatrix.PNG?raw=true)\n", + "\n", + "Source: [Confusion Matrix](https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781838555078/6/ch06lvl1sec34/confusion-matrix)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ci-6eiqBTgbL" + }, + "source": [ + "## Accuracy\n", + "> Accuracy - é o número de previsões corretas feitas pelo modelo.\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Com que frequência o classificador (modelo preditivo) classifica corretamente?\n", + "```\n", + "\n", + "$$Accuracy= \\frac{TP+TN}{TP+TN+FP+FN}$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F7YI8X5TRx-R" + }, + "source": [ + "## Precision (ou Specificity)\n", + "> **Precision** - fornece informações sobre o desempenho em relação a Falsos Positivos (quantos capturamos).\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Com relação ao resultado Positivo, com que frequência o classificador está correto?\n", + "```\n", + "\n", + "\n", + "$$Precision= \\frac{TP}{TP+FP}$$\n", + "\n", + "**Exemplo**: Precison nos dirá a proporção de clientes que o modelo estimou como sendo Fraude quando, na verdade, são fraude.\n", + "\n", + "**Comentário**: Se nosso foco é minimizar Falso Negativos (FN), então precisamos nos esforçar para termos Recall próximo de 100%." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zO39n8x_Sz3L" + }, + "source": [ + "## Recall (ou Sensitivity)\n", + "> **Recall** - nos fornece informações sobre o desempenho de um classificador em relação a Falsos Negativos (quantos perdemos).\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Quando o valor observado é Positivo, com que frequência o classificador está correto?\n", + "```\n", + "\n", + "$$Recall = Sensitivity = \\frac{TP}{TP+FN}$$\n", + "\n", + "**Exemplo**: Recall é a proporção de clientes observados como Fraude e que o modelo estima como Fraude.\n", + "\n", + "**Comentário**: Se nosso foco for minimizar Falso Positivos (FP), então precisamos nos esforçar para fazer Precision mais próximo de 100% possível." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "htS6rdHVVXRG" + }, + "source": [ + "## Specificity\n", + "> **Specificity** - proporção de TN por TN+FP.\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Quando o valor observado é Negativo, com que frequência o classificador está correto?\n", + "```\n", + "\n", + "**Exemplo**: Specificity é a proporção de clientes NÃO-Fraude que o modelo estima como NÃO-Fraude.\n", + "\n", + "$$Specificity= \\frac{TN}{TN+FP}$$\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mNn0twadTacc" + }, + "source": [ + "## F1-Score\n", + "> F1-Score é a média harmônica entre Recall e Precision e é um número entre 0 e 1. Quanto mais próximo de 1, melhor. Quanto mais próximo de 0, pior. Ou seja, é um equilíbrio entre Recall e Precision.\n", + "\n", + "$$F1\\_Score= 2\\left(\\frac{Recall*Precision}{Recall+Precision}\\right)$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rsH9dMxazWCg" + }, + "source": [ + "# **DATAFRAME-EXEMPLO USADO NESTE TUTORIAL**\n", + "> Gerar um dataframe com 18 colunas, sendo 9 informativas, 6 redundantes e 3 repetidas:\n", + "\n", + "Para saber mais sobre a geração de dataframes-exemplo (toy), consulte [Synthetic data generation — a must-have skill for new data scientists](https://towardsdatascience.com/synthetic-data-generation-a-must-have-skill-for-new-data-scientists-915896c0c1ae)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GEyDo_EIV_jV" + }, + "source": [ + "## Definir variáveis globais" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TdwgpZ76WFaT" + }, + "source": [ + "i_CV= 10 # Número de Cross-Validations\n", + "i_Seed= 20111974 # semente por questões de reproducibilidade\n", + "f_Test_Size= 0.3 # Proporção do dataframe de validação" + ], + "execution_count": 1, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gJTJfpwWzykS" + }, + "source": [ + "from sklearn.datasets import make_classification\n", + "X, y = make_classification(n_samples = 1000, n_features = 18, n_informative = 9, n_redundant = 6, n_repeated = 3, n_classes = 2, n_clusters_per_class = 1, random_state=i_Seed)" + ], + "execution_count": 9, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "bXG2uEFYwZAk", + "outputId": "cc865b0f-9b3c-4df4-a783-48484a08c784", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 85 + } + }, + "source": [ + "X[0:1]" + ], + "execution_count": 13, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[ 0.06844089, 4.21184154, -2.5583024 , 3.66548195, -3.83515815,\n", + " 3.49985104, 2.49085623, 3.66548195, 0.24511712, 0.86717207,\n", + " 2.86554551, 0.49395636, -5.14859605, 2.86554551, 3.49985104,\n", + " -0.63061895, -0.97831983, -0.88826977]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 13 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wE-IZJfGwzvE", + "outputId": "dddf7c0f-ad67-42d2-ec50-6d9d95929e0a", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "y[0:30]" + ], + "execution_count": 16, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,\n", + " 1, 1, 0, 1, 0, 1, 0, 1])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 16 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OHO2befKJxR3" + }, + "source": [ + "___\n", + "# **DECISION TREE**\n", + "> Decision Trees possuem estrutura em forma de árvores.\n", + "\n", + "* **Principais Vantagens**:\n", + " * São algoritmos fáceis de entender, visualizar e interpretar;\n", + " * Captura facilmente padrões não-lineares presentes nos dados;\n", + " * Requer pouco poder computacional;\n", + " * Lida bem com COLUNAS numéricas ou categóricas;\n", + " * Não requer os dados sejam normalizados;\n", + " * Pode ser utilizado como Feature Engineering ao lidar com Missing Values;\n", + " * Pode ser utilizado como Feature Selection;\n", + " * Não requer suposições sobre a distribuição dos dados por causa da natureza não-paramétrica do algoritmo\n", + "\n", + "* **Principais desvantagens**\n", + " * Propenso a Overfitting, pois Decision Trees podem construir árvores complexas que não sejam capazes de generalizar bem os dados. As coisas complicam muito se a amostra de treinamento possuir outliers. Portanto, **recomenda-se fortemente a tratar os outliers previamente**.\n", + " * Pode criar árvores viesadas se tivermos um dataframe não-balanceado ou que alguma classe seja dominante. Por conta disso, **recomenda-se balancear o dataframe previamente para se evitar esse problema**.\n", + "\n", + "* **Principais parâmetros**\n", + " * **Gini Index** - é uma métrica que mede a frequência com que um ponto/observação aleatoriamente selecionado seria incorretamente identificado.\n", + " * Portanto, quanto menor o valor de Gini Index, melhor a COLUNA;\n", + " * **Entropy** - é uma métrica que mede aleatoriedade da informação presente nos dados.\n", + " * Portanto, quanto maior a entropia da COLUNA, pior ela se torna para nos ajudar a tomar uma conclusão (classificar, por exemplo).\n", + "\n", + "## **Referências**:\n", + "* [1.10. Decision Trees](https://scikit-learn.org/stable/modules/tree.html).\n", + "* [Decision Tree Algorithm With Hands On Example](https://medium.com/datadriveninvestor/decision-tree-algorithm-with-hands-on-example-e6c2afb40d38) - ótimo tutorial para aprender, entender, interpretar e calcular os índices de Gini e entropia.\n", + "* [Intuitive Guide to Understanding Decision Trees](https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-understanding-decision-trees-adb2165ccab7) - ótimo tutorial para aprender, entender, interpretar e calcular os índices de Gini e entropia.\n", + "* [The Complete Guide to Decision Trees](https://towardsdatascience.com/the-complete-guide-to-decision-trees-28a4e3c7be14)\n", + "* [Creating and Visualizing Decision Tree Algorithm in Machine Learning Using Sklearn](https://intellipaat.com/blog/decision-tree-algorithm-in-machine-learning/) - Muito didático!\n", + "* [Decision Trees in Machine Learning](https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052)\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FrMkPN5aLp0Y" + }, + "source": [ + "## Carregar as bibliotecas" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FVU1CM0PKgO4" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "\n", + "import warnings\n", + "warnings.filterwarnings(\"ignore\")" + ], + "execution_count": 3, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "15clh4XrISpz" + }, + "source": [ + "## Carregar/Ler os dados" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UMPL46w2IWJw" + }, + "source": [ + "l_colunas= ['v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8', 'v9', 'v10', 'v11', 'v12', 'v13', 'v14', 'v15', 'v16', 'v17', 'v18']\n", + "df_X = pd.DataFrame(X, columns = l_colunas)\n", + "df_y = pd.DataFrame(y, columns = ['target'])" + ], + "execution_count": 4, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "MFaQF2MGFl_M", + "outputId": "44232441-3030-4dfe-bd29-fd1473b70c3a", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 224 + } + }, + "source": [ + "df_X.head()" + ], + "execution_count": 5, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
v1v2v3v4v5v6v7v8v9v10v11v12v13v14v15v16v17v18
00.0684414.211842-2.5583023.665482-3.8351583.4998512.4908563.6654820.2451170.8671722.8655460.493956-5.1485962.8655463.499851-0.630619-0.978320-0.888270
1-4.8240210.179509-2.9844731.033618-3.8934263.428734-3.3346051.033618-0.882780-0.7532811.441522-1.395514-4.0028801.4415223.4287340.3399201.891538-6.109676
21.389530-0.2264761.8774002.7134264.6302570.516455-3.7430272.7134261.2840392.030797-1.0955361.560159-1.014211-1.0955360.516455-1.4778450.9605262.060204
31.1458092.2559460.2073644.6658172.2946786.5013060.9647704.6658170.1194103.1963541.8947873.519138-4.7578071.8947876.501306-3.7890290.5794911.397106
4-0.9366463.697163-3.3636173.805126-1.7544304.9543460.4066053.805126-0.8247381.3825911.665704-0.649758-3.5130361.6657044.9543460.2570520.904244-3.071354
\n", + "
" + ], + "text/plain": [ + " v1 v2 v3 ... v16 v17 v18\n", + "0 0.068441 4.211842 -2.558302 ... -0.630619 -0.978320 -0.888270\n", + "1 -4.824021 0.179509 -2.984473 ... 0.339920 1.891538 -6.109676\n", + "2 1.389530 -0.226476 1.877400 ... -1.477845 0.960526 2.060204\n", + "3 1.145809 2.255946 0.207364 ... -3.789029 0.579491 1.397106\n", + "4 -0.936646 3.697163 -3.363617 ... 0.257052 0.904244 -3.071354\n", + "\n", + "[5 rows x 18 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 5 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "s-ibdD2ZG7tm", + "outputId": "79840be6-361c-424d-a6bb-0fdf9acbe222", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "df_X.shape" + ], + "execution_count": 6, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(1000, 18)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 6 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "f9cqRaywa_TR", + "outputId": "beae2b96-5d53-4870-af26-01dbeac1ab40", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "set(df_y['target'])" + ], + "execution_count": 7, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{0, 1}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 7 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BN6jbpn6Iwmu" + }, + "source": [ + "## Estatísticas Descritivas básicas do dataframe - df.describe()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KlwhxxUNIyYs", + "outputId": "77235ef7-30e5-43b1-afa6-bb2eeb202f2b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 297 + } + }, + "source": [ + "df_X.describe()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
v1v2v3v4v5v6v7v8v9v10v11v12v13v14v15v16v17v18
count1000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.000000
mean-0.0851591.0342270.6574081.4053170.6872791.1315600.1080531.4053171.0070231.0488010.0792480.001650-0.3654380.0792481.131560-0.0277510.9846060.633624
std2.0022471.6315073.6087722.2568574.0195984.4818321.9813072.2568571.8632881.6439001.9492731.9326414.1606681.9492734.4818322.0654551.8505933.552991
min-6.944169-4.620754-16.300139-6.235192-12.454256-14.305401-6.152747-6.235192-5.484992-3.293216-7.135349-5.705500-9.120941-7.135349-14.305401-6.009023-5.035184-11.439074
25%-1.305566-0.089052-1.623657-0.152888-1.854645-1.684751-1.216983-0.152888-0.240908-0.012710-1.209675-1.292162-3.555363-1.209675-1.684751-1.436673-0.261610-1.691346
50%0.0525230.9941500.5738491.4499310.8123641.2815040.1670911.4499311.0661251.0128990.1803440.035237-0.9666380.1803441.281504-0.0001900.9757930.844784
75%1.3838532.0719953.0385862.8871413.4139524.0081031.4387192.8871412.2881882.1872021.4391991.3153422.7458061.4391994.0081031.3653692.2565043.109330
max4.9971727.35486011.7201658.49456612.84441815.9998036.2935508.4945668.1465596.5231806.2524485.53821611.2593506.25244815.9998036.5315617.64680212.090528
\n", + "
" + ], + "text/plain": [ + " v1 v2 ... v17 v18\n", + "count 1000.000000 1000.000000 ... 1000.000000 1000.000000\n", + "mean -0.085159 1.034227 ... 0.984606 0.633624\n", + "std 2.002247 1.631507 ... 1.850593 3.552991\n", + "min -6.944169 -4.620754 ... -5.035184 -11.439074\n", + "25% -1.305566 -0.089052 ... -0.261610 -1.691346\n", + "50% 0.052523 0.994150 ... 0.975793 0.844784\n", + "75% 1.383853 2.071995 ... 2.256504 3.109330\n", + "max 4.997172 7.354860 ... 7.646802 12.090528\n", + "\n", + "[8 rows x 18 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 242 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N_QhFqyZOKFB" + }, + "source": [ + "## Selecionar as amostras de treinamento e validação\n", + "* Neste fase, devemos selecionar amostras de treinamento para treinar o modelo de Machine Learning e validação, para validar o modelo de Machine Learning.\n", + "* Geralmente usamos 70% da amostra para treinamento e 30% validação. Outras opções são usar os percentuais 80/20 ou 75/25 (default).\n", + "* Consulte [sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) para mais detalhes.\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8sKBgs-QOOfn" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_train, X_test, y_train, y_test = train_test_split(df_X, df_y, test_size = f_Test_Size, random_state = i_Seed)" + ], + "execution_count": 17, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TPTKBBHgOpoA", + "outputId": "b986cda5-98f9-4dbe-b9ab-d4a7133b8c9d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "X_train.shape" + ], + "execution_count": 18, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(700, 18)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 18 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lEn_LLs2OtRI", + "outputId": "4ace28f9-1d68-4bb6-ce5a-c5cc2fd9e2a6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "y_train.shape" + ], + "execution_count": 19, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(700, 1)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 19 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_uAw8EcyOvrG", + "outputId": "d841fc6a-42ec-40a7-e300-afcc4d34e75d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "X_test.shape" + ], + "execution_count": 20, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(300, 18)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 20 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "A2LYI-9hOyXI", + "outputId": "156e39cf-80a9-4259-961d-a14ce1252e83", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "y_test.shape" + ], + "execution_count": 21, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(300, 1)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 21 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "npgoBSX2dd4l" + }, + "source": [ + "## Treinar o algoritmo com os dados de treinamento\n", + "### Carregar os algoritmos/libraries" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hcvzrtolGfnQ", + "outputId": "50d1a229-5e17-4be8-c9f9-54bd55bc8b08", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + } + }, + "source": [ + "!pip install graphviz\n", + "!pip install pydotplus" + ], + "execution_count": 22, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Requirement already satisfied: graphviz in /usr/local/lib/python3.6/dist-packages (0.10.1)\n", + "Requirement already satisfied: pydotplus in /usr/local/lib/python3.6/dist-packages (2.0.2)\n", + "Requirement already satisfied: pyparsing>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from pydotplus) (2.4.7)\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "v_pF-HH3JKL2" + }, + "source": [ + "from sklearn.metrics import accuracy_score\n", + "#from sklearn.model_selection import train_test_split\n", + "#from sklearn.metrics import classification_report\n", + "from sklearn.metrics import confusion_matrix\n", + "\n", + "from sklearn.model_selection import GridSearchCV\n", + "from sklearn.model_selection import cross_val_score\n", + "from time import time\n", + "from operator import itemgetter\n", + "from scipy.stats import randint\n", + "\n", + "from sklearn.tree import export_graphviz\n", + "from sklearn.externals.six import StringIO \n", + "from IPython.display import Image \n", + "import pydotplus\n", + "\n", + "np.set_printoptions(suppress=True)" + ], + "execution_count": 23, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9ROlyvgij2yl" + }, + "source": [ + "Função para plotar a Confusion Matrix extraído de [Confusion Matrix Visualization](https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "klQ0FLOIgeX1" + }, + "source": [ + "def mostra_confusion_matrix(cf, \n", + " group_names = None, \n", + " categories = 'auto', \n", + " count = True, \n", + " percent = True, \n", + " cbar = True, \n", + " xyticks = False, \n", + " xyplotlabels = True, \n", + " sum_stats = True, figsize = (8, 8), \n", + " cmap = 'Blues'):\n", + " '''\n", + " This function will make a pretty plot of an sklearn Confusion Matrix cm using a Seaborn heatmap visualization.\n", + " Arguments\n", + " ---------\n", + " cf: confusion matrix to be passed in\n", + " group_names: List of strings that represent the labels row by row to be shown in each square.\n", + " categories: List of strings containing the categories to be displayed on the x,y axis. Default is 'auto'\n", + " count: If True, show the raw number in the confusion matrix. Default is True.\n", + " normalize: If True, show the proportions for each category. Default is True.\n", + " cbar: If True, show the color bar. The cbar values are based off the values in the confusion matrix.\n", + " Default is True.\n", + " xyticks: If True, show x and y ticks. Default is True.\n", + " xyplotlabels: If True, show 'True Label' and 'Predicted Label' on the figure. Default is True.\n", + " sum_stats: If True, display summary statistics below the figure. Default is True.\n", + " figsize: Tuple representing the figure size. Default will be the matplotlib rcParams value.\n", + " cmap: Colormap of the values displayed from matplotlib.pyplot.cm. Default is 'Blues'\n", + " See http://matplotlib.org/examples/color/colormaps_reference.html\n", + " '''\n", + "\n", + " # CODE TO GENERATE TEXT INSIDE EACH SQUARE\n", + " blanks = ['' for i in range(cf.size)]\n", + "\n", + " if group_names and len(group_names)==cf.size:\n", + " group_labels = [\"{}\\n\".format(value) for value in group_names]\n", + " else:\n", + " group_labels = blanks\n", + "\n", + " if count:\n", + " group_counts = [\"{0:0.0f}\\n\".format(value) for value in cf.flatten()]\n", + " else:\n", + " group_counts = blanks\n", + "\n", + " if percent:\n", + " group_percentages = [\"{0:.2%}\".format(value) for value in cf.flatten()/np.sum(cf)]\n", + " else:\n", + " group_percentages = blanks\n", + "\n", + " box_labels = [f\"{v1}{v2}{v3}\".strip() for v1, v2, v3 in zip(group_labels,group_counts,group_percentages)]\n", + " box_labels = np.asarray(box_labels).reshape(cf.shape[0],cf.shape[1])\n", + "\n", + " # CODE TO GENERATE SUMMARY STATISTICS & TEXT FOR SUMMARY STATS\n", + " if sum_stats:\n", + " #Accuracy is sum of diagonal divided by total observations\n", + " accuracy = np.trace(cf) / float(np.sum(cf))\n", + "\n", + " #if it is a binary confusion matrix, show some more stats\n", + " if len(cf)==2:\n", + " #Metrics for Binary Confusion Matrices\n", + " precision = cf[1,1] / sum(cf[:,1])\n", + " recall = cf[1,1] / sum(cf[1,:])\n", + " f1_score = 2*precision*recall / (precision + recall)\n", + " stats_text = \"\\n\\nAccuracy={:0.3f}\\nPrecision={:0.3f}\\nRecall={:0.3f}\\nF1 Score={:0.3f}\".format(accuracy,precision,recall,f1_score)\n", + " else:\n", + " stats_text = \"\\n\\nAccuracy={:0.3f}\".format(accuracy)\n", + " else:\n", + " stats_text = \"\"\n", + "\n", + " # SET FIGURE PARAMETERS ACCORDING TO OTHER ARGUMENTS\n", + " if figsize==None:\n", + " #Get default figure size if not set\n", + " figsize = plt.rcParams.get('figure.figsize')\n", + "\n", + " if xyticks==False:\n", + " #Do not show categories if xyticks is False\n", + " categories=False\n", + "\n", + " # MAKE THE HEATMAP VISUALIZATION\n", + " plt.figure(figsize=figsize)\n", + " sns.heatmap(cf,annot=box_labels,fmt=\"\",cmap=cmap,cbar=cbar,xticklabels=categories,yticklabels=categories)\n", + "\n", + " if xyplotlabels:\n", + " plt.ylabel('True label')\n", + " plt.xlabel('Predicted label' + stats_text)\n", + " else:\n", + " plt.xlabel(stats_text)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YJMS9ePQ6B6t" + }, + "source": [ + "**Atenção**: Para evitar overfitting nos algoritmos DecisionTreeClassifier, considere min_samples_split= 2 como default." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nNeRHYePJc-r" + }, + "source": [ + "from sklearn.tree import DecisionTreeClassifier\n", + "\n", + "# Instancia com os parâmetros sugeridos para se evitar overfitting:\n", + "ml_DT= DecisionTreeClassifier(criterion = 'gini', \n", + " splitter = 'best', \n", + " max_depth = None, \n", + " min_samples_split=2, \n", + " min_samples_leaf = 1, \n", + " min_weight_fraction_leaf = 0.0, \n", + " max_features = None, \n", + " random_state = i_Seed, \n", + " max_leaf_nodes = None, \n", + " min_impurity_decrease = 0.0, \n", + " min_impurity_split = None, class_weight = None, \n", + " presort = False)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "OgAHfXVo-Nw8", + "outputId": "3eaa0531-beb0-4ed7-b470-457b0b2c0a63", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 119 + } + }, + "source": [ + "# Treina o algoritmo\n", + "ml_DT.fit(X_train, y_train)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n", + " max_depth=None, max_features=None, max_leaf_nodes=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, presort=False,\n", + " random_state=20111974, splitter='best')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 252 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6exa9D8R2fDJ", + "outputId": "9d32930b-0aec-4322-d86d-da7f08869e55", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_DT, X_train, y_train, cv = i_CV)\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Média das Acurácias calculadas pelo CV....: 91.43\n", + "std médio das Acurácias calculadas pelo CV: 3.44\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6_rYker2gzeG" + }, + "source": [ + "**Interpretação**: Nosso classificador (DecisionTreeClassifier) tem uma acurácia média de 91,43% (base de treinamento). Além disso, o std é da ordem de 3,66%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tkwchmkP3p_A", + "outputId": "1197dd8b-421b-47e0-866d-0200bae53b28", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Acurácias: [0.9 0.98571429 0.85714286 0.92857143 0.88571429 0.94285714\n", + " 0.92857143 0.9 0.88571429 0.92857143]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sI31WkZs2ht_" + }, + "source": [ + "# Faz predições...\n", + "y_pred = ml_DT.predict(X_test)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "fSaVzJ9xFpwW", + "outputId": "90af8559-6487-42a2-e497-25288ab965bb", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 538 + } + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_test, y_pred)\n", + "cf_labels = ['True_Negative', 'False_Positive', 'False_Negative', 'True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p8D975NqsGtj" + }, + "source": [ + "## Parameter tunning\n", + "### Referência\n", + "* [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74)\n", + "* [Decision Tree Adventures 2 — Explanation of Decision Tree Classifier Parameters](https://medium.com/datadriveninvestor/decision-tree-adventures-2-explanation-of-decision-tree-classifier-parameters-84776f39a28) - Explica didaticamente e step by step como fazer parameter tunning." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Bfdq5zEhlVsk" + }, + "source": [ + "# Dicionário de parâmetros para o parameter tunning. Ao todo serão ajustados 2X13X5X5X7= 4.550 modelos. Contando com 10 folds no Cross-Validation, então são 45.500 modelos.\n", + "d_parametros_DT= {\"criterion\": [\"gini\", \"entropy\"]} #, \"min_samples_split\": [2, 5, 10, 30, 50, 70, 90, 120, 150, 180, 210, 240, 270, 350, 400], \"max_depth\": [None, 2, 5, 9, 15], \"min_samples_leaf\": [20, 40, 60, 80, 100], \"max_leaf_nodes\": [None, 2, 3, 4, 5, 10, 15]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H8gNSs0G0A-L" + }, + "source": [ + "```\n", + "grid_search = GridSearchCV(ml_DT, param_grid= d_parametros_DT, cv = i_CV, n_jobs= -1)\n", + "start = time()\n", + "grid_search.fit(X_train, y_train)\n", + "tempo_elapsed= time()-start\n", + "print(f\"\\nGridSearchCV levou {tempo_elapsed:.2f} segundos para estimar {len(grid_search.cv_results_)} modelos candidatos\")\n", + "\n", + "GridSearchCV levou 1999.12 segundos para estimar 23 modelos candidatos\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ap3WMXqDthu9" + }, + "source": [ + "# Definindo a função para o GridSearchCV\n", + "def GridSearchOptimizer(modelo, ml_Opt, d_Parametros, X_train, y_train, X_test, y_test, cv = i_CV):\n", + " ml_GridSearchCV = GridSearchCV(modelo, d_Parametros, cv = i_CV, n_jobs= -1, verbose= 10, scoring= 'accuracy')\n", + " start = time()\n", + " ml_GridSearchCV.fit(X_train, y_train)\n", + " tempo_elapsed= time()-start\n", + " #print(f\"\\nGridSearchCV levou {tempo_elapsed:.2f} segundos.\")\n", + "\n", + " # Parâmetros que otimizam a classificação:\n", + " print(f'\\nParametros otimizados: {ml_GridSearchCV.best_params_}')\n", + " \n", + " if ml_Opt == 'ml_DT2':\n", + " print(f'\\nDecisionTreeClassifier *********************************************************************************************************')\n", + " ml_Opt = DecisionTreeClassifier(criterion= ml_GridSearchCV.best_params_['criterion'], \n", + " max_depth= ml_GridSearchCV.best_params_['max_depth'],\n", + " max_leaf_nodes= ml_GridSearchCV.best_params_['max_leaf_nodes'],\n", + " min_samples_split= ml_GridSearchCV.best_params_['min_samples_leaf'],\n", + " min_samples_leaf= ml_GridSearchCV.best_params_['min_samples_split'], \n", + " random_state= i_Seed)\n", + " \n", + " elif ml_Opt == 'ml_RF2':\n", + " print(f'\\nRandomForestClassifier *********************************************************************************************************')\n", + " ml_Opt = RandomForestClassifier(bootstrap= ml_GridSearchCV.best_params_['bootstrap'], \n", + " max_depth= ml_GridSearchCV.best_params_['max_depth'],\n", + " max_features= ml_GridSearchCV.best_params_['max_features'],\n", + " min_samples_leaf= ml_GridSearchCV.best_params_['min_samples_leaf'],\n", + " min_samples_split= ml_GridSearchCV.best_params_['min_samples_split'],\n", + " n_estimators= ml_GridSearchCV.best_params_['n_estimators'],\n", + " random_state= i_Seed)\n", + " \n", + " elif ml_Opt == 'ml_AB2':\n", + " print(f'\\nAdaBoostClassifier *********************************************************************************************************')\n", + " ml_Opt = AdaBoostClassifier(algorithm='SAMME.R', \n", + " base_estimator=RandomForestClassifier(bootstrap = False, \n", + " max_depth = 10, \n", + " max_features = 'auto', \n", + " min_samples_leaf = 1, \n", + " min_samples_split = 2, \n", + " n_estimators = 400), \n", + " learning_rate = ml_GridSearchCV.best_params_['learning_rate'], \n", + " n_estimators = ml_GridSearchCV.best_params_['n_estimators'], \n", + " random_state = i_Seed)\n", + " \n", + " elif ml_Opt == 'ml_GB2':\n", + " print(f'\\nGradientBoostingClassifier *********************************************************************************************************')\n", + " ml_Opt = GradientBoostingClassifier(learning_rate = ml_GridSearchCV.best_params_['learning_rate'], \n", + " n_estimators = ml_GridSearchCV.best_params_['n_estimators'], \n", + " max_depth = ml_GridSearchCV.best_params_['max_depth'], \n", + " min_samples_split = ml_GridSearchCV.best_params_['min_samples_split'], \n", + " min_samples_leaf = ml_GridSearchCV.best_params_['min_samples_leaf'], \n", + " max_features = ml_GridSearchCV.best_params_['max_features'])\n", + " \n", + " elif ml_Opt == 'ml_XGB2':\n", + " print(f'\\nXGBoostingClassifier *********************************************************************************************************')\n", + " ml_Opt = XGBoostingClassifier(learning_rate= ml_GridSearchCV.best_params_['learning_rate'], \n", + " max_depth= ml_GridSearchCV.best_params_['max_depth'], \n", + " colsample_bytree= ml_GridSearchCV.best_params_['colsample_bytree'], \n", + " subsample= ml_GridSearchCV.best_params_['subsample'], \n", + " gamma= ml_GridSearchCV.best_params_['gamma'], \n", + " min_child_weight= ml_GridSearchCV.best_params_['min_child_weight'])\n", + " \n", + " # Treina novamente usando os parametros otimizados...\n", + " ml_Opt.fit(X_train, y_train)\n", + "\n", + " # Cross-Validation com 10 folds\n", + " print(f'\\n********* CROSS-VALIDATION ***********')\n", + " a_scores_CV = cross_val_score(ml_Opt, X_train, y_train, cv = i_CV)\n", + " print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + " print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')\n", + "\n", + " # Faz predições com os parametros otimizados...\n", + " y_pred = ml_Opt.predict(X_test)\n", + " \n", + " # Importância das COLUNAS\n", + " print(f'\\n********* IMPORTÂNCIA DAS COLUNAS ***********')\n", + " df_importancia_variaveis = pd.DataFrame(zip(l_colunas, ml_Opt.feature_importances_), columns= ['coluna', 'importancia'])\n", + " df_importancia_variaveis = df_importancia_variaveis.sort_values(by= ['importancia'], ascending=False)\n", + " print(df_importancia_variaveis)\n", + "\n", + " # Matriz de Confusão\n", + " print(f'\\n********* CONFUSION MATRIX - PARAMETER TUNNING ***********')\n", + " cf_matrix = confusion_matrix(y_test, y_pred)\n", + " cf_labels = ['True_Negative', 'False_Positive', 'False_Negative', 'True_Positive']\n", + " cf_categories = ['Zero', 'One']\n", + " mostra_confusion_matrix(cf_matrix, group_names = cf_labels, categories = cf_categories)\n", + "\n", + " return ml_Opt, ml_GridSearchCV.best_params_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "44-BRnNjBT25", + "outputId": "da9fa734-cd1d-4731-d6c6-2ff2cbc1d379", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 520 + } + }, + "source": [ + "# Invoca a função\n", + "ml_DT2, best_params = GridSearchOptimizer(ml_DT, 'ml_DT2', d_parametros_DT, X_train, y_train, X_test, y_test, cv = i_CV)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Fitting 10 folds for each of 2 candidates, totalling 20 fits\n" + ], + "name": "stdout" + }, + { + "output_type": "stream", + "text": [ + "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.\n", + "[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 1.0s\n", + "[Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 1.1s\n", + "[Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 1.2s\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.1813s.) Setting batch_size=2.\n", + "[Parallel(n_jobs=-1)]: Done 14 tasks | elapsed: 1.2s\n" + ], + "name": "stderr" + }, + { + "output_type": "stream", + "text": [ + "\n", + "Parametros otimizados: {'criterion': 'entropy'}\n", + "\n", + "DecisionTreeClassifier *********************************************************************************************************\n" + ], + "name": "stdout" + }, + { + "output_type": "stream", + "text": [ + "[Parallel(n_jobs=-1)]: Done 20 out of 20 | elapsed: 1.3s remaining: 0.0s\n", + "[Parallel(n_jobs=-1)]: Done 20 out of 20 | elapsed: 1.3s finished\n" + ], + "name": "stderr" + }, + { + "output_type": "error", + "ename": "KeyError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Invoca a função\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mml_DT2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mbest_params\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mGridSearchOptimizer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mml_DT\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'ml_DT2'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0md_parametros_DT\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mX_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mX_test\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_test\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcv\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mi_CV\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m\u001b[0m in \u001b[0;36mGridSearchOptimizer\u001b[0;34m(modelo, ml_Opt, d_Parametros, X_train, y_train, X_test, y_test, cv)\u001b[0m\n\u001b[1;32m 13\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'\\nDecisionTreeClassifier *********************************************************************************************************'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 14\u001b[0m ml_Opt = DecisionTreeClassifier(criterion= ml_GridSearchCV.best_params_['criterion'], \n\u001b[0;32m---> 15\u001b[0;31m \u001b[0mmax_depth\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mml_GridSearchCV\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbest_params_\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'max_depth'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 16\u001b[0m \u001b[0mmax_leaf_nodes\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mml_GridSearchCV\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbest_params_\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'max_leaf_nodes'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 17\u001b[0m \u001b[0mmin_samples_split\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mml_GridSearchCV\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbest_params_\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'min_samples_leaf'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mKeyError\u001b[0m: 'max_depth'" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gmCkjGjPJMLr" + }, + "source": [ + "### Visualizar o resultado" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cIc3ZgaISEd0" + }, + "source": [ + "from sklearn.tree import export_graphviz\n", + "from sklearn.externals.six import StringIO \n", + "from IPython.display import Image \n", + "import pydotplus\n", + "\n", + "dot_data = StringIO()\n", + "export_graphviz(ml_DT2, out_file = dot_data, filled = True, rounded = True, special_characters = True, feature_names = l_colunas, class_names = ['0','1'])\n", + "\n", + "graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) \n", + "graph.write_png('DecisionTree.png')\n", + "Image(graph.create_png())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e1R2GBkbnV37" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vv7GKBvs6Ybf" + }, + "source": [ + "# Função desenvolvida para Selecionar COLUNAS relevantes\n", + "from sklearn.feature_selection import SelectFromModel\n", + "\n", + "def seleciona_colunas_relevantes(modelo, X_train, X_test, threshold = 0.05):\n", + " # Cria um seletor para selecionar as COLUNAS com importância > threshold\n", + " sfm = SelectFromModel(modelo, threshold)\n", + " \n", + " # Treina o seletor\n", + " sfm.fit(X_train, y_train)\n", + "\n", + " # Mostra o indice das COLUNAS mais importantes\n", + " print(f'\\n********** COLUNAS Relevantes ******')\n", + " print(sfm.get_support(indices=True))\n", + "\n", + " # Seleciona somente as COLUNAS relevantes\n", + " X_train_I = sfm.transform(X_train)\n", + " X_test_I = sfm.transform(X_test)\n", + " return X_train_I, X_test_I " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ukMLoEr7nbUf" + }, + "source": [ + "X_train_DT, X_test_DT = seleciona_colunas_relevantes(ml_DT2, X_train, X_test)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8JjePRQAoqkk" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Gt3aCPpfKRxm" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zq6uCVtzovMt" + }, + "source": [ + "# Treina usando as COLUNAS relevantes...\n", + "ml_DT2.fit(X_train_DT, y_train)\n", + "\n", + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_DT2, X_train_DT, y_train, cv = i_CV)\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Tc7esxqtq-Og" + }, + "source": [ + "****************************************************************" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "znWy3LE1q-Z3" + }, + "source": [ + "ml_DT3, best_params2 = GridSearchOptimizer(ml_DT2, 'ml_DT2', d_parametros_DT, X_train_DT, y_train, X_test_DT, y_test, cv = i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6IhCC6pfq-jL" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "qw6Dk3kesT0q" + }, + "source": [ + "best_params2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "SbS4ZKN8s-ee" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_DT3, X_train_DT, y_train, cv = i_CV)\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_at3XP1Bq-qb" + }, + "source": [ + "***************************************************************" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MZ1-vGRcxJoN" + }, + "source": [ + "## Valida o modelo usando o dataframe X_test" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ig9GiUAEw9jr" + }, + "source": [ + "y_pred_DT = ml_DT2.predict(X_test_DT)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7UZz4UzHDqae" + }, + "source": [ + "# Calcula acurácia\n", + "accuracy_score(y_test, y_pred_DT)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K3EUMAxxKBur" + }, + "source": [ + "___\n", + "# **RANDOM FOREST**\n", + "* Decision Trees possuem estrutura em forma de árvores.\n", + "* Random Forest pode ser utilizado tanto para classificação (RandomForestClassifier)quanto para Regressão (RandomForestRegressor).\n", + "\n", + "* **Vantagens**:\n", + " * Não requer tanto data preprocessing;\n", + " * Lida bem com COLUNAS categóricas e numéricas;\n", + " * É um Boosting Ensemble Method (pois constrói muitas árvores). Estes modelos aprendem com os próprios erros e ajustam as árvores de modo a fazer melhores classificações;\n", + " * Mais robusta que uma simples Decision Tree. **Porque?**\n", + " * Controla automaticamente overfitting (**porque?**) e frequentemente produz modelos muito robustos e de alta-performance.\n", + " * Pode ser utilizado como Feature Selection, pois gera a matriz de importância dos atributos (importance sample). A soma das importâncias soma 100;\n", + " * Assim como as Decision Trees, esses modelos capturam facilmente padrões não-lineares presentes nos dados;\n", + " * Não requer os dados sejam normalizados;\n", + " * Lida bem com Missing Values;\n", + " * Não requer suposições (assumptions) sobre a distribuição dos dados por causa da natureza não-paramétrica do algoritmo\n", + "\n", + "* **Desvantagens**\n", + " * **Recomenda-se balancear o dataframe previamente para se evitar esse problema**.\n", + "\n", + "* **Principais parâmetros**\n", + "\n", + "## **Referências**:\n", + "* [Running Random Forests? Inspect the feature importances with this code](https://towardsdatascience.com/running-random-forests-inspect-the-feature-importances-with-this-code-2b00dd72b92e)\n", + "* [Feature importances with forests of trees](https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html)\n", + "* [Understanding Random Forests Classifiers in Python](https://www.datacamp.com/community/tutorials/random-forests-classifier-python)\n", + "* [Understanding Random Forest](https://towardsdatascience.com/understanding-random-forest-58381e0602d2)\n", + "* [An Implementation and Explanation of the Random Forest in Python](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76)\n", + "* [Random Forest Simple Explanation](https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d)\n", + "* [Random Forest Explained](https://www.youtube.com/watch?v=eM4uJ6XGnSM)\n", + "* [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74) - Explica os principais parâmetros do Random Forest." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cnfDw_GEKBuu" + }, + "source": [ + "from sklearn.ensemble import RandomForestClassifier\n", + "\n", + "# Instancia...\n", + "ml_RF= RandomForestClassifier(n_estimators=100, min_samples_split= 2, max_features=\"auto\", random_state= i_Seed)\n", + "\n", + "# Treina...\n", + "ml_RF.fit(X_train, y_train)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lYa9oaZW__o6" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_RF, X_train, y_train, cv = i_CV)\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AouWUu8vANdb" + }, + "source": [ + "**Interpretação**: Nosso classificador (RandomForestClassifier) tem uma acurácia média de 96,44% (base de treinamento). Além disso, o std é da ordem de 2,77%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vbducxlgAa85" + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_lxx-LUw_5sd" + }, + "source": [ + "# Faz predições...\n", + "y_pred = ml_RF.predict(X_test)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "pQIRO_LpGAkw" + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_test, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yKLHZ5_C6FJ8" + }, + "source": [ + "## Parameter tunning\n", + "### Referência\n", + "* [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74)\n", + "* [Decision Tree Adventures 2 — Explanation of Decision Tree Classifier Parameters](https://medium.com/datadriveninvestor/decision-tree-adventures-2-explanation-of-decision-tree-classifier-parameters-84776f39a28) - Explica didaticamente e step by step como fazer parameter tunning.\n", + "* [Optimizing Hyperparameters in Random Forest Classification](https://towardsdatascience.com/optimizing-hyperparameters-in-random-forest-classification-ec7741f9d3f6) - Outro approach para entender parameter tunning. Recomendo fortemente a leitura! " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XOa9naju6FKA" + }, + "source": [ + "# Dicionário de parâmetros para o parameter tunning.\n", + "d_parametros_RF= {'bootstrap': [True, False]} #,\n", + "# 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],\n", + "# 'max_features': ['auto', 'sqrt'],\n", + "# 'min_samples_leaf': [1, 2, 4],\n", + "# 'min_samples_split': [2, 5, 10],\n", + "# 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6__f2jZaTQat" + }, + "source": [ + "# Invoca a função\n", + "ml_RF2, best_params = GridSearchOptimizer(ml_RF, 'ml_RF2', d_parametros_RF, X_train, y_train, X_test, y_test, cv = i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "crfn-n--KG4n" + }, + "source": [ + "### Resultado da execução do Random Forest\n", + "\n", + "```\n", + "[Parallel(n_jobs=-1)]: Done 7920 out of 7920 | elapsed: 194.0min finished\n", + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}\n", + "```" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SGTOe5PaRw59" + }, + "source": [ + "# Como o procedimento acima levou 194 minutos para executar, então vou estimar ml_RF2 abaixo usando os parâmetros acima estimados\n", + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}\n", + "\n", + "ml_RF2= RandomForestClassifier(bootstrap= best_params['bootstrap'], \n", + " max_depth= best_params['max_depth'], \n", + " max_features= best_params['max_features'], \n", + " min_samples_leaf= best_params['min_samples_leaf'], \n", + " min_samples_split= best_params['min_samples_split'], \n", + " n_estimators= best_params['n_estimators'], \n", + " random_state= i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HMJcAdLlTQa0" + }, + "source": [ + "## Visualizar o resultado\n", + "> Implementar a visualização do RandomForest." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WWNiy7Z0TQa3" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kOi11YOKTQa4" + }, + "source": [ + "X_train_RF, X_test_RF = seleciona_colunas_relevantes(ml_RF2, X_train, X_test)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Zn_O7c_DTQbE" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UwEOwzSGTQbF" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Rr8qDrgvTQbL" + }, + "source": [ + "# Treina com as COLUNAS relevantes...\n", + "ml_RF2.fit(X_train_RF, y_train)\n", + "\n", + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_RF2, X_train_RF, y_train, cv = i_CV)\n", + "print(f'Acurácia Media: {100*a_scores_CV.mean():.2f}')\n", + "print(f'std médio.....: {100*a_scores_CV.std():.2f}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-mYfQLlsTQbQ" + }, + "source": [ + "## Valida o modelo usando o dataframe X_test" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sSD5o1JQTQbR" + }, + "source": [ + "y_pred_RF = ml_RF2.predict(X_test_RF)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "wywF6LymDzKr" + }, + "source": [ + "# Calcula acurácia\n", + "accuracy_score(y_test, y_pred_RF)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hJJsL0IJb6iO" + }, + "source": [ + "## Estudo do comportamento dos parametros do algoritmo\n", + "> Consulte [Optimizing Hyperparameters in Random Forest Classification](https://towardsdatascience.com/optimizing-hyperparameters-in-random-forest-classification-ec7741f9d3f6) para mais detalhes." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "navUWMwHi44D" + }, + "source": [ + "param_range = np.arange(1, 250, 2)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_train, \n", + " y_train, \n", + " param_name=\"n_estimators\", \n", + " param_range = param_range, \n", + " cv = i_CV, \n", + " scoring = \"accuracy\", \n", + " n_jobs = -1)\n", + "\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label = \"Training score\", color = \"black\")\n", + "plt.plot(param_range, test_mean, label = \"Cross-validation score\", color = \"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color = \"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color = \"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc = \"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rv7TIM9kjsud" + }, + "source": [ + "param_range = np.arange(1, 250, 2)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_train, \n", + " y_train, \n", + " param_name = \"max_depth\", \n", + " param_range = param_range, \n", + " cv = i_CV, \n", + " scoring = \"accuracy\", \n", + " n_jobs = -1)\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label=\"Training score\", color=\"black\")\n", + "plt.plot(param_range, test_mean, label=\"Cross-validation score\", color=\"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color=\"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color=\"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc=\"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lm_fPGYwkJYc" + }, + "source": [ + "param_range = np.arange(1, 250, 2)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_train, \n", + " y_train, \n", + " param_name='min_samples_leaf', \n", + " param_range=param_range,\n", + " cv = i_CV, \n", + " scoring=\"accuracy\", \n", + " n_jobs=-1)\n", + "\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label=\"Training score\", color=\"black\")\n", + "plt.plot(param_range, test_mean, label=\"Cross-validation score\", color=\"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color=\"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color=\"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc=\"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "CAqdiSaVlAB8" + }, + "source": [ + "param_range = np.arange(0.05, 1, 0.05)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_train, \n", + " y_train, \n", + " param_name='min_samples_split', \n", + " param_range=param_range,\n", + " cv = i_CV, \n", + " scoring=\"accuracy\", \n", + " n_jobs=-1)\n", + "\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label=\"Training score\", color=\"black\")\n", + "plt.plot(param_range, test_mean, label=\"Cross-validation score\", color=\"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color=\"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color=\"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc=\"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cX_gfsbQSdNd" + }, + "source": [ + "___\n", + "# **BOOSTING MODELS**\n", + "* São algoritmos muito utilizados nas competições do Kaggle;\n", + "* São algoritmos utilizados para melhorar a performance dos algoritmos de Machine Learning;\n", + "* Modelos:\n", + " - [X] AdaBoost\n", + " - [X] XGBoost\n", + " - [X] LightGBM\n", + " - [X] GradientBoosting\n", + " - [X] CatBoost\n", + "\n", + "## Bagging vs Boosting vc Stacking\n", + "### **Bagging**\n", + "* Objetivo é reduzir a variância;\n", + "\n", + "#### Como funciona\n", + "* Seleciona várias amostras **COM REPOSIÇÃO** do dataframe de treinamento. Cada amostra é usada para treinar um modelo usando Decision Trees. Como resultado, temos um ensemble de muitas e diferentes modelos (Decision Trees). A média de desses muitos e diferentes modelos (Decision Trees) são usados para produzir o resultado final;\n", + "* O resultado final é mais robusto do que usarmos uma simples Decision Tree.\n", + "\n", + "![Bagging](https://github.com/MathMachado/Materials/blob/master/Bagging.png?raw=true)\n", + "\n", + "Souce: [Boosting and Bagging: How To Develop A Robust Machine Learning Algorithm](https://hackernoon.com/how-to-develop-a-robust-algorithm-c38e08f32201).\n", + "\n", + "#### Steps\n", + "* Suponha um dataframe X_train (dataframe de treinamento) contendo N observações (instâncias, pontos, linhas) e M COLUNAS (features, atributos).\n", + " 1. Bagging seleciona aleatoriamente uma amostra **COM REPOSIÇÃO** de X_train;\n", + " 2. Bagging seleciona aleatoriamente M2 (M2 < M) COLUNAS do dataframe extraído do passo (1);\n", + " 3. Constroi uma Decision Tree com as M2 COLUNAS do passo (2) e o dataframe obtido no passo (1) e as COLUNAS são avaliadas pela sua habilidade de classificar as observações;\n", + " 4. Os passos (1)--> (2)-- (3) são repetidos K vezes (ou seja, K Decision Trees), de forma que as COLUNAS são ranqueadas pelo seu poder preditivo e o resultado final (acurácia, por exemplo) é obtido pela agregação das predições dos K Decision Trees.\n", + "\n", + "#### Vantagens\n", + "* Reduz overfitting;\n", + "* Lida bem com dataframes com muitas COLUNAS (high dimensionality);\n", + "* Lida automaticamente com Missing Values;\n", + "\n", + "#### Desvantagem\n", + "* A predição final é baseada na média das K Decision Trees, o que pode comprometer a acurácia final.\n", + "\n", + "___ \n", + "### **Boosting**\n", + "* Objetivo é melhorar acurácia;\n", + "\n", + "#### Como funciona\n", + "* Os classificadores são usados sequencialmente, de forma que o classificador no passo N aprende com os erros do classificador do passo N-1. Ou seja, o objetivo é melhorar a precisão/acurácia à cada passo aprendendo com o passado.\n", + "\n", + "![Boosting](https://github.com/MathMachado/Materials/blob/master/Boosting.png?raw=true)\n", + "\n", + "Source: [Ensemble methods: bagging, boosting and stacking](https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205), Joseph Rocca\n", + ".\n", + "\n", + "#### Steps\n", + "* Suponha um dataframe X_train (dataframe de treinamento) contendo N observações (instâncias, pontos, linhas) e M COLUNAS (features, atributos).\n", + " 1. Boosting seleciona aleatoriamente uma amostra D1 SEM reposição de X_train;\n", + " 2. Boosting treina o classificador C1;\n", + " 3. Boosting seleciona aleatoriamente a SEGUNDA amostra D2 SEM reposição de X_train e acrescenta à D2 50% das observações que foram classificadas incorretamente para treinar o classificador C2;\n", + " 4. Boosting encontra em X_train a amostra D3 que os classificadores C1 e C2 discordam em classificar e treina C3;\n", + " 5. Combina (voto) as predições de C1, C2 e C3 para produzir o resultado final.\n", + "\n", + "#### Vantagens\n", + "* Lida bem com dataframes com muitas COLUNAS (high dimensionality);\n", + "* Lida automaticamente com Missing Values;\n", + "\n", + "#### Desvantagem\n", + "* Propenso a overfitting. Recomenda-se tratar outliers previamente.\n", + "* Requer ajuste cuidadoso dos hyperparameters;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9fgUrkmPk4dr" + }, + "source": [ + "___\n", + "# STACKING\n", + "\n", + "![Stacking](https://github.com/MathMachado/Materials/blob/master/Stacking.png?raw=true)\n", + "\n", + "Kd a referência desta figura???" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B0jxx3ETpOdm" + }, + "source": [ + "___\n", + "# **BOOTSTRAPPING METHODS**\n", + "> Antes de falarmos de Boosting ou Bagging, precisamos entender primeiro o que é Bootstrap, pois ambos (Boosting e Bagging) são baseados em Bootstrap.\n", + "\n", + "* Em Estatística (e em Machine Learning), Bootstrap se refere à extrair amostras aleatórias COM reposição da população X." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SyqazmUuifkE" + }, + "source": [ + "___\n", + "# **ADABOOST(Adaptive Boosting)**\n", + "* Quando nada funciona, AdaBoost funciona!\n", + "* Foi um dos primeiros algoritmos de Boosting (1995);\n", + "* AdaBoost pode ser utilizado tanto para classificação (AdaBoostClassifier) quanto para Regressão (AdaBoostRegressor);\n", + "* AdaBoost usam algoritmos DecisionTree como base_estimator;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RU-vzkXqrFVw" + }, + "source": [ + "## Referências\n", + "* [AdaBoost Classifier Example In Python](https://towardsdatascience.com/machine-learning-part-17-boosting-algorithms-adaboost-in-python-d00faac6c464) - Didático e explica exatamente como o AdaBoost funciona.\n", + "* [Adaboost for Dummies: Breaking Down the Math (and its Equations) into Simple Terms](https://towardsdatascience.com/adaboost-for-dummies-breaking-down-the-math-and-its-equations-into-simple-terms-87f439757dcf) - Para quem quer entender a matemática por trás do algoritmo.\n", + "* [Gradient Boosting and XGBoost](https://medium.com/hackernoon/gradient-boosting-and-xgboost-90862daa6c77)\n", + "* [Understanding AdaBoost](https://towardsdatascience.com/understanding-adaboost-2f94f22d5bfe), Akash Desarda.\n", + "* [AdaBoost Classifier Example In Python](https://towardsdatascience.com/machine-learning-part-17-boosting-algorithms-adaboost-in-python-d00faac6c464)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6EMrjQDZIMl_" + }, + "source": [ + "## O que é AdaBoost (Adaptive Boosting)?\n", + "* é um dos classificadores do tipo ensemble (combina vários classificadores para aumentar a precisão).\n", + "* AdaBoost é um classificador iterativo e forte que combina (ensemble) vários classificadores fracos para melhorar a precisão.\n", + "* Qualquer algoritmo de aprendizado de máquina pode ser usado como um classificador de base (parâmetro base_estimator);\n", + "\n", + "## Parâmetros mais importantes do AdaBoost:\n", + "* base_estimator - É um classificador usado para treinar o modelo. Como default, AdaBoost usa o DecisionTreeClassifier. Como dito anteriormente, pode-se utilizar diferentes algoritmos para esse fim.\n", + "* n_estimators - Número de base_estimator para treinar iterativamente.\n", + "* learning_rate - Controla a contribuição do base_estimator na solução/combinação final;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TzLtHzWNJBix" + }, + "source": [ + "## Usando diferentes algoritmos para base_estimator\n", + "> Como dito anteriormente, pode-se utilizar vários tipos de base_estimator em AdaBoost. Por exemplo, se quisermos usar SVM (Support Vector Machines), devemos proceder da seguinte forma:\n", + "\n", + "\n", + "```\n", + "# Importar a biblioteca base_estimator\n", + "from sklearn.svm import SVC\n", + "\n", + "# Treina o classificador (algoritmo)\n", + "ml_SVC= SVC(probability=True, kernel='linear')\n", + "\n", + "# Constroi o modelo AdaBoost\n", + "ml_AB = AdaBoostClassifier(n_estimators= 50, base_estimator=ml_SVC, learning_rate=1)\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hrj4a4s6hMMB" + }, + "source": [ + "## Vantagens\n", + "* AdaBoost é fácil de implementar;\n", + "* AdaBoost corrige os erros do base_estimator iterativamente e melhora a acurácia;\n", + "* Faz o Feature Selection automaticamente (**Porque**?);\n", + "* Pode-se usar muitos algoritos como base_estimator ;\n", + "* Como é um método ensemble, então o modelo final é pouco propenso à overfitting.\n", + "\n", + "## Desvantagens\n", + "* AdaBoost é sensível a ruídos nos dados;\n", + "* Altamente impactado por outliers (contribui para overfitting), pois o algoritmo tenta se ajustr a cada ponto da mehor forma possível;\n", + "* AdaBoost é mais lento que XGBoost;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bgJmu7YLiyv7" + }, + "source": [ + "No exemplo a seguir, vou usar RandomForestClassifier com os parâmetros otimizados, ou seja:\n", + "\n", + "```\n", + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5VCRNyZT3qvc" + }, + "source": [ + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1gIboJdriq61" + }, + "source": [ + "from sklearn.ensemble import AdaBoostClassifier\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "\n", + "# Instancia RandomForestClassifier - Parâmetros otimizados!\n", + "ml_RF2= RandomForestClassifier(bootstrap= best_params['bootstrap'], \n", + " max_depth= best_params['max_depth'], \n", + " max_features= best_params['max_features'], \n", + " min_samples_leaf= best_params['min_samples_leaf'], \n", + " min_samples_split= best_params['min_samples_split'], \n", + " n_estimators= best_params['n_estimators'], \n", + " random_state= i_Seed)\n", + "# Instancia AdaBoostClassifier\n", + "ml_AB= AdaBoostClassifier(n_estimators=100, base_estimator= ml_RF2, random_state= i_Seed)\n", + "\n", + "# Treina...\n", + "ml_AB.fit(X_train, y_train)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "A4Cs81OLD40y" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_AB, X_train, y_train, cv = i_CV)\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F7Ce5L38ECoC" + }, + "source": [ + "**Interpretação**: Nosso classificador (AdaBoostClassifier) tem uma acurácia média de 96,72% (base de treinamento). Além disso, o std é da ordem de 2,54%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "t5GfnBwEifkO" + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q9rSpuXyEPA5" + }, + "source": [ + "# Faz predições com os parametros otimizados...\n", + "y_pred = ml_AB.predict(X_test)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "2F9k-_eXGDLa" + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_test, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XweWTjQ9EXLw" + }, + "source": [ + "## Parameter tunning" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fcrKzse9EbL_" + }, + "source": [ + "# Dicionário de parâmetros para o parameter tunning.\n", + "d_parametros_AB = {'n_estimators':[50, 100, 200], 'learning_rate':[.001, 0.01, 0.05, 0.1, 0.3,1]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Susc3I7mFDQX" + }, + "source": [ + "# Invoca a função\n", + "ml_AB2, best_params= GridSearchOptimizer(ml_AB, 'ml_AB2', d_parametros_AB, X_train, y_train, X_test, y_test, cv = i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "w4JjWsusjNS8" + }, + "source": [ + "___\n", + "# **GRADIENT BOOSTING**\n", + "* Gradient boosting pode ser usado para resolver problemas de classificação (GradientBoostingClassifier) e Regressão (GradientBoostingRegressor);\n", + "* Gradient boosting são um refinamento do AdaBoost (lembra que AdaBoost foi um dos primeiros métodos de Boosting - criado em 1995). O que Gradient Boosting faz adicionalmente ao AdaBoost é minimizar a loss (função perda), ie, minimizar a diferença entre os valores observados de y e os valores preditos.\n", + "* Usa Gradient Descent para encontrar as deficiências nas previsões do passo anterior. Gradient Descent é um algoritmo popular e poderoso e usado em Redes Neurais;\n", + "* O objetivo do Gradient Boosting é minimizar 'loss function'. Portanto, Gradient Boosting depende da \"loss function\".\n", + "* Gradient boosting usam algoritmos DecisionTree como base_estimator;\n", + "\n", + "## Vantagens\n", + "* Não há necessidade de pre-processing;\n", + "* Trabalha normalmente com COLUNAS numéricas ou categóricas;\n", + "* Trata automaticamente os Missing Values. Ou seja, não é necessário aplicar métodos de Missing Value Imputation;\n", + "\n", + "## Desvantagens\n", + "* Como Gradient Boosting tenta continuamente minimizar os erros à cada iteração, isso pode enfatizar os outliers e causar overfitting. Portanto, deve-se:\n", + " * Tratar os outliers previamente OU\n", + " * Usar Cross-Validation para neutralizar os efeitos dos outliers (**Eu prefiro este método, pois toma menos tempo**);\n", + "* Computacionalmene caro. Geralmente são necessários muitas árvores (> 1000) para se obter bons resultados;\n", + "* Devido à flexibilidade (muitos parâmetros para ajustar), então é necessário usar GridSearchCV para encontrar a combinação ótima dos hyperparameters;\n", + "\n", + "## Referências\n", + "* [Gradient Boosting Decision Tree Algorithm Explained](https://towardsdatascience.com/machine-learning-part-18-boosting-algorithms-gradient-boosting-in-python-ef5ae6965be4) - Didático e detalhista.\n", + "* [Predicting Wine Quality with Gradient Boosting Machines](https://towardsdatascience.com/predicting-wine-quality-with-gradient-boosting-machines-a-gmb-tutorial-d950b1542065)\n", + "* [Parameter Tuning in Gradient Boosting (GBM) with Python](https://www.datacareer.de/blog/parameter-tuning-in-gradient-boosting-gbm/)\n", + "* [Tune Learning Rate for Gradient Boosting with XGBoost in Python](https://machinelearningmastery.com/tune-learning-rate-for-gradient-boosting-with-xgboost-in-python/)\n", + "* [In Depth: Parameter tuning for Gradient Boosting](https://medium.com/all-things-ai/in-depth-parameter-tuning-for-gradient-boosting-3363992e9bae) - Muito bom\n", + "* [Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q4bUCZs2jNTA" + }, + "source": [ + "from sklearn.ensemble import GradientBoostingClassifier\n", + "\n", + "# Instancia...\n", + "ml_GB=GradientBoostingClassifier(n_estimators=100, min_samples_split= 2)\n", + "\n", + "# Treina...\n", + "ml_GB.fit(X_train, y_train)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-dr6dyjdXwvd" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_GB, X_train, y_train, cv = i_CV)\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VlC3y3M5YaGG" + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vnLvQ0ZDYNjB" + }, + "source": [ + "**Interpretação**: Nosso classificador (GradientBoostingClassifier) tem uma acurácia média de 96,86% (base de treinamento). Além disso, o std é da ordem de 2,52%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "D2n1RKZuXq3D" + }, + "source": [ + "# Faz precições...\n", + "y_pred = ml_GB.predict(X_test)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8r6JCzQRGFa0" + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_test, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names = cf_labels, categories = cf_categories)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KFv-Q2AD5uCk" + }, + "source": [ + "## Parameter tunning\n", + "> Consulte [Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/) para detalhes sobre os parâmetros, significado e etc." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wgU040AcjNTF" + }, + "source": [ + "# Dicionário de parâmetros para o parameter tunning.\n", + "d_parametros_GB= {'learning_rate': [1, 0.5, 0.25, 0.1, 0.05, 0.01]} #,\n", + "# 'n_estimators': [1, 2, 4, 8, 16, 32, 64, 100, 200],\n", + "# 'max_depth': [5, 10, 15, 20, 25, 30],\n", + "# 'min_samples_split': [0.1, 0.3, 0.5, 0.7, 0.9],\n", + "# 'min_samples_leaf': [0.1, 0.2, 0.3, 0.4, 0.5],\n", + "# 'max_features': list(range(1, X_train.shape[1]))}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "v5KLFlpTjNTH" + }, + "source": [ + "# Invoca a função\n", + "ml_GB2, best_params= GridSearchOptimizer(ml_GB, 'ml_GB2', d_parametros_GB, X_train, y_train, X_test, y_test, cv = i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YQ6ERz3fi9i2" + }, + "source": [ + "### Resultado da execução do Gradient Boosting" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RSa7uKw13mKG" + }, + "source": [ + "```\n", + "[Parallel(n_jobs=-1)]: Done 275400 out of 275400 | elapsed: 93.7min finished\n", + "\n", + "Parametros otimizados: {'learning_rate': 1, 'max_depth': 30, 'max_features': 11, 'min_samples_leaf': 0.1, 'min_samples_split': 0.1, 'n_estimators': 100}\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wiJpA2PyjDjR" + }, + "source": [ + "# Como o procedimento acima levou 93 minutos para executar, então vou estimar ml_GB2 abaixo usando os parâmetros acima estimados\n", + "best_params= {'learning_rate': 1, 'max_depth': 30, 'max_features': 11, 'min_samples_leaf': 0.1, 'min_samples_split': 0.1, 'n_estimators': 100}\n", + "\n", + "#ml_GB2= GradientBoostingClassifier(learning_rate= best_params['learning_rate'], \n", + "# max_depth= best_params['max_depth'],\n", + "# max_features= best_params['max_features'],\n", + "# min_samples_leaf= best_params['min_samples_leaf'],\n", + "# min_samples_split= best_params['min_samples_split'],\n", + "# n_estimators= best_params['n_estimators'],\n", + "# random_state= i_Seed)\n", + "\n", + "ml_GB2= GradientBoostingClassifier(learning_rate= best_params['learning_rate'], \n", + " max_depth= best_params['max_depth'],\n", + " min_samples_leaf= best_params['min_samples_leaf'],\n", + " min_samples_split= best_params['min_samples_split'],\n", + " n_estimators= best_params['n_estimators'],\n", + " random_state= i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mb14gJ7-jbVM" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TAqGZIFYm2sU" + }, + "source": [ + "X_train_GB, X_test_GB = seleciona_colunas_relevantes(ml_GB2, X_train, X_test)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6yiu6dahnBvC" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "APrtWN18nc4t" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VS0mLdOmnXAY" + }, + "source": [ + "# Treina com as COLUNAS relevantes\n", + "ml_GB2.fit(X_train_GB, y_train)\n", + "\n", + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_GB2, X_train_GB, y_train, cv = i_CV)\n", + "print(f'Acurácia Media: {100*a_scores_CV.mean():.2f}')\n", + "print(f'std médio.....: {100*a_scores_CV.std():.2f}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vmc9PP_Rn1TN" + }, + "source": [ + "## Valida o modelo usando o dataframe X_test" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "e3mnIALvnzP2" + }, + "source": [ + "y_pred_GB = ml_GB2.predict(X_test_GB)\n", + "\n", + "# Calcula acurácia\n", + "accuracy_score(y_test, y_pred_GB)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kwP9Z2GnkV7r" + }, + "source": [ + "___\n", + "# **XGBOOST (eXtreme Gradient Boosting)**\n", + "* XGBoost é uma melhoria de Gradient Boosting. As melhorias são em velocidade e performace, além de corrigir as ineficiências do GradientBoosting.\n", + "* Algoritmo preferido pelos Kaggle Grandmasters;\n", + "* Paralelizável;\n", + "* Estado-da-arte em termos de Machine Learning;\n", + "\n", + "## Parâmetros relevantes e seus valores iniciais\n", + "Consulte [Complete Guide to Parameter Tuning in XGBoost with codes in Python](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/) para detalhes completos sobre os parâmetros, significado e etc.\n", + "\n", + "* n_estimators = 100 (100 caso o dataframe for grande. Se o dataframe for médio/pequeno, então 1000) - É o número de árvores desejamos construir;\n", + "* max_depth= 3 - Determina quão profundo cada árvore pode crescer durante qualquer round de treinamento. Valores típicos no intervalo [3, 10];\n", + "* learning rate= 0.01 - Usado para evitar overfitting, intervalo: [0, 1];\n", + "* alpha (somente para problemas de Regressão) - L1 regularization nos pesos. Valores altos resulta em mais regularization;\n", + "* lambda (somente para problemas de Regressão) - L2 regularization nos pesos.\n", + "* colsample_bytree: 1 - porcentagem de COLUNAS usados por cada árvore. Alto valor pode causar overfitting;\n", + "* subsample: 0.8 - porcentagem de amostras usadas por árvore. Um valor baixo pode levar a overfitting;\n", + "* gamma: 1 - Controla se um determinado nó será dividido com base na redução esperada na perda após a divisão. Um valor mais alto leva a menos divisões.\n", + "* objective: Define a \"loss function\". As opções são:\n", + " * reg:linear - Para resolver problemas de regressão;\n", + " * reg:logistic - Para resolver problemas de classificação;\n", + " * binary:logistic - Para resolver problemas de classificação com cálculo de probabilidades;\n", + "\n", + "# Referências\n", + "* [How exactly XGBoost Works?](https://medium.com/@pushkarmandot/how-exactly-xgboost-works-a320d9b8aeef)\n", + "* [Fine-tuning XGBoost in Python like a boss](https://towardsdatascience.com/fine-tuning-xgboost-in-python-like-a-boss-b4543ed8b1e)\n", + "* [Gentle Introduction of XGBoost Library](https://medium.com/@imoisharma/gentle-introduction-of-xgboost-library-2b1ac2669680)\n", + "* [A Beginner’s guide to XGBoost](https://towardsdatascience.com/a-beginners-guide-to-xgboost-87f5d4c30ed7)\n", + "* [Exploring XGBoost](https://towardsdatascience.com/exploring-xgboost-4baf9ace0cf6)\n", + "* [Feature Importance and Feature Selection With XGBoost in Python](https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/)\n", + "* [Ensemble Learning case study: Running XGBoost on Google Colab free GPU](https://towardsdatascience.com/running-xgboost-on-google-colab-free-gpu-a-case-study-841c90fef101) - Recomendo\n", + "* [Predicting movie revenue with AdaBoost, XGBoost and LightGBM](https://towardsdatascience.com/predicting-movie-revenue-with-adaboost-xgboost-and-lightgbm-262eadee6daa)\n", + "* [Tuning XGBoost Hyperparameters with Scikit Optimize](https://towardsdatascience.com/how-to-improve-the-performance-of-xgboost-models-1af3995df8ad)\n", + "* [An Example of Hyperparameter Optimization on XGBoost, LightGBM and CatBoost using Hyperopt](https://towardsdatascience.com/an-example-of-hyperparameter-optimization-on-xgboost-lightgbm-and-catboost-using-hyperopt-12bc41a271e) - Interessante\n", + "* [XGBOOST vs LightGBM: Which algorithm wins the race !!!](https://towardsdatascience.com/lightgbm-vs-xgboost-which-algorithm-win-the-race-1ff7dd4917d) - LightGBM tem se mostrado interessante.\n", + "* [From Zero to Hero in XGBoost Tuning](https://towardsdatascience.com/from-zero-to-hero-in-xgboost-tuning-e48b59bfaf58) - Gostei\n", + "* [Build XGBoost / LightGBM models on large datasets — what are the possible solutions?](https://towardsdatascience.com/build-xgboost-lightgbm-models-on-large-datasets-what-are-the-possible-solutions-bf882da2c27d)\n", + "* [Selecting Optimal Parameters for XGBoost Model Training](https://towardsdatascience.com/selecting-optimal-parameters-for-xgboost-model-training-c7cd9ed5e45e) - Muito bom!\n", + "* [CatBoost vs. Light GBM vs. XGBoost](https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db)\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iMM_R4_ukV7x" + }, + "source": [ + "from xgboost import XGBClassifier\n", + "import xgboost as xgb\n", + "\n", + "# Instancia...\n", + "ml_XGB= XGBClassifier(silent=False, \n", + " scale_pos_weight=1,\n", + " learning_rate=0.01, \n", + " colsample_bytree = 1,\n", + " subsample = 0.8,\n", + " objective='binary:logistic', \n", + " n_estimators=1000, \n", + " reg_alpha = 0.3,\n", + " max_depth= 3, \n", + " gamma=1, \n", + " max_delta_step=5)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "E4wQMlDEFINR" + }, + "source": [ + "# Treina...\n", + "ml_XGB.fit(X_train, y_train)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zAhsTtwGqMkG" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_XGB, X_train, y_train, cv = i_CV)\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JNyKX6PkrXOk" + }, + "source": [ + "**Interpretação**: Nosso classificador (XGBClassifier) tem uma acurácia média de 96,72% (base de treinamento). Além disso, o std é da ordem de 2,02%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_h0QYv3FkV73" + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "AKhhAZLjkV76" + }, + "source": [ + "# Faz predições...\n", + "y_pred = ml_XGB.predict(X_test)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ir2Kd1PqGHgz" + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_test, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jEC7gW4qYpWw" + }, + "source": [ + "## Parameter tunning\n", + "### Leitura Adicional:\n", + "* [Fine-tuning XGBoost in Python like a boss](https://towardsdatascience.com/fine-tuning-xgboost-in-python-like-a-boss-b4543ed8b1e)\n", + "* [Complete Guide to Parameter Tuning in XGBoost with codes in Python](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)\n", + "\n", + "> Olhando para os resultados acima, qual o melhor modelo?\n", + "\n", + "XGBoost? Supondo que sim, agora vamos fazer o fine-tuning dos parâmetros do modelo." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "n3MsUONPwIV9" + }, + "source": [ + "# Dicionário de parâmetros para XGBoost:\n", + "d_parametros_XGB = {'min_child_weight': [i for i in np.arange(1, 13)]} #,\n", + "# 'gamma': [i for i in np.arange(0, 5, 0.5)],\n", + "# 'subsample': [0.6, 0.8, 1.0],\n", + "# 'colsample_bytree': [0.6, 0.8, 1.0],\n", + "# 'max_depth': [3, 4, 5, 7, 9],\n", + "# 'learning_rate': [i for i in np.arange(0.01, 1, 0.1)]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "CX27FCKmwSni" + }, + "source": [ + "# Invoca a função\n", + "ml_XGB, best_params= GridSearchOptimizer(ml_XGB, 'ml_XGB2', d_parametros_XGB, X_train, y_train, X_test, y_test, cv = i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9b7uCuF74Hjv" + }, + "source": [ + "### Resultado da execução do XGBoostClassifier\n", + "\n", + "```\n", + "[Parallel(n_jobs=-1)]: Done 108000 out of 108000 | elapsed: 372.0min finished\n", + "\n", + "Parametros otimizados: {'colsample_bytree': 0.8, 'gamma': 0.5, 'learning_rate': 0.51, 'max_depth': 5, 'min_child_weight': 1, 'subsample': 0.6}\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "n7E0oyxEtbGi" + }, + "source": [ + "# Como o procedimento acima levou 372 minutos para executar, então vou estimar ml_XGB2 abaixo usando os parâmetros acima estimados\n", + "best_params= {'colsample_bytree': 0.8, 'gamma': 0.5, 'learning_rate': 0.51, 'max_depth': 5, 'min_child_weight': 1, 'subsample': 0.6}\n", + "\n", + "ml_XGB2= XGBClassifier(min_child_weight= best_params['min_child_weight'], \n", + " gamma= best_params['gamma'], \n", + " subsample= best_params['subsample'], \n", + " colsample_bytree= best_params['colsample_bytree'], \n", + " max_depth= best_params['max_depth'], \n", + " learning_rate= best_params['learning_rate'], \n", + " random_state= i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CuqyLHTU5Z-j" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes\n", + "* [The Multiple faces of ‘Feature importance’ in XGBoost](https://towardsdatascience.com/be-careful-when-interpreting-your-features-importance-in-xgboost-6e16132588e7)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QPG3JZIpRZ-T" + }, + "source": [ + "# plot feature importance\n", + "from xgboost import plot_importance\n", + "\n", + "xgb.plot_importance(ml_XGB2, color = 'red')\n", + "plt.title('importance', fontsize = 20)\n", + "plt.yticks(fontsize = 10)\n", + "plt.ylabel('features', fontsize = 20)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "EmpRC2lHW-KP" + }, + "source": [ + "ml_XGB2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "4f9MIEBiyq-5" + }, + "source": [ + "X_train_XGB, X_test_XGB= seleciona_colunas_relevantes(ml_XGB2, X_train, X_test)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F6EayWaY5nMm" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Huy18gKI5qad" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "E3-PaTdc5vZk" + }, + "source": [ + "# Treina com as COLUNAS relevantes...\n", + "ml_XGB2.fit(X_train_XGB, y_train)\n", + "\n", + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_XGB2, X_train_XGB, y_train, cv = i_CV)\n", + "print(f'Acurácia Media: {100*a_scores_CV.mean():.2f}')\n", + "print(f'std médio.....: {100*a_scores_CV.std():.2f}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tBdYikDU6NhD" + }, + "source": [ + "## Valida o modelo usando o dataframe X_test" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GcvY-VdL6VIZ" + }, + "source": [ + "y_pred_XGB = ml_XGB2.predict(X_test_XGB)\n", + "\n", + "# Calcula acurácia\n", + "accuracy_score(y_test, y_pred_XGB)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8oLtdH-vTSbC" + }, + "source": [ + "xgb.to_graphviz(ml_XGB2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "czXQG3MCHfHM" + }, + "source": [ + "# KNN - KNEIGHBORSCLASSIFIER" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "llTTXNeyHiwx" + }, + "source": [ + "# BAGGINGCLASSIFIER" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Fbkekd4QHoZO" + }, + "source": [ + "# EXTRATREESCLASSIFIER" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "widavwR4HzwE" + }, + "source": [ + "# SVM\n", + "https://data-flair.training/blogs/svm-support-vector-machine-tutorial/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "id_Ubulns6We" + }, + "source": [ + "# NAIVE BAYES" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3e0m7lEnYOV9" + }, + "source": [ + "# **IMPORTANCIA DAS COLUNAS**\n", + "Source: [Plotting Feature Importances](https://www.kaggle.com/grfiv4/plotting-feature-importances)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fjco0HnNYr-N" + }, + "source": [ + "def mostra_feature_importances(clf, X_train, y_train=None, \n", + " top_n=10, figsize=(8,8), print_table=False, title=\"Feature Importances\"):\n", + " '''\n", + " plot feature importances of a tree-based sklearn estimator\n", + " \n", + " Note: X_train and y_train are pandas DataFrames\n", + " \n", + " Note: Scikit-plot is a lovely package but I sometimes have issues\n", + " 1. flexibility/extendibility\n", + " 2. complicated models/datasets\n", + " But for many situations Scikit-plot is the way to go\n", + " see https://scikit-plot.readthedocs.io/en/latest/Quickstart.html\n", + " \n", + " Parameters\n", + " ----------\n", + " clf (sklearn estimator) if not fitted, this routine will fit it\n", + " \n", + " X_train (pandas DataFrame)\n", + " \n", + " y_train (pandas DataFrame) optional\n", + " required only if clf has not already been fitted \n", + " \n", + " top_n (int) Plot the top_n most-important features\n", + " Default: 10\n", + " \n", + " figsize ((int,int)) The physical size of the plot\n", + " Default: (8,8)\n", + " \n", + " print_table (boolean) If True, print out the table of feature importances\n", + " Default: False\n", + " \n", + " Returns\n", + " -------\n", + " the pandas dataframe with the features and their importance\n", + " \n", + " Author\n", + " ------\n", + " George Fisher\n", + " '''\n", + " \n", + " __name__ = \"mostra_feature_importances\"\n", + " \n", + " import pandas as pd\n", + " import numpy as np\n", + " import matplotlib.pyplot as plt\n", + " \n", + " from xgboost.core import XGBoostError\n", + " from lightgbm.sklearn import LightGBMError\n", + " \n", + " try: \n", + " if not hasattr(clf, 'feature_importances_'):\n", + " clf.fit(X_train.values, y_train.values.ravel())\n", + "\n", + " if not hasattr(clf, 'feature_importances_'):\n", + " raise AttributeError(\"{} does not have feature_importances_ attribute\".\n", + " format(clf.__class__.__name__))\n", + " \n", + " except (XGBoostError, LightGBMError, ValueError):\n", + " clf.fit(X_train.values, y_train.values.ravel())\n", + " \n", + " feat_imp = pd.DataFrame({'importance':clf.feature_importances_}) \n", + " feat_imp['feature'] = X_train.columns\n", + " feat_imp.sort_values(by ='importance', ascending = False, inplace = True)\n", + " feat_imp = feat_imp.iloc[:top_n]\n", + " \n", + " feat_imp.sort_values(by='importance', inplace = True)\n", + " feat_imp = feat_imp.set_index('feature', drop = True)\n", + " feat_imp.plot.barh(title=title, figsize=figsize)\n", + " plt.xlabel('Feature Importance Score')\n", + " plt.show()\n", + " \n", + " if print_table:\n", + " from IPython.display import display\n", + " print(\"Top {} features in descending order of importance\".format(top_n))\n", + " display(feat_imp.sort_values(by = 'importance', ascending = False))\n", + " \n", + " return feat_imp" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ycu_EIGlYUYn" + }, + "source": [ + "import pandas as pd\n", + "\n", + "from xgboost import XGBClassifier\n", + "from sklearn.ensemble import ExtraTreesClassifier\n", + "from sklearn.tree import ExtraTreeClassifier\n", + "from sklearn.tree import DecisionTreeClassifier\n", + "from sklearn.ensemble import GradientBoostingClassifier\n", + "from sklearn.ensemble import BaggingClassifier\n", + "from sklearn.ensemble import AdaBoostClassifier\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "from sklearn.linear_model import LogisticRegression\n", + "from lightgbm import LGBMClassifier\n", + "\n", + "clfs = [XGBClassifier(), LGBMClassifier(), \n", + " ExtraTreesClassifier(), ExtraTreeClassifier(),\n", + " BaggingClassifier(), DecisionTreeClassifier(),\n", + " GradientBoostingClassifier(), LogisticRegression(),\n", + " AdaBoostClassifier(), RandomForestClassifier()]\n", + "\n", + "for clf in clfs:\n", + " try:\n", + " _ = mostra_feature_importances(clf, X_train, y_train, top_n=X_train.shape[1], title=clf.__class__.__name__)\n", + " except AttributeError as e:\n", + " print(e)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EwWkjfC8KEZH" + }, + "source": [ + "# ENSEMBLE METHODS\n", + "https://towardsdatascience.com/using-bagging-and-boosting-to-improve-classification-tree-accuracy-6d3bb6c95e5b\n", + "\n", + "![Ensemble](https://github.com/MathMachado/Materials/blob/master/Ensemble.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3Uf1RML7xETY" + }, + "source": [ + "# WOE e IV\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TBNRfYZCyhMP" + }, + "source": [ + "## Construção do exemplo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gIIroyyP4ZRZ" + }, + "source": [ + "df_y.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "PzQQdrkf1ohX" + }, + "source": [ + "from random import choices\n", + "\n", + "df_X2= df_X.copy()\n", + "df_X2['tipo']= choices(['A', 'B', 'C', 'D'], k= 1000)\n", + "df_X2['idade']= np.random.randint(10, 15, size= 1000)\n", + "df_X2['target']= df_y['target']\n", + "df_X2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "v-OpwIpx4hXJ" + }, + "source": [ + "df_X2['target'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "yZfqSvbKzeJ3" + }, + "source": [ + "def Constroi_Buckets(df, i, k= 10):\n", + " coluna= 'v'+ str(i)\n", + " df[coluna+'_Bucket']= pd.cut(df[coluna], bins= k, labels= np.arange(1, k+1))\n", + " df= df.drop(columns= [coluna], axis= 1)\n", + " return df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "V6Nrpsx60HD3" + }, + "source": [ + "for i in np.arange(1,19):\n", + " df_X2= Constroi_Buckets(df_X2, i)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "J2Fbh41-03OB" + }, + "source": [ + "df_X2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "O9r5BeWVxIr3" + }, + "source": [ + "# Função para calcular WOE e IV\n", + "def calculate_woe_iv(dataset, feature, target):\n", + "\n", + " def codethem(IV):\n", + " if IV < 0.02: return 'Useless'\n", + " elif IV >= 0.02 and IV < 0.1: return 'Weak'\n", + " elif IV >= 0.1 and IV < 0.3: return 'Medium'\n", + " elif IV >= 0.3 and IV < 0.5: return 'Strong'\n", + " elif IV >= 0.5: return 'Suspicious'\n", + " else: return 'None'\n", + "\n", + " lst = []\n", + " for i in range(dataset[feature].nunique()):\n", + " val = list(dataset[feature].unique())[i]\n", + " lst.append({\n", + " 'Value': val,\n", + " 'All': dataset[dataset[feature] == val].count()[feature],\n", + " 'Good': dataset[(dataset[feature] == val) & (dataset[target] == 0)].count()[feature],\n", + " 'Bad': dataset[(dataset[feature] == val) & (dataset[target] == 1)].count()[feature]\n", + " })\n", + " \n", + " dset = pd.DataFrame(lst)\n", + " dset['Distr_Good'] = dset['Good']/dset['Good'].sum()\n", + " dset['Distr_Bad'] = dset['Bad']/dset['Bad'].sum()\n", + " dset['Mean']= dset['All']/dset['All'].sum()\n", + " dset['WoE'] = np.log(dset['Distr_Good']/dset['Distr_Bad'])\n", + " dset = dset.replace({'WoE': {np.inf: 0, -np.inf: 0}})\n", + " dset['IV'] = (dset['Distr_Good'] - dset['Distr_Bad']) * dset['WoE']\n", + " #dset= dset.drop(columns= ['Distr_Good', 'Distr_Bad'], axis= 1)\n", + "\n", + " dset['Predictive_Power']= dset['IV'].map(codethem)\n", + " iv = dset['IV'].sum() \n", + " dset = dset.sort_values(by='IV') \n", + " return dset, iv" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Y8WGjWH63nx_" + }, + "source": [ + "df_Lab = df_X2.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-N6xr1MgxTiz" + }, + "source": [ + "def calcula_Predictive_Power(df_Lab, coluna):\n", + " print('WoE and IV for column: {}'.format(coluna))\n", + " df, iv = calculate_woe_iv(df_Lab, coluna, 'target')\n", + " print(df)\n", + " print('IV score: {:.2f}'.format(iv))\n", + " print('\\n')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ayqN_7WnxVq9" + }, + "source": [ + "for i in np.arange(1,19):\n", + " coluna= 'v'+str(i)+'_Bucket'\n", + " calcula_Predictive_Power(df_Lab, coluna)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qtoJVI4Pyx3I" + }, + "source": [ + "# **IMBALANCED SAMPLE**\n", + "> Alguns objetivos como detectar fraude em transações bancárias ou detecção de intrusão em network tem em comum o fato que a classe de interesse (o que queremos detectar), geralmente é um evento raro\n", + "\n", + "## Exemplo: Detectar fraude\n", + "A proporção de fraudes diante de NÃO-FRAUDES são mais ou menos 1%/99%. Neste caso, ao desenvovermos um modelo para detectar fraudes e o modelo classificar todas as instâncias como NÃO-FRAUDE, então o modelo terá uma acurácia de 99%. No entanto, este modelo não nos ajudará em nada.\n", + "\n", + "## Necessidade de se usar outras métricas \n", + "> Recomenda-se utilizar outras métricas (na verdade, é boa prática usar mais de 1 métrica para medir a performance dos modelos) como, por exemplo, F1-Score, Precision/Specificity, Recall/Sensitivity e AUROC.\n", + "\n", + "## Como lidar com a amostra desbalanceada?\n", + "* Under-sampling\n", + "> Seleciona aleatoriamente a classe MAJORITÁRIA (em nosso exemplo, NÃO-FRAUDE) até o número de instâncias da classe MINORITÁRIA (FRAUDE);\n", + "\n", + "* Over-sampling\n", + "> Resample aleatoriamente a classe MINORITÁRIA (em nosso exemplo, FRAUDE) até o número de instâncias da classe MAJORITÁRIA (NÃO-FRAUDE), ou uma proporção da classe MAJORITÁRIA. Veja a bibliotea SMOTE (Synthetic Minority Over-Sampling Techniques);\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2o45zx8zw-aB" + }, + "source": [ + "## EFEITOS DA AMOSTRA DESBALANCEADA" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cCVTPCB-Xkbd" + }, + "source": [ + "# TPOT\n", + "https://towardsdatascience.com/tpot-automated-machine-learning-in-python-4c063b3e5de9" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2ulXii6JXpWd" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_TWUq-z4X4yZ" + }, + "source": [ + "___\n", + "# FEATURETOOLS\n", + "https://medium.com/@rrfd/simple-automatic-feature-engineering-using-featuretools-in-python-for-classification-b1308040e183\n", + "\n", + "https://www.analyticsvidhya.com/blog/2018/08/guide-automated-feature-engineering-featuretools-python/\n", + "\n", + "https://mlwhiz.com/blog/2019/05/19/feature_extraction/\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aZUNOgmSgAmq" + }, + "source": [ + "!pip install featuretools" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_sxdONzsh9rb" + }, + "source": [ + "df_X.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "p5_ynGo1dBJJ" + }, + "source": [ + "df_X.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TqJRJXUhiDqf" + }, + "source": [ + "from random import choices\n", + "\n", + "df_X2= df_X.copy()\n", + "df_X2['tipo'] = choices(['A', 'B', 'C', 'D'], k = 1000)\n", + "df_X2['idade'] = np.random.randint(10, 15, size = 1000)\n", + "df_X2['id'] = range(0,1000)\n", + "df_X2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "nR56bGGngk-W" + }, + "source": [ + "# Automated feature engineering\n", + "import featuretools as ft\n", + "import featuretools.variable_types as vtypes\n", + "\n", + "es= ft.EntitySet(id = 'simulacao')\n", + "\n", + "# adding a dataframe \n", + "es.entity_from_dataframe(entity_id = 'df_X2', dataframe = df_X2, index = 'id')\n", + "es" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IOJ4Tr5Ogk6M" + }, + "source": [ + "es['df_X2'].variables" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1uXPqHDZgkys" + }, + "source": [ + "variable_types = {'idade': vtypes.Categorical}\n", + " \n", + "es.entity_from_dataframe(entity_id = 'df_X2', dataframe = df_X2, index = 'id', variable_types= variable_types)\n", + "\n", + "es = es.normalize_entity(base_entity_id='df_X2', new_entity_id= 'tipo', index='id')\n", + "es = es.normalize_entity(base_entity_id='df_X2', new_entity_id= 'idade', index='id')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "dnbYTBqugkvm" + }, + "source": [ + "es" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "I2v_jetdgkr7" + }, + "source": [ + "feature_matrix, feature_names = ft.dfs(entityset=es, target_entity = 'df_X2', max_depth = 3, verbose = 3, n_jobs= 1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zZiRBvHXgkoJ" + }, + "source": [ + "feature_matrix.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aWiahwKe2d6U" + }, + "source": [ + "# **EXERCÍCIOS**\n", + "> Encontre algoritmos adequados para ser aplicados aos seguintes problemas:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XbSLkbDB2mzK" + }, + "source": [ + "## Exercício 1 - Credit Card Fraud Detection\n", + "Source: [Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud)\n", + "\n", + "### Leitura suporte\n", + "* [Detecting Credit Card Fraud Using Machine Learning](https://towardsdatascience.com/detecting-credit-card-fraud-using-machine-learning-a3d83423d3b8)\n", + "* [Credit Card Fraud Detection](https://towardsdatascience.com/credit-card-fraud-detection-a1c7e1b75f59)\n", + "\n", + "### Dataframe\n", + "* [Creditcard.csv](https://raw.githubusercontent.com/MathMachado/DataFrames/master/creditcard.csv)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oYgK6JXd3MgA" + }, + "source": [ + "## Exercício 2 - Predicting species on IRIS dataset\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "si0rsJvu3O6O" + }, + "source": [ + "from sklearn import datasets\n", + "import xgboost as xgb\n", + "\n", + "iris = datasets.load_iris()\n", + "X_iris = iris.data\n", + "y_iris = iris.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zom8t4yWC_UC" + }, + "source": [ + "## Exercício 3 - Predict Wine Quality\n", + "> Estimar a qualidade dos vinhos, numa scala de 0–100. A seguir, a qualidade em função da escala:\n", + "\n", + "* 95–100 Classic: a great wine\n", + "* 90–94 Outstanding: a wine of superior character and style\n", + "* 85–89 Very good: a wine with special qualities\n", + "* 80–84 Good: a solid, well-made wine\n", + "* 75–79 Mediocre: a drinkable wine that may have minor flaws\n", + "* 50–74 Not recommended\n", + "\n", + "Source: [Wine Reviews](https://www.kaggle.com/zynicide/wine-reviews)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "klL2Q9Ria96n" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from sklearn import datasets\n", + "\n", + "Wine = datasets.load_wine()\n", + "X_vinho = Wine.data\n", + "y_vinho = Wine.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lhVhSWBgGijq" + }, + "source": [ + "## Exercício 4 - Predict Parkinson\n", + "Source: https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SVCxHqv0VBJn" + }, + "source": [ + "## Exercício 5 - Predict survivors from Titanic tragedy\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "CwvB8us4eKNi" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "\n", + "df_titanic = sns.load_dataset('titanic')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZJrT9YIXVdtx" + }, + "source": [ + "## Exercício 6 - Predict Loan\n", + "> Os dados devem ser obtidos diretamente da fonte: [Loan Default Prediction - Imperial College London](https://www.kaggle.com/c/loan-default-prediction/data)\n", + "\n", + "* [Bank Loan Default Prediction](https://medium.com/@wutianhao910/bank-loan-default-prediction-94d4902db740)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R8-GVu7ZWeA8" + }, + "source": [ + "## Exercício 7 - Predict the sales of a store.\n", + "* [Predicting expected sales for Bigmart’s stores](https://medium.com/diogo-menezes-borges/project-1-bigmart-sale-prediction-fdc04f07dc1e)\n", + "* Dataframes\n", + " * [Treinamento](https://raw.githubusercontent.com/MathMachado/DataFrames/master/Big_Mart_Sales_III_train.txt)\n", + " * [Validação](https://raw.githubusercontent.com/MathMachado/DataFrames/master/Big_Mart_Sales_III_test.txt)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fv9w86j4Wnwj" + }, + "source": [ + "## Exercício 8 - [The Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html)\n", + "> Predict the median value of owner occupied homes." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5HYRt8-ug1BT" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from sklearn import datasets\n", + "\n", + "Boston = datasets.load_boston()\n", + "X_boston = Boston.data\n", + "y_boston = Boston.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1UDIaqmtXQ0T" + }, + "source": [ + "## Exercício 9 - Predict the height or weight of a person.\n", + "\n", + "http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-7R146nIXmMT" + }, + "source": [ + "## Exercício 10 - Black Friday Sales Prediction - Predict purchase amount.\n", + "\n", + "This dataset comprises of sales transactions captured at a retail store. It’s a classic dataset to explore and expand your feature engineering skills and day to day understanding from multiple shopping experiences. This is a regression problem. The dataset has 550,069 rows and 12 columns.\n", + "\n", + "https://github.com/MathMachado/DataFrames/blob/master/blackfriday.zip\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mQ8FPbuLZlIh" + }, + "source": [ + "## Exercício 11 - Predict the income class of US population.\n", + "\n", + "http://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Af4NRrchgPlM" + }, + "source": [ + "## Exercício 12 - Predicting Cancer\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "c4LOlgZW3P40" + }, + "source": [ + "from sklearn import datasets\n", + "cancer = datasets.load_breast_cancer()\n", + "X_cancer = cancer.data\n", + "y_cancer = cancer.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "74PmpT8Ix0tD" + }, + "source": [ + "## Exercício 13\n", + "Source: [Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/).\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WY8GZMixZ9W9" + }, + "source": [ + "## Exercício 14 - Predict Diabetes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y92t6tbOge0S" + }, + "source": [ + "from sklearn import datasets\n", + "Diabetes= datasets.load_diabetes()\n", + "\n", + "X_diabetes = Diabetes.data\n", + "y_diabetes = Diabetes.target" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB15_00__Machine_Learning___DSWP_hs3.ipynb b/Notebooks/NB15_00__Machine_Learning___DSWP_hs3.ipynb new file mode 100644 index 000000000..910b16443 --- /dev/null +++ b/Notebooks/NB15_00__Machine_Learning___DSWP_hs3.ipynb @@ -0,0 +1,6645 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "colab": { + "name": "NB15_00__Machine_Learning.ipynb", + "provenance": [], + "include_colab_link": true + }, + "accelerator": "TPU" + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ShVXyGj9wkgN" + }, + "source": [ + "

MACHINE LEARNING WITH PYTHON

" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aYQ4cDfcPu4e" + }, + "source": [ + "___\n", + "# **NOTAS E OBSERVAÇÕES**\n", + "* Abordar o impacto do desbalanceamento da amostra;\n", + "* Colocar AUROC no material e mostrar o cut off para classificação entre 0 e 1;\n", + "* Conceitos estatísticos de bias & variance;\n", + "* Ver Sklearn.optimize: https://web.telegram.org/#/im?p=g497957288" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5YvhLC_uf4_G" + }, + "source": [ + "___\n", + "# **AGENDA**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QgX6n2VDyY1O" + }, + "source": [ + "___\n", + "# **REFERÊNCIAS**\n", + "* [scikit-learn - Machine Learning With Python](https://scikit-learn.org/stable/);\n", + "* [An Introduction to Machine Learning Theory and Its Applications: A Visual Tutorial with Examples](https://www.toptal.com/machine-learning/machine-learning-theory-an-introductory-primer)\n", + "* [The Difference Between Artificial Intelligence, Machine Learning, and Deep Learning](https://medium.com/iotforall/the-difference-between-artificial-intelligence-machine-learning-and-deep-learning-3aa67bff5991)\n", + "* [A Gentle Guide to Machine Learning](https://blog.monkeylearn.com/a-gentle-guide-to-machine-learning/)\n", + "* [A Visual Introduction to Machine Learning](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)\n", + "* [Introduction to Machine Learning](http://alex.smola.org/drafts/thebook.pdf)\n", + "* [The 10 Statistical Techniques Data Scientists Need to Master](https://medium.com/cracking-the-data-science-interview/the-10-statistical-techniques-data-scientists-need-to-master-1ef6dbd531f7)\n", + "* [Tune: a library for fast hyperparameter tuning at any scale](https://towardsdatascience.com/fast-hyperparameter-tuning-at-scale-d428223b081c)\n", + "* [How to lie with Data Science](https://towardsdatascience.com/how-to-lie-with-data-science-5090f3891d9c)\n", + "* [5 Reasons “Logistic Regression” should be the first thing you learn when becoming a Data Scientist](https://towardsdatascience.com/5-reasons-logistic-regression-should-be-the-first-thing-you-learn-when-become-a-data-scientist-fcaae46605c4)\n", + "* [Machine learning on categorical variables](https://towardsdatascience.com/machine-learning-on-categorical-variables-3b76ffe4a7cb)\n", + "\n", + "## Deep Learning & Neural Networks\n", + "\n", + "- [An Introduction to Neural Networks](http://www.cs.stir.ac.uk/~lss/NNIntro/InvSlides.html)\n", + "- [An Introduction to Image Recognition with Deep Learning](https://medium.com/@ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721)\n", + "- [Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/index.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TsCbZd2epfxo" + }, + "source": [ + "___\n", + "# **INTRODUÇÃO**\n", + "\n", + "* \"__Information is the oil of the 21st century, and analytics is the combustion engine__.\" - Peter Sondergaard, SVP, Garner Research;\n", + "\n", + "\n", + ">O foco deste capítulo será:\n", + "* Linear, Logistic Regression, Decision Tree, Random Forest, Support Vector Machine and XGBoost algorithms for building Machine Learning models;\n", + "* Entender como resolver problemas de classificação e Regressão;\n", + "* Aplicar técnicas de Ensemble como Bagging e Boosting;\n", + "* Como medir a acurácia dos modelos de Machine Learning;\n", + "* Aprender os principais algoritmos de Machine Learning tanto das técnicas de aprendizagem supervisionada quanto da não-supervisionada.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HqqB2vaHXMGt" + }, + "source": [ + "___\n", + "# **ARTIFICIAL INTELLIGENCE VS MACHINE LEARNING VS DEEP LEARNING**\n", + "* **Machine Learning** - dá aos computadores a capacidade de aprender sem serem explicitamente programados. Os computadores podem melhorar sua capacidade de aprendizagem através da prática de uma tarefa, geralmente usando grandes conjuntos de dados.\n", + "* **Deep Learning** - é um método de Machine Learning que depende de redes neurais artificiais, permitindo que os sistemas de computadores aprendam pelo exemplo, assim como nós humanos aprendemos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P961GcguXFFA" + }, + "source": [ + "![EvolutionOfAI](https://github.com/MathMachado/Materials/blob/master/Evolution%20of%20AI.PNG?raw=true)\n", + "\n", + "Source: [Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://github.com/MathMachado/P4ML/blob/DS_Python/Material/Evolution%20of%20AI.PNG)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lkqGtO88ZkPr" + }, + "source": [ + "![AI_vs_ML_vs_DL](https://github.com/MathMachado/Materials/blob/master/AI_vs_ML_vs_DL.PNG?raw=true)\n", + "\n", + "Source: [Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://towardsdatascience.com/artificial-intelligence-vs-machine-learning-vs-deep-learning-2210ba8cc4ac)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xesQpzfmaqj6" + }, + "source": [ + "![ML_vs_DL](https://github.com/MathMachado/Materials/blob/master/ML_vs_DL.PNG?raw=true)\n", + "\n", + "Source: [Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://towardsdatascience.com/artificial-intelligence-vs-machine-learning-vs-deep-learning-2210ba8cc4ac)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KeIVR59IIS7f" + }, + "source": [ + "___\n", + "# **MACHINE LEARNING - TECHNIQUES**\n", + "\n", + "* Supervised Learning\n", + "* Unsupervised Learning\n", + "\n", + "![MachineLearning](https://github.com/MathMachado/Materials/blob/master/MachineLearningTechniques.jpg?raw=true)\n", + "\n", + "Source: [Machine Learning for Everyone](https://vas3k.com/blog/machine_learning/?source=post_page-----885aa35db58b----------------------)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rvwp5UHdBiup" + }, + "source": [ + "___\n", + "# **NOSSO FOCO AQUI SERÁ...**\n", + "\n", + "![ClassicalML](https://github.com/MathMachado/Materials/blob/master/ClassicalML.jpg?raw=true)\n", + "\n", + "Source: [Machine Learning for Everyone](https://vas3k.com/blog/machine_learning/?source=post_page-----885aa35db58b----------------------)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cBLSvJTXHBjK" + }, + "source": [ + "___\n", + "# **CHEETSHEET**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZdjR3nahUuKq" + }, + "source": [ + "\n", + "![Scikit-Learn](https://github.com/MathMachado/Materials/blob/master/scikit-learn-1.png?raw=true)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MkBSvyorGXQz" + }, + "source": [ + "___\n", + "# **CROSS-VALIDATION**\n", + "* K-fold é o método de Cross-Validation (CV) mais conhecido e utilizado;\n", + "* Como funciona: divide o dataframe de treinamento em k partes;\n", + " * Usa k-1 partes para treinar o modelo e o restante para validar o modelo;\n", + " * repete este processo k vezes, sendo que em cada iteração calcula as métricas desejadas (exemplo: acurácia);\n", + " * Ao final das k iterações, teremos k métricas das quais calculamos média e desvio-padrão.\n", + "\n", + " A figura abaixo nos ajuda a entender como funciona CV:\n", + "\n", + "![Cross-Validation](https://github.com/MathMachado/Materials/blob/master/CV2.PNG?raw=true)\n", + "\n", + "Source: [5 Reasons why you should use Cross-Validation in your Data Science Projects](https://towardsdatascience.com/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79)\n", + "\n", + "* **valor de k**:\n", + " * valor de k (folds): entre 5 e 10 --> Não há regra geral para a escolha de k;\n", + " * Quanto maior o valor de k --> menor o viés do CV --> Experimento Estatístico para mostrar o efeito.\n", + "\n", + "[Applied Predictive Modeling, 2013](https://www.amazon.com/Applied-Predictive-Modeling-Max-Kuhn/dp/1461468485/ref=as_li_ss_tl?ie=UTF8&qid=1520380699&sr=8-1&keywords=applied+predictive+modeling&linkCode=sl1&tag=inspiredalgor-20&linkId=1af1f3de89c11e4a7fd49de2b05e5ebf)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HscfN-a1V043" + }, + "source": [ + "* **Vantagens do uso de CV**:\n", + " * Modelos com melhor acurácia;\n", + " * Melhor uso dos dados, pois todos os dados são utilizados como treinamento e validação. Portanto, qualquer problema com os dados serão encontrados nesta fase.\n", + "\n", + "* **Leitura Adicional**\n", + " * [Cross-Validation in Machine Learning](https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f)\n", + " * [5 Reasons why you should use Cross-Validation in your Data Science Projects](https://towardsdatascience.com/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79)\n", + " * [Cross-validation: evaluating estimator performance](https://scikit-learn.org/stable/modules/cross_validation.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XRukccWQSklx" + }, + "source": [ + "## Medidas para avaliarmos a variabilidade presente nos dados\n", + "* As principais medidas para medirmos a variabilidade dos dados são amplitude, variância, desvio padrão e coeficiente de variação;\n", + "* Estas medidas nos permite concluir se os dados são homogêneos (menor dispersão/variabilidade) ou heterogêneos (maior variabilidade/dispersão).\n", + "\n", + "* **Na próxima versão, trazer estes conceitos para o Notebook e usar o Python para calcular estas medidas**." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yBR8tWV_lhQq" + }, + "source": [ + "___\n", + "# **ENSEMBLE METHODS** (= Combinar modelos preditivos)\n", + "* Métodos\n", + " * **Bagging** (Bootstrap AGGregatING)\n", + " * **Boosting**\n", + " * Stacking --> Não é muito utilizado\n", + "* Evita overfitting (Overfitting é quando o modelo/função se ajusta muito bem ao dados de treinamento, sendo ineficiente para generalizar para outras amostras/população).\n", + "* Constroi meta-classificadores: combinar os resultados de vários algoritmos para produzir previsões mais precisas e robustas do que as previsões de cada classificador individual.\n", + "* Ensemble reduz/minimiza os efeitos das principais causas de erros nos modelos de Machine Learning:\n", + " * ruído;\n", + " * bias (viés);\n", + " * variância --> Principal medida para medir a variabilidade presente nos dados.\n", + "\n", + "# Referências\n", + "* [Simple guide for ensemble learning methods](https://towardsdatascience.com/simple-guide-for-ensemble-learning-methods-d87cc68705a2) - Explica didaticamente como funcionam ensembes." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "25RW8u-Sj780" + }, + "source": [ + "### Leitura Adicional\n", + "* [Ensemble methods: bagging, boosting and stacking](https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205)\n", + "* [Ensemble Methods in Machine Learning: What are They and Why Use Them?](https://towardsdatascience.com/ensemble-methods-in-machine-learning-what-are-they-and-why-use-them-68ec3f9fef5f)\n", + "* [Ensemble Learning Using Scikit-learn](https://towardsdatascience.com/ensemble-learning-using-scikit-learn-85c4531ff86a)\n", + "* [Let’s Talk About Machine Learning Ensemble Learning In Python](https://medium.com/fintechexplained/lets-talk-about-machine-learning-ensemble-learning-in-python-382747e5fba8)\n", + "* [Boosting, Bagging, and Stacking — Ensemble Methods with sklearn and mlens](https://medium.com/@rrfd/boosting-bagging-and-stacking-ensemble-methods-with-sklearn-and-mlens-a455c0c982de)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FugME1HSl4jJ" + }, + "source": [ + "___\n", + "# **PARAMETER TUNNING** (= Parâmetros ótimos dos modelos de Machine Learning)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u_147cIRl9F1" + }, + "source": [ + "## GridSearch (Ferramenta ou meio que vamos utilizar para otimização dos parâmetros dos modelos de ML)\n", + "* Encontra os parâmetros ótimos (hyperparameter tunning) que melhoram a acurácia dos modelos.\n", + "* Necessita dos seguintes inputs:\n", + " * A matrix $X_{p}$ com as $p$ COLUNAS (variáveis ou atributos) do dataframe;\n", + " * A matriz $y_{p}$ com a COLUNA-target (vaiável resposta);\n", + " * Exemplo: DecisionTree, RandomForestClassifier, XGBoostClassificer e etc;\n", + " * Um dicionário com os parâmetros a serem otimizados;\n", + " * O número de folds para o método de Cross-validation." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "39Sg77fbTWCO" + }, + "source": [ + "___\n", + "# **MODEL SELECTION & EVALUATION**\n", + "> Nesta fase identificamos e aplicamos as melhores métricas (Accuracy, Sensitivity, Specificity, F-Score, AUC, R-Sq, Adj R-SQ, RMSE (Root Mean Square Error)) para avaliar o desempenho/acurácia/performance dos modelos de ML.\n", + ">> Treinamos os modelos de ML usando a amostra de treinamento e avaliamos o desempenho/acurácia/performance na amostra de teste/validação.\n", + "\n", + "* Leitura Adicional\n", + " * [The 5 Classification Evaluation metrics every Data Scientist must know](https://towardsdatascience.com/the-5-classification-evaluation-metrics-you-must-know-aa97784ff226)\n", + " * [Confusion matrix and other metrics in machine learning](https://medium.com/hugo-ferreiras-blog/confusion-matrix-and-other-metrics-in-machine-learning-894688cb1c0a)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oQQVzZ2ZTYrB" + }, + "source": [ + "## Confusion Matrix\n", + "* Termos associados à Confusion Matrix:\n", + " * **Verdadeiro Positivo** (TP = True Positive): Quando o valor observado é True e o modelo estima como True. Ou seja, o modelo acertou na estimativa.\n", + " * Exemplo: **Observado**: Fraude (Positive); **Modelo**: Fraude (Positive) --> Modelo acertou!\n", + " * **Verdadeiro Negativo** (TN = True Negative): Quando o valor observado é False e o modelo estima como False. Ou seja, o modelo acertou na estimativa;\n", + " * Exemplo: **Observado**: NÃO-Fraude (Negative); **Modelo**: NÃO-Fraude (Negative) --> Modelo acertou!\n", + " * **Falso Positivo** (FP = False Positive): Quando o valor observado é False e o modelo estima como True. Ou seja, o modelo errou na estimativa. \n", + " * Exemplo: **Observado**: NÃO-Fraude (Negative); **Modelo**: Fraude (Positive) --> Modelo errou!\n", + " * **Falso Negativo** (FN = False Negative): Quando o valor observado é True e o modelo estima como False.\n", + " * Exemplo: **Observado**: Fraude (Positive); **Modelo**: NÃO-Fraude (Negative) --> Modelo errou!\n", + "\n", + "* Consulte [Confusion matrix](https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py)\n", + "\n", + "![ConfusionMatrix](https://github.com/MathMachado/Materials/blob/master/ConfusionMatrix.PNG?raw=true)\n", + "\n", + "Source: [Confusion Matrix](https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781838555078/6/ch06lvl1sec34/confusion-matrix)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ci-6eiqBTgbL" + }, + "source": [ + "## Accuracy\n", + "> Accuracy - é o número de previsões corretas feitas pelo modelo.\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Com que frequência o classificador (modelo preditivo) classifica corretamente?\n", + "```\n", + "\n", + "$$Accuracy= \\frac{TP+TN}{TP+TN+FP+FN}$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F7YI8X5TRx-R" + }, + "source": [ + "## Precision (ou Specificity)\n", + "> **Precision** - fornece informações sobre o desempenho em relação a Falsos Positivos (quantos capturamos).\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Com relação ao resultado Positivo, com que frequência o classificador está correto?\n", + "```\n", + "\n", + "\n", + "$$Precision= \\frac{TP}{TP+FP}$$\n", + "\n", + "**Exemplo**: Precison nos dirá a proporção de clientes que o modelo estimou como sendo Fraude quando, na verdade, são fraude.\n", + "\n", + "**Comentário**: Se nosso foco é minimizar Falso Negativos (FN), então precisamos nos esforçar para termos Recall próximo de 100%." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zO39n8x_Sz3L" + }, + "source": [ + "## Recall (ou Sensitivity)\n", + "> **Recall** - nos fornece informações sobre o desempenho de um classificador em relação a Falsos Negativos (quantos perdemos).\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Quando o valor observado é Positivo, com que frequência o classificador está correto?\n", + "```\n", + "\n", + "$$Recall = Sensitivity = \\frac{TP}{TP+FN}$$\n", + "\n", + "**Exemplo**: Recall é a proporção de clientes observados como Fraude e que o modelo estima como Fraude.\n", + "\n", + "**Comentário**: Se nosso foco for minimizar Falso Positivos (FP), então precisamos nos esforçar para fazer Precision mais próximo de 100% possível." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "htS6rdHVVXRG" + }, + "source": [ + "## Specificity\n", + "> **Specificity** - proporção de TN por TN+FP.\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Quando o valor observado é Negativo, com que frequência o classificador está correto?\n", + "```\n", + "\n", + "**Exemplo**: Specificity é a proporção de clientes NÃO-Fraude que o modelo estima como NÃO-Fraude.\n", + "\n", + "$$Specificity= \\frac{TN}{TN+FP}$$\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mNn0twadTacc" + }, + "source": [ + "## F1-Score\n", + "> F1-Score é a média harmônica entre Recall e Precision e é um número entre 0 e 1. Quanto mais próximo de 1, melhor. Quanto mais próximo de 0, pior. Ou seja, é um equilíbrio entre Recall e Precision.\n", + "\n", + "$$F1\\_Score= 2\\left(\\frac{Recall*Precision}{Recall+Precision}\\right)$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rsH9dMxazWCg" + }, + "source": [ + "# **DATAFRAME-EXEMPLO USADO NESTE TUTORIAL**\n", + "> Gerar um dataframe com 18 colunas, sendo 9 informativas, 6 redundantes e 3 repetidas:\n", + "\n", + "Para saber mais sobre a geração de dataframes-exemplo (toy), consulte [Synthetic data generation — a must-have skill for new data scientists](https://towardsdatascience.com/synthetic-data-generation-a-must-have-skill-for-new-data-scientists-915896c0c1ae)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GEyDo_EIV_jV" + }, + "source": [ + "## Definir variáveis globais" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TdwgpZ76WFaT" + }, + "source": [ + "i_CV = 10 # Número de Cross-Validations\n", + "i_Seed = 20111974 # semente por questões de reproducibilidade\n", + "f_Test_Size = 0.3 # Proporção do dataframe de validação (outros valores poderiam ser 0.15, 0.20 ou 0.25)" + ], + "execution_count": 73, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gJTJfpwWzykS" + }, + "source": [ + "from sklearn.datasets import make_classification\n", + "\n", + "X, y = make_classification(n_samples = 1000, \n", + " n_features = 18, \n", + " n_informative = 9, \n", + " n_redundant = 6, \n", + " n_repeated = 3, \n", + " n_classes = 2, \n", + " n_clusters_per_class = 1, \n", + " random_state=i_Seed)" + ], + "execution_count": 74, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gWy2IZh3s-o3", + "outputId": "ad659add-4519-4c5b-c6a2-8b0f1fa98196", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 238 + } + }, + "source": [ + "X" + ], + "execution_count": 75, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[ 0.06844089, 4.21184154, -2.5583024 , ..., -0.63061895,\n", + " -0.97831983, -0.88826977],\n", + " [-4.8240213 , 0.17950903, -2.98447332, ..., 0.33992045,\n", + " 1.89153784, -6.10967565],\n", + " [ 1.38953042, -0.226476 , 1.8774004 , ..., -1.47784549,\n", + " 0.96052606, 2.06020368],\n", + " ...,\n", + " [ 1.62548685, 0.43377848, 4.93537285, ..., -4.61990917,\n", + " 0.18310709, 6.16040231],\n", + " [-2.40619087, -1.65474635, 2.64196493, ..., -1.21427845,\n", + " 0.83745861, 0.8254619 ],\n", + " [-4.00041881, 2.52475556, -4.15290177, ..., -0.51680266,\n", + " 1.72224835, -5.59558306]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 75 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ccjhGnzxtAaV", + "outputId": "d42b942e-a902-488c-d585-5ecb5a2a98bc", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "y[0:30] # Semelhante aos casos de fraude: {0, 1}" + ], + "execution_count": 76, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,\n", + " 1, 1, 0, 1, 0, 1, 0, 1])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 76 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OHO2befKJxR3" + }, + "source": [ + "___\n", + "# **DECISION TREE**\n", + "> Decision Trees possuem estrutura em forma de árvores.\n", + "\n", + "* **Principais Vantagens**:\n", + " * São algoritmos fáceis de entender, visualizar e interpretar;\n", + " * Captura facilmente padrões não-lineares presentes nos dados;\n", + " * Requer pouco poder computacional --> Treinar Decision Trees não requer tanto recurso computacional!\n", + " * Lida bem com COLUNAS numéricas ou categóricas;\n", + " * Não requer os dados sejam normalizados;\n", + " * Pode ser utilizado como Feature Engineering ao lidar com Missing Values;\n", + " * Pode ser utilizado como Feature Selection;\n", + " * Não requer suposições sobre a distribuição dos dados por causa da natureza não-paramétrica do algoritmo\n", + "\n", + "* **Principais desvantagens**\n", + " * Propenso a Overfitting, pois Decision Trees podem construir árvores complexas que não sejam capazes de generalizar bem os dados. As coisas complicam muito se a amostra de treinamento possuir outliers. Portanto, **recomenda-se fortemente a tratar os outliers previamente**.\n", + " * Pode criar árvores viesadas se tivermos um dataframe não-balanceado ou que alguma classe seja dominante. Por conta disso, **recomenda-se balancear o dataframe previamente para se evitar esse problema**.\n", + "\n", + "* **Principais parâmetros**\n", + " * **Gini Index** - é uma métrica que mede a frequência com que um ponto/observação aleatoriamente selecionado seria incorretamente identificado.\n", + " * Portanto, quanto menor o valor de Gini Index, melhor a COLUNA;\n", + " * **Entropy** - é uma métrica que mede aleatoriedade da informação presente nos dados.\n", + " * Portanto, quanto maior a entropia da COLUNA, pior ela se torna para nos ajudar a tomar uma conclusão (classificar, por exemplo).\n", + "\n", + "## **Referências**:\n", + "* [1.10. Decision Trees](https://scikit-learn.org/stable/modules/tree.html).\n", + "* [Decision Tree Algorithm With Hands On Example](https://medium.com/datadriveninvestor/decision-tree-algorithm-with-hands-on-example-e6c2afb40d38) - ótimo tutorial para aprender, entender, interpretar e calcular os índices de Gini e entropia.\n", + "* [Intuitive Guide to Understanding Decision Trees](https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-understanding-decision-trees-adb2165ccab7) - ótimo tutorial para aprender, entender, interpretar e calcular os índices de Gini e entropia.\n", + "* [The Complete Guide to Decision Trees](https://towardsdatascience.com/the-complete-guide-to-decision-trees-28a4e3c7be14)\n", + "* [Creating and Visualizing Decision Tree Algorithm in Machine Learning Using Sklearn](https://intellipaat.com/blog/decision-tree-algorithm-in-machine-learning/) - Muito didático!\n", + "* [Decision Trees in Machine Learning](https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052)\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FrMkPN5aLp0Y" + }, + "source": [ + "## Carregar as bibliotecas" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FVU1CM0PKgO4" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "\n", + "import warnings\n", + "warnings.filterwarnings(\"ignore\")" + ], + "execution_count": 77, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "15clh4XrISpz" + }, + "source": [ + "## Carregar/Ler os dados" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UMPL46w2IWJw" + }, + "source": [ + "l_colunas = ['v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8', 'v9', 'v10', 'v11', 'v12', 'v13', 'v14', 'v15', 'v16', 'v17', 'v18']\n", + "\n", + "df_X = pd.DataFrame(X, columns = l_colunas)\n", + "df_y = pd.DataFrame(y, columns = ['target'])" + ], + "execution_count": 78, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "MFaQF2MGFl_M", + "outputId": "ae5defe5-050e-40ab-9910-005d7abdef8d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "df_X.head()" + ], + "execution_count": 79, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
v1v2v3v4v5v6v7v8v9v10v11v12v13v14v15v16v17v18
00.0684414.211842-2.5583023.665482-3.8351583.4998512.4908563.6654820.2451170.8671722.8655460.493956-5.1485962.8655463.499851-0.630619-0.978320-0.888270
1-4.8240210.179509-2.9844731.033618-3.8934263.428734-3.3346051.033618-0.882780-0.7532811.441522-1.395514-4.0028801.4415223.4287340.3399201.891538-6.109676
21.389530-0.2264761.8774002.7134264.6302570.516455-3.7430272.7134261.2840392.030797-1.0955361.560159-1.014211-1.0955360.516455-1.4778450.9605262.060204
31.1458092.2559460.2073644.6658172.2946786.5013060.9647704.6658170.1194103.1963541.8947873.519138-4.7578071.8947876.501306-3.7890290.5794911.397106
4-0.9366463.697163-3.3636173.805126-1.7544304.9543460.4066053.805126-0.8247381.3825911.665704-0.649758-3.5130361.6657044.9543460.2570520.904244-3.071354
\n", + "
" + ], + "text/plain": [ + " v1 v2 v3 ... v16 v17 v18\n", + "0 0.068441 4.211842 -2.558302 ... -0.630619 -0.978320 -0.888270\n", + "1 -4.824021 0.179509 -2.984473 ... 0.339920 1.891538 -6.109676\n", + "2 1.389530 -0.226476 1.877400 ... -1.477845 0.960526 2.060204\n", + "3 1.145809 2.255946 0.207364 ... -3.789029 0.579491 1.397106\n", + "4 -0.936646 3.697163 -3.363617 ... 0.257052 0.904244 -3.071354\n", + "\n", + "[5 rows x 18 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 79 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "s-ibdD2ZG7tm", + "outputId": "4f22d899-818b-4def-b303-d6b41343bb12", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "df_X.shape" + ], + "execution_count": 80, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(1000, 18)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 80 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "f9cqRaywa_TR", + "outputId": "1b6b47df-d61a-49cc-9a64-7ee2133adefd", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "set(df_y['target'])" + ], + "execution_count": 81, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{0, 1}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 81 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BN6jbpn6Iwmu" + }, + "source": [ + "## Estatísticas Descritivas básicas do dataframe - df.describe()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KlwhxxUNIyYs", + "outputId": "2d17c864-a0cd-4845-f908-16329f2a2844", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 317 + } + }, + "source": [ + "df_X.describe()" + ], + "execution_count": 82, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
v1v2v3v4v5v6v7v8v9v10v11v12v13v14v15v16v17v18
count1000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.000000
mean-0.0851591.0342270.6574081.4053170.6872791.1315600.1080531.4053171.0070231.0488010.0792480.001650-0.3654380.0792481.131560-0.0277510.9846060.633624
std2.0022471.6315073.6087722.2568574.0195984.4818321.9813072.2568571.8632881.6439001.9492731.9326414.1606681.9492734.4818322.0654551.8505933.552991
min-6.944169-4.620754-16.300139-6.235192-12.454256-14.305401-6.152747-6.235192-5.484992-3.293216-7.135349-5.705500-9.120941-7.135349-14.305401-6.009023-5.035184-11.439074
25%-1.305566-0.089052-1.623657-0.152888-1.854645-1.684751-1.216983-0.152888-0.240908-0.012710-1.209675-1.292162-3.555363-1.209675-1.684751-1.436673-0.261610-1.691346
50%0.0525230.9941500.5738491.4499310.8123641.2815040.1670911.4499311.0661251.0128990.1803440.035237-0.9666380.1803441.281504-0.0001900.9757930.844784
75%1.3838532.0719953.0385862.8871413.4139524.0081031.4387192.8871412.2881882.1872021.4391991.3153422.7458061.4391994.0081031.3653692.2565043.109330
max4.9971727.35486011.7201658.49456612.84441815.9998036.2935508.4945668.1465596.5231806.2524485.53821611.2593506.25244815.9998036.5315617.64680212.090528
\n", + "
" + ], + "text/plain": [ + " v1 v2 ... v17 v18\n", + "count 1000.000000 1000.000000 ... 1000.000000 1000.000000\n", + "mean -0.085159 1.034227 ... 0.984606 0.633624\n", + "std 2.002247 1.631507 ... 1.850593 3.552991\n", + "min -6.944169 -4.620754 ... -5.035184 -11.439074\n", + "25% -1.305566 -0.089052 ... -0.261610 -1.691346\n", + "50% 0.052523 0.994150 ... 0.975793 0.844784\n", + "75% 1.383853 2.071995 ... 2.256504 3.109330\n", + "max 4.997172 7.354860 ... 7.646802 12.090528\n", + "\n", + "[8 rows x 18 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 82 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N_QhFqyZOKFB" + }, + "source": [ + "## Selecionar as amostras de treinamento e validação\n", + "\n", + "* Dividir os dados/amostras em:\n", + " * **Amostra de treinamento**: usado para treinar o modelo e otimizar os hiperparâmetros;\n", + " * **Amostra de teste**: usado para verificar se o modelo otimizado funciona em dados totalmente desconhecidos. É nesta amostra de teste que avaliamos a performance do modelo em termos de generalização (trabalhar com dados que não lhe foi apresentado);\n", + "* Geralmente usamos 70% da amostra para treinamento e 30% validação. Outras opções são usar os percentuais 80/20 ou 75/25 (default).\n", + "* Consulte [sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) para mais detalhes.\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8sKBgs-QOOfn" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(df_X, df_y, test_size = f_Test_Size, random_state = i_Seed)" + ], + "execution_count": 83, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TPTKBBHgOpoA", + "outputId": "ddea49d2-f1e3-4175-9941-ba5d1ae64f58", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "X_treinamento.shape" + ], + "execution_count": 84, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(700, 18)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 84 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lEn_LLs2OtRI", + "outputId": "7275edce-ac1f-4bcc-b149-30b6486eaad3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "y_treinamento.shape" + ], + "execution_count": 85, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(700, 1)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 85 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_uAw8EcyOvrG", + "outputId": "35abf0d9-546e-4950-b538-cee3e57bd683", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "X_teste.shape" + ], + "execution_count": 86, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(300, 18)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 86 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "A2LYI-9hOyXI", + "outputId": "4b847a2d-4c78-4bc3-a0e2-a8a38bf7c1a7", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "y_teste.shape" + ], + "execution_count": 87, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(300, 1)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 87 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "npgoBSX2dd4l" + }, + "source": [ + "## Treinar o algoritmo com os dados de treinamento\n", + "### Carregar os algoritmos/libraries" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hcvzrtolGfnQ", + "outputId": "d1c6be41-6108-436b-d0eb-6c4f5f116cdf", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + } + }, + "source": [ + "!pip install graphviz\n", + "!pip install pydotplus" + ], + "execution_count": 88, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Requirement already satisfied: graphviz in /usr/local/lib/python3.6/dist-packages (0.10.1)\n", + "Requirement already satisfied: pydotplus in /usr/local/lib/python3.6/dist-packages (2.0.2)\n", + "Requirement already satisfied: pyparsing>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from pydotplus) (2.4.7)\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "v_pF-HH3JKL2" + }, + "source": [ + "from sklearn.metrics import accuracy_score # para medir a acurácia do modelo preditivo\n", + "#from sklearn.model_selection import train_test_split\n", + "#from sklearn.metrics import classification_report\n", + "from sklearn.metrics import confusion_matrix # para plotar a confusion matrix\n", + "\n", + "from sklearn.model_selection import GridSearchCV # para otimizar os parâmetros dos modelos preditivos\n", + "from sklearn.model_selection import cross_val_score # Para o CV (Cross-Validation)\n", + "from time import time\n", + "from operator import itemgetter\n", + "from scipy.stats import randint\n", + "\n", + "from sklearn.tree import export_graphviz\n", + "from sklearn.externals.six import StringIO \n", + "from IPython.display import Image \n", + "import pydotplus\n", + "\n", + "np.set_printoptions(suppress=True)" + ], + "execution_count": 89, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9ROlyvgij2yl" + }, + "source": [ + "Função para plotar a Confusion Matrix extraído de [Confusion Matrix Visualization](https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "klQ0FLOIgeX1" + }, + "source": [ + "def mostra_confusion_matrix(cf, \n", + " group_names = None, \n", + " categories = 'auto', \n", + " count = True, \n", + " percent = True, \n", + " cbar = True, \n", + " xyticks = False, \n", + " xyplotlabels = True, \n", + " sum_stats = True, \n", + " figsize = (8, 8), \n", + " cmap = 'Blues'):\n", + " '''\n", + " This function will make a pretty plot of an sklearn Confusion Matrix cm using a Seaborn heatmap visualization.\n", + " Arguments\n", + " ---------\n", + " cf: confusion matrix to be passed in\n", + " group_names: List of strings that represent the labels row by row to be shown in each square.\n", + " categories: List of strings containing the categories to be displayed on the x,y axis. Default is 'auto'\n", + " count: If True, show the raw number in the confusion matrix. Default is True.\n", + " normalize: If True, show the proportions for each category. Default is True.\n", + " cbar: If True, show the color bar. The cbar values are based off the values in the confusion matrix.\n", + " Default is True.\n", + " xyticks: If True, show x and y ticks. Default is True.\n", + " xyplotlabels: If True, show 'True Label' and 'Predicted Label' on the figure. Default is True.\n", + " sum_stats: If True, display summary statistics below the figure. Default is True.\n", + " figsize: Tuple representing the figure size. Default will be the matplotlib rcParams value.\n", + " cmap: Colormap of the values displayed from matplotlib.pyplot.cm. Default is 'Blues'\n", + " See http://matplotlib.org/examples/color/colormaps_reference.html\n", + " '''\n", + "\n", + " # CODE TO GENERATE TEXT INSIDE EACH SQUARE\n", + " blanks = ['' for i in range(cf.size)]\n", + "\n", + " if group_names and len(group_names)==cf.size:\n", + " group_labels = [\"{}\\n\".format(value) for value in group_names]\n", + " else:\n", + " group_labels = blanks\n", + "\n", + " if count:\n", + " group_counts = [\"{0:0.0f}\\n\".format(value) for value in cf.flatten()]\n", + " else:\n", + " group_counts = blanks\n", + "\n", + " if percent:\n", + " group_percentages = [\"{0:.2%}\".format(value) for value in cf.flatten()/np.sum(cf)]\n", + " else:\n", + " group_percentages = blanks\n", + "\n", + " box_labels = [f\"{v1}{v2}{v3}\".strip() for v1, v2, v3 in zip(group_labels,group_counts,group_percentages)]\n", + " box_labels = np.asarray(box_labels).reshape(cf.shape[0],cf.shape[1])\n", + "\n", + " # CODE TO GENERATE SUMMARY STATISTICS & TEXT FOR SUMMARY STATS\n", + " if sum_stats:\n", + " #Accuracy is sum of diagonal divided by total observations\n", + " accuracy = np.trace(cf) / float(np.sum(cf))\n", + "\n", + " #if it is a binary confusion matrix, show some more stats\n", + " if len(cf)==2:\n", + " #Metrics for Binary Confusion Matrices\n", + " precision = cf[1,1] / sum(cf[:,1])\n", + " recall = cf[1,1] / sum(cf[1,:])\n", + " f1_score = 2*precision*recall / (precision + recall)\n", + " stats_text = \"\\n\\nAccuracy={:0.3f}\\nPrecision={:0.3f}\\nRecall={:0.3f}\\nF1 Score={:0.3f}\".format(accuracy,precision,recall,f1_score)\n", + " else:\n", + " stats_text = \"\\n\\nAccuracy={:0.3f}\".format(accuracy)\n", + " else:\n", + " stats_text = \"\"\n", + "\n", + " # SET FIGURE PARAMETERS ACCORDING TO OTHER ARGUMENTS\n", + " if figsize==None:\n", + " #Get default figure size if not set\n", + " figsize = plt.rcParams.get('figure.figsize')\n", + "\n", + " if xyticks==False:\n", + " #Do not show categories if xyticks is False\n", + " categories=False\n", + "\n", + " # MAKE THE HEATMAP VISUALIZATION\n", + " plt.figure(figsize=figsize)\n", + " sns.heatmap(cf,annot=box_labels,fmt=\"\",cmap=cmap,cbar=cbar,xticklabels=categories,yticklabels=categories)\n", + "\n", + " if xyplotlabels:\n", + " plt.ylabel('True label')\n", + " plt.xlabel('Predicted label' + stats_text)\n", + " else:\n", + " plt.xlabel(stats_text)" + ], + "execution_count": 90, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YJMS9ePQ6B6t" + }, + "source": [ + "**Atenção**: Para evitar overfitting nos algoritmos DecisionTreeClassifier, considere min_samples_split = 2 como default." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nNeRHYePJc-r" + }, + "source": [ + "from sklearn.tree import DecisionTreeClassifier # Library para Decision Tree (Classificação)\n", + "\n", + "# Instancia (configuração do Decision Trees) com os parâmetros sugeridos para se evitar overfitting:\n", + "ml_DT = DecisionTreeClassifier(criterion = 'gini', \n", + " splitter = 'best', \n", + " max_depth = None, \n", + " min_samples_split = 2, \n", + " min_samples_leaf = 1, \n", + " min_weight_fraction_leaf = 0.0, \n", + " max_features = None, \n", + " random_state = i_Seed, \n", + " max_leaf_nodes = None, \n", + " min_impurity_decrease = 0.0, \n", + " min_impurity_split = None, \n", + " class_weight = None, \n", + " presort = False)" + ], + "execution_count": 91, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gVLZznprx2YX", + "outputId": "d07ef173-ad82-4397-ced7-a75d2233be38", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 119 + } + }, + "source": [ + "# Objeto/classificador configurado\n", + "ml_DT" + ], + "execution_count": 92, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n", + " max_depth=None, max_features=None, max_leaf_nodes=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, presort=False,\n", + " random_state=20111974, splitter='best')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 92 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OgAHfXVo-Nw8", + "outputId": "9d1806d3-ea4f-47b3-da8c-eb0cbdd4caaf", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 119 + } + }, + "source": [ + "# Treina o algoritmo: fit(df)\n", + "ml_DT.fit(X_treinamento, y_treinamento)" + ], + "execution_count": 93, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n", + " max_depth=None, max_features=None, max_leaf_nodes=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, presort=False,\n", + " random_state=20111974, splitter='best')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 93 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ohmGCDpfyhvV", + "outputId": "d4556530-84b1-414f-fb81-f82b0a405628", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "i_CV" + ], + "execution_count": 94, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "10" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 94 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6exa9D8R2fDJ", + "outputId": "d080c556-c4f6-49e3-f938-9144c15c750b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "# Cross-Validation com k = 10 folds\n", + "a_scores_CV = cross_val_score(ml_DT, X_treinamento, y_treinamento, cv = i_CV)\n", + "\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')" + ], + "execution_count": 95, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Média das Acurácias calculadas pelo CV....: 91.43\n", + "std médio das Acurácias calculadas pelo CV: 3.44\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Uxoplcea0byV", + "outputId": "1d810e2d-c12e-4432-ee0b-3628d91afa5d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "a_scores_CV # array com os scores a cada iteração do CV" + ], + "execution_count": 96, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.9 , 0.98571429, 0.85714286, 0.92857143, 0.88571429,\n", + " 0.94285714, 0.92857143, 0.9 , 0.88571429, 0.92857143])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 96 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y3k-PcbN0o_i", + "outputId": "610bc0fc-ddfe-4a45-8621-3fc755a750c4", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_scores_CV.mean()" + ], + "execution_count": 97, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.9142857142857144" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 97 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6_rYker2gzeG" + }, + "source": [ + "**Interpretação**: Nosso classificador (DecisionTreeClassifier) tem uma acurácia média de 91,43% (base de treinamento). Além disso, o std é da ordem de 3,66%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tkwchmkP3p_A", + "outputId": "c2c8e0be-6ef2-433b-b940-13c91d6e48ec", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": 98, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Acurácias: [0.9 0.98571429 0.85714286 0.92857143 0.88571429 0.94285714\n", + " 0.92857143 0.9 0.88571429 0.92857143]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sI31WkZs2ht_" + }, + "source": [ + "# Faz predições usando o classificador (Decision Trees) para inferir na amostra de teste:\n", + "y_pred = ml_DT.predict(X_teste)" + ], + "execution_count": 99, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rfapj3OG13PG", + "outputId": "ceb65e06-5029-4aa8-cbe9-5225d66f4776", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "y_pred[0:30]" + ], + "execution_count": 100, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0,\n", + " 1, 0, 0, 1, 1, 0, 1, 1])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 100 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sc88ofqh16RT" + }, + "source": [ + "y[0:30]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "fSaVzJ9xFpwW", + "outputId": "d56ad709-c966-472d-8e11-bd73c4f977f6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 538 + } + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative', 'False_Positive', 'False_Negative', 'True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": 101, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAccAAAIJCAYAAADQ9vbrAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nOzdd3gVRdvH8e+dAqF3EEGkKlVpIlXpAiJFpAmKgiIiAooF1MfuK3YsFFFQEEERQbHQRBFB6RY6IkhvEnoIpMz7xznE5NDCmpDk5Pd5r3NxdnZ3ZjaPb+7cM7uz5pxDRERE/hWS1h0QERFJbxQcRUREAig4ioiIBFBwFBERCaDgKCIiEkDBUUREJEBYalSarVo/PR8iGd6Bpe+kdRdEUkREGJZadafG7/vjv76Tav1NLmWOIiIiAVIlcxQRkUzCgjPHCs6rEhER+Q+UOYqIiHeW5tODqUKZo4iISABljiIi4l2QzjkqOIqIiHcaVhUREckclDmKiIh3QTqsGpxXJSIi8h8ocxQREe+CdM5RwVFERLzTsKqIiEjmoMxRRES8C9JhVWWOIiKS4ZjZWDPba2arzrBvkJk5Myvo3zYze8vMNprZH2ZW/Xz1KziKiIh3FpLyn+T5EGhxWnfMLgOaA1sTFbcEyvk/vYGR56tcwVFERLwzS/lPMjjn5gORZ9j1BvAIkPglzG2B8c5nEZDXzIqeq34FRxERCQpm1hbY4Zz7PWBXMWBbou3t/rKz0g05IiLiXSo8ymFmvfENf54y2jk3+jznZAcewzek+p8pOIqISLriD4TnDIZnUAYoBfxuvqHZ4sAKM6sF7AAuS3RscX/ZWSk4ioiId+nkUQ7n3Eqg8KltM/sbqOmc+8fMpgP9zOwT4FrgkHNu17nq05yjiIhkOGY2CfgFuNLMtptZr3Mc/i2wCdgIvAf0PV/9yhxFRMS7NFo+zjnX9Tz7Syb67oD7LqR+BUcREfFOa6uKiIhkDsocRUTEu5D0cUNOSlPmKCIiEkCZo4iIeBekc44KjiIi4l06ec4xpQVnyBcREfkPlDmKiIh3QTqsGpxXJSIi8h8ocxQREe+CdM5RwVFERLzTsKqIiEjmoMxRRES8C9JhVWWOIiIiAZQ5ioiId0E656jgKCIi3mlYVUREJHNQ5igiIt4F6bBqcF6ViIjIf6DMUUREvNOco4iISOagzFFERLwL0jlHBUcREfEuSINjcF6ViIjIf6DMUUREvNMNOSIiIpmDMkcREfEuSOccFRxFRMQ7DauKiIhkDsocRUTEuyAdVg3OqxIREfkPlDmKiIh3QTrnqOAoIiKeWZAGRw2rioiIBFDmKCIinilzFBERySSUOYqIiHfBmTgqcxQREQmkzFFERDwL1jlHBUcREfEsWIOjhlVFREQCKHMUERHPlDmKiIhkEsocRUTEs2DNHBUcRUTEu+CMjRpWFRERCaTMUUREPAvWYVVljiIiIgGUOYqIiGfBmjkqOIqIiGfBGhw1rCoiIhJAmaOIiHimzFFERCSTUOYoIiLeBWfiqMxRREQyHjMba2Z7zWxVorJXzGydmf1hZtPMLG+ifUPMbKOZrTezG85Xv4KjiIh4ZmYp/kmmD4EWAWVzgMrOuauADcAQfx8rAl2ASv5zRphZ6LkqV3AUERHP0io4OufmA5EBZbOdc7H+zUVAcf/3tsAnzrkTzrnNwEag1rnqV3AUEZF0xcx6m9myRJ/eHqrpCczwfy8GbEu0b7u/7Kx0Q46IiHiWGo9yOOdGA6O9nm9mjwOxwMde61BwFBGRoGFmdwCtgSbOOecv3gFcluiw4v6ys9KwqoiIeGep8PHaFbMWwCNAG+dcVKJd04EuZpbVzEoB5YAl56pLmaOIiHiWVivkmNkkoCFQ0My2A0/huzs1KzDH369Fzrk+zrnVZjYZWINvuPU+51zcuepXcBQRkQzHOdf1DMVjznH8C8ALya1fwVFERDzT2qoiIiKZhDJHERHxLFgzRwVHERHxLFiDo4ZVRUREAihzFBER74IzcVTmeCHy58nBok8Gs+iTwWye83/8Nev5hO3wsHMu8H7B1n3zDJNevSthu33Tqox+pnuKtgHQ79aGZIsIT9ie9va95MmZLcXbkfSpWpUKdLq5bcJnx47tZz22ds1qKdZurztuo82NN9CxfRt6dOvC35s3XXAd9/W5m8OHD3P48GE+nfTvKmF79+5h0MD+KdZXyZyUOV6AyEPHqN1lKACP39OKY1EnGPbR3IT9oaEhxMXFp1h71SpcRvnSl7Bu0+4UqzNQv26NmPTtUo5HxwDQ/v6RqdaWpD9Zs0YweeqXadL2iy+9SqXKVZgy+VNef/Vl3ho+6oLOHz7qPQB27NjOp59MonPXbgAULlyE14a9leL9lTPTnKOc0ehnuvPW412YP/4h/m9gOx6/pxUDb2uSsH/ZZ49Romh+ALq0uoafPnqIRZ8M5u3HuxAScu7/qN786Hse7XX6OzmzR2Rh1FPd+Omjh/hl0qO0blgFgGwR4Ux4qScrPn+cT1+7m/njH6J6xRK+uh7rzIKPH2H5lMd5ok8rAPp2vZ6ihfIwc/QAZo72/aW97ptnKJA3B8/1b8M9na5LaDPxdT1wexMWTHiYJZ8OSahLgkPUsWPc3bMHnW9pT4d2N/HD99+ddsy+fXu58/ZudLq5LTe3bc2K5csA+HnhAm67tTOdb2nPQw/0J+rYsWS1WaNmTbZt3YpzjtdffYmb27amQ7ubmDnj23O217JZYw4ciOTNN15j+7atdLq5La+/+hI7dmzn5ratAejetRMbN/6Z0FavO25j9aqVREVF8eQTQ7i18y106tDujNcpmZsyxxRQrHBeGt7xGvHxjsfvOXOwuLJUEW5pXp1Gd75ObGw8w4Z0okura5j49dmX9/t89gp6d2xA6csKJil/9K4bmLd0A32e+Zg8ObPx04SH+X7Renp3bMCBw1FU7/ACFcsUZfEngxPOefqdrzhwOIqQEGPGu/2pXO5SRkz6kf7dG9Oi95vsP5j0F9mUWSt45eEOvDt5PgAdmlejTd/hNKldnjIlClO/+yuYGVOG3UO96mVYuOIvrz8+SUMnTkTT6ea2AFxavDivvv4mb7w1nJw5c3LgQCS3de1Mw0ZNkmQH337zNXXr1efue+4lLi6O6OjjHDgQyXvvjuTd9z8ge/bsjH1/NOPHfUCfvv3O24cf5/1A2SuuYO6c2axft47Ppn7JwQMHuLXzLdSoWfOM7SU24IFBbPzzz4QMOPHQ8A0tWjF75gzK9ivHvn172bdvL5UqV+GtYa9T69raPPv8ixw+fJhuXTpybe26ZM+ePSV+rJlKsGaOCo4pYOp3vxIf7855TKNaV1K9YgkWTHgEgGxZw9kXefSc58TFx/PG+O94uGdzZi9ck1DepE4Fbry+CgNv92VyEVnCuKxoPupWK807E+cBsOavXaz8c2fCOR2aV6fnzfUICw3hkkK5qVC6KKsS7Q/0+/rtFMqXi6KF8lAwX04OHo5i+56D3HdrI5rWKc8if+DNmS0rZUsUVnDMoAKHVWNiYnhr2OusWL6UEAth79497P/nHwoWKpRwTOXKVXjqiceIjY2lUeOmlK9QgWVLf2DTXxu5o3vXhHquqlr1nG0PefQhIrJGcGmxYgx+7H98NO4DWrS6kdDQUAoULEiNa65h9cqVZ2wvuZq3aEmfu3vSt19/Zs+cQbPmvhfH//LzAub98D3jPxgLwMkTJ9i9axely5RJdt3io+AoZxV1/ETC99i4uCTDpRFZfDe7mBkTvlrMk29Pv6C6J36zhId7NmfNxl0JZQZ0feh9/tyyN1l1XH5pAQbe1oT63V/m4JHjjH6mO1mznP9/+qnf/Ur7plUpUiA3U2av8F8HvDJ2NmM+X3hB1yEZw7dff8WBA5FMmjyV8PBwWjZrzImTJ5IcU6PmNYwdP4GffvyRJx8fzG097iRX7tzUrlOPl159PdltnZpzPJ8ztXdT23bJaqNIkSLkzZuXDevXMWvmDJ548mkAnIPXh71FyVKlk91fyVw055jCtuyMpGoF32vDqpYvTsliBQD4Ycl62jetSqF8OQHIlzs7JYrmO299sbHxvD3hB+7v1iih7Ltf1tK3y/UJ21dfWRyAX37bRIfm1QEoX/oSKpe9FIDcOSM4Fn2CQ0ejKZw/F83rVUw498ixE+TMHnHGtqfMWk7HG2rQvmk1ps75FYA5P6+lR9s65MiWBYBLC+VJuCbJ+I4ePUL+/AUIDw9nyeJF7Nx5+ivvdu7cQYECBenQsRPtO3Rk7ZrVXHV1VX77dQVbt2wBICoqir//3nxBbVerUZNZM2YQFxdHZGQkK5Yto3KVq87YXmI5cuQ45/zmDS1a8cHY9zly5AhXXFkegLr16jPx4wmcet3f2rVrznq+nEc6emVVSlLmmMK+mPsb3VrXYvmUx1m68u+E7G7dpt08M/xrvhrZjxAzYmLjeGDoZLbuOnDeOj/84hcG390iYfvF92byykMdWDr5MUJCjL937KfDgFG8O/kn3n/uNlZ8/jgbNu9hzaZdHDp6nL+27uP3ddv5fdr/2L77AIt++/e2+bFTFzJ9eF927TtEi95J7/Bbu2k3ObNHsHPvQXb/cxiAuYvWUb7UJcwb9xAAx46f4M7Hx7HvwLmHiCVjaNX6Jvrfdy8d2t1ExUqVKVX69Mxq2ZIlfPjBGMLCwsiePTvPv/gS+fPn59kXXmTwww9yMuYkAP3uH0jJkqWS3XaTps344/df6XhzW8yMgYMepmChQkz/Ytpp7SWWN28+qlarzs1tW1O/QYOEu1ZPadb8Bl4e+gK9+/RNKOvdpy8vD/0/bmnfhvj4eIoVL847I969kB+VBDn790XJKSdbtX4pX6mcV0iIER4WyomTsZQqXpBvR/XjqnbPERN7zteWyVkcWPpOWndBJEVEhKVePlbi/ukp/vt+69tt0jx/VOYYRLJHZGHmewMIDwvBMAa8OFmBUURSlW7IkVQxf/xDZAm4OabXE+NZvfHsd5KezdGoE9Tv9nJKdU3kPxvY/z52bk+66s6ABx+iXv0GadQjkeRRcExj193+6hnLRz3VjZbXVWZf5BFqdvy/JPsG3NaYoQ/eTPFGjyY8n9igRjleebgD4WGh7D94lOZ3vZnqfRc5n2FvDT/rvr83b+KRQQ8kbG/fvo2+/frT/fY7LkLPJKUoc5SL6qOvFjHq0x95/7nbk5QXL5KXJrUrsHVXZEJZnpzZePOxTrS9bwTbdh/Q3aOSIZQsVTrhGcu4uDiaNbqOxk2bpXGvRHz0KEc6tXDFX0Qeijqt/OWHOvD4m1+Q+Eaqzi1r8uXc39m223fnq+4clYxm8aJfuOyyy7j00mJp3RW5QGaW4p/0QJljBtK6YRV27j3Iyg1Jnz0rd3lhwsJCmfXeAHJmz8rwSfPOuSydSHozc8Y3tGjVOq27IV6kj1iW4hQcM4hsEeE80vMGWvc9/fGCsNAQqle4jJb3vE22iHDmjRvEkj/+ZuPW5K2gI5KWYk6e5McfvmfAwEFp3RWRBAqOGUTp4oW4vFgBlnw6BPAtdv7LxEdpcNsr7Nh7kP2HjhEVfZKo6JMsWLGRq64opuAoGcKCBfMpX7ESBQoWPP/Bku6kl2HQlKY5xwxi9cadXN5kCOVvfIryNz7Fjr0HqXPrS+zZf4Sv5v1B3aplCA0NIVtEONdULsm6zan3DkiRlDTj229o2erGtO6GSBLKHNOpcS/eQYMa5SiYNycbZz7Hc6O+ZdwXv5zx2PWb9zDn5zUsnTyE+HjHh9N+Zs1fu854rEh6EhUVxaKff+Z/Tz2b1l0Rj4I1c9TycSJnoeXjJFik5vJxZQbNSPHf93+91jLNI64yRxER8SxIE0cFRxER8S5Yh1V1Q46IiEgAZY5pJCTEWPjxI+zce4gOA0bRsNYV/N/A9oSEGMeiTnD3Ux+xads/p51XudylvPNEV3LliCA+3lG/+8uEh4Xy3dh/16gsVjgvn3y7lIdf/Zx7u1xPrw712Lb7AJ0eGE1MbBx1q5amXZOqPPLa1It5yRLEdu/axeNDHiFy/34w45aOneh2W48kx/zw/XcMf/tNQiyE0LBQHn70MarXqMnOnTt4oH8/XHw8MbGxdO3WnU6du3Ly5EkG9LuXPXv20LlL14T3ND771P/o2LkLFSpWSotLlQBBmjgqOKaVfrc2Yv3mPeTKEQHAW491oeMD77J+8x56d2zA4Lta0PupCUnOCQ0NYezzPej1v/Gs3LCD/HlyEBMbx4mTsdTuMjThuIUfP8IX3/8GQJeWNbmm04s80qs5zepW4Nv5qxh8d0t6DPng4l2sBL3QsFAeemQwFSpW4tixo3Tp2IHadepRpmzZhGOuvbYODRs1wczYsH4dDw8ayJdfz6RQwUJ8NPFTsmTJQtSxY3RodxMNGzVmzapVVKteg7t696FHd19wXL9uHXHxcQqMkuo0rJoGihXOS4v6lfhg2s8JZc45cvsDZe5c2di179Bp5zWtU55Vf+5IWD4u8tAx4uOT3ihWtkRhCufPxcIVfwG++YDwsFCyR2QhJjaOrjdew+yFqzlw+PR1W0W8KlSocELAypEjJ6VLl2bv3j1JjsmeI0fC/NTx48cTvodnyUKWLFkAOBlzkvj4eADCwsOIjo4mNjY2YS3h4W8P4777B1yUa5Lk0dqqkmJeedi3eHjO7BEJZX2fnci0t/sSfeIkh49Fc/3tr512XrkShXEOpg+/j4L5cjJl1nJeH/ddkmM6tqjOlNkrErZHfvojP44fxNq/dvHLb5v47I3e3HTf2V8jJPJf7dixnXVr11LlqqtP2zf3uzm8New1IvdH8s7IdxPKd+/aRb++vdm2dSsPDHqEwoWLkD9/Ab6ePp3uXTtxx529mPf9XCpUrEThwkUu5uXIeaSTWJbiFBwvspYNKrM38gi/rt1GgxrlEsrv79aI9vePYOmqLTxwexNeGnQzfZ+dmOTcsNBQ6lYrTf3urxAVfZIZ7/ZnxdqtzFuyIeGYjjfUoNcT4xO2J32zlEnfLAVgSO8WjJj0IzfUq0S31rXYvvsAj74+jdR41lUyp6hjxxg0sD8PD36MnDlPf3Vak6bNaNK0GcuXLWX4228yesyHAFxStChTpn3F3r17GHj/fTRrfgMFChZk6Cu+PxJjYmK4t3cv3nxnBK+89CK7d+3ipjZtadi4ycW8PMlENKx6kdWpWprW11dh3TfPMH7onTS85gqmvtWHKlcUY+mqLQBMmb2C2leXOu3cHXsPsmDFX+w/eIzj0THMXLCaauUvS9hf5YpihIWG8uvabaedW7RQHmpWKslX8/5gwG2N6f7oWA4eOU6jWlem3sVKphITE8ODA/vT6sabaNqs+TmPrVHzGrZv38aBA5FJygsXLkLZcuVYsXxZkvLJn0zkpjbt+OP338mVKxcvv/YG48dp3jw9CAmxFP+kBwqOF9mTb0+nbIv/Uf7Gp7h98AfMW7qBjg+MJnfObJQtURiAxrXLs37zntPOnfPzGiqVvZRsEeGEhobQoEZZ1m76dw3VTi1qMHnmstPOA3iy7408N/JrALJlDcc5iHeO7NnCU+EqJbNxzvH0k49TunRpbr/jzjMes3XLloRRirVrVnPy5Eny5s3Hnt27iY6OBuDwoUP8umIFJUv9+8fh4UOHmP/jPG5q247o6OMJ81KnzhFJDRpWTQfi4uK577mJTHr1LuJdPAcPH+eep313qt54fRWqVyzBcyO/4eCR47w14XsWTHgE5xyzFqxm5oLVCfV0aFaddvePPK3+q68sDsBv67YD8OmMZSz77DG27z7A6x9+d9rxIhfq1xXL+Xr6l5S74go63dwWgPsHPsiuXTsB6NS5K9/NmcVX078kPCyMrBERvPzqG5gZmzb9xWuvDMUwHI4ed/Sk3BX/jmi8O3I4d/XuQ0hICHXrNeCTSRPp0O4mOnbukibXKkkF65yj1lYVOQutrSrBIjXXVq38xJwU/32/6vlmaR5yNawqIiISQMOqIiLiWbAOqypzFBERCaDMUUREPEsvK9qkNGWOIiIiAZQ5ioiIZ8GaOSo4ioiIZ0EaGzWsKiIiEkiZo4iIeBasw6rKHEVERAIocxQREc+CNHFUcBQREe80rCoiIpJJKHMUERHPgjRxVOYoIiISSMFRREQ8M7MU/ySz3bFmttfMViUqy29mc8zsT/+/+fzlZmZvmdlGM/vDzKqfr34FRxER8cws5T/J9CHQIqBsMDDXOVcOmOvfBmgJlPN/egMjz1e5gqOIiGQ4zrn5QGRAcVtgnP/7OKBdovLxzmcRkNfMip6rft2QIyIinqWzRzmKOOd2+b/vBor4vxcDtiU6bru/bBdnocxRRETSFTPrbWbLEn16X2gdzjkHOK99UOYoIiKepUbi6JwbDYz2cOoeMyvqnNvlHzbd6y/fAVyW6Lji/rKzUuYoIiLBYjrQw/+9B/BlovLb/Xet1gYOJRp+PSNljiIi4llazTma2SSgIVDQzLYDTwFDgclm1gvYAnTyH/4t0ArYCEQBd56vfgVHERHxLK3ux3HOdT3LriZnONYB911I/RpWFRERCaDMUUREPEtnj3KkGGWOIiIiAZQ5ioiIZ0GaOCo4ioiIdxpWFRERySSUOYqIiGfKHEVERDIJZY4iIuJZkCaOCo4iIuKdhlVFREQyCWWOIiLiWZAmjsocRUREAilzFBERz4J1zlHBUUREPAvS2KhhVRERkUDKHEVExLOQIE0dlTmKiIgEUOYoIiKeBWniqMxRREQkkDJHERHxTI9yiIiIBAgJztioYVUREZFAyhxFRMSzYB1WVeYoIiISQJmjiIh4FqSJo4KjiIh4ZwRndNSwqoiISABljiIi4pke5RAREckklDmKiIhnwfooh4KjiIh4FqSxUcOqIiIigZQ5ioiIZ3rZsYiISCahzFFERDwL0sRRmaOIiEggZY4iIuKZHuUQEREJEKSxUcOqIiIigZQ5ioiIZ3qUQ0REJJNQ5igiIp4FZ96o4CgiIv9BsN6tqmFVERGRAMocRUTEs2B92fFZg6OZvQ24s+13zvVPlR6JiIiksXNljssuWi9ERCRDCtY5x7MGR+fcuMTbZpbdOReV+l0SEZGMIkhj4/lvyDGzOma2Bljn377azEakes9ERETSSHLuVh0G3ADsB3DO/Q5cl5qdEhGRjMHMUvyTHiTrUQ7n3LaAorhU6IuIiEi6kJxHObaZWV3AmVk4MABYm7rdEhGRjCBYH+VITubYB7gPKAbsBKr6t0VERILSeTNH59w/QLeL0BcREclg0mqO0MweAO7C9zz+SuBOoCjwCVAAWA7c5pw76aX+5NytWtrMvjKzfWa218y+NLPSXhoTEZHgYqnwOW+bZsWA/kBN51xlIBToArwEvOGcKwscAHp5va7kDKtOBCbji8iXAp8Bk7w2KCIikgLCgGxmFgZkB3YBjYEp/v3jgHZeK09OcMzunPvIORfr/0wAIrw2KCIiwSPELMU/ZtbbzJYl+vRO3KZzbgfwKrAVX1A8hG8Y9aBzLtZ/2HZ898p4cq61VfP7v84ws8H4xnEd0Bn41muDIiIi5+KcGw2MPtt+M8sHtAVKAQfxjWi2SMk+nOuGnOX4guGpIeB7Eu1zwJCU7IiIiGQ8aXQ/TlNgs3Nun68PNhWoB+Q1szB/9lgc2OG1gXOtrVrKa6UiIpI5pNHdqluB2maWHTgONMH3sowfgFvwjXT2AL702kCy3udoZpWBiiSaa3TOjffaqIiIiFfOucVmNgVYAcQCv+Ibhv0G+MTMnveXjfHaxnmDo5k9BTTEFxy/BVoCCwAFRxGRTC6tlkJ1zj0FPBVQvAmolRL1J+du1Vvwpay7nXN3AlcDeVKicRERkfQoOcOqx51z8WYWa2a5gb3AZancLxERyQBC0slbNFJacoLjMjPLC7yH7w7Wo8AvqdorERHJEII0NiZrbdW+/q+jzGwmkNs590fqdktERCTtnGsRgOrn2uecW5E6XRIRkYwivbycOKWdK3N87Rz7HL417M5o7y9vee6QSHqRr8XQtO6CSIo4/t3gtO5ChnOuRQAaXcyOiIhIxpOcRx4yomC9LhEREc+StUKOiIjImWTGOUcREZFzCgnO2Hj+YVXz6W5mT/q3S5hZiizPIyIikh4lZ85xBFAH6OrfPgIMT7UeiYhIhhFiKf9JD5IzrHqtc666mf0K4Jw7YGZZUrlfIiIiaSY5wTHGzELxPduImRUC4lO1VyIikiFk5hty3gKmAYXN7AV8b+l4IlV7JSIiGUJ6GQZNaclZW/VjM1uO77VVBrRzzq1N9Z6JiIikkeS87LgEEAV8lbjMObc1NTsmIiLpX5COqiZrWPUbfPONBkQApYD1QKVU7JeIiEiaSc6wapXE2/63dfQ9y+EiIpKJZOaXHSfhnFthZtemRmdERCRjCdYFupMz5/hgos0QoDqwM9V6JCIiksaSkznmSvQ9Ft8c5Oep0x0REclIgnRU9dzB0f/wfy7n3EMXqT8iIiJp7qzB0czCnHOxZlbvYnZIREQyjsx4Q84SfPOLv5nZdOAz4Nipnc65qancNxERkTSRnDnHCGA/0Jh/n3d0gIKjiEgmF6SJ4zmDY2H/naqr+DconuJStVciIpIhZMa1VUOBnCQNiqcoOIqISNA6V3Dc5Zx79qL1REREMpxgvSHnXIsbBOcVi4iInMe5MscmF60XIiKSIQVp4nj24Oici7yYHRERkYwnWG/ICdY1Y0VERDy74LdyiIiInGJBenuKMkcREZEAyhxFRMSzYJ1zVHAUERHPgjU4alhVREQkgDJHERHxzIL0QUdljiIiIgGUOYqIiGeacxQREckklDmKiIhnQTrlqOAoIiLeZcZXVomIiGRKyhxFRMQz3ZAjIiKSSShzFBERz4J0ylHBUUREvAvRK6tEREQyB2WOIiLiWbAOqypzFBERCaDMUUREPNOjHCIiIgFCzFL8kxxmltfMppjZOjNba2Z1zCy/mc0xsz/9/+bzfF1eTxQREUlDbwIznXPlgauBtcBgYK5zrhww17/tiYKjiIh4Zpbyn/O3aXmA64AxAM65k865g0BbYJz/sHFAO6/XpeAoIiIZTSlgHwYQdmQAACAASURBVPCBmf1qZu+bWQ6giHNul/+Y3UARrw0oOIqIiGepMedoZr3NbFmiT++AZsOA6sBI51w14BgBQ6jOOQc4r9elu1VFRCRdcc6NBkaf45DtwHbn3GL/9hR8wXGPmRV1zu0ys6LAXq99UOYoIiKepcWco3NuN7DNzK70FzUB1gDTgR7+sh7Al16vS5mjiIh4loYZ1v3Ax2aWBdgE3OnvzmQz6wVsATp5rVzBUUREMhzn3G9AzTPsapIS9Ss4ioiIZxaki6tqzlFERCSAMkcREfEsOPNGBUcREfkPkrsWakajYVUREZEAyhxFRMSz4MwblTmKiIicRpmjiIh4FqRTjgqOIiLinZ5zFBERySSUOYqIiGfBmmEF63WJiIh4psxRREQ805yjiIhIJqHMUUREPAvOvFHBUURE/gMNq4qIiGQSyhxFRMSzYM2wgvW6REREPFPmKCIingXrnKOCo4iIeBacoVHDqiIiIqdR5igiIp4F6aiqMkcREZFAyhxFRMSzkCCddVRwFBERzzSsKiIikkkocxQREc8sSIdVlTmKiIgEUOYoIiKeBeuco4KjiIh4Fqx3q2pYVUREJIAyRxER8SxYh1WVOYqIiARQ5igiIp4pcxQREckklDkmU61qlShb7oqE7VffeIdLixU747ENatfgp0XLU6Td3r1u53hUFB9NmgLAmtWrGPb6y4weMz5F6j/lqy+nUbtOPQoVLgzAc08/Qbfb7qB0mbIp2o6kP/lzR/Dty10BKJI/B/Hxjn0HowBo0G8cMbHxKdbWugn3cuT4CZyDPZHHuOulr9lz4NgF1fHDm91pNGACJYrkoU6lYnz6/RoAql9xCd2aVWbQ8O9SrL9yfsG6CICCYzJlzRrBxMnT0qTtyMhIFi6YT73616VaG19Nn0aZsuUSguP/nn4+1dqS9CXycDS1+3wAwOO31+fY8ZMM+2xJwv7QECMu3qVYey0GTWL/4eM80/M6Hrm1zgUHs0YDJgBw+SV56NS4YkJwXLFhNys27E6xfkryhARnbFRw9Coq6hiDBvTj8OFDxMbGcm+/ATRs1CTJMf/s28uQRx7k2LFjxMbGMuSJp6hWvSaLfl7IuyPf5uTJkxS/rARPPfsC2bPnOGtbt/Xoydj33j0tOMbFxfHOm6+zfNkSTp48ScfOt9KhY2fi4+N5+cXnWLpkMUUuuYSwsDDatOtA02Y38N6o4fw0fx7R0dFcXbUaj/3vGeZ+N5u1q1fzxJCHiYiIYOz4SfS/rzcDH3yENatXsWP7NgY8+DDgyzDXrF7Fo4/9j2+/ns4nEycQGxtDpcpXMfjxJwkNDU35H7ZcdKMfvpHok7FULVuEX1Zv53DUySRBc9l7vbj5iSls3XOILk0qcV/7GoSHhbJ03U4GvDWb+GQE0wUrt9G3XU2yhofy1oAbqH7lJcTGOR4dOZf5v2+lwuUFGf1wK8LDQgkJMbo+M42/dhxg31cPUuim13n+ruu5skQBFo26k4/nrOK3jXsY2LEWt/xvCms/updr7xnLoWMnAFj5YW+aDJxAvHO8PbAFlxXODcDDI77jl9U7Uu8HKRmW5hyT6cSJaG7t1J5bO7XnoYH9yJIlK6+88TYffzqVd98fx7DXXsa5pL8QZn77DbXr1mfi5GlM+uwLrriyAgcPHGDMeyMZ8e5YPv50KhUqVuLj8R+es+2rrq5KeHg4y5YsTlL+5bTPyZEzJ+Mnfsb4iZ/xxdTP2LF9O9/PncPOnTv4bNrXPPvCS6z8/feEczp17cb4iZ8xeepXREdH89OP82ja7AYqVKrE8y++wsTJ04iIiEg4vknT5vzw/b9/2c+ZNYMbWrRi86a/mDNrBmPHfczEydMIDQ1hxrdf/YefsKQ3xQrlouGAj3h01PdnPebKEgW4pWEFGg2YQO0+HxAX7+jSpFKy6m91bVlWb95Hn7Y1cMA1d4+lxwtf8v6jN5I1PJS7b6rG8KnLqN3nA+r1/ZAd+44kOf+J939k4crt1O7zAW9/vjSh3Dn4+uc/aVPfNw1yTfmibN17mL0Ho3j1vqa8/flS6t83jq7PTGPEgy0v/AcjSVgq/F96oMwxmQKHVWNjYhj+1hv8umIZISEh7Nu7h/37/6FgwUIJx1SsXJlnn3qC2NhYGjZqwpXlK/DTsiVs2vQXve7oBkBMTAxVrrr6vO33ursPY94bxf0DByWULfplIRs3rOf772YDcPTIEbZt3cLvvy6nabMWhISEULBgIWpeUyvhnGVLlzD+gzFERx/n8KFDlClTjusaNjpru/ny56dYseKs/OM3LitxOX9v3sTV1aoz+ZOJrF27mtu7dQIgOjqafPkLJPOnKRnB1B/XnTcDbFTtcqqXK8KC4T0AyJY1LGG+8mxmvtaVuDjHqs37ePqD+Yx++EZGfOGbo9+wLZKtew5Trnh+Fq/ZwSO31qFYoVx8sWADf+04kOy+T5m3liG31eOjWSvp2KgiU+at9fe3JOVLFEw4LneOrOSICOdYdEyy65bMQcHRoxnffs3BA5FMmDSFsPBwbmrZhJMnTiY5pnqNa3hv7Ecs+Gkezzz5GLfe1oPcufJwbe26/N9Lr11Qe9dcW5uRw99k5R//ZoHOOR4e/AR16tVPcuzCBT+esY4TJ07w0gvPMn7SZ1xySVHeHfkOJ06eOG/bzVu0Ys6smZQsVZqGjZtiZjjnaH1TO/oNePCCrkMyjqhEASM2Lp6QRPfsR2Tx/eowMybMWcWTY87839yZnJpzPJ9Pv1/DkrU7aXltGb54oSP9hs3ix9+2JKuNRWt2UObSfBTMk42b6pZj6McLAQgJMa6/fzwnYuKS3V85Nz3KIUkcPXqEfPkLEOYf7ty1c+dpx+zauYP8BQrQvkMn2ra/hfVr11Dlqqv5/bdf2bbV9//kx6Oi2PL35mS12fPuPoz/cEzCdp269Zny2SfExvh+iW35ezPHo6K4ump1vv9uNvHx8ezf/w/Ll/mGnE6e8AXCvHnzERV1jLlzZiXUlSN7DqKOnfmuwUZNmvLjvO+ZNeMbmrdoBUCta2sz97tZRO7fD8ChQwfZtVNzN8Fqy+5DVC1XBICqZYtQ8pI8APyw4m/aN7iSQnmzA5AvVwQl/PN5ybVw5Ta6NKkIQNli+biscG42bI+kZNE8bN51kBFfLOfrn/+kSulCSc47GnWCXNmynLXe6Qs38FKfJqzbup/Iw9EAzF2+mb7tayQcc1WZwhfUVzmdhlUliZatbuKB/vfSuUMbKlasTMlSpU87ZvmypYz/cAxhYeFkz56dZ54fSr78+Xn62f/j8cEPcfKkL9O8t98ALi9Z6rxt1m9wPfny5UvYbnfzLezauYNuXTrgnCNfvvy8NuwdGjdtzpLFi+jYvjVFLrmE8hUqkDNnTnLlzk27DrfQuUMbChQsSKVKVRLqat22Pf/3/NMJN+Qkljt3HkqVKs3mTX9RucpVAJQuU5Z77xtAv3vvIj4+nrCwMB597H8UvfTMj7dIxvbFT+vp1qwyy9/vxdJ1u/hzeyQA67bu55kP5/PV0M6EhBgxsfE88PZstu49nOy6352+grcG3MDS93oSG+e4++VvOBkTxy3XV6Br00rExMaz58AxXp70S5LzVm7aR1y8Y/G7PZkweyW/bdyTZP+UeWtZOOIO7nrp64SyQe98x7D+zVkyuidhoSEs+GMb/d+chUggC7yJJCUciU7B+77Fk6ioY2TPnoODBw/Qo1tnxoz7OMl8qJxf4dYvp3UXRFLE8e8Gp1o6Nn9DZIr/vr/uivxpnj4qcwxSA++/l6NHjhATE8Ndve9VYBQRuQAKjunEQwP7sTNgzu7+AYNOu9kmuVJ6BR2RCzH/7dvJEp70mddeL33N6s370qhHklrSyxxhSlNwTCdeHfZOWndBJMVcd7/+OMssgvVuVQXHDOCZJx9nwfx55Mufn8lTfQ/afzd7JqNHvsPmzZsY9/FkKlaqnMa9FDmzUQ+1ouW1Zdh3MIqad/vutn7yjga0rlsuYR3X3q98w679RwFocHUJXrm3CeFhIew/dJzmgyamZfclk9KjHBnATW3b8fbI0UnKypQtx8tvvE21GjXTqFciyfPRrJW0HTI5SdkbkxdTq/dYavf5gBmLNjKkez0A8uTIypv9m9Pxyc+pcdcYuj33RVp0WS6ApcInPVDmmAFUr3ENO3cknY8sVbpMGvVG5MIsXLmNEkXyJCk7EvXvghnZs4Xj8N3w2LlJRb5csJ5t/kdBzrfajkhqUXAUkTTx9J3X0a1ZZQ4dO0GLh3xDp+WK5ScsLIRZr91KzmxZGD5tGRPnrErjnsq5hATppKOGVUUkTTz9wXzK3TqCT75fTZ+2vlVrwkJDqH7FJbR//DPaDP6UId3qUrZYvvPUJJLyFBxFJE19OncN7RpcCcCOf44wZ+lmoqJj2H/4OAtWbtMSb+lcsM45KjiKyEVXJlE22LpuOTZs863R+9XPf1K3cnFCQ4xsWcO4pvylrNu6P626KcmRhtHRzELN7Fcz+9q/XcrMFpvZRjP71MzOvvjueWjOMQN47NFBLF+2hIMHD9KqWUN639uPPHny8MrQFzhwIJKB/fpwxZXleWfU+2ndVZHTjHusDQ2uLkHBPNnYOKkvz41bQItry1CueH7inWPrnsP0HzYTgPVb9zNn2SaWvteL+HjHhzN+Z83f/6TxFUg6NgBYC5xa7f4l4A3n3CdmNgroBYz0UrHWVhU5C62tKsEiNddWXfzXoRT/fX9tmTzn7a+ZFQfGAS8ADwI3AfuAS5xzsWZWB3jaOXeDlz5oWFVERDKiYcAjQLx/uwBw0DkX69/eDnh+TZCCo4iIeGaWGh/rbWbLEn16J23TWgN7nXPLU+u6NOeYhk6cOMHdd95GTMxJ4mJjadLsBu7pe/9px82ZNYPRo4ZjQLkry/PC0FdZtmQxr786NOGYvzdv4v9eeo2GjZvyxJCH2fjnBhpc15D7+j8AwPujR1K2bDkaNm56sS5PMomQEGPhiDvY+c8ROjwxhdEP30iDqy7j0DHfy7V7v/INf/y1N8k5JQrn5pNnbibEjPCwEEZ+sZz3v/4NgC9f7MQl+XMSFmosXLmdgW/PJj7e8fxdDWleqzR//LU34R2NXZpUomCebLwzddnFvWhJkBrjtc650cDocxxSD2hjZq2ACHxzjm8Cec0szJ89Fgc8v4FdwTENZcmShVHvf0D27DmIjYmh1x3dqVu/AVWuqppwzNYtf/PBmPcYM+5jcufOQ+R+3517NWtdy8TJ0wA4dOgg7Vu3oHadevy5YT1Zs2blkylf0veenhw9coTo6OOsXvkHd/W+N02uU4Jbv/Y1Wb/1H3Jlz5pQ9tjoH5j20/qznrMr8igN+3/EyZg4ckSEs/z9u/jml43s2n+U7s99kbCCzqSn2tPhuvLMWrqJquWKUKv3WEY82JJKpQrx144D3H5DFdoELE0nwc85NwQYAmBmDYGHnHPdzOwz4BbgE6AH8KXXNjSsmobMjOzZcwAQGxtLbGzMaa9/mTb1Mzp16Uru3L7lt/IXKHBaPXPnzKZu/QZEZMtGWFgYJ06cID4+ntjYWEJCQxg14m3u6dsv9S9IMp1iBXPR4toyfPDtHxd0XkxsPCdj4gDImiWUkES/iU4FxrDQEMLDQnE44uMd4WG+V2BljwgjJjaOgR1rMfKL5cTGxZ9Wv1xE6etBx0eBB81sI745yDFeK1JwTGNxcXHc2qk9zRrV59radal81dVJ9m/dsoUtW/6mZ49buaN7Z35e+NNpdcye+S03tGgF+NZczZcvH927dOC66xqxbetW4uPjKV+h0kW5HslcXunbhMff+4H4gLven+55HUtG9+Tle5uc9l7HU4oXysWS0T35c+J9vPbJ4oS3cgBMH9qJrVP6c/T4CabOX8/R4yeZtfgvFo26k937j3H42AmuqXApX/38Z6pen6R/zrl5zrnW/u+bnHO1nHNlnXMdnXMnvNarYdU0FhoaysTJ0zhy+DAPPXA/G//cQNlyVyTsj4uNZduWLYx+fxx79uyhd8/b+GTKl+TK7Xus5599e9m4cQN16v77UuRBjzyW8P2B++/lsf89w5j3RvHnhvVcW7sO7Tt0ungXKEGr5bVl2Hswil//3EODq0sklD85Zh67I4+RJTyU4Q+0YFDn2rw4YeFp52/fd4RavcdStEBOJj9zM9Pmr2Ovf6HxNoMnkzU8lA8fa0PDqpfz/Yq/eX3yYl6fvBiAEQ+25LkPf+KOllfRtGYpVm7ax0sf/3xxLlySCNaXHStzTCdy5c5NzWtq8cvPC5KUFy5yCdc1bExYeDjFihenxOUl2bp1S8L+ObNn0qhxU8LCw0+rc94PcylfsRJRUcfYvm0bQ195g7lzZhN9/HiqX48EvzqVi9O6TlnWTbiX8Y/7gtjYwa3ZHXkMgJMxcYyftZKa5Yues55d+4+y+u9/qFflsiTlJ2Li+OrnP7mpbrkk5VeXLYIZbNgeyc3Xl6f7c19SumjeJKvuyMWTGnerpgcKjmnoQGQkRw77Xs0THR3N4kW/ULJkqSTHNGzchOXLlgBw8MABtm75m2LFiyfsnzXjG25oceNpdcfGxDBpwnh63NGLEydOJPwHFx8fR0xMTCpdkWQmT475kbJdR1C++0huf2E6837bQs+hX3NJ/hwJx7SpW441f+877dxiBXMRkcU3cJU3Z1bqVi7Ohu2R5IgITzg/NMRoeW0Z1m9Lunzck3c04NkPfyI8NIRQ/2RlvHNkz3r6H4giXmlYNQ39888+nnpiCPHxccTHx9OseQsaXN+IUcPfokKlylzfsDF16tZn0c8L6di+NSEhIfR/4CHy5vX9hbxzxw727N5N9ZrXnFb35E8n0rpNOyKyZaPcFVcSHR1N5w5tqFf/uoQhWZHU8MGQNhTMmw3D+OOvPdw/bBYA1a+4hLtaV6Pv6zO4skQBhvZpjHO+TGHYZ4tZvXkfhfNmZ8pzt5AlPJQQM+b/vpX3vvo1oe6b6pZjxYbdCfOTf2zcw9L3erJq0z5Wbtp7xv5I6koniV6K0/JxImeh5eMkWKTm8nEr/j6c4r/vq5fMneYxV5mjiIh4l+ZhLHVozlFERCSAMkcREfEsWB/lUHAUERHP0sujFylNw6oiIiIBlDmKiIhnQZo4KnMUEREJpMxRRES8C9LUUcFRREQ8C9a7VTWsKiIiEkCZo4iIeKZHOURERDIJZY4iIuJZkCaOCo4iIvIfBGl01LCqiIhIAGWOIiLimR7lEBERySSUOYqIiGd6lENERCSTUOYoIiKeBWniqOAoIiL/QZBGRw2rioiIBFDmKCIinulRDhERkUxCmaOIiHgWrI9yKDiKiIhnQRobNawqIiISSJmjiIh4F6SpozJHERGRAMocRUTEs2B9lEPBUUREPAvWu1U1rCoiIhJAmaOIiHgWpImjMkcREZFAyhxFRMS7IE0dlTmKiIgEUOYoIiKe6VEOERGRAHqUQ0REJJNQ5igiIp4FaeKozFFERCSQMkcREfEuSFNHBUcREfEsWO9W1bCqiIhIAGWOIiLimR7lEBERySSUOYqIiGdBmjgqOIqIiHcaVhUREUkHzOwyM/vBzNaY2WozG+Avz29mc8zsT/+/+by2oeAoIiL/gaXC57xigUHOuYpAbeA+M6sIDAbmOufKAXP9254oOIqISIbinNvlnFvh/34EWAsUA9oC4/yHjQPaeW1Dc44iIuJZWs85mllJoBqwGCjinNvl37UbKOK1XmWOIiKSrphZbzNblujT+yzH5QQ+BwY65w4n3uecc4Dz2gdljiIi4llqJI7OudHA6HO2axaOLzB+7Jyb6i/eY2ZFnXO7zKwosNdrH5Q5ioiIZ2Yp/zl/m2bAGGCtc+71RLumAz3833sAX3q9LmWOIiKS0dQDbgNWmtlv/rLHgKHAZDPrBWwBOnltQMFRREQ8S4u3cjjnFnD2Ed0mKdGGhlVFREQCKHMUERHvgnT5OAVHERHxLEhjo4ZVRUREAilzFBERz9J6hZzUosxRREQkgDJHERHxLC0e5bgYFBxFRMS74IyNGlYVEREJpMxRREQ8C9LEUZmjiIhIIGWOIiLimR7lEBERySSUOYqIiGd6lENERCSAhlVFREQyCQVHERGRAAqOIiIiATTnKCIingXrnKOCo4iIeBasd6tqWFVERCSAMkcREfEsWIdVlTmKiIgEUOYoIiKeBWniqOAoIiL/QZBGRw2rioiIBFDmKCIinulRDhERkUxCmaOIiHimRzlEREQyCWWOIiLiWZAmjgqOIiLyHwRpdNSwqoiISABljiIi4pke5RAREckklDmKiIhnwfoohznn0roPIiIi6YqGVUVERAIoOIqIiARQcBQREQmg4CjpipnFmdlvZrbKzD4zs+z/oa4PzewW//f3zaziOY5taGZ1PbTxt5kVTG55wDFHL7Ctp83soQvto4hcOAVHSW+OO+eqOucqAyeBPol3mpmnO6ydc3c559ac45CGwAUHRxEJTgqOkp79BJT1Z3U/mdl0YI2ZhZrZK2a21Mz+MLN7AMznHTNbb2bfAYVPVWRm88yspv97CzNbYWa/m9lcMyuJLwg/4M9aG5hZITP73N/GUjOr5z+3gJnNNrPVZvY+yVg8y8y+MLPl/nN6B+x7w18+18wK+cvKmNlM/zk/mVn5lPhhikjy6TlHSZf8GWJLYKa/qDpQ2Tm32R9gDjnnrjGzrMBCM5sNVAOuBCoCRYA1wNiAegsB7wHX+evK75yLNLNRwFHn3Kv+4yYCbzjnFphZCWAWUAF4CljgnHvWzG4EeiXjcnr628gGLDWzz51z+4EcwDLn3ANm9qS/7n7AaKCPc+5PM7sWGAE09vBjFBGPFBwlvclmZr/5v/8EjME33LnEObfZX94cuOrUfCKQBygHXAdMcs7FATvN7Psz1F8bmH+qLudc5Fn60RSoaP8+4ZzbzHL627jZf+43ZnYgGdfU38za+79f5u/rfiAe+NRfPgGY6m+jLvBZorazJqMNEUlBCo6S3hx3zlVNXOAPEscSFwH3O+dmBRzXKgX7EQLUds5Fn6EvyWZmDfEF2jrOuSgzmwdEnOVw52/3YODPQEQuLs05SkY0C7jXzMIBzOwKM8sBzAc6++ckiwKNznDuIuA6MyvlPze/v/wIkCvRcbOB+09tmNmpYDUfuNVf1hLId56+5gEO+ANjeXyZ6ykhwKns91Z8w7WHgc1m1tHfhpnZ1edpQ0RSmIKjZETv45tPXGFmq4B38Y2CTAP+9O8bD/wSeKJzbh/QG98Q5u/8O6z5FdD+1A05QH+gpv+GnzX8e9fsM/iC62p8w6tbz9PXmUCYma0FhuILzqccA2r5r6Ex8Ky/vBvQy9+/1UDbZPxMRCQFaW1VERGRAMocRUREAig4ioiIBFBwFBERCaDgKCIiEkDBUUREJICCo4iISAAFRxERkQAKjiIiIgEUHEVERAIoOIqIiARQcBQREQmg4CgiIhJAwVFERCSAgqOIiEgABUdJc2bWzsyc/2XAGZ6Z1TCzlWa20czeMjM7wzH5zGya/32RS8yscsD+UDP71cy+TlRWyswW++v91MyyXIzrEcmMFBwlPegKLPD/myrMLDS16j6DkcDdQDn/p8UZjnkM+M05dxVwO/BmwP4BwNqAspeAN5xzZYEDQK+U7LSI/EvBUdKUmeUE6uP7Rd/FXxZqZq+a2Sp/ZnW/v/waM/vZzH73Z1u5zOwOM3snUX1fm1lD//ejZvaamf0O1DGzJ81sqb/e0acyOjMra2bf+etdYWZlzGy8mbVLVO/HZtY2GddTFMjtnFvkfG8SHw+0O8OhFYHvAZxz64CSZlbEX0dx4Ebg/UT1GtAYmOIvGneWekUkBYSldQck02sLzHTObTCz/WZWA6gFlASqOudizSy/fwjxU6Czc26pmeUGjp+n7hzAYufcIAAzW+Oce9b//SOgNfAV8DEw1Dk3zcwi8P3ROAZ4APjCzPIAdYEeZnalvx9n0hAoBmxPVLbdXxbod+Bm4CczqwVcDhQH9gDDgEeAXImOLwAcdM7FnqdeEUkBCo6S1rry75DiJ/7tUsCoU4HAORdpZlWAXc65pf6ywwBnmM5LLA74PNF2IzN7BMgO5AdWm9k8oJhzbpq/3mj/sT+a2QgzKwR0AD7392c9UPVsDZ6nP4kNBd40s9+AlcCvQJyZtQb2OueWn8qAReTiU3CUNGNm+fENFVYxMweEAg5YegHVxJJ0eiAi0fdo51ycv60IYARQ0zm3zcyeDjj2TMYD3fEN997pr+d8meMOfBngKcX9ZUn4g/upOg3YDGwCOgNtzKyVv3+5zWwCcBuQ18zC/EH6jPWKSMrQnKOkpVuAj5xzlzvnSjrnLsMXJH4H7jGzMEgIouuBov/f3r0H21XWZxz/PoJiIJHQUFJHcWLFNtCIEQTUiqMSqPSCMmodFRQQK3gJNKXFtn94GTsFvEBJvSFpwE5LKWoqKpLQllsFxUAuhIQASrRYa1AuIRBhCI9/vL8ti+0+J+ecXM7J+Hxm9ux93rXetdc6M2d+513rXc+SdEi1Tanl64DZkp4maV/aKdlBeoXwp3Wd800Ath8C7uldX5S0m6Tda92LgNNrvdX1vtb27CFeD9j+MbBB0suq6L0D+Gr/zkia2pltejJwne0Ntv/a9nNtz6AV5f+2fVxdv7y6t9/AOwdtNyK2jRTHGE9vBRb1tX0ZeDbwQ2BlTaZ5m+3HaKOq+dV2Fa3gfYtWUFcD5wO3DPoi2w8AXwBWAYt56uj0eGCupJXADcBvVZ+f0GaMLhzlcb2XNpnmLuB7wDcBJJ0i6ZRaZ39glaS1wNG02albciYwT9JdtGuQC0a5XxExQmr/kEZEvxpB3gocZPvB8d6fiNhxMnKMGEDSHNqocX4KY8Svn4wcIyIi+mTkGBER0SfFMcaVpM2SlldqzWWdmaJbs82P1mnRoZafezO85gAACOdJREFUIukdW/s9w2x/q7JVJa2r/sslLe20v1nSbZKekPTS7bX/EZHTqjHOJG20Pbk+/wtws+1PdZb37uvbaUi6CZgLfAe4Ajjf9jf71vk4sNH2R9QC1z9t+4hato52P+ZP+/rsDzwBfB44w/ZSImK7yMgxJpLrgf0kvVrS9ZIuB1arZa1+vHJRV0p6T6+DpDNrlLVC0lnVdpGkN9XnsyStrn6fqLYPSzqjPs+W9O1avkjSXtV+jaSza1R3h6TDR3IA2gbZqkOxvcb22pHsR0RsnSTkxIRQN/QfDVxZTQcBs2zfLenPgAdtHyJpN+BbkpYAM2nZrIfZfqTCArrbnAYcC8y0bUlTB3z1F4EP2L5W0keBD1E3/gO72j600mo+BMzZQdmqBpZUatDnbV8wxPdFxHaS4hjjbVLli0IbOS6ghXzfZPvuaj8KOLA3GgT2pD0Kag6w0PYj0DJY+7b9IPBzYIHacxG/3l2oFig+1fa11XQxcFlnla/U+820IHRq5LbdslVr2Stt/0jSPsBVkm63fd1INxwRWy/FMcbbJttPKTZVYB7uNtFGd4v71vuD4TZcT/Q4FDiCFrv2flqW60g9Wu+bqb+VHZCtiu0f1ft6SYtokXgpjhE7UK45xs5gMXCqpKcDSPodSXvQIuRO7M1wHXBadTKwp+0raI+fenF3ed3cf3/neuLxwLUMY3tnq0raQ9KUWmcP2qh51Uh/URGxbWTkGDuDC2mnNW+pgnMv8AbbV0qaDSyV9BhtZujfdPpNAb6q9kQOAfMGbPudwOeqwH6fGs1tpffSQssn0XJVf5mtCmD7c7Rs1YvruuJttIc9A0wHFtXoeVfgX21fWf2PBeYDvwl8Q9Jy28OOniNibHIrR0RERJ+cVo2IiOiT4hgREdEnxTEmrL5oua8NcZ/i1mx/naS96/PGUfR7vqTvVDzcpZ2JNd11niFpYSeg4NXVvrukb0i6vaLgzur0OUHSvXXMyyWdvA0OMyLGIMUxJrJNNQt0FnAf8L7x3qFyNnCu7f2A+3lyMk3XuwFsvwg4EvikpN7f2ydszwReAvy+pKM7/S7tzH69cPsdQkQMJ8UxdhY3Ukkzkl4g6UpJN1fM3Mxqn14RcCvq9Ypq/49a97ZK2xmzmi37WuBL1XQxW46HWw88QMtLfcT21dX+GHALT70vMiImgNzKEROepF1oN/IvqKYLgFNs3ynpMOAztIJ1PnCt7WOrz+Ra/yTb90maBHxX0pdt/2yI75pCS+oZ5G3AeuCBThj6cPFwx0i6BNgXOLjeb+p811TgT4B/6PR7o6RXAXcAf277f4fYl4jYjlIcYyLrRcs9B1hDi1KbTIuXu6wT1bZbvb+WdtM9tjfT4uMA5tY9gtAK1AuBgcXR9kMMHw+39wj3/Z9o9zIuBX4A3MCT8XC9LNlLaE/s+H41fw24xPajauHqFzO6RJ+I2EZSHGMi22R7dt2gv5h2zfEi2shtyALWVRNh5gAvr3Dya4BnDrP+lkaOa4CpevJRWkPFwz1OS+XpbfcG2miw5wLgTtvndfp0C/aFwDnDHlxEbDe55hgTXgWLzwX+AngEuFvSm6FdA5TUi4X7L+DUat+lgsX3BO6vwjgTeNkWvuuhYeLhVtdjqK6mZbVCS9gZFA+3e8W/IelI4HHbq+vnj9V+nd7X59mdH4+hFeKIGAcpjrFTsL0MWAm8FXg78C5JK2jRa6+v1U4DXiPpVtqTNA6gPQJrV0lraE/C+PY22J0zgXmS7gKmUddCJR2j9tgrgH1ocXdrav3ja53nAn9b+3ZL3y0bc2vS0AraPwMnbIN9jYgxSHxcREREn4wcIyIi+qQ4RkRE9ElxjIiI6JPiGOOuk6Hae82QNE3S1ZI2SvrHYfr+saRllYizuu4PHDeSfkPSVZLurPe9hljv7MqMXSXpLZ32BXUsKyV9qe7rRNLz6vexrJb94Y46pohfR5mQE+NO0kbbk/va9qBlj84CZtl+/4B+T6fdYH+o7Xsk7QbMsL12K/ZFtL+LJ8bY/xzgPttnSfogsJftM/vW+SPabRxH0wIMrgGOsL1B0rNsb6j1PgWsr21dACyz/VlJBwBX2J4xxsOMiC3IyDEmJNsP2/4f4OfDrDaFFmTxs+rzaK8wDpOzOq8zYju92mZIWivpi8AqYF9JfynpuzVK+8godv31tGQbGD539Trbj9t+mHaLyuvqGHqFUcAkoPffq4Fn1ec9gf8bxT5FxCilOMZEMKlzSnXRSDvZvg+4HPiBpEskvV1PPvmil7P6YuAg4DZJBwMnAofRwgDeLekltf4Lgc/Y/j3gd+vnQ2lRcgdX3ilqQefLB7zm1Ham2/5xff5/YPqAXV8BvK6CAvYGXkOLtaO+Y2H1nQnMr+YPA8dJuge4AvjASH9PETF6iY+LiWDTSOPg+tk+WdKLaBFxZ9AeD3UCA3JWJb0SWFSjNSR9BTicKrC2ewEBR9VrWf08mVYsr7N9+Cj2zZJ+5bqF7SWSDqHlrd5Le+LI5s7yE9WC0+cDbwEW0sIPLrL9SUkvB/5Z0qyxnv6NiOFl5Bg7Pdu32j6XVhjfOMbNPNz5LODvO7Fx+9nupeBsaeT4k14MXL2vH2Kf/662fWR93x19yzcD/9Y5nncB/17LbqTlw440BD0iRinFMXZakiarBYv3zKZN0IHBOavXA2/o5J4ey+CQ8cXASZ2Zos+RtA+A7cOHyF39z+p7OS1vFYbOXd1F0rT6fCBwILBEzX7VLlq+6u3V7Ye0x3YhaX9acbx3xL+siBiVzFaNcTdotmq1r6NNQnkG7WHBR/XCu2v5FOBS4AXAJtro7zTbSyVNpz354rdppyxPtX2jpHnASbWJC22fJ2kG8HXbszrbPg3oZZ5uBI6z/b0RHMs02gjvebRC/af1LMmX0p5BebKkZ9IecgywodqX1/XS6+uYRbs2eWrNYj0A+ALtFK+Bv7K9ZEv7ExFjk+IYERHRJ6dVIyIi+qQ4RkRE9ElxjIiI6JPiGBER0SfFMSIiok+KY0RERJ8Ux4iIiD4pjhEREX1+AbLgBDCeetvLAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "t-KyMOWgRyQ4" + }, + "source": [ + "\n", + "\n", + "---\n", + "##### ACURÁCIA com dados de TREINO SEM CROSS VALIDATION\n", + "---\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VTT_qynaPV6-", + "outputId": "5f6af706-b5b3-4b1b-f0bd-a1886f0a9e9f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 0 + } + }, + "source": [ + "# Medir ACURÁCIA com dados de TREINO SEM CROSS VALIDATION\n", + "# Preparar o array de predições\n", + "a_scores_pred_treino = ml_DT.predict(X_treinamento)\n", + "a_scores_pred_treino[0:30]" + ], + "execution_count": 30, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0,\n", + " 0, 1, 0, 1, 1, 0, 0, 0])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 30 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NgzcGE75PVKB", + "outputId": "259de4dd-4c60-4bed-b750-e6a7f51d40db", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 0 + } + }, + "source": [ + "# Medir ACURÁCIA com dados de TREINO SEM CROSS VALIDATION\n", + "accuracy_score(y_treinamento, a_scores_pred_treino)" + ], + "execution_count": 31, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "1.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 31 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DL-b5ehHSeaV" + }, + "source": [ + "---\n", + "#### Criar Modelo usando classificador Naive Bayes\n", + "---" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mG3gUR4aSwD3" + }, + "source": [ + "from sklearn.naive_bayes import GaussianNB" + ], + "execution_count": 33, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "N8KXRDsSS3uQ" + }, + "source": [ + "# Criando o modelo preditivo\n", + "modelo_v1 = GaussianNB()" + ], + "execution_count": 34, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "tO9zI48tS3Wy", + "outputId": "dc906ec3-5010-486e-dbd8-78493fa23d2b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Treinando o modelo\n", + "modelo_v1.fit(X_treinamento, y_treinamento)" + ], + "execution_count": 38, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "GaussianNB(priors=None, var_smoothing=1e-09)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 38 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5OdlSGHWS28K", + "outputId": "0a175639-5633-4b6d-a320-e8c929468659", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "a_scores_pred_treino_NB = modelo_v1.predict(X_treinamento)\n", + "a_scores_pred_treino_NB[0:30]" + ], + "execution_count": 39, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0,\n", + " 0, 1, 0, 1, 0, 0, 0, 0])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 39 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aEFkIz3wS2gm", + "outputId": "2c656f9d-ba1c-47fc-80f6-98385c7420d7", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Medir ACURÁCIA com dados de TREINO SEM CROSS VALIDATION\n", + "accuracy_score(y_treinamento, a_scores_pred_treino_NB)" + ], + "execution_count": 40, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.8942857142857142" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 40 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vrjgRbCUT2PB", + "outputId": "8dbd3d1e-b7a7-46c1-b8d5-0e62b172a199", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Medir ACURÁCIA com dados de TESTE\n", + "a_scores_pred_teste_NB = modelo_v1.predict(X_teste)\n", + "accuracy_score(y_teste, a_scores_pred_teste_NB)" + ], + "execution_count": 46, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.92" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 46 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1AvybYmQT2AU", + "outputId": "83397313-5911-4d70-be99-0e73645455c7", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "# Medir ACURÁCIA com dados de TREINO **COM** CROSS VALIDATION\n", + "# Cross-Validation com k = 10 folds\n", + "a_scores_CV_NB = cross_val_score(modelo_v1, X_treinamento, y_treinamento, cv = i_CV)\n", + "\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV_NB.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV_NB.std(),4)}')" + ], + "execution_count": 45, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Média das Acurácias calculadas pelo CV....: 89.71000000000001\n", + "std médio das Acurácias calculadas pelo CV: 3.3099999999999996\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "b-QHuY26T1g3" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p8D975NqsGtj" + }, + "source": [ + "## Parameter tunning\n", + "### Referência\n", + "* [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74)\n", + "* [Decision Tree Adventures 2 — Explanation of Decision Tree Classifier Parameters](https://medium.com/datadriveninvestor/decision-tree-adventures-2-explanation-of-decision-tree-classifier-parameters-84776f39a28) - Explica didaticamente e step by step como fazer parameter tunning." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Bfdq5zEhlVsk" + }, + "source": [ + "# Dicionário de parâmetros para o parameter tunning. Ao todo serão ajustados 2X15X5X5X7= 5.250 modelos. Contando com 10 folds no Cross-Validation, então são 52.500 modelos.\n", + "#d_parametros_DT = {\"criterion\": [\"gini\", \"entropy\"], \n", + "# \"min_samples_split\": [2, 5, 350, 400], \n", + "# \"max_depth\": [None, 2, 15], \n", + "# \"min_samples_leaf\": [20, 80, 100], \n", + "# \"max_leaf_nodes\": [None, 2, 15]}\n", + "\n", + "# DICIONÁRIO ORIGINAL\n", + "d_parametros_DT = {\"criterion\": [\"gini\", \"entropy\"], \n", + " \"min_samples_split\": [2, 5, 10, 30, 50, 70, 90, 120, 150, 180, 210, 240, 270, 350, 400], \n", + " \"max_depth\": [None, 2, 5, 9, 15], \n", + " \"min_samples_leaf\": [20, 40, 60, 80, 100], \n", + " \"max_leaf_nodes\": [None, 2, 3, 4, 5, 10, 15]}\n" + ], + "execution_count": 102, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "BtajXuuUpGwq", + "outputId": "8bb5a299-00cc-48c0-a8c3-8c47232f8bf1", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 340 + } + }, + "source": [ + "d_parametros_DT" + ], + "execution_count": 103, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'criterion': ['gini', 'entropy'],\n", + " 'max_depth': [None, 2, 5, 9, 15],\n", + " 'max_leaf_nodes': [None, 2, 3, 4, 5, 10, 15],\n", + " 'min_samples_leaf': [20, 40, 60, 80, 100],\n", + " 'min_samples_split': [2,\n", + " 5,\n", + " 10,\n", + " 30,\n", + " 50,\n", + " 70,\n", + " 90,\n", + " 120,\n", + " 150,\n", + " 180,\n", + " 210,\n", + " 240,\n", + " 270,\n", + " 350,\n", + " 400]}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 103 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H8gNSs0G0A-L" + }, + "source": [ + "```\n", + "grid_search = GridSearchCV(ml_DT, param_grid= d_parametros_DT, cv = i_CV, n_jobs= -1)\n", + "start = time()\n", + "grid_search.fit(X_treinamento, y_treinamento)\n", + "tempo_elapsed= time()-start\n", + "print(f\"\\nGridSearchCV levou {tempo_elapsed:.2f} segundos para estimar {len(grid_search.cv_results_)} modelos candidatos\")\n", + "\n", + "GridSearchCV levou 1999.12 segundos para estimar 23 modelos candidatos\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ap3WMXqDthu9" + }, + "source": [ + "# Definindo a função para o GridSearchCV\n", + "def GridSearchOptimizer(modelo, ml_Opt, d_Parametros, X_treinamento, y_treinamento, X_teste, y_teste, cv = i_CV):\n", + " ml_GridSearchCV = GridSearchCV(modelo, d_Parametros, cv = i_CV, n_jobs= -1, verbose= 10, scoring= 'accuracy')\n", + " start = time()\n", + " ml_GridSearchCV.fit(X_treinamento, y_treinamento)\n", + " tempo_elapsed= time()-start\n", + " #print(f\"\\nGridSearchCV levou {tempo_elapsed:.2f} segundos.\")\n", + "\n", + " # Parâmetros que otimizam a classificação:\n", + " print(f'\\nParametros otimizados: {ml_GridSearchCV.best_params_}')\n", + " \n", + " if ml_Opt == 'ml_DT2':\n", + " print(f'\\nDecisionTreeClassifier *********************************************************************************************************')\n", + " ml_Opt = DecisionTreeClassifier(criterion= ml_GridSearchCV.best_params_['criterion'], \n", + " max_depth= ml_GridSearchCV.best_params_['max_depth'],\n", + " max_leaf_nodes= ml_GridSearchCV.best_params_['max_leaf_nodes'],\n", + " min_samples_split= ml_GridSearchCV.best_params_['min_samples_leaf'],\n", + " min_samples_leaf= ml_GridSearchCV.best_params_['min_samples_split'], \n", + " random_state= i_Seed)\n", + " \n", + " elif ml_Opt == 'ml_RF2':\n", + " print(f'\\nRandomForestClassifier *********************************************************************************************************')\n", + " ml_Opt = RandomForestClassifier(bootstrap= ml_GridSearchCV.best_params_['bootstrap'], \n", + " max_depth= ml_GridSearchCV.best_params_['max_depth'],\n", + " max_features= ml_GridSearchCV.best_params_['max_features'],\n", + " min_samples_leaf= ml_GridSearchCV.best_params_['min_samples_leaf'],\n", + " min_samples_split= ml_GridSearchCV.best_params_['min_samples_split'],\n", + " n_estimators= ml_GridSearchCV.best_params_['n_estimators'],\n", + " random_state= i_Seed)\n", + " \n", + " elif ml_Opt == 'ml_AB2':\n", + " print(f'\\nAdaBoostClassifier *********************************************************************************************************')\n", + " ml_Opt = AdaBoostClassifier(algorithm='SAMME.R', \n", + " base_estimator=RandomForestClassifier(bootstrap = False, \n", + " max_depth = 10, \n", + " max_features = 'auto', \n", + " min_samples_leaf = 1, \n", + " min_samples_split = 2, \n", + " n_estimators = 400), \n", + " learning_rate = ml_GridSearchCV.best_params_['learning_rate'], \n", + " n_estimators = ml_GridSearchCV.best_params_['n_estimators'], \n", + " random_state = i_Seed)\n", + " \n", + " elif ml_Opt == 'ml_GB2':\n", + " print(f'\\nGradientBoostingClassifier *********************************************************************************************************')\n", + " ml_Opt = GradientBoostingClassifier(learning_rate = ml_GridSearchCV.best_params_['learning_rate'], \n", + " n_estimators = ml_GridSearchCV.best_params_['n_estimators'], \n", + " max_depth = ml_GridSearchCV.best_params_['max_depth'], \n", + " min_samples_split = ml_GridSearchCV.best_params_['min_samples_split'], \n", + " min_samples_leaf = ml_GridSearchCV.best_params_['min_samples_leaf'], \n", + " max_features = ml_GridSearchCV.best_params_['max_features'])\n", + " \n", + " elif ml_Opt == 'ml_XGB2':\n", + " print(f'\\nXGBoostingClassifier *********************************************************************************************************')\n", + " ml_Opt = XGBoostingClassifier(learning_rate= ml_GridSearchCV.best_params_['learning_rate'], \n", + " max_depth= ml_GridSearchCV.best_params_['max_depth'], \n", + " colsample_bytree= ml_GridSearchCV.best_params_['colsample_bytree'], \n", + " subsample= ml_GridSearchCV.best_params_['subsample'], \n", + " gamma= ml_GridSearchCV.best_params_['gamma'], \n", + " min_child_weight= ml_GridSearchCV.best_params_['min_child_weight'])\n", + " \n", + " # Treina novamente usando os parametros otimizados...\n", + " ml_Opt.fit(X_treinamento, y_treinamento)\n", + "\n", + " # Cross-Validation com 10 folds\n", + " print(f'\\n********* CROSS-VALIDATION ***********')\n", + " a_scores_CV = cross_val_score(ml_Opt, X_treinamento, y_treinamento, cv = i_CV)\n", + " print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + " print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')\n", + "\n", + " # Faz predições com os parametros otimizados...\n", + " y_pred = ml_Opt.predict(X_teste)\n", + " \n", + " # Importância das COLUNAS\n", + " print(f'\\n********* IMPORTÂNCIA DAS COLUNAS ***********')\n", + " df_importancia_variaveis = pd.DataFrame(zip(l_colunas, ml_Opt.feature_importances_), columns= ['coluna', 'importancia'])\n", + " df_importancia_variaveis = df_importancia_variaveis.sort_values(by= ['importancia'], ascending=False)\n", + " print(df_importancia_variaveis)\n", + "\n", + " # Matriz de Confusão\n", + " print(f'\\n********* CONFUSION MATRIX - PARAMETER TUNNING ***********')\n", + " cf_matrix = confusion_matrix(y_teste, y_pred)\n", + " cf_labels = ['True_Negative', 'False_Positive', 'False_Negative', 'True_Positive']\n", + " cf_categories = ['Zero', 'One']\n", + " mostra_confusion_matrix(cf_matrix, group_names = cf_labels, categories = cf_categories)\n", + "\n", + " return ml_Opt, ml_GridSearchCV.best_params_" + ], + "execution_count": 104, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "44-BRnNjBT25", + "outputId": "d2a0bad2-3989-4441-efdf-8c94bdaa0a53", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + } + }, + "source": [ + "# Invoca a função com o modelo baseline\n", + "ml_DT2, best_params = GridSearchOptimizer(ml_DT, 'ml_DT2', d_parametros_DT, X_treinamento, y_treinamento, X_teste, y_teste, cv = i_CV)" + ], + "execution_count": 105, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Fitting 10 folds for each of 5250 candidates, totalling 52500 fits\n" + ], + "name": "stdout" + }, + { + "output_type": "stream", + "text": [ + "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.\n", + "[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 1.3s\n", + "[Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 1.4s\n", + "[Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 1.4s\n", + "[Parallel(n_jobs=-1)]: Done 14 tasks | elapsed: 1.5s\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.1747s.) Setting batch_size=2.\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0593s.) Setting batch_size=4.\n", + "[Parallel(n_jobs=-1)]: Done 24 tasks | elapsed: 1.6s\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0885s.) Setting batch_size=8.\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.1811s.) Setting batch_size=16.\n", + "[Parallel(n_jobs=-1)]: Done 58 tasks | elapsed: 1.9s\n", + "[Parallel(n_jobs=-1)]: Done 186 tasks | elapsed: 2.8s\n", + "[Parallel(n_jobs=-1)]: Done 330 tasks | elapsed: 3.7s\n", + "[Parallel(n_jobs=-1)]: Done 506 tasks | elapsed: 4.8s\n", + "[Parallel(n_jobs=-1)]: Done 682 tasks | elapsed: 5.8s\n", + "[Parallel(n_jobs=-1)]: Done 890 tasks | elapsed: 7.0s\n", + "[Parallel(n_jobs=-1)]: Done 1098 tasks | elapsed: 8.2s\n", + "[Parallel(n_jobs=-1)]: Done 1338 tasks | elapsed: 9.6s\n", + "[Parallel(n_jobs=-1)]: Done 1578 tasks | elapsed: 11.0s\n", + "[Parallel(n_jobs=-1)]: Done 1850 tasks | elapsed: 12.7s\n", + "[Parallel(n_jobs=-1)]: Done 2122 tasks | elapsed: 14.3s\n", + "[Parallel(n_jobs=-1)]: Done 2426 tasks | elapsed: 16.1s\n", + "[Parallel(n_jobs=-1)]: Done 2730 tasks | elapsed: 17.9s\n", + "[Parallel(n_jobs=-1)]: Done 3066 tasks | elapsed: 19.7s\n", + "[Parallel(n_jobs=-1)]: Done 3402 tasks | elapsed: 21.7s\n", + "[Parallel(n_jobs=-1)]: Done 3770 tasks | elapsed: 23.8s\n", + "[Parallel(n_jobs=-1)]: Done 4138 tasks | elapsed: 26.2s\n", + "[Parallel(n_jobs=-1)]: Done 4538 tasks | elapsed: 28.6s\n", + "[Parallel(n_jobs=-1)]: Done 4938 tasks | elapsed: 31.0s\n", + "[Parallel(n_jobs=-1)]: Done 5370 tasks | elapsed: 33.4s\n", + "[Parallel(n_jobs=-1)]: Done 5802 tasks | elapsed: 35.7s\n", + "[Parallel(n_jobs=-1)]: Done 6266 tasks | elapsed: 38.1s\n", + "[Parallel(n_jobs=-1)]: Done 6730 tasks | elapsed: 40.7s\n", + "[Parallel(n_jobs=-1)]: Done 7226 tasks | elapsed: 43.4s\n", + "[Parallel(n_jobs=-1)]: Done 7722 tasks | elapsed: 46.0s\n", + "[Parallel(n_jobs=-1)]: Done 8250 tasks | elapsed: 48.9s\n", + "[Parallel(n_jobs=-1)]: Done 8778 tasks | elapsed: 51.7s\n", + "[Parallel(n_jobs=-1)]: Done 9338 tasks | elapsed: 54.8s\n", + "[Parallel(n_jobs=-1)]: Done 9898 tasks | elapsed: 57.8s\n", + "[Parallel(n_jobs=-1)]: Done 10490 tasks | elapsed: 1.0min\n", + "[Parallel(n_jobs=-1)]: Done 11082 tasks | elapsed: 1.1min\n", + "[Parallel(n_jobs=-1)]: Done 11706 tasks | elapsed: 1.1min\n", + "[Parallel(n_jobs=-1)]: Done 12330 tasks | elapsed: 1.2min\n", + "[Parallel(n_jobs=-1)]: Done 12986 tasks | elapsed: 1.3min\n", + "[Parallel(n_jobs=-1)]: Done 13642 tasks | elapsed: 1.3min\n", + "[Parallel(n_jobs=-1)]: Done 14330 tasks | elapsed: 1.4min\n", + "[Parallel(n_jobs=-1)]: Done 15018 tasks | elapsed: 1.5min\n", + "[Parallel(n_jobs=-1)]: Done 15738 tasks | elapsed: 1.5min\n", + "[Parallel(n_jobs=-1)]: Done 16458 tasks | elapsed: 1.6min\n", + "[Parallel(n_jobs=-1)]: Done 17210 tasks | elapsed: 1.7min\n", + "[Parallel(n_jobs=-1)]: Done 17962 tasks | elapsed: 1.7min\n", + "[Parallel(n_jobs=-1)]: Done 18746 tasks | elapsed: 1.8min\n", + "[Parallel(n_jobs=-1)]: Done 19530 tasks | elapsed: 1.9min\n", + "[Parallel(n_jobs=-1)]: Done 20346 tasks | elapsed: 2.0min\n", + "[Parallel(n_jobs=-1)]: Done 21162 tasks | elapsed: 2.0min\n", + "[Parallel(n_jobs=-1)]: Done 22010 tasks | elapsed: 2.1min\n", + "[Parallel(n_jobs=-1)]: Done 22858 tasks | elapsed: 2.2min\n", + "[Parallel(n_jobs=-1)]: Done 23738 tasks | elapsed: 2.3min\n", + "[Parallel(n_jobs=-1)]: Done 24618 tasks | elapsed: 2.4min\n", + "[Parallel(n_jobs=-1)]: Done 25530 tasks | elapsed: 2.5min\n", + "[Parallel(n_jobs=-1)]: Done 26442 tasks | elapsed: 2.6min\n", + "[Parallel(n_jobs=-1)]: Done 27386 tasks | elapsed: 2.7min\n", + "[Parallel(n_jobs=-1)]: Done 28330 tasks | elapsed: 2.8min\n", + "[Parallel(n_jobs=-1)]: Done 29306 tasks | elapsed: 2.9min\n", + "[Parallel(n_jobs=-1)]: Done 30282 tasks | elapsed: 3.0min\n", + "[Parallel(n_jobs=-1)]: Done 31290 tasks | elapsed: 3.1min\n", + "[Parallel(n_jobs=-1)]: Done 32298 tasks | elapsed: 3.2min\n", + "[Parallel(n_jobs=-1)]: Done 33338 tasks | elapsed: 3.3min\n", + "[Parallel(n_jobs=-1)]: Done 34378 tasks | elapsed: 3.5min\n", + "[Parallel(n_jobs=-1)]: Done 35450 tasks | elapsed: 3.6min\n", + "[Parallel(n_jobs=-1)]: Done 36522 tasks | elapsed: 3.7min\n", + "[Parallel(n_jobs=-1)]: Done 37626 tasks | elapsed: 3.8min\n", + "[Parallel(n_jobs=-1)]: Done 38730 tasks | elapsed: 4.0min\n", + "[Parallel(n_jobs=-1)]: Done 39866 tasks | elapsed: 4.1min\n", + "[Parallel(n_jobs=-1)]: Done 41002 tasks | elapsed: 4.2min\n", + "[Parallel(n_jobs=-1)]: Done 42170 tasks | elapsed: 4.4min\n", + "[Parallel(n_jobs=-1)]: Done 43338 tasks | elapsed: 4.5min\n", + "[Parallel(n_jobs=-1)]: Done 44538 tasks | elapsed: 4.6min\n", + "[Parallel(n_jobs=-1)]: Done 45738 tasks | elapsed: 4.8min\n", + "[Parallel(n_jobs=-1)]: Done 46970 tasks | elapsed: 4.9min\n", + "[Parallel(n_jobs=-1)]: Done 48202 tasks | elapsed: 5.1min\n", + "[Parallel(n_jobs=-1)]: Done 49466 tasks | elapsed: 5.2min\n", + "[Parallel(n_jobs=-1)]: Done 50730 tasks | elapsed: 5.4min\n", + "[Parallel(n_jobs=-1)]: Done 52026 tasks | elapsed: 5.6min\n", + "[Parallel(n_jobs=-1)]: Done 52500 out of 52500 | elapsed: 5.6min finished\n" + ], + "name": "stderr" + }, + { + "output_type": "stream", + "text": [ + "\n", + "Parametros otimizados: {'criterion': 'entropy', 'max_depth': None, 'max_leaf_nodes': None, 'min_samples_leaf': 20, 'min_samples_split': 70}\n", + "\n", + "DecisionTreeClassifier *********************************************************************************************************\n", + "\n", + "********* CROSS-VALIDATION ***********\n", + "Média das Acurácias calculadas pelo CV....: 87.14\n", + "std médio das Acurácias calculadas pelo CV: 4.33\n", + "\n", + "********* IMPORTÂNCIA DAS COLUNAS ***********\n", + " coluna importancia\n", + "12 v13 0.735896\n", + "0 v1 0.135030\n", + "9 v10 0.090888\n", + "6 v7 0.025768\n", + "1 v2 0.012418\n", + "3 v4 0.000000\n", + "4 v5 0.000000\n", + "5 v6 0.000000\n", + "7 v8 0.000000\n", + "8 v9 0.000000\n", + "10 v11 0.000000\n", + "11 v12 0.000000\n", + "2 v3 0.000000\n", + "13 v14 0.000000\n", + "14 v15 0.000000\n", + "15 v16 0.000000\n", + "16 v17 0.000000\n", + "17 v18 0.000000\n", + "\n", + "********* CONFUSION MATRIX - PARAMETER TUNNING ***********\n" + ], + "name": "stdout" + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gmCkjGjPJMLr" + }, + "source": [ + "### Visualizar o resultado" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cIc3ZgaISEd0", + "outputId": "48a1f7da-d77d-4630-c8c6-9d02d802971a", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 753 + } + }, + "source": [ + "from sklearn.tree import export_graphviz\n", + "from sklearn.externals.six import StringIO \n", + "from IPython.display import Image \n", + "import pydotplus\n", + "\n", + "dot_data = StringIO()\n", + "export_graphviz(ml_DT2, out_file = dot_data, filled = True, rounded = True, special_characters = True, feature_names = l_colunas, class_names = ['0','1'])\n", + "\n", + "graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) \n", + "graph.write_png('DecisionTree.png')\n", + "Image(graph.create_png())" + ], + "execution_count": 106, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "image/png": "\n", + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 106 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e1R2GBkbnV37" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vv7GKBvs6Ybf" + }, + "source": [ + "# Função desenvolvida para Selecionar COLUNAS relevantes\n", + "from sklearn.feature_selection import SelectFromModel\n", + "\n", + "def seleciona_colunas_relevantes(modelo, X_treinamento, X_teste, threshold = 0.05):\n", + " # Cria um seletor para selecionar as COLUNAS com importância > threshold\n", + " sfm = SelectFromModel(modelo, threshold)\n", + " \n", + " # Treina o seletor\n", + " sfm.fit(X_treinamento, y_treinamento)\n", + "\n", + " # Mostra o indice das COLUNAS mais importantes\n", + " print(f'\\n********** COLUNAS Relevantes ******')\n", + " print(sfm.get_support(indices=True))\n", + "\n", + " # Seleciona somente as COLUNAS relevantes\n", + " X_treinamento_I = sfm.transform(X_treinamento)\n", + " X_teste_I = sfm.transform(X_teste)\n", + " return X_treinamento_I, X_teste_I " + ], + "execution_count": 107, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ukMLoEr7nbUf", + "outputId": "afd9f4fa-535f-4da0-c736-3c606ef53568", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + } + }, + "source": [ + "X_treinamento_DT, X_teste_DT = seleciona_colunas_relevantes(ml_DT2, X_treinamento, X_teste)" + ], + "execution_count": 108, + "outputs": [ + { + "output_type": "stream", + "text": [ + "\n", + "********** COLUNAS Relevantes ******\n", + "[ 0 9 12]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8JjePRQAoqkk" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Gt3aCPpfKRxm", + "outputId": "c4af1a97-2585-40ee-e144-d15260fbbef1", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 102 + } + }, + "source": [ + "best_params" + ], + "execution_count": 109, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'criterion': 'entropy',\n", + " 'max_depth': None,\n", + " 'max_leaf_nodes': None,\n", + " 'min_samples_leaf': 20,\n", + " 'min_samples_split': 70}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 109 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zq6uCVtzovMt", + "outputId": "bf2faa39-ed81-4f33-8b9d-9c4d10c2147b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "# Treina usando as COLUNAS relevantes...\n", + "ml_DT2.fit(X_treinamento_DT, y_treinamento)\n", + "\n", + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_DT2, X_treinamento_DT, y_treinamento, cv = i_CV)\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')" + ], + "execution_count": 110, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Média das Acurácias calculadas pelo CV....: 88.71\n", + "std médio das Acurácias calculadas pelo CV: 2.5100000000000002\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Tc7esxqtq-Og" + }, + "source": [ + "****************************************************************" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "znWy3LE1q-Z3" + }, + "source": [ + "ml_DT3, best_params2 = GridSearchOptimizer(ml_DT2, 'ml_DT2', d_parametros_DT, X_treinamento_DT, y_treinamento, X_teste_DT, y_teste, cv = i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6IhCC6pfq-jL" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "qw6Dk3kesT0q" + }, + "source": [ + "best_params2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "SbS4ZKN8s-ee" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_DT3, X_treinamento_DT, y_treinamento, cv = i_CV)\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_at3XP1Bq-qb" + }, + "source": [ + "***************************************************************" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MZ1-vGRcxJoN" + }, + "source": [ + "## Valida o modelo usando o dataframe X_teste" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ig9GiUAEw9jr" + }, + "source": [ + "y_pred_DT = ml_DT2.predict(X_teste_DT)" + ], + "execution_count": 111, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7UZz4UzHDqae", + "outputId": "1e931808-3887-4e6a-b27b-329ea9776e1e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Calcula acurácia\n", + "accuracy_score(y_teste, y_pred_DT)" + ], + "execution_count": 112, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.9333333333333333" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 112 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K3EUMAxxKBur" + }, + "source": [ + "___\n", + "# **RANDOM FOREST**\n", + "* Decision Trees possuem estrutura em forma de árvores.\n", + "* Random Forest pode ser utilizado tanto para classificação (RandomForestClassifier)quanto para Regressão (RandomForestRegressor).\n", + "\n", + "* **Vantagens**:\n", + " * Não requer tanto data preprocessing;\n", + " * Lida bem com COLUNAS categóricas e numéricas;\n", + " * É um Boosting Ensemble Method (pois constrói muitas árvores). Estes modelos aprendem com os próprios erros e ajustam as árvores de modo a fazer melhores classificações;\n", + " * Mais robusta que uma simples Decision Tree. **Porque?**\n", + " * Controla automaticamente overfitting (**porque?**) e frequentemente produz modelos muito robustos e de alta-performance.\n", + " * Pode ser utilizado como Feature Selection, pois gera a matriz de importância dos atributos (importance sample). A soma das importâncias soma 100;\n", + " * Assim como as Decision Trees, esses modelos capturam facilmente padrões não-lineares presentes nos dados;\n", + " * Não requer os dados sejam normalizados;\n", + " * Lida bem com Missing Values;\n", + " * Não requer suposições (assumptions) sobre a distribuição dos dados por causa da natureza não-paramétrica do algoritmo\n", + "\n", + "* **Desvantagens**\n", + " * **Recomenda-se balancear o dataframe previamente para se evitar esse problema**.\n", + "\n", + "* **Principais parâmetros**\n", + "\n", + "## **Referências**:\n", + "* [Running Random Forests? Inspect the feature importances with this code](https://towardsdatascience.com/running-random-forests-inspect-the-feature-importances-with-this-code-2b00dd72b92e)\n", + "* [Feature importances with forests of trees](https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html)\n", + "* [Understanding Random Forests Classifiers in Python](https://www.datacamp.com/community/tutorials/random-forests-classifier-python)\n", + "* [Understanding Random Forest](https://towardsdatascience.com/understanding-random-forest-58381e0602d2)\n", + "* [An Implementation and Explanation of the Random Forest in Python](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76)\n", + "* [Random Forest Simple Explanation](https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d)\n", + "* [Random Forest Explained](https://www.youtube.com/watch?v=eM4uJ6XGnSM)\n", + "* [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74) - Explica os principais parâmetros do Random Forest." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cnfDw_GEKBuu", + "outputId": "83f52ce2-ac65-4e96-aa5d-ce5fb204511c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 153 + } + }, + "source": [ + "from sklearn.ensemble import RandomForestClassifier\n", + "\n", + "# Instancia...\n", + "ml_RF= RandomForestClassifier(n_estimators=100, min_samples_split= 2, max_features=\"auto\", random_state= i_Seed)\n", + "\n", + "# Treina...\n", + "ml_RF.fit(X_treinamento, y_treinamento)" + ], + "execution_count": 113, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n", + " criterion='gini', max_depth=None, max_features='auto',\n", + " max_leaf_nodes=None, max_samples=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, n_estimators=100,\n", + " n_jobs=None, oob_score=False, random_state=20111974,\n", + " verbose=0, warm_start=False)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 113 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lYa9oaZW__o6" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_RF, X_treinamento, y_treinamento, cv = i_CV)\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AouWUu8vANdb" + }, + "source": [ + "**Interpretação**: Nosso classificador (RandomForestClassifier) tem uma acurácia média de 96,44% (base de treinamento). Além disso, o std é da ordem de 2,77%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vbducxlgAa85" + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_lxx-LUw_5sd" + }, + "source": [ + "# Faz predições...\n", + "y_pred = ml_RF.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "pQIRO_LpGAkw" + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yKLHZ5_C6FJ8" + }, + "source": [ + "## Parameter tunning\n", + "### Referência\n", + "* [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74)\n", + "* [Decision Tree Adventures 2 — Explanation of Decision Tree Classifier Parameters](https://medium.com/datadriveninvestor/decision-tree-adventures-2-explanation-of-decision-tree-classifier-parameters-84776f39a28) - Explica didaticamente e step by step como fazer parameter tunning.\n", + "* [Optimizing Hyperparameters in Random Forest Classification](https://towardsdatascience.com/optimizing-hyperparameters-in-random-forest-classification-ec7741f9d3f6) - Outro approach para entender parameter tunning. Recomendo fortemente a leitura! " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XOa9naju6FKA" + }, + "source": [ + "# Dicionário de parâmetros para o parameter tunning.\n", + "d_parametros_RF= {'bootstrap': [True, False]} #,\n", + "# 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],\n", + "# 'max_features': ['auto', 'sqrt'],\n", + "# 'min_samples_leaf': [1, 2, 4],\n", + "# 'min_samples_split': [2, 5, 10],\n", + "# 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6__f2jZaTQat" + }, + "source": [ + "# Invoca a função\n", + "ml_RF2, best_params = GridSearchOptimizer(ml_RF, 'ml_RF2', d_parametros_RF, X_treinamento, y_treinamento, X_teste, y_teste, cv = i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "crfn-n--KG4n" + }, + "source": [ + "### Resultado da execução do Random Forest\n", + "\n", + "```\n", + "[Parallel(n_jobs=-1)]: Done 7920 out of 7920 | elapsed: 194.0min finished\n", + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}\n", + "```" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SGTOe5PaRw59" + }, + "source": [ + "# Como o procedimento acima levou 194 minutos para executar, então vou estimar ml_RF2 abaixo usando os parâmetros acima estimados\n", + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}\n", + "\n", + "ml_RF2= RandomForestClassifier(bootstrap= best_params['bootstrap'], \n", + " max_depth= best_params['max_depth'], \n", + " max_features= best_params['max_features'], \n", + " min_samples_leaf= best_params['min_samples_leaf'], \n", + " min_samples_split= best_params['min_samples_split'], \n", + " n_estimators= best_params['n_estimators'], \n", + " random_state= i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HMJcAdLlTQa0" + }, + "source": [ + "## Visualizar o resultado\n", + "> Implementar a visualização do RandomForest." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WWNiy7Z0TQa3" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kOi11YOKTQa4" + }, + "source": [ + "X_treinamento_RF, X_teste_RF = seleciona_colunas_relevantes(ml_RF2, X_treinamento, X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Zn_O7c_DTQbE" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UwEOwzSGTQbF" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Rr8qDrgvTQbL" + }, + "source": [ + "# Treina com as COLUNAS relevantes...\n", + "ml_RF2.fit(X_treinamento_RF, y_treinamento)\n", + "\n", + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_RF2, X_treinamento_RF, y_treinamento, cv = i_CV)\n", + "print(f'Acurácia Media: {100*a_scores_CV.mean():.2f}')\n", + "print(f'std médio.....: {100*a_scores_CV.std():.2f}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-mYfQLlsTQbQ" + }, + "source": [ + "## Valida o modelo usando o dataframe X_teste" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sSD5o1JQTQbR" + }, + "source": [ + "y_pred_RF = ml_RF2.predict(X_teste_RF)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "wywF6LymDzKr" + }, + "source": [ + "# Calcula acurácia\n", + "accuracy_score(y_teste, y_pred_RF)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hJJsL0IJb6iO" + }, + "source": [ + "## Estudo do comportamento dos parametros do algoritmo\n", + "> Consulte [Optimizing Hyperparameters in Random Forest Classification](https://towardsdatascience.com/optimizing-hyperparameters-in-random-forest-classification-ec7741f9d3f6) para mais detalhes." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "navUWMwHi44D" + }, + "source": [ + "param_range = np.arange(1, 250, 2)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_treinamento, \n", + " y_treinamento, \n", + " param_name=\"n_estimators\", \n", + " param_range = param_range, \n", + " cv = i_CV, \n", + " scoring = \"accuracy\", \n", + " n_jobs = -1)\n", + "\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label = \"Training score\", color = \"black\")\n", + "plt.plot(param_range, test_mean, label = \"Cross-validation score\", color = \"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color = \"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color = \"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc = \"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rv7TIM9kjsud" + }, + "source": [ + "param_range = np.arange(1, 250, 2)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_treinamento, \n", + " y_treinamento, \n", + " param_name = \"max_depth\", \n", + " param_range = param_range, \n", + " cv = i_CV, \n", + " scoring = \"accuracy\", \n", + " n_jobs = -1)\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label=\"Training score\", color=\"black\")\n", + "plt.plot(param_range, test_mean, label=\"Cross-validation score\", color=\"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color=\"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color=\"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc=\"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lm_fPGYwkJYc" + }, + "source": [ + "param_range = np.arange(1, 250, 2)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_treinamento, \n", + " y_treinamento, \n", + " param_name='min_samples_leaf', \n", + " param_range=param_range,\n", + " cv = i_CV, \n", + " scoring=\"accuracy\", \n", + " n_jobs=-1)\n", + "\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label=\"Training score\", color=\"black\")\n", + "plt.plot(param_range, test_mean, label=\"Cross-validation score\", color=\"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color=\"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color=\"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc=\"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "CAqdiSaVlAB8" + }, + "source": [ + "param_range = np.arange(0.05, 1, 0.05)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_treinamento, \n", + " y_treinamento, \n", + " param_name='min_samples_split', \n", + " param_range=param_range,\n", + " cv = i_CV, \n", + " scoring=\"accuracy\", \n", + " n_jobs=-1)\n", + "\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label=\"Training score\", color=\"black\")\n", + "plt.plot(param_range, test_mean, label=\"Cross-validation score\", color=\"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color=\"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color=\"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc=\"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cX_gfsbQSdNd" + }, + "source": [ + "___\n", + "# **BOOSTING MODELS**\n", + "* São algoritmos muito utilizados nas competições do Kaggle;\n", + "* São algoritmos utilizados para melhorar a performance dos algoritmos de Machine Learning;\n", + "* Modelos:\n", + " - [X] AdaBoost\n", + " - [X] XGBoost\n", + " - [X] LightGBM\n", + " - [X] GradientBoosting\n", + " - [X] CatBoost\n", + "\n", + "## Bagging vs Boosting vc Stacking\n", + "### **Bagging**\n", + "* Objetivo é reduzir a variância;\n", + "\n", + "#### Como funciona\n", + "* Seleciona várias amostras **COM REPOSIÇÃO** do dataframe de treinamento. Cada amostra é usada para treinar um modelo usando Decision Trees. Como resultado, temos um ensemble de muitas e diferentes modelos (Decision Trees). A média de desses muitos e diferentes modelos (Decision Trees) são usados para produzir o resultado final;\n", + "* O resultado final é mais robusto do que usarmos uma simples Decision Tree.\n", + "\n", + "![Bagging](https://github.com/MathMachado/Materials/blob/master/Bagging.png?raw=true)\n", + "\n", + "Souce: [Boosting and Bagging: How To Develop A Robust Machine Learning Algorithm](https://hackernoon.com/how-to-develop-a-robust-algorithm-c38e08f32201).\n", + "\n", + "#### Steps\n", + "* Suponha um dataframe X_treinamento (dataframe de treinamento) contendo N observações (instâncias, pontos, linhas) e M COLUNAS (features, atributos).\n", + " 1. Bagging seleciona aleatoriamente uma amostra **COM REPOSIÇÃO** de X_treinamento;\n", + " 2. Bagging seleciona aleatoriamente M2 (M2 < M) COLUNAS do dataframe extraído do passo (1);\n", + " 3. Constroi uma Decision Tree com as M2 COLUNAS do passo (2) e o dataframe obtido no passo (1) e as COLUNAS são avaliadas pela sua habilidade de classificar as observações;\n", + " 4. Os passos (1)--> (2)-- (3) são repetidos K vezes (ou seja, K Decision Trees), de forma que as COLUNAS são ranqueadas pelo seu poder preditivo e o resultado final (acurácia, por exemplo) é obtido pela agregação das predições dos K Decision Trees.\n", + "\n", + "#### Vantagens\n", + "* Reduz overfitting;\n", + "* Lida bem com dataframes com muitas COLUNAS (high dimensionality);\n", + "* Lida automaticamente com Missing Values;\n", + "\n", + "#### Desvantagem\n", + "* A predição final é baseada na média das K Decision Trees, o que pode comprometer a acurácia final.\n", + "\n", + "___ \n", + "### **Boosting**\n", + "* Objetivo é melhorar acurácia;\n", + "\n", + "#### Como funciona\n", + "* Os classificadores são usados sequencialmente, de forma que o classificador no passo N aprende com os erros do classificador do passo N-1. Ou seja, o objetivo é melhorar a precisão/acurácia à cada passo aprendendo com o passado.\n", + "\n", + "![Boosting](https://github.com/MathMachado/Materials/blob/master/Boosting.png?raw=true)\n", + "\n", + "Source: [Ensemble methods: bagging, boosting and stacking](https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205), Joseph Rocca\n", + ".\n", + "\n", + "#### Steps\n", + "* Suponha um dataframe X_treinamento (dataframe de treinamento) contendo N observações (instâncias, pontos, linhas) e M COLUNAS (features, atributos).\n", + " 1. Boosting seleciona aleatoriamente uma amostra D1 SEM reposição de X_treinamento;\n", + " 2. Boosting treina o classificador C1;\n", + " 3. Boosting seleciona aleatoriamente a SEGUNDA amostra D2 SEM reposição de X_treinamento e acrescenta à D2 50% das observações que foram classificadas incorretamente para treinar o classificador C2;\n", + " 4. Boosting encontra em X_treinamento a amostra D3 que os classificadores C1 e C2 discordam em classificar e treina C3;\n", + " 5. Combina (voto) as predições de C1, C2 e C3 para produzir o resultado final.\n", + "\n", + "#### Vantagens\n", + "* Lida bem com dataframes com muitas COLUNAS (high dimensionality);\n", + "* Lida automaticamente com Missing Values;\n", + "\n", + "#### Desvantagem\n", + "* Propenso a overfitting. Recomenda-se tratar outliers previamente.\n", + "* Requer ajuste cuidadoso dos hyperparameters;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9fgUrkmPk4dr" + }, + "source": [ + "___\n", + "# STACKING\n", + "\n", + "![Stacking](https://github.com/MathMachado/Materials/blob/master/Stacking.png?raw=true)\n", + "\n", + "Kd a referência desta figura???" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B0jxx3ETpOdm" + }, + "source": [ + "___\n", + "# **BOOTSTRAPPING METHODS**\n", + "> Antes de falarmos de Boosting ou Bagging, precisamos entender primeiro o que é Bootstrap, pois ambos (Boosting e Bagging) são baseados em Bootstrap.\n", + "\n", + "* Em Estatística (e em Machine Learning), Bootstrap se refere à extrair amostras aleatórias COM reposição da população X." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SyqazmUuifkE" + }, + "source": [ + "___\n", + "# **ADABOOST(Adaptive Boosting)**\n", + "* Quando nada funciona, AdaBoost funciona!\n", + "* Foi um dos primeiros algoritmos de Boosting (1995);\n", + "* AdaBoost pode ser utilizado tanto para classificação (AdaBoostClassifier) quanto para Regressão (AdaBoostRegressor);\n", + "* AdaBoost usam algoritmos DecisionTree como base_estimator;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RU-vzkXqrFVw" + }, + "source": [ + "## Referências\n", + "* [AdaBoost Classifier Example In Python](https://towardsdatascience.com/machine-learning-part-17-boosting-algorithms-adaboost-in-python-d00faac6c464) - Didático e explica exatamente como o AdaBoost funciona.\n", + "* [Adaboost for Dummies: Breaking Down the Math (and its Equations) into Simple Terms](https://towardsdatascience.com/adaboost-for-dummies-breaking-down-the-math-and-its-equations-into-simple-terms-87f439757dcf) - Para quem quer entender a matemática por trás do algoritmo.\n", + "* [Gradient Boosting and XGBoost](https://medium.com/hackernoon/gradient-boosting-and-xgboost-90862daa6c77)\n", + "* [Understanding AdaBoost](https://towardsdatascience.com/understanding-adaboost-2f94f22d5bfe), Akash Desarda.\n", + "* [AdaBoost Classifier Example In Python](https://towardsdatascience.com/machine-learning-part-17-boosting-algorithms-adaboost-in-python-d00faac6c464)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6EMrjQDZIMl_" + }, + "source": [ + "## O que é AdaBoost (Adaptive Boosting)?\n", + "* é um dos classificadores do tipo ensemble (combina vários classificadores para aumentar a precisão).\n", + "* AdaBoost é um classificador iterativo e forte que combina (ensemble) vários classificadores fracos para melhorar a precisão.\n", + "* Qualquer algoritmo de aprendizado de máquina pode ser usado como um classificador de base (parâmetro base_estimator);\n", + "\n", + "## Parâmetros mais importantes do AdaBoost:\n", + "* base_estimator - É um classificador usado para treinar o modelo. Como default, AdaBoost usa o DecisionTreeClassifier. Como dito anteriormente, pode-se utilizar diferentes algoritmos para esse fim.\n", + "* n_estimators - Número de base_estimator para treinar iterativamente.\n", + "* learning_rate - Controla a contribuição do base_estimator na solução/combinação final;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TzLtHzWNJBix" + }, + "source": [ + "## Usando diferentes algoritmos para base_estimator\n", + "> Como dito anteriormente, pode-se utilizar vários tipos de base_estimator em AdaBoost. Por exemplo, se quisermos usar SVM (Support Vector Machines), devemos proceder da seguinte forma:\n", + "\n", + "\n", + "```\n", + "# Importar a biblioteca base_estimator\n", + "from sklearn.svm import SVC\n", + "\n", + "# Treina o classificador (algoritmo)\n", + "ml_SVC= SVC(probability=True, kernel='linear')\n", + "\n", + "# Constroi o modelo AdaBoost\n", + "ml_AB = AdaBoostClassifier(n_estimators= 50, base_estimator=ml_SVC, learning_rate=1)\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hrj4a4s6hMMB" + }, + "source": [ + "## Vantagens\n", + "* AdaBoost é fácil de implementar;\n", + "* AdaBoost corrige os erros do base_estimator iterativamente e melhora a acurácia;\n", + "* Faz o Feature Selection automaticamente (**Porque**?);\n", + "* Pode-se usar muitos algoritos como base_estimator ;\n", + "* Como é um método ensemble, então o modelo final é pouco propenso à overfitting.\n", + "\n", + "## Desvantagens\n", + "* AdaBoost é sensível a ruídos nos dados;\n", + "* Altamente impactado por outliers (contribui para overfitting), pois o algoritmo tenta se ajustr a cada ponto da mehor forma possível;\n", + "* AdaBoost é mais lento que XGBoost;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bgJmu7YLiyv7" + }, + "source": [ + "No exemplo a seguir, vou usar RandomForestClassifier com os parâmetros otimizados, ou seja:\n", + "\n", + "```\n", + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5VCRNyZT3qvc" + }, + "source": [ + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1gIboJdriq61" + }, + "source": [ + "from sklearn.ensemble import AdaBoostClassifier\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "\n", + "# Instancia RandomForestClassifier - Parâmetros otimizados!\n", + "ml_RF2= RandomForestClassifier(bootstrap= best_params['bootstrap'], \n", + " max_depth= best_params['max_depth'], \n", + " max_features= best_params['max_features'], \n", + " min_samples_leaf= best_params['min_samples_leaf'], \n", + " min_samples_split= best_params['min_samples_split'], \n", + " n_estimators= best_params['n_estimators'], \n", + " random_state= i_Seed)\n", + "# Instancia AdaBoostClassifier\n", + "ml_AB= AdaBoostClassifier(n_estimators=100, base_estimator= ml_RF2, random_state= i_Seed)\n", + "\n", + "# Treina...\n", + "ml_AB.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "A4Cs81OLD40y" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_AB, X_treinamento, y_treinamento, cv = i_CV)\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F7Ce5L38ECoC" + }, + "source": [ + "**Interpretação**: Nosso classificador (AdaBoostClassifier) tem uma acurácia média de 96,72% (base de treinamento). Além disso, o std é da ordem de 2,54%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "t5GfnBwEifkO" + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q9rSpuXyEPA5" + }, + "source": [ + "# Faz predições com os parametros otimizados...\n", + "y_pred = ml_AB.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "2F9k-_eXGDLa" + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XweWTjQ9EXLw" + }, + "source": [ + "## Parameter tunning" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fcrKzse9EbL_" + }, + "source": [ + "# Dicionário de parâmetros para o parameter tunning.\n", + "d_parametros_AB = {'n_estimators':[50, 100, 200], 'learning_rate':[.001, 0.01, 0.05, 0.1, 0.3,1]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Susc3I7mFDQX" + }, + "source": [ + "# Invoca a função\n", + "ml_AB2, best_params= GridSearchOptimizer(ml_AB, 'ml_AB2', d_parametros_AB, X_treinamento, y_treinamento, X_teste, y_teste, cv = i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "w4JjWsusjNS8" + }, + "source": [ + "___\n", + "# **GRADIENT BOOSTING**\n", + "* Gradient boosting pode ser usado para resolver problemas de classificação (GradientBoostingClassifier) e Regressão (GradientBoostingRegressor);\n", + "* Gradient boosting são um refinamento do AdaBoost (lembra que AdaBoost foi um dos primeiros métodos de Boosting - criado em 1995). O que Gradient Boosting faz adicionalmente ao AdaBoost é minimizar a loss (função perda), ie, minimizar a diferença entre os valores observados de y e os valores preditos.\n", + "* Usa Gradient Descent para encontrar as deficiências nas previsões do passo anterior. Gradient Descent é um algoritmo popular e poderoso e usado em Redes Neurais;\n", + "* O objetivo do Gradient Boosting é minimizar 'loss function'. Portanto, Gradient Boosting depende da \"loss function\".\n", + "* Gradient boosting usam algoritmos DecisionTree como base_estimator;\n", + "\n", + "## Vantagens\n", + "* Não há necessidade de pre-processing;\n", + "* Trabalha normalmente com COLUNAS numéricas ou categóricas;\n", + "* Trata automaticamente os Missing Values. Ou seja, não é necessário aplicar métodos de Missing Value Imputation;\n", + "\n", + "## Desvantagens\n", + "* Como Gradient Boosting tenta continuamente minimizar os erros à cada iteração, isso pode enfatizar os outliers e causar overfitting. Portanto, deve-se:\n", + " * Tratar os outliers previamente OU\n", + " * Usar Cross-Validation para neutralizar os efeitos dos outliers (**Eu prefiro este método, pois toma menos tempo**);\n", + "* Computacionalmene caro. Geralmente são necessários muitas árvores (> 1000) para se obter bons resultados;\n", + "* Devido à flexibilidade (muitos parâmetros para ajustar), então é necessário usar GridSearchCV para encontrar a combinação ótima dos hyperparameters;\n", + "\n", + "## Referências\n", + "* [Gradient Boosting Decision Tree Algorithm Explained](https://towardsdatascience.com/machine-learning-part-18-boosting-algorithms-gradient-boosting-in-python-ef5ae6965be4) - Didático e detalhista.\n", + "* [Predicting Wine Quality with Gradient Boosting Machines](https://towardsdatascience.com/predicting-wine-quality-with-gradient-boosting-machines-a-gmb-tutorial-d950b1542065)\n", + "* [Parameter Tuning in Gradient Boosting (GBM) with Python](https://www.datacareer.de/blog/parameter-tuning-in-gradient-boosting-gbm/)\n", + "* [Tune Learning Rate for Gradient Boosting with XGBoost in Python](https://machinelearningmastery.com/tune-learning-rate-for-gradient-boosting-with-xgboost-in-python/)\n", + "* [In Depth: Parameter tuning for Gradient Boosting](https://medium.com/all-things-ai/in-depth-parameter-tuning-for-gradient-boosting-3363992e9bae) - Muito bom\n", + "* [Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q4bUCZs2jNTA" + }, + "source": [ + "from sklearn.ensemble import GradientBoostingClassifier\n", + "\n", + "# Instancia...\n", + "ml_GB=GradientBoostingClassifier(n_estimators=100, min_samples_split= 2)\n", + "\n", + "# Treina...\n", + "ml_GB.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-dr6dyjdXwvd" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_GB, X_treinamento, y_treinamento, cv = i_CV)\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VlC3y3M5YaGG" + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vnLvQ0ZDYNjB" + }, + "source": [ + "**Interpretação**: Nosso classificador (GradientBoostingClassifier) tem uma acurácia média de 96,86% (base de treinamento). Além disso, o std é da ordem de 2,52%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "D2n1RKZuXq3D" + }, + "source": [ + "# Faz precições...\n", + "y_pred = ml_GB.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8r6JCzQRGFa0" + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names = cf_labels, categories = cf_categories)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KFv-Q2AD5uCk" + }, + "source": [ + "## Parameter tunning\n", + "> Consulte [Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/) para detalhes sobre os parâmetros, significado e etc." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wgU040AcjNTF" + }, + "source": [ + "# Dicionário de parâmetros para o parameter tunning.\n", + "d_parametros_GB= {'learning_rate': [1, 0.5, 0.25, 0.1, 0.05, 0.01]} #,\n", + "# 'n_estimators': [1, 2, 4, 8, 16, 32, 64, 100, 200],\n", + "# 'max_depth': [5, 10, 15, 20, 25, 30],\n", + "# 'min_samples_split': [0.1, 0.3, 0.5, 0.7, 0.9],\n", + "# 'min_samples_leaf': [0.1, 0.2, 0.3, 0.4, 0.5],\n", + "# 'max_features': list(range(1, X_treinamento.shape[1]))}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "v5KLFlpTjNTH" + }, + "source": [ + "# Invoca a função\n", + "ml_GB2, best_params= GridSearchOptimizer(ml_GB, 'ml_GB2', d_parametros_GB, X_treinamento, y_treinamento, X_teste, y_teste, cv = i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YQ6ERz3fi9i2" + }, + "source": [ + "### Resultado da execução do Gradient Boosting" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RSa7uKw13mKG" + }, + "source": [ + "```\n", + "[Parallel(n_jobs=-1)]: Done 275400 out of 275400 | elapsed: 93.7min finished\n", + "\n", + "Parametros otimizados: {'learning_rate': 1, 'max_depth': 30, 'max_features': 11, 'min_samples_leaf': 0.1, 'min_samples_split': 0.1, 'n_estimators': 100}\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wiJpA2PyjDjR" + }, + "source": [ + "# Como o procedimento acima levou 93 minutos para executar, então vou estimar ml_GB2 abaixo usando os parâmetros acima estimados\n", + "best_params= {'learning_rate': 1, 'max_depth': 30, 'max_features': 11, 'min_samples_leaf': 0.1, 'min_samples_split': 0.1, 'n_estimators': 100}\n", + "\n", + "#ml_GB2= GradientBoostingClassifier(learning_rate= best_params['learning_rate'], \n", + "# max_depth= best_params['max_depth'],\n", + "# max_features= best_params['max_features'],\n", + "# min_samples_leaf= best_params['min_samples_leaf'],\n", + "# min_samples_split= best_params['min_samples_split'],\n", + "# n_estimators= best_params['n_estimators'],\n", + "# random_state= i_Seed)\n", + "\n", + "ml_GB2= GradientBoostingClassifier(learning_rate= best_params['learning_rate'], \n", + " max_depth= best_params['max_depth'],\n", + " min_samples_leaf= best_params['min_samples_leaf'],\n", + " min_samples_split= best_params['min_samples_split'],\n", + " n_estimators= best_params['n_estimators'],\n", + " random_state= i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mb14gJ7-jbVM" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TAqGZIFYm2sU" + }, + "source": [ + "X_treinamento_GB, X_teste_GB = seleciona_colunas_relevantes(ml_GB2, X_treinamento, X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6yiu6dahnBvC" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "APrtWN18nc4t" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VS0mLdOmnXAY" + }, + "source": [ + "# Treina com as COLUNAS relevantes\n", + "ml_GB2.fit(X_treinamento_GB, y_treinamento)\n", + "\n", + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_GB2, X_treinamento_GB, y_treinamento, cv = i_CV)\n", + "print(f'Acurácia Media: {100*a_scores_CV.mean():.2f}')\n", + "print(f'std médio.....: {100*a_scores_CV.std():.2f}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vmc9PP_Rn1TN" + }, + "source": [ + "## Valida o modelo usando o dataframe X_teste" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "e3mnIALvnzP2" + }, + "source": [ + "y_pred_GB = ml_GB2.predict(X_teste_GB)\n", + "\n", + "# Calcula acurácia\n", + "accuracy_score(y_teste, y_pred_GB)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kwP9Z2GnkV7r" + }, + "source": [ + "___\n", + "# **XGBOOST (eXtreme Gradient Boosting)**\n", + "* XGBoost é uma melhoria de Gradient Boosting. As melhorias são em velocidade e performace, além de corrigir as ineficiências do GradientBoosting.\n", + "* Algoritmo preferido pelos Kaggle Grandmasters;\n", + "* Paralelizável;\n", + "* Estado-da-arte em termos de Machine Learning;\n", + "\n", + "## Parâmetros relevantes e seus valores iniciais\n", + "Consulte [Complete Guide to Parameter Tuning in XGBoost with codes in Python](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/) para detalhes completos sobre os parâmetros, significado e etc.\n", + "\n", + "* n_estimators = 100 (100 caso o dataframe for grande. Se o dataframe for médio/pequeno, então 1000) - É o número de árvores desejamos construir;\n", + "* max_depth= 3 - Determina quão profundo cada árvore pode crescer durante qualquer round de treinamento. Valores típicos no intervalo [3, 10];\n", + "* learning rate= 0.01 - Usado para evitar overfitting, intervalo: [0, 1];\n", + "* alpha (somente para problemas de Regressão) - L1 regularization nos pesos. Valores altos resulta em mais regularization;\n", + "* lambda (somente para problemas de Regressão) - L2 regularization nos pesos.\n", + "* colsample_bytree: 1 - porcentagem de COLUNAS usados por cada árvore. Alto valor pode causar overfitting;\n", + "* subsample: 0.8 - porcentagem de amostras usadas por árvore. Um valor baixo pode levar a overfitting;\n", + "* gamma: 1 - Controla se um determinado nó será dividido com base na redução esperada na perda após a divisão. Um valor mais alto leva a menos divisões.\n", + "* objective: Define a \"loss function\". As opções são:\n", + " * reg:linear - Para resolver problemas de regressão;\n", + " * reg:logistic - Para resolver problemas de classificação;\n", + " * binary:logistic - Para resolver problemas de classificação com cálculo de probabilidades;\n", + "\n", + "# Referências\n", + "* [How exactly XGBoost Works?](https://medium.com/@pushkarmandot/how-exactly-xgboost-works-a320d9b8aeef)\n", + "* [Fine-tuning XGBoost in Python like a boss](https://towardsdatascience.com/fine-tuning-xgboost-in-python-like-a-boss-b4543ed8b1e)\n", + "* [Gentle Introduction of XGBoost Library](https://medium.com/@imoisharma/gentle-introduction-of-xgboost-library-2b1ac2669680)\n", + "* [A Beginner’s guide to XGBoost](https://towardsdatascience.com/a-beginners-guide-to-xgboost-87f5d4c30ed7)\n", + "* [Exploring XGBoost](https://towardsdatascience.com/exploring-xgboost-4baf9ace0cf6)\n", + "* [Feature Importance and Feature Selection With XGBoost in Python](https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/)\n", + "* [Ensemble Learning case study: Running XGBoost on Google Colab free GPU](https://towardsdatascience.com/running-xgboost-on-google-colab-free-gpu-a-case-study-841c90fef101) - Recomendo\n", + "* [Predicting movie revenue with AdaBoost, XGBoost and LightGBM](https://towardsdatascience.com/predicting-movie-revenue-with-adaboost-xgboost-and-lightgbm-262eadee6daa)\n", + "* [Tuning XGBoost Hyperparameters with Scikit Optimize](https://towardsdatascience.com/how-to-improve-the-performance-of-xgboost-models-1af3995df8ad)\n", + "* [An Example of Hyperparameter Optimization on XGBoost, LightGBM and CatBoost using Hyperopt](https://towardsdatascience.com/an-example-of-hyperparameter-optimization-on-xgboost-lightgbm-and-catboost-using-hyperopt-12bc41a271e) - Interessante\n", + "* [XGBOOST vs LightGBM: Which algorithm wins the race !!!](https://towardsdatascience.com/lightgbm-vs-xgboost-which-algorithm-win-the-race-1ff7dd4917d) - LightGBM tem se mostrado interessante.\n", + "* [From Zero to Hero in XGBoost Tuning](https://towardsdatascience.com/from-zero-to-hero-in-xgboost-tuning-e48b59bfaf58) - Gostei\n", + "* [Build XGBoost / LightGBM models on large datasets — what are the possible solutions?](https://towardsdatascience.com/build-xgboost-lightgbm-models-on-large-datasets-what-are-the-possible-solutions-bf882da2c27d)\n", + "* [Selecting Optimal Parameters for XGBoost Model Training](https://towardsdatascience.com/selecting-optimal-parameters-for-xgboost-model-training-c7cd9ed5e45e) - Muito bom!\n", + "* [CatBoost vs. Light GBM vs. XGBoost](https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db)\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iMM_R4_ukV7x" + }, + "source": [ + "from xgboost import XGBClassifier\n", + "import xgboost as xgb\n", + "\n", + "# Instancia...\n", + "ml_XGB= XGBClassifier(silent=False, \n", + " scale_pos_weight=1,\n", + " learning_rate=0.01, \n", + " colsample_bytree = 1,\n", + " subsample = 0.8,\n", + " objective='binary:logistic', \n", + " n_estimators=1000, \n", + " reg_alpha = 0.3,\n", + " max_depth= 3, \n", + " gamma=1, \n", + " max_delta_step=5)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "E4wQMlDEFINR" + }, + "source": [ + "# Treina...\n", + "ml_XGB.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zAhsTtwGqMkG" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_XGB, X_treinamento, y_treinamento, cv = i_CV)\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JNyKX6PkrXOk" + }, + "source": [ + "**Interpretação**: Nosso classificador (XGBClassifier) tem uma acurácia média de 96,72% (base de treinamento). Além disso, o std é da ordem de 2,02%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_h0QYv3FkV73" + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "AKhhAZLjkV76" + }, + "source": [ + "# Faz predições...\n", + "y_pred = ml_XGB.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ir2Kd1PqGHgz" + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jEC7gW4qYpWw" + }, + "source": [ + "## Parameter tunning\n", + "### Leitura Adicional:\n", + "* [Fine-tuning XGBoost in Python like a boss](https://towardsdatascience.com/fine-tuning-xgboost-in-python-like-a-boss-b4543ed8b1e)\n", + "* [Complete Guide to Parameter Tuning in XGBoost with codes in Python](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)\n", + "\n", + "> Olhando para os resultados acima, qual o melhor modelo?\n", + "\n", + "XGBoost? Supondo que sim, agora vamos fazer o fine-tuning dos parâmetros do modelo." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "n3MsUONPwIV9" + }, + "source": [ + "# Dicionário de parâmetros para XGBoost:\n", + "d_parametros_XGB = {'min_child_weight': [i for i in np.arange(1, 13)]} #,\n", + "# 'gamma': [i for i in np.arange(0, 5, 0.5)],\n", + "# 'subsample': [0.6, 0.8, 1.0],\n", + "# 'colsample_bytree': [0.6, 0.8, 1.0],\n", + "# 'max_depth': [3, 4, 5, 7, 9],\n", + "# 'learning_rate': [i for i in np.arange(0.01, 1, 0.1)]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "CX27FCKmwSni" + }, + "source": [ + "# Invoca a função\n", + "ml_XGB, best_params= GridSearchOptimizer(ml_XGB, 'ml_XGB2', d_parametros_XGB, X_treinamento, y_treinamento, X_teste, y_teste, cv = i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9b7uCuF74Hjv" + }, + "source": [ + "### Resultado da execução do XGBoostClassifier\n", + "\n", + "```\n", + "[Parallel(n_jobs=-1)]: Done 108000 out of 108000 | elapsed: 372.0min finished\n", + "\n", + "Parametros otimizados: {'colsample_bytree': 0.8, 'gamma': 0.5, 'learning_rate': 0.51, 'max_depth': 5, 'min_child_weight': 1, 'subsample': 0.6}\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "n7E0oyxEtbGi" + }, + "source": [ + "# Como o procedimento acima levou 372 minutos para executar, então vou estimar ml_XGB2 abaixo usando os parâmetros acima estimados\n", + "best_params= {'colsample_bytree': 0.8, 'gamma': 0.5, 'learning_rate': 0.51, 'max_depth': 5, 'min_child_weight': 1, 'subsample': 0.6}\n", + "\n", + "ml_XGB2= XGBClassifier(min_child_weight= best_params['min_child_weight'], \n", + " gamma= best_params['gamma'], \n", + " subsample= best_params['subsample'], \n", + " colsample_bytree= best_params['colsample_bytree'], \n", + " max_depth= best_params['max_depth'], \n", + " learning_rate= best_params['learning_rate'], \n", + " random_state= i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CuqyLHTU5Z-j" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes\n", + "* [The Multiple faces of ‘Feature importance’ in XGBoost](https://towardsdatascience.com/be-careful-when-interpreting-your-features-importance-in-xgboost-6e16132588e7)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QPG3JZIpRZ-T" + }, + "source": [ + "# plot feature importance\n", + "from xgboost import plot_importance\n", + "\n", + "xgb.plot_importance(ml_XGB2, color = 'red')\n", + "plt.title('importance', fontsize = 20)\n", + "plt.yticks(fontsize = 10)\n", + "plt.ylabel('features', fontsize = 20)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "EmpRC2lHW-KP" + }, + "source": [ + "ml_XGB2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "4f9MIEBiyq-5" + }, + "source": [ + "X_treinamento_XGB, X_teste_XGB= seleciona_colunas_relevantes(ml_XGB2, X_treinamento, X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F6EayWaY5nMm" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Huy18gKI5qad" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "E3-PaTdc5vZk" + }, + "source": [ + "# Treina com as COLUNAS relevantes...\n", + "ml_XGB2.fit(X_treinamento_XGB, y_treinamento)\n", + "\n", + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_XGB2, X_treinamento_XGB, y_treinamento, cv = i_CV)\n", + "print(f'Acurácia Media: {100*a_scores_CV.mean():.2f}')\n", + "print(f'std médio.....: {100*a_scores_CV.std():.2f}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tBdYikDU6NhD" + }, + "source": [ + "## Valida o modelo usando o dataframe X_teste" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GcvY-VdL6VIZ" + }, + "source": [ + "y_pred_XGB = ml_XGB2.predict(X_teste_XGB)\n", + "\n", + "# Calcula acurácia\n", + "accuracy_score(y_teste, y_pred_XGB)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8oLtdH-vTSbC" + }, + "source": [ + "xgb.to_graphviz(ml_XGB2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "czXQG3MCHfHM" + }, + "source": [ + "# KNN - KNEIGHBORSCLASSIFIER" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "llTTXNeyHiwx" + }, + "source": [ + "# BAGGINGCLASSIFIER" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Fbkekd4QHoZO" + }, + "source": [ + "# EXTRATREESCLASSIFIER" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "widavwR4HzwE" + }, + "source": [ + "# SVM\n", + "https://data-flair.training/blogs/svm-support-vector-machine-tutorial/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "id_Ubulns6We" + }, + "source": [ + "# NAIVE BAYES" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3e0m7lEnYOV9" + }, + "source": [ + "# **IMPORTANCIA DAS COLUNAS**\n", + "Source: [Plotting Feature Importances](https://www.kaggle.com/grfiv4/plotting-feature-importances)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fjco0HnNYr-N" + }, + "source": [ + "def mostra_feature_importances(clf, X_treinamento, y_treinamento=None, \n", + " top_n=10, figsize=(8,8), print_table=False, title=\"Feature Importances\"):\n", + " '''\n", + " plot feature importances of a tree-based sklearn estimator\n", + " \n", + " Note: X_treinamento and y_treinamento are pandas DataFrames\n", + " \n", + " Note: Scikit-plot is a lovely package but I sometimes have issues\n", + " 1. flexibility/extendibility\n", + " 2. complicated models/datasets\n", + " But for many situations Scikit-plot is the way to go\n", + " see https://scikit-plot.readthedocs.io/en/latest/Quickstart.html\n", + " \n", + " Parameters\n", + " ----------\n", + " clf (sklearn estimator) if not fitted, this routine will fit it\n", + " \n", + " X_treinamento (pandas DataFrame)\n", + " \n", + " y_treinamento (pandas DataFrame) optional\n", + " required only if clf has not already been fitted \n", + " \n", + " top_n (int) Plot the top_n most-important features\n", + " Default: 10\n", + " \n", + " figsize ((int,int)) The physical size of the plot\n", + " Default: (8,8)\n", + " \n", + " print_table (boolean) If True, print out the table of feature importances\n", + " Default: False\n", + " \n", + " Returns\n", + " -------\n", + " the pandas dataframe with the features and their importance\n", + " \n", + " Author\n", + " ------\n", + " George Fisher\n", + " '''\n", + " \n", + " __name__ = \"mostra_feature_importances\"\n", + " \n", + " import pandas as pd\n", + " import numpy as np\n", + " import matplotlib.pyplot as plt\n", + " \n", + " from xgboost.core import XGBoostError\n", + " from lightgbm.sklearn import LightGBMError\n", + " \n", + " try: \n", + " if not hasattr(clf, 'feature_importances_'):\n", + " clf.fit(X_treinamento.values, y_treinamento.values.ravel())\n", + "\n", + " if not hasattr(clf, 'feature_importances_'):\n", + " raise AttributeError(\"{} does not have feature_importances_ attribute\".\n", + " format(clf.__class__.__name__))\n", + " \n", + " except (XGBoostError, LightGBMError, ValueError):\n", + " clf.fit(X_treinamento.values, y_treinamento.values.ravel())\n", + " \n", + " feat_imp = pd.DataFrame({'importance':clf.feature_importances_}) \n", + " feat_imp['feature'] = X_treinamento.columns\n", + " feat_imp.sort_values(by ='importance', ascending = False, inplace = True)\n", + " feat_imp = feat_imp.iloc[:top_n]\n", + " \n", + " feat_imp.sort_values(by='importance', inplace = True)\n", + " feat_imp = feat_imp.set_index('feature', drop = True)\n", + " feat_imp.plot.barh(title=title, figsize=figsize)\n", + " plt.xlabel('Feature Importance Score')\n", + " plt.show()\n", + " \n", + " if print_table:\n", + " from IPython.display import display\n", + " print(\"Top {} features in descending order of importance\".format(top_n))\n", + " display(feat_imp.sort_values(by = 'importance', ascending = False))\n", + " \n", + " return feat_imp" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ycu_EIGlYUYn" + }, + "source": [ + "import pandas as pd\n", + "\n", + "from xgboost import XGBClassifier\n", + "from sklearn.ensemble import ExtraTreesClassifier\n", + "from sklearn.tree import ExtraTreeClassifier\n", + "from sklearn.tree import DecisionTreeClassifier\n", + "from sklearn.ensemble import GradientBoostingClassifier\n", + "from sklearn.ensemble import BaggingClassifier\n", + "from sklearn.ensemble import AdaBoostClassifier\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "from sklearn.linear_model import LogisticRegression\n", + "from lightgbm import LGBMClassifier\n", + "\n", + "clfs = [XGBClassifier(), LGBMClassifier(), \n", + " ExtraTreesClassifier(), ExtraTreeClassifier(),\n", + " BaggingClassifier(), DecisionTreeClassifier(),\n", + " GradientBoostingClassifier(), LogisticRegression(),\n", + " AdaBoostClassifier(), RandomForestClassifier()]\n", + "\n", + "for clf in clfs:\n", + " try:\n", + " _ = mostra_feature_importances(clf, X_treinamento, y_treinamento, top_n=X_treinamento.shape[1], title=clf.__class__.__name__)\n", + " except AttributeError as e:\n", + " print(e)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EwWkjfC8KEZH" + }, + "source": [ + "# ENSEMBLE METHODS\n", + "https://towardsdatascience.com/using-bagging-and-boosting-to-improve-classification-tree-accuracy-6d3bb6c95e5b\n", + "\n", + "![Ensemble](https://github.com/MathMachado/Materials/blob/master/Ensemble.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3Uf1RML7xETY" + }, + "source": [ + "# WOE e IV\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TBNRfYZCyhMP" + }, + "source": [ + "## Construção do exemplo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gIIroyyP4ZRZ" + }, + "source": [ + "df_y.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "PzQQdrkf1ohX" + }, + "source": [ + "from random import choices\n", + "\n", + "df_X2= df_X.copy()\n", + "df_X2['tipo']= choices(['A', 'B', 'C', 'D'], k= 1000)\n", + "df_X2['idade']= np.random.randint(10, 15, size= 1000)\n", + "df_X2['target']= df_y['target']\n", + "df_X2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "v-OpwIpx4hXJ" + }, + "source": [ + "df_X2['target'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "yZfqSvbKzeJ3" + }, + "source": [ + "def Constroi_Buckets(df, i, k= 10):\n", + " coluna= 'v'+ str(i)\n", + " df[coluna+'_Bucket']= pd.cut(df[coluna], bins= k, labels= np.arange(1, k+1))\n", + " df= df.drop(columns= [coluna], axis= 1)\n", + " return df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "V6Nrpsx60HD3" + }, + "source": [ + "for i in np.arange(1,19):\n", + " df_X2= Constroi_Buckets(df_X2, i)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "J2Fbh41-03OB" + }, + "source": [ + "df_X2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "O9r5BeWVxIr3" + }, + "source": [ + "# Função para calcular WOE e IV\n", + "def calculate_woe_iv(dataset, feature, target):\n", + "\n", + " def codethem(IV):\n", + " if IV < 0.02: return 'Useless'\n", + " elif IV >= 0.02 and IV < 0.1: return 'Weak'\n", + " elif IV >= 0.1 and IV < 0.3: return 'Medium'\n", + " elif IV >= 0.3 and IV < 0.5: return 'Strong'\n", + " elif IV >= 0.5: return 'Suspicious'\n", + " else: return 'None'\n", + "\n", + " lst = []\n", + " for i in range(dataset[feature].nunique()):\n", + " val = list(dataset[feature].unique())[i]\n", + " lst.append({\n", + " 'Value': val,\n", + " 'All': dataset[dataset[feature] == val].count()[feature],\n", + " 'Good': dataset[(dataset[feature] == val) & (dataset[target] == 0)].count()[feature],\n", + " 'Bad': dataset[(dataset[feature] == val) & (dataset[target] == 1)].count()[feature]\n", + " })\n", + " \n", + " dset = pd.DataFrame(lst)\n", + " dset['Distr_Good'] = dset['Good']/dset['Good'].sum()\n", + " dset['Distr_Bad'] = dset['Bad']/dset['Bad'].sum()\n", + " dset['Mean']= dset['All']/dset['All'].sum()\n", + " dset['WoE'] = np.log(dset['Distr_Good']/dset['Distr_Bad'])\n", + " dset = dset.replace({'WoE': {np.inf: 0, -np.inf: 0}})\n", + " dset['IV'] = (dset['Distr_Good'] - dset['Distr_Bad']) * dset['WoE']\n", + " #dset= dset.drop(columns= ['Distr_Good', 'Distr_Bad'], axis= 1)\n", + "\n", + " dset['Predictive_Power']= dset['IV'].map(codethem)\n", + " iv = dset['IV'].sum() \n", + " dset = dset.sort_values(by='IV') \n", + " return dset, iv" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Y8WGjWH63nx_" + }, + "source": [ + "df_Lab = df_X2.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-N6xr1MgxTiz" + }, + "source": [ + "def calcula_Predictive_Power(df_Lab, coluna):\n", + " print('WoE and IV for column: {}'.format(coluna))\n", + " df, iv = calculate_woe_iv(df_Lab, coluna, 'target')\n", + " print(df)\n", + " print('IV score: {:.2f}'.format(iv))\n", + " print('\\n')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ayqN_7WnxVq9" + }, + "source": [ + "for i in np.arange(1,19):\n", + " coluna= 'v'+str(i)+'_Bucket'\n", + " calcula_Predictive_Power(df_Lab, coluna)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qtoJVI4Pyx3I" + }, + "source": [ + "# **IMBALANCED SAMPLE**\n", + "> Alguns objetivos como detectar fraude em transações bancárias ou detecção de intrusão em network tem em comum o fato que a classe de interesse (o que queremos detectar), geralmente é um evento raro\n", + "\n", + "## Exemplo: Detectar fraude\n", + "A proporção de fraudes diante de NÃO-FRAUDES são mais ou menos 1%/99%. Neste caso, ao desenvovermos um modelo para detectar fraudes e o modelo classificar todas as instâncias como NÃO-FRAUDE, então o modelo terá uma acurácia de 99%. No entanto, este modelo não nos ajudará em nada.\n", + "\n", + "## Necessidade de se usar outras métricas \n", + "> Recomenda-se utilizar outras métricas (na verdade, é boa prática usar mais de 1 métrica para medir a performance dos modelos) como, por exemplo, F1-Score, Precision/Specificity, Recall/Sensitivity e AUROC.\n", + "\n", + "## Como lidar com a amostra desbalanceada?\n", + "* Under-sampling\n", + "> Seleciona aleatoriamente a classe MAJORITÁRIA (em nosso exemplo, NÃO-FRAUDE) até o número de instâncias da classe MINORITÁRIA (FRAUDE);\n", + "\n", + "* Over-sampling\n", + "> Resample aleatoriamente a classe MINORITÁRIA (em nosso exemplo, FRAUDE) até o número de instâncias da classe MAJORITÁRIA (NÃO-FRAUDE), ou uma proporção da classe MAJORITÁRIA. Veja a bibliotea SMOTE (Synthetic Minority Over-Sampling Techniques);\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2o45zx8zw-aB" + }, + "source": [ + "## EFEITOS DA AMOSTRA DESBALANCEADA" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cCVTPCB-Xkbd" + }, + "source": [ + "# TPOT\n", + "https://towardsdatascience.com/tpot-automated-machine-learning-in-python-4c063b3e5de9" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2ulXii6JXpWd" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_TWUq-z4X4yZ" + }, + "source": [ + "___\n", + "# FEATURETOOLS\n", + "https://medium.com/@rrfd/simple-automatic-feature-engineering-using-featuretools-in-python-for-classification-b1308040e183\n", + "\n", + "https://www.analyticsvidhya.com/blog/2018/08/guide-automated-feature-engineering-featuretools-python/\n", + "\n", + "https://mlwhiz.com/blog/2019/05/19/feature_extraction/\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aZUNOgmSgAmq" + }, + "source": [ + "!pip install featuretools" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_sxdONzsh9rb" + }, + "source": [ + "df_X.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "p5_ynGo1dBJJ" + }, + "source": [ + "df_X.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TqJRJXUhiDqf" + }, + "source": [ + "from random import choices\n", + "\n", + "df_X2= df_X.copy()\n", + "df_X2['tipo'] = choices(['A', 'B', 'C', 'D'], k = 1000)\n", + "df_X2['idade'] = np.random.randint(10, 15, size = 1000)\n", + "df_X2['id'] = range(0,1000)\n", + "df_X2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "nR56bGGngk-W" + }, + "source": [ + "# Automated feature engineering\n", + "import featuretools as ft\n", + "import featuretools.variable_types as vtypes\n", + "\n", + "es= ft.EntitySet(id = 'simulacao')\n", + "\n", + "# adding a dataframe \n", + "es.entity_from_dataframe(entity_id = 'df_X2', dataframe = df_X2, index = 'id')\n", + "es" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IOJ4Tr5Ogk6M" + }, + "source": [ + "es['df_X2'].variables" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1uXPqHDZgkys" + }, + "source": [ + "variable_types = {'idade': vtypes.Categorical}\n", + " \n", + "es.entity_from_dataframe(entity_id = 'df_X2', dataframe = df_X2, index = 'id', variable_types= variable_types)\n", + "\n", + "es = es.normalize_entity(base_entity_id='df_X2', new_entity_id= 'tipo', index='id')\n", + "es = es.normalize_entity(base_entity_id='df_X2', new_entity_id= 'idade', index='id')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "dnbYTBqugkvm" + }, + "source": [ + "es" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "I2v_jetdgkr7" + }, + "source": [ + "feature_matrix, feature_names = ft.dfs(entityset=es, target_entity = 'df_X2', max_depth = 3, verbose = 3, n_jobs= 1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zZiRBvHXgkoJ" + }, + "source": [ + "feature_matrix.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aWiahwKe2d6U" + }, + "source": [ + "# **EXERCÍCIOS**\n", + "> Encontre algoritmos adequados para ser aplicados aos seguintes problemas:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XbSLkbDB2mzK" + }, + "source": [ + "## Exercício 1 - Credit Card Fraud Detection\n", + "Source: [Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud)\n", + "\n", + "### Leitura suporte\n", + "* [Detecting Credit Card Fraud Using Machine Learning](https://towardsdatascience.com/detecting-credit-card-fraud-using-machine-learning-a3d83423d3b8)\n", + "* [Credit Card Fraud Detection](https://towardsdatascience.com/credit-card-fraud-detection-a1c7e1b75f59)\n", + "\n", + "### Dataframe\n", + "* [Creditcard.csv](https://raw.githubusercontent.com/MathMachado/DSWP/master/Dataframes/creditcard.csv)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JYVM3StS-g0E" + }, + "source": [ + "### Importar as libraries necessárias" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dyliPChh-jPk" + }, + "source": [ + "from sklearn.metrics import accuracy_score # para medir a acurácia do modelo preditivo\n", + "#from sklearn.model_selection import train_test_split\n", + "#from sklearn.metrics import classification_report\n", + "from sklearn.metrics import confusion_matrix # para plotar a confusion matrix\n", + "\n", + "from sklearn.model_selection import GridSearchCV # para otimizar os parâmetros dos modelos preditivos\n", + "from sklearn.model_selection import cross_val_score\n", + "from time import time\n", + "from operator import itemgetter\n", + "from scipy.stats import randint\n", + "\n", + "from sklearn.tree import export_graphviz\n", + "from sklearn.externals.six import StringIO \n", + "from IPython.display import Image \n", + "import pydotplus\n", + "\n", + "np.set_printoptions(suppress=True)" + ], + "execution_count": 114, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lAl9ZwP_0-d0" + }, + "source": [ + "url = 'https://raw.githubusercontent.com/MathMachado/DSWP/master/Dataframes/creditcard.csv'\n", + "df_cc = pd.read_csv(url)" + ], + "execution_count": 115, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "w6lN8FjJ12VU", + "outputId": "f7a75b8d-f178-442f-e00b-4ed17485cd80", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 379 + } + }, + "source": [ + "df_cc.head(10)" + ], + "execution_count": 116, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TimeV1V2V3V4V5V6V7V8V9V10V11V12V13V14V15V16V17V18V19V20V21V22V23V24V25V26V27V28AmountClass
00-1.359807-0.0727812.5363471.378155-0.3383210.4623880.2395990.0986980.3637870.090794-0.551600-0.617801-0.991390-0.3111691.468177-0.4704010.2079710.0257910.4039930.251412-0.0183070.277838-0.1104740.0669280.128539-0.1891150.133558-0.021053149.620.0
101.1918570.2661510.1664800.4481540.060018-0.082361-0.0788030.085102-0.255425-0.1669741.6127271.0652350.489095-0.1437720.6355580.463917-0.114805-0.183361-0.145783-0.069083-0.225775-0.6386720.101288-0.3398460.1671700.125895-0.0089830.0147242.690.0
21-1.358354-1.3401631.7732090.379780-0.5031981.8004990.7914610.247676-1.5146540.2076430.6245010.0660840.717293-0.1659462.345865-2.8900831.109969-0.121359-2.2618570.5249800.2479980.7716790.909412-0.689281-0.327642-0.139097-0.055353-0.059752378.660.0
31-0.966272-0.1852261.792993-0.863291-0.0103091.2472030.2376090.377436-1.387024-0.054952-0.2264870.1782280.507757-0.287924-0.631418-1.059647-0.6840931.965775-1.232622-0.208038-0.1083000.005274-0.190321-1.1755750.647376-0.2219290.0627230.061458123.500.0
42-1.1582330.8777371.5487180.403034-0.4071930.0959210.592941-0.2705330.8177390.753074-0.8228430.5381961.345852-1.1196700.175121-0.451449-0.237033-0.0381950.8034870.408542-0.0094310.798278-0.1374580.141267-0.2060100.5022920.2194220.21515369.990.0
52-0.4259660.9605231.141109-0.1682520.420987-0.0297280.4762010.260314-0.568671-0.3714071.3412620.359894-0.358091-0.1371340.5176170.401726-0.0581330.068653-0.0331940.084968-0.208254-0.559825-0.026398-0.371427-0.2327940.1059150.2538440.0810803.670.0
641.2296580.1410040.0453711.2026130.1918810.272708-0.0051590.0812130.464960-0.099254-1.416907-0.153826-0.7510630.1673720.050144-0.4435870.002821-0.611987-0.045575-0.219633-0.167716-0.270710-0.154104-0.7800550.750137-0.2572370.0345070.0051684.990.0
77-0.6442691.4179641.074380-0.4921990.9489340.4281181.120631-3.8078640.6153751.249376-0.6194680.2914741.757964-1.3238650.686133-0.076127-1.222127-0.3582220.324505-0.1567421.943465-1.0154550.057504-0.649709-0.415267-0.051634-1.206921-1.08533940.800.0
87-0.8942860.286157-0.113192-0.2715262.6695993.7218180.3701450.851084-0.392048-0.410430-0.705117-0.110452-0.2862540.074355-0.328783-0.210077-0.4997680.1187650.5703280.052736-0.073425-0.268092-0.2042331.0115920.373205-0.3841570.0117470.14240493.200.0
99-0.3382621.1195931.044367-0.2221870.499361-0.2467610.6515830.069539-0.736727-0.3668461.0176140.8363901.006844-0.4435230.1502190.739453-0.5409800.4766770.4517730.203711-0.246914-0.633753-0.120794-0.385050-0.0697330.0941990.2462190.0830763.680.0
\n", + "
" + ], + "text/plain": [ + " Time V1 V2 V3 ... V27 V28 Amount Class\n", + "0 0 -1.359807 -0.072781 2.536347 ... 0.133558 -0.021053 149.62 0.0\n", + "1 0 1.191857 0.266151 0.166480 ... -0.008983 0.014724 2.69 0.0\n", + "2 1 -1.358354 -1.340163 1.773209 ... -0.055353 -0.059752 378.66 0.0\n", + "3 1 -0.966272 -0.185226 1.792993 ... 0.062723 0.061458 123.50 0.0\n", + "4 2 -1.158233 0.877737 1.548718 ... 0.219422 0.215153 69.99 0.0\n", + "5 2 -0.425966 0.960523 1.141109 ... 0.253844 0.081080 3.67 0.0\n", + "6 4 1.229658 0.141004 0.045371 ... 0.034507 0.005168 4.99 0.0\n", + "7 7 -0.644269 1.417964 1.074380 ... -1.206921 -1.085339 40.80 0.0\n", + "8 7 -0.894286 0.286157 -0.113192 ... 0.011747 0.142404 93.20 0.0\n", + "9 9 -0.338262 1.119593 1.044367 ... 0.246219 0.083076 3.68 0.0\n", + "\n", + "[10 rows x 31 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 116 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "M47GS1cK2NdD", + "outputId": "e402576c-cee8-4d09-b776-04fdddf9c2b0", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "df_cc.shape" + ], + "execution_count": 117, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(12842, 31)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 117 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "b2QBZFbR3W_q", + "outputId": "5a8ed763-f794-4bd8-b846-63d36101f9e8", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + } + }, + "source": [ + "df_cc['Class'].value_counts()" + ], + "execution_count": 118, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.0 12785\n", + "1.0 56\n", + "Name: Class, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 118 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pzjW3Bf_3h7J", + "outputId": "4949014a-3f1a-4565-d7b0-213362ef9156", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "56/12842" + ], + "execution_count": 119, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.004360691481077714" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 119 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9bWDX9H12k5g" + }, + "source": [ + "### Drop NaN" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "27ob8tRR21TB", + "outputId": "ea900ea1-79f6-413e-8cb6-ae0492293547", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 561 + } + }, + "source": [ + "df_cc.isna().sum()" + ], + "execution_count": 120, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Time 0\n", + "V1 0\n", + "V2 0\n", + "V3 0\n", + "V4 0\n", + "V5 0\n", + "V6 0\n", + "V7 0\n", + "V8 0\n", + "V9 0\n", + "V10 1\n", + "V11 1\n", + "V12 1\n", + "V13 1\n", + "V14 1\n", + "V15 1\n", + "V16 1\n", + "V17 1\n", + "V18 1\n", + "V19 1\n", + "V20 1\n", + "V21 1\n", + "V22 1\n", + "V23 1\n", + "V24 1\n", + "V25 1\n", + "V26 1\n", + "V27 1\n", + "V28 1\n", + "Amount 1\n", + "Class 1\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 120 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "X9k16WLI49JI", + "outputId": "3de3f96c-5f76-4763-ced1-953c0646612e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "df_cc2 = df_cc.copy()\n", + "df_cc2 = df_cc.dropna()\n", + "df_cc2.shape" + ], + "execution_count": 71, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(12841, 31)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 71 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OY-DYRKg34ZX" + }, + "source": [ + "### Definir as variáveis globais" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KVhHgV_s3_5f" + }, + "source": [ + "i_CV = 10 # Número de Cross-Validations\n", + "i_Seed = 20111974 # semente por questões de reproducibilidade\n", + "f_Test_Size = 0.3 # Proporção do dataframe de validação (outros valores poderiam ser 0.15, 0.20 ou 0.25)" + ], + "execution_count": 121, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wKbqrF4Q2nBq" + }, + "source": [ + "### Define amostras de treinamento e teste" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "N8CUAiA57OhS", + "outputId": "9e6169f9-9930-4c60-df3f-34e831469da8", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 224 + } + }, + "source": [ + "df_cc.head()" + ], + "execution_count": 122, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TimeV1V2V3V4V5V6V7V8V9V10V11V12V13V14V15V16V17V18V19V20V21V22V23V24V25V26V27V28AmountClass
00-1.359807-0.0727812.5363471.378155-0.3383210.4623880.2395990.0986980.3637870.090794-0.551600-0.617801-0.991390-0.3111691.468177-0.4704010.2079710.0257910.4039930.251412-0.0183070.277838-0.1104740.0669280.128539-0.1891150.133558-0.021053149.620.0
101.1918570.2661510.1664800.4481540.060018-0.082361-0.0788030.085102-0.255425-0.1669741.6127271.0652350.489095-0.1437720.6355580.463917-0.114805-0.183361-0.145783-0.069083-0.225775-0.6386720.101288-0.3398460.1671700.125895-0.0089830.0147242.690.0
21-1.358354-1.3401631.7732090.379780-0.5031981.8004990.7914610.247676-1.5146540.2076430.6245010.0660840.717293-0.1659462.345865-2.8900831.109969-0.121359-2.2618570.5249800.2479980.7716790.909412-0.689281-0.327642-0.139097-0.055353-0.059752378.660.0
31-0.966272-0.1852261.792993-0.863291-0.0103091.2472030.2376090.377436-1.387024-0.054952-0.2264870.1782280.507757-0.287924-0.631418-1.059647-0.6840931.965775-1.232622-0.208038-0.1083000.005274-0.190321-1.1755750.647376-0.2219290.0627230.061458123.500.0
42-1.1582330.8777371.5487180.403034-0.4071930.0959210.592941-0.2705330.8177390.753074-0.8228430.5381961.345852-1.1196700.175121-0.451449-0.237033-0.0381950.8034870.408542-0.0094310.798278-0.1374580.141267-0.2060100.5022920.2194220.21515369.990.0
\n", + "
" + ], + "text/plain": [ + " Time V1 V2 V3 ... V27 V28 Amount Class\n", + "0 0 -1.359807 -0.072781 2.536347 ... 0.133558 -0.021053 149.62 0.0\n", + "1 0 1.191857 0.266151 0.166480 ... -0.008983 0.014724 2.69 0.0\n", + "2 1 -1.358354 -1.340163 1.773209 ... -0.055353 -0.059752 378.66 0.0\n", + "3 1 -0.966272 -0.185226 1.792993 ... 0.062723 0.061458 123.50 0.0\n", + "4 2 -1.158233 0.877737 1.548718 ... 0.219422 0.215153 69.99 0.0\n", + "\n", + "[5 rows x 31 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 122 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LZjNUDNb7s1t", + "outputId": "c0ec58c1-72b6-4621-b96a-f1987534409d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 224 + } + }, + "source": [ + "# Definição do dataframe contendo as variáveis preditoras:\n", + "df_X = df_cc2.copy()\n", + "df_X.drop(columns= ['Class'], axis = 1, inplace = True)\n", + "df_X.head()" + ], + "execution_count": 123, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TimeV1V2V3V4V5V6V7V8V9V10V11V12V13V14V15V16V17V18V19V20V21V22V23V24V25V26V27V28Amount
00-1.359807-0.0727812.5363471.378155-0.3383210.4623880.2395990.0986980.3637870.090794-0.551600-0.617801-0.991390-0.3111691.468177-0.4704010.2079710.0257910.4039930.251412-0.0183070.277838-0.1104740.0669280.128539-0.1891150.133558-0.021053149.62
101.1918570.2661510.1664800.4481540.060018-0.082361-0.0788030.085102-0.255425-0.1669741.6127271.0652350.489095-0.1437720.6355580.463917-0.114805-0.183361-0.145783-0.069083-0.225775-0.6386720.101288-0.3398460.1671700.125895-0.0089830.0147242.69
21-1.358354-1.3401631.7732090.379780-0.5031981.8004990.7914610.247676-1.5146540.2076430.6245010.0660840.717293-0.1659462.345865-2.8900831.109969-0.121359-2.2618570.5249800.2479980.7716790.909412-0.689281-0.327642-0.139097-0.055353-0.059752378.66
31-0.966272-0.1852261.792993-0.863291-0.0103091.2472030.2376090.377436-1.387024-0.054952-0.2264870.1782280.507757-0.287924-0.631418-1.059647-0.6840931.965775-1.232622-0.208038-0.1083000.005274-0.190321-1.1755750.647376-0.2219290.0627230.061458123.50
42-1.1582330.8777371.5487180.403034-0.4071930.0959210.592941-0.2705330.8177390.753074-0.8228430.5381961.345852-1.1196700.175121-0.451449-0.237033-0.0381950.8034870.408542-0.0094310.798278-0.1374580.141267-0.2060100.5022920.2194220.21515369.99
\n", + "
" + ], + "text/plain": [ + " Time V1 V2 V3 ... V26 V27 V28 Amount\n", + "0 0 -1.359807 -0.072781 2.536347 ... -0.189115 0.133558 -0.021053 149.62\n", + "1 0 1.191857 0.266151 0.166480 ... 0.125895 -0.008983 0.014724 2.69\n", + "2 1 -1.358354 -1.340163 1.773209 ... -0.139097 -0.055353 -0.059752 378.66\n", + "3 1 -0.966272 -0.185226 1.792993 ... -0.221929 0.062723 0.061458 123.50\n", + "4 2 -1.158233 0.877737 1.548718 ... 0.502292 0.219422 0.215153 69.99\n", + "\n", + "[5 rows x 30 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 123 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "d3DDsN2V7IOU", + "outputId": "ac1843d4-50bb-49f6-b434-eb400ade1c38", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 119 + } + }, + "source": [ + "df_y = df_cc2['Class'] # Variável-resposta\n", + "df_y.head()" + ], + "execution_count": 131, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 0.0\n", + "1 0.0\n", + "2 0.0\n", + "3 0.0\n", + "4 0.0\n", + "Name: Class, dtype: float64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 131 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aMthdXHD8vnh", + "outputId": "f2aec987-220f-4c58-9a6d-60f2c21a86ab", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "df_y.shape" + ], + "execution_count": 132, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(12841,)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 132 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "EiJRftpZ2103" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(df_X, df_y, test_size = f_Test_Size, random_state = i_Seed)" + ], + "execution_count": 125, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TmSkPzNt8O6I", + "outputId": "76a362af-dbb3-4e48-e1a0-968c429e555a", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "X_treinamento.shape" + ], + "execution_count": 126, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(8988, 30)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 126 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9h1PjPKh8Xb1", + "outputId": "44f90b9c-0968-40f8-bb7c-43f7bd93a8c3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "X_teste.shape" + ], + "execution_count": 127, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(3853, 30)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 127 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NbCN_puI2qk1" + }, + "source": [ + "### Ajusta o modelo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hjRwSI079ADn" + }, + "source": [ + "# Importar o classificador (modelo, algoritmo, ...)\n", + "from sklearn.tree import DecisionTreeClassifier # Este é o nosso classificador" + ], + "execution_count": 128, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "HuhKJOQA22bR", + "outputId": "2f648705-8bc5-4965-83d4-71005c80cd5c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 119 + } + }, + "source": [ + "ml_DT = DecisionTreeClassifier(max_depth = 5, min_samples_split = 2, random_state = i_Seed)\n", + "ml_DT" + ], + "execution_count": 129, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n", + " max_depth=5, max_features=None, max_leaf_nodes=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, presort='deprecated',\n", + " random_state=20111974, splitter='best')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 129 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Zai1d6eM93VQ", + "outputId": "97e92d12-b2b5-4560-ebf8-0cd5b40567f3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 119 + } + }, + "source": [ + "# Treinar o algoritmo/classificador: fit(df)\n", + "ml_DT.fit(X_treinamento, y_treinamento)" + ], + "execution_count": 130, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n", + " max_depth=5, max_features=None, max_leaf_nodes=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, presort='deprecated',\n", + " random_state=20111974, splitter='best')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 130 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ybbS4zHn-8BO", + "outputId": "b9c3cef3-0748-407e-8213-9f04692257b0", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = cross_val_score(ml_DT, X_treinamento, y_treinamento, cv = i_CV)\n", + "\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(), 4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(), 4)}')" + ], + "execution_count": 133, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Média das Acurácias calculadas pelo CV....: 99.9\n", + "std médio das Acurácias calculadas pelo CV: 0.09\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "r_NLku7q_YT9", + "outputId": "cb2666bb-dee9-4640-a7fa-bdae648df9b8", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "a_scores_CV # array com os scores a cada iteração do CV" + ], + "execution_count": 134, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1. , 0.99888765, 0.99777531, 0.99777531, 1. ,\n", + " 0.99888765, 1. , 0.99888765, 0.99777283, 1. ])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 134 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bCRgHxUu2s7c" + }, + "source": [ + "### Cross-Validation" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2wMWm-p5229A" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Am_UELOg2vDh" + }, + "source": [ + "### Fine tuning dos parâmetros" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lF9mxe7y23hr" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bG31I7_n4RQg" + }, + "source": [ + "### Aplicar as transformações (principais) estudadas e reestimar o modelo novamente\n", + "* Qual o impacto das transformações?\n", + "* A conclusão muda/mudou?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oYgK6JXd3MgA" + }, + "source": [ + "## Exercício 2 - Predicting species on IRIS dataset\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "si0rsJvu3O6O" + }, + "source": [ + "from sklearn import datasets\n", + "import xgboost as xgb\n", + "\n", + "iris = datasets.load_iris()\n", + "X_iris = iris.data\n", + "y_iris = iris.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zom8t4yWC_UC" + }, + "source": [ + "## Exercício 3 - Predict Wine Quality\n", + "> Estimar a qualidade dos vinhos, numa scala de 0–100. A seguir, a qualidade em função da escala:\n", + "\n", + "* 95–100 Classic: a great wine\n", + "* 90–94 Outstanding: a wine of superior character and style\n", + "* 85–89 Very good: a wine with special qualities\n", + "* 80–84 Good: a solid, well-made wine\n", + "* 75–79 Mediocre: a drinkable wine that may have minor flaws\n", + "* 50–74 Not recommended\n", + "\n", + "Source: [Wine Reviews](https://www.kaggle.com/zynicide/wine-reviews)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "klL2Q9Ria96n" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from sklearn import datasets\n", + "\n", + "Wine = datasets.load_wine()\n", + "X_vinho = Wine.data\n", + "y_vinho = Wine.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lhVhSWBgGijq" + }, + "source": [ + "## Exercício 4 - Predict Parkinson\n", + "Source: https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SVCxHqv0VBJn" + }, + "source": [ + "## Exercício 5 - Predict survivors from Titanic tragedy\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "CwvB8us4eKNi" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "\n", + "df_titanic = sns.load_dataset('titanic')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZJrT9YIXVdtx" + }, + "source": [ + "## Exercício 6 - Predict Loan\n", + "> Os dados devem ser obtidos diretamente da fonte: [Loan Default Prediction - Imperial College London](https://www.kaggle.com/c/loan-default-prediction/data)\n", + "\n", + "* [Bank Loan Default Prediction](https://medium.com/@wutianhao910/bank-loan-default-prediction-94d4902db740)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R8-GVu7ZWeA8" + }, + "source": [ + "## Exercício 7 - Predict the sales of a store.\n", + "* [Predicting expected sales for Bigmart’s stores](https://medium.com/diogo-menezes-borges/project-1-bigmart-sale-prediction-fdc04f07dc1e)\n", + "* Dataframes\n", + " * [Treinamento](https://raw.githubusercontent.com/MathMachado/DataFrames/master/Big_Mart_Sales_III_train.txt)\n", + " * [Validação](https://raw.githubusercontent.com/MathMachado/DataFrames/master/Big_Mart_Sales_III_test.txt)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fv9w86j4Wnwj" + }, + "source": [ + "## Exercício 8 - [The Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html)\n", + "> Predict the median value of owner occupied homes." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5HYRt8-ug1BT" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from sklearn import datasets\n", + "\n", + "Boston = datasets.load_boston()\n", + "X_boston = Boston.data\n", + "y_boston = Boston.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1UDIaqmtXQ0T" + }, + "source": [ + "## Exercício 9 - Predict the height or weight of a person.\n", + "\n", + "http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-7R146nIXmMT" + }, + "source": [ + "## Exercício 10 - Black Friday Sales Prediction - Predict purchase amount.\n", + "\n", + "This dataset comprises of sales transactions captured at a retail store. It’s a classic dataset to explore and expand your feature engineering skills and day to day understanding from multiple shopping experiences. This is a regression problem. The dataset has 550,069 rows and 12 columns.\n", + "\n", + "https://github.com/MathMachado/DataFrames/blob/master/blackfriday.zip\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mQ8FPbuLZlIh" + }, + "source": [ + "## Exercício 11 - Predict the income class of US population.\n", + "\n", + "http://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Af4NRrchgPlM" + }, + "source": [ + "## Exercício 12 - Predicting Cancer\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "c4LOlgZW3P40" + }, + "source": [ + "from sklearn import datasets\n", + "cancer = datasets.load_breast_cancer()\n", + "X_cancer = cancer.data\n", + "y_cancer = cancer.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "74PmpT8Ix0tD" + }, + "source": [ + "## Exercício 13\n", + "Source: [Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/).\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WY8GZMixZ9W9" + }, + "source": [ + "## Exercício 14 - Predict Diabetes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y92t6tbOge0S" + }, + "source": [ + "from sklearn import datasets\n", + "Diabetes= datasets.load_diabetes()\n", + "\n", + "X_diabetes = Diabetes.data\n", + "y_diabetes = Diabetes.target" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB15_00__Machine_Learning___DSWP_hs4.ipynb b/Notebooks/NB15_00__Machine_Learning___DSWP_hs4.ipynb new file mode 100644 index 000000000..1025a0945 --- /dev/null +++ b/Notebooks/NB15_00__Machine_Learning___DSWP_hs4.ipynb @@ -0,0 +1,8583 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "colab": { + "name": "NB15_00__Machine_Learning.ipynb", + "provenance": [], + "include_colab_link": true + }, + "accelerator": "TPU" + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ShVXyGj9wkgN" + }, + "source": [ + "

MACHINE LEARNING WITH PYTHON

" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aYQ4cDfcPu4e" + }, + "source": [ + "___\n", + "# **NOTAS E OBSERVAÇÕES**\n", + "* Abordar o impacto do desbalanceamento da amostra;\n", + "* Colocar AUROC no material e mostrar o cut off para classificação entre 0 e 1;\n", + "* Conceitos estatísticos de bias & variance;\n", + "* Ver Sklearn.optimize: https://web.telegram.org/#/im?p=g497957288;\n", + "* Construir a package para conter todas as funções definidas e colocar estas funções na package --> Manutenção rápida, fácil e centralizada! Desta forma, o tópico (\"Funções usadas neste tutorial\") vai totalmente para o package." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5YvhLC_uf4_G" + }, + "source": [ + "___\n", + "# **AGENDA**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QgX6n2VDyY1O" + }, + "source": [ + "___\n", + "# **REFERÊNCIAS**\n", + "* [scikit-learn - Machine Learning With Python](https://scikit-learn.org/stable/);\n", + "* [An Introduction to Machine Learning Theory and Its Applications: A Visual Tutorial with Examples](https://www.toptal.com/machine-learning/machine-learning-theory-an-introductory-primer)\n", + "* [The Difference Between Artificial Intelligence, Machine Learning, and Deep Learning](https://medium.com/iotforall/the-difference-between-artificial-intelligence-machine-learning-and-deep-learning-3aa67bff5991)\n", + "* [A Gentle Guide to Machine Learning](https://blog.monkeylearn.com/a-gentle-guide-to-machine-learning/)\n", + "* [A Visual Introduction to Machine Learning](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)\n", + "* [Introduction to Machine Learning](http://alex.smola.org/drafts/thebook.pdf)\n", + "* [The 10 Statistical Techniques Data Scientists Need to Master](https://medium.com/cracking-the-data-science-interview/the-10-statistical-techniques-data-scientists-need-to-master-1ef6dbd531f7)\n", + "* [Tune: a library for fast hyperparameter tuning at any scale](https://towardsdatascience.com/fast-hyperparameter-tuning-at-scale-d428223b081c)\n", + "* [How to lie with Data Science](https://towardsdatascience.com/how-to-lie-with-data-science-5090f3891d9c)\n", + "* [5 Reasons “Logistic Regression” should be the first thing you learn when becoming a Data Scientist](https://towardsdatascience.com/5-reasons-logistic-regression-should-be-the-first-thing-you-learn-when-become-a-data-scientist-fcaae46605c4)\n", + "* [Machine learning on categorical variables](https://towardsdatascience.com/machine-learning-on-categorical-variables-3b76ffe4a7cb)\n", + "\n", + "## Deep Learning & Neural Networks\n", + "\n", + "- [An Introduction to Neural Networks](http://www.cs.stir.ac.uk/~lss/NNIntro/InvSlides.html)\n", + "- [An Introduction to Image Recognition with Deep Learning](https://medium.com/@ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721)\n", + "- [Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/index.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TsCbZd2epfxo" + }, + "source": [ + "___\n", + "# **INTRODUÇÃO**\n", + "\n", + "* \"__Information is the oil of the 21st century, and analytics is the combustion engine__.\" - Peter Sondergaard, SVP, Garner Research;\n", + "\n", + "\n", + ">O foco deste capítulo será:\n", + "* Linear, Logistic Regression, Decision Tree, Random Forest, Support Vector Machine and XGBoost algorithms for building Machine Learning models;\n", + "* Entender como resolver problemas de classificação e Regressão;\n", + "* Aplicar técnicas de Ensemble como Bagging e Boosting;\n", + "* Como medir a acurácia dos modelos de Machine Learning;\n", + "* Aprender os principais algoritmos de Machine Learning tanto das técnicas de aprendizagem supervisionada quanto da não-supervisionada.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HqqB2vaHXMGt" + }, + "source": [ + "___\n", + "# **ARTIFICIAL INTELLIGENCE VS MACHINE LEARNING VS DEEP LEARNING**\n", + "* **Machine Learning** - dá aos computadores a capacidade de aprender sem serem explicitamente programados. Os computadores podem melhorar sua capacidade de aprendizagem através da prática de uma tarefa, geralmente usando grandes conjuntos de dados.\n", + "* **Deep Learning** - é um método de Machine Learning que depende de redes neurais artificiais, permitindo que os sistemas de computadores aprendam pelo exemplo, assim como nós humanos aprendemos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P961GcguXFFA" + }, + "source": [ + "![EvolutionOfAI](https://github.com/MathMachado/Materials/blob/master/Evolution%20of%20AI.PNG?raw=true)\n", + "\n", + "Source: [Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://github.com/MathMachado/P4ML/blob/DS_Python/Material/Evolution%20of%20AI.PNG)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lkqGtO88ZkPr" + }, + "source": [ + "![AI_vs_ML_vs_DL](https://github.com/MathMachado/Materials/blob/master/AI_vs_ML_vs_DL.PNG?raw=true)\n", + "\n", + "Source: [Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://towardsdatascience.com/artificial-intelligence-vs-machine-learning-vs-deep-learning-2210ba8cc4ac)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xesQpzfmaqj6" + }, + "source": [ + "![ML_vs_DL](https://github.com/MathMachado/Materials/blob/master/ML_vs_DL.PNG?raw=true)\n", + "\n", + "Source: [Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://towardsdatascience.com/artificial-intelligence-vs-machine-learning-vs-deep-learning-2210ba8cc4ac)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KeIVR59IIS7f" + }, + "source": [ + "___\n", + "# **MACHINE LEARNING - TECHNIQUES**\n", + "\n", + "* Supervised Learning\n", + "* Unsupervised Learning\n", + "\n", + "![MachineLearning](https://github.com/MathMachado/Materials/blob/master/MachineLearningTechniques.jpg?raw=true)\n", + "\n", + "Source: [Machine Learning for Everyone](https://vas3k.com/blog/machine_learning/?source=post_page-----885aa35db58b----------------------)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rvwp5UHdBiup" + }, + "source": [ + "___\n", + "# **NOSSO FOCO AQUI SERÁ...**\n", + "\n", + "![ClassicalML](https://github.com/MathMachado/Materials/blob/master/ClassicalML.jpg?raw=true)\n", + "\n", + "Source: [Machine Learning for Everyone](https://vas3k.com/blog/machine_learning/?source=post_page-----885aa35db58b----------------------)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cBLSvJTXHBjK" + }, + "source": [ + "___\n", + "# **CHEETSHEET**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZdjR3nahUuKq" + }, + "source": [ + "\n", + "![Scikit-Learn](https://github.com/MathMachado/Materials/blob/master/scikit-learn-1.png?raw=true)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XRukccWQSklx" + }, + "source": [ + "## Medidas para avaliarmos a variabilidade presente nos dados\n", + "* As principais medidas para medirmos a variabilidade dos dados são amplitude, variância, desvio padrão e coeficiente de variação;\n", + "* Estas medidas nos permite concluir se os dados são homogêneos (menor dispersão/variabilidade) ou heterogêneos (maior variabilidade/dispersão).\n", + "\n", + "* **Na próxima versão, trazer estes conceitos para o Notebook e usar o Python para calcular estas medidas**." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yBR8tWV_lhQq" + }, + "source": [ + "___\n", + "# **ENSEMBLE METHODS** (= Combinar modelos preditivos)\n", + "* Métodos\n", + " * **Bagging** (Bootstrap AGGregatING)\n", + " * **Boosting**\n", + " * Stacking --> Não é muito utilizado\n", + "* Evita overfitting (Overfitting é quando o modelo/função se ajusta muito bem ao dados de treinamento, sendo ineficiente para generalizar para outras amostras/população).\n", + "* Constroi meta-classificadores: combinar os resultados de vários algoritmos para produzir previsões mais precisas e robustas do que as previsões de cada classificador individual.\n", + "* Ensemble reduz/minimiza os efeitos das principais causas de erros nos modelos de Machine Learning:\n", + " * ruído;\n", + " * bias (viés);\n", + " * variância --> Principal medida para medir a variabilidade presente nos dados.\n", + "\n", + "# Referências\n", + "* [Simple guide for ensemble learning methods](https://towardsdatascience.com/simple-guide-for-ensemble-learning-methods-d87cc68705a2) - Explica didaticamente como funcionam ensembes." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "25RW8u-Sj780" + }, + "source": [ + "### Leitura Adicional\n", + "* [Ensemble methods: bagging, boosting and stacking](https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205)\n", + "* [Ensemble Methods in Machine Learning: What are They and Why Use Them?](https://towardsdatascience.com/ensemble-methods-in-machine-learning-what-are-they-and-why-use-them-68ec3f9fef5f)\n", + "* [Ensemble Learning Using Scikit-learn](https://towardsdatascience.com/ensemble-learning-using-scikit-learn-85c4531ff86a)\n", + "* [Let’s Talk About Machine Learning Ensemble Learning In Python](https://medium.com/fintechexplained/lets-talk-about-machine-learning-ensemble-learning-in-python-382747e5fba8)\n", + "* [Boosting, Bagging, and Stacking — Ensemble Methods with sklearn and mlens](https://medium.com/@rrfd/boosting-bagging-and-stacking-ensemble-methods-with-sklearn-and-mlens-a455c0c982de)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FugME1HSl4jJ" + }, + "source": [ + "___\n", + "# **PARAMETER TUNNING** (= Parâmetros ótimos dos modelos de Machine Learning)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u_147cIRl9F1" + }, + "source": [ + "## GridSearch (Ferramenta ou meio que vamos utilizar para otimização dos parâmetros dos modelos de ML)\n", + "* Encontra os parâmetros ótimos (hyperparameter tunning) que melhoram a acurácia dos modelos.\n", + "* Necessita dos seguintes inputs:\n", + " * A matrix $X_{p}$ com as $p$ COLUNAS (variáveis ou atributos) do dataframe;\n", + " * A matriz $y_{p}$ com a COLUNA-target (vaiável resposta);\n", + " * Exemplo: DecisionTree, RandomForestClassifier, XGBoostClassificer e etc;\n", + " * Um dicionário com os parâmetros a serem otimizados;\n", + " * O número de folds para o método de Cross-validation." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "39Sg77fbTWCO" + }, + "source": [ + "___\n", + "# **MODEL SELECTION & EVALUATION**\n", + "> Nesta fase identificamos e aplicamos as melhores métricas (Accuracy, Sensitivity, Specificity, F-Score, AUC, R-Sq, Adj R-SQ, RMSE (Root Mean Square Error)) para avaliar o desempenho/acurácia/performance dos modelos de ML.\n", + ">> Treinamos os modelos de ML usando a amostra de treinamento e avaliamos o desempenho/acurácia/performance na amostra de teste/validação.\n", + "\n", + "* Leitura Adicional\n", + " * [The 5 Classification Evaluation metrics every Data Scientist must know](https://towardsdatascience.com/the-5-classification-evaluation-metrics-you-must-know-aa97784ff226)\n", + " * [Confusion matrix and other metrics in machine learning](https://medium.com/hugo-ferreiras-blog/confusion-matrix-and-other-metrics-in-machine-learning-894688cb1c0a)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oQQVzZ2ZTYrB" + }, + "source": [ + "## Confusion Matrix\n", + "* Termos associados à Confusion Matrix:\n", + " * **Verdadeiro Positivo** (TP = True Positive): Quando o valor observado é True e o modelo estima como True. Ou seja, o modelo acertou na estimativa.\n", + " * Exemplo: **Observado**: Fraude (Positive); **Modelo**: Fraude (Positive) --> Modelo acertou!\n", + " * **Verdadeiro Negativo** (TN = True Negative): Quando o valor observado é False e o modelo estima como False. Ou seja, o modelo acertou na estimativa;\n", + " * Exemplo: **Observado**: NÃO-Fraude (Negative); **Modelo**: NÃO-Fraude (Negative) --> Modelo acertou!\n", + " * **Falso Positivo** (FP = False Positive): Quando o valor observado é False e o modelo estima como True. Ou seja, o modelo errou na estimativa. \n", + " * Exemplo: **Observado**: NÃO-Fraude (Negative); **Modelo**: Fraude (Positive) --> Modelo errou!\n", + " * **Falso Negativo** (FN = False Negative): Quando o valor observado é True e o modelo estima como False.\n", + " * Exemplo: **Observado**: Fraude (Positive); **Modelo**: NÃO-Fraude (Negative) --> Modelo errou!\n", + "\n", + "* Consulte [Confusion matrix](https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py)\n", + "\n", + "![ConfusionMatrix](https://github.com/MathMachado/Materials/blob/master/ConfusionMatrix.PNG?raw=true)\n", + "\n", + "Source: [Confusion Matrix](https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781838555078/6/ch06lvl1sec34/confusion-matrix)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ci-6eiqBTgbL" + }, + "source": [ + "## Accuracy\n", + "> Accuracy - é o número de previsões corretas feitas pelo modelo.\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Com que frequência o classificador (modelo preditivo) classifica corretamente?\n", + "```\n", + "\n", + "$$Accuracy= \\frac{TP+TN}{TP+TN+FP+FN}$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F7YI8X5TRx-R" + }, + "source": [ + "## Precision (ou Specificity)\n", + "> **Precision** - fornece informações sobre o desempenho em relação a Falsos Positivos (quantos capturamos).\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Com relação ao resultado Positivo, com que frequência o classificador está correto?\n", + "```\n", + "\n", + "\n", + "$$Precision= \\frac{TP}{TP+FP}$$\n", + "\n", + "**Exemplo**: Precison nos dirá a proporção de clientes que o modelo estimou como sendo Fraude quando, na verdade, são fraude.\n", + "\n", + "**Comentário**: Se nosso foco é minimizar Falso Negativos (FN), então precisamos nos esforçar para termos Recall próximo de 100%." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zO39n8x_Sz3L" + }, + "source": [ + "## Recall (ou Sensitivity)\n", + "> **Recall** - nos fornece informações sobre o desempenho de um classificador em relação a Falsos Negativos (quantos perdemos).\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Quando o valor observado é Positivo, com que frequência o classificador está correto?\n", + "```\n", + "\n", + "$$Recall = Sensitivity = \\frac{TP}{TP+FN}$$\n", + "\n", + "**Exemplo**: Recall é a proporção de clientes observados como Fraude e que o modelo estima como Fraude.\n", + "\n", + "**Comentário**: Se nosso foco for minimizar Falso Positivos (FP), então precisamos nos esforçar para fazer Precision mais próximo de 100% possível." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "htS6rdHVVXRG" + }, + "source": [ + "## Specificity\n", + "> **Specificity** - proporção de TN por TN+FP.\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Quando o valor observado é Negativo, com que frequência o classificador está correto?\n", + "```\n", + "\n", + "**Exemplo**: Specificity é a proporção de clientes NÃO-Fraude que o modelo estima como NÃO-Fraude.\n", + "\n", + "$$Specificity= \\frac{TN}{TN+FP}$$\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mNn0twadTacc" + }, + "source": [ + "## F1-Score\n", + "> F1-Score é a média harmônica entre Recall e Precision e é um número entre 0 e 1. Quanto mais próximo de 1, melhor. Quanto mais próximo de 0, pior. Ou seja, é um equilíbrio entre Recall e Precision.\n", + "\n", + "$$F1\\_Score= 2\\left(\\frac{Recall*Precision}{Recall+Precision}\\right)$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gkCubyUCP_hn" + }, + "source": [ + "### Funções usadas neste tutorial" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZD2pH9hfTnZv" + }, + "source": [ + "#### Função para Cross-Validation" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hr8LczrSQB0x" + }, + "source": [ + "" + ], + "execution_count": 2, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9ROlyvgij2yl" + }, + "source": [ + "#### Função para plotar a Confusion Matrix\n", + "* Extraído de [Confusion Matrix Visualization](https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "klQ0FLOIgeX1" + }, + "source": [ + "def mostra_confusion_matrix(cf, \n", + " group_names = None, \n", + " categories = 'auto', \n", + " count = True, \n", + " percent = True, \n", + " cbar = True, \n", + " xyticks = False, \n", + " xyplotlabels = True, \n", + " sum_stats = True, \n", + " figsize = (8, 8), \n", + " cmap = 'Blues'):\n", + " '''\n", + " This function will make a pretty plot of an sklearn Confusion Matrix cm using a Seaborn heatmap visualization.\n", + " Arguments\n", + " ---------\n", + " cf: confusion matrix to be passed in\n", + " group_names: List of strings that represent the labels row by row to be shown in each square.\n", + " categories: List of strings containing the categories to be displayed on the x,y axis. Default is 'auto'\n", + " count: If True, show the raw number in the confusion matrix. Default is True.\n", + " normalize: If True, show the proportions for each category. Default is True.\n", + " cbar: If True, show the color bar. The cbar values are based off the values in the confusion matrix.\n", + " Default is True.\n", + " xyticks: If True, show x and y ticks. Default is True.\n", + " xyplotlabels: If True, show 'True Label' and 'Predicted Label' on the figure. Default is True.\n", + " sum_stats: If True, display summary statistics below the figure. Default is True.\n", + " figsize: Tuple representing the figure size. Default will be the matplotlib rcParams value.\n", + " cmap: Colormap of the values displayed from matplotlib.pyplot.cm. Default is 'Blues'\n", + " See http://matplotlib.org/examples/color/colormaps_reference.html\n", + " '''\n", + "\n", + " # CODE TO GENERATE TEXT INSIDE EACH SQUARE\n", + " blanks = ['' for i in range(cf.size)]\n", + "\n", + " if group_names and len(group_names)==cf.size:\n", + " group_labels = [\"{}\\n\".format(value) for value in group_names]\n", + " else:\n", + " group_labels = blanks\n", + "\n", + " if count:\n", + " group_counts = [\"{0:0.0f}\\n\".format(value) for value in cf.flatten()]\n", + " else:\n", + " group_counts = blanks\n", + "\n", + " if percent:\n", + " group_percentages = [\"{0:.2%}\".format(value) for value in cf.flatten()/np.sum(cf)]\n", + " else:\n", + " group_percentages = blanks\n", + "\n", + " box_labels = [f\"{v1}{v2}{v3}\".strip() for v1, v2, v3 in zip(group_labels,group_counts,group_percentages)]\n", + " box_labels = np.asarray(box_labels).reshape(cf.shape[0],cf.shape[1])\n", + "\n", + " # CODE TO GENERATE SUMMARY STATISTICS & TEXT FOR SUMMARY STATS\n", + " if sum_stats:\n", + " #Accuracy is sum of diagonal divided by total observations\n", + " accuracy = np.trace(cf) / float(np.sum(cf))\n", + "\n", + " #if it is a binary confusion matrix, show some more stats\n", + " if len(cf)==2:\n", + " #Metrics for Binary Confusion Matrices\n", + " precision = cf[1,1] / sum(cf[:,1])\n", + " recall = cf[1,1] / sum(cf[1,:])\n", + " f1_score = 2*precision*recall / (precision + recall)\n", + " stats_text = \"\\n\\nAccuracy={:0.3f}\\nPrecision={:0.3f}\\nRecall={:0.3f}\\nF1 Score={:0.3f}\".format(accuracy,precision,recall,f1_score)\n", + " else:\n", + " stats_text = \"\\n\\nAccuracy={:0.3f}\".format(accuracy)\n", + " else:\n", + " stats_text = \"\"\n", + "\n", + " # SET FIGURE PARAMETERS ACCORDING TO OTHER ARGUMENTS\n", + " if figsize==None:\n", + " #Get default figure size if not set\n", + " figsize = plt.rcParams.get('figure.figsize')\n", + "\n", + " if xyticks==False:\n", + " #Do not show categories if xyticks is False\n", + " categories=False\n", + "\n", + " # MAKE THE HEATMAP VISUALIZATION\n", + " plt.figure(figsize=figsize)\n", + " sns.heatmap(cf,annot=box_labels,fmt=\"\",cmap=cmap,cbar=cbar,xticklabels=categories,yticklabels=categories)\n", + "\n", + " if xyplotlabels:\n", + " plt.ylabel('True label')\n", + " plt.xlabel('Predicted label' + stats_text)\n", + " else:\n", + " plt.xlabel(stats_text)" + ], + "execution_count": 3, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8J-sTUfTTdLi" + }, + "source": [ + "#### Função para o GridSearchCV" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ap3WMXqDthu9" + }, + "source": [ + "def GridSearchOptimizer(modelo, ml_Opt, d_Parametros, X_treinamento, y_treinamento, X_teste, y_teste, i_CV, l_colunas):\n", + " ml_GridSearchCV = GridSearchCV(modelo, d_Parametros, cv = i_CV, n_jobs = -1, verbose= 10, scoring = 'accuracy')\n", + " start = time()\n", + " ml_GridSearchCV.fit(X_treinamento, y_treinamento)\n", + " tempo_elapsed = time()-start\n", + " #print(f\"\\nGridSearchCV levou {tempo_elapsed:.2f} segundos.\")\n", + "\n", + " # Parâmetros que otimizam a classificação:\n", + " print(f'\\nParametros otimizados: {ml_GridSearchCV.best_params_}')\n", + " \n", + " if ml_Opt == 'ml_DT2':\n", + " print(f'\\nDecisionTreeClassifier *********************************************************************************************************')\n", + " ml_Opt = DecisionTreeClassifier(criterion= ml_GridSearchCV.best_params_['criterion'], \n", + " max_depth= ml_GridSearchCV.best_params_['max_depth'],\n", + " max_leaf_nodes= ml_GridSearchCV.best_params_['max_leaf_nodes'],\n", + " min_samples_split= ml_GridSearchCV.best_params_['min_samples_leaf'],\n", + " min_samples_leaf= ml_GridSearchCV.best_params_['min_samples_split'], \n", + " random_state= i_Seed)\n", + " \n", + " elif ml_Opt == 'ml_RF2':\n", + " print(f'\\nRandomForestClassifier *********************************************************************************************************')\n", + " ml_Opt = RandomForestClassifier(bootstrap= ml_GridSearchCV.best_params_['bootstrap'], \n", + " max_depth= ml_GridSearchCV.best_params_['max_depth'],\n", + " max_features= ml_GridSearchCV.best_params_['max_features'],\n", + " min_samples_leaf= ml_GridSearchCV.best_params_['min_samples_leaf'],\n", + " min_samples_split= ml_GridSearchCV.best_params_['min_samples_split'],\n", + " n_estimators= ml_GridSearchCV.best_params_['n_estimators'],\n", + " random_state= i_Seed)\n", + " \n", + " elif ml_Opt == 'ml_AB2':\n", + " print(f'\\nAdaBoostClassifier *********************************************************************************************************')\n", + " ml_Opt = AdaBoostClassifier(algorithm='SAMME.R', \n", + " base_estimator=RandomForestClassifier(bootstrap = False, \n", + " max_depth = 10, \n", + " max_features = 'auto', \n", + " min_samples_leaf = 1, \n", + " min_samples_split = 2, \n", + " n_estimators = 400), \n", + " learning_rate = ml_GridSearchCV.best_params_['learning_rate'], \n", + " n_estimators = ml_GridSearchCV.best_params_['n_estimators'], \n", + " random_state = i_Seed)\n", + " \n", + " elif ml_Opt == 'ml_GB2':\n", + " print(f'\\nGradientBoostingClassifier *********************************************************************************************************')\n", + " ml_Opt = GradientBoostingClassifier(learning_rate = ml_GridSearchCV.best_params_['learning_rate'], \n", + " n_estimators = ml_GridSearchCV.best_params_['n_estimators'], \n", + " max_depth = ml_GridSearchCV.best_params_['max_depth'], \n", + " min_samples_split = ml_GridSearchCV.best_params_['min_samples_split'], \n", + " min_samples_leaf = ml_GridSearchCV.best_params_['min_samples_leaf'], \n", + " max_features = ml_GridSearchCV.best_params_['max_features'])\n", + " \n", + " elif ml_Opt == 'ml_XGB2':\n", + " print(f'\\nXGBoostingClassifier *********************************************************************************************************')\n", + " ml_Opt = XGBoostingClassifier(learning_rate= ml_GridSearchCV.best_params_['learning_rate'], \n", + " max_depth= ml_GridSearchCV.best_params_['max_depth'], \n", + " colsample_bytree= ml_GridSearchCV.best_params_['colsample_bytree'], \n", + " subsample= ml_GridSearchCV.best_params_['subsample'], \n", + " gamma= ml_GridSearchCV.best_params_['gamma'], \n", + " min_child_weight= ml_GridSearchCV.best_params_['min_child_weight'])\n", + " \n", + " # Treina novamente usando os parametros otimizados...\n", + " ml_Opt.fit(X_treinamento, y_treinamento)\n", + "\n", + " # Cross-Validation com 10 folds\n", + " print(f'\\n********* CROSS-VALIDATION ***********')\n", + " a_scores_CV = funcao_cross_val_score(ml_Opt, X_treinamento, y_treinamento, i_CV)\n", + "\n", + " # Faz predições com os parametros otimizados...\n", + " y_pred = ml_Opt.predict(X_teste)\n", + " \n", + " # Importância das COLUNAS\n", + " print(f'\\n********* IMPORTÂNCIA DAS COLUNAS ***********')\n", + " df_importancia_variaveis = pd.DataFrame(zip(l_colunas, ml_Opt.feature_importances_), columns= ['coluna', 'importancia'])\n", + " df_importancia_variaveis = df_importancia_variaveis.sort_values(by= ['importancia'], ascending=False)\n", + " print(df_importancia_variaveis)\n", + "\n", + " # Matriz de Confusão\n", + " print(f'\\n********* CONFUSION MATRIX - PARAMETER TUNNING ***********')\n", + " cf_matrix = confusion_matrix(y_teste, y_pred)\n", + " cf_labels = ['True_Negative', 'False_Positive', 'False_Negative', 'True_Positive']\n", + " cf_categories = ['Zero', 'One']\n", + " mostra_confusion_matrix(cf_matrix, group_names = cf_labels, categories = cf_categories)\n", + "\n", + " return ml_Opt, ml_GridSearchCV.best_params_" + ], + "execution_count": 4, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YMnQn2XgT7Mg" + }, + "source": [ + "#### Função para selecionar COLUNAS relevantes dos dataframes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fsnHcaeLUDFS" + }, + "source": [ + "from sklearn.feature_selection import SelectFromModel\n", + "\n", + "def seleciona_colunas_relevantes(modelo, X_treinamento, X_teste, threshold = 0.05):\n", + " # Cria um seletor para selecionar as COLUNAS com importância > threshold\n", + " sfm = SelectFromModel(modelo, threshold)\n", + " \n", + " # Treina o seletor\n", + " sfm.fit(X_treinamento, y_treinamento)\n", + "\n", + " # Mostra o indice das COLUNAS mais importantes\n", + " print(f'\\n********** COLUNAS Relevantes ******')\n", + " print(sfm.get_support(indices=True))\n", + "\n", + " # Seleciona somente as COLUNAS relevantes\n", + " X_treinamento_I = sfm.transform(X_treinamento)\n", + " X_teste_I = sfm.transform(X_teste)\n", + " return X_treinamento_I, X_teste_I " + ], + "execution_count": 5, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gd98JFSGUV5n" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3e0m7lEnYOV9" + }, + "source": [ + "### Função para calcular a importância das colunas/variáveis/atributos\n", + "* Source: [Plotting Feature Importances](https://www.kaggle.com/grfiv4/plotting-feature-importances)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fjco0HnNYr-N" + }, + "source": [ + "def mostra_feature_importances(clf, X_treinamento, y_treinamento=None, \n", + " top_n=10, figsize=(8,8), print_table=False, title=\"Feature Importances\"):\n", + " '''\n", + " plot feature importances of a tree-based sklearn estimator\n", + " \n", + " Note: X_treinamento and y_treinamento are pandas DataFrames\n", + " \n", + " Note: Scikit-plot is a lovely package but I sometimes have issues\n", + " 1. flexibility/extendibility\n", + " 2. complicated models/datasets\n", + " But for many situations Scikit-plot is the way to go\n", + " see https://scikit-plot.readthedocs.io/en/latest/Quickstart.html\n", + " \n", + " Parameters\n", + " ----------\n", + " clf (sklearn estimator) if not fitted, this routine will fit it\n", + " \n", + " X_treinamento (pandas DataFrame)\n", + " \n", + " y_treinamento (pandas DataFrame) optional\n", + " required only if clf has not already been fitted \n", + " \n", + " top_n (int) Plot the top_n most-important features\n", + " Default: 10\n", + " \n", + " figsize ((int,int)) The physical size of the plot\n", + " Default: (8,8)\n", + " \n", + " print_table (boolean) If True, print out the table of feature importances\n", + " Default: False\n", + " \n", + " Returns\n", + " -------\n", + " the pandas dataframe with the features and their importance\n", + " \n", + " Author\n", + " ------\n", + " George Fisher\n", + " '''\n", + " \n", + " __name__ = \"mostra_feature_importances\"\n", + " \n", + " import pandas as pd\n", + " import numpy as np\n", + " import matplotlib.pyplot as plt\n", + " \n", + " from xgboost.core import XGBoostError\n", + " from lightgbm.sklearn import LightGBMError\n", + " \n", + " try: \n", + " if not hasattr(clf, 'feature_importances_'):\n", + " clf.fit(X_treinamento.values, y_treinamento.values.ravel())\n", + "\n", + " if not hasattr(clf, 'feature_importances_'):\n", + " raise AttributeError(\"{} does not have feature_importances_ attribute\".\n", + " format(clf.__class__.__name__))\n", + " \n", + " except (XGBoostError, LightGBMError, ValueError):\n", + " clf.fit(X_treinamento.values, y_treinamento.values.ravel())\n", + " \n", + " feat_imp = pd.DataFrame({'importance':clf.feature_importances_}) \n", + " feat_imp['feature'] = X_treinamento.columns\n", + " feat_imp.sort_values(by ='importance', ascending = False, inplace = True)\n", + " feat_imp = feat_imp.iloc[:top_n]\n", + " \n", + " feat_imp.sort_values(by='importance', inplace = True)\n", + " feat_imp = feat_imp.set_index('feature', drop = True)\n", + " feat_imp.plot.barh(title=title, figsize=figsize)\n", + " plt.xlabel('Feature Importance Score')\n", + " plt.show()\n", + " \n", + " if print_table:\n", + " from IPython.display import display\n", + " print(\"Top {} features in descending order of importance\".format(top_n))\n", + " display(feat_imp.sort_values(by = 'importance', ascending = False))\n", + " \n", + " return feat_imp" + ], + "execution_count": 6, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rsH9dMxazWCg" + }, + "source": [ + "# **DATAFRAME-EXEMPLO USADO NESTE TUTORIAL**\n", + "> Gerar um dataframe com 18 colunas, sendo 9 informativas, 6 redundantes e 3 repetidas:\n", + "\n", + "Para saber mais sobre a geração de dataframes-exemplo (toy), consulte [Synthetic data generation — a must-have skill for new data scientists](https://towardsdatascience.com/synthetic-data-generation-a-must-have-skill-for-new-data-scientists-915896c0c1ae)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GEyDo_EIV_jV" + }, + "source": [ + "## Definir variáveis globais" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TdwgpZ76WFaT" + }, + "source": [ + "i_CV = 10 # Número de Cross-Validations\n", + "i_Seed = 20111974 # semente por questões de reproducibilidade\n", + "f_Test_Size = 0.3 # Proporção do dataframe de validação (outros valores poderiam ser 0.15, 0.20 ou 0.25)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gJTJfpwWzykS" + }, + "source": [ + "from sklearn.datasets import make_classification\n", + "\n", + "X, y = make_classification(n_samples = 1000, \n", + " n_features = 18, \n", + " n_informative = 9, \n", + " n_redundant = 6, \n", + " n_repeated = 3, \n", + " n_classes = 2, \n", + " n_clusters_per_class = 1, \n", + " random_state=i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gWy2IZh3s-o3", + "outputId": "7971af09-b2a0-45a4-e59b-5311f981e367", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 240 + } + }, + "source": [ + "X" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[ 0.06844089, 4.21184154, -2.5583024 , ..., -0.63061895,\n", + " -0.97831983, -0.88826977],\n", + " [-4.8240213 , 0.17950903, -2.98447332, ..., 0.33992045,\n", + " 1.89153784, -6.10967565],\n", + " [ 1.38953042, -0.226476 , 1.8774004 , ..., -1.47784549,\n", + " 0.96052606, 2.06020368],\n", + " ...,\n", + " [ 1.62548685, 0.43377848, 4.93537285, ..., -4.61990917,\n", + " 0.18310709, 6.16040231],\n", + " [-2.40619087, -1.65474635, 2.64196493, ..., -1.21427845,\n", + " 0.83745861, 0.8254619 ],\n", + " [-4.00041881, 2.52475556, -4.15290177, ..., -0.51680266,\n", + " 1.72224835, -5.59558306]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 8 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ccjhGnzxtAaV", + "outputId": "1e5b7a88-9c9c-4a81-ab35-251e6c0aa3df", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "y[0:30] # Semelhante aos casos de fraude: {0, 1}" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,\n", + " 1, 1, 0, 1, 0, 1, 0, 1])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 9 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OHO2befKJxR3" + }, + "source": [ + "___\n", + "# **DECISION TREE**\n", + "> Decision Trees possuem estrutura em forma de árvores.\n", + "\n", + "* **Principais Vantagens**:\n", + " * São algoritmos fáceis de entender, visualizar e interpretar;\n", + " * Captura facilmente padrões não-lineares presentes nos dados;\n", + " * Requer pouco poder computacional --> Treinar Decision Trees não requer tanto recurso computacional!\n", + " * Lida bem com COLUNAS numéricas ou categóricas;\n", + " * Não requer os dados sejam normalizados;\n", + " * Pode ser utilizado como Feature Engineering ao lidar com Missing Values;\n", + " * Pode ser utilizado como Feature Selection;\n", + " * Não requer suposições sobre a distribuição dos dados por causa da natureza não-paramétrica do algoritmo\n", + "\n", + "* **Principais desvantagens**\n", + " * Propenso a Overfitting, pois Decision Trees podem construir árvores complexas que não sejam capazes de generalizar bem os dados. As coisas complicam muito se a amostra de treinamento possuir outliers. Portanto, **recomenda-se fortemente a tratar os outliers previamente**.\n", + " * Pode criar árvores viesadas se tivermos um dataframe não-balanceado ou que alguma classe seja dominante. Por conta disso, **recomenda-se balancear o dataframe previamente para se evitar esse problema**.\n", + "\n", + "* **Principais parâmetros**\n", + " * **Gini Index** - é uma métrica que mede a frequência com que um ponto/observação aleatoriamente selecionado seria incorretamente identificado.\n", + " * Portanto, quanto menor o valor de Gini Index, melhor a COLUNA;\n", + " * **Entropy** - é uma métrica que mede aleatoriedade da informação presente nos dados.\n", + " * Portanto, quanto maior a entropia da COLUNA, pior ela se torna para nos ajudar a tomar uma conclusão (classificar, por exemplo).\n", + "\n", + "## **Referências**:\n", + "* [1.10. Decision Trees](https://scikit-learn.org/stable/modules/tree.html).\n", + "* [Decision Tree Algorithm With Hands On Example](https://medium.com/datadriveninvestor/decision-tree-algorithm-with-hands-on-example-e6c2afb40d38) - ótimo tutorial para aprender, entender, interpretar e calcular os índices de Gini e entropia.\n", + "* [Intuitive Guide to Understanding Decision Trees](https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-understanding-decision-trees-adb2165ccab7) - ótimo tutorial para aprender, entender, interpretar e calcular os índices de Gini e entropia.\n", + "* [The Complete Guide to Decision Trees](https://towardsdatascience.com/the-complete-guide-to-decision-trees-28a4e3c7be14)\n", + "* [Creating and Visualizing Decision Tree Algorithm in Machine Learning Using Sklearn](https://intellipaat.com/blog/decision-tree-algorithm-in-machine-learning/) - Muito didático!\n", + "* [Decision Trees in Machine Learning](https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052)\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FrMkPN5aLp0Y" + }, + "source": [ + "## Carregar as bibliotecas" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FVU1CM0PKgO4" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "\n", + "import warnings\n", + "warnings.filterwarnings(\"ignore\")" + ], + "execution_count": 8, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "15clh4XrISpz" + }, + "source": [ + "## Carregar/Ler os dados" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UMPL46w2IWJw" + }, + "source": [ + "l_colunas = ['v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8', 'v9', 'v10', 'v11', 'v12', 'v13', 'v14', 'v15', 'v16', 'v17', 'v18']\n", + "\n", + "df_X = pd.DataFrame(X, columns = l_colunas)\n", + "df_y = pd.DataFrame(y, columns = ['target'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "MFaQF2MGFl_M", + "outputId": "6427f328-d63c-4845-d20c-92938031305d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 223 + } + }, + "source": [ + "df_X.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
v1v2v3v4v5v6v7v8v9v10v11v12v13v14v15v16v17v18
00.0684414.211842-2.5583023.665482-3.8351583.4998512.4908563.6654820.2451170.8671722.8655460.493956-5.1485962.8655463.499851-0.630619-0.978320-0.888270
1-4.8240210.179509-2.9844731.033618-3.8934263.428734-3.3346051.033618-0.882780-0.7532811.441522-1.395514-4.0028801.4415223.4287340.3399201.891538-6.109676
21.389530-0.2264761.8774002.7134264.6302570.516455-3.7430272.7134261.2840392.030797-1.0955361.560159-1.014211-1.0955360.516455-1.4778450.9605262.060204
31.1458092.2559460.2073644.6658172.2946786.5013060.9647704.6658170.1194103.1963541.8947873.519138-4.7578071.8947876.501306-3.7890290.5794911.397106
4-0.9366463.697163-3.3636173.805126-1.7544304.9543460.4066053.805126-0.8247381.3825911.665704-0.649758-3.5130361.6657044.9543460.2570520.904244-3.071354
\n", + "
" + ], + "text/plain": [ + " v1 v2 v3 ... v16 v17 v18\n", + "0 0.068441 4.211842 -2.558302 ... -0.630619 -0.978320 -0.888270\n", + "1 -4.824021 0.179509 -2.984473 ... 0.339920 1.891538 -6.109676\n", + "2 1.389530 -0.226476 1.877400 ... -1.477845 0.960526 2.060204\n", + "3 1.145809 2.255946 0.207364 ... -3.789029 0.579491 1.397106\n", + "4 -0.936646 3.697163 -3.363617 ... 0.257052 0.904244 -3.071354\n", + "\n", + "[5 rows x 18 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 12 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "s-ibdD2ZG7tm", + "outputId": "1dedf9d5-1a8d-47a3-b665-4ce84f870b0b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "df_X.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(1000, 18)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 13 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "f9cqRaywa_TR", + "outputId": "faeca466-1081-47e0-bb69-1bff8c00f51e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "set(df_y['target'])" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{0, 1}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 14 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BN6jbpn6Iwmu" + }, + "source": [ + "## Estatísticas Descritivas básicas do dataframe - df.describe()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KlwhxxUNIyYs", + "outputId": "9ea27467-c83e-4654-9703-1947ada7f50b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 315 + } + }, + "source": [ + "df_X.describe()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
v1v2v3v4v5v6v7v8v9v10v11v12v13v14v15v16v17v18
count1000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.000000
mean-0.0851591.0342270.6574081.4053170.6872791.1315600.1080531.4053171.0070231.0488010.0792480.001650-0.3654380.0792481.131560-0.0277510.9846060.633624
std2.0022471.6315073.6087722.2568574.0195984.4818321.9813072.2568571.8632881.6439001.9492731.9326414.1606681.9492734.4818322.0654551.8505933.552991
min-6.944169-4.620754-16.300139-6.235192-12.454256-14.305401-6.152747-6.235192-5.484992-3.293216-7.135349-5.705500-9.120941-7.135349-14.305401-6.009023-5.035184-11.439074
25%-1.305566-0.089052-1.623657-0.152888-1.854645-1.684751-1.216983-0.152888-0.240908-0.012710-1.209675-1.292162-3.555363-1.209675-1.684751-1.436673-0.261610-1.691346
50%0.0525230.9941500.5738491.4499310.8123641.2815040.1670911.4499311.0661251.0128990.1803440.035237-0.9666380.1803441.281504-0.0001900.9757930.844784
75%1.3838532.0719953.0385862.8871413.4139524.0081031.4387192.8871412.2881882.1872021.4391991.3153422.7458061.4391994.0081031.3653692.2565043.109330
max4.9971727.35486011.7201658.49456612.84441815.9998036.2935508.4945668.1465596.5231806.2524485.53821611.2593506.25244815.9998036.5315617.64680212.090528
\n", + "
" + ], + "text/plain": [ + " v1 v2 ... v17 v18\n", + "count 1000.000000 1000.000000 ... 1000.000000 1000.000000\n", + "mean -0.085159 1.034227 ... 0.984606 0.633624\n", + "std 2.002247 1.631507 ... 1.850593 3.552991\n", + "min -6.944169 -4.620754 ... -5.035184 -11.439074\n", + "25% -1.305566 -0.089052 ... -0.261610 -1.691346\n", + "50% 0.052523 0.994150 ... 0.975793 0.844784\n", + "75% 1.383853 2.071995 ... 2.256504 3.109330\n", + "max 4.997172 7.354860 ... 7.646802 12.090528\n", + "\n", + "[8 rows x 18 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 15 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N_QhFqyZOKFB" + }, + "source": [ + "## Selecionar as amostras de treinamento e validação\n", + "\n", + "* Dividir os dados/amostras em:\n", + " * **Amostra de treinamento**: usado para treinar o modelo e otimizar os hiperparâmetros;\n", + " * **Amostra de teste**: usado para verificar se o modelo otimizado funciona em dados totalmente desconhecidos. É nesta amostra de teste que avaliamos a performance do modelo em termos de generalização (trabalhar com dados que não lhe foi apresentado);\n", + "* **Técnica de Hold-Out**: Separar/dividir os dados em amostra de treinamento e teste. Geralmente usamos 70% da amostra para treinamento e 30% validação. Outras opções são usar os percentuais 80/20 ou 75/25 (default).\n", + " * **Desvatangem do Hold-Out**: Variância nos dados pode comprometer performance do modelo quando, por exemplo, amostra de treinamento é semelhante amostra de teste. \n", + "* Consulte [sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) para mais detalhes.\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8sKBgs-QOOfn" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(df_X, df_y, test_size = f_Test_Size, random_state = i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TPTKBBHgOpoA", + "outputId": "14ff5eb3-d8a8-4475-a783-58ff2657943b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "X_treinamento.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(700, 18)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 17 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lEn_LLs2OtRI", + "outputId": "2da233e6-1f0e-449f-ce61-8e3f9bcb3076", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "y_treinamento.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(700, 1)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 18 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_uAw8EcyOvrG", + "outputId": "247633e0-375c-4f2c-e21b-ff394cb7850a", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "X_teste.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(300, 18)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 19 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "A2LYI-9hOyXI", + "outputId": "efe6513a-ac85-468a-ac49-32d4ddd4266f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "y_teste.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(300, 1)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 20 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "npgoBSX2dd4l" + }, + "source": [ + "## Treinar o algoritmo com os dados de treinamento\n", + "### Carregar os algoritmos/libraries" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hcvzrtolGfnQ", + "outputId": "13d2f619-2128-43da-b1df-e2e3253928a8", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + } + }, + "source": [ + "!pip install graphviz\n", + "!pip install pydotplus" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Requirement already satisfied: graphviz in /usr/local/lib/python3.6/dist-packages (0.10.1)\n", + "Requirement already satisfied: pydotplus in /usr/local/lib/python3.6/dist-packages (2.0.2)\n", + "Requirement already satisfied: pyparsing>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from pydotplus) (2.4.7)\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "v_pF-HH3JKL2" + }, + "source": [ + "from sklearn.metrics import accuracy_score # para medir a acurácia do modelo preditivo\n", + "#from sklearn.model_selection import train_test_split\n", + "#from sklearn.metrics import classification_report\n", + "from sklearn.metrics import confusion_matrix # para plotar a confusion matrix\n", + "\n", + "from sklearn.model_selection import GridSearchCV # para otimizar os parâmetros dos modelos preditivos\n", + "from sklearn.model_selection import cross_val_score # Para o CV (Cross-Validation)\n", + "from sklearn.model_selection import cross_validate\n", + "\n", + "from time import time\n", + "from operator import itemgetter\n", + "from scipy.stats import randint\n", + "\n", + "from sklearn.tree import export_graphviz\n", + "from sklearn.externals.six import StringIO \n", + "from IPython.display import Image \n", + "import pydotplus\n", + "\n", + "np.set_printoptions(suppress=True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YJMS9ePQ6B6t" + }, + "source": [ + "**Atenção**: Para evitar overfitting nos algoritmos DecisionTreeClassifier, considere min_samples_split = 2 como default." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nNeRHYePJc-r" + }, + "source": [ + "from sklearn.tree import DecisionTreeClassifier # Library para Decision Tree (Classificação)\n", + "\n", + "# Instancia (configuração do Decision Trees) com os parâmetros sugeridos para se evitar overfitting:\n", + "ml_DT = DecisionTreeClassifier(criterion = 'gini', \n", + " splitter = 'best', \n", + " max_depth = None, \n", + " min_samples_split = 2, \n", + " min_samples_leaf = 1, \n", + " min_weight_fraction_leaf = 0.0, \n", + " max_features = None, \n", + " random_state = i_Seed, \n", + " max_leaf_nodes = None, \n", + " min_impurity_decrease = 0.0, \n", + " min_impurity_split = None, \n", + " class_weight = None, \n", + " presort = False)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gVLZznprx2YX", + "outputId": "c2cd5eff-b03f-4c15-f42e-8c0862e2b45f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 120 + } + }, + "source": [ + "# Objeto/classificador configurado\n", + "ml_DT" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n", + " max_depth=None, max_features=None, max_leaf_nodes=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, presort=False,\n", + " random_state=20111974, splitter='best')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 24 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8CC24H-JHhlj" + }, + "source": [ + "### Treina o algoritmo: fit(df)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OgAHfXVo-Nw8", + "outputId": "c760acbb-3017-4181-cf88-118990fc43d5", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 120 + } + }, + "source": [ + "ml_DT.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n", + " max_depth=None, max_features=None, max_leaf_nodes=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, presort=False,\n", + " random_state=20111974, splitter='best')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 25 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oNVgVA4Bqy2m" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CNiRjmrRHVnx" + }, + "source": [ + "### Valida o modelo com a amostra de treinamento" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2GMCSs89HquJ", + "outputId": "722edff2-b895-4ed2-ab48-77c6763bb009", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "ml_DT.score(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.94" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 26 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Bmv9YZobIer4" + }, + "source": [ + "### Calcula as predições usando o modelo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2YufZaRNIkFL" + }, + "source": [ + "y_pred = ml_DT.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fYvMN-tvIX-p" + }, + "source": [ + "### Matriz de Confusão" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9iTK6pBwIZ3F", + "outputId": "2cfe7aa6-a696-4169-dfac-8cb40a9d513b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 538 + } + }, + "source": [ + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative', 'False_Positive', 'False_Negative', 'True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jOnkFBcEIVAb" + }, + "source": [ + "### Volte ao início, extraia nova amostra e calcule a acurácia\n", + "* Observou que a acurácia mudou? Isso acontece porque extraimos uma nova amostra de treinamento.\n", + "* Quais os inconvenientes de termos uma métrica diferente para cada amostra do modelo preditivo?\n", + "* Como reportar os resultados do seu modelo?\n", + "* Como se assegurar acerca do valor mais ideal da métrica?\n", + " * use a Estatística a seu favor! --> Use Cross-Validation." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MkBSvyorGXQz" + }, + "source": [ + "___\n", + "# **CROSS-VALIDATION**\n", + "* K-fold é o método de Cross-Validation (CV) mais conhecido e utilizado;\n", + "* Como funciona: divide o dataframe de treinamento em k partes (cada parte é um fold);\n", + " * Usa k-1 partes para treinar o modelo e o restante para validar o modelo;\n", + " * O processo é repetido k vezes, sendo que em cada iteração calcula as métricas desejadas (exemplo: acurácia);\n", + " * Desta forma o modelo é treinado e testado com todas as partes dos dados;\n", + " * Ao final das k iterações, teremos k métricas das quais calculamos média e desvio-padrão.\n", + "\n", + " A figura abaixo nos ajuda a entender como funciona CV:\n", + "\n", + "![Cross-Validation](https://github.com/MathMachado/Materials/blob/master/CV2.PNG?raw=true)\n", + "\n", + "Source: [5 Reasons why you should use Cross-Validation in your Data Science Projects](https://towardsdatascience.com/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79)\n", + "\n", + "* **valor de k**:\n", + " * valor de k (folds): entre 5 e 10 --> Não há regra geral para a escolha de k;\n", + " * Quanto maior o valor de k --> menor o viés do CV --> Experimento Estatístico para mostrar o efeito.\n", + "\n", + "[Applied Predictive Modeling, 2013](https://www.amazon.com/Applied-Predictive-Modeling-Max-Kuhn/dp/1461468485/ref=as_li_ss_tl?ie=UTF8&qid=1520380699&sr=8-1&keywords=applied+predictive+modeling&linkCode=sl1&tag=inspiredalgor-20&linkId=1af1f3de89c11e4a7fd49de2b05e5ebf)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HscfN-a1V043" + }, + "source": [ + "* **Vantagens do uso de CV**:\n", + " * Modelos com melhor acurácia;\n", + " * Melhor uso dos dados, pois todos os dados são utilizados como treinamento e validação. Portanto, qualquer problema com os dados serão encontrados nesta fase.\n", + "\n", + "* **Leitura Adicional**\n", + " * [Cross-Validation in Machine Learning](https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f)\n", + " * [5 Reasons why you should use Cross-Validation in your Data Science Projects](https://towardsdatascience.com/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79)\n", + " * [Cross-validation: evaluating estimator performance](https://scikit-learn.org/stable/modules/cross_validation.html)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8x2UPwOYQPcI", + "outputId": "b587c430-3e4d-487c-838a-ffc3b8c5864b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "# Cross-Validation com k = 10 folds (= 10 partes)\n", + "a_scores_CV = funcao_cross_val_score(ml_DT, X_treinamento, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Média das Acurácias calculadas pelo CV....: 91.43\n", + "std médio das Acurácias calculadas pelo CV: 3.44\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Uxoplcea0byV", + "outputId": "597530be-6e07-452c-b22e-37d1c054769e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "a_scores_CV # array com os scores a cada iteração do CV" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.9 , 0.98571429, 0.85714286, 0.92857143, 0.88571429,\n", + " 0.94285714, 0.92857143, 0.9 , 0.88571429, 0.92857143])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 30 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y3k-PcbN0o_i", + "outputId": "2a4a51a9-1636-4d75-aa07-ff75a7f00ad1", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_scores_CV.mean()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.9142857142857144" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 31 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6_rYker2gzeG" + }, + "source": [ + "**Interpretação**: Nosso classificador (DecisionTreeClassifier) tem uma acurácia média de 91,43% (base de treinamento). Além disso, o std é da ordem de 3,66%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tkwchmkP3p_A", + "outputId": "28332397-3b99-437c-e37c-387b116e6311", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Acurácias: [0.9 0.98571429 0.85714286 0.92857143 0.88571429 0.94285714\n", + " 0.92857143 0.9 0.88571429 0.92857143]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lQNyqHCiKRUh" + }, + "source": [ + "### Valida o modelo com a amostra de treinamento" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Qb0ZPyvKKRUp", + "outputId": "dbbbe053-f3e9-4190-d88e-78c92dca7544", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "ml_DT.score(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.94" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 33 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iL2tEdbqKY5P" + }, + "source": [ + "### Predições com o modelo treinado\n", + "* Faz predições usando o classificador (Decision Trees) para inferir na amostra de teste:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sI31WkZs2ht_" + }, + "source": [ + "y_pred = ml_DT.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rfapj3OG13PG", + "outputId": "0a725c57-50d1-4151-a889-19103c0aeb1d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "y_pred[0:30]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0,\n", + " 1, 0, 0, 1, 1, 0, 1, 1])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 35 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sc88ofqh16RT", + "outputId": "a6f0823a-b1a0-41f3-e6fa-77223c2d7415", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "y[0:30]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,\n", + " 1, 1, 0, 1, 0, 1, 0, 1])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 36 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Cecv-51TKgz-" + }, + "source": [ + "### Matriz de Confusão" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fSaVzJ9xFpwW", + "outputId": "cf7cc4d8-8cec-4603-a561-4484b7cd235b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 538 + } + }, + "source": [ + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative', 'False_Positive', 'False_Negative', 'True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3ySIXWeFVDlh" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "t-KyMOWgRyQ4" + }, + "source": [ + "\n", + "\n", + "---\n", + "##### ACURÁCIA com dados de TREINO SEM CROSS VALIDATION\n", + "---\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VTT_qynaPV6-", + "outputId": "5f6af706-b5b3-4b1b-f0bd-a1886f0a9e9f", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Medir ACURÁCIA com dados de TREINO SEM CROSS VALIDATION\n", + "# Preparar o array de predições\n", + "a_scores_pred_treino = ml_DT.predict(X_treinamento)\n", + "a_scores_pred_treino[0:30]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0,\n", + " 0, 1, 0, 1, 1, 0, 0, 0])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 30 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NgzcGE75PVKB", + "outputId": "259de4dd-4c60-4bed-b750-e6a7f51d40db", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Medir ACURÁCIA com dados de TREINO SEM CROSS VALIDATION\n", + "accuracy_score(y_treinamento, a_scores_pred_treino)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "1.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 31 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DL-b5ehHSeaV" + }, + "source": [ + "---\n", + "#### Criar Modelo usando classificador Naive Bayes\n", + "---" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mG3gUR4aSwD3" + }, + "source": [ + "from sklearn.naive_bayes import GaussianNB" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "N8KXRDsSS3uQ" + }, + "source": [ + "# Criando o modelo preditivo\n", + "modelo_v1 = GaussianNB()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "tO9zI48tS3Wy", + "outputId": "dc906ec3-5010-486e-dbd8-78493fa23d2b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Treinando o modelo\n", + "modelo_v1.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "GaussianNB(priors=None, var_smoothing=1e-09)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 38 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5OdlSGHWS28K", + "outputId": "0a175639-5633-4b6d-a320-e8c929468659", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "a_scores_pred_treino_NB = modelo_v1.predict(X_treinamento)\n", + "a_scores_pred_treino_NB[0:30]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0,\n", + " 0, 1, 0, 1, 0, 0, 0, 0])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 39 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aEFkIz3wS2gm", + "outputId": "2c656f9d-ba1c-47fc-80f6-98385c7420d7", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Medir ACURÁCIA com dados de TREINO SEM CROSS VALIDATION\n", + "accuracy_score(y_treinamento, a_scores_pred_treino_NB)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.8942857142857142" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 40 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vrjgRbCUT2PB", + "outputId": "8dbd3d1e-b7a7-46c1-b8d5-0e62b172a199", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Medir ACURÁCIA com dados de TESTE\n", + "a_scores_pred_teste_NB = modelo_v1.predict(X_teste)\n", + "accuracy_score(y_teste, a_scores_pred_teste_NB)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.92" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 46 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1AvybYmQT2AU", + "outputId": "83397313-5911-4d70-be99-0e73645455c7", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "# Medir ACURÁCIA com dados de TREINO **COM** CROSS VALIDATION\n", + "# Cross-Validation com k = 10 folds\n", + "a_scores_CV_NB = cross_val_score(modelo_v1, X_treinamento, y_treinamento, cv = i_CV)\n", + "\n", + "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV_NB.mean(),4)}')\n", + "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV_NB.std(),4)}')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Média das Acurácias calculadas pelo CV....: 89.71000000000001\n", + "std médio das Acurácias calculadas pelo CV: 3.3099999999999996\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p8D975NqsGtj" + }, + "source": [ + "## Parameter tunning\n", + "### Referência\n", + "* [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74)\n", + "* [Decision Tree Adventures 2 — Explanation of Decision Tree Classifier Parameters](https://medium.com/datadriveninvestor/decision-tree-adventures-2-explanation-of-decision-tree-classifier-parameters-84776f39a28) - Explica didaticamente e step by step como fazer parameter tunning." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Bfdq5zEhlVsk" + }, + "source": [ + "# Dicionário de parâmetros para o parameter tunning. Ao todo serão ajustados 2X13X5X5X7= 4.550 modelos. Contando com 10 folds no Cross-Validation, então são 45.500 modelos.\n", + "d_parametros_DT = {\"criterion\": [\"gini\", \"entropy\"], \n", + " \"min_samples_split\": [2, 5, 10, 30, 50, 70, 90, 120, 150, 180, 210, 240, 270, 350, 400], \n", + " \"max_depth\": [None, 2, 5, 9, 15], \n", + " \"min_samples_leaf\": [20, 40, 60, 80, 100], \n", + " \"max_leaf_nodes\": [None, 2, 3, 4, 5, 10, 15]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "BtajXuuUpGwq", + "outputId": "47780f80-f16c-4582-de59-64272fbcbb20", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 343 + } + }, + "source": [ + "d_parametros_DT" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'criterion': ['gini', 'entropy'],\n", + " 'max_depth': [None, 2, 5, 9, 15],\n", + " 'max_leaf_nodes': [None, 2, 3, 4, 5, 10, 15],\n", + " 'min_samples_leaf': [20, 40, 60, 80, 100],\n", + " 'min_samples_split': [2,\n", + " 5,\n", + " 10,\n", + " 30,\n", + " 50,\n", + " 70,\n", + " 90,\n", + " 120,\n", + " 150,\n", + " 180,\n", + " 210,\n", + " 240,\n", + " 270,\n", + " 350,\n", + " 400]}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 39 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H8gNSs0G0A-L" + }, + "source": [ + "```\n", + "grid_search = GridSearchCV(ml_DT, param_grid= d_parametros_DT, cv = i_CV, n_jobs= -1)\n", + "start = time()\n", + "grid_search.fit(X_treinamento, y_treinamento)\n", + "tempo_elapsed= time()-start\n", + "print(f\"\\nGridSearchCV levou {tempo_elapsed:.2f} segundos para estimar {len(grid_search.cv_results_)} modelos candidatos\")\n", + "\n", + "GridSearchCV levou 1999.12 segundos para estimar 23 modelos candidatos\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "44-BRnNjBT25", + "outputId": "c671315c-7109-4ff6-d0c8-c78f12ff042b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + } + }, + "source": [ + "# Invoca a função com o modelo baseline\n", + "ml_DT2, best_params = GridSearchOptimizer(ml_DT, 'ml_DT2', d_parametros_DT, X_treinamento, y_treinamento, X_teste, y_teste, i_CV, l_colunas)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Fitting 10 folds for each of 5250 candidates, totalling 52500 fits\n" + ], + "name": "stdout" + }, + { + "output_type": "stream", + "text": [ + "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.\n", + "[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 1.5s\n", + "[Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 1.6s\n", + "[Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 1.7s\n", + "[Parallel(n_jobs=-1)]: Done 14 tasks | elapsed: 1.7s\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.1730s.) Setting batch_size=2.\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0582s.) Setting batch_size=4.\n", + "[Parallel(n_jobs=-1)]: Done 24 tasks | elapsed: 1.8s\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0972s.) Setting batch_size=8.\n", + "[Parallel(n_jobs=-1)]: Done 58 tasks | elapsed: 2.1s\n", + "[Parallel(n_jobs=-1)]: Done 130 tasks | elapsed: 2.7s\n", + "[Parallel(n_jobs=-1)]: Done 202 tasks | elapsed: 3.3s\n", + "[Parallel(n_jobs=-1)]: Done 290 tasks | elapsed: 3.9s\n", + "[Parallel(n_jobs=-1)]: Done 378 tasks | elapsed: 4.5s\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.1926s.) Setting batch_size=16.\n", + "[Parallel(n_jobs=-1)]: Done 490 tasks | elapsed: 5.3s\n", + "[Parallel(n_jobs=-1)]: Done 698 tasks | elapsed: 6.7s\n", + "[Parallel(n_jobs=-1)]: Done 938 tasks | elapsed: 8.0s\n", + "[Parallel(n_jobs=-1)]: Done 1178 tasks | elapsed: 9.5s\n", + "[Parallel(n_jobs=-1)]: Done 1450 tasks | elapsed: 11.1s\n", + "[Parallel(n_jobs=-1)]: Done 1722 tasks | elapsed: 12.9s\n", + "[Parallel(n_jobs=-1)]: Done 2026 tasks | elapsed: 14.7s\n", + "[Parallel(n_jobs=-1)]: Done 2330 tasks | elapsed: 16.6s\n", + "[Parallel(n_jobs=-1)]: Done 2666 tasks | elapsed: 18.8s\n", + "[Parallel(n_jobs=-1)]: Done 3002 tasks | elapsed: 20.8s\n", + "[Parallel(n_jobs=-1)]: Done 3370 tasks | elapsed: 23.2s\n", + "[Parallel(n_jobs=-1)]: Done 3738 tasks | elapsed: 25.4s\n", + "[Parallel(n_jobs=-1)]: Done 4138 tasks | elapsed: 28.2s\n", + "[Parallel(n_jobs=-1)]: Done 4538 tasks | elapsed: 30.5s\n", + "[Parallel(n_jobs=-1)]: Done 4970 tasks | elapsed: 33.4s\n", + "[Parallel(n_jobs=-1)]: Done 5402 tasks | elapsed: 35.9s\n", + "[Parallel(n_jobs=-1)]: Done 5866 tasks | elapsed: 38.6s\n", + "[Parallel(n_jobs=-1)]: Done 6330 tasks | elapsed: 41.2s\n", + "[Parallel(n_jobs=-1)]: Done 6826 tasks | elapsed: 44.1s\n", + "[Parallel(n_jobs=-1)]: Done 7322 tasks | elapsed: 47.0s\n", + "[Parallel(n_jobs=-1)]: Done 7850 tasks | elapsed: 50.1s\n", + "[Parallel(n_jobs=-1)]: Done 8378 tasks | elapsed: 53.0s\n", + "[Parallel(n_jobs=-1)]: Done 8938 tasks | elapsed: 56.2s\n", + "[Parallel(n_jobs=-1)]: Done 9498 tasks | elapsed: 59.3s\n", + "[Parallel(n_jobs=-1)]: Done 10090 tasks | elapsed: 1.0min\n", + "[Parallel(n_jobs=-1)]: Done 10682 tasks | elapsed: 1.1min\n", + "[Parallel(n_jobs=-1)]: Done 11306 tasks | elapsed: 1.2min\n", + "[Parallel(n_jobs=-1)]: Done 11930 tasks | elapsed: 1.2min\n", + "[Parallel(n_jobs=-1)]: Done 12586 tasks | elapsed: 1.3min\n", + "[Parallel(n_jobs=-1)]: Done 13242 tasks | elapsed: 1.4min\n", + "[Parallel(n_jobs=-1)]: Done 13930 tasks | elapsed: 1.4min\n", + "[Parallel(n_jobs=-1)]: Done 14618 tasks | elapsed: 1.5min\n", + "[Parallel(n_jobs=-1)]: Done 15338 tasks | elapsed: 1.6min\n", + "[Parallel(n_jobs=-1)]: Done 16058 tasks | elapsed: 1.7min\n", + "[Parallel(n_jobs=-1)]: Done 16810 tasks | elapsed: 1.7min\n", + "[Parallel(n_jobs=-1)]: Done 17562 tasks | elapsed: 1.8min\n", + "[Parallel(n_jobs=-1)]: Done 18346 tasks | elapsed: 1.9min\n", + "[Parallel(n_jobs=-1)]: Done 19130 tasks | elapsed: 2.0min\n", + "[Parallel(n_jobs=-1)]: Done 19946 tasks | elapsed: 2.1min\n", + "[Parallel(n_jobs=-1)]: Done 20762 tasks | elapsed: 2.2min\n", + "[Parallel(n_jobs=-1)]: Done 21610 tasks | elapsed: 2.2min\n", + "[Parallel(n_jobs=-1)]: Done 22458 tasks | elapsed: 2.3min\n", + "[Parallel(n_jobs=-1)]: Done 23338 tasks | elapsed: 2.4min\n", + "[Parallel(n_jobs=-1)]: Done 24218 tasks | elapsed: 2.5min\n", + "[Parallel(n_jobs=-1)]: Done 25130 tasks | elapsed: 2.6min\n", + "[Parallel(n_jobs=-1)]: Done 26042 tasks | elapsed: 2.7min\n", + "[Parallel(n_jobs=-1)]: Done 26986 tasks | elapsed: 2.8min\n", + "[Parallel(n_jobs=-1)]: Done 27930 tasks | elapsed: 2.9min\n", + "[Parallel(n_jobs=-1)]: Done 28906 tasks | elapsed: 3.1min\n", + "[Parallel(n_jobs=-1)]: Done 29882 tasks | elapsed: 3.2min\n", + "[Parallel(n_jobs=-1)]: Done 30890 tasks | elapsed: 3.3min\n", + "[Parallel(n_jobs=-1)]: Done 31898 tasks | elapsed: 3.4min\n", + "[Parallel(n_jobs=-1)]: Done 32938 tasks | elapsed: 3.6min\n", + "[Parallel(n_jobs=-1)]: Done 33978 tasks | elapsed: 3.7min\n", + "[Parallel(n_jobs=-1)]: Done 35050 tasks | elapsed: 3.8min\n", + "[Parallel(n_jobs=-1)]: Done 36122 tasks | elapsed: 3.9min\n", + "[Parallel(n_jobs=-1)]: Done 37226 tasks | elapsed: 4.1min\n", + "[Parallel(n_jobs=-1)]: Done 38330 tasks | elapsed: 4.2min\n", + "[Parallel(n_jobs=-1)]: Done 39466 tasks | elapsed: 4.3min\n", + "[Parallel(n_jobs=-1)]: Done 40602 tasks | elapsed: 4.5min\n", + "[Parallel(n_jobs=-1)]: Done 41770 tasks | elapsed: 4.6min\n", + "[Parallel(n_jobs=-1)]: Done 42938 tasks | elapsed: 4.8min\n", + "[Parallel(n_jobs=-1)]: Done 44138 tasks | elapsed: 4.9min\n", + "[Parallel(n_jobs=-1)]: Done 45338 tasks | elapsed: 5.1min\n", + "[Parallel(n_jobs=-1)]: Done 46570 tasks | elapsed: 5.2min\n", + "[Parallel(n_jobs=-1)]: Done 47802 tasks | elapsed: 5.4min\n", + "[Parallel(n_jobs=-1)]: Done 49066 tasks | elapsed: 5.5min\n", + "[Parallel(n_jobs=-1)]: Done 50330 tasks | elapsed: 5.7min\n", + "[Parallel(n_jobs=-1)]: Done 51626 tasks | elapsed: 5.8min\n", + "[Parallel(n_jobs=-1)]: Done 52500 out of 52500 | elapsed: 6.0min finished\n" + ], + "name": "stderr" + }, + { + "output_type": "stream", + "text": [ + "\n", + "Parametros otimizados: {'criterion': 'entropy', 'max_depth': None, 'max_leaf_nodes': None, 'min_samples_leaf': 20, 'min_samples_split': 70}\n", + "\n", + "DecisionTreeClassifier *********************************************************************************************************\n", + "\n", + "********* CROSS-VALIDATION ***********\n", + "Média das Acurácias calculadas pelo CV....: 87.14\n", + "std médio das Acurácias calculadas pelo CV: 4.33\n", + "\n", + "********* IMPORTÂNCIA DAS COLUNAS ***********\n", + " coluna importancia\n", + "12 v13 0.735896\n", + "0 v1 0.135030\n", + "9 v10 0.090888\n", + "6 v7 0.025768\n", + "1 v2 0.012418\n", + "3 v4 0.000000\n", + "4 v5 0.000000\n", + "5 v6 0.000000\n", + "7 v8 0.000000\n", + "8 v9 0.000000\n", + "10 v11 0.000000\n", + "11 v12 0.000000\n", + "2 v3 0.000000\n", + "13 v14 0.000000\n", + "14 v15 0.000000\n", + "15 v16 0.000000\n", + "16 v17 0.000000\n", + "17 v18 0.000000\n", + "\n", + "********* CONFUSION MATRIX - PARAMETER TUNNING ***********\n" + ], + "name": "stdout" + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gmCkjGjPJMLr" + }, + "source": [ + "### Visualizar o resultado" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cIc3ZgaISEd0", + "outputId": "b6511d91-c6b6-4faa-ea66-73f0b6ecca4b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 753 + } + }, + "source": [ + "from sklearn.tree import export_graphviz\n", + "from sklearn.externals.six import StringIO \n", + "from IPython.display import Image \n", + "import pydotplus\n", + "\n", + "dot_data = StringIO()\n", + "export_graphviz(ml_DT2, out_file = dot_data, filled = True, rounded = True, special_characters = True, feature_names = l_colunas, class_names = ['0','1'])\n", + "\n", + "graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) \n", + "graph.write_png('DecisionTree.png')\n", + "Image(graph.create_png())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "image/png": "\n", + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 41 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e1R2GBkbnV37" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ukMLoEr7nbUf", + "outputId": "8d728e73-211e-44d7-f025-be3523e21cd0", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + } + }, + "source": [ + "X_treinamento_DT, X_teste_DT = seleciona_colunas_relevantes(ml_DT2, X_treinamento, X_teste)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "\n", + "********** COLUNAS Relevantes ******\n", + "[ 0 9 12]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8JjePRQAoqkk" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Gt3aCPpfKRxm", + "outputId": "f3cdfdc8-c5f5-401f-bd4e-5cd19149e689", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 103 + } + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'criterion': 'entropy',\n", + " 'max_depth': None,\n", + " 'max_leaf_nodes': None,\n", + " 'min_samples_leaf': 20,\n", + " 'min_samples_split': 70}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 43 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zq6uCVtzovMt", + "outputId": "f97be011-6ddc-45c9-9c57-ea995e6309c2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 120 + } + }, + "source": [ + "# Treina usando as COLUNAS relevantes...\n", + "ml_DT2.fit(X_treinamento_DT, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',\n", + " max_depth=None, max_features=None, max_leaf_nodes=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=70, min_samples_split=20,\n", + " min_weight_fraction_leaf=0.0, presort='deprecated',\n", + " random_state=20111974, splitter='best')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 44 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "M2h3EpinRD5Q", + "outputId": "1ac155a6-6003-4efe-9168-5c071cc7b917", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_DT2, X_treinamento, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Média das Acurácias calculadas pelo CV....: 87.14\n", + "std médio das Acurácias calculadas pelo CV: 4.33\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "znWy3LE1q-Z3", + "outputId": "c9f15800-e330-4cfc-ede2-afcee536b089", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + } + }, + "source": [ + "ml_DT3, best_params2 = GridSearchOptimizer(ml_DT2, 'ml_DT2', d_parametros_DT, X_treinamento_DT, y_treinamento, X_teste_DT, y_teste, i_CV, l_colunas)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Fitting 10 folds for each of 5250 candidates, totalling 52500 fits\n" + ], + "name": "stdout" + }, + { + "output_type": "stream", + "text": [ + "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.\n", + "[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.0s\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0129s.) Setting batch_size=2.\n", + "[Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 0.0s\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0358s.) Setting batch_size=4.\n", + "[Parallel(n_jobs=-1)]: Done 16 tasks | elapsed: 0.1s\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0443s.) Setting batch_size=8.\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0788s.) Setting batch_size=16.\n", + "[Parallel(n_jobs=-1)]: Done 44 tasks | elapsed: 0.2s\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.1434s.) Setting batch_size=32.\n", + "[Parallel(n_jobs=-1)]: Done 156 tasks | elapsed: 0.7s\n", + "[Parallel(n_jobs=-1)]: Done 380 tasks | elapsed: 2.1s\n", + "[Parallel(n_jobs=-1)]: Done 668 tasks | elapsed: 3.6s\n", + "[Parallel(n_jobs=-1)]: Done 956 tasks | elapsed: 5.2s\n", + "[Parallel(n_jobs=-1)]: Done 1308 tasks | elapsed: 7.1s\n", + "[Parallel(n_jobs=-1)]: Done 1660 tasks | elapsed: 9.0s\n", + "[Parallel(n_jobs=-1)]: Done 2076 tasks | elapsed: 10.7s\n", + "[Parallel(n_jobs=-1)]: Done 2492 tasks | elapsed: 11.8s\n", + "[Parallel(n_jobs=-1)]: Done 2972 tasks | elapsed: 13.1s\n", + "[Parallel(n_jobs=-1)]: Done 3452 tasks | elapsed: 14.5s\n", + "[Parallel(n_jobs=-1)]: Done 3996 tasks | elapsed: 16.1s\n", + "[Parallel(n_jobs=-1)]: Done 4540 tasks | elapsed: 17.6s\n", + "[Parallel(n_jobs=-1)]: Done 5148 tasks | elapsed: 19.3s\n", + "[Parallel(n_jobs=-1)]: Done 5756 tasks | elapsed: 21.1s\n", + "[Parallel(n_jobs=-1)]: Done 6428 tasks | elapsed: 22.9s\n", + "[Parallel(n_jobs=-1)]: Done 7100 tasks | elapsed: 24.8s\n", + "[Parallel(n_jobs=-1)]: Done 7836 tasks | elapsed: 26.9s\n", + "[Parallel(n_jobs=-1)]: Done 8572 tasks | elapsed: 28.8s\n", + "[Parallel(n_jobs=-1)]: Done 9372 tasks | elapsed: 31.1s\n", + "[Parallel(n_jobs=-1)]: Done 10172 tasks | elapsed: 33.4s\n", + "[Parallel(n_jobs=-1)]: Done 11036 tasks | elapsed: 35.9s\n", + "[Parallel(n_jobs=-1)]: Done 11900 tasks | elapsed: 38.3s\n", + "[Parallel(n_jobs=-1)]: Done 12828 tasks | elapsed: 41.0s\n", + "[Parallel(n_jobs=-1)]: Done 13756 tasks | elapsed: 43.7s\n", + "[Parallel(n_jobs=-1)]: Done 14748 tasks | elapsed: 46.6s\n", + "[Parallel(n_jobs=-1)]: Done 15740 tasks | elapsed: 49.5s\n", + "[Parallel(n_jobs=-1)]: Done 16796 tasks | elapsed: 52.5s\n", + "[Parallel(n_jobs=-1)]: Done 17852 tasks | elapsed: 55.6s\n", + "[Parallel(n_jobs=-1)]: Done 18972 tasks | elapsed: 59.0s\n", + "[Parallel(n_jobs=-1)]: Done 20092 tasks | elapsed: 1.0min\n", + "[Parallel(n_jobs=-1)]: Done 21276 tasks | elapsed: 1.1min\n", + "[Parallel(n_jobs=-1)]: Done 22460 tasks | elapsed: 1.2min\n", + "[Parallel(n_jobs=-1)]: Done 23708 tasks | elapsed: 1.2min\n", + "[Parallel(n_jobs=-1)]: Done 24956 tasks | elapsed: 1.3min\n", + "[Parallel(n_jobs=-1)]: Done 26268 tasks | elapsed: 1.3min\n", + "[Parallel(n_jobs=-1)]: Done 27580 tasks | elapsed: 1.4min\n", + "[Parallel(n_jobs=-1)]: Done 28956 tasks | elapsed: 1.5min\n", + "[Parallel(n_jobs=-1)]: Done 30332 tasks | elapsed: 1.6min\n", + "[Parallel(n_jobs=-1)]: Done 31772 tasks | elapsed: 1.6min\n", + "[Parallel(n_jobs=-1)]: Done 33212 tasks | elapsed: 1.7min\n", + "[Parallel(n_jobs=-1)]: Done 34716 tasks | elapsed: 1.8min\n", + "[Parallel(n_jobs=-1)]: Done 36220 tasks | elapsed: 1.8min\n", + "[Parallel(n_jobs=-1)]: Done 37788 tasks | elapsed: 1.9min\n", + "[Parallel(n_jobs=-1)]: Done 39356 tasks | elapsed: 2.0min\n", + "[Parallel(n_jobs=-1)]: Done 40988 tasks | elapsed: 2.1min\n", + "[Parallel(n_jobs=-1)]: Done 42620 tasks | elapsed: 2.2min\n", + "[Parallel(n_jobs=-1)]: Done 44316 tasks | elapsed: 2.3min\n", + "[Parallel(n_jobs=-1)]: Done 46012 tasks | elapsed: 2.4min\n", + "[Parallel(n_jobs=-1)]: Done 47772 tasks | elapsed: 2.4min\n", + "[Parallel(n_jobs=-1)]: Done 49532 tasks | elapsed: 2.5min\n", + "[Parallel(n_jobs=-1)]: Done 51356 tasks | elapsed: 2.6min\n", + "[Parallel(n_jobs=-1)]: Done 52500 out of 52500 | elapsed: 2.7min finished\n" + ], + "name": "stderr" + }, + { + "output_type": "stream", + "text": [ + "\n", + "Parametros otimizados: {'criterion': 'entropy', 'max_depth': None, 'max_leaf_nodes': None, 'min_samples_leaf': 60, 'min_samples_split': 2}\n", + "\n", + "DecisionTreeClassifier *********************************************************************************************************\n", + "\n", + "********* CROSS-VALIDATION ***********\n", + "Média das Acurácias calculadas pelo CV....: 89.29\n", + "std médio das Acurácias calculadas pelo CV: 2.73\n", + "\n", + "********* IMPORTÂNCIA DAS COLUNAS ***********\n", + " coluna importancia\n", + "2 v3 0.691283\n", + "0 v1 0.177569\n", + "1 v2 0.131148\n", + "\n", + "********* CONFUSION MATRIX - PARAMETER TUNNING ***********\n" + ], + "name": "stdout" + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6IhCC6pfq-jL", + "outputId": "cd1291ea-34ae-4c8a-f617-aca1daf4e5d7", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 103 + } + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'criterion': 'entropy',\n", + " 'max_depth': None,\n", + " 'max_leaf_nodes': None,\n", + " 'min_samples_leaf': 20,\n", + " 'min_samples_split': 70}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 47 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qw6Dk3kesT0q", + "outputId": "782be18c-e89e-438a-fe48-8eb48a266fdc", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 103 + } + }, + "source": [ + "best_params2" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'criterion': 'entropy',\n", + " 'max_depth': None,\n", + " 'max_leaf_nodes': None,\n", + " 'min_samples_leaf': 60,\n", + " 'min_samples_split': 2}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 48 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YFoK1ZGrRHf3", + "outputId": "bcdcb58f-3499-4e2d-ec50-88d117d0eeee", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_DT3, X_treinamento_DT, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Média das Acurácias calculadas pelo CV....: 89.29\n", + "std médio das Acurácias calculadas pelo CV: 2.73\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MZ1-vGRcxJoN" + }, + "source": [ + "## Valida o modelo usando o dataframe X_teste" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ig9GiUAEw9jr" + }, + "source": [ + "y_pred_DT = ml_DT2.predict(X_teste_DT)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7UZz4UzHDqae", + "outputId": "68be0407-57ec-4542-ef23-09eef53d95c9", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Calcula acurácia\n", + "accuracy_score(y_teste, y_pred_DT)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.9333333333333333" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 51 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K3EUMAxxKBur" + }, + "source": [ + "___\n", + "# **RANDOM FOREST**\n", + "* Decision Trees possuem estrutura em forma de árvores.\n", + "* Random Forest pode ser utilizado tanto para classificação (RandomForestClassifier)quanto para Regressão (RandomForestRegressor).\n", + "\n", + "* **Vantagens**:\n", + " * Não requer tanto data preprocessing;\n", + " * Lida bem com COLUNAS categóricas e numéricas;\n", + " * É um Boosting Ensemble Method (pois constrói muitas árvores). Estes modelos aprendem com os próprios erros e ajustam as árvores de modo a fazer melhores classificações;\n", + " * Mais robusta que uma simples Decision Tree. **Porque?**\n", + " * Controla automaticamente overfitting (**porque?**) e frequentemente produz modelos muito robustos e de alta-performance.\n", + " * Pode ser utilizado como Feature Selection, pois gera a matriz de importância dos atributos (importance sample). A soma das importâncias soma 100;\n", + " * Assim como as Decision Trees, esses modelos capturam facilmente padrões não-lineares presentes nos dados;\n", + " * Não requer os dados sejam normalizados;\n", + " * Lida bem com Missing Values;\n", + " * Não requer suposições (assumptions) sobre a distribuição dos dados por causa da natureza não-paramétrica do algoritmo\n", + "\n", + "* **Desvantagens**\n", + " * **Recomenda-se balancear o dataframe previamente para se evitar esse problema**.\n", + "\n", + "* **Principais parâmetros**\n", + "\n", + "## **Referências**:\n", + "* [Running Random Forests? Inspect the feature importances with this code](https://towardsdatascience.com/running-random-forests-inspect-the-feature-importances-with-this-code-2b00dd72b92e)\n", + "* [Feature importances with forests of trees](https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html)\n", + "* [Understanding Random Forests Classifiers in Python](https://www.datacamp.com/community/tutorials/random-forests-classifier-python)\n", + "* [Understanding Random Forest](https://towardsdatascience.com/understanding-random-forest-58381e0602d2)\n", + "* [An Implementation and Explanation of the Random Forest in Python](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76)\n", + "* [Random Forest Simple Explanation](https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d)\n", + "* [Random Forest Explained](https://www.youtube.com/watch?v=eM4uJ6XGnSM)\n", + "* [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74) - Explica os principais parâmetros do Random Forest." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cnfDw_GEKBuu", + "outputId": "f1658397-246a-42de-99c2-4bb57166dece", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 154 + } + }, + "source": [ + "from sklearn.ensemble import RandomForestClassifier\n", + "\n", + "# Instancia...\n", + "ml_RF= RandomForestClassifier(n_estimators=100, min_samples_split= 2, max_features=\"auto\", random_state= i_Seed)\n", + "\n", + "# Treina...\n", + "ml_RF.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n", + " criterion='gini', max_depth=None, max_features='auto',\n", + " max_leaf_nodes=None, max_samples=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, n_estimators=100,\n", + " n_jobs=None, oob_score=False, random_state=20111974,\n", + " verbose=0, warm_start=False)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 52 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "E25BIxM0RTzs", + "outputId": "1e988b9c-76c1-41a1-d21b-9d7d1188175f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_RF, X_treinamento, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Média das Acurácias calculadas pelo CV....: 96.28999999999999\n", + "std médio das Acurácias calculadas pelo CV: 2.94\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AouWUu8vANdb" + }, + "source": [ + "**Interpretação**: Nosso classificador (RandomForestClassifier) tem uma acurácia média de 96,44% (base de treinamento). Além disso, o std é da ordem de 2,77%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vbducxlgAa85", + "outputId": "9207c10f-4166-4147-8f19-8a84bf941ee9", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Acurácias: [0.9 0.98571429 0.98571429 0.95714286 0.92857143 1.\n", + " 0.97142857 0.98571429 0.94285714 0.97142857]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_lxx-LUw_5sd" + }, + "source": [ + "# Faz predições...\n", + "y_pred = ml_RF.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "pQIRO_LpGAkw", + "outputId": "82b0a991-1af3-40ff-b22e-fc0c5829b74b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 538 + } + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAccAAAIJCAYAAADQ9vbrAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nOzdeZzV0x/H8ddnlvZ9lYpW7UlCWmhDUdo3lRBJkpCU/OxblhCSJEppkVDSooUU7SFttGuP9qZllvP7497GdJuZpm8zzcyd9/P3uA/3nu/3e875zo/5zOec8z3XnHOIiIjIf0JSuwMiIiJpjYKjiIhIAAVHERGRAAqOIiIiARQcRUREAig4ioiIBAhLiUqzXtVLz4dIundg6Xup3QWRZJElDEupulPi9/3xle+lWH+TSpmjiIhIgBTJHEVEJIOw4MyxgvOuRERELoAyRxER8c5SfXowRShzFBERCaDMUUREvAvSOUcFRxER8U7DqiIiIhmDMkcREfEuSIdVg/OuRERELoAyRxER8S5I5xwVHEVExDsNq4qIiGQMyhxFRMS7IB1WVeYoIiISQJmjiIh4F6RzjgqOIiLinYZVRUREMgZljiIi4l2QDqsG512JiIhcAGWOIiLineYcRUREMgZljiIi4l2QzjkqOIqIiHdBGhyD865EREQugDJHERHxLkQLckRERDIEBUcREfHOQpL/lZRmzUaa2V4z+yOeY4+ZmTOzAv7PZmZDzGyDmf1uZtXPVb+Co4iIeGeW/K+k+RRofHZ3rDhwM7AtTnEToKz/1R344FyVKziKiEi645ybD+yP59BbQD/AxSlrDox2PouAPGZWJLH6tSBHRES8S0OPcphZc2CHc+43OzMDLQr8Hefzdn/ZroTqUnAUEZE0xcy64xv+PG24c274Oa7JBjyJb0j1gik4ioiIdymwt6o/ECYaDONRGigJnM4aiwErzOxaYAdQPM65xfxlCVJwFBER79LIsKpzbhVQ6PRnM9sC1HDO/WNmU4BeZjYeuA445JxLcEgVtCBHRETSITMbB/wClDOz7WbWLZHTvwM2ARuAj4Ce56pfmaOIiHiXSl9Z5ZzreI7jJeK8d8CD51O/MkcREZEAyhxFRMS7NDLnmNwUHEVExLtUGlZNacEZ8kVERC6AMkcREfEuSIdVg/OuRERELoAyRxER8U5zjiIiIhmDMkcREfEuSOccFRxFRMS7IA2OwXlXIiIiF0CZo4iIeKcFOSIiIhmDMkcREfEuSOccFRxFRMQ7DauKiIhkDMocRUTEuyAdVg3OuxIREbkAyhxFRMS7IJ1zVHAUERHPLEiDo4ZVRUREAihzFBERz5Q5ioiIZBDKHEVExLvgTByVOYqIiARS5igiIp4F65yjgqOIiHgWrMFRw6oiIiIBlDmKiIhnyhxFREQyCGWOIiLiWbBmjgqOIiLiXXDGRg2rioiIBFLmKCIingXrsKoyRxERkQDKHEVExLNgzRwVHEVExLNgDY4aVhUREQmgzFFERDxT5igiIpJBKHMUERHvgjNxVOYoIiISSJmjiIh4FqxzjgqOIiLiWbAGRw2rioiIBFDmKCIinilzFBERySCUOYqIiHfBmTgqOIqIiHcaVhUREckgFBxFRMQzM0v2VxLbHWlme83sjzhlr5vZOjP73cy+MrM8cY4NMLMNZrbezG45V/0KjiIikh59CjQOKPseqOycqwr8CQwAMLOKQAegkv+aoWYWmljlCo4iIuJZamWOzrn5wP6AslnOuSj/x0VAMf/75sB459xJ59xmYANwbWL1KziKiIhnKREczay7mS2L8+ruoWv3ANP974sCf8c5tt1fliCtVhURkTTFOTccGO71ejMbCEQBY73WoeAoIiLepbEnOczsLqAp0NA55/zFO4DicU4r5i9LkIZVz0O+3NlZNL4/i8b3Z/P3L7Nx5ouxn8PDEp3bPW/rpj3HuDfujf3cslE1hj/XOVnbAOh1Rz2yZgmP/fzVuw+QO0fWZG9H0qarqlSgXavmsa8dO7YneG7NGlclW7vd7urC7bfdQtuWt9O1Uwe2bN503nU82OM+Dh8+zOHDh5kw7r8EYe/ePTzWp3ey9VXSDzNrDPQDbnfORcQ5NAXoYGaZzawkUBZYklhdyhzPw/5Dx6jZ4VUABt5/K8ciTvL2Z3Nij4eGhhAdHZNs7V1VoTjlS13Cuk27k63OQL061Wfcd0s5fiISgJYPfZBibUnakzlzFiZO/iZV2n5l0BtUqlyFSRMnMPiN1xjy/rDzuv79YR8BsGPHdiaMH0f7jp0AKFSoMG++PSTZ+yvxS61NAMxsHFAPKGBm24Fn8K1OzQx87+/XIudcD+fcajObCKzBN9z6oHMuOrH6lTleoOHPdWbIwA7MH92Xl/u0YOD9t9KnS8PY48u+eJLLiuQDoMOt1/DTZ31ZNL4/7w7sQEhI4v9SvfPZXJ7odvbjONmyZGLYM5346bO+/DLuCZrWqwJA1izhjBl0Dyu+HMiEN+9j/ui+VK94ma+uJ9uzYGw/lk8ayFM9bgWgZ8cbKVIwNzOGP8yM4b6/tNdNe478ebLzQu/bub/dDbFtxr2vR+5syIIxj7NkwoDYuiQ4RBw7xn33dKV9m5a0btGMeXNnn3XOvn17ufvOTrRr1ZxWzZuyYvkyAH5euIAud7SnfZuW9H2kNxHHjiWpzatr1ODvbdtwzjH4jUG0at6U1i2aMWP6d4m21+SmBhw4sJ933nqT7X9vo12r5gx+YxA7dmynVfOmAHTu2I4NG/6KbavbXV1Y/ccqIiIiePqpAdzRvg3tWreI9z4lbXPOdXTOFXHOhTvnijnnPnbOlXHOFXfOVfO/esQ5/yXnXGnnXDnn3PTE6gZljsmiaKE81LvrTWJiHAPvjz9YlCtZmDY3V6f+3YOJiorh7QHt6HDrNXz+bcKZ/ZezVtC9bV1KFS9wRvkT997CD0v/pMdzY8mdIys/jXmcuYvW071tXQ4cjqB665eoWLoIi8f3j73m2femcuBwBCEhxvQPe1O57KUMHfcjvTs3oHH3d/j34Jm/yCbNXMHrj7fmw4nzAWh981Xc3vN9GtYsT+nLClGn8+uYGZPevp/a1UuzcMVGrz8+SUUnT56gXavmAFxarBhvDH6Ht4a8T44cOThwYD9dOranXv2GZ2QH3037llq163Df/Q8QHR3NiRPHOXBgPx99+AEfjviEbNmyMXLEcEaP+oQePXudsw8//jCPMldcwZzvZ7F+3Tq+mPwNBw8c4I72bbi6Ro1424vr4UceY8Nff8VmwHGHhm9pfCuzZkynTK+y7Nu3l3379lKpchWGvD2Ya6+ryfMvvsLhw4fp1KEt19WsRbZs2ZLjx5qhBOv2cQqOyWDy7JXExLhEz6l/bTmqV7yMBWP6AZA1czj79h9N9JromBjeGj2bx++5mVkL18SWN7y+ArfdWIU+d/oyuSyZwiheJC+1rirFe5//AMCajbtY9dfO2Gta31yde1rVJiw0hEsK5qJCqSL8Eed4oN/Wb6dg3pwUKZibAnlzcPBwBNv3HOTBO+rT6PryLPIH3hxZM1PmskIKjulU4LBqZGQkQ94ezIrlSwmxEPbu3cO///xDgYIFY8+pXLkKzzz1JFFRUdRv0IjyFSqwbOk8Nm3cwF2dO8bWU7VatUTbHvBEX7JkzsKlRYvS/8n/8dmoT2h8622EhoaSv0ABrr7mGlavWhVve0l1c+Mm9LjvHnr26s2sGdO56WbfM+O//LyAH+bNZfQnIwE4dfIku3ftolTp0kmuW3wUHCVBEcdPxr6Pio4+Y7g0SybfYhczY8zUxTz97pTzqvvzaUt4/J6bWbNhV2yZAR37juCvrXuTVMfll+anT5eG1On8GgePHGf4c53JnOnc/9dPnr2Slo2qUTh/LibNWuG/D3h95Cw+/nLhed2HpA/ffTuVAwf2M27iZMLDw2lyUwNOnjp5xjlX17iGkaPH8NOPP/L0wP506Xo3OXPloub1tRn0xuAkt3V6zvFc4muvWfMWSWqjcOHC5MmThz/Xr2PmjOk89fSzADgHg98eQomSpZLcX8lYNOeYzLbu3E+1Cr4Vw9XKF6NE0fwAzFuynpaNqlEwbw4A8ubKxmVF8p6zvqioGN4dM4+HOtWPLZv9y1p6drgx9vOV5XybQPzy6yZa31wdgPKlLqFymUsByJUjC8dOnOTQ0RMUypeTm2tXjL32yLGT5MiWJd62J81cTttbrqZlo6uY/P1KAL7/eS1dm19P9qyZALi0YO7Ye5L07+jRI+TLl5/w8HCWLF7Ezp1nr3bfuXMH+fMXoHXbdrRs3Za1a1ZT9cpq/LpyBdu2bgUgIiKCLVs2n1fbV11dg5nTpxMdHc3+/ftZsWwZlatUjbe9uLJnz57o/OYtjW/lk5EjOHLkCFeUKw9Ardp1+HzsGE6v9F+7dk2C18s5WAq80gBljsns6zm/0qnptSyfNJClq7bEZnfrNu3mufe/ZeoHvQgxIzIqmkdenci2XQfOWeenX/9C//v+20LwlY9m8Hrf1iyd+CQhIcaWHf/S+uFhfDjxJ0a80IUVXw7kz817WLNpF4eOHmfjtn38tm47v331P7bvPsCiX/9bNj9y8kKmvN+TXfsO0bj7mSv81m7aTY5sWdi59yC7/zkMwJxF6yhf8hJ+GNUXgGPHT3L3wFHsO5D4ELGkD7c2bUbvBx+gdYtmVKxUmZKlzs6sli1ZwqeffExYWBjZsmXjxVcGkS9fPp5/6RX6P/4opyJPAdDroT6UKFEyyW03bHQTv/+2kratmmNm9HnscQoULMiUr786q7248uTJS7WrqtOqeVPq1K0bu2r1tJtuvoXXXn2J7j16xpZ179GT1159mTYtbycmJoaixYrx3tAPz+dHJUHO/ntGMvlkvapX8lcq5xQSYoSHhXLyVBQlixXgu2G9qNriBSKjEl2xLAk4sPS91O6CSLLIEpZy+dhlD01J9t/32969PdXzR2WOQSRblkzM+OhhwsNCMIyHX5mowCgiKUoLciRFzB/dl0wBi2O6PTWa1RsSXkmakKMRJ6nT6bXk6prIBevT+0F2bj9z152HH+1L7Tp1U6lHIkmj4JjKbrjzjXjLhz3TiSY3VGbf/iPUaPsy4HsQ/55WtWLn9555bwozF6whX+7sfP56N66udDljpizikUFfXLT+iyTm7SHvJ3hs965dDBzQj/3//gtmtGnbjk5dul7E3klyUOYoF9VnUxcxbMKPjHjhzjPK3x0z74wt6wBOnIzk+aHfUrHMpVQqXeRidlPEs9CwUPr260+FipU4duwoHdq2pub1tSldpkxqd01Ej3KkVQtXbGT/oYhznwhEnDjFz79u4sTJyBTulUjyKViwEBUqVgIge/YclCpVir1796Ryr+R8pdaXHac0Bcd0pkeHG1gyYQDDnulEnpz69gwJDjt2bGfd2rVUqXplandFzleQPueo4JiOfPTFT1Rs9izXdXiV3f8c5tVHW6V2l0QuWMSxYzzWpzeP93+SHDm0oYSkDQqO6cje/UeIiXE45xg5eSE1Kl+e2l0SuSCRkZE82qc3t97WjEY33Zza3REPNKwqqe6SArli3zdvcCVrNu5K5GyRtM05x7NPD6RUqVLcedfdqd0dkTNotWoaNeqVu6h7dVkK5MnBhhkv8MKw77jh6rJULVcM5xxbd+3noRfHxZ6/btpz5MyehUzhYTSrX5WmPd9P0S9JFrlQK1cs59sp31D2iitivzbroT6PUveGG89xpaQlaSXTS27aPk4kAdo+ToJFSm4fV/qx6cn++37jm01SPeIqcxQREc+CNHFUcBQREe+CdVhVC3JEREQCKHNMBeumPceRYyeJjokhKjqGOp1eI2+ubHw26B4uvzQfW3fup3O/jzl45PgZ111WJC/j3+we+9VUH4z/kRGTFgDwzXs9uaRgLsJCQ1m4ciN9XplATIzjxd7Nubl2RX7/czv3/u8zADrceg0F8mTnvc9/uMh3LsFs4U/zGfTqS8REx9CydVu63df9rHNmzviOYe+/B2aUK1eeV19/kyWLF/HGoFdiz9m8eROD3niLBg0bMaDfY/z115/ccGN9evd5FIDhw4ZSpuwVNGjY6KLdmyQsSBNHBcfU0rj7O/x78L9vL+979038sGQ9b3zyPX3vvom+d9/MU0O+OeOaXfsOU6/rm5yKjCJ71kwsnzSQaT+uYte+Q3R+YiRHjp0AYNwb99L6purMXLiaahWKc237Vxj69B1UKnMpG//ex5231+T2XglvCC1yvqKjo3n5pef58KNPKFy4MHe0b0O9+g3O2Cd169YtfPzRcEaNGUeu3Ln5999/Abj2uppMnOz7d/3QwYM0bXIz19eqzZ/r15E5SxYmfTWV+++9myNHjnDixHFW/f77GV9cLJISNKyaRjStV5UxUxcDMGbqYprVr3rWOZFR0ZyKjAIgc6ZwQuL8yXY6MIaFhRAeFopzjpgYR3hYKOD7rsfIqGj63NmQD8b/SFRUTErfkmQgf6z6neLFL6dY8eKEZ8pE41tv44d5Z26QP/mLiXTo2IlcuXMDkD9//rPq+X7WTOrUrUvWrFkJCwvn5IkTxMTEEBUVRWhICEPfHULPXg9dlHuSpNEmAJJsnHNMHdqLhWP7cU+r2gAUyp+T3f8cBmD3P4cplD9nvNcWK5yHJRMG8Nf0F3jz09ns2nco9tiU9x9k25xXORpxksmzV3I04iQzF6xm0fj+7P7nEIePHueayiWY+sPvKX+TkqHs3bOHS4pcEvu5UOHC7Nlz5ibiW7duYeuWzXTt1IHOHdux8Kf5Z9UzY/o0Gt/aFIBSpUuTN28+OrRpyQ316rNt2zZiXEzsZuWSNpgl/yst0LBqKmh491vs3HeIgnlz8O2wXqzfcvbD+gk9frp9z0Gubf8KRQrmZuLg+/hq9kr27j8CwO0Pvk/mTGF8+vJd1LumHHMXr2PwqNkMHjUbgKFP38ELH3zLXS2vp1HNCqz6aweDRsxMsfsUiSsqOpqt27Yy4tPP2LNnN/d07cykr6aSK5dv56d9+/ay4a8/qVW7Tuw1/QYMjH3/UM8e/O/Z5/joww/4c/06al5fm9Zt2130+5CMQZljKtjpz/b2HTjKlLm/c02lEuz990js9nCXFMjFPn/AS8iufYdYvWEXtauXPqP85Kkopv7wO83qVTmj/MpyxTCDP7fspVWj6nR+YiSlihWk9GUFk/HOJKMqVLgwu3f990fe3j17KFy48BnnFC5cmHr1GxAeHk6xYsW5/PISbNu6Jfb4rBnTadDwJsLDw8+qf97c2VSsVImIiAj+/nsbrw9+h+9nzeT48eNnnSsXV0iIJfsrLVBwvMiyZclEjmyZY983ur48qzfuZNqPq+jc7DoAOje7jm/jGfosWigPWTL7fnHkyZmVWleV5s8te8meNVNsYA0NDaFJnUqs33LmkNbTPZvy/NBphIeFEhrq+5cvxsWQLUumFLtXyTgqVa7Ctm1b2L79byJPnWLGd9O4sX6DM85p0KARy5YsAeDAgf1s3bqFYsWLxx6f/t00Gt9621l1R0ZGMmb0KO66515OnjgZOycVExNNZKS+w1RShoZVL7JC+XMyYfB9AISFhjJh+jK+/3kty1dvY8yge+ja4nq27dpP534jAahe8TLubVOHns9/TrmSl/Dqoy1xOAzj7dFzWL1hJ4Xy5WTS2/eTKTyMkBBj/rK/+Mj/iAdAs3pVWbFmW+z85O/rd7B04pP88dcOVv254+L/ECTohIWFMWDg0zzQ/V5iYqJp0bI1ZcqU5f1336FSpcrUa9CQWnXq8vPPC2nZ7FZCQkN55LF+5MmTF/B9n+Pu3buocc21Z9U9YdxYbm/ekqxZs3JFuXKcOH6C1i2aUafuDbFDspJ60socYXLT3qoiCdDeqhIsUnJv1cpPfZ/sv+//ePGmVA+5GlYVEREJoGFVERHxLFiHVZU5ioiIBFDmKCIinqWVHW2SmzJHERGRAMocRUTEs2DNHBUcRUTEsyCNjRpWFRERCaTMUUREPAvWYVVljiIiIgGUOYqIiGdBmjgqOIqIiHcaVhUREckglDmKiIhnQZo4KnMUEREJpMxRREQ8C9Y5RwVHERHxLEhjo4ZVRUREAilzFBERz4J1WFWZo4iISAAFRxER8cws+V9Ja9dGmtleM/sjTlk+M/vezP7y/zOvv9zMbIiZbTCz382s+rnqV3AUEZH06FOgcUBZf2COc64sMMf/GaAJUNb/6g58cK7KFRxFRMQzM0v2V1I45+YD+wOKmwOj/O9HAS3ilI92PouAPGZWJLH6tSBHREQ8S2PrcQo753b53+8GCvvfFwX+jnPedn/ZLhKgzFFERNIUM+tuZsvivLqfbx3OOQc4r31Q5igiIp6lxKMczrnhwHAPl+4xsyLOuV3+YdO9/vIdQPE45xXzlyVImaOIiASLKUBX//uuwDdxyu/0r1qtCRyKM/waL2WOIiLiWWrNOZrZOKAeUMDMtgPPAK8CE82sG7AVaOc//TvgVmADEAHcfa76FRxFRMSz1NohxznXMYFDDeM51wEPnk/9GlYVEREJoMxRREQ8096qIiIiGYQyRxER8SxIE0cFRxER8U7DqiIiIhmEMkcREfEsSBNHZY4iIiKBlDmKiIhnwTrnqOAoIiKeBWls1LCqiIhIIGWOIiLiWUiQpo7KHEVERAIocxQREc+CNHFU5igiIhJImaOIiHimRzlEREQChARnbNSwqoiISCBljiIi4lmwDqsqcxQREQmgzFFERDwL0sRRwVFERLwzgjM6alhVREQkgDJHERHxTI9yiIiIZBDKHEVExLNgfZRDwVFERDwL0tioYVUREZFAyhxFRMQzfdmxiIhIBqHMUUREPAvSxFGZo4iISCBljiIi4pke5RAREQkQpLFRw6oiIiKBlDmKiIhnepRDREQkg1DmKCIingVn3qjgKCIiFyBYV6tqWFVERCSAMkcREfEsWL/sOMHgaGbvAi6h48653inSIxERkVSWWOa47KL1QkRE0qVgnXNMMDg650bF/Wxm2ZxzESnfJRERSS+CNDaee0GOmV1vZmuAdf7PV5rZ0BTvmYiISCpJymrVt4FbgH8BnHO/ATekZKdERCR9MLNkf6UFSXqUwzn3d0BRdAr0RUREJE1IyqMcf5tZLcCZWTjwMLA2ZbslIiLpQbA+ypGUzLEH8CBQFNgJVPN/FhERCUrnzBydc/8AnS5CX0REJJ1JrTlCM3sEuBff8/irgLuBIsB4ID+wHOjinDvlpf6krFYtZWZTzWyfme01s2/MrJSXxkREJLhYCrzO2aZZUaA3UMM5VxkIBToAg4C3nHNlgANAN6/3lZRh1c+Bifgi8qXAF8A4rw2KiIgkgzAgq5mFAdmAXUADYJL/+CighdfKkxIcsznnPnPORflfY4AsXhsUEZHgEWKW7K9zcc7tAN4AtuELiofwDaMedM5F+U/bjm+tjLf7SuiAmeUzs3zAdDPrb2YlzOxyM+sHfOe1QRERkcSYWXczWxbn1T3geF6gOVAS34hmdqBxcvYhsQU5y/FNdJ4O4/fHOeaAAcnZERERSX9SYj2Oc244MDyRUxoBm51z+3x9sMlAbSCPmYX5s8diwA6vfUhsb9WSXisVEZGMIZVWq24DappZNuA40BDfl2XMA9rgW7HaFfjGawNJ+j5HM6sMVCTOXKNzbrTXRkVERLxyzi02s0nACiAKWIkv05wGjDezF/1lH3tt45zB0cyeAerhC47fAU2ABYCCo4hIBpdaW6E6554Bngko3gRcmxz1J2W1aht8Ketu59zdwJVA7uRoXEREJC1KyrDqcedcjJlFmVkuYC9QPIX7JSIi6UBSHr1Ij5ISHJeZWR7gI3wrWI8Cv6Ror0REJF0I0tiYpL1Ve/rfDjOzGUAu59zvKdstERGR1JNgcDSz6okdc86tSJkuiYhIepFWvpw4uSWWOb6ZyDGHbw+7eP27+F3PHRJJK/I2eDa1uyCSLI7Pfza1u5DuJLYJQP2L2REREUl/kvLIQ3oUrPclIiLiWZJ2yBEREYlPRpxzFBERSVRIcMbGcw+rmk9nM3va//kyM0uW7XlERETSoqTMOQ4Frgc6+j8fAd5PsR6JiEi6EWLJ/0oLkjKsep1zrrqZrQRwzh0ws0wp3C8REZFUk5TgGGlmofiebcTMCgIxKdorERFJFzLygpwhwFdAITN7Cd+3dDyVor0SEZF0Ia0Mgya3pOytOtbMluP72ioDWjjn1qZ4z0RERFJJUr7s+DIgApgat8w5ty0lOyYiImlfkI6qJmlYdRq++UYDsgAlgfVApRTsl4iISKpJyrBqlbif/d/W0TOB00VEJAPJyF92fAbn3Aozuy4lOiMiIulLsG7QnZQ5x0fjfAwBqgM7U6xHIiIiqSwpmWPOOO+j8M1Bfpky3RERkfQkSEdVEw+O/of/czrn+l6k/oiIiKS6BIOjmYU556LMrPbF7JCIiKQfGXFBzhJ884u/mtkU4Avg2OmDzrnJKdw3ERGRVJGUOccswL9AA/573tEBCo4iIhlckCaOiQbHQv6Vqn/wX1A8zaVor0REJF3IiHurhgI5ODMonqbgKCIiQSux4LjLOff8ReuJiIikO8G6ICexzQ2C845FRETOIbHMseFF64WIiKRLQZo4JhwcnXP7L2ZHREQk/QnWBTnBumesiIiIZ+f9rRwiIiKnWZAuT1HmKCIiEkCZo4iIeBasc44KjiIi4lmwBkcNq4qIiARQ5igiIp5ZkD7oqMxRREQkgDJHERHxTHOOIiIiGYQyRxER8SxIpxwVHEVExLuM+JVVIiIiGZIyRxER8UwLckRERDIIZY4iIuJZkE45KjiKiIh3IfrKKhERkYxBwVFERDwzS/5X0tq1PGY2yczWmdlaM7vezPKZ2fdm9pf/n3m93peCo4iIpEfvADOcc+WBK4G1QH9gjnOuLDDH/9kTzTmKiIhnqfEoh5nlBm4A7gJwzp0CTplZc6Ce/7RRwA/AE17aUHAUERHPUmmHnJLAPuATM7sSWA48DBR2zu3yn7MbKOy1AQ2riohImmJm3c1sWZxX94BTwoDqwAfOuauAYwQMoTrnHOC89kGZo4iIeJm/6RQAACAASURBVJYSiaNzbjgwPJFTtgPbnXOL/Z8n4QuOe8ysiHNul5kVAfZ67YMyRxERSVecc7uBv82snL+oIbAGmAJ09Zd1Bb7x2oYyRxER8SwVv5XjIWCsmWUCNgF340v4JppZN2Ar0M5r5QqOIiKS7jjnfgVqxHOoYXLUr+AoIiKeaW9VERGRAMG6cCVY70tERMQzZY4iIuKZBem4qjJHERGRAMocRUTEs+DMGxUcRUTkAqTic44pSsOqIiIiAZQ5ioiIZ8GZNypzFBEROYsyRxER8SxIpxwVHEVExDs95ygiIpJBKHMUERHPgjXDCtb7EhER8UyZo4iIeKY5RxERkQxCmaOIiHgWnHmjgqOIiFwADauKiIhkEMocRUTEs2DNsIL1vkRERDxT5igiIp4F65yjgqOIiHgWnKFRw6oiIiJnUeYoIiKeBemoqjJHERGRQMocRUTEs5AgnXVUcBQREc80rCoiIpJBKHMUERHPLEiHVZU5ioiIBFDmKCIingXrnKOCo4iIeBasq1U1rCoiIhJAmaOIiHgWrMOqyhxFREQCKHMUERHPlDmKiIhkEAqOSXT1lRVp36ZF7Gvnju0Jnlvr2urJ1u69d3fhjvatYz+vXr2Ke+/ukmz1nzbl68ns3bsn9vNzzzzFxo0bkr0dSXvy5crKoo97sOjjHmz+qi8bv3w09nN4WGiytrVuQh+WfvoASz55gKlvdqFwvhznXce8od0AuOySPLRvVCW2vHq5S3mzd5Nk66skjaXA/9ICDasmUebMWZgw6etUafvA/v0s+Gk+derekGJtTPnmK0qXLUuhQoUBeOa5F1OsLUlb9h8+Ts1uwwAYeHc9jh0/xdvjf449HhoaQnR0TLK11/jhUfx7KILn7mtIv851eWzI9PO6vn7PjwG4/JI8tGtUhQmzVwGwYv1OVqzfmWz9lKQJSRuxLNkpOHoUEXGMR3o/yOHDh4mKjKTnQ32o36DhGefs27eXJ/o+yrFjR4mOjubJp56h+tU1+OXnBXzw/rtERkZSrFhxnnvxZbJly55gW3fedQ8ffzTsrOAYHR3NkLffZNnSJUSeOkW7DnfQpl0HYmJiePXlF1i6eBGFLylCWFgYzVu24qabG/PhB+8z/8d5nDx5kiuvrMZTzzzP7O9nsmb1agb2f5zMmbMwasx4ej1wH4/07cea1avZ/vc2HnmsH+DLMNes/oP+A59m2tQpjPv8MyIjI6lSpSoDnnqG0NDkzTQkdQwf0IITp6KoVvYSfln1N4cjTp4RNJd92pNW/T9n2+6DdLipKg+2uY7wsFCWrt3Ow4OnERPjztnGgt+20rPNdWTOFMaQR2+jevlLiYqK4Yn3ZzJ/5RYqlCjI8AEtCA8LJSTE6Pi/CWzcvp99M56kYOOXefH+RpS7vACLPu7B2Bm/8utfu+nToRZtBoxj7fiHua7bMA4dPQHAqs8fouGDI4mJcbzbtynFC+UG4PF3Z/DLH3+n3A9S0i0NqybRyZMnYodUH324F5kyZebNt99j3MTJDB85mrfeGIRzZ/5CmP7dt9SqXYcJk75mwqSvKVe+PAcOHOCjD4fx4UefMG7iZCpWqsxnoz5NtO2q1aoRHh7O0iWLzij/evIkcuTIydjxkxgzfhKTv/yCHdu3M2f2LHbu2MGX30zjxVcG8ftvv8Ze0+GOTowdP4lJX03l5MmTzP9xHjfd3JiKlSrx0quvM2HS12TJkiX2/IaNbmbenNmxn2fOmM4tTW5j06aNzJr5HZ+M/pwJk74mJDSU76ZNvYCfsKQ1RQvmol7Pj3ni/ZkJnlPu8gK0aVCJ+j0/pma3YURHOzrcVDVJ9d9a6wpWb9pDj5bX4Bxcc9cHdH3+S0Y82ZLMmcK4r3kN3p+0iJrdhlH7vuHs2Hv4jOuf+nA2C3/fRs1uw3j3i//+23DO8e3CddxetzwA11Qoyrbdh9h74Bhv9G7CuxMXUef+j+j4v4kMfeJ2Dz8ZiUvDqhlc4LBqZGQk770zmBXLl2EhIezdu4d///2HAgUKxp5TqVIVnnt6IFFRkdRv0Ihy5SuwfNk8Nm/awF133hFbT9Urq52z/Xu7P8CI4cPo/chjsWW//LKQv/5cz+zvfb+8jh49wrZtW/h15QpuuvkWQkJCKFCgINdce13sNUuXLGbUJx9z4vhxDh0+RKnSZbixXoME282XLx9FixXj999+5bLLL2fLlk1Uu6o6E8aNZc2a1XTu2Bbw/fGQL1++JP40JT2Y/MPqc2aA9a8uRfVyl7JgeHcAsmYOY9/BY4leM+OdrkRHO/7YuIdnR8xleP/mDJ28BIA/t/3Dtt0HKVssP4tXb6dfl7oULZiLr+evZeP2/Unu+6S5qxnQ9UY+m/4rbRtWZtLcP3z9rVGK8iX++280V7bMZM+aiWPHTyW5bskYFBw9mj5tKgcOHGDshC8JDw/n1lsacOrkyTPOubrGNYz49DMWzP+Rp58aQOc77yJXrlxcd30tXn1t8Hm1d+11NXn/3bdZ9dtvsWXOOZ4Y8BS1atc949wFP82Pt46TJ0/yyovPM3bCJC65pAjDhr7LqVMn4z03rlua3MasmdMpWbIU9Rs0wsxwztHs9hb07vPYOa+X9CnieGTs+6joGELirNnPksn3q8OAMTN+5enhc5Jc7+k5x3OZMHsVS9Zsp8n1V/D1a53o9ca3/Lhic5LaWPTH35Qumo8CubPRrG55Xh3t+28ixIwbHxjByVNRSe6vJE6PcsgZjh49St58+WKHO3ftPHshwM6dO8ifvwCt2rSjZas2rFu7hipVq/HbypVs27YVgOMREWzdkrT/4O/t/gCjPvk49nOtWnX4YsJ4IiN9v8S2btnM8YgIqlWrzpzZs4iJieHff/5h2VLfX+Wng3eePHmJiDjG7O9nxdaVPXt2Io7F/xd/gwaN+HHeXGZMn0bjJrcBcG3N65n9/Sz2//svAIcOHWTnzh1Jug9Jf7buOki1K4oAUO2KIpQokgeAecs307JeRQrm8c2Z582ZlcsK5z6vuhf+vo0ON/lWnZYplp/ihXPz59//UKJIXjbvPMDQLxfz7YL1VCld+IzrjkacJGe2TAnWO+WntQzqdQvrtv7D/sPHAZizdCM9W10be07VMpecV1/lbBpWlTM0ua0ZD/fqQduWzahYqTIlS5Y665xlS5cw+tORhIWFkS1bNl54aRD58uXjuRdfYUC/x4g85RvK6flQHy4vUfKcbda94Uby5s0b+7ll67bs3LmDO9q1wgF58+Zl8Dvv0/Cmm1m8+BdaN7+NwpcUoXyFiuTMkZOcuXLRqnVb2rZsRv4CBahUqXJsXc2at+SlF56NXZATV67cuSlZqhSbNm6kchXffFLp0mV48KGHeeD+briYGMLCwug/8GkuvbSolx+npHFf/7iGTo2vZPmonixds4O/tvv+KFq3dR/PjZjL1De7EBJiREZF88hb37Ftz6Ek1/3h10sZ8uhtLP30AaKiYrjvla85FRlNmwaV6HhzVSKjYtiz/yivffbTGdet2riH6BjH4pE9GDPdtyAnrklzV7Pwo+7c+/JXsWWPDZnO24/cypJPHiAsNIQFv22l95vfXsBPRoKVBS4iSQ4Rp1KgUjkvERHHyJYtOwcPHqBLx3Z88tnnZ8yHyrnlb/RcandBJFkcn/9siqVj8//cn+y/72+4Il+qp4/KHINU7wd7cOTIESIjI7nv/gcUGEVEzoOCYxrx6MO92BGw687Djzx21mKbpBrxyWfJ0S0RT+YPu5dM4Wf+eun20mRWb9qbSj2SlJKac4RmFgosA3Y455qaWUlgPJAfWA50cc55Woqs4JhGDH7nvdTugkiyuaHHiNTuglwkqbxa9WFgLZDL/3kQ8JZzbryZDQO6AR94qVirVdOp6OhoOrRtSe8H70/trogkatgTzdn6zeMs+7TnWccebn89x+c/S/7c2QBoWqccSz55gEUf92DB8O7UqnLZxe6upBNmVgy4DRjh/2xAA2CS/5RRQAuv9Ss4plOfjxkd7wpZkbTmsxm/0vzxMWeVFyuUi4bXlGbb7oOxZfOWb+bauz+gZrdh9Hj1G4b20w42aZ2lwCuJ3gb6Aac3/s0PHHTOnX6IdTvgefm8gmM6tGf3bhb89CMtW7dN7a6InNPC37bGPmcY12u9GjPwg++Ju7Y97k412bOG49DC94zIzLqb2bI4r+4Bx5sCe51zy1OqD5pzTIdef+1lHn6kLxERiW/TJZJWNa1Tjp3/HGbVxj1nHbu9bnme796Ignmz0+qJsanQOzkfISkw6eicGw4MT+SU2sDtZnYrkAXfnOM7QB4zC/Nnj8UAzzuTKHNMZ+b/OI98+fJTMc4D/CLpSdbM4fTrXJfnP54X7/EpP62jWpf3aDdwPE93S3jfX8m4nHMDnHPFnHMlgA7AXOdcJ2Ae0MZ/WlfgG69tKDimM7+uXMGP8+Zy6y0N6P/4YyxdspiB/R9P7W6JJFmponm5vEhelox8gHUT+lC0YC5+GXH/WV98vPC3rZS8NG/sYh1Jm1JxzjE+TwCPmtkGfHOQH5/j/ARpWDWd6d3nsdjNvpctXczoT0fy0quvp3KvRJJu9aa9XN78v39n103oQ+3uw/n3UASliuZj0w7ft29Uu6IImcNDk7RJuaSiVN7Lxjn3A/CD//0m4NrEzk8qBUcRSVGjnm5N3atKUCB3NjZMepQXPpnHqGkr4z235Y0VuOOWK4mMiuHEyUi6PDsp3vNEUpr2VhVJgPZWlWCRknurLt54KNl/319XOneq762qOUcREZEAGlYVERHPgvXLjhUcU9mz/3uS+fN/IF++/Ez6aupZx+fNncMH772DhYQQGhrK4088yVXVr2bnzh081uchYmJiiIqKosMdnWnbrgOnTp3ikd492bNnD+3ad6RdhzsAeOHZ/9GmXQcqVKx0sW9RglhIiLFweHd2/nOE1v0/Z/a7d5MjW2YACuXNzrK1O2g3cPxZ1xUvlJuhT9xOsUK5cA5a9BvLtt0HE7y+xY0V+N899Tlw+DjtBo5n/+HjlLw0L893b6h5yVQWpLFRwTG1NWvekvYdO/G/gf3jPX5dzZrUq98AM+PP9et5om8fvpo6nYIFCzJqzHgyZcpERMQx2rRsxo316rNm9WqqXXU13e67n7u6+ILj+vXriI6JUWCUZNerTU3Wb/2HnNl9Aa3RQ5/EHhv3QjumLlgf73UjBrZk0GfzmbtsE9mzZiImxiV6/QOtrqNO949ofkMF2jeqwgeTl/DsvQ14dsTclLo1yeA055jKrq5xDblz507weLZs2TH/uMXx4xGx78PDM5EpUyYATp06hfP/cgkLC+PEieNERUXF1jH0vXfo2at3St2CZFBFC+ai8fVl+WTairOO5cyWmRurl2TqT+vOOlb+8oKEhYYwd9kmwLdl3PGTkYleH+McmcNDyZYlnMioGGpXvYw9+4+ycfv+FLgzOS9p7EHH5KLMMR2YO+d73n17MPv372fI+8Niy3fv3kXvnvfz99/b6PPo4xQqVJh8+fIzbeo33NmpPV3vuocf5s2lQoWKFCpUOBXvQILR6w/59kY9PQwaV7O65flh+WaORJw861jZ4vk5ePQE419sz+WX5GHe8k089eHs2OwxvutfH/MT0966k13/HOGeFycz9vl23KnhVElBCo7pQIOGN9Gg4U0sX7aUoe8N4cMRvqGnSy4pwsTJU9i7dw+PPtyLRjfdQv4CBXjltTcBiIyM5MEe9/LWkPd547VX2L17F02btaBefW3JJRemyfVXsPfAMVb+uYu61Uqcdbxdw8p8Gk9GCRAWGkLtqpdRs9uH/L33EGOebUOXJtXOePYx8Pq5yzYxd5lvq807brmSmYv+omzx/PTpUIsDR47Td8iMs7JPuThS88uOU5KGVdORq2tcw47tf3PgwIEzygsVKkyZMmVZsWLZGeVfTBhH02bNWfXbb+TMmZNBr7/FZ6NGXswuS5C6vkpxmtYux7oJfRj9TBvqVS/JyKdaAZA/dzZqVCjK9F/+ivfaHfsO8/uG3WzZdYDo6BjfXqpXFIk9ntj1WTOH06VJNYZNXsJT99Tj3pe/4udV2+hwU5UUuU85N7Pkf6UFCo5p3LZtWzm9UcPaNas5FXmKPHnysGf3bk6cOAHA4UOHWLlyOSVKlIy97vChQ8z/8Qea3t6C4ydOYBaCmXHy5NnDXCLn6+nhcyjTZjDl27/Nnc9N4ocVm7nnxckAtLyxItN/+ZOTp6LivXbZuh3kzpGFAv49U+tVL8m6Lftijyd2/SMdazF00mKiomPImjkc5yAmxpEtS3gK3KVkZBpWTWX9+z3K8qVLOXjwALc0vJEeDz4Uu5imbbsOzPl+Ft9O/YawsDAyZ87MoNffwszYvGkjg98Y5Pszyznu7HoPZa8oF1vv8GFDubf7/YSEhFCrdh0mjh9L21a306Zt+9S6Vckg2jaszBtjF5xRVr3cpdzbvAY9X5tCTIxjwNBZfPd2V8xg5fpdjJy6ItHrAYrkz0mNCkV5+dMfAfjgy8UsGH4fh46eoN2TZz8uIhdHGkn0kp22jxNJgLaPk2CRktvHrdhyONl/31cvkSvVY64yRxER8S7Vw1jK0JyjiIhIAGWOIiLiWbA+yqHgKCIinqWVRy+Sm4ZVRUREAihzFBERz4I0cVTmKCIiEkiZo4iIeBekqaOCo4iIeBasq1U1rCoiIhJAmaOIiHimRzlEREQyCGWOIiLiWZAmjgqOIiJyAYI0OmpYVUREJIAyRxER8UyPcoiIiGQQyhxFRMQzPcohIiKSQShzFBERz4I0cVRwFBGRCxCk0VHDqiIiIgGUOYqIiGd6lENERCSDUOYoIiKeBeujHAqOIiLiWZDGRg2rioiIBFLmKCIi3gVp6qjMUUREJIAyRxER8SxYH+VQcBQREc+CdbWqhlVFREQCKHMUERHPgjRxVOYoIiISSJmjiIh4F6SpozJHERGRAMocRUTEs2B9lEOZo4iIeGaW/K9zt2nFzWyema0xs9Vm9rC/PJ+ZfW9mf/n/mdfrfSk4iohIehMFPOacqwjUBB40s4pAf2COc64sMMf/2RMFRxER8cxS4HUuzrldzrkV/vdHgLVAUaA5MMp/2iighdf7UnAUEZE0xcy6m9myOK/uiZxbArgKWAwUds7t8h/aDRT22gctyBEREe9SYD2Oc244MPycTZvlAL4E+jjnDlucCUvnnDMz57UPCo4iIuJZaq1WNbNwfIFxrHNusr94j5kVcc7tMrMiwF6v9WtYVURE0hXzpYgfA2udc4PjHJoCdPW/7wp847UNZY4iIuJZKn0rR22gC7DKzH71lz0JvApMNLNuwFagndcGFBxFRCRdcc4tIOHZzobJ0YaCo4iIeBac++MoOIqIyAXQlx2LiIhkEMocRUTkAgRn6qjMUUREJIAyRxER8UxzjiIiIhmEMkcREfEsSBNHBUcREfFOw6oiIiIZhDJHERHxLLW+lSOlKXMUEREJoMxRRES8C87EUcFRRES8C9LYqGFVERGRQMocRUTEMz3KISIikkEocxQREc+C9VEOBUcREfEuOGOjhlVFREQCKXMUERHPgjRxVOYoIiISSJmjiIh4pkc5REREMghljiIi4pke5RAREQmgYVUREZEMQsFRREQkgIKjiIhIAM05ioiIZ8E656jgKCIingXralUNq4qIiARQ5igiIp4F67CqMkcREZEAyhxFRMSzIE0cFRxFROQCBGl01LCqiIhIAGWOIiLimR7lEBERySCUOYqIiGd6lENERCSDUOYoIiKeBWniqOAoIiIXIEijo4ZVRUREAihzFBERz/Qoh4iISAahzFFERDwL1kc5zDmX2n0QERFJUzSsKiIiEkDBUUREJICCo4iISAAFR0lTzCzazH41sz/M7Aszy3YBdX1qZm3870eYWcVEzq1nZrU8tLHFzAoktTzgnKPn2dazZtb3fPsoIudPwVHSmuPOuWrOucrAKaBH3INm5mmFtXPuXufcmkROqQecd3AUkeCk4Chp2U9AGX9W95OZTQHWmFmomb1uZkvN7Hczux/AfN4zs/VmNhsodLoiM/vBzGr43zc2sxVm9puZzTGzEviC8CP+rLWumRU0sy/9bSw1s9r+a/Ob2SwzW21mI0jC5llm9rWZLfdf0z3g2Fv+8jlmVtBfVtrMZviv+cnMyifHD1NEkk7POUqa5M8QmwAz/EXVgcrOuc3+AHPIOXeNmWUGFprZLOAqoBxQESgMrAFGBtRbEPgIuMFfVz7n3H4zGwYcdc694T/vc+At59wCM7sMmAlUAJ4BFjjnnjez24BuSbide/xtZAWWmtmXzrl/gezAMufcI2b2tL/uXsBwoIdz7i8zuw4YCjTw8GMUEY8UHCWtyWpmv/rf/wR8jG+4c4lzbrO//Gag6un5RCA3UBa4ARjnnIsGdprZ3HjqrwnMP12Xc25/Av1oBFS0/55wzmVmOfxttPJfO83MDiThnnqbWUv/++L+vv4LxAAT/OVjgMn+NmoBX8RpO3MS2hCRZKTgKGnNcedctbgF/iBxLG4R8JBzbmbAebcmYz9CgJrOuRPx9CXJzKwevkB7vXMuwsx+ALIkcLrzt3sw8GcgIheX5hwlPZoJPGBm4QBmdoWZZQfmA+39c5JFgPrxXLsIuMHMSvqvzecvPwLkjHPeLOCh0x/M7HSwmg/c4S9rAuQ9R19zAwf8gbE8vsz1tBDgdPZ7B77h2sPAZjNr62/DzOzKc7QhIslMwVHSoxH45hNXmNkfwIf4RkG+Av7yHxsN/BJ4oXNuH9Ad3xDmb/w3rDkVaHl6QQ7QG6jhX/Czhv9WzT6HL7iuxje8uu0cfZ0BhJnZWuBVfMH5tGPAtf57aAA87y/vBHTz92810DwJPxMRSUbaW1VERCSAMkcREZEACo4iIiIBFBxFREQCKDiKiIgEUHAUEREJoOAoIiISQMFRREQkgIKjiIhIAAVHERGRAAqOIiIiARQcRUREAig4ioiIBFBwFBERCaDgKCIiEkDBUVKdmbUwM+f/MuB0z8yuNrNVZrbBzIaYmcVzTl4z+8r/fZFLzKxynGOPmNlqM/vDzMaZWRZ/eUkzW+yvd4KZZbqY9yWSkSg4SlrQEVjg/2eKMLPQlKo7Hh8A9wFl/a/G8ZzzJPCrc64qcCfwDoCZFcX/RcvOucpAKNDBf80g4C3nXBngANAtJW9CJCNTcJRUZWY5gDr4ftF38JeFmtkb/szpdzN7yF9+jZn9bGa/+bOtnGZ2l5m9F6e+b82snv/9UTN708x+A643s6fNbKm/3uGnMzozK2Nms/31rjCz0mY22sxaxKl3rJk1T8L9FAFyOecWOd83iY8GWsRzakVgLoBzbh1QwswK+4+FAVnNLAzIBuz097UBMMl/zqgE6hWRZBCW2h2QDK85MMM596eZ/WtmVwPXAiWAas65KDPL5x9CnAC0d84tNbNcwPFz1J0dWOycewzAzNY45573v/8MaApMBcYCrzrnvvIPYYYAHwOPAF+bWW6gFtDVzMr5+xGfekBRYHucsu3+skC/Aa2An8zsWuByoJhzbrmZvQFs89/fLOfcLDMrABx0zkWdo14RSQYKjpLaOuIfUgTG+z+XBIadDgTOuf1mVgXY5Zxb6i87DBDPdF5c0cCXcT7XN7N++LKxfMBqM/sBKOqc+8pf7wn/uT+a2VAzKwi0Br7092c9UC2hBs/Rn7heBd4xs1+BVcBKINrM8uL7g6EkcBD4wsw6AzOSWrGIXDgFR0k1ZpYP31BhFTNz+ObXHLD0PKqJ4szpgSxx3p9wzkX728oCDMU3l/e3mT0bcG58RgOd8Q333u2v51yZ4w6gWJyyYv6yM/iD++k6DdgMbAJuATY75/b5j03Gl7WOBfKYWZg/SMdbr4gkD805SmpqA3zmnLvcOVfCOVccX5D4DbjfP+d2OoiuB4qY2TX+spz+41uAamYWYmbF8Q3Jxud0IPzHP8/ZBsA5dwTYfnp+0cwym1k2/7mfAn38563x/3O9c65aAq+DzrldwGEzq+kPencC3wR2xszyxFltei8w3x8wtwE1zSyb//qGwFr//OW80/0GusZXr4gkDwVHSU0dga8Cyr4EiuALEr/7F9Pc4Zw7BbQH3vWXfY8v4C3EF1DXAEOAFfE15Jw7CHwE/AHM5MzstAvQ28x+B34GLvFfswdYC3xynvfVExgBbAA2AtMBzKyHmfXwn1MB+MPM1gNNgIf9bS7Gt+hmBb7h1hBguP+aJ4BHzWwDkB/fvKiIpADz/UEqIoH8GeQqoLpz7lBq90dELh5ljiLxMLNG+LLGdxUYRTIeZY4iIiIBlDmKiIgEUHCUVGVm0Wb2q3/Xmi/irBS9kDqf9w+LJnS8h5ndeaHtJFJ/Su2tamb2kpn9aWZrzax3St2DSEanYdX/t3fvwVaVdRjHv08qRIBKOjrmONEohoSCNywnzAuaZqPgUF4rr6l5zSyappnSGaejlpUTpo4XsJS0FLTGxMsIYiooIiKY2ojONFmaeDuINuivP97flsVyn8PZwOGcMz6fmT1777Xfd621Gc785l17vc9rPUpSe0QMytc3AvMj4rLK5415fX2GpHmUfNS5wJ3A5RHx11qbS4H2iLhAJXB9ckQcoJKt+iAwIiJWSLoFuDMipkg6AdgPOD4i3pe0VUS8vEG/nNlHhEeO1pvMAXaQtK+kOZLuAJaoZK1eqpKL+qSkUxsdJE3KUdpCSW25bYqkifm6TdKS7Pfz3PZTSefn69GSHsnPp2dCDZJmSbo4R3XPShrblS+gbspWze2nAxdGxPvZz4XRrJs4Icd6hSwEh7AqJm03YGRELJX0beCNiNhTUn/gb5LuBoZTotb2ioi3Myygus8tgAnA8IgISZs3OfQNwFkRMVvShcBPyIn/wMYRMUbSV3L7uC4k5HRLtmr22R44UtIE4BXg7Ih4roNzMbN1++8P4AAAB5dJREFU4JGj9bQBKvmij1EKQmNi+7yIWJqvDwK+me3mUibADwPGAddHxNtQMlhr+34DeAe4VtIRwNvVD1UCxTePiNm5aSqwT6XJbfk8nxKEvsaEnBa+dxslDu4J4CyaZ6t+Chiokq0K0J8SibcHJdDguhaOZ2Yt8MjRetqKiFgtyDvvX1le3UQZ3c2stftyZzvOFT3GUCLYJgJnUrJcu+rdfH6P/Fvpwsixu7JVf08ZhTYK9nRaT+4xsy7yyNH6gpnA6ZI2AZC0o6SBlAi5Exp3uDa5rDoI2Cwi7qQsPzWq+nlO7n+t8nviN4DZdKKnslWz3QzKDTkAXwKe7exczWzteeRofcE1lMuaj2fBeAUYHxF3SRoNPCbpf5Q7Q39U6TcYuD2nQgg4r8m+vwVcmQX2eXI0t46+QwktH0DJVf0gWxUgIq6kZKtOVVmNZDFlsWciYq6kRrbqSsrl1ka2ahtwo6TvAu2Uompm3cBTOczMzGp8WdXMzKzGxdHMzKzGxdF6La0eLffnDuYprsv+X5C0Zb5ub6HfZyTNVYmHu7lyY021TT9J11cCCvbN7YPzOzUe/5X0q/zstGz/hKQHJY1YT1/VzFrk4mi92Yq8C3QksAw4o6dPKF0M/DIidgBeI2+mqTkFICJ2Bg4EfiHpYxHxVvUOV+BFVk3PuCkids7tlwCXNdmvmW0ALo7WVzxMJs1I2l7SXZLmZ8zc8Ny+dUbALczH3rl9RrZdnGk7ay3vlt0f+FNumsqa4+FeBl4H9qjta0dgK0psXmPuY8NAwHfLmfUQT+WwXk/SRpT5fo30nKuB0yLiOUl7AVdQCtblwOyImJB9BmX7EyNimaQBwKOSbo2IVzs41mCyWDVxDPAy8HolDL2zeLjDJE0DtgN2z+d5lTZHATdH5ZZxSWdQppz0o7XAAjNbj1wcrTdrRMttS5kIf09O7N8b+KNWrQTVP5/3p0y6JyLeo8THAZydeaRQCtQwoGlxjIi3gNHNPgNo/EbZBddR5jI+Rrl0+hAlaafqKErwQPX4k4HJko4BfkyZh2lmG5iLo/VmKyJidE7Qn0n5zXEKZeTWYQGryhthxgFfyHDyWcDHO2m/ppHj05RM1MZSWh3Fw62kpPI09vsQlUQbSaMowebzOzjWH4DfdvzNzKw7+TdH6/UyWPxs4HuU8PClkr4GHywA3IiFu4+yrBMqy1xtBmwGvJaFcTjw+TUca7UbZmqPJXkJ9H5KViuUkV2zeLhPZMQdkg4EVkbEkkqTo4FptT7DKm8PBbzihlkPcXG0PiEiFgBPUorKscBJkhZSotcOz2bnAPtJWkRZSWMEZQmsjSU9TYlfe2Q9nM4k4DxJ/6CsEHItgKTDVJa9gnKjzeN53EnULp8CX6dWHIEz86ahJyi/O/qSqlkPcXycmZlZjUeOZmZmNS6OZmZmNS6OZmZmNS6O1uMqGaqNx1BJW0i6X1K7pN900verkhZkIs4SSaduyHNvcj6flHSPpOfyeUgH7S7OzNinJB1Z2T6n8u/wL0kzcvuQTP95UtI8SSM31Hcy+yjyDTnW4yS1R8Sg2raBwK7ASGBkRJzZpN8mlAn2YyLin5L6A0Mj4pl1OBdR/i7eX8v+lwDLIqJN0g+BIRExqdbmUOBc4BBKgMEs4IBafBySbgVuj4gbJF0KtEfEBTklZXJEHLA252hma+aRo/VKEbE8Ih4E3umk2WBKkMWr2efdRmHsJGf1vMqI7dzcNlTSM5JuAJ4CtpP0fUmP5kjtghZO/XBK3ip0nrv6QESsjIjllCkqB1cbSNqUkvgzo9KnkdX6d2CopK1bOC8za4GLo/UGAyqXEqd3tVNELAPuAF6UNE3SsZIa/6cbOaujgN2AxZJ2B04A9qKEAZwiaddsPwy4IiI+B3w234+hRMntLmkf+NBlz+pjXO5n64h4KV//G2hWwBYCB2dQwJbAfpRYu6rxwH2V0eRC4Ig8hzHApynpPGbWDRwfZ73Biq7GwdVFxMmSdqZExJ1PWR7qeJrkrEr6IjA9R2tIug0YSxbYiGgEBByUjwX5fhClWD4QEWNbOLeQ9KHfLSLibkl7UvJWX6GsOFLPXT0auKbyvg34dQYELMpzq/cxs/XExdH6vIhYBCyS9DtgKaU4tmp55bWAn0XEVfVGkuZQLufWnR8R9wL/kbRNRLwkaRvKKh7Nzvki4KLc502snru6JWXUOqHS/k3KqLfxu+hS4PmWvqGZdZkvq1qfJWlQBos3NBYPhuY5q3OA8ZXc0wk0DxmfCZyosgIIkraVtBVARIztIHf13ux7B6ti3zrKXd1I0hb5ehdgF+DuSpOJwF8i4p1Kn80l9cu3J1NGsavdwGNm649HjtZrSXoB2BToJ2k8cFAtvFvADyRdBaygjP6Oz8/OAa6WdBLl8uPpEfGwpCmsWlPxmohYIGlo9bh52XMn4OEySKMdOI4ORoE1bcAtedwXKRmqSNqDsgblycAmwJzc95vAcZX1IaEsZdVW2+9OwNS8TLsYOKkL52Jma8lTOczMzGp8WdXMzKzGxdHMzKzGxdHMzKzGxdHMzKzGxdHMzKzGxdHMzKzGxdHMzKzGxdHMzKzm/+AGlzkm/vcpAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yKLHZ5_C6FJ8" + }, + "source": [ + "## Parameter tunning\n", + "### Referência\n", + "* [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74)\n", + "* [Decision Tree Adventures 2 — Explanation of Decision Tree Classifier Parameters](https://medium.com/datadriveninvestor/decision-tree-adventures-2-explanation-of-decision-tree-classifier-parameters-84776f39a28) - Explica didaticamente e step by step como fazer parameter tunning.\n", + "* [Optimizing Hyperparameters in Random Forest Classification](https://towardsdatascience.com/optimizing-hyperparameters-in-random-forest-classification-ec7741f9d3f6) - Outro approach para entender parameter tunning. Recomendo fortemente a leitura! " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XOa9naju6FKA" + }, + "source": [ + "# Dicionário de parâmetros para o parameter tunning.\n", + "d_parametros_RF= {'bootstrap': [True, False]} #,\n", + "# 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],\n", + "# 'max_features': ['auto', 'sqrt'],\n", + "# 'min_samples_leaf': [1, 2, 4],\n", + "# 'min_samples_split': [2, 5, 10],\n", + "# 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6__f2jZaTQat", + "outputId": "e43cf8dc-3af7-4726-b906-26adf07cfdd2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 579 + } + }, + "source": [ + "# Invoca a função\n", + "ml_RF2, best_params = GridSearchOptimizer(ml_RF, 'ml_RF2', d_parametros_RF, X_treinamento, y_treinamento, X_teste, y_teste, i_CV, l_colunas)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Fitting 10 folds for each of 2 candidates, totalling 20 fits\n" + ], + "name": "stdout" + }, + { + "output_type": "stream", + "text": [ + "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.\n", + "[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.4s\n", + "[Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 0.9s\n", + "[Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 2.1s\n", + "[Parallel(n_jobs=-1)]: Done 14 tasks | elapsed: 3.0s\n", + "[Parallel(n_jobs=-1)]: Done 20 out of 20 | elapsed: 4.4s remaining: 0.0s\n", + "[Parallel(n_jobs=-1)]: Done 20 out of 20 | elapsed: 4.4s finished\n" + ], + "name": "stderr" + }, + { + "output_type": "stream", + "text": [ + "\n", + "Parametros otimizados: {'bootstrap': False}\n", + "\n", + "RandomForestClassifier *********************************************************************************************************\n" + ], + "name": "stdout" + }, + { + "output_type": "error", + "ename": "KeyError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Invoca a função\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mml_RF2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mbest_params\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mGridSearchOptimizer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mml_RF\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'ml_RF2'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0md_parametros_RF\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mX_treinamento\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_treinamento\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mX_teste\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_teste\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mi_CV\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0ml_colunas\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m\u001b[0m in \u001b[0;36mGridSearchOptimizer\u001b[0;34m(modelo, ml_Opt, d_Parametros, X_treinamento, y_treinamento, X_teste, y_teste, i_CV, l_colunas)\u001b[0m\n\u001b[1;32m 21\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'\\nRandomForestClassifier *********************************************************************************************************'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 22\u001b[0m ml_Opt = RandomForestClassifier(bootstrap= ml_GridSearchCV.best_params_['bootstrap'], \n\u001b[0;32m---> 23\u001b[0;31m \u001b[0mmax_depth\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mml_GridSearchCV\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbest_params_\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'max_depth'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 24\u001b[0m \u001b[0mmax_features\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mml_GridSearchCV\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbest_params_\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'max_features'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 25\u001b[0m \u001b[0mmin_samples_leaf\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mml_GridSearchCV\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbest_params_\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'min_samples_leaf'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mKeyError\u001b[0m: 'max_depth'" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "crfn-n--KG4n" + }, + "source": [ + "### Resultado da execução do Random Forest\n", + "\n", + "```\n", + "[Parallel(n_jobs=-1)]: Done 7920 out of 7920 | elapsed: 194.0min finished\n", + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}\n", + "```" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SGTOe5PaRw59" + }, + "source": [ + "# Como o procedimento acima levou 194 minutos para executar, então vou estimar ml_RF2 abaixo usando os parâmetros acima estimados\n", + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}\n", + "\n", + "ml_RF2= RandomForestClassifier(bootstrap= best_params['bootstrap'], \n", + " max_depth= best_params['max_depth'], \n", + " max_features= best_params['max_features'], \n", + " min_samples_leaf= best_params['min_samples_leaf'], \n", + " min_samples_split= best_params['min_samples_split'], \n", + " n_estimators= best_params['n_estimators'], \n", + " random_state= i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HMJcAdLlTQa0" + }, + "source": [ + "## Visualizar o resultado\n", + "> Implementar a visualização do RandomForest." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WWNiy7Z0TQa3" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kOi11YOKTQa4" + }, + "source": [ + "X_treinamento_RF, X_teste_RF = seleciona_colunas_relevantes(ml_RF2, X_treinamento, X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Zn_O7c_DTQbE" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UwEOwzSGTQbF" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Rr8qDrgvTQbL" + }, + "source": [ + "# Treina com as COLUNAS relevantes...\n", + "ml_RF2.fit(X_treinamento_RF, y_treinamento)\n", + "\n", + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_RF2, X_treinamento_RF, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-mYfQLlsTQbQ" + }, + "source": [ + "## Valida o modelo usando o dataframe X_teste" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sSD5o1JQTQbR" + }, + "source": [ + "y_pred_RF = ml_RF2.predict(X_teste_RF)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "wywF6LymDzKr" + }, + "source": [ + "# Calcula acurácia\n", + "accuracy_score(y_teste, y_pred_RF)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hJJsL0IJb6iO" + }, + "source": [ + "## Estudo do comportamento dos parametros do algoritmo\n", + "> Consulte [Optimizing Hyperparameters in Random Forest Classification](https://towardsdatascience.com/optimizing-hyperparameters-in-random-forest-classification-ec7741f9d3f6) para mais detalhes." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "navUWMwHi44D" + }, + "source": [ + "param_range = np.arange(1, 250, 2)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_treinamento, \n", + " y_treinamento, \n", + " param_name=\"n_estimators\", \n", + " param_range = param_range, \n", + " cv = i_CV, \n", + " scoring = \"accuracy\", \n", + " n_jobs = -1)\n", + "\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label = \"Training score\", color = \"black\")\n", + "plt.plot(param_range, test_mean, label = \"Cross-validation score\", color = \"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color = \"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color = \"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc = \"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rv7TIM9kjsud" + }, + "source": [ + "param_range = np.arange(1, 250, 2)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_treinamento, \n", + " y_treinamento, \n", + " param_name = \"max_depth\", \n", + " param_range = param_range, \n", + " cv = i_CV, \n", + " scoring = \"accuracy\", \n", + " n_jobs = -1)\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label=\"Training score\", color=\"black\")\n", + "plt.plot(param_range, test_mean, label=\"Cross-validation score\", color=\"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color=\"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color=\"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc=\"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lm_fPGYwkJYc" + }, + "source": [ + "param_range = np.arange(1, 250, 2)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_treinamento, \n", + " y_treinamento, \n", + " param_name='min_samples_leaf', \n", + " param_range=param_range,\n", + " cv = i_CV, \n", + " scoring=\"accuracy\", \n", + " n_jobs=-1)\n", + "\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label=\"Training score\", color=\"black\")\n", + "plt.plot(param_range, test_mean, label=\"Cross-validation score\", color=\"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color=\"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color=\"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc=\"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "CAqdiSaVlAB8" + }, + "source": [ + "param_range = np.arange(0.05, 1, 0.05)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_treinamento, \n", + " y_treinamento, \n", + " param_name='min_samples_split', \n", + " param_range=param_range,\n", + " cv = i_CV, \n", + " scoring=\"accuracy\", \n", + " n_jobs=-1)\n", + "\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label=\"Training score\", color=\"black\")\n", + "plt.plot(param_range, test_mean, label=\"Cross-validation score\", color=\"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color=\"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color=\"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc=\"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cX_gfsbQSdNd" + }, + "source": [ + "___\n", + "# **BOOSTING MODELS**\n", + "* São algoritmos muito utilizados nas competições do Kaggle;\n", + "* São algoritmos utilizados para melhorar a performance dos algoritmos de Machine Learning;\n", + "* Modelos:\n", + " - [X] AdaBoost\n", + " - [X] XGBoost\n", + " - [X] LightGBM\n", + " - [X] GradientBoosting\n", + " - [X] CatBoost\n", + "\n", + "## Bagging vs Boosting vc Stacking\n", + "### **Bagging**\n", + "* Objetivo é reduzir a variância;\n", + "\n", + "#### Como funciona\n", + "* Seleciona várias amostras **COM REPOSIÇÃO** do dataframe de treinamento. Cada amostra é usada para treinar um modelo usando Decision Trees. Como resultado, temos um ensemble de muitas e diferentes modelos (Decision Trees). A média de desses muitos e diferentes modelos (Decision Trees) são usados para produzir o resultado final;\n", + "* O resultado final é mais robusto do que usarmos uma simples Decision Tree.\n", + "\n", + "![Bagging](https://github.com/MathMachado/Materials/blob/master/Bagging.png?raw=true)\n", + "\n", + "Souce: [Boosting and Bagging: How To Develop A Robust Machine Learning Algorithm](https://hackernoon.com/how-to-develop-a-robust-algorithm-c38e08f32201).\n", + "\n", + "#### Steps\n", + "* Suponha um dataframe X_treinamento (dataframe de treinamento) contendo N observações (instâncias, pontos, linhas) e M COLUNAS (features, atributos).\n", + " 1. Bagging seleciona aleatoriamente uma amostra **COM REPOSIÇÃO** de X_treinamento;\n", + " 2. Bagging seleciona aleatoriamente M2 (M2 < M) COLUNAS do dataframe extraído do passo (1);\n", + " 3. Constroi uma Decision Tree com as M2 COLUNAS do passo (2) e o dataframe obtido no passo (1) e as COLUNAS são avaliadas pela sua habilidade de classificar as observações;\n", + " 4. Os passos (1)--> (2)-- (3) são repetidos K vezes (ou seja, K Decision Trees), de forma que as COLUNAS são ranqueadas pelo seu poder preditivo e o resultado final (acurácia, por exemplo) é obtido pela agregação das predições dos K Decision Trees.\n", + "\n", + "#### Vantagens\n", + "* Reduz overfitting;\n", + "* Lida bem com dataframes com muitas COLUNAS (high dimensionality);\n", + "* Lida automaticamente com Missing Values;\n", + "\n", + "#### Desvantagem\n", + "* A predição final é baseada na média das K Decision Trees, o que pode comprometer a acurácia final.\n", + "\n", + "___ \n", + "### **Boosting**\n", + "* Objetivo é melhorar acurácia;\n", + "\n", + "#### Como funciona\n", + "* Os classificadores são usados sequencialmente, de forma que o classificador no passo N aprende com os erros do classificador do passo N-1. Ou seja, o objetivo é melhorar a precisão/acurácia à cada passo aprendendo com o passado.\n", + "\n", + "![Boosting](https://github.com/MathMachado/Materials/blob/master/Boosting.png?raw=true)\n", + "\n", + "Source: [Ensemble methods: bagging, boosting and stacking](https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205), Joseph Rocca\n", + ".\n", + "\n", + "#### Steps\n", + "* Suponha um dataframe X_treinamento (dataframe de treinamento) contendo N observações (instâncias, pontos, linhas) e M COLUNAS (features, atributos).\n", + " 1. Boosting seleciona aleatoriamente uma amostra D1 SEM reposição de X_treinamento;\n", + " 2. Boosting treina o classificador C1;\n", + " 3. Boosting seleciona aleatoriamente a SEGUNDA amostra D2 SEM reposição de X_treinamento e acrescenta à D2 50% das observações que foram classificadas incorretamente para treinar o classificador C2;\n", + " 4. Boosting encontra em X_treinamento a amostra D3 que os classificadores C1 e C2 discordam em classificar e treina C3;\n", + " 5. Combina (voto) as predições de C1, C2 e C3 para produzir o resultado final.\n", + "\n", + "#### Vantagens\n", + "* Lida bem com dataframes com muitas COLUNAS (high dimensionality);\n", + "* Lida automaticamente com Missing Values;\n", + "\n", + "#### Desvantagem\n", + "* Propenso a overfitting. Recomenda-se tratar outliers previamente.\n", + "* Requer ajuste cuidadoso dos hyperparameters;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9fgUrkmPk4dr" + }, + "source": [ + "___\n", + "# STACKING\n", + "\n", + "![Stacking](https://github.com/MathMachado/Materials/blob/master/Stacking.png?raw=true)\n", + "\n", + "Kd a referência desta figura???" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B0jxx3ETpOdm" + }, + "source": [ + "___\n", + "# **BOOTSTRAPPING METHODS**\n", + "> Antes de falarmos de Boosting ou Bagging, precisamos entender primeiro o que é Bootstrap, pois ambos (Boosting e Bagging) são baseados em Bootstrap.\n", + "\n", + "* Em Estatística (e em Machine Learning), Bootstrap se refere à extrair amostras aleatórias COM reposição da população X." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SyqazmUuifkE" + }, + "source": [ + "___\n", + "# **ADABOOST(Adaptive Boosting)**\n", + "* Quando nada funciona, AdaBoost funciona!\n", + "* Foi um dos primeiros algoritmos de Boosting (1995);\n", + "* AdaBoost pode ser utilizado tanto para classificação (AdaBoostClassifier) quanto para Regressão (AdaBoostRegressor);\n", + "* AdaBoost usam algoritmos DecisionTree como base_estimator;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RU-vzkXqrFVw" + }, + "source": [ + "## Referências\n", + "* [AdaBoost Classifier Example In Python](https://towardsdatascience.com/machine-learning-part-17-boosting-algorithms-adaboost-in-python-d00faac6c464) - Didático e explica exatamente como o AdaBoost funciona.\n", + "* [Adaboost for Dummies: Breaking Down the Math (and its Equations) into Simple Terms](https://towardsdatascience.com/adaboost-for-dummies-breaking-down-the-math-and-its-equations-into-simple-terms-87f439757dcf) - Para quem quer entender a matemática por trás do algoritmo.\n", + "* [Gradient Boosting and XGBoost](https://medium.com/hackernoon/gradient-boosting-and-xgboost-90862daa6c77)\n", + "* [Understanding AdaBoost](https://towardsdatascience.com/understanding-adaboost-2f94f22d5bfe), Akash Desarda.\n", + "* [AdaBoost Classifier Example In Python](https://towardsdatascience.com/machine-learning-part-17-boosting-algorithms-adaboost-in-python-d00faac6c464)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6EMrjQDZIMl_" + }, + "source": [ + "## O que é AdaBoost (Adaptive Boosting)?\n", + "* é um dos classificadores do tipo ensemble (combina vários classificadores para aumentar a precisão).\n", + "* AdaBoost é um classificador iterativo e forte que combina (ensemble) vários classificadores fracos para melhorar a precisão.\n", + "* Qualquer algoritmo de aprendizado de máquina pode ser usado como um classificador de base (parâmetro base_estimator);\n", + "\n", + "## Parâmetros mais importantes do AdaBoost:\n", + "* base_estimator - É um classificador usado para treinar o modelo. Como default, AdaBoost usa o DecisionTreeClassifier. Como dito anteriormente, pode-se utilizar diferentes algoritmos para esse fim.\n", + "* n_estimators - Número de base_estimator para treinar iterativamente.\n", + "* learning_rate - Controla a contribuição do base_estimator na solução/combinação final;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TzLtHzWNJBix" + }, + "source": [ + "## Usando diferentes algoritmos para base_estimator\n", + "> Como dito anteriormente, pode-se utilizar vários tipos de base_estimator em AdaBoost. Por exemplo, se quisermos usar SVM (Support Vector Machines), devemos proceder da seguinte forma:\n", + "\n", + "\n", + "```\n", + "# Importar a biblioteca base_estimator\n", + "from sklearn.svm import SVC\n", + "\n", + "# Treina o classificador (algoritmo)\n", + "ml_SVC= SVC(probability=True, kernel='linear')\n", + "\n", + "# Constroi o modelo AdaBoost\n", + "ml_AB = AdaBoostClassifier(n_estimators= 50, base_estimator=ml_SVC, learning_rate=1)\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hrj4a4s6hMMB" + }, + "source": [ + "## Vantagens\n", + "* AdaBoost é fácil de implementar;\n", + "* AdaBoost corrige os erros do base_estimator iterativamente e melhora a acurácia;\n", + "* Faz o Feature Selection automaticamente (**Porque**?);\n", + "* Pode-se usar muitos algoritos como base_estimator ;\n", + "* Como é um método ensemble, então o modelo final é pouco propenso à overfitting.\n", + "\n", + "## Desvantagens\n", + "* AdaBoost é sensível a ruídos nos dados;\n", + "* Altamente impactado por outliers (contribui para overfitting), pois o algoritmo tenta se ajustr a cada ponto da mehor forma possível;\n", + "* AdaBoost é mais lento que XGBoost;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bgJmu7YLiyv7" + }, + "source": [ + "No exemplo a seguir, vou usar RandomForestClassifier com os parâmetros otimizados, ou seja:\n", + "\n", + "```\n", + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5VCRNyZT3qvc" + }, + "source": [ + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1gIboJdriq61" + }, + "source": [ + "from sklearn.ensemble import AdaBoostClassifier\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "\n", + "# Instancia RandomForestClassifier - Parâmetros otimizados!\n", + "ml_RF2= RandomForestClassifier(bootstrap= best_params['bootstrap'], \n", + " max_depth= best_params['max_depth'], \n", + " max_features= best_params['max_features'], \n", + " min_samples_leaf= best_params['min_samples_leaf'], \n", + " min_samples_split= best_params['min_samples_split'], \n", + " n_estimators= best_params['n_estimators'], \n", + " random_state= i_Seed)\n", + "# Instancia AdaBoostClassifier\n", + "ml_AB= AdaBoostClassifier(n_estimators=100, base_estimator= ml_RF2, random_state= i_Seed)\n", + "\n", + "# Treina...\n", + "ml_AB.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "tBOuTywWRm91" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_AB, X_treinamento, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F7Ce5L38ECoC" + }, + "source": [ + "**Interpretação**: Nosso classificador (AdaBoostClassifier) tem uma acurácia média de 96,72% (base de treinamento). Além disso, o std é da ordem de 2,54%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "t5GfnBwEifkO" + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q9rSpuXyEPA5" + }, + "source": [ + "# Faz predições com os parametros otimizados...\n", + "y_pred = ml_AB.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "2F9k-_eXGDLa" + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XweWTjQ9EXLw" + }, + "source": [ + "## Parameter tunning" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fcrKzse9EbL_" + }, + "source": [ + "# Dicionário de parâmetros para o parameter tunning.\n", + "d_parametros_AB = {'n_estimators':[50, 100, 200], 'learning_rate':[.001, 0.01, 0.05, 0.1, 0.3,1]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Susc3I7mFDQX" + }, + "source": [ + "# Invoca a função\n", + "ml_AB2, best_params= GridSearchOptimizer(ml_AB, 'ml_AB2', d_parametros_AB, X_treinamento, y_treinamento, X_teste, y_teste, i_CV, l_colunas)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "w4JjWsusjNS8" + }, + "source": [ + "___\n", + "# **GRADIENT BOOSTING**\n", + "* Gradient boosting pode ser usado para resolver problemas de classificação (GradientBoostingClassifier) e Regressão (GradientBoostingRegressor);\n", + "* Gradient boosting são um refinamento do AdaBoost (lembra que AdaBoost foi um dos primeiros métodos de Boosting - criado em 1995). O que Gradient Boosting faz adicionalmente ao AdaBoost é minimizar a loss (função perda), ie, minimizar a diferença entre os valores observados de y e os valores preditos.\n", + "* Usa Gradient Descent para encontrar as deficiências nas previsões do passo anterior. Gradient Descent é um algoritmo popular e poderoso e usado em Redes Neurais;\n", + "* O objetivo do Gradient Boosting é minimizar 'loss function'. Portanto, Gradient Boosting depende da \"loss function\".\n", + "* Gradient boosting usam algoritmos DecisionTree como base_estimator;\n", + "\n", + "## Vantagens\n", + "* Não há necessidade de pre-processing;\n", + "* Trabalha normalmente com COLUNAS numéricas ou categóricas;\n", + "* Trata automaticamente os Missing Values. Ou seja, não é necessário aplicar métodos de Missing Value Imputation;\n", + "\n", + "## Desvantagens\n", + "* Como Gradient Boosting tenta continuamente minimizar os erros à cada iteração, isso pode enfatizar os outliers e causar overfitting. Portanto, deve-se:\n", + " * Tratar os outliers previamente OU\n", + " * Usar Cross-Validation para neutralizar os efeitos dos outliers (**Eu prefiro este método, pois toma menos tempo**);\n", + "* Computacionalmene caro. Geralmente são necessários muitas árvores (> 1000) para se obter bons resultados;\n", + "* Devido à flexibilidade (muitos parâmetros para ajustar), então é necessário usar GridSearchCV para encontrar a combinação ótima dos hyperparameters;\n", + "\n", + "## Referências\n", + "* [Gradient Boosting Decision Tree Algorithm Explained](https://towardsdatascience.com/machine-learning-part-18-boosting-algorithms-gradient-boosting-in-python-ef5ae6965be4) - Didático e detalhista.\n", + "* [Predicting Wine Quality with Gradient Boosting Machines](https://towardsdatascience.com/predicting-wine-quality-with-gradient-boosting-machines-a-gmb-tutorial-d950b1542065)\n", + "* [Parameter Tuning in Gradient Boosting (GBM) with Python](https://www.datacareer.de/blog/parameter-tuning-in-gradient-boosting-gbm/)\n", + "* [Tune Learning Rate for Gradient Boosting with XGBoost in Python](https://machinelearningmastery.com/tune-learning-rate-for-gradient-boosting-with-xgboost-in-python/)\n", + "* [In Depth: Parameter tuning for Gradient Boosting](https://medium.com/all-things-ai/in-depth-parameter-tuning-for-gradient-boosting-3363992e9bae) - Muito bom\n", + "* [Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q4bUCZs2jNTA" + }, + "source": [ + "from sklearn.ensemble import GradientBoostingClassifier\n", + "\n", + "# Instancia...\n", + "ml_GB = GradientBoostingClassifier(n_estimators = 100, min_samples_split = 2)\n", + "\n", + "# Treina... \n", + "ml_GB.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "PKOG1ugSRvLM" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_GB, X_treinamento, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VlC3y3M5YaGG" + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vnLvQ0ZDYNjB" + }, + "source": [ + "**Interpretação**: Nosso classificador (GradientBoostingClassifier) tem uma acurácia média de 96,86% (base de treinamento). Além disso, o std é da ordem de 2,52%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "D2n1RKZuXq3D" + }, + "source": [ + "# Faz precições...\n", + "y_pred = ml_GB.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8r6JCzQRGFa0" + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names = cf_labels, categories = cf_categories)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KFv-Q2AD5uCk" + }, + "source": [ + "## Parameter tunning\n", + "> Consulte [Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/) para detalhes sobre os parâmetros, significado e etc." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wgU040AcjNTF" + }, + "source": [ + "# Dicionário de parâmetros para o parameter tunning.\n", + "d_parametros_GB= {'learning_rate': [1, 0.5, 0.25, 0.1, 0.05, 0.01]} #,\n", + "# 'n_estimators': [1, 2, 4, 8, 16, 32, 64, 100, 200],\n", + "# 'max_depth': [5, 10, 15, 20, 25, 30],\n", + "# 'min_samples_split': [0.1, 0.3, 0.5, 0.7, 0.9],\n", + "# 'min_samples_leaf': [0.1, 0.2, 0.3, 0.4, 0.5],\n", + "# 'max_features': list(range(1, X_treinamento.shape[1]))}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "v5KLFlpTjNTH" + }, + "source": [ + "# Invoca a função\n", + "ml_GB2, best_params= GridSearchOptimizer(ml_GB, 'ml_GB2', d_parametros_GB, X_treinamento, y_treinamento, X_teste, y_teste, i_CV, l_colunas)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YQ6ERz3fi9i2" + }, + "source": [ + "### Resultado da execução do Gradient Boosting" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RSa7uKw13mKG" + }, + "source": [ + "```\n", + "[Parallel(n_jobs=-1)]: Done 275400 out of 275400 | elapsed: 93.7min finished\n", + "\n", + "Parametros otimizados: {'learning_rate': 1, 'max_depth': 30, 'max_features': 11, 'min_samples_leaf': 0.1, 'min_samples_split': 0.1, 'n_estimators': 100}\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wiJpA2PyjDjR" + }, + "source": [ + "# Como o procedimento acima levou 93 minutos para executar, então vou estimar ml_GB2 abaixo usando os parâmetros acima estimados\n", + "best_params= {'learning_rate': 1, 'max_depth': 30, 'max_features': 11, 'min_samples_leaf': 0.1, 'min_samples_split': 0.1, 'n_estimators': 100}\n", + "\n", + "#ml_GB2= GradientBoostingClassifier(learning_rate= best_params['learning_rate'], \n", + "# max_depth= best_params['max_depth'],\n", + "# max_features= best_params['max_features'],\n", + "# min_samples_leaf= best_params['min_samples_leaf'],\n", + "# min_samples_split= best_params['min_samples_split'],\n", + "# n_estimators= best_params['n_estimators'],\n", + "# random_state= i_Seed)\n", + "\n", + "ml_GB2= GradientBoostingClassifier(learning_rate= best_params['learning_rate'], \n", + " max_depth= best_params['max_depth'],\n", + " min_samples_leaf= best_params['min_samples_leaf'],\n", + " min_samples_split= best_params['min_samples_split'],\n", + " n_estimators= best_params['n_estimators'],\n", + " random_state= i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mb14gJ7-jbVM" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TAqGZIFYm2sU" + }, + "source": [ + "X_treinamento_GB, X_teste_GB = seleciona_colunas_relevantes(ml_GB2, X_treinamento, X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6yiu6dahnBvC" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "APrtWN18nc4t" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VS0mLdOmnXAY" + }, + "source": [ + "# Treina com as COLUNAS relevantes\n", + "ml_GB2.fit(X_treinamento_GB, y_treinamento)\n", + "\n", + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_GB2, X_treinamento_GB, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vmc9PP_Rn1TN" + }, + "source": [ + "## Valida o modelo usando o dataframe X_teste" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "e3mnIALvnzP2" + }, + "source": [ + "y_pred_GB = ml_GB2.predict(X_teste_GB)\n", + "\n", + "# Calcula acurácia\n", + "accuracy_score(y_teste, y_pred_GB)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kwP9Z2GnkV7r" + }, + "source": [ + "___\n", + "# **XGBOOST (eXtreme Gradient Boosting)**\n", + "* XGBoost é uma melhoria de Gradient Boosting. As melhorias são em velocidade e performace, além de corrigir as ineficiências do GradientBoosting.\n", + "* Algoritmo preferido pelos Kaggle Grandmasters;\n", + "* Paralelizável;\n", + "* Estado-da-arte em termos de Machine Learning;\n", + "\n", + "## Parâmetros relevantes e seus valores iniciais\n", + "Consulte [Complete Guide to Parameter Tuning in XGBoost with codes in Python](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/) para detalhes completos sobre os parâmetros, significado e etc.\n", + "\n", + "* n_estimators = 100 (100 caso o dataframe for grande. Se o dataframe for médio/pequeno, então 1000) - É o número de árvores desejamos construir;\n", + "* max_depth= 3 - Determina quão profundo cada árvore pode crescer durante qualquer round de treinamento. Valores típicos no intervalo [3, 10];\n", + "* learning rate= 0.01 - Usado para evitar overfitting, intervalo: [0, 1];\n", + "* alpha (somente para problemas de Regressão) - L1 regularization nos pesos. Valores altos resulta em mais regularization;\n", + "* lambda (somente para problemas de Regressão) - L2 regularization nos pesos.\n", + "* colsample_bytree: 1 - porcentagem de COLUNAS usados por cada árvore. Alto valor pode causar overfitting;\n", + "* subsample: 0.8 - porcentagem de amostras usadas por árvore. Um valor baixo pode levar a overfitting;\n", + "* gamma: 1 - Controla se um determinado nó será dividido com base na redução esperada na perda após a divisão. Um valor mais alto leva a menos divisões.\n", + "* objective: Define a \"loss function\". As opções são:\n", + " * reg:linear - Para resolver problemas de regressão;\n", + " * reg:logistic - Para resolver problemas de classificação;\n", + " * binary:logistic - Para resolver problemas de classificação com cálculo de probabilidades;\n", + "\n", + "# Referências\n", + "* [How exactly XGBoost Works?](https://medium.com/@pushkarmandot/how-exactly-xgboost-works-a320d9b8aeef)\n", + "* [Fine-tuning XGBoost in Python like a boss](https://towardsdatascience.com/fine-tuning-xgboost-in-python-like-a-boss-b4543ed8b1e)\n", + "* [Gentle Introduction of XGBoost Library](https://medium.com/@imoisharma/gentle-introduction-of-xgboost-library-2b1ac2669680)\n", + "* [A Beginner’s guide to XGBoost](https://towardsdatascience.com/a-beginners-guide-to-xgboost-87f5d4c30ed7)\n", + "* [Exploring XGBoost](https://towardsdatascience.com/exploring-xgboost-4baf9ace0cf6)\n", + "* [Feature Importance and Feature Selection With XGBoost in Python](https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/)\n", + "* [Ensemble Learning case study: Running XGBoost on Google Colab free GPU](https://towardsdatascience.com/running-xgboost-on-google-colab-free-gpu-a-case-study-841c90fef101) - Recomendo\n", + "* [Predicting movie revenue with AdaBoost, XGBoost and LightGBM](https://towardsdatascience.com/predicting-movie-revenue-with-adaboost-xgboost-and-lightgbm-262eadee6daa)\n", + "* [Tuning XGBoost Hyperparameters with Scikit Optimize](https://towardsdatascience.com/how-to-improve-the-performance-of-xgboost-models-1af3995df8ad)\n", + "* [An Example of Hyperparameter Optimization on XGBoost, LightGBM and CatBoost using Hyperopt](https://towardsdatascience.com/an-example-of-hyperparameter-optimization-on-xgboost-lightgbm-and-catboost-using-hyperopt-12bc41a271e) - Interessante\n", + "* [XGBOOST vs LightGBM: Which algorithm wins the race !!!](https://towardsdatascience.com/lightgbm-vs-xgboost-which-algorithm-win-the-race-1ff7dd4917d) - LightGBM tem se mostrado interessante.\n", + "* [From Zero to Hero in XGBoost Tuning](https://towardsdatascience.com/from-zero-to-hero-in-xgboost-tuning-e48b59bfaf58) - Gostei\n", + "* [Build XGBoost / LightGBM models on large datasets — what are the possible solutions?](https://towardsdatascience.com/build-xgboost-lightgbm-models-on-large-datasets-what-are-the-possible-solutions-bf882da2c27d)\n", + "* [Selecting Optimal Parameters for XGBoost Model Training](https://towardsdatascience.com/selecting-optimal-parameters-for-xgboost-model-training-c7cd9ed5e45e) - Muito bom!\n", + "* [CatBoost vs. Light GBM vs. XGBoost](https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db)\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iMM_R4_ukV7x" + }, + "source": [ + "from xgboost import XGBClassifier\n", + "import xgboost as xgb\n", + "\n", + "# Instancia...\n", + "ml_XGB= XGBClassifier(silent=False, \n", + " scale_pos_weight=1,\n", + " learning_rate=0.01, \n", + " colsample_bytree = 1,\n", + " subsample = 0.8,\n", + " objective='binary:logistic', \n", + " n_estimators=1000, \n", + " reg_alpha = 0.3,\n", + " max_depth= 3, \n", + " gamma=1, \n", + " max_delta_step=5)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "E4wQMlDEFINR" + }, + "source": [ + "# Treina...\n", + "ml_XGB.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "S77LljiQR_16" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_XGB, X_treinamento, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JNyKX6PkrXOk" + }, + "source": [ + "**Interpretação**: Nosso classificador (XGBClassifier) tem uma acurácia média de 96,72% (base de treinamento). Além disso, o std é da ordem de 2,02%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_h0QYv3FkV73" + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "AKhhAZLjkV76" + }, + "source": [ + "# Faz predições...\n", + "y_pred = ml_XGB.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ir2Kd1PqGHgz" + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jEC7gW4qYpWw" + }, + "source": [ + "## Parameter tunning\n", + "### Leitura Adicional:\n", + "* [Fine-tuning XGBoost in Python like a boss](https://towardsdatascience.com/fine-tuning-xgboost-in-python-like-a-boss-b4543ed8b1e)\n", + "* [Complete Guide to Parameter Tuning in XGBoost with codes in Python](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)\n", + "\n", + "> Olhando para os resultados acima, qual o melhor modelo?\n", + "\n", + "XGBoost? Supondo que sim, agora vamos fazer o fine-tuning dos parâmetros do modelo." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "n3MsUONPwIV9" + }, + "source": [ + "# Dicionário de parâmetros para XGBoost:\n", + "d_parametros_XGB = {'min_child_weight': [i for i in np.arange(1, 13)]} #,\n", + "# 'gamma': [i for i in np.arange(0, 5, 0.5)],\n", + "# 'subsample': [0.6, 0.8, 1.0],\n", + "# 'colsample_bytree': [0.6, 0.8, 1.0],\n", + "# 'max_depth': [3, 4, 5, 7, 9],\n", + "# 'learning_rate': [i for i in np.arange(0.01, 1, 0.1)]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "CX27FCKmwSni" + }, + "source": [ + "# Invoca a função\n", + "ml_XGB, best_params= GridSearchOptimizer(ml_XGB, 'ml_XGB2', d_parametros_XGB, X_treinamento, y_treinamento, X_teste, y_teste, i_CV, l_colunas)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9b7uCuF74Hjv" + }, + "source": [ + "### Resultado da execução do XGBoostClassifier\n", + "\n", + "```\n", + "[Parallel(n_jobs=-1)]: Done 108000 out of 108000 | elapsed: 372.0min finished\n", + "\n", + "Parametros otimizados: {'colsample_bytree': 0.8, 'gamma': 0.5, 'learning_rate': 0.51, 'max_depth': 5, 'min_child_weight': 1, 'subsample': 0.6}\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "n7E0oyxEtbGi" + }, + "source": [ + "# Como o procedimento acima levou 372 minutos para executar, então vou estimar ml_XGB2 abaixo usando os parâmetros acima estimados\n", + "best_params= {'colsample_bytree': 0.8, 'gamma': 0.5, 'learning_rate': 0.51, 'max_depth': 5, 'min_child_weight': 1, 'subsample': 0.6}\n", + "\n", + "ml_XGB2= XGBClassifier(min_child_weight= best_params['min_child_weight'], \n", + " gamma= best_params['gamma'], \n", + " subsample= best_params['subsample'], \n", + " colsample_bytree= best_params['colsample_bytree'], \n", + " max_depth= best_params['max_depth'], \n", + " learning_rate= best_params['learning_rate'], \n", + " random_state= i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CuqyLHTU5Z-j" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes\n", + "* [The Multiple faces of ‘Feature importance’ in XGBoost](https://towardsdatascience.com/be-careful-when-interpreting-your-features-importance-in-xgboost-6e16132588e7)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QPG3JZIpRZ-T" + }, + "source": [ + "# plot feature importance\n", + "from xgboost import plot_importance\n", + "\n", + "xgb.plot_importance(ml_XGB2, color = 'red')\n", + "plt.title('importance', fontsize = 20)\n", + "plt.yticks(fontsize = 10)\n", + "plt.ylabel('features', fontsize = 20)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "EmpRC2lHW-KP" + }, + "source": [ + "ml_XGB2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "4f9MIEBiyq-5" + }, + "source": [ + "X_treinamento_XGB, X_teste_XGB= seleciona_colunas_relevantes(ml_XGB2, X_treinamento, X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F6EayWaY5nMm" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Huy18gKI5qad" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "E3-PaTdc5vZk" + }, + "source": [ + "# Treina com as COLUNAS relevantes...\n", + "ml_XGB2.fit(X_treinamento_XGB, y_treinamento)\n", + "\n", + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_XGB2, X_treinamento_XGB, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tBdYikDU6NhD" + }, + "source": [ + "## Valida o modelo usando o dataframe X_teste" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GcvY-VdL6VIZ" + }, + "source": [ + "y_pred_XGB = ml_XGB2.predict(X_teste_XGB)\n", + "\n", + "# Calcula acurácia\n", + "accuracy_score(y_teste, y_pred_XGB)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8oLtdH-vTSbC" + }, + "source": [ + "xgb.to_graphviz(ml_XGB2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "czXQG3MCHfHM" + }, + "source": [ + "# KNN - KNEIGHBORSCLASSIFIER" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "llTTXNeyHiwx" + }, + "source": [ + "# BAGGINGCLASSIFIER" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Fbkekd4QHoZO" + }, + "source": [ + "# EXTRATREESCLASSIFIER" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "widavwR4HzwE" + }, + "source": [ + "# SVM\n", + "https://data-flair.training/blogs/svm-support-vector-machine-tutorial/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "id_Ubulns6We" + }, + "source": [ + "# NAIVE BAYES" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ycu_EIGlYUYn" + }, + "source": [ + "import pandas as pd\n", + "\n", + "from xgboost import XGBClassifier\n", + "from sklearn.ensemble import ExtraTreesClassifier\n", + "from sklearn.tree import ExtraTreeClassifier\n", + "from sklearn.tree import DecisionTreeClassifier\n", + "from sklearn.ensemble import GradientBoostingClassifier\n", + "from sklearn.ensemble import BaggingClassifier\n", + "from sklearn.ensemble import AdaBoostClassifier\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "from sklearn.linear_model import LogisticRegression\n", + "from lightgbm import LGBMClassifier\n", + "\n", + "clfs = [XGBClassifier(), LGBMClassifier(), \n", + " ExtraTreesClassifier(), ExtraTreeClassifier(),\n", + " BaggingClassifier(), DecisionTreeClassifier(),\n", + " GradientBoostingClassifier(), LogisticRegression(),\n", + " AdaBoostClassifier(), RandomForestClassifier()]\n", + "\n", + "for clf in clfs:\n", + " try:\n", + " _ = mostra_feature_importances(clf, X_treinamento, y_treinamento, top_n=X_treinamento.shape[1], title=clf.__class__.__name__)\n", + " except AttributeError as e:\n", + " print(e)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EwWkjfC8KEZH" + }, + "source": [ + "# ENSEMBLE METHODS\n", + "https://towardsdatascience.com/using-bagging-and-boosting-to-improve-classification-tree-accuracy-6d3bb6c95e5b\n", + "\n", + "![Ensemble](https://github.com/MathMachado/Materials/blob/master/Ensemble.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3Uf1RML7xETY" + }, + "source": [ + "# WOE e IV\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TBNRfYZCyhMP" + }, + "source": [ + "## Construção do exemplo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gIIroyyP4ZRZ" + }, + "source": [ + "df_y.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "PzQQdrkf1ohX" + }, + "source": [ + "from random import choices\n", + "\n", + "df_X2= df_X.copy()\n", + "df_X2['tipo']= choices(['A', 'B', 'C', 'D'], k= 1000)\n", + "df_X2['idade']= np.random.randint(10, 15, size= 1000)\n", + "df_X2['target']= df_y['target']\n", + "df_X2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "v-OpwIpx4hXJ" + }, + "source": [ + "df_X2['target'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "yZfqSvbKzeJ3" + }, + "source": [ + "def Constroi_Buckets(df, i, k= 10):\n", + " coluna= 'v'+ str(i)\n", + " df[coluna+'_Bucket']= pd.cut(df[coluna], bins= k, labels= np.arange(1, k+1))\n", + " df= df.drop(columns= [coluna], axis= 1)\n", + " return df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "V6Nrpsx60HD3" + }, + "source": [ + "for i in np.arange(1,19):\n", + " df_X2= Constroi_Buckets(df_X2, i)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "J2Fbh41-03OB" + }, + "source": [ + "df_X2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "O9r5BeWVxIr3" + }, + "source": [ + "# Função para calcular WOE e IV\n", + "def calculate_woe_iv(dataset, feature, target):\n", + "\n", + " def codethem(IV):\n", + " if IV < 0.02: return 'Useless'\n", + " elif IV >= 0.02 and IV < 0.1: return 'Weak'\n", + " elif IV >= 0.1 and IV < 0.3: return 'Medium'\n", + " elif IV >= 0.3 and IV < 0.5: return 'Strong'\n", + " elif IV >= 0.5: return 'Suspicious'\n", + " else: return 'None'\n", + "\n", + " lst = []\n", + " for i in range(dataset[feature].nunique()):\n", + " val = list(dataset[feature].unique())[i]\n", + " lst.append({\n", + " 'Value': val,\n", + " 'All': dataset[dataset[feature] == val].count()[feature],\n", + " 'Good': dataset[(dataset[feature] == val) & (dataset[target] == 0)].count()[feature],\n", + " 'Bad': dataset[(dataset[feature] == val) & (dataset[target] == 1)].count()[feature]\n", + " })\n", + " \n", + " dset = pd.DataFrame(lst)\n", + " dset['Distr_Good'] = dset['Good']/dset['Good'].sum()\n", + " dset['Distr_Bad'] = dset['Bad']/dset['Bad'].sum()\n", + " dset['Mean']= dset['All']/dset['All'].sum()\n", + " dset['WoE'] = np.log(dset['Distr_Good']/dset['Distr_Bad'])\n", + " dset = dset.replace({'WoE': {np.inf: 0, -np.inf: 0}})\n", + " dset['IV'] = (dset['Distr_Good'] - dset['Distr_Bad']) * dset['WoE']\n", + " #dset= dset.drop(columns= ['Distr_Good', 'Distr_Bad'], axis= 1)\n", + "\n", + " dset['Predictive_Power']= dset['IV'].map(codethem)\n", + " iv = dset['IV'].sum() \n", + " dset = dset.sort_values(by='IV') \n", + " return dset, iv" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Y8WGjWH63nx_" + }, + "source": [ + "df_Lab = df_X2.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-N6xr1MgxTiz" + }, + "source": [ + "def calcula_Predictive_Power(df_Lab, coluna):\n", + " print('WoE and IV for column: {}'.format(coluna))\n", + " df, iv = calculate_woe_iv(df_Lab, coluna, 'target')\n", + " print(df)\n", + " print('IV score: {:.2f}'.format(iv))\n", + " print('\\n')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ayqN_7WnxVq9" + }, + "source": [ + "for i in np.arange(1,19):\n", + " coluna= 'v'+str(i)+'_Bucket'\n", + " calcula_Predictive_Power(df_Lab, coluna)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qtoJVI4Pyx3I" + }, + "source": [ + "# **IMBALANCED SAMPLE**\n", + "> Alguns objetivos como detectar fraude em transações bancárias ou detecção de intrusão em network tem em comum o fato que a classe de interesse (o que queremos detectar), geralmente é um evento raro\n", + "\n", + "## Exemplo: Detectar fraude\n", + "A proporção de fraudes diante de NÃO-FRAUDES são mais ou menos 1%/99%. Neste caso, ao desenvovermos um modelo para detectar fraudes e o modelo classificar todas as instâncias como NÃO-FRAUDE, então o modelo terá uma acurácia de 99%. No entanto, este modelo não nos ajudará em nada.\n", + "\n", + "## Necessidade de se usar outras métricas \n", + "> Recomenda-se utilizar outras métricas (na verdade, é boa prática usar mais de 1 métrica para medir a performance dos modelos) como, por exemplo, F1-Score, Precision/Specificity, Recall/Sensitivity e AUROC.\n", + "\n", + "## Como lidar com a amostra desbalanceada?\n", + "* Under-sampling\n", + "> Seleciona aleatoriamente a classe MAJORITÁRIA (em nosso exemplo, NÃO-FRAUDE) até o número de instâncias da classe MINORITÁRIA (FRAUDE);\n", + "\n", + "* Over-sampling\n", + "> Resample aleatoriamente a classe MINORITÁRIA (em nosso exemplo, FRAUDE) até o número de instâncias da classe MAJORITÁRIA (NÃO-FRAUDE), ou uma proporção da classe MAJORITÁRIA. Veja a bibliotea SMOTE (Synthetic Minority Over-Sampling Techniques);\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2o45zx8zw-aB" + }, + "source": [ + "## EFEITOS DA AMOSTRA DESBALANCEADA" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cCVTPCB-Xkbd" + }, + "source": [ + "# TPOT\n", + "https://towardsdatascience.com/tpot-automated-machine-learning-in-python-4c063b3e5de9" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2ulXii6JXpWd" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_TWUq-z4X4yZ" + }, + "source": [ + "___\n", + "# FEATURETOOLS\n", + "https://medium.com/@rrfd/simple-automatic-feature-engineering-using-featuretools-in-python-for-classification-b1308040e183\n", + "\n", + "https://www.analyticsvidhya.com/blog/2018/08/guide-automated-feature-engineering-featuretools-python/\n", + "\n", + "https://mlwhiz.com/blog/2019/05/19/feature_extraction/\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aZUNOgmSgAmq" + }, + "source": [ + "!pip install featuretools" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_sxdONzsh9rb" + }, + "source": [ + "df_X.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "p5_ynGo1dBJJ" + }, + "source": [ + "df_X.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TqJRJXUhiDqf" + }, + "source": [ + "from random import choices\n", + "\n", + "df_X2= df_X.copy()\n", + "df_X2['tipo'] = choices(['A', 'B', 'C', 'D'], k = 1000)\n", + "df_X2['idade'] = np.random.randint(10, 15, size = 1000)\n", + "df_X2['id'] = range(0,1000)\n", + "df_X2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "nR56bGGngk-W" + }, + "source": [ + "# Automated feature engineering\n", + "import featuretools as ft\n", + "import featuretools.variable_types as vtypes\n", + "\n", + "es= ft.EntitySet(id = 'simulacao')\n", + "\n", + "# adding a dataframe \n", + "es.entity_from_dataframe(entity_id = 'df_X2', dataframe = df_X2, index = 'id')\n", + "es" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IOJ4Tr5Ogk6M" + }, + "source": [ + "es['df_X2'].variables" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1uXPqHDZgkys" + }, + "source": [ + "variable_types = {'idade': vtypes.Categorical}\n", + " \n", + "es.entity_from_dataframe(entity_id = 'df_X2', dataframe = df_X2, index = 'id', variable_types= variable_types)\n", + "\n", + "es = es.normalize_entity(base_entity_id='df_X2', new_entity_id= 'tipo', index='id')\n", + "es = es.normalize_entity(base_entity_id='df_X2', new_entity_id= 'idade', index='id')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "dnbYTBqugkvm" + }, + "source": [ + "es" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "I2v_jetdgkr7" + }, + "source": [ + "feature_matrix, feature_names = ft.dfs(entityset=es, target_entity = 'df_X2', max_depth = 3, verbose = 3, n_jobs= 1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zZiRBvHXgkoJ" + }, + "source": [ + "feature_matrix.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aWiahwKe2d6U" + }, + "source": [ + "# **EXERCÍCIOS**\n", + "> Encontre algoritmos adequados para ser aplicados aos seguintes problemas:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XbSLkbDB2mzK" + }, + "source": [ + "## Exercício 1 - Credit Card Fraud Detection\n", + "Source: [Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud)\n", + "\n", + "### Leitura suporte\n", + "* [Detecting Credit Card Fraud Using Machine Learning](https://towardsdatascience.com/detecting-credit-card-fraud-using-machine-learning-a3d83423d3b8)\n", + "* [Credit Card Fraud Detection](https://towardsdatascience.com/credit-card-fraud-detection-a1c7e1b75f59)\n", + "\n", + "### Dataframe\n", + "* [Creditcard.csv](https://raw.githubusercontent.com/MathMachado/DSWP/master/Dataframes/creditcard.csv)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JYVM3StS-g0E" + }, + "source": [ + "### Importar as libraries necessárias" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dyliPChh-jPk" + }, + "source": [ + "from sklearn.metrics import accuracy_score # para medir a acurácia do modelo preditivo\n", + "#from sklearn.model_selection import train_test_split\n", + "#from sklearn.metrics import classification_report\n", + "from sklearn.metrics import confusion_matrix # para plotar a confusion matrix\n", + "\n", + "from sklearn.model_selection import GridSearchCV # para otimizar os parâmetros dos modelos preditivos\n", + "from sklearn.model_selection import cross_val_score\n", + "from time import time\n", + "from operator import itemgetter\n", + "from scipy.stats import randint\n", + "\n", + "from sklearn.tree import export_graphviz\n", + "from sklearn.externals.six import StringIO \n", + "from IPython.display import Image \n", + "import pydotplus\n", + "\n", + "np.set_printoptions(suppress=True)" + ], + "execution_count": 9, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lAl9ZwP_0-d0" + }, + "source": [ + "url = 'https://raw.githubusercontent.com/MathMachado/DSWP/master/Dataframes/creditcard.csv'\n", + "df_cc = pd.read_csv(url)" + ], + "execution_count": 10, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "w6lN8FjJ12VU", + "outputId": "0d825f64-aa57-4d12-ca42-e3909b82e384", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 379 + } + }, + "source": [ + "df_cc.head(10)" + ], + "execution_count": 11, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TimeV1V2V3V4V5V6V7V8V9V10V11V12V13V14V15V16V17V18V19V20V21V22V23V24V25V26V27V28AmountClass
00-1.359807-0.0727812.5363471.378155-0.3383210.4623880.2395990.0986980.3637870.090794-0.551600-0.617801-0.991390-0.3111691.468177-0.4704010.2079710.0257910.4039930.251412-0.0183070.277838-0.1104740.0669280.128539-0.1891150.133558-0.021053149.620.0
101.1918570.2661510.1664800.4481540.060018-0.082361-0.0788030.085102-0.255425-0.1669741.6127271.0652350.489095-0.1437720.6355580.463917-0.114805-0.183361-0.145783-0.069083-0.225775-0.6386720.101288-0.3398460.1671700.125895-0.0089830.0147242.690.0
21-1.358354-1.3401631.7732090.379780-0.5031981.8004990.7914610.247676-1.5146540.2076430.6245010.0660840.717293-0.1659462.345865-2.8900831.109969-0.121359-2.2618570.5249800.2479980.7716790.909412-0.689281-0.327642-0.139097-0.055353-0.059752378.660.0
31-0.966272-0.1852261.792993-0.863291-0.0103091.2472030.2376090.377436-1.387024-0.054952-0.2264870.1782280.507757-0.287924-0.631418-1.059647-0.6840931.965775-1.232622-0.208038-0.1083000.005274-0.190321-1.1755750.647376-0.2219290.0627230.061458123.500.0
42-1.1582330.8777371.5487180.403034-0.4071930.0959210.592941-0.2705330.8177390.753074-0.8228430.5381961.345852-1.1196700.175121-0.451449-0.237033-0.0381950.8034870.408542-0.0094310.798278-0.1374580.141267-0.2060100.5022920.2194220.21515369.990.0
52-0.4259660.9605231.141109-0.1682520.420987-0.0297280.4762010.260314-0.568671-0.3714071.3412620.359894-0.358091-0.1371340.5176170.401726-0.0581330.068653-0.0331940.084968-0.208254-0.559825-0.026398-0.371427-0.2327940.1059150.2538440.0810803.670.0
641.2296580.1410040.0453711.2026130.1918810.272708-0.0051590.0812130.464960-0.099254-1.416907-0.153826-0.7510630.1673720.050144-0.4435870.002821-0.611987-0.045575-0.219633-0.167716-0.270710-0.154104-0.7800550.750137-0.2572370.0345070.0051684.990.0
77-0.6442691.4179641.074380-0.4921990.9489340.4281181.120631-3.8078640.6153751.249376-0.6194680.2914741.757964-1.3238650.686133-0.076127-1.222127-0.3582220.324505-0.1567421.943465-1.0154550.057504-0.649709-0.415267-0.051634-1.206921-1.08533940.800.0
87-0.8942860.286157-0.113192-0.2715262.6695993.7218180.3701450.851084-0.392048-0.410430-0.705117-0.110452-0.2862540.074355-0.328783-0.210077-0.4997680.1187650.5703280.052736-0.073425-0.268092-0.2042331.0115920.373205-0.3841570.0117470.14240493.200.0
99-0.3382621.1195931.044367-0.2221870.499361-0.2467610.6515830.069539-0.736727-0.3668461.0176140.8363901.006844-0.4435230.1502190.739453-0.5409800.4766770.4517730.203711-0.246914-0.633753-0.120794-0.385050-0.0697330.0941990.2462190.0830763.680.0
\n", + "
" + ], + "text/plain": [ + " Time V1 V2 V3 ... V27 V28 Amount Class\n", + "0 0 -1.359807 -0.072781 2.536347 ... 0.133558 -0.021053 149.62 0.0\n", + "1 0 1.191857 0.266151 0.166480 ... -0.008983 0.014724 2.69 0.0\n", + "2 1 -1.358354 -1.340163 1.773209 ... -0.055353 -0.059752 378.66 0.0\n", + "3 1 -0.966272 -0.185226 1.792993 ... 0.062723 0.061458 123.50 0.0\n", + "4 2 -1.158233 0.877737 1.548718 ... 0.219422 0.215153 69.99 0.0\n", + "5 2 -0.425966 0.960523 1.141109 ... 0.253844 0.081080 3.67 0.0\n", + "6 4 1.229658 0.141004 0.045371 ... 0.034507 0.005168 4.99 0.0\n", + "7 7 -0.644269 1.417964 1.074380 ... -1.206921 -1.085339 40.80 0.0\n", + "8 7 -0.894286 0.286157 -0.113192 ... 0.011747 0.142404 93.20 0.0\n", + "9 9 -0.338262 1.119593 1.044367 ... 0.246219 0.083076 3.68 0.0\n", + "\n", + "[10 rows x 31 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 11 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jrHdWYBryKDQ" + }, + "source": [ + "### Normalizar os nomes das colunas" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fGMvMcmYyUtV", + "outputId": "14ce428c-6846-428b-8413-cd4a46485779", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 103 + } + }, + "source": [ + "df_cc.columns" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',\n", + " 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',\n", + " 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',\n", + " 'Class'],\n", + " dtype='object')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 62 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pMGifxs2yGPL", + "outputId": "735fc6c3-c3ba-4cba-feba-d5461d47eedf", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 224 + } + }, + "source": [ + "df_cc.columns = [coluna.lower() for coluna in df_cc.columns]\n", + "df_cc.head()" + ], + "execution_count": 12, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
timev1v2v3v4v5v6v7v8v9v10v11v12v13v14v15v16v17v18v19v20v21v22v23v24v25v26v27v28amountclass
00-1.359807-0.0727812.5363471.378155-0.3383210.4623880.2395990.0986980.3637870.090794-0.551600-0.617801-0.991390-0.3111691.468177-0.4704010.2079710.0257910.4039930.251412-0.0183070.277838-0.1104740.0669280.128539-0.1891150.133558-0.021053149.620.0
101.1918570.2661510.1664800.4481540.060018-0.082361-0.0788030.085102-0.255425-0.1669741.6127271.0652350.489095-0.1437720.6355580.463917-0.114805-0.183361-0.145783-0.069083-0.225775-0.6386720.101288-0.3398460.1671700.125895-0.0089830.0147242.690.0
21-1.358354-1.3401631.7732090.379780-0.5031981.8004990.7914610.247676-1.5146540.2076430.6245010.0660840.717293-0.1659462.345865-2.8900831.109969-0.121359-2.2618570.5249800.2479980.7716790.909412-0.689281-0.327642-0.139097-0.055353-0.059752378.660.0
31-0.966272-0.1852261.792993-0.863291-0.0103091.2472030.2376090.377436-1.387024-0.054952-0.2264870.1782280.507757-0.287924-0.631418-1.059647-0.6840931.965775-1.232622-0.208038-0.1083000.005274-0.190321-1.1755750.647376-0.2219290.0627230.061458123.500.0
42-1.1582330.8777371.5487180.403034-0.4071930.0959210.592941-0.2705330.8177390.753074-0.8228430.5381961.345852-1.1196700.175121-0.451449-0.237033-0.0381950.8034870.408542-0.0094310.798278-0.1374580.141267-0.2060100.5022920.2194220.21515369.990.0
\n", + "
" + ], + "text/plain": [ + " time v1 v2 v3 ... v27 v28 amount class\n", + "0 0 -1.359807 -0.072781 2.536347 ... 0.133558 -0.021053 149.62 0.0\n", + "1 0 1.191857 0.266151 0.166480 ... -0.008983 0.014724 2.69 0.0\n", + "2 1 -1.358354 -1.340163 1.773209 ... -0.055353 -0.059752 378.66 0.0\n", + "3 1 -0.966272 -0.185226 1.792993 ... 0.062723 0.061458 123.50 0.0\n", + "4 2 -1.158233 0.877737 1.548718 ... 0.219422 0.215153 69.99 0.0\n", + "\n", + "[5 rows x 31 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 12 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "M47GS1cK2NdD", + "outputId": "1aeb156c-862b-49b8-ac61-1ae7d18b8a21", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "df_cc.shape" + ], + "execution_count": 13, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(12842, 31)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 13 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "b2QBZFbR3W_q", + "outputId": "66a283a2-eb4c-4b9f-fbf8-218601fbc749", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + } + }, + "source": [ + "df_cc['class'].value_counts()" + ], + "execution_count": 14, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.0 12785\n", + "1.0 56\n", + "Name: class, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 14 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BpLSPuhIyjO1", + "outputId": "4cb83ec3-098f-4400-d6cb-774773526c8b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + } + }, + "source": [ + "classes2 = {0: 'nao-fraude', 1: 'fraude'}\n", + "df_classes = df_cc['class'].value_counts().rename(index = classes2)\n", + "df_classes.head()" + ], + "execution_count": 15, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "nao-fraude 12785\n", + "fraude 56\n", + "Name: class, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 15 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RnhS_KMizHbw", + "outputId": "56721a15-b9c0-440a-ae04-1df33bca7703", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 296 + } + }, + "source": [ + "sns.countplot(x = 'class', data = df_cc)" + ], + "execution_count": 16, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 16 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZEAAAEGCAYAAACkQqisAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAS90lEQVR4nO3df6xf9X3f8ecrdkiaLon5ccdS26u9xupk0kyhFmGNNFV4AsO6GFUkImqLS61400h/7FcD3Va3JEzNko2FtKGyioNBaQglzfA2WmZB2uxHMFwKJfwo44o0wRaEG2xIW5qkzt774/u5yTfuvfbNx/5+v1zu8yFd3XPe53PO+RzJ4sU553M+N1WFJEk9XjHpDkiSli5DRJLUzRCRJHUzRCRJ3QwRSVK3lZPuwLidccYZtW7dukl3Q5KWlPvvv/8rVTV1dH3Zhci6deuYnp6edDckaUlJ8sX56j7OkiR1M0QkSd0MEUlSN0NEktTNEJEkdTNEJEndDBFJUjdDRJLUzRCRJHVbdl+sn6gf/tc3TboLegm6/4OXTboL0kR4JyJJ6maISJK6GSKSpG6GiCSpmyEiSepmiEiSuhkikqRuhogkqdvIQiTJ7iTPJnl4qPbBJH+S5KEkn06yamjbVUlmkjye5IKh+pZWm0ly5VB9fZL9rf7JJKeM6lokSfMb5Z3IjcCWo2r7gDdV1ZuB/wtcBZBkI3ApcFbb56NJViRZAfwGcCGwEXhXawvwAeDaqnojcBjYPsJrkSTNY2QhUlWfBQ4dVfsfVXWkrd4DrGnLW4FbqurrVfUFYAY4p/3MVNWTVfUN4BZga5IA5wG3tf33ABeP6lokSfOb5DuRnwF+ry2vBp4a2nag1Raqnw48PxRIc/V5JdmRZDrJ9Ozs7EnqviRpIiGS5N8AR4CPj+N8VbWrqjZV1aapqalxnFKSloWxz+Kb5KeBHwM2V1W18kFg7VCzNa3GAvXngFVJVra7keH2kqQxGeudSJItwC8Cb6+qF4c27QUuTfKqJOuBDcC9wH3AhjYS6xQGL9/3tvD5DHBJ238bcPu4rkOSNDDKIb6fAD4H/GCSA0m2A78OvBbYl+TBJL8JUFWPALcCjwK/D1xRVd9sdxnvAe4EHgNubW0B3gv8iyQzDN6R3DCqa5EkzW9kj7Oq6l3zlBf8D31VXQNcM0/9DuCOeepPMhi9JUmaEL9YlyR1M0QkSd0MEUlSN0NEktTNEJEkdTNEJEndDBFJUjdDRJLUzRCRJHUzRCRJ3QwRSVI3Q0SS1M0QkSR1M0QkSd0MEUlSN0NEktTNEJEkdTNEJEndDBFJUjdDRJLUzRCRJHUzRCRJ3QwRSVI3Q0SS1G1kIZJkd5Jnkzw8VDstyb4kT7Tfp7Z6klyXZCbJQ0nOHtpnW2v/RJJtQ/UfTvL5ts91STKqa5EkzW+UdyI3AluOql0J3FVVG4C72jrAhcCG9rMDuB4GoQPsBN4KnAPsnAue1ubdQ/sdfS5J0oiNLESq6rPAoaPKW4E9bXkPcPFQ/aYauAdYleQNwAXAvqo6VFWHgX3AlrbtdVV1T1UVcNPQsSRJYzLudyJnVtXTbfkZ4My2vBp4aqjdgVY7Vv3APPV5JdmRZDrJ9Ozs7IldgSTpWyb2Yr3dQdSYzrWrqjZV1aapqalxnFKSloVxh8iX26Mo2u9nW/0gsHao3ZpWO1Z9zTx1SdIYjTtE9gJzI6y2AbcP1S9ro7TOBV5oj73uBM5Pcmp7oX4+cGfb9tUk57ZRWZcNHUuSNCYrR3XgJJ8AfhQ4I8kBBqOsfg24Ncl24IvAO1vzO4CLgBngReBygKo6lOR9wH2t3dVVNfey/p8xGAH2PcDvtR9J0hiNLESq6l0LbNo8T9sCrljgOLuB3fPUp4E3nUgfJUknxi/WJUndDBFJUjdDRJLUzRCRJHUzRCRJ3QwRSVI3Q0SS1M0QkSR1M0QkSd0MEUlSN0NEktTNEJEkdTNEJEndDBFJUjdDRJLUzRCRJHUzRCRJ3QwRSVI3Q0SS1M0QkSR1M0QkSd0MEUlSN0NEktRtIiGS5J8neSTJw0k+keTVSdYn2Z9kJsknk5zS2r6qrc+07euGjnNVqz+e5IJJXIskLWdjD5Ekq4GfAzZV1ZuAFcClwAeAa6vqjcBhYHvbZTtwuNWvbe1IsrHtdxawBfhokhXjvBZJWu4m9ThrJfA9SVYCrwGeBs4Dbmvb9wAXt+WtbZ22fXOStPotVfX1qvoCMAOcM6b+S5KYQIhU1UHgQ8CXGITHC8D9wPNVdaQ1OwCsbsurgafavkda+9OH6/Ps8x2S7EgynWR6dnb25F6QJC1jk3icdSqDu4j1wPcB38vgcdTIVNWuqtpUVZumpqZGeSpJWlYm8TjrHwJfqKrZqvor4HeBtwGr2uMtgDXAwbZ8EFgL0La/HnhuuD7PPpKkMZhEiHwJODfJa9q7jc3Ao8BngEtam23A7W15b1unbb+7qqrVL22jt9YDG4B7x3QNkiQGL7jHqqr2J7kN+CPgCPAAsAv478AtSd7faje0XW4Abk4yAxxiMCKLqnokya0MAugIcEVVfXOsFyNJy9zYQwSgqnYCO48qP8k8o6uq6mvAOxY4zjXANSe9g5KkRfGLdUlSN0NEktTNEJEkdTNEJEndFhUiSe5aTE2StLwcc3RWklczmNvqjPaledqm17HAFCOSpOXjeEN8/wnwCwymJ7mfb4fIV4FfH2G/JElLwDFDpKo+DHw4yc9W1UfG1CdJ0hKxqI8Nq+ojSX4EWDe8T1XdNKJ+SZKWgEWFSJKbgR8AHgTmphYpwBCRpGVssdOebAI2tokPJUkCFv+dyMPA3xplRyRJS89i70TOAB5Nci/w9bliVb19JL2SJC0Jiw2RXxllJyRJS9NiR2f94ag7IklaehY7OuvPGIzGAjgFeCXwF1X1ulF1TJL00rfYO5HXzi23P2m7FTh3VJ2SJC0N3/UsvjXwX4ALRtAfSdISstjHWT8+tPoKBt+NfG0kPZIkLRmLHZ31j4eWjwB/yuCRliRpGVvsO5HLR90RSdLSs9g/SrUmyaeTPNt+PpVkzag7J0l6aVvsi/WPAXsZ/F2R7wP+a6tJkpaxxYbIVFV9rKqOtJ8bganekyZZleS2JH+S5LEkfz/JaUn2JXmi/T61tU2S65LMJHkoydlDx9nW2j+RZFtvfyRJfRYbIs8l+ckkK9rPTwLPncB5Pwz8flX9XeDvAY8BVwJ3VdUG4K62DnAhsKH97ACuB0hyGrATeCtwDrBzLngkSeOx2BD5GeCdwDPA08AlwE/3nDDJ64F/ANwAUFXfqKrnGYz22tOa7QEubstbgZva9yn3AKuSvIHBdyr7qupQVR0G9gFbevokSeqz2BC5GthWVVNV9TcZhMqvdp5zPTALfCzJA0l+K8n3AmdW1dOtzTPAmW15NfDU0P4HWm2huiRpTBYbIm9u/7cPQFUdAt7Sec6VwNnA9VX1FuAv+Pajq7njF9+eq+uEJdmRZDrJ9Ozs7Mk6rCQte4sNkVcMv29o7yMW+6Hi0Q4AB6pqf1u/jUGofLk9pqL9frZtPwisHdp/TastVP9rqmpXVW2qqk1TU93jASRJR1lsiPxH4HNJ3pfkfcD/Af5Dzwmr6hngqSQ/2EqbgUcZDCGeG2G1Dbi9Le8FLmujtM4FXmiPve4Ezk9yagu481tNkjQmi/1i/aYk08B5rfTjVfXoCZz3Z4GPJzkFeBK4nEGg3ZpkO/BFBi/yAe4ALgJmgBdbW6rqUAu0+1q7q9tjNknSmCz6kVQLjRMJjuFjPchgEsejbZ6nbQFXLHCc3cDuk9EnSdJ377ueCl6SpDmGiCSpmyEiSepmiEiSuhkikqRuhogkqZshIknqZohIkroZIpKkboaIJKmbISJJ6maISJK6GSKSpG6GiCSpmyEiSepmiEiSuhkikqRuhogkqZshIknqZohIkroZIpKkboaIJKmbISJJ6maISJK6TSxEkqxI8kCS/9bW1yfZn2QmySeTnNLqr2rrM237uqFjXNXqjye5YDJXIknL1yTvRH4eeGxo/QPAtVX1RuAwsL3VtwOHW/3a1o4kG4FLgbOALcBHk6wYU98lSUwoRJKsAf4R8FttPcB5wG2tyR7g4ra8ta3Ttm9u7bcCt1TV16vqC8AMcM54rkCSBJO7E/nPwC8C/6+tnw48X1VH2voBYHVbXg08BdC2v9Daf6s+zz7fIcmOJNNJpmdnZ0/mdUjSsjb2EEnyY8CzVXX/uM5ZVbuqalNVbZqamhrXaSXpZW/lBM75NuDtSS4CXg28DvgwsCrJyna3sQY42NofBNYCB5KsBF4PPDdUnzO8jyRpDMZ+J1JVV1XVmqpax+DF+N1V9RPAZ4BLWrNtwO1teW9bp22/u6qq1S9to7fWAxuAe8d0GZIkJnMnspD3ArckeT/wAHBDq98A3JxkBjjEIHioqkeS3Ao8ChwBrqiqb46/25K0fE00RKrqD4A/aMtPMs/oqqr6GvCOBfa/BrhmdD2UJB2LX6xLkroZIpKkboaIJKmbISJJ6maISJK6GSKSpG6GiCSpmyEiSepmiEiSuhkikqRuhogkqZshIknqZohIkroZIpKkboaIJKmbISJJ6maISJK6GSKSpG6GiCSpmyEiSepmiEiSuhkikqRuhogkqdvYQyTJ2iSfSfJokkeS/Hyrn5ZkX5In2u9TWz1Jrksyk+ShJGcPHWtba/9Ekm3jvhZJWu4mcSdyBPiXVbUROBe4IslG4ErgrqraANzV1gEuBDa0nx3A9TAIHWAn8FbgHGDnXPBIksZj7CFSVU9X1R+15T8DHgNWA1uBPa3ZHuDitrwVuKkG7gFWJXkDcAGwr6oOVdVhYB+wZYyXIknL3kTfiSRZB7wF2A+cWVVPt03PAGe25dXAU0O7HWi1herznWdHkukk07Ozsyet/5K03E0sRJL8DeBTwC9U1VeHt1VVAXWyzlVVu6pqU1VtmpqaOlmHlaRlbyIhkuSVDALk41X1u6385faYivb72VY/CKwd2n1Nqy1UlySNySRGZwW4AXisqv7T0Ka9wNwIq23A7UP1y9oorXOBF9pjrzuB85Oc2l6on99qkqQxWTmBc74N+Cng80kebLVfAn4NuDXJduCLwDvbtjuAi4AZ4EXgcoCqOpTkfcB9rd3VVXVoPJcgSYIJhEhV/S8gC2zePE/7Aq5Y4Fi7gd0nr3eSpO+GX6xLkroZIpKkboaIJKmbISJJ6maISJK6GSKSpG6GiCSpmyEiSepmiEiSuhkikqRuhogkqZshIknqZohIkroZIpKkboaIJKmbISJJ6maISJK6GSKSpG6GiCSpmyEiSepmiEiSuhkikqRuhogkqZshIknqtuRDJMmWJI8nmUly5aT7I0nLyZIOkSQrgN8ALgQ2Au9KsnGyvZKk5WPlpDtwgs4BZqrqSYAktwBbgUcn2itpQr509Q9Nugt6Cfrbv/z5kR17qYfIauCpofUDwFuPbpRkB7Cjrf55ksfH0Lfl4AzgK5PuxEtBPrRt0l3QX+e/zzk7czKO8v3zFZd6iCxKVe0Cdk26Hy83SaaratOk+yHNx3+f47Gk34kAB4G1Q+trWk2SNAZLPUTuAzYkWZ/kFOBSYO+E+yRJy8aSfpxVVUeSvAe4E1gB7K6qRybcreXER4R6KfPf5xikqibdB0nSErXUH2dJkibIEJEkdTNEdFzHm1omyauSfLJt359k3fh7qeUoye4kzyZ5eIHtSXJd+7f5UJKzx93HlztDRMe0yKlltgOHq+qNwLXAB8bbSy1jNwJbjrH9QmBD+9kBXD+GPi0rhoiO51tTy1TVN4C5qWWGbQX2tOXbgM1JTsonstKxVNVngUPHaLIVuKkG7gFWJXnDeHq3PBgiOp75ppZZvVCbqjoCvACcPpbeSce2mH+/OgGGiCSpmyGi41nM1DLfapNkJfB64Lmx9E46NqdGGjFDRMezmKll9gJz09heAtxdfsWql4a9wGVtlNa5wAtV9fSkO/VysqSnPdHoLTS1TJKrgemq2gvcANycZIbBS85LJ9djLSdJPgH8KHBGkgPATuCVAFX1m8AdwEXADPAicPlkevry5bQnkqRuPs6SJHUzRCRJ3QwRSVI3Q0SS1M0QkSR1M0SkMUryK0n+1aT7IZ0shogkqZshIo1Qksva37H44yQ3H7Xt3Unua9s+leQ1rf6OJA+3+mdb7awk9yZ5sB1vwySuRzqaHxtKI5LkLODTwI9U1VeSnAb8HPDnVfWhJKdX1XOt7fuBL1fVR5J8HthSVQeTrKqq55N8BLinqj7epp9ZUVV/Oalrk+Z4JyKNznnA71TVVwCq6ui/e/GmJP+zhcZPAGe1+v8GbkzybgZTzQB8DvilJO8Fvt8A0UuFISJNzo3Ae6rqh4BfBV4NUFX/FPi3DGafvb/dsfw28HbgL4E7kpw3mS5L38kQkUbnbuAdSU4HaI+zhr0WeDrJKxncidDa/UBV7a+qXwZmgbVJ/g7wZFVdB9wOvHksVyAdh7P4SiPSZju+BvjDJN8EHgD+dKjJvwP2MwiK/QxCBeCD7cV5gLuAPwbeC/xUkr8CngH+/VguQjoOX6xLkrr5OEuS1M0QkSR1M0QkSd0MEUlSN0NEktTNEJEkdTNEJEnd/j/eEDTdweaeVQAAAABJRU5ErkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pzjW3Bf_3h7J", + "outputId": "b7bc849f-e680-4f11-e8d6-8db4a568452b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "56/12842" + ], + "execution_count": 17, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.004360691481077714" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 17 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9bWDX9H12k5g" + }, + "source": [ + "### Drop NaN" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "27ob8tRR21TB", + "outputId": "13dd0a84-505e-4e94-8e76-9fc617d58664", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 561 + } + }, + "source": [ + "df_cc.isna().sum()" + ], + "execution_count": 18, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "time 0\n", + "v1 0\n", + "v2 0\n", + "v3 0\n", + "v4 0\n", + "v5 0\n", + "v6 0\n", + "v7 0\n", + "v8 0\n", + "v9 0\n", + "v10 1\n", + "v11 1\n", + "v12 1\n", + "v13 1\n", + "v14 1\n", + "v15 1\n", + "v16 1\n", + "v17 1\n", + "v18 1\n", + "v19 1\n", + "v20 1\n", + "v21 1\n", + "v22 1\n", + "v23 1\n", + "v24 1\n", + "v25 1\n", + "v26 1\n", + "v27 1\n", + "v28 1\n", + "amount 1\n", + "class 1\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 18 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HCTCnZUWzgrj" + }, + "source": [ + "### DataViz (= Data Visualization)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "m9H7zUA40Arv" + }, + "source": [ + "[X1, X2, ..., Xn, y]" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "x2RHoMGZzndC", + "outputId": "7637945b-cc58-4c83-961b-a8622a95bf14", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 400 + } + }, + "source": [ + "# Boxplot da variável 'v1' por variável-target (class):\n", + "sns.catplot(x = 'class', y = 'v6', kind = 'box', data = df_cc)" + ], + "execution_count": 19, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 19 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW4AAAFuCAYAAAChovKPAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAVDElEQVR4nO3df3Ck933Q8ffnJPcqOxTbm+Mazuc2tY5h7GIPiRoK/GMSnbv1MHbBpBiYnkpDbjokuTSE0Pyg2GGSTJgydGJTIIJmrCuhacAU30wcBelKGmCIEzm1fXZ+EI0tx3dzts/rJE642KmkD39o7dGddbJ11u53v/u8XzOZ2+fZze5HI8073zz77LORmUiS6rGj9ACSpK0x3JJUGcMtSZUx3JJUGcMtSZUZLT3Admi32zk7O1t6DEnabrHRzqFYcT/11FOlR5CkvhmKcEtSkxhuSaqM4ZakyhhuSaqM4ZakyhhuSaqM4ZakyhhuSaqM4Zakyhjuhup0Ohw6dIhOp1N6FElbZLgbamZmhmPHjnH48OHSo0jaIsPdQJ1Oh9nZWTKT2dlZV91SZQx3A83MzLC6ugrAysqKq26pMoa7gebn51leXgZgeXmZubm5whNJ2grD3UCTk5OMjq5din10dJT9+/cXnkjSVhjuBpqammLHjrVf/cjICAcOHCg8kaStMNwN1Gq1aLfbRATtdptWq1V6JElbMBRfXaatm5qaYmlpydW2VKHIzNIzvGITExO5sLBQegxJ2m6D9Z2TEbE3Iv5HRHw1Ih6KiHd2918aEXMR8c3uv5eUmlGSBlHJY9zLwLsz80rgZ4G3RcSVwHuBo5m5Dzja3ZYkdRULd2aezMyvdG9/D/gasAe4EZjpPmwG+IUyE0rSYBqIs0oi4ieBvwjcA+zOzJPdux4Hdp/jv3MwIhYiYuHUqVN9mVOSBkHxcEfEq4A7gV/LzGfW35dr75xu+O5pZk5n5kRmTuzatasPk0rSYCga7oi4gLVofzIz/2t39xMR8Zru/a8Bniw1nyQNopJnlQTwO8DXMvNfrbvrCDDVvT0F3NXv2SRpkJX8AM5fBX4JOBYR93X3vR/4KPDpiHgL8Cjwi4Xmk6SBVCzcmfm/OMfJ5cCb+jmLJNWk+JuTkqStMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdwN1el0OHToEJ1Op/QokraoaLgj4hMR8WREPLhu36URMRcR3+z+e0nJGYfVzMwMx44d4/Dhw6VHkbRFpVfcdwDts/a9FziamfuAo91tbaNOp8Ps7CyZyezsrKtuqTJFw52ZXwCePmv3jcBM9/YM8At9HaoBZmZmWF1dBWBlZcVVt1SZ0ivujezOzJPd248Duzd6UEQcjIiFiFg4depU/6YbAvPz8ywvLwOwvLzM3Nxc4YkkbcUghvsFmZlAnuO+6cycyMyJXbt29Xmyuk1OTjI6OgrA6Ogo+/fvLzyRpK0YxHA/ERGvAej++2TheYbO1NQUO3as/epHRkY4cOBA4YkkbcUghvsIMNW9PQXcVXCWodRqtWi320QE7XabVqtVeiRJWzBa8sUj4veAa4FXR8Rx4Bbgo8CnI+ItwKPAL5abcHhNTU2xtLTkaluqUKwdRq7bxMRELiwslB5DkrZbbLRzEA+VSJI2YbglqTKGW5IqY7glqTKGu6G8OqBUL8PdUF4dUKqX4W4grw4o1c1wN5BXB5TqZrgbyKsDSnUz3A00OTnJyMgIsHaRKa8OKNXFcDfQ1NTUC4dKVldXvV6JVBnD3VDPX6NmGK5VIzWN4W6gj3/842dsT09PF5pE0vkw3A109OjRM7bn5+cLTSLpfBjuBlpZWdl0W9JgM9ySVBnD3UA7d+7cdFvSYDPcDfTcc89tui1psBluSaqM4W6gsbGxTbclDTbD3UAe45bqZrgb6Dvf+c6m25IGm+GWpMoYbkmqjOGWpMoYbkmqjOGWpMoYbkmqjOGWpMoYbkmqjOGWpMoYbkmqjOGW1FOdTodDhw7R6XRKjzI0DLeknpqZmeHYsWMcPny49ChDw3BL6plOp8Ps7CyZyezsrKvubWK4JfXMzMwMq6urwNqXUrvq3h6GW1LPzM/Ps7y8DMDy8jJzc3OFJxoOhltSz0xOTjI6OgrA6Ogo+/fvLzzRcDDcknpmamqKHTvWMjMyMsKBAwcKTzQcDLeknmm1WrTbbSKCdrtNq9UqPdJQGC09gKThNjU1xdLSkqvtbWS4JfVUq9XitttuKz3GUPFQiSRVxnBLUmU8VFLQ7bffzuLiYukxAHjnO9/Z19cbHx/nHe94R19fUxoWrrglqTKRmaVneMUmJiZyYWGh9BjVuPbaa1+07/Of/3zf55D0kmKjna64G2hsbOyM7YsuuqjQJJLOh+FuoM9+9rNnbH/mM58pNImk82G4G87VtlQfzyppqGuuuQaAj33sY4UnkbRVrrgl9ZRfXbb9DLeknvKry7af4ZbUM351WW8Ybkk941eX9UbjP4AzSB8776fnf+bx8fHCk/SfH7fvn+uvv57Tp0+/sH3hhRdy9913F5yoOht+AGdgzyqJiDbwMWAE+A+Z+dFevM7i4iL3Pfg1Vi68tBdPP7B2/HDtf7DvffiJwpP018jpp0uPUESpBcrY2NgZ4R4bG/O6ONtgIMMdESPAbwP7gePAlyPiSGZ+tRevt3Lhpfzgz1/fi6fWgBn7uqu9ftq9e/cLx7Ujgt27dxeeaDgMZLiBNwCLmfkwQER8CrgR6Em4pWFXcsV500030el0uOGGG3jXu95VbI5hMqjh3gM8tm77OPCX1j8gIg4CBwEuv/zy836hEydOMHL6u67EGmLkdIcTJ5ZLj9Eou3fv5tlnn/Wry7ZRtWeVZOZ0Zk5k5sSuXbtKjyPpHC644ALGx8f9ouBtNKgr7hPA3nXbl3X3bbs9e/bw+HOjHuNuiLGv382ePR5nVd0GNdxfBvZFxGtZC/bNwN/t1YuNnH66cYdKdjz7DACrP/pjhSfpr7WzSgy36nbOcEfE5cCTmflsRATwy8DrWHuD8N9nZs8OFGbmckS8Hfgca6cDfiIzH+rFazXxPGaAxcXvATD+U02L2O7G/s41PDZbcd/N2tkdAB8FrgD+G/BG4GeAX+nlYJl5d3eGnhq28ztfrufPpfXqgFJ9Ngv3jsx8/sz5SeBnMnMV+I8RcX/vR5MkbWSzs0oei4g3dm8v0X2zMCJ8a1iSCtpsxf0PgMMRcSvwXeC+iLgPuBj4R32YTZK0gXOGOzMfA/5aRPwG8DBwB92Pn3cPmUjagqZf0Kzf1ygZBL26TsrLOR1wB/B+4Gng91k7bNKsKxNJ22BxcZFvPvTHXP6qldKj9NWP/MnaEdnnHj2/K3jW6lvfH+nZc79kuDPzg8AHI+Jq4G8DfxQRxzNzsmdTqeeeeeYZHnnkEe69915e//rXlx6nMS5/1Qrvf90zpcdQH3zkK737jMRWPvL+JPA40AH+TG/GUb888sgjALznPe8pPImkrXrJcEfEP4yIzwNHgRbw1sy8uteDqXfWf+nE6uoq9957b8FpJG3VyznGvRf4tcy8r9fDNE2pN6vuv//M0/Df/e53c8011/R1hmG8uP1LOXHiBP/veyM9/b/QGhyPfm+Ei0705BJLL+sY9/t68sqSpPMyqBeZaoRSK85rr732Rfv86Hvv7dmzh+eWT/rmZEN85Cs/xs49e3ry3NVej1uSmsoVt9RH3/p+845xP3F6bX24+8JmfW7vW98fYV+PnttwS33S1MvJ/rD7BvzOn2jWz7+P3v3ODbfUJ007i+Z5XkJ4+3mMW5IqY7glqTKGW5IqY7glqTKGW5IqY7glqTKGW5Iq43ncUgOU/Nq00l9dNoxXonTF3UBXX33m5dT7fUlXNcvY2BhjY2OlxxgqkZmlZ3jFJiYmcv2XA2hznU6Hm2666YXtO++8k1arVXAiSecQG+10xd1A3/72tzfdljTYDHcDfehDH9p0W9JgM9wNtLS0tOm2pMFmuBto7969m25LGmyGu4EuvvjiM7YvueSSQpNIOh+Gu4GOHTt2xvYDDzxQaBJJ58NwS1JlDLckVcZwS1JlDHcDXXbZZWdse1aJVBfD3UC33nrrGdu33HJLmUEknRfD3UDj4+MvrLr37t3L+Ph44YkkbYXhbqhbb72Viy66yNW2VCHD3VCXXHIJV1xxhR++kSpkuBtqenqaBx54gOnp6dKjSNoiw91AnU6Hubk5AObm5uh0OoUnkrQVhruBpqenWV1dBWB1ddVVt1QZw91A8/Pzm25LGmyGu4FWVlY23ZY02Ax3A42MjGy6LWmwGe4Gmpyc3HRb0mAz3A108OBBIta+PDoiOHjwYOGJJG2F4W6gVqvFddddB8B1111Hq9UqPJGkrRgtPYDKOHjwICdPnnS1LVUoMrP0DK/YxMRELiwslB5DkrZbbLTTQyWSVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVKRLuiHhzRDwUEasRMXHWfe+LiMWI+EZE/FyJ+SRpkJX6AM6DwN8EPr5+Z0RcCdwMXAX8WWA+Iv5cZnr5OknqKrLizsyvZeY3NrjrRuBTmflcZj4CLAJv6O90kjTYBu0Y9x7gsXXbx7v7XiQiDkbEQkQsnDp1qi/DSdIg6NmhkoiYB358g7s+kJl3vdLnz8xpYBrWPvL+Sp9PkmrRs3Bn5vlc5PkEsHfd9mXdfZKkrkE7VHIEuDkidkbEa4F9wJcKzyRJA6XU6YB/IyKOA38Z+ExEfA4gMx8CPg18FZgF3uYZJZJ0Ji/rKkmDy8u6StIwMNySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVJki4Y6I34yIr0fEAxHxBxFx8br73hcRixHxjYj4uRLzSdIgK7XingN+OjOvBv4v8D6AiLgSuBm4CmgD/yYiRgrNKEkDqUi4M/O/Z+Zyd/OLwGXd2zcCn8rM5zLzEWAReEOJGSVpUA3CMe5fAT7bvb0HeGzdfce7+14kIg5GxEJELJw6darHI0rS4Bjt1RNHxDzw4xvc9YHMvKv7mA8Ay8Ant/r8mTkNTANMTEzkKxhVkqrSs3Bn5uRm90fELwN/HXhTZj4f3hPA3nUPu6y7T5LUVeqskjbwT4AbMvP0uruOADdHxM6IeC2wD/hSiRklaVD1bMX9Ev41sBOYiwiAL2bmr2bmQxHxaeCrrB1CeVtmrhSaUZIGUpFwZ+b4Jvd9GPhwH8eRpKoMwlklkqQtMNySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdySVBnDLUmVMdwN1el0OHToEJ1Op/QokrbIcDfUzMwMx44d4/Dhw6VHkbRFhruBOp0Os7OzZCazs7OuuqXKGO4GmpmZYXV1FYCVlRVX3VJlDHcDzc/Ps7y8DMDy8jJzc3OFJ5K0FYa7gSYnJxkdXfvWutHRUfbv3194IklbYbgbaGpqih071n71IyMjHDhwoPBEkrbCcDdQq9Wi3W4TEbTbbVqtVumRJG1BkW95V3lTU1MsLS252pYqFJlZeoZXbGJiIhcWFkqPIUnbLTba6aESSaqM4ZakyhhuSaqM4ZakyhhuSaqM4ZakyhhuSaqM4ZakyhhuSarMUHxyMiJOAY+WnqNCrwaeKj2EGsG/tfPzVGa2z945FOHW+YmIhcycKD2Hhp9/a9vLQyWSVBnDLUmVMdzNNl16ADWGf2vbyGPcklQZV9ySVBnDLUmVMdwNEBHtiPhGRCxGxHs3uH9nRPx+9/57IuIn+z+lahcRn4iIJyPiwXPcHxFxW/fv7IGIeF2/ZxwWhnvIRcQI8NvAzwNXAn8nIq4862FvAb6dmePAbwH/or9TakjcAbzowyLr/Dywr/ufg8C/7cNMQ8lwD783AIuZ+XBm/hD4FHDjWY+5EZjp3v4vwJsiYsPvupPOJTO/ADy9yUNuBA7nmi8CF0fEa/oz3XAx3MNvD/DYuu3j3X0bPiYzl4HvAq2+TKcmeTl/i3oZDLckVcZwD78TwN5125d19234mIgYBf400OnLdGqSl/O3qJfBcA+/LwP7IuK1EfEjwM3AkbMecwSY6t7+W8Afpp/M0vY7Ahzonl3ys8B3M/Nk6aFqNFp6APVWZi5HxNuBzwEjwCcy86GI+OfAQmYeAX4H+N2IWGTtzaWby02sWkXE7wHXAq+OiOPALcAFAJn574C7geuBReA08PfLTFo/P/IuSZXxUIkkVcZwS1JlDLckVcZwS1JlDLckVcZwS2eJiFsj4h+XnkM6F8MtSZUx3Gq8iDjQvT70/RHxu2fd99aI+HL3vjsj4sLu/jdHxIPd/V/o7rsqIr4UEfd1n29fiZ9Hw88P4KjRIuIq4A+Av5KZT0XEpcAh4PuZ+S8jopWZne5jPwQ8kZm3R8QxoJ2ZJyLi4sz8TkTcDnwxMz/ZvbzASGb+oNTPpuHliltN90bgP2fmUwCZefb1pH86Iv5nN9R/D7iqu/9/A3dExFtZu5QAwP8B3h8Rvw78hNFWrxhuaXN3AG/PzL8AfBD4UYDM/FXgn7J2tbt7uyvz/wTcAPwAuDsi3lhmZA07w62m+0PgzRHRAugeKlnvTwEnI+IC1lbcdB93RWbek5n/DDgF7I2InwIezszbgLuAq/vyE6hxvDqgGq17pcQPA38UESvAHwNL6x7yG8A9rMX5HtZCDvCb3TcfAzgK3A/8OvBLEfEnwOPAR/ryQ6hxfHNSkirjoRJJqozhlqTKGG5JqozhlqTKGG5JqozhlqTKGG5Jqsz/B3+Fb3CsjmIsAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0pjVw0t41Wjm", + "outputId": "ac634515-7250-4153-d501-cb4af565b1ec", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "len(df_cc.columns)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "31" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 77 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bwT2CdAf1Noy", + "outputId": "00a8bbe0-f2ab-475b-aefe-8cceb71df343", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 700 + } + }, + "source": [ + "figura = plt.figure(figsize= (16, 12))\n", + "\n", + "for i in range(1, 29):\n", + " plt.subplot(5, 6, i)\n", + " plt.plot(df_cc['v'+str(i)])\n", + "\n", + "plt.plot(df_cc['amount'])\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "apVkeoCl3AST", + "outputId": "77d75c8d-65f1-4d80-b6f1-eabfa01f8364", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 315 + } + }, + "source": [ + "df_cc.describe()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
timev1v2v3v4v5v6v7v8v9v10v11v12v13v14v15v16v17v18v19v20v21v22v23v24v25v26v27v28amountclass
count12842.00000012842.00000012842.00000012842.00000012842.00000012842.00000012842.00000012842.00000012842.00000012842.00000012841.00000012841.00000012841.00000012841.00000012841.00000012841.00000012841.00000012841.00000012841.00000012841.00000012841.00000012841.00000012841.00000012841.00000012841.00000012841.00000012841.00000012841.00000012841.00000012841.00000012841.000000
mean8949.099984-0.2167830.2756750.8759390.280864-0.1114090.134556-0.147548-0.0312290.966379-0.3202920.842062-1.4944390.9659040.815934-0.177547-0.0364910.393921-0.012121-0.0727880.021230-0.062996-0.147793-0.0354060.0152290.1136440.0438920.0113750.00074462.2193860.004361
std6914.5883711.6533241.3387321.4534341.4955321.2321531.3073001.2020731.2436101.2176701.2098061.1898771.5449521.1713031.3317520.9810750.9495261.1585010.8330460.8237370.5719570.8944940.6212580.4960130.5892870.4266050.5639380.4016080.257426175.7801150.065897
min0.000000-27.670569-34.607649-24.667741-4.657545-32.092129-23.496714-26.548144-23.632502-7.175097-14.166795-2.595325-17.769143-3.389510-19.214325-4.152532-12.227189-18.587366-8.061208-4.932733-13.276034-11.468435-8.593642-19.254328-2.512377-4.781606-1.338556-7.976100-3.5753120.0000000.000000
25%2789.250000-0.966739-0.2802160.420540-0.631430-0.713310-0.618201-0.612138-0.1805100.252389-0.7737210.041619-2.4428600.1488750.218535-0.759873-0.525637-0.078526-0.447914-0.554832-0.159392-0.265563-0.534956-0.171847-0.334567-0.134560-0.372507-0.077529-0.0150025.4900000.000000
50%7605.500000-0.3194390.2458070.9620570.205730-0.195153-0.146859-0.1090230.0177350.944073-0.3737920.782630-1.8176301.0530441.090067-0.0417980.0341570.3923840.044854-0.069879-0.035732-0.129139-0.115690-0.0443290.0671070.153360-0.022228-0.0007870.01590715.3000000.000000
75%14441.7500001.1629830.8756731.6109081.1695840.3371920.5080400.4202980.2653111.6434430.1338311.648365-0.2487281.8364281.5446810.5044640.5342630.8733570.4855500.4488750.1407790.0205850.2340240.0711170.3979180.3886840.3913140.1015750.07150050.0000000.000000
max22549.0000001.96049710.5586004.10171611.92751234.09930921.39306934.3031778.67568510.39288912.25994912.0189133.7748374.4654137.6922093.6350424.8162529.2535264.2956484.5553598.01257422.6148894.53445413.8762213.2002015.5250933.5173468.2543764.8607697712.4300001.000000
\n", + "
" + ], + "text/plain": [ + " time v1 ... amount class\n", + "count 12842.000000 12842.000000 ... 12841.000000 12841.000000\n", + "mean 8949.099984 -0.216783 ... 62.219386 0.004361\n", + "std 6914.588371 1.653324 ... 175.780115 0.065897\n", + "min 0.000000 -27.670569 ... 0.000000 0.000000\n", + "25% 2789.250000 -0.966739 ... 5.490000 0.000000\n", + "50% 7605.500000 -0.319439 ... 15.300000 0.000000\n", + "75% 14441.750000 1.162983 ... 50.000000 0.000000\n", + "max 22549.000000 1.960497 ... 7712.430000 1.000000\n", + "\n", + "[8 rows x 31 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 81 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VuDjYxY359EL" + }, + "source": [ + "limite_superior_outlier = Q3+1.5*IQR" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "X9k16WLI49JI", + "outputId": "b2e13475-0c22-40e6-95bf-6e099efc8585", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "df_cc2 = df_cc.copy()\n", + "df_cc2 = df_cc.dropna()\n", + "df_cc2.shape" + ], + "execution_count": 20, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(12841, 31)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 20 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OY-DYRKg34ZX" + }, + "source": [ + "### Definir as variáveis globais" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KVhHgV_s3_5f" + }, + "source": [ + "i_CV = 10 # Número de Cross-Validations\n", + "i_Seed = 20111974 # semente por questões de reproducibilidade\n", + "f_Test_Size = 0.3 # Proporção do dataframe de validação (outros valores poderiam ser 0.15, 0.20 ou 0.25)" + ], + "execution_count": 21, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wKbqrF4Q2nBq" + }, + "source": [ + "### Define amostras de treinamento e teste" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "N8CUAiA57OhS", + "outputId": "d769cf9c-544a-4f1d-b3e0-49c1380e1a36", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 224 + } + }, + "source": [ + "df_cc2.head()" + ], + "execution_count": 22, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
timev1v2v3v4v5v6v7v8v9v10v11v12v13v14v15v16v17v18v19v20v21v22v23v24v25v26v27v28amountclass
00-1.359807-0.0727812.5363471.378155-0.3383210.4623880.2395990.0986980.3637870.090794-0.551600-0.617801-0.991390-0.3111691.468177-0.4704010.2079710.0257910.4039930.251412-0.0183070.277838-0.1104740.0669280.128539-0.1891150.133558-0.021053149.620.0
101.1918570.2661510.1664800.4481540.060018-0.082361-0.0788030.085102-0.255425-0.1669741.6127271.0652350.489095-0.1437720.6355580.463917-0.114805-0.183361-0.145783-0.069083-0.225775-0.6386720.101288-0.3398460.1671700.125895-0.0089830.0147242.690.0
21-1.358354-1.3401631.7732090.379780-0.5031981.8004990.7914610.247676-1.5146540.2076430.6245010.0660840.717293-0.1659462.345865-2.8900831.109969-0.121359-2.2618570.5249800.2479980.7716790.909412-0.689281-0.327642-0.139097-0.055353-0.059752378.660.0
31-0.966272-0.1852261.792993-0.863291-0.0103091.2472030.2376090.377436-1.387024-0.054952-0.2264870.1782280.507757-0.287924-0.631418-1.059647-0.6840931.965775-1.232622-0.208038-0.1083000.005274-0.190321-1.1755750.647376-0.2219290.0627230.061458123.500.0
42-1.1582330.8777371.5487180.403034-0.4071930.0959210.592941-0.2705330.8177390.753074-0.8228430.5381961.345852-1.1196700.175121-0.451449-0.237033-0.0381950.8034870.408542-0.0094310.798278-0.1374580.141267-0.2060100.5022920.2194220.21515369.990.0
\n", + "
" + ], + "text/plain": [ + " time v1 v2 v3 ... v27 v28 amount class\n", + "0 0 -1.359807 -0.072781 2.536347 ... 0.133558 -0.021053 149.62 0.0\n", + "1 0 1.191857 0.266151 0.166480 ... -0.008983 0.014724 2.69 0.0\n", + "2 1 -1.358354 -1.340163 1.773209 ... -0.055353 -0.059752 378.66 0.0\n", + "3 1 -0.966272 -0.185226 1.792993 ... 0.062723 0.061458 123.50 0.0\n", + "4 2 -1.158233 0.877737 1.548718 ... 0.219422 0.215153 69.99 0.0\n", + "\n", + "[5 rows x 31 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 22 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HPwapSnV6Rmu", + "outputId": "de83aafd-59a1-4e92-d731-1951288f51c4", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 85 + } + }, + "source": [ + "l_preditoras = df_cc2.iloc[:, 1:30].columns\n", + "l_preditoras2 = list(df_cc2.columns)\n", + "l_preditoras" + ], + "execution_count": 23, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Index(['v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8', 'v9', 'v10', 'v11',\n", + " 'v12', 'v13', 'v14', 'v15', 'v16', 'v17', 'v18', 'v19', 'v20', 'v21',\n", + " 'v22', 'v23', 'v24', 'v25', 'v26', 'v27', 'v28', 'amount'],\n", + " dtype='object')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 23 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FqSWGYIv6i0-", + "outputId": "7d87d6f3-8b76-4455-d087-91629ca54bbb", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 527 + } + }, + "source": [ + "l_preditoras2.remove('class')\n", + "l_preditoras2" + ], + "execution_count": 24, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "['time',\n", + " 'v1',\n", + " 'v2',\n", + " 'v3',\n", + " 'v4',\n", + " 'v5',\n", + " 'v6',\n", + " 'v7',\n", + " 'v8',\n", + " 'v9',\n", + " 'v10',\n", + " 'v11',\n", + " 'v12',\n", + " 'v13',\n", + " 'v14',\n", + " 'v15',\n", + " 'v16',\n", + " 'v17',\n", + " 'v18',\n", + " 'v19',\n", + " 'v20',\n", + " 'v21',\n", + " 'v22',\n", + " 'v23',\n", + " 'v24',\n", + " 'v25',\n", + " 'v26',\n", + " 'v27',\n", + " 'v28',\n", + " 'amount']" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 24 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NC4z51Rt6q9y", + "outputId": "51dc5065-7475-4b87-beda-103fdc186253", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 224 + } + }, + "source": [ + "df_X = df_cc2[l_preditoras2]\n", + "df_X.head()" + ], + "execution_count": 25, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
timev1v2v3v4v5v6v7v8v9v10v11v12v13v14v15v16v17v18v19v20v21v22v23v24v25v26v27v28amount
00-1.359807-0.0727812.5363471.378155-0.3383210.4623880.2395990.0986980.3637870.090794-0.551600-0.617801-0.991390-0.3111691.468177-0.4704010.2079710.0257910.4039930.251412-0.0183070.277838-0.1104740.0669280.128539-0.1891150.133558-0.021053149.62
101.1918570.2661510.1664800.4481540.060018-0.082361-0.0788030.085102-0.255425-0.1669741.6127271.0652350.489095-0.1437720.6355580.463917-0.114805-0.183361-0.145783-0.069083-0.225775-0.6386720.101288-0.3398460.1671700.125895-0.0089830.0147242.69
21-1.358354-1.3401631.7732090.379780-0.5031981.8004990.7914610.247676-1.5146540.2076430.6245010.0660840.717293-0.1659462.345865-2.8900831.109969-0.121359-2.2618570.5249800.2479980.7716790.909412-0.689281-0.327642-0.139097-0.055353-0.059752378.66
31-0.966272-0.1852261.792993-0.863291-0.0103091.2472030.2376090.377436-1.387024-0.054952-0.2264870.1782280.507757-0.287924-0.631418-1.059647-0.6840931.965775-1.232622-0.208038-0.1083000.005274-0.190321-1.1755750.647376-0.2219290.0627230.061458123.50
42-1.1582330.8777371.5487180.403034-0.4071930.0959210.592941-0.2705330.8177390.753074-0.8228430.5381961.345852-1.1196700.175121-0.451449-0.237033-0.0381950.8034870.408542-0.0094310.798278-0.1374580.141267-0.2060100.5022920.2194220.21515369.99
\n", + "
" + ], + "text/plain": [ + " time v1 v2 v3 ... v26 v27 v28 amount\n", + "0 0 -1.359807 -0.072781 2.536347 ... -0.189115 0.133558 -0.021053 149.62\n", + "1 0 1.191857 0.266151 0.166480 ... 0.125895 -0.008983 0.014724 2.69\n", + "2 1 -1.358354 -1.340163 1.773209 ... -0.139097 -0.055353 -0.059752 378.66\n", + "3 1 -0.966272 -0.185226 1.792993 ... -0.221929 0.062723 0.061458 123.50\n", + "4 2 -1.158233 0.877737 1.548718 ... 0.502292 0.219422 0.215153 69.99\n", + "\n", + "[5 rows x 30 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 25 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LZjNUDNb7s1t" + }, + "source": [ + "# Definição do dataframe contendo as variáveis preditoras:\n", + "#df_X = df_cc2.copy()\n", + "#df_X.drop(columns= ['Class'], axis = 1, inplace = True)\n", + "df_X.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "d3DDsN2V7IOU", + "outputId": "3e36b91a-9f84-43bb-a99f-6c8e749ed4eb", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 119 + } + }, + "source": [ + "df_y = df_cc2['class'] # Variável-resposta\n", + "df_y.head()" + ], + "execution_count": 26, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 0.0\n", + "1 0.0\n", + "2 0.0\n", + "3 0.0\n", + "4 0.0\n", + "Name: class, dtype: float64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 26 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aMthdXHD8vnh", + "outputId": "d753dc23-6041-499d-85b8-01cb7d535e90", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "df_y.shape" + ], + "execution_count": 27, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(12841,)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 27 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "EiJRftpZ2103" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(df_X, df_y, test_size = f_Test_Size, random_state = i_Seed)" + ], + "execution_count": 28, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TmSkPzNt8O6I", + "outputId": "2bf1a100-92cb-47b7-b2b2-aa306a20325e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "X_treinamento.shape" + ], + "execution_count": 29, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(8988, 30)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 29 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9h1PjPKh8Xb1", + "outputId": "b1fde667-7b52-4549-d259-0952d4dfd1b2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "X_teste.shape" + ], + "execution_count": 30, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(3853, 30)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 30 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NbCN_puI2qk1" + }, + "source": [ + "### Ajustar o modelo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hjRwSI079ADn" + }, + "source": [ + "# Importar o classificador (modelo, algoritmo, ...)\n", + "from sklearn.tree import DecisionTreeClassifier # Este é o nosso classificador" + ], + "execution_count": 31, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "HuhKJOQA22bR", + "outputId": "6cd87596-b4a9-4bba-9ce2-43ce6270db2c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 119 + } + }, + "source": [ + "ml_DT = DecisionTreeClassifier(max_depth = 5, min_samples_split = 2, random_state = i_Seed)\n", + "ml_DT" + ], + "execution_count": 32, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n", + " max_depth=5, max_features=None, max_leaf_nodes=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, presort='deprecated',\n", + " random_state=20111974, splitter='best')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 32 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Zai1d6eM93VQ", + "outputId": "dc156d73-6d66-4a70-c4ce-5cdb037c2848", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 119 + } + }, + "source": [ + "# Treinar o algoritmo/classificador: fit(df)\n", + "ml_DT.fit(X_treinamento, y_treinamento)" + ], + "execution_count": 33, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n", + " max_depth=5, max_features=None, max_leaf_nodes=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, presort='deprecated',\n", + " random_state=20111974, splitter='best')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 33 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3OOuaVQB-AkN" + }, + "source": [ + "Y = mu + alpha1*X1 + alpha2*X2 + Erro" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ybbS4zHn-8BO", + "outputId": "460d1262-1068-4d55-f09e-821fd69a2f6b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_DT, X_treinamento, y_treinamento, i_CV)" + ], + "execution_count": 34, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Média das Acurácias calculadas pelo CV....: 99.9\n", + "std médio das Acurácias calculadas pelo CV: 0.09\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "r_NLku7q_YT9", + "outputId": "46653af5-a2a2-4429-fb2a-fc02a3e01b3e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "a_scores_CV # array com os scores a cada iteração do CV" + ], + "execution_count": 35, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1. , 0.99888765, 0.99777531, 0.99777531, 1. ,\n", + " 0.99888765, 1. , 0.99888765, 0.99777283, 1. ])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 35 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wgX7eBNY78sa", + "outputId": "00f74266-5757-4e4f-f152-5476aae2d20f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "ml_DT.score(X_teste, y_teste)" + ], + "execution_count": 36, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.9994809239553595" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 36 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kDNrT7VM8IaT" + }, + "source": [ + "### Matriz de Confusão" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Uct1z_sS8OGt", + "outputId": "b6868fd4-32f7-480f-cbf5-3d9fd7ada5e6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "y_pred = ml_DT.predict(X_teste)\n", + "y_pred[:30]" + ], + "execution_count": 38, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n", + " 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 38 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FmG8KiqS8tnn", + "outputId": "65d11296-2e98-4072-8f6d-986dfe010797", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 544 + } + }, + "source": [ + "y_teste[0:30]" + ], + "execution_count": 39, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "1224 0.0\n", + "11994 0.0\n", + "5408 0.0\n", + "4385 0.0\n", + "8164 0.0\n", + "10540 0.0\n", + "10246 0.0\n", + "10110 0.0\n", + "3787 0.0\n", + "7263 0.0\n", + "4124 0.0\n", + "12319 0.0\n", + "12573 0.0\n", + "2934 0.0\n", + "1074 0.0\n", + "8300 0.0\n", + "11444 0.0\n", + "9251 0.0\n", + "1691 0.0\n", + "10482 0.0\n", + "10295 0.0\n", + "7868 0.0\n", + "823 0.0\n", + "6021 0.0\n", + "9713 0.0\n", + "2005 0.0\n", + "10049 0.0\n", + "10364 0.0\n", + "3300 0.0\n", + "2071 0.0\n", + "Name: class, dtype: float64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 39 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2wMWm-p5229A", + "outputId": "16b7a184-540f-4eb1-a08c-f5048e59ccf8", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 538 + } + }, + "source": [ + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative', 'False_Positive', 'False_Negative', 'True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": 40, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Am_UELOg2vDh" + }, + "source": [ + "### Fine tuning dos parâmetros" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lF9mxe7y23hr", + "outputId": "c8f9f8a8-8d06-4b06-87c0-272981248da4", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 102 + } + }, + "source": [ + "# Hiperparâmetros para GridSearch()\n", + "d_parametros = {'criterion': ['gini', 'entropy'],\n", + " 'min_samples_split': [2, 5, 7, 9, 11, 50],\n", + " 'max_depth': [2, 9, 15],\n", + " 'min_samples_leaf': [5, 10, 15, 50, 100],\n", + " 'max_leaf_nodes': [2, 7, 11]}\n", + "\n", + "d_parametros" + ], + "execution_count": 41, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'criterion': ['gini', 'entropy'],\n", + " 'max_depth': [2, 9, 15],\n", + " 'max_leaf_nodes': [2, 7, 11],\n", + " 'min_samples_leaf': [5, 10, 15, 50, 100],\n", + " 'min_samples_split': [2, 5, 7, 9, 11, 50]}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 41 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F8TGaDQvA5jJ" + }, + "source": [ + "### GridSearchOptimizer()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UvRWYAk7A-Gp", + "outputId": "21c0ea2f-f148-4bd2-84a0-d7625ec50ca7", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + } + }, + "source": [ + "l_colunas = list(df_X.columns)\n", + "ml_DT2, melhor_hiperparam = GridSearchOptimizer(ml_DT, 'ml_DT2', d_parametros, X_treinamento, y_treinamento, X_teste, y_teste, i_CV, l_colunas)" + ], + "execution_count": 42, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Fitting 10 folds for each of 540 candidates, totalling 5400 fits\n" + ], + "name": "stdout" + }, + { + "output_type": "stream", + "text": [ + "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.\n", + "[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 1.4s\n", + "[Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 1.6s\n", + "[Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 1.9s\n", + "[Parallel(n_jobs=-1)]: Done 14 tasks | elapsed: 2.2s\n", + "[Parallel(n_jobs=-1)]: Done 21 tasks | elapsed: 2.6s\n", + "[Parallel(n_jobs=-1)]: Done 28 tasks | elapsed: 3.0s\n", + "[Parallel(n_jobs=-1)]: Done 37 tasks | elapsed: 3.5s\n", + "[Parallel(n_jobs=-1)]: Done 46 tasks | elapsed: 4.0s\n", + "[Parallel(n_jobs=-1)]: Done 57 tasks | elapsed: 4.6s\n", + "[Parallel(n_jobs=-1)]: Done 68 tasks | elapsed: 5.2s\n", + "[Parallel(n_jobs=-1)]: Done 81 tasks | elapsed: 6.0s\n", + "[Parallel(n_jobs=-1)]: Done 94 tasks | elapsed: 6.7s\n", + "[Parallel(n_jobs=-1)]: Done 109 tasks | elapsed: 7.6s\n", + "[Parallel(n_jobs=-1)]: Done 124 tasks | elapsed: 8.4s\n", + "[Parallel(n_jobs=-1)]: Done 141 tasks | elapsed: 9.4s\n", + "[Parallel(n_jobs=-1)]: Done 158 tasks | elapsed: 10.4s\n", + "[Parallel(n_jobs=-1)]: Done 177 tasks | elapsed: 11.5s\n", + "[Parallel(n_jobs=-1)]: Done 196 tasks | elapsed: 12.5s\n", + "[Parallel(n_jobs=-1)]: Done 217 tasks | elapsed: 13.7s\n", + "[Parallel(n_jobs=-1)]: Done 238 tasks | elapsed: 14.8s\n", + "[Parallel(n_jobs=-1)]: Done 261 tasks | elapsed: 16.2s\n", + "[Parallel(n_jobs=-1)]: Done 284 tasks | elapsed: 17.4s\n", + "[Parallel(n_jobs=-1)]: Done 309 tasks | elapsed: 18.9s\n", + "[Parallel(n_jobs=-1)]: Done 334 tasks | elapsed: 20.3s\n", + "[Parallel(n_jobs=-1)]: Done 361 tasks | elapsed: 21.9s\n", + "[Parallel(n_jobs=-1)]: Done 388 tasks | elapsed: 23.4s\n", + "[Parallel(n_jobs=-1)]: Done 417 tasks | elapsed: 25.0s\n", + "[Parallel(n_jobs=-1)]: Done 446 tasks | elapsed: 26.6s\n", + "[Parallel(n_jobs=-1)]: Done 477 tasks | elapsed: 28.4s\n", + "[Parallel(n_jobs=-1)]: Done 508 tasks | elapsed: 30.1s\n", + "[Parallel(n_jobs=-1)]: Done 541 tasks | elapsed: 32.0s\n", + "[Parallel(n_jobs=-1)]: Done 574 tasks | elapsed: 33.8s\n", + "[Parallel(n_jobs=-1)]: Done 609 tasks | elapsed: 35.8s\n", + "[Parallel(n_jobs=-1)]: Done 644 tasks | elapsed: 37.7s\n", + "[Parallel(n_jobs=-1)]: Done 681 tasks | elapsed: 39.9s\n", + "[Parallel(n_jobs=-1)]: Done 718 tasks | elapsed: 41.9s\n", + "[Parallel(n_jobs=-1)]: Done 757 tasks | elapsed: 44.1s\n", + "[Parallel(n_jobs=-1)]: Done 796 tasks | elapsed: 46.3s\n", + "[Parallel(n_jobs=-1)]: Done 837 tasks | elapsed: 48.6s\n", + "[Parallel(n_jobs=-1)]: Done 878 tasks | elapsed: 50.9s\n", + "[Parallel(n_jobs=-1)]: Done 921 tasks | elapsed: 53.3s\n", + "[Parallel(n_jobs=-1)]: Done 964 tasks | elapsed: 55.7s\n", + "[Parallel(n_jobs=-1)]: Done 1009 tasks | elapsed: 58.3s\n", + "[Parallel(n_jobs=-1)]: Done 1054 tasks | elapsed: 1.0min\n", + "[Parallel(n_jobs=-1)]: Done 1101 tasks | elapsed: 1.1min\n", + "[Parallel(n_jobs=-1)]: Done 1148 tasks | elapsed: 1.1min\n", + "[Parallel(n_jobs=-1)]: Done 1197 tasks | elapsed: 1.1min\n", + "[Parallel(n_jobs=-1)]: Done 1246 tasks | elapsed: 1.2min\n", + "[Parallel(n_jobs=-1)]: Done 1297 tasks | elapsed: 1.3min\n", + "[Parallel(n_jobs=-1)]: Done 1348 tasks | elapsed: 1.4min\n", + "[Parallel(n_jobs=-1)]: Done 1401 tasks | elapsed: 1.5min\n", + "[Parallel(n_jobs=-1)]: Done 1454 tasks | elapsed: 1.6min\n", + "[Parallel(n_jobs=-1)]: Done 1509 tasks | elapsed: 1.7min\n", + "[Parallel(n_jobs=-1)]: Done 1564 tasks | elapsed: 1.8min\n", + "[Parallel(n_jobs=-1)]: Done 1621 tasks | elapsed: 1.9min\n", + "[Parallel(n_jobs=-1)]: Done 1678 tasks | elapsed: 2.1min\n", + "[Parallel(n_jobs=-1)]: Done 1737 tasks | elapsed: 2.1min\n", + "[Parallel(n_jobs=-1)]: Done 1796 tasks | elapsed: 2.2min\n", + "[Parallel(n_jobs=-1)]: Done 1857 tasks | elapsed: 2.3min\n", + "[Parallel(n_jobs=-1)]: Done 1918 tasks | elapsed: 2.4min\n", + "[Parallel(n_jobs=-1)]: Done 1981 tasks | elapsed: 2.4min\n", + "[Parallel(n_jobs=-1)]: Done 2044 tasks | elapsed: 2.5min\n", + "[Parallel(n_jobs=-1)]: Done 2109 tasks | elapsed: 2.5min\n", + "[Parallel(n_jobs=-1)]: Done 2174 tasks | elapsed: 2.7min\n", + "[Parallel(n_jobs=-1)]: Done 2241 tasks | elapsed: 2.9min\n", + "[Parallel(n_jobs=-1)]: Done 2308 tasks | elapsed: 3.0min\n", + "[Parallel(n_jobs=-1)]: Done 2377 tasks | elapsed: 3.1min\n", + "[Parallel(n_jobs=-1)]: Done 2446 tasks | elapsed: 3.2min\n", + "[Parallel(n_jobs=-1)]: Done 2517 tasks | elapsed: 3.4min\n", + "[Parallel(n_jobs=-1)]: Done 2588 tasks | elapsed: 3.5min\n", + "[Parallel(n_jobs=-1)]: Done 2661 tasks | elapsed: 3.6min\n", + "[Parallel(n_jobs=-1)]: Done 2734 tasks | elapsed: 3.7min\n", + "[Parallel(n_jobs=-1)]: Done 2809 tasks | elapsed: 3.9min\n", + "[Parallel(n_jobs=-1)]: Done 2884 tasks | elapsed: 4.0min\n", + "[Parallel(n_jobs=-1)]: Done 2961 tasks | elapsed: 4.1min\n", + "[Parallel(n_jobs=-1)]: Done 3038 tasks | elapsed: 4.2min\n", + "[Parallel(n_jobs=-1)]: Done 3117 tasks | elapsed: 4.3min\n", + "[Parallel(n_jobs=-1)]: Done 3196 tasks | elapsed: 4.4min\n", + "[Parallel(n_jobs=-1)]: Done 3277 tasks | elapsed: 4.5min\n", + "[Parallel(n_jobs=-1)]: Done 3358 tasks | elapsed: 4.6min\n", + "[Parallel(n_jobs=-1)]: Done 3441 tasks | elapsed: 4.7min\n", + "[Parallel(n_jobs=-1)]: Done 3524 tasks | elapsed: 4.9min\n", + "[Parallel(n_jobs=-1)]: Done 3609 tasks | elapsed: 5.0min\n", + "[Parallel(n_jobs=-1)]: Done 3694 tasks | elapsed: 5.1min\n", + "[Parallel(n_jobs=-1)]: Done 3781 tasks | elapsed: 5.2min\n", + "[Parallel(n_jobs=-1)]: Done 3868 tasks | elapsed: 5.3min\n", + "[Parallel(n_jobs=-1)]: Done 3957 tasks | elapsed: 5.5min\n", + "[Parallel(n_jobs=-1)]: Done 4046 tasks | elapsed: 5.6min\n", + "[Parallel(n_jobs=-1)]: Done 4137 tasks | elapsed: 5.7min\n", + "[Parallel(n_jobs=-1)]: Done 4228 tasks | elapsed: 5.9min\n", + "[Parallel(n_jobs=-1)]: Done 4321 tasks | elapsed: 6.0min\n", + "[Parallel(n_jobs=-1)]: Done 4414 tasks | elapsed: 6.2min\n", + "[Parallel(n_jobs=-1)]: Done 4509 tasks | elapsed: 6.3min\n", + "[Parallel(n_jobs=-1)]: Done 4604 tasks | elapsed: 6.4min\n", + "[Parallel(n_jobs=-1)]: Done 4701 tasks | elapsed: 6.6min\n", + "[Parallel(n_jobs=-1)]: Done 4798 tasks | elapsed: 6.7min\n", + "[Parallel(n_jobs=-1)]: Done 4897 tasks | elapsed: 6.8min\n", + "[Parallel(n_jobs=-1)]: Done 4996 tasks | elapsed: 7.0min\n", + "[Parallel(n_jobs=-1)]: Done 5097 tasks | elapsed: 7.1min\n", + "[Parallel(n_jobs=-1)]: Done 5198 tasks | elapsed: 7.3min\n", + "[Parallel(n_jobs=-1)]: Done 5301 tasks | elapsed: 7.5min\n", + "[Parallel(n_jobs=-1)]: Done 5400 out of 5400 | elapsed: 7.6min finished\n" + ], + "name": "stderr" + }, + { + "output_type": "stream", + "text": [ + "\n", + "Parametros otimizados: {'criterion': 'gini', 'max_depth': 2, 'max_leaf_nodes': 2, 'min_samples_leaf': 5, 'min_samples_split': 2}\n", + "\n", + "DecisionTreeClassifier *********************************************************************************************************\n", + "\n", + "********* CROSS-VALIDATION ***********\n", + "Média das Acurácias calculadas pelo CV....: 99.87\n", + "std médio das Acurácias calculadas pelo CV: 0.12\n", + "\n", + "********* IMPORTÂNCIA DAS COLUNAS ***********\n", + " coluna importancia\n", + "12 v12 1.0\n", + "0 time 0.0\n", + "16 v16 0.0\n", + "28 v28 0.0\n", + "27 v27 0.0\n", + "26 v26 0.0\n", + "25 v25 0.0\n", + "24 v24 0.0\n", + "23 v23 0.0\n", + "22 v22 0.0\n", + "21 v21 0.0\n", + "20 v20 0.0\n", + "19 v19 0.0\n", + "18 v18 0.0\n", + "17 v17 0.0\n", + "15 v15 0.0\n", + "1 v1 0.0\n", + "14 v14 0.0\n", + "13 v13 0.0\n", + "11 v11 0.0\n", + "10 v10 0.0\n", + "9 v9 0.0\n", + "8 v8 0.0\n", + "7 v7 0.0\n", + "6 v6 0.0\n", + "5 v5 0.0\n", + "4 v4 0.0\n", + "3 v3 0.0\n", + "2 v2 0.0\n", + "29 amount 0.0\n", + "\n", + "********* CONFUSION MATRIX - PARAMETER TUNNING ***********\n" + ], + "name": "stdout" + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fdaQoeuiEgaE", + "outputId": "f60b420d-c7e6-4cf3-b896-d01eb960a66e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "120*5" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "600" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 119 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "STnFiVlCCet9" + }, + "source": [ + "### Visualizar os resultados" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ufrYXAd5_8uN", + "outputId": "18d7793f-d8ac-4ac3-c1f5-de83f1a6e30c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 277 + } + }, + "source": [ + "from sklearn.tree import export_graphviz\n", + "from sklearn.externals.six import StringIO \n", + "from IPython.display import Image \n", + "import pydotplus\n", + "\n", + "dot_data = StringIO()\n", + "export_graphviz(ml_DT2, out_file = dot_data, filled = True, rounded = True, special_characters = True, feature_names = l_colunas, class_names = ['0','1'])\n", + "\n", + "graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) \n", + "graph.write_png('DecisionTree.png')\n", + "Image(graph.create_png())" + ], + "execution_count": 43, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "image/png": "\n", + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 43 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bG31I7_n4RQg" + }, + "source": [ + "### Aplicar as transformações (principais) estudadas e reestimar o modelo novamente\n", + "* Qual o impacto das transformações?\n", + "* A conclusão muda/mudou?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oYgK6JXd3MgA" + }, + "source": [ + "## Exercício 2 - Predicting species on IRIS dataset\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "si0rsJvu3O6O" + }, + "source": [ + "from sklearn import datasets\n", + "import xgboost as xgb\n", + "\n", + "iris = datasets.load_iris()\n", + "X_iris = iris.data\n", + "y_iris = iris.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zom8t4yWC_UC" + }, + "source": [ + "## Exercício 3 - Predict Wine Quality\n", + "> Estimar a qualidade dos vinhos, numa scala de 0–100. A seguir, a qualidade em função da escala:\n", + "\n", + "* 95–100 Classic: a great wine\n", + "* 90–94 Outstanding: a wine of superior character and style\n", + "* 85–89 Very good: a wine with special qualities\n", + "* 80–84 Good: a solid, well-made wine\n", + "* 75–79 Mediocre: a drinkable wine that may have minor flaws\n", + "* 50–74 Not recommended\n", + "\n", + "Source: [Wine Reviews](https://www.kaggle.com/zynicide/wine-reviews)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "klL2Q9Ria96n" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from sklearn import datasets\n", + "\n", + "Wine = datasets.load_wine()\n", + "X_vinho = Wine.data\n", + "y_vinho = Wine.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lhVhSWBgGijq" + }, + "source": [ + "## Exercício 4 - Predict Parkinson\n", + "Source: https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SVCxHqv0VBJn" + }, + "source": [ + "## Exercício 5 - Predict survivors from Titanic tragedy\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "CwvB8us4eKNi" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "\n", + "df_titanic = sns.load_dataset('titanic')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZJrT9YIXVdtx" + }, + "source": [ + "## Exercício 6 - Predict Loan\n", + "> Os dados devem ser obtidos diretamente da fonte: [Loan Default Prediction - Imperial College London](https://www.kaggle.com/c/loan-default-prediction/data)\n", + "\n", + "* [Bank Loan Default Prediction](https://medium.com/@wutianhao910/bank-loan-default-prediction-94d4902db740)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R8-GVu7ZWeA8" + }, + "source": [ + "## Exercício 7 - Predict the sales of a store.\n", + "* [Predicting expected sales for Bigmart’s stores](https://medium.com/diogo-menezes-borges/project-1-bigmart-sale-prediction-fdc04f07dc1e)\n", + "* Dataframes\n", + " * [Treinamento](https://raw.githubusercontent.com/MathMachado/DataFrames/master/Big_Mart_Sales_III_train.txt)\n", + " * [Validação](https://raw.githubusercontent.com/MathMachado/DataFrames/master/Big_Mart_Sales_III_test.txt)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fv9w86j4Wnwj" + }, + "source": [ + "## Exercício 8 - [The Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html)\n", + "> Predict the median value of owner occupied homes." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5HYRt8-ug1BT" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from sklearn import datasets\n", + "\n", + "Boston = datasets.load_boston()\n", + "X_boston = Boston.data\n", + "y_boston = Boston.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1UDIaqmtXQ0T" + }, + "source": [ + "## Exercício 9 - Predict the height or weight of a person.\n", + "\n", + "http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-7R146nIXmMT" + }, + "source": [ + "## Exercício 10 - Black Friday Sales Prediction - Predict purchase amount.\n", + "\n", + "This dataset comprises of sales transactions captured at a retail store. It’s a classic dataset to explore and expand your feature engineering skills and day to day understanding from multiple shopping experiences. This is a regression problem. The dataset has 550,069 rows and 12 columns.\n", + "\n", + "https://github.com/MathMachado/DataFrames/blob/master/blackfriday.zip\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mQ8FPbuLZlIh" + }, + "source": [ + "## Exercício 11 - Predict the income class of US population.\n", + "\n", + "http://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Af4NRrchgPlM" + }, + "source": [ + "## Exercício 12 - Predicting Cancer\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "c4LOlgZW3P40" + }, + "source": [ + "from sklearn import datasets\n", + "cancer = datasets.load_breast_cancer()\n", + "X_cancer = cancer.data\n", + "y_cancer = cancer.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "74PmpT8Ix0tD" + }, + "source": [ + "## Exercício 13\n", + "Source: [Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/).\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WY8GZMixZ9W9" + }, + "source": [ + "## Exercício 14 - Predict Diabetes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y92t6tbOge0S" + }, + "source": [ + "from sklearn import datasets\n", + "Diabetes= datasets.load_diabetes()\n", + "\n", + "X_diabetes = Diabetes.data\n", + "y_diabetes = Diabetes.target" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB15_00__Machine_Learning_hs5.ipynb b/Notebooks/NB15_00__Machine_Learning_hs5.ipynb new file mode 100644 index 000000000..186bb35ec --- /dev/null +++ b/Notebooks/NB15_00__Machine_Learning_hs5.ipynb @@ -0,0 +1,6271 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "colab": { + "name": "NB15_00__Machine_Learning.ipynb", + "provenance": [], + "include_colab_link": true + }, + "accelerator": "TPU" + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ShVXyGj9wkgN" + }, + "source": [ + "

MACHINE LEARNING WITH PYTHON

" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e-VOopTKxrMs" + }, + "source": [ + "A seguir, sugestão de problemas para resolver com Regressão Linear\n", + "* https://lionbridge.ai/datasets/10-open-datasets-for-linear-regression/" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iTAGmmHouqQd", + "outputId": "61bb183a-d989-4d92-867d-bb6265a98717", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + } + }, + "source": [ + "!pip install azureml\n", + "!pip install azureml-opendatasets\n", + "!pip install azureml-dataset-runtime" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Requirement already satisfied: azureml in /usr/local/lib/python3.6/dist-packages (0.2.7)\n", + "Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from azureml) (2.23.0)\n", + "Requirement already satisfied: python-dateutil in /usr/local/lib/python3.6/dist-packages (from azureml) (2.8.1)\n", + "Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from azureml) (1.1.2)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->azureml) (2020.6.20)\n", + "Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->azureml) (3.0.4)\n", + "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->azureml) (1.24.3)\n", + "Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->azureml) (2.10)\n", + "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil->azureml) (1.15.0)\n", + "Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.6/dist-packages (from pandas->azureml) (1.18.5)\n", + "Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->azureml) (2018.9)\n", + "Collecting azureml-opendatasets\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/50/80/9cdbe1f2574e03ca9292a4db5de98b958f817d79fc50b556c09e98378b98/azureml_opendatasets-1.16.0-py3-none-any.whl (1.3MB)\n", + "\u001b[K |████████████████████████████████| 1.3MB 3.2MB/s \n", + "\u001b[?25hCollecting pandas<=1.0.0,>=0.21.0\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/12/d1/a6502c2f5c15b50f5dd579fc1c52b47edf6f2e9f682aed917dd7565b3e60/pandas-1.0.0-cp36-cp36m-manylinux1_x86_64.whl (10.1MB)\n", + "\u001b[K |████████████████████████████████| 10.1MB 806kB/s \n", + "\u001b[?25hCollecting azureml-dataset-runtime[fuse,pandas]~=1.16.0\n", + " Downloading https://files.pythonhosted.org/packages/31/b8/88abfd9d5fe86ed3eeed4dc7736eca9ad264026b6744e42ca25b557a78d4/azureml_dataset_runtime-1.16.0-py3-none-any.whl\n", + "Requirement already satisfied: numpy<=1.19.0,>=1.16.0 in /usr/local/lib/python3.6/dist-packages (from azureml-opendatasets) (1.18.5)\n", + "Collecting pyarrow>=0.16.0\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/d7/e1/27958a70848f8f7089bff8d6ebe42519daf01f976d28b481e1bfd52c8097/pyarrow-2.0.0-cp36-cp36m-manylinux2014_x86_64.whl (17.7MB)\n", + "\u001b[K |████████████████████████████████| 17.7MB 228kB/s \n", + "\u001b[?25hCollecting azureml-core~=1.16.0\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/3b/5b/c7f67ec1ac8066eba75a4cbaa9631bb8713585e56f6487fdda9dd55db8a9/azureml_core-1.16.0.post1-py3-none-any.whl (2.0MB)\n", + "\u001b[K |████████████████████████████████| 2.0MB 42.0MB/s \n", + "\u001b[?25hCollecting azureml-telemetry~=1.16.0\n", + " Downloading https://files.pythonhosted.org/packages/a5/04/130242589a85a55e16aebf0c38f5b735d7fa1cd385e3ac741f69fca8295c/azureml_telemetry-1.16.0-py3-none-any.whl\n", + "Requirement already satisfied: scipy<=1.4.1,>=1.0.0 in /usr/local/lib/python3.6/dist-packages (from azureml-opendatasets) (1.4.1)\n", + "Collecting pyspark\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/f0/26/198fc8c0b98580f617cb03cb298c6056587b8f0447e20fa40c5b634ced77/pyspark-3.0.1.tar.gz (204.2MB)\n", + "\u001b[K |████████████████████████████████| 204.2MB 60kB/s \n", + "\u001b[?25hRequirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas<=1.0.0,>=0.21.0->azureml-opendatasets) (2018.9)\n", + "Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas<=1.0.0,>=0.21.0->azureml-opendatasets) (2.8.1)\n", + "Collecting azureml-dataprep<2.4.0a,>=2.3.0a\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/27/86/648d3ec3feddf41187a4ed25551293d0480239a920fef00fb979034dc3e1/azureml_dataprep-2.3.4-py3-none-any.whl (28.2MB)\n", + "\u001b[K |████████████████████████████████| 28.2MB 151kB/s \n", + "\u001b[?25hCollecting fusepy<4.0.0,>=3.0.1; extra == \"fuse\"\n", + " Downloading https://files.pythonhosted.org/packages/04/0b/4506cb2e831cea4b0214d3625430e921faaa05a7fb520458c75a2dbd2152/fusepy-3.0.1.tar.gz\n", + "Collecting azure-mgmt-authorization<1.0.0,>=0.40.0\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/b4/50/7a923f58bf053280fe1890f3332c08f6a82a208c92035ad8f7888c87b786/azure_mgmt_authorization-0.61.0-py2.py3-none-any.whl (94kB)\n", + "\u001b[K |████████████████████████████████| 102kB 7.8MB/s \n", + "\u001b[?25hRequirement already satisfied: urllib3>=1.23 in /usr/local/lib/python3.6/dist-packages (from azureml-core~=1.16.0->azureml-opendatasets) (1.24.3)\n", + "Requirement already satisfied: requests>=2.19.1 in /usr/local/lib/python3.6/dist-packages (from azureml-core~=1.16.0->azureml-opendatasets) (2.23.0)\n", + "Collecting PyJWT\n", + " Downloading https://files.pythonhosted.org/packages/87/8b/6a9f14b5f781697e51259d81657e6048fd31a113229cf346880bb7545565/PyJWT-1.7.1-py2.py3-none-any.whl\n", + "Collecting ruamel.yaml>=0.15.35\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/7e/39/186f14f3836ac5d2a6a042c8de69988770e8b9abb537610edc429e4914aa/ruamel.yaml-0.16.12-py2.py3-none-any.whl (111kB)\n", + "\u001b[K |████████████████████████████████| 112kB 34.4MB/s \n", + "\u001b[?25hCollecting jmespath\n", + " Downloading https://files.pythonhosted.org/packages/07/cb/5f001272b6faeb23c1c9e0acc04d48eaaf5c862c17709d20e3469c6e0139/jmespath-0.10.0-py2.py3-none-any.whl\n", + "Collecting azure-mgmt-storage<16.0.0,>=1.5.0\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/33/cc/8ace313fd151af6663b1e1778f216532eab0258133ef21498c0e2caefad6/azure_mgmt_storage-11.2.0-py2.py3-none-any.whl (547kB)\n", + "\u001b[K |████████████████████████████████| 552kB 25.2MB/s \n", + "\u001b[?25hCollecting ndg-httpsclient\n", + " Downloading https://files.pythonhosted.org/packages/fb/67/c2f508c00ed2a6911541494504b7cac16fe0b0473912568df65fd1801132/ndg_httpsclient-0.5.1-py3-none-any.whl\n", + "Collecting backports.tempfile\n", + " Downloading https://files.pythonhosted.org/packages/b4/5c/077f910632476281428fe254807952eb47ca78e720d059a46178c541e669/backports.tempfile-1.0-py2.py3-none-any.whl\n", + "Collecting msrest>=0.5.1\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/fa/f5/9e315fe8cb985b0ce052b34bcb767883dc739f46fadb62f05a7e6d6eedbe/msrest-0.6.19-py2.py3-none-any.whl (84kB)\n", + "\u001b[K |████████████████████████████████| 92kB 6.6MB/s \n", + "\u001b[?25hCollecting pathspec\n", + " Downloading https://files.pythonhosted.org/packages/5d/d0/887c58853bd4b6ffc7aa9cdba4fc57d7b979b45888a6bd47e4568e1cf868/pathspec-0.8.0-py2.py3-none-any.whl\n", + "Collecting azure-common>=1.1.12\n", + " Downloading https://files.pythonhosted.org/packages/e5/4d/d000fc3c5af601d00d55750b71da5c231fcb128f42ac95b208ed1091c2c1/azure_common-1.1.25-py2.py3-none-any.whl\n", + "Collecting docker\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/9e/8c/8d42dbd83679483db207535f4fb02dc84325fa78b290f057694b057fcd21/docker-4.3.1-py2.py3-none-any.whl (145kB)\n", + "\u001b[K |████████████████████████████████| 153kB 41.0MB/s \n", + "\u001b[?25hRequirement already satisfied: contextlib2 in /usr/local/lib/python3.6/dist-packages (from azureml-core~=1.16.0->azureml-opendatasets) (0.5.5)\n", + "Collecting adal>=1.2.0\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/e9/51/5081e3705fdc4bf56fe26990b959b3379c9db38c6a0a0cd6b66508d161db/adal-1.2.5-py2.py3-none-any.whl (55kB)\n", + "\u001b[K |████████████████████████████████| 61kB 4.7MB/s \n", + "\u001b[?25hCollecting azure-graphrbac<1.0.0,>=0.40.0\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/3e/93/02056aca45162f9fc275d1eaad12a2a07ef92375afb48eabddc4134b8315/azure_graphrbac-0.61.1-py2.py3-none-any.whl (141kB)\n", + "\u001b[K |████████████████████████████████| 143kB 37.9MB/s \n", + "\u001b[?25hCollecting azure-mgmt-containerregistry>=2.0.0\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/97/70/8c2d0509db466678eba16fa2b0a539499f3b351b1f2993126ad843d5be13/azure_mgmt_containerregistry-2.8.0-py2.py3-none-any.whl (718kB)\n", + "\u001b[K |████████████████████████████████| 727kB 35.2MB/s \n", + "\u001b[?25hCollecting SecretStorage\n", + " Downloading https://files.pythonhosted.org/packages/c3/50/8a02cad020e949e6d7105f5f4530d41e3febcaa5b73f8f2148aacb3aeba5/SecretStorage-3.1.2-py3-none-any.whl\n", + "Collecting azure-mgmt-keyvault<7.0.0,>=0.40.0\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/f1/af/1ba15e7176bcf6b1531b453e410ae41a983c09f834d8700dfce739451b53/azure_mgmt_keyvault-2.2.0-py2.py3-none-any.whl (89kB)\n", + "\u001b[K |████████████████████████████████| 92kB 6.9MB/s \n", + "\u001b[?25hCollecting msrestazure>=0.4.33\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/5e/3a/7adb08fd2f0ee6fdfd03685fac38477b64f184943dcf6ea0cbffb205f22d/msrestazure-0.6.4-py2.py3-none-any.whl (40kB)\n", + "\u001b[K |████████████████████████████████| 40kB 4.0MB/s \n", + "\u001b[?25hCollecting cryptography!=1.9,!=2.0.*,!=2.1.*,!=2.2.*\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/33/62/30f6936941d87a5ed72efb24249437824f6b2c953901245b58c91fde2f27/cryptography-3.1.1-cp35-abi3-manylinux2010_x86_64.whl (2.6MB)\n", + "\u001b[K |████████████████████████████████| 2.6MB 33.9MB/s \n", + "\u001b[?25hCollecting jsonpickle\n", + " Downloading https://files.pythonhosted.org/packages/af/ca/4fee219cc4113a5635e348ad951cf8a2e47fed2e3342312493f5b73d0007/jsonpickle-1.4.1-py2.py3-none-any.whl\n", + "Collecting pyopenssl<20.0.0\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/9e/de/f8342b68fa9e981d348039954657bdf681b2ab93de27443be51865ffa310/pyOpenSSL-19.1.0-py2.py3-none-any.whl (53kB)\n", + "\u001b[K |████████████████████████████████| 61kB 7.6MB/s \n", + "\u001b[?25hCollecting azure-mgmt-resource<15.0.0,>=1.2.1\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/af/b3/8009c149d7d162b7a2a22a5007f984aa090f089bf8dc09e7e84bd354b868/azure_mgmt_resource-10.2.0-py2.py3-none-any.whl (968kB)\n", + "\u001b[K |████████████████████████████████| 972kB 40.6MB/s \n", + "\u001b[?25hCollecting applicationinsights\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/a1/53/234c53004f71f0717d8acd37876e0b65c121181167057b9ce1b1795f96a0/applicationinsights-0.11.9-py2.py3-none-any.whl (58kB)\n", + "\u001b[K |████████████████████████████████| 61kB 6.9MB/s \n", + "\u001b[?25hCollecting py4j==0.10.9\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)\n", + "\u001b[K |████████████████████████████████| 204kB 31.5MB/s \n", + "\u001b[?25hRequirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.6.1->pandas<=1.0.0,>=0.21.0->azureml-opendatasets) (1.15.0)\n", + "Collecting azure-identity<2.0.0,>=1.2.0\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/93/97/0e57f9d0bb0e9aee5cce0007616f6d3c2e09931fd24ad140c9cc3b06b7ef/azure_identity-1.4.1-py2.py3-none-any.whl (86kB)\n", + "\u001b[K |████████████████████████████████| 92kB 9.9MB/s \n", + "\u001b[?25hCollecting azureml-dataprep-native<24.0.0,>=23.0.0\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/50/a8/7c85148c3cc1bbcba9476fee04079899ba8dd0b6b49fed4a685926e3bcdc/azureml_dataprep_native-23.0.0-cp36-cp36m-manylinux1_x86_64.whl (1.3MB)\n", + "\u001b[K |████████████████████████████████| 1.3MB 38.8MB/s \n", + "\u001b[?25hCollecting dotnetcore2<3.0.0,>=2.1.14\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/57/a3/43ee595226ae611c2077da17f6ae0df5694aac04429733fb40608c54f83c/dotnetcore2-2.1.17-py3-none-manylinux1_x86_64.whl (28.7MB)\n", + "\u001b[K |████████████████████████████████| 28.7MB 144kB/s \n", + "\u001b[?25hRequirement already satisfied: cloudpickle<2.0.0,>=1.1.0 in /usr/local/lib/python3.6/dist-packages (from azureml-dataprep<2.4.0a,>=2.3.0a->azureml-dataset-runtime[fuse,pandas]~=1.16.0->azureml-opendatasets) (1.3.0)\n", + "Collecting azureml-dataprep-rslex<1.2.0a,>=1.1.0dev0\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/7c/2c/e26558af094a5c63611bbee20d118c320db8841ac4dfe75c602804611834/azureml_dataprep_rslex-1.1.3-cp36-cp36m-manylinux2010_x86_64.whl (7.9MB)\n", + "\u001b[K |████████████████████████████████| 7.9MB 16.4MB/s \n", + "\u001b[?25hRequirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests>=2.19.1->azureml-core~=1.16.0->azureml-opendatasets) (2.10)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests>=2.19.1->azureml-core~=1.16.0->azureml-opendatasets) (2020.6.20)\n", + "Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests>=2.19.1->azureml-core~=1.16.0->azureml-opendatasets) (3.0.4)\n", + "Collecting ruamel.yaml.clib>=0.1.2; platform_python_implementation == \"CPython\" and python_version < \"3.9\"\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/88/ff/ec25dc01ef04232a9e68ff18492e37dfa01f1f58172e702ad4f38536d41b/ruamel.yaml.clib-0.2.2-cp36-cp36m-manylinux1_x86_64.whl (549kB)\n", + "\u001b[K |████████████████████████████████| 552kB 44.3MB/s \n", + "\u001b[?25hRequirement already satisfied: pyasn1>=0.1.1 in /usr/local/lib/python3.6/dist-packages (from ndg-httpsclient->azureml-core~=1.16.0->azureml-opendatasets) (0.4.8)\n", + "Collecting backports.weakref\n", + " Downloading https://files.pythonhosted.org/packages/88/ec/f598b633c3d5ffe267aaada57d961c94fdfa183c5c3ebda2b6d151943db6/backports.weakref-1.0.post1-py2.py3-none-any.whl\n", + "Requirement already satisfied: requests-oauthlib>=0.5.0 in /usr/local/lib/python3.6/dist-packages (from msrest>=0.5.1->azureml-core~=1.16.0->azureml-opendatasets) (1.3.0)\n", + "Collecting isodate>=0.6.0\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/9b/9f/b36f7774ff5ea8e428fdcfc4bb332c39ee5b9362ddd3d40d9516a55221b2/isodate-0.6.0-py2.py3-none-any.whl (45kB)\n", + "\u001b[K |████████████████████████████████| 51kB 6.3MB/s \n", + "\u001b[?25hCollecting websocket-client>=0.32.0\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/4c/5f/f61b420143ed1c8dc69f9eaec5ff1ac36109d52c80de49d66e0c36c3dfdf/websocket_client-0.57.0-py2.py3-none-any.whl (200kB)\n", + "\u001b[K |████████████████████████████████| 204kB 43.8MB/s \n", + "\u001b[?25hCollecting jeepney>=0.4.2\n", + " Downloading https://files.pythonhosted.org/packages/79/31/2e8d42727595faf224c6dbb748c32b192e212f25495fe841fb7ce8e168b8/jeepney-0.4.3-py3-none-any.whl\n", + "Requirement already satisfied: cffi!=1.11.3,>=1.8 in /usr/local/lib/python3.6/dist-packages (from cryptography!=1.9,!=2.0.*,!=2.1.*,!=2.2.*->azureml-core~=1.16.0->azureml-opendatasets) (1.14.3)\n", + "Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.6/dist-packages (from jsonpickle->azureml-core~=1.16.0->azureml-opendatasets) (2.0.0)\n", + "Collecting msal-extensions~=0.2.2\n", + " Downloading https://files.pythonhosted.org/packages/33/da/eed514cb6902405c5c11a03f1e65adbd95e2c26d9b22eae390eddb561201/msal_extensions-0.2.2-py2.py3-none-any.whl\n", + "Collecting azure-core<2.0.0,>=1.0.0\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/4c/fa/46974f4a7ad78b27e3eda8a573cc0c2508849f0d7d360b61c07cc5b46014/azure_core-1.8.2-py2.py3-none-any.whl (122kB)\n", + "\u001b[K |████████████████████████████████| 122kB 49.9MB/s \n", + "\u001b[?25hCollecting msal<2.0.0,>=1.3.0\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/47/84/72f350389f24a3127c8d4d8da0d0d73adb14f04dadca296575ca2ad20d42/msal-1.5.1-py2.py3-none-any.whl (50kB)\n", + "\u001b[K |████████████████████████████████| 51kB 6.1MB/s \n", + "\u001b[?25hCollecting distro>=1.2.0\n", + " Downloading https://files.pythonhosted.org/packages/25/b7/b3c4270a11414cb22c6352ebc7a83aaa3712043be29daa05018fd5a5c956/distro-1.5.0-py2.py3-none-any.whl\n", + "Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from requests-oauthlib>=0.5.0->msrest>=0.5.1->azureml-core~=1.16.0->azureml-opendatasets) (3.1.0)\n", + "Requirement already satisfied: pycparser in /usr/local/lib/python3.6/dist-packages (from cffi!=1.11.3,>=1.8->cryptography!=1.9,!=2.0.*,!=2.1.*,!=2.2.*->azureml-core~=1.16.0->azureml-opendatasets) (2.20)\n", + "Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata->jsonpickle->azureml-core~=1.16.0->azureml-opendatasets) (3.2.0)\n", + "Collecting portalocker~=1.0; platform_system != \"Windows\"\n", + " Downloading https://files.pythonhosted.org/packages/3b/e7/ceef002a300a98a208232fab593183249b6964b306ee7dabb29908419cca/portalocker-1.7.1-py2.py3-none-any.whl\n", + "Building wheels for collected packages: pyspark, fusepy\n", + " Building wheel for pyspark (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + " Created wheel for pyspark: filename=pyspark-3.0.1-py2.py3-none-any.whl size=204612243 sha256=4c5104794485520f1b81919c0c0ca277ae8ea760b3935ca45aeb0354de5d5c99\n", + " Stored in directory: /root/.cache/pip/wheels/5e/bd/07/031766ca628adec8435bb40f0bd83bb676ce65ff4007f8e73f\n", + " Building wheel for fusepy (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + " Created wheel for fusepy: filename=fusepy-3.0.1-cp36-none-any.whl size=10504 sha256=6ed5e9d7be0a259083f02d32dbf37fea9cc48dc8e8aa628d37952fafa4ebf2bd\n", + " Stored in directory: /root/.cache/pip/wheels/4c/a5/91/7772af9e21c461f07bb40f26d928d7d231d224977dd8353bab\n", + "Successfully built pyspark fusepy\n", + "\u001b[31mERROR: google-colab 1.0.0 has requirement pandas~=1.1.0; python_version >= \"3.0\", but you'll have pandas 1.0.0 which is incompatible.\u001b[0m\n", + "\u001b[31mERROR: fbprophet 0.7.1 has requirement pandas>=1.0.4, but you'll have pandas 1.0.0 which is incompatible.\u001b[0m\n", + "\u001b[31mERROR: azureml-dataset-runtime 1.16.0 has requirement pyarrow<2.0.0,>=0.17.0, but you'll have pyarrow 2.0.0 which is incompatible.\u001b[0m\n", + "Installing collected packages: pandas, pyarrow, PyJWT, msal, portalocker, msal-extensions, azure-core, cryptography, azure-identity, azureml-dataprep-native, distro, dotnetcore2, azureml-dataprep-rslex, azureml-dataprep, fusepy, azureml-dataset-runtime, azure-common, isodate, msrest, adal, msrestazure, azure-mgmt-authorization, ruamel.yaml.clib, ruamel.yaml, jmespath, azure-mgmt-storage, pyopenssl, ndg-httpsclient, backports.weakref, backports.tempfile, pathspec, websocket-client, docker, azure-graphrbac, azure-mgmt-containerregistry, jeepney, SecretStorage, azure-mgmt-keyvault, jsonpickle, azure-mgmt-resource, azureml-core, applicationinsights, azureml-telemetry, py4j, pyspark, azureml-opendatasets\n", + " Found existing installation: pandas 1.1.2\n", + " Uninstalling pandas-1.1.2:\n", + " Successfully uninstalled pandas-1.1.2\n", + " Found existing installation: pyarrow 0.14.1\n", + " Uninstalling pyarrow-0.14.1:\n", + " Successfully uninstalled pyarrow-0.14.1\n", + "Successfully installed PyJWT-1.7.1 SecretStorage-3.1.2 adal-1.2.5 applicationinsights-0.11.9 azure-common-1.1.25 azure-core-1.8.2 azure-graphrbac-0.61.1 azure-identity-1.4.1 azure-mgmt-authorization-0.61.0 azure-mgmt-containerregistry-2.8.0 azure-mgmt-keyvault-2.2.0 azure-mgmt-resource-10.2.0 azure-mgmt-storage-11.2.0 azureml-core-1.16.0.post1 azureml-dataprep-2.3.4 azureml-dataprep-native-23.0.0 azureml-dataprep-rslex-1.1.3 azureml-dataset-runtime-1.16.0 azureml-opendatasets-1.16.0 azureml-telemetry-1.16.0 backports.tempfile-1.0 backports.weakref-1.0.post1 cryptography-3.1.1 distro-1.5.0 docker-4.3.1 dotnetcore2-2.1.17 fusepy-3.0.1 isodate-0.6.0 jeepney-0.4.3 jmespath-0.10.0 jsonpickle-1.4.1 msal-1.5.1 msal-extensions-0.2.2 msrest-0.6.19 msrestazure-0.6.4 ndg-httpsclient-0.5.1 pandas-1.0.0 pathspec-0.8.0 portalocker-1.7.1 py4j-0.10.9 pyarrow-2.0.0 pyopenssl-19.1.0 pyspark-3.0.1 ruamel.yaml-0.16.12 ruamel.yaml.clib-0.2.2 websocket-client-0.57.0\n" + ], + "name": "stdout" + }, + { + "output_type": "display_data", + "data": { + "application/vnd.colab-display-data+json": { + "pip_warning": { + "packages": [ + "azureml", + "pandas" + ] + } + } + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "74DHjOrSuOwd" + }, + "source": [ + "from azureml import Datasets" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "uIMB3a9EuQ9h" + }, + "source": [ + "from azureml.core import Dataset\n", + "from azureml.opendatasets import NycTlcYellow, NycTlcGreen\n", + "from dateutil import parser\n", + "from datetime import datetime\n", + "from dateutil.relativedelta import relativedelta" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "oV-ETadXvsuG", + "outputId": "61ceff13-4640-4fad-ddc1-810132ee4b5a", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 766 + } + }, + "source": [ + "end_date = parser.parse('2018-06-06')\n", + "start_date = parser.parse('2018-05-01')\n", + "nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)\n", + "nyc_tlc_df = nyc_tlc.to_pandas_dataframe() " + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "/usr/local/lib/python3.6/dist-packages/azureml/opendatasets/dataaccess/_blob_accessor.py:526: Warning: Please install azureml-dataset-runtimeusing pip install azureml-dataset-runtime\n", + " Warning)\n" + ], + "name": "stderr" + }, + { + "output_type": "stream", + "text": [ + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00000-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426339-118.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00001-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426336-117.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00002-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426334-119.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00003-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426340-115.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00004-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426331-116.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00005-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426324-117.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00006-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426326-116.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00007-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426332-118.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00008-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426341-118.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00009-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426325-116.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00010-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426335-117.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00011-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426338-117.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00012-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426337-117.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00013-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426327-117.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00014-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426330-118.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00015-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426342-117.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00016-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426328-116.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00017-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426323-118.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00018-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426329-118.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00019-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426333-116.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=6/part-00000-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426339-119.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=6/part-00001-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426336-118.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=6/part-00002-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426334-120.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=6/part-00003-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426340-116.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=6/part-00004-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426331-117.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=6/part-00005-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426324-118.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=6/part-00006-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426326-117.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=6/part-00007-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426332-119.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=6/part-00008-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426341-119.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=6/part-00009-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426325-117.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=6/part-00010-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426335-118.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=6/part-00011-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426338-118.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=6/part-00012-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426337-118.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=6/part-00013-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426327-118.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=6/part-00014-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426330-119.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=6/part-00015-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426342-118.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=6/part-00016-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426328-117.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=6/part-00017-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426323-119.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=6/part-00018-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426329-119.c000.snappy.parquet\n", + "[Info] read from /tmp/tmpmgax2urz/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=6/part-00019-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426333-117.c000.snappy.parquet\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H-zwPxoLxVgp" + }, + "source": [ + "Link: https://docs.microsoft.com/pt-pt/azure/machine-learning/tutorial-auto-train-models" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tUwpNHtbwn41", + "outputId": "f7a9e926-5e4c-49d0-9fc8-2797121bd120", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 221 + } + }, + "source": [ + " nyc_tlc_df.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
vendorIDtpepPickupDateTimetpepDropoffDateTimepassengerCounttripDistancepuLocationIddoLocationIdstartLonstartLatendLonendLatrateCodeIdstoreAndFwdFlagpaymentTypefareAmountextramtaTaximprovementSurchargetipAmounttollsAmounttotalAmount
022018-05-27 17:50:342018-05-27 17:56:4130.82161100NaNNaNNaNNaN1N26.00.00.50.30.000.06.80
122018-05-23 08:20:412018-05-23 08:37:0611.69142162NaNNaNNaNNaN1N111.50.00.50.33.080.015.38
322018-05-23 09:02:542018-05-23 09:17:5926.6414087NaNNaNNaNNaN1N219.50.00.50.30.000.020.30
522018-05-23 13:28:482018-05-23 13:35:1510.61170234NaNNaNNaNNaN1N16.00.00.50.31.000.07.80
722018-05-23 07:05:502018-05-23 07:07:4020.484850NaNNaNNaNNaN1N23.50.00.50.30.000.04.30
\n", + "
" + ], + "text/plain": [ + " vendorID tpepPickupDateTime ... tollsAmount totalAmount\n", + "0 2 2018-05-27 17:50:34 ... 0.0 6.80\n", + "1 2 2018-05-23 08:20:41 ... 0.0 15.38\n", + "3 2 2018-05-23 09:02:54 ... 0.0 20.30\n", + "5 2 2018-05-23 13:28:48 ... 0.0 7.80\n", + "7 2 2018-05-23 07:05:50 ... 0.0 4.30\n", + "\n", + "[5 rows x 21 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 10 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oflehhy7wtde", + "outputId": "b62418c1-5469-4c9c-d693-20af6c87b48b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "nyc_tlc_df.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(10695823, 21)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 11 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "C1-G9EajxFbJ" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aYQ4cDfcPu4e" + }, + "source": [ + "___\n", + "# **NOTAS E OBSERVAÇÕES**\n", + "* Abordar o impacto do desbalanceamento da amostra;\n", + "* Colocar AUROC no material e mostrar o cut off para classificação entre 0 e 1;\n", + "* Conceitos estatísticos de bias & variance;\n", + "* Ver Sklearn.optimize: https://web.telegram.org/#/im?p=g497957288;\n", + "* Construir a package para conter todas as funções definidas e colocar estas funções na package --> Manutenção rápida, fácil e centralizada! Desta forma, o tópico (\"Funções usadas neste tutorial\") vai totalmente para o package." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5YvhLC_uf4_G" + }, + "source": [ + "___\n", + "# **AGENDA**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QgX6n2VDyY1O" + }, + "source": [ + "___\n", + "# **REFERÊNCIAS**\n", + "* [scikit-learn - Machine Learning With Python](https://scikit-learn.org/stable/);\n", + "* [An Introduction to Machine Learning Theory and Its Applications: A Visual Tutorial with Examples](https://www.toptal.com/machine-learning/machine-learning-theory-an-introductory-primer)\n", + "* [The Difference Between Artificial Intelligence, Machine Learning, and Deep Learning](https://medium.com/iotforall/the-difference-between-artificial-intelligence-machine-learning-and-deep-learning-3aa67bff5991)\n", + "* [A Gentle Guide to Machine Learning](https://blog.monkeylearn.com/a-gentle-guide-to-machine-learning/)\n", + "* [A Visual Introduction to Machine Learning](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)\n", + "* [Introduction to Machine Learning](http://alex.smola.org/drafts/thebook.pdf)\n", + "* [The 10 Statistical Techniques Data Scientists Need to Master](https://medium.com/cracking-the-data-science-interview/the-10-statistical-techniques-data-scientists-need-to-master-1ef6dbd531f7)\n", + "* [Tune: a library for fast hyperparameter tuning at any scale](https://towardsdatascience.com/fast-hyperparameter-tuning-at-scale-d428223b081c)\n", + "* [How to lie with Data Science](https://towardsdatascience.com/how-to-lie-with-data-science-5090f3891d9c)\n", + "* [5 Reasons “Logistic Regression” should be the first thing you learn when becoming a Data Scientist](https://towardsdatascience.com/5-reasons-logistic-regression-should-be-the-first-thing-you-learn-when-become-a-data-scientist-fcaae46605c4)\n", + "* [Machine learning on categorical variables](https://towardsdatascience.com/machine-learning-on-categorical-variables-3b76ffe4a7cb)\n", + "\n", + "## Deep Learning & Neural Networks\n", + "\n", + "- [An Introduction to Neural Networks](http://www.cs.stir.ac.uk/~lss/NNIntro/InvSlides.html)\n", + "- [An Introduction to Image Recognition with Deep Learning](https://medium.com/@ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721)\n", + "- [Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/index.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TsCbZd2epfxo" + }, + "source": [ + "___\n", + "# **INTRODUÇÃO**\n", + "\n", + "* \"__Information is the oil of the 21st century, and analytics is the combustion engine__.\" - Peter Sondergaard, SVP, Garner Research;\n", + "\n", + "\n", + ">O foco deste capítulo será:\n", + "* Linear, Logistic Regression, Decision Tree, Random Forest, Support Vector Machine and XGBoost algorithms for building Machine Learning models;\n", + "* Entender como resolver problemas de classificação e Regressão;\n", + "* Aplicar técnicas de Ensemble como Bagging e Boosting;\n", + "* Como medir a acurácia dos modelos de Machine Learning;\n", + "* Aprender os principais algoritmos de Machine Learning tanto das técnicas de aprendizagem supervisionada quanto da não-supervisionada.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HqqB2vaHXMGt" + }, + "source": [ + "___\n", + "# **ARTIFICIAL INTELLIGENCE VS MACHINE LEARNING VS DEEP LEARNING**\n", + "* **Machine Learning** - dá aos computadores a capacidade de aprender sem serem explicitamente programados. Os computadores podem melhorar sua capacidade de aprendizagem através da prática de uma tarefa, geralmente usando grandes conjuntos de dados.\n", + "* **Deep Learning** - é um método de Machine Learning que depende de redes neurais artificiais, permitindo que os sistemas de computadores aprendam pelo exemplo, assim como nós humanos aprendemos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P961GcguXFFA" + }, + "source": [ + "![EvolutionOfAI](https://github.com/MathMachado/Materials/blob/master/Evolution%20of%20AI.PNG?raw=true)\n", + "\n", + "Source: [Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://github.com/MathMachado/P4ML/blob/DS_Python/Material/Evolution%20of%20AI.PNG)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lkqGtO88ZkPr" + }, + "source": [ + "![AI_vs_ML_vs_DL](https://github.com/MathMachado/Materials/blob/master/AI_vs_ML_vs_DL.PNG?raw=true)\n", + "\n", + "Source: [Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://towardsdatascience.com/artificial-intelligence-vs-machine-learning-vs-deep-learning-2210ba8cc4ac)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xesQpzfmaqj6" + }, + "source": [ + "![ML_vs_DL](https://github.com/MathMachado/Materials/blob/master/ML_vs_DL.PNG?raw=true)\n", + "\n", + "Source: [Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://towardsdatascience.com/artificial-intelligence-vs-machine-learning-vs-deep-learning-2210ba8cc4ac)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KeIVR59IIS7f" + }, + "source": [ + "___\n", + "# **MACHINE LEARNING - TECHNIQUES**\n", + "\n", + "* Supervised Learning\n", + "* Unsupervised Learning\n", + "\n", + "![MachineLearning](https://github.com/MathMachado/Materials/blob/master/MachineLearningTechniques.jpg?raw=true)\n", + "\n", + "Source: [Machine Learning for Everyone](https://vas3k.com/blog/machine_learning/?source=post_page-----885aa35db58b----------------------)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rvwp5UHdBiup" + }, + "source": [ + "___\n", + "# **NOSSO FOCO AQUI SERÁ...**\n", + "\n", + "![ClassicalML](https://github.com/MathMachado/Materials/blob/master/ClassicalML.jpg?raw=true)\n", + "\n", + "Source: [Machine Learning for Everyone](https://vas3k.com/blog/machine_learning/?source=post_page-----885aa35db58b----------------------)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cBLSvJTXHBjK" + }, + "source": [ + "___\n", + "# **CHEETSHEET**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZdjR3nahUuKq" + }, + "source": [ + "\n", + "![Scikit-Learn](https://github.com/MathMachado/Materials/blob/master/scikit-learn-1.png?raw=true)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XRukccWQSklx" + }, + "source": [ + "## Medidas para avaliarmos a variabilidade presente nos dados\n", + "* As principais medidas para medirmos a variabilidade dos dados são amplitude, variância, desvio padrão e coeficiente de variação;\n", + "* Estas medidas nos permite concluir se os dados são homogêneos (menor dispersão/variabilidade) ou heterogêneos (maior variabilidade/dispersão).\n", + "\n", + "* **Na próxima versão, trazer estes conceitos para o Notebook e usar o Python para calcular estas medidas**." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yBR8tWV_lhQq" + }, + "source": [ + "___\n", + "# **ENSEMBLE METHODS** (= Combinar modelos preditivos)\n", + "* Métodos\n", + " * **Bagging** (Bootstrap AGGregatING)\n", + " * **Boosting**\n", + " * Stacking --> Não é muito utilizado\n", + "* Evita overfitting (Overfitting é quando o modelo/função se ajusta muito bem ao dados de treinamento, sendo ineficiente para generalizar para outras amostras/população).\n", + "* Constroi meta-classificadores: combinar os resultados de vários algoritmos para produzir previsões mais precisas e robustas do que as previsões de cada classificador individual.\n", + "* Ensemble reduz/minimiza os efeitos das principais causas de erros nos modelos de Machine Learning:\n", + " * ruído;\n", + " * bias (viés);\n", + " * variância --> Principal medida para medir a variabilidade presente nos dados.\n", + "\n", + "# Referências\n", + "* [Simple guide for ensemble learning methods](https://towardsdatascience.com/simple-guide-for-ensemble-learning-methods-d87cc68705a2) - Explica didaticamente como funcionam ensembes." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "25RW8u-Sj780" + }, + "source": [ + "### Leitura Adicional\n", + "* [Ensemble methods: bagging, boosting and stacking](https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205)\n", + "* [Ensemble Methods in Machine Learning: What are They and Why Use Them?](https://towardsdatascience.com/ensemble-methods-in-machine-learning-what-are-they-and-why-use-them-68ec3f9fef5f)\n", + "* [Ensemble Learning Using Scikit-learn](https://towardsdatascience.com/ensemble-learning-using-scikit-learn-85c4531ff86a)\n", + "* [Let’s Talk About Machine Learning Ensemble Learning In Python](https://medium.com/fintechexplained/lets-talk-about-machine-learning-ensemble-learning-in-python-382747e5fba8)\n", + "* [Boosting, Bagging, and Stacking — Ensemble Methods with sklearn and mlens](https://medium.com/@rrfd/boosting-bagging-and-stacking-ensemble-methods-with-sklearn-and-mlens-a455c0c982de)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FugME1HSl4jJ" + }, + "source": [ + "___\n", + "# **PARAMETER TUNNING** (= Hiperparâmetros ótimos dos modelos de Machine Learning)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u_147cIRl9F1" + }, + "source": [ + "## GridSearch (Ferramenta ou meio que vamos utilizar para otimização dos hiperparâmetros dos modelos de ML)\n", + "* Encontra os hiperparâmetros ótimos (hyperparameter tunning) que melhoram a acurácia dos modelos.\n", + "* Necessita dos seguintes inputs:\n", + " * A matrix $X_{p}$ com as $p$ COLUNAS (variáveis ou atributos) do dataframe;\n", + " * A matriz $y_{p}$ com a COLUNA-target (vaiável resposta);\n", + " * Exemplo: DecisionTree, RandomForestClassifier, XGBoostClassificer e etc;\n", + " * Um dicionário com os hiperparâmetros a serem otimizados;\n", + " * O número de folds para o método de Cross-validation." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "39Sg77fbTWCO" + }, + "source": [ + "___\n", + "# **MODEL SELECTION & EVALUATION**\n", + "> Nesta fase identificamos e aplicamos as melhores métricas (Accuracy, Sensitivity, Specificity, F-Score, AUC, R-Sq, Adj R-SQ, RMSE (Root Mean Square Error)) para avaliar o desempenho/acurácia/performance dos modelos de ML.\n", + ">> Treinamos os modelos de ML usando a amostra de treinamento e avaliamos o desempenho/acurácia/performance na amostra de teste/validação.\n", + "\n", + "* Leitura Adicional\n", + " * [The 5 Classification Evaluation metrics every Data Scientist must know](https://towardsdatascience.com/the-5-classification-evaluation-metrics-you-must-know-aa97784ff226)\n", + " * [Confusion matrix and other metrics in machine learning](https://medium.com/hugo-ferreiras-blog/confusion-matrix-and-other-metrics-in-machine-learning-894688cb1c0a)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oQQVzZ2ZTYrB" + }, + "source": [ + "## Confusion Matrix\n", + "* Termos associados à Confusion Matrix:\n", + " * **Verdadeiro Positivo** (TP = True Positive): Quando o valor observado é True e o modelo estima como True. Ou seja, o modelo acertou na estimativa.\n", + " * Exemplo: **Observado**: Fraude (Positive); **Modelo**: Fraude (Positive) --> Modelo acertou!\n", + " * **Verdadeiro Negativo** (TN = True Negative): Quando o valor observado é False e o modelo estima como False. Ou seja, o modelo acertou na estimativa;\n", + " * Exemplo: **Observado**: NÃO-Fraude (Negative); **Modelo**: NÃO-Fraude (Negative) --> Modelo acertou!\n", + " * **Falso Positivo** (FP = False Positive): Quando o valor observado é False e o modelo estima como True. Ou seja, o modelo errou na estimativa. \n", + " * Exemplo: **Observado**: NÃO-Fraude (Negative); **Modelo**: Fraude (Positive) --> Modelo errou!\n", + " * **Falso Negativo** (FN = False Negative): Quando o valor observado é True e o modelo estima como False.\n", + " * Exemplo: **Observado**: Fraude (Positive); **Modelo**: NÃO-Fraude (Negative) --> Modelo errou!\n", + "\n", + "* Consulte [Confusion matrix](https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py)\n", + "\n", + "![ConfusionMatrix](https://github.com/MathMachado/Materials/blob/master/ConfusionMatrix.PNG?raw=true)\n", + "\n", + "Source: [Confusion Matrix](https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781838555078/6/ch06lvl1sec34/confusion-matrix)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ci-6eiqBTgbL" + }, + "source": [ + "## Accuracy\n", + "> Accuracy - é o número de previsões corretas feitas pelo modelo.\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Com que frequência o classificador (modelo preditivo) classifica corretamente?\n", + "```\n", + "\n", + "$$Accuracy= \\frac{TP+TN}{TP+TN+FP+FN}$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F7YI8X5TRx-R" + }, + "source": [ + "## Precision (ou Specificity)\n", + "> **Precision** - fornece informações sobre o desempenho em relação a Falsos Positivos (quantos capturamos).\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Com relação ao resultado Positivo, com que frequência o classificador está correto?\n", + "```\n", + "\n", + "\n", + "$$Precision= \\frac{TP}{TP+FP}$$\n", + "\n", + "**Exemplo**: Precison nos dirá a proporção de clientes que o modelo estimou como sendo Fraude quando, na verdade, são fraude.\n", + "\n", + "**Comentário**: Se nosso foco é minimizar Falso Negativos (FN), então precisamos nos esforçar para termos Recall próximo de 100%." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zO39n8x_Sz3L" + }, + "source": [ + "## Recall (ou Sensitivity)\n", + "> **Recall** - nos fornece informações sobre o desempenho de um classificador em relação a Falsos Negativos (quantos perdemos).\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Quando o valor observado é Positivo, com que frequência o classificador está correto?\n", + "```\n", + "\n", + "$$Recall = Sensitivity = \\frac{TP}{TP+FN}$$\n", + "\n", + "**Exemplo**: Recall é a proporção de clientes observados como Fraude e que o modelo estima como Fraude.\n", + "\n", + "**Comentário**: Se nosso foco for minimizar Falso Positivos (FP), então precisamos nos esforçar para fazer Precision mais próximo de 100% possível." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "htS6rdHVVXRG" + }, + "source": [ + "## Specificity\n", + "> **Specificity** - proporção de TN por TN+FP.\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Quando o valor observado é Negativo, com que frequência o classificador está correto?\n", + "```\n", + "\n", + "**Exemplo**: Specificity é a proporção de clientes NÃO-Fraude que o modelo estima como NÃO-Fraude.\n", + "\n", + "$$Specificity= \\frac{TN}{TN+FP}$$\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mNn0twadTacc" + }, + "source": [ + "## F1-Score\n", + "> F1-Score é a média harmônica entre Recall e Precision e é um número entre 0 e 1. Quanto mais próximo de 1, melhor. Quanto mais próximo de 0, pior. Ou seja, é um equilíbrio entre Recall e Precision.\n", + "\n", + "$$F1\\_Score= 2\\left(\\frac{Recall*Precision}{Recall+Precision}\\right)$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gkCubyUCP_hn" + }, + "source": [ + "### Funções usadas neste tutorial" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZD2pH9hfTnZv" + }, + "source": [ + "#### Função para Cross-Validation" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hr8LczrSQB0x" + }, + "source": [ + "def funcao_cross_val_score(modelo, X_treinamento, y_treinamento, CV):\n", + " # versão com sklearn.model_selection.cross_validate:\n", + " #a_scores_CV = cross_validate(modelo, X_treinamento, y_treinamento, cv = CV, scoring = metodo)\n", + " #print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + " #print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')\n", + " #return a_scores_CV\n", + "\n", + " #versão com cross_val_score::\n", + " a_scores_CV = cross_val_score(modelo, X_treinamento, y_treinamento, cv = CV)\n", + " print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + " print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')\n", + " return a_scores_CV" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9ROlyvgij2yl" + }, + "source": [ + "#### Função para plotar a Confusion Matrix\n", + "* Extraído de [Confusion Matrix Visualization](https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "klQ0FLOIgeX1" + }, + "source": [ + "def mostra_confusion_matrix(cf, \n", + " group_names = None, \n", + " categories = 'auto', \n", + " count = True, \n", + " percent = True, \n", + " cbar = True, \n", + " xyticks = False, \n", + " xyplotlabels = True, \n", + " sum_stats = True, \n", + " figsize = (8, 8), \n", + " cmap = 'Blues'):\n", + " '''\n", + " This function will make a pretty plot of an sklearn Confusion Matrix cm using a Seaborn heatmap visualization.\n", + " Arguments\n", + " ---------\n", + " cf: confusion matrix to be passed in\n", + " group_names: List of strings that represent the labels row by row to be shown in each square.\n", + " categories: List of strings containing the categories to be displayed on the x,y axis. Default is 'auto'\n", + " count: If True, show the raw number in the confusion matrix. Default is True.\n", + " normalize: If True, show the proportions for each category. Default is True.\n", + " cbar: If True, show the color bar. The cbar values are based off the values in the confusion matrix.\n", + " Default is True.\n", + " xyticks: If True, show x and y ticks. Default is True.\n", + " xyplotlabels: If True, show 'True Label' and 'Predicted Label' on the figure. Default is True.\n", + " sum_stats: If True, display summary statistics below the figure. Default is True.\n", + " figsize: Tuple representing the figure size. Default will be the matplotlib rcParams value.\n", + " cmap: Colormap of the values displayed from matplotlib.pyplot.cm. Default is 'Blues'\n", + " See http://matplotlib.org/examples/color/colormaps_reference.html\n", + " '''\n", + "\n", + " # CODE TO GENERATE TEXT INSIDE EACH SQUARE\n", + " blanks = ['' for i in range(cf.size)]\n", + "\n", + " if group_names and len(group_names)==cf.size:\n", + " group_labels = [\"{}\\n\".format(value) for value in group_names]\n", + " else:\n", + " group_labels = blanks\n", + "\n", + " if count:\n", + " group_counts = [\"{0:0.0f}\\n\".format(value) for value in cf.flatten()]\n", + " else:\n", + " group_counts = blanks\n", + "\n", + " if percent:\n", + " group_percentages = [\"{0:.2%}\".format(value) for value in cf.flatten()/np.sum(cf)]\n", + " else:\n", + " group_percentages = blanks\n", + "\n", + " box_labels = [f\"{v1}{v2}{v3}\".strip() for v1, v2, v3 in zip(group_labels,group_counts,group_percentages)]\n", + " box_labels = np.asarray(box_labels).reshape(cf.shape[0],cf.shape[1])\n", + "\n", + " # CODE TO GENERATE SUMMARY STATISTICS & TEXT FOR SUMMARY STATS\n", + " if sum_stats:\n", + " #Accuracy is sum of diagonal divided by total observations\n", + " accuracy = np.trace(cf) / float(np.sum(cf))\n", + "\n", + " #if it is a binary confusion matrix, show some more stats\n", + " if len(cf)==2:\n", + " #Metrics for Binary Confusion Matrices\n", + " precision = cf[1,1] / sum(cf[:,1])\n", + " recall = cf[1,1] / sum(cf[1,:])\n", + " f1_score = 2*precision*recall / (precision + recall)\n", + " stats_text = \"\\n\\nAccuracy={:0.3f}\\nPrecision={:0.3f}\\nRecall={:0.3f}\\nF1 Score={:0.3f}\".format(accuracy,precision,recall,f1_score)\n", + " else:\n", + " stats_text = \"\\n\\nAccuracy={:0.3f}\".format(accuracy)\n", + " else:\n", + " stats_text = \"\"\n", + "\n", + " # SET FIGURE PARAMETERS ACCORDING TO OTHER ARGUMENTS\n", + " if figsize==None:\n", + " #Get default figure size if not set\n", + " figsize = plt.rcParams.get('figure.figsize')\n", + "\n", + " if xyticks==False:\n", + " #Do not show categories if xyticks is False\n", + " categories=False\n", + "\n", + " # MAKE THE HEATMAP VISUALIZATION\n", + " plt.figure(figsize=figsize)\n", + " sns.heatmap(cf,annot=box_labels,fmt=\"\",cmap=cmap,cbar=cbar,xticklabels=categories,yticklabels=categories)\n", + "\n", + " if xyplotlabels:\n", + " plt.ylabel('True label')\n", + " plt.xlabel('Predicted label' + stats_text)\n", + " else:\n", + " plt.xlabel(stats_text)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8J-sTUfTTdLi" + }, + "source": [ + "#### Função para o GridSearchCV" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ap3WMXqDthu9" + }, + "source": [ + "def GridSearchOptimizer(modelo, ml_Opt, d_hiperparametros, X_treinamento, y_treinamento, X_teste, y_teste, i_CV, l_colunas):\n", + " ml_GridSearchCV = GridSearchCV(modelo, d_hiperparametros, cv = i_CV, n_jobs = -1, verbose= 10, scoring = 'accuracy')\n", + " start = time()\n", + " ml_GridSearchCV.fit(X_treinamento, y_treinamento)\n", + " tempo_elapsed = time()-start\n", + " #print(f\"\\nGridSearchCV levou {tempo_elapsed:.2f} segundos.\")\n", + "\n", + " # Hiperparâmetros que otimizam a classificação:\n", + " print(f'\\nHiperparâmetros otimizados: {ml_GridSearchCV.best_params_}')\n", + " \n", + " if ml_Opt == 'ml_DT2':\n", + " print(f'\\nDecisionTreeClassifier *********************************************************************************************************')\n", + " ml_Opt = DecisionTreeClassifier(criterion= ml_GridSearchCV.best_params_['criterion'], \n", + " max_depth= ml_GridSearchCV.best_params_['max_depth'],\n", + " max_leaf_nodes= ml_GridSearchCV.best_params_['max_leaf_nodes'],\n", + " min_samples_split= ml_GridSearchCV.best_params_['min_samples_leaf'],\n", + " min_samples_leaf= ml_GridSearchCV.best_params_['min_samples_split'], \n", + " random_state= i_Seed)\n", + " \n", + " elif ml_Opt == 'ml_RF2':\n", + " print(f'\\nRandomForestClassifier *********************************************************************************************************')\n", + " ml_Opt = RandomForestClassifier(bootstrap= ml_GridSearchCV.best_params_['bootstrap'], \n", + " max_depth= ml_GridSearchCV.best_params_['max_depth'],\n", + " max_features= ml_GridSearchCV.best_params_['max_features'],\n", + " min_samples_leaf= ml_GridSearchCV.best_params_['min_samples_leaf'],\n", + " min_samples_split= ml_GridSearchCV.best_params_['min_samples_split'],\n", + " n_estimators= ml_GridSearchCV.best_params_['n_estimators'],\n", + " random_state= i_Seed)\n", + " \n", + " elif ml_Opt == 'ml_AB2':\n", + " print(f'\\nAdaBoostClassifier *********************************************************************************************************')\n", + " ml_Opt = AdaBoostClassifier(algorithm='SAMME.R', \n", + " base_estimator=RandomForestClassifier(bootstrap = False, \n", + " max_depth = 10, \n", + " max_features = 'auto', \n", + " min_samples_leaf = 1, \n", + " min_samples_split = 2, \n", + " n_estimators = 400), \n", + " learning_rate = ml_GridSearchCV.best_params_['learning_rate'], \n", + " n_estimators = ml_GridSearchCV.best_params_['n_estimators'], \n", + " random_state = i_Seed)\n", + " \n", + " elif ml_Opt == 'ml_GB2':\n", + " print(f'\\nGradientBoostingClassifier *********************************************************************************************************')\n", + " ml_Opt = GradientBoostingClassifier(learning_rate = ml_GridSearchCV.best_params_['learning_rate'], \n", + " n_estimators = ml_GridSearchCV.best_params_['n_estimators'], \n", + " max_depth = ml_GridSearchCV.best_params_['max_depth'], \n", + " min_samples_split = ml_GridSearchCV.best_params_['min_samples_split'], \n", + " min_samples_leaf = ml_GridSearchCV.best_params_['min_samples_leaf'], \n", + " max_features = ml_GridSearchCV.best_params_['max_features'])\n", + " \n", + " elif ml_Opt == 'ml_XGB2':\n", + " print(f'\\nXGBoostingClassifier *********************************************************************************************************')\n", + " ml_Opt = XGBoostingClassifier(learning_rate= ml_GridSearchCV.best_params_['learning_rate'], \n", + " max_depth= ml_GridSearchCV.best_params_['max_depth'], \n", + " colsample_bytree= ml_GridSearchCV.best_params_['colsample_bytree'], \n", + " subsample= ml_GridSearchCV.best_params_['subsample'], \n", + " gamma= ml_GridSearchCV.best_params_['gamma'], \n", + " min_child_weight= ml_GridSearchCV.best_params_['min_child_weight'])\n", + " \n", + " # Treina novamente usando os hiperparâmetros otimizados...\n", + " ml_Opt.fit(X_treinamento, y_treinamento)\n", + "\n", + " # Cross-Validation com 10 folds\n", + " print(f'\\n********* CROSS-VALIDATION ***********')\n", + " a_scores_CV = funcao_cross_val_score(ml_Opt, X_treinamento, y_treinamento, i_CV)\n", + "\n", + " # Faz predições com os hiperparâmetros otimizados...\n", + " y_pred = ml_Opt.predict(X_teste)\n", + " \n", + " # Importância das COLUNAS\n", + " print(f'\\n********* IMPORTÂNCIA DAS COLUNAS ***********')\n", + " df_importancia_variaveis = pd.DataFrame(zip(l_colunas, ml_Opt.feature_importances_), columns= ['coluna', 'importancia'])\n", + " df_importancia_variaveis = df_importancia_variaveis.sort_values(by= ['importancia'], ascending=False)\n", + " print(df_importancia_variaveis)\n", + "\n", + " # Matriz de Confusão\n", + " print(f'\\n********* CONFUSION MATRIX - PARAMETER TUNNING ***********')\n", + " cf_matrix = confusion_matrix(y_teste, y_pred)\n", + " cf_labels = ['True_Negative', 'False_Positive', 'False_Negative', 'True_Positive']\n", + " cf_categories = ['Zero', 'One']\n", + " mostra_confusion_matrix(cf_matrix, group_names = cf_labels, categories = cf_categories)\n", + "\n", + " return ml_Opt, ml_GridSearchCV.best_params_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YMnQn2XgT7Mg" + }, + "source": [ + "#### Função para selecionar COLUNAS relevantes dos dataframes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fsnHcaeLUDFS" + }, + "source": [ + "from sklearn.feature_selection import SelectFromModel\n", + "\n", + "def seleciona_colunas_relevantes(modelo, X_treinamento, X_teste, threshold = 0.05):\n", + " # Cria um seletor para selecionar as COLUNAS com importância > threshold\n", + " sfm = SelectFromModel(modelo, threshold)\n", + " \n", + " # Treina o seletor\n", + " sfm.fit(X_treinamento, y_treinamento)\n", + "\n", + " # Mostra o indice das COLUNAS mais importantes\n", + " print(f'\\n********** COLUNAS Relevantes ******')\n", + " print(sfm.get_support(indices=True))\n", + "\n", + " # Seleciona somente as COLUNAS relevantes\n", + " X_treinamento_I = sfm.transform(X_treinamento)\n", + " X_teste_I = sfm.transform(X_teste)\n", + " return X_treinamento_I, X_teste_I " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gd98JFSGUV5n" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3e0m7lEnYOV9" + }, + "source": [ + "### Função para calcular a importância das colunas/variáveis/atributos\n", + "* Source: [Plotting Feature Importances](https://www.kaggle.com/grfiv4/plotting-feature-importances)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fjco0HnNYr-N" + }, + "source": [ + "def mostra_feature_importances(clf, X_treinamento, y_treinamento=None, \n", + " top_n=10, figsize=(8,8), print_table=False, title=\"Feature Importances\"):\n", + " '''\n", + " plot feature importances of a tree-based sklearn estimator\n", + " \n", + " Note: X_treinamento and y_treinamento are pandas DataFrames\n", + " \n", + " Note: Scikit-plot is a lovely package but I sometimes have issues\n", + " 1. flexibility/extendibility\n", + " 2. complicated models/datasets\n", + " But for many situations Scikit-plot is the way to go\n", + " see https://scikit-plot.readthedocs.io/en/latest/Quickstart.html\n", + " \n", + " Parameters\n", + " ----------\n", + " clf (sklearn estimator) if not fitted, this routine will fit it\n", + " \n", + " X_treinamento (pandas DataFrame)\n", + " \n", + " y_treinamento (pandas DataFrame) optional\n", + " required only if clf has not already been fitted \n", + " \n", + " top_n (int) Plot the top_n most-important features\n", + " Default: 10\n", + " \n", + " figsize ((int,int)) The physical size of the plot\n", + " Default: (8,8)\n", + " \n", + " print_table (boolean) If True, print out the table of feature importances\n", + " Default: False\n", + " \n", + " Returns\n", + " -------\n", + " the pandas dataframe with the features and their importance\n", + " \n", + " Author\n", + " ------\n", + " George Fisher\n", + " '''\n", + " \n", + " __name__ = \"mostra_feature_importances\"\n", + " \n", + " import pandas as pd\n", + " import numpy as np\n", + " import matplotlib.pyplot as plt\n", + " \n", + " from xgboost.core import XGBoostError\n", + " from lightgbm.sklearn import LightGBMError\n", + " \n", + " try: \n", + " if not hasattr(clf, 'feature_importances_'):\n", + " clf.fit(X_treinamento.values, y_treinamento.values.ravel())\n", + "\n", + " if not hasattr(clf, 'feature_importances_'):\n", + " raise AttributeError(\"{} does not have feature_importances_ attribute\".\n", + " format(clf.__class__.__name__))\n", + " \n", + " except (XGBoostError, LightGBMError, ValueError):\n", + " clf.fit(X_treinamento.values, y_treinamento.values.ravel())\n", + " \n", + " feat_imp = pd.DataFrame({'importance':clf.feature_importances_}) \n", + " feat_imp['feature'] = X_treinamento.columns\n", + " feat_imp.sort_values(by ='importance', ascending = False, inplace = True)\n", + " feat_imp = feat_imp.iloc[:top_n]\n", + " \n", + " feat_imp.sort_values(by='importance', inplace = True)\n", + " feat_imp = feat_imp.set_index('feature', drop = True)\n", + " feat_imp.plot.barh(title=title, figsize=figsize)\n", + " plt.xlabel('Feature Importance Score')\n", + " plt.show()\n", + " \n", + " if print_table:\n", + " from IPython.display import display\n", + " print(\"Top {} features in descending order of importance\".format(top_n))\n", + " display(feat_imp.sort_values(by = 'importance', ascending = False))\n", + " \n", + " return feat_imp" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rsH9dMxazWCg" + }, + "source": [ + "# **DATAFRAME-EXEMPLO USADO NESTE TUTORIAL**\n", + "> Gerar um dataframe com 18 colunas, sendo 9 informativas, 6 redundantes e 3 repetidas:\n", + "\n", + "Para saber mais sobre a geração de dataframes-exemplo (toy), consulte [Synthetic data generation — a must-have skill for new data scientists](https://towardsdatascience.com/synthetic-data-generation-a-must-have-skill-for-new-data-scientists-915896c0c1ae)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GEyDo_EIV_jV" + }, + "source": [ + "## Definir variáveis globais" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TdwgpZ76WFaT" + }, + "source": [ + "i_CV = 10 # Número de Cross-Validations\n", + "i_Seed = 20111974 # semente por questões de reproducibilidade\n", + "f_Test_Size = 0.3 # Proporção do dataframe de validação (outros valores poderiam ser 0.15, 0.20 ou 0.25)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gJTJfpwWzykS" + }, + "source": [ + "from sklearn.datasets import make_classification\n", + "\n", + "X, y = make_classification(n_samples = 1000, \n", + " n_features = 18, \n", + " n_informative = 9, \n", + " n_redundant = 6, \n", + " n_repeated = 3, \n", + " n_classes = 2, \n", + " n_clusters_per_class = 1, \n", + " random_state=i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gWy2IZh3s-o3", + "outputId": "9ddd6f9a-cfa1-4421-f25b-5e61297a966f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 235 + } + }, + "source": [ + "X" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[ 0.06844089, 4.21184154, -2.5583024 , ..., -0.63061895,\n", + " -0.97831983, -0.88826977],\n", + " [-4.8240213 , 0.17950903, -2.98447332, ..., 0.33992045,\n", + " 1.89153784, -6.10967565],\n", + " [ 1.38953042, -0.226476 , 1.8774004 , ..., -1.47784549,\n", + " 0.96052606, 2.06020368],\n", + " ...,\n", + " [ 1.62548685, 0.43377848, 4.93537285, ..., -4.61990917,\n", + " 0.18310709, 6.16040231],\n", + " [-2.40619087, -1.65474635, 2.64196493, ..., -1.21427845,\n", + " 0.83745861, 0.8254619 ],\n", + " [-4.00041881, 2.52475556, -4.15290177, ..., -0.51680266,\n", + " 1.72224835, -5.59558306]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 73 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ccjhGnzxtAaV", + "outputId": "f756e8a1-baba-42e0-c358-660fd16b5017", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 50 + } + }, + "source": [ + "y[0:30] # Semelhante aos casos de fraude: {0, 1}" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,\n", + " 1, 1, 0, 1, 0, 1, 0, 1])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 74 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OHO2befKJxR3" + }, + "source": [ + "___\n", + "# **DECISION TREE**\n", + "> Decision Trees possuem estrutura em forma de árvores.\n", + "\n", + "* **Principais Vantagens**:\n", + " * São algoritmos fáceis de entender, visualizar e interpretar;\n", + " * Captura facilmente padrões não-lineares presentes nos dados;\n", + " * Requer pouco poder computacional --> Treinar Decision Trees não requer tanto recurso computacional!\n", + " * Lida bem com COLUNAS numéricas ou categóricas;\n", + " * Não requer os dados sejam normalizados;\n", + " * Pode ser utilizado como Feature Engineering ao lidar com Missing Values;\n", + " * Pode ser utilizado como Feature Selection;\n", + " * Não requer suposições sobre a distribuição dos dados por causa da natureza não-paramétrica do algoritmo\n", + "\n", + "* **Principais desvantagens**\n", + " * Propenso a Overfitting, pois Decision Trees podem construir árvores complexas que não sejam capazes de generalizar bem os dados. As coisas complicam muito se a amostra de treinamento possuir outliers. Portanto, **recomenda-se fortemente a tratar os outliers previamente**.\n", + " * Pode criar árvores viesadas se tivermos um dataframe não-balanceado ou que alguma classe seja dominante. Por conta disso, **recomenda-se balancear o dataframe previamente para se evitar esse problema**.\n", + "\n", + "* **Principais Hiperparâmetros**\n", + " * **Gini Index** - é uma métrica que mede a frequência com que um ponto/observação aleatoriamente selecionado seria incorretamente identificado.\n", + " * Portanto, quanto menor o valor de Gini Index, melhor a COLUNA;\n", + " * **Entropy** - é uma métrica que mede aleatoriedade da informação presente nos dados.\n", + " * Portanto, quanto maior a entropia da COLUNA, pior ela se torna para nos ajudar a tomar uma conclusão (classificar, por exemplo).\n", + "\n", + "## **Referências**:\n", + "* [1.10. Decision Trees](https://scikit-learn.org/stable/modules/tree.html).\n", + "* [Decision Tree Algorithm With Hands On Example](https://medium.com/datadriveninvestor/decision-tree-algorithm-with-hands-on-example-e6c2afb40d38) - ótimo tutorial para aprender, entender, interpretar e calcular os índices de Gini e entropia.\n", + "* [Intuitive Guide to Understanding Decision Trees](https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-understanding-decision-trees-adb2165ccab7) - ótimo tutorial para aprender, entender, interpretar e calcular os índices de Gini e entropia.\n", + "* [The Complete Guide to Decision Trees](https://towardsdatascience.com/the-complete-guide-to-decision-trees-28a4e3c7be14)\n", + "* [Creating and Visualizing Decision Tree Algorithm in Machine Learning Using Sklearn](https://intellipaat.com/blog/decision-tree-algorithm-in-machine-learning/) - Muito didático!\n", + "* [Decision Trees in Machine Learning](https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052)\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FrMkPN5aLp0Y" + }, + "source": [ + "## Carregar as bibliotecas" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FVU1CM0PKgO4" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "\n", + "import warnings\n", + "warnings.filterwarnings(\"ignore\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "15clh4XrISpz" + }, + "source": [ + "## Carregar/Ler os dados" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UMPL46w2IWJw" + }, + "source": [ + "l_colunas = ['v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8', 'v9', 'v10', 'v11', 'v12', 'v13', 'v14', 'v15', 'v16', 'v17', 'v18']\n", + "\n", + "df_X = pd.DataFrame(X, columns = l_colunas)\n", + "df_y = pd.DataFrame(y, columns = ['target'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "MFaQF2MGFl_M", + "outputId": "eefcee87-d906-4d9e-ddf4-b6c969601263", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_X.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
v1v2v3v4v5v6v7v8v9v10v11v12v13v14v15v16v17v18
00.0684414.211842-2.5583023.665482-3.8351583.4998512.4908563.6654820.2451170.8671722.8655460.493956-5.1485962.8655463.499851-0.630619-0.978320-0.888270
1-4.8240210.179509-2.9844731.033618-3.8934263.428734-3.3346051.033618-0.882780-0.7532811.441522-1.395514-4.0028801.4415223.4287340.3399201.891538-6.109676
21.389530-0.2264761.8774002.7134264.6302570.516455-3.7430272.7134261.2840392.030797-1.0955361.560159-1.014211-1.0955360.516455-1.4778450.9605262.060204
31.1458092.2559460.2073644.6658172.2946786.5013060.9647704.6658170.1194103.1963541.8947873.519138-4.7578071.8947876.501306-3.7890290.5794911.397106
4-0.9366463.697163-3.3636173.805126-1.7544304.9543460.4066053.805126-0.8247381.3825911.665704-0.649758-3.5130361.6657044.9543460.2570520.904244-3.071354
\n", + "
" + ], + "text/plain": [ + " v1 v2 v3 ... v16 v17 v18\n", + "0 0.068441 4.211842 -2.558302 ... -0.630619 -0.978320 -0.888270\n", + "1 -4.824021 0.179509 -2.984473 ... 0.339920 1.891538 -6.109676\n", + "2 1.389530 -0.226476 1.877400 ... -1.477845 0.960526 2.060204\n", + "3 1.145809 2.255946 0.207364 ... -3.789029 0.579491 1.397106\n", + "4 -0.936646 3.697163 -3.363617 ... 0.257052 0.904244 -3.071354\n", + "\n", + "[5 rows x 18 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 77 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "s-ibdD2ZG7tm", + "outputId": "1a828ec6-51fa-4e8e-f081-c43f8e74b55b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "df_X.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(1000, 18)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 78 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "f9cqRaywa_TR", + "outputId": "750797e1-f1c0-4da6-bc32-33d91a8770e4", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "set(df_y['target'])" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{0, 1}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 79 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BN6jbpn6Iwmu" + }, + "source": [ + "## Estatísticas Descritivas básicas do dataframe - df.describe()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KlwhxxUNIyYs", + "outputId": "5ecb44eb-d7ba-46cf-c460-c48cbe02dbea", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 304 + } + }, + "source": [ + "df_X.describe()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
v1v2v3v4v5v6v7v8v9v10v11v12v13v14v15v16v17v18
count1000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.000000
mean-0.0851591.0342270.6574081.4053170.6872791.1315600.1080531.4053171.0070231.0488010.0792480.001650-0.3654380.0792481.131560-0.0277510.9846060.633624
std2.0022471.6315073.6087722.2568574.0195984.4818321.9813072.2568571.8632881.6439001.9492731.9326414.1606681.9492734.4818322.0654551.8505933.552991
min-6.944169-4.620754-16.300139-6.235192-12.454256-14.305401-6.152747-6.235192-5.484992-3.293216-7.135349-5.705500-9.120941-7.135349-14.305401-6.009023-5.035184-11.439074
25%-1.305566-0.089052-1.623657-0.152888-1.854645-1.684751-1.216983-0.152888-0.240908-0.012710-1.209675-1.292162-3.555363-1.209675-1.684751-1.436673-0.261610-1.691346
50%0.0525230.9941500.5738491.4499310.8123641.2815040.1670911.4499311.0661251.0128990.1803440.035237-0.9666380.1803441.281504-0.0001900.9757930.844784
75%1.3838532.0719953.0385862.8871413.4139524.0081031.4387192.8871412.2881882.1872021.4391991.3153422.7458061.4391994.0081031.3653692.2565043.109330
max4.9971727.35486011.7201658.49456612.84441815.9998036.2935508.4945668.1465596.5231806.2524485.53821611.2593506.25244815.9998036.5315617.64680212.090528
\n", + "
" + ], + "text/plain": [ + " v1 v2 ... v17 v18\n", + "count 1000.000000 1000.000000 ... 1000.000000 1000.000000\n", + "mean -0.085159 1.034227 ... 0.984606 0.633624\n", + "std 2.002247 1.631507 ... 1.850593 3.552991\n", + "min -6.944169 -4.620754 ... -5.035184 -11.439074\n", + "25% -1.305566 -0.089052 ... -0.261610 -1.691346\n", + "50% 0.052523 0.994150 ... 0.975793 0.844784\n", + "75% 1.383853 2.071995 ... 2.256504 3.109330\n", + "max 4.997172 7.354860 ... 7.646802 12.090528\n", + "\n", + "[8 rows x 18 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 80 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N_QhFqyZOKFB" + }, + "source": [ + "## Selecionar as amostras de treinamento e validação\n", + "\n", + "* Dividir os dados/amostras em:\n", + " * **Amostra de treinamento**: usado para treinar o modelo e otimizar os hiperparâmetros;\n", + " * **Amostra de teste**: usado para verificar se o modelo otimizado funciona em dados totalmente desconhecidos. É nesta amostra de teste que avaliamos a performance do modelo em termos de generalização (trabalhar com dados que não lhe foi apresentado);\n", + "* **Técnica de Hold-Out**: Separar/dividir os dados em amostra de treinamento e teste. Geralmente usamos 70% da amostra para treinamento e 30% validação. Outras opções são usar os percentuais 80/20 ou 75/25 (default).\n", + " * **Desvatangem do Hold-Out**: Variância nos dados pode comprometer performance do modelo quando, por exemplo, amostra de treinamento é semelhante amostra de teste. \n", + "* Consulte [sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) para mais detalhes.\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8sKBgs-QOOfn" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(df_X, df_y, test_size = f_Test_Size, random_state = i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TPTKBBHgOpoA", + "outputId": "257ee423-a06a-47ad-b865-bbc2078a305f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "X_treinamento.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(700, 18)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 82 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lEn_LLs2OtRI", + "outputId": "bdf4bde9-fb11-4b98-e9ce-14c23fa75e11", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "y_treinamento.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(700, 1)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 83 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_uAw8EcyOvrG", + "outputId": "271793ff-abc0-4295-bac6-b5125e7a8528", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "X_teste.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(300, 18)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 84 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "A2LYI-9hOyXI", + "outputId": "3d9f27a2-ae55-4448-d59e-059fc80cb37c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "y_teste.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(300, 1)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 85 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "npgoBSX2dd4l" + }, + "source": [ + "## Treinar o algoritmo com os dados de treinamento\n", + "### Carregar os algoritmos/libraries" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hcvzrtolGfnQ", + "outputId": "d4435fc5-9916-4abb-d2c8-ad320c867eee", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 67 + } + }, + "source": [ + "!pip install graphviz\n", + "!pip install pydotplus" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Requirement already satisfied: graphviz in /usr/local/lib/python3.6/dist-packages (0.10.1)\n", + "Requirement already satisfied: pydotplus in /usr/local/lib/python3.6/dist-packages (2.0.2)\n", + "Requirement already satisfied: pyparsing>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from pydotplus) (2.4.7)\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "v_pF-HH3JKL2" + }, + "source": [ + "from sklearn.metrics import accuracy_score # para medir a acurácia do modelo preditivo\n", + "#from sklearn.model_selection import train_test_split\n", + "#from sklearn.metrics import classification_report\n", + "from sklearn.metrics import confusion_matrix # para plotar a confusion matrix\n", + "\n", + "from sklearn.model_selection import GridSearchCV # para otimizar os hiperparâmetros dos modelos preditivos\n", + "from sklearn.model_selection import cross_val_score # Para o CV (Cross-Validation)\n", + "from sklearn.model_selection import cross_validate\n", + "\n", + "from time import time\n", + "from operator import itemgetter\n", + "from scipy.stats import randint\n", + "\n", + "from sklearn.tree import export_graphviz\n", + "from sklearn.externals.six import StringIO \n", + "from IPython.display import Image \n", + "import pydotplus\n", + "\n", + "np.set_printoptions(suppress=True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YJMS9ePQ6B6t" + }, + "source": [ + "**Atenção**: Para evitar overfitting nos algoritmos DecisionTreeClassifier, considere min_samples_split = 2 como default." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nNeRHYePJc-r" + }, + "source": [ + "from sklearn.tree import DecisionTreeClassifier # Library para Decision Tree (Classificação)\n", + "\n", + "# Instancia (configuração do Decision Trees) com os hiperparâmetros sugeridos para se evitar overfitting:\n", + "ml_DT = DecisionTreeClassifier(criterion = 'gini', \n", + " splitter = 'best', \n", + " max_depth = None, \n", + " min_samples_split = 2, \n", + " min_samples_leaf = 1, \n", + " min_weight_fraction_leaf = 0.0, \n", + " max_features = None, \n", + " random_state = i_Seed, \n", + " max_leaf_nodes = None, \n", + " min_impurity_decrease = 0.0, \n", + " min_impurity_split = None, \n", + " class_weight = None, \n", + " presort = False)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gVLZznprx2YX", + "outputId": "a4c2b726-2902-4d0f-a048-fb83a0beee72", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 118 + } + }, + "source": [ + "# Objeto/classificador configurado\n", + "ml_DT" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n", + " max_depth=None, max_features=None, max_leaf_nodes=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, presort=False,\n", + " random_state=20111974, splitter='best')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 90 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8CC24H-JHhlj" + }, + "source": [ + "### Treina o algoritmo: fit(df)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OgAHfXVo-Nw8", + "outputId": "06683422-3e33-4643-8283-8ba0cebd7980", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 118 + } + }, + "source": [ + "ml_DT.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n", + " max_depth=None, max_features=None, max_leaf_nodes=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, presort=False,\n", + " random_state=20111974, splitter='best')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 91 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CNiRjmrRHVnx" + }, + "source": [ + "### Valida o modelo com a amostra de treinamento" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2GMCSs89HquJ", + "outputId": "82a67cd4-033e-484f-9209-7d434380b074", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "ml_DT.score(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.94" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 92 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Bmv9YZobIer4" + }, + "source": [ + "### Calcula as predições usando o modelo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2YufZaRNIkFL" + }, + "source": [ + "y_pred = ml_DT.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fYvMN-tvIX-p" + }, + "source": [ + "### Matriz de Confusão" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9iTK6pBwIZ3F", + "outputId": "20492dbd-48e7-46c2-d724-931ce2cd9e98", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 538 + } + }, + "source": [ + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative', 'False_Positive', 'False_Negative', 'True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jOnkFBcEIVAb" + }, + "source": [ + "### Volte ao início, extraia nova amostra e calcule a acurácia\n", + "* Observou que a acurácia mudou? Isso acontece porque extraimos uma nova amostra de treinamento.\n", + "* Quais os inconvenientes de termos uma métrica diferente para cada amostra do modelo preditivo?\n", + "* Como reportar os resultados do seu modelo?\n", + "* Como se assegurar acerca do valor mais ideal da métrica?\n", + " * use a Estatística a seu favor! --> Use Cross-Validation." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MkBSvyorGXQz" + }, + "source": [ + "___\n", + "# **CROSS-VALIDATION**\n", + "* K-fold é o método de Cross-Validation (CV) mais conhecido e utilizado;\n", + "* Como funciona: divide o dataframe de treinamento em k partes (cada parte é um fold);\n", + " * Usa k-1 partes para treinar o modelo e o restante para validar o modelo;\n", + " * O processo é repetido k vezes, sendo que em cada iteração calcula as métricas desejadas (exemplo: acurácia);\n", + " * Desta forma o modelo é treinado e testado com todas as partes dos dados;\n", + " * Ao final das k iterações, teremos k métricas das quais calculamos média e desvio-padrão.\n", + "\n", + " A figura abaixo nos ajuda a entender como funciona CV:\n", + "\n", + "![Cross-Validation](https://github.com/MathMachado/Materials/blob/master/CV2.PNG?raw=true)\n", + "\n", + "Source: [5 Reasons why you should use Cross-Validation in your Data Science Projects](https://towardsdatascience.com/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79)\n", + "\n", + "* **valor de k**:\n", + " * valor de k (folds): entre 5 e 10 --> Não há regra geral para a escolha de k;\n", + " * Quanto maior o valor de k --> menor o viés do CV --> Experimento Estatístico para mostrar o efeito.\n", + "\n", + "[Applied Predictive Modeling, 2013](https://www.amazon.com/Applied-Predictive-Modeling-Max-Kuhn/dp/1461468485/ref=as_li_ss_tl?ie=UTF8&qid=1520380699&sr=8-1&keywords=applied+predictive+modeling&linkCode=sl1&tag=inspiredalgor-20&linkId=1af1f3de89c11e4a7fd49de2b05e5ebf)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HscfN-a1V043" + }, + "source": [ + "* **Vantagens do uso de CV**:\n", + " * Modelos com melhor acurácia;\n", + " * Melhor uso dos dados, pois todos os dados são utilizados como treinamento e validação. Portanto, qualquer problema com os dados serão encontrados nesta fase.\n", + "\n", + "* **Leitura Adicional**\n", + " * [Cross-Validation in Machine Learning](https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f)\n", + " * [5 Reasons why you should use Cross-Validation in your Data Science Projects](https://towardsdatascience.com/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79)\n", + " * [Cross-validation: evaluating estimator performance](https://scikit-learn.org/stable/modules/cross_validation.html)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8x2UPwOYQPcI", + "outputId": "9fe6a0f2-c7e9-48b4-a2bc-4f4e7bd09672", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 50 + } + }, + "source": [ + "# Cross-Validation com k = 10 folds (= 10 partes)\n", + "a_scores_CV = funcao_cross_val_score(ml_DT, X_treinamento, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Média das Acurácias calculadas pelo CV....: 91.43\n", + "std médio das Acurácias calculadas pelo CV: 3.44\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Uxoplcea0byV", + "outputId": "19aa7a0d-59c6-432c-bf8b-ae7672d68f27", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 50 + } + }, + "source": [ + "a_scores_CV # array com os scores a cada iteração do CV" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.9 , 0.98571429, 0.85714286, 0.92857143, 0.88571429,\n", + " 0.94285714, 0.92857143, 0.9 , 0.88571429, 0.92857143])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 96 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y3k-PcbN0o_i", + "outputId": "9f821901-32a6-4ce2-884c-4cbda1f1b774", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "a_scores_CV.mean()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.9142857142857144" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 97 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6_rYker2gzeG" + }, + "source": [ + "**Interpretação**: Nosso classificador (DecisionTreeClassifier) tem uma acurácia média de 91,43% (base de treinamento). Além disso, o std é da ordem de 3,66%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tkwchmkP3p_A", + "outputId": "5a54bca4-47e0-4077-de1c-1b88fae4afdd", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 50 + } + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Acurácias: [0.9 0.98571429 0.85714286 0.92857143 0.88571429 0.94285714\n", + " 0.92857143 0.9 0.88571429 0.92857143]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lQNyqHCiKRUh" + }, + "source": [ + "### Valida o modelo com a amostra de treinamento" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Qb0ZPyvKKRUp", + "outputId": "fa17085f-fc2d-4323-b046-873b0d14c391", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "ml_DT.score(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.94" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 99 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iL2tEdbqKY5P" + }, + "source": [ + "### Predições com o modelo treinado\n", + "* Faz predições usando o classificador (Decision Trees) para inferir na amostra de teste:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sI31WkZs2ht_" + }, + "source": [ + "y_pred = ml_DT.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rfapj3OG13PG", + "outputId": "330e27c8-c056-4161-c668-a26b27fee148", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 50 + } + }, + "source": [ + "y_pred[0:30]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0,\n", + " 1, 0, 0, 1, 1, 0, 1, 1])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 101 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sc88ofqh16RT", + "outputId": "b404e488-1f50-4157-dfba-7db67a45f328", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 50 + } + }, + "source": [ + "y[0:30]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,\n", + " 1, 1, 0, 1, 0, 1, 0, 1])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 102 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Cecv-51TKgz-" + }, + "source": [ + "### Matriz de Confusão" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fSaVzJ9xFpwW", + "outputId": "77c1a652-fcd6-400d-db0e-323a37024c55", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 538 + } + }, + "source": [ + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative', 'False_Positive', 'False_Negative', 'True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p8D975NqsGtj" + }, + "source": [ + "## Parameter tunning\n", + "### Referência\n", + "* [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74)\n", + "* [Decision Tree Adventures 2 — Explanation of Decision Tree Classifier Parameters](https://medium.com/datadriveninvestor/decision-tree-adventures-2-explanation-of-decision-tree-classifier-parameters-84776f39a28) - Explica didaticamente e step by step como fazer parameter tunning." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Bfdq5zEhlVsk" + }, + "source": [ + "# Dicionário com hiperparâmetros para o parameter tunning. Ao todo serão ajustados 2X13X5X5X7= 4.550 modelos. Contando com 10 folds no Cross-Validation, então são 45.500 modelos.\n", + "d_hiperparametros_DT = {\"criterion\": [\"gini\", \"entropy\"], \n", + " \"min_samples_split\": [2, 5, 10, 30, 50, 70, 90, 120, 150, 180, 210, 240, 270, 350, 400], \n", + " \"max_depth\": [None, 2, 5, 9, 15], \n", + " \"min_samples_leaf\": [20, 40, 60, 80, 100], \n", + " \"max_leaf_nodes\": [None, 2, 3, 4, 5, 10, 15]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "BtajXuuUpGwq", + "outputId": "0a395fb9-2b86-4baf-c06f-87562bdd55c7", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 336 + } + }, + "source": [ + "d_hiperparametros_DT" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'criterion': ['gini', 'entropy'],\n", + " 'max_depth': [None, 2, 5, 9, 15],\n", + " 'max_leaf_nodes': [None, 2, 3, 4, 5, 10, 15],\n", + " 'min_samples_leaf': [20, 40, 60, 80, 100],\n", + " 'min_samples_split': [2,\n", + " 5,\n", + " 10,\n", + " 30,\n", + " 50,\n", + " 70,\n", + " 90,\n", + " 120,\n", + " 150,\n", + " 180,\n", + " 210,\n", + " 240,\n", + " 270,\n", + " 350,\n", + " 400]}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 105 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H8gNSs0G0A-L" + }, + "source": [ + "```\n", + "grid_search = GridSearchCV(ml_DT, param_grid= d_hiperparametros_DT, cv = i_CV, n_jobs= -1)\n", + "start = time()\n", + "grid_search.fit(X_treinamento, y_treinamento)\n", + "tempo_elapsed= time()-start\n", + "print(f\"\\nGridSearchCV levou {tempo_elapsed:.2f} segundos para estimar {len(grid_search.cv_results_)} modelos candidatos\")\n", + "\n", + "GridSearchCV levou 1999.12 segundos para estimar 23 modelos candidatos\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "44-BRnNjBT25", + "outputId": "40ce711e-9258-432d-c327-30d92900dc5c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + } + }, + "source": [ + "# Invoca a função com o modelo baseline\n", + "ml_DT2, best_params = GridSearchOptimizer(ml_DT, 'ml_DT2', d_hiperparametros_DT, X_treinamento, y_treinamento, X_teste, y_teste, i_CV, l_colunas)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Fitting 10 folds for each of 5250 candidates, totalling 52500 fits\n" + ], + "name": "stdout" + }, + { + "output_type": "stream", + "text": [ + "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.\n", + "[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 1.0s\n", + "[Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 1.0s\n", + "[Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 1.1s\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.1908s.) Setting batch_size=2.\n", + "[Parallel(n_jobs=-1)]: Done 14 tasks | elapsed: 1.1s\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0439s.) Setting batch_size=4.\n", + "[Parallel(n_jobs=-1)]: Done 28 tasks | elapsed: 1.2s\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0734s.) Setting batch_size=8.\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.1218s.) Setting batch_size=16.\n", + "[Parallel(n_jobs=-1)]: Done 72 tasks | elapsed: 1.5s\n", + "[Parallel(n_jobs=-1)]: Done 216 tasks | elapsed: 2.2s\n", + "[Parallel(n_jobs=-1)]: Done 360 tasks | elapsed: 2.9s\n", + "[Parallel(n_jobs=-1)]: Done 536 tasks | elapsed: 3.6s\n", + "[Parallel(n_jobs=-1)]: Done 712 tasks | elapsed: 4.3s\n", + "[Parallel(n_jobs=-1)]: Done 920 tasks | elapsed: 5.1s\n", + "[Parallel(n_jobs=-1)]: Done 1128 tasks | elapsed: 6.0s\n", + "[Parallel(n_jobs=-1)]: Done 1368 tasks | elapsed: 6.9s\n", + "[Parallel(n_jobs=-1)]: Done 1608 tasks | elapsed: 7.9s\n", + "[Parallel(n_jobs=-1)]: Done 1880 tasks | elapsed: 9.0s\n", + "[Parallel(n_jobs=-1)]: Done 2152 tasks | elapsed: 10.0s\n", + "[Parallel(n_jobs=-1)]: Done 2456 tasks | elapsed: 11.2s\n", + "[Parallel(n_jobs=-1)]: Done 2760 tasks | elapsed: 12.4s\n", + "[Parallel(n_jobs=-1)]: Done 3096 tasks | elapsed: 13.8s\n", + "[Parallel(n_jobs=-1)]: Done 3432 tasks | elapsed: 15.3s\n", + "[Parallel(n_jobs=-1)]: Done 3800 tasks | elapsed: 16.7s\n", + "[Parallel(n_jobs=-1)]: Done 4168 tasks | elapsed: 18.4s\n", + "[Parallel(n_jobs=-1)]: Done 4568 tasks | elapsed: 20.3s\n", + "[Parallel(n_jobs=-1)]: Done 4968 tasks | elapsed: 22.4s\n", + "[Parallel(n_jobs=-1)]: Done 5400 tasks | elapsed: 24.1s\n", + "[Parallel(n_jobs=-1)]: Done 5832 tasks | elapsed: 25.7s\n", + "[Parallel(n_jobs=-1)]: Done 6296 tasks | elapsed: 27.5s\n", + "[Parallel(n_jobs=-1)]: Done 6760 tasks | elapsed: 29.2s\n", + "[Parallel(n_jobs=-1)]: Done 7256 tasks | elapsed: 31.2s\n", + "[Parallel(n_jobs=-1)]: Done 7752 tasks | elapsed: 33.1s\n", + "[Parallel(n_jobs=-1)]: Done 8280 tasks | elapsed: 35.2s\n", + "[Parallel(n_jobs=-1)]: Done 8808 tasks | elapsed: 37.2s\n", + "[Parallel(n_jobs=-1)]: Done 9368 tasks | elapsed: 39.4s\n", + "[Parallel(n_jobs=-1)]: Done 9928 tasks | elapsed: 41.5s\n", + "[Parallel(n_jobs=-1)]: Done 10520 tasks | elapsed: 43.7s\n", + "[Parallel(n_jobs=-1)]: Done 11112 tasks | elapsed: 46.2s\n", + "[Parallel(n_jobs=-1)]: Done 11736 tasks | elapsed: 48.6s\n", + "[Parallel(n_jobs=-1)]: Done 12360 tasks | elapsed: 51.1s\n", + "[Parallel(n_jobs=-1)]: Done 13016 tasks | elapsed: 53.8s\n", + "[Parallel(n_jobs=-1)]: Done 13672 tasks | elapsed: 56.8s\n", + "[Parallel(n_jobs=-1)]: Done 14360 tasks | elapsed: 59.6s\n", + "[Parallel(n_jobs=-1)]: Done 15048 tasks | elapsed: 1.0min\n", + "[Parallel(n_jobs=-1)]: Done 15768 tasks | elapsed: 1.1min\n", + "[Parallel(n_jobs=-1)]: Done 16488 tasks | elapsed: 1.1min\n", + "[Parallel(n_jobs=-1)]: Done 17240 tasks | elapsed: 1.2min\n", + "[Parallel(n_jobs=-1)]: Done 17992 tasks | elapsed: 1.2min\n", + "[Parallel(n_jobs=-1)]: Done 18776 tasks | elapsed: 1.3min\n", + "[Parallel(n_jobs=-1)]: Done 19560 tasks | elapsed: 1.3min\n", + "[Parallel(n_jobs=-1)]: Done 20376 tasks | elapsed: 1.4min\n", + "[Parallel(n_jobs=-1)]: Done 21192 tasks | elapsed: 1.5min\n", + "[Parallel(n_jobs=-1)]: Done 22040 tasks | elapsed: 1.5min\n", + "[Parallel(n_jobs=-1)]: Done 22888 tasks | elapsed: 1.6min\n", + "[Parallel(n_jobs=-1)]: Done 23768 tasks | elapsed: 1.6min\n", + "[Parallel(n_jobs=-1)]: Done 24648 tasks | elapsed: 1.7min\n", + "[Parallel(n_jobs=-1)]: Done 25560 tasks | elapsed: 1.8min\n", + "[Parallel(n_jobs=-1)]: Done 26472 tasks | elapsed: 1.8min\n", + "[Parallel(n_jobs=-1)]: Done 27416 tasks | elapsed: 1.9min\n", + "[Parallel(n_jobs=-1)]: Done 28360 tasks | elapsed: 2.0min\n", + "[Parallel(n_jobs=-1)]: Done 29336 tasks | elapsed: 2.1min\n", + "[Parallel(n_jobs=-1)]: Done 30312 tasks | elapsed: 2.2min\n", + "[Parallel(n_jobs=-1)]: Done 31320 tasks | elapsed: 2.2min\n", + "[Parallel(n_jobs=-1)]: Done 32328 tasks | elapsed: 2.3min\n", + "[Parallel(n_jobs=-1)]: Done 33368 tasks | elapsed: 2.4min\n", + "[Parallel(n_jobs=-1)]: Done 34408 tasks | elapsed: 2.5min\n", + "[Parallel(n_jobs=-1)]: Done 35480 tasks | elapsed: 2.6min\n", + "[Parallel(n_jobs=-1)]: Done 36552 tasks | elapsed: 2.7min\n", + "[Parallel(n_jobs=-1)]: Done 37656 tasks | elapsed: 2.8min\n", + "[Parallel(n_jobs=-1)]: Done 38760 tasks | elapsed: 2.8min\n", + "[Parallel(n_jobs=-1)]: Done 39896 tasks | elapsed: 2.9min\n", + "[Parallel(n_jobs=-1)]: Done 41032 tasks | elapsed: 3.0min\n", + "[Parallel(n_jobs=-1)]: Done 42200 tasks | elapsed: 3.1min\n", + "[Parallel(n_jobs=-1)]: Done 43368 tasks | elapsed: 3.2min\n", + "[Parallel(n_jobs=-1)]: Done 44568 tasks | elapsed: 3.3min\n", + "[Parallel(n_jobs=-1)]: Done 45768 tasks | elapsed: 3.4min\n", + "[Parallel(n_jobs=-1)]: Done 47000 tasks | elapsed: 3.6min\n", + "[Parallel(n_jobs=-1)]: Done 48232 tasks | elapsed: 3.7min\n", + "[Parallel(n_jobs=-1)]: Done 49496 tasks | elapsed: 3.8min\n", + "[Parallel(n_jobs=-1)]: Done 50760 tasks | elapsed: 3.9min\n", + "[Parallel(n_jobs=-1)]: Done 52056 tasks | elapsed: 4.0min\n", + "[Parallel(n_jobs=-1)]: Done 52500 out of 52500 | elapsed: 4.0min finished\n" + ], + "name": "stderr" + }, + { + "output_type": "stream", + "text": [ + "\n", + "Parametros otimizados: {'criterion': 'entropy', 'max_depth': None, 'max_leaf_nodes': None, 'min_samples_leaf': 20, 'min_samples_split': 70}\n", + "\n", + "DecisionTreeClassifier *********************************************************************************************************\n", + "\n", + "********* CROSS-VALIDATION ***********\n", + "Média das Acurácias calculadas pelo CV....: 87.14\n", + "std médio das Acurácias calculadas pelo CV: 4.33\n", + "\n", + "********* IMPORTÂNCIA DAS COLUNAS ***********\n", + " coluna importancia\n", + "12 v13 0.735896\n", + "0 v1 0.135030\n", + "9 v10 0.090888\n", + "6 v7 0.025768\n", + "1 v2 0.012418\n", + "3 v4 0.000000\n", + "4 v5 0.000000\n", + "5 v6 0.000000\n", + "7 v8 0.000000\n", + "8 v9 0.000000\n", + "10 v11 0.000000\n", + "11 v12 0.000000\n", + "2 v3 0.000000\n", + "13 v14 0.000000\n", + "14 v15 0.000000\n", + "15 v16 0.000000\n", + "16 v17 0.000000\n", + "17 v18 0.000000\n", + "\n", + "********* CONFUSION MATRIX - PARAMETER TUNNING ***********\n" + ], + "name": "stdout" + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gmCkjGjPJMLr" + }, + "source": [ + "### Visualizar o resultado" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cIc3ZgaISEd0", + "outputId": "281d1c7d-0104-4575-bb27-88e98e590ee2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 753 + } + }, + "source": [ + "from sklearn.tree import export_graphviz\n", + "from sklearn.externals.six import StringIO \n", + "from IPython.display import Image \n", + "import pydotplus\n", + "\n", + "dot_data = StringIO()\n", + "export_graphviz(ml_DT2, out_file = dot_data, filled = True, rounded = True, special_characters = True, feature_names = l_colunas, class_names = ['0','1'])\n", + "\n", + "graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) \n", + "graph.write_png('DecisionTree.png')\n", + "Image(graph.create_png())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "image/png": "\n", + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 108 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e1R2GBkbnV37" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ukMLoEr7nbUf", + "outputId": "81c17f60-7890-4f2a-8946-56a443d3306c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 67 + } + }, + "source": [ + "X_treinamento_DT, X_teste_DT = seleciona_colunas_relevantes(ml_DT2, X_treinamento, X_teste)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "\n", + "********** COLUNAS Relevantes ******\n", + "[ 0 9 12]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8JjePRQAoqkk" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Gt3aCPpfKRxm", + "outputId": "3e5e4663-dfae-46f6-d037-92dd028bc834", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 101 + } + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'criterion': 'entropy',\n", + " 'max_depth': None,\n", + " 'max_leaf_nodes': None,\n", + " 'min_samples_leaf': 20,\n", + " 'min_samples_split': 70}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 111 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zq6uCVtzovMt", + "outputId": "31e88125-b928-4910-8536-19f922dc1ee1", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 118 + } + }, + "source": [ + "# Treina usando as COLUNAS relevantes...\n", + "ml_DT2.fit(X_treinamento_DT, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',\n", + " max_depth=None, max_features=None, max_leaf_nodes=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=70, min_samples_split=20,\n", + " min_weight_fraction_leaf=0.0, presort='deprecated',\n", + " random_state=20111974, splitter='best')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 112 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "M2h3EpinRD5Q", + "outputId": "1345e3c8-460f-4cbc-df74-2d84140df2bb", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 50 + } + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_DT2, X_treinamento, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Média das Acurácias calculadas pelo CV....: 87.14\n", + "std médio das Acurácias calculadas pelo CV: 4.33\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "znWy3LE1q-Z3", + "outputId": "7b7b71c8-89cd-4d91-ea70-81f8d57fc7b3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + } + }, + "source": [ + "ml_DT3, best_params2 = GridSearchOptimizer(ml_DT2, 'ml_DT2', d_hiperparametros_DT, X_treinamento_DT, y_treinamento, X_teste_DT, y_teste, i_CV, l_colunas)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Fitting 10 folds for each of 5250 candidates, totalling 52500 fits\n" + ], + "name": "stdout" + }, + { + "output_type": "stream", + "text": [ + "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.\n", + "[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.0s\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0088s.) Setting batch_size=2.\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0088s.) Setting batch_size=4.\n", + "[Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 0.0s\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0256s.) Setting batch_size=8.\n", + "[Parallel(n_jobs=-1)]: Done 28 tasks | elapsed: 0.1s\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0537s.) Setting batch_size=16.\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0830s.) Setting batch_size=32.\n", + "[Parallel(n_jobs=-1)]: Done 84 tasks | elapsed: 0.2s\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.1519s.) Setting batch_size=64.\n", + "[Parallel(n_jobs=-1)]: Done 308 tasks | elapsed: 0.8s\n", + "[Parallel(n_jobs=-1)]: Done 756 tasks | elapsed: 1.6s\n", + "[Parallel(n_jobs=-1)]: Done 1332 tasks | elapsed: 2.6s\n", + "[Parallel(n_jobs=-1)]: Done 1908 tasks | elapsed: 3.6s\n", + "[Parallel(n_jobs=-1)]: Done 2612 tasks | elapsed: 4.9s\n", + "[Parallel(n_jobs=-1)]: Done 3316 tasks | elapsed: 6.1s\n", + "[Parallel(n_jobs=-1)]: Done 4148 tasks | elapsed: 7.7s\n", + "[Parallel(n_jobs=-1)]: Done 4980 tasks | elapsed: 9.1s\n", + "[Parallel(n_jobs=-1)]: Done 5940 tasks | elapsed: 10.7s\n", + "[Parallel(n_jobs=-1)]: Done 6900 tasks | elapsed: 12.4s\n", + "[Parallel(n_jobs=-1)]: Done 7988 tasks | elapsed: 14.3s\n", + "[Parallel(n_jobs=-1)]: Done 9076 tasks | elapsed: 16.2s\n", + "[Parallel(n_jobs=-1)]: Done 10292 tasks | elapsed: 18.2s\n", + "[Parallel(n_jobs=-1)]: Done 11508 tasks | elapsed: 20.6s\n", + "[Parallel(n_jobs=-1)]: Done 12852 tasks | elapsed: 23.3s\n", + "[Parallel(n_jobs=-1)]: Done 14196 tasks | elapsed: 25.6s\n", + "[Parallel(n_jobs=-1)]: Done 15668 tasks | elapsed: 28.3s\n", + "[Parallel(n_jobs=-1)]: Done 17140 tasks | elapsed: 30.9s\n", + "[Parallel(n_jobs=-1)]: Done 18740 tasks | elapsed: 33.6s\n", + "[Parallel(n_jobs=-1)]: Done 20340 tasks | elapsed: 36.6s\n", + "[Parallel(n_jobs=-1)]: Done 22068 tasks | elapsed: 39.7s\n", + "[Parallel(n_jobs=-1)]: Done 23796 tasks | elapsed: 42.8s\n", + "[Parallel(n_jobs=-1)]: Done 25652 tasks | elapsed: 46.1s\n", + "[Parallel(n_jobs=-1)]: Done 27508 tasks | elapsed: 49.5s\n", + "[Parallel(n_jobs=-1)]: Done 29492 tasks | elapsed: 53.3s\n", + "[Parallel(n_jobs=-1)]: Done 31476 tasks | elapsed: 57.6s\n", + "[Parallel(n_jobs=-1)]: Done 33588 tasks | elapsed: 1.0min\n", + "[Parallel(n_jobs=-1)]: Done 35700 tasks | elapsed: 1.1min\n", + "[Parallel(n_jobs=-1)]: Done 37940 tasks | elapsed: 1.2min\n", + "[Parallel(n_jobs=-1)]: Done 40180 tasks | elapsed: 1.2min\n", + "[Parallel(n_jobs=-1)]: Done 42548 tasks | elapsed: 1.3min\n", + "[Parallel(n_jobs=-1)]: Done 44916 tasks | elapsed: 1.4min\n", + "[Parallel(n_jobs=-1)]: Done 47412 tasks | elapsed: 1.5min\n", + "[Parallel(n_jobs=-1)]: Done 49908 tasks | elapsed: 1.6min\n", + "[Parallel(n_jobs=-1)]: Done 52406 tasks | elapsed: 1.6min\n", + "[Parallel(n_jobs=-1)]: Done 52500 out of 52500 | elapsed: 1.6min finished\n" + ], + "name": "stderr" + }, + { + "output_type": "stream", + "text": [ + "\n", + "Parametros otimizados: {'criterion': 'entropy', 'max_depth': None, 'max_leaf_nodes': None, 'min_samples_leaf': 60, 'min_samples_split': 2}\n", + "\n", + "DecisionTreeClassifier *********************************************************************************************************\n", + "\n", + "********* CROSS-VALIDATION ***********\n", + "Média das Acurácias calculadas pelo CV....: 89.29\n", + "std médio das Acurácias calculadas pelo CV: 2.73\n", + "\n", + "********* IMPORTÂNCIA DAS COLUNAS ***********\n", + " coluna importancia\n", + "2 v3 0.691283\n", + "0 v1 0.177569\n", + "1 v2 0.131148\n", + "\n", + "********* CONFUSION MATRIX - PARAMETER TUNNING ***********\n" + ], + "name": "stdout" + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6IhCC6pfq-jL", + "outputId": "c971d5d4-56b4-4945-ce9e-d5175b721e6a", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 101 + } + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'criterion': 'entropy',\n", + " 'max_depth': None,\n", + " 'max_leaf_nodes': None,\n", + " 'min_samples_leaf': 20,\n", + " 'min_samples_split': 70}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 115 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qw6Dk3kesT0q", + "outputId": "2c5d1f1e-ef01-4734-debf-2a8aa1598616", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 101 + } + }, + "source": [ + "best_params2" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'criterion': 'entropy',\n", + " 'max_depth': None,\n", + " 'max_leaf_nodes': None,\n", + " 'min_samples_leaf': 60,\n", + " 'min_samples_split': 2}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 116 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YFoK1ZGrRHf3", + "outputId": "78b2d27c-0831-419b-cdd0-f7d43bf5c72b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 50 + } + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_DT3, X_treinamento_DT, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Média das Acurácias calculadas pelo CV....: 89.29\n", + "std médio das Acurácias calculadas pelo CV: 2.73\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MZ1-vGRcxJoN" + }, + "source": [ + "## Valida o modelo usando o dataframe X_teste" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ig9GiUAEw9jr" + }, + "source": [ + "y_pred_DT = ml_DT2.predict(X_teste_DT)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7UZz4UzHDqae", + "outputId": "e2e56b21-9412-4c85-f713-c9ebcd1eb8c9", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Calcula acurácia\n", + "accuracy_score(y_teste, y_pred_DT)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.9333333333333333" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 119 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K3EUMAxxKBur" + }, + "source": [ + "___\n", + "# **RANDOM FOREST**\n", + "* Decision Trees possuem estrutura em forma de árvores.\n", + "* Random Forest pode ser utilizado tanto para classificação (RandomForestClassifier) quanto para Regressão (RandomForestRegressor);\n", + "* Os nós da árvore são criados a partir das variáveis do dataframe;\n", + "\n", + "* **Vantagens**:\n", + " * Não requer tanto data preprocessing;\n", + " * Lida bem com COLUNAS categóricas e numéricas;\n", + " * Apresenta bons resultados em diversos tipos de problema;\n", + " * Estes modelos aprendem com os próprios erros e ajustam as árvores de modo a fazer melhores classificações;\n", + " * Ensemble é a combinação de diferentes modelos preditivos;\n", + " * Torna os algoritmos/resultados mais robustos e complexos, levando a um maior custo computacional que costuma ser acompanhando de melhores resultados.\n", + " * Mais robusta que uma simples Decision Tree. **Porque?**\n", + " * Controla automaticamente overfitting (**porque?**) e frequentemente produz modelos muito robustos e de alta-performance.\n", + " * Pode ser utilizado como **Feature Selection**, pois gera a matriz de importância dos atributos (importance sample). A soma das importâncias soma 100;\n", + " * Assim como as Decision Trees, esses modelos capturam facilmente padrões não-lineares presentes nos dados;\n", + " * Não requer dados normalizados;\n", + " * Lida bem com Missing Values;\n", + " * Não requer suposições (assumptions) sobre a distribuição dos dados por causa da natureza não-paramétrica do algoritmo;\n", + "\n", + "* **Desvantagens**\n", + " * **Recomenda-se balancear o dataframe previamente**.\n", + "\n", + "* **Principais Hiperparâmetros**\n", + "\n", + "## **Referências**:\n", + "* [Running Random Forests? Inspect the feature importances with this code](https://towardsdatascience.com/running-random-forests-inspect-the-feature-importances-with-this-code-2b00dd72b92e)\n", + "* [Feature importances with forests of trees](https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html)\n", + "* [Understanding Random Forests Classifiers in Python](https://www.datacamp.com/community/tutorials/random-forests-classifier-python)\n", + "* [Understanding Random Forest](https://towardsdatascience.com/understanding-random-forest-58381e0602d2)\n", + "* [An Implementation and Explanation of the Random Forest in Python](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76)\n", + "* [Random Forest Simple Explanation](https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d)\n", + "* [Random Forest Explained](https://www.youtube.com/watch?v=eM4uJ6XGnSM)\n", + "* [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74) - Explica os principais hiperparâmetros do Random Forest." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CMQt5wiw1tt8" + }, + "source": [ + "### Como funciona?\n", + "\n", + "O algoritmo possui 4 passos:\n", + "1. Seleção aleatória de algumas features;\n", + "2. Seleção da feature mais adequada para a posição de nó raiz;\n", + "3. Geração dos nós filhos\n", + "4. Repete os passos acima até que se atinja a quantidade de árvores desejada.\n", + "\n", + "**Observação**: Depois que o modelo é gerado, as previsões são feitas a partir de “votações” das várias árvores. A decisão mais votada é a resposta do algoritmo." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VLGqtjs42zkN" + }, + "source": [ + "![DecisionTree](https://github.com/MathMachado/Materials/blob/master/DecisionTree.PNG?raw=true)\n", + "\n", + "Fonte: [Um tutorial completo sobre modelagem baseada em árvores de decisão (códigos R e Python)](https://www.vooo.pro/insights/um-tutorial-completo-sobre-a-modelagem-baseada-em-tree-arvore-do-zero-em-r-python/)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HaUjVgEd2rzU" + }, + "source": [ + "![](![image.png]())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r0s2vixBzFAR" + }, + "source": [ + "### Principais hiperparâmetros\n", + "* " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cnfDw_GEKBuu", + "outputId": "09ec89b8-4cd3-4773-9b3f-bae1c49d51a6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 151 + } + }, + "source": [ + "from sklearn.ensemble import RandomForestClassifier\n", + "\n", + "# Instancia...\n", + "ml_RF = RandomForestClassifier(n_estimators = 100, min_samples_split = 2, max_features = \"auto\", random_state = i_Seed)\n", + "\n", + "# Treina...\n", + "ml_RF.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n", + " criterion='gini', max_depth=None, max_features='auto',\n", + " max_leaf_nodes=None, max_samples=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, n_estimators=100,\n", + " n_jobs=None, oob_score=False, random_state=20111974,\n", + " verbose=0, warm_start=False)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 120 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "E25BIxM0RTzs", + "outputId": "7827d97f-c63b-4b02-af65-1dcac773b83d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 50 + } + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_RF, X_treinamento, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Média das Acurácias calculadas pelo CV....: 96.28999999999999\n", + "std médio das Acurácias calculadas pelo CV: 2.94\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AouWUu8vANdb" + }, + "source": [ + "**Interpretação**: Nosso classificador (RandomForestClassifier) tem uma acurácia média de 96,44% (base de treinamento). Além disso, o std é da ordem de 2,77%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vbducxlgAa85", + "outputId": "b3393d84-e6a6-47f7-93a8-61a3cc8bdb8f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 50 + } + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Acurácias: [0.9 0.98571429 0.98571429 0.95714286 0.92857143 1.\n", + " 0.97142857 0.98571429 0.94285714 0.97142857]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_lxx-LUw_5sd" + }, + "source": [ + "# Faz predições...\n", + "y_pred = ml_RF.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "pQIRO_LpGAkw", + "outputId": "40c83739-5055-4d1a-8e8a-5abce6de6cc8", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 538 + } + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yKLHZ5_C6FJ8" + }, + "source": [ + "## Parameter tunning\n", + "### Referência\n", + "* [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74)\n", + "* [Decision Tree Adventures 2 — Explanation of Decision Tree Classifier Parameters](https://medium.com/datadriveninvestor/decision-tree-adventures-2-explanation-of-decision-tree-classifier-parameters-84776f39a28) - Explica didaticamente e step by step como fazer parameter tunning.\n", + "* [Optimizing Hyperparameters in Random Forest Classification](https://towardsdatascience.com/optimizing-hyperparameters-in-random-forest-classification-ec7741f9d3f6) - Outro approach para entender parameter tunning. Recomendo fortemente a leitura! " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XOa9naju6FKA" + }, + "source": [ + "# Dicionário de Hiperparâmetros para o parameter tunning.\n", + "d_hiperparametros_RF= {'bootstrap': [True, False]} #,\n", + "# 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],\n", + "# 'max_features': ['auto', 'sqrt'],\n", + "# 'min_samples_leaf': [1, 2, 4],\n", + "# 'min_samples_split': [2, 5, 10],\n", + "# 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6__f2jZaTQat", + "outputId": "1cb740d8-20c4-4255-d2f1-5303efb546ee", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 498 + } + }, + "source": [ + "# Invoca a função\n", + "ml_RF2, best_params = GridSearchOptimizer(ml_RF, 'ml_RF2', d_hiperparametros_RF, X_treinamento, y_treinamento, X_teste, y_teste, i_CV, l_colunas)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Fitting 10 folds for each of 2 candidates, totalling 20 fits\n" + ], + "name": "stdout" + }, + { + "output_type": "stream", + "text": [ + "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.\n", + "[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.3s\n", + "[Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 0.6s\n", + "[Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 1.4s\n", + "[Parallel(n_jobs=-1)]: Done 14 tasks | elapsed: 2.0s\n", + "[Parallel(n_jobs=-1)]: Done 20 out of 20 | elapsed: 2.9s remaining: 0.0s\n", + "[Parallel(n_jobs=-1)]: Done 20 out of 20 | elapsed: 2.9s finished\n" + ], + "name": "stderr" + }, + { + "output_type": "stream", + "text": [ + "\n", + "Parametros otimizados: {'bootstrap': False}\n", + "\n", + "RandomForestClassifier *********************************************************************************************************\n" + ], + "name": "stdout" + }, + { + "output_type": "error", + "ename": "KeyError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Invoca a função\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mml_RF2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mbest_params\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mGridSearchOptimizer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mml_RF\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'ml_RF2'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0md_parametros_RF\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mX_treinamento\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_treinamento\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mX_teste\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_teste\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcv\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mi_CV\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m\u001b[0m in \u001b[0;36mGridSearchOptimizer\u001b[0;34m(modelo, ml_Opt, d_Parametros, X_treinamento, y_treinamento, X_teste, y_teste, cv)\u001b[0m\n\u001b[1;32m 22\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'\\nRandomForestClassifier *********************************************************************************************************'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 23\u001b[0m ml_Opt = RandomForestClassifier(bootstrap= ml_GridSearchCV.best_params_['bootstrap'], \n\u001b[0;32m---> 24\u001b[0;31m \u001b[0mmax_depth\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mml_GridSearchCV\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbest_params_\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'max_depth'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 25\u001b[0m \u001b[0mmax_features\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mml_GridSearchCV\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbest_params_\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'max_features'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 26\u001b[0m \u001b[0mmin_samples_leaf\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mml_GridSearchCV\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbest_params_\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'min_samples_leaf'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mKeyError\u001b[0m: 'max_depth'" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "crfn-n--KG4n" + }, + "source": [ + "### Resultado da execução do Random Forest\n", + "\n", + "```\n", + "[Parallel(n_jobs=-1)]: Done 7920 out of 7920 | elapsed: 194.0min finished\n", + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}\n", + "```" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SGTOe5PaRw59" + }, + "source": [ + "# Como o procedimento acima levou 194 minutos para executar, então vou estimar ml_RF2 abaixo usando os parâmetros acima estimados\n", + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}\n", + "\n", + "ml_RF2= RandomForestClassifier(bootstrap= best_params['bootstrap'], \n", + " max_depth= best_params['max_depth'], \n", + " max_features= best_params['max_features'], \n", + " min_samples_leaf= best_params['min_samples_leaf'], \n", + " min_samples_split= best_params['min_samples_split'], \n", + " n_estimators= best_params['n_estimators'], \n", + " random_state= i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HMJcAdLlTQa0" + }, + "source": [ + "## Visualizar o resultado\n", + "> Implementar a visualização do RandomForest." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WWNiy7Z0TQa3" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kOi11YOKTQa4" + }, + "source": [ + "X_treinamento_RF, X_teste_RF = seleciona_colunas_relevantes(ml_RF2, X_treinamento, X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Zn_O7c_DTQbE" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UwEOwzSGTQbF" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Rr8qDrgvTQbL" + }, + "source": [ + "# Treina com as COLUNAS relevantes...\n", + "ml_RF2.fit(X_treinamento_RF, y_treinamento)\n", + "\n", + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_RF2, X_treinamento_RF, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "flyxvhIA1B8l" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-mYfQLlsTQbQ" + }, + "source": [ + "## Valida o modelo usando o dataframe X_teste" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sSD5o1JQTQbR" + }, + "source": [ + "y_pred_RF = ml_RF2.predict(X_teste_RF)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "wywF6LymDzKr" + }, + "source": [ + "# Calcula acurácia\n", + "accuracy_score(y_teste, y_pred_RF)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hJJsL0IJb6iO" + }, + "source": [ + "## Estudo do comportamento dos hiperparâmetros do algoritmo\n", + "> Consulte [Optimizing Hyperparameters in Random Forest Classification](https://towardsdatascience.com/optimizing-hyperparameters-in-random-forest-classification-ec7741f9d3f6) para mais detalhes." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "navUWMwHi44D" + }, + "source": [ + "param_range = np.arange(1, 250, 2)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_treinamento, \n", + " y_treinamento, \n", + " param_name=\"n_estimators\", \n", + " param_range = param_range, \n", + " cv = i_CV, \n", + " scoring = \"accuracy\", \n", + " n_jobs = -1)\n", + "\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label = \"Training score\", color = \"black\")\n", + "plt.plot(param_range, test_mean, label = \"Cross-validation score\", color = \"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color = \"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color = \"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc = \"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rv7TIM9kjsud" + }, + "source": [ + "param_range = np.arange(1, 250, 2)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_treinamento, \n", + " y_treinamento, \n", + " param_name = \"max_depth\", \n", + " param_range = param_range, \n", + " cv = i_CV, \n", + " scoring = \"accuracy\", \n", + " n_jobs = -1)\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label=\"Training score\", color=\"black\")\n", + "plt.plot(param_range, test_mean, label=\"Cross-validation score\", color=\"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color=\"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color=\"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc=\"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lm_fPGYwkJYc" + }, + "source": [ + "param_range = np.arange(1, 250, 2)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_treinamento, \n", + " y_treinamento, \n", + " param_name='min_samples_leaf', \n", + " param_range=param_range,\n", + " cv = i_CV, \n", + " scoring=\"accuracy\", \n", + " n_jobs=-1)\n", + "\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label=\"Training score\", color=\"black\")\n", + "plt.plot(param_range, test_mean, label=\"Cross-validation score\", color=\"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color=\"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color=\"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc=\"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "CAqdiSaVlAB8" + }, + "source": [ + "param_range = np.arange(0.05, 1, 0.05)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_treinamento, \n", + " y_treinamento, \n", + " param_name='min_samples_split', \n", + " param_range=param_range,\n", + " cv = i_CV, \n", + " scoring=\"accuracy\", \n", + " n_jobs=-1)\n", + "\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label=\"Training score\", color=\"black\")\n", + "plt.plot(param_range, test_mean, label=\"Cross-validation score\", color=\"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color=\"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color=\"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc=\"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cX_gfsbQSdNd" + }, + "source": [ + "___\n", + "# **BOOSTING MODELS**\n", + "* São algoritmos muito utilizados nas competições do Kaggle;\n", + "* São algoritmos utilizados para melhorar a performance dos algoritmos de Machine Learning;\n", + "* Modelos:\n", + " - [X] AdaBoost\n", + " - [X] XGBoost\n", + " - [X] LightGBM\n", + " - [X] GradientBoosting\n", + " - [X] CatBoost\n", + "\n", + "## Bagging vs Boosting vc Stacking\n", + "### **Bagging**\n", + "* Objetivo é reduzir a variância;\n", + "\n", + "#### Como funciona\n", + "* Seleciona várias amostras **COM REPOSIÇÃO** do dataframe de treinamento. Cada amostra é usada para treinar um modelo usando Decision Trees. Como resultado, temos um ensemble de muitas e diferentes modelos (Decision Trees). A média de desses muitos e diferentes modelos (Decision Trees) são usados para produzir o resultado final;\n", + "* O resultado final é mais robusto do que usarmos uma simples Decision Tree.\n", + "\n", + "![Bagging](https://github.com/MathMachado/Materials/blob/master/Bagging.png?raw=true)\n", + "\n", + "Souce: [Boosting and Bagging: How To Develop A Robust Machine Learning Algorithm](https://hackernoon.com/how-to-develop-a-robust-algorithm-c38e08f32201).\n", + "\n", + "#### Steps\n", + "* Suponha um dataframe X_treinamento (dataframe de treinamento) contendo N observações (instâncias, pontos, linhas) e M COLUNAS (features, atributos).\n", + " 1. Bagging seleciona aleatoriamente uma amostra **COM REPOSIÇÃO** de X_treinamento;\n", + " 2. Bagging seleciona aleatoriamente M2 (M2 < M) COLUNAS do dataframe extraído do passo (1);\n", + " 3. Constroi uma Decision Tree com as M2 COLUNAS do passo (2) e o dataframe obtido no passo (1) e as COLUNAS são avaliadas pela sua habilidade de classificar as observações;\n", + " 4. Os passos (1)--> (2)-- (3) são repetidos K vezes (ou seja, K Decision Trees), de forma que as COLUNAS são ranqueadas pelo seu poder preditivo e o resultado final (acurácia, por exemplo) é obtido pela agregação das predições dos K Decision Trees.\n", + "\n", + "#### Vantagens\n", + "* Reduz overfitting;\n", + "* Lida bem com dataframes com muitas COLUNAS (high dimensionality);\n", + "* Lida automaticamente com Missing Values;\n", + "\n", + "#### Desvantagem\n", + "* A predição final é baseada na média das K Decision Trees, o que pode comprometer a acurácia final.\n", + "\n", + "___ \n", + "### **Boosting**\n", + "* Objetivo é melhorar acurácia;\n", + "\n", + "#### Como funciona\n", + "* Os classificadores são usados sequencialmente, de forma que o classificador no passo N aprende com os erros do classificador do passo N-1. Ou seja, o objetivo é melhorar a precisão/acurácia à cada passo aprendendo com o passado.\n", + "\n", + "![Boosting](https://github.com/MathMachado/Materials/blob/master/Boosting.png?raw=true)\n", + "\n", + "Source: [Ensemble methods: bagging, boosting and stacking](https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205), Joseph Rocca\n", + ".\n", + "\n", + "#### Steps\n", + "* Suponha um dataframe X_treinamento (dataframe de treinamento) contendo N observações (instâncias, pontos, linhas) e M COLUNAS (features, atributos).\n", + " 1. Boosting seleciona aleatoriamente uma amostra D1 SEM reposição de X_treinamento;\n", + " 2. Boosting treina o classificador C1;\n", + " 3. Boosting seleciona aleatoriamente a SEGUNDA amostra D2 SEM reposição de X_treinamento e acrescenta à D2 50% das observações que foram classificadas incorretamente para treinar o classificador C2;\n", + " 4. Boosting encontra em X_treinamento a amostra D3 que os classificadores C1 e C2 discordam em classificar e treina C3;\n", + " 5. Combina (voto) as predições de C1, C2 e C3 para produzir o resultado final.\n", + "\n", + "#### Vantagens\n", + "* Lida bem com dataframes com muitas COLUNAS (high dimensionality);\n", + "* Lida automaticamente com Missing Values;\n", + "\n", + "#### Desvantagem\n", + "* Propenso a overfitting. Recomenda-se tratar outliers previamente.\n", + "* Requer ajuste cuidadoso dos hyperparameters;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9fgUrkmPk4dr" + }, + "source": [ + "___\n", + "# STACKING\n", + "\n", + "![Stacking](https://github.com/MathMachado/Materials/blob/master/Stacking.png?raw=true)\n", + "\n", + "Kd a referência desta figura???" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B0jxx3ETpOdm" + }, + "source": [ + "___\n", + "# **BOOTSTRAPPING METHODS**\n", + "> Antes de falarmos de Boosting ou Bagging, precisamos entender primeiro o que é Bootstrap, pois ambos (Boosting e Bagging) são baseados em Bootstrap.\n", + "\n", + "* Em Estatística (e em Machine Learning), Bootstrap se refere à extrair amostras aleatórias COM reposição da população X." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SyqazmUuifkE" + }, + "source": [ + "___\n", + "# **ADABOOST(Adaptive Boosting)**\n", + "* Quando nada funciona, AdaBoost funciona!\n", + "* Foi um dos primeiros algoritmos de Boosting (1995);\n", + "* AdaBoost pode ser utilizado tanto para classificação (AdaBoostClassifier) quanto para Regressão (AdaBoostRegressor);\n", + "* AdaBoost usam algoritmos DecisionTree como base_estimator;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RU-vzkXqrFVw" + }, + "source": [ + "## Referências\n", + "* [AdaBoost Classifier Example In Python](https://towardsdatascience.com/machine-learning-part-17-boosting-algorithms-adaboost-in-python-d00faac6c464) - Didático e explica exatamente como o AdaBoost funciona.\n", + "* [Adaboost for Dummies: Breaking Down the Math (and its Equations) into Simple Terms](https://towardsdatascience.com/adaboost-for-dummies-breaking-down-the-math-and-its-equations-into-simple-terms-87f439757dcf) - Para quem quer entender a matemática por trás do algoritmo.\n", + "* [Gradient Boosting and XGBoost](https://medium.com/hackernoon/gradient-boosting-and-xgboost-90862daa6c77)\n", + "* [Understanding AdaBoost](https://towardsdatascience.com/understanding-adaboost-2f94f22d5bfe), Akash Desarda.\n", + "* [AdaBoost Classifier Example In Python](https://towardsdatascience.com/machine-learning-part-17-boosting-algorithms-adaboost-in-python-d00faac6c464)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6EMrjQDZIMl_" + }, + "source": [ + "## O que é AdaBoost (Adaptive Boosting)?\n", + "* é um dos classificadores do tipo ensemble (combina vários classificadores para aumentar a precisão).\n", + "* AdaBoost é um classificador iterativo e forte que combina (ensemble) vários classificadores fracos para melhorar a precisão.\n", + "* Qualquer algoritmo de aprendizado de máquina pode ser usado como um classificador de base (parâmetro base_estimator);\n", + "\n", + "## Hiperparâmetros mais importantes do AdaBoost:\n", + "* base_estimator - É um classificador usado para treinar o modelo. Como default, AdaBoost usa o DecisionTreeClassifier. Como dito anteriormente, pode-se utilizar diferentes algoritmos para esse fim.\n", + "* n_estimators - Número de base_estimator para treinar iterativamente.\n", + "* learning_rate - Controla a contribuição do base_estimator na solução/combinação final;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TzLtHzWNJBix" + }, + "source": [ + "## Usando diferentes algoritmos para base_estimator\n", + "> Como dito anteriormente, pode-se utilizar vários tipos de base_estimator em AdaBoost. Por exemplo, se quisermos usar SVM (Support Vector Machines), devemos proceder da seguinte forma:\n", + "\n", + "\n", + "```\n", + "# Importar a biblioteca base_estimator\n", + "from sklearn.svm import SVC\n", + "\n", + "# Treina o classificador (algoritmo)\n", + "ml_SVC= SVC(probability=True, kernel='linear')\n", + "\n", + "# Constroi o modelo AdaBoost\n", + "ml_AB = AdaBoostClassifier(n_estimators= 50, base_estimator=ml_SVC, learning_rate=1)\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hrj4a4s6hMMB" + }, + "source": [ + "## Vantagens\n", + "* AdaBoost é fácil de implementar;\n", + "* AdaBoost corrige os erros do base_estimator iterativamente e melhora a acurácia;\n", + "* Faz o Feature Selection automaticamente (**Porque**?);\n", + "* Pode-se usar muitos algoritos como base_estimator ;\n", + "* Como é um método ensemble, então o modelo final é pouco propenso à overfitting.\n", + "\n", + "## Desvantagens\n", + "* AdaBoost é sensível a ruídos nos dados;\n", + "* Altamente impactado por outliers (contribui para overfitting), pois o algoritmo tenta se ajustr a cada ponto da mehor forma possível;\n", + "* AdaBoost é mais lento que XGBoost;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bgJmu7YLiyv7" + }, + "source": [ + "No exemplo a seguir, vou usar RandomForestClassifier com os hiperparâmetros otimizados, ou seja:\n", + "\n", + "```\n", + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5VCRNyZT3qvc" + }, + "source": [ + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1gIboJdriq61" + }, + "source": [ + "from sklearn.ensemble import AdaBoostClassifier\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "\n", + "# Instancia RandomForestClassifier - Hiperparâmetros otimizados!\n", + "ml_RF2= RandomForestClassifier(bootstrap= best_params['bootstrap'], \n", + " max_depth= best_params['max_depth'], \n", + " max_features= best_params['max_features'], \n", + " min_samples_leaf= best_params['min_samples_leaf'], \n", + " min_samples_split= best_params['min_samples_split'], \n", + " n_estimators= best_params['n_estimators'], \n", + " random_state= i_Seed)\n", + "# Instancia AdaBoostClassifier\n", + "ml_AB= AdaBoostClassifier(n_estimators=100, base_estimator= ml_RF2, random_state= i_Seed)\n", + "\n", + "# Treina...\n", + "ml_AB.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "tBOuTywWRm91" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_AB, X_treinamento, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F7Ce5L38ECoC" + }, + "source": [ + "**Interpretação**: Nosso classificador (AdaBoostClassifier) tem uma acurácia média de 96,72% (base de treinamento). Além disso, o std é da ordem de 2,54%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "t5GfnBwEifkO" + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q9rSpuXyEPA5" + }, + "source": [ + "# Faz predições com os hiperparâmetros otimizados...\n", + "y_pred = ml_AB.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "2F9k-_eXGDLa" + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XweWTjQ9EXLw" + }, + "source": [ + "## Parameter tunning" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fcrKzse9EbL_" + }, + "source": [ + "# Dicionário de hiperparâmetros para o parameter tunning.\n", + "d_hiperparametros_AB = {'n_estimators':[50, 100, 200], 'learning_rate':[.001, 0.01, 0.05, 0.1, 0.3,1]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Susc3I7mFDQX" + }, + "source": [ + "# Invoca a função\n", + "ml_AB2, best_params= GridSearchOptimizer(ml_AB, 'ml_AB2', d_hiperparametros_AB, X_treinamento, y_treinamento, X_teste, y_teste, i_CV, l_colunas)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "w4JjWsusjNS8" + }, + "source": [ + "___\n", + "# **GRADIENT BOOSTING**\n", + "* Gradient boosting pode ser usado para resolver problemas de classificação (GradientBoostingClassifier) e Regressão (GradientBoostingRegressor);\n", + "* Gradient boosting são um refinamento do AdaBoost (lembra que AdaBoost foi um dos primeiros métodos de Boosting - criado em 1995). O que Gradient Boosting faz adicionalmente ao AdaBoost é minimizar a loss (função perda), ie, minimizar a diferença entre os valores observados de y e os valores preditos.\n", + "* Usa Gradient Descent para encontrar as deficiências nas previsões do passo anterior. Gradient Descent é um algoritmo popular e poderoso e usado em Redes Neurais;\n", + "* O objetivo do Gradient Boosting é minimizar 'loss function'. Portanto, Gradient Boosting depende da \"loss function\".\n", + "* Gradient boosting usam algoritmos DecisionTree como base_estimator;\n", + "\n", + "## Vantagens\n", + "* Não há necessidade de pre-processing;\n", + "* Trabalha normalmente com COLUNAS numéricas ou categóricas;\n", + "* Trata automaticamente os Missing Values. Ou seja, não é necessário aplicar métodos de Missing Value Imputation;\n", + "\n", + "## Desvantagens\n", + "* Como Gradient Boosting tenta continuamente minimizar os erros à cada iteração, isso pode enfatizar os outliers e causar overfitting. Portanto, deve-se:\n", + " * Tratar os outliers previamente OU\n", + " * Usar Cross-Validation para neutralizar os efeitos dos outliers (**Eu prefiro este método, pois toma menos tempo**);\n", + "* Computacionalmene caro. Geralmente são necessários muitas árvores (> 1000) para se obter bons resultados;\n", + "* Devido à flexibilidade (muitos hiperparâmetros para ajustar), então é necessário usar GridSearchCV para encontrar a combinação ótima dos hyperparameters;\n", + "\n", + "## Referências\n", + "* [Gradient Boosting Decision Tree Algorithm Explained](https://towardsdatascience.com/machine-learning-part-18-boosting-algorithms-gradient-boosting-in-python-ef5ae6965be4) - Didático e detalhista.\n", + "* [Predicting Wine Quality with Gradient Boosting Machines](https://towardsdatascience.com/predicting-wine-quality-with-gradient-boosting-machines-a-gmb-tutorial-d950b1542065)\n", + "* [Parameter Tuning in Gradient Boosting (GBM) with Python](https://www.datacareer.de/blog/parameter-tuning-in-gradient-boosting-gbm/)\n", + "* [Tune Learning Rate for Gradient Boosting with XGBoost in Python](https://machinelearningmastery.com/tune-learning-rate-for-gradient-boosting-with-xgboost-in-python/)\n", + "* [In Depth: Parameter tuning for Gradient Boosting](https://medium.com/all-things-ai/in-depth-parameter-tuning-for-gradient-boosting-3363992e9bae) - Muito bom\n", + "* [Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q4bUCZs2jNTA" + }, + "source": [ + "from sklearn.ensemble import GradientBoostingClassifier\n", + "\n", + "# Instancia...\n", + "ml_GB = GradientBoostingClassifier(n_estimators = 100, min_samples_split = 2)\n", + "\n", + "# Treina... \n", + "ml_GB.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "PKOG1ugSRvLM" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_GB, X_treinamento, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VlC3y3M5YaGG" + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vnLvQ0ZDYNjB" + }, + "source": [ + "**Interpretação**: Nosso classificador (GradientBoostingClassifier) tem uma acurácia média de 96,86% (base de treinamento). Além disso, o std é da ordem de 2,52%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "D2n1RKZuXq3D" + }, + "source": [ + "# Faz precições...\n", + "y_pred = ml_GB.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8r6JCzQRGFa0" + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names = cf_labels, categories = cf_categories)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KFv-Q2AD5uCk" + }, + "source": [ + "## Parameter tunning\n", + "> Consulte [Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/) para detalhes sobre os hiperparâmetros, significado e etc." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wgU040AcjNTF" + }, + "source": [ + "# Dicionário de hiperparâmetros para o parameter tunning.\n", + "d_hiperparametros_GB= {'learning_rate': [1, 0.5, 0.25, 0.1, 0.05, 0.01]} #,\n", + "# 'n_estimators': [1, 2, 4, 8, 16, 32, 64, 100, 200],\n", + "# 'max_depth': [5, 10, 15, 20, 25, 30],\n", + "# 'min_samples_split': [0.1, 0.3, 0.5, 0.7, 0.9],\n", + "# 'min_samples_leaf': [0.1, 0.2, 0.3, 0.4, 0.5],\n", + "# 'max_features': list(range(1, X_treinamento.shape[1]))}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "v5KLFlpTjNTH" + }, + "source": [ + "# Invoca a função\n", + "ml_GB2, best_params= GridSearchOptimizer(ml_GB, 'ml_GB2', d_hiperparametros_GB, X_treinamento, y_treinamento, X_teste, y_teste, i_CV, l_colunas)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YQ6ERz3fi9i2" + }, + "source": [ + "### Resultado da execução do Gradient Boosting" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RSa7uKw13mKG" + }, + "source": [ + "```\n", + "[Parallel(n_jobs=-1)]: Done 275400 out of 275400 | elapsed: 93.7min finished\n", + "\n", + "Hiperparâmetros otimizados: {'learning_rate': 1, 'max_depth': 30, 'max_features': 11, 'min_samples_leaf': 0.1, 'min_samples_split': 0.1, 'n_estimators': 100}\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wiJpA2PyjDjR" + }, + "source": [ + "# Como o procedimento acima levou 93 minutos para executar, então vou estimar ml_GB2 abaixo usando os parâmetros acima estimados\n", + "best_params= {'learning_rate': 1, 'max_depth': 30, 'max_features': 11, 'min_samples_leaf': 0.1, 'min_samples_split': 0.1, 'n_estimators': 100}\n", + "\n", + "#ml_GB2= GradientBoostingClassifier(learning_rate= best_params['learning_rate'], \n", + "# max_depth= best_params['max_depth'],\n", + "# max_features= best_params['max_features'],\n", + "# min_samples_leaf= best_params['min_samples_leaf'],\n", + "# min_samples_split= best_params['min_samples_split'],\n", + "# n_estimators= best_params['n_estimators'],\n", + "# random_state= i_Seed)\n", + "\n", + "ml_GB2= GradientBoostingClassifier(learning_rate= best_params['learning_rate'], \n", + " max_depth= best_params['max_depth'],\n", + " min_samples_leaf= best_params['min_samples_leaf'],\n", + " min_samples_split= best_params['min_samples_split'],\n", + " n_estimators= best_params['n_estimators'],\n", + " random_state= i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mb14gJ7-jbVM" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TAqGZIFYm2sU" + }, + "source": [ + "X_treinamento_GB, X_teste_GB = seleciona_colunas_relevantes(ml_GB2, X_treinamento, X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6yiu6dahnBvC" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "APrtWN18nc4t" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VS0mLdOmnXAY" + }, + "source": [ + "# Treina com as COLUNAS relevantes\n", + "ml_GB2.fit(X_treinamento_GB, y_treinamento)\n", + "\n", + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_GB2, X_treinamento_GB, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vmc9PP_Rn1TN" + }, + "source": [ + "## Valida o modelo usando o dataframe X_teste" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "e3mnIALvnzP2" + }, + "source": [ + "y_pred_GB = ml_GB2.predict(X_teste_GB)\n", + "\n", + "# Calcula acurácia\n", + "accuracy_score(y_teste, y_pred_GB)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kwP9Z2GnkV7r" + }, + "source": [ + "___\n", + "# **XGBOOST (eXtreme Gradient Boosting)**\n", + "* XGBoost é uma melhoria de Gradient Boosting. As melhorias são em velocidade e performace, além de corrigir as ineficiências do GradientBoosting.\n", + "* Algoritmo preferido pelos Kaggle Grandmasters;\n", + "* Paralelizável;\n", + "* Estado-da-arte em termos de Machine Learning;\n", + "\n", + "## Hiperparâmetros relevantes e seus valores iniciais\n", + "Consulte [Complete Guide to Parameter Tuning in XGBoost with codes in Python](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/) para detalhes completos sobre os hiperparâmetros, significado e etc.\n", + "\n", + "* n_estimators = 100 (100 caso o dataframe for grande. Se o dataframe for médio/pequeno, então 1000) - É o número de árvores desejamos construir;\n", + "* max_depth= 3 - Determina quão profundo cada árvore pode crescer durante qualquer round de treinamento. Valores típicos no intervalo [3, 10];\n", + "* learning rate= 0.01 - Usado para evitar overfitting, intervalo: [0, 1];\n", + "* alpha (somente para problemas de Regressão) - L1 regularization nos pesos. Valores altos resulta em mais regularization;\n", + "* lambda (somente para problemas de Regressão) - L2 regularization nos pesos.\n", + "* colsample_bytree: 1 - porcentagem de COLUNAS usados por cada árvore. Alto valor pode causar overfitting;\n", + "* subsample: 0.8 - porcentagem de amostras usadas por árvore. Um valor baixo pode levar a overfitting;\n", + "* gamma: 1 - Controla se um determinado nó será dividido com base na redução esperada na perda após a divisão. Um valor mais alto leva a menos divisões.\n", + "* objective: Define a \"loss function\". As opções são:\n", + " * reg:linear - Para resolver problemas de regressão;\n", + " * reg:logistic - Para resolver problemas de classificação;\n", + " * binary:logistic - Para resolver problemas de classificação com cálculo de probabilidades;\n", + "\n", + "# Referências\n", + "* [How exactly XGBoost Works?](https://medium.com/@pushkarmandot/how-exactly-xgboost-works-a320d9b8aeef)\n", + "* [Fine-tuning XGBoost in Python like a boss](https://towardsdatascience.com/fine-tuning-xgboost-in-python-like-a-boss-b4543ed8b1e)\n", + "* [Gentle Introduction of XGBoost Library](https://medium.com/@imoisharma/gentle-introduction-of-xgboost-library-2b1ac2669680)\n", + "* [A Beginner’s guide to XGBoost](https://towardsdatascience.com/a-beginners-guide-to-xgboost-87f5d4c30ed7)\n", + "* [Exploring XGBoost](https://towardsdatascience.com/exploring-xgboost-4baf9ace0cf6)\n", + "* [Feature Importance and Feature Selection With XGBoost in Python](https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/)\n", + "* [Ensemble Learning case study: Running XGBoost on Google Colab free GPU](https://towardsdatascience.com/running-xgboost-on-google-colab-free-gpu-a-case-study-841c90fef101) - Recomendo\n", + "* [Predicting movie revenue with AdaBoost, XGBoost and LightGBM](https://towardsdatascience.com/predicting-movie-revenue-with-adaboost-xgboost-and-lightgbm-262eadee6daa)\n", + "* [Tuning XGBoost Hyperparameters with Scikit Optimize](https://towardsdatascience.com/how-to-improve-the-performance-of-xgboost-models-1af3995df8ad)\n", + "* [An Example of Hyperparameter Optimization on XGBoost, LightGBM and CatBoost using Hyperopt](https://towardsdatascience.com/an-example-of-hyperparameter-optimization-on-xgboost-lightgbm-and-catboost-using-hyperopt-12bc41a271e) - Interessante\n", + "* [XGBOOST vs LightGBM: Which algorithm wins the race !!!](https://towardsdatascience.com/lightgbm-vs-xgboost-which-algorithm-win-the-race-1ff7dd4917d) - LightGBM tem se mostrado interessante.\n", + "* [From Zero to Hero in XGBoost Tuning](https://towardsdatascience.com/from-zero-to-hero-in-xgboost-tuning-e48b59bfaf58) - Gostei\n", + "* [Build XGBoost / LightGBM models on large datasets — what are the possible solutions?](https://towardsdatascience.com/build-xgboost-lightgbm-models-on-large-datasets-what-are-the-possible-solutions-bf882da2c27d)\n", + "* [Selecting Optimal Parameters for XGBoost Model Training](https://towardsdatascience.com/selecting-optimal-parameters-for-xgboost-model-training-c7cd9ed5e45e) - Muito bom!\n", + "* [CatBoost vs. Light GBM vs. XGBoost](https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db)\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iMM_R4_ukV7x" + }, + "source": [ + "from xgboost import XGBClassifier\n", + "import xgboost as xgb\n", + "\n", + "# Instancia...\n", + "ml_XGB= XGBClassifier(silent=False, \n", + " scale_pos_weight=1,\n", + " learning_rate=0.01, \n", + " colsample_bytree = 1,\n", + " subsample = 0.8,\n", + " objective='binary:logistic', \n", + " n_estimators=1000, \n", + " reg_alpha = 0.3,\n", + " max_depth= 3, \n", + " gamma=1, \n", + " max_delta_step=5)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "E4wQMlDEFINR" + }, + "source": [ + "# Treina...\n", + "ml_XGB.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "S77LljiQR_16" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_XGB, X_treinamento, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JNyKX6PkrXOk" + }, + "source": [ + "**Interpretação**: Nosso classificador (XGBClassifier) tem uma acurácia média de 96,72% (base de treinamento). Além disso, o std é da ordem de 2,02%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_h0QYv3FkV73" + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "AKhhAZLjkV76" + }, + "source": [ + "# Faz predições...\n", + "y_pred = ml_XGB.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ir2Kd1PqGHgz" + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jEC7gW4qYpWw" + }, + "source": [ + "## Parameter tunning\n", + "### Leitura Adicional:\n", + "* [Fine-tuning XGBoost in Python like a boss](https://towardsdatascience.com/fine-tuning-xgboost-in-python-like-a-boss-b4543ed8b1e)\n", + "* [Complete Guide to Parameter Tuning in XGBoost with codes in Python](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)\n", + "\n", + "> Olhando para os resultados acima, qual o melhor modelo?\n", + "\n", + "XGBoost? Supondo que sim, agora vamos fazer o fine-tuning dos hiperparâmetros do modelo." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "n3MsUONPwIV9" + }, + "source": [ + "# Dicionário de Hiperparâmetros para XGBoost:\n", + "d_hiperparametros_XGB = {'min_child_weight': [i for i in np.arange(1, 13)]} #,\n", + "# 'gamma': [i for i in np.arange(0, 5, 0.5)],\n", + "# 'subsample': [0.6, 0.8, 1.0],\n", + "# 'colsample_bytree': [0.6, 0.8, 1.0],\n", + "# 'max_depth': [3, 4, 5, 7, 9],\n", + "# 'learning_rate': [i for i in np.arange(0.01, 1, 0.1)]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "CX27FCKmwSni" + }, + "source": [ + "# Invoca a função\n", + "ml_XGB, best_params= GridSearchOptimizer(ml_XGB, 'ml_XGB2', d_hiperparametros_XGB, X_treinamento, y_treinamento, X_teste, y_teste, i_CV, l_colunas)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9b7uCuF74Hjv" + }, + "source": [ + "### Resultado da execução do XGBoostClassifier\n", + "\n", + "```\n", + "[Parallel(n_jobs=-1)]: Done 108000 out of 108000 | elapsed: 372.0min finished\n", + "\n", + "Hiperparâmetros otimizados: {'colsample_bytree': 0.8, 'gamma': 0.5, 'learning_rate': 0.51, 'max_depth': 5, 'min_child_weight': 1, 'subsample': 0.6}\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "n7E0oyxEtbGi" + }, + "source": [ + "# Como o procedimento acima levou 372 minutos para executar, então vou estimar ml_XGB2 abaixo usando os parâmetros acima estimados\n", + "best_params= {'colsample_bytree': 0.8, 'gamma': 0.5, 'learning_rate': 0.51, 'max_depth': 5, 'min_child_weight': 1, 'subsample': 0.6}\n", + "\n", + "ml_XGB2= XGBClassifier(min_child_weight= best_params['min_child_weight'], \n", + " gamma= best_params['gamma'], \n", + " subsample= best_params['subsample'], \n", + " colsample_bytree= best_params['colsample_bytree'], \n", + " max_depth= best_params['max_depth'], \n", + " learning_rate= best_params['learning_rate'], \n", + " random_state= i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CuqyLHTU5Z-j" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes\n", + "* [The Multiple faces of ‘Feature importance’ in XGBoost](https://towardsdatascience.com/be-careful-when-interpreting-your-features-importance-in-xgboost-6e16132588e7)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QPG3JZIpRZ-T" + }, + "source": [ + "# plot feature importance\n", + "from xgboost import plot_importance\n", + "\n", + "xgb.plot_importance(ml_XGB2, color = 'red')\n", + "plt.title('importance', fontsize = 20)\n", + "plt.yticks(fontsize = 10)\n", + "plt.ylabel('features', fontsize = 20)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "EmpRC2lHW-KP" + }, + "source": [ + "ml_XGB2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "4f9MIEBiyq-5" + }, + "source": [ + "X_treinamento_XGB, X_teste_XGB= seleciona_colunas_relevantes(ml_XGB2, X_treinamento, X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F6EayWaY5nMm" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Huy18gKI5qad" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "E3-PaTdc5vZk" + }, + "source": [ + "# Treina com as COLUNAS relevantes...\n", + "ml_XGB2.fit(X_treinamento_XGB, y_treinamento)\n", + "\n", + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_XGB2, X_treinamento_XGB, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tBdYikDU6NhD" + }, + "source": [ + "## Valida o modelo usando o dataframe X_teste" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GcvY-VdL6VIZ" + }, + "source": [ + "y_pred_XGB = ml_XGB2.predict(X_teste_XGB)\n", + "\n", + "# Calcula acurácia\n", + "accuracy_score(y_teste, y_pred_XGB)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8oLtdH-vTSbC" + }, + "source": [ + "xgb.to_graphviz(ml_XGB2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "czXQG3MCHfHM" + }, + "source": [ + "# KNN - KNEIGHBORSCLASSIFIER" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "llTTXNeyHiwx" + }, + "source": [ + "# BAGGINGCLASSIFIER" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Fbkekd4QHoZO" + }, + "source": [ + "# EXTRATREESCLASSIFIER" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "widavwR4HzwE" + }, + "source": [ + "# SVM\n", + "https://data-flair.training/blogs/svm-support-vector-machine-tutorial/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "id_Ubulns6We" + }, + "source": [ + "# NAIVE BAYES" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ycu_EIGlYUYn" + }, + "source": [ + "import pandas as pd\n", + "\n", + "from xgboost import XGBClassifier\n", + "from sklearn.ensemble import ExtraTreesClassifier\n", + "from sklearn.tree import ExtraTreeClassifier\n", + "from sklearn.tree import DecisionTreeClassifier\n", + "from sklearn.ensemble import GradientBoostingClassifier\n", + "from sklearn.ensemble import BaggingClassifier\n", + "from sklearn.ensemble import AdaBoostClassifier\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "from sklearn.linear_model import LogisticRegression\n", + "from lightgbm import LGBMClassifier\n", + "\n", + "clfs = [XGBClassifier(), LGBMClassifier(), \n", + " ExtraTreesClassifier(), ExtraTreeClassifier(),\n", + " BaggingClassifier(), DecisionTreeClassifier(),\n", + " GradientBoostingClassifier(), LogisticRegression(),\n", + " AdaBoostClassifier(), RandomForestClassifier()]\n", + "\n", + "for clf in clfs:\n", + " try:\n", + " _ = mostra_feature_importances(clf, X_treinamento, y_treinamento, top_n=X_treinamento.shape[1], title=clf.__class__.__name__)\n", + " except AttributeError as e:\n", + " print(e)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EwWkjfC8KEZH" + }, + "source": [ + "# ENSEMBLE METHODS\n", + "https://towardsdatascience.com/using-bagging-and-boosting-to-improve-classification-tree-accuracy-6d3bb6c95e5b\n", + "\n", + "![Ensemble](https://github.com/MathMachado/Materials/blob/master/Ensemble.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3Uf1RML7xETY" + }, + "source": [ + "# WOE e IV\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TBNRfYZCyhMP" + }, + "source": [ + "## Construção do exemplo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gIIroyyP4ZRZ" + }, + "source": [ + "df_y.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "PzQQdrkf1ohX" + }, + "source": [ + "from random import choices\n", + "\n", + "df_X2= df_X.copy()\n", + "df_X2['tipo']= choices(['A', 'B', 'C', 'D'], k= 1000)\n", + "df_X2['idade']= np.random.randint(10, 15, size= 1000)\n", + "df_X2['target']= df_y['target']\n", + "df_X2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "v-OpwIpx4hXJ" + }, + "source": [ + "df_X2['target'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "yZfqSvbKzeJ3" + }, + "source": [ + "def Constroi_Buckets(df, i, k= 10):\n", + " coluna= 'v'+ str(i)\n", + " df[coluna+'_Bucket']= pd.cut(df[coluna], bins= k, labels= np.arange(1, k+1))\n", + " df= df.drop(columns= [coluna], axis= 1)\n", + " return df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "V6Nrpsx60HD3" + }, + "source": [ + "for i in np.arange(1,19):\n", + " df_X2= Constroi_Buckets(df_X2, i)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "J2Fbh41-03OB" + }, + "source": [ + "df_X2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "O9r5BeWVxIr3" + }, + "source": [ + "# Função para calcular WOE e IV\n", + "def calculate_woe_iv(dataset, feature, target):\n", + "\n", + " def codethem(IV):\n", + " if IV < 0.02: return 'Useless'\n", + " elif IV >= 0.02 and IV < 0.1: return 'Weak'\n", + " elif IV >= 0.1 and IV < 0.3: return 'Medium'\n", + " elif IV >= 0.3 and IV < 0.5: return 'Strong'\n", + " elif IV >= 0.5: return 'Suspicious'\n", + " else: return 'None'\n", + "\n", + " lst = []\n", + " for i in range(dataset[feature].nunique()):\n", + " val = list(dataset[feature].unique())[i]\n", + " lst.append({\n", + " 'Value': val,\n", + " 'All': dataset[dataset[feature] == val].count()[feature],\n", + " 'Good': dataset[(dataset[feature] == val) & (dataset[target] == 0)].count()[feature],\n", + " 'Bad': dataset[(dataset[feature] == val) & (dataset[target] == 1)].count()[feature]\n", + " })\n", + " \n", + " dset = pd.DataFrame(lst)\n", + " dset['Distr_Good'] = dset['Good']/dset['Good'].sum()\n", + " dset['Distr_Bad'] = dset['Bad']/dset['Bad'].sum()\n", + " dset['Mean']= dset['All']/dset['All'].sum()\n", + " dset['WoE'] = np.log(dset['Distr_Good']/dset['Distr_Bad'])\n", + " dset = dset.replace({'WoE': {np.inf: 0, -np.inf: 0}})\n", + " dset['IV'] = (dset['Distr_Good'] - dset['Distr_Bad']) * dset['WoE']\n", + " #dset= dset.drop(columns= ['Distr_Good', 'Distr_Bad'], axis= 1)\n", + "\n", + " dset['Predictive_Power']= dset['IV'].map(codethem)\n", + " iv = dset['IV'].sum() \n", + " dset = dset.sort_values(by='IV') \n", + " return dset, iv" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Y8WGjWH63nx_" + }, + "source": [ + "df_Lab = df_X2.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-N6xr1MgxTiz" + }, + "source": [ + "def calcula_Predictive_Power(df_Lab, coluna):\n", + " print('WoE and IV for column: {}'.format(coluna))\n", + " df, iv = calculate_woe_iv(df_Lab, coluna, 'target')\n", + " print(df)\n", + " print('IV score: {:.2f}'.format(iv))\n", + " print('\\n')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ayqN_7WnxVq9" + }, + "source": [ + "for i in np.arange(1,19):\n", + " coluna= 'v'+str(i)+'_Bucket'\n", + " calcula_Predictive_Power(df_Lab, coluna)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qtoJVI4Pyx3I" + }, + "source": [ + "# **IMBALANCED SAMPLE**\n", + "> Alguns objetivos como detectar fraude em transações bancárias ou detecção de intrusão em network tem em comum o fato que a classe de interesse (o que queremos detectar), geralmente é um evento raro\n", + "\n", + "## Exemplo: Detectar fraude\n", + "A proporção de fraudes diante de NÃO-FRAUDES são mais ou menos 1%/99%. Neste caso, ao desenvovermos um modelo para detectar fraudes e o modelo classificar todas as instâncias como NÃO-FRAUDE, então o modelo terá uma acurácia de 99%. No entanto, este modelo não nos ajudará em nada.\n", + "\n", + "## Necessidade de se usar outras métricas \n", + "> Recomenda-se utilizar outras métricas (na verdade, é boa prática usar mais de 1 métrica para medir a performance dos modelos) como, por exemplo, F1-Score, Precision/Specificity, Recall/Sensitivity e AUROC.\n", + "\n", + "## Como lidar com a amostra desbalanceada?\n", + "* Under-sampling\n", + "> Seleciona aleatoriamente a classe MAJORITÁRIA (em nosso exemplo, NÃO-FRAUDE) até o número de instâncias da classe MINORITÁRIA (FRAUDE);\n", + "\n", + "* Over-sampling\n", + "> Resample aleatoriamente a classe MINORITÁRIA (em nosso exemplo, FRAUDE) até o número de instâncias da classe MAJORITÁRIA (NÃO-FRAUDE), ou uma proporção da classe MAJORITÁRIA. Veja a bibliotea SMOTE (Synthetic Minority Over-Sampling Techniques);\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2o45zx8zw-aB" + }, + "source": [ + "## EFEITOS DA AMOSTRA DESBALANCEADA" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cCVTPCB-Xkbd" + }, + "source": [ + "# TPOT\n", + "https://towardsdatascience.com/tpot-automated-machine-learning-in-python-4c063b3e5de9" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2ulXii6JXpWd" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_TWUq-z4X4yZ" + }, + "source": [ + "___\n", + "# FEATURETOOLS\n", + "https://medium.com/@rrfd/simple-automatic-feature-engineering-using-featuretools-in-python-for-classification-b1308040e183\n", + "\n", + "https://www.analyticsvidhya.com/blog/2018/08/guide-automated-feature-engineering-featuretools-python/\n", + "\n", + "https://mlwhiz.com/blog/2019/05/19/feature_extraction/\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aZUNOgmSgAmq" + }, + "source": [ + "!pip install featuretools" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_sxdONzsh9rb" + }, + "source": [ + "df_X.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "p5_ynGo1dBJJ" + }, + "source": [ + "df_X.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TqJRJXUhiDqf" + }, + "source": [ + "from random import choices\n", + "\n", + "df_X2= df_X.copy()\n", + "df_X2['tipo'] = choices(['A', 'B', 'C', 'D'], k = 1000)\n", + "df_X2['idade'] = np.random.randint(10, 15, size = 1000)\n", + "df_X2['id'] = range(0,1000)\n", + "df_X2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "nR56bGGngk-W" + }, + "source": [ + "# Automated feature engineering\n", + "import featuretools as ft\n", + "import featuretools.variable_types as vtypes\n", + "\n", + "es= ft.EntitySet(id = 'simulacao')\n", + "\n", + "# adding a dataframe \n", + "es.entity_from_dataframe(entity_id = 'df_X2', dataframe = df_X2, index = 'id')\n", + "es" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IOJ4Tr5Ogk6M" + }, + "source": [ + "es['df_X2'].variables" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1uXPqHDZgkys" + }, + "source": [ + "variable_types = {'idade': vtypes.Categorical}\n", + " \n", + "es.entity_from_dataframe(entity_id = 'df_X2', dataframe = df_X2, index = 'id', variable_types= variable_types)\n", + "\n", + "es = es.normalize_entity(base_entity_id='df_X2', new_entity_id= 'tipo', index='id')\n", + "es = es.normalize_entity(base_entity_id='df_X2', new_entity_id= 'idade', index='id')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "dnbYTBqugkvm" + }, + "source": [ + "es" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "I2v_jetdgkr7" + }, + "source": [ + "feature_matrix, feature_names = ft.dfs(entityset=es, target_entity = 'df_X2', max_depth = 3, verbose = 3, n_jobs= 1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zZiRBvHXgkoJ" + }, + "source": [ + "feature_matrix.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aWiahwKe2d6U" + }, + "source": [ + "# **EXERCÍCIOS**\n", + "> Encontre algoritmos adequados para ser aplicados aos seguintes problemas:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XbSLkbDB2mzK" + }, + "source": [ + "## Exercício 1 - Credit Card Fraud Detection\n", + "Source: [Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud)\n", + "\n", + "### Leitura suporte\n", + "* [Detecting Credit Card Fraud Using Machine Learning](https://towardsdatascience.com/detecting-credit-card-fraud-using-machine-learning-a3d83423d3b8)\n", + "* [Credit Card Fraud Detection](https://towardsdatascience.com/credit-card-fraud-detection-a1c7e1b75f59)\n", + "\n", + "### Dataframe\n", + "* [Creditcard.csv](https://raw.githubusercontent.com/MathMachado/DSWP/master/Dataframes/creditcard.csv)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JYVM3StS-g0E" + }, + "source": [ + "### Importar as libraries necessárias" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dyliPChh-jPk" + }, + "source": [ + "from sklearn.metrics import accuracy_score # para medir a acurácia do modelo preditivo\n", + "#from sklearn.model_selection import train_test_split\n", + "#from sklearn.metrics import classification_report\n", + "from sklearn.metrics import confusion_matrix # para plotar a confusion matrix\n", + "\n", + "from sklearn.model_selection import GridSearchCV # para otimizar os hiperparâmetros dos modelos preditivos\n", + "from sklearn.model_selection import cross_val_score\n", + "from time import time\n", + "from operator import itemgetter\n", + "from scipy.stats import randint\n", + "\n", + "from sklearn.tree import export_graphviz\n", + "from sklearn.externals.six import StringIO \n", + "from IPython.display import Image \n", + "import pydotplus\n", + "\n", + "np.set_printoptions(suppress=True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lAl9ZwP_0-d0" + }, + "source": [ + "url = 'https://raw.githubusercontent.com/MathMachado/DSWP/master/Dataframes/creditcard.csv'\n", + "df_cc = pd.read_csv(url)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "w6lN8FjJ12VU" + }, + "source": [ + "df_cc.head(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "M47GS1cK2NdD" + }, + "source": [ + "df_cc.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "b2QBZFbR3W_q" + }, + "source": [ + "df_cc['Class'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "pzjW3Bf_3h7J" + }, + "source": [ + "56/12842" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9bWDX9H12k5g" + }, + "source": [ + "### Drop NaN" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "27ob8tRR21TB" + }, + "source": [ + "df_cc.isna().sum()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "X9k16WLI49JI" + }, + "source": [ + "df_cc2 = df_cc.copy()\n", + "df_cc2 = df_cc.dropna()\n", + "df_cc2.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OY-DYRKg34ZX" + }, + "source": [ + "### Definir as variáveis globais" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KVhHgV_s3_5f" + }, + "source": [ + "i_CV = 10 # Número de Cross-Validations\n", + "i_Seed = 20111974 # semente por questões de reproducibilidade\n", + "f_Test_Size = 0.3 # Proporção do dataframe de validação (outros valores poderiam ser 0.15, 0.20 ou 0.25)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wKbqrF4Q2nBq" + }, + "source": [ + "### Define amostras de treinamento e teste" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "N8CUAiA57OhS" + }, + "source": [ + "df_cc.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "LZjNUDNb7s1t" + }, + "source": [ + "# Definição do dataframe contendo as variáveis preditoras:\n", + "df_X = df_cc2.copy()\n", + "df_X.drop(columns= ['Class'], axis = 1, inplace = True)\n", + "df_X.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "d3DDsN2V7IOU" + }, + "source": [ + "df_y = df_cc2['Class'] # Variável-resposta\n", + "df_y.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "aMthdXHD8vnh" + }, + "source": [ + "df_y.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "EiJRftpZ2103" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(df_X, df_y, test_size = f_Test_Size, random_state = i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TmSkPzNt8O6I" + }, + "source": [ + "X_treinamento.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "9h1PjPKh8Xb1" + }, + "source": [ + "X_teste.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NbCN_puI2qk1" + }, + "source": [ + "### Ajusta o modelo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hjRwSI079ADn" + }, + "source": [ + "# Importar o classificador (modelo, algoritmo, ...)\n", + "from sklearn.tree import DecisionTreeClassifier # Este é o nosso classificador" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "HuhKJOQA22bR" + }, + "source": [ + "ml_DT = DecisionTreeClassifier(max_depth = 5, min_samples_split = 2, random_state = i_Seed)\n", + "ml_DT" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Zai1d6eM93VQ" + }, + "source": [ + "# Treinar o algoritmo/classificador: fit(df)\n", + "ml_DT.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ybbS4zHn-8BO" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_DT, X_treinamento, y_treinamento, cv = i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "r_NLku7q_YT9" + }, + "source": [ + "a_scores_CV # array com os scores a cada iteração do CV" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bCRgHxUu2s7c" + }, + "source": [ + "### Cross-Validation" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2wMWm-p5229A" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Am_UELOg2vDh" + }, + "source": [ + "### Fine tuning dos Hiperparâmetros" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lF9mxe7y23hr" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bG31I7_n4RQg" + }, + "source": [ + "### Aplicar as transformações (principais) estudadas e reestimar o modelo novamente\n", + "* Qual o impacto das transformações?\n", + "* A conclusão muda/mudou?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oYgK6JXd3MgA" + }, + "source": [ + "## Exercício 2 - Predicting species on IRIS dataset\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "si0rsJvu3O6O" + }, + "source": [ + "from sklearn import datasets\n", + "import xgboost as xgb\n", + "\n", + "iris = datasets.load_iris()\n", + "X_iris = iris.data\n", + "y_iris = iris.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zom8t4yWC_UC" + }, + "source": [ + "## Exercício 3 - Predict Wine Quality\n", + "> Estimar a qualidade dos vinhos, numa scala de 0–100. A seguir, a qualidade em função da escala:\n", + "\n", + "* 95–100 Classic: a great wine\n", + "* 90–94 Outstanding: a wine of superior character and style\n", + "* 85–89 Very good: a wine with special qualities\n", + "* 80–84 Good: a solid, well-made wine\n", + "* 75–79 Mediocre: a drinkable wine that may have minor flaws\n", + "* 50–74 Not recommended\n", + "\n", + "Source: [Wine Reviews](https://www.kaggle.com/zynicide/wine-reviews)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "klL2Q9Ria96n" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from sklearn import datasets\n", + "\n", + "Wine = datasets.load_wine()\n", + "X_vinho = Wine.data\n", + "y_vinho = Wine.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lhVhSWBgGijq" + }, + "source": [ + "## Exercício 4 - Predict Parkinson\n", + "Source: https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SVCxHqv0VBJn" + }, + "source": [ + "## Exercício 5 - Predict survivors from Titanic tragedy\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "CwvB8us4eKNi" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "\n", + "df_titanic = sns.load_dataset('titanic')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZJrT9YIXVdtx" + }, + "source": [ + "## Exercício 6 - Predict Loan\n", + "> Os dados devem ser obtidos diretamente da fonte: [Loan Default Prediction - Imperial College London](https://www.kaggle.com/c/loan-default-prediction/data)\n", + "\n", + "* [Bank Loan Default Prediction](https://medium.com/@wutianhao910/bank-loan-default-prediction-94d4902db740)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R8-GVu7ZWeA8" + }, + "source": [ + "## Exercício 7 - Predict the sales of a store.\n", + "* [Predicting expected sales for Bigmart’s stores](https://medium.com/diogo-menezes-borges/project-1-bigmart-sale-prediction-fdc04f07dc1e)\n", + "* Dataframes\n", + " * [Treinamento](https://raw.githubusercontent.com/MathMachado/DataFrames/master/Big_Mart_Sales_III_train.txt)\n", + " * [Validação](https://raw.githubusercontent.com/MathMachado/DataFrames/master/Big_Mart_Sales_III_test.txt)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fv9w86j4Wnwj" + }, + "source": [ + "## Exercício 8 - [The Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html)\n", + "> Predict the median value of owner occupied homes." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5HYRt8-ug1BT" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from sklearn import datasets\n", + "\n", + "Boston = datasets.load_boston()\n", + "X_boston = Boston.data\n", + "y_boston = Boston.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1UDIaqmtXQ0T" + }, + "source": [ + "## Exercício 9 - Predict the height or weight of a person.\n", + "\n", + "http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-7R146nIXmMT" + }, + "source": [ + "## Exercício 10 - Black Friday Sales Prediction - Predict purchase amount.\n", + "\n", + "This dataset comprises of sales transactions captured at a retail store. It’s a classic dataset to explore and expand your feature engineering skills and day to day understanding from multiple shopping experiences. This is a regression problem. The dataset has 550,069 rows and 12 columns.\n", + "\n", + "https://github.com/MathMachado/DataFrames/blob/master/blackfriday.zip\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mQ8FPbuLZlIh" + }, + "source": [ + "## Exercício 11 - Predict the income class of US population.\n", + "\n", + "http://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Af4NRrchgPlM" + }, + "source": [ + "## Exercício 12 - Predicting Cancer\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "c4LOlgZW3P40" + }, + "source": [ + "from sklearn import datasets\n", + "cancer = datasets.load_breast_cancer()\n", + "X_cancer = cancer.data\n", + "y_cancer = cancer.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "74PmpT8Ix0tD" + }, + "source": [ + "## Exercício 13\n", + "Source: [Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/).\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WY8GZMixZ9W9" + }, + "source": [ + "## Exercício 14 - Predict Diabetes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y92t6tbOge0S" + }, + "source": [ + "from sklearn import datasets\n", + "Diabetes= datasets.load_diabetes()\n", + "\n", + "X_diabetes = Diabetes.data\n", + "y_diabetes = Diabetes.target" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB15_00__Machine_Learning_hs6.ipynb b/Notebooks/NB15_00__Machine_Learning_hs6.ipynb new file mode 100644 index 000000000..a63d6f900 --- /dev/null +++ b/Notebooks/NB15_00__Machine_Learning_hs6.ipynb @@ -0,0 +1,4996 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "colab": { + "name": "NB15_00__Machine_Learning.ipynb", + "provenance": [], + "include_colab_link": true + }, + "accelerator": "TPU" + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ShVXyGj9wkgN" + }, + "source": [ + "

MACHINE LEARNING WITH PYTHON

" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e-VOopTKxrMs" + }, + "source": [ + "A seguir, sugestão de problemas para resolver com Regressão Linear\n", + "* https://lionbridge.ai/datasets/10-open-datasets-for-linear-regression/" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iTAGmmHouqQd" + }, + "source": [ + "#!pip install azureml\n", + "#!pip install azureml-opendatasets\n", + "#!pip install azureml-dataset-runtime" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "74DHjOrSuOwd" + }, + "source": [ + "#from azureml import Datasets" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "uIMB3a9EuQ9h" + }, + "source": [ + "#from azureml.core import Dataset\n", + "#from azureml.opendatasets import NycTlcYellow, NycTlcGreen\n", + "#from dateutil import parser\n", + "#from datetime import datetime\n", + "#from dateutil.relativedelta import relativedelta" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "oV-ETadXvsuG" + }, + "source": [ + "#end_date = parser.parse('2018-06-06')\n", + "#start_date = parser.parse('2018-05-01')\n", + "#nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)\n", + "#nyc_tlc_df = nyc_tlc.to_pandas_dataframe() " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H-zwPxoLxVgp" + }, + "source": [ + "Link: https://docs.microsoft.com/pt-pt/azure/machine-learning/tutorial-auto-train-models" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tUwpNHtbwn41" + }, + "source": [ + "#nyc_tlc_df.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "oflehhy7wtde" + }, + "source": [ + "#nyc_tlc_df.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "C1-G9EajxFbJ" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aYQ4cDfcPu4e" + }, + "source": [ + "___\n", + "# **NOTAS E OBSERVAÇÕES**\n", + "* Abordar o impacto do desbalanceamento da amostra;\n", + "* Colocar AUROC no material e mostrar o cut off para classificação entre 0 e 1;\n", + "* Conceitos estatísticos de bias & variance;\n", + "* Ver Sklearn.optimize: https://web.telegram.org/#/im?p=g497957288;\n", + "* Construir a package para conter todas as funções definidas e colocar estas funções na package --> Manutenção rápida, fácil e centralizada! Desta forma, o tópico (\"Funções usadas neste tutorial\") vai totalmente para o package." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5YvhLC_uf4_G" + }, + "source": [ + "___\n", + "# **AGENDA**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QgX6n2VDyY1O" + }, + "source": [ + "___\n", + "# **REFERÊNCIAS**\n", + "* [scikit-learn - Machine Learning With Python](https://scikit-learn.org/stable/);\n", + "* [An Introduction to Machine Learning Theory and Its Applications: A Visual Tutorial with Examples](https://www.toptal.com/machine-learning/machine-learning-theory-an-introductory-primer)\n", + "* [The Difference Between Artificial Intelligence, Machine Learning, and Deep Learning](https://medium.com/iotforall/the-difference-between-artificial-intelligence-machine-learning-and-deep-learning-3aa67bff5991)\n", + "* [A Gentle Guide to Machine Learning](https://blog.monkeylearn.com/a-gentle-guide-to-machine-learning/)\n", + "* [A Visual Introduction to Machine Learning](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)\n", + "* [Introduction to Machine Learning](http://alex.smola.org/drafts/thebook.pdf)\n", + "* [The 10 Statistical Techniques Data Scientists Need to Master](https://medium.com/cracking-the-data-science-interview/the-10-statistical-techniques-data-scientists-need-to-master-1ef6dbd531f7)\n", + "* [Tune: a library for fast hyperparameter tuning at any scale](https://towardsdatascience.com/fast-hyperparameter-tuning-at-scale-d428223b081c)\n", + "* [How to lie with Data Science](https://towardsdatascience.com/how-to-lie-with-data-science-5090f3891d9c)\n", + "* [5 Reasons “Logistic Regression” should be the first thing you learn when becoming a Data Scientist](https://towardsdatascience.com/5-reasons-logistic-regression-should-be-the-first-thing-you-learn-when-become-a-data-scientist-fcaae46605c4)\n", + "* [Machine learning on categorical variables](https://towardsdatascience.com/machine-learning-on-categorical-variables-3b76ffe4a7cb)\n", + "\n", + "## Deep Learning & Neural Networks\n", + "\n", + "- [An Introduction to Neural Networks](http://www.cs.stir.ac.uk/~lss/NNIntro/InvSlides.html)\n", + "- [An Introduction to Image Recognition with Deep Learning](https://medium.com/@ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721)\n", + "- [Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/index.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TsCbZd2epfxo" + }, + "source": [ + "___\n", + "# **INTRODUÇÃO**\n", + "\n", + "* \"__Information is the oil of the 21st century, and analytics is the combustion engine__.\" - Peter Sondergaard, SVP, Garner Research;\n", + "\n", + "\n", + ">O foco deste capítulo será:\n", + "* Linear, Logistic Regression, Decision Tree, Random Forest, Support Vector Machine and XGBoost algorithms for building Machine Learning models;\n", + "* Entender como resolver problemas de classificação e Regressão;\n", + "* Aplicar técnicas de Ensemble como Bagging e Boosting;\n", + "* Como medir a acurácia dos modelos de Machine Learning;\n", + "* Aprender os principais algoritmos de Machine Learning tanto das técnicas de aprendizagem supervisionada quanto da não-supervisionada.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HqqB2vaHXMGt" + }, + "source": [ + "___\n", + "# **ARTIFICIAL INTELLIGENCE VS MACHINE LEARNING VS DEEP LEARNING**\n", + "* **Machine Learning** - dá aos computadores a capacidade de aprender sem serem explicitamente programados. Os computadores podem melhorar sua capacidade de aprendizagem através da prática de uma tarefa, geralmente usando grandes conjuntos de dados.\n", + "* **Deep Learning** - é um método de Machine Learning que depende de redes neurais artificiais, permitindo que os sistemas de computadores aprendam pelo exemplo, assim como nós humanos aprendemos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P961GcguXFFA" + }, + "source": [ + "![EvolutionOfAI](https://github.com/MathMachado/Materials/blob/master/Evolution%20of%20AI.PNG?raw=true)\n", + "\n", + "Source: [Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://github.com/MathMachado/P4ML/blob/DS_Python/Material/Evolution%20of%20AI.PNG)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lkqGtO88ZkPr" + }, + "source": [ + "![AI_vs_ML_vs_DL](https://github.com/MathMachado/Materials/blob/master/AI_vs_ML_vs_DL.PNG?raw=true)\n", + "\n", + "Source: [Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://towardsdatascience.com/artificial-intelligence-vs-machine-learning-vs-deep-learning-2210ba8cc4ac)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xesQpzfmaqj6" + }, + "source": [ + "![ML_vs_DL](https://github.com/MathMachado/Materials/blob/master/ML_vs_DL.PNG?raw=true)\n", + "\n", + "Source: [Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://towardsdatascience.com/artificial-intelligence-vs-machine-learning-vs-deep-learning-2210ba8cc4ac)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KeIVR59IIS7f" + }, + "source": [ + "___\n", + "# **MACHINE LEARNING - TECHNIQUES**\n", + "\n", + "* Supervised Learning\n", + "* Unsupervised Learning\n", + "\n", + "![MachineLearning](https://github.com/MathMachado/Materials/blob/master/MachineLearningTechniques.jpg?raw=true)\n", + "\n", + "Source: [Machine Learning for Everyone](https://vas3k.com/blog/machine_learning/?source=post_page-----885aa35db58b----------------------)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rvwp5UHdBiup" + }, + "source": [ + "___\n", + "# **NOSSO FOCO AQUI SERÁ...**\n", + "\n", + "![ClassicalML](https://github.com/MathMachado/Materials/blob/master/ClassicalML.jpg?raw=true)\n", + "\n", + "Source: [Machine Learning for Everyone](https://vas3k.com/blog/machine_learning/?source=post_page-----885aa35db58b----------------------)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cBLSvJTXHBjK" + }, + "source": [ + "___\n", + "# **CHEETSHEET**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZdjR3nahUuKq" + }, + "source": [ + "\n", + "![Scikit-Learn](https://github.com/MathMachado/Materials/blob/master/scikit-learn-1.png?raw=true)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XRukccWQSklx" + }, + "source": [ + "## Medidas para avaliarmos a variabilidade presente nos dados\n", + "* As principais medidas para medirmos a variabilidade dos dados são amplitude, variância, desvio padrão e coeficiente de variação;\n", + "* Estas medidas nos permite concluir se os dados são homogêneos (menor dispersão/variabilidade) ou heterogêneos (maior variabilidade/dispersão).\n", + "\n", + "* **Na próxima versão, trazer estes conceitos para o Notebook e usar o Python para calcular estas medidas**." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yBR8tWV_lhQq" + }, + "source": [ + "___\n", + "# **ENSEMBLE METHODS** (= Combinar modelos preditivos)\n", + "* Métodos\n", + " * **Bagging** (Bootstrap AGGregatING)\n", + " * **Boosting**\n", + " * Stacking --> Não é muito utilizado\n", + "* Evita overfitting (Overfitting é quando o modelo/função se ajusta muito bem ao dados de treinamento, sendo ineficiente para generalizar para outras amostras/população).\n", + "* Constroi meta-classificadores: combinar os resultados de vários algoritmos para produzir previsões mais precisas e robustas do que as previsões de cada classificador individual.\n", + "* Ensemble reduz/minimiza os efeitos das principais causas de erros nos modelos de Machine Learning:\n", + " * ruído;\n", + " * bias (viés);\n", + " * variância --> Principal medida para medir a variabilidade presente nos dados.\n", + "\n", + "# Referências\n", + "* [Simple guide for ensemble learning methods](https://towardsdatascience.com/simple-guide-for-ensemble-learning-methods-d87cc68705a2) - Explica didaticamente como funcionam ensembes." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "25RW8u-Sj780" + }, + "source": [ + "### Leitura Adicional\n", + "* [Ensemble methods: bagging, boosting and stacking](https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205)\n", + "* [Ensemble Methods in Machine Learning: What are They and Why Use Them?](https://towardsdatascience.com/ensemble-methods-in-machine-learning-what-are-they-and-why-use-them-68ec3f9fef5f)\n", + "* [Ensemble Learning Using Scikit-learn](https://towardsdatascience.com/ensemble-learning-using-scikit-learn-85c4531ff86a)\n", + "* [Let’s Talk About Machine Learning Ensemble Learning In Python](https://medium.com/fintechexplained/lets-talk-about-machine-learning-ensemble-learning-in-python-382747e5fba8)\n", + "* [Boosting, Bagging, and Stacking — Ensemble Methods with sklearn and mlens](https://medium.com/@rrfd/boosting-bagging-and-stacking-ensemble-methods-with-sklearn-and-mlens-a455c0c982de)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FugME1HSl4jJ" + }, + "source": [ + "___\n", + "# **PARAMETER TUNNING** (= Hiperparâmetros ótimos dos modelos de Machine Learning)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u_147cIRl9F1" + }, + "source": [ + "## GridSearch (Ferramenta ou meio que vamos utilizar para otimização dos hiperparâmetros dos modelos de ML)\n", + "* Encontra os hiperparâmetros ótimos (hyperparameter tunning) que melhoram a acurácia dos modelos.\n", + "* Necessita dos seguintes inputs:\n", + " * A matrix $X_{p}$ com as $p$ COLUNAS (variáveis ou atributos) do dataframe;\n", + " * A matriz $y_{p}$ com a COLUNA-target (vaiável resposta);\n", + " * Exemplo: DecisionTree, RandomForestClassifier, XGBoostClassificer e etc;\n", + " * Um dicionário com os hiperparâmetros a serem otimizados;\n", + " * O número de folds para o método de Cross-validation." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "39Sg77fbTWCO" + }, + "source": [ + "___\n", + "# **MODEL SELECTION & EVALUATION**\n", + "> Nesta fase identificamos e aplicamos as melhores métricas (Accuracy, Sensitivity, Specificity, F-Score, AUC, R-Sq, Adj R-SQ, RMSE (Root Mean Square Error)) para avaliar o desempenho/acurácia/performance dos modelos de ML.\n", + ">> Treinamos os modelos de ML usando a amostra de treinamento e avaliamos o desempenho/acurácia/performance na amostra de teste/validação.\n", + "\n", + "* Leitura Adicional\n", + " * [The 5 Classification Evaluation metrics every Data Scientist must know](https://towardsdatascience.com/the-5-classification-evaluation-metrics-you-must-know-aa97784ff226)\n", + " * [Confusion matrix and other metrics in machine learning](https://medium.com/hugo-ferreiras-blog/confusion-matrix-and-other-metrics-in-machine-learning-894688cb1c0a)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oQQVzZ2ZTYrB" + }, + "source": [ + "## Confusion Matrix\n", + "* Termos associados à Confusion Matrix:\n", + " * **Verdadeiro Positivo** (TP = True Positive): Quando o valor observado é True e o modelo estima como True. Ou seja, o modelo acertou na estimativa.\n", + " * Exemplo: **Observado**: Fraude (Positive); **Modelo**: Fraude (Positive) --> Modelo acertou!\n", + " * **Verdadeiro Negativo** (TN = True Negative): Quando o valor observado é False e o modelo estima como False. Ou seja, o modelo acertou na estimativa;\n", + " * Exemplo: **Observado**: NÃO-Fraude (Negative); **Modelo**: NÃO-Fraude (Negative) --> Modelo acertou!\n", + " * **Falso Positivo** (FP = False Positive): Quando o valor observado é False e o modelo estima como True. Ou seja, o modelo errou na estimativa. \n", + " * Exemplo: **Observado**: NÃO-Fraude (Negative); **Modelo**: Fraude (Positive) --> Modelo errou!\n", + " * **Falso Negativo** (FN = False Negative): Quando o valor observado é True e o modelo estima como False.\n", + " * Exemplo: **Observado**: Fraude (Positive); **Modelo**: NÃO-Fraude (Negative) --> Modelo errou!\n", + "\n", + "* Consulte [Confusion matrix](https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py)\n", + "\n", + "![ConfusionMatrix](https://github.com/MathMachado/Materials/blob/master/ConfusionMatrix.PNG?raw=true)\n", + "\n", + "Source: [Confusion Matrix](https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781838555078/6/ch06lvl1sec34/confusion-matrix)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ci-6eiqBTgbL" + }, + "source": [ + "## Accuracy\n", + "> Accuracy - é o número de previsões corretas feitas pelo modelo.\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Com que frequência o classificador (modelo preditivo) classifica corretamente?\n", + "```\n", + "\n", + "$$Accuracy= \\frac{TP+TN}{TP+TN+FP+FN}$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F7YI8X5TRx-R" + }, + "source": [ + "## Precision (ou Specificity)\n", + "> **Precision** - fornece informações sobre o desempenho em relação a Falsos Positivos (quantos capturamos).\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Com relação ao resultado Positivo, com que frequência o classificador está correto?\n", + "```\n", + "\n", + "\n", + "$$Precision= \\frac{TP}{TP+FP}$$\n", + "\n", + "**Exemplo**: Precison nos dirá a proporção de clientes que o modelo estimou como sendo Fraude quando, na verdade, são fraude.\n", + "\n", + "**Comentário**: Se nosso foco é minimizar Falso Negativos (FN), então precisamos nos esforçar para termos Recall próximo de 100%." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zO39n8x_Sz3L" + }, + "source": [ + "## Recall (ou Sensitivity)\n", + "> **Recall** - nos fornece informações sobre o desempenho de um classificador em relação a Falsos Negativos (quantos perdemos).\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Quando o valor observado é Positivo, com que frequência o classificador está correto?\n", + "```\n", + "\n", + "$$Recall = Sensitivity = \\frac{TP}{TP+FN}$$\n", + "\n", + "**Exemplo**: Recall é a proporção de clientes observados como Fraude e que o modelo estima como Fraude.\n", + "\n", + "**Comentário**: Se nosso foco for minimizar Falso Positivos (FP), então precisamos nos esforçar para fazer Precision mais próximo de 100% possível." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "htS6rdHVVXRG" + }, + "source": [ + "## Specificity\n", + "> **Specificity** - proporção de TN por TN+FP.\n", + "\n", + "Responde à seguinte pergunta:\n", + "\n", + "```\n", + "Quando o valor observado é Negativo, com que frequência o classificador está correto?\n", + "```\n", + "\n", + "**Exemplo**: Specificity é a proporção de clientes NÃO-Fraude que o modelo estima como NÃO-Fraude.\n", + "\n", + "$$Specificity= \\frac{TN}{TN+FP}$$\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mNn0twadTacc" + }, + "source": [ + "## F1-Score\n", + "> F1-Score é a média harmônica entre Recall e Precision e é um número entre 0 e 1. Quanto mais próximo de 1, melhor. Quanto mais próximo de 0, pior. Ou seja, é um equilíbrio entre Recall e Precision.\n", + "\n", + "$$F1\\_Score= 2\\left(\\frac{Recall*Precision}{Recall+Precision}\\right)$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gkCubyUCP_hn" + }, + "source": [ + "### Funções usadas neste tutorial" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZD2pH9hfTnZv" + }, + "source": [ + "#### Função para Cross-Validation" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hr8LczrSQB0x" + }, + "source": [ + "def funcao_cross_val_score(modelo, X_treinamento, y_treinamento, CV):\n", + " # versão com sklearn.model_selection.cross_validate:\n", + " #a_scores_CV = cross_validate(modelo, X_treinamento, y_treinamento, cv = CV, scoring = metodo)\n", + " #print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + " #print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')\n", + " #return a_scores_CV\n", + "\n", + " #versão com cross_val_score::\n", + " a_scores_CV = cross_val_score(modelo, X_treinamento, y_treinamento, cv = CV)\n", + " print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n", + " print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')\n", + " return a_scores_CV" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9ROlyvgij2yl" + }, + "source": [ + "#### Função para plotar a Confusion Matrix\n", + "* Extraído de [Confusion Matrix Visualization](https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "klQ0FLOIgeX1" + }, + "source": [ + "def mostra_confusion_matrix(cf, \n", + " group_names = None, \n", + " categories = 'auto', \n", + " count = True, \n", + " percent = True, \n", + " cbar = True, \n", + " xyticks = False, \n", + " xyplotlabels = True, \n", + " sum_stats = True, \n", + " figsize = (8, 8), \n", + " cmap = 'Blues'):\n", + " '''\n", + " This function will make a pretty plot of an sklearn Confusion Matrix cm using a Seaborn heatmap visualization.\n", + " Arguments\n", + " ---------\n", + " cf: confusion matrix to be passed in\n", + " group_names: List of strings that represent the labels row by row to be shown in each square.\n", + " categories: List of strings containing the categories to be displayed on the x,y axis. Default is 'auto'\n", + " count: If True, show the raw number in the confusion matrix. Default is True.\n", + " normalize: If True, show the proportions for each category. Default is True.\n", + " cbar: If True, show the color bar. The cbar values are based off the values in the confusion matrix.\n", + " Default is True.\n", + " xyticks: If True, show x and y ticks. Default is True.\n", + " xyplotlabels: If True, show 'True Label' and 'Predicted Label' on the figure. Default is True.\n", + " sum_stats: If True, display summary statistics below the figure. Default is True.\n", + " figsize: Tuple representing the figure size. Default will be the matplotlib rcParams value.\n", + " cmap: Colormap of the values displayed from matplotlib.pyplot.cm. Default is 'Blues'\n", + " See http://matplotlib.org/examples/color/colormaps_reference.html\n", + " '''\n", + "\n", + " # CODE TO GENERATE TEXT INSIDE EACH SQUARE\n", + " blanks = ['' for i in range(cf.size)]\n", + "\n", + " if group_names and len(group_names)==cf.size:\n", + " group_labels = [\"{}\\n\".format(value) for value in group_names]\n", + " else:\n", + " group_labels = blanks\n", + "\n", + " if count:\n", + " group_counts = [\"{0:0.0f}\\n\".format(value) for value in cf.flatten()]\n", + " else:\n", + " group_counts = blanks\n", + "\n", + " if percent:\n", + " group_percentages = [\"{0:.2%}\".format(value) for value in cf.flatten()/np.sum(cf)]\n", + " else:\n", + " group_percentages = blanks\n", + "\n", + " box_labels = [f\"{v1}{v2}{v3}\".strip() for v1, v2, v3 in zip(group_labels,group_counts,group_percentages)]\n", + " box_labels = np.asarray(box_labels).reshape(cf.shape[0],cf.shape[1])\n", + "\n", + " # CODE TO GENERATE SUMMARY STATISTICS & TEXT FOR SUMMARY STATS\n", + " if sum_stats:\n", + " #Accuracy is sum of diagonal divided by total observations\n", + " accuracy = np.trace(cf) / float(np.sum(cf))\n", + "\n", + " #if it is a binary confusion matrix, show some more stats\n", + " if len(cf)==2:\n", + " #Metrics for Binary Confusion Matrices\n", + " precision = cf[1,1] / sum(cf[:,1])\n", + " recall = cf[1,1] / sum(cf[1,:])\n", + " f1_score = 2*precision*recall / (precision + recall)\n", + " stats_text = \"\\n\\nAccuracy={:0.3f}\\nPrecision={:0.3f}\\nRecall={:0.3f}\\nF1 Score={:0.3f}\".format(accuracy,precision,recall,f1_score)\n", + " else:\n", + " stats_text = \"\\n\\nAccuracy={:0.3f}\".format(accuracy)\n", + " else:\n", + " stats_text = \"\"\n", + "\n", + " # SET FIGURE PARAMETERS ACCORDING TO OTHER ARGUMENTS\n", + " if figsize==None:\n", + " #Get default figure size if not set\n", + " figsize = plt.rcParams.get('figure.figsize')\n", + "\n", + " if xyticks==False:\n", + " #Do not show categories if xyticks is False\n", + " categories=False\n", + "\n", + " # MAKE THE HEATMAP VISUALIZATION\n", + " plt.figure(figsize=figsize)\n", + " sns.heatmap(cf,annot=box_labels,fmt=\"\",cmap=cmap,cbar=cbar,xticklabels=categories,yticklabels=categories)\n", + "\n", + " if xyplotlabels:\n", + " plt.ylabel('True label')\n", + " plt.xlabel('Predicted label' + stats_text)\n", + " else:\n", + " plt.xlabel(stats_text)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8J-sTUfTTdLi" + }, + "source": [ + "#### Função para o GridSearchCV" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ap3WMXqDthu9" + }, + "source": [ + "def GridSearchOptimizer(modelo, ml_Opt, d_hiperparametros, X_treinamento, y_treinamento, X_teste, y_teste, i_CV, l_colunas):\n", + " ml_GridSearchCV = GridSearchCV(modelo, d_hiperparametros, cv = i_CV, n_jobs = -1, verbose= 10, scoring = 'accuracy')\n", + " start = time()\n", + " ml_GridSearchCV.fit(X_treinamento, y_treinamento)\n", + " tempo_elapsed = time()-start\n", + " print(f\"\\nGridSearchCV levou {tempo_elapsed:.2f} segundos.\")\n", + "\n", + " # Hiperparâmetros que otimizam a classificação:\n", + " print(f'\\nHiperparâmetros otimizados: {ml_GridSearchCV.best_params_}')\n", + " \n", + " if ml_Opt == 'ml_DT2':\n", + " print(f'\\nDecisionTreeClassifier *********************************************************************************************************')\n", + " ml_Opt = DecisionTreeClassifier(criterion= ml_GridSearchCV.best_params_['criterion'], \n", + " max_depth= ml_GridSearchCV.best_params_['max_depth'],\n", + " max_leaf_nodes= ml_GridSearchCV.best_params_['max_leaf_nodes'],\n", + " min_samples_split= ml_GridSearchCV.best_params_['min_samples_leaf'],\n", + " min_samples_leaf= ml_GridSearchCV.best_params_['min_samples_split'], \n", + " random_state= i_Seed)\n", + " \n", + " elif ml_Opt == 'ml_RF2':\n", + " print(f'\\nRandomForestClassifier *********************************************************************************************************')\n", + " ml_Opt = RandomForestClassifier(bootstrap= ml_GridSearchCV.best_params_['bootstrap'], \n", + " max_depth= ml_GridSearchCV.best_params_['max_depth'],\n", + " max_features= ml_GridSearchCV.best_params_['max_features'],\n", + " min_samples_leaf= ml_GridSearchCV.best_params_['min_samples_leaf'],\n", + " min_samples_split= ml_GridSearchCV.best_params_['min_samples_split'],\n", + " n_estimators= ml_GridSearchCV.best_params_['n_estimators'],\n", + " random_state= i_Seed)\n", + " \n", + " elif ml_Opt == 'ml_AB2':\n", + " print(f'\\nAdaBoostClassifier *********************************************************************************************************')\n", + " ml_Opt = AdaBoostClassifier(algorithm='SAMME.R', \n", + " base_estimator=RandomForestClassifier(bootstrap = False, \n", + " max_depth = 10, \n", + " max_features = 'auto', \n", + " min_samples_leaf = 1, \n", + " min_samples_split = 2, \n", + " n_estimators = 400), \n", + " learning_rate = ml_GridSearchCV.best_params_['learning_rate'], \n", + " n_estimators = ml_GridSearchCV.best_params_['n_estimators'], \n", + " random_state = i_Seed)\n", + " \n", + " elif ml_Opt == 'ml_GB2':\n", + " print(f'\\nGradientBoostingClassifier *********************************************************************************************************')\n", + " ml_Opt = GradientBoostingClassifier(learning_rate = ml_GridSearchCV.best_params_['learning_rate'], \n", + " n_estimators = ml_GridSearchCV.best_params_['n_estimators'], \n", + " max_depth = ml_GridSearchCV.best_params_['max_depth'], \n", + " min_samples_split = ml_GridSearchCV.best_params_['min_samples_split'], \n", + " min_samples_leaf = ml_GridSearchCV.best_params_['min_samples_leaf'], \n", + " max_features = ml_GridSearchCV.best_params_['max_features'])\n", + " \n", + " elif ml_Opt == 'ml_XGB2':\n", + " print(f'\\nXGBoostingClassifier *********************************************************************************************************')\n", + " ml_Opt = XGBoostingClassifier(learning_rate= ml_GridSearchCV.best_params_['learning_rate'], \n", + " max_depth= ml_GridSearchCV.best_params_['max_depth'], \n", + " colsample_bytree= ml_GridSearchCV.best_params_['colsample_bytree'], \n", + " subsample= ml_GridSearchCV.best_params_['subsample'], \n", + " gamma= ml_GridSearchCV.best_params_['gamma'], \n", + " min_child_weight= ml_GridSearchCV.best_params_['min_child_weight'])\n", + " \n", + " # Treina novamente usando os hiperparâmetros otimizados...\n", + " ml_Opt.fit(X_treinamento, y_treinamento)\n", + "\n", + " # Cross-Validation com 10 folds\n", + " print(f'\\n********* CROSS-VALIDATION ***********')\n", + " a_scores_CV = funcao_cross_val_score(ml_Opt, X_treinamento, y_treinamento, i_CV)\n", + "\n", + " # Faz predições com os hiperparâmetros otimizados...\n", + " y_pred = ml_Opt.predict(X_teste)\n", + " \n", + " # Importância das COLUNAS\n", + " print(f'\\n********* IMPORTÂNCIA DAS COLUNAS ***********')\n", + " df_importancia_variaveis = pd.DataFrame(zip(l_colunas, ml_Opt.feature_importances_), columns= ['coluna', 'importancia'])\n", + " df_importancia_variaveis = df_importancia_variaveis.sort_values(by= ['importancia'], ascending=False)\n", + " print(df_importancia_variaveis)\n", + "\n", + " # Matriz de Confusão\n", + " print(f'\\n********* CONFUSION MATRIX - PARAMETER TUNNING ***********')\n", + " cf_matrix = confusion_matrix(y_teste, y_pred)\n", + " cf_labels = ['True_Negative', 'False_Positive', 'False_Negative', 'True_Positive']\n", + " cf_categories = ['Zero', 'One']\n", + " mostra_confusion_matrix(cf_matrix, group_names = cf_labels, categories = cf_categories)\n", + "\n", + " return ml_Opt, ml_GridSearchCV.best_params_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YMnQn2XgT7Mg" + }, + "source": [ + "#### Função para selecionar COLUNAS relevantes dos dataframes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fsnHcaeLUDFS" + }, + "source": [ + "from sklearn.feature_selection import SelectFromModel\n", + "\n", + "def seleciona_colunas_relevantes(modelo, X_treinamento, X_teste, threshold = 0.05):\n", + " # Cria um seletor para selecionar as COLUNAS com importância > threshold\n", + " sfm = SelectFromModel(modelo, threshold)\n", + " \n", + " # Treina o seletor\n", + " sfm.fit(X_treinamento, y_treinamento)\n", + "\n", + " # Mostra o indice das COLUNAS mais importantes\n", + " print(f'\\n********** COLUNAS Relevantes ******')\n", + " print(sfm.get_support(indices=True))\n", + "\n", + " # Seleciona somente as COLUNAS relevantes\n", + " X_treinamento_I = sfm.transform(X_treinamento)\n", + " X_teste_I = sfm.transform(X_teste)\n", + " return X_treinamento_I, X_teste_I " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gd98JFSGUV5n" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3e0m7lEnYOV9" + }, + "source": [ + "### Função para calcular a importância das colunas/variáveis/atributos\n", + "* Source: [Plotting Feature Importances](https://www.kaggle.com/grfiv4/plotting-feature-importances)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fjco0HnNYr-N" + }, + "source": [ + "def mostra_feature_importances(clf, X_treinamento, y_treinamento=None, \n", + " top_n=10, figsize=(8,8), print_table=False, title=\"Feature Importances\"):\n", + " '''\n", + " plot feature importances of a tree-based sklearn estimator\n", + " \n", + " Note: X_treinamento and y_treinamento are pandas DataFrames\n", + " \n", + " Note: Scikit-plot is a lovely package but I sometimes have issues\n", + " 1. flexibility/extendibility\n", + " 2. complicated models/datasets\n", + " But for many situations Scikit-plot is the way to go\n", + " see https://scikit-plot.readthedocs.io/en/latest/Quickstart.html\n", + " \n", + " Parameters\n", + " ----------\n", + " clf (sklearn estimator) if not fitted, this routine will fit it\n", + " \n", + " X_treinamento (pandas DataFrame)\n", + " \n", + " y_treinamento (pandas DataFrame) optional\n", + " required only if clf has not already been fitted \n", + " \n", + " top_n (int) Plot the top_n most-important features\n", + " Default: 10\n", + " \n", + " figsize ((int,int)) The physical size of the plot\n", + " Default: (8,8)\n", + " \n", + " print_table (boolean) If True, print out the table of feature importances\n", + " Default: False\n", + " \n", + " Returns\n", + " -------\n", + " the pandas dataframe with the features and their importance\n", + " \n", + " Author\n", + " ------\n", + " George Fisher\n", + " '''\n", + " \n", + " __name__ = \"mostra_feature_importances\"\n", + " \n", + " import pandas as pd\n", + " import numpy as np\n", + " import matplotlib.pyplot as plt\n", + " \n", + " from xgboost.core import XGBoostError\n", + " from lightgbm.sklearn import LightGBMError\n", + " \n", + " try: \n", + " if not hasattr(clf, 'feature_importances_'):\n", + " clf.fit(X_treinamento.values, y_treinamento.values.ravel())\n", + "\n", + " if not hasattr(clf, 'feature_importances_'):\n", + " raise AttributeError(\"{} does not have feature_importances_ attribute\".\n", + " format(clf.__class__.__name__))\n", + " \n", + " except (XGBoostError, LightGBMError, ValueError):\n", + " clf.fit(X_treinamento.values, y_treinamento.values.ravel())\n", + " \n", + " feat_imp = pd.DataFrame({'importance':clf.feature_importances_}) \n", + " feat_imp['feature'] = X_treinamento.columns\n", + " feat_imp.sort_values(by ='importance', ascending = False, inplace = True)\n", + " feat_imp = feat_imp.iloc[:top_n]\n", + " \n", + " feat_imp.sort_values(by='importance', inplace = True)\n", + " feat_imp = feat_imp.set_index('feature', drop = True)\n", + " feat_imp.plot.barh(title=title, figsize=figsize)\n", + " plt.xlabel('Feature Importance Score')\n", + " plt.show()\n", + " \n", + " if print_table:\n", + " from IPython.display import display\n", + " print(\"Top {} features in descending order of importance\".format(top_n))\n", + " display(feat_imp.sort_values(by = 'importance', ascending = False))\n", + " \n", + " return feat_imp" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rsH9dMxazWCg" + }, + "source": [ + "# **DATAFRAME-EXEMPLO USADO NESTE TUTORIAL**\n", + "> Gerar um dataframe com 18 colunas, sendo 9 informativas, 6 redundantes e 3 repetidas:\n", + "\n", + "Para saber mais sobre a geração de dataframes-exemplo (toy), consulte [Synthetic data generation — a must-have skill for new data scientists](https://towardsdatascience.com/synthetic-data-generation-a-must-have-skill-for-new-data-scientists-915896c0c1ae)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GEyDo_EIV_jV" + }, + "source": [ + "## Definir variáveis globais" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TdwgpZ76WFaT" + }, + "source": [ + "i_CV = 10 # Número de Cross-Validations\n", + "i_Seed = 20111974 # semente por questões de reproducibilidade\n", + "f_Test_Size = 0.3 # Proporção do dataframe de validação (outros valores poderiam ser 0.15, 0.20 ou 0.25)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gJTJfpwWzykS" + }, + "source": [ + "from sklearn.datasets import make_classification\n", + "\n", + "X, y = make_classification(n_samples = 1000, \n", + " n_features = 18, \n", + " n_informative = 9, \n", + " n_redundant = 6, \n", + " n_repeated = 3, \n", + " n_classes = 2, \n", + " n_clusters_per_class = 1, \n", + " random_state=i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gWy2IZh3s-o3", + "outputId": "e72d1b18-fca3-4352-9ea3-0009377505b3", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[ 0.06844089, 4.21184154, -2.5583024 , ..., -0.63061895,\n", + " -0.97831983, -0.88826977],\n", + " [-4.8240213 , 0.17950903, -2.98447332, ..., 0.33992045,\n", + " 1.89153784, -6.10967565],\n", + " [ 1.38953042, -0.226476 , 1.8774004 , ..., -1.47784549,\n", + " 0.96052606, 2.06020368],\n", + " ...,\n", + " [ 1.62548685, 0.43377848, 4.93537285, ..., -4.61990917,\n", + " 0.18310709, 6.16040231],\n", + " [-2.40619087, -1.65474635, 2.64196493, ..., -1.21427845,\n", + " 0.83745861, 0.8254619 ],\n", + " [-4.00041881, 2.52475556, -4.15290177, ..., -0.51680266,\n", + " 1.72224835, -5.59558306]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 14 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ccjhGnzxtAaV", + "outputId": "82144bbd-78c7-4bc5-b0e1-9457ca30f421", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "y[0:30] # Semelhante aos casos de fraude: {0, 1}" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,\n", + " 1, 1, 0, 1, 0, 1, 0, 1])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 15 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OHO2befKJxR3" + }, + "source": [ + "___\n", + "# **DECISION TREE**\n", + "> Decision Trees possuem estrutura em forma de árvores.\n", + "\n", + "![DecisionTree](https://github.com/MathMachado/Materials/blob/master/DecisionTree.PNG?raw=true)\n", + "\n", + "Fonte: [Um tutorial completo sobre modelagem baseada em árvores de decisão (códigos R e Python)](https://www.vooo.pro/insights/um-tutorial-completo-sobre-a-modelagem-baseada-em-tree-arvore-do-zero-em-r-python/)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J4fRargyX_1M" + }, + "source": [ + "* **Principais Vantagens**:\n", + " * São algoritmos fáceis de entender, visualizar e interpretar;\n", + " * Captura facilmente padrões não-lineares presentes nos dados;\n", + " * Requer pouco poder computacional --> Treinar Decision Trees não requer tanto recurso computacional!\n", + " * Lida bem com COLUNAS numéricas ou categóricas;\n", + " * Não requer os dados sejam normalizados;\n", + " * Pode ser utilizado como Feature Engineering ao lidar com Missing Values;\n", + " * Pode ser utilizado como Feature Selection;\n", + " * Não requer suposições sobre a distribuição dos dados por causa da natureza não-paramétrica do algoritmo\n", + "\n", + "* **Principais desvantagens**\n", + " * Propenso a Overfitting, pois Decision Trees podem construir árvores complexas que não sejam capazes de generalizar bem os dados. As coisas complicam muito se a amostra de treinamento possuir outliers. Portanto, **recomenda-se fortemente a tratar os outliers previamente**.\n", + " * Pode criar árvores viesadas se tivermos um dataframe não-balanceado ou que alguma classe seja dominante. Por conta disso, **recomenda-se balancear o dataframe previamente para se evitar esse problema**.\n", + "\n", + "* **Principais Hiperparâmetros**\n", + " * **Gini Index** - é uma métrica que mede a frequência com que um ponto/observação aleatoriamente selecionado seria incorretamente identificado.\n", + " * Portanto, quanto menor o valor de Gini Index, melhor a COLUNA;\n", + " * **Entropy** - é uma métrica que mede aleatoriedade da informação presente nos dados.\n", + " * Portanto, quanto maior a entropia da COLUNA, pior ela se torna para nos ajudar a tomar uma conclusão (classificar, por exemplo).\n", + "\n", + "## **Referências**:\n", + "* [1.10. Decision Trees](https://scikit-learn.org/stable/modules/tree.html).\n", + "* [Decision Tree Algorithm With Hands On Example](https://medium.com/datadriveninvestor/decision-tree-algorithm-with-hands-on-example-e6c2afb40d38) - ótimo tutorial para aprender, entender, interpretar e calcular os índices de Gini e entropia.\n", + "* [Intuitive Guide to Understanding Decision Trees](https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-understanding-decision-trees-adb2165ccab7) - ótimo tutorial para aprender, entender, interpretar e calcular os índices de Gini e entropia.\n", + "* [The Complete Guide to Decision Trees](https://towardsdatascience.com/the-complete-guide-to-decision-trees-28a4e3c7be14)\n", + "* [Creating and Visualizing Decision Tree Algorithm in Machine Learning Using Sklearn](https://intellipaat.com/blog/decision-tree-algorithm-in-machine-learning/) - Muito didático!\n", + "* [Decision Trees in Machine Learning](https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FrMkPN5aLp0Y" + }, + "source": [ + "## Carregar as bibliotecas" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FVU1CM0PKgO4" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "\n", + "import warnings\n", + "warnings.filterwarnings(\"ignore\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "15clh4XrISpz" + }, + "source": [ + "## Carregar/Ler os dados" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UMPL46w2IWJw" + }, + "source": [ + "l_colunas = ['v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8', 'v9', 'v10', 'v11', 'v12', 'v13', 'v14', 'v15', 'v16', 'v17', 'v18']\n", + "\n", + "df_X = pd.DataFrame(X, columns = l_colunas)\n", + "df_y = pd.DataFrame(y, columns = ['target'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "MFaQF2MGFl_M", + "outputId": "8fbd0af4-559f-49fc-d9cf-4e39a64b9b30", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_X.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
v1v2v3v4v5v6v7v8v9v10v11v12v13v14v15v16v17v18
00.0684414.211842-2.5583023.665482-3.8351583.4998512.4908563.6654820.2451170.8671722.8655460.493956-5.1485962.8655463.499851-0.630619-0.978320-0.888270
1-4.8240210.179509-2.9844731.033618-3.8934263.428734-3.3346051.033618-0.882780-0.7532811.441522-1.395514-4.0028801.4415223.4287340.3399201.891538-6.109676
21.389530-0.2264761.8774002.7134264.6302570.516455-3.7430272.7134261.2840392.030797-1.0955361.560159-1.014211-1.0955360.516455-1.4778450.9605262.060204
31.1458092.2559460.2073644.6658172.2946786.5013060.9647704.6658170.1194103.1963541.8947873.519138-4.7578071.8947876.501306-3.7890290.5794911.397106
4-0.9366463.697163-3.3636173.805126-1.7544304.9543460.4066053.805126-0.8247381.3825911.665704-0.649758-3.5130361.6657044.9543460.2570520.904244-3.071354
\n", + "
" + ], + "text/plain": [ + " v1 v2 v3 ... v16 v17 v18\n", + "0 0.068441 4.211842 -2.558302 ... -0.630619 -0.978320 -0.888270\n", + "1 -4.824021 0.179509 -2.984473 ... 0.339920 1.891538 -6.109676\n", + "2 1.389530 -0.226476 1.877400 ... -1.477845 0.960526 2.060204\n", + "3 1.145809 2.255946 0.207364 ... -3.789029 0.579491 1.397106\n", + "4 -0.936646 3.697163 -3.363617 ... 0.257052 0.904244 -3.071354\n", + "\n", + "[5 rows x 18 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 18 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "s-ibdD2ZG7tm", + "outputId": "f85b02f9-1289-4347-ecf5-d0b2ba73a845", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_X.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(1000, 18)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 19 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "f9cqRaywa_TR", + "outputId": "db612d0d-7ec6-495c-ed69-4b48d653fd84", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "set(df_y['target'])" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{0, 1}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 20 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BN6jbpn6Iwmu" + }, + "source": [ + "## Estatísticas Descritivas básicas do dataframe - df.describe()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KlwhxxUNIyYs", + "outputId": "8d06dab9-86f2-447a-ad02-4e03a714cad1", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 304 + } + }, + "source": [ + "df_X.describe()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
v1v2v3v4v5v6v7v8v9v10v11v12v13v14v15v16v17v18
count1000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.000000
mean-0.0851591.0342270.6574081.4053170.6872791.1315600.1080531.4053171.0070231.0488010.0792480.001650-0.3654380.0792481.131560-0.0277510.9846060.633624
std2.0022471.6315073.6087722.2568574.0195984.4818321.9813072.2568571.8632881.6439001.9492731.9326414.1606681.9492734.4818322.0654551.8505933.552991
min-6.944169-4.620754-16.300139-6.235192-12.454256-14.305401-6.152747-6.235192-5.484992-3.293216-7.135349-5.705500-9.120941-7.135349-14.305401-6.009023-5.035184-11.439074
25%-1.305566-0.089052-1.623657-0.152888-1.854645-1.684751-1.216983-0.152888-0.240908-0.012710-1.209675-1.292162-3.555363-1.209675-1.684751-1.436673-0.261610-1.691346
50%0.0525230.9941500.5738491.4499310.8123641.2815040.1670911.4499311.0661251.0128990.1803440.035237-0.9666380.1803441.281504-0.0001900.9757930.844784
75%1.3838532.0719953.0385862.8871413.4139524.0081031.4387192.8871412.2881882.1872021.4391991.3153422.7458061.4391994.0081031.3653692.2565043.109330
max4.9971727.35486011.7201658.49456612.84441815.9998036.2935508.4945668.1465596.5231806.2524485.53821611.2593506.25244815.9998036.5315617.64680212.090528
\n", + "
" + ], + "text/plain": [ + " v1 v2 ... v17 v18\n", + "count 1000.000000 1000.000000 ... 1000.000000 1000.000000\n", + "mean -0.085159 1.034227 ... 0.984606 0.633624\n", + "std 2.002247 1.631507 ... 1.850593 3.552991\n", + "min -6.944169 -4.620754 ... -5.035184 -11.439074\n", + "25% -1.305566 -0.089052 ... -0.261610 -1.691346\n", + "50% 0.052523 0.994150 ... 0.975793 0.844784\n", + "75% 1.383853 2.071995 ... 2.256504 3.109330\n", + "max 4.997172 7.354860 ... 7.646802 12.090528\n", + "\n", + "[8 rows x 18 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 21 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N_QhFqyZOKFB" + }, + "source": [ + "## Selecionar as amostras de treinamento e validação\n", + "\n", + "* Dividir os dados/amostras em:\n", + " * **Amostra de treinamento**: usado para treinar o modelo e otimizar os hiperparâmetros;\n", + " * **Amostra de teste**: usado para verificar se o modelo otimizado funciona em dados totalmente desconhecidos. É nesta amostra de teste que avaliamos a performance do modelo em termos de generalização (trabalhar com dados que não lhe foi apresentado);\n", + "* **Técnica de Hold-Out**: Separar/dividir os dados em amostra de treinamento e teste. Geralmente usamos 70% da amostra para treinamento e 30% validação. Outras opções são usar os percentuais 80/20 ou 75/25 (default).\n", + " * **Desvatangem do Hold-Out**: Variância nos dados pode comprometer performance do modelo quando, por exemplo, amostra de treinamento é semelhante amostra de teste. \n", + "* Consulte [sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) para mais detalhes.\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8sKBgs-QOOfn" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(df_X, df_y, test_size = f_Test_Size, random_state = i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TPTKBBHgOpoA", + "outputId": "2948cf55-c38b-499e-f584-ba102fb84d5c", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X_treinamento.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(700, 18)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 23 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lEn_LLs2OtRI", + "outputId": "98cf2d71-40b5-474c-e30d-cfe7468f8920", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "y_treinamento.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(700, 1)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 24 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_uAw8EcyOvrG", + "outputId": "ada5d946-e73d-4fcd-e4da-5287b463a875", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X_teste.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(300, 18)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 25 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "A2LYI-9hOyXI", + "outputId": "06ef3a8f-929f-42cf-d538-bcefa254fac0", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "y_teste.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(300, 1)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 26 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "npgoBSX2dd4l" + }, + "source": [ + "## Treinar o algoritmo com os dados de treinamento\n", + "### Carregar os algoritmos/libraries" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hcvzrtolGfnQ", + "outputId": "7994c7f4-c644-4ffe-ea58-d27f522e9576", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "!pip install graphviz\n", + "!pip install pydotplus" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Requirement already satisfied: graphviz in /usr/local/lib/python3.6/dist-packages (0.10.1)\n", + "Requirement already satisfied: pydotplus in /usr/local/lib/python3.6/dist-packages (2.0.2)\n", + "Requirement already satisfied: pyparsing>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from pydotplus) (2.4.7)\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "v_pF-HH3JKL2" + }, + "source": [ + "from sklearn.metrics import accuracy_score # para medir a acurácia do modelo preditivo\n", + "#from sklearn.model_selection import train_test_split\n", + "#from sklearn.metrics import classification_report\n", + "from sklearn.metrics import confusion_matrix # para plotar a confusion matrix\n", + "\n", + "from sklearn.model_selection import GridSearchCV # para otimizar os hiperparâmetros dos modelos preditivos\n", + "from sklearn.model_selection import cross_val_score # Para o CV (Cross-Validation)\n", + "from sklearn.model_selection import cross_validate\n", + "\n", + "from time import time\n", + "from operator import itemgetter\n", + "from scipy.stats import randint\n", + "\n", + "from sklearn.tree import export_graphviz\n", + "from sklearn.externals.six import StringIO \n", + "from IPython.display import Image \n", + "import pydotplus\n", + "\n", + "np.set_printoptions(suppress=True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YJMS9ePQ6B6t" + }, + "source": [ + "**Atenção**: Para evitar overfitting nos algoritmos DecisionTreeClassifier, considere min_samples_split = 2 como default." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nNeRHYePJc-r" + }, + "source": [ + "from sklearn.tree import DecisionTreeClassifier # Library para Decision Tree (Classificação)\n", + "\n", + "# Instancia (configuração do Decision Trees) com os hiperparâmetros sugeridos para se evitar overfitting:\n", + "ml_DT = DecisionTreeClassifier(criterion = 'gini', \n", + " splitter = 'best', \n", + " max_depth = None, \n", + " min_samples_split = 2, \n", + " min_samples_leaf = 1, \n", + " min_weight_fraction_leaf = 0.0, \n", + " max_features = None, \n", + " random_state = i_Seed, \n", + " max_leaf_nodes = None, \n", + " min_impurity_decrease = 0.0, \n", + " min_impurity_split = None, \n", + " class_weight = None, \n", + " presort = False)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gVLZznprx2YX", + "outputId": "6266f184-5949-4f2d-eb76-7215f87f1b3a", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Objeto/classificador configurado\n", + "ml_DT" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n", + " max_depth=None, max_features=None, max_leaf_nodes=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, presort=False,\n", + " random_state=20111974, splitter='best')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 30 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8CC24H-JHhlj" + }, + "source": [ + "### Treina o algoritmo: fit(df)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OgAHfXVo-Nw8", + "outputId": "a2c20323-1ef2-4491-fb19-4b96c370e04b", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "ml_DT.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n", + " max_depth=None, max_features=None, max_leaf_nodes=None,\n", + " min_impurity_decrease=0.0, min_impurity_split=None,\n", + " min_samples_leaf=1, min_samples_split=2,\n", + " min_weight_fraction_leaf=0.0, presort=False,\n", + " random_state=20111974, splitter='best')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 31 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CNiRjmrRHVnx" + }, + "source": [ + "### Valida o modelo com a amostra de treinamento" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2GMCSs89HquJ", + "outputId": "cde08303-2def-4401-ecf7-dd3c45c393c1", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "ml_DT.score(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.94" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 32 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Bmv9YZobIer4" + }, + "source": [ + "### Calcula as predições usando o modelo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2YufZaRNIkFL" + }, + "source": [ + "y_pred = ml_DT.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fYvMN-tvIX-p" + }, + "source": [ + "### Matriz de Confusão" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9iTK6pBwIZ3F", + "outputId": "9345ea23-6691-446f-dead-d5c893bcc6c8", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 538 + } + }, + "source": [ + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative', 'False_Positive', 'False_Negative', 'True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jOnkFBcEIVAb" + }, + "source": [ + "### Volte ao início, extraia nova amostra e calcule a acurácia\n", + "* Observou que a acurácia mudou? Isso acontece porque extraimos uma nova amostra de treinamento.\n", + "* Quais os inconvenientes de termos uma métrica diferente para cada amostra do modelo preditivo?\n", + "* Como reportar os resultados do seu modelo?\n", + "* Como se assegurar acerca do valor mais ideal da métrica?\n", + " * use a Estatística a seu favor! --> Use Cross-Validation." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MkBSvyorGXQz" + }, + "source": [ + "___\n", + "# **CROSS-VALIDATION**\n", + "* K-fold é o método de Cross-Validation (CV) mais conhecido e utilizado;\n", + "* Como funciona: divide o dataframe de treinamento em k partes (cada parte é um fold);\n", + " * Usa k-1 partes para treinar o modelo e o restante para validar o modelo;\n", + " * O processo é repetido k vezes, sendo que em cada iteração calcula as métricas desejadas (exemplo: acurácia);\n", + " * Desta forma o modelo é treinado e testado com todas as partes dos dados;\n", + " * Ao final das k iterações, teremos k métricas das quais calculamos média e desvio-padrão.\n", + "\n", + " A figura abaixo nos ajuda a entender como funciona CV:\n", + "\n", + "![Cross-Validation](https://github.com/MathMachado/Materials/blob/master/CV2.PNG?raw=true)\n", + "\n", + "Source: [5 Reasons why you should use Cross-Validation in your Data Science Projects](https://towardsdatascience.com/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79)\n", + "\n", + "* **valor de k**:\n", + " * valor de k (folds): entre 5 e 10 --> Não há regra geral para a escolha de k;\n", + " * Quanto maior o valor de k --> menor o viés do CV --> Experimento Estatístico para mostrar o efeito.\n", + "\n", + "[Applied Predictive Modeling, 2013](https://www.amazon.com/Applied-Predictive-Modeling-Max-Kuhn/dp/1461468485/ref=as_li_ss_tl?ie=UTF8&qid=1520380699&sr=8-1&keywords=applied+predictive+modeling&linkCode=sl1&tag=inspiredalgor-20&linkId=1af1f3de89c11e4a7fd49de2b05e5ebf)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HscfN-a1V043" + }, + "source": [ + "* **Vantagens do uso de CV**:\n", + " * Modelos com melhor acurácia;\n", + " * Melhor uso dos dados, pois todos os dados são utilizados como treinamento e validação. Portanto, qualquer problema com os dados serão encontrados nesta fase.\n", + "\n", + "* **Leitura Adicional**\n", + " * [Cross-Validation in Machine Learning](https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f)\n", + " * [5 Reasons why you should use Cross-Validation in your Data Science Projects](https://towardsdatascience.com/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79)\n", + " * [Cross-validation: evaluating estimator performance](https://scikit-learn.org/stable/modules/cross_validation.html)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8x2UPwOYQPcI", + "outputId": "7b9c4045-2257-4795-a54f-3305d2e97ab2", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Cross-Validation com k = 10 folds (= 10 partes)\n", + "a_scores_CV = funcao_cross_val_score(ml_DT, X_treinamento, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Média das Acurácias calculadas pelo CV....: 91.43\n", + "std médio das Acurácias calculadas pelo CV: 3.44\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Uxoplcea0byV", + "outputId": "0acecb3b-565d-4356-c46b-4c48da8268fb", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "a_scores_CV # array com os scores a cada iteração do CV" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.9 , 0.98571429, 0.85714286, 0.92857143, 0.88571429,\n", + " 0.94285714, 0.92857143, 0.9 , 0.88571429, 0.92857143])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 36 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y3k-PcbN0o_i", + "outputId": "0d4222a7-dee5-4468-8e58-9e3dccd8def1", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "a_scores_CV.mean()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.9142857142857144" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 37 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6_rYker2gzeG" + }, + "source": [ + "**Interpretação**: Nosso classificador (DecisionTreeClassifier) tem uma acurácia média de 91,43% (base de treinamento). Além disso, o std é da ordem de 3,66%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tkwchmkP3p_A", + "outputId": "11804dfc-d593-4d00-8b6b-b37d162d0bd6", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Acurácias: [0.9 0.98571429 0.85714286 0.92857143 0.88571429 0.94285714\n", + " 0.92857143 0.9 0.88571429 0.92857143]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lQNyqHCiKRUh" + }, + "source": [ + "### Valida o modelo com a amostra de treinamento" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Qb0ZPyvKKRUp", + "outputId": "1785679f-fa54-4fe9-82f4-e9847374f950", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "ml_DT.score(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.94" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 39 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iL2tEdbqKY5P" + }, + "source": [ + "### Predições com o modelo treinado\n", + "* Faz predições usando o classificador (Decision Trees) para inferir na amostra de teste:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sI31WkZs2ht_" + }, + "source": [ + "y_pred = ml_DT.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rfapj3OG13PG", + "outputId": "367c435b-c3b9-49a5-b78f-7c03042e7da0", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "y_pred[0:30]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0,\n", + " 1, 0, 0, 1, 1, 0, 1, 1])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 41 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sc88ofqh16RT", + "outputId": "05a0fb74-4512-445b-9d76-0f6589d18541", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "y[0:30]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,\n", + " 1, 1, 0, 1, 0, 1, 0, 1])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 42 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Cecv-51TKgz-" + }, + "source": [ + "### Matriz de Confusão" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fSaVzJ9xFpwW", + "outputId": "053c1abb-ef0b-479e-a770-738d4691538f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 538 + } + }, + "source": [ + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative', 'False_Positive', 'False_Negative', 'True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p8D975NqsGtj" + }, + "source": [ + "## Parameter tunning\n", + "### Referência\n", + "* [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74)\n", + "* [Decision Tree Adventures 2 — Explanation of Decision Tree Classifier Parameters](https://medium.com/datadriveninvestor/decision-tree-adventures-2-explanation-of-decision-tree-classifier-parameters-84776f39a28) - Explica didaticamente e step by step como fazer parameter tunning." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Bfdq5zEhlVsk" + }, + "source": [ + "# Dicionário com hiperparâmetros para o parameter tunning:\n", + "d_hiperparametros_DT = {\"criterion\": [\"gini\", \"entropy\"], \n", + " \"min_samples_split\": [2, 5, 10, 270, 350, 400], \n", + " \"max_depth\": [None, 2, 5, 9, 15], \n", + " \"min_samples_leaf\": [20, 40, 100], \n", + " \"max_leaf_nodes\": [None, 2, 3, 15]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "BtajXuuUpGwq", + "outputId": "020d97bf-4940-406e-8fb3-23c8b09018a5", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "d_hiperparametros_DT" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'criterion': ['gini', 'entropy'],\n", + " 'max_depth': [None, 2, 5, 9, 15],\n", + " 'max_leaf_nodes': [None, 2, 3, 15],\n", + " 'min_samples_leaf': [20, 40, 100],\n", + " 'min_samples_split': [2, 5, 10, 270, 350, 400]}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 45 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H8gNSs0G0A-L" + }, + "source": [ + "```\n", + "grid_search = GridSearchCV(ml_DT, param_grid= d_hiperparametros_DT, cv = i_CV, n_jobs= -1)\n", + "start = time()\n", + "grid_search.fit(X_treinamento, y_treinamento)\n", + "tempo_elapsed= time()-start\n", + "print(f\"\\nGridSearchCV levou {tempo_elapsed:.2f} segundos para estimar {len(grid_search.cv_results_)} modelos candidatos\")\n", + "\n", + "GridSearchCV levou 1999.12 segundos para estimar 23 modelos candidatos\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "44-BRnNjBT25", + "outputId": "e8547d4f-1eff-42e2-aa2a-9e8735f2a997", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + } + }, + "source": [ + "# Invoca a função com o modelo baseline\n", + "ml_DT2, best_params = GridSearchOptimizer(ml_DT, 'ml_DT2', d_hiperparametros_DT, X_treinamento, y_treinamento, X_teste, y_teste, i_CV, l_colunas)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Fitting 10 folds for each of 720 candidates, totalling 7200 fits\n" + ], + "name": "stdout" + }, + { + "output_type": "stream", + "text": [ + "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.\n", + "[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 1.5s\n", + "[Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 1.6s\n", + "[Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 1.6s\n", + "[Parallel(n_jobs=-1)]: Done 14 tasks | elapsed: 1.7s\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.1995s.) Setting batch_size=2.\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0662s.) Setting batch_size=4.\n", + "[Parallel(n_jobs=-1)]: Done 24 tasks | elapsed: 1.8s\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0940s.) Setting batch_size=8.\n", + "[Parallel(n_jobs=-1)]: Batch computation too fast (0.1391s.) Setting batch_size=16.\n", + "[Parallel(n_jobs=-1)]: Done 58 tasks | elapsed: 2.0s\n", + "[Parallel(n_jobs=-1)]: Done 186 tasks | elapsed: 2.9s\n", + "[Parallel(n_jobs=-1)]: Done 330 tasks | elapsed: 3.8s\n", + "[Parallel(n_jobs=-1)]: Done 506 tasks | elapsed: 4.9s\n", + "[Parallel(n_jobs=-1)]: Done 682 tasks | elapsed: 6.0s\n", + "[Parallel(n_jobs=-1)]: Done 890 tasks | elapsed: 7.3s\n", + "[Parallel(n_jobs=-1)]: Done 1098 tasks | elapsed: 8.5s\n", + "[Parallel(n_jobs=-1)]: Done 1338 tasks | elapsed: 10.0s\n", + "[Parallel(n_jobs=-1)]: Done 1578 tasks | elapsed: 11.5s\n", + "[Parallel(n_jobs=-1)]: Done 1850 tasks | elapsed: 13.2s\n", + "[Parallel(n_jobs=-1)]: Done 2122 tasks | elapsed: 15.0s\n", + "[Parallel(n_jobs=-1)]: Done 2426 tasks | elapsed: 16.7s\n", + "[Parallel(n_jobs=-1)]: Done 2730 tasks | elapsed: 18.5s\n", + "[Parallel(n_jobs=-1)]: Done 3066 tasks | elapsed: 20.4s\n", + "[Parallel(n_jobs=-1)]: Done 3402 tasks | elapsed: 22.3s\n", + "[Parallel(n_jobs=-1)]: Done 3770 tasks | elapsed: 24.8s\n", + "[Parallel(n_jobs=-1)]: Done 4138 tasks | elapsed: 27.4s\n", + "[Parallel(n_jobs=-1)]: Done 4538 tasks | elapsed: 30.2s\n", + "[Parallel(n_jobs=-1)]: Done 4938 tasks | elapsed: 32.9s\n", + "[Parallel(n_jobs=-1)]: Done 5370 tasks | elapsed: 35.8s\n", + "[Parallel(n_jobs=-1)]: Done 5802 tasks | elapsed: 38.9s\n", + "[Parallel(n_jobs=-1)]: Done 6266 tasks | elapsed: 42.2s\n", + "[Parallel(n_jobs=-1)]: Done 6730 tasks | elapsed: 45.5s\n", + "[Parallel(n_jobs=-1)]: Done 7181 tasks | elapsed: 48.8s\n", + "[Parallel(n_jobs=-1)]: Done 7200 out of 7200 | elapsed: 49.0s finished\n" + ], + "name": "stderr" + }, + { + "output_type": "stream", + "text": [ + "\n", + "GridSearchCV levou 49.12 segundos.\n", + "\n", + "Hiperparâmetros otimizados: {'criterion': 'entropy', 'max_depth': 5, 'max_leaf_nodes': None, 'min_samples_leaf': 20, 'min_samples_split': 2}\n", + "\n", + "DecisionTreeClassifier *********************************************************************************************************\n", + "\n", + "********* CROSS-VALIDATION ***********\n", + "Média das Acurácias calculadas pelo CV....: 91.14\n", + "std médio das Acurácias calculadas pelo CV: 3.25\n", + "\n", + "********* IMPORTÂNCIA DAS COLUNAS ***********\n", + " coluna importancia\n", + "12 v13 0.552335\n", + "0 v1 0.116321\n", + "6 v7 0.074047\n", + "9 v10 0.069885\n", + "11 v12 0.058293\n", + "1 v2 0.040086\n", + "8 v9 0.028917\n", + "16 v17 0.020542\n", + "2 v3 0.017079\n", + "7 v8 0.011940\n", + "13 v14 0.010556\n", + "4 v5 0.000000\n", + "5 v6 0.000000\n", + "3 v4 0.000000\n", + "10 v11 0.000000\n", + "14 v15 0.000000\n", + "15 v16 0.000000\n", + "17 v18 0.000000\n", + "\n", + "********* CONFUSION MATRIX - PARAMETER TUNNING ***********\n" + ], + "name": "stdout" + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gmCkjGjPJMLr" + }, + "source": [ + "### Visualizar o resultado" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cIc3ZgaISEd0", + "outputId": "2e2b5d3c-2aea-4c2f-ec7a-1d37b4f41eb7", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 623 + } + }, + "source": [ + "from sklearn.tree import export_graphviz\n", + "from sklearn.externals.six import StringIO \n", + "from IPython.display import Image \n", + "import pydotplus\n", + "\n", + "dot_data = StringIO()\n", + "export_graphviz(ml_DT2, out_file = dot_data, filled = True, rounded = True, special_characters = True, feature_names = l_colunas, class_names = ['0','1'])\n", + "\n", + "graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) \n", + "graph.write_png('DecisionTree.png')\n", + "Image(graph.create_png())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "image/png": "\n", + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 47 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e1R2GBkbnV37" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ukMLoEr7nbUf", + "outputId": "4fc221e2-0c35-4669-ddf4-7cba982106c7", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X_treinamento_DT, X_teste_DT = seleciona_colunas_relevantes(ml_DT2, X_treinamento, X_teste)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "\n", + "********** COLUNAS Relevantes ******\n", + "[ 0 6 9 11 12]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xSDB20yWOrH1" + }, + "source": [ + "### Importância das variáveis pelo modelo.feature_importances_" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WJbGcEeqOFRQ", + "outputId": "ce97e5d9-f6ec-48f0-f214-e01f5b24c471", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 162 + } + }, + "source": [ + "ml_RF3.feature_importances_" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "NameError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mml_RF3\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfeature_importances_\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mNameError\u001b[0m: name 'ml_RF3' is not defined" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8JjePRQAoqkk" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Gt3aCPpfKRxm" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zq6uCVtzovMt" + }, + "source": [ + "# Treina usando as COLUNAS relevantes...\n", + "ml_DT2.fit(X_treinamento_DT, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "M2h3EpinRD5Q" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_DT2, X_treinamento, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "znWy3LE1q-Z3" + }, + "source": [ + "ml_DT3, best_params2 = GridSearchOptimizer(ml_DT2, 'ml_DT2', d_hiperparametros_DT, X_treinamento_DT, y_treinamento, X_teste_DT, y_teste, i_CV, l_colunas)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6IhCC6pfq-jL" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "qw6Dk3kesT0q" + }, + "source": [ + "best_params2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "YFoK1ZGrRHf3" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_DT3, X_treinamento_DT, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MZ1-vGRcxJoN" + }, + "source": [ + "## Valida o modelo usando o dataframe X_teste" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ig9GiUAEw9jr" + }, + "source": [ + "y_pred_DT = ml_DT2.predict(X_teste_DT)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7UZz4UzHDqae" + }, + "source": [ + "# Calcula acurácia\n", + "accuracy_score(y_teste, y_pred_DT)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K3EUMAxxKBur" + }, + "source": [ + "___\n", + "# **RANDOM FOREST**\n", + "* Decision Trees possuem estrutura em forma de árvores.\n", + "* Random Forest pode ser utilizado tanto para classificação (RandomForestClassifier) quanto para Regressão (RandomForestRegressor);\n", + "* Os nós da árvore são criados a partir das variáveis do dataframe;\n", + "\n", + "* **Vantagens**:\n", + " * Não requer tanto data preprocessing;\n", + " * Lida bem com COLUNAS categóricas e numéricas;\n", + " * Apresenta bons resultados em diversos tipos de problema;\n", + " * Estes modelos aprendem com os próprios erros e ajustam as árvores de modo a fazer melhores classificações;\n", + " * Ensemble é a combinação de diferentes modelos preditivos;\n", + " * Torna os algoritmos/resultados mais robustos e complexos, levando a um maior custo computacional que costuma ser acompanhando de melhores resultados.\n", + " * Mais robusta que uma simples Decision Tree. **Porque?**\n", + " * Controla automaticamente overfitting (**porque?**) e frequentemente produz modelos muito robustos e de alta-performance.\n", + " * Pode ser utilizado como **Feature Selection**, pois gera a matriz de importância dos atributos (importance sample). \n", + " * A soma das importâncias das variáveis soma 100;\n", + " * Assim como as Decision Trees, esses modelos capturam facilmente padrões não-lineares presentes nos dados;\n", + " * Não requer dados normalizados;\n", + " * Lida bem com Missing Values;\n", + " * Não requer suposições (assumptions) sobre a distribuição dos dados por causa da natureza não-paramétrica do algoritmo;\n", + "\n", + "* **Desvantagens/Cuidados**\n", + " * **Recomenda-se balancear o dataframe previamente**.\n", + "\n", + "* **Principais Hiperparâmetros**\n", + "\n", + "## **Referências**:\n", + "* [Running Random Forests? Inspect the feature importances with this code](https://towardsdatascience.com/running-random-forests-inspect-the-feature-importances-with-this-code-2b00dd72b92e)\n", + "* [Feature importances with forests of trees](https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html)\n", + "* [Understanding Random Forests Classifiers in Python](https://www.datacamp.com/community/tutorials/random-forests-classifier-python)\n", + "* [Understanding Random Forest](https://towardsdatascience.com/understanding-random-forest-58381e0602d2)\n", + "* [An Implementation and Explanation of the Random Forest in Python](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76)\n", + "* [Random Forest Simple Explanation](https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d)\n", + "* [Random Forest Explained](https://www.youtube.com/watch?v=eM4uJ6XGnSM)\n", + "* [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74) - Explica os principais hiperparâmetros do Random Forest." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CMQt5wiw1tt8" + }, + "source": [ + "### Como funciona?\n", + "\n", + "O algoritmo possui 4 passos:\n", + "1. Seleção aleatória de algumas features;\n", + "2. Seleção da feature mais adequada para a posição de nó raiz;\n", + "3. Geração dos nós filhos\n", + "4. Repete os passos acima até que se atinja a quantidade de árvores desejada.\n", + "\n", + "**Observação**: Depois que o modelo é gerado, as previsões são feitas a partir de “votações” das várias árvores. A decisão mais votada é a resposta do algoritmo." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VLGqtjs42zkN" + }, + "source": [ + "![DecisionTree](https://github.com/MathMachado/Materials/blob/master/DecisionTree.PNG?raw=true)\n", + "\n", + "Fonte: [Um tutorial completo sobre modelagem baseada em árvores de decisão (códigos R e Python)](https://www.vooo.pro/insights/um-tutorial-completo-sobre-a-modelagem-baseada-em-tree-arvore-do-zero-em-r-python/)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HaUjVgEd2rzU" + }, + "source": [ + "![](![image.png]())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r0s2vixBzFAR" + }, + "source": [ + "### Principais hiperparâmetros\n", + "* [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74) - Explica os principais hiperparâmetros do Random Forest.\n", + "\n", + " * n_estimators: especifica o número de árvores. O valor default é 10, o que significa que 10 diferentes modelos de DecisionTrees serão ajustadas;\n", + " * max_depth: especifica a profundidade máxima de cada árvore. O valor default é None, o que significa que cada árvore se expandirá até que cada folha esteja pura. Uma folha pura é aquela em que todos os dados da folha vêm da mesma classe;\n", + " * min_samples_split: especifica o número mínimo de amostras necessárias para dividir um nó filho. O valor default é 2, o que significa que um nó filho deve ter pelo menos duas amostras antes de ser dividido;\n", + " * min_samples_leaf: especifica o número mínimo de amostras necessárias para estar em um nó folha. O valor default é 1, o que significa que cada folha deve ter pelo menos 1 amostra para classificar." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rp1e0atD_gXN" + }, + "source": [ + "X_treinamento.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "KxIGuaRZ_ivJ" + }, + "source": [ + "X_treinamento.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "cnfDw_GEKBuu" + }, + "source": [ + "# Carregamento da library:\n", + "from sklearn.ensemble import RandomForestClassifier # Para resolver problemas de classificação\n", + "\n", + "# Instancia...\n", + "ml_RF = RandomForestClassifier(n_estimators = 100, min_samples_split = 2, max_features = \"auto\", random_state = i_Seed)\n", + "\n", + "# Treina...\n", + "ml_RF.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "E25BIxM0RTzs" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_RF, X_treinamento, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FDsj9bJ0BCk6" + }, + "source": [ + "### Avaliar a distribuição da variável-resposta: $y$\n", + "* A classe minoritária é muito abaixo da classe majoritária?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SSDy4vwoBx8H" + }, + "source": [ + "X_treinamento.shape[0]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "68-p5HaHARqY" + }, + "source": [ + "y_treinamento['target'].value_counts() # Apresentar em forma %" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "23Zv-Oc-CQIj" + }, + "source": [ + "### Avaliar quantas linhas para cada preditora" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8F2ELvqdCXAR" + }, + "source": [ + "X_treinamento.shape[0]/X_treinamento.shape[1]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "x7Ql5N4gCm5u" + }, + "source": [ + "Temos 38 linhas para cada preditora do X_treinamento. O ideal são 50 linhas/observações para cada preditora." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AouWUu8vANdb" + }, + "source": [ + "**Interpretação**: Nosso classificador (RandomForestClassifier) tem uma acurácia média de 96,44% (base de treinamento). Além disso, o std é da ordem de 2,77%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vbducxlgAa85" + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_lxx-LUw_5sd" + }, + "source": [ + "# Faz predições...\n", + "y_pred = ml_RF.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "pQIRO_LpGAkw" + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yKLHZ5_C6FJ8" + }, + "source": [ + "## Parameter tunning\n", + "### Referência\n", + "* [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74)\n", + "* [Decision Tree Adventures 2 — Explanation of Decision Tree Classifier Parameters](https://medium.com/datadriveninvestor/decision-tree-adventures-2-explanation-of-decision-tree-classifier-parameters-84776f39a28) - Explica didaticamente e step by step como fazer parameter tunning.\n", + "* [Optimizing Hyperparameters in Random Forest Classification](https://towardsdatascience.com/optimizing-hyperparameters-in-random-forest-classification-ec7741f9d3f6) - Outro approach para entender parameter tunning. Recomendo fortemente a leitura!\n", + "* [A Beginner’s Guide to Random Forest Hyperparameter Tuning](https://www.analyticsvidhya.com/blog/2020/03/beginners-guide-random-forest-hyperparameter-tuning/) " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XOa9naju6FKA" + }, + "source": [ + "# Dicionário de Hiperparâmetros para o parameter tunning.\n", + "d_hiperparametros_RF= {'bootstrap': [True, False],\n", + " 'max_depth': [10, 20, 30, 40, 50, 70, 90, 100, None],\n", + " 'max_features': ['auto', 'sqrt'],\n", + " 'min_samples_leaf': [1, 2, 4],\n", + " 'min_samples_split': [2, 5, 10],\n", + " 'n_estimators': [200, 400, 600, 800, 1000, 1500, 2000]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "KP5oHFGjF3ii" + }, + "source": [ + "d_hiperparametros_RF" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6__f2jZaTQat" + }, + "source": [ + "# Invoca a função\n", + "ml_RF2, best_params = GridSearchOptimizer(ml_RF, 'ml_RF2', d_hiperparametros_RF, X_treinamento, y_treinamento, X_teste, y_teste, i_CV, l_colunas)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "crfn-n--KG4n" + }, + "source": [ + "### Resultado da execução do Random Forest\n", + "\n", + "```\n", + "[Parallel(n_jobs=-1)]: Done 7920 out of 7920 | elapsed: 194.0min finished\n", + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}\n", + "```" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SGTOe5PaRw59" + }, + "source": [ + "# Como o procedimento acima levou 194 minutos para executar, então vou estimar ml_RF2 abaixo usando os parâmetros acima estimados\n", + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}\n", + "\n", + "ml_RF3= RandomForestClassifier(bootstrap= best_params['bootstrap'], \n", + " max_depth= best_params['max_depth'], \n", + " max_features= best_params['max_features'], \n", + " min_samples_leaf= best_params['min_samples_leaf'], \n", + " min_samples_split= best_params['min_samples_split'], \n", + " n_estimators= best_params['n_estimators'], \n", + " random_state= i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "2HwM8hbuRSx_" + }, + "source": [ + "ml_RF3.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WWNiy7Z0TQa3" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes\n", + "* 2 formas:\n", + " * Usando a função seleciona_colunas_relevantes() que seleciona baseado no threshold;\n", + " * Usando modelo.feature_importances_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qqXvO7wePWRg" + }, + "source": [ + "### Preditoras mais importantes pelo modelo.feature_importances_" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "l3m6QfTiQ0E8" + }, + "source": [ + "sum(ml_RF3.feature_importances_) # Soma das importâncias das colunas = 100%" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "mczpm2-NRmAr" + }, + "source": [ + "ml_RF3.feature_importances_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "fRPuq2g3PtmN" + }, + "source": [ + "df_importantes = pd.DataFrame(zip(X_treinamento.columns, ml_RF3.feature_importances_), columns = ['coluna', 'importancia'])\n", + "df_importantes.sort_values(by = 'importancia', ascending = False)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vgxPCnphc6r_" + }, + "source": [ + "Vamos selecionar as 7 preditoras mais importantes: v13, v11, v7, v12, v14, v16 e v1." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kOi11YOKTQa4" + }, + "source": [ + "# Deletar as preditoras que não são importantes:\n", + "X_treinamento_RF = X_treinamento.drop(columns = ['v5', 'v2', 'v15', 'v3', 'v17', 'v9', 'v10', 'v6', 'v8', 'v4', 'v18'], axis = 1)\n", + "X_teste_RF = X_teste.drop(columns = ['v5', 'v2', 'v15', 'v3', 'v17', 'v9', 'v10', 'v6', 'v8', 'v4', 'v18'], axis = 1)\n", + "X_treinamento_RF.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Zn_O7c_DTQbE" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UwEOwzSGTQbF" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Rr8qDrgvTQbL" + }, + "source": [ + "# Treina com as COLUNAS relevantes...\n", + "ml_RF3.fit(X_treinamento_RF, y_treinamento)\n", + "\n", + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_RF3, X_treinamento_RF, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-mYfQLlsTQbQ" + }, + "source": [ + "## Valida o modelo usando o dataframe X_teste" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sSD5o1JQTQbR" + }, + "source": [ + "y_pred_RF = ml_RF3.predict(X_teste_RF)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "wywF6LymDzKr" + }, + "source": [ + "# Calcula acurácia\n", + "accuracy_score(y_teste, y_pred_RF)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hJJsL0IJb6iO" + }, + "source": [ + "## Estudo do comportamento dos hiperparâmetros do algoritmo\n", + "> Consulte [Optimizing Hyperparameters in Random Forest Classification](https://towardsdatascience.com/optimizing-hyperparameters-in-random-forest-classification-ec7741f9d3f6) para mais detalhes." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ytRCEWjrjEQd" + }, + "source": [ + "from sklearn.model_selection import validation_curve" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YhY94cKcjnny" + }, + "source": [ + "### Efeito em n_estimators" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_1--fbI7fQR_" + }, + "source": [ + "param_range = np.arange(1, 250, 2)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_treinamento, \n", + " y_treinamento, \n", + " param_name = \"n_estimators\", \n", + " param_range = param_range, \n", + " cv = i_CV, \n", + " scoring = \"accuracy\", \n", + " n_jobs = -1)\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label = \"Training score\", color = \"black\")\n", + "plt.plot(param_range, test_mean, label = \"Cross-validation score\", color = \"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color = \"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color = \"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc = \"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "U60YOyRLjYsE" + }, + "source": [ + "#### Efeito de max_depth" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rv7TIM9kjsud" + }, + "source": [ + "param_range = np.arange(1, 250, 2)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_treinamento, \n", + " y_treinamento, \n", + " param_name = \"max_depth\", \n", + " param_range = param_range, \n", + " cv = i_CV, \n", + " scoring = \"accuracy\", \n", + " n_jobs = -1)\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label=\"Training score\", color=\"black\")\n", + "plt.plot(param_range, test_mean, label=\"Cross-validation score\", color=\"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color=\"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color=\"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc=\"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hyqgm6GzjUd5" + }, + "source": [ + "#### Efeito de min_samples_leaf" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lm_fPGYwkJYc" + }, + "source": [ + "param_range = np.arange(1, 250, 2)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_treinamento, \n", + " y_treinamento, \n", + " param_name='min_samples_leaf', \n", + " param_range=param_range,\n", + " cv = i_CV, \n", + " scoring=\"accuracy\", \n", + " n_jobs=-1)\n", + "\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label=\"Training score\", color=\"black\")\n", + "plt.plot(param_range, test_mean, label=\"Cross-validation score\", color=\"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color=\"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color=\"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc=\"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7Ox_t7NljNmq" + }, + "source": [ + "#### Efeito de min_samples_split" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "CAqdiSaVlAB8" + }, + "source": [ + "param_range = np.arange(0.05, 1, 0.05)\n", + "\n", + "# Calculate accuracy on training and test set using range of parameter values\n", + "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n", + " X_treinamento, \n", + " y_treinamento, \n", + " param_name='min_samples_split', \n", + " param_range=param_range,\n", + " cv = i_CV, \n", + " scoring=\"accuracy\", \n", + " n_jobs=-1)\n", + "\n", + "\n", + "# Calculate mean and standard deviation for training set a_scores_CV\n", + "train_mean = np.mean(train_a_scores_CV, axis = 1)\n", + "train_std = np.std(train_a_scores_CV, axis = 1)\n", + "\n", + "# Calculate mean and standard deviation for test set a_scores_CV\n", + "test_mean = np.mean(test_a_scores_CV, axis = 1)\n", + "test_std = np.std(test_a_scores_CV, axis = 1)\n", + "\n", + "# Plot mean accuracy a_scores_CV for training and test sets\n", + "plt.plot(param_range, train_mean, label=\"Training score\", color=\"black\")\n", + "plt.plot(param_range, test_mean, label=\"Cross-validation score\", color=\"dimgrey\")\n", + "\n", + "# Plot accurancy bands for training and test sets\n", + "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color=\"gray\")\n", + "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color=\"gainsboro\")\n", + "\n", + "# Create plot\n", + "plt.title(\"Validation Curve With Random Forest\")\n", + "plt.xlabel(\"Number Of Trees\")\n", + "plt.ylabel(\"Accuracy Score\")\n", + "plt.tight_layout()\n", + "plt.legend(loc=\"best\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y1q5cOJHV3HI" + }, + "source": [ + "### Exercício: LabData\n", + "* https://www.kaggle.com/c/labdata-churn-challenge-2020" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NeCCAwqVVmKA" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N_1A6lC9SF53" + }, + "source": [ + "# **BOOTSTRAPPING METHODS**\n", + "> Antes de falarmos de Boosting ou Bagging, precisamos entender primeiro o que é Bootstrap, pois ambos (Boosting e Bagging) são baseados em Bootstrap.\n", + "\n", + "* Em Estatística (e em Machine Learning), Bootstrap se refere à extrair amostras aleatórias COM reposição da população X." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cX_gfsbQSdNd" + }, + "source": [ + "___\n", + "# **BOOSTING MODELS**\n", + "* São algoritmos muito utilizados nas competições do Kaggle;\n", + "* São algoritmos utilizados para melhorar a performance dos algoritmos de Machine Learning;\n", + "* Modelos:\n", + " - [X] AdaBoost\n", + " - [X] XGBoost\n", + " - [X] LightGBM\n", + " - [X] GradientBoosting\n", + " - [X] CatBoost\n", + "\n", + "## Bagging vs Boosting\n", + "### **Bagging**\n", + "* **Objetivo**: é reduzir a variância;\n", + "* **Vantagens**:\n", + " * Reduz overfitting;\n", + " * Lida bem com dataframes com muitas COLUNAS (high dimensionality);\n", + " * Lida automaticamente com Missing Values;\n", + "* **Desvantagens**:\n", + " * A predição final é baseada na média das K Decision Trees, o que pode comprometer a acurácia final.\n", + "\n", + "### **Boosting**\n", + "* **Objetivo**: é melhorar acurácia;\n", + "* **Vantagens**:\n", + " * Lida bem com dataframes com muitas COLUNAS (high dimensionality);\n", + " * Lida automaticamente com Missing Values;\n", + "* **Desvantagens**:\n", + " * Propenso a overfitting. Recomenda-se tratar outliers previamente.\n", + " * Requer ajuste cuidadoso dos hyperparameters;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FqJV-QGFQx90" + }, + "source": [ + "#### **Bagging**: Como funciona\n", + "* Seleciona várias amostras **COM REPOSIÇÃO** do dataframe de treinamento. Cada amostra é usada para treinar um modelo usando Decision Trees. Como resultado, temos um ensemble de muitas e diferentes modelos (Decision Trees). A média de desses muitos e diferentes modelos (Decision Trees) são usados para produzir o resultado final;\n", + "* O resultado final é mais robusto do que usarmos uma simples Decision Tree.\n", + "\n", + "![Bagging](https://github.com/MathMachado/Materials/blob/master/Bagging.png?raw=true)\n", + "\n", + "Souce: [Boosting and Bagging: How To Develop A Robust Machine Learning Algorithm](https://hackernoon.com/how-to-develop-a-robust-algorithm-c38e08f32201).\n", + "\n", + "#### Steps\n", + "* Suponha um dataframe X_treinamento (dataframe de treinamento) contendo N observações (instâncias, pontos, linhas) e M COLUNAS (features, atributos).\n", + " 1. Bagging seleciona aleatoriamente uma amostra **COM REPOSIÇÃO** de X_treinamento;\n", + " 2. Bagging seleciona aleatoriamente M2 (M2 < M) COLUNAS do dataframe extraído do passo (1);\n", + " 3. Constroi uma Decision Tree com as M2 COLUNAS do passo (2) e o dataframe obtido no passo (1) e as COLUNAS são avaliadas pela sua habilidade de classificar as observações;\n", + " 4. Os passos (1)--> (2)-- (3) são repetidos K vezes (ou seja, K Decision Trees), de forma que as COLUNAS são ranqueadas pelo seu poder preditivo e o resultado final (acurácia, por exemplo) é obtido pela agregação das predições dos K Decision Trees." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EysHMsU9Q6nQ" + }, + "source": [ + "___\n", + "#### **Boosting**: Como funciona\n", + "* Os classificadores são usados sequencialmente, de forma que o classificador no passo N aprende com os erros do classificador do passo N-1. Ou seja, o objetivo é melhorar a precisão/acurácia à cada passo aprendendo com o passado.\n", + "\n", + "![Boosting](https://github.com/MathMachado/Materials/blob/master/Boosting.png?raw=true)\n", + "\n", + "Source: [Ensemble methods: bagging, boosting and stacking](https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205), Joseph Rocca\n", + ".\n", + "\n", + "#### Steps\n", + "* Suponha um dataframe X_treinamento (dataframe de treinamento) contendo N observações (instâncias, pontos, linhas) e M COLUNAS (features, atributos).\n", + " 1. Boosting seleciona aleatoriamente uma amostra D1 SEM reposição de X_treinamento;\n", + " 2. Boosting treina o classificador C1;\n", + " 3. Boosting seleciona aleatoriamente a SEGUNDA amostra D2 SEM reposição de X_treinamento e acrescenta à D2 50% das observações que foram classificadas incorretamente para treinar o classificador C2;\n", + " 4. Boosting encontra em X_treinamento a amostra D3 que os classificadores C1 e C2 discordam em classificar e treina C3;\n", + " 5. Combina (voto) as predições de C1, C2 e C3 para produzir o resultado final." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SyqazmUuifkE" + }, + "source": [ + "___\n", + "# **ADABOOST(Adaptive Boosting)**\n", + "* Foi um dos primeiros algoritmos de Boosting (1995);\n", + "* AdaBoost pode ser utilizado tanto para classificação (AdaBoostClassifier) quanto para Regressão (AdaBoostRegressor);\n", + "* AdaBoost usam algoritmos DecisionTree como base_estimator;\n", + "* É um dos classificadores do tipo ensemble (combina vários classificadores para aumentar a precisão).\n", + " * AdaBoost é um classificador iterativo e forte que combina (ensemble) vários classificadores fracos para melhorar a precisão.\n", + "* Qualquer algoritmo de aprendizado de máquina pode ser usado como um classificador de base (parâmetro base_estimator);\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RU-vzkXqrFVw" + }, + "source": [ + "## Referências\n", + "* [AdaBoost Classifier Example In Python](https://towardsdatascience.com/machine-learning-part-17-boosting-algorithms-adaboost-in-python-d00faac6c464) - Didático e explica exatamente como o AdaBoost funciona.\n", + "* [Adaboost for Dummies: Breaking Down the Math (and its Equations) into Simple Terms](https://towardsdatascience.com/adaboost-for-dummies-breaking-down-the-math-and-its-equations-into-simple-terms-87f439757dcf) - Para quem quer entender a matemática por trás do algoritmo.\n", + "* [Gradient Boosting and XGBoost](https://medium.com/hackernoon/gradient-boosting-and-xgboost-90862daa6c77)\n", + "* [Understanding AdaBoost](https://towardsdatascience.com/understanding-adaboost-2f94f22d5bfe), Akash Desarda.\n", + "* [AdaBoost Classifier Example In Python](https://towardsdatascience.com/machine-learning-part-17-boosting-algorithms-adaboost-in-python-d00faac6c464)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6EMrjQDZIMl_" + }, + "source": [ + "## Hiperparâmetros mais importantes do AdaBoost:\n", + "* base_estimator - É um classificador usado para treinar o modelo. Como default, AdaBoost usa o DecisionTreeClassifier. Pode-se utilizar diferentes algoritmos para esse fim.\n", + "* n_estimators - Número de base_estimator para treinar iterativamente.\n", + "* learning_rate - Controla a contribuição do base_estimator na solução/combinação final;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TzLtHzWNJBix" + }, + "source": [ + "## Usando diferentes algoritmos para base_estimator\n", + "> Como dito anteriormente, pode-se utilizar vários tipos de base_estimator em AdaBoost. Por exemplo, se quisermos usar SVM (Support Vector Machines), devemos proceder da seguinte forma:\n", + "\n", + "\n", + "```\n", + "# Importar a biblioteca base_estimator\n", + "from sklearn.svm import SVC\n", + "\n", + "# Treina o classificador (algoritmo)\n", + "ml_SVC= SVC(probability=True, kernel='linear')\n", + "\n", + "# Constroi o modelo AdaBoost\n", + "ml_AB = AdaBoostClassifier(n_estimators= 50, base_estimator=ml_SVC, learning_rate=1)\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hrj4a4s6hMMB" + }, + "source": [ + "## Vantagens\n", + "* AdaBoost é fácil de implementar;\n", + "* AdaBoost corrige os erros do base_estimator iterativamente e melhora a acurácia;\n", + "* Faz o Feature Selection automaticamente (**Porque**?);\n", + "* Pode-se usar muitos algoritos como base_estimator ;\n", + "* Como é um método ensemble, então o modelo final é pouco propenso à overfitting.\n", + "\n", + "## Desvantagens\n", + "* AdaBoost é sensível a ruídos nos dados;\n", + "* Altamente impactado por outliers (contribui para overfitting), pois o algoritmo tenta se ajustr a cada ponto da mehor forma possível;\n", + "* AdaBoost é mais lento que XGBoost;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bgJmu7YLiyv7" + }, + "source": [ + "O exemplo a seguir usa RandomForestClassifier com os hiperparâmetros otimizados, ou seja:\n", + "\n", + "```\n", + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5VCRNyZT3qvc" + }, + "source": [ + "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1gIboJdriq61" + }, + "source": [ + "from sklearn.ensemble import AdaBoostClassifier\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "\n", + "# Instancia RandomForestClassifier - Hiperparâmetros otimizados!\n", + "ml_RF4 = RandomForestClassifier(bootstrap= best_params['bootstrap'], \n", + " max_depth= best_params['max_depth'], \n", + " max_features= best_params['max_features'], \n", + " min_samples_leaf= best_params['min_samples_leaf'], \n", + " min_samples_split= best_params['min_samples_split'], \n", + " n_estimators= best_params['n_estimators'], \n", + " random_state= i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "2v9F-N2STPes" + }, + "source": [ + "# Instancia AdaBoostClassifier\n", + "ml_AB = AdaBoostClassifier(n_estimators=100, base_estimator= ml_RF4, random_state = i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "sdvkTlxkTRS_" + }, + "source": [ + "# Treina...\n", + "ml_AB.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "tBOuTywWRm91" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_AB, X_treinamento, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F7Ce5L38ECoC" + }, + "source": [ + "**Interpretação**: Nosso classificador (AdaBoostClassifier) tem uma acurácia média de 96,72% (base de treinamento) com std na ordem dos 2,54%. A seguir, tentativa de melhorar o modelo com GridSearch." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "t5GfnBwEifkO" + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q9rSpuXyEPA5" + }, + "source": [ + "# Faz predições com os hiperparâmetros otimizados...\n", + "y_pred = ml_AB.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "2F9k-_eXGDLa" + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative', 'False_Positive', 'False_Negative', 'True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XweWTjQ9EXLw" + }, + "source": [ + "## Parameter tunning" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fcrKzse9EbL_" + }, + "source": [ + "# Dicionário de hiperparâmetros para o parameter tunning.\n", + "d_hiperparametros_AB = {'n_estimators': [50, 100, 200], 'learning_rate': [.001, 0.01, 0.05, 0.1, 0.3,1]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Susc3I7mFDQX" + }, + "source": [ + "# Invoca a função\n", + "ml_AB2, best_params= GridSearchOptimizer(ml_AB, 'ml_AB2', d_hiperparametros_AB, X_treinamento, y_treinamento, X_teste, y_teste, i_CV, l_colunas)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "w4JjWsusjNS8" + }, + "source": [ + "___\n", + "# **GRADIENT BOOSTING**\n", + "* Gradient boosting pode ser usado para resolver problemas de classificação (GradientBoostingClassifier) e Regressão (GradientBoostingRegressor);\n", + "* Gradient boosting são um refinamento do AdaBoost (lembra que AdaBoost foi um dos primeiros métodos de Boosting - criado em 1995). O que Gradient Boosting faz adicionalmente ao AdaBoost é minimizar a loss (função perda), ie, minimizar a diferença entre os valores observados de y e os valores preditos.\n", + "* Usa Gradient Descent para encontrar as deficiências nas previsões do passo anterior. \n", + " * Gradient Descent é um algoritmo popular e poderoso e usado em Redes Neurais;\n", + " * O objetivo do Gradient Boosting é minimizar 'loss function'. Portanto, Gradient Boosting depende da \"loss function\".\n", + " * Gradient boosting usam algoritmos DecisionTree como base_estimator;\n", + "\n", + "## Vantagens\n", + "* Não há necessidade de pre-processing;\n", + "* Trabalha normalmente com COLUNAS numéricas ou categóricas;\n", + "* Trata automaticamente os Missing Values. Ou seja, não é necessário aplicar métodos de Missing Value Imputation;\n", + "\n", + "## Desvantagens\n", + "* Como Gradient Boosting tenta continuamente minimizar os erros à cada iteração, isso pode enfatizar os outliers e causar overfitting. Portanto, deve-se:\n", + " * Tratar os outliers previamente OU\n", + " * Usar Cross-Validation para neutralizar os efeitos dos outliers (**Eu prefiro este método, pois toma menos tempo**);\n", + "* Computacionalmene caro: geralmente são necessários muitas árvores (> 1000) para se obter bons resultados;\n", + "* Devido à flexibilidade (muitos hiperparâmetros para ajustar), então é necessário usar GridSearchCV para encontrar a combinação ótima dos hiperparâmetros;\n", + "\n", + "## Referências\n", + "* [Gradient Boosting Decision Tree Algorithm Explained](https://towardsdatascience.com/machine-learning-part-18-boosting-algorithms-gradient-boosting-in-python-ef5ae6965be4) - Didático e detalhista.\n", + "* [Predicting Wine Quality with Gradient Boosting Machines](https://towardsdatascience.com/predicting-wine-quality-with-gradient-boosting-machines-a-gmb-tutorial-d950b1542065)\n", + "* [Parameter Tuning in Gradient Boosting (GBM) with Python](https://www.datacareer.de/blog/parameter-tuning-in-gradient-boosting-gbm/)\n", + "* [Tune Learning Rate for Gradient Boosting with XGBoost in Python](https://machinelearningmastery.com/tune-learning-rate-for-gradient-boosting-with-xgboost-in-python/)\n", + "* [In Depth: Parameter tuning for Gradient Boosting](https://medium.com/all-things-ai/in-depth-parameter-tuning-for-gradient-boosting-3363992e9bae) - Muito bom\n", + "* [Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q4bUCZs2jNTA" + }, + "source": [ + "from sklearn.ensemble import GradientBoostingClassifier\n", + "\n", + "# Instancia...\n", + "ml_GB = GradientBoostingClassifier(n_estimators = 100, min_samples_split = 2)\n", + "\n", + "# Treina... \n", + "ml_GB.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "PKOG1ugSRvLM" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_GB, X_treinamento, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VlC3y3M5YaGG" + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vnLvQ0ZDYNjB" + }, + "source": [ + "**Interpretação**: Nosso classificador (GradientBoostingClassifier) tem uma acurácia média de 96,86% (base de treinamento) e std na ordem dos 2,52%. A seguir, tentativa de melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "D2n1RKZuXq3D" + }, + "source": [ + "# Faz precições...\n", + "y_pred = ml_GB.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8r6JCzQRGFa0" + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names = cf_labels, categories = cf_categories)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KFv-Q2AD5uCk" + }, + "source": [ + "## Parameter tunning\n", + "> Consulte [Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/) para detalhes sobre os hiperparâmetros, significado e etc." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wgU040AcjNTF" + }, + "source": [ + "# Dicionário de hiperparâmetros para o parameter tunning.\n", + "d_hiperparametros_GB= {'learning_rate': [1, 0.5, 0.25, 0.1, 0.05, 0.01]} #,\n", + "# 'n_estimators': [1, 2, 4, 8, 16, 32, 64, 100, 200],\n", + "# 'max_depth': [5, 10, 15, 20, 25, 30],\n", + "# 'min_samples_split': [0.1, 0.3, 0.5, 0.7, 0.9],\n", + "# 'min_samples_leaf': [0.1, 0.2, 0.3, 0.4, 0.5],\n", + "# 'max_features': list(range(1, X_treinamento.shape[1]))}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "v5KLFlpTjNTH" + }, + "source": [ + "# Invoca a função\n", + "ml_GB2, best_params= GridSearchOptimizer(ml_GB, 'ml_GB2', d_hiperparametros_GB, X_treinamento, y_treinamento, X_teste, y_teste, i_CV, l_colunas)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YQ6ERz3fi9i2" + }, + "source": [ + "### Resultado da execução do Gradient Boosting" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RSa7uKw13mKG" + }, + "source": [ + "```\n", + "[Parallel(n_jobs=-1)]: Done 275400 out of 275400 | elapsed: 93.7min finished\n", + "\n", + "Hiperparâmetros otimizados: {'learning_rate': 1, 'max_depth': 30, 'max_features': 11, 'min_samples_leaf': 0.1, 'min_samples_split': 0.1, 'n_estimators': 100}\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wiJpA2PyjDjR" + }, + "source": [ + "# Como o procedimento acima levou 93 minutos para executar, então vou estimar ml_GB2 abaixo usando os parâmetros acima estimados\n", + "best_params= {'learning_rate': 1, 'max_depth': 30, 'max_features': 11, 'min_samples_leaf': 0.1, 'min_samples_split': 0.1, 'n_estimators': 100}\n", + "\n", + "#ml_GB2= GradientBoostingClassifier(learning_rate= best_params['learning_rate'], \n", + "# max_depth= best_params['max_depth'],\n", + "# max_features= best_params['max_features'],\n", + "# min_samples_leaf= best_params['min_samples_leaf'],\n", + "# min_samples_split= best_params['min_samples_split'],\n", + "# n_estimators= best_params['n_estimators'],\n", + "# random_state= i_Seed)\n", + "\n", + "ml_GB2= GradientBoostingClassifier(learning_rate= best_params['learning_rate'], \n", + " max_depth= best_params['max_depth'],\n", + " min_samples_leaf= best_params['min_samples_leaf'],\n", + " min_samples_split= best_params['min_samples_split'],\n", + " n_estimators= best_params['n_estimators'],\n", + " random_state= i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mb14gJ7-jbVM" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TAqGZIFYm2sU" + }, + "source": [ + "X_treinamento_GB, X_teste_GB = seleciona_colunas_relevantes(ml_GB2, X_treinamento, X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6yiu6dahnBvC" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "APrtWN18nc4t" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VS0mLdOmnXAY" + }, + "source": [ + "# Treina com as COLUNAS relevantes\n", + "ml_GB2.fit(X_treinamento_GB, y_treinamento)\n", + "\n", + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_GB2, X_treinamento_GB, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vmc9PP_Rn1TN" + }, + "source": [ + "## Valida o modelo usando o dataframe X_teste" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "e3mnIALvnzP2" + }, + "source": [ + "y_pred_GB = ml_GB2.predict(X_teste_GB)\n", + "\n", + "# Calcula acurácia\n", + "accuracy_score(y_teste, y_pred_GB)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kwP9Z2GnkV7r" + }, + "source": [ + "___\n", + "# **XGBOOST (eXtreme Gradient Boosting)**\n", + "* XGBoost é uma melhoria de Gradient Boosting. \n", + " * As melhorias são em velocidade e performace, além de corrigir as ineficiências do GradientBoosting.\n", + "* Um dos algoritmo preferido pelos Kaggle Grandmasters;\n", + "* Paralelizável;\n", + "* Estado-da-arte em termos de Machine Learning;\n", + "\n", + "## Hiperparâmetros relevantes e seus valores iniciais\n", + "Consulte [Complete Guide to Parameter Tuning in XGBoost with codes in Python](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/) para detalhes completos sobre os hiperparâmetros, significado e etc.\n", + "\n", + "* n_estimators = 100 (100 caso o dataframe for grande. Se o dataframe for médio/pequeno, então 1000) - É o número de árvores desejamos construir;\n", + "* max_depth= 3 - Determina quão profundo cada árvore pode crescer durante qualquer round de treinamento. Valores típicos no intervalo [3, 10];\n", + "* learning rate= 0.01 - Usado para evitar overfitting, intervalo: [0, 1];\n", + "* alpha (somente para problemas de Regressão) - L1 regularization nos pesos. Valores altos resulta em mais regularization;\n", + "* lambda (somente para problemas de Regressão) - L2 regularization nos pesos.\n", + "* colsample_bytree: 1 - porcentagem de COLUNAS usados por cada árvore. Alto valor pode causar overfitting;\n", + "* subsample: 0.8 - porcentagem de amostras usadas por árvore. Um valor baixo pode levar a overfitting;\n", + "* gamma: 1 - Controla se um determinado nó será dividido com base na redução esperada na perda após a divisão. Um valor mais alto leva a menos divisões.\n", + "* objective: Define a \"loss function\". As opções são:\n", + " * reg:linear - Para resolver problemas de regressão;\n", + " * reg:logistic - Para resolver problemas de classificação;\n", + " * binary:logistic - Para resolver problemas de classificação com cálculo de probabilidades;\n", + "\n", + "# Referências\n", + "* [How exactly XGBoost Works?](https://medium.com/@pushkarmandot/how-exactly-xgboost-works-a320d9b8aeef)\n", + "* [Fine-tuning XGBoost in Python like a boss](https://towardsdatascience.com/fine-tuning-xgboost-in-python-like-a-boss-b4543ed8b1e)\n", + "* [Gentle Introduction of XGBoost Library](https://medium.com/@imoisharma/gentle-introduction-of-xgboost-library-2b1ac2669680)\n", + "* [A Beginner’s guide to XGBoost](https://towardsdatascience.com/a-beginners-guide-to-xgboost-87f5d4c30ed7)\n", + "* [Exploring XGBoost](https://towardsdatascience.com/exploring-xgboost-4baf9ace0cf6)\n", + "* [Feature Importance and Feature Selection With XGBoost in Python](https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/)\n", + "* [Ensemble Learning case study: Running XGBoost on Google Colab free GPU](https://towardsdatascience.com/running-xgboost-on-google-colab-free-gpu-a-case-study-841c90fef101) - Recomendo\n", + "* [Predicting movie revenue with AdaBoost, XGBoost and LightGBM](https://towardsdatascience.com/predicting-movie-revenue-with-adaboost-xgboost-and-lightgbm-262eadee6daa)\n", + "* [Tuning XGBoost Hyperparameters with Scikit Optimize](https://towardsdatascience.com/how-to-improve-the-performance-of-xgboost-models-1af3995df8ad)\n", + "* [An Example of Hyperparameter Optimization on XGBoost, LightGBM and CatBoost using Hyperopt](https://towardsdatascience.com/an-example-of-hyperparameter-optimization-on-xgboost-lightgbm-and-catboost-using-hyperopt-12bc41a271e) - Interessante\n", + "* [XGBOOST vs LightGBM: Which algorithm wins the race !!!](https://towardsdatascience.com/lightgbm-vs-xgboost-which-algorithm-win-the-race-1ff7dd4917d) - LightGBM tem se mostrado interessante.\n", + "* [From Zero to Hero in XGBoost Tuning](https://towardsdatascience.com/from-zero-to-hero-in-xgboost-tuning-e48b59bfaf58) - Gostei\n", + "* [Build XGBoost / LightGBM models on large datasets — what are the possible solutions?](https://towardsdatascience.com/build-xgboost-lightgbm-models-on-large-datasets-what-are-the-possible-solutions-bf882da2c27d)\n", + "* [Selecting Optimal Parameters for XGBoost Model Training](https://towardsdatascience.com/selecting-optimal-parameters-for-xgboost-model-training-c7cd9ed5e45e) - Muito bom!\n", + "* [CatBoost vs. Light GBM vs. XGBoost](https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db)\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iMM_R4_ukV7x" + }, + "source": [ + "from xgboost import XGBClassifier\n", + "import xgboost as xgb\n", + "\n", + "# Instancia...\n", + "ml_XGB = XGBClassifier(silent = False, \n", + " scale_pos_weight=1,\n", + " learning_rate=0.01, \n", + " colsample_bytree = 1,\n", + " subsample = 0.8,\n", + " objective='binary:logistic', \n", + " n_estimators=1000, \n", + " reg_alpha = 0.3,\n", + " max_depth= 3, \n", + " gamma=1, \n", + " max_delta_step=5)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "E4wQMlDEFINR" + }, + "source": [ + "# Treina...\n", + "ml_XGB.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "S77LljiQR_16" + }, + "source": [ + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_XGB, X_treinamento, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JNyKX6PkrXOk" + }, + "source": [ + "**Interpretação**: Nosso classificador (XGBClassifier) tem uma acurácia média de 96,72% (base de treinamento) e std na ordem dos 2,02%. Na sequência, tentativa de melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_h0QYv3FkV73" + }, + "source": [ + "print(f'Acurácias: {a_scores_CV}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "AKhhAZLjkV76" + }, + "source": [ + "# Faz predições...\n", + "y_pred = ml_XGB.predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ir2Kd1PqGHgz" + }, + "source": [ + "# Confusion Matrix\n", + "cf_matrix = confusion_matrix(y_teste, y_pred)\n", + "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n", + "cf_categories = ['Zero', 'One']\n", + "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jEC7gW4qYpWw" + }, + "source": [ + "## Parameter tunning\n", + "### Leitura Adicional:\n", + "* [Fine-tuning XGBoost in Python like a boss](https://towardsdatascience.com/fine-tuning-xgboost-in-python-like-a-boss-b4543ed8b1e)\n", + "* [Complete Guide to Parameter Tuning in XGBoost with codes in Python](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)\n", + "\n", + "> Olhando para os resultados acima, qual o melhor modelo?\n", + "\n", + "XGBoost? Supondo que sim, agora vamos fazer o fine-tuning dos hiperparâmetros do modelo." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "n3MsUONPwIV9" + }, + "source": [ + "# Dicionário de Hiperparâmetros para XGBoost:\n", + "d_hiperparametros_XGB = {'min_child_weight': [i for i in np.arange(1, 13)]} #,\n", + "# 'gamma': [i for i in np.arange(0, 5, 0.5)],\n", + "# 'subsample': [0.6, 0.8, 1.0],\n", + "# 'colsample_bytree': [0.6, 0.8, 1.0],\n", + "# 'max_depth': [3, 4, 5, 7, 9],\n", + "# 'learning_rate': [i for i in np.arange(0.01, 1, 0.1)]}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "CX27FCKmwSni" + }, + "source": [ + "# Invoca a função\n", + "ml_XGB, best_params= GridSearchOptimizer(ml_XGB, 'ml_XGB2', d_hiperparametros_XGB, X_treinamento, y_treinamento, X_teste, y_teste, i_CV, l_colunas)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9b7uCuF74Hjv" + }, + "source": [ + "### Resultado da execução do XGBoostClassifier\n", + "\n", + "```\n", + "[Parallel(n_jobs=-1)]: Done 108000 out of 108000 | elapsed: 372.0min finished\n", + "\n", + "Hiperparâmetros otimizados: {'colsample_bytree': 0.8, 'gamma': 0.5, 'learning_rate': 0.51, 'max_depth': 5, 'min_child_weight': 1, 'subsample': 0.6}\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "n7E0oyxEtbGi" + }, + "source": [ + "# Como o procedimento acima levou 372 minutos para executar, então vou estimar ml_XGB2 abaixo usando os parâmetros acima estimados\n", + "best_params= {'colsample_bytree': 0.8, 'gamma': 0.5, 'learning_rate': 0.51, 'max_depth': 5, 'min_child_weight': 1, 'subsample': 0.6}\n", + "\n", + "ml_XGB2= XGBClassifier(min_child_weight= best_params['min_child_weight'], \n", + " gamma= best_params['gamma'], \n", + " subsample= best_params['subsample'], \n", + " colsample_bytree= best_params['colsample_bytree'], \n", + " max_depth= best_params['max_depth'], \n", + " learning_rate= best_params['learning_rate'], \n", + " random_state= i_Seed)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CuqyLHTU5Z-j" + }, + "source": [ + "## Selecionar as COLUNAS importantes/relevantes\n", + "* [The Multiple faces of ‘Feature importance’ in XGBoost](https://towardsdatascience.com/be-careful-when-interpreting-your-features-importance-in-xgboost-6e16132588e7)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QPG3JZIpRZ-T" + }, + "source": [ + "# plot feature importance\n", + "from xgboost import plot_importance\n", + "\n", + "xgb.plot_importance(ml_XGB2, color = 'red')\n", + "plt.title('importance', fontsize = 20)\n", + "plt.yticks(fontsize = 10)\n", + "plt.ylabel('features', fontsize = 20)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "EmpRC2lHW-KP" + }, + "source": [ + "ml_XGB2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "4f9MIEBiyq-5" + }, + "source": [ + "X_treinamento_XGB, X_teste_XGB= seleciona_colunas_relevantes(ml_XGB2, X_treinamento, X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F6EayWaY5nMm" + }, + "source": [ + "## Treina o classificador com as COLUNAS relevantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Huy18gKI5qad" + }, + "source": [ + "best_params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "E3-PaTdc5vZk" + }, + "source": [ + "# Treina com as COLUNAS relevantes...\n", + "ml_XGB2.fit(X_treinamento_XGB, y_treinamento)\n", + "\n", + "# Cross-Validation com 10 folds\n", + "a_scores_CV = funcao_cross_val_score(ml_XGB2, X_treinamento_XGB, y_treinamento, i_CV)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tBdYikDU6NhD" + }, + "source": [ + "## Valida o modelo usando o dataframe X_teste" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GcvY-VdL6VIZ" + }, + "source": [ + "y_pred_XGB = ml_XGB2.predict(X_teste_XGB)\n", + "\n", + "# Calcula acurácia\n", + "accuracy_score(y_teste, y_pred_XGB)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8oLtdH-vTSbC" + }, + "source": [ + "xgb.to_graphviz(ml_XGB2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "czXQG3MCHfHM" + }, + "source": [ + "# KNN - KNEIGHBORSCLASSIFIER" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "llTTXNeyHiwx" + }, + "source": [ + "# BAGGINGCLASSIFIER" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Fbkekd4QHoZO" + }, + "source": [ + "# EXTRATREESCLASSIFIER" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "widavwR4HzwE" + }, + "source": [ + "# SVM" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "id_Ubulns6We" + }, + "source": [ + "# NAIVE BAYES" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EwWkjfC8KEZH" + }, + "source": [ + "# ENSEMBLE METHODS\n", + "https://towardsdatascience.com/using-bagging-and-boosting-to-improve-classification-tree-accuracy-6d3bb6c95e5b\n", + "\n", + "![Ensemble](https://github.com/MathMachado/Materials/blob/master/Ensemble.png?raw=true)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ycu_EIGlYUYn" + }, + "source": [ + "import pandas as pd\n", + "\n", + "from xgboost import XGBClassifier\n", + "from sklearn.ensemble import ExtraTreesClassifier\n", + "from sklearn.tree import ExtraTreeClassifier\n", + "from sklearn.tree import DecisionTreeClassifier\n", + "from sklearn.ensemble import GradientBoostingClassifier\n", + "from sklearn.ensemble import BaggingClassifier\n", + "from sklearn.ensemble import AdaBoostClassifier\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "from sklearn.linear_model import LogisticRegression\n", + "from lightgbm import LGBMClassifier\n", + "\n", + "clfs = [XGBClassifier(), LGBMClassifier(), \n", + " ExtraTreesClassifier(), ExtraTreeClassifier(),\n", + " BaggingClassifier(), DecisionTreeClassifier(),\n", + " GradientBoostingClassifier(), LogisticRegression(),\n", + " AdaBoostClassifier(), RandomForestClassifier()]\n", + "\n", + "for clf in clfs:\n", + " try:\n", + " _ = mostra_feature_importances(clf, X_treinamento, y_treinamento, top_n = X_treinamento.shape[1], title=clf.__class__.__name__)\n", + " except AttributeError as e:\n", + " print(e)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qtoJVI4Pyx3I" + }, + "source": [ + "# **IMBALANCED SAMPLE**\n", + "> Alguns objetivos como detectar fraude em transações bancárias ou detecção de intrusão em network tem em comum o fato que a classe de interesse (o que queremos detectar), geralmente é um evento raro\n", + "\n", + "## Exemplo: Detectar fraude\n", + "A proporção de fraudes diante de NÃO-FRAUDES são mais ou menos 1%/99%. Neste caso, ao desenvovermos um modelo para detectar fraudes e o modelo classificar todas as instâncias como NÃO-FRAUDE, então o modelo terá uma acurácia de 99%. No entanto, este modelo não nos ajudará em nada.\n", + "\n", + "## Necessidade de se usar outras métricas \n", + "> Recomenda-se utilizar outras métricas (na verdade, é boa prática usar mais de 1 métrica para medir a performance dos modelos) como, por exemplo, F1-Score, Precision/Specificity, Recall/Sensitivity e AUROC.\n", + "\n", + "## Como lidar com a amostra desbalanceada?\n", + "* Under-sampling\n", + "> Seleciona aleatoriamente a classe MAJORITÁRIA (em nosso exemplo, NÃO-FRAUDE) até o número de instâncias da classe MINORITÁRIA (FRAUDE);\n", + "\n", + "* Over-sampling\n", + "> Resample aleatoriamente a classe MINORITÁRIA (em nosso exemplo, FRAUDE) até o número de instâncias da classe MAJORITÁRIA (NÃO-FRAUDE), ou uma proporção da classe MAJORITÁRIA. Veja a bibliotea SMOTE (Synthetic Minority Over-Sampling Techniques);\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2o45zx8zw-aB" + }, + "source": [ + "## EFEITOS DA AMOSTRA DESBALANCEADA" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cCVTPCB-Xkbd" + }, + "source": [ + "# TPOT\n", + "https://towardsdatascience.com/tpot-automated-machine-learning-in-python-4c063b3e5de9" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2ulXii6JXpWd" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_TWUq-z4X4yZ" + }, + "source": [ + "___\n", + "# FEATURETOOLS\n", + "https://medium.com/@rrfd/simple-automatic-feature-engineering-using-featuretools-in-python-for-classification-b1308040e183\n", + "\n", + "https://www.analyticsvidhya.com/blog/2018/08/guide-automated-feature-engineering-featuretools-python/\n", + "\n", + "https://mlwhiz.com/blog/2019/05/19/feature_extraction/\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "igHfzeA1Y90p" + }, + "source": [ + "# Picaret" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aWiahwKe2d6U" + }, + "source": [ + "# **EXERCÍCIOS**\n", + "> Encontre algoritmos adequados para ser aplicados aos seguintes problemas:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XbSLkbDB2mzK" + }, + "source": [ + "## Exercício 1 - Credit Card Fraud Detection\n", + "Source: [Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud)\n", + "\n", + "### Leitura suporte\n", + "* [Detecting Credit Card Fraud Using Machine Learning](https://towardsdatascience.com/detecting-credit-card-fraud-using-machine-learning-a3d83423d3b8)\n", + "* [Credit Card Fraud Detection](https://towardsdatascience.com/credit-card-fraud-detection-a1c7e1b75f59)\n", + "\n", + "### Dataframe\n", + "* [Creditcard.csv](https://raw.githubusercontent.com/MathMachado/DSWP/master/Dataframes/creditcard.csv)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lAl9ZwP_0-d0" + }, + "source": [ + "url = 'https://raw.githubusercontent.com/MathMachado/DSWP/master/Dataframes/creditcard.csv'\n", + "df_cc = pd.read_csv(url)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oYgK6JXd3MgA" + }, + "source": [ + "## Exercício 2 - Predicting species on IRIS dataset\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "si0rsJvu3O6O" + }, + "source": [ + "from sklearn import datasets\n", + "import xgboost as xgb\n", + "\n", + "iris = datasets.load_iris()\n", + "X_iris = iris.data\n", + "y_iris = iris.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zom8t4yWC_UC" + }, + "source": [ + "## Exercício 3 - Predict Wine Quality\n", + "> Estimar a qualidade dos vinhos, numa scala de 0–100. A seguir, a qualidade em função da escala:\n", + "\n", + "* 95–100 Classic: a great wine\n", + "* 90–94 Outstanding: a wine of superior character and style\n", + "* 85–89 Very good: a wine with special qualities\n", + "* 80–84 Good: a solid, well-made wine\n", + "* 75–79 Mediocre: a drinkable wine that may have minor flaws\n", + "* 50–74 Not recommended\n", + "\n", + "Source: [Wine Reviews](https://www.kaggle.com/zynicide/wine-reviews)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "klL2Q9Ria96n" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from sklearn import datasets\n", + "\n", + "Wine = datasets.load_wine()\n", + "X_vinho = Wine.data\n", + "y_vinho = Wine.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lhVhSWBgGijq" + }, + "source": [ + "## Exercício 4 - Predict Parkinson\n", + "Source: https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SVCxHqv0VBJn" + }, + "source": [ + "## Exercício 5 - Predict survivors from Titanic tragedy\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "CwvB8us4eKNi" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "\n", + "df_titanic = sns.load_dataset('titanic')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZJrT9YIXVdtx" + }, + "source": [ + "## Exercício 6 - Predict Loan\n", + "> Os dados devem ser obtidos diretamente da fonte: [Loan Default Prediction - Imperial College London](https://www.kaggle.com/c/loan-default-prediction/data)\n", + "\n", + "* [Bank Loan Default Prediction](https://medium.com/@wutianhao910/bank-loan-default-prediction-94d4902db740)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R8-GVu7ZWeA8" + }, + "source": [ + "## Exercício 7 - Predict the sales of a store.\n", + "* [Predicting expected sales for Bigmart’s stores](https://medium.com/diogo-menezes-borges/project-1-bigmart-sale-prediction-fdc04f07dc1e)\n", + "* Dataframes\n", + " * [Treinamento](https://raw.githubusercontent.com/MathMachado/DataFrames/master/Big_Mart_Sales_III_train.txt)\n", + " * [Validação](https://raw.githubusercontent.com/MathMachado/DataFrames/master/Big_Mart_Sales_III_test.txt)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fv9w86j4Wnwj" + }, + "source": [ + "## Exercício 8 - [The Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html)\n", + "> Predict the median value of owner occupied homes." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5HYRt8-ug1BT" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from sklearn import datasets\n", + "\n", + "Boston = datasets.load_boston()\n", + "X_boston = Boston.data\n", + "y_boston = Boston.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1UDIaqmtXQ0T" + }, + "source": [ + "## Exercício 9 - Predict the height or weight of a person.\n", + "\n", + "http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-7R146nIXmMT" + }, + "source": [ + "## Exercício 10 - Black Friday Sales Prediction - Predict purchase amount.\n", + "\n", + "This dataset comprises of sales transactions captured at a retail store. It’s a classic dataset to explore and expand your feature engineering skills and day to day understanding from multiple shopping experiences. This is a regression problem. The dataset has 550,069 rows and 12 columns.\n", + "\n", + "https://github.com/MathMachado/DataFrames/blob/master/blackfriday.zip\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mQ8FPbuLZlIh" + }, + "source": [ + "## Exercício 11 - Predict the income class of US population.\n", + "\n", + "http://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Af4NRrchgPlM" + }, + "source": [ + "## Exercício 12 - Predicting Cancer\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "c4LOlgZW3P40" + }, + "source": [ + "from sklearn import datasets\n", + "cancer = datasets.load_breast_cancer()\n", + "X_cancer = cancer.data\n", + "y_cancer = cancer.target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "74PmpT8Ix0tD" + }, + "source": [ + "## Exercício 13\n", + "Source: [Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/).\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WY8GZMixZ9W9" + }, + "source": [ + "## Exercício 14 - Predict Diabetes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y92t6tbOge0S" + }, + "source": [ + "from sklearn import datasets\n", + "Diabetes= datasets.load_diabetes()\n", + "\n", + "X_diabetes = Diabetes.data\n", + "y_diabetes = Diabetes.target" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git "a/Notebooks/NB15_02__Regress\303\243o Linear_hs.ipynb" "b/Notebooks/NB15_02__Regress\303\243o Linear_hs.ipynb" new file mode 100644 index 000000000..011db45d3 --- /dev/null +++ "b/Notebooks/NB15_02__Regress\303\243o Linear_hs.ipynb" @@ -0,0 +1,5928 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.1" + }, + "colab": { + "name": "NB15_02__Regressão Linear.ipynb", + "provenance": [], + "include_colab_link": true + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XwQDhId7N6_r" + }, + "source": [ + "

MACHINE LEARNING WITH PYTHON

\n", + "

APRENDIZAGEM SUPERVISIONADA

\n", + "

MODELOS DE REGRESSÃO (LINEAR E LOGÍSTICA)

\n", + "\n", + "Fonte: https://realpython.com/linear-regression-in-python/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PN-dQFJcM1UV" + }, + "source": [ + "Passos para implementação da Regressão Linear:\n", + "\n", + "* (1) Importar as libraries necessárias;\n", + "* (2) Carregar os dados;\n", + "* (3) Aplicar as transformações necessárias: outliers, NaN's, normalização (MinMaxScaler, RobustScaler, StandarScaler, Log, Box-Cox e etc);\n", + "* (4) DataViz dos dados: entender os relacionamentos, distribuições e etc presente nos dados;\n", + "* (5) Construir e treinar o modelo preditivo (neste caso, modelo de regressão);\n", + "* (6) Validar/verificar as métricas para avaliação do(s) modelo(s);\n", + "* (7) Predições." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8TldGZxAFV5E" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0QRbxlqaq7pr" + }, + "source": [ + "# Melhorias da sessão:\n", + "* " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P4sAIblOgFyL" + }, + "source": [ + "# Modelos de Regressão com Regularization para Classificação e Regressão" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o7Y7cuJNgFyU" + }, + "source": [ + "## Regressão Linear Simples (usando OLS - Ordinary Least Squares)\n", + "\n", + "* Features $X_{np}$: é uma matriz de dimensão nxp contendo os atributos/variáveis preditoras do dataframe (variáveis independentes);\n", + "* Variável target/dependente representada por y;\n", + "* Relação entre X e y é representado pela equação abaixo, onde $w_{i}$ representa os pesos de cada coeficiente e $w_{0}$ representa o intercepto." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NpJ580y9gFyU" + }, + "source": [ + "\n", + "\n", + "![X_y](https://github.com/MathMachado/Materials/blob/master/Architecture.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5rhbVGJ0gFyY" + }, + "source": [ + "* Soma de Quadrados dos Resíduos (RSS) - Soma de Quadrados das diferenças entre os valores observados e preditos.\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u8gA0YkbgFyp" + }, + "source": [ + "## Principais parâmetros do algoritmo:\n", + "* fit_intercept - Indica se o intercepto $w_{0}$ deve ou não ser ajustado. Se os dados estão normalizados, então não faz sentido ajustar o intercepto $w_{0}$.\n", + "\n", + "* normalize - $X$ será automaticamente normalizada (subtrai a média e divide pelo desvio-padrão);\n", + "\n", + "## Atributos do modelo de Machine Learning para Regressão\n", + "* coef - peso/fator de cada variável independente do modelo de ML;\n", + "\n", + "* intercepto $w_{0}$ - intercepto ou viés de $y$;\n", + "\n", + "## Funções para ajuste do ML:\n", + "* fit - treina o modelo com as matrizes $X$ e $y$;\n", + "* predict - Uma vez que o modelo foi treinado, para um dado $X$, use $y$ para calcular os valores preditos de $y$ (y_pred).\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "A-JG8El1gFy7" + }, + "source": [ + "# Limitações do OLS (Ordinary Least Squares):\n", + "* Impactado/sensível à Outliers;\n", + "* Multicolinearidade; \n", + "* Heterocedasticidade - apresenta-se como uma forte dispersão dos dados em torno de uma reta;\n", + "\n", + "* References" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xylMYR8COyrw" + }, + "source": [ + "### Importar as libraries" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2BGgrILlPK6Z" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from scipy import stats" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "263GgbwhO2kQ" + }, + "source": [ + "### Carregar os dados\n", + "* Vamos carregar o dataset [Boston House Pricing](https://archive.ics.uci.edu/ml/datasets/housing)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1h66x_-rXGhi" + }, + "source": [ + "from sklearn.datasets import load_boston, load_iris" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rWniNkMpXQFU", + "outputId": "4fd6cce0-22dc-43e0-b31a-effc991c8c69", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + } + }, + "source": [ + "boston = load_boston()\n", + "boston" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'DESCR': \".. _boston_dataset:\\n\\nBoston house prices dataset\\n---------------------------\\n\\n**Data Set Characteristics:** \\n\\n :Number of Instances: 506 \\n\\n :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\\n\\n :Attribute Information (in order):\\n - CRIM per capita crime rate by town\\n - ZN proportion of residential land zoned for lots over 25,000 sq.ft.\\n - INDUS proportion of non-retail business acres per town\\n - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\\n - NOX nitric oxides concentration (parts per 10 million)\\n - RM average number of rooms per dwelling\\n - AGE proportion of owner-occupied units built prior to 1940\\n - DIS weighted distances to five Boston employment centres\\n - RAD index of accessibility to radial highways\\n - TAX full-value property-tax rate per $10,000\\n - PTRATIO pupil-teacher ratio by town\\n - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\\n - LSTAT % lower status of the population\\n - MEDV Median value of owner-occupied homes in $1000's\\n\\n :Missing Attribute Values: None\\n\\n :Creator: Harrison, D. and Rubinfeld, D.L.\\n\\nThis is a copy of UCI ML housing dataset.\\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/housing/\\n\\n\\nThis dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\\n\\nThe Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\\nprices and the demand for clean air', J. Environ. Economics & Management,\\nvol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics\\n...', Wiley, 1980. N.B. Various transformations are used in the table on\\npages 244-261 of the latter.\\n\\nThe Boston house-price data has been used in many machine learning papers that address regression\\nproblems. \\n \\n.. topic:: References\\n\\n - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\\n - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\\n\",\n", + " 'data': array([[0.01, 18.00, 2.31, ..., 15.30, 396.90, 4.98],\n", + " [0.03, 0.00, 7.07, ..., 17.80, 396.90, 9.14],\n", + " [0.03, 0.00, 7.07, ..., 17.80, 392.83, 4.03],\n", + " ...,\n", + " [0.06, 0.00, 11.93, ..., 21.00, 396.90, 5.64],\n", + " [0.11, 0.00, 11.93, ..., 21.00, 393.45, 6.48],\n", + " [0.05, 0.00, 11.93, ..., 21.00, 396.90, 7.88]]),\n", + " 'feature_names': array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',\n", + " 'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTAT
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.98
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.14
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.03
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.94
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.33
\n", + "" + ], + "text/plain": [ + " CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT\n", + "0 0.00632 18.0 2.31 0.0 0.538 ... 1.0 296.0 15.3 396.90 4.98\n", + "1 0.02731 0.0 7.07 0.0 0.469 ... 2.0 242.0 17.8 396.90 9.14\n", + "2 0.02729 0.0 7.07 0.0 0.469 ... 2.0 242.0 17.8 392.83 4.03\n", + "3 0.03237 0.0 2.18 0.0 0.458 ... 3.0 222.0 18.7 394.63 2.94\n", + "4 0.06905 0.0 2.18 0.0 0.458 ... 3.0 222.0 18.7 396.90 5.33\n", + "\n", + "[5 rows x 13 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 75 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pQzFW7DUX_KW", + "outputId": "39ff2f0a-75ef-41c8-eba0-01f2fd56ac73", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "# Variável target/resposta\n", + "df_boston['preco'] = load_boston().target\n", + "df_boston.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATpreco
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2
\n", + "
" + ], + "text/plain": [ + " CRIM ZN INDUS CHAS NOX ... TAX PTRATIO B LSTAT preco\n", + "0 0.00632 18.0 2.31 0.0 0.538 ... 296.0 15.3 396.90 4.98 24.0\n", + "1 0.02731 0.0 7.07 0.0 0.469 ... 242.0 17.8 396.90 9.14 21.6\n", + "2 0.02729 0.0 7.07 0.0 0.469 ... 242.0 17.8 392.83 4.03 34.7\n", + "3 0.03237 0.0 2.18 0.0 0.458 ... 222.0 18.7 394.63 2.94 33.4\n", + "4 0.06905 0.0 2.18 0.0 0.458 ... 222.0 18.7 396.90 5.33 36.2\n", + "\n", + "[5 rows x 14 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 76 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H71da4bIO4kI" + }, + "source": [ + "### Data Transformation" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K-6YOdsTfciO" + }, + "source": [ + "#### Normalização/padronização dos nomes das colunas" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "L8OJEapufhq4" + }, + "source": [ + "# Renomear as colunas do dataframe\n", + "df_boston.columns = [col.lower() for col in df_boston.columns]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "uRinX-5ofol_", + "outputId": "b3b4003c-7f76-4779-8b53-23ca25c1d38d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "df_boston.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
crimzninduschasnoxrmagedisradtaxptratioblstatpreco
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2
\n", + "
" + ], + "text/plain": [ + " crim zn indus chas nox ... tax ptratio b lstat preco\n", + "0 0.00632 18.0 2.31 0.0 0.538 ... 296.0 15.3 396.90 4.98 24.0\n", + "1 0.02731 0.0 7.07 0.0 0.469 ... 242.0 17.8 396.90 9.14 21.6\n", + "2 0.02729 0.0 7.07 0.0 0.469 ... 242.0 17.8 392.83 4.03 34.7\n", + "3 0.03237 0.0 2.18 0.0 0.458 ... 222.0 18.7 394.63 2.94 33.4\n", + "4 0.06905 0.0 2.18 0.0 0.458 ... 222.0 18.7 396.90 5.33 36.2\n", + "\n", + "[5 rows x 14 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 78 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CMDh5jyqekmr" + }, + "source": [ + "#### Outliers" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jJIG0jJQf6em" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FgYPzlvfemFc" + }, + "source": [ + "#### Missing values" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BAjw7UhJen0D", + "outputId": "9159a3f6-ef34-4507-91c3-19f8138230af", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 272 + } + }, + "source": [ + "# Missing values por colunas/variáveis\n", + "df_boston.isna().sum()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "crim 0\n", + "zn 0\n", + "indus 0\n", + "chas 0\n", + "nox 0\n", + "rm 0\n", + "age 0\n", + "dis 0\n", + "rad 0\n", + "tax 0\n", + "ptratio 0\n", + "b 0\n", + "lstat 0\n", + "preco 0\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 79 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Jo3UWNpbYnNF", + "outputId": "13a0c7e1-219f-42d1-ef65-d8603a2af1e0", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Número de atributos\n", + "len(load_boston().feature_names)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "13" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 80 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0Yp8g7hxfQli", + "outputId": "6296d35e-e488-489d-bfad-489884473d40", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 49 + } + }, + "source": [ + "# Missing Values por linhas\n", + "df_boston[df_boston.isnull().any(axis = 1)]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
crimzninduschasnoxrmagedisradtaxptratioblstatpreco
\n", + "
" + ], + "text/plain": [ + "Empty DataFrame\n", + "Columns: [crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat, preco]\n", + "Index: []" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 81 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5qmkTFLrf9MT" + }, + "source": [ + "#### Estatísticas Descritivas" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Nprn3p_Wf_bn", + "outputId": "ac3fe2f6-7255-4257-9a6d-42ae210854f4", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 297 + } + }, + "source": [ + "df_boston.describe()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
crimzninduschasnoxrmagedisradtaxptratioblstatpreco
count506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000
mean3.61352411.36363611.1367790.0691700.5546956.28463468.5749013.7950439.549407408.23715418.455534356.67403212.65306322.532806
std8.60154523.3224536.8603530.2539940.1158780.70261728.1488612.1057108.707259168.5371162.16494691.2948647.1410629.197104
min0.0063200.0000000.4600000.0000000.3850003.5610002.9000001.1296001.000000187.00000012.6000000.3200001.7300005.000000
25%0.0820450.0000005.1900000.0000000.4490005.88550045.0250002.1001754.000000279.00000017.400000375.3775006.95000017.025000
50%0.2565100.0000009.6900000.0000000.5380006.20850077.5000003.2074505.000000330.00000019.050000391.44000011.36000021.200000
75%3.67708312.50000018.1000000.0000000.6240006.62350094.0750005.18842524.000000666.00000020.200000396.22500016.95500025.000000
max88.976200100.00000027.7400001.0000000.8710008.780000100.00000012.12650024.000000711.00000022.000000396.90000037.97000050.000000
\n", + "
" + ], + "text/plain": [ + " crim zn indus ... b lstat preco\n", + "count 506.000000 506.000000 506.000000 ... 506.000000 506.000000 506.000000\n", + "mean 3.613524 11.363636 11.136779 ... 356.674032 12.653063 22.532806\n", + "std 8.601545 23.322453 6.860353 ... 91.294864 7.141062 9.197104\n", + "min 0.006320 0.000000 0.460000 ... 0.320000 1.730000 5.000000\n", + "25% 0.082045 0.000000 5.190000 ... 375.377500 6.950000 17.025000\n", + "50% 0.256510 0.000000 9.690000 ... 391.440000 11.360000 21.200000\n", + "75% 3.677083 12.500000 18.100000 ... 396.225000 16.955000 25.000000\n", + "max 88.976200 100.000000 27.740000 ... 396.900000 37.970000 50.000000\n", + "\n", + "[8 rows x 14 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 82 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1JimyY3SgECE" + }, + "source": [ + "#### Análise de Correlação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jScHq7eTgIpm", + "outputId": "17d9046b-b44c-40f9-9cd6-7a0a602d26e3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 483 + } + }, + "source": [ + "correlacoes = df_boston.corr()\n", + "correlacoes" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
crimzninduschasnoxrmagedisradtaxptratioblstatpreco
crim1.000000-0.2004690.406583-0.0558920.420972-0.2192470.352734-0.3796700.6255050.5827640.289946-0.3850640.455621-0.388305
zn-0.2004691.000000-0.533828-0.042697-0.5166040.311991-0.5695370.664408-0.311948-0.314563-0.3916790.175520-0.4129950.360445
indus0.406583-0.5338281.0000000.0629380.763651-0.3916760.644779-0.7080270.5951290.7207600.383248-0.3569770.603800-0.483725
chas-0.055892-0.0426970.0629381.0000000.0912030.0912510.086518-0.099176-0.007368-0.035587-0.1215150.048788-0.0539290.175260
nox0.420972-0.5166040.7636510.0912031.000000-0.3021880.731470-0.7692300.6114410.6680230.188933-0.3800510.590879-0.427321
rm-0.2192470.311991-0.3916760.091251-0.3021881.000000-0.2402650.205246-0.209847-0.292048-0.3555010.128069-0.6138080.695360
age0.352734-0.5695370.6447790.0865180.731470-0.2402651.000000-0.7478810.4560220.5064560.261515-0.2735340.602339-0.376955
dis-0.3796700.664408-0.708027-0.099176-0.7692300.205246-0.7478811.000000-0.494588-0.534432-0.2324710.291512-0.4969960.249929
rad0.625505-0.3119480.595129-0.0073680.611441-0.2098470.456022-0.4945881.0000000.9102280.464741-0.4444130.488676-0.381626
tax0.582764-0.3145630.720760-0.0355870.668023-0.2920480.506456-0.5344320.9102281.0000000.460853-0.4418080.543993-0.468536
ptratio0.289946-0.3916790.383248-0.1215150.188933-0.3555010.261515-0.2324710.4647410.4608531.000000-0.1773830.374044-0.507787
b-0.3850640.175520-0.3569770.048788-0.3800510.128069-0.2735340.291512-0.444413-0.441808-0.1773831.000000-0.3660870.333461
lstat0.455621-0.4129950.603800-0.0539290.590879-0.6138080.602339-0.4969960.4886760.5439930.374044-0.3660871.000000-0.737663
preco-0.3883050.360445-0.4837250.175260-0.4273210.695360-0.3769550.249929-0.381626-0.468536-0.5077870.333461-0.7376631.000000
\n", + "
" + ], + "text/plain": [ + " crim zn indus ... b lstat preco\n", + "crim 1.000000 -0.200469 0.406583 ... -0.385064 0.455621 -0.388305\n", + "zn -0.200469 1.000000 -0.533828 ... 0.175520 -0.412995 0.360445\n", + "indus 0.406583 -0.533828 1.000000 ... -0.356977 0.603800 -0.483725\n", + "chas -0.055892 -0.042697 0.062938 ... 0.048788 -0.053929 0.175260\n", + "nox 0.420972 -0.516604 0.763651 ... -0.380051 0.590879 -0.427321\n", + "rm -0.219247 0.311991 -0.391676 ... 0.128069 -0.613808 0.695360\n", + "age 0.352734 -0.569537 0.644779 ... -0.273534 0.602339 -0.376955\n", + "dis -0.379670 0.664408 -0.708027 ... 0.291512 -0.496996 0.249929\n", + "rad 0.625505 -0.311948 0.595129 ... -0.444413 0.488676 -0.381626\n", + "tax 0.582764 -0.314563 0.720760 ... -0.441808 0.543993 -0.468536\n", + "ptratio 0.289946 -0.391679 0.383248 ... -0.177383 0.374044 -0.507787\n", + "b -0.385064 0.175520 -0.356977 ... 1.000000 -0.366087 0.333461\n", + "lstat 0.455621 -0.412995 0.603800 ... -0.366087 1.000000 -0.737663\n", + "preco -0.388305 0.360445 -0.483725 ... 0.333461 -0.737663 1.000000\n", + "\n", + "[14 rows x 14 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 83 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AxQp7xqdgTJP" + }, + "source": [ + "##### Gráfico das correlações entre as features/variáveis/colunas\n", + "Source: https://seaborn.pydata.org/examples/many_pairwise_correlations.html\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KOiH2X-WgqmN", + "outputId": "31174dc1-d09b-4efc-a9be-96a72c98a8d7", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 557 + } + }, + "source": [ + "import seaborn as sns\n", + "from string import ascii_letters\n", + "import matplotlib.pyplot as plt\n", + "\n", + "sns.set_theme(style = \"white\")\n", + "\n", + "d = df_boston\n", + "\n", + "# Compute the correlation matrix\n", + "corr = d.corr()\n", + "\n", + "# Generate a mask for the upper triangle\n", + "mask = np.triu(np.ones_like(corr, dtype=bool))\n", + "\n", + "# Set up the matplotlib figure\n", + "f, ax = plt.subplots(figsize=(11, 9))\n", + "\n", + "# Generate a custom diverging colormap\n", + "cmap = sns.diverging_palette(230, 20, as_cmap=True)\n", + "\n", + "# Draw the heatmap with the mask and correct aspect ratio\n", + "sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,\n", + " square=True, linewidths=.5, cbar_kws={\"shrink\": .5})" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 84 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nogPhyfVO70G" + }, + "source": [ + "### Construir e treinar o(s) modelo(s)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HxYpfyvQaIe1" + }, + "source": [ + "$X = [X_{1}, X_{2}, X_{p}]$ = X_boston abaixo." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0BhLZJhibVNG" + }, + "source": [ + "X_boston = df_boston.drop(columns = ['preco'], axis = 1) # todas as variáveis/atributos, exceto 'preco'\n", + "y_boston = df_boston['preco'] # variável-target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "v_nC_RGva1Z6", + "outputId": "de1647e4-f836-40a5-a49a-a9a21dd5edca", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "X_boston.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
crimzninduschasnoxrmagedisradtaxptratioblstat
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.98
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.14
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.03
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.94
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.33
\n", + "
" + ], + "text/plain": [ + " crim zn indus chas nox ... rad tax ptratio b lstat\n", + "0 0.00632 18.0 2.31 0.0 0.538 ... 1.0 296.0 15.3 396.90 4.98\n", + "1 0.02731 0.0 7.07 0.0 0.469 ... 2.0 242.0 17.8 396.90 9.14\n", + "2 0.02729 0.0 7.07 0.0 0.469 ... 2.0 242.0 17.8 392.83 4.03\n", + "3 0.03237 0.0 2.18 0.0 0.458 ... 3.0 222.0 18.7 394.63 2.94\n", + "4 0.06905 0.0 2.18 0.0 0.458 ... 3.0 222.0 18.7 396.90 5.33\n", + "\n", + "[5 rows x 13 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 86 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nlVJM--Ya5fS", + "outputId": "6007a0f6-9709-4c93-a65a-85f16c578b71", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "y_boston[0:10] # Series (coluna)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 24.0\n", + "1 21.6\n", + "2 34.7\n", + "3 33.4\n", + "4 36.2\n", + "5 28.7\n", + "6 22.9\n", + "7 27.1\n", + "8 16.5\n", + "9 18.9\n", + "Name: preco, dtype: float64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 87 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "b50_6tv5h1kY" + }, + "source": [ + "# Definindo os dataframes de treinamento e teste:\n", + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_boston, \n", + " y_boston, \n", + " test_size = 0.2, \n", + " random_state = 20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1U3hpdkDbYTv", + "outputId": "5e71907d-eb8c-44f0-ce98-3c1ac56e6f82", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "print(f\"Dataframe de treinamento: {X_treinamento.shape[0]} linhas\")\n", + "print(f\"Dataframe de teste......: {X_teste.shape[0]} linhas\")" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Dataframe de treinamento: 404 linhas\n", + "Dataframe de teste......: 102 linhas\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SvevXulFiJj1" + }, + "source": [ + "#### Treinamento do modelo de Regressão Linear" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GVwF3vp8iNff" + }, + "source": [ + "# Importa a library LinearRegression --> Para treinamento da Regressão Linear\n", + "from sklearn.linear_model import LinearRegression\n", + "\n", + "# Library para statmodels\n", + "import statsmodels.api as sm" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ibX6bCbViW-v" + }, + "source": [ + "# Instancia o objeto\n", + "regressao_linear = LinearRegression()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "M-5wRGUribY0", + "outputId": "fde40bc7-3f51-4e11-bdba-978d7e3c3710", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Treina o modelo usando as amostras/dataset de treinamento: X_treinamento e y_treinamento \n", + "regressao_linear.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 92 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jri-jA1VjmUl", + "outputId": "e95d7477-f836-47d1-b300-a573d30c9cfd", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Valor do intercepto\n", + "regressao_linear.intercept_" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "35.9020918753502" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 93 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VOjadxdxjqtT", + "outputId": "bf2f3507-81cd-4f2b-b0cc-27b043ff15b3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 452 + } + }, + "source": [ + "# Coeficientes do modelo de Regressão Linear\n", + "coeficientes_regressao_linear = pd.DataFrame([X_treinamento.columns, regressao_linear.coef_]).T\n", + "coeficientes_regressao_linear = coeficientes_regressao_linear.rename(columns={0: 'Feature/variável/coluna', 1: 'Coeficientes'})\n", + "coeficientes_regressao_linear" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Feature/variável/colunaCoeficientes
0crim-0.0822083
1zn0.0428002
2indus0.0756011
3chas3.16348
4nox-19.4945
5rm3.98161
6age0.00480929
7dis-1.37396
8rad0.298883
9tax-0.0123962
10ptratio-0.984657
11b0.008949
12lstat-0.526478
\n", + "
" + ], + "text/plain": [ + " Feature/variável/coluna Coeficientes\n", + "0 crim -0.0822083\n", + "1 zn 0.0428002\n", + "2 indus 0.0756011\n", + "3 chas 3.16348\n", + "4 nox -19.4945\n", + "5 rm 3.98161\n", + "6 age 0.00480929\n", + "7 dis -1.37396\n", + "8 rad 0.298883\n", + "9 tax -0.0123962\n", + "10 ptratio -0.984657\n", + "11 b 0.008949\n", + "12 lstat -0.526478" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 94 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jwnkhPwDjkhS" + }, + "source": [ + "#### Usando statmodels" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ltbekHd_k3PH", + "outputId": "c1265a64-74b3-492d-8d13-04eaee9ade8b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 680 + } + }, + "source": [ + "X2_treinamento = sm.add_constant(X_treinamento)\n", + "lm_sm = sm.OLS(y_treinamento, X2_treinamento).fit()\n", + "print(lm_sm.summary())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + " OLS Regression Results \n", + "==============================================================================\n", + "Dep. Variable: preco R-squared: 0.725\n", + "Model: OLS Adj. R-squared: 0.716\n", + "Method: Least Squares F-statistic: 78.97\n", + "Date: Tue, 27 Oct 2020 Prob (F-statistic): 1.48e-100\n", + "Time: 18:36:57 Log-Likelihood: -1214.8\n", + "No. Observations: 404 AIC: 2458.\n", + "Df Residuals: 390 BIC: 2514.\n", + "Df Model: 13 \n", + "Covariance Type: nonrobust \n", + "==============================================================================\n", + " coef std err t P>|t| [0.025 0.975]\n", + "------------------------------------------------------------------------------\n", + "const 35.9021 6.037 5.947 0.000 24.033 47.771\n", + "crim -0.0822 0.045 -1.824 0.069 -0.171 0.006\n", + "zn 0.0428 0.016 2.638 0.009 0.011 0.075\n", + "indus 0.0756 0.072 1.054 0.292 -0.065 0.217\n", + "chas 3.1635 0.997 3.174 0.002 1.204 5.123\n", + "nox -19.4945 4.539 -4.295 0.000 -28.418 -10.571\n", + "rm 3.9816 0.510 7.802 0.000 2.978 4.985\n", + "age 0.0048 0.015 0.312 0.755 -0.025 0.035\n", + "dis -1.3740 0.236 -5.827 0.000 -1.838 -0.910\n", + "rad 0.2989 0.079 3.760 0.000 0.143 0.455\n", + "tax -0.0124 0.004 -2.814 0.005 -0.021 -0.004\n", + "ptratio -0.9847 0.156 -6.309 0.000 -1.292 -0.678\n", + "b 0.0089 0.003 2.796 0.005 0.003 0.015\n", + "lstat -0.5265 0.060 -8.764 0.000 -0.645 -0.408\n", + "==============================================================================\n", + "Omnibus: 140.799 Durbin-Watson: 2.083\n", + "Prob(Omnibus): 0.000 Jarque-Bera (JB): 591.650\n", + "Skew: 1.484 Prob(JB): 3.35e-129\n", + "Kurtosis: 8.132 Cond. No. 1.51e+04\n", + "==============================================================================\n", + "\n", + "Warnings:\n", + "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", + "[2] The condition number is large, 1.51e+04. This might indicate that there are\n", + "strong multicollinearity or other numerical problems.\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Kpt3A4Q0guHv" + }, + "source": [ + "#### Exclusão da variável menos significativa para o modelo: 'age'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rVUJkfg4gSh7", + "outputId": "a620d85e-38ae-49a5-f5ea-bac502964bac", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 663 + } + }, + "source": [ + "X3 = X_treinamento.drop(columns = 'age', axis = 1)\n", + "X3_treinamento = sm.add_constant(X3)\n", + "lm_sm2 = sm.OLS(y_treinamento, X3_treinamento).fit()\n", + "print(lm_sm2.summary())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + " OLS Regression Results \n", + "==============================================================================\n", + "Dep. Variable: preco R-squared: 0.725\n", + "Model: OLS Adj. R-squared: 0.716\n", + "Method: Least Squares F-statistic: 85.75\n", + "Date: Tue, 27 Oct 2020 Prob (F-statistic): 1.64e-101\n", + "Time: 18:36:57 Log-Likelihood: -1214.8\n", + "No. Observations: 404 AIC: 2456.\n", + "Df Residuals: 391 BIC: 2508.\n", + "Df Model: 12 \n", + "Covariance Type: nonrobust \n", + "==============================================================================\n", + " coef std err t P>|t| [0.025 0.975]\n", + "------------------------------------------------------------------------------\n", + "const 35.7325 6.006 5.950 0.000 23.925 47.540\n", + "crim -0.0815 0.045 -1.812 0.071 -0.170 0.007\n", + "zn 0.0422 0.016 2.623 0.009 0.011 0.074\n", + "indus 0.0750 0.072 1.048 0.295 -0.066 0.216\n", + "chas 3.1794 0.994 3.198 0.001 1.225 5.134\n", + "nox -19.1299 4.381 -4.367 0.000 -27.742 -10.517\n", + "rm 4.0153 0.498 8.059 0.000 3.036 4.995\n", + "dis -1.3963 0.224 -6.223 0.000 -1.837 -0.955\n", + "rad 0.2958 0.079 3.755 0.000 0.141 0.451\n", + "tax -0.0123 0.004 -2.802 0.005 -0.021 -0.004\n", + "ptratio -0.9812 0.156 -6.310 0.000 -1.287 -0.675\n", + "b 0.0090 0.003 2.825 0.005 0.003 0.015\n", + "lstat -0.5202 0.057 -9.203 0.000 -0.631 -0.409\n", + "==============================================================================\n", + "Omnibus: 142.363 Durbin-Watson: 2.081\n", + "Prob(Omnibus): 0.000 Jarque-Bera (JB): 608.694\n", + "Skew: 1.496 Prob(JB): 6.67e-133\n", + "Kurtosis: 8.216 Cond. No. 1.48e+04\n", + "==============================================================================\n", + "\n", + "Warnings:\n", + "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", + "[2] The condition number is large, 1.48e+04. This might indicate that there are\n", + "strong multicollinearity or other numerical problems.\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_lcp7m5FmZvG" + }, + "source": [ + "#### Exclusão da variável menos significativa para o modelo: 'indus'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jEiBywx4hGNB", + "outputId": "c283ddc3-1c44-4d53-e483-11c9034c1533", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 646 + } + }, + "source": [ + "X4 = X3_treinamento.drop(columns = 'indus', axis = 1)\n", + "X4_treinamento = sm.add_constant(X4)\n", + "lm_sm3 = sm.OLS(y_treinamento, X4_treinamento).fit()\n", + "print(lm_sm3.summary())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + " OLS Regression Results \n", + "==============================================================================\n", + "Dep. Variable: preco R-squared: 0.724\n", + "Model: OLS Adj. R-squared: 0.716\n", + "Method: Least Squares F-statistic: 93.42\n", + "Date: Tue, 27 Oct 2020 Prob (F-statistic): 2.86e-102\n", + "Time: 18:36:57 Log-Likelihood: -1215.4\n", + "No. Observations: 404 AIC: 2455.\n", + "Df Residuals: 392 BIC: 2503.\n", + "Df Model: 11 \n", + "Covariance Type: nonrobust \n", + "==============================================================================\n", + " coef std err t P>|t| [0.025 0.975]\n", + "------------------------------------------------------------------------------\n", + "const 35.4757 6.001 5.911 0.000 23.677 47.275\n", + "crim -0.0840 0.045 -1.871 0.062 -0.172 0.004\n", + "zn 0.0407 0.016 2.539 0.012 0.009 0.072\n", + "chas 3.2924 0.989 3.330 0.001 1.349 5.236\n", + "nox -17.9558 4.235 -4.239 0.000 -26.283 -9.629\n", + "rm 3.9674 0.496 7.996 0.000 2.992 4.943\n", + "dis -1.4553 0.217 -6.699 0.000 -1.882 -1.028\n", + "rad 0.2744 0.076 3.606 0.000 0.125 0.424\n", + "tax -0.0103 0.004 -2.603 0.010 -0.018 -0.003\n", + "ptratio -0.9609 0.154 -6.227 0.000 -1.264 -0.658\n", + "b 0.0089 0.003 2.778 0.006 0.003 0.015\n", + "lstat -0.5151 0.056 -9.145 0.000 -0.626 -0.404\n", + "==============================================================================\n", + "Omnibus: 142.123 Durbin-Watson: 2.073\n", + "Prob(Omnibus): 0.000 Jarque-Bera (JB): 605.868\n", + "Skew: 1.494 Prob(JB): 2.74e-132\n", + "Kurtosis: 8.202 Cond. No. 1.47e+04\n", + "==============================================================================\n", + "\n", + "Warnings:\n", + "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", + "[2] The condition number is large, 1.47e+04. This might indicate that there are\n", + "strong multicollinearity or other numerical problems.\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rFejox5XmrEE" + }, + "source": [ + "#### Exclusão da variável menos significativa para o modelo: 'crim'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DOehOql8hZWr", + "outputId": "12f9c9a0-b10f-4ad2-fcbd-c010eaea95bb", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 629 + } + }, + "source": [ + "X5 = X4_treinamento.drop(columns = 'crim', axis = 1)\n", + "X5_treinamento = sm.add_constant(X5)\n", + "lm_sm4 = sm.OLS(y_treinamento, X5_treinamento).fit()\n", + "print(lm_sm4.summary())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + " OLS Regression Results \n", + "==============================================================================\n", + "Dep. Variable: preco R-squared: 0.721\n", + "Model: OLS Adj. R-squared: 0.714\n", + "Method: Least Squares F-statistic: 101.8\n", + "Date: Tue, 27 Oct 2020 Prob (F-statistic): 1.55e-102\n", + "Time: 18:36:57 Log-Likelihood: -1217.2\n", + "No. Observations: 404 AIC: 2456.\n", + "Df Residuals: 393 BIC: 2500.\n", + "Df Model: 10 \n", + "Covariance Type: nonrobust \n", + "==============================================================================\n", + " coef std err t P>|t| [0.025 0.975]\n", + "------------------------------------------------------------------------------\n", + "const 33.9950 5.968 5.696 0.000 22.262 45.728\n", + "zn 0.0375 0.016 2.349 0.019 0.006 0.069\n", + "chas 3.3959 0.990 3.430 0.001 1.449 5.343\n", + "nox -17.1637 4.228 -4.060 0.000 -25.475 -8.852\n", + "rm 4.0365 0.496 8.132 0.000 3.061 5.012\n", + "dis -1.3999 0.216 -6.484 0.000 -1.824 -0.975\n", + "rad 0.2278 0.072 3.158 0.002 0.086 0.370\n", + "tax -0.0100 0.004 -2.513 0.012 -0.018 -0.002\n", + "ptratio -0.9493 0.155 -6.137 0.000 -1.253 -0.645\n", + "b 0.0101 0.003 3.217 0.001 0.004 0.016\n", + "lstat -0.5315 0.056 -9.523 0.000 -0.641 -0.422\n", + "==============================================================================\n", + "Omnibus: 140.245 Durbin-Watson: 2.070\n", + "Prob(Omnibus): 0.000 Jarque-Bera (JB): 609.563\n", + "Skew: 1.464 Prob(JB): 4.32e-133\n", + "Kurtosis: 8.257 Cond. No. 1.46e+04\n", + "==============================================================================\n", + "\n", + "Warnings:\n", + "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", + "[2] The condition number is large, 1.46e+04. This might indicate that there are\n", + "strong multicollinearity or other numerical problems.\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UafIUrpZB0YP" + }, + "source": [ + "### Conclusão\n", + "* Quais variáveis/colunas/atributos ficam no modelo?\n", + "* **Muito importante (exercício)**: normalizar (MinMaxScaler) as covariáveis e refazer a análise.\n", + "* Nesta iteração (depois de excluirmos (nesta ordem) as variáveis age, indus e crim, não surge nenhuma outra variável insignificante ao nível de 5 (na verdade, o maior valor é 1.9%)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jx7sOzrrm-H_" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nXeiFtnJO_1u" + }, + "source": [ + "### Validação do(s) modelo(s)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QlGVFA6uPDvr" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PE3aKJ6mPDyJ" + }, + "source": [ + "### Predições" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d3nGiyX8jadH" + }, + "source": [ + "### Deployment da solução **analítica**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5YQF4NIlGSLH" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UQfpoo1igFy8" + }, + "source": [ + "# Regularized Regression Methods \n", + "## Ridge Regression - Penalized Regression\n", + "> Reduz a complexidade do modelo através do uso de todas as variáveis de $X$, mas penalizando (valor de $\\alpha$) os coeficientes $w_{i}$ quando estiverem muito longe de zero, forçando-os a serem pequenos de maneira contínua. Dessa forma, diminuímos a complexidade do modelo enquanto mantemos todas as variáveis no modelo.\n", + "* Menor impacto dos outliers.\n", + "\n", + "### Exemplo" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rVCtuvztS6gk" + }, + "source": [ + "" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "o00xH2MvxvgP" + }, + "source": [ + "# Matriz de covariáveis do modelo:\n", + "X_new = [[0, 0], [0, 0], [1, 1]]\n", + "y_new = [0, .1, 1]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "v9U7c03NzW_c", + "outputId": "5ac868a1-3985-4b81-e0c7-906dd6a28f37", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "X_new" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[[0, 0], [0, 0], [1, 1]]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 100 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iiVEAPpUzXyN", + "outputId": "1c600dce-e3f2-4720-a526-827f2698aa0d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "y_new" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[0, 0.1, 1]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 101 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8mWj2GbPOkHx", + "outputId": "d98acd27-e45f-4a27-c726-7abef730474c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "ridge = Ridge(alpha = .1)\n", + "ridge.fit(X_new, y_new)\n", + "ridge.coef_" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.44, 0.44])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 102 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0kD7Bsq_OkH1", + "outputId": "3101e945-41cb-4bd3-eefe-c374f6654982", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# treinando a regressão Ridge\n", + "ridge.fit(X_new, y_new)\n", + "\n", + "# treinando a regressão linear simples (OLS)\n", + "lr.fit(X_new, y_new)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 103 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "v9SNxnj42k6Y", + "outputId": "5973fa6f-f754-4e9f-b573-0afc861cce59", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "X_new" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[[0, 0], [0, 0], [1, 1]]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 104 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "utc-f86d2ne_", + "outputId": "28e5a75b-ad1d-42d4-fb79-e292d0475a2f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "y_new" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[0, 0.1, 1]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 105 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aUEyK4lygFy_", + "outputId": "cbbdae3b-741c-41c1-8c97-ac2becf713aa", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "ridge.coef_" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.44, 0.44])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 106 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qYRLUwIugFzC", + "outputId": "eb51b510-4979-49f6-ba9a-e1d85514ae18", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "lr.coef_" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.47, 0.47])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 107 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u5jsTkUmS9wK" + }, + "source": [ + "### Aplicação da Regressão Ridge no dataframe Boston Housing Price." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Kp4VIJWxgFy8" + }, + "source": [ + "from sklearn.linear_model import Ridge\n", + "ridge = Ridge(alpha = 0.1) # Definição do valor de alpha da regressão ridge\n", + "lr = LinearRegression()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "cmRMoOwV6FMt", + "outputId": "2b81c12b-2be6-490e-f642-10565d18b99e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "# Ao inves de: regressao_linear.fit(X_treinamento, y_treinamento)\n", + "ridge.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Ridge(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=None,\n", + " normalize=False, random_state=None, solver='auto', tol=0.001)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 109 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VPnekyUbK6Xg" + }, + "source": [ + "#### Peso/contribuição das variáveis para a regressão usando RIDGE" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "k83RDArjsUrj", + "outputId": "aa386a9b-8595-4fa7-fb9a-643828bb78c7", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + } + }, + "source": [ + "df_boston.columns" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Index(['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax',\n", + " 'ptratio', 'b', 'lstat', 'preco'],\n", + " dtype='object')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 110 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vMCb0CFjK973", + "outputId": "92d03811-fb34-47f1-927c-6486e5bee39a", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "ridge.coef_" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([-0.08, 0.04, 0.07, 3.14, -18.00, 3.99, 0.00, -1.35, 0.30, -0.01,\n", + " -0.97, 0.01, -0.53])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 111 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZqksuIjXypRJ", + "outputId": "24457c5e-5a22-405b-835c-8b36b7619295", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# treinando a regressão Ridge\n", + "ridge.fit(X_treinamento, y_treinamento)\n", + "\n", + "# treinando a regressão linear simples (OLS)\n", + "lr.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 112 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7r28PBsWLtjA", + "outputId": "c63f1b7c-c635-4de2-c97d-e032ab4f1e8b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "ridge.alpha" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.1" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 113 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dDZ_TJnhuZno" + }, + "source": [ + "#### $\\alpha = 0.01$" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hRMK_QTmNgc1", + "outputId": "835a8daa-8efc-4927-c1d1-25024b859384", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "# maior alpha --> mais restrição aos coeficientes; \n", + "# Menor alpha --> mais generalização, e Ridge se assemelha da OLS; Se alpha = 0 ==> Ridge = OLS.\n", + "rr = Ridge(alpha = 0.01) # Quanto mais próximo de 0 ==> Ridge = OLS\n", + "rr.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Ridge(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=None,\n", + " normalize=False, random_state=None, solver='auto', tol=0.001)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 114 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "IRuWmBE7Ngc7" + }, + "source": [ + "# MSE = Erro Quadrático Médio\n", + "from sklearn.metrics import mean_squared_error\n", + "rr_model=(mean_squared_error(y_true = y_treinamento, y_pred = rr.predict(X_treinamento)))\n", + "lr_model=(mean_squared_error(y_true = y_treinamento, y_pred = lr.predict(X_treinamento)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "L4an-zHetafI", + "outputId": "9012a92d-44f0-4f70-cda8-fe27bca50233", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "print(rr_model)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "23.94639697817076\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QsLVzk3EtbGs", + "outputId": "4baf36bd-a557-4965-fbac-5d1343879a70", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "print(lr_model)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "23.946319854597377\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K2sjngo1QhY2" + }, + "source": [ + "### Coeficientes da Ridge:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "s5i87o3quByz", + "outputId": "0560bb2c-fe7c-4961-d254-4305a10465dd", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 238 + } + }, + "source": [ + "list(zip(X_treinamento.columns, abs(ridge.coef_)))" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[('crim', 0.08087280884194979),\n", + " ('zn', 0.0431105323320636),\n", + " ('indus', 0.06967744483334821),\n", + " ('chas', 3.144789492713716),\n", + " ('nox', 17.9983019701622),\n", + " ('rm', 3.9867565296916703),\n", + " ('age', 0.0035446489044452497),\n", + " ('dis', 1.3530395756206453),\n", + " ('rad', 0.29504291572154007),\n", + " ('tax', 0.012511527307639232),\n", + " ('ptratio', 0.9682821087614826),\n", + " ('b', 0.009027440635645128),\n", + " ('lstat', 0.5291356457993021)]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 118 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s44vo9IjQonE" + }, + "source": [ + "### Expreimente vários outros valores para $\\alpha$ como, por exemplo, $\\alpha = 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CDv5fGPbuUq5" + }, + "source": [ + "#### $\\alpha = 100$" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NEaj4QRrNgdA" + }, + "source": [ + "rr100 = Ridge(alpha = 100)\n", + "rr100.fit(X_treinamento, y_treinamento)\n", + "train_score=lr.score(X_treinamento, y_treinamento)\n", + "test_score=lr.score(X_teste, y_teste)\n", + "Ridge_treinamento_score = rr.score(X_treinamento,y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zhcfoTEENgdE" + }, + "source": [ + "# MSE\n", + "rr100_model = (mean_squared_error(y_true = y_treinamento, y_pred = rr100.predict(X_treinamento)))\n", + "lr_model = (mean_squared_error(y_true = y_treinamento, y_pred = lr.predict(X_treinamento)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "NGDBpfiquxoc", + "outputId": "654af80a-5188-4e8c-c7ea-4bcbcd8cfd66", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "print(rr100_model)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "26.460105089888508\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Owami5MVureW", + "outputId": "c571e71f-f2cf-41c6-b746-f6669c9ad401", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "print(lr_model)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "23.946319854597377\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Xk5dN3Owu6Kw" + }, + "source": [ + "### Próximo passo: fazer o statmodel dos modelos ridge." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cEF_3GgUgF0Q" + }, + "source": [ + "# LASSO (Least Absolute Shrinkage And Selection Operator regularization)\n", + "* Método mais comum e usado para Regularization; \n", + "* Reduz overfitting;\n", + "* Se encarrega do **Feature Selection**, pois descarta variáveis altamente correlacionadas." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-YiKb9reQdI4" + }, + "source": [ + "* Usado no processo de Regularization - processo de penalizar as variáveis para manter somente os atributos mais importantes. Pense na utilidade disso diante de um dataframe com muitas variáveis;\n", + "* A regressão Lasso vem com um parâmetro ($\\alpha$), e quanto maior o alfa, a maioria dos coeficientes de recurso é zero. Ou seja, quando $\\alpha = 0$, a regressão Lasso produz os mesmos coeficientes que uma regressão linear. Quando alfa é muito grande, todos os coeficientes são zero." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5p_ZPZ4tTUX1" + }, + "source": [ + "### Exemplo LASSO" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "i5JZTnkTOkI9", + "outputId": "287e4f8c-e6f0-459d-fd88-bfb829e6cc95", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + } + }, + "source": [ + "from sklearn.linear_model import Lasso\n", + "lasso = Lasso(alpha = .1)\n", + "lasso.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,\n", + " normalize=False, positive=False, precompute=False, random_state=None,\n", + " selection='cyclic', tol=0.0001, warm_start=False)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 123 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gEUxSlThOkJD", + "outputId": "7a5cfd51-4c2c-48a2-de0b-5d0292636ef2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "lasso.coef_" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.50, 0.00])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 124 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EQaGWzzLT9qP" + }, + "source": [ + "### Aplicação do LASSO no Boston Housing Price" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ME6v6LFlgF0Q", + "outputId": "3852532d-56fa-40a9-b2aa-c8893f4e4bb4", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + } + }, + "source": [ + "from sklearn.linear_model import Lasso\n", + "lasso = Lasso(alpha = .1)\n", + "lasso.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,\n", + " normalize=False, positive=False, precompute=False, random_state=None,\n", + " selection='cyclic', tol=0.0001, warm_start=False)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 125 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "h6DSEHc1gF0V", + "outputId": "f057ebcb-1604-42d7-86ef-cc2fefa1e104", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "lasso.coef_" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([-0.07, 0.05, 0.00, 1.57, -0.00, 3.78, -0.01, -1.06, 0.26, -0.01,\n", + " -0.78, 0.01, -0.59])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 126 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8SzYnpVGy4cy" + }, + "source": [ + "### Coeficientes do LASSO:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "O2w2QDmdxxVe", + "outputId": "04cf1b03-7cc0-496a-ed56-b58c8caadcb1", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 238 + } + }, + "source": [ + "list(zip(X_treinamento.columns, abs(lasso.coef_)))" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[('crim', 0.06530501689285828),\n", + " ('zn', 0.04699294932304524),\n", + " ('indus', 0.002030456305853612),\n", + " ('chas', 1.5663885184641415),\n", + " ('nox', 0.0),\n", + " ('rm', 3.779546713514268),\n", + " ('age', 0.006404324032734558),\n", + " ('dis', 1.0612931166345525),\n", + " ('rad', 0.2580730613206583),\n", + " ('tax', 0.014270830653978057),\n", + " ('ptratio', 0.7817739916684686),\n", + " ('b', 0.009950918490594119),\n", + " ('lstat', 0.5874528237350962)]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 127 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UBOCg1H9zn6A" + }, + "source": [ + "### Coeficientes do RIDGE:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "g1fF-mEZzXpH", + "outputId": "ff52cc03-8289-47f1-9609-97830355a798", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 238 + } + }, + "source": [ + "list(zip(X_treinamento.columns, abs(ridge.coef_)))" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[('crim', 0.08087280884194979),\n", + " ('zn', 0.0431105323320636),\n", + " ('indus', 0.06967744483334821),\n", + " ('chas', 3.144789492713716),\n", + " ('nox', 17.9983019701622),\n", + " ('rm', 3.9867565296916703),\n", + " ('age', 0.0035446489044452497),\n", + " ('dis', 1.3530395756206453),\n", + " ('rad', 0.29504291572154007),\n", + " ('tax', 0.012511527307639232),\n", + " ('ptratio', 0.9682821087614826),\n", + " ('b', 0.009027440635645128),\n", + " ('lstat', 0.5291356457993021)]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 128 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xP1fX1Bi6VdX" + }, + "source": [ + "**Conclusão**: Coeficientes zero podem ser excluídos da Análise/modelo." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TbtxIWyGSXkH" + }, + "source": [ + "### Efeito dos valores de $\\alpha$\n", + "* Função adaptada de https://chrisalbon.com/machine_learning/linear_regression/effect_of_alpha_on_lasso_regression/." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "B4AuWA4LRBE3" + }, + "source": [ + "# Create a function called lasso,\n", + "def lasso(alphas):\n", + " '''\n", + " Takes in a list of alphas. Outputs a dataframe containing the coefficients of lasso regressions from each alpha.\n", + " '''\n", + " # Create an empty data frame\n", + " df = pd.DataFrame()\n", + " \n", + " # Create a column of feature names\n", + " df['Feature Name'] = names\n", + " \n", + " # For each alpha value in the list of alpha values,\n", + " for alpha in alphas:\n", + " # Create a lasso regression with that alpha value,\n", + " lasso = Lasso(alpha = alpha)\n", + " \n", + " # Fit the lasso regression\n", + " lasso.fit(X_treinamento, y_treinamento)\n", + " \n", + " # Create a column name for that alpha value\n", + " column_name = 'Alpha = %f' % alpha\n", + "\n", + " # Create a column of coefficient values\n", + " df[column_name] = lasso.coef_\n", + " \n", + " # Return the datafram \n", + " return df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VEDvXvuNRK0C", + "outputId": "829e9736-e54f-4059-8da3-89e353d59263", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 452 + } + }, + "source": [ + "names = X_treinamento.columns\n", + "lasso([.0001, .001, .01, .1, 1, 10, 1000, 1000])" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Feature NameAlpha = 0.000100Alpha = 0.001000Alpha = 0.010000Alpha = 0.100000Alpha = 1.000000Alpha = 10.000000Alpha = 1000.000000
0crim-0.082177-0.081898-0.079113-0.065305-0.043175-0.000000-0.0
1zn0.0428070.0428700.0435010.0469930.0465110.0229070.0
2indus0.0754670.0742630.0622170.0020300.000000-0.000000-0.0
3chas3.1618173.1468792.9974911.5663890.0000000.0000000.0
4nox-19.459924-19.148839-16.038598-0.000000-0.0000000.000000-0.0
5rm3.9815023.9805423.9709003.7795470.7671230.0000000.0
6age0.0047840.0045580.002299-0.0064040.0277000.000000-0.0
7dis-1.373444-1.368773-1.322091-1.061293-0.603672-0.0000000.0
8rad0.2988000.2980600.2906670.2580730.2630290.000000-0.0
9tax-0.012399-0.012425-0.012688-0.014271-0.014111-0.007210-0.0
10ptratio-0.984286-0.980948-0.947575-0.781774-0.754648-0.000000-0.0
11b0.0089510.0089670.0091300.0099510.0082250.0076680.0
12lstat-0.526561-0.527311-0.534805-0.587453-0.800866-0.600753-0.0
\n", + "
" + ], + "text/plain": [ + " Feature Name Alpha = 0.000100 ... Alpha = 10.000000 Alpha = 1000.000000\n", + "0 crim -0.082177 ... -0.000000 -0.0\n", + "1 zn 0.042807 ... 0.022907 0.0\n", + "2 indus 0.075467 ... -0.000000 -0.0\n", + "3 chas 3.161817 ... 0.000000 0.0\n", + "4 nox -19.459924 ... 0.000000 -0.0\n", + "5 rm 3.981502 ... 0.000000 0.0\n", + "6 age 0.004784 ... 0.000000 -0.0\n", + "7 dis -1.373444 ... -0.000000 0.0\n", + "8 rad 0.298800 ... 0.000000 -0.0\n", + "9 tax -0.012399 ... -0.007210 -0.0\n", + "10 ptratio -0.984286 ... -0.000000 -0.0\n", + "11 b 0.008951 ... 0.007668 0.0\n", + "12 lstat -0.526561 ... -0.600753 -0.0\n", + "\n", + "[13 rows x 8 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 130 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jSYw6SdcXa0q" + }, + "source": [ + "### Cross-Validation & GridSearch para LASSO" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "irFZAkvVXfya" + }, + "source": [ + "from sklearn.linear_model import LassoCV\n", + "from sklearn.model_selection import RepeatedKFold" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "T3Jjom8RYdly" + }, + "source": [ + "# define model evaluation method\n", + "cv = RepeatedKFold(n_splits = 5, n_repeats = 3, random_state = 20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Cw3lAvRPYgJe" + }, + "source": [ + "# define model\n", + "model = LassoCV(alphas = np.arange(0.001, 10, 0.001), cv = cv, n_jobs = -1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "oLX3CpThXvkJ", + "outputId": "e5b49993-c7a8-49d7-f11b-2c7ea63d6cba", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 102 + } + }, + "source": [ + "# fit model\n", + "model.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "LassoCV(alphas=array([0.00, 0.00, 0.00, ..., 10.00, 10.00, 10.00]), copy_X=True,\n", + " cv=RepeatedKFold(n_repeats=3, n_splits=10, random_state=1), eps=0.001,\n", + " fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=-1,\n", + " normalize=False, positive=False, precompute='auto', random_state=None,\n", + " selection='cyclic', tol=0.0001, verbose=False)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 180 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "U1ubd5huYQ7u", + "outputId": "a847a297-a66c-449b-feeb-491827b16bd3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# summarize chosen configuration\n", + "print('alpha: %f' % model.alpha_)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "alpha: 0.001000\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9P7hYoo4gF0Z" + }, + "source": [ + "# Elastic Net \n", + "* Combina o poder de Ridge e LASSO;\n", + "* Remove variáveis de pouco poder preditivo (LASSO) ou as penaliza (Ridge)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yChNUYs7gF0b" + }, + "source": [ + "from sklearn.linear_model import ElasticNet\n", + "from sklearn.model_selection import GridSearchCV\n", + "\n", + "# Instancia o objeto\n", + "en = ElasticNet(alpha = .1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "4mbIaAUAF4N6", + "outputId": "3a7bd712-d468-46bf-ca8a-bf783249e5af", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + } + }, + "source": [ + "en.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "ElasticNet(alpha=0.1, copy_X=True, fit_intercept=True, l1_ratio=0.5,\n", + " max_iter=1000, normalize=False, positive=False, precompute=False,\n", + " random_state=None, selection='cyclic', tol=0.0001, warm_start=False)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 132 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MaUkZw8ngF0h", + "outputId": "9e417454-b4c6-4c50-8029-afee0cd8020b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "en.coef_" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([-0.07, 0.05, 0.00, 1.32, -0.12, 3.29, -0.00, -1.08, 0.28, -0.02,\n", + " -0.81, 0.01, -0.62])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 133 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xl-Qh9caDyCp" + }, + "source": [ + "# Instancia o objeto:\n", + "en = ElasticNet(normalize = True)\n", + "\n", + "# Otimização dos hiperparâmetros:\n", + "d_hiperparametros = {'alpha': np.logspace(-5, 2, 8), \n", + " 'l1_ratio': [.2, .4, .6, .8]}\n", + "\n", + "search = GridSearchCV(estimator = en, \n", + " param_grid = d_hiperparametros, \n", + " scoring = 'neg_mean_squared_error', \n", + " n_jobs = 1,\n", + " refit = True, \n", + " cv = 10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "c3_XCQCPGlr3", + "outputId": "7407462c-8b3a-472b-e9d4-3878344970e3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "search.fit(X_treinamento, y_treinamento)\n", + "search.best_params_" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'alpha': 0.0001, 'l1_ratio': 0.4}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 135 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zq0_ugQfGrdb", + "outputId": "317e8866-1996-40a2-f7df-698d196a5f67", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "en2 = ElasticNet(normalize = True, alpha = 0.001, l1_ratio = 0.6)\n", + "en2.fit(X_treinamento, y_treinamento)\n", + "\n", + "# Métrica\n", + "ml2 = (mean_squared_error(y_true = y_teste, y_pred = en2.predict(X_teste)))\n", + "print(ml2)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "15.410850398354441\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5geUMgC6ztxE" + }, + "source": [ + "### Coeficientes do Elastic Net:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LyLdASRqzwCq", + "outputId": "647de634-be15-4164-8dad-014b735de1af", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 238 + } + }, + "source": [ + "list(zip(X_treinamento.columns, abs(ridge.coef_)))" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[('crim', 0.08087280884194979),\n", + " ('zn', 0.0431105323320636),\n", + " ('indus', 0.06967744483334821),\n", + " ('chas', 3.144789492713716),\n", + " ('nox', 17.9983019701622),\n", + " ('rm', 3.9867565296916703),\n", + " ('age', 0.0035446489044452497),\n", + " ('dis', 1.3530395756206453),\n", + " ('rad', 0.29504291572154007),\n", + " ('tax', 0.012511527307639232),\n", + " ('ptratio', 0.9682821087614826),\n", + " ('b', 0.009027440635645128),\n", + " ('lstat', 0.5291356457993021)]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 137 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "90pfP9-3OkJG" + }, + "source": [ + "Observe acima que o segundo coeficiente foi estimado como 0 e, desta forma, podemos excluí-lo do ML." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ILCXvYKDOkJH" + }, + "source": [ + "# Elastic Net \n", + "* Combina o poder de Ridge e LASSO;\n", + "* Remove variáveis de pouco poder preditivo (LASSO) ou as penaliza (Ridge)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GaQPDCR2OkJI" + }, + "source": [ + "from sklearn.linear_model import ElasticNet\n", + "\n", + "# Instancia o objeto\n", + "en = ElasticNet(alpha = .1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "xVp16Eu_OkJL", + "outputId": "e1152c6f-9129-4ac6-90ea-d48280820550", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + } + }, + "source": [ + "en.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "ElasticNet(alpha=0.1, copy_X=True, fit_intercept=True, l1_ratio=0.5,\n", + " max_iter=1000, normalize=False, positive=False, precompute=False,\n", + " random_state=None, selection='cyclic', tol=0.0001, warm_start=False)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 139 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kwj018U8OkJO", + "outputId": "9261a5fb-80c1-4623-dd6f-42ad3a5f16fa", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "en.coef_" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.33, 0.33])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 140 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rJRWBzSQCcss" + }, + "source": [ + "# Regressão Logística" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XwuMfMD1gFyd" + }, + "source": [ + "# Exemplo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "efF3st3sHxPG" + }, + "source": [ + "# Carrega as bibliotecas\n", + "import numpy as np\n", + "np.set_printoptions(formatter = {'float': lambda x: \"{0:0.2f}\".format(x)})\n", + "\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "from sklearn.model_selection import train_test_split\n", + "import statsmodels.api as sm\n", + "\n", + "%matplotlib inline" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Bk9F6JO0IELv", + "outputId": "1e85bafb-5da0-4b1e-e95f-ac65df8388a5", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "# Carregar/ler o banco de dados - Dataframe Diabetes\n", + "from sklearn import datasets\n", + "#Diabetes = datasets.load_diabetes()\n", + "\n", + "url = 'https://raw.githubusercontent.com/MathMachado/DSWP/master/Dataframes/diabetes.csv'\n", + "diabetes = pd.read_csv(url)\n", + "diabetes.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcome
061487235033.60.627501
11856629026.60.351310
28183640023.30.672321
318966239428.10.167210
40137403516843.12.288331
\n", + "
" + ], + "text/plain": [ + " Pregnancies Glucose BloodPressure ... DiabetesPedigreeFunction Age Outcome\n", + "0 6 148 72 ... 0.627 50 1\n", + "1 1 85 66 ... 0.351 31 0\n", + "2 8 183 64 ... 0.672 32 1\n", + "3 1 89 66 ... 0.167 21 0\n", + "4 0 137 40 ... 2.288 33 1\n", + "\n", + "[5 rows x 9 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 142 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tjRmpaPIDknb", + "outputId": "f6b56d34-1ebb-4dd4-e92b-06846299edf3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "# Definir as matrizes X e y\n", + "X_diabetes = diabetes.copy()\n", + "X_diabetes.drop(columns = ['Outcome'], axis = 1, inplace = True)\n", + "y_diabetes = diabetes['Outcome']\n", + "\n", + "X_diabetes.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAge
061487235033.60.62750
11856629026.60.35131
28183640023.30.67232
318966239428.10.16721
40137403516843.12.28833
\n", + "
" + ], + "text/plain": [ + " Pregnancies Glucose BloodPressure ... BMI DiabetesPedigreeFunction Age\n", + "0 6 148 72 ... 33.6 0.627 50\n", + "1 1 85 66 ... 26.6 0.351 31\n", + "2 8 183 64 ... 23.3 0.672 32\n", + "3 1 89 66 ... 28.1 0.167 21\n", + "4 0 137 40 ... 43.1 2.288 33\n", + "\n", + "[5 rows x 8 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 143 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jLrx69TH-Mad", + "outputId": "b090efbf-e878-4e97-b6bb-6eaa74f942f5", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "X_diabetes.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(768, 8)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 144 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mdFBioP6-Ply", + "outputId": "029ee89e-63be-444e-c28f-da5abafcc73e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "y_diabetes.shape" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(768,)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 145 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fhLySN65IaDF" + }, + "source": [ + "# Definir as matrizes de treinamento e validação\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_diabetes, y_diabetes)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "J5R8HlnuIGpL", + "outputId": "366b8993-ba2e-4cc2-a4c8-d4785645bc23", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + } + }, + "source": [ + "# Usando statmodels:\n", + "x = sm.add_constant(X_treinamento)\n", + "lr_sm = sm.Logit(y_treinamento, X_treinamento) # Atenção: aqui é o contrário: [y, x]\n", + "\n", + "# Treinar o modelo\n", + "lr.fit(X_treinamento, y_treinamento)\n", + "resultado_sm = lr_sm.fit()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Optimization terminated successfully.\n", + " Current function value: 0.605992\n", + " Iterations 5\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GlbCaPp1ETNa", + "outputId": "8e1be25a-2500-4432-d4e1-21d94cd3adbb", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 374 + } + }, + "source": [ + "resultado_sm.summary()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "\n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "
Logit Regression Results
Dep. Variable: Outcome No. Observations: 576
Model: Logit Df Residuals: 568
Method: MLE Df Model: 7
Date: Tue, 27 Oct 2020 Pseudo R-squ.: 0.06151
Time: 18:36:59 Log-Likelihood: -349.05
converged: True LL-Null: -371.93
Covariance Type: nonrobust LLR p-value: 9.756e-08
\n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "
coef std err z P>|z| [0.025 0.975]
Pregnancies 0.1260 0.034 3.719 0.000 0.060 0.192
Glucose 0.0129 0.003 4.135 0.000 0.007 0.019
BloodPressure -0.0320 0.005 -6.038 0.000 -0.042 -0.022
SkinThickness 0.0074 0.007 1.043 0.297 -0.007 0.021
Insulin -0.0001 0.001 -0.125 0.900 -0.002 0.002
BMI -0.0074 0.012 -0.596 0.551 -0.032 0.017
DiabetesPedigreeFunction 0.1565 0.270 0.580 0.562 -0.373 0.686
Age -0.0096 0.010 -0.962 0.336 -0.029 0.010
" + ], + "text/plain": [ + "\n", + "\"\"\"\n", + " Logit Regression Results \n", + "==============================================================================\n", + "Dep. Variable: Outcome No. Observations: 576\n", + "Model: Logit Df Residuals: 568\n", + "Method: MLE Df Model: 7\n", + "Date: Tue, 27 Oct 2020 Pseudo R-squ.: 0.06151\n", + "Time: 18:36:59 Log-Likelihood: -349.05\n", + "converged: True LL-Null: -371.93\n", + "Covariance Type: nonrobust LLR p-value: 9.756e-08\n", + "============================================================================================\n", + " coef std err z P>|z| [0.025 0.975]\n", + "--------------------------------------------------------------------------------------------\n", + "Pregnancies 0.1260 0.034 3.719 0.000 0.060 0.192\n", + "Glucose 0.0129 0.003 4.135 0.000 0.007 0.019\n", + "BloodPressure -0.0320 0.005 -6.038 0.000 -0.042 -0.022\n", + "SkinThickness 0.0074 0.007 1.043 0.297 -0.007 0.021\n", + "Insulin -0.0001 0.001 -0.125 0.900 -0.002 0.002\n", + "BMI -0.0074 0.012 -0.596 0.551 -0.032 0.017\n", + "DiabetesPedigreeFunction 0.1565 0.270 0.580 0.562 -0.373 0.686\n", + "Age -0.0096 0.010 -0.962 0.336 -0.029 0.010\n", + "============================================================================================\n", + "\"\"\"" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 148 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-FJaSnJLKICU", + "outputId": "bf103788-6bb8-4755-a4ea-f06fc8bd8514", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# EQM - Erro Quadrático Médio\n", + "np.mean((resultado_sm.predict(X_teste) - y_teste) ** 2) " + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.2153796977006658" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 149 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6bVEUSTUPzOj" + }, + "source": [ + "### Calcular y_pred - os valores preditos de y" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OjGrNhTNLcr-" + }, + "source": [ + "y_pred = resultado_sm.predict(X_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vfS5RCx_VnGT", + "outputId": "05b0e59b-60ce-4bbe-d582-f983ea39d7a1", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 527 + } + }, + "source": [ + "compara = list(zip(np.array(diabetes['Outcome']), resultado_sm.predict()))\n", + "compara[0:30]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[(1, 0.4964813717283173),\n", + " (0, 0.43813015739999056),\n", + " (1, 0.3351229393597726),\n", + " (0, 0.4170598535890674),\n", + " (1, 0.33525360127159953),\n", + " (0, 0.33954024324811855),\n", + " (1, 0.7161856828807713),\n", + " (0, 0.30737583540608177),\n", + " (1, 0.7463860219809548),\n", + " (1, 0.5224158390283206),\n", + " (0, 0.6617811831292351),\n", + " (1, 0.28390512013813257),\n", + " (0, 0.30820517679884585),\n", + " (1, 0.4402216426898358),\n", + " (1, 0.5364393605837372),\n", + " (1, 0.20147010634149431),\n", + " (1, 0.3084811467691899),\n", + " (1, 0.549873161998823),\n", + " (0, 0.3078243845960755),\n", + " (1, 0.3943683761693644),\n", + " (0, 0.32026940349835475),\n", + " (0, 0.49754641616990125),\n", + " (1, 0.6901893365750147),\n", + " (1, 0.6287550079979413),\n", + " (1, 0.42408221478347446),\n", + " (1, 0.3478297305544648),\n", + " (1, 0.31084157363014775),\n", + " (0, 0.23512026465552527),\n", + " (0, 0.21009588386462008),\n", + " (0, 0.428293478648919)]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 151 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pUxasncIFaw4", + "outputId": "91683144-9981-4319-9747-d5711c223640", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "resultado_sm.pred_table()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[326.00, 50.00],\n", + " [127.00, 73.00]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 152 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_liLYinwFgch", + "outputId": "fd843722-cc87-4435-b232-fdeeb269dbf5", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 111 + } + }, + "source": [ + "confusion_matrix = pd.DataFrame(resultado_sm.pred_table())\n", + "confusion_matrix.columns = ['Predicted No Diabetes', 'Predicted Diabetes']\n", + "confusion_matrix = confusion_matrix.rename(index = {0 : 'Actual No Diabetes', 1 : 'Actual Diabetes'})\n", + "confusion_matrix" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Predicted No DiabetesPredicted Diabetes
Actual No Diabetes326.050.0
Actual Diabetes127.073.0
\n", + "
" + ], + "text/plain": [ + " Predicted No Diabetes Predicted Diabetes\n", + "Actual No Diabetes 326.0 50.0\n", + "Actual Diabetes 127.0 73.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 153 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ceH3MODWFm7S", + "outputId": "d8352417-9d66-49c3-ed3f-d12cb0d89dc3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "cm = np.array(confusion_matrix)\n", + "training_accuracy = (cm[0,0] + cm[1,1])/ cm.sum()\n", + "training_accuracy" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.6927083333333334" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 154 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CH_iEuzhO109" + }, + "source": [ + "# Exercício 1 - Mall_Customers.csv\n", + "> A variável-target deste dataframe é 'Annual Income'. Desenvolva um modelo de regressão utilizando OLS, Ridge e LASSO e compare os resultados.\n", + "\n", + "* Experimente:\n", + " * Lasso(alpha = 0.01, max_iter = 10e5);\n", + " * Lasso(alpha = 0.0001, max_iter = 10e5);\n", + " * Ridge(alpha = 0.01);\n", + " * Ridge(alpha = 100);" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZfRDEaaRYxFQ" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "from sklearn import preprocessing\n", + "import matplotlib.pyplot as plt \n", + "plt.rc(\"font\", size=14)\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.model_selection import train_test_split\n", + "import seaborn as sns\n", + "sns.set(style=\"white\")\n", + "sns.set(style=\"whitegrid\", color_codes=True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nulrLzUqYxFY" + }, + "source": [ + "## Dados\n", + "\n", + "The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe (1/0) a term deposit (variable y)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4LdrQCwxYxFY" + }, + "source": [ + "This dataset provides the customer information. It includes 41188 records and 21 fields." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qoT6zkoFYxFZ", + "outputId": "517870d0-203d-48b1-dbd2-74dfeab2b8ec", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "df_bank = pd.read_csv('https://raw.githubusercontent.com/MathMachado/DataFrames/master/bank-full.csv', header = 0)\n", + "df_bank = df_bank.dropna()\n", + "print(df_bank.shape)\n", + "print(list(df_bank.columns))" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "(45211, 1)\n", + "['age;\"job\";\"marital\";\"education\";\"default\";\"balance\";\"housing\";\"loan\";\"contact\";\"day\";\"month\";\"duration\";\"campaign\";\"pdays\";\"previous\";\"poutcome\";\"y\"']\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZD23hMCeYxFc", + "outputId": "56468732-3d3c-4cb0-d5f5-1753495f2d31", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "df_bank.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
age;\"job\";\"marital\";\"education\";\"default\";\"balance\";\"housing\";\"loan\";\"contact\";\"day\";\"month\";\"duration\";\"campaign\";\"pdays\";\"previous\";\"poutcome\";\"y\"
058;\"management\";\"married\";\"tertiary\";\"no\";2143...
144;\"technician\";\"single\";\"secondary\";\"no\";29;\"...
233;\"entrepreneur\";\"married\";\"secondary\";\"no\";2...
347;\"blue-collar\";\"married\";\"unknown\";\"no\";1506...
433;\"unknown\";\"single\";\"unknown\";\"no\";1;\"no\";\"n...
\n", + "
" + ], + "text/plain": [ + " age;\"job\";\"marital\";\"education\";\"default\";\"balance\";\"housing\";\"loan\";\"contact\";\"day\";\"month\";\"duration\";\"campaign\";\"pdays\";\"previous\";\"poutcome\";\"y\"\n", + "0 58;\"management\";\"married\";\"tertiary\";\"no\";2143... \n", + "1 44;\"technician\";\"single\";\"secondary\";\"no\";29;\"... \n", + "2 33;\"entrepreneur\";\"married\";\"secondary\";\"no\";2... \n", + "3 47;\"blue-collar\";\"married\";\"unknown\";\"no\";1506... \n", + "4 33;\"unknown\";\"single\";\"unknown\";\"no\";1;\"no\";\"n... " + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 157 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CtGbim_EYxFh" + }, + "source": [ + "#### Input variables" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0pJ7ai5ZYxFh" + }, + "source": [ + "1 - age (numeric)\n", + "\n", + "2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')\n", + "\n", + "3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)\n", + "\n", + "4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')\n", + "\n", + "5 - default: has credit in default? (categorical: 'no','yes','unknown')\n", + "\n", + "6 - housing: has housing loan? (categorical: 'no','yes','unknown')\n", + "\n", + "7 - loan: has personal loan? (categorical: 'no','yes','unknown')\n", + "\n", + "8 - contact: contact communication type (categorical: 'cellular','telephone')\n", + "\n", + "9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')\n", + "\n", + "10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')\n", + "\n", + "11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.\n", + "\n", + "12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)\n", + "\n", + "13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)\n", + "\n", + "14 - previous: number of contacts performed before this campaign and for this client (numeric)\n", + "\n", + "15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')\n", + "\n", + "16 - emp.var.rate: employment variation rate - (numeric)\n", + "\n", + "17 - cons.price.idx: consumer price index - (numeric)\n", + "\n", + "18 - cons.conf.idx: consumer confidence index - (numeric) \n", + "\n", + "19 - euribor3m: euribor 3 month rate - (numeric)\n", + "\n", + "20 - nr.employed: number of employees - (numeric)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YwsaBV_OYxFi" + }, + "source": [ + "#### Predict variable (desired target):\n", + "\n", + "y - has the client subscribed a term deposit? (binary: '1','0')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2SsNWV_SYxFj" + }, + "source": [ + "The education column of the dataset has many categories and we need to reduce the categories for a better modelling. The education column has the following categories:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6TFbgh3vYxFk", + "outputId": "dbc5f647-726c-4951-e037-ce0ba2c0b5fd", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 561 + } + }, + "source": [ + "df_bank['education'].unique()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "KeyError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 2894\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2895\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2896\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", + "\u001b[0;31mKeyError\u001b[0m: 'education'", + "\nThe above exception was the direct cause of the following exception:\n", + "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf_bank\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'education'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0munique\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 2900\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnlevels\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2901\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_multilevel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2902\u001b[0;31m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2903\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_integer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2904\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 2895\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2896\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2897\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2898\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2899\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mtolerance\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mKeyError\u001b[0m: 'education'" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "luv7Bdf_YxFn" + }, + "source": [ + "Let us group \"basic.4y\", \"basic.9y\" and \"basic.6y\" together and call them \"basic\"." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gkOlUOs2YxFn" + }, + "source": [ + "df_bank['education']=np.where(df_bank['education'] =='basic.9y', 'Basic', df_bank['education'])\n", + "df_bank['education']=np.where(df_bank['education'] =='basic.6y', 'Basic', df_bank['education'])\n", + "df_bank['education']=np.where(df_bank['education'] =='basic.4y', 'Basic', df_bank['education'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H-X1WMv2YxFq" + }, + "source": [ + "After grouping, this is the columns" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "r9LlgpkjYxFq" + }, + "source": [ + "df_bank['education'].unique()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fcnJy3KYYxFt" + }, + "source": [ + "### Data exploration" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qUrTMR8BYxFt" + }, + "source": [ + "df_bank['y'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rpzHnzJKYxFx" + }, + "source": [ + "sns.countplot(x='y',data=df_bank, palette='hls')\n", + "plt.show()\n", + "plt.savefig('count_plot')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C99nOe3mYxF0" + }, + "source": [ + "There are 36548 no's and 4640 yes's in the outcome variables." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8nGaox_kYxF1" + }, + "source": [ + "Let's get a sense of the numbers across the two classes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sQvzA60bYxF1" + }, + "source": [ + "df_bank.groupby('y').mean()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u3xjoceKYxF3" + }, + "source": [ + "Observations:\n", + "\n", + "The average age of customers who bought the term deposit is higher than that of the customers who didn't.\n", + "The pdays (days since the customer was last contacted) is understandably lower for the customers who bought it. The lower the pdays, the better the memory of the last call and hence the better chances of a sale.\n", + "Surprisingly, campaigns (number of contacts or calls made during the current campaign) are lower for customers who bought the term deposit." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jvzGMePPYxF4" + }, + "source": [ + "We can calculate categorical means for other categorical variables such as education and marital status to get a more detailed sense of our data." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RqLVMjoxYxF5" + }, + "source": [ + "df_bank.groupby('job').mean()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "GTUeRJAtYxF7" + }, + "source": [ + "df_bank.groupby('marital').mean()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "xsxdFumiYxF9" + }, + "source": [ + "df_bank.groupby('education').mean()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3i1DCWV-YxGA" + }, + "source": [ + "Visualizations" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OEArHQPbYxGB" + }, + "source": [ + "%matplotlib inline\n", + "pd.crosstab(df_bank.job,df_bank.y).plot(kind='bar')\n", + "plt.title('Purchase Frequency for Job Title')\n", + "plt.xlabel('Job')\n", + "plt.ylabel('Frequency of Purchase')\n", + "plt.savefig('purchase_fre_job')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PNwo5du_YxGD" + }, + "source": [ + "The frequency of purchase of the deposit depends a great deal on the job title. Thus, the job title can be a good predictor of the outcome variable." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "eM7CWfAZYxGE" + }, + "source": [ + "table=pd.crosstab(df_bank.marital,df_bank.y)\n", + "table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)\n", + "plt.title('Stacked Bar Chart of Marital Status vs Purchase')\n", + "plt.xlabel('Marital Status')\n", + "plt.ylabel('Proportion of Customers')\n", + "plt.savefig('mariral_vs_pur_stack')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LWBLh7toYxGG" + }, + "source": [ + "Hard to see, but the marital status does not seem a strong predictor for the outcome variable." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vh_u4QphYxGH" + }, + "source": [ + "table=pd.crosstab(df_bank.education,df_bank.y)\n", + "table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)\n", + "plt.title('Stacked Bar Chart of Education vs Purchase')\n", + "plt.xlabel('Education')\n", + "plt.ylabel('Proportion of Customers')\n", + "plt.savefig('edu_vs_pur_stack')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d9AgJroYYxGK" + }, + "source": [ + "Education seems a good predictor of the outcome variable." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dHI2LT-IYxGL" + }, + "source": [ + "pd.crosstab(df_bank.day_of_week,df_bank.y).plot(kind='bar')\n", + "plt.title('Purchase Frequency for Day of Week')\n", + "plt.xlabel('Day of Week')\n", + "plt.ylabel('Frequency of Purchase')\n", + "plt.savefig('pur_dayofweek_bar')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3A2jmS4MYxGR" + }, + "source": [ + "Day of week may not be a good predictor of the outcome" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bzafDBHpYxGS" + }, + "source": [ + "pd.crosstab(df_bank.month,df_bank.y).plot(kind='bar')\n", + "plt.title('Purchase Frequency for Month')\n", + "plt.xlabel('Month')\n", + "plt.ylabel('Frequency of Purchase')\n", + "plt.savefig('pur_fre_month_bar')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "x5CBtquEYxGW" + }, + "source": [ + "Month might be a good predictor of the outcome variable" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tgF_3SqWYxGY" + }, + "source": [ + "df_bank.age.hist()\n", + "plt.title('Histogram of Age')\n", + "plt.xlabel('Age')\n", + "plt.ylabel('Frequency')\n", + "plt.savefig('hist_age')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y0FhKYDsYxGc" + }, + "source": [ + "The most of the customers of the bank in this dataset are in the age range of 30-40." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5Nd3yV7DYxGd" + }, + "source": [ + "pd.crosstab(df_bank.poutcome,df_bank.y).plot(kind='bar')\n", + "plt.title('Purchase Frequency for Poutcome')\n", + "plt.xlabel('Poutcome')\n", + "plt.ylabel('Frequency of Purchase')\n", + "plt.savefig('pur_fre_pout_bar')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oRKUAGrjYxGh" + }, + "source": [ + "Poutcome seems to be a good predictor of the outcome variable." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "63RLRI9uYxGi" + }, + "source": [ + "### Create dummy variables" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "V8S4WUKmYxGj" + }, + "source": [ + "cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']\n", + "for var in cat_vars:\n", + " cat_list='var'+'_'+var\n", + " cat_list = pd.get_dummies(df_bank[var], prefix=var)\n", + " df_bank1=df_bank.join(cat_list)\n", + " data=df_bank1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "uX3w9i9WYxGl" + }, + "source": [ + "cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']\n", + "df_bank_vars=df_bank.columns.values.tolist()\n", + "to_keep=[i for i in df_bank_vars if i not in cat_vars]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "cMX_82xaYxGq" + }, + "source": [ + "df_bank_final=df_bank[to_keep]\n", + "df_bank_final.columns.values" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "LkTjpxYoYxGr" + }, + "source": [ + "df_bank_final_vars=df_bank_final.columns.values.tolist()\n", + "y=['y']\n", + "X=[i for i in df_bank_final_vars if i not in y]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2QbKaRcsYxGt" + }, + "source": [ + "### Feature Selection" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "EkxjW1AQYxGu" + }, + "source": [ + "from sklearn import datasets\n", + "from sklearn.feature_selection import RFE\n", + "from sklearn.linear_model import LogisticRegression\n", + "\n", + "logreg = LogisticRegression()\n", + "\n", + "rfe = RFE(logreg, 18)\n", + "rfe = rfe.fit(df_bank_final[X], df_bank_final[y] )\n", + "print(rfe.support_)\n", + "print(rfe.ranking_)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2P9hd4jHYxGw" + }, + "source": [ + "The Recursive Feature Elimination (RFE) has helped us select the following features: \"previous\", \"euribor3m\", \"job_blue-collar\", \"job_retired\", \"job_services\", \"job_student\", \"default_no\", \"month_aug\", \"month_dec\", \"month_jul\", \"month_nov\", \"month_oct\", \"month_sep\", \"day_of_week_fri\", \"day_of_week_wed\", \"poutcome_failure\", \"poutcome_nonexistent\", \"poutcome_success\"." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5PW8WZX_YxGx" + }, + "source": [ + "cols=[\"previous\", \"euribor3m\", \"job_blue-collar\", \"job_retired\", \"job_services\", \"job_student\", \"default_no\", \n", + " \"month_aug\", \"month_dec\", \"month_jul\", \"month_nov\", \"month_oct\", \"month_sep\", \"day_of_week_fri\", \"day_of_week_wed\", \n", + " \"poutcome_failure\", \"poutcome_nonexistent\", \"poutcome_success\"] \n", + "X=df_bank_final[cols]\n", + "y=df_bank_final['y']" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ix0mN9qxYxG0" + }, + "source": [ + "### Implementing the model" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Hbx2bwtiYxG0" + }, + "source": [ + "import statsmodels.api as sm\n", + "logit_model=sm.Logit(y,X)\n", + "result=logit_model.fit()\n", + "print(result.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HR1ui-UcYxG2" + }, + "source": [ + "The p-values for most of the variables are very small, therefore, most of them are significant to the model." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9GHhrsaeYxG3" + }, + "source": [ + "### Logistic Regression Model Fitting" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MFQnH5MzYxG3" + }, + "source": [ + "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn import metrics\n", + "logreg = LogisticRegression()\n", + "logreg.fit(X_train, y_train)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YUa3QL7tYxG6" + }, + "source": [ + "#### Predicting the test set results and caculating the accuracy" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SD-y2e33YxG6" + }, + "source": [ + "y_pred = logreg.predict(X_test)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "kkPWzos7YxG-" + }, + "source": [ + "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kwC3rt_6YxHA" + }, + "source": [ + "### Cross Validation" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Muw50oqSYxHB" + }, + "source": [ + "from sklearn import model_selection\n", + "from sklearn.model_selection import cross_val_score\n", + "kfold = model_selection.KFold(n_splits=10, random_state=7)\n", + "modelCV = LogisticRegression()\n", + "scoring = 'accuracy'\n", + "results = model_selection.cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)\n", + "print(\"10-fold cross validation average accuracy: %.3f\" % (results.mean()))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4y8XCTqoYxHE" + }, + "source": [ + "### Confusion Matrix" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BCza9NkVYxHE" + }, + "source": [ + "from sklearn.metrics import confusion_matrix\n", + "confusion_matrix = confusion_matrix(y_test, y_pred)\n", + "print(confusion_matrix)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "X9SapwS2YxHG" + }, + "source": [ + "The result is telling us that we have 10872+254 correct predictions and 1122+109 incorrect predictions." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6bEWvWScYxHG" + }, + "source": [ + "#### Accuracy" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NaH2nESwYxHH" + }, + "source": [ + "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(classifier.score(X_test, y_test)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C6oxlhbpYxHJ" + }, + "source": [ + "#### Compute precision, recall, F-measure and support\n", + "\n", + "The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.\n", + "\n", + "The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.\n", + "\n", + "The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.\n", + "\n", + "The F-beta score weights recall more than precision by a factor of beta. beta == 1.0 means recall and precision are equally important.\n", + "\n", + "The support is the number of occurrences of each class in y_test." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mhN5_p4yYxHK" + }, + "source": [ + "from sklearn.metrics import classification_report\n", + "print(classification_report(y_test, y_pred))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xzSFVEnAYxHP" + }, + "source": [ + "#### Interpretation: \n", + "\n", + "Of the entire test set, 88% of the promoted term deposit were the term deposit that the customers liked. Of the entire test set, 90% of the customer's preferred term deposit were promoted." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NGXJ6g2nYxHQ" + }, + "source": [ + "### ROC Curvefrom sklearn import metrics\n", + "from ggplot import *\n", + "\n", + "prob = clf1.predict_proba(X_test)[:,1]\n", + "fpr, sensitivity, _ = metrics.roc_curve(Y_test, prob)\n", + "\n", + "df = pd.DataFrame(dict(fpr=fpr, sensitivity=sensitivity))\n", + "ggplot(df, aes(x='fpr', y='sensitivity')) +\\\n", + " geom_line() +\\\n", + " geom_abline(linetype='dashed')" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "u9QKDuS0YxHQ" + }, + "source": [ + "from sklearn.metrics import roc_auc_score\n", + "from sklearn.metrics import roc_curve\n", + "logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))\n", + "fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])\n", + "plt.figure()\n", + "plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)\n", + "plt.plot([0, 1], [0, 1],'r--')\n", + "plt.xlim([0.0, 1.0])\n", + "plt.ylim([0.0, 1.05])\n", + "plt.xlabel('False Positive Rate')\n", + "plt.ylabel('True Positive Rate')\n", + "plt.title('Receiver operating characteristic')\n", + "plt.legend(loc=\"lower right\")\n", + "plt.savefig('Log_ROC')\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git "a/Notebooks/NB15_02__Regress\303\243o Linear_hs2.ipynb" "b/Notebooks/NB15_02__Regress\303\243o Linear_hs2.ipynb" new file mode 100644 index 000000000..837f0d6a5 --- /dev/null +++ "b/Notebooks/NB15_02__Regress\303\243o Linear_hs2.ipynb" @@ -0,0 +1,10651 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.1" + }, + "colab": { + "name": "NB15_02__Regressão Linear.ipynb", + "provenance": [], + "include_colab_link": true + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XwQDhId7N6_r" + }, + "source": [ + "

MACHINE LEARNING WITH PYTHON

\n", + "

APRENDIZAGEM SUPERVISIONADA

\n", + "

MODELOS DE REGRESSÃO (LINEAR E LOGÍSTICA)

\n", + "\n", + "Fonte: https://realpython.com/linear-regression-in-python/\n", + "https://towardsdatascience.com/predicting-the-survival-of-titanic-passengers-30870ccc7e8" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PN-dQFJcM1UV" + }, + "source": [ + "Passos para implementação da Regressão Linear:\n", + "\n", + "* (1) Importar as libraries necessárias;\n", + "* (2) Carregar os dados;\n", + "* (3) Aplicar as transformações necessárias: outliers, NaN's, normalização (MinMaxScaler, RobustScaler, StandarScaler, Log, Box-Cox e etc);\n", + "* (4) DataViz dos dados: entender os relacionamentos, distribuições e etc presente nos dados;\n", + "* (5) Construir e treinar o modelo preditivo (neste caso, modelo de regressão);\n", + "* (6) Validar/verificar as métricas para avaliação do(s) modelo(s);\n", + "* (7) Predições." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8TldGZxAFV5E" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0QRbxlqaq7pr" + }, + "source": [ + "# Melhorias da sessão:\n", + "* Calcular as correlações antes e depois da RIDGE e LASSO para mostrar a multicolinearidade e explicar porque determinadas colunas \"deixam\" de ser importantes." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P4sAIblOgFyL" + }, + "source": [ + "# Modelos de Regressão com Regularization para Classificação e Regressão" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o7Y7cuJNgFyU" + }, + "source": [ + "## Regressão Linear Simples (usando OLS - Ordinary Least Squares)\n", + "\n", + "* Features $X_{np}$: é uma matriz de dimensão nxp contendo os atributos/variáveis preditoras do dataframe (variáveis independentes);\n", + "* Variável target/dependente representada por y;\n", + "* Relação entre X e y é representado pela equação abaixo, onde $w_{i}$ representa os pesos de cada coeficiente e $w_{0}$ representa o intercepto." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NpJ580y9gFyU" + }, + "source": [ + "\n", + "\n", + "![X_y](https://github.com/MathMachado/Materials/blob/master/Architecture.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5rhbVGJ0gFyY" + }, + "source": [ + "* Soma de Quadrados dos Resíduos (RSS) - Soma de Quadrados das diferenças entre os valores observados e preditos.\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u8gA0YkbgFyp" + }, + "source": [ + "## Principais parâmetros do algoritmo:\n", + "* fit_intercept - Indica se o intercepto $w_{0}$ deve ou não ser ajustado. Se os dados estão normalizados, então não faz sentido ajustar o intercepto $w_{0}$.\n", + "\n", + "* normalize - $X$ será automaticamente normalizada (subtrai a média e divide pelo desvio-padrão);\n", + "\n", + "## Atributos do modelo de Machine Learning para Regressão\n", + "* coef - peso/fator de cada variável independente do modelo de ML;\n", + "\n", + "* intercepto $w_{0}$ - intercepto ou viés de $y$;\n", + "\n", + "## Funções para ajuste do ML:\n", + "* fit - treina o modelo com as matrizes $X$ e $y$;\n", + "* predict - Uma vez que o modelo foi treinado, para um dado $X$, use $y$ para calcular os valores preditos de $y$ (y_pred).\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "A-JG8El1gFy7" + }, + "source": [ + "# Limitações do OLS (Ordinary Least Squares):\n", + "* Impactado/sensível à Outliers;\n", + "* Multicolinearidade; \n", + "* Heterocedasticidade - apresenta-se como uma forte dispersão dos dados em torno de uma reta;\n", + "\n", + "* References" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xylMYR8COyrw" + }, + "source": [ + "### Importar as libraries" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2BGgrILlPK6Z" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from scipy import stats" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "263GgbwhO2kQ" + }, + "source": [ + "### Carregar os dados\n", + "* Vamos carregar o dataset [Boston House Pricing](https://archive.ics.uci.edu/ml/datasets/housing)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1h66x_-rXGhi" + }, + "source": [ + "from sklearn.datasets import load_boston, load_iris" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rWniNkMpXQFU", + "outputId": "5096d239-2c8c-4327-dbf5-f9128faa589c", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "boston = load_boston()\n", + "boston" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'DESCR': \".. _boston_dataset:\\n\\nBoston house prices dataset\\n---------------------------\\n\\n**Data Set Characteristics:** \\n\\n :Number of Instances: 506 \\n\\n :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\\n\\n :Attribute Information (in order):\\n - CRIM per capita crime rate by town\\n - ZN proportion of residential land zoned for lots over 25,000 sq.ft.\\n - INDUS proportion of non-retail business acres per town\\n - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\\n - NOX nitric oxides concentration (parts per 10 million)\\n - RM average number of rooms per dwelling\\n - AGE proportion of owner-occupied units built prior to 1940\\n - DIS weighted distances to five Boston employment centres\\n - RAD index of accessibility to radial highways\\n - TAX full-value property-tax rate per $10,000\\n - PTRATIO pupil-teacher ratio by town\\n - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\\n - LSTAT % lower status of the population\\n - MEDV Median value of owner-occupied homes in $1000's\\n\\n :Missing Attribute Values: None\\n\\n :Creator: Harrison, D. and Rubinfeld, D.L.\\n\\nThis is a copy of UCI ML housing dataset.\\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/housing/\\n\\n\\nThis dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\\n\\nThe Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\\nprices and the demand for clean air', J. Environ. Economics & Management,\\nvol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics\\n...', Wiley, 1980. N.B. Various transformations are used in the table on\\npages 244-261 of the latter.\\n\\nThe Boston house-price data has been used in many machine learning papers that address regression\\nproblems. \\n \\n.. topic:: References\\n\\n - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\\n - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\\n\",\n", + " 'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,\n", + " 4.9800e+00],\n", + " [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,\n", + " 9.1400e+00],\n", + " [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,\n", + " 4.0300e+00],\n", + " ...,\n", + " [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,\n", + " 5.6400e+00],\n", + " [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,\n", + " 6.4800e+00],\n", + " [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,\n", + " 7.8800e+00]]),\n", + " 'feature_names': array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',\n", + " 'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTAT
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.98
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.14
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.03
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.94
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.33
\n", + "" + ], + "text/plain": [ + " CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT\n", + "0 0.00632 18.0 2.31 0.0 0.538 ... 1.0 296.0 15.3 396.90 4.98\n", + "1 0.02731 0.0 7.07 0.0 0.469 ... 2.0 242.0 17.8 396.90 9.14\n", + "2 0.02729 0.0 7.07 0.0 0.469 ... 2.0 242.0 17.8 392.83 4.03\n", + "3 0.03237 0.0 2.18 0.0 0.458 ... 3.0 222.0 18.7 394.63 2.94\n", + "4 0.06905 0.0 2.18 0.0 0.458 ... 3.0 222.0 18.7 396.90 5.33\n", + "\n", + "[5 rows x 13 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 136 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pQzFW7DUX_KW", + "outputId": "dcf288db-d99d-4d17-c22c-ceb8a9ba4841", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "# Variável target/resposta\n", + "df_boston['preco'] = load_boston().target\n", + "df_boston.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATpreco
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2
\n", + "
" + ], + "text/plain": [ + " CRIM ZN INDUS CHAS NOX ... TAX PTRATIO B LSTAT preco\n", + "0 0.00632 18.0 2.31 0.0 0.538 ... 296.0 15.3 396.90 4.98 24.0\n", + "1 0.02731 0.0 7.07 0.0 0.469 ... 242.0 17.8 396.90 9.14 21.6\n", + "2 0.02729 0.0 7.07 0.0 0.469 ... 242.0 17.8 392.83 4.03 34.7\n", + "3 0.03237 0.0 2.18 0.0 0.458 ... 222.0 18.7 394.63 2.94 33.4\n", + "4 0.06905 0.0 2.18 0.0 0.458 ... 222.0 18.7 396.90 5.33 36.2\n", + "\n", + "[5 rows x 14 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 137 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H71da4bIO4kI" + }, + "source": [ + "### Data Transformation" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K-6YOdsTfciO" + }, + "source": [ + "#### Normalização/padronização dos nomes das colunas" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "L8OJEapufhq4" + }, + "source": [ + "# Renomear as colunas do dataframe\n", + "df_boston.columns = [col.lower() for col in df_boston.columns]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "uRinX-5ofol_", + "outputId": "2e67bbbd-792f-4786-8c7e-2d0bd16fd249", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "df_boston.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
crimzninduschasnoxrmagedisradtaxptratioblstatpreco
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2
\n", + "
" + ], + "text/plain": [ + " crim zn indus chas nox ... tax ptratio b lstat preco\n", + "0 0.00632 18.0 2.31 0.0 0.538 ... 296.0 15.3 396.90 4.98 24.0\n", + "1 0.02731 0.0 7.07 0.0 0.469 ... 242.0 17.8 396.90 9.14 21.6\n", + "2 0.02729 0.0 7.07 0.0 0.469 ... 242.0 17.8 392.83 4.03 34.7\n", + "3 0.03237 0.0 2.18 0.0 0.458 ... 222.0 18.7 394.63 2.94 33.4\n", + "4 0.06905 0.0 2.18 0.0 0.458 ... 222.0 18.7 396.90 5.33 36.2\n", + "\n", + "[5 rows x 14 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 139 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CMDh5jyqekmr" + }, + "source": [ + "#### Outliers" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jJIG0jJQf6em" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FgYPzlvfemFc" + }, + "source": [ + "#### Missing values" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BAjw7UhJen0D", + "outputId": "917a8f23-ec31-4f22-9a46-c3a15c1e4563", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Missing values por colunas/variáveis\n", + "df_boston.isna().sum()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "crim 0\n", + "zn 0\n", + "indus 0\n", + "chas 0\n", + "nox 0\n", + "rm 0\n", + "age 0\n", + "dis 0\n", + "rad 0\n", + "tax 0\n", + "ptratio 0\n", + "b 0\n", + "lstat 0\n", + "preco 0\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 140 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Jo3UWNpbYnNF", + "outputId": "aeefc57a-f1b7-41ac-aa2e-53f828b9be14", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Número de atributos\n", + "len(load_boston().feature_names)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "13" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 141 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0Yp8g7hxfQli", + "outputId": "43795436-0366-4427-ed5a-2deacedf567f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 49 + } + }, + "source": [ + "# Missing Values por linhas\n", + "df_boston[df_boston.isnull().any(axis = 1)]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
crimzninduschasnoxrmagedisradtaxptratioblstatpreco
\n", + "
" + ], + "text/plain": [ + "Empty DataFrame\n", + "Columns: [crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat, preco]\n", + "Index: []" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 142 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5qmkTFLrf9MT" + }, + "source": [ + "#### Estatísticas Descritivas" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Nprn3p_Wf_bn", + "outputId": "16f46af6-ab9a-4d7b-a875-295817b9bf9c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 297 + } + }, + "source": [ + "df_boston.describe()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
crimzninduschasnoxrmagedisradtaxptratioblstatpreco
count506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000
mean3.61352411.36363611.1367790.0691700.5546956.28463468.5749013.7950439.549407408.23715418.455534356.67403212.65306322.532806
std8.60154523.3224536.8603530.2539940.1158780.70261728.1488612.1057108.707259168.5371162.16494691.2948647.1410629.197104
min0.0063200.0000000.4600000.0000000.3850003.5610002.9000001.1296001.000000187.00000012.6000000.3200001.7300005.000000
25%0.0820450.0000005.1900000.0000000.4490005.88550045.0250002.1001754.000000279.00000017.400000375.3775006.95000017.025000
50%0.2565100.0000009.6900000.0000000.5380006.20850077.5000003.2074505.000000330.00000019.050000391.44000011.36000021.200000
75%3.67708312.50000018.1000000.0000000.6240006.62350094.0750005.18842524.000000666.00000020.200000396.22500016.95500025.000000
max88.976200100.00000027.7400001.0000000.8710008.780000100.00000012.12650024.000000711.00000022.000000396.90000037.97000050.000000
\n", + "
" + ], + "text/plain": [ + " crim zn indus ... b lstat preco\n", + "count 506.000000 506.000000 506.000000 ... 506.000000 506.000000 506.000000\n", + "mean 3.613524 11.363636 11.136779 ... 356.674032 12.653063 22.532806\n", + "std 8.601545 23.322453 6.860353 ... 91.294864 7.141062 9.197104\n", + "min 0.006320 0.000000 0.460000 ... 0.320000 1.730000 5.000000\n", + "25% 0.082045 0.000000 5.190000 ... 375.377500 6.950000 17.025000\n", + "50% 0.256510 0.000000 9.690000 ... 391.440000 11.360000 21.200000\n", + "75% 3.677083 12.500000 18.100000 ... 396.225000 16.955000 25.000000\n", + "max 88.976200 100.000000 27.740000 ... 396.900000 37.970000 50.000000\n", + "\n", + "[8 rows x 14 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 143 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1JimyY3SgECE" + }, + "source": [ + "#### Análise de Correlação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jScHq7eTgIpm", + "outputId": "50696c9d-c19a-4937-9189-368be5fb291c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 483 + } + }, + "source": [ + "correlacoes = df_boston.corr()\n", + "correlacoes" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
crimzninduschasnoxrmagedisradtaxptratioblstatpreco
crim1.000000-0.2004690.406583-0.0558920.420972-0.2192470.352734-0.3796700.6255050.5827640.289946-0.3850640.455621-0.388305
zn-0.2004691.000000-0.533828-0.042697-0.5166040.311991-0.5695370.664408-0.311948-0.314563-0.3916790.175520-0.4129950.360445
indus0.406583-0.5338281.0000000.0629380.763651-0.3916760.644779-0.7080270.5951290.7207600.383248-0.3569770.603800-0.483725
chas-0.055892-0.0426970.0629381.0000000.0912030.0912510.086518-0.099176-0.007368-0.035587-0.1215150.048788-0.0539290.175260
nox0.420972-0.5166040.7636510.0912031.000000-0.3021880.731470-0.7692300.6114410.6680230.188933-0.3800510.590879-0.427321
rm-0.2192470.311991-0.3916760.091251-0.3021881.000000-0.2402650.205246-0.209847-0.292048-0.3555010.128069-0.6138080.695360
age0.352734-0.5695370.6447790.0865180.731470-0.2402651.000000-0.7478810.4560220.5064560.261515-0.2735340.602339-0.376955
dis-0.3796700.664408-0.708027-0.099176-0.7692300.205246-0.7478811.000000-0.494588-0.534432-0.2324710.291512-0.4969960.249929
rad0.625505-0.3119480.595129-0.0073680.611441-0.2098470.456022-0.4945881.0000000.9102280.464741-0.4444130.488676-0.381626
tax0.582764-0.3145630.720760-0.0355870.668023-0.2920480.506456-0.5344320.9102281.0000000.460853-0.4418080.543993-0.468536
ptratio0.289946-0.3916790.383248-0.1215150.188933-0.3555010.261515-0.2324710.4647410.4608531.000000-0.1773830.374044-0.507787
b-0.3850640.175520-0.3569770.048788-0.3800510.128069-0.2735340.291512-0.444413-0.441808-0.1773831.000000-0.3660870.333461
lstat0.455621-0.4129950.603800-0.0539290.590879-0.6138080.602339-0.4969960.4886760.5439930.374044-0.3660871.000000-0.737663
preco-0.3883050.360445-0.4837250.175260-0.4273210.695360-0.3769550.249929-0.381626-0.468536-0.5077870.333461-0.7376631.000000
\n", + "
" + ], + "text/plain": [ + " crim zn indus ... b lstat preco\n", + "crim 1.000000 -0.200469 0.406583 ... -0.385064 0.455621 -0.388305\n", + "zn -0.200469 1.000000 -0.533828 ... 0.175520 -0.412995 0.360445\n", + "indus 0.406583 -0.533828 1.000000 ... -0.356977 0.603800 -0.483725\n", + "chas -0.055892 -0.042697 0.062938 ... 0.048788 -0.053929 0.175260\n", + "nox 0.420972 -0.516604 0.763651 ... -0.380051 0.590879 -0.427321\n", + "rm -0.219247 0.311991 -0.391676 ... 0.128069 -0.613808 0.695360\n", + "age 0.352734 -0.569537 0.644779 ... -0.273534 0.602339 -0.376955\n", + "dis -0.379670 0.664408 -0.708027 ... 0.291512 -0.496996 0.249929\n", + "rad 0.625505 -0.311948 0.595129 ... -0.444413 0.488676 -0.381626\n", + "tax 0.582764 -0.314563 0.720760 ... -0.441808 0.543993 -0.468536\n", + "ptratio 0.289946 -0.391679 0.383248 ... -0.177383 0.374044 -0.507787\n", + "b -0.385064 0.175520 -0.356977 ... 1.000000 -0.366087 0.333461\n", + "lstat 0.455621 -0.412995 0.603800 ... -0.366087 1.000000 -0.737663\n", + "preco -0.388305 0.360445 -0.483725 ... 0.333461 -0.737663 1.000000\n", + "\n", + "[14 rows x 14 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 144 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AxQp7xqdgTJP" + }, + "source": [ + "##### Gráfico das correlações entre as features/variáveis/colunas\n", + "Source: https://seaborn.pydata.org/examples/many_pairwise_correlations.html\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KOiH2X-WgqmN", + "outputId": "f72007dc-7c99-4ce1-b6bb-b86a9bf617c5", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 557 + } + }, + "source": [ + "import seaborn as sns\n", + "from string import ascii_letters\n", + "import matplotlib.pyplot as plt\n", + "\n", + "sns.set_theme(style = \"white\")\n", + "\n", + "d = df_boston\n", + "\n", + "# Compute the correlation matrix\n", + "corr = d.corr()\n", + "\n", + "# Generate a mask for the upper triangle\n", + "mask = np.triu(np.ones_like(corr, dtype=bool))\n", + "\n", + "# Set up the matplotlib figure\n", + "f, ax = plt.subplots(figsize=(11, 9))\n", + "\n", + "# Generate a custom diverging colormap\n", + "cmap = sns.diverging_palette(230, 20, as_cmap=True)\n", + "\n", + "# Draw the heatmap with the mask and correct aspect ratio\n", + "sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,\n", + " square=True, linewidths=.5, cbar_kws={\"shrink\": .5})" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 145 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nogPhyfVO70G" + }, + "source": [ + "### Construir e treinar o(s) modelo(s)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HxYpfyvQaIe1" + }, + "source": [ + "$X = [X_{1}, X_{2}, X_{p}]$ = X_boston abaixo." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0BhLZJhibVNG" + }, + "source": [ + "X_boston = df_boston.drop(columns = ['preco'], axis = 1) # todas as variáveis/atributos, exceto 'preco'\n", + "y_boston = df_boston['preco'] # variável-target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "v_nC_RGva1Z6", + "outputId": "6a5946c8-62b3-424f-a809-9a2bbc34f191", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "X_boston.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
crimzninduschasnoxrmagedisradtaxptratioblstat
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.98
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.14
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.03
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.94
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.33
\n", + "
" + ], + "text/plain": [ + " crim zn indus chas nox ... rad tax ptratio b lstat\n", + "0 0.00632 18.0 2.31 0.0 0.538 ... 1.0 296.0 15.3 396.90 4.98\n", + "1 0.02731 0.0 7.07 0.0 0.469 ... 2.0 242.0 17.8 396.90 9.14\n", + "2 0.02729 0.0 7.07 0.0 0.469 ... 2.0 242.0 17.8 392.83 4.03\n", + "3 0.03237 0.0 2.18 0.0 0.458 ... 3.0 222.0 18.7 394.63 2.94\n", + "4 0.06905 0.0 2.18 0.0 0.458 ... 3.0 222.0 18.7 396.90 5.33\n", + "\n", + "[5 rows x 13 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 147 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nlVJM--Ya5fS", + "outputId": "58037983-175f-47ed-ad47-5826589358b0", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "y_boston[0:10] # Series (coluna)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 24.0\n", + "1 21.6\n", + "2 34.7\n", + "3 33.4\n", + "4 36.2\n", + "5 28.7\n", + "6 22.9\n", + "7 27.1\n", + "8 16.5\n", + "9 18.9\n", + "Name: preco, dtype: float64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 148 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "b50_6tv5h1kY" + }, + "source": [ + "# Definindo os dataframes de treinamento e teste:\n", + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_boston, \n", + " y_boston, \n", + " test_size = 0.2, \n", + " random_state = 20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1U3hpdkDbYTv", + "outputId": "35e8cee1-201a-4a65-a6ec-8fa9e8c7c0a8", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "print(f\"Dataframe de treinamento: {X_treinamento.shape[0]} linhas\")\n", + "print(f\"Dataframe de teste......: {X_teste.shape[0]} linhas\")" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Dataframe de treinamento: 404 linhas\n", + "Dataframe de teste......: 102 linhas\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SvevXulFiJj1" + }, + "source": [ + "#### Treinamento do modelo de Regressão Linear" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GVwF3vp8iNff" + }, + "source": [ + "# Importa a library LinearRegression --> Para treinamento da Regressão Linear\n", + "from sklearn.linear_model import LinearRegression\n", + "\n", + "# Library para statmodels\n", + "import statsmodels.api as sm" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ibX6bCbViW-v" + }, + "source": [ + "# Instancia o objeto\n", + "regressao_linear = LinearRegression()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "M-5wRGUribY0", + "outputId": "a67d7355-3d9e-43fc-edf6-8ebd71911935", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Treina o modelo usando as amostras/dataset de treinamento: X_treinamento e y_treinamento \n", + "regressao_linear.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 153 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jri-jA1VjmUl", + "outputId": "3150261d-c264-4273-9c5f-95229529881b", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Valor do intercepto\n", + "regressao_linear.intercept_" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "35.9020918753502" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 154 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VOjadxdxjqtT", + "outputId": "49d06bd9-e375-403f-e257-863967f10fd3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 452 + } + }, + "source": [ + "# Coeficientes do modelo de Regressão Linear\n", + "coeficientes_regressao_linear = pd.DataFrame([X_treinamento.columns, regressao_linear.coef_]).T\n", + "coeficientes_regressao_linear = coeficientes_regressao_linear.rename(columns={0: 'Feature/variável/coluna', 1: 'Coeficientes'})\n", + "coeficientes_regressao_linear" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Feature/variável/colunaCoeficientes
0crim-0.0822083
1zn0.0428002
2indus0.0756011
3chas3.16348
4nox-19.4945
5rm3.98161
6age0.00480929
7dis-1.37396
8rad0.298883
9tax-0.0123962
10ptratio-0.984657
11b0.008949
12lstat-0.526478
\n", + "
" + ], + "text/plain": [ + " Feature/variável/coluna Coeficientes\n", + "0 crim -0.0822083\n", + "1 zn 0.0428002\n", + "2 indus 0.0756011\n", + "3 chas 3.16348\n", + "4 nox -19.4945\n", + "5 rm 3.98161\n", + "6 age 0.00480929\n", + "7 dis -1.37396\n", + "8 rad 0.298883\n", + "9 tax -0.0123962\n", + "10 ptratio -0.984657\n", + "11 b 0.008949\n", + "12 lstat -0.526478" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 155 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jwnkhPwDjkhS" + }, + "source": [ + "#### Usando statmodels" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ltbekHd_k3PH", + "outputId": "a69b057e-75a6-446e-8b7c-ad37604114a5", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X2_treinamento = sm.add_constant(X_treinamento)\n", + "lm_sm = sm.OLS(y_treinamento, X2_treinamento).fit()\n", + "print(lm_sm.summary())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + " OLS Regression Results \n", + "==============================================================================\n", + "Dep. Variable: preco R-squared: 0.725\n", + "Model: OLS Adj. R-squared: 0.716\n", + "Method: Least Squares F-statistic: 78.97\n", + "Date: Thu, 29 Oct 2020 Prob (F-statistic): 1.48e-100\n", + "Time: 11:00:14 Log-Likelihood: -1214.8\n", + "No. Observations: 404 AIC: 2458.\n", + "Df Residuals: 390 BIC: 2514.\n", + "Df Model: 13 \n", + "Covariance Type: nonrobust \n", + "==============================================================================\n", + " coef std err t P>|t| [0.025 0.975]\n", + "------------------------------------------------------------------------------\n", + "const 35.9021 6.037 5.947 0.000 24.033 47.771\n", + "crim -0.0822 0.045 -1.824 0.069 -0.171 0.006\n", + "zn 0.0428 0.016 2.638 0.009 0.011 0.075\n", + "indus 0.0756 0.072 1.054 0.292 -0.065 0.217\n", + "chas 3.1635 0.997 3.174 0.002 1.204 5.123\n", + "nox -19.4945 4.539 -4.295 0.000 -28.418 -10.571\n", + "rm 3.9816 0.510 7.802 0.000 2.978 4.985\n", + "age 0.0048 0.015 0.312 0.755 -0.025 0.035\n", + "dis -1.3740 0.236 -5.827 0.000 -1.838 -0.910\n", + "rad 0.2989 0.079 3.760 0.000 0.143 0.455\n", + "tax -0.0124 0.004 -2.814 0.005 -0.021 -0.004\n", + "ptratio -0.9847 0.156 -6.309 0.000 -1.292 -0.678\n", + "b 0.0089 0.003 2.796 0.005 0.003 0.015\n", + "lstat -0.5265 0.060 -8.764 0.000 -0.645 -0.408\n", + "==============================================================================\n", + "Omnibus: 140.799 Durbin-Watson: 2.083\n", + "Prob(Omnibus): 0.000 Jarque-Bera (JB): 591.650\n", + "Skew: 1.484 Prob(JB): 3.35e-129\n", + "Kurtosis: 8.132 Cond. No. 1.51e+04\n", + "==============================================================================\n", + "\n", + "Warnings:\n", + "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", + "[2] The condition number is large, 1.51e+04. This might indicate that there are\n", + "strong multicollinearity or other numerical problems.\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Kpt3A4Q0guHv" + }, + "source": [ + "#### Exclusão da variável menos significativa para o modelo: 'age'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rVUJkfg4gSh7", + "outputId": "eeff1e8f-8ac7-44e8-e0fe-caf0d4a641c7", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X3 = X_treinamento.drop(columns = 'age', axis = 1)\n", + "X3_treinamento = sm.add_constant(X3)\n", + "lm_sm2 = sm.OLS(y_treinamento, X3_treinamento).fit()\n", + "print(lm_sm2.summary())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + " OLS Regression Results \n", + "==============================================================================\n", + "Dep. Variable: preco R-squared: 0.725\n", + "Model: OLS Adj. R-squared: 0.716\n", + "Method: Least Squares F-statistic: 85.75\n", + "Date: Thu, 29 Oct 2020 Prob (F-statistic): 1.64e-101\n", + "Time: 11:00:14 Log-Likelihood: -1214.8\n", + "No. Observations: 404 AIC: 2456.\n", + "Df Residuals: 391 BIC: 2508.\n", + "Df Model: 12 \n", + "Covariance Type: nonrobust \n", + "==============================================================================\n", + " coef std err t P>|t| [0.025 0.975]\n", + "------------------------------------------------------------------------------\n", + "const 35.7325 6.006 5.950 0.000 23.925 47.540\n", + "crim -0.0815 0.045 -1.812 0.071 -0.170 0.007\n", + "zn 0.0422 0.016 2.623 0.009 0.011 0.074\n", + "indus 0.0750 0.072 1.048 0.295 -0.066 0.216\n", + "chas 3.1794 0.994 3.198 0.001 1.225 5.134\n", + "nox -19.1299 4.381 -4.367 0.000 -27.742 -10.517\n", + "rm 4.0153 0.498 8.059 0.000 3.036 4.995\n", + "dis -1.3963 0.224 -6.223 0.000 -1.837 -0.955\n", + "rad 0.2958 0.079 3.755 0.000 0.141 0.451\n", + "tax -0.0123 0.004 -2.802 0.005 -0.021 -0.004\n", + "ptratio -0.9812 0.156 -6.310 0.000 -1.287 -0.675\n", + "b 0.0090 0.003 2.825 0.005 0.003 0.015\n", + "lstat -0.5202 0.057 -9.203 0.000 -0.631 -0.409\n", + "==============================================================================\n", + "Omnibus: 142.363 Durbin-Watson: 2.081\n", + "Prob(Omnibus): 0.000 Jarque-Bera (JB): 608.694\n", + "Skew: 1.496 Prob(JB): 6.67e-133\n", + "Kurtosis: 8.216 Cond. No. 1.48e+04\n", + "==============================================================================\n", + "\n", + "Warnings:\n", + "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", + "[2] The condition number is large, 1.48e+04. This might indicate that there are\n", + "strong multicollinearity or other numerical problems.\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_lcp7m5FmZvG" + }, + "source": [ + "#### Exclusão da variável menos significativa para o modelo: 'indus'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jEiBywx4hGNB", + "outputId": "fb2abfd1-9019-4e37-f6e1-cf5e54ae1276", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X4 = X3_treinamento.drop(columns = 'indus', axis = 1)\n", + "X4_treinamento = sm.add_constant(X4)\n", + "lm_sm3 = sm.OLS(y_treinamento, X4_treinamento).fit()\n", + "print(lm_sm3.summary())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + " OLS Regression Results \n", + "==============================================================================\n", + "Dep. Variable: preco R-squared: 0.724\n", + "Model: OLS Adj. R-squared: 0.716\n", + "Method: Least Squares F-statistic: 93.42\n", + "Date: Thu, 29 Oct 2020 Prob (F-statistic): 2.86e-102\n", + "Time: 11:00:14 Log-Likelihood: -1215.4\n", + "No. Observations: 404 AIC: 2455.\n", + "Df Residuals: 392 BIC: 2503.\n", + "Df Model: 11 \n", + "Covariance Type: nonrobust \n", + "==============================================================================\n", + " coef std err t P>|t| [0.025 0.975]\n", + "------------------------------------------------------------------------------\n", + "const 35.4757 6.001 5.911 0.000 23.677 47.275\n", + "crim -0.0840 0.045 -1.871 0.062 -0.172 0.004\n", + "zn 0.0407 0.016 2.539 0.012 0.009 0.072\n", + "chas 3.2924 0.989 3.330 0.001 1.349 5.236\n", + "nox -17.9558 4.235 -4.239 0.000 -26.283 -9.629\n", + "rm 3.9674 0.496 7.996 0.000 2.992 4.943\n", + "dis -1.4553 0.217 -6.699 0.000 -1.882 -1.028\n", + "rad 0.2744 0.076 3.606 0.000 0.125 0.424\n", + "tax -0.0103 0.004 -2.603 0.010 -0.018 -0.003\n", + "ptratio -0.9609 0.154 -6.227 0.000 -1.264 -0.658\n", + "b 0.0089 0.003 2.778 0.006 0.003 0.015\n", + "lstat -0.5151 0.056 -9.145 0.000 -0.626 -0.404\n", + "==============================================================================\n", + "Omnibus: 142.123 Durbin-Watson: 2.073\n", + "Prob(Omnibus): 0.000 Jarque-Bera (JB): 605.868\n", + "Skew: 1.494 Prob(JB): 2.74e-132\n", + "Kurtosis: 8.202 Cond. No. 1.47e+04\n", + "==============================================================================\n", + "\n", + "Warnings:\n", + "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", + "[2] The condition number is large, 1.47e+04. This might indicate that there are\n", + "strong multicollinearity or other numerical problems.\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rFejox5XmrEE" + }, + "source": [ + "#### Exclusão da variável menos significativa para o modelo: 'crim'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DOehOql8hZWr", + "outputId": "cbb71827-f44e-4688-93c4-98a3ec5e3257", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X5 = X4_treinamento.drop(columns = 'crim', axis = 1)\n", + "X5_treinamento = sm.add_constant(X5)\n", + "lm_sm4 = sm.OLS(y_treinamento, X5_treinamento).fit()\n", + "print(lm_sm4.summary())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + " OLS Regression Results \n", + "==============================================================================\n", + "Dep. Variable: preco R-squared: 0.721\n", + "Model: OLS Adj. R-squared: 0.714\n", + "Method: Least Squares F-statistic: 101.8\n", + "Date: Thu, 29 Oct 2020 Prob (F-statistic): 1.55e-102\n", + "Time: 11:00:14 Log-Likelihood: -1217.2\n", + "No. Observations: 404 AIC: 2456.\n", + "Df Residuals: 393 BIC: 2500.\n", + "Df Model: 10 \n", + "Covariance Type: nonrobust \n", + "==============================================================================\n", + " coef std err t P>|t| [0.025 0.975]\n", + "------------------------------------------------------------------------------\n", + "const 33.9950 5.968 5.696 0.000 22.262 45.728\n", + "zn 0.0375 0.016 2.349 0.019 0.006 0.069\n", + "chas 3.3959 0.990 3.430 0.001 1.449 5.343\n", + "nox -17.1637 4.228 -4.060 0.000 -25.475 -8.852\n", + "rm 4.0365 0.496 8.132 0.000 3.061 5.012\n", + "dis -1.3999 0.216 -6.484 0.000 -1.824 -0.975\n", + "rad 0.2278 0.072 3.158 0.002 0.086 0.370\n", + "tax -0.0100 0.004 -2.513 0.012 -0.018 -0.002\n", + "ptratio -0.9493 0.155 -6.137 0.000 -1.253 -0.645\n", + "b 0.0101 0.003 3.217 0.001 0.004 0.016\n", + "lstat -0.5315 0.056 -9.523 0.000 -0.641 -0.422\n", + "==============================================================================\n", + "Omnibus: 140.245 Durbin-Watson: 2.070\n", + "Prob(Omnibus): 0.000 Jarque-Bera (JB): 609.563\n", + "Skew: 1.464 Prob(JB): 4.32e-133\n", + "Kurtosis: 8.257 Cond. No. 1.46e+04\n", + "==============================================================================\n", + "\n", + "Warnings:\n", + "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", + "[2] The condition number is large, 1.46e+04. This might indicate that there are\n", + "strong multicollinearity or other numerical problems.\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UafIUrpZB0YP" + }, + "source": [ + "### Conclusão\n", + "* Quais variáveis/colunas/atributos ficam no modelo?\n", + "* **Muito importante (exercício)**: normalizar (MinMaxScaler) as covariáveis e refazer a análise.\n", + "* Nesta iteração (depois de excluirmos (nesta ordem) as variáveis age, indus e crim, não surge nenhuma outra variável insignificante ao nível de 5 (na verdade, o maior valor é 1.9%)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jx7sOzrrm-H_" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nXeiFtnJO_1u" + }, + "source": [ + "### Validação do(s) modelo(s)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QlGVFA6uPDvr" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PE3aKJ6mPDyJ" + }, + "source": [ + "### Predições" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d3nGiyX8jadH" + }, + "source": [ + "### Deployment da solução **analítica**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5YQF4NIlGSLH" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UQfpoo1igFy8" + }, + "source": [ + "# Regularized Regression Methods \n", + "## Ridge Regression - Penalized Regression\n", + "> Reduz a complexidade do modelo através do uso de todas as variáveis de $X$, mas penalizando (valor de $\\alpha$) os coeficientes $w_{i}$ quando estiverem muito longe de zero, forçando-os a serem pequenos de maneira contínua. Dessa forma, diminuímos a complexidade do modelo enquanto mantemos todas as variáveis no modelo.\n", + "* Menor impacto dos outliers.\n", + "\n", + "### Exemplo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "o00xH2MvxvgP" + }, + "source": [ + "# Matriz de covariáveis do modelo:\n", + "X_new = [[0, 0], [0, 0], [1, 1]]\n", + "X_new2 = [[0, 0], [0, 1.5], [1, 1]]\n", + "\n", + "y_new = [0, .1, 1]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "v9U7c03NzW_c", + "outputId": "2652bd10-e6b4-4200-f7f0-a07806564a1d", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X_new # 2 variáveis/colunas no dataframe" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[[0, 0], [0, 0], [1, 1]]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 161 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iiVEAPpUzXyN", + "outputId": "a69fe575-57da-459c-f482-41d3185ab76f", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "y_new" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[0, 0.1, 1]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 162 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JDljolA95Hw5" + }, + "source": [ + "### Sem outliers" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8mWj2GbPOkHx", + "outputId": "3b433090-a588-449a-af69-5f2da53a9b60", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 197 + } + }, + "source": [ + "ridge = Ridge(alpha = .1)\n", + "ridge.fit(X_new, y_new)\n", + "ridge.coef_ # Coeficientes da Ridge" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "NameError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mridge\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mRidge\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0malpha\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m.1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mridge\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX_new\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_new\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mridge\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcoef_\u001b[0m \u001b[0;31m# Coeficientes da Ridge\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mNameError\u001b[0m: name 'Ridge' is not defined" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8yvd4ABY5JjC" + }, + "source": [ + "### Com outliers" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "O3sJZ_pe5GQ7" + }, + "source": [ + "ridge = Ridge(alpha = .1)\n", + "ridge.fit(X_new2, y_new)\n", + "ridge.coef_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zZxdCLU_5kKh" + }, + "source": [ + "#### Conseguiram visualizar o impacto dos outliers?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u5jsTkUmS9wK" + }, + "source": [ + "### Aplicação da Regressão Ridge no dataframe Boston Housing Price." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Kp4VIJWxgFy8" + }, + "source": [ + "from sklearn.linear_model import Ridge\n", + "ridge = Ridge(alpha = 0.1) # Definição do valor de alpha da regressão ridge\n", + "lr = LinearRegression()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "cmRMoOwV6FMt" + }, + "source": [ + "# Ao inves de: regressao_linear.fit(X_treinamento, y_treinamento)\n", + "ridge.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VPnekyUbK6Xg" + }, + "source": [ + "#### Peso/contribuição das variáveis para a regressão usando RIDGE" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "k83RDArjsUrj" + }, + "source": [ + "df_boston.columns" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vMCb0CFjK973" + }, + "source": [ + "ridge.coef_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZqksuIjXypRJ" + }, + "source": [ + "# treinando a regressão Ridge\n", + "ridge.fit(X_treinamento, y_treinamento)\n", + "\n", + "# treinando a regressão linear simples (OLS)\n", + "lr.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7r28PBsWLtjA" + }, + "source": [ + "ridge.alpha" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dDZ_TJnhuZno" + }, + "source": [ + "#### $\\alpha = 0.01$" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hRMK_QTmNgc1" + }, + "source": [ + "# maior alpha --> mais restrição aos coeficientes; \n", + "# Menor alpha --> mais generalização, e Ridge se assemelha da OLS; Se alpha = 0 ==> Ridge = OLS.\n", + "rr = Ridge(alpha = 0.01) # Quanto mais próximo de 0 ==> Ridge = OLS\n", + "rr.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IRuWmBE7Ngc7" + }, + "source": [ + "# MSE = Erro Quadrático Médio\n", + "from sklearn.metrics import mean_squared_error\n", + "\n", + "rr_model=(mean_squared_error(y_true = y_treinamento, y_pred = rr.predict(X_treinamento)))\n", + "lr_model=(mean_squared_error(y_true = y_treinamento, y_pred = lr.predict(X_treinamento)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "L4an-zHetafI" + }, + "source": [ + "print(rr_model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QsLVzk3EtbGs" + }, + "source": [ + "print(lr_model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K2sjngo1QhY2" + }, + "source": [ + "### Coeficientes da Ridge:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "s5i87o3quByz" + }, + "source": [ + "# Lista das variáveis + coeficientes da Ridge:\n", + "list(zip(X_treinamento.columns, abs(ridge.coef_)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s44vo9IjQonE" + }, + "source": [ + "### Experimente vários outros valores para $\\alpha$ como, por exemplo, $\\alpha = 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CDv5fGPbuUq5" + }, + "source": [ + "#### $\\alpha = 100$" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NEaj4QRrNgdA" + }, + "source": [ + "rr100 = Ridge(alpha = 100)\n", + "rr100.fit(X_treinamento, y_treinamento)\n", + "train_score=lr.score(X_treinamento, y_treinamento)\n", + "test_score=lr.score(X_teste, y_teste)\n", + "Ridge_treinamento_score = rr.score(X_treinamento,y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zhcfoTEENgdE" + }, + "source": [ + "# MSE\n", + "rr100_model = (mean_squared_error(y_true = y_treinamento, y_pred = rr100.predict(X_treinamento)))\n", + "lr_model = (mean_squared_error(y_true = y_treinamento, y_pred = lr.predict(X_treinamento)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "NGDBpfiquxoc" + }, + "source": [ + "print(rr100_model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Owami5MVureW" + }, + "source": [ + "print(lr_model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Xk5dN3Owu6Kw" + }, + "source": [ + "### Próximo passo: fazer o statmodel dos modelos ridge." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cEF_3GgUgF0Q" + }, + "source": [ + "# LASSO (Least Absolute Shrinkage And Selection Operator regularization)\n", + "* Método mais comum e usado para Regularization; \n", + "* Reduz overfitting;\n", + "* Se encarrega do **Feature Selection**, pois descarta variáveis altamente correlacionadas." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-YiKb9reQdI4" + }, + "source": [ + "* Usado no processo de Regularization - processo de penalizar as variáveis para manter somente os atributos mais importantes. Pense na utilidade disso diante de um dataframe com muitas variáveis;\n", + "* A regressão Lasso vem com um parâmetro ($\\alpha$), e quanto maior o alfa, a maioria dos coeficientes de recurso é zero. Ou seja, quando $\\alpha = 0$, a regressão Lasso produz os mesmos coeficientes que uma regressão linear. Quando alfa é muito grande, todos os coeficientes são zero." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5p_ZPZ4tTUX1" + }, + "source": [ + "### Exemplo LASSO" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JD1_M2uw6q0W" + }, + "source": [ + "X_new" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "i5JZTnkTOkI9" + }, + "source": [ + "from sklearn.linear_model import Lasso\n", + "lasso = Lasso(alpha = .1)\n", + "lasso.fit(X_new, y_new)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gEUxSlThOkJD" + }, + "source": [ + "lasso.coef_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EQaGWzzLT9qP" + }, + "source": [ + "### Aplicação do LASSO no Boston Housing Price" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ME6v6LFlgF0Q" + }, + "source": [ + "from sklearn.linear_model import Lasso\n", + "lasso = Lasso(alpha = .1)\n", + "lasso.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "h6DSEHc1gF0V" + }, + "source": [ + "lasso.coef_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8SzYnpVGy4cy" + }, + "source": [ + "### Coeficientes do LASSO:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "O2w2QDmdxxVe" + }, + "source": [ + "list(zip(X_treinamento.columns, abs(lasso.coef_)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UBOCg1H9zn6A" + }, + "source": [ + "### Comparação com os coeficientes do RIDGE:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "g1fF-mEZzXpH" + }, + "source": [ + "list(zip(X_treinamento.columns, abs(ridge.coef_)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xP1fX1Bi6VdX" + }, + "source": [ + "**Conclusão**: Coeficientes zero podem ser excluídos da Análise/modelo." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TbtxIWyGSXkH" + }, + "source": [ + "### Efeito dos valores de $\\alpha$\n", + "* Função adaptada de https://chrisalbon.com/machine_learning/linear_regression/effect_of_alpha_on_lasso_regression/." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "B4AuWA4LRBE3" + }, + "source": [ + "# Create a function called lasso,\n", + "def lasso(alphas):\n", + " '''\n", + " Takes in a list of alphas. Outputs a dataframe containing the coefficients of lasso regressions from each alpha.\n", + " '''\n", + " # Create an empty data frame\n", + " df = pd.DataFrame()\n", + " \n", + " # Create a column of feature names\n", + " df['Feature Name'] = names\n", + " \n", + " # For each alpha value in the list of alpha values,\n", + " for alpha in alphas:\n", + " # Create a lasso regression with that alpha value,\n", + " lasso = Lasso(alpha = alpha)\n", + " \n", + " # Fit the lasso regression\n", + " lasso.fit(X_treinamento, y_treinamento)\n", + " \n", + " # Create a column name for that alpha value\n", + " column_name = 'Alpha = %f' % alpha\n", + "\n", + " # Create a column of coefficient values\n", + " df[column_name] = lasso.coef_\n", + " \n", + " # Return the datafram \n", + " return df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VEDvXvuNRK0C" + }, + "source": [ + "names = X_treinamento.columns\n", + "\n", + "# Valores de alpha:\n", + "lasso([.0001, .001, 0, .01, .1, 1, 10, 100])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xFlvTUJKhwgW" + }, + "source": [ + "### Capturando os elementos mais importantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4_-sUgMIhzmE" + }, + "source": [ + "r_squared = model.rsquared\n", + "r_squared_adj = model.rsquared_adj\n", + "coeficientes_regressao = model.params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "apGv5ytnimsM" + }, + "source": [ + "VEJA: https://stackoverflow.com/questions/27928275/find-p-value-significance-in-scikit-learn-linearregression" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Uhokzxtcil8w" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jSYw6SdcXa0q" + }, + "source": [ + "### Cross-Validation & GridSearch para LASSO" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "E14i4Y3rqEX2" + }, + "source": [ + "### Colocar aqui a fórmula do RMSE." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "irFZAkvVXfya" + }, + "source": [ + "from sklearn.linear_model import LassoCV\n", + "from sklearn.model_selection import RepeatedKFold" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "T3Jjom8RYdly" + }, + "source": [ + "# define model evaluation method\n", + "cv = RepeatedKFold(n_splits = 5, n_repeats = 3, random_state = 20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Cw3lAvRPYgJe" + }, + "source": [ + "# define model\n", + "model = LassoCV(alphas = np.arange(0.001, 10, 0.001), cv = cv, n_jobs = -1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "oLX3CpThXvkJ" + }, + "source": [ + "# fit model\n", + "model.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "U1ubd5huYQ7u" + }, + "source": [ + "# summarize chosen configuration\n", + "print('alpha: %f' % model.alpha_)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9P7hYoo4gF0Z" + }, + "source": [ + "# Elastic Net \n", + "* Combina o poder de Ridge e LASSO;\n", + "* Remove variáveis de pouco poder preditivo (LASSO) ou as penaliza (Ridge)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yChNUYs7gF0b" + }, + "source": [ + "from sklearn.linear_model import ElasticNet\n", + "from sklearn.model_selection import GridSearchCV\n", + "\n", + "# Instancia o objeto\n", + "en = ElasticNet(alpha = .1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "S1m3SL2avMbd" + }, + "source": [ + "transformacao.fit(dados_que_eu_quero_transformar)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4mbIaAUAF4N6" + }, + "source": [ + "en.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "MaUkZw8ngF0h" + }, + "source": [ + "list(zip(X_treinamento, en.coef_))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K7LuPhCtvouJ" + }, + "source": [ + "### GridSearch para encontrar o $\\alpha$ para Elastic Net" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xl-Qh9caDyCp" + }, + "source": [ + "# Instancia o objeto:\n", + "en = ElasticNet(normalize = True)\n", + "\n", + "# Otimização dos hiperparâmetros:\n", + "d_hiperparametros = {'alpha': np.logspace(-5, 2, 8), \n", + " 'l1_ratio': [.2, .4, .6, .8]}\n", + "\n", + "search = GridSearchCV(estimator = en, # Elastic Net\n", + " param_grid = d_hiperparametros, # Dicionário com os hiperparâmetros\n", + " scoring = 'mean_squared_error', # MSE (Erro Quadrático Médio) - Métrica para avaliação da performance do modelo\n", + " #scoring = 'neg_mean_squared_error',\n", + " n_jobs = -1, # Usar todos os processadores/computação\n", + " refit = True, \n", + " cv = 10) # Número de Cross-Valitations" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JvNQyUW_2QLr" + }, + "source": [ + "### Exercício (Estatística): Sugestão de ajuste manual\n", + "* Estudar estatisticamente a distribuição de frequência em que a variável é significante (ao nível de 5%) em 100 fits." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hp1hV5YahsJb" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ng0rPXfA1DgS" + }, + "source": [ + "for i in range(0, 100):\n", + " X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X, y, 0.2)\n", + " modeloi = fit(X_treinamento, y_treinamento)\n", + " intercepto\n", + " coeficientes da regressão\n", + " validação dos parâmetros (significância)\n", + " y_predict = predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "c3_XCQCPGlr3" + }, + "source": [ + "search.fit(X_treinamento, y_treinamento)\n", + "\n", + "# Retorna os melhores hiperparâmetros do algoritmo:\n", + "search.best_params_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zq0_ugQfGrdb" + }, + "source": [ + "en2 = ElasticNet(normalize = True, alpha = 0.001, l1_ratio = 0.6)\n", + "en2.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ILA5lScUx-Ub" + }, + "source": [ + "\n", + "# Métrica\n", + "ml2 = (mean_squared_error(y_true = y_teste, y_pred = en2.predict(X_teste)))\n", + "# Encontrar a métrica neg_squared_error --> ml3 = (neg_mean_squared_error(y_true = y_teste, y_pred = en2.predict(X_teste)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "BzO_dHRixd_L" + }, + "source": [ + "print(f\"MSE: {ml2}\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zaEwh3t3zwFc" + }, + "source": [ + "**Conclusão**:\n", + "* Comparação dos MSE - A Regressão sem Regularization produziu MSE de 23.94. Como podemos ver, Elastic Net produz MSE: 15.4." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5geUMgC6ztxE" + }, + "source": [ + "### Coeficientes do Elastic Net:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LyLdASRqzwCq" + }, + "source": [ + "list(zip(X_treinamento.columns, abs(ridge.coef_)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "90pfP9-3OkJG" + }, + "source": [ + "Observe acima que o segundo coeficiente foi estimado como 0 e, desta forma, podemos excluí-lo do ML." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ILCXvYKDOkJH" + }, + "source": [ + "# Elastic Net \n", + "* Combina o poder de Ridge e LASSO;\n", + "* Remove variáveis de pouco poder preditivo (LASSO) ou as penaliza (Ridge)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GaQPDCR2OkJI" + }, + "source": [ + "from sklearn.linear_model import ElasticNet\n", + "\n", + "# Instancia o objeto\n", + "en = ElasticNet(alpha = .1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "xVp16Eu_OkJL" + }, + "source": [ + "en.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "kwj018U8OkJO" + }, + "source": [ + "en.coef_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rJRWBzSQCcss" + }, + "source": [ + "# Regressão Logística\n", + "\n", + "* Na regressão linear nós tentamos modelar a relação linear entre as features ($X_{np} = [X_{1}, X_{2}, ..., X_{p}]$) através de uma reta dada pela equação:\n", + "\n", + "$$\\hat{y}= \\beta_{0}+\\beta_{1}x_{1}+\\beta_{2}x_{2}+...+\\beta_{p}x_{p}$$\n", + "\n", + "Para classificação, a Regressão Logística vai nos retornar probabilidades (entre 0 e 1), dada pela equação logística ( também conhecida **função sigmoid**):\n", + "\n", + "$$P[y = 1]= \\frac{1}{1+e^{-(\\beta_{0}+\\beta_{1}x_{1}+\\beta_{2}x_{2}+...+\\beta_{p}x_{p})}}$$\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Vj83Altwdni7" + }, + "source": [ + "![SigmoidFunction](https://github.com/MathMachado/Materials/blob/master/SigmoidFunction.PNG?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LS1QjQnknqe5" + }, + "source": [ + "## Pressupostos da Regressão Logística\n", + "* Não há valores nulos no banco de dados;\n", + "* A variável-resposta $y$ é binária (0 ou 1) ou ordinal (variável categórica com valores ordenados (por exemplo, estimar a qualidade do vinho));\n", + "* Todas as variáveis preditoras $X$ são independentes;\n", + "* Há (pelo menos) 50 observações para cada variável preditora no modelo preditivo --> Quanto mais, melhor. Isso visa garantir a confiabilidade dos resultados);\n", + "* As classes da variável-resposta estejam equilibradas;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5YGvpGTAd4jO" + }, + "source": [ + "# Exemplo 1" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-LBYRG__e_Zv" + }, + "source": [ + "### Carregar as libraries" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XX2GNYWue-iA" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "%matplotlib inline\n", + "\n", + "import statsmodels.api as sm\n", + "import statsmodels.formula.api as smf\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.metrics import roc_auc_score, roc_curve, classification_report, accuracy_score, confusion_matrix, auc" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RpNu-JjJfBYe" + }, + "source": [ + "### Carregar os dados" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dWVj8SmUeBZB", + "outputId": "ddb92623-228d-4621-90f9-b80b1a0d06c9", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 357 + } + }, + "source": [ + "url = 'https://raw.githubusercontent.com/MathMachado/DataFrames/master/Titanic_Original.csv'\n", + "df_titanic = pd.read_csv(url)\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", + "
" + ], + "text/plain": [ + " PassengerId Survived Pclass ... Fare Cabin Embarked\n", + "0 1 0 3 ... 7.2500 NaN S\n", + "1 2 1 1 ... 71.2833 C85 C\n", + "2 3 1 3 ... 7.9250 NaN S\n", + "3 4 1 1 ... 53.1000 C123 S\n", + "4 5 0 3 ... 8.0500 NaN S\n", + "\n", + "[5 rows x 12 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 289 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "T9vZGvU5qbsQ", + "outputId": "6ad44206-7129-4e4e-a03d-cec7aeab3449", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 357 + } + }, + "source": [ + "df_titanic.columns = [coluna.lower() for coluna in df_titanic.columns]\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
passengeridsurvivedpclassnamesexagesibspparchticketfarecabinembarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", + "
" + ], + "text/plain": [ + " passengerid survived pclass ... fare cabin embarked\n", + "0 1 0 3 ... 7.2500 NaN S\n", + "1 2 1 1 ... 71.2833 C85 C\n", + "2 3 1 3 ... 7.9250 NaN S\n", + "3 4 1 1 ... 53.1000 C123 S\n", + "4 5 0 3 ... 8.0500 NaN S\n", + "\n", + "[5 rows x 12 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 290 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fAYAg5tofDgQ" + }, + "source": [ + "### Entendendo os dados\n", + "* sibsp - número of siblings/esposas abordo do Titanic;\n", + "* parch - número de parentes/crianças abordo do Titanic;\n", + "* embarked - Cidade/Portão de embarque: C = Cherbourg, Q = Queenstown, S = Southampton." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZbijPdpFxdZy" + }, + "source": [ + "#### A variável-target é do tipo binária ou categórica ordinal?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7hspb3IMe5tx", + "outputId": "684c59df-d788-400f-a179-0774a39e4303", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic['survived'].value_counts()/df_titanic.shape[0]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 0.616162\n", + "1 0.383838\n", + "Name: survived, dtype: float64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 291 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tsp4t7oxx3zC" + }, + "source": [ + "A seguir, o gráfico da variável-target:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vm0BDjw-xrGI", + "outputId": "443def8e-dee6-40f5-8bcf-b33a43bb5c15", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 296 + } + }, + "source": [ + "sns.countplot(x = 'survived', data = df_titanic)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 292 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEGCAYAAACKB4k+AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAPQUlEQVR4nO3dfbDmZV3H8fcHFqR84MHdNtyllpLJoRTFE5HaVJAFZC5jgjgaK+7M1gw1OmZG/ZEPQ42OlmEatRPqQiUgZmxmGrNApgPq2UQeMzeC2A3cI0+KZLn27Y9z7cVhObvcZ9nfuc9y3q+Ze+7rd/2u3+/+3szO+XD9nu5UFZIkARww7gIkSQuHoSBJ6gwFSVJnKEiSOkNBktQtGXcBT8TSpUtr1apV4y5DkvYrmzdv/npVLZtt3X4dCqtWrWJycnLcZUjSfiXJnbtb5+EjSVJnKEiSOkNBktQZCpKkzlCQJHWGgiSpMxQkSZ2hIEnqDAVJUrdf39G8L7zwty4edwlagDa/++xxlyCNhTMFSVJnKEiSOkNBktQZCpKkzlCQJHWGgiSpMxQkSZ2hIEnqDAVJUmcoSJI6Q0GS1BkKkqTOUJAkdYaCJKkbNBSS3JHkpiQ3JJlsfUckuSrJV9v74a0/Sd6XZEuSG5McP2RtkqTHmo+Zws9W1fOraqItnwdsqqpjgE1tGeBU4Jj2WgdcOA+1SZJmGMfho9XAhtbeAJw+o//imnY9cFiSI8dQnyQtWkOHQgH/mGRzknWtb3lV3d3a9wDLW3sFcNeMbbe2vkdJsi7JZJLJqampoeqWpEVp6J/jfElVbUvyfcBVSf515sqqqiQ1lx1W1XpgPcDExMSctpUk7dmgM4Wq2tbetwMfB04AvrbzsFB7396GbwOOmrH5ytYnSZong4VCkqcmefrONvDzwM3ARmBNG7YGuLK1NwJnt6uQTgQenHGYSZI0D4Y8fLQc+HiSnZ/z11X1qSRfBC5Psha4Ezizjf8kcBqwBXgYOGfA2iRJsxgsFKrqduC4WfrvBU6epb+Ac4eqR5L0+LyjWZLUGQqSpM5QkCR1hoIkqTMUJEmdoSBJ6gwFSVJnKEiSOkNBktQZCpKkzlCQJHWGgiSpMxQkSZ2hIEnqDAVJUmcoSJI6Q0GS1BkKkqTOUJAkdYaCJKkzFCRJnaEgSeoMBUlSZyhIkjpDQZLUGQqSpM5QkCR1hoIkqTMUJEnd4KGQ5MAkX0ryibZ8dJLPJ9mS5LIkB7f+p7TlLW39qqFrkyQ92nzMFN4A3DZj+V3Ae6vq2cD9wNrWvxa4v/W/t42TJM2jQUMhyUrgF4G/aMsBTgKuaEM2AKe39uq2TFt/chsvSZonQ88U/hh4C/B/bfmZwANVtaMtbwVWtPYK4C6Atv7BNv5RkqxLMplkcmpqasjaJWnRGSwUkrwM2F5Vm/flfqtqfVVNVNXEsmXL9uWuJWnRWzLgvl8MvDzJacAhwDOAC4DDkixps4GVwLY2fhtwFLA1yRLgUODeAeuTJO1isJlCVf1OVa2sqlXAWcDVVfUa4BrglW3YGuDK1t7Ylmnrr66qGqo+SdJjjeM+hd8G3pRkC9PnDC5q/RcBz2z9bwLOG0NtkrSoDXn4qKuqa4FrW/t24IRZxnwbOGM+6pEkzc47miVJnaEgSeoMBUlSZyhIkjpDQZLUGQqSpM5QkCR1hoIkqTMUJEmdoSBJ6gwFSVJnKEiSOkNBktQZCpKkzlCQJHWGgiSpm5cf2ZE0d//5jueOuwQtQD/wezcNun9nCpKkzlCQJHWGgiSpMxQkSZ2hIEnqDAVJUmcoSJI6Q0GS1I0UCkk2jdInSdq/7fGO5iSHAN8LLE1yOJC26hnAioFrkyTNs8d7zMWvAm8EngVs5pFQ+Abw/gHrkiSNwR4PH1XVBVV1NPDmqvqhqjq6vY6rqj2GQpJDknwhyZeT3JLk7a3/6CSfT7IlyWVJDm79T2nLW9r6VfvoO0qSRjTSA/Gq6k+SvAhYNXObqrp4D5v9D3BSVT2U5CDgs0n+AXgT8N6qujTJnwFrgQvb+/1V9ewkZwHvAl61N19KkrR3Rj3RfAnwHuAlwI+318SetqlpD7XFg9qrgJOAK1r/BuD01l7dlmnrT06y83CVJGkejPro7Ang2Kqquew8yYFMn4t4NvAB4N+BB6pqRxuylUdOWK8A7gKoqh1JHgSeCXx9Lp8pSdp7o96ncDPw/XPdeVV9t6qeD6wETgCeM9d97CrJuiSTSSanpqae6O4kSTOMOlNYCtya5AtMnysAoKpePsrGVfVAkmuAnwQOS7KkzRZWAtvasG3AUcDWJEuAQ4F7Z9nXemA9wMTExJxmLpKkPRs1FN421x0nWQZ8pwXC9wAvZfrk8TXAK4FLgTXAlW2TjW35urb+6rkerpIkPTGjXn30T3ux7yOBDe28wgHA5VX1iSS3ApcmOR/4EnBRG38RcEmSLcB9wFl78ZmSpCdgpFBI8k2mrxwCOJjpK4m+VVXP2N02VXUj8IJZ+m9n+vzCrv3fBs4YpR5J0jBGnSk8fWe7XSa6GjhxqKIkSeMx56ektvsP/hb4hQHqkSSN0aiHj14xY/EApu9b+PYgFUmSxmbUq49+aUZ7B3AH04eQJElPIqOeUzhn6EIkSeM36rOPVib5eJLt7fWxJCuHLk6SNL9GPdH8IaZvLntWe/1d65MkPYmMGgrLqupDVbWjvT4MLBuwLknSGIwaCvcmeW2SA9vrtczyXCJJ0v5t1FB4PXAmcA9wN9PPJnrdQDVJksZk1EtS3wGsqar7AZIcwfSP7rx+qMIkSfNv1JnC83YGAkBV3ccszzWSJO3fRg2FA5IcvnOhzRRGnWVIkvYTo/5h/0PguiQfbctnAL8/TEmSpHEZ9Y7mi5NMAie1rldU1a3DlSVJGoeRDwG1EDAIJOlJbM6PzpYkPXkZCpKkzlCQJHWGgiSpMxQkSZ2hIEnqDAVJUmcoSJI6Q0GS1BkKkqTOUJAkdYaCJKkzFCRJ3WChkOSoJNckuTXJLUne0PqPSHJVkq+298Nbf5K8L8mWJDcmOX6o2iRJsxtyprAD+M2qOhY4ETg3ybHAecCmqjoG2NSWAU4FjmmvdcCFA9YmSZrFYKFQVXdX1b+09jeB24AVwGpgQxu2ATi9tVcDF9e064HDkhw5VH2SpMeal3MKSVYBLwA+DyyvqrvbqnuA5a29ArhrxmZbW9+u+1qXZDLJ5NTU1GA1S9JiNHgoJHka8DHgjVX1jZnrqqqAmsv+qmp9VU1U1cSyZcv2YaWSpEFDIclBTAfCX1XV37Tur+08LNTet7f+bcBRMzZf2fokSfNkyKuPAlwE3FZVfzRj1UZgTWuvAa6c0X92uwrpRODBGYeZJEnzYMmA+34x8CvATUluaH2/C7wTuDzJWuBO4My27pPAacAW4GHgnAFrkyTNYrBQqKrPAtnN6pNnGV/AuUPVI0l6fN7RLEnqDAVJUmcoSJI6Q0GS1BkKkqTOUJAkdYaCJKkzFCRJnaEgSeoMBUlSZyhIkjpDQZLUGQqSpM5QkCR1hoIkqTMUJEmdoSBJ6gwFSVJnKEiSOkNBktQZCpKkzlCQJHWGgiSpMxQkSZ2hIEnqDAVJUmcoSJI6Q0GS1BkKkqRusFBI8sEk25PcPKPviCRXJflqez+89SfJ+5JsSXJjkuOHqkuStHtDzhQ+DJyyS995wKaqOgbY1JYBTgWOaa91wIUD1iVJ2o3BQqGqPgPct0v3amBDa28ATp/Rf3FNux44LMmRQ9UmSZrdfJ9TWF5Vd7f2PcDy1l4B3DVj3NbW9xhJ1iWZTDI5NTU1XKWStAiN7URzVRVQe7Hd+qqaqKqJZcuWDVCZJC1e8x0KX9t5WKi9b2/924CjZoxb2fokSfNovkNhI7CmtdcAV87oP7tdhXQi8OCMw0ySpHmyZKgdJ/kI8DPA0iRbgbcC7wQuT7IWuBM4sw3/JHAasAV4GDhnqLokSbs3WChU1at3s+rkWcYWcO5QtUiSRuMdzZKkzlCQJHWGgiSpMxQkSZ2hIEnqDAVJUmcoSJI6Q0GS1BkKkqTOUJAkdYaCJKkzFCRJnaEgSeoMBUlSZyhIkjpDQZLUGQqSpM5QkCR1hoIkqTMUJEmdoSBJ6gwFSVJnKEiSOkNBktQZCpKkzlCQJHWGgiSpMxQkSZ2hIEnqFlQoJDklyVeSbEly3rjrkaTFZsGEQpIDgQ8ApwLHAq9Ocux4q5KkxWXBhAJwArClqm6vqv8FLgVWj7kmSVpUloy7gBlWAHfNWN4K/MSug5KsA9a1xYeSfGUealsslgJfH3cRC0Hes2bcJejR/Le501uzL/byg7tbsZBCYSRVtR5YP+46noySTFbVxLjrkHblv835s5AOH20DjpqxvLL1SZLmyUIKhS8CxyQ5OsnBwFnAxjHXJEmLyoI5fFRVO5L8OvBp4EDgg1V1y5jLWmw8LKeFyn+b8yRVNe4aJEkLxEI6fCRJGjNDQZLUGQry8SJasJJ8MMn2JDePu5bFwlBY5Hy8iBa4DwOnjLuIxcRQkI8X0YJVVZ8B7ht3HYuJoaDZHi+yYky1SBozQ0GS1BkK8vEikjpDQT5eRFJnKCxyVbUD2Pl4kduAy328iBaKJB8BrgN+JMnWJGvHXdOTnY+5kCR1zhQkSZ2hIEnqDAVJUmcoSJI6Q0GS1BkK0kCSvHxfPXU2yUP7Yj/S4/GSVOkJSLKk3esx9Oc8VFVPG/pzJGcKEpDkqUn+PsmXk9yc5FVJ7kiytK2fSHJta78tySVJPgdckuT6JD86Y1/XtvGvS/L+JIcmuTPJATM+664kByX54SSfSrI5yT8neU4bc3SS65LclOT8+f8vosXKUJCmnQL8V1UdV1U/BnzqccYfC/xcVb0auAw4EyDJkcCRVTW5c2BVPQjcAPx063oZ8Omq+g7TP0j/G1X1QuDNwJ+2MRcAF1bVc4G798UXlEZhKEjTbgJemuRdSX6q/SHfk41V9d+tfTnwytY+E7hilvGXAa9q7bOAy5I8DXgR8NEkNwB/DhzZxrwY+EhrXzLnbyPtpSXjLkBaCKrq35IcD5wGnJ9kE7CDR/7H6ZBdNvnWjG23Jbk3yfOY/sP/a7N8xEbgD5IcAbwQuBp4KvBAVT1/d2Xt9ReS9pIzBQlI8izg4ar6S+DdwPHAHUz/AQf45cfZxWXAW4BDq+rGXVdW1UNMP5H2AuATVfXdqvoG8B9Jzmg1JMlxbZPPMT2jAHjNXn8xaY4MBWnac4EvtMM4bwXOB94OXJBkEvju42x/BdN/xC/fw5jLgNe2951eA6xN8mXgFh75KdQ3AOcmuQl/CU/zyEtSJUmdMwVJUmcoSJI6Q0GS1BkKkqTOUJAkdYaCJKkzFCRJ3f8DThe6X9gR+9IAAAAASUVORK5CYII=\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XfhFG6Axxj6F" + }, + "source": [ + "Como podemos ver, a variável-resposta 'survived' é binária. Portanto, tudo ok até agora." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zRKhDX6ZraGU" + }, + "source": [ + "### Tratamento dos Missing Values\n", + "* Substituir os NaN's por mediana da variável" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qPbILjZyrhRZ", + "outputId": "52f34626-1875-4632-cce3-8a694863ea6c", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic.isna().sum()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "passengerid 0\n", + "survived 0\n", + "pclass 0\n", + "name 0\n", + "sex 0\n", + "age 177\n", + "sibsp 0\n", + "parch 0\n", + "ticket 0\n", + "fare 0\n", + "cabin 687\n", + "embarked 2\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 293 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uJUPufRossTo" + }, + "source": [ + "Cálculo da mediana da variável/preditora 'age'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WGW9bW5x4JdT", + "outputId": "c0ce5e66-70fe-4195-c637-2ac8a4cf9d37", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 357 + } + }, + "source": [ + "df_titanic_copia = df_titanic.copy()\n", + "#df_titanic = df_titanic_copia.copy()\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
passengeridsurvivedpclassnamesexagesibspparchticketfarecabinembarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", + "
" + ], + "text/plain": [ + " passengerid survived pclass ... fare cabin embarked\n", + "0 1 0 3 ... 7.2500 NaN S\n", + "1 2 1 1 ... 71.2833 C85 C\n", + "2 3 1 3 ... 7.9250 NaN S\n", + "3 4 1 1 ... 53.1000 C123 S\n", + "4 5 0 3 ... 8.0500 NaN S\n", + "\n", + "[5 rows x 12 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 294 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DgAwrR8msYv_" + }, + "source": [ + "mediana_age = df_titanic['age'].median()\n", + "mediana_fare = df_titanic['fare'].median()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "yqIgckarzwdB", + "outputId": "a190a691-e377-42ba-a8f8-98d430ccabbd", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "mediana_age" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "28.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 297 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "czdSVeLjzxAX", + "outputId": "48bdfb0b-a153-482e-9fa6-6706bd44b88d", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "mediana_fare" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "14.4542" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 298 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u4vcCshcsv6w" + }, + "source": [ + "Substituição dos NaN's da variável 'age' e 'fare' pela respetiva mediana" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tnOOsaqLsg03", + "outputId": "81837607-f6b7-4bc2-e295-443679d3deb2", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic['age'].fillna(mediana_age, inplace = True)\n", + "df_titanic.isna().sum()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "passengerid 0\n", + "survived 0\n", + "pclass 0\n", + "name 0\n", + "sex 0\n", + "age 0\n", + "sibsp 0\n", + "parch 0\n", + "ticket 0\n", + "fare 0\n", + "cabin 687\n", + "embarked 2\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 299 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VqAnNxnO0Ghn" + }, + "source": [ + "Dado que fare não possui NaN's, então nada a fazer." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4Hi2zG_ms6n-" + }, + "source": [ + "#### Usando Imputer\n", + "* Método para tratamento de Missing Values." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mvCnGfCOri9Y", + "outputId": "6d8b7f52-ca60-4bbd-bc50-9e55f3b9bd17", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "from sklearn.impute import SimpleImputer\n", + "\n", + "# fit()\n", + "imputer_mv = SimpleImputer(strategy = 'median')\n", + "imputer_mv.fit(df_titanic_copia[['age', 'fare']])" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "SimpleImputer(add_indicator=False, copy=True, fill_value=None,\n", + " missing_values=nan, strategy='median', verbose=0)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 300 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SokJ8HM61FcK", + "outputId": "65052934-48c4-4c6d-b206-b14d8fa10dc3", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "imputer_mv" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "SimpleImputer(add_indicator=False, copy=True, fill_value=None,\n", + " missing_values=nan, strategy='median', verbose=0)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 301 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "X-qx8QsQthyU", + "outputId": "469c2591-1ea2-4dfd-8395-df11821f5951", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "# transform()\n", + "df_titanic_mediana = pd.DataFrame(imputer_mv.transform(df_titanic[['age', 'fare']]), columns = ['age2', 'fare2'])\n", + "df_titanic_mediana.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
age2fare2
022.07.2500
138.071.2833
226.07.9250
335.053.1000
435.08.0500
\n", + "
" + ], + "text/plain": [ + " age2 fare2\n", + "0 22.0 7.2500\n", + "1 38.0 71.2833\n", + "2 26.0 7.9250\n", + "3 35.0 53.1000\n", + "4 35.0 8.0500" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 302 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KS-xYf5BuwEt", + "outputId": "c7e01602-c917-48b7-a7f1-33ce365f21ac", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic_mediana.median()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "age2 28.0000\n", + "fare2 14.4542\n", + "dtype: float64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 303 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lggbmAD2vN42", + "outputId": "55134b01-f993-4f01-b053-7600c45eec21", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic_copia.isna().sum()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "passengerid 0\n", + "survived 0\n", + "pclass 0\n", + "name 0\n", + "sex 0\n", + "age 177\n", + "sibsp 0\n", + "parch 0\n", + "ticket 0\n", + "fare 0\n", + "cabin 687\n", + "embarked 2\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 304 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8fQ6a7RSvUOp", + "outputId": "75770554-a802-4012-a068-76b2e1ba1578", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 498 + } + }, + "source": [ + "df_titanic['age'] = df_titanic_mediana['age2']\n", + "\n", + "# Não há NaN's na variável fare. Portanto, nenhuma alteração\n", + "#df_titanic['fare'] = df_titanic_mediana['fare']\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
passengeridsurvivedpclassnamesexagesibspparchticketfarecabinembarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", + "
" + ], + "text/plain": [ + " passengerid survived pclass ... fare cabin embarked\n", + "0 1 0 3 ... 7.2500 NaN S\n", + "1 2 1 1 ... 71.2833 C85 C\n", + "2 3 1 3 ... 7.9250 NaN S\n", + "3 4 1 1 ... 53.1000 C123 S\n", + "4 5 0 3 ... 8.0500 NaN S\n", + "\n", + "[5 rows x 12 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 305 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HSncMlT51oM5", + "outputId": "85057833-9c05-4805-d2d0-ad6a3af0d140", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic.isna().sum()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "passengerid 0\n", + "survived 0\n", + "pclass 0\n", + "name 0\n", + "sex 0\n", + "age 0\n", + "sibsp 0\n", + "parch 0\n", + "ticket 0\n", + "fare 0\n", + "cabin 687\n", + "embarked 2\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 306 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c48gJg0q4zgj" + }, + "source": [ + "Exclui as colunas que não são mais necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7OzK7DnDg2WY", + "outputId": "462a1c3e-d2b3-4d5d-b745-945706522481", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 498 + } + }, + "source": [ + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
passengeridsurvivedpclassnamesexagesibspparchticketfarecabinembarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", + "
" + ], + "text/plain": [ + " passengerid survived pclass ... fare cabin embarked\n", + "0 1 0 3 ... 7.2500 NaN S\n", + "1 2 1 1 ... 71.2833 C85 C\n", + "2 3 1 3 ... 7.9250 NaN S\n", + "3 4 1 1 ... 53.1000 C123 S\n", + "4 5 0 3 ... 8.0500 NaN S\n", + "\n", + "[5 rows x 12 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 307 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oLpWbzz84ykm", + "outputId": "cdb706ed-5395-47d0-8f52-7887edcde8ee", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "df_titanic.drop(columns = ['passengerid', 'name', 'ticket', 'cabin'], axis = 1, inplace = True)\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked
003male22.0107.2500S
111female38.01071.2833C
213female26.0007.9250S
311female35.01053.1000S
403male35.0008.0500S
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked\n", + "0 0 3 male 22.0 1 0 7.2500 S\n", + "1 1 1 female 38.0 1 0 71.2833 C\n", + "2 1 3 female 26.0 0 0 7.9250 S\n", + "3 1 1 female 35.0 1 0 53.1000 S\n", + "4 0 3 male 35.0 0 0 8.0500 S" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 308 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NZei3VxSxR6g" + }, + "source": [ + "Alternativamente, poderíamos concatenar os dois dataframes usando pd.concat()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ek2qBdOFw2p5", + "outputId": "7777706c-55e4-4bfd-822f-cf3843922f3e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic = df_titanic_copia.copy()\n", + "\n", + "df_titanic.drop(columns = ['passengerid', 'name', 'ticket', 'cabin', 'fare', 'age'], axis = 1, inplace = True)\n", + "df_titanic = pd.concat([df_titanic, df_titanic_mediana], axis = 1)\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexsibspparchembarkedage2fare2
003male10S22.07.2500
111female10C38.071.2833
213female00S26.07.9250
311female10S35.053.1000
403male00S35.08.0500
\n", + "
" + ], + "text/plain": [ + " survived pclass sex sibsp parch embarked age2 fare2\n", + "0 0 3 male 1 0 S 22.0 7.2500\n", + "1 1 1 female 1 0 C 38.0 71.2833\n", + "2 1 3 female 0 0 S 26.0 7.9250\n", + "3 1 1 female 1 0 S 35.0 53.1000\n", + "4 0 3 male 0 0 S 35.0 8.0500" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 18 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6omsobg77tRv" + }, + "source": [ + "#### Tratamento dos NaN's da variável 'embarked'" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YjeivMbz85gg" + }, + "source": [ + "A seguir, listamos as linhas em que embarked = NaN:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Mc03_AnI8QgV", + "outputId": "ebe1ecc6-2c40-429d-d608-3b72ef10acd2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 110 + } + }, + "source": [ + "embarked_NaN = df_titanic[df_titanic['embarked'].isna()]\n", + "embarked_NaN.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked
6111female38.00080.0NaN
82911female62.00080.0NaN
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked\n", + "61 1 1 female 38.0 0 0 80.0 NaN\n", + "829 1 1 female 62.0 0 0 80.0 NaN" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 309 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xsbeFBFp7zRM", + "outputId": "58859c46-d711-4558-aadf-70480e67c98b", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "from sklearn.impute import SimpleImputer\n", + "\n", + "# fit()\n", + "imputer_mv = SimpleImputer(strategy = 'most_frequent')\n", + "imputer_mv.fit(df_titanic[['embarked']])" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "SimpleImputer(add_indicator=False, copy=True, fill_value=None,\n", + " missing_values=nan, strategy='most_frequent', verbose=0)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 310 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "f2kDtHVN761L" + }, + "source": [ + "# transform()\n", + "df_embarked_freq = pd.DataFrame(imputer_mv.transform(df_titanic[['embarked']]), columns = ['embarked2'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "0JmoLrzD8NwW", + "outputId": "d8ac60c0-a440-42b1-d7d5-c00168c3956f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "df_titanic = pd.concat([df_titanic, df_embarked_freq], axis = 1)\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarkedembarked2
003male22.0107.2500SS
111female38.01071.2833CC
213female26.0007.9250SS
311female35.01053.1000SS
403male35.0008.0500SS
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked embarked2\n", + "0 0 3 male 22.0 1 0 7.2500 S S\n", + "1 1 1 female 38.0 1 0 71.2833 C C\n", + "2 1 3 female 26.0 0 0 7.9250 S S\n", + "3 1 1 female 35.0 1 0 53.1000 S S\n", + "4 0 3 male 35.0 0 0 8.0500 S S" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 312 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FRxX9c4--TCg" + }, + "source": [ + "COMPARE o ANTES e o DEPOIS: Veja a seguir que os valores de [embarked] = NaN foram substituidos por..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oQFDqatz9bMv", + "outputId": "45d6ab98-b832-4844-8d66-6d47cbcee08e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 110 + } + }, + "source": [ + "embarked_NaN = df_titanic[df_titanic['embarked'].isna()]\n", + "embarked_NaN" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarkedembarked2
6111female38.00080.0NaNS
82911female62.00080.0NaNS
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked embarked2\n", + "61 1 1 female 38.0 0 0 80.0 NaN S\n", + "829 1 1 female 62.0 0 0 80.0 NaN S" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 313 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jgCuXei2ZTQl" + }, + "source": [ + "Como podemos ver, os NaN's da variável embarked foram todos substituídos pelo valor 'S'. Tudo bem para vocês esta substituição?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "r3r8ObKn-nBt" + }, + "source": [ + "df_titanic.drop(columns = ['embarked'], axis = 1, inplace = True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OacQvrYeAPBR" + }, + "source": [ + "Verificação final dos NaN's:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OHBv7CrjARol", + "outputId": "df1e556b-21dd-42a2-df08-4c0046f1f3b1", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic.isna().sum()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "survived 0\n", + "pclass 0\n", + "sex 0\n", + "age 0\n", + "sibsp 0\n", + "parch 0\n", + "fare 0\n", + "embarked2 0\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 315 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ITFMsiBSyAHY" + }, + "source": [ + "### O dataframe sob análise possui (pelo menos) 50 observações para cada preditora?\n", + "* Variáveis preditoras: pclass, sex, age, sibsp, parch, fare, embarked2 --> 7 variáveis preditoras.\n", + "* Portanto, nosso dataframe precisa de, no mínimo 7 x 50 = 350 linhas." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4lgVp2N8yE1C", + "outputId": "2dbea822-609b-4576-c3e2-f89b3527db1c", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic.info()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 891 entries, 0 to 890\n", + "Data columns (total 8 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 survived 891 non-null int64 \n", + " 1 pclass 891 non-null int64 \n", + " 2 sex 891 non-null object \n", + " 3 age 891 non-null float64\n", + " 4 sibsp 891 non-null int64 \n", + " 5 parch 891 non-null int64 \n", + " 6 fare 891 non-null float64\n", + " 7 embarked2 891 non-null object \n", + "dtypes: float64(2), int64(4), object(2)\n", + "memory usage: 55.8+ KB\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rFwtnAcw23gQ", + "outputId": "2b9b006b-3dee-493a-e9a6-adcf90923373", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "891/7" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "127.28571428571429" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 317 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wLqz2V7SytPU" + }, + "source": [ + "Pressuposto atendido?\n", + "Se sim, podemos prosseguir com as análises..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Wm0VycfhovW8" + }, + "source": [ + "#### Avaliação do pressuposto de variáveis preditoras independentes\n", + "* Coeficiente de Spearman (desenvolvido por Charles Spearman). Também conhecido como Coeficiente de Correlação de Spearman e denotado pela letra greaga $\\rho(p)$.\n", + "* É um método estatístico para avaliar/medir a correlação entre 2 variáveis ordinais." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "29knlUdcztb1", + "outputId": "adb2e1a5-2436-4327-8eff-4a65b66d4e0b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexsibspparchfareage2ambarked2
003male107.250022.0S
111female1071.283338.0C
213female007.925026.0S
311female1053.100035.0S
403male008.050035.0S
\n", + "
" + ], + "text/plain": [ + " survived pclass sex sibsp parch fare age2 ambarked2\n", + "0 0 3 male 1 0 7.2500 22.0 S\n", + "1 1 1 female 1 0 71.2833 38.0 C\n", + "2 1 3 female 0 0 7.9250 26.0 S\n", + "3 1 1 female 1 0 53.1000 35.0 S\n", + "4 0 3 male 0 0 8.0500 35.0 S" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 102 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J5EEcU7l0E2B" + }, + "source": [ + "A seguir, a hipótese de independência que queremos testar:\n", + "\n", + "$H_{0}$: variáveis são independentes --> Se o p-value < 5% --> Há evidências para rejeitar $H_{0}$." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tj8A_Kp0qxp_" + }, + "source": [ + "from scipy.stats import spearmanr" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "kFxVGHPUpKLi" + }, + "source": [ + "coef, p = spearmanr(df_titanic['pclass'], df_titanic['age'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "fvzvyvK7qzib", + "outputId": "d1e8d723-5048-4360-bad4-9ce0b8172bbf", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "print('Coeficiente de Correlação de Spearman: %.3f' % coef)\n", + "\n", + "# Interpretação da significância:\n", + "alpha = 0.05\n", + "if p > alpha:\n", + "\tprint('Amostras NÃO correlacionadas (falha em rejeitar H0) p = %.3f' %p)\n", + "else:\n", + "\tprint('Amostras correlacionadas (Rejeita H0) p = %.3f' %p)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Coeficiente de Correlação de Spearman: -0.317\n", + "Amostras correlacionadas (Rejeita H0) p = 0.000\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yespibmf1WVh" + }, + "source": [ + "## Data Transformation\n", + "* MinMaxScaler e RobustScaler\n", + "* Binning (categorização de variáveis/preditoras numéricas): fare e age\n", + "* Outliers" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UwLpj8PXKFuL" + }, + "source": [ + "### Tratamento dos Outliers\n", + "* variáveis: age e fare" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sTTgUx9oiWdJ", + "outputId": "fd14b9f5-7e25-4416-bfc7-e318e79d3249", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "df_titanic_copia = df_titanic.copy()\n", + "#df_titanic = df_titanic_copia.copy()\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked2
003male22.0107.2500S
111female38.01071.2833C
213female26.0007.9250S
311female35.01053.1000S
403male35.0008.0500S
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked2\n", + "0 0 3 male 22.0 1 0 7.2500 S\n", + "1 1 1 female 38.0 1 0 71.2833 C\n", + "2 1 3 female 26.0 0 0 7.9250 S\n", + "3 1 1 female 35.0 1 0 53.1000 S\n", + "4 0 3 male 35.0 0 0 8.0500 S" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 321 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-7v8WaB4aEKv" + }, + "source": [ + "from scipy import stats \n", + "\n", + "def trata_outliers(df, coluna):\n", + " sns.boxplot(x = coluna, data = df)\n", + "\n", + " # Cálculo de Q1, Q3 e IQR:\n", + " Q1 = np.percentile(df[coluna], 25)\n", + " Q3 = np.percentile(df[coluna], 75)\n", + " IQR = Q3 - Q1\n", + " print(f\"IQR: {IQR}\")\n", + "\n", + " # Jeito mais fácil (menos trabalhoso).\n", + " #IQR2 = stats.iqr(df[coluna]) \n", + " #IQR2 \n", + "\n", + " # Cálculo dos limites inferiores e superiores para detecção de outliers:\n", + " limite_inferior_outliers = Q1 - 1.5*IQR\n", + " limite_superior_outliers = Q3 + 1.5*IQR\n", + " print(f\"Limite inferior para outlier: {limite_inferior_outliers}; Limite superior para outliers: {limite_superior_outliers}\")\n", + "\n", + " # Cálculo da mediana\n", + " mediana = df[coluna].median()\n", + " print(f\"Mediana: {mediana}\")\n", + "\n", + " # Substituição dos outliers:\n", + " df[coluna+'_o'] = df[coluna]\n", + "\n", + " df.loc[df[coluna] > limite_superior_outliers, coluna+'_o'] = np.nan\n", + " df[coluna+'_o'].fillna(mediana, inplace = True) # 'o' significa tratamento outlier --> indicação para mostrar que a coluna passou pelo tratamento dos outliers.\n", + "\n", + " return df, limite_superior_outliers" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "pwAExKTWaOSf", + "outputId": "089cba96-b805-4eef-d504-f8c81551c938", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 332 + } + }, + "source": [ + "df_titanic, limite_superior_outliers = trata_outliers(df = df_titanic, coluna = 'age')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "IQR: 13.0\n", + "Limite inferior para outlier: 2.5; Limite superior para outliers: 54.5\n", + "Mediana: 28.0\n" + ], + "name": "stdout" + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAEGCAYAAABbzE8LAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAPXUlEQVR4nO3df2ychX3H8c83vtGGpAXioAgctmvlrhla1rSNOlCrzc7CmpLRamqRyA8wIhAmdU4Ck6YC0WJLAW3S5BFlbBKDFJhIWiUtkECUNSHepCGNYrehCSS0t9VtYxWSOi1tfqyryXd/PM+Zu7Nj+xzffR/j90uy8PM8vuf5Xu7uzePHv8zdBQCovxnRAwDAdEWAASAIAQaAIAQYAIIQYAAIkqvmg+fOnev5fL5GowDAe1Nvb+/P3P3KyvVVBTifz6unp2fypgKAacDMfjTSei5BAEAQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABKnqb8Ih1tatW1UoFGqy7/7+fklSU1NTTfZfqbm5We3t7XU5FpBVBHgKKRQKOnTkqN65dM6k77vh7NuSpDd/XfunRMPZUzU/BjAVEOAp5p1L5+jcghsnfb8zj+2VpJrs+0LHAqY7rgEDQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAkEwFeOvWrdq6dWv0GEA4XgvTQy56gFKFQiF6BCATeC1MD5k6AwaA6YQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0CQugR4YGBAq1atUktLi1paWvTII49Ikjo7O9XS0qIHH3ywHmMAU1ahUNDy5cvV29ur1atXq6WlRd3d3WXbCoWCJOm5555TS0uL9uzZM+L2gwcPlt2+p6dHS5YsUW9v77BtlbetXC69rZS81tetW6eBgYGLuo/r1q1TT09P2bFGczHHjdx3XQL85JNPqr+/f2h5586dkjT0IO/fv78eYwBT1ubNm3XmzBlt2rRJx48fl6ShE5fits2bN0uSHn74YUlSV1fXiNsfeuihstt3dHTo/Pnz2rRp07BtlbetXC69rZS81g8fPqynnnrqou7j4cOH1dHRUXas0VzMcSP3XfMADwwM6IUXXhi2fuXKlWXLnAUDIysUCurr65MknT59emj94OCgtm/fPrStr69Pjz32mNxdkuTu2rZtW9n2p59+WoODg0O3f/zxx4f2efr06bJtO3bsKLttd3d32fKePXvKbtvd3a19+/bJ3bVv376qzhgr76O7D+27r69v1LPggYGBCR93LLXctyRZ8cEaj8WLF3tPT09VB+jq6tLu3bvH9bFz587VuXPn1NzcXNUxpotCoaBf/Z/rzKJbJn3fM4/tlSSdW3DjpO+70qxDX9MHLjEe51EUCgXNnDlTu3bt0u233z4Up0i5XG4o0JJkZirtRy6Xk5TEO5fLafny5brnnnvGte+x7mM+n9cTTzwx4rauri7t3bt3Qscdy2Tt28x63X1x5foxz4DNbK2Z9ZhZz8mTJ6s+8IEDB6q+DYB3ZSG+ksriK0mVJ2+Dg4NlZ9DVXFoc6z6Otv3AgQMTPu5YarlvScqN9QHu/qikR6XkDLjaAyxdunTcZ8BNTU2SpC1btlR7mGlh/fr16v2ft6LHuGjn3/9BNX94Ho/zKNavXz/0fj6fz0SEqz0DvuGGG8a977HuYz6fv+C2pUuXlp2lVnPcsdRy31IdrgG3tbWpoaFh2Pqrr766bHmy7xjwXrFx48YLblu7dm3Z8urVq8uWb7vttrLlu+66q2z51ltvveC+77777rLlBx54oGz53nvvHbZ9xowkKQ0NDcOOPZrR7uNY29va2iZ83LHUct9SHQLc2Nio5cuXD1u/ffv2suXKBxdAorm5eegMcPbs2UPrc7mcVq5cObQtn8/rzjvvlJlJSs5Q77jjjrLtq1atGjpTzeVyWrNmzdA+Z8+eXbZtxYoVZbdtbW0tW77pppvKbtva2qply5bJzLRs2TI1NjZO+D6a2dC+8/n8qF8vaGxsnPBxx1LLfUt1+ja0tra2ocsLknTzzTdLklpbWyVx9guMZePGjZo1a5Y6Ozs1f/58Se+etBS3Fc8SN2zYIOndM9TK7ffff3/Z7Ts6OjRjxgx1dnYO21Z528rl0ttKyWt94cKFEzpTLL2PCxcuVEdHR9mxRnMxx43cd82/C6IaxeteXBscWfEacC2+U6Ge3wUx89hefZJrwKPitfDeMuHvggAA1AYBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAguegBSjU3N0ePAGQCr4XpIVMBbm9vjx4ByAReC9MDlyAAIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAiSix4A1Wk4e0ozj+2twX4HJKkm+x5+rFOS5tX8OEDWEeAppLm5uWb77u8flCQ1NdUjjPNqel+AqYIATyHt7e3RIwCYRFwDBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASCIufv4P9jspKQfVXmMuZJ+VuVt6iWrszFXdbI6l5Td2ZirOhc71++4+5WVK6sK8ESYWY+7L67pQSYoq7MxV3WyOpeU3dmYqzq1motLEAAQhAADQJB6BPjROhxjorI6G3NVJ6tzSdmdjbmqU5O5an4NGAAwMi5BAEAQAgwAQWoaYDNbZmZvmFnBzL5Sy2ONMcc2MzthZkdK1s0xs/1m9oP0v1cEzHWNmXWb2etm9pqZrc/QbO83s2+b2avpbJ3p+g+Z2cvpY/p1M7skYLYGM/uumT2flZnSOfrM7LCZHTKznnRdFh7Ly81sl5kdM7OjZnZ9Rub6aPpvVXz7pZltyMhs96TP+yNmtiN9PUz686xmATazBkmPSPqcpGslrTCza2t1vDE8IWlZxbqvSHrR3T8i6cV0ud4GJf2Vu18r6TpJX07/jbIw268lLXH3j0laJGmZmV0n6e8k/YO7N0v6uaQ1AbOtl3S0ZDkLMxW1uvuiku8ZzcJjuUXSPndfIOljSv7twudy9zfSf6tFkj4p6aykZ6JnM7MmSeskLXb335fUIOkW1eJ55u41eZN0vaR/K1m+T9J9tTreOObJSzpSsvyGpKvS96+S9EbUbCUzPSfphqzNJulSSd+R9IdKfhooN9JjXKdZ5it5US6R9Lwki56pZLY+SXMr1oU+lpIuk/RDpV9wz8pcI8z5p5JeysJskpok/UTSHEm59Hn22Vo8z2p5CaJ4J4qOp+uyYp67/zR9/01J8yKHMbO8pI9LelkZmS39VP+QpBOS9kv6b0m/cPfB9EMiHtOHJf21pPPpcmMGZipySd8ys14zW5uui34sPyTppKSvppdtHjOzWRmYq9Itknak74fO5u79kv5e0o8l/VTS25J6VYPnGV+Ek+TJ/9LCvh/PzGZL+oakDe7+y9JtkbO5+zuefHo4X9KnJC2ImKPIzP5M0gl3742cYxSfcfdPKLns9mUz+6PSjUGPZU7SJyT9s7t/XNIZVXxKn4Hn/yWSPi9pZ+W2iNnSa85fUPI/r6slzdLwS5iTopYB7pd0Tcny/HRdVrxlZldJUvrfExFDmNlvKYnv0+7+zSzNVuTuv5DUreTTrsvNLJduqvdj+mlJnzezPklfU3IZYkvwTEPSMye5+wkl1zI/pfjH8rik4+7+crq8S0mQo+cq9TlJ33H3t9Ll6NmWSvqhu590999I+qaS596kP89qGeBXJH0k/crhJUo+xdhdw+NVa7ektvT9NiXXX+vKzEzS45KOuntXxma70swuT9+fqeTa9FElIf5SxGzufp+7z3f3vJLn00F3XxU5U5GZzTKzDxTfV3JN84iCH0t3f1PST8zso+mqP5H0evRcFVbo3csPUvxsP5Z0nZldmr5Gi/9mk/88q/HF7BslfV/JtcMH6nkhvWKOHUqu5fxGyRnBGiXXDl+U9ANJByTNCZjrM0o+vfqepEPp240Zme0PJH03ne2IpL9J139Y0rclFZR8yvi+oMe0RdLzWZkpneHV9O214vM9I4/lIkk96WP5rKQrsjBXOtssSQOSLitZFz6bpE5Jx9Ln/r9Kel8tnmf8KDIABOGLcAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMKYEM3s2/SU3rxV/0Y2ZrTGz76e/t/hfzOwf0/VXmtk3zOyV9O3TsdMDI+MHMTAlmNkcdz+V/lj0K0p+PeBLSn6vwa8kHZT0qrv/pZltl/RP7v6fZvbbSn5t4O+FDQ9cQG7sDwEyYZ2Z/Xn6/jWSbpX0H+5+SpLMbKek3023L5V0bfJj/JKkD5rZbHc/Xc+BgbEQYGSembUoier17n7WzP5dyc/pX+isdoak69z9f+szITAxXAPGVHCZpJ+n8V2g5M83zZL0x2Z2RforAr9Y8vHfktReXDCzRXWdFhgnAoypYJ+knJkdlfS3kv5Lye9ifUjJb6d6ScmfA3o7/fh1khab2ffM7HVJf1H3iYFx4ItwmLKK13XTM+BnJG1z92ei5wLGizNgTGUd6d+sO6LkD08+GzwPUBXOgAEgCGfAABCEAANAEAIMAEEIMAAEIcAAEOT/ASoUtoMb2LNZAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rB3Wh7jldcl-", + "outputId": "34e01f99-642a-45aa-ebd3-bda160d7be2e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 665 + } + }, + "source": [ + "df_titanic.head(20)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked2age_o
003male22.0107.2500S22.0
111female38.01071.2833C38.0
213female26.0007.9250S26.0
311female35.01053.1000S35.0
403male35.0008.0500S35.0
503male28.0008.4583Q28.0
601male54.00051.8625S54.0
703male2.03121.0750S2.0
813female27.00211.1333S27.0
912female14.01030.0708C14.0
1013female4.01116.7000S4.0
1111female58.00026.5500S28.0
1203male20.0008.0500S20.0
1303male39.01531.2750S39.0
1403female14.0007.8542S14.0
1512female55.00016.0000S28.0
1603male2.04129.1250Q2.0
1712male28.00013.0000S28.0
1803female31.01018.0000S31.0
1913female28.0007.2250C28.0
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked2 age_o\n", + "0 0 3 male 22.0 1 0 7.2500 S 22.0\n", + "1 1 1 female 38.0 1 0 71.2833 C 38.0\n", + "2 1 3 female 26.0 0 0 7.9250 S 26.0\n", + "3 1 1 female 35.0 1 0 53.1000 S 35.0\n", + "4 0 3 male 35.0 0 0 8.0500 S 35.0\n", + "5 0 3 male 28.0 0 0 8.4583 Q 28.0\n", + "6 0 1 male 54.0 0 0 51.8625 S 54.0\n", + "7 0 3 male 2.0 3 1 21.0750 S 2.0\n", + "8 1 3 female 27.0 0 2 11.1333 S 27.0\n", + "9 1 2 female 14.0 1 0 30.0708 C 14.0\n", + "10 1 3 female 4.0 1 1 16.7000 S 4.0\n", + "11 1 1 female 58.0 0 0 26.5500 S 28.0\n", + "12 0 3 male 20.0 0 0 8.0500 S 20.0\n", + "13 0 3 male 39.0 1 5 31.2750 S 39.0\n", + "14 0 3 female 14.0 0 0 7.8542 S 14.0\n", + "15 1 2 female 55.0 0 0 16.0000 S 28.0\n", + "16 0 3 male 2.0 4 1 29.1250 Q 2.0\n", + "17 1 2 male 28.0 0 0 13.0000 S 28.0\n", + "18 0 3 female 31.0 1 0 18.0000 S 31.0\n", + "19 1 3 female 28.0 0 0 7.2250 C 28.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 324 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "x6YRvSf5SRR4" + }, + "source": [ + "### Quem são os outliers de 'age'?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2y9BeUnoSU4W", + "outputId": "85968b30-7903-465b-eddf-27d4604acb58", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "age_outlier = df_titanic[df_titanic['age'] > limite_superior_outliers]\n", + "age_outlier.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked2age_o
1111female58.00026.5500S28.0
1512female55.00016.0000S28.0
3302male66.00010.5000S28.0
5401male65.00161.9792C28.0
9403male59.0007.2500S28.0
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked2 age_o\n", + "11 1 1 female 58.0 0 0 26.5500 S 28.0\n", + "15 1 2 female 55.0 0 0 16.0000 S 28.0\n", + "33 0 2 male 66.0 0 0 10.5000 S 28.0\n", + "54 0 1 male 65.0 0 1 61.9792 C 28.0\n", + "94 0 3 male 59.0 0 0 7.2500 S 28.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 327 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J0dHnei1TBFc" + }, + "source": [ + "### Tratamento dos outliers da variável 'fare'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "i8YM25uKm8g1", + "outputId": "00e04c37-82d8-4aca-c250-36a4a045df2f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "df_titanic_copia = df_titanic.copy()\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked2age_o
003male22.0107.2500S22.0
111female38.01071.2833C38.0
213female26.0007.9250S26.0
311female35.01053.1000S35.0
403male35.0008.0500S35.0
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked2 age_o\n", + "0 0 3 male 22.0 1 0 7.2500 S 22.0\n", + "1 1 1 female 38.0 1 0 71.2833 C 38.0\n", + "2 1 3 female 26.0 0 0 7.9250 S 26.0\n", + "3 1 1 female 35.0 1 0 53.1000 S 35.0\n", + "4 0 3 male 35.0 0 0 8.0500 S 35.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 328 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4Gnr9AEfTSj6", + "outputId": "40a23226-68bd-4288-dbf2-5fabfd4fd32b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 332 + } + }, + "source": [ + "df_titanic, limite_superior_outliers = trata_outliers(df = df_titanic, coluna = 'fare')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "IQR: 23.0896\n", + "Limite inferior para outlier: -26.724; Limite superior para outliers: 65.6344\n", + "Mediana: 14.4542\n" + ], + "name": "stdout" + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAEGCAYAAABbzE8LAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAOr0lEQVR4nO3df2zcdR3H8de7PTbGiuA2XEgBb+SIhAQEaRSQP25jbGMbELP9MQLpRvgRjCkTNApbY0YwGvljiI1RCRJMXJwxamTdBDdW/tKArRtssCGHKXELMCg/tGwZFj7+cZ877q7X33f3bu+ej6TZfb/3ve/3+xnfPvvpd+1hIQQBAGqvyfsEAKBREWAAcEKAAcAJAQYAJwQYAJwkJrLxggULQjKZrNKpAEB96uvreyeEcFbp+gkFOJlMqre3t3JnBQANwMxeL7eeWxAA4IQAA4ATAgwATggwADghwADghAADgBMCDABOCDAAOCHAAOCEAAOAEwIMAE4IMAA4IcAA4IQAA4ATAgwATggwADghwADghAADgBMCDABOJvT/hKukrq4uZTKZ/PLRo0clSa2traO+LpVKqaOjo6rnBgC14BbgTCaj/QcP6ePT5kmSmo9/IEl68+TIp9R8/N2anBsA1IJbgCXp49Pm6cSFKyVJcw7vkqT8cjm5bQCgHnAPGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJzUJMBdXV3q6uqqxaEmbDqfG4D6lqjFQTKZTC0OMynT+dwA1DduQQCAEwIMAE4IMAA4IcAA4IQAA4ATAgwATggwADghwADghAADgBMCDABOCDAAOCHAAOCEAAOAEwIMAE4IMAA4IcAA4IQAA4ATAgwATggwADghwADghAADgBMCDABOCDAAOCHAAOCEAAOAEwIMAE4IMAA4IcAA4IQAA4ATAgwATggwADghwADghAADgBMCDABOCDAAOCHAAOCEAAOAEwIMAE4aPsDHjx/XgQMHtGnTJqXTaaXTaXV1deUf5z7a29uVTqe1ZMkSLV68WA899JDS6bSuueYaXXfddcpkMmX3n8lktGrVKvX09Gj58uX5/fX19UmS9u7dq3Q6rZ6enhFfv2zZMqXTae3YsWPE/Wcymfy+Cvc/0rbV1tvbqyVLlujWW2/VwMBA0XMDAwO6++67NTAwMOb4S41nDNu2bVM6ndb27dunNAZA+vRaLvc5NVUWQhj3xm1tbaG3t3fCB9m4caMk6ZFHHila1/evt3TiwpWSpDmHd0lSfrmcOYd36fLzFxbtZ6qWL1+ukydPTnk/yWRSTzzxxLD1GzZsUH9/vxKJhIaGhvLrW1pa1N3draVLl2poaEiJREJ79uwZ8fWSZGbDQpV7PplM6siRI/lj5PY/0rblzrWSVq9ercHBQUnSjTfeqHvuuSf/3NatW7Vjxw7dcMMN2rlz56jjLzWeMaTT6fzjZ599dirDAPLXcrnPqfEys74QQlvp+oaeAWcymYrEV5L6+/uHzcoymUw+noXxlaTBwUE99thj+fVDQ0PD4lr4ekkKIRTNgguf7+/vLzrG4OBg0Vfs0m2rOQvu7e3Nx1eSdu7cmZ8FDwwM6KmnnlIIQd3d3aOOv9R4xrBt27aiZWbBmIrCa7n0c6oSajIDXrt2rU6cOKFUKpVfl8lk9N+Pgj68dJ2k8c2A5+7frtNnWdF+puLw4cMVC7A0fBZcOHsdj9JZYLnXF86Cx9p/4Vfs0m2rOQsunP3m5GbBW7du1a5du4Z9QZKGj7/UeMZQOPvNYRaMySq9lic7C570DNjM7jSzXjPrffvttyd84OmskvGVNCyGE4mvNHyWXO71hV8wx9p/4YUz1XObiNL4StLu3bslSXv27CkbX2n4+EvVcgyANPxaLndtT0VirA1CCI9KelTKzoAnc5DW1lZJ5e8BT8Qnp35GqQreA57oDHUsyWRy2PJEZ8Bjvd7Mxr3/lpaWEbctPddKamlpGXahXnvttZKkpUuXjjoDHk0txwBIw6/lws+pSmjoe8CdnZ1V3d9Y+7/llluKljdv3jzm6++9995x7/+BBx6Y9LlNxZYtW4qWE4mE2tvbJUnr169XU1P2smtubi7arnT8pcYzhjvuuKNo+a677hrXOQPllF7LhZ9TldDQAU6lUpo9e3ZF9pVMJofdm06lUvlZWunsrqWlRbfffnt+fSKR0OLFi0d8vZSd/V5//fVln08mk0XHaGlp0eWXXz7itpW6j15OW1tb0Uxh1apVmj9/viRp/vz5WrFihcxMq1evHnX8pcYzhptvvrloed26dVMZChpc4bVc+jlVCQ0dYEk677zz1NTUpKuuuiq/bs2aNWW3k6SmpiaZmVauzP5jYXNzs+bMmTPijLKzs1Nz587V5s2bi2Kf+0q6adMmSSPP/jo7OzVr1ixJxbPf0v13dnbm91W4/5G2rbYtW7aoqalJixYtys9+c9avX6+LL75Y7e3tY46/1HjGkJsFM/tFJeSu5UrPfiV+DrjsuQFAJfFzwAAwzRBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHCSqMVBUqlULQ4zKdP53ADUt5oEuKOjoxaHmZTpfG4A6hu3IADACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcJLwPHjz8Xc15/Cu+HhAkvLLI20vLazFqQFA1bkFOJVKFS0fPTokSWptHS2wC4e9DgBmKrcAd3R0eB0aAKYF7gEDgBMCDABOCDAAOCHAAOCEAAOAEwIMAE4IMAA4IcAA4IQAA4ATAgwATggwADghwADghAADgBMCDABOCDAAOCHAAOCEAAOAEwIMAE4IMAA4IcAA4MRCCOPf2OxtSa9P8lgLJL0zydfONI00VqmxxttIY5Uaa7zVHOvnQwhnla6cUICnwsx6QwhtNTmYs0Yaq9RY422ksUqNNV6PsXILAgCcEGAAcFLLAD9aw2N5a6SxSo013kYaq9RY4635WGt2DxgAUIxbEADghAADgJOqB9jMVpjZK2aWMbP7qn28WjCzx83smJkdLFg3z8x2m9mr8c/PxvVmZj+J43/RzL7kd+YTZ2bnmlmPmb1sZi+Z2ca4vl7He6qZPW9mL8TxPhDXLzKz5+K4fmtms+L62XE5E59Pep7/ZJhZs5ntM7PuuFyXYzWzfjM7YGb7zaw3rnO9jqsaYDNrlvRTSddJukjSTWZ2UTWPWSNPSFpRsu4+Sc+EEC6Q9ExclrJjvyB+3CnpZzU6x0oZkvStEMJFkq6Q9I3437Bex3tS0pIQwhclXSpphZldIelHkh4OIaQkvSfptrj9bZLei+sfjtvNNBslHSpYruexLg4hXFrw876+13EIoWofkq6U9HTB8v2S7q/mMWv1ISkp6WDB8iuSzo6Pz5b0Snz8C0k3ldtuJn5I+pOkaxthvJJOk/QPSV9R9jekEnF9/rqW9LSkK+PjRNzOvM99AmM8R9nwLJHULcnqeKz9khaUrHO9jqt9C6JV0r8Llo/EdfVoYQjhjfj4TUkL4+O6+TuI33JeJuk51fF447fk+yUdk7Rb0muS3g8hDMVNCseUH298/gNJ82t7xlPyY0nfkfRJXJ6v+h1rkPQXM+szszvjOtfrOFHpHUIKIQQzq6uf7zOzFkm/l/TNEMJ/zCz/XL2NN4TwsaRLzexMSX+UdKHzKVWFma2WdCyE0Gdmae/zqYGrQwhHzexzknab2eHCJz2u42rPgI9KOrdg+Zy4rh69ZWZnS1L881hcP+P/DszsFGXjuy2E8Ie4um7HmxNCeF9Sj7Lfhp9pZrkJS+GY8uONz58haaDGpzpZX5V0g5n1S9qu7G2IR1SfY1UI4Wj885iyX1i/LOfruNoB/rukC+K/qs6StE7Sk1U+ppcnJa2Pj9cre680t749/qvqFZI+KPiWZ9qz7FT3l5IOhRC2FjxVr+M9K858ZWZzlL3ffUjZEK+Nm5WON/f3sFbS3hBvGk53IYT7QwjnhBCSyn5u7g0h3Kw6HKuZzTWz03OPJS2TdFDe13ENbnyvlPRPZe+jbfa+EV+hMf1G0huS/qfsvaHblL0X9oykVyXtkTQvbmvK/iTIa5IOSGrzPv8JjvVqZe+dvShpf/xYWcfjvUTSvjjeg5K+F9efL+l5SRlJv5M0O64/NS5n4vPne49hkuNOS+qu17HGMb0QP17Ktcj7OuZXkQHACb8JBwBOCDAAOCHAAOCEAAOAEwIMAE4IMKY9M7vbzA6Z2TbvcwEqiR9Dw7QXf2V0aQjhyDi2TYRP38cAmNaYAWNaM7OfK/tD9H82s++a2d/ie9f+1cy+ELfZYGZPmtleSc/E33p6PL6v7z4zu9F1EMAImAFj2ovvVdAm6SNJx0MIQ2a2VNLXQwhrzGyDpO9LuiSE8K6Z/UDSyyGEX8dfK35e0mUhhA+dhgCUxbuhYSY5Q9KvzOwCZX89+pSC53aHEN6Nj5cp+yYz347Lp0o6T8VvOg64I8CYSR6U1BNC+Fp8b+JnC54rnN2apDUhhFdqd2rAxHEPGDPJGfr0LQE3jLLd05I64ju5ycwuq/J5AZNCgDGTPCTph2a2T6N/9/agsrcnXjSzl+IyMO3wj3AA4IQZMAA4IcAA4IQAA4ATAgwATggwADghwADghAADgJP/A44KX5vXXCReAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uh7f7nNATSkT" + }, + "source": [ + "### Quem são os outliers de 'fare'?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BdzaUaD0nQnv", + "outputId": "c05a1d46-c91a-4542-92a8-bfb121b44174", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "limite_superior_outliers" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "65.6344" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 330 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "P3SAGnYnnQn4", + "outputId": "424347d6-d243-48df-8c3d-4c2a014c0cd9", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "fare_outlier = df_titanic[df_titanic['fare'] > limite_superior_outliers]\n", + "fare_outlier.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked2age_ofare_o
111female38.01071.2833C38.014.4542
2701male19.032263.0000S19.014.4542
3111female28.010146.5208C28.014.4542
3401male28.01082.1708C28.014.4542
5211female49.01076.7292C49.014.4542
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age ... fare embarked2 age_o fare_o\n", + "1 1 1 female 38.0 ... 71.2833 C 38.0 14.4542\n", + "27 0 1 male 19.0 ... 263.0000 S 19.0 14.4542\n", + "31 1 1 female 28.0 ... 146.5208 C 28.0 14.4542\n", + "34 0 1 male 28.0 ... 82.1708 C 28.0 14.4542\n", + "52 1 1 female 49.0 ... 76.7292 C 49.0 14.4542\n", + "\n", + "[5 rows x 10 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 331 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Jh83WTrZDeM_" + }, + "source": [ + "### Binning variáveis numéricas: age e fare" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JVNVCd7aDjkz", + "outputId": "21f91b71-cfe5-445b-cf2c-8f4bf0de9825", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "#df_titanic_copia = df_titanic.copy()\n", + "df_titanic = df_titanic_copia.copy()\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked2age_o
003male22.0107.2500S22.0
111female38.01071.2833C38.0
213female26.0007.9250S26.0
311female35.01053.1000S35.0
403male35.0008.0500S35.0
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked2 age_o\n", + "0 0 3 male 22.0 1 0 7.2500 S 22.0\n", + "1 1 1 female 38.0 1 0 71.2833 C 38.0\n", + "2 1 3 female 26.0 0 0 7.9250 S 26.0\n", + "3 1 1 female 35.0 1 0 53.1000 S 35.0\n", + "4 0 3 male 35.0 0 0 8.0500 S 35.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 332 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pUspmjPWFP06" + }, + "source": [ + "#### Usando cut()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NBVoCBe_2Zmp" + }, + "source": [ + "#df_titanic['age_bins'] = pd.cut(x = df_titanic['age_o'], bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90])\n", + "df_titanic['age_bins'] = pd.cut(x = df_titanic['age_o'], bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90], labels = [10, 20, 30, 40, 50, 60, 70, 80, 90])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "2i1jombNDrEO", + "outputId": "7c96358b-3023-4706-813a-3e3e594cc45e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 264 + } + }, + "source": [ + "df_titanic.head(7)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked2age_oage_bins
003male22.0107.2500S22.030
111female38.01071.2833C38.040
213female26.0007.9250S26.030
311female35.01053.1000S35.040
403male35.0008.0500S35.040
503male28.0008.4583Q28.030
601male54.00051.8625S54.060
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age ... fare embarked2 age_o age_bins\n", + "0 0 3 male 22.0 ... 7.2500 S 22.0 30\n", + "1 1 1 female 38.0 ... 71.2833 C 38.0 40\n", + "2 1 3 female 26.0 ... 7.9250 S 26.0 30\n", + "3 1 1 female 35.0 ... 53.1000 S 35.0 40\n", + "4 0 3 male 35.0 ... 8.0500 S 35.0 40\n", + "5 0 3 male 28.0 ... 8.4583 Q 28.0 30\n", + "6 0 1 male 54.0 ... 51.8625 S 54.0 60\n", + "\n", + "[7 rows x 10 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 340 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "davIt0UT9tTr" + }, + "source": [ + "#### **Desafio**: Qual seria o corte ótimo para 'age' usando DecisionTree?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "i5aAYl2ZDu1f", + "outputId": "0f1d7a99-6cb0-4484-b12b-7907b70feb8a", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic['age_bins_cut1'].value_counts()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(20, 30] 449\n", + "(30, 40] 155\n", + "(10, 20] 115\n", + "(40, 50] 86\n", + "(0, 10] 64\n", + "(50, 60] 22\n", + "(80, 90] 0\n", + "(70, 80] 0\n", + "(60, 70] 0\n", + "Name: age_bins_cut1, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 276 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VAUshOiLFT9-" + }, + "source": [ + "#### Usando qcut()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RKnb-bI7FL3F", + "outputId": "59742803-dcee-4525-8fd1-2cecff379cab", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic['age_bins_qcut'] = pd.qcut(x = df_titanic['age'], q = 5, labels = [1, 2, 3, 4], duplicates = 'drop')\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareambarked2age_ofare_oage_bins_cut1age_bins_cut2age_bins_qcut
003male22.0107.2500S22.07.2500(20, 30]302
111female38.01071.2833C38.014.4542(30, 40]403
213female26.0007.9250S26.07.9250(20, 30]302
311female35.01053.1000S35.053.1000(30, 40]403
403male35.0008.0500S35.08.0500(30, 40]403
\n", + "
" + ], + "text/plain": [ + " survived pclass sex ... age_bins_cut1 age_bins_cut2 age_bins_qcut\n", + "0 0 3 male ... (20, 30] 30 2\n", + "1 1 1 female ... (30, 40] 40 3\n", + "2 1 3 female ... (20, 30] 30 2\n", + "3 1 1 female ... (30, 40] 40 3\n", + "4 0 3 male ... (30, 40] 40 3\n", + "\n", + "[5 rows x 13 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 277 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "boSGroSYN7cP", + "outputId": "9304540f-1a20-41cb-c4a4-70633be1a071", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic.dtypes" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "survived int64\n", + "pclass int64\n", + "sex object\n", + "age float64\n", + "sibsp int64\n", + "parch int64\n", + "fare float64\n", + "embarked2 object\n", + "age_o float64\n", + "age_bins category\n", + "dtype: object" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 344 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "P8s3LzfpNdUz", + "outputId": "b0e4b638-32de-4295-984e-ba207e195661", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "l_colunas_numericas = list(df_titanic.select_dtypes('int').columns)\n", + "l_colunas_numericas\n", + "\n", + "for coluna in l_colunas_numericas:\n", + " trata_outliers(df, coluna)\n", + " trata_missing_values(df, coluna)\n", + " aplica_MMS(df, coluna)\n", + " aplica_RS(df, coluna)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "['survived', 'pclass', 'sibsp', 'parch']" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 346 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ov2_l39mn3FH", + "outputId": "122869f8-018f-4176-b2df-c249efa222b7", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic['age_bins_qcut'].value_counts()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(20.0, 28.0] 360\n", + "(0.419, 20.0] 179\n", + "(38.0, 80.0] 177\n", + "(28.0, 38.0] 175\n", + "Name: age_bins_qcut, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 261 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J60XwHUOGwbr" + }, + "source": [ + "### MinMaxScaler() e RobustScaler()\n", + "* age e fare" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GRY84U4HHxoQ" + }, + "source": [ + "from sklearn.preprocessing import MinMaxScaler, RobustScaler" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IQC7Bo-DH71s" + }, + "source": [ + "mms = MinMaxScaler()\n", + "rs = RobustScaler()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8O2oM9XdIYF5", + "outputId": "9812b76b-dffc-406b-fcd9-0a7c1b01a61b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareambarked2age_ofare_oage_bins_cut1age_bins_cut2age_bins_qcut
003male22.0107.2500S22.07.2500(20, 30]30(20.0, 28.0]
111female38.01071.2833C38.014.4542(30, 40]40(28.0, 38.0]
213female26.0007.9250S26.07.9250(20, 30]30(20.0, 28.0]
311female35.01053.1000S35.053.1000(30, 40]40(28.0, 38.0]
403male35.0008.0500S35.08.0500(30, 40]40(28.0, 38.0]
\n", + "
" + ], + "text/plain": [ + " survived pclass sex ... age_bins_cut1 age_bins_cut2 age_bins_qcut\n", + "0 0 3 male ... (20, 30] 30 (20.0, 28.0]\n", + "1 1 1 female ... (30, 40] 40 (28.0, 38.0]\n", + "2 1 3 female ... (20, 30] 30 (20.0, 28.0]\n", + "3 1 1 female ... (30, 40] 40 (28.0, 38.0]\n", + "4 0 3 male ... (30, 40] 40 (28.0, 38.0]\n", + "\n", + "[5 rows x 13 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 264 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "B-qglHy6NZlg", + "outputId": "96a154a8-678e-48eb-93b2-ce93b7e0258f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic_copia = df_titanic.copy()\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareambarked2age_ofare_oage_bins_cut1age_bins_cut2age_bins_qcut
003male22.0107.2500S22.07.2500(20, 30]30(20.0, 28.0]
111female38.01071.2833C38.014.4542(30, 40]40(28.0, 38.0]
213female26.0007.9250S26.07.9250(20, 30]30(20.0, 28.0]
311female35.01053.1000S35.053.1000(30, 40]40(28.0, 38.0]
403male35.0008.0500S35.08.0500(30, 40]40(28.0, 38.0]
\n", + "
" + ], + "text/plain": [ + " survived pclass sex ... age_bins_cut1 age_bins_cut2 age_bins_qcut\n", + "0 0 3 male ... (20, 30] 30 (20.0, 28.0]\n", + "1 1 1 female ... (30, 40] 40 (28.0, 38.0]\n", + "2 1 3 female ... (20, 30] 30 (20.0, 28.0]\n", + "3 1 1 female ... (30, 40] 40 (28.0, 38.0]\n", + "4 0 3 male ... (30, 40] 40 (28.0, 38.0]\n", + "\n", + "[5 rows x 13 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 265 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Dp9jYZ1OoA9i" + }, + "source": [ + "A seguir, deletar as variáveis que desnecessárias..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zSSViPY5XokW" + }, + "source": [ + "df_titanic.drop(columns = ['age', 'fare'], axis = 1, inplace = True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "MNq5a0eUIBGV", + "outputId": "d692f608-f511-46e6-d101-bd9443647b94", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 282 + } + }, + "source": [ + "# fit\n", + "df_titanic_mms = pd.DataFrame(mms.fit_transform(df_titanic[['age_o', 'fare_o']]), columns = ['age_mms', 'fare_mms'])\n", + "df_titanic_rs = pd.DataFrame(rs.fit_transform(df_titanic[['age_o', 'fare_o']]), columns = ['age_rs', 'fare_rs'])\n", + "\n", + "df_titanic['age_mms'] = df_titanic_mms['age_mms']\n", + "df_titanic['age_rs'] = df_titanic_rs['age_rs']\n", + "\n", + "df_titanic['fare_mms'] = df_titanic_mms['fare_mms']\n", + "df_titanic['fare_rs'] = df_titanic_rs['fare_rs']\n", + "\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexsibspparchambarked2age_ofare_oage_bins_cut1age_bins_cut2age_bins_qcutage_mmsage_rsfare_mmsfare_rsage_bins_mmsage_bins_rsfare_bins_mmsfare_bins_rs
003male10S22.07.2500(20, 30]30(20.0, 28.0]0.402762-0.5454550.111538-0.443619(0.365, 0.515](-0.727, 0.0](-0.001, 0.121](-0.891, -0.406]
111female10C38.014.4542(30, 40]40(28.0, 38.0]0.7013810.9090910.2223720.000000(0.645, 1.0](0.636, 2.364](0.162, 0.222](-0.243, 0.0]
213female00S26.07.9250(20, 30]30(20.0, 28.0]0.477417-0.1818180.121923-0.402054(0.365, 0.515](-0.727, 0.0](0.121, 0.162](-0.406, -0.243]
311female10S35.053.1000(30, 40]40(28.0, 38.0]0.6453900.6363640.8169232.379726(0.515, 0.645](0.0, 0.636](0.404, 1.0](0.726, 3.113]
403male00S35.08.0500(30, 40]40(28.0, 38.0]0.6453900.6363640.123846-0.394357(0.515, 0.645](0.0, 0.636](0.121, 0.162](-0.406, -0.243]
\n", + "
" + ], + "text/plain": [ + " survived pclass sex ... age_bins_rs fare_bins_mms fare_bins_rs\n", + "0 0 3 male ... (-0.727, 0.0] (-0.001, 0.121] (-0.891, -0.406]\n", + "1 1 1 female ... (0.636, 2.364] (0.162, 0.222] (-0.243, 0.0]\n", + "2 1 3 female ... (-0.727, 0.0] (0.121, 0.162] (-0.406, -0.243]\n", + "3 1 1 female ... (0.0, 0.636] (0.404, 1.0] (0.726, 3.113]\n", + "4 0 3 male ... (0.0, 0.636] (0.121, 0.162] (-0.406, -0.243]\n", + "\n", + "[5 rows x 19 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 268 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UzrdPNO3rIg5", + "outputId": "aaa31937-081d-4af8-f3ec-07002886d2a6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 555 + } + }, + "source": [ + "# Categorizando as variáveis transformadas\n", + "df_titanic['age_bins_mms'] = pd.qcut(x = df_titanic['age_mms'], q = 5, duplicates = 'drop', labels = [1, 2, 3, 4])\n", + "df_titanic['age_bins_rs'] = pd.qcut(x = df_titanic['age_rs'], q = 5, labels = [1, 2, 3, 4], duplicates = 'drop')\n", + "\n", + "df_titanic['fare_bins_mms'] = pd.qcut(x = df_titanic['fare_mms'], q = 5, labels = [1, 2, 3, 4], duplicates = 'drop')\n", + "df_titanic['fare_bins_rs'] = pd.qcut(x = df_titanic['fare_rs'], q = 5, labels = [1, 2, 3, 4], duplicates = 'drop')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "KeyError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 2894\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2895\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2896\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", + "\u001b[0;31mKeyError\u001b[0m: 'age_mms'", + "\nThe above exception was the direct cause of the following exception:\n", + "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Categorizando as variáveis transformadas\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdf_titanic\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'age_bins_mms'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mqcut\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf_titanic\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'age_mms'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mq\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m5\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mduplicates\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'drop'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlabels\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m4\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0mdf_titanic\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'age_bins_rs'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mqcut\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf_titanic\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'age_rs'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mq\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m5\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlabels\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m3\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m4\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mduplicates\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'drop'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mdf_titanic\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'fare_bins_mms'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mqcut\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf_titanic\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'fare_mms'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mq\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m5\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mduplicates\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'drop'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 2900\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnlevels\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2901\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_multilevel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2902\u001b[0;31m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2903\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_integer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2904\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 2895\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2896\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2897\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2898\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2899\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mtolerance\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mKeyError\u001b[0m: 'age_mms'" + ] + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7smfXya5pmNq", + "outputId": "a942223f-e9b6-4758-c453-73c5c55f91bd", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic.drop(columns = ['age_o', 'fare_o', 'age_bins_cut2', 'age_mms', 'age_rs', 'fare_mms', 'fare_rs'], axis = 1, inplace = True)\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexsibspparchambarked2age_bins_cut1age_bins_qcutage_bins_mmsage_bins_rsfare_bins_mmsfare_bins_rs
003male10S(20, 30](20.0, 28.0](0.365, 0.515](-0.727, 0.0](-0.001, 0.121](-0.891, -0.406]
111female10C(30, 40](28.0, 38.0](0.645, 1.0](0.636, 2.364](0.162, 0.222](-0.243, 0.0]
213female00S(20, 30](20.0, 28.0](0.365, 0.515](-0.727, 0.0](0.121, 0.162](-0.406, -0.243]
311female10S(30, 40](28.0, 38.0](0.515, 0.645](0.0, 0.636](0.404, 1.0](0.726, 3.113]
403male00S(30, 40](28.0, 38.0](0.515, 0.645](0.0, 0.636](0.121, 0.162](-0.406, -0.243]
\n", + "
" + ], + "text/plain": [ + " survived pclass sex ... age_bins_rs fare_bins_mms fare_bins_rs\n", + "0 0 3 male ... (-0.727, 0.0] (-0.001, 0.121] (-0.891, -0.406]\n", + "1 1 1 female ... (0.636, 2.364] (0.162, 0.222] (-0.243, 0.0]\n", + "2 1 3 female ... (-0.727, 0.0] (0.121, 0.162] (-0.406, -0.243]\n", + "3 1 1 female ... (0.0, 0.636] (0.404, 1.0] (0.726, 3.113]\n", + "4 0 3 male ... (0.0, 0.636] (0.121, 0.162] (-0.406, -0.243]\n", + "\n", + "[5 rows x 12 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 269 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SFPNLDMcU339" + }, + "source": [ + "### Variáveis Dummy" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "L_Fx1iy7snjF", + "outputId": "24c70d23-5a35-41aa-8a30-04fcc146b7c6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareambarked2age_ofare_oage_bins_cut1age_bins_cut2age_bins_qcut
003male22.0107.2500S22.07.2500(20, 30]302
111female38.01071.2833C38.014.4542(30, 40]403
213female26.0007.9250S26.07.9250(20, 30]302
311female35.01053.1000S35.053.1000(30, 40]403
403male35.0008.0500S35.08.0500(30, 40]403
\n", + "
" + ], + "text/plain": [ + " survived pclass sex ... age_bins_cut1 age_bins_cut2 age_bins_qcut\n", + "0 0 3 male ... (20, 30] 30 2\n", + "1 1 1 female ... (30, 40] 40 3\n", + "2 1 3 female ... (20, 30] 30 2\n", + "3 1 1 female ... (30, 40] 40 3\n", + "4 0 3 male ... (30, 40] 40 3\n", + "\n", + "[5 rows x 13 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 279 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "X6aHaJodX0Hi", + "outputId": "f7a26db1-81d3-47f5-dcd3-bbde2b2b6440", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 402 + } + }, + "source": [ + "dummy = pd.get_dummies(df_titanic[['sex', 'ambarked2']])\n", + "dummy" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sex_femalesex_maleambarked2_Cambarked2_Qambarked2_S
001001
110100
210001
310001
401001
..................
88601001
88710001
88810001
88901100
89001010
\n", + "

891 rows × 5 columns

\n", + "
" + ], + "text/plain": [ + " sex_female sex_male ambarked2_C ambarked2_Q ambarked2_S\n", + "0 0 1 0 0 1\n", + "1 1 0 1 0 0\n", + "2 1 0 0 0 1\n", + "3 1 0 0 0 1\n", + "4 0 1 0 0 1\n", + ".. ... ... ... ... ...\n", + "886 0 1 0 0 1\n", + "887 1 0 0 0 1\n", + "888 1 0 0 0 1\n", + "889 0 1 1 0 0\n", + "890 0 1 0 1 0\n", + "\n", + "[891 rows x 5 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 282 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZhLW0lEbs28E", + "outputId": "c772c290-4394-409a-ded7-1eb57b7ac0db", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 215 + } + }, + "source": [ + "df_titanic2 = pd.concat([df_titanic, dummy], axis = 1)\n", + "df_titanic2.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareambarked2age_ofare_oage_bins_cut1age_bins_cut2age_bins_qcutsex_femalesex_maleambarked2_Cambarked2_Qambarked2_S
003male22.0107.2500S22.07.2500(20, 30]30201001
111female38.01071.2833C38.014.4542(30, 40]40310100
213female26.0007.9250S26.07.9250(20, 30]30210001
311female35.01053.1000S35.053.1000(30, 40]40310001
403male35.0008.0500S35.08.0500(30, 40]40301001
\n", + "
" + ], + "text/plain": [ + " survived pclass sex ... ambarked2_C ambarked2_Q ambarked2_S\n", + "0 0 3 male ... 0 0 1\n", + "1 1 1 female ... 1 0 0\n", + "2 1 3 female ... 0 0 1\n", + "3 1 1 female ... 0 0 1\n", + "4 0 3 male ... 0 0 1\n", + "\n", + "[5 rows x 18 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 283 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "I_bOYD4gWwGt", + "outputId": "b33c5076-6d0b-4c8f-8e5b-d87cb3e4544d", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic['pclass'].value_counts() # Quem será a referência?" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "3.0 484\n", + "1.0 189\n", + "2.0 176\n", + "Name: pclass, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 64 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xFKdsFDihApP" + }, + "source": [ + "df_titanic['pclass'] = df_titanic['pclass'].astype('category')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D_mWCqM1ZOgU" + }, + "source": [ + "### Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UPnCuCsLZSjQ", + "outputId": "95b92d55-a895-4c2e-9655-07a3bd753f6d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 162 + } + }, + "source": [ + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "NameError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mX_treinamento\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mX_teste\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_treinamento\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_teste\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtrain_test_split\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mNameError\u001b[0m: name 'train_test_split' is not defined" + ] + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rk-Zuh5RXJbp", + "outputId": "47c3c005-795d-406d-984a-b0094cd5718c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexsibspparchambarked2age3fare3age_bins_cut1age_bins_cut2age_bins_qcutage_mmsage_rsfare_mmsfare_rs
00.03.0male1.00.0S22.07.2500(20, 30]30(20.0, 28.0]0.402762-0.5454550.014151-0.323505
11.01.0female1.00.0C38.071.2833(30, 40]40(36.0, 54.0]0.7013810.9090910.1391362.696934
21.03.0female0.00.0S26.07.9250(20, 30]30(20.0, 28.0]0.477417-0.1818180.015469-0.291665
31.01.0female1.00.0S35.053.1000(30, 40]40(28.0, 36.0]0.6453900.6363640.1036441.839231
40.03.0male0.00.0S35.08.0500(30, 40]40(28.0, 36.0]0.6453900.6363640.015713-0.285769
\n", + "
" + ], + "text/plain": [ + " survived pclass sex sibsp ... age_mms age_rs fare_mms fare_rs\n", + "0 0.0 3.0 male 1.0 ... 0.402762 -0.545455 0.014151 -0.323505\n", + "1 1.0 1.0 female 1.0 ... 0.701381 0.909091 0.139136 2.696934\n", + "2 1.0 3.0 female 0.0 ... 0.477417 -0.181818 0.015469 -0.291665\n", + "3 1.0 1.0 female 1.0 ... 0.645390 0.636364 0.103644 1.839231\n", + "4 0.0 3.0 male 0.0 ... 0.645390 0.636364 0.015713 -0.285769\n", + "\n", + "[5 rows x 15 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 69 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XtdQc49LXTfk", + "outputId": "696c23be-e9d0-4ba9-b4f5-0dcba0ffe8ec", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic['pclass'].value_counts()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "3.0 484\n", + "1.0 189\n", + "2.0 176\n", + "Name: pclass, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 74 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fbvB30S5hRxH", + "outputId": "e872fe50-fcad-4b33-9456-182b1ca7e62c", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "modelo = smf.glm(formula = 'survived ~ age3 + pclass + sex', data = df_titanic, family = sm.families.Binomial()).fit()\n", + "print(modelo.summary())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + " Generalized Linear Model Regression Results \n", + "==============================================================================\n", + "Dep. Variable: survived No. Observations: 849\n", + "Model: GLM Df Residuals: 844\n", + "Model Family: Binomial Df Model: 4\n", + "Link Function: logit Scale: 1.0000\n", + "Method: IRLS Log-Likelihood: -386.42\n", + "Date: Thu, 29 Oct 2020 Deviance: 772.85\n", + "Time: 17:17:09 Pearson chi2: 890.\n", + "No. Iterations: 5 \n", + "Covariance Type: nonrobust \n", + "=================================================================================\n", + " coef std err z P>|z| [0.025 0.975]\n", + "---------------------------------------------------------------------------------\n", + "Intercept 3.5515 0.382 9.297 0.000 2.803 4.300\n", + "pclass[T.2.0] -1.1389 0.264 -4.315 0.000 -1.656 -0.622\n", + "pclass[T.3.0] -2.3581 0.245 -9.613 0.000 -2.839 -1.877\n", + "sex[T.male] -2.5618 0.189 -13.522 0.000 -2.933 -2.191\n", + "age3 -0.0344 0.009 -4.035 0.000 -0.051 -0.018\n", + "=================================================================================\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p7_gfQXciFs1" + }, + "source": [ + "Qual a significância dos coeficientes (p-value abaixo de 0.05 adotando confiança de 95%)?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xtrh_bYNikTk" + }, + "source": [ + "### Interpretação dos coeficientes:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FzlGDyeLgL11" + }, + "source": [ + "* Pessoas que viajavam na segunda classe possuem menos chances de sobrevivência do que quem viajava na primeira.\n", + "* Quem viajava na terceira classe possui menos chances ainda.\n", + "* Homens possuem menos chances de sobrevivência do que mulheres. Quanto mais velho, menores as chances de sobrevivência." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CJNgEYY9ioVM" + }, + "source": [ + "### Coeficientes mais interpretáveis - Chances relativas de Sobrevivência" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "q0vLh1v3irCz" + }, + "source": [ + "print(np.exp(modelo.params[1:]))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "a2fJIOOzi3VF" + }, + "source": [ + "* Pessoas que viajavam na segunda classe tinham 0.27 das chances de sobrevivência que as pessoas da primeira classe tinham. \n", + "* Pessoas da terceira classe tinham 0.076 das chances que as pessoas da primeira classe tinham. \n", + "* Homens tinham 0.08 das chances das mulheres." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dYRkdqNujHFA" + }, + "source": [ + "### Comparando com a regressão Linear" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mKW-aODfjLbm" + }, + "source": [ + "(np.exp(modelo.params[1:]) - 1) * 100" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ODOqZpAgjQ2q" + }, + "source": [ + "* Pessoas da segunda classe tem 73% menos chances de sobrevivência do que pessoas da primeira classe.\n", + "* Pessoas da terceira classe tem 92% menos chances de sobrevivência que pessoas da primeira classe.\n", + "* Homens tem 92% menos chances de sobrevivência do que mulheres.\n", + "* Para cada ano a mais de idade, as chances diminuem 3.63%." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fxnutoD7jp94" + }, + "source": [ + "### Qualidade do modelo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-oW8Kg5Ij3Av", + "outputId": "d56cd7ce-e5c0-493d-ccfe-27cd5a6899c5", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 354 + } + }, + "source": [ + "modelo2 = LogisticRegression(penalty='none', solver='newton-cg')\n", + "df_titanic2 = df_titanic[['Survived', 'Pclass', 'Sex', 'Age']].dropna()\n", + "y = df_titanic2['Survived']\n", + "X = pd.get_dummies(df_titanic2[['Pclass', 'Sex', 'Age']], drop_first=True)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "KeyError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mmodelo2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mLogisticRegression\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpenalty\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'none'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msolver\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'newton-cg'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdf_titanic2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf_titanic\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Survived'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Pclass'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Sex'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Age'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdropna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf_titanic2\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Survived'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mX\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf_titanic2\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Pclass'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Sex'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Age'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdrop_first\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 2906\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_iterator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2907\u001b[0m \u001b[0mkey\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2908\u001b[0;31m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_get_listlike_indexer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mraise_missing\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2909\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2910\u001b[0m \u001b[0;31m# take() does not accept boolean indexers\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py\u001b[0m in \u001b[0;36m_get_listlike_indexer\u001b[0;34m(self, key, axis, raise_missing)\u001b[0m\n\u001b[1;32m 1252\u001b[0m \u001b[0mkeyarr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mindexer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnew_indexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0max\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_reindex_non_unique\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkeyarr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1253\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1254\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_validate_read_indexer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkeyarr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mindexer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mraise_missing\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mraise_missing\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1255\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mkeyarr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mindexer\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1256\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py\u001b[0m in \u001b[0;36m_validate_read_indexer\u001b[0;34m(self, key, indexer, axis, raise_missing)\u001b[0m\n\u001b[1;32m 1296\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mmissing\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1297\u001b[0m \u001b[0maxis_name\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mobj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_get_axis_name\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1298\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf\"None of [{key}] are in the [{axis_name}]\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1299\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1300\u001b[0m \u001b[0;31m# We (temporarily) allow for some missing keys with .loc, except in\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mKeyError\u001b[0m: \"None of [Index(['Survived', 'Pclass', 'Sex', 'Age'], dtype='object')] are in the [columns]\"" + ] + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YK9nIiz_kQQl" + }, + "source": [ + "modelo2.fit(X, y)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "FPhHe4SmkZoE" + }, + "source": [ + "y_pred = modelo2.predict_proba(X)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "dPrXU0GSknGJ" + }, + "source": [ + "confusion_matrix(y, model.predict(X))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_-Wweiq7kruH" + }, + "source": [ + "acuracia = accuracy_score(y, model.predict(X))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "bqcT8XYJkuwH" + }, + "source": [ + "print(classification_report(y, model.predict(X)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "4fSN-vLOjseh" + }, + "source": [ + "confusion_matrix(y, modelo.predict(X)) # usando a função do sklearn" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "A0408gbPkywR" + }, + "source": [ + "def plot_roc_curve(y_true, y_score, figsize=(10,6)):\n", + " fpr, tpr, _ = roc_curve(y_true, y_score)\n", + " plt.figure(figsize=figsize)\n", + " auc_value = roc_auc_score(y_true, y_score)\n", + " plt.plot(fpr, tpr, color='orange', label='ROC curve (area = %0.2f)' % auc_value)\n", + " plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')\n", + " plt.xlabel('False Positive Rate')\n", + " plt.ylabel('True Positive Rate')\n", + " plt.title('Receiver Operating Characteristic (ROC) Curve')\n", + " plt.legend()\n", + " plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "5T4P90hQk1ug" + }, + "source": [ + "plot_roc_curve(y, y_pred)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CANyMIgIjgSb" + }, + "source": [ + "### Predições" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pWJgVcQRlESq" + }, + "source": [ + "eu = pd.DataFrame({'Age':32, 'Pclass_2':0, 'Pclass_3':1, 'Sex_male':1}, index=[0])\n", + "minha_prob = model.predict_proba(eu)\n", + "print('Eu teria {}% de probabilidade de sobrevivência se estivesse no Titanic'\\\n", + " .format(round(minha_prob[:,1][0]*100, 2)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kgpdgkgrlJ-w" + }, + "source": [ + "Eu teria 7.52% de probabilidade de sobrevivência se estivesse no Titanic" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "91GShU9ClMiY" + }, + "source": [ + "coleguinha = pd.DataFrame({'Age':32, 'Pclass_2':0, 'Pclass_3':0, 'Sex_male':1}, index=[0])\n", + "prob_do_coleguinha = model.predict_proba(coleguinha)\n", + "print('Meu coleguinha teria {}% de probabilidade de sobrevivência se estivesse no Titanic'\\\n", + " .format(round(prob_do_coleguinha[:,1][0]*100, 2)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c2EHn8volOil" + }, + "source": [ + "Meu coleguinha teria 51.77% de probabilidade de sobrevivência se estivesse no Titanic" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "C2PvJoZQlH6u" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XwuMfMD1gFyd" + }, + "source": [ + "# Exemplo 2" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kFY0TQVgOlvT" + }, + "source": [ + "" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "efF3st3sHxPG" + }, + "source": [ + "# Carrega as bibliotecas\n", + "import numpy as np\n", + "np.set_printoptions(formatter = {'float': lambda x: \"{0:0.2f}\".format(x)})\n", + "\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "from sklearn.model_selection import train_test_split\n", + "import statsmodels.api as sm\n", + "\n", + "%matplotlib inline" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Bk9F6JO0IELv" + }, + "source": [ + "# Carregar/ler o banco de dados - Dataframe Diabetes\n", + "from sklearn import datasets\n", + "#Diabetes = datasets.load_diabetes()\n", + "\n", + "url = 'https://raw.githubusercontent.com/MathMachado/DSWP/master/Dataframes/diabetes.csv'\n", + "diabetes = pd.read_csv(url)\n", + "diabetes.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "tjRmpaPIDknb" + }, + "source": [ + "# Definir as matrizes X e y\n", + "X_diabetes = diabetes.copy()\n", + "X_diabetes.drop(columns = ['Outcome'], axis = 1, inplace = True)\n", + "y_diabetes = diabetes['Outcome']\n", + "\n", + "X_diabetes.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "jLrx69TH-Mad" + }, + "source": [ + "X_diabetes.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "mdFBioP6-Ply" + }, + "source": [ + "y_diabetes.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "fhLySN65IaDF" + }, + "source": [ + "# Definir as matrizes de treinamento e validação\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_diabetes, y_diabetes)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "J5R8HlnuIGpL" + }, + "source": [ + "# Usando statmodels:\n", + "x = sm.add_constant(X_treinamento)\n", + "lr_sm = sm.Logit(y_treinamento, X_treinamento) # Atenção: aqui é o contrário: [y, x]\n", + "\n", + "# Treinar o modelo\n", + "lr.fit(X_treinamento, y_treinamento)\n", + "resultado_sm = lr_sm.fit()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "GlbCaPp1ETNa" + }, + "source": [ + "resultado_sm.summary()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-FJaSnJLKICU" + }, + "source": [ + "# EQM - Erro Quadrático Médio\n", + "np.mean((resultado_sm.predict(X_teste) - y_teste) ** 2) " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6bVEUSTUPzOj" + }, + "source": [ + "### Calcular y_pred - os valores preditos de y" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OjGrNhTNLcr-" + }, + "source": [ + "y_pred = resultado_sm.predict(X_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vfS5RCx_VnGT" + }, + "source": [ + "compara = list(zip(np.array(diabetes['Outcome']), resultado_sm.predict()))\n", + "compara[0:30]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "pUxasncIFaw4" + }, + "source": [ + "resultado_sm.pred_table()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_liLYinwFgch" + }, + "source": [ + "confusion_matrix = pd.DataFrame(resultado_sm.pred_table())\n", + "confusion_matrix.columns = ['Predicted No Diabetes', 'Predicted Diabetes']\n", + "confusion_matrix = confusion_matrix.rename(index = {0 : 'Actual No Diabetes', 1 : 'Actual Diabetes'})\n", + "confusion_matrix" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ceH3MODWFm7S" + }, + "source": [ + "cm = np.array(confusion_matrix)\n", + "training_accuracy = (cm[0,0] + cm[1,1])/ cm.sum()\n", + "training_accuracy" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CH_iEuzhO109" + }, + "source": [ + "# Exercício 1 - Mall_Customers.csv\n", + "> A variável-target deste dataframe é 'Annual Income'. Desenvolva um modelo de regressão utilizando OLS, Ridge e LASSO e compare os resultados.\n", + "\n", + "* Experimente:\n", + " * Lasso(alpha = 0.01, max_iter = 10e5);\n", + " * Lasso(alpha = 0.0001, max_iter = 10e5);\n", + " * Ridge(alpha = 0.01);\n", + " * Ridge(alpha = 100);" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZfRDEaaRYxFQ" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "from sklearn import preprocessing\n", + "import matplotlib.pyplot as plt \n", + "plt.rc(\"font\", size=14)\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.model_selection import train_test_split\n", + "import seaborn as sns\n", + "sns.set(style=\"white\")\n", + "sns.set(style=\"whitegrid\", color_codes=True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nulrLzUqYxFY" + }, + "source": [ + "## Dados\n", + "\n", + "The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe (1/0) a term deposit (variable y)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4LdrQCwxYxFY" + }, + "source": [ + "This dataset provides the customer information. It includes 41188 records and 21 fields." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qoT6zkoFYxFZ", + "outputId": "b04874af-bf4d-409f-cd1c-ad8c473004e6", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_bank = pd.read_csv('https://raw.githubusercontent.com/MathMachado/DataFrames/master/bank-full.csv', header = 0)\n", + "df_bank = df_bank.dropna()\n", + "print(df_bank.shape)\n", + "print(list(df_bank.columns))" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "(45211, 1)\n", + "['age;\"job\";\"marital\";\"education\";\"default\";\"balance\";\"housing\";\"loan\";\"contact\";\"day\";\"month\";\"duration\";\"campaign\";\"pdays\";\"previous\";\"poutcome\";\"y\"']\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZD23hMCeYxFc", + "outputId": "f347c846-5f92-4e4f-b468-2bfbc608777c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_bank.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
age;\"job\";\"marital\";\"education\";\"default\";\"balance\";\"housing\";\"loan\";\"contact\";\"day\";\"month\";\"duration\";\"campaign\";\"pdays\";\"previous\";\"poutcome\";\"y\"
058;\"management\";\"married\";\"tertiary\";\"no\";2143...
144;\"technician\";\"single\";\"secondary\";\"no\";29;\"...
233;\"entrepreneur\";\"married\";\"secondary\";\"no\";2...
347;\"blue-collar\";\"married\";\"unknown\";\"no\";1506...
433;\"unknown\";\"single\";\"unknown\";\"no\";1;\"no\";\"n...
\n", + "
" + ], + "text/plain": [ + " age;\"job\";\"marital\";\"education\";\"default\";\"balance\";\"housing\";\"loan\";\"contact\";\"day\";\"month\";\"duration\";\"campaign\";\"pdays\";\"previous\";\"poutcome\";\"y\"\n", + "0 58;\"management\";\"married\";\"tertiary\";\"no\";2143... \n", + "1 44;\"technician\";\"single\";\"secondary\";\"no\";29;\"... \n", + "2 33;\"entrepreneur\";\"married\";\"secondary\";\"no\";2... \n", + "3 47;\"blue-collar\";\"married\";\"unknown\";\"no\";1506... \n", + "4 33;\"unknown\";\"single\";\"unknown\";\"no\";1;\"no\";\"n... " + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 285 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CtGbim_EYxFh" + }, + "source": [ + "#### Input variables" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0pJ7ai5ZYxFh" + }, + "source": [ + "1 - age (numeric)\n", + "\n", + "2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')\n", + "\n", + "3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)\n", + "\n", + "4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')\n", + "\n", + "5 - default: has credit in default? (categorical: 'no','yes','unknown')\n", + "\n", + "6 - housing: has housing loan? (categorical: 'no','yes','unknown')\n", + "\n", + "7 - loan: has personal loan? (categorical: 'no','yes','unknown')\n", + "\n", + "8 - contact: contact communication type (categorical: 'cellular','telephone')\n", + "\n", + "9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')\n", + "\n", + "10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')\n", + "\n", + "11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.\n", + "\n", + "12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)\n", + "\n", + "13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)\n", + "\n", + "14 - previous: number of contacts performed before this campaign and for this client (numeric)\n", + "\n", + "15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')\n", + "\n", + "16 - emp.var.rate: employment variation rate - (numeric)\n", + "\n", + "17 - cons.price.idx: consumer price index - (numeric)\n", + "\n", + "18 - cons.conf.idx: consumer confidence index - (numeric) \n", + "\n", + "19 - euribor3m: euribor 3 month rate - (numeric)\n", + "\n", + "20 - nr.employed: number of employees - (numeric)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YwsaBV_OYxFi" + }, + "source": [ + "#### Predict variable (desired target):\n", + "\n", + "y - has the client subscribed a term deposit? (binary: '1','0')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2SsNWV_SYxFj" + }, + "source": [ + "The education column of the dataset has many categories and we need to reduce the categories for a better modelling. The education column has the following categories:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6TFbgh3vYxFk" + }, + "source": [ + "df_bank['education'].unique()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "luv7Bdf_YxFn" + }, + "source": [ + "Let us group \"basic.4y\", \"basic.9y\" and \"basic.6y\" together and call them \"basic\"." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gkOlUOs2YxFn" + }, + "source": [ + "df_bank['education']=np.where(df_bank['education'] =='basic.9y', 'Basic', df_bank['education'])\n", + "df_bank['education']=np.where(df_bank['education'] =='basic.6y', 'Basic', df_bank['education'])\n", + "df_bank['education']=np.where(df_bank['education'] =='basic.4y', 'Basic', df_bank['education'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H-X1WMv2YxFq" + }, + "source": [ + "After grouping, this is the columns" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "r9LlgpkjYxFq" + }, + "source": [ + "df_bank['education'].unique()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fcnJy3KYYxFt" + }, + "source": [ + "### Data exploration" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qUrTMR8BYxFt" + }, + "source": [ + "df_bank['y'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rpzHnzJKYxFx" + }, + "source": [ + "sns.countplot(x='y',data=df_bank, palette='hls')\n", + "plt.show()\n", + "plt.savefig('count_plot')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C99nOe3mYxF0" + }, + "source": [ + "There are 36548 no's and 4640 yes's in the outcome variables." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8nGaox_kYxF1" + }, + "source": [ + "Let's get a sense of the numbers across the two classes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sQvzA60bYxF1" + }, + "source": [ + "df_bank.groupby('y').mean()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u3xjoceKYxF3" + }, + "source": [ + "Observations:\n", + "\n", + "The average age of customers who bought the term deposit is higher than that of the customers who didn't.\n", + "The pdays (days since the customer was last contacted) is understandably lower for the customers who bought it. The lower the pdays, the better the memory of the last call and hence the better chances of a sale.\n", + "Surprisingly, campaigns (number of contacts or calls made during the current campaign) are lower for customers who bought the term deposit." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jvzGMePPYxF4" + }, + "source": [ + "We can calculate categorical means for other categorical variables such as education and marital status to get a more detailed sense of our data." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RqLVMjoxYxF5" + }, + "source": [ + "df_bank.groupby('job').mean()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "GTUeRJAtYxF7" + }, + "source": [ + "df_bank.groupby('marital').mean()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "xsxdFumiYxF9" + }, + "source": [ + "df_bank.groupby('education').mean()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3i1DCWV-YxGA" + }, + "source": [ + "Visualizations" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OEArHQPbYxGB" + }, + "source": [ + "%matplotlib inline\n", + "pd.crosstab(df_bank.job,df_bank.y).plot(kind='bar')\n", + "plt.title('Purchase Frequency for Job Title')\n", + "plt.xlabel('Job')\n", + "plt.ylabel('Frequency of Purchase')\n", + "plt.savefig('purchase_fre_job')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PNwo5du_YxGD" + }, + "source": [ + "The frequency of purchase of the deposit depends a great deal on the job title. Thus, the job title can be a good predictor of the outcome variable." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "eM7CWfAZYxGE" + }, + "source": [ + "table=pd.crosstab(df_bank.marital,df_bank.y)\n", + "table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)\n", + "plt.title('Stacked Bar Chart of Marital Status vs Purchase')\n", + "plt.xlabel('Marital Status')\n", + "plt.ylabel('Proportion of Customers')\n", + "plt.savefig('mariral_vs_pur_stack')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LWBLh7toYxGG" + }, + "source": [ + "Hard to see, but the marital status does not seem a strong predictor for the outcome variable." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vh_u4QphYxGH" + }, + "source": [ + "table=pd.crosstab(df_bank.education,df_bank.y)\n", + "table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)\n", + "plt.title('Stacked Bar Chart of Education vs Purchase')\n", + "plt.xlabel('Education')\n", + "plt.ylabel('Proportion of Customers')\n", + "plt.savefig('edu_vs_pur_stack')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d9AgJroYYxGK" + }, + "source": [ + "Education seems a good predictor of the outcome variable." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dHI2LT-IYxGL" + }, + "source": [ + "pd.crosstab(df_bank.day_of_week,df_bank.y).plot(kind='bar')\n", + "plt.title('Purchase Frequency for Day of Week')\n", + "plt.xlabel('Day of Week')\n", + "plt.ylabel('Frequency of Purchase')\n", + "plt.savefig('pur_dayofweek_bar')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3A2jmS4MYxGR" + }, + "source": [ + "Day of week may not be a good predictor of the outcome" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bzafDBHpYxGS" + }, + "source": [ + "pd.crosstab(df_bank.month,df_bank.y).plot(kind='bar')\n", + "plt.title('Purchase Frequency for Month')\n", + "plt.xlabel('Month')\n", + "plt.ylabel('Frequency of Purchase')\n", + "plt.savefig('pur_fre_month_bar')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "x5CBtquEYxGW" + }, + "source": [ + "Month might be a good predictor of the outcome variable" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tgF_3SqWYxGY" + }, + "source": [ + "df_bank.age.hist()\n", + "plt.title('Histogram of Age')\n", + "plt.xlabel('Age')\n", + "plt.ylabel('Frequency')\n", + "plt.savefig('hist_age')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y0FhKYDsYxGc" + }, + "source": [ + "The most of the customers of the bank in this dataset are in the age range of 30-40." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5Nd3yV7DYxGd" + }, + "source": [ + "pd.crosstab(df_bank.poutcome,df_bank.y).plot(kind='bar')\n", + "plt.title('Purchase Frequency for Poutcome')\n", + "plt.xlabel('Poutcome')\n", + "plt.ylabel('Frequency of Purchase')\n", + "plt.savefig('pur_fre_pout_bar')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oRKUAGrjYxGh" + }, + "source": [ + "Poutcome seems to be a good predictor of the outcome variable." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "63RLRI9uYxGi" + }, + "source": [ + "### Create dummy variables" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "V8S4WUKmYxGj" + }, + "source": [ + "cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']\n", + "for var in cat_vars:\n", + " cat_list='var'+'_'+var\n", + " cat_list = pd.get_dummies(df_bank[var], prefix=var)\n", + " df_bank1=df_bank.join(cat_list)\n", + " data=df_bank1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "uX3w9i9WYxGl" + }, + "source": [ + "cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']\n", + "df_bank_vars=df_bank.columns.values.tolist()\n", + "to_keep=[i for i in df_bank_vars if i not in cat_vars]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "cMX_82xaYxGq" + }, + "source": [ + "df_bank_final=df_bank[to_keep]\n", + "df_bank_final.columns.values" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "LkTjpxYoYxGr" + }, + "source": [ + "df_bank_final_vars=df_bank_final.columns.values.tolist()\n", + "y=['y']\n", + "X=[i for i in df_bank_final_vars if i not in y]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2QbKaRcsYxGt" + }, + "source": [ + "### Feature Selection" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "EkxjW1AQYxGu" + }, + "source": [ + "from sklearn import datasets\n", + "from sklearn.feature_selection import RFE\n", + "from sklearn.linear_model import LogisticRegression\n", + "\n", + "logreg = LogisticRegression()\n", + "\n", + "rfe = RFE(logreg, 18)\n", + "rfe = rfe.fit(df_bank_final[X], df_bank_final[y] )\n", + "print(rfe.support_)\n", + "print(rfe.ranking_)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2P9hd4jHYxGw" + }, + "source": [ + "The Recursive Feature Elimination (RFE) has helped us select the following features: \"previous\", \"euribor3m\", \"job_blue-collar\", \"job_retired\", \"job_services\", \"job_student\", \"default_no\", \"month_aug\", \"month_dec\", \"month_jul\", \"month_nov\", \"month_oct\", \"month_sep\", \"day_of_week_fri\", \"day_of_week_wed\", \"poutcome_failure\", \"poutcome_nonexistent\", \"poutcome_success\"." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5PW8WZX_YxGx" + }, + "source": [ + "cols=[\"previous\", \"euribor3m\", \"job_blue-collar\", \"job_retired\", \"job_services\", \"job_student\", \"default_no\", \n", + " \"month_aug\", \"month_dec\", \"month_jul\", \"month_nov\", \"month_oct\", \"month_sep\", \"day_of_week_fri\", \"day_of_week_wed\", \n", + " \"poutcome_failure\", \"poutcome_nonexistent\", \"poutcome_success\"] \n", + "X=df_bank_final[cols]\n", + "y=df_bank_final['y']" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ix0mN9qxYxG0" + }, + "source": [ + "### Implementing the model" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Hbx2bwtiYxG0" + }, + "source": [ + "import statsmodels.api as sm\n", + "logit_model=sm.Logit(y,X)\n", + "result=logit_model.fit()\n", + "print(result.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HR1ui-UcYxG2" + }, + "source": [ + "The p-values for most of the variables are very small, therefore, most of them are significant to the model." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9GHhrsaeYxG3" + }, + "source": [ + "### Logistic Regression Model Fitting" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MFQnH5MzYxG3" + }, + "source": [ + "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn import metrics\n", + "logreg = LogisticRegression()\n", + "logreg.fit(X_train, y_train)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YUa3QL7tYxG6" + }, + "source": [ + "#### Predicting the test set results and caculating the accuracy" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SD-y2e33YxG6" + }, + "source": [ + "y_pred = logreg.predict(X_test)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "kkPWzos7YxG-" + }, + "source": [ + "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kwC3rt_6YxHA" + }, + "source": [ + "### Cross Validation" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Muw50oqSYxHB" + }, + "source": [ + "from sklearn import model_selection\n", + "from sklearn.model_selection import cross_val_score\n", + "kfold = model_selection.KFold(n_splits=10, random_state=7)\n", + "modelCV = LogisticRegression()\n", + "scoring = 'accuracy'\n", + "results = model_selection.cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)\n", + "print(\"10-fold cross validation average accuracy: %.3f\" % (results.mean()))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4y8XCTqoYxHE" + }, + "source": [ + "### Confusion Matrix" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BCza9NkVYxHE" + }, + "source": [ + "from sklearn.metrics import confusion_matrix\n", + "confusion_matrix = confusion_matrix(y_test, y_pred)\n", + "print(confusion_matrix)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "X9SapwS2YxHG" + }, + "source": [ + "The result is telling us that we have 10872+254 correct predictions and 1122+109 incorrect predictions." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6bEWvWScYxHG" + }, + "source": [ + "#### Accuracy" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NaH2nESwYxHH" + }, + "source": [ + "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(classifier.score(X_test, y_test)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C6oxlhbpYxHJ" + }, + "source": [ + "#### Compute precision, recall, F-measure and support\n", + "\n", + "The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.\n", + "\n", + "The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.\n", + "\n", + "The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.\n", + "\n", + "The F-beta score weights recall more than precision by a factor of beta. beta == 1.0 means recall and precision are equally important.\n", + "\n", + "The support is the number of occurrences of each class in y_test." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mhN5_p4yYxHK" + }, + "source": [ + "from sklearn.metrics import classification_report\n", + "print(classification_report(y_test, y_pred))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xzSFVEnAYxHP" + }, + "source": [ + "#### Interpretation: \n", + "\n", + "Of the entire test set, 88% of the promoted term deposit were the term deposit that the customers liked. Of the entire test set, 90% of the customer's preferred term deposit were promoted." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NGXJ6g2nYxHQ" + }, + "source": [ + "### ROC Curvefrom sklearn import metrics\n", + "from ggplot import *\n", + "\n", + "prob = clf1.predict_proba(X_test)[:,1]\n", + "fpr, sensitivity, _ = metrics.roc_curve(Y_test, prob)\n", + "\n", + "df = pd.DataFrame(dict(fpr=fpr, sensitivity=sensitivity))\n", + "ggplot(df, aes(x='fpr', y='sensitivity')) +\\\n", + " geom_line() +\\\n", + " geom_abline(linetype='dashed')" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "u9QKDuS0YxHQ" + }, + "source": [ + "from sklearn.metrics import roc_auc_score\n", + "from sklearn.metrics import roc_curve\n", + "logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))\n", + "fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])\n", + "plt.figure()\n", + "plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)\n", + "plt.plot([0, 1], [0, 1],'r--')\n", + "plt.xlim([0.0, 1.0])\n", + "plt.ylim([0.0, 1.05])\n", + "plt.xlabel('False Positive Rate')\n", + "plt.ylabel('True Positive Rate')\n", + "plt.title('Receiver operating characteristic')\n", + "plt.legend(loc=\"lower right\")\n", + "plt.savefig('Log_ROC')\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git "a/Notebooks/NB15_02__Regress\303\243o Linear_hs3.ipynb" "b/Notebooks/NB15_02__Regress\303\243o Linear_hs3.ipynb" new file mode 100644 index 000000000..ffca82972 --- /dev/null +++ "b/Notebooks/NB15_02__Regress\303\243o Linear_hs3.ipynb" @@ -0,0 +1,10651 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.1" + }, + "colab": { + "name": "NB15_02__Regressão Linear.ipynb", + "provenance": [], + "include_colab_link": true + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XwQDhId7N6_r" + }, + "source": [ + "

MACHINE LEARNING WITH PYTHON

\n", + "

APRENDIZAGEM SUPERVISIONADA

\n", + "

MODELOS DE REGRESSÃO (LINEAR E LOGÍSTICA)

\n", + "\n", + "Fonte: https://realpython.com/linear-regression-in-python/\n", + "https://towardsdatascience.com/predicting-the-survival-of-titanic-passengers-30870ccc7e8" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PN-dQFJcM1UV" + }, + "source": [ + "Passos para implementação da Regressão Linear:\n", + "\n", + "* (1) Importar as libraries necessárias;\n", + "* (2) Carregar os dados;\n", + "* (3) Aplicar as transformações necessárias: outliers, NaN's, normalização (MinMaxScaler, RobustScaler, StandarScaler, Log, Box-Cox e etc);\n", + "* (4) DataViz dos dados: entender os relacionamentos, distribuições e etc presente nos dados;\n", + "* (5) Construir e treinar o modelo preditivo (neste caso, modelo de regressão);\n", + "* (6) Validar/verificar as métricas para avaliação do(s) modelo(s);\n", + "* (7) Predições." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8TldGZxAFV5E" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0QRbxlqaq7pr" + }, + "source": [ + "# Melhorias da sessão:\n", + "* Calcular as correlações antes e depois da RIDGE e LASSO para mostrar a multicolinearidade e explicar porque determinadas colunas \"deixam\" de ser importantes." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P4sAIblOgFyL" + }, + "source": [ + "# Modelos de Regressão com Regularization para Classificação e Regressão" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o7Y7cuJNgFyU" + }, + "source": [ + "## Regressão Linear Simples (usando OLS - Ordinary Least Squares)\n", + "\n", + "* Features $X_{np}$: é uma matriz de dimensão nxp contendo os atributos/variáveis preditoras do dataframe (variáveis independentes);\n", + "* Variável target/dependente representada por y;\n", + "* Relação entre X e y é representado pela equação abaixo, onde $w_{i}$ representa os pesos de cada coeficiente e $w_{0}$ representa o intercepto." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NpJ580y9gFyU" + }, + "source": [ + "\n", + "\n", + "![X_y](https://github.com/MathMachado/Materials/blob/master/Architecture.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5rhbVGJ0gFyY" + }, + "source": [ + "* Soma de Quadrados dos Resíduos (RSS) - Soma de Quadrados das diferenças entre os valores observados e preditos.\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u8gA0YkbgFyp" + }, + "source": [ + "## Principais parâmetros do algoritmo:\n", + "* fit_intercept - Indica se o intercepto $w_{0}$ deve ou não ser ajustado. Se os dados estão normalizados, então não faz sentido ajustar o intercepto $w_{0}$.\n", + "\n", + "* normalize - $X$ será automaticamente normalizada (subtrai a média e divide pelo desvio-padrão);\n", + "\n", + "## Atributos do modelo de Machine Learning para Regressão\n", + "* coef - peso/fator de cada variável independente do modelo de ML;\n", + "\n", + "* intercepto $w_{0}$ - intercepto ou viés de $y$;\n", + "\n", + "## Funções para ajuste do ML:\n", + "* fit - treina o modelo com as matrizes $X$ e $y$;\n", + "* predict - Uma vez que o modelo foi treinado, para um dado $X$, use $y$ para calcular os valores preditos de $y$ (y_pred).\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "A-JG8El1gFy7" + }, + "source": [ + "# Limitações do OLS (Ordinary Least Squares):\n", + "* Impactado/sensível à Outliers;\n", + "* Multicolinearidade; \n", + "* Heterocedasticidade - apresenta-se como uma forte dispersão dos dados em torno de uma reta;\n", + "\n", + "* References" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xylMYR8COyrw" + }, + "source": [ + "### Importar as libraries" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2BGgrILlPK6Z" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from scipy import stats" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "263GgbwhO2kQ" + }, + "source": [ + "### Carregar os dados\n", + "* Vamos carregar o dataset [Boston House Pricing](https://archive.ics.uci.edu/ml/datasets/housing)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1h66x_-rXGhi" + }, + "source": [ + "from sklearn.datasets import load_boston, load_iris" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rWniNkMpXQFU", + "outputId": "5096d239-2c8c-4327-dbf5-f9128faa589c", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "boston = load_boston()\n", + "boston" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'DESCR': \".. _boston_dataset:\\n\\nBoston house prices dataset\\n---------------------------\\n\\n**Data Set Characteristics:** \\n\\n :Number of Instances: 506 \\n\\n :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\\n\\n :Attribute Information (in order):\\n - CRIM per capita crime rate by town\\n - ZN proportion of residential land zoned for lots over 25,000 sq.ft.\\n - INDUS proportion of non-retail business acres per town\\n - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\\n - NOX nitric oxides concentration (parts per 10 million)\\n - RM average number of rooms per dwelling\\n - AGE proportion of owner-occupied units built prior to 1940\\n - DIS weighted distances to five Boston employment centres\\n - RAD index of accessibility to radial highways\\n - TAX full-value property-tax rate per $10,000\\n - PTRATIO pupil-teacher ratio by town\\n - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\\n - LSTAT % lower status of the population\\n - MEDV Median value of owner-occupied homes in $1000's\\n\\n :Missing Attribute Values: None\\n\\n :Creator: Harrison, D. and Rubinfeld, D.L.\\n\\nThis is a copy of UCI ML housing dataset.\\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/housing/\\n\\n\\nThis dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\\n\\nThe Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\\nprices and the demand for clean air', J. Environ. Economics & Management,\\nvol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics\\n...', Wiley, 1980. N.B. Various transformations are used in the table on\\npages 244-261 of the latter.\\n\\nThe Boston house-price data has been used in many machine learning papers that address regression\\nproblems. \\n \\n.. topic:: References\\n\\n - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\\n - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\\n\",\n", + " 'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,\n", + " 4.9800e+00],\n", + " [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,\n", + " 9.1400e+00],\n", + " [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,\n", + " 4.0300e+00],\n", + " ...,\n", + " [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,\n", + " 5.6400e+00],\n", + " [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,\n", + " 6.4800e+00],\n", + " [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,\n", + " 7.8800e+00]]),\n", + " 'feature_names': array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',\n", + " 'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTAT
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.98
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.14
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.03
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.94
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.33
\n", + "" + ], + "text/plain": [ + " CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT\n", + "0 0.00632 18.0 2.31 0.0 0.538 ... 1.0 296.0 15.3 396.90 4.98\n", + "1 0.02731 0.0 7.07 0.0 0.469 ... 2.0 242.0 17.8 396.90 9.14\n", + "2 0.02729 0.0 7.07 0.0 0.469 ... 2.0 242.0 17.8 392.83 4.03\n", + "3 0.03237 0.0 2.18 0.0 0.458 ... 3.0 222.0 18.7 394.63 2.94\n", + "4 0.06905 0.0 2.18 0.0 0.458 ... 3.0 222.0 18.7 396.90 5.33\n", + "\n", + "[5 rows x 13 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 136 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pQzFW7DUX_KW", + "outputId": "dcf288db-d99d-4d17-c22c-ceb8a9ba4841", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "# Variável target/resposta\n", + "df_boston['preco'] = load_boston().target\n", + "df_boston.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATpreco
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2
\n", + "
" + ], + "text/plain": [ + " CRIM ZN INDUS CHAS NOX ... TAX PTRATIO B LSTAT preco\n", + "0 0.00632 18.0 2.31 0.0 0.538 ... 296.0 15.3 396.90 4.98 24.0\n", + "1 0.02731 0.0 7.07 0.0 0.469 ... 242.0 17.8 396.90 9.14 21.6\n", + "2 0.02729 0.0 7.07 0.0 0.469 ... 242.0 17.8 392.83 4.03 34.7\n", + "3 0.03237 0.0 2.18 0.0 0.458 ... 222.0 18.7 394.63 2.94 33.4\n", + "4 0.06905 0.0 2.18 0.0 0.458 ... 222.0 18.7 396.90 5.33 36.2\n", + "\n", + "[5 rows x 14 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 137 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H71da4bIO4kI" + }, + "source": [ + "### Data Transformation" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K-6YOdsTfciO" + }, + "source": [ + "#### Normalização/padronização dos nomes das colunas" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "L8OJEapufhq4" + }, + "source": [ + "# Renomear as colunas do dataframe\n", + "df_boston.columns = [col.lower() for col in df_boston.columns]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "uRinX-5ofol_", + "outputId": "2e67bbbd-792f-4786-8c7e-2d0bd16fd249", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "df_boston.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
crimzninduschasnoxrmagedisradtaxptratioblstatpreco
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2
\n", + "
" + ], + "text/plain": [ + " crim zn indus chas nox ... tax ptratio b lstat preco\n", + "0 0.00632 18.0 2.31 0.0 0.538 ... 296.0 15.3 396.90 4.98 24.0\n", + "1 0.02731 0.0 7.07 0.0 0.469 ... 242.0 17.8 396.90 9.14 21.6\n", + "2 0.02729 0.0 7.07 0.0 0.469 ... 242.0 17.8 392.83 4.03 34.7\n", + "3 0.03237 0.0 2.18 0.0 0.458 ... 222.0 18.7 394.63 2.94 33.4\n", + "4 0.06905 0.0 2.18 0.0 0.458 ... 222.0 18.7 396.90 5.33 36.2\n", + "\n", + "[5 rows x 14 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 139 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CMDh5jyqekmr" + }, + "source": [ + "#### Outliers" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jJIG0jJQf6em" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FgYPzlvfemFc" + }, + "source": [ + "#### Missing values" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BAjw7UhJen0D", + "outputId": "917a8f23-ec31-4f22-9a46-c3a15c1e4563", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Missing values por colunas/variáveis\n", + "df_boston.isna().sum()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "crim 0\n", + "zn 0\n", + "indus 0\n", + "chas 0\n", + "nox 0\n", + "rm 0\n", + "age 0\n", + "dis 0\n", + "rad 0\n", + "tax 0\n", + "ptratio 0\n", + "b 0\n", + "lstat 0\n", + "preco 0\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 140 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Jo3UWNpbYnNF", + "outputId": "aeefc57a-f1b7-41ac-aa2e-53f828b9be14", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Número de atributos\n", + "len(load_boston().feature_names)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "13" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 141 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0Yp8g7hxfQli", + "outputId": "43795436-0366-4427-ed5a-2deacedf567f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 49 + } + }, + "source": [ + "# Missing Values por linhas\n", + "df_boston[df_boston.isnull().any(axis = 1)]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
crimzninduschasnoxrmagedisradtaxptratioblstatpreco
\n", + "
" + ], + "text/plain": [ + "Empty DataFrame\n", + "Columns: [crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat, preco]\n", + "Index: []" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 142 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5qmkTFLrf9MT" + }, + "source": [ + "#### Estatísticas Descritivas" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Nprn3p_Wf_bn", + "outputId": "16f46af6-ab9a-4d7b-a875-295817b9bf9c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 297 + } + }, + "source": [ + "df_boston.describe()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
crimzninduschasnoxrmagedisradtaxptratioblstatpreco
count506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000
mean3.61352411.36363611.1367790.0691700.5546956.28463468.5749013.7950439.549407408.23715418.455534356.67403212.65306322.532806
std8.60154523.3224536.8603530.2539940.1158780.70261728.1488612.1057108.707259168.5371162.16494691.2948647.1410629.197104
min0.0063200.0000000.4600000.0000000.3850003.5610002.9000001.1296001.000000187.00000012.6000000.3200001.7300005.000000
25%0.0820450.0000005.1900000.0000000.4490005.88550045.0250002.1001754.000000279.00000017.400000375.3775006.95000017.025000
50%0.2565100.0000009.6900000.0000000.5380006.20850077.5000003.2074505.000000330.00000019.050000391.44000011.36000021.200000
75%3.67708312.50000018.1000000.0000000.6240006.62350094.0750005.18842524.000000666.00000020.200000396.22500016.95500025.000000
max88.976200100.00000027.7400001.0000000.8710008.780000100.00000012.12650024.000000711.00000022.000000396.90000037.97000050.000000
\n", + "
" + ], + "text/plain": [ + " crim zn indus ... b lstat preco\n", + "count 506.000000 506.000000 506.000000 ... 506.000000 506.000000 506.000000\n", + "mean 3.613524 11.363636 11.136779 ... 356.674032 12.653063 22.532806\n", + "std 8.601545 23.322453 6.860353 ... 91.294864 7.141062 9.197104\n", + "min 0.006320 0.000000 0.460000 ... 0.320000 1.730000 5.000000\n", + "25% 0.082045 0.000000 5.190000 ... 375.377500 6.950000 17.025000\n", + "50% 0.256510 0.000000 9.690000 ... 391.440000 11.360000 21.200000\n", + "75% 3.677083 12.500000 18.100000 ... 396.225000 16.955000 25.000000\n", + "max 88.976200 100.000000 27.740000 ... 396.900000 37.970000 50.000000\n", + "\n", + "[8 rows x 14 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 143 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1JimyY3SgECE" + }, + "source": [ + "#### Análise de Correlação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jScHq7eTgIpm", + "outputId": "50696c9d-c19a-4937-9189-368be5fb291c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 483 + } + }, + "source": [ + "correlacoes = df_boston.corr()\n", + "correlacoes" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
crimzninduschasnoxrmagedisradtaxptratioblstatpreco
crim1.000000-0.2004690.406583-0.0558920.420972-0.2192470.352734-0.3796700.6255050.5827640.289946-0.3850640.455621-0.388305
zn-0.2004691.000000-0.533828-0.042697-0.5166040.311991-0.5695370.664408-0.311948-0.314563-0.3916790.175520-0.4129950.360445
indus0.406583-0.5338281.0000000.0629380.763651-0.3916760.644779-0.7080270.5951290.7207600.383248-0.3569770.603800-0.483725
chas-0.055892-0.0426970.0629381.0000000.0912030.0912510.086518-0.099176-0.007368-0.035587-0.1215150.048788-0.0539290.175260
nox0.420972-0.5166040.7636510.0912031.000000-0.3021880.731470-0.7692300.6114410.6680230.188933-0.3800510.590879-0.427321
rm-0.2192470.311991-0.3916760.091251-0.3021881.000000-0.2402650.205246-0.209847-0.292048-0.3555010.128069-0.6138080.695360
age0.352734-0.5695370.6447790.0865180.731470-0.2402651.000000-0.7478810.4560220.5064560.261515-0.2735340.602339-0.376955
dis-0.3796700.664408-0.708027-0.099176-0.7692300.205246-0.7478811.000000-0.494588-0.534432-0.2324710.291512-0.4969960.249929
rad0.625505-0.3119480.595129-0.0073680.611441-0.2098470.456022-0.4945881.0000000.9102280.464741-0.4444130.488676-0.381626
tax0.582764-0.3145630.720760-0.0355870.668023-0.2920480.506456-0.5344320.9102281.0000000.460853-0.4418080.543993-0.468536
ptratio0.289946-0.3916790.383248-0.1215150.188933-0.3555010.261515-0.2324710.4647410.4608531.000000-0.1773830.374044-0.507787
b-0.3850640.175520-0.3569770.048788-0.3800510.128069-0.2735340.291512-0.444413-0.441808-0.1773831.000000-0.3660870.333461
lstat0.455621-0.4129950.603800-0.0539290.590879-0.6138080.602339-0.4969960.4886760.5439930.374044-0.3660871.000000-0.737663
preco-0.3883050.360445-0.4837250.175260-0.4273210.695360-0.3769550.249929-0.381626-0.468536-0.5077870.333461-0.7376631.000000
\n", + "
" + ], + "text/plain": [ + " crim zn indus ... b lstat preco\n", + "crim 1.000000 -0.200469 0.406583 ... -0.385064 0.455621 -0.388305\n", + "zn -0.200469 1.000000 -0.533828 ... 0.175520 -0.412995 0.360445\n", + "indus 0.406583 -0.533828 1.000000 ... -0.356977 0.603800 -0.483725\n", + "chas -0.055892 -0.042697 0.062938 ... 0.048788 -0.053929 0.175260\n", + "nox 0.420972 -0.516604 0.763651 ... -0.380051 0.590879 -0.427321\n", + "rm -0.219247 0.311991 -0.391676 ... 0.128069 -0.613808 0.695360\n", + "age 0.352734 -0.569537 0.644779 ... -0.273534 0.602339 -0.376955\n", + "dis -0.379670 0.664408 -0.708027 ... 0.291512 -0.496996 0.249929\n", + "rad 0.625505 -0.311948 0.595129 ... -0.444413 0.488676 -0.381626\n", + "tax 0.582764 -0.314563 0.720760 ... -0.441808 0.543993 -0.468536\n", + "ptratio 0.289946 -0.391679 0.383248 ... -0.177383 0.374044 -0.507787\n", + "b -0.385064 0.175520 -0.356977 ... 1.000000 -0.366087 0.333461\n", + "lstat 0.455621 -0.412995 0.603800 ... -0.366087 1.000000 -0.737663\n", + "preco -0.388305 0.360445 -0.483725 ... 0.333461 -0.737663 1.000000\n", + "\n", + "[14 rows x 14 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 144 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AxQp7xqdgTJP" + }, + "source": [ + "##### Gráfico das correlações entre as features/variáveis/colunas\n", + "Source: https://seaborn.pydata.org/examples/many_pairwise_correlations.html\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KOiH2X-WgqmN", + "outputId": "f72007dc-7c99-4ce1-b6bb-b86a9bf617c5", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 557 + } + }, + "source": [ + "import seaborn as sns\n", + "from string import ascii_letters\n", + "import matplotlib.pyplot as plt\n", + "\n", + "sns.set_theme(style = \"white\")\n", + "\n", + "d = df_boston\n", + "\n", + "# Compute the correlation matrix\n", + "corr = d.corr()\n", + "\n", + "# Generate a mask for the upper triangle\n", + "mask = np.triu(np.ones_like(corr, dtype=bool))\n", + "\n", + "# Set up the matplotlib figure\n", + "f, ax = plt.subplots(figsize=(11, 9))\n", + "\n", + "# Generate a custom diverging colormap\n", + "cmap = sns.diverging_palette(230, 20, as_cmap=True)\n", + "\n", + "# Draw the heatmap with the mask and correct aspect ratio\n", + "sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,\n", + " square=True, linewidths=.5, cbar_kws={\"shrink\": .5})" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 145 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nogPhyfVO70G" + }, + "source": [ + "### Construir e treinar o(s) modelo(s)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HxYpfyvQaIe1" + }, + "source": [ + "$X = [X_{1}, X_{2}, X_{p}]$ = X_boston abaixo." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0BhLZJhibVNG" + }, + "source": [ + "X_boston = df_boston.drop(columns = ['preco'], axis = 1) # todas as variáveis/atributos, exceto 'preco'\n", + "y_boston = df_boston['preco'] # variável-target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "v_nC_RGva1Z6", + "outputId": "6a5946c8-62b3-424f-a809-9a2bbc34f191", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "X_boston.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
crimzninduschasnoxrmagedisradtaxptratioblstat
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.98
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.14
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.03
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.94
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.33
\n", + "
" + ], + "text/plain": [ + " crim zn indus chas nox ... rad tax ptratio b lstat\n", + "0 0.00632 18.0 2.31 0.0 0.538 ... 1.0 296.0 15.3 396.90 4.98\n", + "1 0.02731 0.0 7.07 0.0 0.469 ... 2.0 242.0 17.8 396.90 9.14\n", + "2 0.02729 0.0 7.07 0.0 0.469 ... 2.0 242.0 17.8 392.83 4.03\n", + "3 0.03237 0.0 2.18 0.0 0.458 ... 3.0 222.0 18.7 394.63 2.94\n", + "4 0.06905 0.0 2.18 0.0 0.458 ... 3.0 222.0 18.7 396.90 5.33\n", + "\n", + "[5 rows x 13 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 147 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nlVJM--Ya5fS", + "outputId": "58037983-175f-47ed-ad47-5826589358b0", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "y_boston[0:10] # Series (coluna)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 24.0\n", + "1 21.6\n", + "2 34.7\n", + "3 33.4\n", + "4 36.2\n", + "5 28.7\n", + "6 22.9\n", + "7 27.1\n", + "8 16.5\n", + "9 18.9\n", + "Name: preco, dtype: float64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 148 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "b50_6tv5h1kY" + }, + "source": [ + "# Definindo os dataframes de treinamento e teste:\n", + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_boston, \n", + " y_boston, \n", + " test_size = 0.2, \n", + " random_state = 20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1U3hpdkDbYTv", + "outputId": "35e8cee1-201a-4a65-a6ec-8fa9e8c7c0a8", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "print(f\"Dataframe de treinamento: {X_treinamento.shape[0]} linhas\")\n", + "print(f\"Dataframe de teste......: {X_teste.shape[0]} linhas\")" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Dataframe de treinamento: 404 linhas\n", + "Dataframe de teste......: 102 linhas\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SvevXulFiJj1" + }, + "source": [ + "#### Treinamento do modelo de Regressão Linear" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GVwF3vp8iNff" + }, + "source": [ + "# Importa a library LinearRegression --> Para treinamento da Regressão Linear\n", + "from sklearn.linear_model import LinearRegression\n", + "\n", + "# Library para statmodels\n", + "import statsmodels.api as sm" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ibX6bCbViW-v" + }, + "source": [ + "# Instancia o objeto\n", + "regressao_linear = LinearRegression()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "M-5wRGUribY0", + "outputId": "a67d7355-3d9e-43fc-edf6-8ebd71911935", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Treina o modelo usando as amostras/dataset de treinamento: X_treinamento e y_treinamento \n", + "regressao_linear.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 153 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jri-jA1VjmUl", + "outputId": "3150261d-c264-4273-9c5f-95229529881b", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Valor do intercepto\n", + "regressao_linear.intercept_" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "35.9020918753502" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 154 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VOjadxdxjqtT", + "outputId": "49d06bd9-e375-403f-e257-863967f10fd3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 452 + } + }, + "source": [ + "# Coeficientes do modelo de Regressão Linear\n", + "coeficientes_regressao_linear = pd.DataFrame([X_treinamento.columns, regressao_linear.coef_]).T\n", + "coeficientes_regressao_linear = coeficientes_regressao_linear.rename(columns={0: 'Feature/variável/coluna', 1: 'Coeficientes'})\n", + "coeficientes_regressao_linear" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Feature/variável/colunaCoeficientes
0crim-0.0822083
1zn0.0428002
2indus0.0756011
3chas3.16348
4nox-19.4945
5rm3.98161
6age0.00480929
7dis-1.37396
8rad0.298883
9tax-0.0123962
10ptratio-0.984657
11b0.008949
12lstat-0.526478
\n", + "
" + ], + "text/plain": [ + " Feature/variável/coluna Coeficientes\n", + "0 crim -0.0822083\n", + "1 zn 0.0428002\n", + "2 indus 0.0756011\n", + "3 chas 3.16348\n", + "4 nox -19.4945\n", + "5 rm 3.98161\n", + "6 age 0.00480929\n", + "7 dis -1.37396\n", + "8 rad 0.298883\n", + "9 tax -0.0123962\n", + "10 ptratio -0.984657\n", + "11 b 0.008949\n", + "12 lstat -0.526478" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 155 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jwnkhPwDjkhS" + }, + "source": [ + "#### Usando statmodels" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ltbekHd_k3PH", + "outputId": "a69b057e-75a6-446e-8b7c-ad37604114a5", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X2_treinamento = sm.add_constant(X_treinamento)\n", + "lm_sm = sm.OLS(y_treinamento, X2_treinamento).fit()\n", + "print(lm_sm.summary())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + " OLS Regression Results \n", + "==============================================================================\n", + "Dep. Variable: preco R-squared: 0.725\n", + "Model: OLS Adj. R-squared: 0.716\n", + "Method: Least Squares F-statistic: 78.97\n", + "Date: Thu, 29 Oct 2020 Prob (F-statistic): 1.48e-100\n", + "Time: 11:00:14 Log-Likelihood: -1214.8\n", + "No. Observations: 404 AIC: 2458.\n", + "Df Residuals: 390 BIC: 2514.\n", + "Df Model: 13 \n", + "Covariance Type: nonrobust \n", + "==============================================================================\n", + " coef std err t P>|t| [0.025 0.975]\n", + "------------------------------------------------------------------------------\n", + "const 35.9021 6.037 5.947 0.000 24.033 47.771\n", + "crim -0.0822 0.045 -1.824 0.069 -0.171 0.006\n", + "zn 0.0428 0.016 2.638 0.009 0.011 0.075\n", + "indus 0.0756 0.072 1.054 0.292 -0.065 0.217\n", + "chas 3.1635 0.997 3.174 0.002 1.204 5.123\n", + "nox -19.4945 4.539 -4.295 0.000 -28.418 -10.571\n", + "rm 3.9816 0.510 7.802 0.000 2.978 4.985\n", + "age 0.0048 0.015 0.312 0.755 -0.025 0.035\n", + "dis -1.3740 0.236 -5.827 0.000 -1.838 -0.910\n", + "rad 0.2989 0.079 3.760 0.000 0.143 0.455\n", + "tax -0.0124 0.004 -2.814 0.005 -0.021 -0.004\n", + "ptratio -0.9847 0.156 -6.309 0.000 -1.292 -0.678\n", + "b 0.0089 0.003 2.796 0.005 0.003 0.015\n", + "lstat -0.5265 0.060 -8.764 0.000 -0.645 -0.408\n", + "==============================================================================\n", + "Omnibus: 140.799 Durbin-Watson: 2.083\n", + "Prob(Omnibus): 0.000 Jarque-Bera (JB): 591.650\n", + "Skew: 1.484 Prob(JB): 3.35e-129\n", + "Kurtosis: 8.132 Cond. No. 1.51e+04\n", + "==============================================================================\n", + "\n", + "Warnings:\n", + "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", + "[2] The condition number is large, 1.51e+04. This might indicate that there are\n", + "strong multicollinearity or other numerical problems.\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Kpt3A4Q0guHv" + }, + "source": [ + "#### Exclusão da variável menos significativa para o modelo: 'age'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rVUJkfg4gSh7", + "outputId": "eeff1e8f-8ac7-44e8-e0fe-caf0d4a641c7", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X3 = X_treinamento.drop(columns = 'age', axis = 1)\n", + "X3_treinamento = sm.add_constant(X3)\n", + "lm_sm2 = sm.OLS(y_treinamento, X3_treinamento).fit()\n", + "print(lm_sm2.summary())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + " OLS Regression Results \n", + "==============================================================================\n", + "Dep. Variable: preco R-squared: 0.725\n", + "Model: OLS Adj. R-squared: 0.716\n", + "Method: Least Squares F-statistic: 85.75\n", + "Date: Thu, 29 Oct 2020 Prob (F-statistic): 1.64e-101\n", + "Time: 11:00:14 Log-Likelihood: -1214.8\n", + "No. Observations: 404 AIC: 2456.\n", + "Df Residuals: 391 BIC: 2508.\n", + "Df Model: 12 \n", + "Covariance Type: nonrobust \n", + "==============================================================================\n", + " coef std err t P>|t| [0.025 0.975]\n", + "------------------------------------------------------------------------------\n", + "const 35.7325 6.006 5.950 0.000 23.925 47.540\n", + "crim -0.0815 0.045 -1.812 0.071 -0.170 0.007\n", + "zn 0.0422 0.016 2.623 0.009 0.011 0.074\n", + "indus 0.0750 0.072 1.048 0.295 -0.066 0.216\n", + "chas 3.1794 0.994 3.198 0.001 1.225 5.134\n", + "nox -19.1299 4.381 -4.367 0.000 -27.742 -10.517\n", + "rm 4.0153 0.498 8.059 0.000 3.036 4.995\n", + "dis -1.3963 0.224 -6.223 0.000 -1.837 -0.955\n", + "rad 0.2958 0.079 3.755 0.000 0.141 0.451\n", + "tax -0.0123 0.004 -2.802 0.005 -0.021 -0.004\n", + "ptratio -0.9812 0.156 -6.310 0.000 -1.287 -0.675\n", + "b 0.0090 0.003 2.825 0.005 0.003 0.015\n", + "lstat -0.5202 0.057 -9.203 0.000 -0.631 -0.409\n", + "==============================================================================\n", + "Omnibus: 142.363 Durbin-Watson: 2.081\n", + "Prob(Omnibus): 0.000 Jarque-Bera (JB): 608.694\n", + "Skew: 1.496 Prob(JB): 6.67e-133\n", + "Kurtosis: 8.216 Cond. No. 1.48e+04\n", + "==============================================================================\n", + "\n", + "Warnings:\n", + "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", + "[2] The condition number is large, 1.48e+04. This might indicate that there are\n", + "strong multicollinearity or other numerical problems.\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_lcp7m5FmZvG" + }, + "source": [ + "#### Exclusão da variável menos significativa para o modelo: 'indus'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jEiBywx4hGNB", + "outputId": "fb2abfd1-9019-4e37-f6e1-cf5e54ae1276", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X4 = X3_treinamento.drop(columns = 'indus', axis = 1)\n", + "X4_treinamento = sm.add_constant(X4)\n", + "lm_sm3 = sm.OLS(y_treinamento, X4_treinamento).fit()\n", + "print(lm_sm3.summary())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + " OLS Regression Results \n", + "==============================================================================\n", + "Dep. Variable: preco R-squared: 0.724\n", + "Model: OLS Adj. R-squared: 0.716\n", + "Method: Least Squares F-statistic: 93.42\n", + "Date: Thu, 29 Oct 2020 Prob (F-statistic): 2.86e-102\n", + "Time: 11:00:14 Log-Likelihood: -1215.4\n", + "No. Observations: 404 AIC: 2455.\n", + "Df Residuals: 392 BIC: 2503.\n", + "Df Model: 11 \n", + "Covariance Type: nonrobust \n", + "==============================================================================\n", + " coef std err t P>|t| [0.025 0.975]\n", + "------------------------------------------------------------------------------\n", + "const 35.4757 6.001 5.911 0.000 23.677 47.275\n", + "crim -0.0840 0.045 -1.871 0.062 -0.172 0.004\n", + "zn 0.0407 0.016 2.539 0.012 0.009 0.072\n", + "chas 3.2924 0.989 3.330 0.001 1.349 5.236\n", + "nox -17.9558 4.235 -4.239 0.000 -26.283 -9.629\n", + "rm 3.9674 0.496 7.996 0.000 2.992 4.943\n", + "dis -1.4553 0.217 -6.699 0.000 -1.882 -1.028\n", + "rad 0.2744 0.076 3.606 0.000 0.125 0.424\n", + "tax -0.0103 0.004 -2.603 0.010 -0.018 -0.003\n", + "ptratio -0.9609 0.154 -6.227 0.000 -1.264 -0.658\n", + "b 0.0089 0.003 2.778 0.006 0.003 0.015\n", + "lstat -0.5151 0.056 -9.145 0.000 -0.626 -0.404\n", + "==============================================================================\n", + "Omnibus: 142.123 Durbin-Watson: 2.073\n", + "Prob(Omnibus): 0.000 Jarque-Bera (JB): 605.868\n", + "Skew: 1.494 Prob(JB): 2.74e-132\n", + "Kurtosis: 8.202 Cond. No. 1.47e+04\n", + "==============================================================================\n", + "\n", + "Warnings:\n", + "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", + "[2] The condition number is large, 1.47e+04. This might indicate that there are\n", + "strong multicollinearity or other numerical problems.\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rFejox5XmrEE" + }, + "source": [ + "#### Exclusão da variável menos significativa para o modelo: 'crim'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DOehOql8hZWr", + "outputId": "cbb71827-f44e-4688-93c4-98a3ec5e3257", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X5 = X4_treinamento.drop(columns = 'crim', axis = 1)\n", + "X5_treinamento = sm.add_constant(X5)\n", + "lm_sm4 = sm.OLS(y_treinamento, X5_treinamento).fit()\n", + "print(lm_sm4.summary())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + " OLS Regression Results \n", + "==============================================================================\n", + "Dep. Variable: preco R-squared: 0.721\n", + "Model: OLS Adj. R-squared: 0.714\n", + "Method: Least Squares F-statistic: 101.8\n", + "Date: Thu, 29 Oct 2020 Prob (F-statistic): 1.55e-102\n", + "Time: 11:00:14 Log-Likelihood: -1217.2\n", + "No. Observations: 404 AIC: 2456.\n", + "Df Residuals: 393 BIC: 2500.\n", + "Df Model: 10 \n", + "Covariance Type: nonrobust \n", + "==============================================================================\n", + " coef std err t P>|t| [0.025 0.975]\n", + "------------------------------------------------------------------------------\n", + "const 33.9950 5.968 5.696 0.000 22.262 45.728\n", + "zn 0.0375 0.016 2.349 0.019 0.006 0.069\n", + "chas 3.3959 0.990 3.430 0.001 1.449 5.343\n", + "nox -17.1637 4.228 -4.060 0.000 -25.475 -8.852\n", + "rm 4.0365 0.496 8.132 0.000 3.061 5.012\n", + "dis -1.3999 0.216 -6.484 0.000 -1.824 -0.975\n", + "rad 0.2278 0.072 3.158 0.002 0.086 0.370\n", + "tax -0.0100 0.004 -2.513 0.012 -0.018 -0.002\n", + "ptratio -0.9493 0.155 -6.137 0.000 -1.253 -0.645\n", + "b 0.0101 0.003 3.217 0.001 0.004 0.016\n", + "lstat -0.5315 0.056 -9.523 0.000 -0.641 -0.422\n", + "==============================================================================\n", + "Omnibus: 140.245 Durbin-Watson: 2.070\n", + "Prob(Omnibus): 0.000 Jarque-Bera (JB): 609.563\n", + "Skew: 1.464 Prob(JB): 4.32e-133\n", + "Kurtosis: 8.257 Cond. No. 1.46e+04\n", + "==============================================================================\n", + "\n", + "Warnings:\n", + "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", + "[2] The condition number is large, 1.46e+04. This might indicate that there are\n", + "strong multicollinearity or other numerical problems.\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UafIUrpZB0YP" + }, + "source": [ + "### Conclusão\n", + "* Quais variáveis/colunas/atributos ficam no modelo?\n", + "* **Muito importante (exercício)**: normalizar (MinMaxScaler) as covariáveis e refazer a análise.\n", + "* Nesta iteração (depois de excluirmos (nesta ordem) as variáveis age, indus e crim, não surge nenhuma outra variável insignificante ao nível de 5 (na verdade, o maior valor é 1.9%)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jx7sOzrrm-H_" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nXeiFtnJO_1u" + }, + "source": [ + "### Validação do(s) modelo(s)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QlGVFA6uPDvr" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PE3aKJ6mPDyJ" + }, + "source": [ + "### Predições" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d3nGiyX8jadH" + }, + "source": [ + "### Deployment da solução **analítica**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5YQF4NIlGSLH" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UQfpoo1igFy8" + }, + "source": [ + "# Regularized Regression Methods \n", + "## Ridge Regression - Penalized Regression\n", + "> Reduz a complexidade do modelo através do uso de todas as variáveis de $X$, mas penalizando (valor de $\\alpha$) os coeficientes $w_{i}$ quando estiverem muito longe de zero, forçando-os a serem pequenos de maneira contínua. Dessa forma, diminuímos a complexidade do modelo enquanto mantemos todas as variáveis no modelo.\n", + "* Menor impacto dos outliers.\n", + "\n", + "### Exemplo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "o00xH2MvxvgP" + }, + "source": [ + "# Matriz de covariáveis do modelo:\n", + "X_new = [[0, 0], [0, 0], [1, 1]]\n", + "X_new2 = [[0, 0], [0, 1.5], [1, 1]]\n", + "\n", + "y_new = [0, .1, 1]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "v9U7c03NzW_c", + "outputId": "2652bd10-e6b4-4200-f7f0-a07806564a1d", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X_new # 2 variáveis/colunas no dataframe" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[[0, 0], [0, 0], [1, 1]]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 161 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iiVEAPpUzXyN", + "outputId": "a69fe575-57da-459c-f482-41d3185ab76f", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "y_new" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[0, 0.1, 1]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 162 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JDljolA95Hw5" + }, + "source": [ + "### Sem outliers" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8mWj2GbPOkHx", + "outputId": "3b433090-a588-449a-af69-5f2da53a9b60", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 197 + } + }, + "source": [ + "ridge = Ridge(alpha = .1)\n", + "ridge.fit(X_new, y_new)\n", + "ridge.coef_ # Coeficientes da Ridge" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "NameError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mridge\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mRidge\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0malpha\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m.1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mridge\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX_new\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_new\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mridge\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcoef_\u001b[0m \u001b[0;31m# Coeficientes da Ridge\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mNameError\u001b[0m: name 'Ridge' is not defined" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8yvd4ABY5JjC" + }, + "source": [ + "### Com outliers" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "O3sJZ_pe5GQ7" + }, + "source": [ + "ridge = Ridge(alpha = .1)\n", + "ridge.fit(X_new2, y_new)\n", + "ridge.coef_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zZxdCLU_5kKh" + }, + "source": [ + "#### Conseguiram visualizar o impacto dos outliers?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u5jsTkUmS9wK" + }, + "source": [ + "### Aplicação da Regressão Ridge no dataframe Boston Housing Price." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Kp4VIJWxgFy8" + }, + "source": [ + "from sklearn.linear_model import Ridge\n", + "ridge = Ridge(alpha = 0.1) # Definição do valor de alpha da regressão ridge\n", + "lr = LinearRegression()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "cmRMoOwV6FMt" + }, + "source": [ + "# Ao inves de: regressao_linear.fit(X_treinamento, y_treinamento)\n", + "ridge.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VPnekyUbK6Xg" + }, + "source": [ + "#### Peso/contribuição das variáveis para a regressão usando RIDGE" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "k83RDArjsUrj" + }, + "source": [ + "df_boston.columns" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vMCb0CFjK973" + }, + "source": [ + "ridge.coef_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZqksuIjXypRJ" + }, + "source": [ + "# treinando a regressão Ridge\n", + "ridge.fit(X_treinamento, y_treinamento)\n", + "\n", + "# treinando a regressão linear simples (OLS)\n", + "lr.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7r28PBsWLtjA" + }, + "source": [ + "ridge.alpha" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dDZ_TJnhuZno" + }, + "source": [ + "#### $\\alpha = 0.01$" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hRMK_QTmNgc1" + }, + "source": [ + "# maior alpha --> mais restrição aos coeficientes; \n", + "# Menor alpha --> mais generalização, e Ridge se assemelha da OLS; Se alpha = 0 ==> Ridge = OLS.\n", + "rr = Ridge(alpha = 0.01) # Quanto mais próximo de 0 ==> Ridge = OLS\n", + "rr.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IRuWmBE7Ngc7" + }, + "source": [ + "# MSE = Erro Quadrático Médio\n", + "from sklearn.metrics import mean_squared_error\n", + "\n", + "rr_model=(mean_squared_error(y_true = y_treinamento, y_pred = rr.predict(X_treinamento)))\n", + "lr_model=(mean_squared_error(y_true = y_treinamento, y_pred = lr.predict(X_treinamento)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "L4an-zHetafI" + }, + "source": [ + "print(rr_model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QsLVzk3EtbGs" + }, + "source": [ + "print(lr_model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K2sjngo1QhY2" + }, + "source": [ + "### Coeficientes da Ridge:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "s5i87o3quByz" + }, + "source": [ + "# Lista das variáveis + coeficientes da Ridge:\n", + "list(zip(X_treinamento.columns, abs(ridge.coef_)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s44vo9IjQonE" + }, + "source": [ + "### Experimente vários outros valores para $\\alpha$ como, por exemplo, $\\alpha = 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CDv5fGPbuUq5" + }, + "source": [ + "#### $\\alpha = 100$" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NEaj4QRrNgdA" + }, + "source": [ + "rr100 = Ridge(alpha = 100)\n", + "rr100.fit(X_treinamento, y_treinamento)\n", + "train_score=lr.score(X_treinamento, y_treinamento)\n", + "test_score=lr.score(X_teste, y_teste)\n", + "Ridge_treinamento_score = rr.score(X_treinamento,y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zhcfoTEENgdE" + }, + "source": [ + "# MSE\n", + "rr100_model = (mean_squared_error(y_true = y_treinamento, y_pred = rr100.predict(X_treinamento)))\n", + "lr_model = (mean_squared_error(y_true = y_treinamento, y_pred = lr.predict(X_treinamento)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "NGDBpfiquxoc" + }, + "source": [ + "print(rr100_model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Owami5MVureW" + }, + "source": [ + "print(lr_model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Xk5dN3Owu6Kw" + }, + "source": [ + "### Próximo passo: fazer o statmodel dos modelos ridge." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cEF_3GgUgF0Q" + }, + "source": [ + "# LASSO (Least Absolute Shrinkage And Selection Operator regularization)\n", + "* Método mais comum e usado para Regularization; \n", + "* Reduz overfitting;\n", + "* Se encarrega do **Feature Selection**, pois descarta variáveis altamente correlacionadas." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-YiKb9reQdI4" + }, + "source": [ + "* Usado no processo de Regularization - processo de penalizar as variáveis para manter somente os atributos mais importantes. Pense na utilidade disso diante de um dataframe com muitas variáveis;\n", + "* A regressão Lasso vem com um parâmetro ($\\alpha$), e quanto maior o alfa, a maioria dos coeficientes de recurso é zero. Ou seja, quando $\\alpha = 0$, a regressão Lasso produz os mesmos coeficientes que uma regressão linear. Quando alfa é muito grande, todos os coeficientes são zero." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5p_ZPZ4tTUX1" + }, + "source": [ + "### Exemplo LASSO" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JD1_M2uw6q0W" + }, + "source": [ + "X_new" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "i5JZTnkTOkI9" + }, + "source": [ + "from sklearn.linear_model import Lasso\n", + "lasso = Lasso(alpha = .1)\n", + "lasso.fit(X_new, y_new)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gEUxSlThOkJD" + }, + "source": [ + "lasso.coef_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EQaGWzzLT9qP" + }, + "source": [ + "### Aplicação do LASSO no Boston Housing Price" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ME6v6LFlgF0Q" + }, + "source": [ + "from sklearn.linear_model import Lasso\n", + "lasso = Lasso(alpha = .1)\n", + "lasso.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "h6DSEHc1gF0V" + }, + "source": [ + "lasso.coef_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8SzYnpVGy4cy" + }, + "source": [ + "### Coeficientes do LASSO:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "O2w2QDmdxxVe" + }, + "source": [ + "list(zip(X_treinamento.columns, abs(lasso.coef_)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UBOCg1H9zn6A" + }, + "source": [ + "### Comparação com os coeficientes do RIDGE:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "g1fF-mEZzXpH" + }, + "source": [ + "list(zip(X_treinamento.columns, abs(ridge.coef_)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xP1fX1Bi6VdX" + }, + "source": [ + "**Conclusão**: Coeficientes zero podem ser excluídos da Análise/modelo." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TbtxIWyGSXkH" + }, + "source": [ + "### Efeito dos valores de $\\alpha$\n", + "* Função adaptada de https://chrisalbon.com/machine_learning/linear_regression/effect_of_alpha_on_lasso_regression/." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "B4AuWA4LRBE3" + }, + "source": [ + "# Create a function called lasso,\n", + "def lasso(alphas):\n", + " '''\n", + " Takes in a list of alphas. Outputs a dataframe containing the coefficients of lasso regressions from each alpha.\n", + " '''\n", + " # Create an empty data frame\n", + " df = pd.DataFrame()\n", + " \n", + " # Create a column of feature names\n", + " df['Feature Name'] = names\n", + " \n", + " # For each alpha value in the list of alpha values,\n", + " for alpha in alphas:\n", + " # Create a lasso regression with that alpha value,\n", + " lasso = Lasso(alpha = alpha)\n", + " \n", + " # Fit the lasso regression\n", + " lasso.fit(X_treinamento, y_treinamento)\n", + " \n", + " # Create a column name for that alpha value\n", + " column_name = 'Alpha = %f' % alpha\n", + "\n", + " # Create a column of coefficient values\n", + " df[column_name] = lasso.coef_\n", + " \n", + " # Return the datafram \n", + " return df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VEDvXvuNRK0C" + }, + "source": [ + "names = X_treinamento.columns\n", + "\n", + "# Valores de alpha:\n", + "lasso([.0001, .001, 0, .01, .1, 1, 10, 100])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xFlvTUJKhwgW" + }, + "source": [ + "### Capturando os elementos mais importantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4_-sUgMIhzmE" + }, + "source": [ + "r_squared = model.rsquared\n", + "r_squared_adj = model.rsquared_adj\n", + "coeficientes_regressao = model.params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "apGv5ytnimsM" + }, + "source": [ + "VEJA: https://stackoverflow.com/questions/27928275/find-p-value-significance-in-scikit-learn-linearregression" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Uhokzxtcil8w" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jSYw6SdcXa0q" + }, + "source": [ + "### Cross-Validation & GridSearch para LASSO" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "E14i4Y3rqEX2" + }, + "source": [ + "### Colocar aqui a fórmula do RMSE." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "irFZAkvVXfya" + }, + "source": [ + "from sklearn.linear_model import LassoCV\n", + "from sklearn.model_selection import RepeatedKFold" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "T3Jjom8RYdly" + }, + "source": [ + "# define model evaluation method\n", + "cv = RepeatedKFold(n_splits = 5, n_repeats = 3, random_state = 20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Cw3lAvRPYgJe" + }, + "source": [ + "# define model\n", + "model = LassoCV(alphas = np.arange(0.001, 10, 0.001), cv = cv, n_jobs = -1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "oLX3CpThXvkJ" + }, + "source": [ + "# fit model\n", + "model.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "U1ubd5huYQ7u" + }, + "source": [ + "# summarize chosen configuration\n", + "print('alpha: %f' % model.alpha_)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9P7hYoo4gF0Z" + }, + "source": [ + "# Elastic Net \n", + "* Combina o poder de Ridge e LASSO;\n", + "* Remove variáveis de pouco poder preditivo (LASSO) ou as penaliza (Ridge)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yChNUYs7gF0b" + }, + "source": [ + "from sklearn.linear_model import ElasticNet\n", + "from sklearn.model_selection import GridSearchCV\n", + "\n", + "# Instancia o objeto\n", + "en = ElasticNet(alpha = .1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "S1m3SL2avMbd" + }, + "source": [ + "transformacao.fit(dados_que_eu_quero_transformar)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4mbIaAUAF4N6" + }, + "source": [ + "en.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "MaUkZw8ngF0h" + }, + "source": [ + "list(zip(X_treinamento, en.coef_))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K7LuPhCtvouJ" + }, + "source": [ + "### GridSearch para encontrar o $\\alpha$ para Elastic Net" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xl-Qh9caDyCp" + }, + "source": [ + "# Instancia o objeto:\n", + "en = ElasticNet(normalize = True)\n", + "\n", + "# Otimização dos hiperparâmetros:\n", + "d_hiperparametros = {'alpha': np.logspace(-5, 2, 8), \n", + " 'l1_ratio': [.2, .4, .6, .8]}\n", + "\n", + "search = GridSearchCV(estimator = en, # Elastic Net\n", + " param_grid = d_hiperparametros, # Dicionário com os hiperparâmetros\n", + " scoring = 'mean_squared_error', # MSE (Erro Quadrático Médio) - Métrica para avaliação da performance do modelo\n", + " #scoring = 'neg_mean_squared_error',\n", + " n_jobs = -1, # Usar todos os processadores/computação\n", + " refit = True, \n", + " cv = 10) # Número de Cross-Valitations" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JvNQyUW_2QLr" + }, + "source": [ + "### Exercício (Estatística): Sugestão de ajuste manual\n", + "* Estudar estatisticamente a distribuição de frequência em que a variável é significante (ao nível de 5%) em 100 fits." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hp1hV5YahsJb" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ng0rPXfA1DgS" + }, + "source": [ + "for i in range(0, 100):\n", + " X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X, y, 0.2)\n", + " modeloi = fit(X_treinamento, y_treinamento)\n", + " intercepto\n", + " coeficientes da regressão\n", + " validação dos parâmetros (significância)\n", + " y_predict = predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "c3_XCQCPGlr3" + }, + "source": [ + "search.fit(X_treinamento, y_treinamento)\n", + "\n", + "# Retorna os melhores hiperparâmetros do algoritmo:\n", + "search.best_params_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zq0_ugQfGrdb" + }, + "source": [ + "en2 = ElasticNet(normalize = True, alpha = 0.001, l1_ratio = 0.6)\n", + "en2.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ILA5lScUx-Ub" + }, + "source": [ + "\n", + "# Métrica\n", + "ml2 = (mean_squared_error(y_true = y_teste, y_pred = en2.predict(X_teste)))\n", + "# Encontrar a métrica neg_squared_error --> ml3 = (neg_mean_squared_error(y_true = y_teste, y_pred = en2.predict(X_teste)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "BzO_dHRixd_L" + }, + "source": [ + "print(f\"MSE: {ml2}\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zaEwh3t3zwFc" + }, + "source": [ + "**Conclusão**:\n", + "* Comparação dos MSE - A Regressão sem Regularization produziu MSE de 23.94. Como podemos ver, Elastic Net produz MSE: 15.4." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5geUMgC6ztxE" + }, + "source": [ + "### Coeficientes do Elastic Net:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LyLdASRqzwCq" + }, + "source": [ + "list(zip(X_treinamento.columns, abs(ridge.coef_)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "90pfP9-3OkJG" + }, + "source": [ + "Observe acima que o segundo coeficiente foi estimado como 0 e, desta forma, podemos excluí-lo do ML." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ILCXvYKDOkJH" + }, + "source": [ + "# Elastic Net \n", + "* Combina o poder de Ridge e LASSO;\n", + "* Remove variáveis de pouco poder preditivo (LASSO) ou as penaliza (Ridge)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GaQPDCR2OkJI" + }, + "source": [ + "from sklearn.linear_model import ElasticNet\n", + "\n", + "# Instancia o objeto\n", + "en = ElasticNet(alpha = .1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "xVp16Eu_OkJL" + }, + "source": [ + "en.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "kwj018U8OkJO" + }, + "source": [ + "en.coef_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rJRWBzSQCcss" + }, + "source": [ + "# Regressão Logística\n", + "\n", + "* Na regressão linear nós tentamos modelar a relação linear entre as features ($X_{np} = [X_{1}, X_{2}, ..., X_{p}]$) através de uma reta dada pela equação:\n", + "\n", + "$$\\hat{y}= \\beta_{0}+\\beta_{1}x_{1}+\\beta_{2}x_{2}+...+\\beta_{p}x_{p}$$\n", + "\n", + "Para classificação, a Regressão Logística vai nos retornar probabilidades (entre 0 e 1), dada pela equação logística ( também conhecida **função sigmoid**):\n", + "\n", + "$$P[y = 1]= \\frac{1}{1+e^{-(\\beta_{0}+\\beta_{1}x_{1}+\\beta_{2}x_{2}+...+\\beta_{p}x_{p})}}$$\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Vj83Altwdni7" + }, + "source": [ + "![SigmoidFunction](https://github.com/MathMachado/Materials/blob/master/SigmoidFunction.PNG?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LS1QjQnknqe5" + }, + "source": [ + "## Pressupostos da Regressão Logística\n", + "* Não há valores nulos no banco de dados;\n", + "* A variável-resposta $y$ é binária (0 ou 1) ou ordinal (variável categórica com valores ordenados (por exemplo, estimar a qualidade do vinho));\n", + "* Todas as variáveis preditoras $X$ são independentes;\n", + "* Há (pelo menos) 50 observações para cada variável preditora no modelo preditivo --> Quanto mais, melhor. Isso visa garantir a confiabilidade dos resultados);\n", + "* As classes da variável-resposta estejam equilibradas;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5YGvpGTAd4jO" + }, + "source": [ + "# Exemplo 1" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-LBYRG__e_Zv" + }, + "source": [ + "### Carregar as libraries" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XX2GNYWue-iA" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "%matplotlib inline\n", + "\n", + "import statsmodels.api as sm\n", + "import statsmodels.formula.api as smf\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.metrics import roc_auc_score, roc_curve, classification_report, accuracy_score, confusion_matrix, auc" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RpNu-JjJfBYe" + }, + "source": [ + "### Carregar os dados" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dWVj8SmUeBZB", + "outputId": "ddb92623-228d-4621-90f9-b80b1a0d06c9", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 357 + } + }, + "source": [ + "url = 'https://raw.githubusercontent.com/MathMachado/DataFrames/master/Titanic_Original.csv'\n", + "df_titanic = pd.read_csv(url)\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", + "
" + ], + "text/plain": [ + " PassengerId Survived Pclass ... Fare Cabin Embarked\n", + "0 1 0 3 ... 7.2500 NaN S\n", + "1 2 1 1 ... 71.2833 C85 C\n", + "2 3 1 3 ... 7.9250 NaN S\n", + "3 4 1 1 ... 53.1000 C123 S\n", + "4 5 0 3 ... 8.0500 NaN S\n", + "\n", + "[5 rows x 12 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 289 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "T9vZGvU5qbsQ", + "outputId": "6ad44206-7129-4e4e-a03d-cec7aeab3449", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 357 + } + }, + "source": [ + "df_titanic.columns = [coluna.lower() for coluna in df_titanic.columns]\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
passengeridsurvivedpclassnamesexagesibspparchticketfarecabinembarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", + "
" + ], + "text/plain": [ + " passengerid survived pclass ... fare cabin embarked\n", + "0 1 0 3 ... 7.2500 NaN S\n", + "1 2 1 1 ... 71.2833 C85 C\n", + "2 3 1 3 ... 7.9250 NaN S\n", + "3 4 1 1 ... 53.1000 C123 S\n", + "4 5 0 3 ... 8.0500 NaN S\n", + "\n", + "[5 rows x 12 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 290 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fAYAg5tofDgQ" + }, + "source": [ + "### Entendendo os dados\n", + "* sibsp - número of siblings/esposas abordo do Titanic;\n", + "* parch - número de parentes/crianças abordo do Titanic;\n", + "* embarked - Cidade/Portão de embarque: C = Cherbourg, Q = Queenstown, S = Southampton." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZbijPdpFxdZy" + }, + "source": [ + "#### A variável-target é do tipo binária ou categórica ordinal?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7hspb3IMe5tx", + "outputId": "684c59df-d788-400f-a179-0774a39e4303", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic['survived'].value_counts()/df_titanic.shape[0]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 0.616162\n", + "1 0.383838\n", + "Name: survived, dtype: float64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 291 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tsp4t7oxx3zC" + }, + "source": [ + "A seguir, o gráfico da variável-target:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vm0BDjw-xrGI", + "outputId": "443def8e-dee6-40f5-8bcf-b33a43bb5c15", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 296 + } + }, + "source": [ + "sns.countplot(x = 'survived', data = df_titanic)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 292 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEGCAYAAACKB4k+AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAPQUlEQVR4nO3dfbDmZV3H8fcHFqR84MHdNtyllpLJoRTFE5HaVJAFZC5jgjgaK+7M1gw1OmZG/ZEPQ42OlmEatRPqQiUgZmxmGrNApgPq2UQeMzeC2A3cI0+KZLn27Y9z7cVhObvcZ9nfuc9y3q+Ze+7rd/2u3+/+3szO+XD9nu5UFZIkARww7gIkSQuHoSBJ6gwFSVJnKEiSOkNBktQtGXcBT8TSpUtr1apV4y5DkvYrmzdv/npVLZtt3X4dCqtWrWJycnLcZUjSfiXJnbtb5+EjSVJnKEiSOkNBktQZCpKkzlCQJHWGgiSpMxQkSZ2hIEnqDAVJUrdf39G8L7zwty4edwlagDa/++xxlyCNhTMFSVJnKEiSOkNBktQZCpKkzlCQJHWGgiSpMxQkSZ2hIEnqDAVJUmcoSJI6Q0GS1BkKkqTOUJAkdYaCJKkbNBSS3JHkpiQ3JJlsfUckuSrJV9v74a0/Sd6XZEuSG5McP2RtkqTHmo+Zws9W1fOraqItnwdsqqpjgE1tGeBU4Jj2WgdcOA+1SZJmGMfho9XAhtbeAJw+o//imnY9cFiSI8dQnyQtWkOHQgH/mGRzknWtb3lV3d3a9wDLW3sFcNeMbbe2vkdJsi7JZJLJqampoeqWpEVp6J/jfElVbUvyfcBVSf515sqqqiQ1lx1W1XpgPcDExMSctpUk7dmgM4Wq2tbetwMfB04AvrbzsFB7396GbwOOmrH5ytYnSZong4VCkqcmefrONvDzwM3ARmBNG7YGuLK1NwJnt6uQTgQenHGYSZI0D4Y8fLQc+HiSnZ/z11X1qSRfBC5Psha4Ezizjf8kcBqwBXgYOGfA2iRJsxgsFKrqduC4WfrvBU6epb+Ac4eqR5L0+LyjWZLUGQqSpM5QkCR1hoIkqTMUJEmdoSBJ6gwFSVJnKEiSOkNBktQZCpKkzlCQJHWGgiSpMxQkSZ2hIEnqDAVJUmcoSJI6Q0GS1BkKkqTOUJAkdYaCJKkzFCRJnaEgSeoMBUlSZyhIkjpDQZLUGQqSpM5QkCR1hoIkqTMUJEnd4KGQ5MAkX0ryibZ8dJLPJ9mS5LIkB7f+p7TlLW39qqFrkyQ92nzMFN4A3DZj+V3Ae6vq2cD9wNrWvxa4v/W/t42TJM2jQUMhyUrgF4G/aMsBTgKuaEM2AKe39uq2TFt/chsvSZonQ88U/hh4C/B/bfmZwANVtaMtbwVWtPYK4C6Atv7BNv5RkqxLMplkcmpqasjaJWnRGSwUkrwM2F5Vm/flfqtqfVVNVNXEsmXL9uWuJWnRWzLgvl8MvDzJacAhwDOAC4DDkixps4GVwLY2fhtwFLA1yRLgUODeAeuTJO1isJlCVf1OVa2sqlXAWcDVVfUa4BrglW3YGuDK1t7Ylmnrr66qGqo+SdJjjeM+hd8G3pRkC9PnDC5q/RcBz2z9bwLOG0NtkrSoDXn4qKuqa4FrW/t24IRZxnwbOGM+6pEkzc47miVJnaEgSeoMBUlSZyhIkjpDQZLUGQqSpM5QkCR1hoIkqTMUJEmdoSBJ6gwFSVJnKEiSOkNBktQZCpKkzlCQJHWGgiSpm5cf2ZE0d//5jueOuwQtQD/wezcNun9nCpKkzlCQJHWGgiSpMxQkSZ2hIEnqDAVJUmcoSJI6Q0GS1I0UCkk2jdInSdq/7fGO5iSHAN8LLE1yOJC26hnAioFrkyTNs8d7zMWvAm8EngVs5pFQ+Abw/gHrkiSNwR4PH1XVBVV1NPDmqvqhqjq6vY6rqj2GQpJDknwhyZeT3JLk7a3/6CSfT7IlyWVJDm79T2nLW9r6VfvoO0qSRjTSA/Gq6k+SvAhYNXObqrp4D5v9D3BSVT2U5CDgs0n+AXgT8N6qujTJnwFrgQvb+/1V9ewkZwHvAl61N19KkrR3Rj3RfAnwHuAlwI+318SetqlpD7XFg9qrgJOAK1r/BuD01l7dlmnrT06y83CVJGkejPro7Ang2Kqquew8yYFMn4t4NvAB4N+BB6pqRxuylUdOWK8A7gKoqh1JHgSeCXx9Lp8pSdp7o96ncDPw/XPdeVV9t6qeD6wETgCeM9d97CrJuiSTSSanpqae6O4kSTOMOlNYCtya5AtMnysAoKpePsrGVfVAkmuAnwQOS7KkzRZWAtvasG3AUcDWJEuAQ4F7Z9nXemA9wMTExJxmLpKkPRs1FN421x0nWQZ8pwXC9wAvZfrk8TXAK4FLgTXAlW2TjW35urb+6rkerpIkPTGjXn30T3ux7yOBDe28wgHA5VX1iSS3ApcmOR/4EnBRG38RcEmSLcB9wFl78ZmSpCdgpFBI8k2mrxwCOJjpK4m+VVXP2N02VXUj8IJZ+m9n+vzCrv3fBs4YpR5J0jBGnSk8fWe7XSa6GjhxqKIkSeMx56ektvsP/hb4hQHqkSSN0aiHj14xY/EApu9b+PYgFUmSxmbUq49+aUZ7B3AH04eQJElPIqOeUzhn6EIkSeM36rOPVib5eJLt7fWxJCuHLk6SNL9GPdH8IaZvLntWe/1d65MkPYmMGgrLqupDVbWjvT4MLBuwLknSGIwaCvcmeW2SA9vrtczyXCJJ0v5t1FB4PXAmcA9wN9PPJnrdQDVJksZk1EtS3wGsqar7AZIcwfSP7rx+qMIkSfNv1JnC83YGAkBV3ccszzWSJO3fRg2FA5IcvnOhzRRGnWVIkvYTo/5h/0PguiQfbctnAL8/TEmSpHEZ9Y7mi5NMAie1rldU1a3DlSVJGoeRDwG1EDAIJOlJbM6PzpYkPXkZCpKkzlCQJHWGgiSpMxQkSZ2hIEnqDAVJUmcoSJI6Q0GS1BkKkqTOUJAkdYaCJKkzFCRJ3WChkOSoJNckuTXJLUne0PqPSHJVkq+298Nbf5K8L8mWJDcmOX6o2iRJsxtyprAD+M2qOhY4ETg3ybHAecCmqjoG2NSWAU4FjmmvdcCFA9YmSZrFYKFQVXdX1b+09jeB24AVwGpgQxu2ATi9tVcDF9e064HDkhw5VH2SpMeal3MKSVYBLwA+DyyvqrvbqnuA5a29ArhrxmZbW9+u+1qXZDLJ5NTU1GA1S9JiNHgoJHka8DHgjVX1jZnrqqqAmsv+qmp9VU1U1cSyZcv2YaWSpEFDIclBTAfCX1XV37Tur+08LNTet7f+bcBRMzZf2fokSfNkyKuPAlwE3FZVfzRj1UZgTWuvAa6c0X92uwrpRODBGYeZJEnzYMmA+34x8CvATUluaH2/C7wTuDzJWuBO4My27pPAacAW4GHgnAFrkyTNYrBQqKrPAtnN6pNnGV/AuUPVI0l6fN7RLEnqDAVJUmcoSJI6Q0GS1BkKkqTOUJAkdYaCJKkzFCRJnaEgSeoMBUlSZyhIkjpDQZLUGQqSpM5QkCR1hoIkqTMUJEmdoSBJ6gwFSVJnKEiSOkNBktQZCpKkzlCQJHWGgiSpMxQkSZ2hIEnqDAVJUmcoSJI6Q0GS1BkKkqRusFBI8sEk25PcPKPviCRXJflqez+89SfJ+5JsSXJjkuOHqkuStHtDzhQ+DJyyS995wKaqOgbY1JYBTgWOaa91wIUD1iVJ2o3BQqGqPgPct0v3amBDa28ATp/Rf3FNux44LMmRQ9UmSZrdfJ9TWF5Vd7f2PcDy1l4B3DVj3NbW9xhJ1iWZTDI5NTU1XKWStAiN7URzVRVQe7Hd+qqaqKqJZcuWDVCZJC1e8x0KX9t5WKi9b2/924CjZoxb2fokSfNovkNhI7CmtdcAV87oP7tdhXQi8OCMw0ySpHmyZKgdJ/kI8DPA0iRbgbcC7wQuT7IWuBM4sw3/JHAasAV4GDhnqLokSbs3WChU1at3s+rkWcYWcO5QtUiSRuMdzZKkzlCQJHWGgiSpMxQkSZ2hIEnqDAVJUmcoSJI6Q0GS1BkKkqTOUJAkdYaCJKkzFCRJnaEgSeoMBUlSZyhIkjpDQZLUGQqSpM5QkCR1hoIkqTMUJEmdoSBJ6gwFSVJnKEiSOkNBktQZCpKkzlCQJHWGgiSpMxQkSZ2hIEnqFlQoJDklyVeSbEly3rjrkaTFZsGEQpIDgQ8ApwLHAq9Ocux4q5KkxWXBhAJwArClqm6vqv8FLgVWj7kmSVpUloy7gBlWAHfNWN4K/MSug5KsA9a1xYeSfGUealsslgJfH3cRC0Hes2bcJejR/Le501uzL/byg7tbsZBCYSRVtR5YP+46noySTFbVxLjrkHblv835s5AOH20DjpqxvLL1SZLmyUIKhS8CxyQ5OsnBwFnAxjHXJEmLyoI5fFRVO5L8OvBp4EDgg1V1y5jLWmw8LKeFyn+b8yRVNe4aJEkLxEI6fCRJGjNDQZLUGQry8SJasJJ8MMn2JDePu5bFwlBY5Hy8iBa4DwOnjLuIxcRQkI8X0YJVVZ8B7ht3HYuJoaDZHi+yYky1SBozQ0GS1BkK8vEikjpDQT5eRFJnKCxyVbUD2Pl4kduAy328iBaKJB8BrgN+JMnWJGvHXdOTnY+5kCR1zhQkSZ2hIEnqDAVJUmcoSJI6Q0GS1BkK0kCSvHxfPXU2yUP7Yj/S4/GSVOkJSLKk3esx9Oc8VFVPG/pzJGcKEpDkqUn+PsmXk9yc5FVJ7kiytK2fSHJta78tySVJPgdckuT6JD86Y1/XtvGvS/L+JIcmuTPJATM+664kByX54SSfSrI5yT8neU4bc3SS65LclOT8+f8vosXKUJCmnQL8V1UdV1U/BnzqccYfC/xcVb0auAw4EyDJkcCRVTW5c2BVPQjcAPx063oZ8Omq+g7TP0j/G1X1QuDNwJ+2MRcAF1bVc4G798UXlEZhKEjTbgJemuRdSX6q/SHfk41V9d+tfTnwytY+E7hilvGXAa9q7bOAy5I8DXgR8NEkNwB/DhzZxrwY+EhrXzLnbyPtpSXjLkBaCKrq35IcD5wGnJ9kE7CDR/7H6ZBdNvnWjG23Jbk3yfOY/sP/a7N8xEbgD5IcAbwQuBp4KvBAVT1/d2Xt9ReS9pIzBQlI8izg4ar6S+DdwPHAHUz/AQf45cfZxWXAW4BDq+rGXVdW1UNMP5H2AuATVfXdqvoG8B9Jzmg1JMlxbZPPMT2jAHjNXn8xaY4MBWnac4EvtMM4bwXOB94OXJBkEvju42x/BdN/xC/fw5jLgNe2951eA6xN8mXgFh75KdQ3AOcmuQl/CU/zyEtSJUmdMwVJUmcoSJI6Q0GS1BkKkqTOUJAkdYaCJKkzFCRJ3f8DThe6X9gR+9IAAAAASUVORK5CYII=\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XfhFG6Axxj6F" + }, + "source": [ + "Como podemos ver, a variável-resposta 'survived' é binária. Portanto, tudo ok até agora." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zRKhDX6ZraGU" + }, + "source": [ + "### Tratamento dos Missing Values\n", + "* Substituir os NaN's por mediana da variável" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qPbILjZyrhRZ", + "outputId": "52f34626-1875-4632-cce3-8a694863ea6c", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic.isna().sum()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "passengerid 0\n", + "survived 0\n", + "pclass 0\n", + "name 0\n", + "sex 0\n", + "age 177\n", + "sibsp 0\n", + "parch 0\n", + "ticket 0\n", + "fare 0\n", + "cabin 687\n", + "embarked 2\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 293 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uJUPufRossTo" + }, + "source": [ + "Cálculo da mediana da variável/preditora 'age'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WGW9bW5x4JdT", + "outputId": "c0ce5e66-70fe-4195-c637-2ac8a4cf9d37", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 357 + } + }, + "source": [ + "df_titanic_copia = df_titanic.copy()\n", + "#df_titanic = df_titanic_copia.copy()\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
passengeridsurvivedpclassnamesexagesibspparchticketfarecabinembarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", + "
" + ], + "text/plain": [ + " passengerid survived pclass ... fare cabin embarked\n", + "0 1 0 3 ... 7.2500 NaN S\n", + "1 2 1 1 ... 71.2833 C85 C\n", + "2 3 1 3 ... 7.9250 NaN S\n", + "3 4 1 1 ... 53.1000 C123 S\n", + "4 5 0 3 ... 8.0500 NaN S\n", + "\n", + "[5 rows x 12 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 294 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DgAwrR8msYv_" + }, + "source": [ + "mediana_age = df_titanic['age'].median()\n", + "mediana_fare = df_titanic['fare'].median()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "yqIgckarzwdB", + "outputId": "a190a691-e377-42ba-a8f8-98d430ccabbd", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "mediana_age" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "28.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 297 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "czdSVeLjzxAX", + "outputId": "48bdfb0b-a153-482e-9fa6-6706bd44b88d", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "mediana_fare" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "14.4542" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 298 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u4vcCshcsv6w" + }, + "source": [ + "Substituição dos NaN's da variável 'age' e 'fare' pela respetiva mediana" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tnOOsaqLsg03", + "outputId": "81837607-f6b7-4bc2-e295-443679d3deb2", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic['age'].fillna(mediana_age, inplace = True)\n", + "df_titanic.isna().sum()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "passengerid 0\n", + "survived 0\n", + "pclass 0\n", + "name 0\n", + "sex 0\n", + "age 0\n", + "sibsp 0\n", + "parch 0\n", + "ticket 0\n", + "fare 0\n", + "cabin 687\n", + "embarked 2\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 299 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VqAnNxnO0Ghn" + }, + "source": [ + "Dado que fare não possui NaN's, então nada a fazer." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4Hi2zG_ms6n-" + }, + "source": [ + "#### Usando Imputer\n", + "* Método para tratamento de Missing Values." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mvCnGfCOri9Y", + "outputId": "6d8b7f52-ca60-4bbd-bc50-9e55f3b9bd17", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "from sklearn.impute import SimpleImputer\n", + "\n", + "# fit()\n", + "imputer_mv = SimpleImputer(strategy = 'median')\n", + "imputer_mv.fit(df_titanic_copia[['age', 'fare']])" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "SimpleImputer(add_indicator=False, copy=True, fill_value=None,\n", + " missing_values=nan, strategy='median', verbose=0)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 300 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SokJ8HM61FcK", + "outputId": "65052934-48c4-4c6d-b206-b14d8fa10dc3", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "imputer_mv" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "SimpleImputer(add_indicator=False, copy=True, fill_value=None,\n", + " missing_values=nan, strategy='median', verbose=0)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 301 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "X-qx8QsQthyU", + "outputId": "469c2591-1ea2-4dfd-8395-df11821f5951", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "# transform()\n", + "df_titanic_mediana = pd.DataFrame(imputer_mv.transform(df_titanic[['age', 'fare']]), columns = ['age2', 'fare2'])\n", + "df_titanic_mediana.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
age2fare2
022.07.2500
138.071.2833
226.07.9250
335.053.1000
435.08.0500
\n", + "
" + ], + "text/plain": [ + " age2 fare2\n", + "0 22.0 7.2500\n", + "1 38.0 71.2833\n", + "2 26.0 7.9250\n", + "3 35.0 53.1000\n", + "4 35.0 8.0500" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 302 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KS-xYf5BuwEt", + "outputId": "c7e01602-c917-48b7-a7f1-33ce365f21ac", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic_mediana.median()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "age2 28.0000\n", + "fare2 14.4542\n", + "dtype: float64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 303 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lggbmAD2vN42", + "outputId": "55134b01-f993-4f01-b053-7600c45eec21", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic_copia.isna().sum()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "passengerid 0\n", + "survived 0\n", + "pclass 0\n", + "name 0\n", + "sex 0\n", + "age 177\n", + "sibsp 0\n", + "parch 0\n", + "ticket 0\n", + "fare 0\n", + "cabin 687\n", + "embarked 2\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 304 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8fQ6a7RSvUOp", + "outputId": "75770554-a802-4012-a068-76b2e1ba1578", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 498 + } + }, + "source": [ + "df_titanic['age'] = df_titanic_mediana['age2']\n", + "\n", + "# Não há NaN's na variável fare. Portanto, nenhuma alteração\n", + "#df_titanic['fare'] = df_titanic_mediana['fare']\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
passengeridsurvivedpclassnamesexagesibspparchticketfarecabinembarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", + "
" + ], + "text/plain": [ + " passengerid survived pclass ... fare cabin embarked\n", + "0 1 0 3 ... 7.2500 NaN S\n", + "1 2 1 1 ... 71.2833 C85 C\n", + "2 3 1 3 ... 7.9250 NaN S\n", + "3 4 1 1 ... 53.1000 C123 S\n", + "4 5 0 3 ... 8.0500 NaN S\n", + "\n", + "[5 rows x 12 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 305 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HSncMlT51oM5", + "outputId": "85057833-9c05-4805-d2d0-ad6a3af0d140", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic.isna().sum()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "passengerid 0\n", + "survived 0\n", + "pclass 0\n", + "name 0\n", + "sex 0\n", + "age 0\n", + "sibsp 0\n", + "parch 0\n", + "ticket 0\n", + "fare 0\n", + "cabin 687\n", + "embarked 2\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 306 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c48gJg0q4zgj" + }, + "source": [ + "Exclui as colunas que não são mais necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7OzK7DnDg2WY", + "outputId": "462a1c3e-d2b3-4d5d-b745-945706522481", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 498 + } + }, + "source": [ + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
passengeridsurvivedpclassnamesexagesibspparchticketfarecabinembarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", + "
" + ], + "text/plain": [ + " passengerid survived pclass ... fare cabin embarked\n", + "0 1 0 3 ... 7.2500 NaN S\n", + "1 2 1 1 ... 71.2833 C85 C\n", + "2 3 1 3 ... 7.9250 NaN S\n", + "3 4 1 1 ... 53.1000 C123 S\n", + "4 5 0 3 ... 8.0500 NaN S\n", + "\n", + "[5 rows x 12 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 307 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oLpWbzz84ykm", + "outputId": "cdb706ed-5395-47d0-8f52-7887edcde8ee", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "df_titanic.drop(columns = ['passengerid', 'name', 'ticket', 'cabin'], axis = 1, inplace = True)\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked
003male22.0107.2500S
111female38.01071.2833C
213female26.0007.9250S
311female35.01053.1000S
403male35.0008.0500S
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked\n", + "0 0 3 male 22.0 1 0 7.2500 S\n", + "1 1 1 female 38.0 1 0 71.2833 C\n", + "2 1 3 female 26.0 0 0 7.9250 S\n", + "3 1 1 female 35.0 1 0 53.1000 S\n", + "4 0 3 male 35.0 0 0 8.0500 S" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 308 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NZei3VxSxR6g" + }, + "source": [ + "Alternativamente, poderíamos concatenar os dois dataframes usando pd.concat()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ek2qBdOFw2p5", + "outputId": "7777706c-55e4-4bfd-822f-cf3843922f3e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic = df_titanic_copia.copy()\n", + "\n", + "df_titanic.drop(columns = ['passengerid', 'name', 'ticket', 'cabin', 'fare', 'age'], axis = 1, inplace = True)\n", + "df_titanic = pd.concat([df_titanic, df_titanic_mediana], axis = 1)\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexsibspparchembarkedage2fare2
003male10S22.07.2500
111female10C38.071.2833
213female00S26.07.9250
311female10S35.053.1000
403male00S35.08.0500
\n", + "
" + ], + "text/plain": [ + " survived pclass sex sibsp parch embarked age2 fare2\n", + "0 0 3 male 1 0 S 22.0 7.2500\n", + "1 1 1 female 1 0 C 38.0 71.2833\n", + "2 1 3 female 0 0 S 26.0 7.9250\n", + "3 1 1 female 1 0 S 35.0 53.1000\n", + "4 0 3 male 0 0 S 35.0 8.0500" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 18 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6omsobg77tRv" + }, + "source": [ + "#### Tratamento dos NaN's da variável 'embarked'" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YjeivMbz85gg" + }, + "source": [ + "A seguir, listamos as linhas em que embarked = NaN:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Mc03_AnI8QgV", + "outputId": "ebe1ecc6-2c40-429d-d608-3b72ef10acd2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 110 + } + }, + "source": [ + "embarked_NaN = df_titanic[df_titanic['embarked'].isna()]\n", + "embarked_NaN.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked
6111female38.00080.0NaN
82911female62.00080.0NaN
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked\n", + "61 1 1 female 38.0 0 0 80.0 NaN\n", + "829 1 1 female 62.0 0 0 80.0 NaN" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 309 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xsbeFBFp7zRM", + "outputId": "58859c46-d711-4558-aadf-70480e67c98b", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "from sklearn.impute import SimpleImputer\n", + "\n", + "# fit()\n", + "imputer_mv = SimpleImputer(strategy = 'most_frequent')\n", + "imputer_mv.fit(df_titanic[['embarked']])" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "SimpleImputer(add_indicator=False, copy=True, fill_value=None,\n", + " missing_values=nan, strategy='most_frequent', verbose=0)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 310 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "f2kDtHVN761L" + }, + "source": [ + "# transform()\n", + "df_embarked_freq = pd.DataFrame(imputer_mv.transform(df_titanic[['embarked']]), columns = ['embarked2'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "0JmoLrzD8NwW", + "outputId": "d8ac60c0-a440-42b1-d7d5-c00168c3956f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "df_titanic = pd.concat([df_titanic, df_embarked_freq], axis = 1)\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarkedembarked2
003male22.0107.2500SS
111female38.01071.2833CC
213female26.0007.9250SS
311female35.01053.1000SS
403male35.0008.0500SS
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked embarked2\n", + "0 0 3 male 22.0 1 0 7.2500 S S\n", + "1 1 1 female 38.0 1 0 71.2833 C C\n", + "2 1 3 female 26.0 0 0 7.9250 S S\n", + "3 1 1 female 35.0 1 0 53.1000 S S\n", + "4 0 3 male 35.0 0 0 8.0500 S S" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 312 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FRxX9c4--TCg" + }, + "source": [ + "COMPARE o ANTES e o DEPOIS: Veja a seguir que os valores de [embarked] = NaN foram substituidos por..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oQFDqatz9bMv", + "outputId": "45d6ab98-b832-4844-8d66-6d47cbcee08e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 110 + } + }, + "source": [ + "embarked_NaN = df_titanic[df_titanic['embarked'].isna()]\n", + "embarked_NaN" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarkedembarked2
6111female38.00080.0NaNS
82911female62.00080.0NaNS
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked embarked2\n", + "61 1 1 female 38.0 0 0 80.0 NaN S\n", + "829 1 1 female 62.0 0 0 80.0 NaN S" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 313 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jgCuXei2ZTQl" + }, + "source": [ + "Como podemos ver, os NaN's da variável embarked foram todos substituídos pelo valor 'S'. Tudo bem para vocês esta substituição?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "r3r8ObKn-nBt" + }, + "source": [ + "df_titanic.drop(columns = ['embarked'], axis = 1, inplace = True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OacQvrYeAPBR" + }, + "source": [ + "Verificação final dos NaN's:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OHBv7CrjARol", + "outputId": "df1e556b-21dd-42a2-df08-4c0046f1f3b1", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic.isna().sum()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "survived 0\n", + "pclass 0\n", + "sex 0\n", + "age 0\n", + "sibsp 0\n", + "parch 0\n", + "fare 0\n", + "embarked2 0\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 315 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ITFMsiBSyAHY" + }, + "source": [ + "### O dataframe sob análise possui (pelo menos) 50 observações para cada preditora?\n", + "* Variáveis preditoras: pclass, sex, age, sibsp, parch, fare, embarked2 --> 7 variáveis preditoras.\n", + "* Portanto, nosso dataframe precisa de, no mínimo 7 x 50 = 350 linhas." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4lgVp2N8yE1C", + "outputId": "2dbea822-609b-4576-c3e2-f89b3527db1c", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic.info()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 891 entries, 0 to 890\n", + "Data columns (total 8 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 survived 891 non-null int64 \n", + " 1 pclass 891 non-null int64 \n", + " 2 sex 891 non-null object \n", + " 3 age 891 non-null float64\n", + " 4 sibsp 891 non-null int64 \n", + " 5 parch 891 non-null int64 \n", + " 6 fare 891 non-null float64\n", + " 7 embarked2 891 non-null object \n", + "dtypes: float64(2), int64(4), object(2)\n", + "memory usage: 55.8+ KB\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rFwtnAcw23gQ", + "outputId": "2b9b006b-3dee-493a-e9a6-adcf90923373", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "891/7" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "127.28571428571429" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 317 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wLqz2V7SytPU" + }, + "source": [ + "Pressuposto atendido?\n", + "Se sim, podemos prosseguir com as análises..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Wm0VycfhovW8" + }, + "source": [ + "#### Avaliação do pressuposto de variáveis preditoras independentes\n", + "* Coeficiente de Spearman (desenvolvido por Charles Spearman). Também conhecido como Coeficiente de Correlação de Spearman e denotado pela letra greaga $\\rho(p)$.\n", + "* É um método estatístico para avaliar/medir a correlação entre 2 variáveis ordinais." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "29knlUdcztb1", + "outputId": "adb2e1a5-2436-4327-8eff-4a65b66d4e0b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexsibspparchfareage2ambarked2
003male107.250022.0S
111female1071.283338.0C
213female007.925026.0S
311female1053.100035.0S
403male008.050035.0S
\n", + "
" + ], + "text/plain": [ + " survived pclass sex sibsp parch fare age2 ambarked2\n", + "0 0 3 male 1 0 7.2500 22.0 S\n", + "1 1 1 female 1 0 71.2833 38.0 C\n", + "2 1 3 female 0 0 7.9250 26.0 S\n", + "3 1 1 female 1 0 53.1000 35.0 S\n", + "4 0 3 male 0 0 8.0500 35.0 S" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 102 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J5EEcU7l0E2B" + }, + "source": [ + "A seguir, a hipótese de independência que queremos testar:\n", + "\n", + "$H_{0}$: variáveis são independentes --> Se o p-value < 5% --> Há evidências para rejeitar $H_{0}$." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tj8A_Kp0qxp_" + }, + "source": [ + "from scipy.stats import spearmanr" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "kFxVGHPUpKLi" + }, + "source": [ + "coef, p = spearmanr(df_titanic['pclass'], df_titanic['age'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "fvzvyvK7qzib", + "outputId": "d1e8d723-5048-4360-bad4-9ce0b8172bbf", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "print('Coeficiente de Correlação de Spearman: %.3f' % coef)\n", + "\n", + "# Interpretação da significância:\n", + "alpha = 0.05\n", + "if p > alpha:\n", + "\tprint('Amostras NÃO correlacionadas (falha em rejeitar H0) p = %.3f' %p)\n", + "else:\n", + "\tprint('Amostras correlacionadas (Rejeita H0) p = %.3f' %p)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Coeficiente de Correlação de Spearman: -0.317\n", + "Amostras correlacionadas (Rejeita H0) p = 0.000\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yespibmf1WVh" + }, + "source": [ + "## Data Transformation\n", + "* MinMaxScaler e RobustScaler\n", + "* Binning (categorização de variáveis/preditoras numéricas): fare e age\n", + "* Outliers" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UwLpj8PXKFuL" + }, + "source": [ + "### Tratamento dos Outliers\n", + "* variáveis: age e fare" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sTTgUx9oiWdJ", + "outputId": "fd14b9f5-7e25-4416-bfc7-e318e79d3249", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "df_titanic_copia = df_titanic.copy()\n", + "#df_titanic = df_titanic_copia.copy()\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked2
003male22.0107.2500S
111female38.01071.2833C
213female26.0007.9250S
311female35.01053.1000S
403male35.0008.0500S
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked2\n", + "0 0 3 male 22.0 1 0 7.2500 S\n", + "1 1 1 female 38.0 1 0 71.2833 C\n", + "2 1 3 female 26.0 0 0 7.9250 S\n", + "3 1 1 female 35.0 1 0 53.1000 S\n", + "4 0 3 male 35.0 0 0 8.0500 S" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 321 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-7v8WaB4aEKv" + }, + "source": [ + "from scipy import stats \n", + "\n", + "def trata_outliers(df, coluna):\n", + " sns.boxplot(x = coluna, data = df)\n", + "\n", + " # Cálculo de Q1, Q3 e IQR:\n", + " Q1 = np.percentile(df[coluna], 25)\n", + " Q3 = np.percentile(df[coluna], 75)\n", + " IQR = Q3 - Q1\n", + " print(f\"IQR: {IQR}\")\n", + "\n", + " # Jeito mais fácil (menos trabalhoso).\n", + " #IQR2 = stats.iqr(df[coluna]) \n", + " #IQR2 \n", + "\n", + " # Cálculo dos limites inferiores e superiores para detecção de outliers:\n", + " limite_inferior_outliers = Q1 - 1.5*IQR\n", + " limite_superior_outliers = Q3 + 1.5*IQR\n", + " print(f\"Limite inferior para outlier: {limite_inferior_outliers}; Limite superior para outliers: {limite_superior_outliers}\")\n", + "\n", + " # Cálculo da mediana\n", + " mediana = df[coluna].median()\n", + " print(f\"Mediana: {mediana}\")\n", + "\n", + " # Substituição dos outliers:\n", + " df[coluna+'_o'] = df[coluna]\n", + "\n", + " df.loc[df[coluna] > limite_superior_outliers, coluna+'_o'] = np.nan\n", + " df[coluna+'_o'].fillna(mediana, inplace = True) # 'o' significa tratamento outlier --> indicação para mostrar que a coluna passou pelo tratamento dos outliers.\n", + "\n", + " return df, limite_superior_outliers" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "pwAExKTWaOSf", + "outputId": "089cba96-b805-4eef-d504-f8c81551c938", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 332 + } + }, + "source": [ + "df_titanic, limite_superior_outliers = trata_outliers(df = df_titanic, coluna = 'age')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "IQR: 13.0\n", + "Limite inferior para outlier: 2.5; Limite superior para outliers: 54.5\n", + "Mediana: 28.0\n" + ], + "name": "stdout" + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAEGCAYAAABbzE8LAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAPXUlEQVR4nO3df2ychX3H8c83vtGGpAXioAgctmvlrhla1rSNOlCrzc7CmpLRamqRyA8wIhAmdU4Ck6YC0WJLAW3S5BFlbBKDFJhIWiUtkECUNSHepCGNYrehCSS0t9VtYxWSOi1tfqyryXd/PM+Zu7Nj+xzffR/j90uy8PM8vuf5Xu7uzePHv8zdBQCovxnRAwDAdEWAASAIAQaAIAQYAIIQYAAIkqvmg+fOnev5fL5GowDAe1Nvb+/P3P3KyvVVBTifz6unp2fypgKAacDMfjTSei5BAEAQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABKnqb8Ih1tatW1UoFGqy7/7+fklSU1NTTfZfqbm5We3t7XU5FpBVBHgKKRQKOnTkqN65dM6k77vh7NuSpDd/XfunRMPZUzU/BjAVEOAp5p1L5+jcghsnfb8zj+2VpJrs+0LHAqY7rgEDQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAkEwFeOvWrdq6dWv0GEA4XgvTQy56gFKFQiF6BCATeC1MD5k6AwaA6YQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0CQugR4YGBAq1atUktLi1paWvTII49Ikjo7O9XS0qIHH3ywHmMAU1ahUNDy5cvV29ur1atXq6WlRd3d3WXbCoWCJOm5555TS0uL9uzZM+L2gwcPlt2+p6dHS5YsUW9v77BtlbetXC69rZS81tetW6eBgYGLuo/r1q1TT09P2bFGczHHjdx3XQL85JNPqr+/f2h5586dkjT0IO/fv78eYwBT1ubNm3XmzBlt2rRJx48fl6ShE5fits2bN0uSHn74YUlSV1fXiNsfeuihstt3dHTo/Pnz2rRp07BtlbetXC69rZS81g8fPqynnnrqou7j4cOH1dHRUXas0VzMcSP3XfMADwwM6IUXXhi2fuXKlWXLnAUDIysUCurr65MknT59emj94OCgtm/fPrStr69Pjz32mNxdkuTu2rZtW9n2p59+WoODg0O3f/zxx4f2efr06bJtO3bsKLttd3d32fKePXvKbtvd3a19+/bJ3bVv376qzhgr76O7D+27r69v1LPggYGBCR93LLXctyRZ8cEaj8WLF3tPT09VB+jq6tLu3bvH9bFz587VuXPn1NzcXNUxpotCoaBf/Z/rzKJbJn3fM4/tlSSdW3DjpO+70qxDX9MHLjEe51EUCgXNnDlTu3bt0u233z4Up0i5XG4o0JJkZirtRy6Xk5TEO5fLafny5brnnnvGte+x7mM+n9cTTzwx4rauri7t3bt3Qscdy2Tt28x63X1x5foxz4DNbK2Z9ZhZz8mTJ6s+8IEDB6q+DYB3ZSG+ksriK0mVJ2+Dg4NlZ9DVXFoc6z6Otv3AgQMTPu5YarlvScqN9QHu/qikR6XkDLjaAyxdunTcZ8BNTU2SpC1btlR7mGlh/fr16v2ft6LHuGjn3/9BNX94Ho/zKNavXz/0fj6fz0SEqz0DvuGGG8a977HuYz6fv+C2pUuXlp2lVnPcsdRy31IdrgG3tbWpoaFh2Pqrr766bHmy7xjwXrFx48YLblu7dm3Z8urVq8uWb7vttrLlu+66q2z51ltvveC+77777rLlBx54oGz53nvvHbZ9xowkKQ0NDcOOPZrR7uNY29va2iZ83LHUct9SHQLc2Nio5cuXD1u/ffv2suXKBxdAorm5eegMcPbs2UPrc7mcVq5cObQtn8/rzjvvlJlJSs5Q77jjjrLtq1atGjpTzeVyWrNmzdA+Z8+eXbZtxYoVZbdtbW0tW77pppvKbtva2qply5bJzLRs2TI1NjZO+D6a2dC+8/n8qF8vaGxsnPBxx1LLfUt1+ja0tra2ocsLknTzzTdLklpbWyVx9guMZePGjZo1a5Y6Ozs1f/58Se+etBS3Fc8SN2zYIOndM9TK7ffff3/Z7Ts6OjRjxgx1dnYO21Z528rl0ttKyWt94cKFEzpTLL2PCxcuVEdHR9mxRnMxx43cd82/C6IaxeteXBscWfEacC2+U6Ge3wUx89hefZJrwKPitfDeMuHvggAA1AYBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAguegBSjU3N0ePAGQCr4XpIVMBbm9vjx4ByAReC9MDlyAAIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAiSix4A1Wk4e0ozj+2twX4HJKkm+x5+rFOS5tX8OEDWEeAppLm5uWb77u8flCQ1NdUjjPNqel+AqYIATyHt7e3RIwCYRFwDBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASCIufv4P9jspKQfVXmMuZJ+VuVt6iWrszFXdbI6l5Td2ZirOhc71++4+5WVK6sK8ESYWY+7L67pQSYoq7MxV3WyOpeU3dmYqzq1motLEAAQhAADQJB6BPjROhxjorI6G3NVJ6tzSdmdjbmqU5O5an4NGAAwMi5BAEAQAgwAQWoaYDNbZmZvmFnBzL5Sy2ONMcc2MzthZkdK1s0xs/1m9oP0v1cEzHWNmXWb2etm9pqZrc/QbO83s2+b2avpbJ3p+g+Z2cvpY/p1M7skYLYGM/uumT2flZnSOfrM7LCZHTKznnRdFh7Ly81sl5kdM7OjZnZ9Rub6aPpvVXz7pZltyMhs96TP+yNmtiN9PUz686xmATazBkmPSPqcpGslrTCza2t1vDE8IWlZxbqvSHrR3T8i6cV0ud4GJf2Vu18r6TpJX07/jbIw268lLXH3j0laJGmZmV0n6e8k/YO7N0v6uaQ1AbOtl3S0ZDkLMxW1uvuiku8ZzcJjuUXSPndfIOljSv7twudy9zfSf6tFkj4p6aykZ6JnM7MmSeskLXb335fUIOkW1eJ55u41eZN0vaR/K1m+T9J9tTreOObJSzpSsvyGpKvS96+S9EbUbCUzPSfphqzNJulSSd+R9IdKfhooN9JjXKdZ5it5US6R9Lwki56pZLY+SXMr1oU+lpIuk/RDpV9wz8pcI8z5p5JeysJskpok/UTSHEm59Hn22Vo8z2p5CaJ4J4qOp+uyYp67/zR9/01J8yKHMbO8pI9LelkZmS39VP+QpBOS9kv6b0m/cPfB9EMiHtOHJf21pPPpcmMGZipySd8ys14zW5uui34sPyTppKSvppdtHjOzWRmYq9Itknak74fO5u79kv5e0o8l/VTS25J6VYPnGV+Ek+TJ/9LCvh/PzGZL+oakDe7+y9JtkbO5+zuefHo4X9KnJC2ImKPIzP5M0gl3742cYxSfcfdPKLns9mUz+6PSjUGPZU7SJyT9s7t/XNIZVXxKn4Hn/yWSPi9pZ+W2iNnSa85fUPI/r6slzdLwS5iTopYB7pd0Tcny/HRdVrxlZldJUvrfExFDmNlvKYnv0+7+zSzNVuTuv5DUreTTrsvNLJduqvdj+mlJnzezPklfU3IZYkvwTEPSMye5+wkl1zI/pfjH8rik4+7+crq8S0mQo+cq9TlJ33H3t9Ll6NmWSvqhu590999I+qaS596kP89qGeBXJH0k/crhJUo+xdhdw+NVa7ektvT9NiXXX+vKzEzS45KOuntXxma70swuT9+fqeTa9FElIf5SxGzufp+7z3f3vJLn00F3XxU5U5GZzTKzDxTfV3JN84iCH0t3f1PST8zso+mqP5H0evRcFVbo3csPUvxsP5Z0nZldmr5Gi/9mk/88q/HF7BslfV/JtcMH6nkhvWKOHUqu5fxGyRnBGiXXDl+U9ANJByTNCZjrM0o+vfqepEPp240Zme0PJH03ne2IpL9J139Y0rclFZR8yvi+oMe0RdLzWZkpneHV9O214vM9I4/lIkk96WP5rKQrsjBXOtssSQOSLitZFz6bpE5Jx9Ln/r9Kel8tnmf8KDIABOGLcAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMKYEM3s2/SU3rxV/0Y2ZrTGz76e/t/hfzOwf0/VXmtk3zOyV9O3TsdMDI+MHMTAlmNkcdz+V/lj0K0p+PeBLSn6vwa8kHZT0qrv/pZltl/RP7v6fZvbbSn5t4O+FDQ9cQG7sDwEyYZ2Z/Xn6/jWSbpX0H+5+SpLMbKek3023L5V0bfJj/JKkD5rZbHc/Xc+BgbEQYGSembUoier17n7WzP5dyc/pX+isdoak69z9f+szITAxXAPGVHCZpJ+n8V2g5M83zZL0x2Z2RforAr9Y8vHfktReXDCzRXWdFhgnAoypYJ+knJkdlfS3kv5Lye9ifUjJb6d6ScmfA3o7/fh1khab2ffM7HVJf1H3iYFx4ItwmLKK13XTM+BnJG1z92ei5wLGizNgTGUd6d+sO6LkD08+GzwPUBXOgAEgCGfAABCEAANAEAIMAEEIMAAEIcAAEOT/ASoUtoMb2LNZAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rB3Wh7jldcl-", + "outputId": "34e01f99-642a-45aa-ebd3-bda160d7be2e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 665 + } + }, + "source": [ + "df_titanic.head(20)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked2age_o
003male22.0107.2500S22.0
111female38.01071.2833C38.0
213female26.0007.9250S26.0
311female35.01053.1000S35.0
403male35.0008.0500S35.0
503male28.0008.4583Q28.0
601male54.00051.8625S54.0
703male2.03121.0750S2.0
813female27.00211.1333S27.0
912female14.01030.0708C14.0
1013female4.01116.7000S4.0
1111female58.00026.5500S28.0
1203male20.0008.0500S20.0
1303male39.01531.2750S39.0
1403female14.0007.8542S14.0
1512female55.00016.0000S28.0
1603male2.04129.1250Q2.0
1712male28.00013.0000S28.0
1803female31.01018.0000S31.0
1913female28.0007.2250C28.0
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked2 age_o\n", + "0 0 3 male 22.0 1 0 7.2500 S 22.0\n", + "1 1 1 female 38.0 1 0 71.2833 C 38.0\n", + "2 1 3 female 26.0 0 0 7.9250 S 26.0\n", + "3 1 1 female 35.0 1 0 53.1000 S 35.0\n", + "4 0 3 male 35.0 0 0 8.0500 S 35.0\n", + "5 0 3 male 28.0 0 0 8.4583 Q 28.0\n", + "6 0 1 male 54.0 0 0 51.8625 S 54.0\n", + "7 0 3 male 2.0 3 1 21.0750 S 2.0\n", + "8 1 3 female 27.0 0 2 11.1333 S 27.0\n", + "9 1 2 female 14.0 1 0 30.0708 C 14.0\n", + "10 1 3 female 4.0 1 1 16.7000 S 4.0\n", + "11 1 1 female 58.0 0 0 26.5500 S 28.0\n", + "12 0 3 male 20.0 0 0 8.0500 S 20.0\n", + "13 0 3 male 39.0 1 5 31.2750 S 39.0\n", + "14 0 3 female 14.0 0 0 7.8542 S 14.0\n", + "15 1 2 female 55.0 0 0 16.0000 S 28.0\n", + "16 0 3 male 2.0 4 1 29.1250 Q 2.0\n", + "17 1 2 male 28.0 0 0 13.0000 S 28.0\n", + "18 0 3 female 31.0 1 0 18.0000 S 31.0\n", + "19 1 3 female 28.0 0 0 7.2250 C 28.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 324 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "x6YRvSf5SRR4" + }, + "source": [ + "### Quem são os outliers de 'age'?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2y9BeUnoSU4W", + "outputId": "85968b30-7903-465b-eddf-27d4604acb58", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "age_outlier = df_titanic[df_titanic['age'] > limite_superior_outliers]\n", + "age_outlier.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked2age_o
1111female58.00026.5500S28.0
1512female55.00016.0000S28.0
3302male66.00010.5000S28.0
5401male65.00161.9792C28.0
9403male59.0007.2500S28.0
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked2 age_o\n", + "11 1 1 female 58.0 0 0 26.5500 S 28.0\n", + "15 1 2 female 55.0 0 0 16.0000 S 28.0\n", + "33 0 2 male 66.0 0 0 10.5000 S 28.0\n", + "54 0 1 male 65.0 0 1 61.9792 C 28.0\n", + "94 0 3 male 59.0 0 0 7.2500 S 28.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 327 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J0dHnei1TBFc" + }, + "source": [ + "### Tratamento dos outliers da variável 'fare'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "i8YM25uKm8g1", + "outputId": "00e04c37-82d8-4aca-c250-36a4a045df2f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "df_titanic_copia = df_titanic.copy()\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked2age_o
003male22.0107.2500S22.0
111female38.01071.2833C38.0
213female26.0007.9250S26.0
311female35.01053.1000S35.0
403male35.0008.0500S35.0
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked2 age_o\n", + "0 0 3 male 22.0 1 0 7.2500 S 22.0\n", + "1 1 1 female 38.0 1 0 71.2833 C 38.0\n", + "2 1 3 female 26.0 0 0 7.9250 S 26.0\n", + "3 1 1 female 35.0 1 0 53.1000 S 35.0\n", + "4 0 3 male 35.0 0 0 8.0500 S 35.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 328 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4Gnr9AEfTSj6", + "outputId": "40a23226-68bd-4288-dbf2-5fabfd4fd32b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 332 + } + }, + "source": [ + "df_titanic, limite_superior_outliers = trata_outliers(df = df_titanic, coluna = 'fare')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "IQR: 23.0896\n", + "Limite inferior para outlier: -26.724; Limite superior para outliers: 65.6344\n", + "Mediana: 14.4542\n" + ], + "name": "stdout" + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAEGCAYAAABbzE8LAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAOr0lEQVR4nO3df2zcdR3H8de7PTbGiuA2XEgBb+SIhAQEaRSQP25jbGMbELP9MQLpRvgRjCkTNApbY0YwGvljiI1RCRJMXJwxamTdBDdW/tKArRtssCGHKXELMCg/tGwZFj7+cZ877q7X33f3bu+ej6TZfb/3ve/3+xnfPvvpd+1hIQQBAGqvyfsEAKBREWAAcEKAAcAJAQYAJwQYAJwkJrLxggULQjKZrNKpAEB96uvreyeEcFbp+gkFOJlMqre3t3JnBQANwMxeL7eeWxAA4IQAA4ATAgwATggwADghwADghAADgBMCDABOCDAAOCHAAOCEAAOAEwIMAE4IMAA4IcAA4IQAA4ATAgwATggwADghwADghAADgBMCDABOJvT/hKukrq4uZTKZ/PLRo0clSa2traO+LpVKqaOjo6rnBgC14BbgTCaj/QcP6ePT5kmSmo9/IEl68+TIp9R8/N2anBsA1IJbgCXp49Pm6cSFKyVJcw7vkqT8cjm5bQCgHnAPGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJzUJMBdXV3q6uqqxaEmbDqfG4D6lqjFQTKZTC0OMynT+dwA1DduQQCAEwIMAE4IMAA4IcAA4IQAA4ATAgwATggwADghwADghAADgBMCDABOCDAAOCHAAOCEAAOAEwIMAE4IMAA4IcAA4IQAA4ATAgwATggwADghwADghAADgBMCDABOCDAAOCHAAOCEAAOAEwIMAE4IMAA4IcAA4IQAA4ATAgwATggwADghwADghAADgBMCDABOCDAAOCHAAOCEAAOAEwIMAE4aPsDHjx/XgQMHtGnTJqXTaaXTaXV1deUf5z7a29uVTqe1ZMkSLV68WA899JDS6bSuueYaXXfddcpkMmX3n8lktGrVKvX09Gj58uX5/fX19UmS9u7dq3Q6rZ6enhFfv2zZMqXTae3YsWPE/Wcymfy+Cvc/0rbV1tvbqyVLlujWW2/VwMBA0XMDAwO6++67NTAwMOb4S41nDNu2bVM6ndb27dunNAZA+vRaLvc5NVUWQhj3xm1tbaG3t3fCB9m4caMk6ZFHHila1/evt3TiwpWSpDmHd0lSfrmcOYd36fLzFxbtZ6qWL1+ukydPTnk/yWRSTzzxxLD1GzZsUH9/vxKJhIaGhvLrW1pa1N3draVLl2poaEiJREJ79uwZ8fWSZGbDQpV7PplM6siRI/lj5PY/0rblzrWSVq9ercHBQUnSjTfeqHvuuSf/3NatW7Vjxw7dcMMN2rlz56jjLzWeMaTT6fzjZ599dirDAPLXcrnPqfEys74QQlvp+oaeAWcymYrEV5L6+/uHzcoymUw+noXxlaTBwUE99thj+fVDQ0PD4lr4ekkKIRTNgguf7+/vLzrG4OBg0Vfs0m2rOQvu7e3Nx1eSdu7cmZ8FDwwM6KmnnlIIQd3d3aOOv9R4xrBt27aiZWbBmIrCa7n0c6oSajIDXrt2rU6cOKFUKpVfl8lk9N+Pgj68dJ2k8c2A5+7frtNnWdF+puLw4cMVC7A0fBZcOHsdj9JZYLnXF86Cx9p/4Vfs0m2rOQsunP3m5GbBW7du1a5du4Z9QZKGj7/UeMZQOPvNYRaMySq9lic7C570DNjM7jSzXjPrffvttyd84OmskvGVNCyGE4mvNHyWXO71hV8wx9p/4YUz1XObiNL4StLu3bslSXv27CkbX2n4+EvVcgyANPxaLndtT0VirA1CCI9KelTKzoAnc5DW1lZJ5e8BT8Qnp35GqQreA57oDHUsyWRy2PJEZ8Bjvd7Mxr3/lpaWEbctPddKamlpGXahXnvttZKkpUuXjjoDHk0txwBIw6/lws+pSmjoe8CdnZ1V3d9Y+7/llluKljdv3jzm6++9995x7/+BBx6Y9LlNxZYtW4qWE4mE2tvbJUnr169XU1P2smtubi7arnT8pcYzhjvuuKNo+a677hrXOQPllF7LhZ9TldDQAU6lUpo9e3ZF9pVMJofdm06lUvlZWunsrqWlRbfffnt+fSKR0OLFi0d8vZSd/V5//fVln08mk0XHaGlp0eWXXz7itpW6j15OW1tb0Uxh1apVmj9/viRp/vz5WrFihcxMq1evHnX8pcYzhptvvrloed26dVMZChpc4bVc+jlVCQ0dYEk677zz1NTUpKuuuiq/bs2aNWW3k6SmpiaZmVauzP5jYXNzs+bMmTPijLKzs1Nz587V5s2bi2Kf+0q6adMmSSPP/jo7OzVr1ixJxbPf0v13dnbm91W4/5G2rbYtW7aoqalJixYtys9+c9avX6+LL75Y7e3tY46/1HjGkJsFM/tFJeSu5UrPfiV+DrjsuQFAJfFzwAAwzRBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHCSqMVBUqlULQ4zKdP53ADUt5oEuKOjoxaHmZTpfG4A6hu3IADACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcJLwPHjz8Xc15/Cu+HhAkvLLI20vLazFqQFA1bkFOJVKFS0fPTokSWptHS2wC4e9DgBmKrcAd3R0eB0aAKYF7gEDgBMCDABOCDAAOCHAAOCEAAOAEwIMAE4IMAA4IcAA4IQAA4ATAgwATggwADghwADghAADgBMCDABOCDAAOCHAAOCEAAOAEwIMAE4IMAA4IcAA4MRCCOPf2OxtSa9P8lgLJL0zydfONI00VqmxxttIY5Uaa7zVHOvnQwhnla6cUICnwsx6QwhtNTmYs0Yaq9RY422ksUqNNV6PsXILAgCcEGAAcFLLAD9aw2N5a6SxSo013kYaq9RY4635WGt2DxgAUIxbEADghAADgJOqB9jMVpjZK2aWMbP7qn28WjCzx83smJkdLFg3z8x2m9mr8c/PxvVmZj+J43/RzL7kd+YTZ2bnmlmPmb1sZi+Z2ca4vl7He6qZPW9mL8TxPhDXLzKz5+K4fmtms+L62XE5E59Pep7/ZJhZs5ntM7PuuFyXYzWzfjM7YGb7zaw3rnO9jqsaYDNrlvRTSddJukjSTWZ2UTWPWSNPSFpRsu4+Sc+EEC6Q9ExclrJjvyB+3CnpZzU6x0oZkvStEMJFkq6Q9I3437Bex3tS0pIQwhclXSpphZldIelHkh4OIaQkvSfptrj9bZLei+sfjtvNNBslHSpYruexLg4hXFrw876+13EIoWofkq6U9HTB8v2S7q/mMWv1ISkp6WDB8iuSzo6Pz5b0Snz8C0k3ldtuJn5I+pOkaxthvJJOk/QPSV9R9jekEnF9/rqW9LSkK+PjRNzOvM99AmM8R9nwLJHULcnqeKz9khaUrHO9jqt9C6JV0r8Llo/EdfVoYQjhjfj4TUkL4+O6+TuI33JeJuk51fF447fk+yUdk7Rb0muS3g8hDMVNCseUH298/gNJ82t7xlPyY0nfkfRJXJ6v+h1rkPQXM+szszvjOtfrOFHpHUIKIQQzq6uf7zOzFkm/l/TNEMJ/zCz/XL2NN4TwsaRLzexMSX+UdKHzKVWFma2WdCyE0Gdmae/zqYGrQwhHzexzknab2eHCJz2u42rPgI9KOrdg+Zy4rh69ZWZnS1L881hcP+P/DszsFGXjuy2E8Ie4um7HmxNCeF9Sj7Lfhp9pZrkJS+GY8uONz58haaDGpzpZX5V0g5n1S9qu7G2IR1SfY1UI4Wj885iyX1i/LOfruNoB/rukC+K/qs6StE7Sk1U+ppcnJa2Pj9cre680t749/qvqFZI+KPiWZ9qz7FT3l5IOhRC2FjxVr+M9K858ZWZzlL3ffUjZEK+Nm5WON/f3sFbS3hBvGk53IYT7QwjnhBCSyn5u7g0h3Kw6HKuZzTWz03OPJS2TdFDe13ENbnyvlPRPZe+jbfa+EV+hMf1G0huS/qfsvaHblL0X9oykVyXtkTQvbmvK/iTIa5IOSGrzPv8JjvVqZe+dvShpf/xYWcfjvUTSvjjeg5K+F9efL+l5SRlJv5M0O64/NS5n4vPne49hkuNOS+qu17HGMb0QP17Ktcj7OuZXkQHACb8JBwBOCDAAOCHAAOCEAAOAEwIMAE4IMKY9M7vbzA6Z2TbvcwEqiR9Dw7QXf2V0aQjhyDi2TYRP38cAmNaYAWNaM7OfK/tD9H82s++a2d/ie9f+1cy+ELfZYGZPmtleSc/E33p6PL6v7z4zu9F1EMAImAFj2ovvVdAm6SNJx0MIQ2a2VNLXQwhrzGyDpO9LuiSE8K6Z/UDSyyGEX8dfK35e0mUhhA+dhgCUxbuhYSY5Q9KvzOwCZX89+pSC53aHEN6Nj5cp+yYz347Lp0o6T8VvOg64I8CYSR6U1BNC+Fp8b+JnC54rnN2apDUhhFdqd2rAxHEPGDPJGfr0LQE3jLLd05I64ju5ycwuq/J5AZNCgDGTPCTph2a2T6N/9/agsrcnXjSzl+IyMO3wj3AA4IQZMAA4IcAA4IQAA4ATAgwATggwADghwADghAADgJP/A44KX5vXXCReAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uh7f7nNATSkT" + }, + "source": [ + "### Quem são os outliers de 'fare'?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BdzaUaD0nQnv", + "outputId": "c05a1d46-c91a-4542-92a8-bfb121b44174", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "limite_superior_outliers" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "65.6344" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 330 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "P3SAGnYnnQn4", + "outputId": "424347d6-d243-48df-8c3d-4c2a014c0cd9", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "fare_outlier = df_titanic[df_titanic['fare'] > limite_superior_outliers]\n", + "fare_outlier.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked2age_ofare_o
111female38.01071.2833C38.014.4542
2701male19.032263.0000S19.014.4542
3111female28.010146.5208C28.014.4542
3401male28.01082.1708C28.014.4542
5211female49.01076.7292C49.014.4542
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age ... fare embarked2 age_o fare_o\n", + "1 1 1 female 38.0 ... 71.2833 C 38.0 14.4542\n", + "27 0 1 male 19.0 ... 263.0000 S 19.0 14.4542\n", + "31 1 1 female 28.0 ... 146.5208 C 28.0 14.4542\n", + "34 0 1 male 28.0 ... 82.1708 C 28.0 14.4542\n", + "52 1 1 female 49.0 ... 76.7292 C 49.0 14.4542\n", + "\n", + "[5 rows x 10 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 331 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Jh83WTrZDeM_" + }, + "source": [ + "### Binning variáveis numéricas: age e fare" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JVNVCd7aDjkz", + "outputId": "21f91b71-cfe5-445b-cf2c-8f4bf0de9825", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "#df_titanic_copia = df_titanic.copy()\n", + "df_titanic = df_titanic_copia.copy()\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked2age_o
003male22.0107.2500S22.0
111female38.01071.2833C38.0
213female26.0007.9250S26.0
311female35.01053.1000S35.0
403male35.0008.0500S35.0
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked2 age_o\n", + "0 0 3 male 22.0 1 0 7.2500 S 22.0\n", + "1 1 1 female 38.0 1 0 71.2833 C 38.0\n", + "2 1 3 female 26.0 0 0 7.9250 S 26.0\n", + "3 1 1 female 35.0 1 0 53.1000 S 35.0\n", + "4 0 3 male 35.0 0 0 8.0500 S 35.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 332 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pUspmjPWFP06" + }, + "source": [ + "#### Usando cut()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NBVoCBe_2Zmp" + }, + "source": [ + "#df_titanic['age_bins'] = pd.cut(x = df_titanic['age_o'], bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90])\n", + "df_titanic['age_bins'] = pd.cut(x = df_titanic['age_o'], bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90], labels = [10, 20, 30, 40, 50, 60, 70, 80, 90])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "2i1jombNDrEO", + "outputId": "7c96358b-3023-4706-813a-3e3e594cc45e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 264 + } + }, + "source": [ + "df_titanic.head(7)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked2age_oage_bins
003male22.0107.2500S22.030
111female38.01071.2833C38.040
213female26.0007.9250S26.030
311female35.01053.1000S35.040
403male35.0008.0500S35.040
503male28.0008.4583Q28.030
601male54.00051.8625S54.060
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age ... fare embarked2 age_o age_bins\n", + "0 0 3 male 22.0 ... 7.2500 S 22.0 30\n", + "1 1 1 female 38.0 ... 71.2833 C 38.0 40\n", + "2 1 3 female 26.0 ... 7.9250 S 26.0 30\n", + "3 1 1 female 35.0 ... 53.1000 S 35.0 40\n", + "4 0 3 male 35.0 ... 8.0500 S 35.0 40\n", + "5 0 3 male 28.0 ... 8.4583 Q 28.0 30\n", + "6 0 1 male 54.0 ... 51.8625 S 54.0 60\n", + "\n", + "[7 rows x 10 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 340 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "davIt0UT9tTr" + }, + "source": [ + "#### **Desafio**: Qual seria o corte ótimo para 'age' usando DecisionTree?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "i5aAYl2ZDu1f", + "outputId": "0f1d7a99-6cb0-4484-b12b-7907b70feb8a", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic['age_bins_cut1'].value_counts()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(20, 30] 449\n", + "(30, 40] 155\n", + "(10, 20] 115\n", + "(40, 50] 86\n", + "(0, 10] 64\n", + "(50, 60] 22\n", + "(80, 90] 0\n", + "(70, 80] 0\n", + "(60, 70] 0\n", + "Name: age_bins_cut1, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 276 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VAUshOiLFT9-" + }, + "source": [ + "#### Usando qcut()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RKnb-bI7FL3F", + "outputId": "59742803-dcee-4525-8fd1-2cecff379cab", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic['age_bins_qcut'] = pd.qcut(x = df_titanic['age'], q = 5, labels = [1, 2, 3, 4], duplicates = 'drop')\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareambarked2age_ofare_oage_bins_cut1age_bins_cut2age_bins_qcut
003male22.0107.2500S22.07.2500(20, 30]302
111female38.01071.2833C38.014.4542(30, 40]403
213female26.0007.9250S26.07.9250(20, 30]302
311female35.01053.1000S35.053.1000(30, 40]403
403male35.0008.0500S35.08.0500(30, 40]403
\n", + "
" + ], + "text/plain": [ + " survived pclass sex ... age_bins_cut1 age_bins_cut2 age_bins_qcut\n", + "0 0 3 male ... (20, 30] 30 2\n", + "1 1 1 female ... (30, 40] 40 3\n", + "2 1 3 female ... (20, 30] 30 2\n", + "3 1 1 female ... (30, 40] 40 3\n", + "4 0 3 male ... (30, 40] 40 3\n", + "\n", + "[5 rows x 13 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 277 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "boSGroSYN7cP", + "outputId": "9304540f-1a20-41cb-c4a4-70633be1a071", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic.dtypes" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "survived int64\n", + "pclass int64\n", + "sex object\n", + "age float64\n", + "sibsp int64\n", + "parch int64\n", + "fare float64\n", + "embarked2 object\n", + "age_o float64\n", + "age_bins category\n", + "dtype: object" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 344 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "P8s3LzfpNdUz", + "outputId": "b0e4b638-32de-4295-984e-ba207e195661", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "l_colunas_numericas = list(df_titanic.select_dtypes('int').columns)\n", + "l_colunas_numericas\n", + "\n", + "for coluna in l_colunas_numericas:\n", + " trata_outliers(df, coluna)\n", + " trata_missing_values(df, coluna)\n", + " aplica_MMS(df, coluna)\n", + " aplica_RS(df, coluna)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "['survived', 'pclass', 'sibsp', 'parch']" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 346 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ov2_l39mn3FH", + "outputId": "122869f8-018f-4176-b2df-c249efa222b7", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic['age_bins_qcut'].value_counts()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(20.0, 28.0] 360\n", + "(0.419, 20.0] 179\n", + "(38.0, 80.0] 177\n", + "(28.0, 38.0] 175\n", + "Name: age_bins_qcut, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 261 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J60XwHUOGwbr" + }, + "source": [ + "### MinMaxScaler() e RobustScaler()\n", + "* age e fare" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GRY84U4HHxoQ" + }, + "source": [ + "from sklearn.preprocessing import MinMaxScaler, RobustScaler" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IQC7Bo-DH71s" + }, + "source": [ + "mms = MinMaxScaler()\n", + "rs = RobustScaler()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8O2oM9XdIYF5", + "outputId": "9812b76b-dffc-406b-fcd9-0a7c1b01a61b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareambarked2age_ofare_oage_bins_cut1age_bins_cut2age_bins_qcut
003male22.0107.2500S22.07.2500(20, 30]30(20.0, 28.0]
111female38.01071.2833C38.014.4542(30, 40]40(28.0, 38.0]
213female26.0007.9250S26.07.9250(20, 30]30(20.0, 28.0]
311female35.01053.1000S35.053.1000(30, 40]40(28.0, 38.0]
403male35.0008.0500S35.08.0500(30, 40]40(28.0, 38.0]
\n", + "
" + ], + "text/plain": [ + " survived pclass sex ... age_bins_cut1 age_bins_cut2 age_bins_qcut\n", + "0 0 3 male ... (20, 30] 30 (20.0, 28.0]\n", + "1 1 1 female ... (30, 40] 40 (28.0, 38.0]\n", + "2 1 3 female ... (20, 30] 30 (20.0, 28.0]\n", + "3 1 1 female ... (30, 40] 40 (28.0, 38.0]\n", + "4 0 3 male ... (30, 40] 40 (28.0, 38.0]\n", + "\n", + "[5 rows x 13 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 264 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "B-qglHy6NZlg", + "outputId": "96a154a8-678e-48eb-93b2-ce93b7e0258f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic_copia = df_titanic.copy()\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareambarked2age_ofare_oage_bins_cut1age_bins_cut2age_bins_qcut
003male22.0107.2500S22.07.2500(20, 30]30(20.0, 28.0]
111female38.01071.2833C38.014.4542(30, 40]40(28.0, 38.0]
213female26.0007.9250S26.07.9250(20, 30]30(20.0, 28.0]
311female35.01053.1000S35.053.1000(30, 40]40(28.0, 38.0]
403male35.0008.0500S35.08.0500(30, 40]40(28.0, 38.0]
\n", + "
" + ], + "text/plain": [ + " survived pclass sex ... age_bins_cut1 age_bins_cut2 age_bins_qcut\n", + "0 0 3 male ... (20, 30] 30 (20.0, 28.0]\n", + "1 1 1 female ... (30, 40] 40 (28.0, 38.0]\n", + "2 1 3 female ... (20, 30] 30 (20.0, 28.0]\n", + "3 1 1 female ... (30, 40] 40 (28.0, 38.0]\n", + "4 0 3 male ... (30, 40] 40 (28.0, 38.0]\n", + "\n", + "[5 rows x 13 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 265 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Dp9jYZ1OoA9i" + }, + "source": [ + "A seguir, deletar as variáveis que desnecessárias..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zSSViPY5XokW" + }, + "source": [ + "df_titanic.drop(columns = ['age', 'fare'], axis = 1, inplace = True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "MNq5a0eUIBGV", + "outputId": "d692f608-f511-46e6-d101-bd9443647b94", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 282 + } + }, + "source": [ + "# fit\n", + "df_titanic_mms = pd.DataFrame(mms.fit_transform(df_titanic[['age_o', 'fare_o']]), columns = ['age_mms', 'fare_mms'])\n", + "df_titanic_rs = pd.DataFrame(rs.fit_transform(df_titanic[['age_o', 'fare_o']]), columns = ['age_rs', 'fare_rs'])\n", + "\n", + "df_titanic['age_mms'] = df_titanic_mms['age_mms']\n", + "df_titanic['age_rs'] = df_titanic_rs['age_rs']\n", + "\n", + "df_titanic['fare_mms'] = df_titanic_mms['fare_mms']\n", + "df_titanic['fare_rs'] = df_titanic_rs['fare_rs']\n", + "\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexsibspparchambarked2age_ofare_oage_bins_cut1age_bins_cut2age_bins_qcutage_mmsage_rsfare_mmsfare_rsage_bins_mmsage_bins_rsfare_bins_mmsfare_bins_rs
003male10S22.07.2500(20, 30]30(20.0, 28.0]0.402762-0.5454550.111538-0.443619(0.365, 0.515](-0.727, 0.0](-0.001, 0.121](-0.891, -0.406]
111female10C38.014.4542(30, 40]40(28.0, 38.0]0.7013810.9090910.2223720.000000(0.645, 1.0](0.636, 2.364](0.162, 0.222](-0.243, 0.0]
213female00S26.07.9250(20, 30]30(20.0, 28.0]0.477417-0.1818180.121923-0.402054(0.365, 0.515](-0.727, 0.0](0.121, 0.162](-0.406, -0.243]
311female10S35.053.1000(30, 40]40(28.0, 38.0]0.6453900.6363640.8169232.379726(0.515, 0.645](0.0, 0.636](0.404, 1.0](0.726, 3.113]
403male00S35.08.0500(30, 40]40(28.0, 38.0]0.6453900.6363640.123846-0.394357(0.515, 0.645](0.0, 0.636](0.121, 0.162](-0.406, -0.243]
\n", + "
" + ], + "text/plain": [ + " survived pclass sex ... age_bins_rs fare_bins_mms fare_bins_rs\n", + "0 0 3 male ... (-0.727, 0.0] (-0.001, 0.121] (-0.891, -0.406]\n", + "1 1 1 female ... (0.636, 2.364] (0.162, 0.222] (-0.243, 0.0]\n", + "2 1 3 female ... (-0.727, 0.0] (0.121, 0.162] (-0.406, -0.243]\n", + "3 1 1 female ... (0.0, 0.636] (0.404, 1.0] (0.726, 3.113]\n", + "4 0 3 male ... (0.0, 0.636] (0.121, 0.162] (-0.406, -0.243]\n", + "\n", + "[5 rows x 19 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 268 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UzrdPNO3rIg5", + "outputId": "aaa31937-081d-4af8-f3ec-07002886d2a6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 555 + } + }, + "source": [ + "# Categorizando as variáveis transformadas\n", + "df_titanic['age_bins_mms'] = pd.qcut(x = df_titanic['age_mms'], q = 5, duplicates = 'drop', labels = [1, 2, 3, 4])\n", + "df_titanic['age_bins_rs'] = pd.qcut(x = df_titanic['age_rs'], q = 5, labels = [1, 2, 3, 4], duplicates = 'drop')\n", + "\n", + "df_titanic['fare_bins_mms'] = pd.qcut(x = df_titanic['fare_mms'], q = 5, labels = [1, 2, 3, 4], duplicates = 'drop')\n", + "df_titanic['fare_bins_rs'] = pd.qcut(x = df_titanic['fare_rs'], q = 5, labels = [1, 2, 3, 4], duplicates = 'drop')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "KeyError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 2894\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2895\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2896\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", + "\u001b[0;31mKeyError\u001b[0m: 'age_mms'", + "\nThe above exception was the direct cause of the following exception:\n", + "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Categorizando as variáveis transformadas\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdf_titanic\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'age_bins_mms'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mqcut\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf_titanic\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'age_mms'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mq\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m5\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mduplicates\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'drop'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlabels\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m4\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0mdf_titanic\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'age_bins_rs'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mqcut\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf_titanic\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'age_rs'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mq\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m5\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlabels\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m3\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m4\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mduplicates\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'drop'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mdf_titanic\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'fare_bins_mms'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mqcut\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf_titanic\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'fare_mms'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mq\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m5\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mduplicates\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'drop'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 2900\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnlevels\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2901\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_multilevel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2902\u001b[0;31m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2903\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_integer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2904\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 2895\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2896\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2897\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2898\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2899\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mtolerance\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mKeyError\u001b[0m: 'age_mms'" + ] + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7smfXya5pmNq", + "outputId": "a942223f-e9b6-4758-c453-73c5c55f91bd", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic.drop(columns = ['age_o', 'fare_o', 'age_bins_cut2', 'age_mms', 'age_rs', 'fare_mms', 'fare_rs'], axis = 1, inplace = True)\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexsibspparchambarked2age_bins_cut1age_bins_qcutage_bins_mmsage_bins_rsfare_bins_mmsfare_bins_rs
003male10S(20, 30](20.0, 28.0](0.365, 0.515](-0.727, 0.0](-0.001, 0.121](-0.891, -0.406]
111female10C(30, 40](28.0, 38.0](0.645, 1.0](0.636, 2.364](0.162, 0.222](-0.243, 0.0]
213female00S(20, 30](20.0, 28.0](0.365, 0.515](-0.727, 0.0](0.121, 0.162](-0.406, -0.243]
311female10S(30, 40](28.0, 38.0](0.515, 0.645](0.0, 0.636](0.404, 1.0](0.726, 3.113]
403male00S(30, 40](28.0, 38.0](0.515, 0.645](0.0, 0.636](0.121, 0.162](-0.406, -0.243]
\n", + "
" + ], + "text/plain": [ + " survived pclass sex ... age_bins_rs fare_bins_mms fare_bins_rs\n", + "0 0 3 male ... (-0.727, 0.0] (-0.001, 0.121] (-0.891, -0.406]\n", + "1 1 1 female ... (0.636, 2.364] (0.162, 0.222] (-0.243, 0.0]\n", + "2 1 3 female ... (-0.727, 0.0] (0.121, 0.162] (-0.406, -0.243]\n", + "3 1 1 female ... (0.0, 0.636] (0.404, 1.0] (0.726, 3.113]\n", + "4 0 3 male ... (0.0, 0.636] (0.121, 0.162] (-0.406, -0.243]\n", + "\n", + "[5 rows x 12 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 269 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SFPNLDMcU339" + }, + "source": [ + "### Variáveis Dummy" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "L_Fx1iy7snjF", + "outputId": "24c70d23-5a35-41aa-8a30-04fcc146b7c6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareambarked2age_ofare_oage_bins_cut1age_bins_cut2age_bins_qcut
003male22.0107.2500S22.07.2500(20, 30]302
111female38.01071.2833C38.014.4542(30, 40]403
213female26.0007.9250S26.07.9250(20, 30]302
311female35.01053.1000S35.053.1000(30, 40]403
403male35.0008.0500S35.08.0500(30, 40]403
\n", + "
" + ], + "text/plain": [ + " survived pclass sex ... age_bins_cut1 age_bins_cut2 age_bins_qcut\n", + "0 0 3 male ... (20, 30] 30 2\n", + "1 1 1 female ... (30, 40] 40 3\n", + "2 1 3 female ... (20, 30] 30 2\n", + "3 1 1 female ... (30, 40] 40 3\n", + "4 0 3 male ... (30, 40] 40 3\n", + "\n", + "[5 rows x 13 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 279 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "X6aHaJodX0Hi", + "outputId": "f7a26db1-81d3-47f5-dcd3-bbde2b2b6440", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 402 + } + }, + "source": [ + "dummy = pd.get_dummies(df_titanic[['sex', 'ambarked2']])\n", + "dummy" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sex_femalesex_maleambarked2_Cambarked2_Qambarked2_S
001001
110100
210001
310001
401001
..................
88601001
88710001
88810001
88901100
89001010
\n", + "

891 rows × 5 columns

\n", + "
" + ], + "text/plain": [ + " sex_female sex_male ambarked2_C ambarked2_Q ambarked2_S\n", + "0 0 1 0 0 1\n", + "1 1 0 1 0 0\n", + "2 1 0 0 0 1\n", + "3 1 0 0 0 1\n", + "4 0 1 0 0 1\n", + ".. ... ... ... ... ...\n", + "886 0 1 0 0 1\n", + "887 1 0 0 0 1\n", + "888 1 0 0 0 1\n", + "889 0 1 1 0 0\n", + "890 0 1 0 1 0\n", + "\n", + "[891 rows x 5 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 282 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZhLW0lEbs28E", + "outputId": "c772c290-4394-409a-ded7-1eb57b7ac0db", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 215 + } + }, + "source": [ + "df_titanic2 = pd.concat([df_titanic, dummy], axis = 1)\n", + "df_titanic2.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareambarked2age_ofare_oage_bins_cut1age_bins_cut2age_bins_qcutsex_femalesex_maleambarked2_Cambarked2_Qambarked2_S
003male22.0107.2500S22.07.2500(20, 30]30201001
111female38.01071.2833C38.014.4542(30, 40]40310100
213female26.0007.9250S26.07.9250(20, 30]30210001
311female35.01053.1000S35.053.1000(30, 40]40310001
403male35.0008.0500S35.08.0500(30, 40]40301001
\n", + "
" + ], + "text/plain": [ + " survived pclass sex ... ambarked2_C ambarked2_Q ambarked2_S\n", + "0 0 3 male ... 0 0 1\n", + "1 1 1 female ... 1 0 0\n", + "2 1 3 female ... 0 0 1\n", + "3 1 1 female ... 0 0 1\n", + "4 0 3 male ... 0 0 1\n", + "\n", + "[5 rows x 18 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 283 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "I_bOYD4gWwGt", + "outputId": "b33c5076-6d0b-4c8f-8e5b-d87cb3e4544d", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic['pclass'].value_counts() # Quem será a referência?" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "3.0 484\n", + "1.0 189\n", + "2.0 176\n", + "Name: pclass, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 64 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xFKdsFDihApP" + }, + "source": [ + "df_titanic['pclass'] = df_titanic['pclass'].astype('category')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D_mWCqM1ZOgU" + }, + "source": [ + "### Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UPnCuCsLZSjQ", + "outputId": "95b92d55-a895-4c2e-9655-07a3bd753f6d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 162 + } + }, + "source": [ + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "NameError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mX_treinamento\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mX_teste\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_treinamento\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_teste\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtrain_test_split\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mNameError\u001b[0m: name 'train_test_split' is not defined" + ] + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rk-Zuh5RXJbp", + "outputId": "47c3c005-795d-406d-984a-b0094cd5718c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexsibspparchambarked2age3fare3age_bins_cut1age_bins_cut2age_bins_qcutage_mmsage_rsfare_mmsfare_rs
00.03.0male1.00.0S22.07.2500(20, 30]30(20.0, 28.0]0.402762-0.5454550.014151-0.323505
11.01.0female1.00.0C38.071.2833(30, 40]40(36.0, 54.0]0.7013810.9090910.1391362.696934
21.03.0female0.00.0S26.07.9250(20, 30]30(20.0, 28.0]0.477417-0.1818180.015469-0.291665
31.01.0female1.00.0S35.053.1000(30, 40]40(28.0, 36.0]0.6453900.6363640.1036441.839231
40.03.0male0.00.0S35.08.0500(30, 40]40(28.0, 36.0]0.6453900.6363640.015713-0.285769
\n", + "
" + ], + "text/plain": [ + " survived pclass sex sibsp ... age_mms age_rs fare_mms fare_rs\n", + "0 0.0 3.0 male 1.0 ... 0.402762 -0.545455 0.014151 -0.323505\n", + "1 1.0 1.0 female 1.0 ... 0.701381 0.909091 0.139136 2.696934\n", + "2 1.0 3.0 female 0.0 ... 0.477417 -0.181818 0.015469 -0.291665\n", + "3 1.0 1.0 female 1.0 ... 0.645390 0.636364 0.103644 1.839231\n", + "4 0.0 3.0 male 0.0 ... 0.645390 0.636364 0.015713 -0.285769\n", + "\n", + "[5 rows x 15 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 69 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XtdQc49LXTfk", + "outputId": "696c23be-e9d0-4ba9-b4f5-0dcba0ffe8ec", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic['pclass'].value_counts()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "3.0 484\n", + "1.0 189\n", + "2.0 176\n", + "Name: pclass, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 74 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fbvB30S5hRxH", + "outputId": "e872fe50-fcad-4b33-9456-182b1ca7e62c", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "modelo = smf.glm(formula = 'survived ~ age3 + pclass + sex', data = df_titanic, family = sm.families.Binomial()).fit()\n", + "print(modelo.summary())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + " Generalized Linear Model Regression Results \n", + "==============================================================================\n", + "Dep. Variable: survived No. Observations: 849\n", + "Model: GLM Df Residuals: 844\n", + "Model Family: Binomial Df Model: 4\n", + "Link Function: logit Scale: 1.0000\n", + "Method: IRLS Log-Likelihood: -386.42\n", + "Date: Thu, 29 Oct 2020 Deviance: 772.85\n", + "Time: 17:17:09 Pearson chi2: 890.\n", + "No. Iterations: 5 \n", + "Covariance Type: nonrobust \n", + "=================================================================================\n", + " coef std err z P>|z| [0.025 0.975]\n", + "---------------------------------------------------------------------------------\n", + "Intercept 3.5515 0.382 9.297 0.000 2.803 4.300\n", + "pclass[T.2.0] -1.1389 0.264 -4.315 0.000 -1.656 -0.622\n", + "pclass[T.3.0] -2.3581 0.245 -9.613 0.000 -2.839 -1.877\n", + "sex[T.male] -2.5618 0.189 -13.522 0.000 -2.933 -2.191\n", + "age3 -0.0344 0.009 -4.035 0.000 -0.051 -0.018\n", + "=================================================================================\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p7_gfQXciFs1" + }, + "source": [ + "Qual a significância dos coeficientes (p-value abaixo de 0.05 adotando confiança de 95%)?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xtrh_bYNikTk" + }, + "source": [ + "### Interpretação dos coeficientes:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FzlGDyeLgL11" + }, + "source": [ + "* Pessoas que viajavam na segunda classe possuem menos chances de sobrevivência do que quem viajava na primeira.\n", + "* Quem viajava na terceira classe possui menos chances ainda.\n", + "* Homens possuem menos chances de sobrevivência do que mulheres. Quanto mais velho, menores as chances de sobrevivência." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CJNgEYY9ioVM" + }, + "source": [ + "### Coeficientes mais interpretáveis - Chances relativas de Sobrevivência" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "q0vLh1v3irCz" + }, + "source": [ + "print(np.exp(modelo.params[1:]))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "a2fJIOOzi3VF" + }, + "source": [ + "* Pessoas que viajavam na segunda classe tinham 0.27 das chances de sobrevivência que as pessoas da primeira classe tinham. \n", + "* Pessoas da terceira classe tinham 0.076 das chances que as pessoas da primeira classe tinham. \n", + "* Homens tinham 0.08 das chances das mulheres." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dYRkdqNujHFA" + }, + "source": [ + "### Comparando com a regressão Linear" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mKW-aODfjLbm" + }, + "source": [ + "(np.exp(modelo.params[1:]) - 1) * 100" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ODOqZpAgjQ2q" + }, + "source": [ + "* Pessoas da segunda classe tem 73% menos chances de sobrevivência do que pessoas da primeira classe.\n", + "* Pessoas da terceira classe tem 92% menos chances de sobrevivência que pessoas da primeira classe.\n", + "* Homens tem 92% menos chances de sobrevivência do que mulheres.\n", + "* Para cada ano a mais de idade, as chances diminuem 3.63%." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fxnutoD7jp94" + }, + "source": [ + "### Qualidade do modelo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-oW8Kg5Ij3Av", + "outputId": "d56cd7ce-e5c0-493d-ccfe-27cd5a6899c5", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 354 + } + }, + "source": [ + "modelo2 = LogisticRegression(penalty='none', solver='newton-cg')\n", + "df_titanic2 = df_titanic[['Survived', 'Pclass', 'Sex', 'Age']].dropna()\n", + "y = df_titanic2['Survived']\n", + "X = pd.get_dummies(df_titanic2[['Pclass', 'Sex', 'Age']], drop_first=True)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "KeyError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mmodelo2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mLogisticRegression\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpenalty\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'none'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msolver\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'newton-cg'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdf_titanic2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf_titanic\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Survived'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Pclass'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Sex'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Age'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdropna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf_titanic2\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Survived'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mX\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf_titanic2\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Pclass'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Sex'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Age'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdrop_first\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 2906\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_iterator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2907\u001b[0m \u001b[0mkey\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2908\u001b[0;31m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_get_listlike_indexer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mraise_missing\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2909\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2910\u001b[0m \u001b[0;31m# take() does not accept boolean indexers\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py\u001b[0m in \u001b[0;36m_get_listlike_indexer\u001b[0;34m(self, key, axis, raise_missing)\u001b[0m\n\u001b[1;32m 1252\u001b[0m \u001b[0mkeyarr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mindexer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnew_indexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0max\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_reindex_non_unique\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkeyarr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1253\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1254\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_validate_read_indexer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkeyarr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mindexer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mraise_missing\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mraise_missing\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1255\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mkeyarr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mindexer\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1256\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py\u001b[0m in \u001b[0;36m_validate_read_indexer\u001b[0;34m(self, key, indexer, axis, raise_missing)\u001b[0m\n\u001b[1;32m 1296\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mmissing\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1297\u001b[0m \u001b[0maxis_name\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mobj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_get_axis_name\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1298\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf\"None of [{key}] are in the [{axis_name}]\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1299\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1300\u001b[0m \u001b[0;31m# We (temporarily) allow for some missing keys with .loc, except in\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mKeyError\u001b[0m: \"None of [Index(['Survived', 'Pclass', 'Sex', 'Age'], dtype='object')] are in the [columns]\"" + ] + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YK9nIiz_kQQl" + }, + "source": [ + "modelo2.fit(X, y)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "FPhHe4SmkZoE" + }, + "source": [ + "y_pred = modelo2.predict_proba(X)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "dPrXU0GSknGJ" + }, + "source": [ + "confusion_matrix(y, model.predict(X))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_-Wweiq7kruH" + }, + "source": [ + "acuracia = accuracy_score(y, model.predict(X))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "bqcT8XYJkuwH" + }, + "source": [ + "print(classification_report(y, model.predict(X)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "4fSN-vLOjseh" + }, + "source": [ + "confusion_matrix(y, modelo.predict(X)) # usando a função do sklearn" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "A0408gbPkywR" + }, + "source": [ + "def plot_roc_curve(y_true, y_score, figsize=(10,6)):\n", + " fpr, tpr, _ = roc_curve(y_true, y_score)\n", + " plt.figure(figsize=figsize)\n", + " auc_value = roc_auc_score(y_true, y_score)\n", + " plt.plot(fpr, tpr, color='orange', label='ROC curve (area = %0.2f)' % auc_value)\n", + " plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')\n", + " plt.xlabel('False Positive Rate')\n", + " plt.ylabel('True Positive Rate')\n", + " plt.title('Receiver Operating Characteristic (ROC) Curve')\n", + " plt.legend()\n", + " plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "5T4P90hQk1ug" + }, + "source": [ + "plot_roc_curve(y, y_pred)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CANyMIgIjgSb" + }, + "source": [ + "### Predições" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pWJgVcQRlESq" + }, + "source": [ + "eu = pd.DataFrame({'Age':32, 'Pclass_2':0, 'Pclass_3':1, 'Sex_male':1}, index=[0])\n", + "minha_prob = model.predict_proba(eu)\n", + "print('Eu teria {}% de probabilidade de sobrevivência se estivesse no Titanic'\\\n", + " .format(round(minha_prob[:,1][0]*100, 2)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kgpdgkgrlJ-w" + }, + "source": [ + "Eu teria 7.52% de probabilidade de sobrevivência se estivesse no Titanic" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "91GShU9ClMiY" + }, + "source": [ + "coleguinha = pd.DataFrame({'Age':32, 'Pclass_2':0, 'Pclass_3':0, 'Sex_male':1}, index=[0])\n", + "prob_do_coleguinha = model.predict_proba(coleguinha)\n", + "print('Meu coleguinha teria {}% de probabilidade de sobrevivência se estivesse no Titanic'\\\n", + " .format(round(prob_do_coleguinha[:,1][0]*100, 2)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c2EHn8volOil" + }, + "source": [ + "Meu coleguinha teria 51.77% de probabilidade de sobrevivência se estivesse no Titanic" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "C2PvJoZQlH6u" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XwuMfMD1gFyd" + }, + "source": [ + "# Exemplo 2" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kFY0TQVgOlvT" + }, + "source": [ + "" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "efF3st3sHxPG" + }, + "source": [ + "# Carrega as bibliotecas\n", + "import numpy as np\n", + "np.set_printoptions(formatter = {'float': lambda x: \"{0:0.2f}\".format(x)})\n", + "\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "from sklearn.model_selection import train_test_split\n", + "import statsmodels.api as sm\n", + "\n", + "%matplotlib inline" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Bk9F6JO0IELv" + }, + "source": [ + "# Carregar/ler o banco de dados - Dataframe Diabetes\n", + "from sklearn import datasets\n", + "#Diabetes = datasets.load_diabetes()\n", + "\n", + "url = 'https://raw.githubusercontent.com/MathMachado/DSWP/master/Dataframes/diabetes.csv'\n", + "diabetes = pd.read_csv(url)\n", + "diabetes.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "tjRmpaPIDknb" + }, + "source": [ + "# Definir as matrizes X e y\n", + "X_diabetes = diabetes.copy()\n", + "X_diabetes.drop(columns = ['Outcome'], axis = 1, inplace = True)\n", + "y_diabetes = diabetes['Outcome']\n", + "\n", + "X_diabetes.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "jLrx69TH-Mad" + }, + "source": [ + "X_diabetes.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "mdFBioP6-Ply" + }, + "source": [ + "y_diabetes.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "fhLySN65IaDF" + }, + "source": [ + "# Definir as matrizes de treinamento e validação\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_diabetes, y_diabetes)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "J5R8HlnuIGpL" + }, + "source": [ + "# Usando statmodels:\n", + "x = sm.add_constant(X_treinamento)\n", + "lr_sm = sm.Logit(y_treinamento, X_treinamento) # Atenção: aqui é o contrário: [y, x]\n", + "\n", + "# Treinar o modelo\n", + "lr.fit(X_treinamento, y_treinamento)\n", + "resultado_sm = lr_sm.fit()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "GlbCaPp1ETNa" + }, + "source": [ + "resultado_sm.summary()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-FJaSnJLKICU" + }, + "source": [ + "# EQM - Erro Quadrático Médio\n", + "np.mean((resultado_sm.predict(X_teste) - y_teste) ** 2) " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6bVEUSTUPzOj" + }, + "source": [ + "### Calcular y_pred - os valores preditos de y" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OjGrNhTNLcr-" + }, + "source": [ + "y_pred = resultado_sm.predict(X_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vfS5RCx_VnGT" + }, + "source": [ + "compara = list(zip(np.array(diabetes['Outcome']), resultado_sm.predict()))\n", + "compara[0:30]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "pUxasncIFaw4" + }, + "source": [ + "resultado_sm.pred_table()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_liLYinwFgch" + }, + "source": [ + "confusion_matrix = pd.DataFrame(resultado_sm.pred_table())\n", + "confusion_matrix.columns = ['Predicted No Diabetes', 'Predicted Diabetes']\n", + "confusion_matrix = confusion_matrix.rename(index = {0 : 'Actual No Diabetes', 1 : 'Actual Diabetes'})\n", + "confusion_matrix" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ceH3MODWFm7S" + }, + "source": [ + "cm = np.array(confusion_matrix)\n", + "training_accuracy = (cm[0,0] + cm[1,1])/ cm.sum()\n", + "training_accuracy" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CH_iEuzhO109" + }, + "source": [ + "# Exercício 1 - Mall_Customers.csv\n", + "> A variável-target deste dataframe é 'Annual Income'. Desenvolva um modelo de regressão utilizando OLS, Ridge e LASSO e compare os resultados.\n", + "\n", + "* Experimente:\n", + " * Lasso(alpha = 0.01, max_iter = 10e5);\n", + " * Lasso(alpha = 0.0001, max_iter = 10e5);\n", + " * Ridge(alpha = 0.01);\n", + " * Ridge(alpha = 100);" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZfRDEaaRYxFQ" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "from sklearn import preprocessing\n", + "import matplotlib.pyplot as plt \n", + "plt.rc(\"font\", size=14)\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.model_selection import train_test_split\n", + "import seaborn as sns\n", + "sns.set(style=\"white\")\n", + "sns.set(style=\"whitegrid\", color_codes=True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nulrLzUqYxFY" + }, + "source": [ + "## Dados\n", + "\n", + "The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe (1/0) a term deposit (variable y)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4LdrQCwxYxFY" + }, + "source": [ + "This dataset provides the customer information. It includes 41188 records and 21 fields." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qoT6zkoFYxFZ", + "outputId": "b04874af-bf4d-409f-cd1c-ad8c473004e6", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_bank = pd.read_csv('https://raw.githubusercontent.com/MathMachado/DataFrames/master/bank-full.csv', header = 0)\n", + "df_bank = df_bank.dropna()\n", + "print(df_bank.shape)\n", + "print(list(df_bank.columns))" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "(45211, 1)\n", + "['age;\"job\";\"marital\";\"education\";\"default\";\"balance\";\"housing\";\"loan\";\"contact\";\"day\";\"month\";\"duration\";\"campaign\";\"pdays\";\"previous\";\"poutcome\";\"y\"']\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZD23hMCeYxFc", + "outputId": "f347c846-5f92-4e4f-b468-2bfbc608777c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_bank.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
age;\"job\";\"marital\";\"education\";\"default\";\"balance\";\"housing\";\"loan\";\"contact\";\"day\";\"month\";\"duration\";\"campaign\";\"pdays\";\"previous\";\"poutcome\";\"y\"
058;\"management\";\"married\";\"tertiary\";\"no\";2143...
144;\"technician\";\"single\";\"secondary\";\"no\";29;\"...
233;\"entrepreneur\";\"married\";\"secondary\";\"no\";2...
347;\"blue-collar\";\"married\";\"unknown\";\"no\";1506...
433;\"unknown\";\"single\";\"unknown\";\"no\";1;\"no\";\"n...
\n", + "
" + ], + "text/plain": [ + " age;\"job\";\"marital\";\"education\";\"default\";\"balance\";\"housing\";\"loan\";\"contact\";\"day\";\"month\";\"duration\";\"campaign\";\"pdays\";\"previous\";\"poutcome\";\"y\"\n", + "0 58;\"management\";\"married\";\"tertiary\";\"no\";2143... \n", + "1 44;\"technician\";\"single\";\"secondary\";\"no\";29;\"... \n", + "2 33;\"entrepreneur\";\"married\";\"secondary\";\"no\";2... \n", + "3 47;\"blue-collar\";\"married\";\"unknown\";\"no\";1506... \n", + "4 33;\"unknown\";\"single\";\"unknown\";\"no\";1;\"no\";\"n... " + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 285 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CtGbim_EYxFh" + }, + "source": [ + "#### Input variables" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0pJ7ai5ZYxFh" + }, + "source": [ + "1 - age (numeric)\n", + "\n", + "2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')\n", + "\n", + "3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)\n", + "\n", + "4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')\n", + "\n", + "5 - default: has credit in default? (categorical: 'no','yes','unknown')\n", + "\n", + "6 - housing: has housing loan? (categorical: 'no','yes','unknown')\n", + "\n", + "7 - loan: has personal loan? (categorical: 'no','yes','unknown')\n", + "\n", + "8 - contact: contact communication type (categorical: 'cellular','telephone')\n", + "\n", + "9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')\n", + "\n", + "10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')\n", + "\n", + "11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.\n", + "\n", + "12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)\n", + "\n", + "13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)\n", + "\n", + "14 - previous: number of contacts performed before this campaign and for this client (numeric)\n", + "\n", + "15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')\n", + "\n", + "16 - emp.var.rate: employment variation rate - (numeric)\n", + "\n", + "17 - cons.price.idx: consumer price index - (numeric)\n", + "\n", + "18 - cons.conf.idx: consumer confidence index - (numeric) \n", + "\n", + "19 - euribor3m: euribor 3 month rate - (numeric)\n", + "\n", + "20 - nr.employed: number of employees - (numeric)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YwsaBV_OYxFi" + }, + "source": [ + "#### Predict variable (desired target):\n", + "\n", + "y - has the client subscribed a term deposit? (binary: '1','0')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2SsNWV_SYxFj" + }, + "source": [ + "The education column of the dataset has many categories and we need to reduce the categories for a better modelling. The education column has the following categories:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6TFbgh3vYxFk" + }, + "source": [ + "df_bank['education'].unique()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "luv7Bdf_YxFn" + }, + "source": [ + "Let us group \"basic.4y\", \"basic.9y\" and \"basic.6y\" together and call them \"basic\"." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gkOlUOs2YxFn" + }, + "source": [ + "df_bank['education']=np.where(df_bank['education'] =='basic.9y', 'Basic', df_bank['education'])\n", + "df_bank['education']=np.where(df_bank['education'] =='basic.6y', 'Basic', df_bank['education'])\n", + "df_bank['education']=np.where(df_bank['education'] =='basic.4y', 'Basic', df_bank['education'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H-X1WMv2YxFq" + }, + "source": [ + "After grouping, this is the columns" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "r9LlgpkjYxFq" + }, + "source": [ + "df_bank['education'].unique()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fcnJy3KYYxFt" + }, + "source": [ + "### Data exploration" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qUrTMR8BYxFt" + }, + "source": [ + "df_bank['y'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rpzHnzJKYxFx" + }, + "source": [ + "sns.countplot(x='y',data=df_bank, palette='hls')\n", + "plt.show()\n", + "plt.savefig('count_plot')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C99nOe3mYxF0" + }, + "source": [ + "There are 36548 no's and 4640 yes's in the outcome variables." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8nGaox_kYxF1" + }, + "source": [ + "Let's get a sense of the numbers across the two classes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sQvzA60bYxF1" + }, + "source": [ + "df_bank.groupby('y').mean()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u3xjoceKYxF3" + }, + "source": [ + "Observations:\n", + "\n", + "The average age of customers who bought the term deposit is higher than that of the customers who didn't.\n", + "The pdays (days since the customer was last contacted) is understandably lower for the customers who bought it. The lower the pdays, the better the memory of the last call and hence the better chances of a sale.\n", + "Surprisingly, campaigns (number of contacts or calls made during the current campaign) are lower for customers who bought the term deposit." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jvzGMePPYxF4" + }, + "source": [ + "We can calculate categorical means for other categorical variables such as education and marital status to get a more detailed sense of our data." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RqLVMjoxYxF5" + }, + "source": [ + "df_bank.groupby('job').mean()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "GTUeRJAtYxF7" + }, + "source": [ + "df_bank.groupby('marital').mean()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "xsxdFumiYxF9" + }, + "source": [ + "df_bank.groupby('education').mean()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3i1DCWV-YxGA" + }, + "source": [ + "Visualizations" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OEArHQPbYxGB" + }, + "source": [ + "%matplotlib inline\n", + "pd.crosstab(df_bank.job,df_bank.y).plot(kind='bar')\n", + "plt.title('Purchase Frequency for Job Title')\n", + "plt.xlabel('Job')\n", + "plt.ylabel('Frequency of Purchase')\n", + "plt.savefig('purchase_fre_job')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PNwo5du_YxGD" + }, + "source": [ + "The frequency of purchase of the deposit depends a great deal on the job title. Thus, the job title can be a good predictor of the outcome variable." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "eM7CWfAZYxGE" + }, + "source": [ + "table=pd.crosstab(df_bank.marital,df_bank.y)\n", + "table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)\n", + "plt.title('Stacked Bar Chart of Marital Status vs Purchase')\n", + "plt.xlabel('Marital Status')\n", + "plt.ylabel('Proportion of Customers')\n", + "plt.savefig('mariral_vs_pur_stack')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LWBLh7toYxGG" + }, + "source": [ + "Hard to see, but the marital status does not seem a strong predictor for the outcome variable." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vh_u4QphYxGH" + }, + "source": [ + "table=pd.crosstab(df_bank.education,df_bank.y)\n", + "table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)\n", + "plt.title('Stacked Bar Chart of Education vs Purchase')\n", + "plt.xlabel('Education')\n", + "plt.ylabel('Proportion of Customers')\n", + "plt.savefig('edu_vs_pur_stack')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d9AgJroYYxGK" + }, + "source": [ + "Education seems a good predictor of the outcome variable." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dHI2LT-IYxGL" + }, + "source": [ + "pd.crosstab(df_bank.day_of_week,df_bank.y).plot(kind='bar')\n", + "plt.title('Purchase Frequency for Day of Week')\n", + "plt.xlabel('Day of Week')\n", + "plt.ylabel('Frequency of Purchase')\n", + "plt.savefig('pur_dayofweek_bar')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3A2jmS4MYxGR" + }, + "source": [ + "Day of week may not be a good predictor of the outcome" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bzafDBHpYxGS" + }, + "source": [ + "pd.crosstab(df_bank.month,df_bank.y).plot(kind='bar')\n", + "plt.title('Purchase Frequency for Month')\n", + "plt.xlabel('Month')\n", + "plt.ylabel('Frequency of Purchase')\n", + "plt.savefig('pur_fre_month_bar')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "x5CBtquEYxGW" + }, + "source": [ + "Month might be a good predictor of the outcome variable" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tgF_3SqWYxGY" + }, + "source": [ + "df_bank.age.hist()\n", + "plt.title('Histogram of Age')\n", + "plt.xlabel('Age')\n", + "plt.ylabel('Frequency')\n", + "plt.savefig('hist_age')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y0FhKYDsYxGc" + }, + "source": [ + "The most of the customers of the bank in this dataset are in the age range of 30-40." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5Nd3yV7DYxGd" + }, + "source": [ + "pd.crosstab(df_bank.poutcome,df_bank.y).plot(kind='bar')\n", + "plt.title('Purchase Frequency for Poutcome')\n", + "plt.xlabel('Poutcome')\n", + "plt.ylabel('Frequency of Purchase')\n", + "plt.savefig('pur_fre_pout_bar')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oRKUAGrjYxGh" + }, + "source": [ + "Poutcome seems to be a good predictor of the outcome variable." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "63RLRI9uYxGi" + }, + "source": [ + "### Create dummy variables" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "V8S4WUKmYxGj" + }, + "source": [ + "cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']\n", + "for var in cat_vars:\n", + " cat_list='var'+'_'+var\n", + " cat_list = pd.get_dummies(df_bank[var], prefix=var)\n", + " df_bank1=df_bank.join(cat_list)\n", + " data=df_bank1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "uX3w9i9WYxGl" + }, + "source": [ + "cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']\n", + "df_bank_vars=df_bank.columns.values.tolist()\n", + "to_keep=[i for i in df_bank_vars if i not in cat_vars]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "cMX_82xaYxGq" + }, + "source": [ + "df_bank_final=df_bank[to_keep]\n", + "df_bank_final.columns.values" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "LkTjpxYoYxGr" + }, + "source": [ + "df_bank_final_vars=df_bank_final.columns.values.tolist()\n", + "y=['y']\n", + "X=[i for i in df_bank_final_vars if i not in y]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2QbKaRcsYxGt" + }, + "source": [ + "### Feature Selection" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "EkxjW1AQYxGu" + }, + "source": [ + "from sklearn import datasets\n", + "from sklearn.feature_selection import RFE\n", + "from sklearn.linear_model import LogisticRegression\n", + "\n", + "logreg = LogisticRegression()\n", + "\n", + "rfe = RFE(logreg, 18)\n", + "rfe = rfe.fit(df_bank_final[X], df_bank_final[y] )\n", + "print(rfe.support_)\n", + "print(rfe.ranking_)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2P9hd4jHYxGw" + }, + "source": [ + "The Recursive Feature Elimination (RFE) has helped us select the following features: \"previous\", \"euribor3m\", \"job_blue-collar\", \"job_retired\", \"job_services\", \"job_student\", \"default_no\", \"month_aug\", \"month_dec\", \"month_jul\", \"month_nov\", \"month_oct\", \"month_sep\", \"day_of_week_fri\", \"day_of_week_wed\", \"poutcome_failure\", \"poutcome_nonexistent\", \"poutcome_success\"." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5PW8WZX_YxGx" + }, + "source": [ + "cols=[\"previous\", \"euribor3m\", \"job_blue-collar\", \"job_retired\", \"job_services\", \"job_student\", \"default_no\", \n", + " \"month_aug\", \"month_dec\", \"month_jul\", \"month_nov\", \"month_oct\", \"month_sep\", \"day_of_week_fri\", \"day_of_week_wed\", \n", + " \"poutcome_failure\", \"poutcome_nonexistent\", \"poutcome_success\"] \n", + "X=df_bank_final[cols]\n", + "y=df_bank_final['y']" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ix0mN9qxYxG0" + }, + "source": [ + "### Implementing the model" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Hbx2bwtiYxG0" + }, + "source": [ + "import statsmodels.api as sm\n", + "logit_model=sm.Logit(y,X)\n", + "result=logit_model.fit()\n", + "print(result.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HR1ui-UcYxG2" + }, + "source": [ + "The p-values for most of the variables are very small, therefore, most of them are significant to the model." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9GHhrsaeYxG3" + }, + "source": [ + "### Logistic Regression Model Fitting" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MFQnH5MzYxG3" + }, + "source": [ + "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn import metrics\n", + "logreg = LogisticRegression()\n", + "logreg.fit(X_train, y_train)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YUa3QL7tYxG6" + }, + "source": [ + "#### Predicting the test set results and caculating the accuracy" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SD-y2e33YxG6" + }, + "source": [ + "y_pred = logreg.predict(X_test)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "kkPWzos7YxG-" + }, + "source": [ + "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kwC3rt_6YxHA" + }, + "source": [ + "### Cross Validation" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Muw50oqSYxHB" + }, + "source": [ + "from sklearn import model_selection\n", + "from sklearn.model_selection import cross_val_score\n", + "kfold = model_selection.KFold(n_splits=10, random_state=7)\n", + "modelCV = LogisticRegression()\n", + "scoring = 'accuracy'\n", + "results = model_selection.cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)\n", + "print(\"10-fold cross validation average accuracy: %.3f\" % (results.mean()))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4y8XCTqoYxHE" + }, + "source": [ + "### Confusion Matrix" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BCza9NkVYxHE" + }, + "source": [ + "from sklearn.metrics import confusion_matrix\n", + "confusion_matrix = confusion_matrix(y_test, y_pred)\n", + "print(confusion_matrix)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "X9SapwS2YxHG" + }, + "source": [ + "The result is telling us that we have 10872+254 correct predictions and 1122+109 incorrect predictions." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6bEWvWScYxHG" + }, + "source": [ + "#### Accuracy" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NaH2nESwYxHH" + }, + "source": [ + "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(classifier.score(X_test, y_test)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C6oxlhbpYxHJ" + }, + "source": [ + "#### Compute precision, recall, F-measure and support\n", + "\n", + "The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.\n", + "\n", + "The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.\n", + "\n", + "The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.\n", + "\n", + "The F-beta score weights recall more than precision by a factor of beta. beta == 1.0 means recall and precision are equally important.\n", + "\n", + "The support is the number of occurrences of each class in y_test." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mhN5_p4yYxHK" + }, + "source": [ + "from sklearn.metrics import classification_report\n", + "print(classification_report(y_test, y_pred))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xzSFVEnAYxHP" + }, + "source": [ + "#### Interpretation: \n", + "\n", + "Of the entire test set, 88% of the promoted term deposit were the term deposit that the customers liked. Of the entire test set, 90% of the customer's preferred term deposit were promoted." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NGXJ6g2nYxHQ" + }, + "source": [ + "### ROC Curvefrom sklearn import metrics\n", + "from ggplot import *\n", + "\n", + "prob = clf1.predict_proba(X_test)[:,1]\n", + "fpr, sensitivity, _ = metrics.roc_curve(Y_test, prob)\n", + "\n", + "df = pd.DataFrame(dict(fpr=fpr, sensitivity=sensitivity))\n", + "ggplot(df, aes(x='fpr', y='sensitivity')) +\\\n", + " geom_line() +\\\n", + " geom_abline(linetype='dashed')" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "u9QKDuS0YxHQ" + }, + "source": [ + "from sklearn.metrics import roc_auc_score\n", + "from sklearn.metrics import roc_curve\n", + "logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))\n", + "fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])\n", + "plt.figure()\n", + "plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)\n", + "plt.plot([0, 1], [0, 1],'r--')\n", + "plt.xlim([0.0, 1.0])\n", + "plt.ylim([0.0, 1.05])\n", + "plt.xlabel('False Positive Rate')\n", + "plt.ylabel('True Positive Rate')\n", + "plt.title('Receiver operating characteristic')\n", + "plt.legend(loc=\"lower right\")\n", + "plt.savefig('Log_ROC')\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git "a/Notebooks/NB15_02__Regress\303\243o Linear_hs4.ipynb" "b/Notebooks/NB15_02__Regress\303\243o Linear_hs4.ipynb" new file mode 100644 index 000000000..fc1c3f423 --- /dev/null +++ "b/Notebooks/NB15_02__Regress\303\243o Linear_hs4.ipynb" @@ -0,0 +1,10670 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.1" + }, + "colab": { + "name": "NB15_02__Regressão Linear.ipynb", + "provenance": [], + "include_colab_link": true + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XwQDhId7N6_r" + }, + "source": [ + "

MACHINE LEARNING WITH PYTHON

\n", + "

APRENDIZAGEM SUPERVISIONADA

\n", + "

MODELOS DE REGRESSÃO (LINEAR E LOGÍSTICA)

\n", + "\n", + "Fonte: https://realpython.com/linear-regression-in-python/\n", + "https://towardsdatascience.com/predicting-the-survival-of-titanic-passengers-30870ccc7e8" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PN-dQFJcM1UV" + }, + "source": [ + "Passos para implementação da Regressão Linear:\n", + "\n", + "* (1) Importar as libraries necessárias;\n", + "* (2) Carregar os dados;\n", + "* (3) Aplicar as transformações necessárias: outliers, NaN's, normalização (MinMaxScaler, RobustScaler, StandarScaler, Log, Box-Cox e etc);\n", + "* (4) DataViz dos dados: entender os relacionamentos, distribuições e etc presente nos dados;\n", + "* (5) Construir e treinar o modelo preditivo (neste caso, modelo de regressão);\n", + "* (6) Validar/verificar as métricas para avaliação do(s) modelo(s);\n", + "* (7) Predições." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8TldGZxAFV5E" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0QRbxlqaq7pr" + }, + "source": [ + "# Melhorias da sessão:\n", + "* Calcular as correlações antes e depois da RIDGE e LASSO para mostrar a multicolinearidade e explicar porque determinadas colunas \"deixam\" de ser importantes." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P4sAIblOgFyL" + }, + "source": [ + "# Modelos de Regressão com Regularization para Classificação e Regressão" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o7Y7cuJNgFyU" + }, + "source": [ + "## Regressão Linear Simples (usando OLS - Ordinary Least Squares)\n", + "\n", + "* Features $X_{np}$: é uma matriz de dimensão nxp contendo os atributos/variáveis preditoras do dataframe (variáveis independentes);\n", + "* Variável target/dependente representada por y;\n", + "* Relação entre X e y é representado pela equação abaixo, onde $w_{i}$ representa os pesos de cada coeficiente e $w_{0}$ representa o intercepto." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NpJ580y9gFyU" + }, + "source": [ + "\n", + "\n", + "![X_y](https://github.com/MathMachado/Materials/blob/master/Architecture.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5rhbVGJ0gFyY" + }, + "source": [ + "* Soma de Quadrados dos Resíduos (RSS) - Soma de Quadrados das diferenças entre os valores observados e preditos.\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u8gA0YkbgFyp" + }, + "source": [ + "## Principais parâmetros do algoritmo:\n", + "* fit_intercept - Indica se o intercepto $w_{0}$ deve ou não ser ajustado. Se os dados estão normalizados, então não faz sentido ajustar o intercepto $w_{0}$.\n", + "\n", + "* normalize - $X$ será automaticamente normalizada (subtrai a média e divide pelo desvio-padrão);\n", + "\n", + "## Atributos do modelo de Machine Learning para Regressão\n", + "* coef - peso/fator de cada variável independente do modelo de ML;\n", + "\n", + "* intercepto $w_{0}$ - intercepto ou viés de $y$;\n", + "\n", + "## Funções para ajuste do ML:\n", + "* fit - treina o modelo com as matrizes $X$ e $y$;\n", + "* predict - Uma vez que o modelo foi treinado, para um dado $X$, use $y$ para calcular os valores preditos de $y$ (y_pred).\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "A-JG8El1gFy7" + }, + "source": [ + "# Limitações do OLS (Ordinary Least Squares):\n", + "* Impactado/sensível à Outliers;\n", + "* Multicolinearidade; \n", + "* Heterocedasticidade - apresenta-se como uma forte dispersão dos dados em torno de uma reta;\n", + "\n", + "* References" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xylMYR8COyrw" + }, + "source": [ + "### Importar as libraries" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2BGgrILlPK6Z" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from scipy import stats" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "263GgbwhO2kQ" + }, + "source": [ + "### Carregar os dados\n", + "* Vamos carregar o dataset [Boston House Pricing](https://archive.ics.uci.edu/ml/datasets/housing)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1h66x_-rXGhi" + }, + "source": [ + "from sklearn.datasets import load_boston, load_iris" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rWniNkMpXQFU", + "outputId": "5096d239-2c8c-4327-dbf5-f9128faa589c", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "boston = load_boston()\n", + "boston" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'DESCR': \".. _boston_dataset:\\n\\nBoston house prices dataset\\n---------------------------\\n\\n**Data Set Characteristics:** \\n\\n :Number of Instances: 506 \\n\\n :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\\n\\n :Attribute Information (in order):\\n - CRIM per capita crime rate by town\\n - ZN proportion of residential land zoned for lots over 25,000 sq.ft.\\n - INDUS proportion of non-retail business acres per town\\n - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\\n - NOX nitric oxides concentration (parts per 10 million)\\n - RM average number of rooms per dwelling\\n - AGE proportion of owner-occupied units built prior to 1940\\n - DIS weighted distances to five Boston employment centres\\n - RAD index of accessibility to radial highways\\n - TAX full-value property-tax rate per $10,000\\n - PTRATIO pupil-teacher ratio by town\\n - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\\n - LSTAT % lower status of the population\\n - MEDV Median value of owner-occupied homes in $1000's\\n\\n :Missing Attribute Values: None\\n\\n :Creator: Harrison, D. and Rubinfeld, D.L.\\n\\nThis is a copy of UCI ML housing dataset.\\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/housing/\\n\\n\\nThis dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\\n\\nThe Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\\nprices and the demand for clean air', J. Environ. Economics & Management,\\nvol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics\\n...', Wiley, 1980. N.B. Various transformations are used in the table on\\npages 244-261 of the latter.\\n\\nThe Boston house-price data has been used in many machine learning papers that address regression\\nproblems. \\n \\n.. topic:: References\\n\\n - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\\n - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\\n\",\n", + " 'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,\n", + " 4.9800e+00],\n", + " [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,\n", + " 9.1400e+00],\n", + " [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,\n", + " 4.0300e+00],\n", + " ...,\n", + " [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,\n", + " 5.6400e+00],\n", + " [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,\n", + " 6.4800e+00],\n", + " [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,\n", + " 7.8800e+00]]),\n", + " 'feature_names': array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',\n", + " 'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTAT
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.98
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.14
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.03
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.94
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.33
\n", + "" + ], + "text/plain": [ + " CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT\n", + "0 0.00632 18.0 2.31 0.0 0.538 ... 1.0 296.0 15.3 396.90 4.98\n", + "1 0.02731 0.0 7.07 0.0 0.469 ... 2.0 242.0 17.8 396.90 9.14\n", + "2 0.02729 0.0 7.07 0.0 0.469 ... 2.0 242.0 17.8 392.83 4.03\n", + "3 0.03237 0.0 2.18 0.0 0.458 ... 3.0 222.0 18.7 394.63 2.94\n", + "4 0.06905 0.0 2.18 0.0 0.458 ... 3.0 222.0 18.7 396.90 5.33\n", + "\n", + "[5 rows x 13 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 136 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pQzFW7DUX_KW", + "outputId": "dcf288db-d99d-4d17-c22c-ceb8a9ba4841", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "# Variável target/resposta\n", + "df_boston['preco'] = load_boston().target\n", + "df_boston.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATpreco
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2
\n", + "
" + ], + "text/plain": [ + " CRIM ZN INDUS CHAS NOX ... TAX PTRATIO B LSTAT preco\n", + "0 0.00632 18.0 2.31 0.0 0.538 ... 296.0 15.3 396.90 4.98 24.0\n", + "1 0.02731 0.0 7.07 0.0 0.469 ... 242.0 17.8 396.90 9.14 21.6\n", + "2 0.02729 0.0 7.07 0.0 0.469 ... 242.0 17.8 392.83 4.03 34.7\n", + "3 0.03237 0.0 2.18 0.0 0.458 ... 222.0 18.7 394.63 2.94 33.4\n", + "4 0.06905 0.0 2.18 0.0 0.458 ... 222.0 18.7 396.90 5.33 36.2\n", + "\n", + "[5 rows x 14 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 137 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H71da4bIO4kI" + }, + "source": [ + "### Data Transformation" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K-6YOdsTfciO" + }, + "source": [ + "#### Normalização/padronização dos nomes das colunas" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "L8OJEapufhq4" + }, + "source": [ + "# Renomear as colunas do dataframe\n", + "df_boston.columns = [col.lower() for col in df_boston.columns]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "uRinX-5ofol_", + "outputId": "2e67bbbd-792f-4786-8c7e-2d0bd16fd249", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "df_boston.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
crimzninduschasnoxrmagedisradtaxptratioblstatpreco
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2
\n", + "
" + ], + "text/plain": [ + " crim zn indus chas nox ... tax ptratio b lstat preco\n", + "0 0.00632 18.0 2.31 0.0 0.538 ... 296.0 15.3 396.90 4.98 24.0\n", + "1 0.02731 0.0 7.07 0.0 0.469 ... 242.0 17.8 396.90 9.14 21.6\n", + "2 0.02729 0.0 7.07 0.0 0.469 ... 242.0 17.8 392.83 4.03 34.7\n", + "3 0.03237 0.0 2.18 0.0 0.458 ... 222.0 18.7 394.63 2.94 33.4\n", + "4 0.06905 0.0 2.18 0.0 0.458 ... 222.0 18.7 396.90 5.33 36.2\n", + "\n", + "[5 rows x 14 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 139 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CMDh5jyqekmr" + }, + "source": [ + "#### Outliers" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jJIG0jJQf6em" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FgYPzlvfemFc" + }, + "source": [ + "#### Missing values" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BAjw7UhJen0D", + "outputId": "917a8f23-ec31-4f22-9a46-c3a15c1e4563", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Missing values por colunas/variáveis\n", + "df_boston.isna().sum()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "crim 0\n", + "zn 0\n", + "indus 0\n", + "chas 0\n", + "nox 0\n", + "rm 0\n", + "age 0\n", + "dis 0\n", + "rad 0\n", + "tax 0\n", + "ptratio 0\n", + "b 0\n", + "lstat 0\n", + "preco 0\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 140 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Jo3UWNpbYnNF", + "outputId": "aeefc57a-f1b7-41ac-aa2e-53f828b9be14", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Número de atributos\n", + "len(load_boston().feature_names)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "13" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 141 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0Yp8g7hxfQli", + "outputId": "43795436-0366-4427-ed5a-2deacedf567f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 49 + } + }, + "source": [ + "# Missing Values por linhas\n", + "df_boston[df_boston.isnull().any(axis = 1)]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
crimzninduschasnoxrmagedisradtaxptratioblstatpreco
\n", + "
" + ], + "text/plain": [ + "Empty DataFrame\n", + "Columns: [crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat, preco]\n", + "Index: []" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 142 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5qmkTFLrf9MT" + }, + "source": [ + "#### Estatísticas Descritivas" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Nprn3p_Wf_bn", + "outputId": "16f46af6-ab9a-4d7b-a875-295817b9bf9c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 297 + } + }, + "source": [ + "df_boston.describe()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
crimzninduschasnoxrmagedisradtaxptratioblstatpreco
count506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000
mean3.61352411.36363611.1367790.0691700.5546956.28463468.5749013.7950439.549407408.23715418.455534356.67403212.65306322.532806
std8.60154523.3224536.8603530.2539940.1158780.70261728.1488612.1057108.707259168.5371162.16494691.2948647.1410629.197104
min0.0063200.0000000.4600000.0000000.3850003.5610002.9000001.1296001.000000187.00000012.6000000.3200001.7300005.000000
25%0.0820450.0000005.1900000.0000000.4490005.88550045.0250002.1001754.000000279.00000017.400000375.3775006.95000017.025000
50%0.2565100.0000009.6900000.0000000.5380006.20850077.5000003.2074505.000000330.00000019.050000391.44000011.36000021.200000
75%3.67708312.50000018.1000000.0000000.6240006.62350094.0750005.18842524.000000666.00000020.200000396.22500016.95500025.000000
max88.976200100.00000027.7400001.0000000.8710008.780000100.00000012.12650024.000000711.00000022.000000396.90000037.97000050.000000
\n", + "
" + ], + "text/plain": [ + " crim zn indus ... b lstat preco\n", + "count 506.000000 506.000000 506.000000 ... 506.000000 506.000000 506.000000\n", + "mean 3.613524 11.363636 11.136779 ... 356.674032 12.653063 22.532806\n", + "std 8.601545 23.322453 6.860353 ... 91.294864 7.141062 9.197104\n", + "min 0.006320 0.000000 0.460000 ... 0.320000 1.730000 5.000000\n", + "25% 0.082045 0.000000 5.190000 ... 375.377500 6.950000 17.025000\n", + "50% 0.256510 0.000000 9.690000 ... 391.440000 11.360000 21.200000\n", + "75% 3.677083 12.500000 18.100000 ... 396.225000 16.955000 25.000000\n", + "max 88.976200 100.000000 27.740000 ... 396.900000 37.970000 50.000000\n", + "\n", + "[8 rows x 14 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 143 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1JimyY3SgECE" + }, + "source": [ + "#### Análise de Correlação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jScHq7eTgIpm", + "outputId": "50696c9d-c19a-4937-9189-368be5fb291c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 483 + } + }, + "source": [ + "correlacoes = df_boston.corr()\n", + "correlacoes" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
crimzninduschasnoxrmagedisradtaxptratioblstatpreco
crim1.000000-0.2004690.406583-0.0558920.420972-0.2192470.352734-0.3796700.6255050.5827640.289946-0.3850640.455621-0.388305
zn-0.2004691.000000-0.533828-0.042697-0.5166040.311991-0.5695370.664408-0.311948-0.314563-0.3916790.175520-0.4129950.360445
indus0.406583-0.5338281.0000000.0629380.763651-0.3916760.644779-0.7080270.5951290.7207600.383248-0.3569770.603800-0.483725
chas-0.055892-0.0426970.0629381.0000000.0912030.0912510.086518-0.099176-0.007368-0.035587-0.1215150.048788-0.0539290.175260
nox0.420972-0.5166040.7636510.0912031.000000-0.3021880.731470-0.7692300.6114410.6680230.188933-0.3800510.590879-0.427321
rm-0.2192470.311991-0.3916760.091251-0.3021881.000000-0.2402650.205246-0.209847-0.292048-0.3555010.128069-0.6138080.695360
age0.352734-0.5695370.6447790.0865180.731470-0.2402651.000000-0.7478810.4560220.5064560.261515-0.2735340.602339-0.376955
dis-0.3796700.664408-0.708027-0.099176-0.7692300.205246-0.7478811.000000-0.494588-0.534432-0.2324710.291512-0.4969960.249929
rad0.625505-0.3119480.595129-0.0073680.611441-0.2098470.456022-0.4945881.0000000.9102280.464741-0.4444130.488676-0.381626
tax0.582764-0.3145630.720760-0.0355870.668023-0.2920480.506456-0.5344320.9102281.0000000.460853-0.4418080.543993-0.468536
ptratio0.289946-0.3916790.383248-0.1215150.188933-0.3555010.261515-0.2324710.4647410.4608531.000000-0.1773830.374044-0.507787
b-0.3850640.175520-0.3569770.048788-0.3800510.128069-0.2735340.291512-0.444413-0.441808-0.1773831.000000-0.3660870.333461
lstat0.455621-0.4129950.603800-0.0539290.590879-0.6138080.602339-0.4969960.4886760.5439930.374044-0.3660871.000000-0.737663
preco-0.3883050.360445-0.4837250.175260-0.4273210.695360-0.3769550.249929-0.381626-0.468536-0.5077870.333461-0.7376631.000000
\n", + "
" + ], + "text/plain": [ + " crim zn indus ... b lstat preco\n", + "crim 1.000000 -0.200469 0.406583 ... -0.385064 0.455621 -0.388305\n", + "zn -0.200469 1.000000 -0.533828 ... 0.175520 -0.412995 0.360445\n", + "indus 0.406583 -0.533828 1.000000 ... -0.356977 0.603800 -0.483725\n", + "chas -0.055892 -0.042697 0.062938 ... 0.048788 -0.053929 0.175260\n", + "nox 0.420972 -0.516604 0.763651 ... -0.380051 0.590879 -0.427321\n", + "rm -0.219247 0.311991 -0.391676 ... 0.128069 -0.613808 0.695360\n", + "age 0.352734 -0.569537 0.644779 ... -0.273534 0.602339 -0.376955\n", + "dis -0.379670 0.664408 -0.708027 ... 0.291512 -0.496996 0.249929\n", + "rad 0.625505 -0.311948 0.595129 ... -0.444413 0.488676 -0.381626\n", + "tax 0.582764 -0.314563 0.720760 ... -0.441808 0.543993 -0.468536\n", + "ptratio 0.289946 -0.391679 0.383248 ... -0.177383 0.374044 -0.507787\n", + "b -0.385064 0.175520 -0.356977 ... 1.000000 -0.366087 0.333461\n", + "lstat 0.455621 -0.412995 0.603800 ... -0.366087 1.000000 -0.737663\n", + "preco -0.388305 0.360445 -0.483725 ... 0.333461 -0.737663 1.000000\n", + "\n", + "[14 rows x 14 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 144 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AxQp7xqdgTJP" + }, + "source": [ + "##### Gráfico das correlações entre as features/variáveis/colunas\n", + "Source: https://seaborn.pydata.org/examples/many_pairwise_correlations.html\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KOiH2X-WgqmN", + "outputId": "f72007dc-7c99-4ce1-b6bb-b86a9bf617c5", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 557 + } + }, + "source": [ + "import seaborn as sns\n", + "from string import ascii_letters\n", + "import matplotlib.pyplot as plt\n", + "\n", + "sns.set_theme(style = \"white\")\n", + "\n", + "d = df_boston\n", + "\n", + "# Compute the correlation matrix\n", + "corr = d.corr()\n", + "\n", + "# Generate a mask for the upper triangle\n", + "mask = np.triu(np.ones_like(corr, dtype=bool))\n", + "\n", + "# Set up the matplotlib figure\n", + "f, ax = plt.subplots(figsize=(11, 9))\n", + "\n", + "# Generate a custom diverging colormap\n", + "cmap = sns.diverging_palette(230, 20, as_cmap=True)\n", + "\n", + "# Draw the heatmap with the mask and correct aspect ratio\n", + "sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,\n", + " square=True, linewidths=.5, cbar_kws={\"shrink\": .5})" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 145 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nogPhyfVO70G" + }, + "source": [ + "### Construir e treinar o(s) modelo(s)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HxYpfyvQaIe1" + }, + "source": [ + "$X = [X_{1}, X_{2}, X_{p}]$ = X_boston abaixo." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0BhLZJhibVNG" + }, + "source": [ + "X_boston = df_boston.drop(columns = ['preco'], axis = 1) # todas as variáveis/atributos, exceto 'preco'\n", + "y_boston = df_boston['preco'] # variável-target" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "v_nC_RGva1Z6", + "outputId": "6a5946c8-62b3-424f-a809-9a2bbc34f191", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "X_boston.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
crimzninduschasnoxrmagedisradtaxptratioblstat
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.98
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.14
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.03
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.94
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.33
\n", + "
" + ], + "text/plain": [ + " crim zn indus chas nox ... rad tax ptratio b lstat\n", + "0 0.00632 18.0 2.31 0.0 0.538 ... 1.0 296.0 15.3 396.90 4.98\n", + "1 0.02731 0.0 7.07 0.0 0.469 ... 2.0 242.0 17.8 396.90 9.14\n", + "2 0.02729 0.0 7.07 0.0 0.469 ... 2.0 242.0 17.8 392.83 4.03\n", + "3 0.03237 0.0 2.18 0.0 0.458 ... 3.0 222.0 18.7 394.63 2.94\n", + "4 0.06905 0.0 2.18 0.0 0.458 ... 3.0 222.0 18.7 396.90 5.33\n", + "\n", + "[5 rows x 13 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 147 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nlVJM--Ya5fS", + "outputId": "58037983-175f-47ed-ad47-5826589358b0", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "y_boston[0:10] # Series (coluna)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 24.0\n", + "1 21.6\n", + "2 34.7\n", + "3 33.4\n", + "4 36.2\n", + "5 28.7\n", + "6 22.9\n", + "7 27.1\n", + "8 16.5\n", + "9 18.9\n", + "Name: preco, dtype: float64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 148 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "b50_6tv5h1kY" + }, + "source": [ + "# Definindo os dataframes de treinamento e teste:\n", + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_boston, \n", + " y_boston, \n", + " test_size = 0.2, \n", + " random_state = 20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1U3hpdkDbYTv", + "outputId": "35e8cee1-201a-4a65-a6ec-8fa9e8c7c0a8", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "print(f\"Dataframe de treinamento: {X_treinamento.shape[0]} linhas\")\n", + "print(f\"Dataframe de teste......: {X_teste.shape[0]} linhas\")" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Dataframe de treinamento: 404 linhas\n", + "Dataframe de teste......: 102 linhas\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SvevXulFiJj1" + }, + "source": [ + "#### Treinamento do modelo de Regressão Linear" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GVwF3vp8iNff" + }, + "source": [ + "# Importa a library LinearRegression --> Para treinamento da Regressão Linear\n", + "from sklearn.linear_model import LinearRegression\n", + "\n", + "# Library para statmodels\n", + "import statsmodels.api as sm" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ibX6bCbViW-v" + }, + "source": [ + "# Instancia o objeto\n", + "regressao_linear = LinearRegression()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "M-5wRGUribY0", + "outputId": "a67d7355-3d9e-43fc-edf6-8ebd71911935", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Treina o modelo usando as amostras/dataset de treinamento: X_treinamento e y_treinamento \n", + "regressao_linear.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 153 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jri-jA1VjmUl", + "outputId": "3150261d-c264-4273-9c5f-95229529881b", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Valor do intercepto\n", + "regressao_linear.intercept_" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "35.9020918753502" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 154 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VOjadxdxjqtT", + "outputId": "49d06bd9-e375-403f-e257-863967f10fd3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 452 + } + }, + "source": [ + "# Coeficientes do modelo de Regressão Linear\n", + "coeficientes_regressao_linear = pd.DataFrame([X_treinamento.columns, regressao_linear.coef_]).T\n", + "coeficientes_regressao_linear = coeficientes_regressao_linear.rename(columns={0: 'Feature/variável/coluna', 1: 'Coeficientes'})\n", + "coeficientes_regressao_linear" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Feature/variável/colunaCoeficientes
0crim-0.0822083
1zn0.0428002
2indus0.0756011
3chas3.16348
4nox-19.4945
5rm3.98161
6age0.00480929
7dis-1.37396
8rad0.298883
9tax-0.0123962
10ptratio-0.984657
11b0.008949
12lstat-0.526478
\n", + "
" + ], + "text/plain": [ + " Feature/variável/coluna Coeficientes\n", + "0 crim -0.0822083\n", + "1 zn 0.0428002\n", + "2 indus 0.0756011\n", + "3 chas 3.16348\n", + "4 nox -19.4945\n", + "5 rm 3.98161\n", + "6 age 0.00480929\n", + "7 dis -1.37396\n", + "8 rad 0.298883\n", + "9 tax -0.0123962\n", + "10 ptratio -0.984657\n", + "11 b 0.008949\n", + "12 lstat -0.526478" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 155 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jwnkhPwDjkhS" + }, + "source": [ + "#### Usando statmodels" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ltbekHd_k3PH", + "outputId": "a69b057e-75a6-446e-8b7c-ad37604114a5", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X2_treinamento = sm.add_constant(X_treinamento)\n", + "lm_sm = sm.OLS(y_treinamento, X2_treinamento).fit()\n", + "print(lm_sm.summary())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + " OLS Regression Results \n", + "==============================================================================\n", + "Dep. Variable: preco R-squared: 0.725\n", + "Model: OLS Adj. R-squared: 0.716\n", + "Method: Least Squares F-statistic: 78.97\n", + "Date: Thu, 29 Oct 2020 Prob (F-statistic): 1.48e-100\n", + "Time: 11:00:14 Log-Likelihood: -1214.8\n", + "No. Observations: 404 AIC: 2458.\n", + "Df Residuals: 390 BIC: 2514.\n", + "Df Model: 13 \n", + "Covariance Type: nonrobust \n", + "==============================================================================\n", + " coef std err t P>|t| [0.025 0.975]\n", + "------------------------------------------------------------------------------\n", + "const 35.9021 6.037 5.947 0.000 24.033 47.771\n", + "crim -0.0822 0.045 -1.824 0.069 -0.171 0.006\n", + "zn 0.0428 0.016 2.638 0.009 0.011 0.075\n", + "indus 0.0756 0.072 1.054 0.292 -0.065 0.217\n", + "chas 3.1635 0.997 3.174 0.002 1.204 5.123\n", + "nox -19.4945 4.539 -4.295 0.000 -28.418 -10.571\n", + "rm 3.9816 0.510 7.802 0.000 2.978 4.985\n", + "age 0.0048 0.015 0.312 0.755 -0.025 0.035\n", + "dis -1.3740 0.236 -5.827 0.000 -1.838 -0.910\n", + "rad 0.2989 0.079 3.760 0.000 0.143 0.455\n", + "tax -0.0124 0.004 -2.814 0.005 -0.021 -0.004\n", + "ptratio -0.9847 0.156 -6.309 0.000 -1.292 -0.678\n", + "b 0.0089 0.003 2.796 0.005 0.003 0.015\n", + "lstat -0.5265 0.060 -8.764 0.000 -0.645 -0.408\n", + "==============================================================================\n", + "Omnibus: 140.799 Durbin-Watson: 2.083\n", + "Prob(Omnibus): 0.000 Jarque-Bera (JB): 591.650\n", + "Skew: 1.484 Prob(JB): 3.35e-129\n", + "Kurtosis: 8.132 Cond. No. 1.51e+04\n", + "==============================================================================\n", + "\n", + "Warnings:\n", + "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", + "[2] The condition number is large, 1.51e+04. This might indicate that there are\n", + "strong multicollinearity or other numerical problems.\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Kpt3A4Q0guHv" + }, + "source": [ + "#### Exclusão da variável menos significativa para o modelo: 'age'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rVUJkfg4gSh7", + "outputId": "eeff1e8f-8ac7-44e8-e0fe-caf0d4a641c7", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X3 = X_treinamento.drop(columns = 'age', axis = 1)\n", + "X3_treinamento = sm.add_constant(X3)\n", + "lm_sm2 = sm.OLS(y_treinamento, X3_treinamento).fit()\n", + "print(lm_sm2.summary())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + " OLS Regression Results \n", + "==============================================================================\n", + "Dep. Variable: preco R-squared: 0.725\n", + "Model: OLS Adj. R-squared: 0.716\n", + "Method: Least Squares F-statistic: 85.75\n", + "Date: Thu, 29 Oct 2020 Prob (F-statistic): 1.64e-101\n", + "Time: 11:00:14 Log-Likelihood: -1214.8\n", + "No. Observations: 404 AIC: 2456.\n", + "Df Residuals: 391 BIC: 2508.\n", + "Df Model: 12 \n", + "Covariance Type: nonrobust \n", + "==============================================================================\n", + " coef std err t P>|t| [0.025 0.975]\n", + "------------------------------------------------------------------------------\n", + "const 35.7325 6.006 5.950 0.000 23.925 47.540\n", + "crim -0.0815 0.045 -1.812 0.071 -0.170 0.007\n", + "zn 0.0422 0.016 2.623 0.009 0.011 0.074\n", + "indus 0.0750 0.072 1.048 0.295 -0.066 0.216\n", + "chas 3.1794 0.994 3.198 0.001 1.225 5.134\n", + "nox -19.1299 4.381 -4.367 0.000 -27.742 -10.517\n", + "rm 4.0153 0.498 8.059 0.000 3.036 4.995\n", + "dis -1.3963 0.224 -6.223 0.000 -1.837 -0.955\n", + "rad 0.2958 0.079 3.755 0.000 0.141 0.451\n", + "tax -0.0123 0.004 -2.802 0.005 -0.021 -0.004\n", + "ptratio -0.9812 0.156 -6.310 0.000 -1.287 -0.675\n", + "b 0.0090 0.003 2.825 0.005 0.003 0.015\n", + "lstat -0.5202 0.057 -9.203 0.000 -0.631 -0.409\n", + "==============================================================================\n", + "Omnibus: 142.363 Durbin-Watson: 2.081\n", + "Prob(Omnibus): 0.000 Jarque-Bera (JB): 608.694\n", + "Skew: 1.496 Prob(JB): 6.67e-133\n", + "Kurtosis: 8.216 Cond. No. 1.48e+04\n", + "==============================================================================\n", + "\n", + "Warnings:\n", + "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", + "[2] The condition number is large, 1.48e+04. This might indicate that there are\n", + "strong multicollinearity or other numerical problems.\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_lcp7m5FmZvG" + }, + "source": [ + "#### Exclusão da variável menos significativa para o modelo: 'indus'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jEiBywx4hGNB", + "outputId": "fb2abfd1-9019-4e37-f6e1-cf5e54ae1276", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X4 = X3_treinamento.drop(columns = 'indus', axis = 1)\n", + "X4_treinamento = sm.add_constant(X4)\n", + "lm_sm3 = sm.OLS(y_treinamento, X4_treinamento).fit()\n", + "print(lm_sm3.summary())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + " OLS Regression Results \n", + "==============================================================================\n", + "Dep. Variable: preco R-squared: 0.724\n", + "Model: OLS Adj. R-squared: 0.716\n", + "Method: Least Squares F-statistic: 93.42\n", + "Date: Thu, 29 Oct 2020 Prob (F-statistic): 2.86e-102\n", + "Time: 11:00:14 Log-Likelihood: -1215.4\n", + "No. Observations: 404 AIC: 2455.\n", + "Df Residuals: 392 BIC: 2503.\n", + "Df Model: 11 \n", + "Covariance Type: nonrobust \n", + "==============================================================================\n", + " coef std err t P>|t| [0.025 0.975]\n", + "------------------------------------------------------------------------------\n", + "const 35.4757 6.001 5.911 0.000 23.677 47.275\n", + "crim -0.0840 0.045 -1.871 0.062 -0.172 0.004\n", + "zn 0.0407 0.016 2.539 0.012 0.009 0.072\n", + "chas 3.2924 0.989 3.330 0.001 1.349 5.236\n", + "nox -17.9558 4.235 -4.239 0.000 -26.283 -9.629\n", + "rm 3.9674 0.496 7.996 0.000 2.992 4.943\n", + "dis -1.4553 0.217 -6.699 0.000 -1.882 -1.028\n", + "rad 0.2744 0.076 3.606 0.000 0.125 0.424\n", + "tax -0.0103 0.004 -2.603 0.010 -0.018 -0.003\n", + "ptratio -0.9609 0.154 -6.227 0.000 -1.264 -0.658\n", + "b 0.0089 0.003 2.778 0.006 0.003 0.015\n", + "lstat -0.5151 0.056 -9.145 0.000 -0.626 -0.404\n", + "==============================================================================\n", + "Omnibus: 142.123 Durbin-Watson: 2.073\n", + "Prob(Omnibus): 0.000 Jarque-Bera (JB): 605.868\n", + "Skew: 1.494 Prob(JB): 2.74e-132\n", + "Kurtosis: 8.202 Cond. No. 1.47e+04\n", + "==============================================================================\n", + "\n", + "Warnings:\n", + "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", + "[2] The condition number is large, 1.47e+04. This might indicate that there are\n", + "strong multicollinearity or other numerical problems.\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rFejox5XmrEE" + }, + "source": [ + "#### Exclusão da variável menos significativa para o modelo: 'crim'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DOehOql8hZWr", + "outputId": "cbb71827-f44e-4688-93c4-98a3ec5e3257", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X5 = X4_treinamento.drop(columns = 'crim', axis = 1)\n", + "X5_treinamento = sm.add_constant(X5)\n", + "lm_sm4 = sm.OLS(y_treinamento, X5_treinamento).fit()\n", + "print(lm_sm4.summary())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + " OLS Regression Results \n", + "==============================================================================\n", + "Dep. Variable: preco R-squared: 0.721\n", + "Model: OLS Adj. R-squared: 0.714\n", + "Method: Least Squares F-statistic: 101.8\n", + "Date: Thu, 29 Oct 2020 Prob (F-statistic): 1.55e-102\n", + "Time: 11:00:14 Log-Likelihood: -1217.2\n", + "No. Observations: 404 AIC: 2456.\n", + "Df Residuals: 393 BIC: 2500.\n", + "Df Model: 10 \n", + "Covariance Type: nonrobust \n", + "==============================================================================\n", + " coef std err t P>|t| [0.025 0.975]\n", + "------------------------------------------------------------------------------\n", + "const 33.9950 5.968 5.696 0.000 22.262 45.728\n", + "zn 0.0375 0.016 2.349 0.019 0.006 0.069\n", + "chas 3.3959 0.990 3.430 0.001 1.449 5.343\n", + "nox -17.1637 4.228 -4.060 0.000 -25.475 -8.852\n", + "rm 4.0365 0.496 8.132 0.000 3.061 5.012\n", + "dis -1.3999 0.216 -6.484 0.000 -1.824 -0.975\n", + "rad 0.2278 0.072 3.158 0.002 0.086 0.370\n", + "tax -0.0100 0.004 -2.513 0.012 -0.018 -0.002\n", + "ptratio -0.9493 0.155 -6.137 0.000 -1.253 -0.645\n", + "b 0.0101 0.003 3.217 0.001 0.004 0.016\n", + "lstat -0.5315 0.056 -9.523 0.000 -0.641 -0.422\n", + "==============================================================================\n", + "Omnibus: 140.245 Durbin-Watson: 2.070\n", + "Prob(Omnibus): 0.000 Jarque-Bera (JB): 609.563\n", + "Skew: 1.464 Prob(JB): 4.32e-133\n", + "Kurtosis: 8.257 Cond. No. 1.46e+04\n", + "==============================================================================\n", + "\n", + "Warnings:\n", + "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", + "[2] The condition number is large, 1.46e+04. This might indicate that there are\n", + "strong multicollinearity or other numerical problems.\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UafIUrpZB0YP" + }, + "source": [ + "### Conclusão\n", + "* Quais variáveis/colunas/atributos ficam no modelo?\n", + "* **Muito importante (exercício)**: normalizar (MinMaxScaler) as covariáveis e refazer a análise.\n", + "* Nesta iteração (depois de excluirmos (nesta ordem) as variáveis age, indus e crim, não surge nenhuma outra variável insignificante ao nível de 5 (na verdade, o maior valor é 1.9%)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jx7sOzrrm-H_" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nXeiFtnJO_1u" + }, + "source": [ + "### Validação do(s) modelo(s)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QlGVFA6uPDvr" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PE3aKJ6mPDyJ" + }, + "source": [ + "### Predições" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d3nGiyX8jadH" + }, + "source": [ + "### Deployment da solução **analítica**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5YQF4NIlGSLH" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UQfpoo1igFy8" + }, + "source": [ + "# Regularized Regression Methods \n", + "## Ridge Regression - Penalized Regression\n", + "> Reduz a complexidade do modelo através do uso de todas as variáveis de $X$, mas penalizando (valor de $\\alpha$) os coeficientes $w_{i}$ quando estiverem muito longe de zero, forçando-os a serem pequenos de maneira contínua. Dessa forma, diminuímos a complexidade do modelo enquanto mantemos todas as variáveis no modelo.\n", + "* Menor impacto dos outliers.\n", + "\n", + "### Exemplo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "o00xH2MvxvgP" + }, + "source": [ + "# Matriz de covariáveis do modelo:\n", + "X_new = [[0, 0], [0, 0], [1, 1]]\n", + "X_new2 = [[0, 0], [0, 1.5], [1, 1]]\n", + "\n", + "y_new = [0, .1, 1]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "v9U7c03NzW_c", + "outputId": "2652bd10-e6b4-4200-f7f0-a07806564a1d", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X_new # 2 variáveis/colunas no dataframe" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[[0, 0], [0, 0], [1, 1]]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 161 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iiVEAPpUzXyN", + "outputId": "a69fe575-57da-459c-f482-41d3185ab76f", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "y_new" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[0, 0.1, 1]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 162 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JDljolA95Hw5" + }, + "source": [ + "### Sem outliers" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8mWj2GbPOkHx", + "outputId": "3b433090-a588-449a-af69-5f2da53a9b60", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 197 + } + }, + "source": [ + "ridge = Ridge(alpha = .1)\n", + "ridge.fit(X_new, y_new)\n", + "ridge.coef_ # Coeficientes da Ridge" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "NameError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mridge\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mRidge\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0malpha\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m.1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mridge\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX_new\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_new\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mridge\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcoef_\u001b[0m \u001b[0;31m# Coeficientes da Ridge\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mNameError\u001b[0m: name 'Ridge' is not defined" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8yvd4ABY5JjC" + }, + "source": [ + "### Com outliers" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "O3sJZ_pe5GQ7" + }, + "source": [ + "ridge = Ridge(alpha = .1)\n", + "ridge.fit(X_new2, y_new)\n", + "ridge.coef_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zZxdCLU_5kKh" + }, + "source": [ + "#### Conseguiram visualizar o impacto dos outliers?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u5jsTkUmS9wK" + }, + "source": [ + "### Aplicação da Regressão Ridge no dataframe Boston Housing Price." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Kp4VIJWxgFy8" + }, + "source": [ + "from sklearn.linear_model import Ridge\n", + "ridge = Ridge(alpha = 0.1) # Definição do valor de alpha da regressão ridge\n", + "lr = LinearRegression()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "cmRMoOwV6FMt" + }, + "source": [ + "# Ao inves de: regressao_linear.fit(X_treinamento, y_treinamento)\n", + "ridge.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VPnekyUbK6Xg" + }, + "source": [ + "#### Peso/contribuição das variáveis para a regressão usando RIDGE" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "k83RDArjsUrj" + }, + "source": [ + "df_boston.columns" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vMCb0CFjK973" + }, + "source": [ + "ridge.coef_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZqksuIjXypRJ" + }, + "source": [ + "# treinando a regressão Ridge\n", + "ridge.fit(X_treinamento, y_treinamento)\n", + "\n", + "# treinando a regressão linear simples (OLS)\n", + "lr.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7r28PBsWLtjA" + }, + "source": [ + "ridge.alpha" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dDZ_TJnhuZno" + }, + "source": [ + "#### $\\alpha = 0.01$" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hRMK_QTmNgc1" + }, + "source": [ + "# maior alpha --> mais restrição aos coeficientes; \n", + "# Menor alpha --> mais generalização, e Ridge se assemelha da OLS; Se alpha = 0 ==> Ridge = OLS.\n", + "rr = Ridge(alpha = 0.01) # Quanto mais próximo de 0 ==> Ridge = OLS\n", + "rr.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IRuWmBE7Ngc7" + }, + "source": [ + "# MSE = Erro Quadrático Médio\n", + "from sklearn.metrics import mean_squared_error\n", + "\n", + "rr_model=(mean_squared_error(y_true = y_treinamento, y_pred = rr.predict(X_treinamento)))\n", + "lr_model=(mean_squared_error(y_true = y_treinamento, y_pred = lr.predict(X_treinamento)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "L4an-zHetafI" + }, + "source": [ + "print(rr_model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QsLVzk3EtbGs" + }, + "source": [ + "print(lr_model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K2sjngo1QhY2" + }, + "source": [ + "### Coeficientes da Ridge:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "s5i87o3quByz" + }, + "source": [ + "# Lista das variáveis + coeficientes da Ridge:\n", + "list(zip(X_treinamento.columns, abs(ridge.coef_)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s44vo9IjQonE" + }, + "source": [ + "### Experimente vários outros valores para $\\alpha$ como, por exemplo, $\\alpha = 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CDv5fGPbuUq5" + }, + "source": [ + "#### $\\alpha = 100$" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NEaj4QRrNgdA" + }, + "source": [ + "rr100 = Ridge(alpha = 100)\n", + "rr100.fit(X_treinamento, y_treinamento)\n", + "train_score=lr.score(X_treinamento, y_treinamento)\n", + "test_score=lr.score(X_teste, y_teste)\n", + "Ridge_treinamento_score = rr.score(X_treinamento,y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zhcfoTEENgdE" + }, + "source": [ + "# MSE\n", + "rr100_model = (mean_squared_error(y_true = y_treinamento, y_pred = rr100.predict(X_treinamento)))\n", + "lr_model = (mean_squared_error(y_true = y_treinamento, y_pred = lr.predict(X_treinamento)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "NGDBpfiquxoc" + }, + "source": [ + "print(rr100_model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Owami5MVureW" + }, + "source": [ + "print(lr_model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Xk5dN3Owu6Kw" + }, + "source": [ + "### Próximo passo: fazer o statmodel dos modelos ridge." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cEF_3GgUgF0Q" + }, + "source": [ + "# LASSO (Least Absolute Shrinkage And Selection Operator regularization)\n", + "* Método mais comum e usado para Regularization; \n", + "* Reduz overfitting;\n", + "* Se encarrega do **Feature Selection**, pois descarta variáveis altamente correlacionadas." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-YiKb9reQdI4" + }, + "source": [ + "* Usado no processo de Regularization - processo de penalizar as variáveis para manter somente os atributos mais importantes. Pense na utilidade disso diante de um dataframe com muitas variáveis;\n", + "* A regressão Lasso vem com um parâmetro ($\\alpha$), e quanto maior o alfa, a maioria dos coeficientes de recurso é zero. Ou seja, quando $\\alpha = 0$, a regressão Lasso produz os mesmos coeficientes que uma regressão linear. Quando alfa é muito grande, todos os coeficientes são zero." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5p_ZPZ4tTUX1" + }, + "source": [ + "### Exemplo LASSO" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JD1_M2uw6q0W" + }, + "source": [ + "X_new" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "i5JZTnkTOkI9" + }, + "source": [ + "from sklearn.linear_model import Lasso\n", + "lasso = Lasso(alpha = .1)\n", + "lasso.fit(X_new, y_new)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "gEUxSlThOkJD" + }, + "source": [ + "lasso.coef_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EQaGWzzLT9qP" + }, + "source": [ + "### Aplicação do LASSO no Boston Housing Price" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ME6v6LFlgF0Q" + }, + "source": [ + "from sklearn.linear_model import Lasso\n", + "lasso = Lasso(alpha = .1)\n", + "lasso.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "h6DSEHc1gF0V" + }, + "source": [ + "lasso.coef_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8SzYnpVGy4cy" + }, + "source": [ + "### Coeficientes do LASSO:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "O2w2QDmdxxVe" + }, + "source": [ + "list(zip(X_treinamento.columns, abs(lasso.coef_)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UBOCg1H9zn6A" + }, + "source": [ + "### Comparação com os coeficientes do RIDGE:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "g1fF-mEZzXpH" + }, + "source": [ + "list(zip(X_treinamento.columns, abs(ridge.coef_)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xP1fX1Bi6VdX" + }, + "source": [ + "**Conclusão**: Coeficientes zero podem ser excluídos da Análise/modelo." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TbtxIWyGSXkH" + }, + "source": [ + "### Efeito dos valores de $\\alpha$\n", + "* Função adaptada de https://chrisalbon.com/machine_learning/linear_regression/effect_of_alpha_on_lasso_regression/." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "B4AuWA4LRBE3" + }, + "source": [ + "# Create a function called lasso,\n", + "def lasso(alphas):\n", + " '''\n", + " Takes in a list of alphas. Outputs a dataframe containing the coefficients of lasso regressions from each alpha.\n", + " '''\n", + " # Create an empty data frame\n", + " df = pd.DataFrame()\n", + " \n", + " # Create a column of feature names\n", + " df['Feature Name'] = names\n", + " \n", + " # For each alpha value in the list of alpha values,\n", + " for alpha in alphas:\n", + " # Create a lasso regression with that alpha value,\n", + " lasso = Lasso(alpha = alpha)\n", + " \n", + " # Fit the lasso regression\n", + " lasso.fit(X_treinamento, y_treinamento)\n", + " \n", + " # Create a column name for that alpha value\n", + " column_name = 'Alpha = %f' % alpha\n", + "\n", + " # Create a column of coefficient values\n", + " df[column_name] = lasso.coef_\n", + " \n", + " # Return the datafram \n", + " return df" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VEDvXvuNRK0C" + }, + "source": [ + "names = X_treinamento.columns\n", + "\n", + "# Valores de alpha:\n", + "lasso([.0001, .001, 0, .01, .1, 1, 10, 100])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xFlvTUJKhwgW" + }, + "source": [ + "### Capturando os elementos mais importantes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4_-sUgMIhzmE" + }, + "source": [ + "r_squared = model.rsquared\n", + "r_squared_adj = model.rsquared_adj\n", + "coeficientes_regressao = model.params" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "apGv5ytnimsM" + }, + "source": [ + "VEJA: https://stackoverflow.com/questions/27928275/find-p-value-significance-in-scikit-learn-linearregression" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Uhokzxtcil8w" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jSYw6SdcXa0q" + }, + "source": [ + "### Cross-Validation & GridSearch para LASSO" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "E14i4Y3rqEX2" + }, + "source": [ + "### Colocar aqui a fórmula do RMSE." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "irFZAkvVXfya" + }, + "source": [ + "from sklearn.linear_model import LassoCV\n", + "from sklearn.model_selection import RepeatedKFold" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "T3Jjom8RYdly" + }, + "source": [ + "# define model evaluation method\n", + "cv = RepeatedKFold(n_splits = 5, n_repeats = 3, random_state = 20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Cw3lAvRPYgJe" + }, + "source": [ + "# define model\n", + "model = LassoCV(alphas = np.arange(0.001, 10, 0.001), cv = cv, n_jobs = -1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "oLX3CpThXvkJ" + }, + "source": [ + "# fit model\n", + "model.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "U1ubd5huYQ7u" + }, + "source": [ + "# summarize chosen configuration\n", + "print('alpha: %f' % model.alpha_)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9P7hYoo4gF0Z" + }, + "source": [ + "# Elastic Net \n", + "* Combina o poder de Ridge e LASSO;\n", + "* Remove variáveis de pouco poder preditivo (LASSO) ou as penaliza (Ridge)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yChNUYs7gF0b" + }, + "source": [ + "from sklearn.linear_model import ElasticNet\n", + "from sklearn.model_selection import GridSearchCV\n", + "\n", + "# Instancia o objeto\n", + "en = ElasticNet(alpha = .1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "S1m3SL2avMbd" + }, + "source": [ + "transformacao.fit(dados_que_eu_quero_transformar)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4mbIaAUAF4N6" + }, + "source": [ + "en.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "MaUkZw8ngF0h" + }, + "source": [ + "list(zip(X_treinamento, en.coef_))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K7LuPhCtvouJ" + }, + "source": [ + "### GridSearch para encontrar o $\\alpha$ para Elastic Net" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xl-Qh9caDyCp" + }, + "source": [ + "# Instancia o objeto:\n", + "en = ElasticNet(normalize = True)\n", + "\n", + "# Otimização dos hiperparâmetros:\n", + "d_hiperparametros = {'alpha': np.logspace(-5, 2, 8), \n", + " 'l1_ratio': [.2, .4, .6, .8]}\n", + "\n", + "search = GridSearchCV(estimator = en, # Elastic Net\n", + " param_grid = d_hiperparametros, # Dicionário com os hiperparâmetros\n", + " scoring = 'mean_squared_error', # MSE (Erro Quadrático Médio) - Métrica para avaliação da performance do modelo\n", + " #scoring = 'neg_mean_squared_error',\n", + " n_jobs = -1, # Usar todos os processadores/computação\n", + " refit = True, \n", + " cv = 10) # Número de Cross-Valitations" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JvNQyUW_2QLr" + }, + "source": [ + "### Exercício (Estatística): Sugestão de ajuste manual\n", + "* Estudar estatisticamente a distribuição de frequência em que a variável é significante (ao nível de 5%) em 100 fits." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hp1hV5YahsJb" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ng0rPXfA1DgS" + }, + "source": [ + "for i in range(0, 100):\n", + " X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X, y, 0.2)\n", + " modeloi = fit(X_treinamento, y_treinamento)\n", + " intercepto\n", + " coeficientes da regressão\n", + " validação dos parâmetros (significância)\n", + " y_predict = predict(X_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "c3_XCQCPGlr3" + }, + "source": [ + "search.fit(X_treinamento, y_treinamento)\n", + "\n", + "# Retorna os melhores hiperparâmetros do algoritmo:\n", + "search.best_params_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zq0_ugQfGrdb" + }, + "source": [ + "en2 = ElasticNet(normalize = True, alpha = 0.001, l1_ratio = 0.6)\n", + "en2.fit(X_treinamento, y_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ILA5lScUx-Ub" + }, + "source": [ + "\n", + "# Métrica\n", + "ml2 = (mean_squared_error(y_true = y_teste, y_pred = en2.predict(X_teste)))\n", + "# Encontrar a métrica neg_squared_error --> ml3 = (neg_mean_squared_error(y_true = y_teste, y_pred = en2.predict(X_teste)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "BzO_dHRixd_L" + }, + "source": [ + "print(f\"MSE: {ml2}\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zaEwh3t3zwFc" + }, + "source": [ + "**Conclusão**:\n", + "* Comparação dos MSE - A Regressão sem Regularization produziu MSE de 23.94. Como podemos ver, Elastic Net produz MSE: 15.4." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5geUMgC6ztxE" + }, + "source": [ + "### Coeficientes do Elastic Net:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LyLdASRqzwCq" + }, + "source": [ + "list(zip(X_treinamento.columns, abs(ridge.coef_)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "90pfP9-3OkJG" + }, + "source": [ + "Observe acima que o segundo coeficiente foi estimado como 0 e, desta forma, podemos excluí-lo do ML." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ILCXvYKDOkJH" + }, + "source": [ + "# Elastic Net \n", + "* Combina o poder de Ridge e LASSO;\n", + "* Remove variáveis de pouco poder preditivo (LASSO) ou as penaliza (Ridge)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GaQPDCR2OkJI" + }, + "source": [ + "from sklearn.linear_model import ElasticNet\n", + "\n", + "# Instancia o objeto\n", + "en = ElasticNet(alpha = .1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "xVp16Eu_OkJL" + }, + "source": [ + "en.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "kwj018U8OkJO" + }, + "source": [ + "en.coef_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rJRWBzSQCcss" + }, + "source": [ + "# Regressão Logística\n", + "\n", + "* Na regressão linear nós tentamos modelar a relação linear entre as features ($X_{np} = [X_{1}, X_{2}, ..., X_{p}]$) através de uma reta dada pela equação:\n", + "\n", + "$$\\hat{y}= \\beta_{0}+\\beta_{1}x_{1}+\\beta_{2}x_{2}+...+\\beta_{p}x_{p}$$\n", + "\n", + "Para classificação, a Regressão Logística vai nos retornar probabilidades (entre 0 e 1), dada pela equação logística ( também conhecida **função sigmoid**):\n", + "\n", + "$$P[y = 1]= \\frac{1}{1+e^{-(\\beta_{0}+\\beta_{1}x_{1}+\\beta_{2}x_{2}+...+\\beta_{p}x_{p})}}$$\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Vj83Altwdni7" + }, + "source": [ + "![SigmoidFunction](https://github.com/MathMachado/Materials/blob/master/SigmoidFunction.PNG?raw=true)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LS1QjQnknqe5" + }, + "source": [ + "## Pressupostos da Regressão Logística\n", + "* Não há valores nulos no banco de dados;\n", + "* A variável-resposta $y$ é binária (0 ou 1) ou ordinal (variável categórica com valores ordenados (por exemplo, estimar a qualidade do vinho));\n", + "* Todas as variáveis preditoras $X$ são independentes;\n", + "* Há (pelo menos) 50 observações para cada variável preditora no modelo preditivo --> Quanto mais, melhor. Isso visa garantir a confiabilidade dos resultados);\n", + "* As classes da variável-resposta estejam equilibradas;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5YGvpGTAd4jO" + }, + "source": [ + "# Exemplo 1" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-LBYRG__e_Zv" + }, + "source": [ + "### Carregar as libraries" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XX2GNYWue-iA" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt # Graficos\n", + "import seaborn as sns # Graficos\n", + "%matplotlib inline\n", + "\n", + "# Classificadores\n", + "import statsmodels.api as sm\n", + "import statsmodels.formula.api as smf\n", + "from sklearn.linear_model import LogisticRegression\n", + "\n", + "# Métricas\n", + "from sklearn.metrics import roc_auc_score, roc_curve, classification_report, accuracy_score, confusion_matrix, auc" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RpNu-JjJfBYe" + }, + "source": [ + "### Carregar os dados" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dWVj8SmUeBZB", + "outputId": "ddb92623-228d-4621-90f9-b80b1a0d06c9", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 357 + } + }, + "source": [ + "url = 'https://raw.githubusercontent.com/MathMachado/DataFrames/master/Titanic_Original.csv'\n", + "df_titanic = pd.read_csv(url)\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", + "
" + ], + "text/plain": [ + " PassengerId Survived Pclass ... Fare Cabin Embarked\n", + "0 1 0 3 ... 7.2500 NaN S\n", + "1 2 1 1 ... 71.2833 C85 C\n", + "2 3 1 3 ... 7.9250 NaN S\n", + "3 4 1 1 ... 53.1000 C123 S\n", + "4 5 0 3 ... 8.0500 NaN S\n", + "\n", + "[5 rows x 12 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 289 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "T9vZGvU5qbsQ", + "outputId": "6ad44206-7129-4e4e-a03d-cec7aeab3449", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 357 + } + }, + "source": [ + "df_titanic.columns = [coluna.lower() for coluna in df_titanic.columns]\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
passengeridsurvivedpclassnamesexagesibspparchticketfarecabinembarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", + "
" + ], + "text/plain": [ + " passengerid survived pclass ... fare cabin embarked\n", + "0 1 0 3 ... 7.2500 NaN S\n", + "1 2 1 1 ... 71.2833 C85 C\n", + "2 3 1 3 ... 7.9250 NaN S\n", + "3 4 1 1 ... 53.1000 C123 S\n", + "4 5 0 3 ... 8.0500 NaN S\n", + "\n", + "[5 rows x 12 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 290 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fAYAg5tofDgQ" + }, + "source": [ + "### Entendendo os dados\n", + "* sibsp - número of siblings/esposas abordo do Titanic;\n", + "* parch - número de parentes/crianças abordo do Titanic;\n", + "* embarked - Cidade/Portão de embarque: C = Cherbourg, Q = Queenstown, S = Southampton." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZbijPdpFxdZy" + }, + "source": [ + "#### A variável-target é do tipo binária ou categórica ordinal?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7hspb3IMe5tx", + "outputId": "684c59df-d788-400f-a179-0774a39e4303", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic['survived'].value_counts()/df_titanic.shape[0]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 0.616162\n", + "1 0.383838\n", + "Name: survived, dtype: float64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 291 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tsp4t7oxx3zC" + }, + "source": [ + "A seguir, o gráfico da variável-target:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vm0BDjw-xrGI", + "outputId": "443def8e-dee6-40f5-8bcf-b33a43bb5c15", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 296 + } + }, + "source": [ + "sns.countplot(x = 'survived', data = df_titanic)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 292 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEGCAYAAACKB4k+AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAPQUlEQVR4nO3dfbDmZV3H8fcHFqR84MHdNtyllpLJoRTFE5HaVJAFZC5jgjgaK+7M1gw1OmZG/ZEPQ42OlmEatRPqQiUgZmxmGrNApgPq2UQeMzeC2A3cI0+KZLn27Y9z7cVhObvcZ9nfuc9y3q+Ze+7rd/2u3+/+3szO+XD9nu5UFZIkARww7gIkSQuHoSBJ6gwFSVJnKEiSOkNBktQtGXcBT8TSpUtr1apV4y5DkvYrmzdv/npVLZtt3X4dCqtWrWJycnLcZUjSfiXJnbtb5+EjSVJnKEiSOkNBktQZCpKkzlCQJHWGgiSpMxQkSZ2hIEnqDAVJUrdf39G8L7zwty4edwlagDa/++xxlyCNhTMFSVJnKEiSOkNBktQZCpKkzlCQJHWGgiSpMxQkSZ2hIEnqDAVJUmcoSJI6Q0GS1BkKkqTOUJAkdYaCJKkbNBSS3JHkpiQ3JJlsfUckuSrJV9v74a0/Sd6XZEuSG5McP2RtkqTHmo+Zws9W1fOraqItnwdsqqpjgE1tGeBU4Jj2WgdcOA+1SZJmGMfho9XAhtbeAJw+o//imnY9cFiSI8dQnyQtWkOHQgH/mGRzknWtb3lV3d3a9wDLW3sFcNeMbbe2vkdJsi7JZJLJqampoeqWpEVp6J/jfElVbUvyfcBVSf515sqqqiQ1lx1W1XpgPcDExMSctpUk7dmgM4Wq2tbetwMfB04AvrbzsFB7396GbwOOmrH5ytYnSZong4VCkqcmefrONvDzwM3ARmBNG7YGuLK1NwJnt6uQTgQenHGYSZI0D4Y8fLQc+HiSnZ/z11X1qSRfBC5Psha4Ezizjf8kcBqwBXgYOGfA2iRJsxgsFKrqduC4WfrvBU6epb+Ac4eqR5L0+LyjWZLUGQqSpM5QkCR1hoIkqTMUJEmdoSBJ6gwFSVJnKEiSOkNBktQZCpKkzlCQJHWGgiSpMxQkSZ2hIEnqDAVJUmcoSJI6Q0GS1BkKkqTOUJAkdYaCJKkzFCRJnaEgSeoMBUlSZyhIkjpDQZLUGQqSpM5QkCR1hoIkqTMUJEnd4KGQ5MAkX0ryibZ8dJLPJ9mS5LIkB7f+p7TlLW39qqFrkyQ92nzMFN4A3DZj+V3Ae6vq2cD9wNrWvxa4v/W/t42TJM2jQUMhyUrgF4G/aMsBTgKuaEM2AKe39uq2TFt/chsvSZonQ88U/hh4C/B/bfmZwANVtaMtbwVWtPYK4C6Atv7BNv5RkqxLMplkcmpqasjaJWnRGSwUkrwM2F5Vm/flfqtqfVVNVNXEsmXL9uWuJWnRWzLgvl8MvDzJacAhwDOAC4DDkixps4GVwLY2fhtwFLA1yRLgUODeAeuTJO1isJlCVf1OVa2sqlXAWcDVVfUa4BrglW3YGuDK1t7Ylmnrr66qGqo+SdJjjeM+hd8G3pRkC9PnDC5q/RcBz2z9bwLOG0NtkrSoDXn4qKuqa4FrW/t24IRZxnwbOGM+6pEkzc47miVJnaEgSeoMBUlSZyhIkjpDQZLUGQqSpM5QkCR1hoIkqTMUJEmdoSBJ6gwFSVJnKEiSOkNBktQZCpKkzlCQJHWGgiSpm5cf2ZE0d//5jueOuwQtQD/wezcNun9nCpKkzlCQJHWGgiSpMxQkSZ2hIEnqDAVJUmcoSJI6Q0GS1I0UCkk2jdInSdq/7fGO5iSHAN8LLE1yOJC26hnAioFrkyTNs8d7zMWvAm8EngVs5pFQ+Abw/gHrkiSNwR4PH1XVBVV1NPDmqvqhqjq6vY6rqj2GQpJDknwhyZeT3JLk7a3/6CSfT7IlyWVJDm79T2nLW9r6VfvoO0qSRjTSA/Gq6k+SvAhYNXObqrp4D5v9D3BSVT2U5CDgs0n+AXgT8N6qujTJnwFrgQvb+/1V9ewkZwHvAl61N19KkrR3Rj3RfAnwHuAlwI+318SetqlpD7XFg9qrgJOAK1r/BuD01l7dlmnrT06y83CVJGkejPro7Ang2Kqquew8yYFMn4t4NvAB4N+BB6pqRxuylUdOWK8A7gKoqh1JHgSeCXx9Lp8pSdp7o96ncDPw/XPdeVV9t6qeD6wETgCeM9d97CrJuiSTSSanpqae6O4kSTOMOlNYCtya5AtMnysAoKpePsrGVfVAkmuAnwQOS7KkzRZWAtvasG3AUcDWJEuAQ4F7Z9nXemA9wMTExJxmLpKkPRs1FN421x0nWQZ8pwXC9wAvZfrk8TXAK4FLgTXAlW2TjW35urb+6rkerpIkPTGjXn30T3ux7yOBDe28wgHA5VX1iSS3ApcmOR/4EnBRG38RcEmSLcB9wFl78ZmSpCdgpFBI8k2mrxwCOJjpK4m+VVXP2N02VXUj8IJZ+m9n+vzCrv3fBs4YpR5J0jBGnSk8fWe7XSa6GjhxqKIkSeMx56ektvsP/hb4hQHqkSSN0aiHj14xY/EApu9b+PYgFUmSxmbUq49+aUZ7B3AH04eQJElPIqOeUzhn6EIkSeM36rOPVib5eJLt7fWxJCuHLk6SNL9GPdH8IaZvLntWe/1d65MkPYmMGgrLqupDVbWjvT4MLBuwLknSGIwaCvcmeW2SA9vrtczyXCJJ0v5t1FB4PXAmcA9wN9PPJnrdQDVJksZk1EtS3wGsqar7AZIcwfSP7rx+qMIkSfNv1JnC83YGAkBV3ccszzWSJO3fRg2FA5IcvnOhzRRGnWVIkvYTo/5h/0PguiQfbctnAL8/TEmSpHEZ9Y7mi5NMAie1rldU1a3DlSVJGoeRDwG1EDAIJOlJbM6PzpYkPXkZCpKkzlCQJHWGgiSpMxQkSZ2hIEnqDAVJUmcoSJI6Q0GS1BkKkqTOUJAkdYaCJKkzFCRJ3WChkOSoJNckuTXJLUne0PqPSHJVkq+298Nbf5K8L8mWJDcmOX6o2iRJsxtyprAD+M2qOhY4ETg3ybHAecCmqjoG2NSWAU4FjmmvdcCFA9YmSZrFYKFQVXdX1b+09jeB24AVwGpgQxu2ATi9tVcDF9e064HDkhw5VH2SpMeal3MKSVYBLwA+DyyvqrvbqnuA5a29ArhrxmZbW9+u+1qXZDLJ5NTU1GA1S9JiNHgoJHka8DHgjVX1jZnrqqqAmsv+qmp9VU1U1cSyZcv2YaWSpEFDIclBTAfCX1XV37Tur+08LNTet7f+bcBRMzZf2fokSfNkyKuPAlwE3FZVfzRj1UZgTWuvAa6c0X92uwrpRODBGYeZJEnzYMmA+34x8CvATUluaH2/C7wTuDzJWuBO4My27pPAacAW4GHgnAFrkyTNYrBQqKrPAtnN6pNnGV/AuUPVI0l6fN7RLEnqDAVJUmcoSJI6Q0GS1BkKkqTOUJAkdYaCJKkzFCRJnaEgSeoMBUlSZyhIkjpDQZLUGQqSpM5QkCR1hoIkqTMUJEmdoSBJ6gwFSVJnKEiSOkNBktQZCpKkzlCQJHWGgiSpMxQkSZ2hIEnqDAVJUmcoSJI6Q0GS1BkKkqRusFBI8sEk25PcPKPviCRXJflqez+89SfJ+5JsSXJjkuOHqkuStHtDzhQ+DJyyS995wKaqOgbY1JYBTgWOaa91wIUD1iVJ2o3BQqGqPgPct0v3amBDa28ATp/Rf3FNux44LMmRQ9UmSZrdfJ9TWF5Vd7f2PcDy1l4B3DVj3NbW9xhJ1iWZTDI5NTU1XKWStAiN7URzVRVQe7Hd+qqaqKqJZcuWDVCZJC1e8x0KX9t5WKi9b2/924CjZoxb2fokSfNovkNhI7CmtdcAV87oP7tdhXQi8OCMw0ySpHmyZKgdJ/kI8DPA0iRbgbcC7wQuT7IWuBM4sw3/JHAasAV4GDhnqLokSbs3WChU1at3s+rkWcYWcO5QtUiSRuMdzZKkzlCQJHWGgiSpMxQkSZ2hIEnqDAVJUmcoSJI6Q0GS1BkKkqTOUJAkdYaCJKkzFCRJnaEgSeoMBUlSZyhIkjpDQZLUGQqSpM5QkCR1hoIkqTMUJEmdoSBJ6gwFSVJnKEiSOkNBktQZCpKkzlCQJHWGgiSpMxQkSZ2hIEnqFlQoJDklyVeSbEly3rjrkaTFZsGEQpIDgQ8ApwLHAq9Ocux4q5KkxWXBhAJwArClqm6vqv8FLgVWj7kmSVpUloy7gBlWAHfNWN4K/MSug5KsA9a1xYeSfGUealsslgJfH3cRC0Hes2bcJejR/Le501uzL/byg7tbsZBCYSRVtR5YP+46noySTFbVxLjrkHblv835s5AOH20DjpqxvLL1SZLmyUIKhS8CxyQ5OsnBwFnAxjHXJEmLyoI5fFRVO5L8OvBp4EDgg1V1y5jLWmw8LKeFyn+b8yRVNe4aJEkLxEI6fCRJGjNDQZLUGQry8SJasJJ8MMn2JDePu5bFwlBY5Hy8iBa4DwOnjLuIxcRQkI8X0YJVVZ8B7ht3HYuJoaDZHi+yYky1SBozQ0GS1BkK8vEikjpDQT5eRFJnKCxyVbUD2Pl4kduAy328iBaKJB8BrgN+JMnWJGvHXdOTnY+5kCR1zhQkSZ2hIEnqDAVJUmcoSJI6Q0GS1BkK0kCSvHxfPXU2yUP7Yj/S4/GSVOkJSLKk3esx9Oc8VFVPG/pzJGcKEpDkqUn+PsmXk9yc5FVJ7kiytK2fSHJta78tySVJPgdckuT6JD86Y1/XtvGvS/L+JIcmuTPJATM+664kByX54SSfSrI5yT8neU4bc3SS65LclOT8+f8vosXKUJCmnQL8V1UdV1U/BnzqccYfC/xcVb0auAw4EyDJkcCRVTW5c2BVPQjcAPx063oZ8Omq+g7TP0j/G1X1QuDNwJ+2MRcAF1bVc4G798UXlEZhKEjTbgJemuRdSX6q/SHfk41V9d+tfTnwytY+E7hilvGXAa9q7bOAy5I8DXgR8NEkNwB/DhzZxrwY+EhrXzLnbyPtpSXjLkBaCKrq35IcD5wGnJ9kE7CDR/7H6ZBdNvnWjG23Jbk3yfOY/sP/a7N8xEbgD5IcAbwQuBp4KvBAVT1/d2Xt9ReS9pIzBQlI8izg4ar6S+DdwPHAHUz/AQf45cfZxWXAW4BDq+rGXVdW1UNMP5H2AuATVfXdqvoG8B9Jzmg1JMlxbZPPMT2jAHjNXn8xaY4MBWnac4EvtMM4bwXOB94OXJBkEvju42x/BdN/xC/fw5jLgNe2951eA6xN8mXgFh75KdQ3AOcmuQl/CU/zyEtSJUmdMwVJUmcoSJI6Q0GS1BkKkqTOUJAkdYaCJKkzFCRJ3f8DThe6X9gR+9IAAAAASUVORK5CYII=\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XfhFG6Axxj6F" + }, + "source": [ + "Como podemos ver, a variável-resposta 'survived' é binária. Portanto, tudo ok até agora." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zRKhDX6ZraGU" + }, + "source": [ + "### Tratamento dos Missing Values\n", + "* Substituir os NaN's por mediana da variável" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qPbILjZyrhRZ", + "outputId": "52f34626-1875-4632-cce3-8a694863ea6c", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic.isna().sum()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "passengerid 0\n", + "survived 0\n", + "pclass 0\n", + "name 0\n", + "sex 0\n", + "age 177\n", + "sibsp 0\n", + "parch 0\n", + "ticket 0\n", + "fare 0\n", + "cabin 687\n", + "embarked 2\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 293 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uJUPufRossTo" + }, + "source": [ + "Cálculo da mediana da variável/preditora 'age'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WGW9bW5x4JdT", + "outputId": "c0ce5e66-70fe-4195-c637-2ac8a4cf9d37", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 357 + } + }, + "source": [ + "df_titanic_copia = df_titanic.copy()\n", + "#df_titanic = df_titanic_copia.copy()\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
passengeridsurvivedpclassnamesexagesibspparchticketfarecabinembarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", + "
" + ], + "text/plain": [ + " passengerid survived pclass ... fare cabin embarked\n", + "0 1 0 3 ... 7.2500 NaN S\n", + "1 2 1 1 ... 71.2833 C85 C\n", + "2 3 1 3 ... 7.9250 NaN S\n", + "3 4 1 1 ... 53.1000 C123 S\n", + "4 5 0 3 ... 8.0500 NaN S\n", + "\n", + "[5 rows x 12 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 294 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DgAwrR8msYv_" + }, + "source": [ + "mediana_age = df_titanic['age'].median()\n", + "mediana_fare = df_titanic['fare'].median()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "yqIgckarzwdB", + "outputId": "a190a691-e377-42ba-a8f8-98d430ccabbd", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "mediana_age" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "28.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 297 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "czdSVeLjzxAX", + "outputId": "48bdfb0b-a153-482e-9fa6-6706bd44b88d", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "mediana_fare" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "14.4542" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 298 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u4vcCshcsv6w" + }, + "source": [ + "Substituição dos NaN's da variável 'age' e 'fare' pela respetiva mediana" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tnOOsaqLsg03", + "outputId": "81837607-f6b7-4bc2-e295-443679d3deb2", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic['age'].fillna(mediana_age, inplace = True)\n", + "df_titanic.isna().sum()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "passengerid 0\n", + "survived 0\n", + "pclass 0\n", + "name 0\n", + "sex 0\n", + "age 0\n", + "sibsp 0\n", + "parch 0\n", + "ticket 0\n", + "fare 0\n", + "cabin 687\n", + "embarked 2\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 299 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VqAnNxnO0Ghn" + }, + "source": [ + "Dado que fare não possui NaN's, então nada a fazer." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4Hi2zG_ms6n-" + }, + "source": [ + "#### Usando Imputer\n", + "* Método para tratamento de Missing Values.\n", + " Scikit Learn: https://scikit-learn.org/stable/modules/impute.html" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mvCnGfCOri9Y", + "outputId": "6d8b7f52-ca60-4bbd-bc50-9e55f3b9bd17", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "from sklearn.impute import SimpleImputer\n", + "\n", + "# fit()\n", + "imputer_mv = SimpleImputer(strategy = 'median')\n", + "imputer_mv.fit(df_titanic_copia[['age', 'fare']])" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "SimpleImputer(add_indicator=False, copy=True, fill_value=None,\n", + " missing_values=nan, strategy='median', verbose=0)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 300 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SokJ8HM61FcK", + "outputId": "65052934-48c4-4c6d-b206-b14d8fa10dc3", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "imputer_mv" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "SimpleImputer(add_indicator=False, copy=True, fill_value=None,\n", + " missing_values=nan, strategy='median', verbose=0)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 301 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "X-qx8QsQthyU", + "outputId": "469c2591-1ea2-4dfd-8395-df11821f5951", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "# transform()\n", + "df_titanic_mediana = pd.DataFrame(imputer_mv.transform(df_titanic[['age', 'fare']]), columns = ['age2', 'fare2'])\n", + "df_titanic_mediana.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
age2fare2
022.07.2500
138.071.2833
226.07.9250
335.053.1000
435.08.0500
\n", + "
" + ], + "text/plain": [ + " age2 fare2\n", + "0 22.0 7.2500\n", + "1 38.0 71.2833\n", + "2 26.0 7.9250\n", + "3 35.0 53.1000\n", + "4 35.0 8.0500" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 302 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KS-xYf5BuwEt", + "outputId": "c7e01602-c917-48b7-a7f1-33ce365f21ac", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic_mediana.median()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "age2 28.0000\n", + "fare2 14.4542\n", + "dtype: float64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 303 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lggbmAD2vN42", + "outputId": "55134b01-f993-4f01-b053-7600c45eec21", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic_copia.isna().sum()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "passengerid 0\n", + "survived 0\n", + "pclass 0\n", + "name 0\n", + "sex 0\n", + "age 177\n", + "sibsp 0\n", + "parch 0\n", + "ticket 0\n", + "fare 0\n", + "cabin 687\n", + "embarked 2\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 304 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8fQ6a7RSvUOp", + "outputId": "75770554-a802-4012-a068-76b2e1ba1578", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 498 + } + }, + "source": [ + "df_titanic['age'] = df_titanic_mediana['age2']\n", + "\n", + "# Não há NaN's na variável fare. Portanto, nenhuma alteração\n", + "#df_titanic['fare'] = df_titanic_mediana['fare']\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
passengeridsurvivedpclassnamesexagesibspparchticketfarecabinembarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", + "
" + ], + "text/plain": [ + " passengerid survived pclass ... fare cabin embarked\n", + "0 1 0 3 ... 7.2500 NaN S\n", + "1 2 1 1 ... 71.2833 C85 C\n", + "2 3 1 3 ... 7.9250 NaN S\n", + "3 4 1 1 ... 53.1000 C123 S\n", + "4 5 0 3 ... 8.0500 NaN S\n", + "\n", + "[5 rows x 12 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 305 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HSncMlT51oM5", + "outputId": "85057833-9c05-4805-d2d0-ad6a3af0d140", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic.isna().sum()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "passengerid 0\n", + "survived 0\n", + "pclass 0\n", + "name 0\n", + "sex 0\n", + "age 0\n", + "sibsp 0\n", + "parch 0\n", + "ticket 0\n", + "fare 0\n", + "cabin 687\n", + "embarked 2\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 306 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c48gJg0q4zgj" + }, + "source": [ + "Exclui as colunas que não são mais necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7OzK7DnDg2WY", + "outputId": "462a1c3e-d2b3-4d5d-b745-945706522481", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 498 + } + }, + "source": [ + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
passengeridsurvivedpclassnamesexagesibspparchticketfarecabinembarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", + "
" + ], + "text/plain": [ + " passengerid survived pclass ... fare cabin embarked\n", + "0 1 0 3 ... 7.2500 NaN S\n", + "1 2 1 1 ... 71.2833 C85 C\n", + "2 3 1 3 ... 7.9250 NaN S\n", + "3 4 1 1 ... 53.1000 C123 S\n", + "4 5 0 3 ... 8.0500 NaN S\n", + "\n", + "[5 rows x 12 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 307 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oLpWbzz84ykm", + "outputId": "cdb706ed-5395-47d0-8f52-7887edcde8ee", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "df_titanic.drop(columns = ['passengerid', 'name', 'ticket', 'cabin'], axis = 1, inplace = True)\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked
003male22.0107.2500S
111female38.01071.2833C
213female26.0007.9250S
311female35.01053.1000S
403male35.0008.0500S
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked\n", + "0 0 3 male 22.0 1 0 7.2500 S\n", + "1 1 1 female 38.0 1 0 71.2833 C\n", + "2 1 3 female 26.0 0 0 7.9250 S\n", + "3 1 1 female 35.0 1 0 53.1000 S\n", + "4 0 3 male 35.0 0 0 8.0500 S" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 308 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NZei3VxSxR6g" + }, + "source": [ + "Alternativamente, poderíamos concatenar os dois dataframes usando pd.concat()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ek2qBdOFw2p5", + "outputId": "7777706c-55e4-4bfd-822f-cf3843922f3e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic = df_titanic_copia.copy()\n", + "\n", + "df_titanic.drop(columns = ['passengerid', 'name', 'ticket', 'cabin', 'fare', 'age'], axis = 1, inplace = True)\n", + "df_titanic = pd.concat([df_titanic, df_titanic_mediana], axis = 1)\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexsibspparchembarkedage2fare2
003male10S22.07.2500
111female10C38.071.2833
213female00S26.07.9250
311female10S35.053.1000
403male00S35.08.0500
\n", + "
" + ], + "text/plain": [ + " survived pclass sex sibsp parch embarked age2 fare2\n", + "0 0 3 male 1 0 S 22.0 7.2500\n", + "1 1 1 female 1 0 C 38.0 71.2833\n", + "2 1 3 female 0 0 S 26.0 7.9250\n", + "3 1 1 female 1 0 S 35.0 53.1000\n", + "4 0 3 male 0 0 S 35.0 8.0500" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 18 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6omsobg77tRv" + }, + "source": [ + "#### Tratamento dos NaN's da variável 'embarked'" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YjeivMbz85gg" + }, + "source": [ + "A seguir, listamos as linhas em que embarked = NaN:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Mc03_AnI8QgV", + "outputId": "ebe1ecc6-2c40-429d-d608-3b72ef10acd2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 110 + } + }, + "source": [ + "embarked_NaN = df_titanic[df_titanic['embarked'].isna()]\n", + "embarked_NaN.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked
6111female38.00080.0NaN
82911female62.00080.0NaN
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked\n", + "61 1 1 female 38.0 0 0 80.0 NaN\n", + "829 1 1 female 62.0 0 0 80.0 NaN" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 309 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xsbeFBFp7zRM", + "outputId": "58859c46-d711-4558-aadf-70480e67c98b", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "from sklearn.impute import SimpleImputer\n", + "\n", + "# fit()\n", + "imputer_mv = SimpleImputer(strategy = 'most_frequent')\n", + "imputer_mv.fit(df_titanic[['embarked']])" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "SimpleImputer(add_indicator=False, copy=True, fill_value=None,\n", + " missing_values=nan, strategy='most_frequent', verbose=0)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 310 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "f2kDtHVN761L" + }, + "source": [ + "# transform()\n", + "df_embarked_freq = pd.DataFrame(imputer_mv.transform(df_titanic[['embarked']]), columns = ['embarked2'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "0JmoLrzD8NwW", + "outputId": "d8ac60c0-a440-42b1-d7d5-c00168c3956f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "df_titanic = pd.concat([df_titanic, df_embarked_freq], axis = 1)\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarkedembarked2
003male22.0107.2500SS
111female38.01071.2833CC
213female26.0007.9250SS
311female35.01053.1000SS
403male35.0008.0500SS
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked embarked2\n", + "0 0 3 male 22.0 1 0 7.2500 S S\n", + "1 1 1 female 38.0 1 0 71.2833 C C\n", + "2 1 3 female 26.0 0 0 7.9250 S S\n", + "3 1 1 female 35.0 1 0 53.1000 S S\n", + "4 0 3 male 35.0 0 0 8.0500 S S" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 312 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FRxX9c4--TCg" + }, + "source": [ + "COMPARE o ANTES e o DEPOIS: Veja a seguir que os valores de [embarked] = NaN foram substituidos por..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oQFDqatz9bMv", + "outputId": "45d6ab98-b832-4844-8d66-6d47cbcee08e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 110 + } + }, + "source": [ + "embarked_NaN = df_titanic[df_titanic['embarked'].isna()]\n", + "embarked_NaN" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarkedembarked2
6111female38.00080.0NaNS
82911female62.00080.0NaNS
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked embarked2\n", + "61 1 1 female 38.0 0 0 80.0 NaN S\n", + "829 1 1 female 62.0 0 0 80.0 NaN S" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 313 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jgCuXei2ZTQl" + }, + "source": [ + "Como podemos ver, os NaN's da variável embarked foram todos substituídos pelo valor 'S'. Tudo bem para vocês esta substituição?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "r3r8ObKn-nBt" + }, + "source": [ + "df_titanic.drop(columns = ['embarked'], axis = 1, inplace = True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OacQvrYeAPBR" + }, + "source": [ + "Verificação final dos NaN's:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OHBv7CrjARol", + "outputId": "df1e556b-21dd-42a2-df08-4c0046f1f3b1", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic.isna().sum()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "survived 0\n", + "pclass 0\n", + "sex 0\n", + "age 0\n", + "sibsp 0\n", + "parch 0\n", + "fare 0\n", + "embarked2 0\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 315 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ITFMsiBSyAHY" + }, + "source": [ + "### O dataframe sob análise possui (pelo menos) 50 observações para cada preditora?\n", + "* Variáveis preditoras: pclass, sex, age, sibsp, parch, fare, embarked2 --> 7 variáveis preditoras.\n", + "* Portanto, nosso dataframe precisa de, no mínimo 7 x 50 = 350 linhas." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4lgVp2N8yE1C", + "outputId": "2dbea822-609b-4576-c3e2-f89b3527db1c", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic.info()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 891 entries, 0 to 890\n", + "Data columns (total 8 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 survived 891 non-null int64 \n", + " 1 pclass 891 non-null int64 \n", + " 2 sex 891 non-null object \n", + " 3 age 891 non-null float64\n", + " 4 sibsp 891 non-null int64 \n", + " 5 parch 891 non-null int64 \n", + " 6 fare 891 non-null float64\n", + " 7 embarked2 891 non-null object \n", + "dtypes: float64(2), int64(4), object(2)\n", + "memory usage: 55.8+ KB\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rFwtnAcw23gQ", + "outputId": "2b9b006b-3dee-493a-e9a6-adcf90923373", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "891/7" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "127.28571428571429" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 317 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wLqz2V7SytPU" + }, + "source": [ + "Pressuposto atendido?\n", + "Se sim, podemos prosseguir com as análises..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Wm0VycfhovW8" + }, + "source": [ + "#### Avaliação do pressuposto de variáveis preditoras independentes\n", + "* Coeficiente de Spearman (desenvolvido por Charles Spearman). Também conhecido como Coeficiente de Correlação de Spearman e denotado pela letra greaga $\\rho(p)$.\n", + "* É um método estatístico para avaliar/medir a correlação entre 2 variáveis ordinais." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "29knlUdcztb1", + "outputId": "adb2e1a5-2436-4327-8eff-4a65b66d4e0b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexsibspparchfareage2ambarked2
003male107.250022.0S
111female1071.283338.0C
213female007.925026.0S
311female1053.100035.0S
403male008.050035.0S
\n", + "
" + ], + "text/plain": [ + " survived pclass sex sibsp parch fare age2 ambarked2\n", + "0 0 3 male 1 0 7.2500 22.0 S\n", + "1 1 1 female 1 0 71.2833 38.0 C\n", + "2 1 3 female 0 0 7.9250 26.0 S\n", + "3 1 1 female 1 0 53.1000 35.0 S\n", + "4 0 3 male 0 0 8.0500 35.0 S" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 102 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J5EEcU7l0E2B" + }, + "source": [ + "A seguir, a hipótese de independência que queremos testar:\n", + "\n", + "$H_{0}$: variáveis são independentes --> Se o p-value < 5% --> Há evidências para rejeitar $H_{0}$." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tj8A_Kp0qxp_" + }, + "source": [ + "from scipy.stats import spearmanr" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "kFxVGHPUpKLi" + }, + "source": [ + "coef, p = spearmanr(df_titanic['pclass'], df_titanic['age'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "fvzvyvK7qzib", + "outputId": "d1e8d723-5048-4360-bad4-9ce0b8172bbf", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "print('Coeficiente de Correlação de Spearman: %.3f' % coef)\n", + "\n", + "# Interpretação da significância:\n", + "alpha = 0.05\n", + "if p > alpha:\n", + "\tprint('Amostras NÃO correlacionadas (falha em rejeitar H0) p = %.3f' %p)\n", + "else:\n", + "\tprint('Amostras correlacionadas (Rejeita H0) p = %.3f' %p)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Coeficiente de Correlação de Spearman: -0.317\n", + "Amostras correlacionadas (Rejeita H0) p = 0.000\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yespibmf1WVh" + }, + "source": [ + "## Data Transformation\n", + "* MinMaxScaler e RobustScaler\n", + "* Binning (categorização de variáveis/preditoras numéricas): fare e age\n", + "* Outliers" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UwLpj8PXKFuL" + }, + "source": [ + "### Tratamento dos Outliers\n", + "* variáveis: age e fare" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sTTgUx9oiWdJ", + "outputId": "fd14b9f5-7e25-4416-bfc7-e318e79d3249", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "df_titanic_copia = df_titanic.copy()\n", + "#df_titanic = df_titanic_copia.copy()\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked2
003male22.0107.2500S
111female38.01071.2833C
213female26.0007.9250S
311female35.01053.1000S
403male35.0008.0500S
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked2\n", + "0 0 3 male 22.0 1 0 7.2500 S\n", + "1 1 1 female 38.0 1 0 71.2833 C\n", + "2 1 3 female 26.0 0 0 7.9250 S\n", + "3 1 1 female 35.0 1 0 53.1000 S\n", + "4 0 3 male 35.0 0 0 8.0500 S" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 321 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-7v8WaB4aEKv" + }, + "source": [ + "from scipy import stats \n", + "\n", + "def trata_outliers(df, coluna):\n", + " sns.boxplot(x = coluna, data = df)\n", + "\n", + " # Cálculo de Q1, Q3 e IQR:\n", + " Q1 = np.percentile(df[coluna], 25)\n", + " Q3 = np.percentile(df[coluna], 75)\n", + " IQR = Q3 - Q1\n", + " print(f\"IQR: {IQR}\")\n", + "\n", + " # Jeito mais fácil (menos trabalhoso).\n", + " #IQR2 = stats.iqr(df[coluna]) \n", + " #IQR2 \n", + "\n", + " # Cálculo dos limites inferiores e superiores para detecção de outliers:\n", + " limite_inferior_outliers = Q1 - 1.5*IQR\n", + " limite_superior_outliers = Q3 + 1.5*IQR\n", + " print(f\"Limite inferior para outlier: {limite_inferior_outliers}; Limite superior para outliers: {limite_superior_outliers}\")\n", + "\n", + " # Cálculo da mediana\n", + " mediana = df[coluna].median()\n", + " print(f\"Mediana: {mediana}\")\n", + "\n", + " # Substituição dos outliers:\n", + " df[coluna+'_o'] = df[coluna]\n", + "\n", + " df.loc[df[coluna] > limite_superior_outliers, coluna+'_o'] = np.nan\n", + " df[coluna+'_o'].fillna(mediana, inplace = True) # 'o' significa tratamento outlier --> indicação para mostrar que a coluna passou pelo tratamento dos outliers.\n", + "\n", + " return df, limite_superior_outliers" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "pwAExKTWaOSf", + "outputId": "089cba96-b805-4eef-d504-f8c81551c938", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 332 + } + }, + "source": [ + "df_titanic, limite_superior_outliers = trata_outliers(df = df_titanic, coluna = 'age')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "IQR: 13.0\n", + "Limite inferior para outlier: 2.5; Limite superior para outliers: 54.5\n", + "Mediana: 28.0\n" + ], + "name": "stdout" + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAEGCAYAAABbzE8LAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAPXUlEQVR4nO3df2ychX3H8c83vtGGpAXioAgctmvlrhla1rSNOlCrzc7CmpLRamqRyA8wIhAmdU4Ck6YC0WJLAW3S5BFlbBKDFJhIWiUtkECUNSHepCGNYrehCSS0t9VtYxWSOi1tfqyryXd/PM+Zu7Nj+xzffR/j90uy8PM8vuf5Xu7uzePHv8zdBQCovxnRAwDAdEWAASAIAQaAIAQYAIIQYAAIkqvmg+fOnev5fL5GowDAe1Nvb+/P3P3KyvVVBTifz6unp2fypgKAacDMfjTSei5BAEAQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABKnqb8Ih1tatW1UoFGqy7/7+fklSU1NTTfZfqbm5We3t7XU5FpBVBHgKKRQKOnTkqN65dM6k77vh7NuSpDd/XfunRMPZUzU/BjAVEOAp5p1L5+jcghsnfb8zj+2VpJrs+0LHAqY7rgEDQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAkEwFeOvWrdq6dWv0GEA4XgvTQy56gFKFQiF6BCATeC1MD5k6AwaA6YQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0AQAgwAQQgwAAQhwAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMAAEIcAAEIQAA0CQugR4YGBAq1atUktLi1paWvTII49Ikjo7O9XS0qIHH3ywHmMAU1ahUNDy5cvV29ur1atXq6WlRd3d3WXbCoWCJOm5555TS0uL9uzZM+L2gwcPlt2+p6dHS5YsUW9v77BtlbetXC69rZS81tetW6eBgYGLuo/r1q1TT09P2bFGczHHjdx3XQL85JNPqr+/f2h5586dkjT0IO/fv78eYwBT1ubNm3XmzBlt2rRJx48fl6ShE5fits2bN0uSHn74YUlSV1fXiNsfeuihstt3dHTo/Pnz2rRp07BtlbetXC69rZS81g8fPqynnnrqou7j4cOH1dHRUXas0VzMcSP3XfMADwwM6IUXXhi2fuXKlWXLnAUDIysUCurr65MknT59emj94OCgtm/fPrStr69Pjz32mNxdkuTu2rZtW9n2p59+WoODg0O3f/zxx4f2efr06bJtO3bsKLttd3d32fKePXvKbtvd3a19+/bJ3bVv376qzhgr76O7D+27r69v1LPggYGBCR93LLXctyRZ8cEaj8WLF3tPT09VB+jq6tLu3bvH9bFz587VuXPn1NzcXNUxpotCoaBf/Z/rzKJbJn3fM4/tlSSdW3DjpO+70qxDX9MHLjEe51EUCgXNnDlTu3bt0u233z4Up0i5XG4o0JJkZirtRy6Xk5TEO5fLafny5brnnnvGte+x7mM+n9cTTzwx4rauri7t3bt3Qscdy2Tt28x63X1x5foxz4DNbK2Z9ZhZz8mTJ6s+8IEDB6q+DYB3ZSG+ksriK0mVJ2+Dg4NlZ9DVXFoc6z6Otv3AgQMTPu5YarlvScqN9QHu/qikR6XkDLjaAyxdunTcZ8BNTU2SpC1btlR7mGlh/fr16v2ft6LHuGjn3/9BNX94Ho/zKNavXz/0fj6fz0SEqz0DvuGGG8a977HuYz6fv+C2pUuXlp2lVnPcsdRy31IdrgG3tbWpoaFh2Pqrr766bHmy7xjwXrFx48YLblu7dm3Z8urVq8uWb7vttrLlu+66q2z51ltvveC+77777rLlBx54oGz53nvvHbZ9xowkKQ0NDcOOPZrR7uNY29va2iZ83LHUct9SHQLc2Nio5cuXD1u/ffv2suXKBxdAorm5eegMcPbs2UPrc7mcVq5cObQtn8/rzjvvlJlJSs5Q77jjjrLtq1atGjpTzeVyWrNmzdA+Z8+eXbZtxYoVZbdtbW0tW77pppvKbtva2qply5bJzLRs2TI1NjZO+D6a2dC+8/n8qF8vaGxsnPBxx1LLfUt1+ja0tra2ocsLknTzzTdLklpbWyVx9guMZePGjZo1a5Y6Ozs1f/58Se+etBS3Fc8SN2zYIOndM9TK7ffff3/Z7Ts6OjRjxgx1dnYO21Z528rl0ttKyWt94cKFEzpTLL2PCxcuVEdHR9mxRnMxx43cd82/C6IaxeteXBscWfEacC2+U6Ge3wUx89hefZJrwKPitfDeMuHvggAA1AYBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAguegBSjU3N0ePAGQCr4XpIVMBbm9vjx4ByAReC9MDlyAAIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAiSix4A1Wk4e0ozj+2twX4HJKkm+x5+rFOS5tX8OEDWEeAppLm5uWb77u8flCQ1NdUjjPNqel+AqYIATyHt7e3RIwCYRFwDBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASAIAQaAIAQYAIIQYAAIQoABIAgBBoAgBBgAghBgAAhCgAEgCAEGgCAEGACCEGAACEKAASCIufv4P9jspKQfVXmMuZJ+VuVt6iWrszFXdbI6l5Td2ZirOhc71++4+5WVK6sK8ESYWY+7L67pQSYoq7MxV3WyOpeU3dmYqzq1motLEAAQhAADQJB6BPjROhxjorI6G3NVJ6tzSdmdjbmqU5O5an4NGAAwMi5BAEAQAgwAQWoaYDNbZmZvmFnBzL5Sy2ONMcc2MzthZkdK1s0xs/1m9oP0v1cEzHWNmXWb2etm9pqZrc/QbO83s2+b2avpbJ3p+g+Z2cvpY/p1M7skYLYGM/uumT2flZnSOfrM7LCZHTKznnRdFh7Ly81sl5kdM7OjZnZ9Rub6aPpvVXz7pZltyMhs96TP+yNmtiN9PUz686xmATazBkmPSPqcpGslrTCza2t1vDE8IWlZxbqvSHrR3T8i6cV0ud4GJf2Vu18r6TpJX07/jbIw268lLXH3j0laJGmZmV0n6e8k/YO7N0v6uaQ1AbOtl3S0ZDkLMxW1uvuiku8ZzcJjuUXSPndfIOljSv7twudy9zfSf6tFkj4p6aykZ6JnM7MmSeskLXb335fUIOkW1eJ55u41eZN0vaR/K1m+T9J9tTreOObJSzpSsvyGpKvS96+S9EbUbCUzPSfphqzNJulSSd+R9IdKfhooN9JjXKdZ5it5US6R9Lwki56pZLY+SXMr1oU+lpIuk/RDpV9wz8pcI8z5p5JeysJskpok/UTSHEm59Hn22Vo8z2p5CaJ4J4qOp+uyYp67/zR9/01J8yKHMbO8pI9LelkZmS39VP+QpBOS9kv6b0m/cPfB9EMiHtOHJf21pPPpcmMGZipySd8ys14zW5uui34sPyTppKSvppdtHjOzWRmYq9Itknak74fO5u79kv5e0o8l/VTS25J6VYPnGV+Ek+TJ/9LCvh/PzGZL+oakDe7+y9JtkbO5+zuefHo4X9KnJC2ImKPIzP5M0gl3742cYxSfcfdPKLns9mUz+6PSjUGPZU7SJyT9s7t/XNIZVXxKn4Hn/yWSPi9pZ+W2iNnSa85fUPI/r6slzdLwS5iTopYB7pd0Tcny/HRdVrxlZldJUvrfExFDmNlvKYnv0+7+zSzNVuTuv5DUreTTrsvNLJduqvdj+mlJnzezPklfU3IZYkvwTEPSMye5+wkl1zI/pfjH8rik4+7+crq8S0mQo+cq9TlJ33H3t9Ll6NmWSvqhu590999I+qaS596kP89qGeBXJH0k/crhJUo+xdhdw+NVa7ektvT9NiXXX+vKzEzS45KOuntXxma70swuT9+fqeTa9FElIf5SxGzufp+7z3f3vJLn00F3XxU5U5GZzTKzDxTfV3JN84iCH0t3f1PST8zso+mqP5H0evRcFVbo3csPUvxsP5Z0nZldmr5Gi/9mk/88q/HF7BslfV/JtcMH6nkhvWKOHUqu5fxGyRnBGiXXDl+U9ANJByTNCZjrM0o+vfqepEPp240Zme0PJH03ne2IpL9J139Y0rclFZR8yvi+oMe0RdLzWZkpneHV9O214vM9I4/lIkk96WP5rKQrsjBXOtssSQOSLitZFz6bpE5Jx9Ln/r9Kel8tnmf8KDIABOGLcAAQhAADQBACDABBCDAABCHAABCEAANAEAIMAEEIMKYEM3s2/SU3rxV/0Y2ZrTGz76e/t/hfzOwf0/VXmtk3zOyV9O3TsdMDI+MHMTAlmNkcdz+V/lj0K0p+PeBLSn6vwa8kHZT0qrv/pZltl/RP7v6fZvbbSn5t4O+FDQ9cQG7sDwEyYZ2Z/Xn6/jWSbpX0H+5+SpLMbKek3023L5V0bfJj/JKkD5rZbHc/Xc+BgbEQYGSembUoier17n7WzP5dyc/pX+isdoak69z9f+szITAxXAPGVHCZpJ+n8V2g5M83zZL0x2Z2RforAr9Y8vHfktReXDCzRXWdFhgnAoypYJ+knJkdlfS3kv5Lye9ifUjJb6d6ScmfA3o7/fh1khab2ffM7HVJf1H3iYFx4ItwmLKK13XTM+BnJG1z92ei5wLGizNgTGUd6d+sO6LkD08+GzwPUBXOgAEgCGfAABCEAANAEAIMAEEIMAAEIcAAEOT/ASoUtoMb2LNZAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rB3Wh7jldcl-", + "outputId": "34e01f99-642a-45aa-ebd3-bda160d7be2e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 665 + } + }, + "source": [ + "df_titanic.head(20)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked2age_o
003male22.0107.2500S22.0
111female38.01071.2833C38.0
213female26.0007.9250S26.0
311female35.01053.1000S35.0
403male35.0008.0500S35.0
503male28.0008.4583Q28.0
601male54.00051.8625S54.0
703male2.03121.0750S2.0
813female27.00211.1333S27.0
912female14.01030.0708C14.0
1013female4.01116.7000S4.0
1111female58.00026.5500S28.0
1203male20.0008.0500S20.0
1303male39.01531.2750S39.0
1403female14.0007.8542S14.0
1512female55.00016.0000S28.0
1603male2.04129.1250Q2.0
1712male28.00013.0000S28.0
1803female31.01018.0000S31.0
1913female28.0007.2250C28.0
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked2 age_o\n", + "0 0 3 male 22.0 1 0 7.2500 S 22.0\n", + "1 1 1 female 38.0 1 0 71.2833 C 38.0\n", + "2 1 3 female 26.0 0 0 7.9250 S 26.0\n", + "3 1 1 female 35.0 1 0 53.1000 S 35.0\n", + "4 0 3 male 35.0 0 0 8.0500 S 35.0\n", + "5 0 3 male 28.0 0 0 8.4583 Q 28.0\n", + "6 0 1 male 54.0 0 0 51.8625 S 54.0\n", + "7 0 3 male 2.0 3 1 21.0750 S 2.0\n", + "8 1 3 female 27.0 0 2 11.1333 S 27.0\n", + "9 1 2 female 14.0 1 0 30.0708 C 14.0\n", + "10 1 3 female 4.0 1 1 16.7000 S 4.0\n", + "11 1 1 female 58.0 0 0 26.5500 S 28.0\n", + "12 0 3 male 20.0 0 0 8.0500 S 20.0\n", + "13 0 3 male 39.0 1 5 31.2750 S 39.0\n", + "14 0 3 female 14.0 0 0 7.8542 S 14.0\n", + "15 1 2 female 55.0 0 0 16.0000 S 28.0\n", + "16 0 3 male 2.0 4 1 29.1250 Q 2.0\n", + "17 1 2 male 28.0 0 0 13.0000 S 28.0\n", + "18 0 3 female 31.0 1 0 18.0000 S 31.0\n", + "19 1 3 female 28.0 0 0 7.2250 C 28.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 324 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "x6YRvSf5SRR4" + }, + "source": [ + "### Quem são os outliers de 'age'?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2y9BeUnoSU4W", + "outputId": "85968b30-7903-465b-eddf-27d4604acb58", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "age_outlier = df_titanic[df_titanic['age'] > limite_superior_outliers]\n", + "age_outlier.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked2age_o
1111female58.00026.5500S28.0
1512female55.00016.0000S28.0
3302male66.00010.5000S28.0
5401male65.00161.9792C28.0
9403male59.0007.2500S28.0
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked2 age_o\n", + "11 1 1 female 58.0 0 0 26.5500 S 28.0\n", + "15 1 2 female 55.0 0 0 16.0000 S 28.0\n", + "33 0 2 male 66.0 0 0 10.5000 S 28.0\n", + "54 0 1 male 65.0 0 1 61.9792 C 28.0\n", + "94 0 3 male 59.0 0 0 7.2500 S 28.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 327 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J0dHnei1TBFc" + }, + "source": [ + "### Tratamento dos outliers da variável 'fare'" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "i8YM25uKm8g1", + "outputId": "00e04c37-82d8-4aca-c250-36a4a045df2f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "df_titanic_copia = df_titanic.copy()\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked2age_o
003male22.0107.2500S22.0
111female38.01071.2833C38.0
213female26.0007.9250S26.0
311female35.01053.1000S35.0
403male35.0008.0500S35.0
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked2 age_o\n", + "0 0 3 male 22.0 1 0 7.2500 S 22.0\n", + "1 1 1 female 38.0 1 0 71.2833 C 38.0\n", + "2 1 3 female 26.0 0 0 7.9250 S 26.0\n", + "3 1 1 female 35.0 1 0 53.1000 S 35.0\n", + "4 0 3 male 35.0 0 0 8.0500 S 35.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 328 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4Gnr9AEfTSj6", + "outputId": "40a23226-68bd-4288-dbf2-5fabfd4fd32b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 332 + } + }, + "source": [ + "df_titanic, limite_superior_outliers = trata_outliers(df = df_titanic, coluna = 'fare')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "IQR: 23.0896\n", + "Limite inferior para outlier: -26.724; Limite superior para outliers: 65.6344\n", + "Mediana: 14.4542\n" + ], + "name": "stdout" + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAEGCAYAAABbzE8LAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAOr0lEQVR4nO3df2zcdR3H8de7PTbGiuA2XEgBb+SIhAQEaRSQP25jbGMbELP9MQLpRvgRjCkTNApbY0YwGvljiI1RCRJMXJwxamTdBDdW/tKArRtssCGHKXELMCg/tGwZFj7+cZ877q7X33f3bu+ej6TZfb/3ve/3+xnfPvvpd+1hIQQBAGqvyfsEAKBREWAAcEKAAcAJAQYAJwQYAJwkJrLxggULQjKZrNKpAEB96uvreyeEcFbp+gkFOJlMqre3t3JnBQANwMxeL7eeWxAA4IQAA4ATAgwATggwADghwADghAADgBMCDABOCDAAOCHAAOCEAAOAEwIMAE4IMAA4IcAA4IQAA4ATAgwATggwADghwADghAADgBMCDABOJvT/hKukrq4uZTKZ/PLRo0clSa2traO+LpVKqaOjo6rnBgC14BbgTCaj/QcP6ePT5kmSmo9/IEl68+TIp9R8/N2anBsA1IJbgCXp49Pm6cSFKyVJcw7vkqT8cjm5bQCgHnAPGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJzUJMBdXV3q6uqqxaEmbDqfG4D6lqjFQTKZTC0OMynT+dwA1DduQQCAEwIMAE4IMAA4IcAA4IQAA4ATAgwATggwADghwADghAADgBMCDABOCDAAOCHAAOCEAAOAEwIMAE4IMAA4IcAA4IQAA4ATAgwATggwADghwADghAADgBMCDABOCDAAOCHAAOCEAAOAEwIMAE4IMAA4IcAA4IQAA4ATAgwATggwADghwADghAADgBMCDABOCDAAOCHAAOCEAAOAEwIMAE4aPsDHjx/XgQMHtGnTJqXTaaXTaXV1deUf5z7a29uVTqe1ZMkSLV68WA899JDS6bSuueYaXXfddcpkMmX3n8lktGrVKvX09Gj58uX5/fX19UmS9u7dq3Q6rZ6enhFfv2zZMqXTae3YsWPE/Wcymfy+Cvc/0rbV1tvbqyVLlujWW2/VwMBA0XMDAwO6++67NTAwMOb4S41nDNu2bVM6ndb27dunNAZA+vRaLvc5NVUWQhj3xm1tbaG3t3fCB9m4caMk6ZFHHila1/evt3TiwpWSpDmHd0lSfrmcOYd36fLzFxbtZ6qWL1+ukydPTnk/yWRSTzzxxLD1GzZsUH9/vxKJhIaGhvLrW1pa1N3draVLl2poaEiJREJ79uwZ8fWSZGbDQpV7PplM6siRI/lj5PY/0rblzrWSVq9ercHBQUnSjTfeqHvuuSf/3NatW7Vjxw7dcMMN2rlz56jjLzWeMaTT6fzjZ599dirDAPLXcrnPqfEys74QQlvp+oaeAWcymYrEV5L6+/uHzcoymUw+noXxlaTBwUE99thj+fVDQ0PD4lr4ekkKIRTNgguf7+/vLzrG4OBg0Vfs0m2rOQvu7e3Nx1eSdu7cmZ8FDwwM6KmnnlIIQd3d3aOOv9R4xrBt27aiZWbBmIrCa7n0c6oSajIDXrt2rU6cOKFUKpVfl8lk9N+Pgj68dJ2k8c2A5+7frtNnWdF+puLw4cMVC7A0fBZcOHsdj9JZYLnXF86Cx9p/4Vfs0m2rOQsunP3m5GbBW7du1a5du4Z9QZKGj7/UeMZQOPvNYRaMySq9lic7C570DNjM7jSzXjPrffvttyd84OmskvGVNCyGE4mvNHyWXO71hV8wx9p/4YUz1XObiNL4StLu3bslSXv27CkbX2n4+EvVcgyANPxaLndtT0VirA1CCI9KelTKzoAnc5DW1lZJ5e8BT8Qnp35GqQreA57oDHUsyWRy2PJEZ8Bjvd7Mxr3/lpaWEbctPddKamlpGXahXnvttZKkpUuXjjoDHk0txwBIw6/lws+pSmjoe8CdnZ1V3d9Y+7/llluKljdv3jzm6++9995x7/+BBx6Y9LlNxZYtW4qWE4mE2tvbJUnr169XU1P2smtubi7arnT8pcYzhjvuuKNo+a677hrXOQPllF7LhZ9TldDQAU6lUpo9e3ZF9pVMJofdm06lUvlZWunsrqWlRbfffnt+fSKR0OLFi0d8vZSd/V5//fVln08mk0XHaGlp0eWXXz7itpW6j15OW1tb0Uxh1apVmj9/viRp/vz5WrFihcxMq1evHnX8pcYzhptvvrloed26dVMZChpc4bVc+jlVCQ0dYEk677zz1NTUpKuuuiq/bs2aNWW3k6SmpiaZmVauzP5jYXNzs+bMmTPijLKzs1Nz587V5s2bi2Kf+0q6adMmSSPP/jo7OzVr1ixJxbPf0v13dnbm91W4/5G2rbYtW7aoqalJixYtys9+c9avX6+LL75Y7e3tY46/1HjGkJsFM/tFJeSu5UrPfiV+DrjsuQFAJfFzwAAwzRBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHCSqMVBUqlULQ4zKdP53ADUt5oEuKOjoxaHmZTpfG4A6hu3IADACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcEKAAcAJAQYAJwQYAJwQYABwQoABwAkBBgAnBBgAnBBgAHBCgAHACQEGACcEGACcEGAAcJLwPHjz8Xc15/Cu+HhAkvLLI20vLazFqQFA1bkFOJVKFS0fPTokSWptHS2wC4e9DgBmKrcAd3R0eB0aAKYF7gEDgBMCDABOCDAAOCHAAOCEAAOAEwIMAE4IMAA4IcAA4IQAA4ATAgwATggwADghwADghAADgBMCDABOCDAAOCHAAOCEAAOAEwIMAE4IMAA4IcAA4MRCCOPf2OxtSa9P8lgLJL0zydfONI00VqmxxttIY5Uaa7zVHOvnQwhnla6cUICnwsx6QwhtNTmYs0Yaq9RY422ksUqNNV6PsXILAgCcEGAAcFLLAD9aw2N5a6SxSo013kYaq9RY4635WGt2DxgAUIxbEADghAADgJOqB9jMVpjZK2aWMbP7qn28WjCzx83smJkdLFg3z8x2m9mr8c/PxvVmZj+J43/RzL7kd+YTZ2bnmlmPmb1sZi+Z2ca4vl7He6qZPW9mL8TxPhDXLzKz5+K4fmtms+L62XE5E59Pep7/ZJhZs5ntM7PuuFyXYzWzfjM7YGb7zaw3rnO9jqsaYDNrlvRTSddJukjSTWZ2UTWPWSNPSFpRsu4+Sc+EEC6Q9ExclrJjvyB+3CnpZzU6x0oZkvStEMJFkq6Q9I3437Bex3tS0pIQwhclXSpphZldIelHkh4OIaQkvSfptrj9bZLei+sfjtvNNBslHSpYruexLg4hXFrw876+13EIoWofkq6U9HTB8v2S7q/mMWv1ISkp6WDB8iuSzo6Pz5b0Snz8C0k3ldtuJn5I+pOkaxthvJJOk/QPSV9R9jekEnF9/rqW9LSkK+PjRNzOvM99AmM8R9nwLJHULcnqeKz9khaUrHO9jqt9C6JV0r8Llo/EdfVoYQjhjfj4TUkL4+O6+TuI33JeJuk51fF447fk+yUdk7Rb0muS3g8hDMVNCseUH298/gNJ82t7xlPyY0nfkfRJXJ6v+h1rkPQXM+szszvjOtfrOFHpHUIKIQQzq6uf7zOzFkm/l/TNEMJ/zCz/XL2NN4TwsaRLzexMSX+UdKHzKVWFma2WdCyE0Gdmae/zqYGrQwhHzexzknab2eHCJz2u42rPgI9KOrdg+Zy4rh69ZWZnS1L881hcP+P/DszsFGXjuy2E8Ie4um7HmxNCeF9Sj7Lfhp9pZrkJS+GY8uONz58haaDGpzpZX5V0g5n1S9qu7G2IR1SfY1UI4Wj885iyX1i/LOfruNoB/rukC+K/qs6StE7Sk1U+ppcnJa2Pj9cre680t749/qvqFZI+KPiWZ9qz7FT3l5IOhRC2FjxVr+M9K858ZWZzlL3ffUjZEK+Nm5WON/f3sFbS3hBvGk53IYT7QwjnhBCSyn5u7g0h3Kw6HKuZzTWz03OPJS2TdFDe13ENbnyvlPRPZe+jbfa+EV+hMf1G0huS/qfsvaHblL0X9oykVyXtkTQvbmvK/iTIa5IOSGrzPv8JjvVqZe+dvShpf/xYWcfjvUTSvjjeg5K+F9efL+l5SRlJv5M0O64/NS5n4vPne49hkuNOS+qu17HGMb0QP17Ktcj7OuZXkQHACb8JBwBOCDAAOCHAAOCEAAOAEwIMAE4IMKY9M7vbzA6Z2TbvcwEqiR9Dw7QXf2V0aQjhyDi2TYRP38cAmNaYAWNaM7OfK/tD9H82s++a2d/ie9f+1cy+ELfZYGZPmtleSc/E33p6PL6v7z4zu9F1EMAImAFj2ovvVdAm6SNJx0MIQ2a2VNLXQwhrzGyDpO9LuiSE8K6Z/UDSyyGEX8dfK35e0mUhhA+dhgCUxbuhYSY5Q9KvzOwCZX89+pSC53aHEN6Nj5cp+yYz347Lp0o6T8VvOg64I8CYSR6U1BNC+Fp8b+JnC54rnN2apDUhhFdqd2rAxHEPGDPJGfr0LQE3jLLd05I64ju5ycwuq/J5AZNCgDGTPCTph2a2T6N/9/agsrcnXjSzl+IyMO3wj3AA4IQZMAA4IcAA4IQAA4ATAgwATggwADghwADghAADgJP/A44KX5vXXCReAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uh7f7nNATSkT" + }, + "source": [ + "### Quem são os outliers de 'fare'?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BdzaUaD0nQnv", + "outputId": "c05a1d46-c91a-4542-92a8-bfb121b44174", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "limite_superior_outliers" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "65.6344" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 330 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "P3SAGnYnnQn4", + "outputId": "424347d6-d243-48df-8c3d-4c2a014c0cd9", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "fare_outlier = df_titanic[df_titanic['fare'] > limite_superior_outliers]\n", + "fare_outlier.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked2age_ofare_o
111female38.01071.2833C38.014.4542
2701male19.032263.0000S19.014.4542
3111female28.010146.5208C28.014.4542
3401male28.01082.1708C28.014.4542
5211female49.01076.7292C49.014.4542
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age ... fare embarked2 age_o fare_o\n", + "1 1 1 female 38.0 ... 71.2833 C 38.0 14.4542\n", + "27 0 1 male 19.0 ... 263.0000 S 19.0 14.4542\n", + "31 1 1 female 28.0 ... 146.5208 C 28.0 14.4542\n", + "34 0 1 male 28.0 ... 82.1708 C 28.0 14.4542\n", + "52 1 1 female 49.0 ... 76.7292 C 49.0 14.4542\n", + "\n", + "[5 rows x 10 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 331 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Jh83WTrZDeM_" + }, + "source": [ + "### Binning variáveis numéricas: age e fare" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JVNVCd7aDjkz", + "outputId": "21f91b71-cfe5-445b-cf2c-8f4bf0de9825", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + } + }, + "source": [ + "#df_titanic_copia = df_titanic.copy()\n", + "df_titanic = df_titanic_copia.copy()\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked2age_o
003male22.0107.2500S22.0
111female38.01071.2833C38.0
213female26.0007.9250S26.0
311female35.01053.1000S35.0
403male35.0008.0500S35.0
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age sibsp parch fare embarked2 age_o\n", + "0 0 3 male 22.0 1 0 7.2500 S 22.0\n", + "1 1 1 female 38.0 1 0 71.2833 C 38.0\n", + "2 1 3 female 26.0 0 0 7.9250 S 26.0\n", + "3 1 1 female 35.0 1 0 53.1000 S 35.0\n", + "4 0 3 male 35.0 0 0 8.0500 S 35.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 332 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pUspmjPWFP06" + }, + "source": [ + "#### Usando cut()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NBVoCBe_2Zmp" + }, + "source": [ + "#df_titanic['age_bins'] = pd.cut(x = df_titanic['age_o'], bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90])\n", + "# Obs.: a linha de cima gera a coluna age-bins com o intervalo no seguinte formato: (início, fim], exemplo: (20, 30] -> intervalor aberto e fechado\n", + "df_titanic['age_bins'] = pd.cut(x = df_titanic['age_o'], bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90], labels = [10, 20, 30, 40, 50, 60, 70, 80, 90])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "2i1jombNDrEO", + "outputId": "7c96358b-3023-4706-813a-3e3e594cc45e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 264 + } + }, + "source": [ + "df_titanic.head(7)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareembarked2age_oage_bins
003male22.0107.2500S22.030
111female38.01071.2833C38.040
213female26.0007.9250S26.030
311female35.01053.1000S35.040
403male35.0008.0500S35.040
503male28.0008.4583Q28.030
601male54.00051.8625S54.060
\n", + "
" + ], + "text/plain": [ + " survived pclass sex age ... fare embarked2 age_o age_bins\n", + "0 0 3 male 22.0 ... 7.2500 S 22.0 30\n", + "1 1 1 female 38.0 ... 71.2833 C 38.0 40\n", + "2 1 3 female 26.0 ... 7.9250 S 26.0 30\n", + "3 1 1 female 35.0 ... 53.1000 S 35.0 40\n", + "4 0 3 male 35.0 ... 8.0500 S 35.0 40\n", + "5 0 3 male 28.0 ... 8.4583 Q 28.0 30\n", + "6 0 1 male 54.0 ... 51.8625 S 54.0 60\n", + "\n", + "[7 rows x 10 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 340 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "davIt0UT9tTr" + }, + "source": [ + "#### **Desafio**: Qual seria o corte ótimo para 'age' usando DecisionTree?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "i5aAYl2ZDu1f", + "outputId": "0f1d7a99-6cb0-4484-b12b-7907b70feb8a", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic['age_bins_cut1'].value_counts()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(20, 30] 449\n", + "(30, 40] 155\n", + "(10, 20] 115\n", + "(40, 50] 86\n", + "(0, 10] 64\n", + "(50, 60] 22\n", + "(80, 90] 0\n", + "(70, 80] 0\n", + "(60, 70] 0\n", + "Name: age_bins_cut1, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 276 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VAUshOiLFT9-" + }, + "source": [ + "#### Usando qcut()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RKnb-bI7FL3F", + "outputId": "59742803-dcee-4525-8fd1-2cecff379cab", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic['age_bins_qcut'] = pd.qcut(x = df_titanic['age'], q = 5, labels = [1, 2, 3, 4], duplicates = 'drop')\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareambarked2age_ofare_oage_bins_cut1age_bins_cut2age_bins_qcut
003male22.0107.2500S22.07.2500(20, 30]302
111female38.01071.2833C38.014.4542(30, 40]403
213female26.0007.9250S26.07.9250(20, 30]302
311female35.01053.1000S35.053.1000(30, 40]403
403male35.0008.0500S35.08.0500(30, 40]403
\n", + "
" + ], + "text/plain": [ + " survived pclass sex ... age_bins_cut1 age_bins_cut2 age_bins_qcut\n", + "0 0 3 male ... (20, 30] 30 2\n", + "1 1 1 female ... (30, 40] 40 3\n", + "2 1 3 female ... (20, 30] 30 2\n", + "3 1 1 female ... (30, 40] 40 3\n", + "4 0 3 male ... (30, 40] 40 3\n", + "\n", + "[5 rows x 13 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 277 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "boSGroSYN7cP", + "outputId": "9304540f-1a20-41cb-c4a4-70633be1a071", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic.dtypes" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "survived int64\n", + "pclass int64\n", + "sex object\n", + "age float64\n", + "sibsp int64\n", + "parch int64\n", + "fare float64\n", + "embarked2 object\n", + "age_o float64\n", + "age_bins category\n", + "dtype: object" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 344 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "P8s3LzfpNdUz", + "outputId": "b0e4b638-32de-4295-984e-ba207e195661", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# ***************************************************\n", + "# AUTOMATIZAR O TRATAMENTO DE VARIÁVEIS DO MESMO TIPO\n", + "# ***************************************************\n", + "l_colunas_numericas = list(df_titanic.select_dtypes('int').columns)\n", + "l_colunas_numericas\n", + "\n", + "for coluna in l_colunas_numericas:\n", + " trata_outliers(df, coluna)\n", + " trata_missing_values(df, coluna)\n", + " aplica_MMS(df, coluna)\n", + " aplica_RS(df, coluna)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "['survived', 'pclass', 'sibsp', 'parch']" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 346 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ov2_l39mn3FH", + "outputId": "122869f8-018f-4176-b2df-c249efa222b7", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic['age_bins_qcut'].value_counts()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(20.0, 28.0] 360\n", + "(0.419, 20.0] 179\n", + "(38.0, 80.0] 177\n", + "(28.0, 38.0] 175\n", + "Name: age_bins_qcut, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 261 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J60XwHUOGwbr" + }, + "source": [ + "### MinMaxScaler() e RobustScaler()\n", + "* age e fare" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GRY84U4HHxoQ" + }, + "source": [ + "from sklearn.preprocessing import MinMaxScaler, RobustScaler" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IQC7Bo-DH71s" + }, + "source": [ + "mms = MinMaxScaler()\n", + "rs = RobustScaler()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8O2oM9XdIYF5", + "outputId": "9812b76b-dffc-406b-fcd9-0a7c1b01a61b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareambarked2age_ofare_oage_bins_cut1age_bins_cut2age_bins_qcut
003male22.0107.2500S22.07.2500(20, 30]30(20.0, 28.0]
111female38.01071.2833C38.014.4542(30, 40]40(28.0, 38.0]
213female26.0007.9250S26.07.9250(20, 30]30(20.0, 28.0]
311female35.01053.1000S35.053.1000(30, 40]40(28.0, 38.0]
403male35.0008.0500S35.08.0500(30, 40]40(28.0, 38.0]
\n", + "
" + ], + "text/plain": [ + " survived pclass sex ... age_bins_cut1 age_bins_cut2 age_bins_qcut\n", + "0 0 3 male ... (20, 30] 30 (20.0, 28.0]\n", + "1 1 1 female ... (30, 40] 40 (28.0, 38.0]\n", + "2 1 3 female ... (20, 30] 30 (20.0, 28.0]\n", + "3 1 1 female ... (30, 40] 40 (28.0, 38.0]\n", + "4 0 3 male ... (30, 40] 40 (28.0, 38.0]\n", + "\n", + "[5 rows x 13 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 264 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "B-qglHy6NZlg", + "outputId": "96a154a8-678e-48eb-93b2-ce93b7e0258f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic_copia = df_titanic.copy()\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareambarked2age_ofare_oage_bins_cut1age_bins_cut2age_bins_qcut
003male22.0107.2500S22.07.2500(20, 30]30(20.0, 28.0]
111female38.01071.2833C38.014.4542(30, 40]40(28.0, 38.0]
213female26.0007.9250S26.07.9250(20, 30]30(20.0, 28.0]
311female35.01053.1000S35.053.1000(30, 40]40(28.0, 38.0]
403male35.0008.0500S35.08.0500(30, 40]40(28.0, 38.0]
\n", + "
" + ], + "text/plain": [ + " survived pclass sex ... age_bins_cut1 age_bins_cut2 age_bins_qcut\n", + "0 0 3 male ... (20, 30] 30 (20.0, 28.0]\n", + "1 1 1 female ... (30, 40] 40 (28.0, 38.0]\n", + "2 1 3 female ... (20, 30] 30 (20.0, 28.0]\n", + "3 1 1 female ... (30, 40] 40 (28.0, 38.0]\n", + "4 0 3 male ... (30, 40] 40 (28.0, 38.0]\n", + "\n", + "[5 rows x 13 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 265 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Dp9jYZ1OoA9i" + }, + "source": [ + "A seguir, deletar as variáveis que desnecessárias..." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zSSViPY5XokW" + }, + "source": [ + "df_titanic.drop(columns = ['age', 'fare'], axis = 1, inplace = True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "MNq5a0eUIBGV", + "outputId": "d692f608-f511-46e6-d101-bd9443647b94", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 282 + } + }, + "source": [ + "# fit\n", + "df_titanic_mms = pd.DataFrame(mms.fit_transform(df_titanic[['age_o', 'fare_o']]), columns = ['age_mms', 'fare_mms'])\n", + "df_titanic_rs = pd.DataFrame(rs.fit_transform(df_titanic[['age_o', 'fare_o']]), columns = ['age_rs', 'fare_rs'])\n", + "\n", + "df_titanic['age_mms'] = df_titanic_mms['age_mms']\n", + "df_titanic['age_rs'] = df_titanic_rs['age_rs']\n", + "\n", + "df_titanic['fare_mms'] = df_titanic_mms['fare_mms']\n", + "df_titanic['fare_rs'] = df_titanic_rs['fare_rs']\n", + "\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexsibspparchambarked2age_ofare_oage_bins_cut1age_bins_cut2age_bins_qcutage_mmsage_rsfare_mmsfare_rsage_bins_mmsage_bins_rsfare_bins_mmsfare_bins_rs
003male10S22.07.2500(20, 30]30(20.0, 28.0]0.402762-0.5454550.111538-0.443619(0.365, 0.515](-0.727, 0.0](-0.001, 0.121](-0.891, -0.406]
111female10C38.014.4542(30, 40]40(28.0, 38.0]0.7013810.9090910.2223720.000000(0.645, 1.0](0.636, 2.364](0.162, 0.222](-0.243, 0.0]
213female00S26.07.9250(20, 30]30(20.0, 28.0]0.477417-0.1818180.121923-0.402054(0.365, 0.515](-0.727, 0.0](0.121, 0.162](-0.406, -0.243]
311female10S35.053.1000(30, 40]40(28.0, 38.0]0.6453900.6363640.8169232.379726(0.515, 0.645](0.0, 0.636](0.404, 1.0](0.726, 3.113]
403male00S35.08.0500(30, 40]40(28.0, 38.0]0.6453900.6363640.123846-0.394357(0.515, 0.645](0.0, 0.636](0.121, 0.162](-0.406, -0.243]
\n", + "
" + ], + "text/plain": [ + " survived pclass sex ... age_bins_rs fare_bins_mms fare_bins_rs\n", + "0 0 3 male ... (-0.727, 0.0] (-0.001, 0.121] (-0.891, -0.406]\n", + "1 1 1 female ... (0.636, 2.364] (0.162, 0.222] (-0.243, 0.0]\n", + "2 1 3 female ... (-0.727, 0.0] (0.121, 0.162] (-0.406, -0.243]\n", + "3 1 1 female ... (0.0, 0.636] (0.404, 1.0] (0.726, 3.113]\n", + "4 0 3 male ... (0.0, 0.636] (0.121, 0.162] (-0.406, -0.243]\n", + "\n", + "[5 rows x 19 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 268 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UzrdPNO3rIg5", + "outputId": "aaa31937-081d-4af8-f3ec-07002886d2a6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 555 + } + }, + "source": [ + "# Categorizando as variáveis transformadas\n", + "df_titanic['age_bins_mms'] = pd.qcut(x = df_titanic['age_mms'], q = 5, duplicates = 'drop', labels = [1, 2, 3, 4])\n", + "df_titanic['age_bins_rs'] = pd.qcut(x = df_titanic['age_rs'], q = 5, labels = [1, 2, 3, 4], duplicates = 'drop')\n", + "\n", + "df_titanic['fare_bins_mms'] = pd.qcut(x = df_titanic['fare_mms'], q = 5, labels = [1, 2, 3, 4], duplicates = 'drop')\n", + "df_titanic['fare_bins_rs'] = pd.qcut(x = df_titanic['fare_rs'], q = 5, labels = [1, 2, 3, 4], duplicates = 'drop')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "KeyError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 2894\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2895\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2896\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", + "\u001b[0;31mKeyError\u001b[0m: 'age_mms'", + "\nThe above exception was the direct cause of the following exception:\n", + "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Categorizando as variáveis transformadas\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdf_titanic\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'age_bins_mms'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mqcut\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf_titanic\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'age_mms'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mq\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m5\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mduplicates\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'drop'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlabels\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m4\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0mdf_titanic\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'age_bins_rs'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mqcut\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf_titanic\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'age_rs'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mq\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m5\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlabels\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m3\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m4\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mduplicates\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'drop'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mdf_titanic\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'fare_bins_mms'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mqcut\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf_titanic\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'fare_mms'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mq\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m5\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mduplicates\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'drop'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 2900\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnlevels\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2901\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_multilevel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2902\u001b[0;31m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2903\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_integer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2904\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 2895\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2896\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2897\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2898\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2899\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mtolerance\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mKeyError\u001b[0m: 'age_mms'" + ] + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7smfXya5pmNq", + "outputId": "a942223f-e9b6-4758-c453-73c5c55f91bd", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic.drop(columns = ['age_o', 'fare_o', 'age_bins_cut2', 'age_mms', 'age_rs', 'fare_mms', 'fare_rs'], axis = 1, inplace = True)\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexsibspparchambarked2age_bins_cut1age_bins_qcutage_bins_mmsage_bins_rsfare_bins_mmsfare_bins_rs
003male10S(20, 30](20.0, 28.0](0.365, 0.515](-0.727, 0.0](-0.001, 0.121](-0.891, -0.406]
111female10C(30, 40](28.0, 38.0](0.645, 1.0](0.636, 2.364](0.162, 0.222](-0.243, 0.0]
213female00S(20, 30](20.0, 28.0](0.365, 0.515](-0.727, 0.0](0.121, 0.162](-0.406, -0.243]
311female10S(30, 40](28.0, 38.0](0.515, 0.645](0.0, 0.636](0.404, 1.0](0.726, 3.113]
403male00S(30, 40](28.0, 38.0](0.515, 0.645](0.0, 0.636](0.121, 0.162](-0.406, -0.243]
\n", + "
" + ], + "text/plain": [ + " survived pclass sex ... age_bins_rs fare_bins_mms fare_bins_rs\n", + "0 0 3 male ... (-0.727, 0.0] (-0.001, 0.121] (-0.891, -0.406]\n", + "1 1 1 female ... (0.636, 2.364] (0.162, 0.222] (-0.243, 0.0]\n", + "2 1 3 female ... (-0.727, 0.0] (0.121, 0.162] (-0.406, -0.243]\n", + "3 1 1 female ... (0.0, 0.636] (0.404, 1.0] (0.726, 3.113]\n", + "4 0 3 male ... (0.0, 0.636] (0.121, 0.162] (-0.406, -0.243]\n", + "\n", + "[5 rows x 12 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 269 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SFPNLDMcU339" + }, + "source": [ + "### Variáveis Dummy" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "L_Fx1iy7snjF", + "outputId": "24c70d23-5a35-41aa-8a30-04fcc146b7c6", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareambarked2age_ofare_oage_bins_cut1age_bins_cut2age_bins_qcut
003male22.0107.2500S22.07.2500(20, 30]302
111female38.01071.2833C38.014.4542(30, 40]403
213female26.0007.9250S26.07.9250(20, 30]302
311female35.01053.1000S35.053.1000(30, 40]403
403male35.0008.0500S35.08.0500(30, 40]403
\n", + "
" + ], + "text/plain": [ + " survived pclass sex ... age_bins_cut1 age_bins_cut2 age_bins_qcut\n", + "0 0 3 male ... (20, 30] 30 2\n", + "1 1 1 female ... (30, 40] 40 3\n", + "2 1 3 female ... (20, 30] 30 2\n", + "3 1 1 female ... (30, 40] 40 3\n", + "4 0 3 male ... (30, 40] 40 3\n", + "\n", + "[5 rows x 13 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 279 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "X6aHaJodX0Hi", + "outputId": "f7a26db1-81d3-47f5-dcd3-bbde2b2b6440", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 402 + } + }, + "source": [ + "dummy = pd.get_dummies(df_titanic[['sex', 'ambarked2']])\n", + "dummy" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sex_femalesex_maleambarked2_Cambarked2_Qambarked2_S
001001
110100
210001
310001
401001
..................
88601001
88710001
88810001
88901100
89001010
\n", + "

891 rows × 5 columns

\n", + "
" + ], + "text/plain": [ + " sex_female sex_male ambarked2_C ambarked2_Q ambarked2_S\n", + "0 0 1 0 0 1\n", + "1 1 0 1 0 0\n", + "2 1 0 0 0 1\n", + "3 1 0 0 0 1\n", + "4 0 1 0 0 1\n", + ".. ... ... ... ... ...\n", + "886 0 1 0 0 1\n", + "887 1 0 0 0 1\n", + "888 1 0 0 0 1\n", + "889 0 1 1 0 0\n", + "890 0 1 0 1 0\n", + "\n", + "[891 rows x 5 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 282 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZhLW0lEbs28E", + "outputId": "c772c290-4394-409a-ded7-1eb57b7ac0db", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 215 + } + }, + "source": [ + "df_titanic2 = pd.concat([df_titanic, dummy], axis = 1)\n", + "df_titanic2.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexagesibspparchfareambarked2age_ofare_oage_bins_cut1age_bins_cut2age_bins_qcutsex_femalesex_maleambarked2_Cambarked2_Qambarked2_S
003male22.0107.2500S22.07.2500(20, 30]30201001
111female38.01071.2833C38.014.4542(30, 40]40310100
213female26.0007.9250S26.07.9250(20, 30]30210001
311female35.01053.1000S35.053.1000(30, 40]40310001
403male35.0008.0500S35.08.0500(30, 40]40301001
\n", + "
" + ], + "text/plain": [ + " survived pclass sex ... ambarked2_C ambarked2_Q ambarked2_S\n", + "0 0 3 male ... 0 0 1\n", + "1 1 1 female ... 1 0 0\n", + "2 1 3 female ... 0 0 1\n", + "3 1 1 female ... 0 0 1\n", + "4 0 3 male ... 0 0 1\n", + "\n", + "[5 rows x 18 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 283 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "I_bOYD4gWwGt", + "outputId": "b33c5076-6d0b-4c8f-8e5b-d87cb3e4544d", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic['pclass'].value_counts() # Quem será a referência?" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "3.0 484\n", + "1.0 189\n", + "2.0 176\n", + "Name: pclass, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 64 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xFKdsFDihApP" + }, + "source": [ + "df_titanic['pclass'] = df_titanic['pclass'].astype('category')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D_mWCqM1ZOgU" + }, + "source": [ + "### Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UPnCuCsLZSjQ", + "outputId": "95b92d55-a895-4c2e-9655-07a3bd753f6d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 162 + } + }, + "source": [ + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "NameError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mX_treinamento\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mX_teste\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_treinamento\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_teste\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtrain_test_split\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mNameError\u001b[0m: name 'train_test_split' is not defined" + ] + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rk-Zuh5RXJbp", + "outputId": "47c3c005-795d-406d-984a-b0094cd5718c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
survivedpclasssexsibspparchambarked2age3fare3age_bins_cut1age_bins_cut2age_bins_qcutage_mmsage_rsfare_mmsfare_rs
00.03.0male1.00.0S22.07.2500(20, 30]30(20.0, 28.0]0.402762-0.5454550.014151-0.323505
11.01.0female1.00.0C38.071.2833(30, 40]40(36.0, 54.0]0.7013810.9090910.1391362.696934
21.03.0female0.00.0S26.07.9250(20, 30]30(20.0, 28.0]0.477417-0.1818180.015469-0.291665
31.01.0female1.00.0S35.053.1000(30, 40]40(28.0, 36.0]0.6453900.6363640.1036441.839231
40.03.0male0.00.0S35.08.0500(30, 40]40(28.0, 36.0]0.6453900.6363640.015713-0.285769
\n", + "
" + ], + "text/plain": [ + " survived pclass sex sibsp ... age_mms age_rs fare_mms fare_rs\n", + "0 0.0 3.0 male 1.0 ... 0.402762 -0.545455 0.014151 -0.323505\n", + "1 1.0 1.0 female 1.0 ... 0.701381 0.909091 0.139136 2.696934\n", + "2 1.0 3.0 female 0.0 ... 0.477417 -0.181818 0.015469 -0.291665\n", + "3 1.0 1.0 female 1.0 ... 0.645390 0.636364 0.103644 1.839231\n", + "4 0.0 3.0 male 0.0 ... 0.645390 0.636364 0.015713 -0.285769\n", + "\n", + "[5 rows x 15 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 69 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XtdQc49LXTfk", + "outputId": "696c23be-e9d0-4ba9-b4f5-0dcba0ffe8ec", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_titanic['pclass'].value_counts()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "3.0 484\n", + "1.0 189\n", + "2.0 176\n", + "Name: pclass, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 74 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Construção da 1ª Versão do modelo (baseline)" + ], + "metadata": { + "id": "yTjb4CyYkktB" + } + }, + { + "cell_type": "code", + "metadata": { + "id": "fbvB30S5hRxH", + "outputId": "e872fe50-fcad-4b33-9456-182b1ca7e62c", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "modelo = smf.glm(formula = 'survived ~ age3 + pclass + sex', \n", + " data = df_titanic, \n", + " family = sm.families.Binomial()).fit()\n", + "print(modelo.summary())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + " Generalized Linear Model Regression Results \n", + "==============================================================================\n", + "Dep. Variable: survived No. Observations: 849\n", + "Model: GLM Df Residuals: 844\n", + "Model Family: Binomial Df Model: 4\n", + "Link Function: logit Scale: 1.0000\n", + "Method: IRLS Log-Likelihood: -386.42\n", + "Date: Thu, 29 Oct 2020 Deviance: 772.85\n", + "Time: 17:17:09 Pearson chi2: 890.\n", + "No. Iterations: 5 \n", + "Covariance Type: nonrobust \n", + "=================================================================================\n", + " coef std err z P>|z| [0.025 0.975]\n", + "---------------------------------------------------------------------------------\n", + "Intercept 3.5515 0.382 9.297 0.000 2.803 4.300\n", + "pclass[T.2.0] -1.1389 0.264 -4.315 0.000 -1.656 -0.622\n", + "pclass[T.3.0] -2.3581 0.245 -9.613 0.000 -2.839 -1.877\n", + "sex[T.male] -2.5618 0.189 -13.522 0.000 -2.933 -2.191\n", + "age3 -0.0344 0.009 -4.035 0.000 -0.051 -0.018\n", + "=================================================================================\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p7_gfQXciFs1" + }, + "source": [ + "Qual a significância dos coeficientes (p-value abaixo de 0.05 adotando confiança de 95%)?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xtrh_bYNikTk" + }, + "source": [ + "### Interpretação dos coeficientes:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FzlGDyeLgL11" + }, + "source": [ + "* Pessoas que viajavam na segunda classe possuem menos chances de sobrevivência do que quem viajava na primeira.\n", + "* Quem viajava na terceira classe possui menos chances ainda.\n", + "* Homens possuem menos chances de sobrevivência do que mulheres. Quanto mais velho, menores as chances de sobrevivência." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CJNgEYY9ioVM" + }, + "source": [ + "### Coeficientes mais interpretáveis - Chances relativas de Sobrevivência" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "q0vLh1v3irCz" + }, + "source": [ + "print(np.exp(modelo.params[1:]))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "a2fJIOOzi3VF" + }, + "source": [ + "* Pessoas que viajavam na segunda classe tinham 0.27 das chances de sobrevivência que as pessoas da primeira classe tinham. \n", + "* Pessoas da terceira classe tinham 0.076 das chances que as pessoas da primeira classe tinham. \n", + "* Homens tinham 0.08 das chances das mulheres." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dYRkdqNujHFA" + }, + "source": [ + "### Comparando com a regressão Linear" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mKW-aODfjLbm" + }, + "source": [ + "(np.exp(modelo.params[1:]) - 1) * 100" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ODOqZpAgjQ2q" + }, + "source": [ + "* Pessoas da segunda classe tem 73% menos chances de sobrevivência do que pessoas da primeira classe.\n", + "* Pessoas da terceira classe tem 92% menos chances de sobrevivência que pessoas da primeira classe.\n", + "* Homens tem 92% menos chances de sobrevivência do que mulheres.\n", + "* Para cada ano a mais de idade, as chances diminuem 3.63%." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fxnutoD7jp94" + }, + "source": [ + "### Qualidade do modelo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-oW8Kg5Ij3Av", + "outputId": "d56cd7ce-e5c0-493d-ccfe-27cd5a6899c5", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 354 + } + }, + "source": [ + "modelo2 = LogisticRegression(penalty='none', solver='newton-cg')\n", + "df_titanic2 = df_titanic[['Survived', 'Pclass', 'Sex', 'Age']].dropna()\n", + "y = df_titanic2['Survived']\n", + "X = pd.get_dummies(df_titanic2[['Pclass', 'Sex', 'Age']], drop_first=True)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "error", + "ename": "KeyError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mmodelo2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mLogisticRegression\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpenalty\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'none'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msolver\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'newton-cg'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdf_titanic2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf_titanic\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Survived'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Pclass'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Sex'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Age'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdropna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf_titanic2\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Survived'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mX\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf_titanic2\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Pclass'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Sex'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Age'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdrop_first\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 2906\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_iterator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2907\u001b[0m \u001b[0mkey\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2908\u001b[0;31m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_get_listlike_indexer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mraise_missing\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2909\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2910\u001b[0m \u001b[0;31m# take() does not accept boolean indexers\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py\u001b[0m in \u001b[0;36m_get_listlike_indexer\u001b[0;34m(self, key, axis, raise_missing)\u001b[0m\n\u001b[1;32m 1252\u001b[0m \u001b[0mkeyarr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mindexer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnew_indexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0max\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_reindex_non_unique\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkeyarr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1253\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1254\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_validate_read_indexer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkeyarr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mindexer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mraise_missing\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mraise_missing\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1255\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mkeyarr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mindexer\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1256\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py\u001b[0m in \u001b[0;36m_validate_read_indexer\u001b[0;34m(self, key, indexer, axis, raise_missing)\u001b[0m\n\u001b[1;32m 1296\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mmissing\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1297\u001b[0m \u001b[0maxis_name\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mobj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_get_axis_name\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1298\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf\"None of [{key}] are in the [{axis_name}]\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1299\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1300\u001b[0m \u001b[0;31m# We (temporarily) allow for some missing keys with .loc, except in\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mKeyError\u001b[0m: \"None of [Index(['Survived', 'Pclass', 'Sex', 'Age'], dtype='object')] are in the [columns]\"" + ] + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YK9nIiz_kQQl" + }, + "source": [ + "modelo2.fit(X, y)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "FPhHe4SmkZoE" + }, + "source": [ + "y_pred = modelo2.predict_proba(X)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "dPrXU0GSknGJ" + }, + "source": [ + "confusion_matrix(y, model.predict(X))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_-Wweiq7kruH" + }, + "source": [ + "acuracia = accuracy_score(y, model.predict(X))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "bqcT8XYJkuwH" + }, + "source": [ + "print(classification_report(y, model.predict(X)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "4fSN-vLOjseh" + }, + "source": [ + "confusion_matrix(y, modelo.predict(X)) # usando a função do sklearn" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "A0408gbPkywR" + }, + "source": [ + "def plot_roc_curve(y_true, y_score, figsize=(10,6)):\n", + " fpr, tpr, _ = roc_curve(y_true, y_score)\n", + " plt.figure(figsize=figsize)\n", + " auc_value = roc_auc_score(y_true, y_score)\n", + " plt.plot(fpr, tpr, color='orange', label='ROC curve (area = %0.2f)' % auc_value)\n", + " plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')\n", + " plt.xlabel('False Positive Rate')\n", + " plt.ylabel('True Positive Rate')\n", + " plt.title('Receiver Operating Characteristic (ROC) Curve')\n", + " plt.legend()\n", + " plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "5T4P90hQk1ug" + }, + "source": [ + "plot_roc_curve(y, y_pred)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CANyMIgIjgSb" + }, + "source": [ + "### Predições" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pWJgVcQRlESq" + }, + "source": [ + "eu = pd.DataFrame({'Age':32, 'Pclass_2':0, 'Pclass_3':1, 'Sex_male':1}, index=[0])\n", + "minha_prob = model.predict_proba(eu)\n", + "print('Eu teria {}% de probabilidade de sobrevivência se estivesse no Titanic'\\\n", + " .format(round(minha_prob[:,1][0]*100, 2)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kgpdgkgrlJ-w" + }, + "source": [ + "Eu teria 7.52% de probabilidade de sobrevivência se estivesse no Titanic" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "91GShU9ClMiY" + }, + "source": [ + "coleguinha = pd.DataFrame({'Age':32, 'Pclass_2':0, 'Pclass_3':0, 'Sex_male':1}, index=[0])\n", + "prob_do_coleguinha = model.predict_proba(coleguinha)\n", + "print('Meu coleguinha teria {}% de probabilidade de sobrevivência se estivesse no Titanic'\\\n", + " .format(round(prob_do_coleguinha[:,1][0]*100, 2)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c2EHn8volOil" + }, + "source": [ + "Meu coleguinha teria 51.77% de probabilidade de sobrevivência se estivesse no Titanic" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "C2PvJoZQlH6u" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XwuMfMD1gFyd" + }, + "source": [ + "# Exemplo 2" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kFY0TQVgOlvT" + }, + "source": [ + "" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "efF3st3sHxPG" + }, + "source": [ + "# Carrega as bibliotecas\n", + "import numpy as np\n", + "np.set_printoptions(formatter = {'float': lambda x: \"{0:0.2f}\".format(x)})\n", + "\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "from sklearn.model_selection import train_test_split\n", + "import statsmodels.api as sm\n", + "\n", + "%matplotlib inline" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Bk9F6JO0IELv" + }, + "source": [ + "# Carregar/ler o banco de dados - Dataframe Diabetes\n", + "from sklearn import datasets\n", + "#Diabetes = datasets.load_diabetes()\n", + "\n", + "url = 'https://raw.githubusercontent.com/MathMachado/DSWP/master/Dataframes/diabetes.csv'\n", + "diabetes = pd.read_csv(url)\n", + "diabetes.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "tjRmpaPIDknb" + }, + "source": [ + "# Definir as matrizes X e y\n", + "X_diabetes = diabetes.copy()\n", + "X_diabetes.drop(columns = ['Outcome'], axis = 1, inplace = True)\n", + "y_diabetes = diabetes['Outcome']\n", + "\n", + "X_diabetes.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "jLrx69TH-Mad" + }, + "source": [ + "X_diabetes.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "mdFBioP6-Ply" + }, + "source": [ + "y_diabetes.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "fhLySN65IaDF" + }, + "source": [ + "# Definir as matrizes de treinamento e validação\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_diabetes, y_diabetes)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "J5R8HlnuIGpL" + }, + "source": [ + "# Usando statmodels:\n", + "x = sm.add_constant(X_treinamento)\n", + "lr_sm = sm.Logit(y_treinamento, X_treinamento) # Atenção: aqui é o contrário: [y, x]\n", + "\n", + "# Treinar o modelo\n", + "lr.fit(X_treinamento, y_treinamento)\n", + "resultado_sm = lr_sm.fit()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "GlbCaPp1ETNa" + }, + "source": [ + "resultado_sm.summary()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-FJaSnJLKICU" + }, + "source": [ + "# EQM - Erro Quadrático Médio\n", + "np.mean((resultado_sm.predict(X_teste) - y_teste) ** 2) " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6bVEUSTUPzOj" + }, + "source": [ + "### Calcular y_pred - os valores preditos de y" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OjGrNhTNLcr-" + }, + "source": [ + "y_pred = resultado_sm.predict(X_treinamento)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vfS5RCx_VnGT" + }, + "source": [ + "compara = list(zip(np.array(diabetes['Outcome']), resultado_sm.predict()))\n", + "compara[0:30]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "pUxasncIFaw4" + }, + "source": [ + "resultado_sm.pred_table()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_liLYinwFgch" + }, + "source": [ + "confusion_matrix = pd.DataFrame(resultado_sm.pred_table())\n", + "confusion_matrix.columns = ['Predicted No Diabetes', 'Predicted Diabetes']\n", + "confusion_matrix = confusion_matrix.rename(index = {0 : 'Actual No Diabetes', 1 : 'Actual Diabetes'})\n", + "confusion_matrix" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ceH3MODWFm7S" + }, + "source": [ + "cm = np.array(confusion_matrix)\n", + "training_accuracy = (cm[0,0] + cm[1,1])/ cm.sum()\n", + "training_accuracy" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CH_iEuzhO109" + }, + "source": [ + "# Exercício 1 - Mall_Customers.csv\n", + "> A variável-target deste dataframe é 'Annual Income'. Desenvolva um modelo de regressão utilizando OLS, Ridge e LASSO e compare os resultados.\n", + "\n", + "* Experimente:\n", + " * Lasso(alpha = 0.01, max_iter = 10e5);\n", + " * Lasso(alpha = 0.0001, max_iter = 10e5);\n", + " * Ridge(alpha = 0.01);\n", + " * Ridge(alpha = 100);" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZfRDEaaRYxFQ" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "from sklearn import preprocessing\n", + "import matplotlib.pyplot as plt \n", + "plt.rc(\"font\", size=14)\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.model_selection import train_test_split\n", + "import seaborn as sns\n", + "sns.set(style=\"white\")\n", + "sns.set(style=\"whitegrid\", color_codes=True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nulrLzUqYxFY" + }, + "source": [ + "## Dados\n", + "\n", + "The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe (1/0) a term deposit (variable y)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4LdrQCwxYxFY" + }, + "source": [ + "This dataset provides the customer information. It includes 41188 records and 21 fields." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qoT6zkoFYxFZ", + "outputId": "b04874af-bf4d-409f-cd1c-ad8c473004e6", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_bank = pd.read_csv('https://raw.githubusercontent.com/MathMachado/DataFrames/master/bank-full.csv', header = 0)\n", + "df_bank = df_bank.dropna()\n", + "print(df_bank.shape)\n", + "print(list(df_bank.columns))" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "(45211, 1)\n", + "['age;\"job\";\"marital\";\"education\";\"default\";\"balance\";\"housing\";\"loan\";\"contact\";\"day\";\"month\";\"duration\";\"campaign\";\"pdays\";\"previous\";\"poutcome\";\"y\"']\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZD23hMCeYxFc", + "outputId": "f347c846-5f92-4e4f-b468-2bfbc608777c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 195 + } + }, + "source": [ + "df_bank.head()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
age;\"job\";\"marital\";\"education\";\"default\";\"balance\";\"housing\";\"loan\";\"contact\";\"day\";\"month\";\"duration\";\"campaign\";\"pdays\";\"previous\";\"poutcome\";\"y\"
058;\"management\";\"married\";\"tertiary\";\"no\";2143...
144;\"technician\";\"single\";\"secondary\";\"no\";29;\"...
233;\"entrepreneur\";\"married\";\"secondary\";\"no\";2...
347;\"blue-collar\";\"married\";\"unknown\";\"no\";1506...
433;\"unknown\";\"single\";\"unknown\";\"no\";1;\"no\";\"n...
\n", + "
" + ], + "text/plain": [ + " age;\"job\";\"marital\";\"education\";\"default\";\"balance\";\"housing\";\"loan\";\"contact\";\"day\";\"month\";\"duration\";\"campaign\";\"pdays\";\"previous\";\"poutcome\";\"y\"\n", + "0 58;\"management\";\"married\";\"tertiary\";\"no\";2143... \n", + "1 44;\"technician\";\"single\";\"secondary\";\"no\";29;\"... \n", + "2 33;\"entrepreneur\";\"married\";\"secondary\";\"no\";2... \n", + "3 47;\"blue-collar\";\"married\";\"unknown\";\"no\";1506... \n", + "4 33;\"unknown\";\"single\";\"unknown\";\"no\";1;\"no\";\"n... " + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 285 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CtGbim_EYxFh" + }, + "source": [ + "#### Input variables" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0pJ7ai5ZYxFh" + }, + "source": [ + "1 - age (numeric)\n", + "\n", + "2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')\n", + "\n", + "3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)\n", + "\n", + "4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')\n", + "\n", + "5 - default: has credit in default? (categorical: 'no','yes','unknown')\n", + "\n", + "6 - housing: has housing loan? (categorical: 'no','yes','unknown')\n", + "\n", + "7 - loan: has personal loan? (categorical: 'no','yes','unknown')\n", + "\n", + "8 - contact: contact communication type (categorical: 'cellular','telephone')\n", + "\n", + "9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')\n", + "\n", + "10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')\n", + "\n", + "11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.\n", + "\n", + "12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)\n", + "\n", + "13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)\n", + "\n", + "14 - previous: number of contacts performed before this campaign and for this client (numeric)\n", + "\n", + "15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')\n", + "\n", + "16 - emp.var.rate: employment variation rate - (numeric)\n", + "\n", + "17 - cons.price.idx: consumer price index - (numeric)\n", + "\n", + "18 - cons.conf.idx: consumer confidence index - (numeric) \n", + "\n", + "19 - euribor3m: euribor 3 month rate - (numeric)\n", + "\n", + "20 - nr.employed: number of employees - (numeric)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YwsaBV_OYxFi" + }, + "source": [ + "#### Predict variable (desired target):\n", + "\n", + "y - has the client subscribed a term deposit? (binary: '1','0')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2SsNWV_SYxFj" + }, + "source": [ + "The education column of the dataset has many categories and we need to reduce the categories for a better modelling. The education column has the following categories:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6TFbgh3vYxFk" + }, + "source": [ + "df_bank['education'].unique()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "luv7Bdf_YxFn" + }, + "source": [ + "Let us group \"basic.4y\", \"basic.9y\" and \"basic.6y\" together and call them \"basic\"." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gkOlUOs2YxFn" + }, + "source": [ + "df_bank['education']=np.where(df_bank['education'] =='basic.9y', 'Basic', df_bank['education'])\n", + "df_bank['education']=np.where(df_bank['education'] =='basic.6y', 'Basic', df_bank['education'])\n", + "df_bank['education']=np.where(df_bank['education'] =='basic.4y', 'Basic', df_bank['education'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H-X1WMv2YxFq" + }, + "source": [ + "After grouping, this is the columns" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "r9LlgpkjYxFq" + }, + "source": [ + "df_bank['education'].unique()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fcnJy3KYYxFt" + }, + "source": [ + "### Data exploration" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qUrTMR8BYxFt" + }, + "source": [ + "df_bank['y'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rpzHnzJKYxFx" + }, + "source": [ + "sns.countplot(x='y',data=df_bank, palette='hls')\n", + "plt.show()\n", + "plt.savefig('count_plot')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C99nOe3mYxF0" + }, + "source": [ + "There are 36548 no's and 4640 yes's in the outcome variables." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8nGaox_kYxF1" + }, + "source": [ + "Let's get a sense of the numbers across the two classes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sQvzA60bYxF1" + }, + "source": [ + "df_bank.groupby('y').mean()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u3xjoceKYxF3" + }, + "source": [ + "Observations:\n", + "\n", + "The average age of customers who bought the term deposit is higher than that of the customers who didn't.\n", + "The pdays (days since the customer was last contacted) is understandably lower for the customers who bought it. The lower the pdays, the better the memory of the last call and hence the better chances of a sale.\n", + "Surprisingly, campaigns (number of contacts or calls made during the current campaign) are lower for customers who bought the term deposit." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jvzGMePPYxF4" + }, + "source": [ + "We can calculate categorical means for other categorical variables such as education and marital status to get a more detailed sense of our data." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RqLVMjoxYxF5" + }, + "source": [ + "df_bank.groupby('job').mean()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "GTUeRJAtYxF7" + }, + "source": [ + "df_bank.groupby('marital').mean()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "xsxdFumiYxF9" + }, + "source": [ + "df_bank.groupby('education').mean()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3i1DCWV-YxGA" + }, + "source": [ + "Visualizations" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OEArHQPbYxGB" + }, + "source": [ + "%matplotlib inline\n", + "pd.crosstab(df_bank.job,df_bank.y).plot(kind='bar')\n", + "plt.title('Purchase Frequency for Job Title')\n", + "plt.xlabel('Job')\n", + "plt.ylabel('Frequency of Purchase')\n", + "plt.savefig('purchase_fre_job')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PNwo5du_YxGD" + }, + "source": [ + "The frequency of purchase of the deposit depends a great deal on the job title. Thus, the job title can be a good predictor of the outcome variable." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "eM7CWfAZYxGE" + }, + "source": [ + "table=pd.crosstab(df_bank.marital,df_bank.y)\n", + "table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)\n", + "plt.title('Stacked Bar Chart of Marital Status vs Purchase')\n", + "plt.xlabel('Marital Status')\n", + "plt.ylabel('Proportion of Customers')\n", + "plt.savefig('mariral_vs_pur_stack')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LWBLh7toYxGG" + }, + "source": [ + "Hard to see, but the marital status does not seem a strong predictor for the outcome variable." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vh_u4QphYxGH" + }, + "source": [ + "table=pd.crosstab(df_bank.education,df_bank.y)\n", + "table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)\n", + "plt.title('Stacked Bar Chart of Education vs Purchase')\n", + "plt.xlabel('Education')\n", + "plt.ylabel('Proportion of Customers')\n", + "plt.savefig('edu_vs_pur_stack')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d9AgJroYYxGK" + }, + "source": [ + "Education seems a good predictor of the outcome variable." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dHI2LT-IYxGL" + }, + "source": [ + "pd.crosstab(df_bank.day_of_week,df_bank.y).plot(kind='bar')\n", + "plt.title('Purchase Frequency for Day of Week')\n", + "plt.xlabel('Day of Week')\n", + "plt.ylabel('Frequency of Purchase')\n", + "plt.savefig('pur_dayofweek_bar')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3A2jmS4MYxGR" + }, + "source": [ + "Day of week may not be a good predictor of the outcome" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bzafDBHpYxGS" + }, + "source": [ + "pd.crosstab(df_bank.month,df_bank.y).plot(kind='bar')\n", + "plt.title('Purchase Frequency for Month')\n", + "plt.xlabel('Month')\n", + "plt.ylabel('Frequency of Purchase')\n", + "plt.savefig('pur_fre_month_bar')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "x5CBtquEYxGW" + }, + "source": [ + "Month might be a good predictor of the outcome variable" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tgF_3SqWYxGY" + }, + "source": [ + "df_bank.age.hist()\n", + "plt.title('Histogram of Age')\n", + "plt.xlabel('Age')\n", + "plt.ylabel('Frequency')\n", + "plt.savefig('hist_age')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y0FhKYDsYxGc" + }, + "source": [ + "The most of the customers of the bank in this dataset are in the age range of 30-40." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5Nd3yV7DYxGd" + }, + "source": [ + "pd.crosstab(df_bank.poutcome,df_bank.y).plot(kind='bar')\n", + "plt.title('Purchase Frequency for Poutcome')\n", + "plt.xlabel('Poutcome')\n", + "plt.ylabel('Frequency of Purchase')\n", + "plt.savefig('pur_fre_pout_bar')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oRKUAGrjYxGh" + }, + "source": [ + "Poutcome seems to be a good predictor of the outcome variable." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "63RLRI9uYxGi" + }, + "source": [ + "### Create dummy variables" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "V8S4WUKmYxGj" + }, + "source": [ + "cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']\n", + "for var in cat_vars:\n", + " cat_list='var'+'_'+var\n", + " cat_list = pd.get_dummies(df_bank[var], prefix=var)\n", + " df_bank1=df_bank.join(cat_list)\n", + " data=df_bank1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "uX3w9i9WYxGl" + }, + "source": [ + "cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']\n", + "df_bank_vars=df_bank.columns.values.tolist()\n", + "to_keep=[i for i in df_bank_vars if i not in cat_vars]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "cMX_82xaYxGq" + }, + "source": [ + "df_bank_final=df_bank[to_keep]\n", + "df_bank_final.columns.values" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "LkTjpxYoYxGr" + }, + "source": [ + "df_bank_final_vars=df_bank_final.columns.values.tolist()\n", + "y=['y']\n", + "X=[i for i in df_bank_final_vars if i not in y]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2QbKaRcsYxGt" + }, + "source": [ + "### Feature Selection" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "EkxjW1AQYxGu" + }, + "source": [ + "from sklearn import datasets\n", + "from sklearn.feature_selection import RFE\n", + "from sklearn.linear_model import LogisticRegression\n", + "\n", + "logreg = LogisticRegression()\n", + "\n", + "rfe = RFE(logreg, 18)\n", + "rfe = rfe.fit(df_bank_final[X], df_bank_final[y] )\n", + "print(rfe.support_)\n", + "print(rfe.ranking_)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2P9hd4jHYxGw" + }, + "source": [ + "The Recursive Feature Elimination (RFE) has helped us select the following features: \"previous\", \"euribor3m\", \"job_blue-collar\", \"job_retired\", \"job_services\", \"job_student\", \"default_no\", \"month_aug\", \"month_dec\", \"month_jul\", \"month_nov\", \"month_oct\", \"month_sep\", \"day_of_week_fri\", \"day_of_week_wed\", \"poutcome_failure\", \"poutcome_nonexistent\", \"poutcome_success\"." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5PW8WZX_YxGx" + }, + "source": [ + "cols=[\"previous\", \"euribor3m\", \"job_blue-collar\", \"job_retired\", \"job_services\", \"job_student\", \"default_no\", \n", + " \"month_aug\", \"month_dec\", \"month_jul\", \"month_nov\", \"month_oct\", \"month_sep\", \"day_of_week_fri\", \"day_of_week_wed\", \n", + " \"poutcome_failure\", \"poutcome_nonexistent\", \"poutcome_success\"] \n", + "X=df_bank_final[cols]\n", + "y=df_bank_final['y']" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ix0mN9qxYxG0" + }, + "source": [ + "### Implementing the model" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Hbx2bwtiYxG0" + }, + "source": [ + "import statsmodels.api as sm\n", + "logit_model=sm.Logit(y,X)\n", + "result=logit_model.fit()\n", + "print(result.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HR1ui-UcYxG2" + }, + "source": [ + "The p-values for most of the variables are very small, therefore, most of them are significant to the model." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9GHhrsaeYxG3" + }, + "source": [ + "### Logistic Regression Model Fitting" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MFQnH5MzYxG3" + }, + "source": [ + "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn import metrics\n", + "logreg = LogisticRegression()\n", + "logreg.fit(X_train, y_train)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YUa3QL7tYxG6" + }, + "source": [ + "#### Predicting the test set results and caculating the accuracy" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SD-y2e33YxG6" + }, + "source": [ + "y_pred = logreg.predict(X_test)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "kkPWzos7YxG-" + }, + "source": [ + "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kwC3rt_6YxHA" + }, + "source": [ + "### Cross Validation" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Muw50oqSYxHB" + }, + "source": [ + "from sklearn import model_selection\n", + "from sklearn.model_selection import cross_val_score\n", + "kfold = model_selection.KFold(n_splits=10, random_state=7)\n", + "modelCV = LogisticRegression()\n", + "scoring = 'accuracy'\n", + "results = model_selection.cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)\n", + "print(\"10-fold cross validation average accuracy: %.3f\" % (results.mean()))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4y8XCTqoYxHE" + }, + "source": [ + "### Confusion Matrix" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BCza9NkVYxHE" + }, + "source": [ + "from sklearn.metrics import confusion_matrix\n", + "confusion_matrix = confusion_matrix(y_test, y_pred)\n", + "print(confusion_matrix)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "X9SapwS2YxHG" + }, + "source": [ + "The result is telling us that we have 10872+254 correct predictions and 1122+109 incorrect predictions." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6bEWvWScYxHG" + }, + "source": [ + "#### Accuracy" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NaH2nESwYxHH" + }, + "source": [ + "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(classifier.score(X_test, y_test)))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C6oxlhbpYxHJ" + }, + "source": [ + "#### Compute precision, recall, F-measure and support\n", + "\n", + "The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.\n", + "\n", + "The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.\n", + "\n", + "The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.\n", + "\n", + "The F-beta score weights recall more than precision by a factor of beta. beta == 1.0 means recall and precision are equally important.\n", + "\n", + "The support is the number of occurrences of each class in y_test." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mhN5_p4yYxHK" + }, + "source": [ + "from sklearn.metrics import classification_report\n", + "print(classification_report(y_test, y_pred))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xzSFVEnAYxHP" + }, + "source": [ + "#### Interpretation: \n", + "\n", + "Of the entire test set, 88% of the promoted term deposit were the term deposit that the customers liked. Of the entire test set, 90% of the customer's preferred term deposit were promoted." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NGXJ6g2nYxHQ" + }, + "source": [ + "### ROC Curvefrom sklearn import metrics\n", + "from ggplot import *\n", + "\n", + "prob = clf1.predict_proba(X_test)[:,1]\n", + "fpr, sensitivity, _ = metrics.roc_curve(Y_test, prob)\n", + "\n", + "df = pd.DataFrame(dict(fpr=fpr, sensitivity=sensitivity))\n", + "ggplot(df, aes(x='fpr', y='sensitivity')) +\\\n", + " geom_line() +\\\n", + " geom_abline(linetype='dashed')" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "u9QKDuS0YxHQ" + }, + "source": [ + "from sklearn.metrics import roc_auc_score\n", + "from sklearn.metrics import roc_curve\n", + "logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))\n", + "fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])\n", + "plt.figure()\n", + "plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)\n", + "plt.plot([0, 1], [0, 1],'r--')\n", + "plt.xlim([0.0, 1.0])\n", + "plt.ylim([0.0, 1.05])\n", + "plt.xlabel('False Positive Rate')\n", + "plt.ylabel('True Positive Rate')\n", + "plt.title('Receiver operating characteristic')\n", + "plt.legend(loc=\"lower right\")\n", + "plt.savefig('Log_ROC')\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB15_XX__ImbalancedSample_hs.ipynb b/Notebooks/NB15_XX__ImbalancedSample_hs.ipynb new file mode 100644 index 000000000..6f28ac6ac --- /dev/null +++ b/Notebooks/NB15_XX__ImbalancedSample_hs.ipynb @@ -0,0 +1,1416 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "Untitled9.ipynb", + "provenance": [], + "private_outputs": true, + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rq9Q5HxFWLBW" + }, + "source": [ + "# Referências" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vdzGi1KTxnI2" + }, + "source": [ + "https://www.kaggle.com/saurav9786/feature-engineering-up-and-down-sampling\n", + "\n", + "https://towardsdatascience.com/dealing-with-imbalanced-data-in-churn-analysis-6ea1afba8b5e\n", + "\n", + "https://www.kdnuggets.com/2019/05/fix-unbalanced-dataset.html\n", + "\n", + "https://towardsdatascience.com/having-an-imbalanced-dataset-here-is-how-you-can-solve-it-1640568947eb\n", + "\n", + "https://medium.com/analytics-vidhya/balance-your-data-using-smote-98e4d79fcddb\n", + "\n", + "https://www.geeksforgeeks.org/ml-handling-imbalanced-data-with-smote-and-near-miss-algorithm-in-python/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eN3pXeN9ae-D" + }, + "source": [ + "## Reamostragem da classe minoritária\n", + "* Up-sampling é o processo de reamostrar observações aleatoriamente da classe minoritária." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sq0Z9FPwb2Z8" + }, + "source": [ + "#### Carregar as libraries:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YLIGeXQYj3Qn" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "\n", + "import matplotlib.pyplot as plt # importing ploting libraries\n", + "import seaborn as sns # importing seaborn for statistical plots\n", + "from collections import Counter\n", + "\n", + "from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report, recall_score # calculate accuracy measures and confusion matrix\n", + "\n", + "from imblearn.over_sampling import SMOTE" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "EXOPkNK-jAwe" + }, + "source": [ + "!pip install imbalanced-learn" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "qBeEcJ4lbrHi" + }, + "source": [ + "%matplotlib inline\n", + "\n", + "plt.rcParams['figure.figsize'] = [20.0, 7.0]\n", + "plt.rcParams.update({'font.size': 22,})\n", + "\n", + "sns.set_palette('viridis')\n", + "sns.set_style('white')\n", + "sns.set_context('talk', font_scale = 0.8)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PJLK5dVEb0Oe" + }, + "source": [ + "#### Carregar os dados\n", + "* Dataframe Credit Card: https://www.kaggle.com/mlg-ulb/creditcardfraud" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nZQwzb_wqFEs" + }, + "source": [ + "" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DkObFs6Wdxy4" + }, + "source": [ + "from google.colab import drive\n", + "drive.mount('/content/drive')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "0KV_cPVDbxUa" + }, + "source": [ + "url = '/content/drive/My Drive/Datasets4ML/creditcard.csv'\n", + "df_cc = pd.read_csv(url)\n", + "\n", + "df_cc.columns = [colunas.lower() for colunas in df_cc.columns]\n", + "\n", + "df_cc.drop(columns = 'time', axis = 1, inplace = True)\n", + "df_cc.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s5me2fqPqgfq" + }, + "source": [ + "### Tratamento dos Missing Values" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Y98jCK55qjyR" + }, + "source": [ + "df_cc.isna().sum()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QY5R8MmkdBsy" + }, + "source": [ + "### Qual a proporção de fraudes e não-fraudes?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0fNYOL_PLMMQ" + }, + "source": [ + "qtd = Counter(df_cc['class'])\n", + "qtd" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "S9SLjncvizFs" + }, + "source": [ + "np.round(100*qtd[1]/qtd[0], 4)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CELJp_aZy_RB" + }, + "source": [ + "### Normalização" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_2L-7fyaz0so" + }, + "source": [ + "df_cc2 = df_cc.copy()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JLwFSuLtzA9H" + }, + "source": [ + "from sklearn.preprocessing import StandardScaler, MinMaxScaler\n", + "\n", + "l_colunas = df_cc.columns\n", + "l_colunas = l_colunas.drop('class')\n", + "l_colunas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "htApfx0h2Rr_" + }, + "source": [ + "for coluna in l_colunas:\n", + " df_cc2[coluna+'_2'] = StandardScaler().fit_transform(np.array(df_cc2[coluna]).reshape(-1, 1))\n", + " df_cc2 = df_cc2.drop(columns = coluna, axis = 1) \n", + "\n", + "df_cc2.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HMnKnqif0wDp" + }, + "source": [ + "### Amostra de treinamento e validação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zoivyfcs0ye3" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "from sklearn.linear_model import LogisticRegression\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_cc, y_cc, test_size = 0.3, random_state = 20111974) " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lceXISCC3kbC" + }, + "source": [ + "### Treinamento do modelo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ikoTEk-W048P" + }, + "source": [ + "# Instancia:\n", + "lr = LogisticRegression() " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "tj4vD_qlMVkN" + }, + "source": [ + "# Treina o modelo usando a amostra de treinamento: \n", + "lr.fit(X_treinamento, y_treinamento.ravel()) " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "G_APl0QY3rfY" + }, + "source": [ + "### Previsão do modelo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SJPiVCzJ3qbg" + }, + "source": [ + "y_pred = lr.predict(X_teste) \n", + " \n", + "# print classification report \n", + "print(classification_report(y_teste, y_pred)) " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oBXTrmmN4IYq" + }, + "source": [ + "**Conclusão**: Temos acurácia de 100%.\n", + "\n", + "Observe o recall da classe minoritária: 0.57 ==> Isso mostra que estamos diante de uma amostra desbalanceada e que, neste caso, o modelo está inclinado/viesado pela classe majoritária." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7aEemta6lSTO" + }, + "source": [ + "## Reamostragem da classe Majoritária\n", + "* Up-sampling é o processo de reamostrar observações aleatoriamente da classe minoritária." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "26W4RzWG1Tco" + }, + "source": [ + "X_cc = df_cc2.copy()\n", + "X_cc = X_cc.drop(columns = 'class', axis = 1)\n", + "\n", + "y_cc = df_cc2['class']" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sZRcM4qqllDo" + }, + "source": [ + "### Processo\n", + "1. Separar as observações de cada classe em diferentes dataframes;\n", + "2. Reamostrar a classe minoritária COM REPOSIÇÃO;\n", + "3. Combinar os dois dataframes com as classes minoritárias e majoritárias." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WB-v_VjHmH9M" + }, + "source": [ + "Abaixo, seleção das instâncias/linhas em que [class] = 0 ==> Classe Majoritária" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mcLiKkpAl30d" + }, + "source": [ + "df_cc_majo = df_cc2[df_cc2['class'] == 0]\n", + "df_cc_majo.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qc-VXhlYmcyi" + }, + "source": [ + "Abaixo, seleção das instâncias/linhas em que [class] = 1 ==> Classe Minoritária" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tbk1OrUzmcyj" + }, + "source": [ + "df_cc_mino = df_cc2[df_cc2['class'] == 1]\n", + "df_cc_mino.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UeCniZsBlYMc" + }, + "source": [ + "## Reamostragem da classe majoritária (COM REPOSIÇÃO)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TJ_I9bceqGYd" + }, + "source": [ + "np.random.seed(20111974)\n", + "df_cc_majo_s = df_cc_majo.sample(n = df_cc_mino.shape[0]+300, replace = True)\n", + "df_cc_majo_s.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1LzjX3dJqzmY" + }, + "source": [ + "#### Combinar os dois dataframes" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0xNGGK-5q3Pd" + }, + "source": [ + "df_cc_s1 = pd.concat([df_cc_majo_s, df_cc_mino])\n", + "Counter(df_cc_s1['class'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "EYhxJSfT56F2" + }, + "source": [ + "df_cc_s1.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8q7zxWK6ZE_B" + }, + "source": [ + "Portanto, o dataframe df_cc_s1 é uma das amostras em que tratamos o desbalanceamento." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IeGBIu_X6Pel" + }, + "source": [ + "### Amostra de treinamento e validação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q6TqvmQC6Peq" + }, + "source": [ + "X = df_cc_s1.copy()\n", + "X = X.drop(columns = 'class', axis = 1)\n", + "\n", + "y = df_cc_s1['class']" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_dO-ASMH6Peu" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X, y, test_size = 0.3, random_state = 20111974) " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qn8K_6mQ6Pew" + }, + "source": [ + "### Treinamento do modelo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kDHE-bz46Pex" + }, + "source": [ + "# Instancia\n", + "from sklearn.linear_model import LogisticRegression\n", + "\n", + "lr = LogisticRegression() " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "42NW2MKzMt8v" + }, + "source": [ + "# treina o modelo na amostra de treinamento \n", + "lr.fit(X_treinamento, y_treinamento.ravel()) " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6bMFgHcO6Pez" + }, + "source": [ + "### Previsão do modelo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xp5Fb3yk6Pez" + }, + "source": [ + "y_pred = lr.predict(X_teste) \n", + " \n", + "# print classification report \n", + "print(classification_report(y_teste, y_pred)) " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1DyzKGSu6Pe1" + }, + "source": [ + "**Conclusão**: Temos acurácia de 94%.\n", + "\n", + "Observe o recall da classe minoritária: 0.92 ==> Isso mostra que estamos diante de uma amostra balanceada." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1Q2-7aJTmSWJ" + }, + "source": [ + "### Verificar a quantidade de instâncias por preditoras" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SYnnu-lWmYUo" + }, + "source": [ + "X.shape[0]/X.shape[1]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TQ_eu8Jyqwe7" + }, + "source": [ + "Temos 44 linhas para cada coluna/variável do dataframe." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e2tY6Xks6vOM" + }, + "source": [ + "## SMOTE (Synthetic Minority Oversampling Technique)\n", + "* Uma forma de se resolver o problema das amostras desbalanceadas é simplesmente reamostrando a classe minoritária e isso pode ser obtido através da duplicação da classe minoritária. Isso resolve o problema do desbalanceamento, mas não traz nenhuma informação adicional ao modelo.\n", + "* Uma alternativa é criar amostras sintéticas da classe minoritária e pode ser efetivo para resolver o problema do desbalanceamento.\n", + "* A estratégia mais utilizada é o SMOTE.\n", + " * Seleciona aleatoriamente amostras que estão próximos (amostras sintéticas). " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "g9hW8lTGO4S_" + }, + "source": [ + "SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.\n", + "\n", + "**Imbalanced Learning: Foundations, Algorithms, and Applications, 2013**\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bXKn7na75I_5" + }, + "source": [ + "from imblearn.over_sampling import SMOTE \n", + "sm = SMOTE(random_state = 20111974) # por questões de reproducibilidade" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "no2QQgd66-qy" + }, + "source": [ + "### Amostra de treinamento e validação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NKHz8aWl6-qz" + }, + "source": [ + "X_cc = df_cc2.copy()\n", + "X_cc = X_cc.drop(columns = 'class', axis = 1)\n", + "\n", + "y_cc = df_cc2['class']" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7Mq3x4Ej6-q3" + }, + "source": [ + "from imblearn.over_sampling import SMOTE \n", + "sm = SMOTE(random_state = 20111974) \n", + "\n", + "X, y = sm.fit_sample(X_cc, y_cc) " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "OBVGbDEeKaK_" + }, + "source": [ + "def antes_depois(y_cc, y):\n", + " qtd_a = Counter(y_cc)\n", + " qtd_d = Counter(y)\n", + " print(qtd_a)\n", + " print(qtd_d)\n", + "\n", + " # scatter plot: antes\n", + " for label, _ in qtd_a.items():\n", + "\t row_ix = np.where(y_cc == label)[0]\n", + "\t plt.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))\n", + " plt.legend()\n", + " plt.show()\n", + "\n", + " # scatter plot: depois\n", + " for label, _ in qtd_d.items():\n", + "\t row_ix = np.where(y == label)[0]\n", + "\t plt.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))\n", + " plt.legend()\n", + " plt.show()\n", + "\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "GkvkEEteKsUT" + }, + "source": [ + "antes_depois(y_cc, y)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "CqqwdtDRtXFu" + }, + "source": [ + "sum(y)/sum(y_cc)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6It3vQq1-qdn" + }, + "source": [ + "## Estratégia 2" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kbQDKCyo_RTs" + }, + "source": [ + "from imblearn.under_sampling import RandomUnderSampler" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "dLuoImU9-sGx" + }, + "source": [ + "over = SMOTE(sampling_strategy = 0.1) # Reamostrar a classe minoritária para ter 10% da classe majoritária\n", + "under = RandomUnderSampler(sampling_strategy = 0.5) # Seleciona a classe majoritária para ter 50%" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o4WdMzQ6-iV1" + }, + "source": [ + "Usando um Pileline" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8Hr38-lR_WaG" + }, + "source": [ + "from imblearn.pipeline import Pipeline" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "NY0quIhW5bGe" + }, + "source": [ + "steps = [('over', over), ('under', under)]\n", + "pipeline = Pipeline(steps = steps)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "E57PQ3isuGGM" + }, + "source": [ + "X_cc e y_cc são nossos dataframes originais." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0viVAtDs5bKs" + }, + "source": [ + "# Aplica o pipeline\n", + "X, y = pipeline.fit_resample(X_cc, y_cc)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IjqfnKY3uYeU" + }, + "source": [ + "Antes:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iIMZ9UgGQEq0" + }, + "source": [ + "Counter(y_cc)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hNpOBtw5ucEh" + }, + "source": [ + "Depois:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TYMLs88_QI9k" + }, + "source": [ + "qtd_d = Counter(y)\n", + "qtd_d" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "2WHEPLGeSuJL" + }, + "source": [ + "# scatter plot\n", + "for label, _ in qtd_d.items():\n", + "\trow_ix = np.where(y == label)[0]\n", + "\tplt.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))\n", + "plt.legend()\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "xl2O2ChzBuoX" + }, + "source": [ + "# Instancia o modelo\n", + "lr = LogisticRegression()\n", + "\n", + "# acrescenta o modelo ao pipeline:\n", + "steps = [('over', over), ('under', under), ('model', lr)]\n", + "pipeline = Pipeline(steps = steps)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JgbhWi-mRMkH" + }, + "source": [ + "from sklearn.model_selection import RepeatedStratifiedKFold\n", + "from sklearn.model_selection import cross_val_score" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "fez37c3qB-JP" + }, + "source": [ + "# evaluate pipeline\n", + "cv = RepeatedStratifiedKFold(n_splits = 10, n_repeats = 3, random_state = 20111974)\n", + "scores = cross_val_score(pipeline, X, y, scoring = 'roc_auc', cv = cv, n_jobs = -1)\n", + "scores" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-1hRfhf6CJpj" + }, + "source": [ + "Avalia o Pipeline" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WKAz2D8VCI96" + }, + "source": [ + "print(f'Mean ROC AUC: {np.mean(scores)}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7W9arWK0AUVJ" + }, + "source": [ + "## SMOTE para Classificação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "N2a-d3SBAfCq" + }, + "source": [ + "# define pipeline\n", + "steps = [('over', SMOTE()), ('model', LogisticRegression())]\n", + "pipeline = Pipeline(steps = steps)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lGGZ-oReARUp" + }, + "source": [ + "# evaluate pipeline\n", + "from sklearn.model_selection import RepeatedStratifiedKFold\n", + "from sklearn.model_selection import cross_val_score\n", + "\n", + "cv = RepeatedStratifiedKFold(n_splits = 10, n_repeats = 3, random_state = 20111974)\n", + "scores = cross_val_score(pipeline, X, y, scoring = 'roc_auc', cv = cv, n_jobs = -1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "q2V9qNCnvkF8" + }, + "source": [ + "scores" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ixo5dDksvmFk" + }, + "source": [ + "np.std(scores)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N5n1lGndCA02" + }, + "source": [ + "Avalia o Pipeline:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1UlycCSeBd5L" + }, + "source": [ + "print(f'Mean ROC AUC: {np.mean(scores)}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R7vgx9MUCg7F" + }, + "source": [ + "## Questões interessantes\n", + "* Qual o percentual adequado da classe minoritária e da classe majoritária com melhor performance?\n", + "* Qual o valor de k ótimo para o SMOTE (default: 5)?\n", + "```\n", + "for k in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]:\n", + " # Defina o pipeline\n", + "```" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "a7NO3Pb9DVrK" + }, + "source": [ + "l_knn = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n", + "\n", + "for k in l_knn:\n", + "\t# Pipeline\n", + "\tmodel = LogisticRegression()\n", + "\tover = SMOTE(sampling_strategy = 0.1, k_neighbors = k)\n", + "\tunder = RandomUnderSampler(sampling_strategy = 0.5)\n", + "\tsteps = [('over', over), ('under', under), ('model', model)]\n", + "\tpipeline = Pipeline(steps = steps)\n", + " \n", + "\t# Avalia o pipeline\n", + "\tcv = RepeatedStratifiedKFold(n_splits = 10, n_repeats = 3, random_state = 20111974)\n", + "\tscores = cross_val_score(pipeline, X_cc, y_cc, scoring = 'roc_auc', cv = cv, n_jobs = -1)\n", + " \n", + " \ty_pred = model.predict(X_teste) \n", + " \n", + "\t# print classification report \n", + "\tprint(classification_report(y_teste, y_pred)) \n", + "\tprint(f'Valor de k: {k}; Mean ROC AUC: {np.mean(scores)}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YJt1bxijE58v" + }, + "source": [ + "## Borderline+SMOTE\n", + "* Uma extensão popular para SMOTE envolve a seleção de instâncias da classe minoritária que foram classificadas incorretamente, como com um modelo de classificação de k-vizinho mais próximo." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bp9xDm39E9Ru" + }, + "source": [ + "from imblearn.over_sampling import BorderlineSMOTE" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8ea3z02PGSly" + }, + "source": [ + "# transform the dataset\n", + "oversample = BorderlineSMOTE()\n", + "X, y = oversample.fit_resample(X_cc, y_cc)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VsYz3KUgVGa1" + }, + "source": [ + "qtd_a = Counter(y_cc)\n", + "print(qtd_a)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "WxSoOawzGzzS" + }, + "source": [ + "qtd_d = Counter(y)\n", + "print(qtd_d)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "k3mnOqs4GcUM" + }, + "source": [ + "# scatter plot\n", + "for label, _ in qtd_d.items():\n", + "\trow_ix = np.where(y == label)[0]\n", + "\tplt.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))\n", + "plt.legend()\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zHT2IVGyCTSs" + }, + "source": [ + "### BORDERLINE+SVM\n", + "* Usa SVM (ao invés do KNN);." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OTnPb0uFCbhD" + }, + "source": [ + "# Carrega a library:\n", + "from imblearn.over_sampling import SVMSMOTE\n", + "\n", + "oversample = SVMSMOTE()\n", + "X, y = oversample.fit_resample(X_cc, y_cc)\n", + "\n", + "# summarize the new class distribution\n", + "qtd_d = Counter(y)\n", + "print(qtd_d)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ltfp7_XgKCnd" + }, + "source": [ + "# scatter plot\n", + "for label, _ in qtd_d.items():\n", + "\trow_ix = np.where(y == label)[0]\n", + "\tplt.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))\n", + "plt.legend()\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "O4G71LLTJTpS" + }, + "source": [ + "### Adaptive Synthetic Sampling (ADASYN)\n", + "* Gera mais amostras sintéticas nas regiões onde a densidade da classe minoritária é baixa e MENOS (ou nenhuma) onde a densidade da classe minoritária é alta." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wdepfj4HA6wd" + }, + "source": [ + "# Carrega a library:\n", + "from imblearn.over_sampling import ADASYN\n", + "\n", + "oversample = ADASYN()\n", + "X, y = oversample.fit_resample(X_cc, y_cc)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QUHQblBuWBJw" + }, + "source": [ + "qtd_d = Counter(y)\n", + "print(qtd_d)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "RJzg9YRdJrLO" + }, + "source": [ + "# scatter plot\n", + "for label, _ in qtd_d.items():\n", + "\trow_ix = np.where(y == label)[0]\n", + "\tplt.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))\n", + "plt.legend()\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "8bypBo3jJxHR" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "mhavlcftJxFU" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "y8L-a-lbJw8x" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "08ZqlICmfGQq" + }, + "source": [ + "clf = setup(data = df_cc, \n", + " target = 'class',\n", + " session_id = 20111974,\n", + " silent = False,\n", + " fix_imbalance = False,\n", + " fix_imbalance_method = None,\n", + " ignore_features = ['time'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "32yWqmXkfsya" + }, + "source": [ + "compare_models()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "BdZk0nnZl6cb" + }, + "source": [ + "evaluate_model(dt)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qhCV7eMLl0Bs" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7iUqgofnl0QE" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c-n7LuT0l0Sy" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bGMEnl0IktNz" + }, + "source": [ + "### O melhor modelo" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VeMJ11_Jkoq8" + }, + "source": [ + "dt = create_model('dt')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "LXTGruYkkfk7" + }, + "source": [ + "evaluate_model(dt)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "U9xYILH2ajfe" + }, + "source": [ + "## Reamostragem da classe majoritária" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oS4R_nAcaHl1" + }, + "source": [ + "clf = setup(data = df_credit_card, target = 'Class')" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB15__ML_AutoML_pycaret_hs.ipynb b/Notebooks/NB15__ML_AutoML_pycaret_hs.ipynb new file mode 100644 index 000000000..9f3b7e06a --- /dev/null +++ b/Notebooks/NB15__ML_AutoML_pycaret_hs.ipynb @@ -0,0 +1,316 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "Untitled8.ipynb", + "provenance": [], + "private_outputs": true, + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FfhCoyP98gDt" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "226lzu3i8kRp" + }, + "source": [ + "url = 'https://raw.githubusercontent.com/MathMachado/DataFrames/master/Titanic_Original.csv'\n", + "df_titanic = pd.read_csv(url)\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6mL0RI0V9JmP" + }, + "source": [ + "#!pip install pycaret" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "WL9nShOd86Fu" + }, + "source": [ + "from pycaret.classification import *" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YRtIVR7LC9nl" + }, + "source": [ + "https://www.kaggle.com/frtgnn/pycaret-introduction-classification-regression" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3-dLwhmi9jTA" + }, + "source": [ + "### Set up" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jooY5VUr9sqd" + }, + "source": [ + "# Normalizar os nomes das colunas:\n", + "df_titanic.columns = [colunas.lower() for colunas in df_titanic.columns]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "erqtZNz9yZ2T" + }, + "source": [ + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bf_IntG2ygtP" + }, + "source": [ + "### Tratamento da feature/variável fare" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "buRgX2rucrHT" + }, + "source": [ + "#fare_bins = ['Muito Baixo', 'Baixo', 'Medio', 'Alto', 'Muito Alto']\n", + "fare_bins = ['Baixo', 'Medio', 'Alto']\n", + "\n", + "df_titanic2 = df_titanic.copy()\n", + "\n", + "# Tratamentos necessários\n", + "\n", + "#df_titanic2['fare_bins'] = pd.qcut(df_titanic2['fare'], q = [0, .2, .4, .6, .8, 1], labels = fare_bins\n", + "#df_titanic2['fare_bins'] = pd.qcut(df_titanic2['fare'], q = 5, labels = fare_bins)\n", + "df_titanic2['fare_bins'] = pd.qcut(df_titanic2['fare'], q = 3, labels = fare_bins)\n", + "\n", + "#df_titanic2.drop(columns = [], axis = 1, inplace = True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6AWpUmbE7p39" + }, + "source": [ + "df_titanic2['fare_bins'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "W0RtT6Xr9IVL" + }, + "source": [ + "clf = setup(data = df_titanic2,\n", + " target = 'survived', \n", + " numeric_imputation = 'mean', # para tratamento dos missing values\n", + " categorical_features = ['sex', 'embarked'], # lista das variáveis categóricas\n", + " ignore_features = ['name', 'ticket', 'cabin', 'passengerid'], \n", + " silent = False)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "mGGubn7k-GNi" + }, + "source": [ + "compare_models()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4YdMCHT92Tij" + }, + "source": [ + "\tModel\tAccuracy\tAUC\tRecall\tPrec.\tF1\tKappa\tMCC\tTT (Sec)\n", + "catboost\tCatBoost Classifier\t0.8218\t0.8634\t0.7857\t0.8275\t0.8154\t0.5996\t0.6150\t1.026\n", + "gbc\tGradient Boosting Classifier\t0.8187\t0.8540\t0.7867\t0.8231\t0.8129\t0.5959\t0.6084\t0.111\n", + "lightgbm\tLight Gradient Boosting Machine\t0.8186\t0.8683\t0.7937\t0.8195\t0.8149\t0.6009\t0.6073\t0.052" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8rcs_jJFCjRW" + }, + "source": [ + "lgbm = create_model('lightgbm') " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "5BdvRHPdCq0E" + }, + "source": [ + "tuned_lightgbm = tune_model(lgbm)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "WwCW_pDYI1hy" + }, + "source": [ + "plot_model(estimator = tuned_lightgbm, plot = 'learning')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "LES2FO1zI4X8" + }, + "source": [ + "plot_model(estimator = tuned_lightgbm, plot = 'auc')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "xxGQX3jbI4bN" + }, + "source": [ + "plot_model(estimator = tuned_lightgbm, plot = 'confusion_matrix')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1O_9qDHgJJjw" + }, + "source": [ + "plot_model(estimator = tuned_lightgbm, plot = 'feature')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "W1xnpqD-46vh" + }, + "source": [ + "### Painel com todos os outputs" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "PluFZQ8bI4hV" + }, + "source": [ + "evaluate_model(tuned_lightgbm)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JaffgUyy4bwz" + }, + "source": [ + "!pip install shap" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Uez4Gik8JwET" + }, + "source": [ + "interpret_model(tuned_lightgbm)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "9U2SnEKA41nW" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB15__ML_AutoML_pycaret_hs2.ipynb b/Notebooks/NB15__ML_AutoML_pycaret_hs2.ipynb new file mode 100644 index 000000000..aad4c6f9b --- /dev/null +++ b/Notebooks/NB15__ML_AutoML_pycaret_hs2.ipynb @@ -0,0 +1,317 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "Untitled8.ipynb", + "provenance": [], + "private_outputs": true, + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FfhCoyP98gDt" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "226lzu3i8kRp" + }, + "source": [ + "url = 'https://raw.githubusercontent.com/MathMachado/DataFrames/master/Titanic_Original.csv'\n", + "df_titanic = pd.read_csv(url)\n", + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6mL0RI0V9JmP" + }, + "source": [ + "#!pip install pycaret" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "WL9nShOd86Fu" + }, + "source": [ + "from pycaret.classification import *" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YRtIVR7LC9nl" + }, + "source": [ + "https://www.kaggle.com/frtgnn/pycaret-introduction-classification-regression" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3-dLwhmi9jTA" + }, + "source": [ + "### Set up" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jooY5VUr9sqd" + }, + "source": [ + "# Normalizar os nomes das colunas:\n", + "df_titanic.columns = [colunas.lower() for colunas in df_titanic.columns]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "erqtZNz9yZ2T" + }, + "source": [ + "df_titanic.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bf_IntG2ygtP" + }, + "source": [ + "### Tratamento da feature/variável fare" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "buRgX2rucrHT" + }, + "source": [ + "#fare_bins = ['Muito Baixo', 'Baixo', 'Medio', 'Alto', 'Muito Alto']\n", + "fare_bins = ['Baixo', 'Medio', 'Alto']\n", + "\n", + "df_titanic2 = df_titanic.copy()\n", + "\n", + "# Tratamentos necessários\n", + "\n", + "#df_titanic2['fare_bins'] = pd.qcut(df_titanic2['fare'], q = [0, .2, .4, .6, .8, 1], labels = fare_bins\n", + "#df_titanic2['fare_bins'] = pd.qcut(df_titanic2['fare'], q = 5, labels = fare_bins)\n", + "df_titanic2['fare_bins'] = pd.qcut(df_titanic2['fare'], q = 3, labels = fare_bins)\n", + "\n", + "#df_titanic2.drop(columns = [], axis = 1, inplace = True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "6AWpUmbE7p39" + }, + "source": [ + "df_titanic2['fare_bins'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "W0RtT6Xr9IVL" + }, + "source": [ + "clf = setup(data = df_titanic2,\n", + " target = 'survived', \n", + " numeric_imputation = 'mean', # para tratamento dos missing values\n", + " categorical_features = ['sex', 'embarked'], # lista das variáveis categóricas\n", + " ignore_features = ['name', 'ticket', 'cabin', 'passengerid'], # variáveis que serão ignoradas\n", + " session_id = 20111974, # Seed por questões de reproducibilidade\n", + " silent = False)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "mGGubn7k-GNi" + }, + "source": [ + "compare_models()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4YdMCHT92Tij" + }, + "source": [ + "\tModel\tAccuracy\tAUC\tRecall\tPrec.\tF1\tKappa\tMCC\tTT (Sec)\n", + "catboost\tCatBoost Classifier\t0.8218\t0.8634\t0.7857\t0.8275\t0.8154\t0.5996\t0.6150\t1.026\n", + "gbc\tGradient Boosting Classifier\t0.8187\t0.8540\t0.7867\t0.8231\t0.8129\t0.5959\t0.6084\t0.111\n", + "lightgbm\tLight Gradient Boosting Machine\t0.8186\t0.8683\t0.7937\t0.8195\t0.8149\t0.6009\t0.6073\t0.052" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8rcs_jJFCjRW" + }, + "source": [ + "lgbm = create_model('lightgbm') " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "5BdvRHPdCq0E" + }, + "source": [ + "tuned_lightgbm = tune_model(lgbm)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "WwCW_pDYI1hy" + }, + "source": [ + "plot_model(estimator = tuned_lightgbm, plot = 'learning')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "LES2FO1zI4X8" + }, + "source": [ + "plot_model(estimator = tuned_lightgbm, plot = 'auc')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "xxGQX3jbI4bN" + }, + "source": [ + "plot_model(estimator = tuned_lightgbm, plot = 'confusion_matrix')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "1O_9qDHgJJjw" + }, + "source": [ + "plot_model(estimator = tuned_lightgbm, plot = 'feature')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "W1xnpqD-46vh" + }, + "source": [ + "### Painel com todos os outputs" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "PluFZQ8bI4hV" + }, + "source": [ + "evaluate_model(tuned_lightgbm)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JaffgUyy4bwz" + }, + "source": [ + "!pip install shap" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Uez4Gik8JwET" + }, + "source": [ + "interpret_model(tuned_lightgbm)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "9U2SnEKA41nW" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB19_Redes_Neurais_hs.ipynb b/Notebooks/NB19_Redes_Neurais_hs.ipynb new file mode 100644 index 000000000..877e810cf --- /dev/null +++ b/Notebooks/NB19_Redes_Neurais_hs.ipynb @@ -0,0 +1,12378 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "accelerator": "TPU", + "colab": { + "name": "NB19_Redes_Neurais__V2.ipynb", + "provenance": [], + "collapsed_sections": [], + "include_colab_link": true + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ShVXyGj9wkgN" + }, + "source": [ + "

REDES NEURAIS ARTIFICIAIS (COMPREHENSIVE GUIDE)

\n", + "\n", + "# Porque Cientistas de Dados desejam aprender e dominar Redes Neurais?\n", + "\n", + "* Redes Neurais têm a capacidade de aprender, modelar e resolver problemas não-lineares e complexos apresentados pela vida real.\n", + "* Você já deve ter ouvido falar em Inteligência Artificial, _self-drive cars_, _Deep Learning_, _Computer Vision_ e _Neural Language Processing_ (NLP). Todos estes assuntos estão estreitamente relacionados às Redes Neurais. Por exemplo, _Deep Learning_ são Redes Neurais com muitas _Hidden Layers_.\n", + "\n", + "Este curso aborda os principais tópicos para você dominar Redes Neurais. Além disso, vamos falar das melhores práticas e atacar as principais dúvidas dos alunos em relação às Redes Neurais. Portanto, ao final deste curso você será capaz de:\n", + "\n", + "* desenvolver suas próprias Redes Neurais;\n", + "* aplicar o algoritmo correto para cada tipo de problema;\n", + "* aplicar as funções de ativação corretamene para cada tipo de problema e camada;\n", + "* aprender o necessário de Tensorflow/Keras para Redes Neurais;\n", + "* Aprender os comandos necessários do Python/NumPy para desenvolvimento de Redes Neurais;\n", + "* aplicar a métrica ideal para cada tipo de problema;\n", + "* entender como as Redes Neurais aprendem (_Backpropagation_);\n", + "\n", + "# **AGENDA**\n", + "\n", + "* Introdução às Redes Neurais;\n", + "* _Activation Function_;\n", + "* _Loss Function_;\n", + "* Métricas para medir a performance das Redes Neurais;\n", + "* _Dropout_;\n", + "* _Backpropagation_;\n", + "* _Gradient Descent_;\n", + "* _Perceptron_ (Redes Neurais com 1 única camada);\n", + "* Exemplo 1: Redes Neurais _Perceptron_ para os operadores lógicos E, OU e XOR;\n", + "* Redes Neurais Multicamada;\n", + "* Exemplo prático: Rede Neural para identificar o sexo a partir de peso e altura;\n", + "* Aplicações de Rede Neural:\n", + " * Aplicação 1 - Rede Neural para identificar espécies de flores (Iris Dataframe);\n", + " * Aplicação 2 - Rede Neural para identificar o tipo do vinho (_Red or White_);\n", + " * Aplicação 3 - Rede Neural para identificar Câncer de Mama (_Breast Cancer_ Dataframe);\n", + " * Aplicação 4 - Rede Neural para identificar Diabetes (Diabetes Dataframe);\n", + " * Aplicação 5 - Rede Neural para prever os preços das casas em Boston (_Boston House Price Prediction_)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aYQ4cDfcPu4e" + }, + "source": [ + "___\n", + "# **NOTAS E OBSERVAÇÕES**\n", + "\n", + "1. Contemplar o uso de StratifiedKFold;\n", + "2. Inserir aqui o exemplo das notas falsas, enviado pela Mónica;\n", + "3. Deixar alguma coisa que foi resolvida como exercício.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QgX6n2VDyY1O" + }, + "source": [ + "___\n", + "# **REFERÊNCIAS**\n", + "- [An Introduction to Neural Networks](http://www.cs.stir.ac.uk/~lss/NNIntro/InvSlides.html)\n", + "- [An Introduction to Image Recognition with Deep Learning](https://medium.com/@ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721)\n", + "- [Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/index.html)\n", + "- [Forward propagation in neural networks — Simplified math and code version](https://towardsdatascience.com/forward-propagation-in-neural-networks-simplified-math-and-code-version-bbcfef6f9250)\n", + "- [Understanding Neural Networks: From Activation Function To Back Propagation](https://medium.com/fintechexplained/neural-networks-activation-function-to-back-propagation-understanding-neural-networks-bdd036c3f29f)\n", + "- [Understanding Gradient Descent](https://medium.com/analytics-vidhya/understanding-gradient-descent-8dd88a4c60e6) - Explica detalhadamente como funciona o _Gradient Descent_ no processo de otimização dos pesos $W$;\n", + "- [Backpropagation step by step](https://medium.com/swlh/backpropagation-step-by-step-13f2b6c0b414) - Eu usei esse artigo para reajustar os pesos $W$;\n", + "- [Perceptron Learning Algorithm: A Graphical Explanation Of Why It Works](https://towardsdatascience.com/perceptron-learning-algorithm-d5db0deab975);\n", + "- [Math behind Perceptrons](https://medium.com/@iamask09/math-behind-perceptrons-7241d5dadbfc);\n", + "- [Neural Network: A Complete Beginners Guide from Scratch](https://medium.com/gadictos/neural-network-a-complete-beginners-guide-from-scratch-cf1fc9d5cd12);\n", + "- [Calculating the Backpropagation of a Network](https://medium.com/towards-artificial-intelligence/calculating-back-propagation-of-a-network-1febbcaa2b5d);\n", + "- [Let’s build a simple Neural Net!](https://becominghuman.ai/lets-build-a-simple-neural-net-f4474256647f) - O autor constroi uma Rede Neural simples, sem _Hidden Layers_;\n", + "- [Coding Neural Network — Forward Propagation and Backpropagtion](https://towardsdatascience.com/coding-neural-network-forward-propagation-and-backpropagtion-ccf8cf369f76);\n", + "- [The Simplest Neural Network: Understanding the non-linearity](https://towardsdatascience.com/the-simplest-neural-network-understanding-the-non-linearity-10846d7d0141) - Ótimo texto para entender a não-linearidade;\n", + "- [Implementing the XOR Gate using Backpropagation in Neural Networks](https://towardsdatascience.com/implementing-the-xor-gate-using-backpropagation-in-neural-networks-c1f255b4f20d) - Usei este texto para resolver o problema do XOR;\n", + "- [Neural Representation of AND, OR, NOT, XOR and XNOR Logic Gates (Perceptron Algorithm)](https://medium.com/@stanleydukor/neural-representation-of-and-or-not-xor-and-xnor-logic-gates-perceptron-algorithm-b0275375fea1) - Eu usei este material para resolver o problema dos operadores E, OU e XOR;\n", + "- [Solving XOR with a single Perceptron](https://medium.com/@lucaspereira0612/solving-xor-with-a-single-perceptron-34539f395182);\n", + "- [Machine Learning 101 — Artificial Neural Networks](https://towardsdatascience.com/machine-learning-101-artificial-neural-networks-3-46ccb04cba30) - Cálculos realizados passo a passo;\n", + "- [Neural Network from scratch in Python](https://towardsdatascience.com/math-neural-network-from-scratch-in-python-d6da9f29ce65) - Este artigo mostra a matemática por trás das Redes Neurais;\n", + "- [Classical Neural Net: Why/Which Activations Functions?](https://towardsdatascience.com/classical-neural-net-why-which-activations-functions-401159ba01c4) - Artigo que discute as principais funções de ativação;\n", + "- [Understanding Activation Functions in Neural Networks](https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0);\n", + "- [Mind: How to Build a Neural Network (Part One)](https://becominghuman.ai/mind-how-to-build-a-neural-network-part-one-67b6aea4ce20);\n", + "- [How to build a simple Neural Network from scratch with Python](https://towardsdatascience.com/how-to-build-a-simple-neural-network-from-scratch-with-python-9f011896d2f3);\n", + "- [Comparison of Activation Functions for Deep Neural Networks](https://towardsdatascience.com/comparison-of-activation-functions-for-deep-neural-networks-706ac4284c8a);\n", + "- [MAE and RMSE — Which Metric is Better?](https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d) - Ótimo artigo, pois discute qual métrica é melhor.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2StZkTpOZbYo" + }, + "source": [ + "___\n", + "# **MACHINE LEARNING DEVELOPMENT LYFECYCLE**\n", + "\n", + "CRISP-DM significa _Cross Industry Standard Process for Data Mining_ ou processos ou fases para desenvolvimento de projetos relacionados à _Data Mining_ e que tem sido muito utilizados pelos Cientistas de Dados para desenvolvimento de modelos predictivos.\n", + "\n", + "\"Drawing\"\n", + "\n", + "Fonte: [The steps to a successful machine learning project](https://emba.epfl.ch/2018/04/10/steps-successful-machine-learning-project/)\n", + "\n", + "Sugiro a leitura do artigo [Why using CRISP-DM will make you a better Data Scientist](https://towardsdatascience.com/why-using-crisp-dm-will-make-you-a-better-data-scientist-66efe5b72686)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TsCbZd2epfxo" + }, + "source": [ + "___\n", + "# **INTRODUÇÃO ÀS REDES NEURAIS**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HqqB2vaHXMGt" + }, + "source": [ + "* Redes Neurais aprendem com as experiências passadas, imitando o funcionamento dos neurônios humanos no processo de aprendizagem;\n", + "* podem e são amplamente utilizadas nas seguintes situações (aplicações):\n", + " * Reconhecimento Facial;\n", + " * Processamento de Linguagem Natural (NLP);\n", + " * _Self-drive car_;\n", + " * Visão computacional;\n", + " * Detecção de padrões (doenças, tumores e etc) em imagens;\n", + "* Ideal para o cenário onde temos muitos dados (_Big Data_) e para resolver problemas complexos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VzylPHA7BP0x" + }, + "source": [ + "___\n", + "# **_PERCEPTRON_** (Rede Neural com 1 única camada)\n", + "\n", + "* **_PERCEPTRON_** é um algoritmo de _Machine Learning_ da classe _Supervised Learning_ para classificação binária, inventado em 1958 por Frank Rosenblatt;\n", + "* Arquitetura de Rede Neural mais simples existente, com 1 única camada;\n", + "\n", + "**Daí, uma pergunta importante**: Se _Perceptron_ é um tipo de Rede Neural simples que nasceu na década de 1950, então porque devemos estudá-la? Porque não focar no estudo de Redes Neurais mais complexas e atuais?\n", + "\n", + "*E a resposta é**: porque _Perceptron_ nos permite entender claramente os aspectos matemáticos das Redes Neurais. Com isso quero dizer que ao entendermos _Perceptrons_, fica mais fácil entender outros tipos de Redes Neurais." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M5YNraza6jum" + }, + "source": [ + "## Exemplo de Perceptron\n", + "\n", + "A seguir, arquitetura do _Perceptron_: várias entradas (_Inputs_) e 1 camada de saída (_Output Layer_) binária (0 ou 1).\n", + "\n", + "* OL significa **O**utput **L**ayer ==> Valor que queremos estimar, ou seja, $\\hat{y}$.\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_5LVgImx78xY" + }, + "source": [ + "A **FUNÇÃO DE ATIVAÇÃO** $f(S)$ acima é conhecida como **_STEP FUNCTION_** e como podemos ver, retorna uma resposta binária (0 ou 1) que depende do valor de $S$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "84zFWve4FkcY" + }, + "source": [ + "A seguir, implementação usando NumPy:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "htVV-GpgBnw3" + }, + "source": [ + "[**Python**] - Importar o NumPy:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xBYyZ5ZiByH4" + }, + "source": [ + "import numpy as np" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-8sR77a4B8Uf" + }, + "source": [ + "[**Python**] - Definir o número de casas decimas para 3:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Gj2dioDTaZl-" + }, + "source": [ + "np.set_printoptions(precision = 3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oZ6Sw4uuCggF" + }, + "source": [ + "[**Python**] - Definir os pesos $W$ e as entradas (_inputs_) $X$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "z2m6BxQ_DLFV" + }, + "source": [ + "# Pesos W:\n", + "W = np.array([0.1, 0.3, 0.2, 0.4])\n", + "\n", + "# Entradas X:\n", + "X = np.array([1, -3, 2, 3])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lBnZP5MKCg8m" + }, + "source": [ + "[**Python**] - Desenvolver a função soma $S$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dMGuWhAhDaim" + }, + "source": [ + "def Soma(X, W):\n", + " S = X.dot(W) # Faz a seguinte operação: S = X1*W1 + X2*W2 + X3*W3 + X4*W4\n", + " return S" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EMxMJ05kDhmi" + }, + "source": [ + "[**Python**] - Desenvolver a função de ativação _Step Function_ $f(S)$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dRLYPJl0aZmg" + }, + "source": [ + "def ativacao_StepFunction(S):\n", + " if S >= 1:\n", + " return 1\n", + " else:\n", + " return 0" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H4g85O2jDu6S" + }, + "source": [ + "[**Python**] - Calcular $S = Soma(X, W)$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zoUMvvzlaZm-", + "outputId": "07b3e65a-176f-4309-beab-24595ccb5f4d", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "S = Soma(X, W)\n", + "S" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.8000000000000003" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 6 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6LzlyDNaD5yB" + }, + "source": [ + "[**Python**] - Calcular $f(S)$, ou seja, $f = ativacao_StepFunction(S)$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6IIe4vIjaZnE", + "outputId": "ba253dca-9955-4eea-8015-eaf002413935", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "f = ativacao_StepFunction(S)\n", + "f" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 7 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UrRG8e8dDTc_" + }, + "source": [ + "# **EXEMPLO 1: DESENVOLVER UMA REDE NEURAL _PERCEPTRON_ PARA OS OPERADORES LÓGICOS E, OU E XOR**\n", + "\n", + "Os exemplos a seguir foram inspirados e adaptado de:\n", + "* [Perceptron: Theory and Practice](https://medium.com/data-alchemist/perceptron-theory-and-practice-e71733ed3fa5)\n", + "* [The Perceptron — A Building Block of Neural Networks](https://blog.usejournal.com/the-perceptron-the-building-block-of-neural-networks-5a428d3f451d) - Este artigo mostra detalhadamente os cálculos\n", + "* [Mind: How to Build a Neural Network (Part One)](https://becominghuman.ai/mind-how-to-build-a-neural-network-part-one-67b6aea4ce20)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qeZBP3TQN2_1" + }, + "source": [ + "## Exemplo 1.1: Rede Neural _Perceptron_ para o Operador Lógico E\n", + "\n", + "Considere o dataframe a seguir:\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$) |\n", + "|---|---|---|---|\n", + "| 0 | 0 | 0 | 0 |\n", + "| 1 | 0 | 1 | 0 |\n", + "| 2 | 1 | 0 | 0 |\n", + "| 3 | 1 | 1 | 1 |\n", + "\n", + "O dataframe acima representa o operador lógico E (https://en.wikipedia.org/wiki/Truth_table):\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$)|\n", + "|---|---|---|---|\n", + "| 0 | F | F | F |\n", + "| 1 | F | T | F |\n", + "| 2 | T | F | F |\n", + "| 3 | T | T | T |\n", + "\n", + "\n", + "Considere $W= [W_{1}, W_{2}]= (0, 0)$ como pesos iniciais e a função de ativação $F(S)$ _Step Function_ dada abaixo:\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "psJh-MUgFAge" + }, + "source": [ + "A seguir, os cálculos manuais da primeira iteração:\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P8x3EvFUQBsU" + }, + "source": [ + "Os Erros $E_{i}$ são calculados com a fórmula: $E_{i}= ValorReal_{i} - ValorCalculado_{i}= y_{i}-\\hat{Y}_{i}$. A seguir, resumo dos cálculos:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d5dryrbGBesj" + }, + "source": [ + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$) | Soma | ValorCalculado ($\\hat{Y}_{i}$) | Erro |\n", + "|---|---|---|---|---|---|---|\n", + "| 0 | 0 | 0 | 0 | 0 | 0 | 0 |\n", + "| 1 | 0 | 1 | 0 | 0 | 0 | 0 |\n", + "| 2 | 1 | 0 | 0 | 0 | 0 | 0 |\n", + "| 3 | 1 | 1 | 1 | 0 | 0 | 1 |\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lkcRy2RYGLVw" + }, + "source": [ + "### Erro Total ($E_{T}$)\n", + "\n", + "$$E_{T}= \\sum_{i=1}^{n}E_{i}= E_{1}+E_{2}+...+E_{n}$$\n", + "\n", + "No nosso caso, temos que $E_{T}= 0+0+0+1= 1$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fzVxmr9OTfGB" + }, + "source": [ + "### Fórmula para ajustar os pesos $W$\n", + "A fórmula a seguir será utilizada para ajustar os pesos $W$:\n", + "\n", + "$$W_{n+1}= W_{n} + \\alpha*(X*E_{T})$$\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Z1bEDxMhToIj" + }, + "source": [ + "### Taxa de Aprendizagem ($\\alpha$)\n", + "* $\\alpha$ é a taxa de aprendizado (_Learning Rate_ em inglês) e diz respeito à velocidade de aprendizagem da Rede Neural.\n", + " * Quanto MENOR o valor de $\\alpha$ $\\Longrightarrow$ mais devagar e demorada será a convergência para o mínimo global;\n", + " * Quanto MAIOR o valor de $\\alpha$ $\\Longrightarrow$ mais rápido será a convergência para o mínimo, **mas sem a garantia de convergência para o mínimo global**." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "drGfgCIZY4aV" + }, + "source": [ + "Para ajustar os pesos $W$, vamos utilizar $\\alpha= 0.1$. Fórmula:\n", + "\n", + "$$W_{n+1}= W_{n} + \\alpha*(X*E_{T})$$\n", + "\n", + "A seguir, os novos pesos $W$ para a próxima iteração da Rede Neural _Perceptron_:\n", + "\n", + "\\begin{align}\n", + "W_{1}&= 0+ 0.1*1*1= 0.1 \\\\\n", + "W_{2}&= 0+ 0.1*1*1= 0.1 \\\\\n", + "\\end{align}\n", + "\n", + "Portanto, na próxima iteração vamos utilizar os pesos $W= [W_{1}, W_{2}]= [0.1, 0.1]$. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "33xLPLo-Pq0Y" + }, + "source": [ + "A seguir, os cálculos manuais para a segunda iteração da Rede Neural:\n", + "\n", + "Função de ativação $f(S)$ _Step Function_:\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i3EsH8pN9wJ6" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mZiiOu1AyW2N" + }, + "source": [ + "A seguir resumo dos cálculos para a segunda iteração:\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$) | Soma | ValorCalculado ($\\hat{Y}_{i}$) | Erro |\n", + "|---|---|---|---|---|---|---|\n", + "| 0 | 0 | 0 | 0 | 0.0 | 0 | 0 |\n", + "| 1 | 0 | 1 | 0 | 0.1 | 0 | 0 |\n", + "| 2 | 1 | 0 | 0 | 0.1 | 0 | 0 |\n", + "| 3 | 1 | 1 | 1 | 0.2 | 0 | 1 |\n", + "\n", + "Daí, $E_{T}= 0+0+0+1= 1$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MAXO38uqUobn" + }, + "source": [ + "### Ajuste dos pesos $W$\n", + "Fórmula para ajustar $W$:\n", + "\n", + "$$W_{n+1}= W_{n} + \\alpha*(X*E_{T})$$\n", + "\n", + "A seguir, os novos pesos $W$ para a próxima iteração da Rede Neural _Perceptron_:\n", + "\n", + "\\begin{align}\n", + "W_{1}&= 0.1+ 0.1*1*1= 0.2 \\\\\n", + "W_{2}&= 0.1+ 0.1*1*1= 0.2 \\\\\n", + "\\end{align}\n", + "\n", + "Portanto, na próxima iteração vamos utilizar os pesos $W= [W_{1}, W_{2}]= [0.2, 0.2]$. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WX48iRa5VLyk" + }, + "source": [ + "Esse processo iterativo é realizado até que se encontre os pesos $W$ que nos dê 100% de acurácia. A título de exemplo, considere $W= [W_{1}, W_{2}]= [0.5, 0.5]$:\n", + "\n", + "Função de ativação $f(S)$ _Step Function_:\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LZfroCc994oz" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "McKYXohzXzzA" + }, + "source": [ + "Como podem ver, o Erro Total $E_{T}= 0$, pois temos 100% de acertos (acurácia) usando $W= [W_{1}, W_{2}]= [0.5, 0.5]$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Wp_tR7h0btDm" + }, + "source": [ + "### Implementar o **_PERCEPTRON_** no Python usando NumPy" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3ix5vCKaEWdx" + }, + "source": [ + "[**Python**] - Importar NumPy:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "x62R_y89ElPA" + }, + "source": [ + "import numpy as np" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SYvLGlgZEXWu" + }, + "source": [ + "[**Python**] - Definir o número de casas decimais:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yEScd0_LEtJc" + }, + "source": [ + "np.set_printoptions(precision = 3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P8hLz6GAEYCo" + }, + "source": [ + "[**Python**] - Definir os pesos $W$, entradas (_inputs_) $X$ e Output $Y$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fD66QeoqXEU3" + }, + "source": [ + "# Pesos W:\n", + "W = np.array([0.0, 0.0])\n", + "\n", + "# Entradas X:\n", + "X = np.array([[0, 0], [0,1], [1, 0], [1, 1]])\n", + "\n", + "# Output Y:\n", + "Y = np.array([[0], [0], [0], [1]])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "alRRwxsUvIU6", + "outputId": "f37feaf6-e705-4afe-fae2-546423672321", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0, 0],\n", + " [0, 1],\n", + " [1, 0],\n", + " [1, 1]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 11 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VB5n2WNUvND3", + "outputId": "65b497ff-e838-4bad-961c-2c0d761c51ab", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "Y" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0],\n", + " [0],\n", + " [0],\n", + " [1]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 12 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2jH1EMfdEYwN" + }, + "source": [ + "[**Python**] - Definir a Taxa de Aprendizagem $\\alpha$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zd2k0S-BXEU_" + }, + "source": [ + "alpha = 0.1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yvGa7d8LEZD2" + }, + "source": [ + "[**Python**] - Desenvolver a função para treinar a Rede Neural\n", + "> Esta função tenta encontrar os pesos $W$ que levem a 100% de acurácia." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JVl0XwBuXEVC" + }, + "source": [ + "def Treinar_RNA(X, Y, W, alpha):\n", + " ET= 1 # ET= Erro Total\n", + " N= 0\n", + " while ((ET != 0) and (N < 100)):\n", + " ET= 0\n", + " for i in range(len(Y)):\n", + " S = X[i].dot(W)\n", + " f = ativacao_StepFunction(S)\n", + " E= Y[i]-f\n", + " ET+= E\n", + " for j in range(len(W)):\n", + " W[j]= W[j] + alpha*(X[i][j]*E)\n", + " print(f'Peso Ajustado: {W[j]}')\n", + " print(f'Erro Total: {ET}')\n", + " N+= 1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pdI7EHnFF4yo" + }, + "source": [ + "[**Python**] - Evocar a função Treinar_RNA:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gHM5tXEdXEVF", + "outputId": "425fe3b6-38e3-4152-d9fb-938290be5cf1", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "Treinar_RNA(X, Y, W, alpha)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Peso Ajustado: 0.0\n", + "Peso Ajustado: 0.0\n", + "Peso Ajustado: 0.0\n", + "Peso Ajustado: 0.0\n", + "Peso Ajustado: 0.0\n", + "Peso Ajustado: 0.0\n", + "Peso Ajustado: 0.1\n", + "Peso Ajustado: 0.1\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.1\n", + "Peso Ajustado: 0.1\n", + "Peso Ajustado: 0.1\n", + "Peso Ajustado: 0.1\n", + "Peso Ajustado: 0.1\n", + "Peso Ajustado: 0.1\n", + "Peso Ajustado: 0.2\n", + "Peso Ajustado: 0.2\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.2\n", + "Peso Ajustado: 0.2\n", + "Peso Ajustado: 0.2\n", + "Peso Ajustado: 0.2\n", + "Peso Ajustado: 0.2\n", + "Peso Ajustado: 0.2\n", + "Peso Ajustado: 0.30000000000000004\n", + "Peso Ajustado: 0.30000000000000004\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.30000000000000004\n", + "Peso Ajustado: 0.30000000000000004\n", + "Peso Ajustado: 0.30000000000000004\n", + "Peso Ajustado: 0.30000000000000004\n", + "Peso Ajustado: 0.30000000000000004\n", + "Peso Ajustado: 0.30000000000000004\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Erro Total: [0]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TPKEML9cDD0E" + }, + "source": [ + "## Exemplo 1.2: Rede Neural _Perceptron_ para o Operador Lógico OU\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rSQnOjDWC7Ta" + }, + "source": [ + "Considere o dataframe a seguir:\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$) |\n", + "|---|---|---|---|\n", + "| 0 | 0 | 0 | 0 |\n", + "| 1 | 0 | 1 | 1 |\n", + "| 2 | 1 | 0 | 1 |\n", + "| 3 | 1 | 1 | 1 |\n", + "\n", + "O dataframe acima representa o operador lógico OU (https://en.wikipedia.org/wiki/Truth_table):\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$)|\n", + "|---|---|---|---|\n", + "| 0 | F | F | F |\n", + "| 1 | F | T | T |\n", + "| 2 | T | F | T |\n", + "| 3 | T | T | T |" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kID13PxSGN6h" + }, + "source": [ + "[**Python**] - Definir os pesos $W$, entradas (_inputs_) $X$ e Output $Y$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "CmuuIX2PGN6l" + }, + "source": [ + "# Pesos W:\n", + "W = np.array([0.0, 0.0])\n", + "\n", + "# Entradas X:\n", + "X = np.array([[0, 0], [0,1], [1, 0], [1, 1]])\n", + "\n", + "# Output Y:\n", + "Y = np.array([[0], [1], [1], [1]])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "UDzdS6FX2LOC", + "outputId": "f0d325ca-56ff-4a80-bfe6-391e18f69c0f", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0, 0],\n", + " [0, 1],\n", + " [1, 0],\n", + " [1, 1]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 17 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ar0dk1eQ2MOD", + "outputId": "3cfefde8-78d4-45c2-beb1-7410e61a6006", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "Y" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0],\n", + " [1],\n", + " [1],\n", + " [1]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 18 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "agZX698KGeVK" + }, + "source": [ + "[**Python**] - Evocar a função Treinar_RNA:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3GF_W4u0GeVM", + "outputId": "2a58bc18-8d9f-435c-92b4-75d03270e8cc", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "Treinar_RNA(X, Y, W, alpha)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Peso Ajustado: 0.0\n", + "Peso Ajustado: 0.0\n", + "Peso Ajustado: 0.0\n", + "Peso Ajustado: 0.1\n", + "Peso Ajustado: 0.1\n", + "Peso Ajustado: 0.1\n", + "Peso Ajustado: 0.2\n", + "Peso Ajustado: 0.2\n", + "Erro Total: [3]\n", + "Peso Ajustado: 0.2\n", + "Peso Ajustado: 0.2\n", + "Peso Ajustado: 0.2\n", + "Peso Ajustado: 0.30000000000000004\n", + "Peso Ajustado: 0.30000000000000004\n", + "Peso Ajustado: 0.30000000000000004\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [3]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Erro Total: [2]\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.6\n", + "Peso Ajustado: 0.6\n", + "Peso Ajustado: 0.6\n", + "Peso Ajustado: 0.6\n", + "Peso Ajustado: 0.6\n", + "Erro Total: [2]\n", + "Peso Ajustado: 0.6\n", + "Peso Ajustado: 0.6\n", + "Peso Ajustado: 0.6\n", + "Peso Ajustado: 0.7\n", + "Peso Ajustado: 0.7\n", + "Peso Ajustado: 0.7\n", + "Peso Ajustado: 0.7\n", + "Peso Ajustado: 0.7\n", + "Erro Total: [2]\n", + "Peso Ajustado: 0.7\n", + "Peso Ajustado: 0.7\n", + "Peso Ajustado: 0.7\n", + "Peso Ajustado: 0.7999999999999999\n", + "Peso Ajustado: 0.7999999999999999\n", + "Peso Ajustado: 0.7999999999999999\n", + "Peso Ajustado: 0.7999999999999999\n", + "Peso Ajustado: 0.7999999999999999\n", + "Erro Total: [2]\n", + "Peso Ajustado: 0.7999999999999999\n", + "Peso Ajustado: 0.7999999999999999\n", + "Peso Ajustado: 0.7999999999999999\n", + "Peso Ajustado: 0.8999999999999999\n", + "Peso Ajustado: 0.8999999999999999\n", + "Peso Ajustado: 0.8999999999999999\n", + "Peso Ajustado: 0.8999999999999999\n", + "Peso Ajustado: 0.8999999999999999\n", + "Erro Total: [2]\n", + "Peso Ajustado: 0.8999999999999999\n", + "Peso Ajustado: 0.8999999999999999\n", + "Peso Ajustado: 0.8999999999999999\n", + "Peso Ajustado: 0.9999999999999999\n", + "Peso Ajustado: 0.9999999999999999\n", + "Peso Ajustado: 0.9999999999999999\n", + "Peso Ajustado: 0.9999999999999999\n", + "Peso Ajustado: 0.9999999999999999\n", + "Erro Total: [2]\n", + "Peso Ajustado: 0.9999999999999999\n", + "Peso Ajustado: 0.9999999999999999\n", + "Peso Ajustado: 0.9999999999999999\n", + "Peso Ajustado: 1.0999999999999999\n", + "Peso Ajustado: 1.0999999999999999\n", + "Peso Ajustado: 1.0999999999999999\n", + "Peso Ajustado: 1.0999999999999999\n", + "Peso Ajustado: 1.0999999999999999\n", + "Erro Total: [2]\n", + "Peso Ajustado: 1.0999999999999999\n", + "Peso Ajustado: 1.0999999999999999\n", + "Peso Ajustado: 1.0999999999999999\n", + "Peso Ajustado: 1.0999999999999999\n", + "Peso Ajustado: 1.0999999999999999\n", + "Peso Ajustado: 1.0999999999999999\n", + "Peso Ajustado: 1.0999999999999999\n", + "Peso Ajustado: 1.0999999999999999\n", + "Erro Total: [0]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u2dZAVVFEpCw" + }, + "source": [ + "## Exemplo 1.3: Rede Neural _Perceptron_ para o Operador Lógico XOR\n", + "\n", + "Problema proposto e demonstrado por Rumelhart et al. (1985)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EaZIyvvEEpC5" + }, + "source": [ + "Considere o dataframe a seguir:\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$)|\n", + "|---|---|---|---|\n", + "| 0 | 0 | 0 | 0 |\n", + "| 1 | 0 | 1 | 1 |\n", + "| 2 | 1 | 0 | 1 |\n", + "| 3 | 1 | 1 | 0 |\n", + "\n", + "O dataframe acima representa o operador lógico XOR (https://pt.wikipedia.org/wiki/Ou_exclusivo):\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$) |\n", + "|---|---|---|---|\n", + "| 0 | F | F | F |\n", + "| 1 | F | T | T |\n", + "| 2 | T | F | T |\n", + "| 3 | T | T | F |" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7rc3hc2RGneF" + }, + "source": [ + "[**Python**] - Definir os pesos $W$, entradas (_inputs_) $X$ e Output $Y$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "u8fAgk3RGneH" + }, + "source": [ + "# Pesos W:\n", + "W = np.array([0.0, 0.0])\n", + "\n", + "# Entradas X:\n", + "X = np.array([[0, 0], [0,1], [1, 0], [1, 1]])\n", + "\n", + "# Output Y:\n", + "Y = np.array([[0], [1], [1], [0]])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "tFKaIhua3Mr6", + "outputId": "7c4b2fbd-e002-44bb-eedd-ae5ac76ad9bc", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0, 0],\n", + " [0, 1],\n", + " [1, 0],\n", + " [1, 1]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 21 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pm-X-dXX3NZW", + "outputId": "6add62ed-22a1-4fcd-f9b5-430b17f921de", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "Y" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0],\n", + " [1],\n", + " [1],\n", + " [0]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 22 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "znRL2XozGneM" + }, + "source": [ + "[**Python**] - Evocar a função Treinar_RNA:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "j8leYHZVGneM", + "outputId": "d5fd8c92-ffe6-4a63-9019-fdedd7bb5b96", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "Treinar_RNA(X, Y, W, alpha)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Peso Ajustado: 0.0\n", + "Peso Ajustado: 0.0\n", + "Peso Ajustado: 0.0\n", + "Peso Ajustado: 0.1\n", + "Peso Ajustado: 0.1\n", + "Peso Ajustado: 0.1\n", + "Peso Ajustado: 0.1\n", + "Peso Ajustado: 0.1\n", + "Erro Total: [2]\n", + "Peso Ajustado: 0.1\n", + "Peso Ajustado: 0.1\n", + "Peso Ajustado: 0.1\n", + "Peso Ajustado: 0.2\n", + "Peso Ajustado: 0.2\n", + "Peso Ajustado: 0.2\n", + "Peso Ajustado: 0.2\n", + "Peso Ajustado: 0.2\n", + "Erro Total: [2]\n", + "Peso Ajustado: 0.2\n", + "Peso Ajustado: 0.2\n", + "Peso Ajustado: 0.2\n", + "Peso Ajustado: 0.30000000000000004\n", + "Peso Ajustado: 0.30000000000000004\n", + "Peso Ajustado: 0.30000000000000004\n", + "Peso Ajustado: 0.30000000000000004\n", + "Peso Ajustado: 0.30000000000000004\n", + "Erro Total: [2]\n", + "Peso Ajustado: 0.30000000000000004\n", + "Peso Ajustado: 0.30000000000000004\n", + "Peso Ajustado: 0.30000000000000004\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [2]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.5\n", + "Peso Ajustado: 0.4\n", + "Peso Ajustado: 0.4\n", + "Erro Total: [1]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1Eu5cVvxM60i" + }, + "source": [ + "## Porque conseguimos pesos $W$ para os Operadores Lógicos E e OU e não para XOR?\n", + "\n", + "* Operadores E e OR: Linearmente Separáveis;\n", + "* Operador XOR: Linearmente NÃO-Separável.\n", + "\n", + "[Lucas Araújo](https://medium.com/@lucaspereira0612/solving-xor-with-a-single-perceptron-34539f395182) diz em seu artigo [Solving XOR with a single Perceptron](https://medium.com/@lucaspereira0612/solving-xor-with-a-single-perceptron-34539f395182) que:\n", + "\n", + "\"Everyone who has ever studied about neural networks has probably already read that a single perceptron can’t represent the boolean XOR function. The book Artificial Intelligence: A Modern Approach, the leading textbook in AI, says: “[XOR] is not linearly separable so the perceptron cannot learn it”.\n", + "\n", + "As figuras abaixo demonstram clarmente os conceitos \"linearmente separáveis\" e \"NÃO-linearmente separável\"." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oUrFCMUjFtR1" + }, + "source": [ + "### Representação gráfica do Operador Lógico E\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$) |\n", + "|---|---|---|---|\n", + "| 0 | 0 | 0 | 0 |\n", + "| 1 | 0 | 1 | 0 |\n", + "| 2 | 1 | 0 | 0 |\n", + "| 3 | 1 | 1 | 1 |\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "n9v07MdMF42e" + }, + "source": [ + "### Representação gráfica do Operador Lógico OU\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$) |\n", + "|---|---|---|---|\n", + "| 0 | 0 | 0 | 0 |\n", + "| 1 | 0 | 1 | 1 |\n", + "| 2 | 1 | 0 | 1 |\n", + "| 3 | 1 | 1 | 1 |\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "56Qp1J6LGBe9" + }, + "source": [ + "### Representação gráfica do Operador Lógico XOR\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$)|\n", + "|---|---|---|---|\n", + "| 0 | 0 | 0 | 0 |\n", + "| 1 | 0 | 1 | 1 |\n", + "| 2 | 1 | 0 | 1 |\n", + "| 3 | 1 | 1 | 0 |\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eaQm7zZJbNAc" + }, + "source": [ + "___\n", + "# **O QUE APRENDEMOS ATÉ AQUI?**\n", + "\n", + "* Redes Neurais tentam ajustar os pesos $W$ para tentar melhorar a taxa de acerto. Ou seja, a Rede Neural aprende com os dados através do ajuste iterativo dos pesos $W$;\n", + "* Treinar uma Rede Neural é uma tarefa computacionalmente intensivo, pois o algoritmo tenta encontrar os pesos $W$ que apresentam melhor acurácia. Para um dataframe grande, o custo conputacional do aprendizado pode ser alto." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f_T35rXZOB4G" + }, + "source": [ + "___\n", + "# **REDES NEURAIS MULTICAMADA**\n", + "\n", + "* Pelo menos 1 _Hidden Layer_. Observe a Rede Neural a seguir contendo 20 neurônios distribuídos da seguinte forma:\n", + "\n", + " * Número de neurônios na camada de entrada (_Input Layer_): 4;\n", + " * 3 camadas escondidas (_Hidden Layers_) com 5 neurônios cada, totalizando 15 neurônios:\n", + " * Número de neurônios na _Hidden Layer 1_: 5;\n", + " * Número de neurônios na _Hidden Layer 2_: 5;\n", + " * Número de neurônios na _Hidden Layer 3_: 5;\n", + " * Número de neurônios na camada de saída (_Output Layer_): 1;\n", + "* _Fully connected layer_." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1dXBXuh2-Tuo" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BK4O_Y_l2vev" + }, + "source": [ + "## Função _Sigmoid_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M_nn8zELXEVf" + }, + "source": [ + "\"Drawing\"\n", + "\n", + "Consulte [e (constante matemática)](https://pt.wikipedia.org/wiki/E_(constante_matem%C3%A1tica)) para saber mais sobre a constante de Euler." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kOWwWR7hOmir" + }, + "source": [ + "## Número de _Hidden Layers_\n", + "\n", + "Pesquisadores apontam que 1 única _Hidden Layer_ é suficiente para a grande maioria dos problemas e que usualmente cada _Hidden Layer_ possui o mesmo número de neurônios. Experimentos mostram que mais _Hidden Layers_ implica em maior tempo para treinar a Rede Neural. No entanto, [Heaton Research](https://www.heatonresearch.com/2017/06/01/hidden-layers.html), mostra que:\n", + "\n", + "![Determinining_number_Hidden_Layers](https://github.com/MathMachado/Materials/blob/master/Determinining_number_Hidden_Layers.png?raw=true)\n", + "\n", + "Fonte: [Heaton Research](https://www.heatonresearch.com/2017/06/01/hidden-layers.html).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u4_1JCbcPRrn" + }, + "source": [ + "## Número de neurônios na camada de entrada (_Input Layer_): $N_{I}$\n", + "\n", + "$N_{I}$= Número de colunas (ou variáveis) no dataframe." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fk-lhwhffUZz" + }, + "source": [ + "### Número de neurônios na camada de saída (_Output Layer_): $N_{O}$\n", + "\n", + "* Se a Rede Neural é uma regressão, então o número de neurônios na _Output Layer_ é 1, pois o _output_ de uma regressão é um valor;\n", + "* Se a Rede Neural é uma classificação e usamos uma função de ativação probabilística (como _softmax_, por exemplo), então o número de neurônios na _Output Layer_ é igual ao número de classes que queremos prever. Por exemplo, no problema de classificar espécies no dataframe IRIS temos 3 espécies (versicolor, virginica e setosa). Ao utilizarmos a função de ativação _softmax_, então teremos 3 neurônios na _Output Layer_." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZsrrdLpSfYm9" + }, + "source": [ + "## Número de neurônios na camada escondida (_Hidden Layer_): $N_{H}$\n", + "\n", + "Determinar o número de neurônios na _Hidden Layer_ tem sido um exercício de tentativa e erro, mas alguns experimentos tem demonstrado que o número adequado de neurônios na _Hidden Layer_ pode ser obtido através da expressão a seguir:\n", + "\n", + "$$N_{H}= \\frac{N_{I}+N_{O}}{2}$$\n", + "\n", + "No entanto, o artigo [How to choose the number of hidden layers and nodes in a feedforward neural network?](https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw) sugere o uso da seguinte expressão:\n", + "\n", + "$$N_H= \\frac{N}{\\alpha(N_{I}+N_{O})}$$\n", + "\n", + "onde $N$ é o número de instâncias (linhas) do dataframe e $\\alpha$ é um número entre 2 e 10, sendo que alguns experimentos com $\\alpha= 2$ produzem bons modelos sem _overfitting_. Para saber mais sobre esta expressão e sobre $\\alpha$, sugiro a leitura do artigo mencionado." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Rj6WfilbShX3" + }, + "source": [ + "## Rede Neural Multicamada para o Operador Lógico XOR.\n", + "\n", + "Dataframe que representa o Operador Lógico XOR:\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$)|\n", + "|---|---|---|---|\n", + "| 0 | 0 | 0 | 0 |\n", + "| 1 | 0 | 1 | 1 |\n", + "| 2 | 1 | 0 | 1 |\n", + "| 3 | 1 | 1 | 0 |" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uURlcU78LwbH" + }, + "source": [ + "### Arquitetura da Rede Neural Multicamada que vamos desenvolver para o Operador Lógico XOR\n", + "\n", + "Os pesos $W_{H}= \\begin{bmatrix} W_{H}^{(1, 1)} & W_{H}^{(1, 2)} & W_{H}^{(1, 3)} \\\\ W_{H}^{(2, 1)} & W_{H}^{(2, 2)} & W_{H}^{(2, 3)} \\end{bmatrix}$ e $W_{O}= \\begin{bmatrix} W_{O}^{(1)} \\\\ W_{O}^{(2)} \\\\ W_{O}^{(3)} \\end{bmatrix}$ serão gerados aleatoriamente. A seguir, a arquitetura da Rede Neural com 1 _Hidden Layer_ contendo 3 neurônios:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6XKMdlZr-e9l" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AV2eUQDuLCUL" + }, + "source": [ + "[**Python**] - Importar as bibliotecas necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uTWYP0V-LGHj" + }, + "source": [ + "import math\n", + "import numpy as np" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-DG86PgxLDQA" + }, + "source": [ + "[**Python**] - Definir o número de casas decimais:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Jsvh5DOkXEVm" + }, + "source": [ + "np.set_printoptions(precision = 3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NYIMcp8TLVuq" + }, + "source": [ + "[**Python**] - Definir as entradas (_inputs_) $X$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "U6Mt6zTnXEVq", + "outputId": "67d81400-0c90-42c5-bb92-52421801aaba", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])\n", + "X" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0, 0],\n", + " [0, 1],\n", + " [1, 0],\n", + " [1, 1]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 3 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tXLd1nZxLbXD" + }, + "source": [ + "[**Python**] - Definir os _Outputs_ $Y$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Oauq3veAXEVu", + "outputId": "faa3a8cc-7d87-4736-973e-15c36dbcda15", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "Y = np.array([[0], [1], [1], [0]])\n", + "Y" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0],\n", + " [1],\n", + " [1],\n", + " [0]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 4 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TC1y0tO1MAU9" + }, + "source": [ + "### Gerar os pesos $W_{H}= \\begin{bmatrix} W_{H}^{(1, 1)} & W_{H}^{(1, 2)} & W_{H}^{(1, 3)} \\\\ W_{H}^{(2, 1)} & W_{H}^{(2, 2)} & W_{H}^{(2, 3)} \\end{bmatrix}$ e $W_{O}= \\begin{bmatrix} W_{O}^{(1)} \\\\ W_{O}^{(2)} \\\\ W_{O}^{(3)} \\end{bmatrix}$ aleatoriamente\n", + "\n", + "Por questões de reproducibilidade de resultados, vamos usar as sementes a seguir para gerar os pesos $W_{H}$ e $W_{O}$:\n", + "\n", + "* _seed_= 20111974 para gerar $W_{H}$;\n", + "* _seed_= 19741120 para gerar $W_{O}$.\n", + "\n", + "Ao usarmos estas sementes, deveremos ter $W_{H}= \\begin{bmatrix} 0.531 & 0.570 & 0.543 \\\\ 0.655 & 0.857 & 0.602 \\end{bmatrix}$ e $W_{O}= \\begin{bmatrix} 0.240 \\\\ 0.318 \\\\ 0.142 \\end{bmatrix}$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_U3Id5XXG5tw" + }, + "source": [ + "[**Python**] - Sementes para gerar $W_{H}$ (aleatoriamente)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tVXiIpgIHId9" + }, + "source": [ + "np.random.seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XYj0NYofHKkk" + }, + "source": [ + "[**Python**] - Gerar os pesos $W_{H}$ (aleatoriamente)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "o1eGsPNQXEVx", + "outputId": "12f09198-1e15-4b8b-c089-23172f243b10", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "W_H = np.array([np.random.random(3), np.random.random(3)])\n", + "W_H" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0.531, 0.57 , 0.543],\n", + " [0.655, 0.857, 0.602]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 6 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cj6KJnP3Hbqf" + }, + "source": [ + "[**Python**] - Sementes para gerar $W_{O}$ (aleatoriamente):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "AkVw-SWSHbqh" + }, + "source": [ + "np.random.seed(19741120)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r7ZjUT4oHbqk" + }, + "source": [ + "[**Python**] - Gerar os pesos $W_{O}$ (aleatoriamente):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ebs8p8mOXEV1", + "outputId": "1d2dd115-3cbb-48a7-ff99-274747e8a0ba", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "W_O = np.array([np.random.random(1), np.random.random(1), np.random.random(1)])\n", + "W_O" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0.24 ],\n", + " [0.318],\n", + " [0.142]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 8 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vg1ByKjKsWcE" + }, + "source": [ + "Confira os pesos dispostos na figura a seguir:\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GiEc1DwPt7Hm" + }, + "source": [ + "### Calcular $S = \\sum_{i=1}^{4}X_{i}W_{i}$ e passar o valor de $S$ para a função de ativação $f(S)$ (_Sigmoid_)\n", + "\n", + "Função _Sigmoid_:\n", + "\n", + "$$f(x)= y= \\frac{1}{1+e^{-x}}$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mCZsXjIhHqId" + }, + "source": [ + "[**Python**] - Definir a função de ativação $Sigmoid$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kB4-UnOGXEV8" + }, + "source": [ + "def FuncaoAtivacao_Sigmoid(x):\n", + " y = 1/(1+np.exp(-x))\n", + " return y" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XkvMHw1KHrjT" + }, + "source": [ + "[**Python**] - Função MostraCalculos, desenvolvida para validarmos os cálculos manuais de $S$ e $f(S)$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fsxHrthYXEWA" + }, + "source": [ + "def MostraCalculos(i):\n", + " print(f'Array W:\\n {W_H}')\n", + " print('\\n')\n", + " print(f'Array X:\\n {X[i]}')\n", + " S = X[i].dot(W_H)\n", + " f = FuncaoAtivacao_Sigmoid(S)\n", + " S2= f.dot(W_O)\n", + " f2= FuncaoAtivacao_Sigmoid(S2)\n", + " \n", + " print('\\n')\n", + " print(f'*** HIDDEN LAYER ***')\n", + " print(f'Função Soma S: {S}')\n", + " print(f'Função de Ativação Sigmoid: {f}')\n", + " \n", + " print('\\n')\n", + " print(f'*** OUTPUT LAYER ***')\n", + " print(f'Função Soma S: {S2}')\n", + " print(f'Função de Ativação Sigmoid: {f2}')\n", + " \n", + " print('\\n')\n", + " print(f'*** ERRO ***')\n", + " E= Y[i]-f2\n", + " print(f'Erro da linha i= {i}: {E}')\n", + " \n", + " return f " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s80knPTzcIBy" + }, + "source": [ + "___\n", + "O Operador A.dot(B) faz o produto matricial entre os arrays A e B. Para saber mais sobre a função dot(), assista este [vídeo](https://youtu.be/Pb1VIe9657s)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Bw0p2m8mbz3C" + }, + "source": [ + "#### $\\Longrightarrow$ Para $i = 0$:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CelKhuoHISyS" + }, + "source": [ + "[**Python**] - Evocar a função f0= MostraCalculos(0):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ar0zOLuUIio1", + "outputId": "4391ee79-2442-4d2e-e625-246838ca7394", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "f0 = MostraCalculos(0)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Array W:\n", + " [[0.531 0.57 0.543]\n", + " [0.655 0.857 0.602]]\n", + "\n", + "\n", + "Array X:\n", + " [0 0]\n", + "\n", + "\n", + "*** HIDDEN LAYER ***\n", + "Função Soma S: [0. 0. 0.]\n", + "Função de Ativação Sigmoid: [0.5 0.5 0.5]\n", + "\n", + "\n", + "*** OUTPUT LAYER ***\n", + "Função Soma S: [0.35]\n", + "Função de Ativação Sigmoid: [0.587]\n", + "\n", + "\n", + "*** ERRO ***\n", + "Erro da linha i= 0: [-0.587]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_R1LdY9QvTqb" + }, + "source": [ + "Observe na figura abaixo os cálculos manuais da Soma $S$, função de ativação $f(S)$ e Erro." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hl_RBLiaa4xS" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NOKtMLHoo_Yt" + }, + "source": [ + "##### _HIDDEN LAYER_\n", + "\\begin{align}\n", + "S_{H}^{(0, 1)} &= (0)(0.531)+(0)(0.655)= 0 \\Longrightarrow f_{H}^{(0, 1)}(S_{H}^{(0, 1)})= f_{H}^{(0, 1)}(0)= 0.5 \\\\\n", + "S_{H}^{(0, 2)} &= (0)(0.570)+(0)(0.857)= 0 \\Longrightarrow f_{H}^{(0, 2)}(S_{H}^{(0, 2)})= f_{H}^{(0, 2)}(0)= 0.5 \\\\\n", + "S_{H}^{(0, 3)} &= (0)(0.543)+(0)(0.602)= 0 \\Longrightarrow f_{H}^{(0, 3)}(S_{H}^{(0, 3)})= f_{H}^{(0, 3)}(0)= 0.5\n", + "\\end{align}\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8Kw-cakYsQGp" + }, + "source": [ + "##### _OUTPUT LAYER_\n", + "\n", + "\\begin{align}\n", + "S_{O}^{(0)}&= (0.5)(0.24)+(0.5)(0.318)+(0.5)(0.142)= 0.35 \\\\\n", + "f_{O}^{(0)}(S_{O}^{(0)})&= f_{O}^{(0)}(0.35)= 0.587 \\\\\n", + "E_{0}&= 0-0.587= -0.587\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TFZ8w1dUdT7A" + }, + "source": [ + "#### $\\Longrightarrow$ Para $i = 1$:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wTz3EfAUIoz-" + }, + "source": [ + "[**Python**] - Evocar a função f1= MostraCalculos(1):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "INUDJ_aMXEWb", + "outputId": "a437faf7-fa53-4146-c2ca-5ac2c50124bc", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "f1 = MostraCalculos(1)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Array W:\n", + " [[0.531 0.57 0.543]\n", + " [0.655 0.857 0.602]]\n", + "\n", + "\n", + "Array X:\n", + " [0 1]\n", + "\n", + "\n", + "*** HIDDEN LAYER ***\n", + "Função Soma S: [0.655 0.857 0.602]\n", + "Função de Ativação Sigmoid: [0.658 0.702 0.646]\n", + "\n", + "\n", + "*** OUTPUT LAYER ***\n", + "Função Soma S: [0.473]\n", + "Função de Ativação Sigmoid: [0.616]\n", + "\n", + "\n", + "*** ERRO ***\n", + "Erro da linha i= 1: [0.384]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "I91qgS1Uh2T1" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JDuyxsKSvDds" + }, + "source": [ + "##### _HIDDEN LAYER_\n", + "\\begin{align}\n", + "S_{H}^{(1, 1)} &= (0)(0.531)+(1)(0.655)= 0.655 \\Longrightarrow f_{H}^{(1, 1)}(S_{H}^{(1, 1)})= f_{H}^{(1, 1)}(0.655)= 0.658 \\\\\n", + "S_{H}^{(1, 2)} &= (0)(0.570)+(1)(0.857)= 0.857 \\Longrightarrow f_{H}^{(1, 2)}(S_{H}^{(1, 2)})= f_{H}^{(1, 2)}(0.857)= 0.702 \\\\\n", + "S_{H}^{(1, 3)} &= (0)(0.543)+(1)(0.602)= 0.602 \\Longrightarrow f_{H}^{(1, 3)}(S_{H}^{(1, 3)})= f_{H}^{(1, 3)}(0.602)= 0.646\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nKPsQA9dvDdt" + }, + "source": [ + "##### _OUTPUT LAYER_\n", + "\n", + "\\begin{align}\n", + "S_{O}^{(1)}&= (0.658)(0.24)+(0.702)(0.318)+(0.646)(0.142)= 0.473 \\\\\n", + "f_{O}^{(1)}(S_{O}^{(1)})&= f_{O}^{(1)}(0.473)= 0.616 \\\\\n", + "E_{1}&= 1-0.616= 0.384\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IBfztHLfeoTR" + }, + "source": [ + "#### $\\Longrightarrow$ Para $i = 2$:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sjcpG53tIvHf" + }, + "source": [ + "[**Python**] - Evocar a função f2= MostraCalculos(2):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RbnG_WxdXEWg", + "outputId": "b81e7106-e5fe-4518-8852-18253ef06354", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "f2 = MostraCalculos(2)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Array W:\n", + " [[0.531 0.57 0.543]\n", + " [0.655 0.857 0.602]]\n", + "\n", + "\n", + "Array X:\n", + " [1 0]\n", + "\n", + "\n", + "*** HIDDEN LAYER ***\n", + "Função Soma S: [0.531 0.57 0.543]\n", + "Função de Ativação Sigmoid: [0.63 0.639 0.632]\n", + "\n", + "\n", + "*** OUTPUT LAYER ***\n", + "Função Soma S: [0.444]\n", + "Função de Ativação Sigmoid: [0.609]\n", + "\n", + "\n", + "*** ERRO ***\n", + "Erro da linha i= 2: [0.391]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9g9MegqIh-Vn" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5gES50aaxszM" + }, + "source": [ + "##### _HIDDEN LAYER_\n", + "\\begin{align}\n", + "S_{H}^{(2, 1)} &= (1)(0.531)+(0)(0.655)= 0.531 \\Longrightarrow f_{H}^{(2, 1)}(S_{H}^{(2, 1)})= f_{H}^{(2, 1)}(0.531)= 0.630 \\\\\n", + "S_{H}^{(2, 2)} &= (1)(0.570)+(0)(0.857)= 0.570 \\Longrightarrow f_{H}^{(2, 2)}(S_{H}^{(2, 2)})= f_{H}^{(2, 2)}(0.570)= 0.639 \\\\\n", + "S_{H}^{(2, 3)} &= (1)(0.543)+(0)(0.602)= 0.543 \\Longrightarrow f_{H}^{(2, 3)}(S_{H}^{(2, 3)})= f_{H}^{(2, 3)}(0.543)= 0.632\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o7n4Eq-6xszP" + }, + "source": [ + "##### _OUTPUT LAYER_\n", + "\n", + "\\begin{align}\n", + "S_{O}^{(2)}&= (0.630)(0.24)+(0.639)(0.318)+(0.632)(0.142)= 0.444 \\\\\n", + "f_{O}^{(2)}(S_{O}^{(2)})&= f_{O}^{(2)}(0.444)= 0.609 \\\\\n", + "E_{2}&= 1-0.609= 0.391\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cPJQKwBthCkh" + }, + "source": [ + "#### $\\Longrightarrow$ Para $i = 3$:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MVhEsrqJI1T7" + }, + "source": [ + "[**Python**] - Evocar a função f3= MostraCalculos(3):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qU87GWKjXEWo", + "outputId": "5df8df11-feaf-464f-af78-f0f7819f56dd", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "f3 = MostraCalculos(3)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Array W:\n", + " [[0.531 0.57 0.543]\n", + " [0.655 0.857 0.602]]\n", + "\n", + "\n", + "Array X:\n", + " [1 1]\n", + "\n", + "\n", + "*** HIDDEN LAYER ***\n", + "Função Soma S: [1.186 1.427 1.144]\n", + "Função de Ativação Sigmoid: [0.766 0.806 0.758]\n", + "\n", + "\n", + "*** OUTPUT LAYER ***\n", + "Função Soma S: [0.548]\n", + "Função de Ativação Sigmoid: [0.634]\n", + "\n", + "\n", + "*** ERRO ***\n", + "Erro da linha i= 3: [-0.634]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AjUGJdaYiEH0" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lKptTBkBzysP" + }, + "source": [ + "##### _HIDDEN LAYER_\n", + "\\begin{align}\n", + "S_{H}^{(3, 1)} &= (1)(0.531)+(1)(0.655)= 1.186 \\Longrightarrow f_{H}^{(3, 1)}(S_{H}^{(3, 1)})= f_{H}^{(3, 1)}(1.186)= 0.766 \\\\\n", + "S_{H}^{(3, 2)} &= (1)(0.570)+(1)(0.857)= 1.427 \\Longrightarrow f_{H}^{(3, 2)}(S_{H}^{(3, 2)})= f_{H}^{(3, 2)}(1.427)= 0.806 \\\\\n", + "S_{H}^{(3, 3)} &= (1)(0.543)+(1)(0.602)= 1.144 \\Longrightarrow f_{H}^{(3, 3)}(S_{H}^{(3, 3)})= f_{H}^{(3, 3)}(1.144)= 0.758\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ISxS131GzysR" + }, + "source": [ + "##### _OUTPUT LAYER_\n", + "\n", + "\\begin{align}\n", + "S_{O}^{(3)}&= (0.766)(0.24)+(0.806)(0.318)+(0.758)(0.142)= 0.548 \\\\\n", + "f_{O}^{(3)}(S_{O}^{(3)})&= f_{O}^{(3)}(0.548)= 0.634 \\\\\n", + "E_{3}&= 0-0.634= -0.634\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YR3X25venRv5" + }, + "source": [ + "### Resumo dos cálculos com _arrays_\n", + "\n", + "Os cálculos que foram realizados previamente com o NumPy _step by step_ aqui são feitos utilizando produto matricial." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "n9D-5dE-I_IS" + }, + "source": [ + "[**Python**] - Funções de ativação da _Hidden Layer_:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "efO2aSu8AzMp", + "outputId": "fbbfe562-fcc3-4535-a10b-c7a2fb03c269", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "f0" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.5, 0.5, 0.5])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 15 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BoDRBC8oXEW0", + "outputId": "d9fc70ea-9f5b-4f66-8271-ad8cb375e5bc", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X2 = np.array([f0, f1, f2, f3])\n", + "X2" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0.5 , 0.5 , 0.5 ],\n", + " [0.658, 0.702, 0.646],\n", + " [0.63 , 0.639, 0.632],\n", + " [0.766, 0.806, 0.758]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 16 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WVIwTcF1JLIm" + }, + "source": [ + "[**Python**] - Calcular a soma $S$ da _Output Layer_, dado pelo produto matricial de $X2$ por $W_{O}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ddyC0sa6XEW5", + "outputId": "984f6fc2-2fd9-4c70-c8d9-311f3b49ab9e", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "S = X2.dot(W_O)\n", + "S" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0.35 ],\n", + " [0.473],\n", + " [0.444],\n", + " [0.548]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 17 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NRkyUZN7Jooz" + }, + "source": [ + "[**Python**] - Função de ativação da _Output Layer_:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Jadac2Q3XEW-", + "outputId": "dfeb815e-230d-420c-a4f4-d1454bafbe2a", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "f = FuncaoAtivacao_Sigmoid(S)\n", + "f" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0.587],\n", + " [0.616],\n", + " [0.609],\n", + " [0.634]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 18 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZuTe0mHg8Kzk" + }, + "source": [ + "Os resultados das funções de ativação acima conferem com o resumo a seguir:\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r2lHqqhmM6rd" + }, + "source": [ + "[**Python**] - Calcular os Erros" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bCu8miA2XEXE", + "outputId": "01ac22ce-5da2-44a7-980e-d7333ce0847a", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "E = Y - f\n", + "E" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[-0.587],\n", + " [ 0.384],\n", + " [ 0.391],\n", + " [-0.634]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 19 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P2Q019cxotQM" + }, + "source": [ + "Os cálculos estão resumidos na tabela a seguir:\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$) | ValorCalculado ($\\hat{Y}_{i}$) | $Erro$ | $Erro^{2}$ |\n", + "|---|---|---|---|---|:----------------------:|------------------:|\n", + "| 0 | 0 | 0 | 0 | 0.587 | -0.587 | 0.344 |\n", + "| 1 | 0 | 1 | 1 | 0.616 | 0.384 | 0.147 |\n", + "| 2 | 1 | 0 | 1 | 0.609 | 0.391 | 0.152 |\n", + "| 3 | 1 | 1 | 0 | 0.634 | -0.634 | 0.401 |\n", + "\n", + "Onde:\n", + "\n", + "$Erro= y_{i}-\\hat{Y}_{i}$= ValorReal - ValorCalculado\n", + "\n", + "O cálculo do MSE será $MSE= \\frac{0.344+0.147+0.152+0.401}{4}= \\frac{1.044}{4}= 0.261$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lHm_16jEz-kL" + }, + "source": [ + "### Métrica para avaliação da performance da Rede Neural\n", + "\n", + "* O MSE é uma das principais métricas para medir a performance das Redes Neurais. A seguir, o cálculo do MSE:\n", + "\n", + "$$MSE= \\frac{\\sum_{i=1}^{n}(y_{i}-\\hat{Y}_{i})^{2}}{n}= \\frac{(0-0.587)^{2}+(1-0.616)^{2}+(1-0.609)^{2}+(0-0.634)^{2}}{4}= \\frac{0.344+0.147+0.152+0.401}{4}=0.261$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D2Yo6TdpNIPW" + }, + "source": [ + "[**Python**] - Desenvolver função MSE para calcular o MSE = Erro Quadrático Médio:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "EENpe-rbXEXL" + }, + "source": [ + "def MSE(Y, f):\n", + " return np.square(Y - f).mean()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ySpVD0-mNQ1s" + }, + "source": [ + "[**Python**] - Evocar a função $MSE(Y, f)$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "C0L5ACZnXEXP", + "outputId": "a289fe92-6adb-42b0-b8b9-22ad1b340d10", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "MSE(Y, f)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.2614527354351902" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 22 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Xpzv12a48GhA" + }, + "source": [ + "### _Backpropagation_ - Ajuste dos pesos $W_{O}= \\begin{bmatrix} W_{O}^{(1)} \\\\ W_{O}^{(2)} \\\\ W_{O}^{(3)} \\end{bmatrix}$\n", + "\n", + "> _Backpropagation_ (ou simplesmente _Backward_) é o processo que faz com que a Rede Neural aprenda a partir da atualização iterativa dos pesos $W$. A ideia do _Backpropagation_ é que podemos melhorar a performance da Rede Neural através da calibração dos pesos $W$ usando _Gradient Descent_, de forma que os _outputs_ ($\\hat{y}_{i}$) serão cada vez mais próximos do valor real ($y_{i}$)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vjBeg2TTcd40" + }, + "source": [ + "Como vimos anteriormente, a fórmula para atualização dos pesos $W$, dada pela expressão abaixo\n", + "\n", + "$$W_{n+1}= W_{n}*M+\\alpha \\frac{\\partial L}{\\partial W_{n}}= W_{n}*M+\\alpha*(X*\\Delta)$$\n", + "\n", + "necessita da derivada da _Loss Function_ $L$, que é a função _Sigmoid_, cuja expressão matemática é dada a seguir:\n", + "\n", + "$$y(x)= \\frac{1}{1+e^{-x}}$$\n", + "\n", + "Portanto, a derivada da função _Sigmoid_ é dada pela expressão a seguir:\n", + "\n", + "$$\\frac{dy}{dx}= y^{'}= y(1-y)$$\n", + "\n", + "Caso você tenha dúvidas sobre a derivada da função de ativação _Sigmoid_, sugiro a leitura deste artigo: [Derivative of the Sigmoid function](https://towardsdatascience.com/derivative-of-the-sigmoid-function-536880cf918e)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hDKJakFImuRp" + }, + "source": [ + "* $D_{O}$ é a Derivada do neurônio da _Output Layer_;\n", + "* $\\Delta_{H}= D_{O}* W_{O} * \\Delta_{O}$;\n", + "* $\\Delta_{O}= E_{i}*D_{O}$.\n", + "\n", + "A seguir, a Derivada da função _Sigmoid_ usando o NumPy:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kDnarWwwNZd0" + }, + "source": [ + "[**Python**] - Função Derivada_Sigmoid, que calcula a Derivada da função _Sigmoid_:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qSxVsNeDXEXY" + }, + "source": [ + "def Derivada_Sigmoid(y):\n", + " return y*(1-y)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CY6O0qkWNhby" + }, + "source": [ + "[**Python**] - Evocar a Derivada_Sigmoid(f), ou seja, calcular a derivada das funções de ativação da _Output Layer_:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WTpQfBTpXEXi", + "outputId": "db0a0bf9-2163-4ce8-df0f-8f7b6a23c195", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "D_O = Derivada_Sigmoid(f)\n", + "D_O" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0.242],\n", + " [0.237],\n", + " [0.238],\n", + " [0.232]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 24 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jfVdgCFDf9-X" + }, + "source": [ + "Os cálculos acima foram feitos no NumPy e são reproduzidos manualmente abaixo:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TOslp3YSN70r" + }, + "source": [ + "[**Python**] - Função Backpropagation para calcular:\n", + "* _Output Layer_:\n", + " * $D_{O}$ - Derivada dos valores da _Output Layer_;\n", + " * $\\Delta_{O}$ - Delta;\n", + "* _Hidden Layer_:\n", + " * $D_{H}$ - Derivada dos valores da _Hidden Layer_;\n", + " * $\\Delta_{H}$ - Delta" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4PihyM2VXEXq" + }, + "source": [ + "def Backpropagation(i):\n", + " print(f'***** OUTPUT LAYER *****')\n", + " print(f'*** Função de ativação ***')\n", + " print(f[i])\n", + " \n", + " #print('\\n')\n", + " print(f'*** Derivada ***')\n", + " D_O= Derivada_Sigmoid(f)\n", + " print(D_O[i])\n", + "\n", + " #print('\\n') \n", + " print(f'*** Erros ***')\n", + " print(E[i])\n", + "\n", + " #print('\\n')\n", + " print(f'*** Delta ***')\n", + " Delta_O= D_O*E\n", + " print(Delta_O[i])\n", + " \n", + " print('\\n')\n", + " print(f'***** HIDDEN LAYER *****')\n", + " print(f'*** Função de ativação ***')\n", + " print(X2[i])\n", + "\n", + " #print('\\n')\n", + " print(f'*** Derivada ***')\n", + " D_H= Derivada_Sigmoid(X2)\n", + " D_H\n", + " print(D_H[i]) \n", + " \n", + " #print('\\n')\n", + " print(f'*** Delta ***')\n", + " Delta_H= D_H*W_O.T*Delta_O\n", + " print(Delta_H[i]) " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eyGWGHVaFxNG" + }, + "source": [ + "#### $\\Longrightarrow$ Para $i = 0$:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uiOtWoNWOn24" + }, + "source": [ + "[**Python**] - Evocar a função Backpropagation(0):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SiNkv_DBXEXu", + "outputId": "477f760d-e1c2-434c-9728-e5ab050f4699", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "Backpropagation(0)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "***** OUTPUT LAYER *****\n", + "*** Função de ativação ***\n", + "[0.587]\n", + "*** Derivada ***\n", + "[0.242]\n", + "*** Erros ***\n", + "[-0.587]\n", + "*** Delta ***\n", + "[-0.142]\n", + "\n", + "\n", + "***** HIDDEN LAYER *****\n", + "*** Função de ativação ***\n", + "[0.5 0.5 0.5]\n", + "*** Derivada ***\n", + "[0.25 0.25 0.25]\n", + "*** Delta ***\n", + "[-0.009 -0.011 -0.005]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PqZ_CvGI0ySD" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yO4njWZb1V1w" + }, + "source": [ + "##### _HIDDEN LAYER_\n", + "\\begin{align}\n", + "\\Delta_{H}^{(0, 1)} &= D_{H}^{(0, 1)}.W_{O}^{(1)}.\\Delta_{O}^{(0)}= (0.25)(0.24)(-0.142)= -0.009 \\\\\n", + "\\Delta_{H}^{(0, 2)} &= D_{H}^{(0, 2)}.W_{O}^{(2)}.\\Delta_{O}^{(0)}= (0.25)(0.318)(-0.142)= -0.011 \\\\\n", + "\\Delta_{H}^{(0, 3)} &= D_{H}^{(0, 3)}.W_{O}^{(3)}.\\Delta_{O}^{(0)}= (0.25)(0.142)(-0.142)= -0.005\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SXpozezsYFCX" + }, + "source": [ + "Na figura acima, temos que $\\Delta_{H}^{(0)}= [\\Delta_{H}^{(0, 1)}, \\Delta_{H}^{(0, 2)}, \\Delta_{H}^{(0, 3)}]= [-0.009, -0.011, -0.005]$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xDZYiujcHGzK" + }, + "source": [ + "#### $\\Longrightarrow$ Para $i = 1$:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ROzKv5VtOuy5" + }, + "source": [ + "[**Python**] - Evocar a função Backpropagation(1):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "S6An6CyUXEX0", + "outputId": "eed1f6c1-7731-4368-c3c1-ea5a6a8524c8", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "Backpropagation(1)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "***** OUTPUT LAYER *****\n", + "*** Função de ativação ***\n", + "[0.616]\n", + "*** Derivada ***\n", + "[0.237]\n", + "*** Erros ***\n", + "[0.384]\n", + "*** Delta ***\n", + "[0.091]\n", + "\n", + "\n", + "***** HIDDEN LAYER *****\n", + "*** Função de ativação ***\n", + "[0.658 0.702 0.646]\n", + "*** Derivada ***\n", + "[0.225 0.209 0.229]\n", + "*** Delta ***\n", + "[0.005 0.006 0.003]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EwblrxI20ygW" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "18bsrlv_4B0Q" + }, + "source": [ + "##### _HIDDEN LAYER_\n", + "\\begin{align}\n", + "\\Delta_{H}^{(1, 1)} &= D_{H}^{(1, 1)}.W_{O}^{(1)}.\\Delta_{O}^{(1)}= (0.225)(0.24)(0.091)= 0.005 \\\\\n", + "\\Delta_{H}^{(1, 2)} &= D_{H}^{(1, 2)}.W_{O}^{(2)}.\\Delta_{O}^{(1)}= (0.209)(0.318)(0.091)= 0.006 \\\\\n", + "\\Delta_{H}^{(1, 3)} &= D_{H}^{(1, 3)}.W_{O}^{(3)}.\\Delta_{O}^{(1)}= (0.229)(0.142)(0.091)= 0.003\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cPfQUUUHYw4i" + }, + "source": [ + "Na figura acima, temos que $\\Delta_{H}^{(1)}= [\\Delta_{H}^{(1, 1)}, \\Delta_{H}^{(1, 2)}, \\Delta_{H}^{(1, 3)}]= [0.005, 0.006, 0.003]$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e8qfA8CGHJo8" + }, + "source": [ + "#### $\\Longrightarrow$ Para $i = 2$:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UWxqTLsKOyoK" + }, + "source": [ + "[**Python**] - Evocar a função Backpropagation(2):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "w39YvfOWXEX7", + "outputId": "bff1eb12-46f3-4713-8bfa-060bed4c88a5", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "Backpropagation(2)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "***** OUTPUT LAYER *****\n", + "*** Função de ativação ***\n", + "[0.609]\n", + "*** Derivada ***\n", + "[0.238]\n", + "*** Erros ***\n", + "[0.391]\n", + "*** Delta ***\n", + "[0.093]\n", + "\n", + "\n", + "***** HIDDEN LAYER *****\n", + "*** Função de ativação ***\n", + "[0.63 0.639 0.632]\n", + "*** Derivada ***\n", + "[0.233 0.231 0.232]\n", + "*** Delta ***\n", + "[0.005 0.007 0.003]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BBZuNcOC0yj9" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sZam1meY48hW" + }, + "source": [ + "##### _HIDDEN LAYER_\n", + "\\begin{align}\n", + "\\Delta_{H}^{(2, 1)} &= D_{H}^{(2, 1)}.W_{O}^{(1)}.\\Delta_{O}^{(2)}= (0.233)(0.24)(0.093)= 0.005 \\\\\n", + "\\Delta_{H}^{(2, 2)} &= D_{H}^{(2, 2)}.W_{O}^{(2)}.\\Delta_{O}^{(2)}= (0.231)(0.318)(0.093)= 0.007 \\\\\n", + "\\Delta_{H}^{(2, 3)} &= D_{H}^{(2, 3)}.W_{O}^{(3)}.\\Delta_{O}^{(2)}= (0.232)(0.142)(0.093)= 0.003\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dAc8YrceY5kY" + }, + "source": [ + "Na figura acima, temos que $\\Delta_{H}^{(2)}= [\\Delta_{H}^{(2, 1)}, \\Delta_{H}^{(2, 2)}, \\Delta_{H}^{(2, 3)}]= [0.005, 0.007, 0.003]$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PWrv-aRyHMPh" + }, + "source": [ + "#### $\\Longrightarrow$ Para $i = 3$:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MKqR3izrO15N" + }, + "source": [ + "[**Python**] - Evocar a função Backpropagation(3):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1APffWq2XEYA", + "outputId": "24359257-7932-468a-d2c6-c437bab1139a", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "Backpropagation(3)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "***** OUTPUT LAYER *****\n", + "*** Função de ativação ***\n", + "[0.634]\n", + "*** Derivada ***\n", + "[0.232]\n", + "*** Erros ***\n", + "[-0.634]\n", + "*** Delta ***\n", + "[-0.147]\n", + "\n", + "\n", + "***** HIDDEN LAYER *****\n", + "*** Função de ativação ***\n", + "[0.766 0.806 0.758]\n", + "*** Derivada ***\n", + "[0.179 0.156 0.183]\n", + "*** Delta ***\n", + "[-0.006 -0.007 -0.004]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PGAXyDhW0ynn" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bdgIs_zP5i1y" + }, + "source": [ + "##### _HIDDEN LAYER_\n", + "\\begin{align}\n", + "\\Delta_{H}^{(3, 1)} &= D_{H}^{(3, 1)}.W_{O}^{(1)}.\\Delta_{O}^{(3)}= (0.179)(0.24)(-0.147)= -0.006 \\\\\n", + "\\Delta_{H}^{(3, 2)} &= D_{H}^{(3, 2)}.W_{O}^{(2)}.\\Delta_{O}^{(3)}= (0.156)(0.318)(-0.147)= -0.007 \\\\\n", + "\\Delta_{H}^{(3, 3)} &= D_{H}^{(3, 3)}.W_{O}^{(3)}.\\Delta_{O}^{(3)}= (0.183)(0.142)(-0.147)= -0.004\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6Ie99-SqZA6z" + }, + "source": [ + "Na figura acima, temos que $\\Delta_{H}^{(3)}= [\\Delta_{H}^{(3, 1)}, \\Delta_{H}^{(3, 2)}, \\Delta_{H}^{(3, 3)}]= [-0.006, -0.007, -0.004]$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sndwYO-VbK1C" + }, + "source": [ + "A seguir, cálculos usando o NumPy:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ycvvhnWIO5s9" + }, + "source": [ + "[**Python**] - $D_{O}$: Derivada da _Output Layer_:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Zkdw8tUKw5vo", + "outputId": "6a0af2a4-d7bd-4fb2-c810-41e90093d43b", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "f" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0.587],\n", + " [0.616],\n", + " [0.609],\n", + " [0.634]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 30 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "poTTrvYEXEYE", + "outputId": "c258cfb9-2c0f-4ee3-ea0b-0cf199ac659d", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "D_O = Derivada_Sigmoid(f)\n", + "D_O" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0.242],\n", + " [0.237],\n", + " [0.238],\n", + " [0.232]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 31 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JkdyDN6BPNZT" + }, + "source": [ + "[**Python**] - Mostrar os Erros:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "AO9Qi9U0aWTx", + "outputId": "716ef77a-c70d-4a16-bb54-f83e49b1524b", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "E" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[-0.587],\n", + " [ 0.384],\n", + " [ 0.391],\n", + " [-0.634]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 32 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DPPqqxpIPRsT" + }, + "source": [ + "[**Python**] - $\\Delta_{O}$: Delta da _Output Layer_:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6fylksvtaT6h", + "outputId": "818bdad7-fa5e-4470-e02e-30743e86397f", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "Delta_O = D_O*E\n", + "Delta_O" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[-0.142],\n", + " [ 0.091],\n", + " [ 0.093],\n", + " [-0.147]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 33 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "E9zsXwcWPXn1" + }, + "source": [ + "[**Python**] - $D_{H}$: Derivada da _Hidden Layer_:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SCABYAGjaigm", + "outputId": "3ac18557-4a19-4b44-bb81-21ef383a1f85", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "D_H = Derivada_Sigmoid(X2)\n", + "D_H" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0.25 , 0.25 , 0.25 ],\n", + " [0.225, 0.209, 0.229],\n", + " [0.233, 0.231, 0.232],\n", + " [0.179, 0.156, 0.183]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 34 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bLa9L88VPdQu" + }, + "source": [ + "[**Python**] - $D_{O}$ - Derivada da _Output Layer_:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "58r5kgNwa9xo", + "outputId": "56870605-a017-4c73-aca3-ff3ebacc8264", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "Delta_H = D_H*W_O.T*Delta_O\n", + "Delta_H" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[-0.009, -0.011, -0.005],\n", + " [ 0.005, 0.006, 0.003],\n", + " [ 0.005, 0.007, 0.003],\n", + " [-0.006, -0.007, -0.004]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 41 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "roh5SVtkQrJE" + }, + "source": [ + "### _Backpropagation_ - Atualizar os pesos da _Output Layer_ $W_{O}= \\begin{bmatrix} W_{O}^{(1)} \\\\ W_{O}^{(2)} \\\\ W_{O}^{(3)} \\end{bmatrix}$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CQ69tO1IPsBQ" + }, + "source": [ + "[**Python**] - $(X*\\Delta_{O})= (X2*\\Delta_{O})$ para atualizar $W_{O}^{(1)}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "K991veZeXEYL", + "outputId": "514d0b94-ce57-4bd2-860f-d797bf47e335", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X2.T.dot(Delta_O)[0]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([-0.065])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 35 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hz-0fQAGd7Aw" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ovgjM8l6Np0e" + }, + "source": [ + "$$(0.5)\\Delta_{O}^{(0)}+(0.658)\\Delta_{O}^{(1)}+(0.630)\\Delta_{O}^{(2)}+(0.766)\\Delta_{O}^{(3)}$$\n", + "$$(0.5)(-0.142)+(0.658)(0.091)+(0.630)(0.093)+(0.766)(-0.147)= -0.065$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BFaNh6NEXEYO" + }, + "source": [ + "[**Python**] - $(X*\\Delta_{O})= (X2*\\Delta_{O})$ para atualizar $W_{O}^{(2)}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "eomk5j12XEYT", + "outputId": "dbcc9962-07ef-46d3-a52d-e7cca3523aa4", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X2.T.dot(Delta_O)[1]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([-0.067])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 36 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M-3gk0erRpSF" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hVCFLfWGPE7W" + }, + "source": [ + "$$(0.5)\\Delta_{O}^{(0)}+(0.702)\\Delta_{O}^{(1)}+(0.639)\\Delta_{O}^{(2)}+(0.866)\\Delta_{O}^{(3)}$$\n", + "$$(0.5)(-0.142)+(0.702)(0.091)+(0.639)(0.093)+(0.806)(-0.147)= -0.067$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MK92KMHYXEYV" + }, + "source": [ + "[**Python**] - $(X*\\Delta_{O})= (X2*\\Delta_{O})$ para atualizar $W_{O}^{(3)}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "D05BW8CgXEYc", + "outputId": "4cb5fd90-a2fc-444f-db0b-104e6b08098e", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X2.T.dot(Delta_O)[2]" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([-0.065])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 37 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K754V1CSRtii" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "q51biJ5TPkKX" + }, + "source": [ + "$$(0.5)\\Delta_{O}^{(0)}+(0.646)\\Delta_{O}^{(1)}+(0.632)\\Delta_{O}^{(2)}+(0.758)\\Delta_{O}^{(3)}$$\n", + "$$(0.5)(-0.142)+(0.646)(0.091)+(0.632)(0.093)+(0.758)(-0.147)= -0.067$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SaEVJAXGV3Xd" + }, + "source": [ + "###### Implementação com NumPy\n", + "\n", + "* Fórmula para atualização dos pesos $W_{O}$:\n", + "\n", + "$$W_{n+1}= W_{n}*M+\\alpha \\frac{\\partial L}{\\partial W_{n}}= W_{n}*M+\\alpha*(X*\\Delta)$$\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "a7dGpwzfRN2M" + }, + "source": [ + "[**Python**] - Calcular/atualizar os pesos $W_{O}$ através da expressão de $W_{n+1}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "er3DprzjXEYg", + "outputId": "1a75110a-8346-4769-d138-be521f4eabdf", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "M = 1\n", + "alpha = 0.1\n", + "\n", + "W_O_New = W_O*M+alpha*(X2.T.dot(Delta_O))\n", + "W_O_New" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0.234],\n", + " [0.312],\n", + " [0.136]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 38 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0-2weyIriNqN" + }, + "source": [ + "Abaixo, os pesos atualizados de $W_{O}$ (antes e depois)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GLkZfXbmi9c6" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "t4fHsSY3AlFi" + }, + "source": [ + "### _Backpropagation_ - Ajuste dos pesos $W_{H}= \\begin{bmatrix} W_{H}^{(1, 1)} & W_{H}^{(1, 2)} & W_{H}^{(1, 3)} \\\\ W_{H}^{(2, 1)} & W_{H}^{(2, 2)} & W_{H}^{(2, 3)} \\end{bmatrix}$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cCED4NKj1_FX" + }, + "source": [ + "#### Ajuste dos pesos $W_{H}^{(1, 1)}, W_{H}^{(1, 2)}, W_{H}^{(1, 3)}$\n", + "\n", + "Para ajustar os pesos $W_{H}^{(1, 1)}, W_{H}^{(1, 2)}, W_{H}^{(1, 3)}$, precisamos dos valores de $\\Delta_{H}$, calculado anteriormente:\n", + "\n", + "* $\\Delta_{H}^{(0)}= [\\Delta_{H}^{(0, 1)}, \\Delta_{H}^{(0, 2)}, \\Delta_{H}^{(0, 3)}]= [-0.009, -0.011, -0.005]$;\n", + "* $\\Delta_{H}^{(1)}= [\\Delta_{H}^{(((1, 1)}, \\Delta_{H}^{(1, 2)}, \\Delta_{H}^{(1, 3)}]= [0.005, 0.006, 0.003]$;\n", + "* $\\Delta_{H}^{(2)}= [\\Delta_{H}^{(2, 1)}, \\Delta_{H}^{(2, 2)}, \\Delta_{H}^{(2, 3)}]= [0.005, 0.007, 0.003]$;\n", + "* $\\Delta_{H}^{(3)}= [\\Delta_{H}^{(3, 1)}, \\Delta_{H}^{(3, 2)}, \\Delta_{H}^{(3, 3)}]= [-0.006, -0.007, -0.004]$.\n", + "\n", + "Veja abaixo no NumPy:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kNHXkTXLRmzu" + }, + "source": [ + "[**Python**] - Mostrar $\\Delta_{H}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oYxrEVC7XEYn", + "outputId": "484c87ce-47a2-4628-dbdd-d96a464e6521", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "Delta_H" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[-0.009, -0.011, -0.005],\n", + " [ 0.005, 0.006, 0.003],\n", + " [ 0.005, 0.007, 0.003],\n", + " [-0.006, -0.007, -0.004]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 42 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XihSawh-1iKI" + }, + "source": [ + "##### Resumo dos valores de $(X*\\Delta_{H})$ calculados manualmente:\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DNXt5DAhBiVC" + }, + "source": [ + "[**Python**] - $(X*\\Delta_{H})$ para atualizar $W_{H}^{(1, 1)}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "06duXU28XEYy", + "outputId": "c4056c7a-7156-4d21-ae23-a6091cbac06d", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X.T.dot(Delta_H)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[-0.001, -0. , -0.001],\n", + " [-0.001, -0.001, -0.001]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 45 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DTCJ_5O7SeU9" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WeJOgJd5P5BS" + }, + "source": [ + "$$(0)\\Delta_{H}^{(0, 1)}+(0)\\Delta_{H}^{(1, 1)}+(1)\\Delta_{H}^{(2, 1)}+(1)\\Delta_{O}^{(3, 1)}$$\n", + "$$(0)(-0.009)+(0)(0.005)+(1)(0.005)+(1)(-0.006)= -0.001$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8mbs0ZNTCRKL" + }, + "source": [ + "[**Python**] - $(X*\\Delta_{H})$ para atualizar $W_{H}^{(1, 2)}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qF1iFyRWXEY9", + "outputId": "b67a06cf-8a2c-427e-8fc1-01290e91ef60", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X.T.dot(Delta_H)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[-0.001, -0. , -0.001],\n", + " [-0.001, -0.001, -0.001]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 46 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xWkm7eyLSm6I" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "X9LQgX05Qj8M" + }, + "source": [ + "$$(0)\\Delta_{H}^{(0, 2)}+(0)\\Delta_{H}^{(1, 2)}+(1)\\Delta_{H}^{(2, 2)}+(1)\\Delta_{H}^{(3, 2)}$$\n", + "$$(0)(-0.011)+(0)(0.006)+(1)(0.007)+(1)(-0.007)= 0$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oaVbGCATCd7B" + }, + "source": [ + "[**Python**] - $(X*\\Delta_{H})$ para atualizar $W_{H}^{(1, 3)}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4UNZFSC5XEZE", + "outputId": "95202ea9-f594-4b74-a964-8b2cbe2a5e62", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X.T.dot(Delta_H)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[-0.001, -0. , -0.001],\n", + " [-0.001, -0.001, -0.001]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 47 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4JHAiH5GSqr0" + }, + "source": [ + "\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DDrurArKQ5I_" + }, + "source": [ + "$$(0)\\Delta_{H}^{(0, 3)}+(0)\\Delta_{H}^{(1, 3)}+(1)\\Delta_{H}^{(2, 3)}+(1)\\Delta_{H}^{(3, 3)}$$\n", + "$$(0)(-0.005)+(0)(0.003)+(1)(0.003)+(1)(-0.004)= -0.001$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GWwWUfiXXlom" + }, + "source": [ + "#### Ajuste dos pesos $W_{H}^{(2, 1)}, W_{H}^{(2, 2)}$ e $W_{H}^{(2, 3)}$\n", + "\n", + "Para ajustar os pesos $W_{H}^{(1, 1)}, W_{H}^{(1, 2)}, W_{H}^{(1, 3)}$, precisamos dos valores de $\\Delta_{H}$, calculado anteriormente:\n", + "\n", + "* $\\Delta_{H}^{(0)}= [\\Delta_{H}^{(0, 1)}, \\Delta_{H}^{(0, 2)}, \\Delta_{H}^{(0, 3)}]= [-0.009, -0.011, -0.005]$;\n", + "* $\\Delta_{H}^{(1)}= [\\Delta_{H}^{(1, 1)}, \\Delta_{H}^{(1, 2)}, \\Delta_{H}^{(1, 3)}]= [0.005, 0.006, 0.003]$;\n", + "* $\\Delta_{H}^{(2)}= [\\Delta_{H}^{(2, 1)}, \\Delta_{H}^{(2, 2)}, \\Delta_{H}^{(2, 3)}]= [0.005, 0.007, 0.003]$;\n", + "* $\\Delta_{H}^{(3)}= [\\Delta_{H}^{(3, 1)}, \\Delta_{H}^{(3, 2)}, \\Delta_{H}^{(3, 3)}]= [-0.006, -0.007, -0.004]$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DzeytNngSI08" + }, + "source": [ + "[**Python**] - Mostra $\\Delta_{H}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vfiS9bFqXEZH", + "outputId": "e659b581-4c46-4ded-833a-420901f08631", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "Delta_H" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[-0.009, -0.011, -0.005],\n", + " [ 0.005, 0.006, 0.003],\n", + " [ 0.005, 0.007, 0.003],\n", + " [-0.006, -0.007, -0.004]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 48 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dSDbe8o8k9yi" + }, + "source": [ + "##### Resumo de $(X*\\Delta_{H})$ calculados manualmente:\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D6e2ZoMmDLFN" + }, + "source": [ + "[**Python**] - $(X*\\Delta_{H})$ para atualizar $W_{H}^{(2, 1)}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SH4yHqoYXEZP", + "outputId": "98c98e25-b3c1-4267-ef4f-b28936ae75fe", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X.T.dot(Delta_H)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[-0.001, -0. , -0.001],\n", + " [-0.001, -0.001, -0.001]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 49 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zORHdsEiXwSw" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P11cTnsCRpwj" + }, + "source": [ + "$$(0)\\Delta_{H}^{(0, 1)}+(1)\\Delta_{H}^{(1, 1)}+(0)\\Delta_{H}^{(2, 1)}+(1)\\Delta_{H}^{(3, 1)}$$\n", + "$$(0)(-0.009)+(1)(0.005)+(0)(0.005)+(1)(-0.006)= -0.001$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "W_LMmSEVDXY7" + }, + "source": [ + "[**Python**] - $(X*\\Delta_{H})$ para atualizar $W_{H}^{(2, 2)}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YE4DH6P_XEZZ", + "outputId": "da667dd2-65a7-4aa1-c45b-d7cefc3627b1", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X.T.dot(Delta_H)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[-0.001, -0. , -0.001],\n", + " [-0.001, -0.001, -0.001]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 50 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Gz7bhUuDX6Me" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OLrwPoE7SGYu" + }, + "source": [ + "$$(0)\\Delta_{H}^{(0, 2)}+(1)\\Delta_{H}^{(1, 2)}+(0)\\Delta_{H}^{(2, 2)}+(1)\\Delta_{H}^{(3, 2)}$$\n", + "$$(0)(-0.011)+(1)(0.006)+(0)(0.007)+(1)(-0.007)= -0.001$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vzbUzC8FDhuo" + }, + "source": [ + "[**Python**] - $(X*\\Delta_{H})$ para atualizar $W_{H}^{(2, 3)}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7-epl7I3XEZf", + "outputId": "402d8680-d2ec-4d9a-e77a-65085978f1d3", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X.T.dot(Delta_H)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[-0.001, -0. , -0.001],\n", + " [-0.001, -0.001, -0.001]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 51 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0gT8_uDQX-NT" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QLz57OEPSWjl" + }, + "source": [ + "$$(0)\\Delta_{H}^{(0, 3)}+(1)\\Delta_{H}^{(1, 3)}+(0)\\Delta_{H}^{(2, 3)}+(1)\\Delta_{H}^{(3, 3)}$$\n", + "$$(0)(-0.005)+(1)(0.003)+(0)(0.003)+(1)(-0.004)= -0.001$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7G9gWkKWIIOL" + }, + "source": [ + "##### Implementação com NumPy\n", + "\n", + "Usando:\n", + "* M = 1;\n", + "* $\\alpha = 0.1$;\n", + "* Fórmula: $W_{n+1} = (W_{n}*M)+\\alpha*(X*\\Delta_{H})$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5C3NCWcuShqN" + }, + "source": [ + "[**Python**] - Calcular/atualizar os pesos $W_{H}$ usando a expressão $W_{n+1}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ys_y-R0BL7Iw", + "outputId": "dcb98f8d-038e-4957-9b72-4c796929a9af", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "M = 1\n", + "alpha = 0.1\n", + "\n", + "W_H_New = W_H*M+alpha*(X.T.dot(Delta_H))\n", + "W_H_New" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0.531, 0.57 , 0.542],\n", + " [0.655, 0.857, 0.602]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 52 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IvaIx_PKZEmd" + }, + "source": [ + "##### Novos Pesos $W_{H}$ e $W_{O}$ da Rede Neural (Antes x Depois)\n", + "\n", + "\"Drawing\"\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EuO2t22CffE8" + }, + "source": [ + "___\n", + "# **Como as Redes Neurais aprendem?**\n", + "\n", + "> Vimos até agora grande parte dos cálculos matemáticos que envolvem o treinamento das Redes Neurais, que envolvem a repetição dos processos _Forward_ e _Backward_:\n", + "\n", + "1. _**Forward**_: Consiste na multiplicação de matrizes entre os _arrays_ da _input layer_, pesos $W$ e, na sequência, aplicar as funções de ativação.\n", + "\n", + "2. _**Backward**_: Consiste em atualizar os pesos $W_{O}$ e $W_{H}$ para minimizar a _Loss Function_ $L$ usando _Gradient Descent_.\n", + "\n", + "* Estes 2 processos foram vistos detalhadamente em aulas anteriores.\n", + " * Cálculos matemáticos passo a passo foram mostrados. Portanto, visite nossas aulas anteriores para aprender mais sobre os aspectos teóricos e matemáticos por trás das Redes Neurais.\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mpKyqSuDdbMr" + }, + "source": [ + "___\n", + "# **_GRADIENT DESCENT_**\n", + "\n", + "_Gradient Descent_ é um algoritmo interativo utilizado para otimizar (neste caso, minimizar) a _Loss Function_ $L$. \n", + "\n", + "* Minimizar a _Loss Function_ $L$ significa encontrar os pesos $W_{H}$ e $W_{O}$ que faz com que MSE seja o menor possível, pois quanto menor o MSE, melhor a performance da Rede Neural. \n", + "\n", + "* Para atualizar os pesos $W$, vamos usar a expressão a seguir:\n", + "\n", + "$$W_{n+1}= W_{n}*M+\\alpha \\frac{\\partial L}{\\partial W_{n}}= W_{n}*M+\\alpha*(X*\\Delta)$$\n", + "\n", + "onde:\n", + "\n", + "* $L$ é a _Loss Function_ a ser minimizada;\n", + "* $W_{n}$ são os pesos atuais e que deverão ser atualizados para a próxima iteração;\n", + "* $\\alpha$ é a taxa de aprendizado (_Learning Rate_ em inglês) e diz respeito à velocidade de aprendizagem da Rede Neural.\n", + " * Quanto MENOR o valor de $\\alpha$ $\\Longrightarrow$ mais devagar e demorada será a convergência para o mínimo global;\n", + " * Quanto MAIOR o valor de $\\alpha$ $\\Longrightarrow$ mais rápido será a convergência para o mínimo, mas sem a garantia de convergência para o mínimo global.\n", + "* $M$ é o _Momentum_, que é o artifício para acelerar a otimização (ou minimização) da _Loss Function_ $L$.\n", + " * Valores altos $\\Longrightarrow$ Aumenta a velocidade da aprendizagem;\n", + " * Valores baixos $\\Longrightarrow$ Mais tempo para aprendizagem, mas com maiores chances de se encontrar a solução ótima, evitando os mínimos locais.\n", + "* $\\frac{\\partial L}{\\partial W_{n}}$ é a derivada da _Loss Function_ $L$ em relação ao peso $W_{n}$. Como dito anteriormente, é a contribuição do peso $W$ no Erro. Calcular $(X*\\Delta)$ é a parte mais complicada da fórmula e fizemos estes cálculos passo a passo em aulas anteriores." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tzuFSOV4eboI" + }, + "source": [ + "Observe a figura a seguir: O que o _Gradient Descent_ fará é encontrar o mínimo global da _loss function_, tentando ao máximo possível evitar os mínimos locais.\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z-UbiCxUgvHg" + }, + "source": [ + "A seguir, alguns artigos sobre _Gradient Descent_ caso você queira saber um pouco mais sobre o assunto:\n", + "\n", + "* [An introduction to Gradient Descent Algorithm](https://medium.com/@montjoile/an-introduction-to-gradient-descent-algorithm-34cf3cee752b) - Abrir este artigo para mostrar os efeitos da _Learning Rate_ e os tipos de _Gradient Descent_ disponíveis para Machine Learning;\n", + "* [Machine learning : Gradient Descent](https://medium.com/@arshren/gradient-descent-5a13f385d403);\n", + "* [The Math and Intuition Behind Gradient Descent](https://medium.com/datadriveninvestor/the-math-and-intuition-behind-gradient-descent-13c45f367a11) - Mostra a matemática por trás do _Gradient Descent_;\n", + "* [An Introduction to Gradient Descent](https://towardsdatascience.com/an-introduction-to-gradient-descent-c9cca5739307);\n", + "* [Gradient Descent From Scratch](https://towardsdatascience.com/gradient-descent-from-scratch-e8b75fa986cc);\n", + "* [Gradient Descent Explanation & Implementation](https://towardsdatascience.com/gradient-descent-explanation-implementation-c74005ff7dd1) - Cálculos step-by-step;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "junvtVY4eePi" + }, + "source": [ + "___\n", + "# **_LOSS FUNCTION_ $L$**\n", + "\n", + "> Como vimos anteriormente, nosso objetivo é minimizar a _Loss function_ através do _Gradient Descent_. Em outras palavras, esse processo de otimização busca, à cada iteração (_epoch_), atualizar os pesos $W$ para reduzir a _Loss Function_. As _Loss Function_ mais comuns são:\n", + "\n", + "* **Regressão**: mse ou mae;\n", + "* **Classificação**: _cross-entropy_ (quando queremos probabilidades de cada observação pertencer à uma determinada classe).\n", + " * **Classificação binária**: tf.keras.losses.BinaryCrossentropy()\n", + ";\n", + " * **Classificação multi-classes**: tf.keras.losses.CategoricalCrossentropy()\n", + "." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6e4ULmJheePY" + }, + "source": [ + "___\n", + "# **MÉTRICAS PARA MEDIR A PERFORMANCE DAS REDES NEURAIS**\n", + "\n", + "* As métricas medem a qualidade/performance das Redes Neurais e as principais são:\n", + " * **Regressão**: Quanto mais próximo de 0 estiver MAE, MSE ou RMSE, melhor a performance da Rede Neural.\n", + " * MAE significa \"_Mean Absolute Error_\".\n", + "\n", + " * MSE - significa \"_Mean Square Error_\", que é a diferença entre os valores reais $y_{i}$ e os valores previstos (ou calculados) $\\hat{Y}_{i}$.\n", + "\n", + " * RMSE - significa \"_Root Mean Square Error_\".\n", + " \n", + " * **Classificação**: Quanto maior a accuracy, melhor a performance da Rede Neural.\n", + " * Accuracy\n", + "\n", + "* Expressões Matemáticas:\n", + "\n", + "\\begin{align}\n", + "MSE &= \\frac{\\sum_{i=1}^{n}(y_{i}-\\hat{Y}_{i})^{2}}{n} \\\\\n", + "RMSE &= \\sqrt{MSE} \\\\\n", + "MAE &= \\frac{\\sum_{i=1}^{n}|y_{i}-\\hat{Y}_{i}|}{n}\n", + "\\end{align}\n", + "\n", + "Para os alunos que estão com dúvidas sobre qual métrica usar, sugiro a leitura do artigo [MAE and RMSE — Which Metric is Better?](https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6n4QjH1WeeO8" + }, + "source": [ + "___\n", + "# **_DROPOUT_**\n", + "\n", + "> _Dropout_ significa ignorar aleatoriamente e temporariamente um percentual $p$ de neurônios durante a fase de treinamento. Ao \"ignorar\", quero dizer que tais neurônios não serão considerados durante os processos _forward_ e _backpropagation_.\n", + "\n", + "* _Dropout_ força que a Rede Neural aprenda a partir dos dados, mas usando diferentes e aleatórios neurônios;\n", + "* Recomenda-se $p = 0.20$. \n", + "* Ao se usar _Dropout_, recomenda-se mais épocas para treinar a Redes Neurais;\n", + "\n", + "* **Vantagens**:\n", + " * Evita _overfitting_ - Num \"_fully connected layer_\", neurônios desenvolvem dependência durante a fase de treinamento levando ao _overfitting_. Com _dropout_ é possível reduzir um pouco desta dependência, reduzindo as chances de _overfitting_;\n", + "\n", + "![Dropout](https://github.com/MathMachado/Materials/blob/master/Dropout.png?raw=true)\n", + "\n", + "Fonte: [Dropout in (Deep) Machine learning](https://medium.com/@amarbudhiraja/https-medium-com-amarbudhiraja-learning-less-to-learn-better-_dropout_-in-deep-machine-learning-74334da4bfc5).\n", + "\n", + "TEMPLATE: keras.layers.Dropout(rate, noise_shape=None, seed=None)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9-9Y9562kNNU" + }, + "source": [ + "___\n", + "# **Rede Neural multicamada (1 _Hidden Layer_) para o Operador Lógico XOR usando _Tensorflow_/_Keras_**\n", + "\n", + "* **Observações**:\n", + " * Há vários artigos (no _medium_, por exemplo) a discutir e desenvolver Redes Neurais para o Operador Lógico XOR. Então porque eu decidi produzir esta aula usando o dataframe do Operador Lógico XOR?\n", + " * Para explicar didaticamente e passo a passo todos os aspectos matemáticos por trás das Redes Neurais usando um dataframe pequeno e, apesar disso, complexo, pois é um problema linearmente NÃO-separável e, sendo assim, requer uma Rede Neural mais complexa (com pelo menos 1 _Hidden Layer_) para melhorar a acurácia e reduzir a _loss_;\n", + " * Para explicar como é fácil desenvolver Redes Neurais usando Tensorflow/Keras;\n", + " * Para explicar didaticamente e passo a passo os processos _Forward_ e _Backward_ para treinar Redes Neurais;\n", + " * Versão do Tensorflow usada: 2.x;\n", + " * Estou a utilizar o Google Colab;\n", + " * Nesta aula, não se preocupe demasiadamente com a sintaxe dos comandos. Porque?\n", + " * Vamos repetir tudo detalhadamente nas próximas aulas. Portanto, você terá a oportunidade de aprender e praticar muito em breve;\n", + " * O objetivo desta aula é simplesmente fazer uma introdução às Redes Neurais, Tensorflow/Keras e mostrar os passos/processos que vamos seguir aqui e no futuro para desenvolver Redes Neurais. Quando você assistir as aulas subsequentes, tudo ficará mais claro.\n", + "* Todas as aulas do curso de Redes Neurais foram cuidadosamente planejadas e preparadas para trazer conteúdos relevantes para você aprender Redes Neurais no menor tempo possível. Portanto, nesta aula você vai encontrar várias linhas como a linha adiante:\n", + "\n", + "[**Python**] - Comando ou _code_ que deve ser executado.\n", + "\n", + "Estas linhas são uma espécie de _guide_ para não nos esquecermos de nenhum detalhe da aula. \n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rOQbXbCgZZRL" + }, + "source": [ + "A seguir, dataframe do Operador Lógico XOR:\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$)|\n", + "|---|---|---|---|\n", + "| 0 | 0 | 0 | 0 |\n", + "| 1 | 0 | 1 | 1 |\n", + "| 2 | 1 | 0 | 1 |\n", + "| 3 | 1 | 1 | 0 |" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KusCpN1S4CtH" + }, + "source": [ + "Vamos obedecer os _steps_ a seguir para construir nossa Rede Neural:\n", + "\n", + "1. Carregar as bibliotecas do Python e Tensorflow;\n", + "2. Carregar os dados para treinar a Rede Neural;\n", + "3. Definir a arquitetura da Rede Neural com Tensorflow/Keras;\n", + "4. Compilar a Rede Neural;\n", + "5. Ajustar a Rede Neural;\n", + "6. Avaliar a performance da Rede Neural;\n", + "7. _Fine tuning_ da Rede Neural;\n", + "8. Fazer Predições com a Rede Neural;\n", + "9. Conclusões." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mq7pF8854cf6" + }, + "source": [ + "### 1. Carregar as bibliotecas do Python e Tensorflow" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Br4REluttJXH" + }, + "source": [ + "[**Python**] - Importar as bibliotecas necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "W-1jl_vnP7n3" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import tensorflow as tf\n", + "from tensorflow import keras" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UA_bHIYOrNwy" + }, + "source": [ + "[**Python**] - Verificar a versão do Tensorflow\n", + "> Assegurar que está a utilizar a versão 2.x." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ApSwaqVbQVGx", + "outputId": "1f2cda78-ed28-470b-e3fd-973379009795", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "tf.__version__" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'2.3.0'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 54 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CH37RXFLtPpB" + }, + "source": [ + "[**Python**] - Definir o número de casas decimais = 3" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Pzdu5btatTom" + }, + "source": [ + "np.set_printoptions(precision = 3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5YbKhVkd4sm2" + }, + "source": [ + "### 2. Carregar os dados para treinar a Rede Neural\n", + "\n", + "Segue abaixo o dataframe do Operador Lógico XOR:\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$)|\n", + "|---|---|---|---|\n", + "| 0 | 0 | 0 | 0 |\n", + "| 1 | 0 | 1 | 1 |\n", + "| 2 | 1 | 0 | 1 |\n", + "| 3 | 1 | 1 | 0 |" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3uojCCMaWgtI" + }, + "source": [ + "[**Python**] - Definir as entradas (_inputs_) $X$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nTbtpKKdQh9M", + "outputId": "0002c861-cb8d-4ebc-b98a-462c82466f8a", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X_XOR = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])\n", + "X_XOR" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0, 0],\n", + " [0, 1],\n", + " [1, 0],\n", + " [1, 1]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 56 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7hc1pYW-WsCp" + }, + "source": [ + "[**Python**] - Definir os _Outputs_ $Y$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Gj4zl-JdQ0nR", + "outputId": "0f2d9383-bb95-4e6e-b8b8-2f14d4364d10", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "y_XOR = np.array([[0], [1], [1], [0]])\n", + "y_XOR" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0],\n", + " [1],\n", + " [1],\n", + " [0]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 57 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2WF73gTMZtN-" + }, + "source": [ + "### 3. Conceito importante: _Fully connected layer_\n", + "\n", + "> A arquitetura da Rede Neural abaixo é dita _fully connected_, ou seja, os neurônios da camada anterior se conecta com todos os neurônios da camada subsequente. Observe a figura a seguir:\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7X5eys3mxsl2" + }, + "source": [ + "#### Arquitetura da Rede Neural\n", + "\n", + "> A seguir, a arquitetura da Rede Neural que vamos desenvolver neste exemplo:\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QpmtGEFgWzhk" + }, + "source": [ + "[**Python**] - Definir a arquitetura, ou seja:\n", + "* $N_{I}$: Número de neurônios na camada de entrada (_Input Layer_);\n", + "* $N_{O}$: Número de neurônios na camada de saída (_Output Layer_);\n", + "* $N_{H}$: Número de neurônios na camada escondida (_Hidden Layer_);\n", + "* FA: Função de ativação:\n", + " * _Hidden Layer_: Há várias opções que podem ser usadas, mas vou tentar resolver este exemplo com a função de ativação _Sigmoid_, que foi a função de ativação que foi a opção escolhida quando explicamos Redes Neurais passo a passo.\n", + " * _Output Layer_: Os valores de $y_{i}$ do dataframe são binários. Portanto, nossa opção para função de ativação para a _Output Layer_ é usar a função de ativação _Sigmoid_." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Id_P910LRRb4" + }, + "source": [ + "# Número de Neurônios na Input Layer:\n", + "N_I = 2 # Número de variáveis/colunas da matriz de preditoras\n", + "\n", + "# Número de neurônios na Output Layer:\n", + "N_O = 1\n", + "\n", + "# Número de neurônios na Hidden Layer:\n", + "N_H = 3\n", + "\n", + "# Função de Ativação da Hidden Layer:\n", + "FA_H = tf.keras.activations.sigmoid\n", + "\n", + "# Função de Ativação da Output Layer\n", + "FA_O = tf.keras.activations.sigmoid" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "n6s9RcjLXqQm" + }, + "source": [ + "[**Python**] - Definir as sementes para NumPy e Tensorflow:\n", + "> Por questões de reproducibilidade de resultados, use as sementes abaixo:\n", + "\n", + "* NumPy: 20111974;\n", + "* Tensorflow: 20111974;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3DizOTqQR6-U" + }, + "source": [ + "np.random.seed(20111974)\n", + "tf.random.set_seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vdcdcNncYB15" + }, + "source": [ + "[**Python**] - Definir a Rede Neural:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "E8KJ0f70HEwN" + }, + "source": [ + "**Observação**:\n", + "\n", + "* A opção kernel_constraint= tf.keras.constraints.UnitNorm() será utilizada para reduzir _overfitting_, conforme sugere o artigo [How to Reduce Overfitting Using Weight Constraints in Keras](https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-neural-networks-with-weight-constraints-in-keras/);." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-LYbXfEZYNcC", + "outputId": "1a48709a-93ad-4b5c-f679-a74d48e21f24", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "from tensorflow.keras import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "\n", + "RN = Sequential() # nome da Rede Neural\n", + "RN.add(Dense(units = N_H, \n", + " input_dim = N_I, \n", + " activation = FA_H, \n", + " kernel_constraint = tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dense(units= N_O, activation = FA_O))\n", + "\n", + "# Resumo da arquitetura da Rede Neural:\n", + "print(RN.summary())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Model: \"sequential_1\"\n", + "_________________________________________________________________\n", + "Layer (type) Output Shape Param # \n", + "=================================================================\n", + "dense_2 (Dense) (None, 3) 9 \n", + "_________________________________________________________________\n", + "dense_3 (Dense) (None, 1) 4 \n", + "=================================================================\n", + "Total params: 13\n", + "Trainable params: 13\n", + "Non-trainable params: 0\n", + "_________________________________________________________________\n", + "None\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OoA-_8A55jMW" + }, + "source": [ + "### 4. Compilar a Rede Neural\n", + "\n", + "> Adam é um algoritmo de otimização.\n", + "\n", + "Para saber mais sobre o algoritmo de otimização 'adam', consulte o artigo [Gentle Introduction to the Adam Optimization Algorithm for Deep Learning](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/).\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ifkjrCT6Yki6" + }, + "source": [ + "[**Python**] - Comando modelo.compile(optimizer, loss, metrics):\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OdIerBPAUGbY" + }, + "source": [ + "Algoritmo_Opt = tf.keras.optimizers.Adam() # Algoritmo de otimização\n", + "Loss_Function = tf.keras.losses.MeanSquaredError() # A métrica para cálculo do erro\n", + "Metrics_Perf = [tf.keras.metrics.binary_accuracy]\n", + "\n", + "RN.compile(optimizer = Algoritmo_Opt, \n", + " loss= Loss_Function, \n", + " metrics = Metrics_Perf)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KVx2w28c5urj" + }, + "source": [ + "### 5. Ajustar/treinar a Rede Neural\n", + "\n", + "* 1 _Epoch_ = 1 iteração da Rede Neural, passando por todo o dataframe de treinamento, sendo que 1 iteração contempla 1 processo _Forward_ e 1 processo _Backward_." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZV3XwUJ8YvxE" + }, + "source": [ + "[**Python**] - Comando modelo.fit(X_treinamento, y_treinamento, epochs):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "45inZ8X3U0Ew", + "outputId": "ad45add5-0027-4f4a-fde0-28bba8bbce54", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "hist = RN.fit(X_XOR, y_XOR, epochs = 100)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Epoch 1/100\n", + "1/1 [==============================] - 0s 1ms/step - loss: 0.3195 - binary_accuracy: 0.5000\n", + "Epoch 2/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3201 - binary_accuracy: 0.5000\n", + "Epoch 3/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3198 - binary_accuracy: 0.5000\n", + "Epoch 4/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3195 - binary_accuracy: 0.5000\n", + "Epoch 5/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3192 - binary_accuracy: 0.5000\n", + "Epoch 6/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3188 - binary_accuracy: 0.5000\n", + "Epoch 7/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.3185 - binary_accuracy: 0.5000\n", + "Epoch 8/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3182 - binary_accuracy: 0.5000\n", + "Epoch 9/100\n", + "1/1 [==============================] - 0s 5ms/step - loss: 0.3179 - binary_accuracy: 0.5000\n", + "Epoch 10/100\n", + "1/1 [==============================] - 0s 1ms/step - loss: 0.3176 - binary_accuracy: 0.5000\n", + "Epoch 11/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3173 - binary_accuracy: 0.5000\n", + "Epoch 12/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3170 - binary_accuracy: 0.5000\n", + "Epoch 13/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3166 - binary_accuracy: 0.5000\n", + "Epoch 14/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3163 - binary_accuracy: 0.5000\n", + "Epoch 15/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3160 - binary_accuracy: 0.5000\n", + "Epoch 16/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3157 - binary_accuracy: 0.5000\n", + "Epoch 17/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3154 - binary_accuracy: 0.5000\n", + "Epoch 18/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3151 - binary_accuracy: 0.5000\n", + "Epoch 19/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3148 - binary_accuracy: 0.5000\n", + "Epoch 20/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3145 - binary_accuracy: 0.5000\n", + "Epoch 21/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3142 - binary_accuracy: 0.5000\n", + "Epoch 22/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3138 - binary_accuracy: 0.5000\n", + "Epoch 23/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3135 - binary_accuracy: 0.5000\n", + "Epoch 24/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3132 - binary_accuracy: 0.5000\n", + "Epoch 25/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3129 - binary_accuracy: 0.5000\n", + "Epoch 26/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3126 - binary_accuracy: 0.5000\n", + "Epoch 27/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3123 - binary_accuracy: 0.5000\n", + "Epoch 28/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3120 - binary_accuracy: 0.5000\n", + "Epoch 29/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3117 - binary_accuracy: 0.5000\n", + "Epoch 30/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3114 - binary_accuracy: 0.5000\n", + "Epoch 31/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3111 - binary_accuracy: 0.5000\n", + "Epoch 32/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.3108 - binary_accuracy: 0.5000\n", + "Epoch 33/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3105 - binary_accuracy: 0.5000\n", + "Epoch 34/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3102 - binary_accuracy: 0.5000\n", + "Epoch 35/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.3099 - binary_accuracy: 0.5000\n", + "Epoch 36/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3096 - binary_accuracy: 0.5000\n", + "Epoch 37/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3093 - binary_accuracy: 0.5000\n", + "Epoch 38/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3090 - binary_accuracy: 0.5000\n", + "Epoch 39/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3087 - binary_accuracy: 0.5000\n", + "Epoch 40/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3084 - binary_accuracy: 0.5000\n", + "Epoch 41/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3081 - binary_accuracy: 0.5000\n", + "Epoch 42/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3079 - binary_accuracy: 0.5000\n", + "Epoch 43/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3076 - binary_accuracy: 0.5000\n", + "Epoch 44/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.3073 - binary_accuracy: 0.5000\n", + "Epoch 45/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3070 - binary_accuracy: 0.5000\n", + "Epoch 46/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3067 - binary_accuracy: 0.5000\n", + "Epoch 47/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3064 - binary_accuracy: 0.5000\n", + "Epoch 48/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3061 - binary_accuracy: 0.5000\n", + "Epoch 49/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.3058 - binary_accuracy: 0.5000\n", + "Epoch 50/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.3055 - binary_accuracy: 0.5000\n", + "Epoch 51/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3053 - binary_accuracy: 0.5000\n", + "Epoch 52/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3050 - binary_accuracy: 0.5000\n", + "Epoch 53/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3047 - binary_accuracy: 0.5000\n", + "Epoch 54/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3044 - binary_accuracy: 0.5000\n", + "Epoch 55/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3041 - binary_accuracy: 0.5000\n", + "Epoch 56/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3038 - binary_accuracy: 0.5000\n", + "Epoch 57/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3036 - binary_accuracy: 0.5000\n", + "Epoch 58/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3033 - binary_accuracy: 0.5000\n", + "Epoch 59/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3030 - binary_accuracy: 0.5000\n", + "Epoch 60/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3027 - binary_accuracy: 0.5000\n", + "Epoch 61/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3025 - binary_accuracy: 0.5000\n", + "Epoch 62/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3022 - binary_accuracy: 0.5000\n", + "Epoch 63/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3019 - binary_accuracy: 0.5000\n", + "Epoch 64/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3016 - binary_accuracy: 0.5000\n", + "Epoch 65/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3014 - binary_accuracy: 0.5000\n", + "Epoch 66/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.3011 - binary_accuracy: 0.5000\n", + "Epoch 67/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.3008 - binary_accuracy: 0.5000\n", + "Epoch 68/100\n", + "1/1 [==============================] - 0s 11ms/step - loss: 0.3006 - binary_accuracy: 0.5000\n", + "Epoch 69/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.3003 - binary_accuracy: 0.5000\n", + "Epoch 70/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.3000 - binary_accuracy: 0.5000\n", + "Epoch 71/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.2997 - binary_accuracy: 0.5000\n", + "Epoch 72/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.2995 - binary_accuracy: 0.5000\n", + "Epoch 73/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.2992 - binary_accuracy: 0.5000\n", + "Epoch 74/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.2990 - binary_accuracy: 0.5000\n", + "Epoch 75/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.2987 - binary_accuracy: 0.5000\n", + "Epoch 76/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.2984 - binary_accuracy: 0.5000\n", + "Epoch 77/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.2982 - binary_accuracy: 0.5000\n", + "Epoch 78/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.2979 - binary_accuracy: 0.5000\n", + "Epoch 79/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.2976 - binary_accuracy: 0.5000\n", + "Epoch 80/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.2974 - binary_accuracy: 0.5000\n", + "Epoch 81/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.2971 - binary_accuracy: 0.5000\n", + "Epoch 82/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.2969 - binary_accuracy: 0.5000\n", + "Epoch 83/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.2966 - binary_accuracy: 0.5000\n", + "Epoch 84/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.2964 - binary_accuracy: 0.5000\n", + "Epoch 85/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.2961 - binary_accuracy: 0.5000\n", + "Epoch 86/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.2959 - binary_accuracy: 0.5000\n", + "Epoch 87/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.2956 - binary_accuracy: 0.5000\n", + "Epoch 88/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.2954 - binary_accuracy: 0.5000\n", + "Epoch 89/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.2951 - binary_accuracy: 0.5000\n", + "Epoch 90/100\n", + "1/1 [==============================] - 0s 5ms/step - loss: 0.2949 - binary_accuracy: 0.5000\n", + "Epoch 91/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.2946 - binary_accuracy: 0.5000\n", + "Epoch 92/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.2944 - binary_accuracy: 0.5000\n", + "Epoch 93/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.2941 - binary_accuracy: 0.5000\n", + "Epoch 94/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.2939 - binary_accuracy: 0.5000\n", + "Epoch 95/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.2936 - binary_accuracy: 0.5000\n", + "Epoch 96/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.2934 - binary_accuracy: 0.5000\n", + "Epoch 97/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.2931 - binary_accuracy: 0.5000\n", + "Epoch 98/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.2929 - binary_accuracy: 0.5000\n", + "Epoch 99/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.2927 - binary_accuracy: 0.5000\n", + "Epoch 100/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.2924 - binary_accuracy: 0.5000\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i1bUiekR5q1E" + }, + "source": [ + "### 6. Avaliar a performance da Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wBp4ctbKY8k7" + }, + "source": [ + "[**Python**] - Comando modelo.evaluate(X_teste, y_teste):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "M4HlrjjjVLjB", + "outputId": "a4efc370-b2fb-4b63-88fc-7671b7005fbe", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "RN.evaluate(X_XOR, y_XOR)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "1/1 [==============================] - 0s 1ms/step - loss: 0.2922 - binary_accuracy: 0.5000\n" + ], + "name": "stdout" + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[0.2921895980834961, 0.5]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 64 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iPwANO05VT5m" + }, + "source": [ + "**Resultado**: O modelo _baseline_ (modelo inicial) apresenta os seguintes resultados:\n", + "* loss= 0.2515;\n", + "* accuracy= 50%.\n", + "\n", + "* **Comentário**: A Rede Neural apresenta resultados insatisfatórios." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lD2pw9H754ZZ" + }, + "source": [ + "### 7. _Fine tuning_ da Rede Neural\n", + "\n", + "Antes de falarmos de _fine tuning_, vamos voltar a falar de CRISP-DM:\n", + "\n", + "CRISP-DM significa _Cross Industry Standard Process for Data Mining_ ou processos ou fases para desenvolvimento de projetos relacionados à _Data Mining_ e que tem sido muito utilizados pelos Cientistas de Dados para desenvolvimento de modelos predictivos.\n", + "\n", + "\"Drawing\"\n", + "\n", + "Fonte: [The steps to a successful machine learning project](https://emba.epfl.ch/2018/04/10/steps-successful-machine-learning-project/)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1ssJuKF3FNA3" + }, + "source": [ + "* CRISP-DM:\n", + " 1. _Business Understanding_ (Entendimento do Negócio)\n", + " * Concentra-se no entendimento dos objetivos e requisitos do projeto sob uma perspectiva de negócios e, em seguida, na conversão desse conhecimento em uma definição de problema de mineração de dados e em um plano preliminar.\n", + "\n", + " 2. _Data Understanding_ (Entendimento/compreensão dos dados)\n", + " * Está relacionado com as atividades de extração de amostras para se familiarizar com os dados, identificar problemas de qualidade, descobrir as primeiras idéias ou detectar subconjuntos interessantes para formar hipóteses de informações ocultas.\n", + "\n", + " 3. _Data Preparation_ (Preparação de Dados)\n", + "\n", + " * Abrange todas as atividades para construir o conjunto de dados final que será dividida entre amostra de treinamento e validação do modelo preditivo.\n", + "\n", + " 4. _Modeling_ (Modelagem)\n", + "\n", + " * Nesta fase se avalia as possíveis técnicas que podem ser aplicadas.\n", + "\n", + " 5. _Evaluation_ (Avaliação do modelo)\n", + "\n", + " * Após a construção do modelo _baseline_ (modelo inicial) e tendo _Loss Function_ pré-definidas, avalia-se ou testa-se a performance dos modelos preditivos (Redes Neurais, no nosso caso) para garantir que o modelo generaliza. De todos os modelos testados nesta fase, devemos selecionar o modelo campeão.\n", + "\n", + " 6. _Deployment_ (Implantação)\n", + "\n", + " * Significa implementar o código do modelo em um sistema operacional para pontuar/escorar ou categorizar novos dados à medida que surgem e criar um mecanismo para o uso dessas novas informações na solução do problema comercial original. Importante, a representação de código também deve incluir todas as etapas de preparação de dados que antecederam a modelagem, para que o modelo trate novos dados brutos da mesma maneira que durante o desenvolvimento do modelo." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VrgiPmD3jw_o" + }, + "source": [ + "#### Estratégias para melhorar a acurácia da Rede Neural\n", + "\n", + "Nossas alternativas são:\n", + "\n", + "* a. Aumentar o número de neurônios na _Hidden Layer_;\n", + "* b. Aumentar o número de _Hidden Layers_;\n", + "* c. Aumentar o número de _Hidden Layers_ e o número de neurônios;\n", + "* d. Alterar a função de ativação;\n", + "* e. Aumentar o número de _epochs_;\n", + "* f. Alterar o algoritmo de otimização (_optimizer_);\n", + "\n", + "Neste exemplo, depois de várias tentativas, obtive sucesso alterando os parâmetros a seguir: \n", + "* Função de ativação: alterar para tf.keras.activations.relu;\n", + "* Número de neurônios na camada escondida (_Hidden Layer_): aumentei para 64." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V81AQ9t8IA9D" + }, + "source": [ + "#### 7.3. Definir a arquitetura da Rede Neural com Tensorflow/Keras" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yVhR55OoaWeX" + }, + "source": [ + "[**Python**] - Definir a arquitetura, ou seja:\n", + "* $N_{I}$: Número de neurônios na camada de entrada (_Input Layer_);\n", + "* $N_{O}$: Número de neurônios na camada de saída (_Output Layer_);\n", + "* $N_{H}$: Número de neurônios na camada escondida (_Hidden Layer_);\n", + "* FA: Função de ativação:\n", + " * _Hidden Layer_: tf.keras.activations.relu;\n", + " * _Output Layer_: Os valores de $y_{i}$ do dataframe são binários. Portanto, nossa opção para função de ativação para a _Output Layer_ é usar a função de ativação _Sigmoid_." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "R26Rf7x_aWeZ" + }, + "source": [ + "# Número de Neurônios na Input Layer:\n", + "N_I = 2 # NÃO FOI ALTERADA!\n", + "\n", + "# Número de neurônios na Output Layer:\n", + "N_O = 1 # NÃO FOI ALTERADA!\n", + "\n", + "# VARIÁVEIS ALTERADAS:\n", + "# Número de neurônios na Hidden Layer:\n", + "N_H = 64\n", + "\n", + "# Função de Ativação da Hidden Layer:\n", + "FA_H = tf.keras.activations.relu # ALTERADA!\n", + "\n", + "# Função de Ativação da Output Layer\n", + "FA_O = tf.keras.activations.sigmoid # NÃO FOI ALTERADA!" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LtQXjYnvIdJR" + }, + "source": [ + "[**Python**] - Definir as sementes para NumPy e Tensorflow:\n", + "> Por questões de reproducibilidade de resultados, use as sementes abaixo:\n", + "\n", + "* NumPy: 20111974;\n", + "* Tensorflow: 20111974;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WSCCZ6BcIdJZ" + }, + "source": [ + "np.random.seed(20111974)\n", + "tf.random.set_seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MwjdTXWNawSz" + }, + "source": [ + "[**Python**] - Definir a Rede Neural:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OWpJNRQjIRA4" + }, + "source": [ + "**Observações**: \n", + "\n", + "Para evitar problemas relacionados ao _overfitting_ e _Vanishing or Exploding Gradients in Deep Neural Nets_, os artigos abaixo sugerem as seguintes opções para inicialização dos pesos $W$:\n", + "\n", + "* [How to Reduce Overfitting Using Weight Constraints in Keras](https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-neural-networks-with-weight-constraints-in-keras/) sugere:\n", + " * kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Deep Learning Best Practices (1) — Weight Initialization](https://medium.com/usf-msds/deep-learning-best-practices-1-weight-initialization-14e5c0295b94) sugere:\n", + " * kernel_initializer= tf.keras.initializers.he_normal() para activation= 'tf.nn.relu' ou 'tf.nn.leaky_relu' e kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Vanishing/ Exploding Gradients in Deep Neural Nets and solving them](https://medium.com/swlh/vanishing-exploding-gradients-in-deep-neural-nets-and-solving-them-9d6070f28b29) sugere:\n", + " * kernel_initializer= tf.keras.initializers.GlorotUniform();\n", + " * kernel_initializer= tf.keras.initializers.GlorotNormal()." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "khod_vL5awS2", + "outputId": "6a9845e8-bc9a-4274-d7c7-cf72da60efd1", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "from tensorflow.keras import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "from tensorflow.keras.layers import Dropout\n", + "\n", + "RN = Sequential()\n", + "RN.add(Dense(units = N_H, input_dim = N_I, activation = FA_H, kernel_constraint = tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dense(units = N_O, activation = FA_O))\n", + "\n", + "print(RN.summary())" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Model: \"sequential_6\"\n", + "_________________________________________________________________\n", + "Layer (type) Output Shape Param # \n", + "=================================================================\n", + "dense_12 (Dense) (None, 64) 192 \n", + "_________________________________________________________________\n", + "dense_13 (Dense) (None, 1) 65 \n", + "=================================================================\n", + "Total params: 257\n", + "Trainable params: 257\n", + "Non-trainable params: 0\n", + "_________________________________________________________________\n", + "None\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8V5EygkgIRA8" + }, + "source": [ + "#### 7.4. Compilar a Rede Neural\n", + "\n", + "> Adam é um algoritmo de otimização.\n", + "\n", + "Para saber mais sobre 'adam', consulte o artigo [Gentle Introduction to the Adam Optimization Algorithm for Deep Learning](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/).\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aEgVJAInbFW0" + }, + "source": [ + "[**Python**] - Comando modelo.compile(optimizer, loss, metrics):\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FRhAlexLW29o" + }, + "source": [ + "#Algoritmo_Opt = tf.keras.optimizers.Adam()\n", + "Algoritmo_Opt = tf.keras.optimizers.Adam(learning_rate=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False,\n", + " name='Adam')\n", + "\n", + "Loss_Function = tf.keras.losses.MeanSquaredError()\n", + "Metrics_Perf = [tf.keras.metrics.binary_accuracy]\n", + "\n", + "RN.compile(optimizer= Algoritmo_Opt, loss= Loss_Function, metrics= Metrics_Perf)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jxpOHzSKIRA-" + }, + "source": [ + "#### 7.5. Ajustar a Rede Neural\n", + "\n", + "1 _Epoch_ = 1 iteração da Rede Neural, passando por todo o dataframe de treinamento, sendo que 1 iteração contempla 1 processo _Forward_ e 1 processo _Backward_." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o-1iqXLabR4O" + }, + "source": [ + "[**Python**] - Comando modelo.fit(X_treinamento, y_treinamento, epochs):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vqyeqpq5XGAm", + "outputId": "61a23fa8-3b38-44ec-d01e-b5b936b50903", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "RN.fit(X_XOR, y_XOR, epochs = 100)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Epoch 1/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0820 - binary_accuracy: 1.0000\n", + "Epoch 2/100\n", + "1/1 [==============================] - 0s 1ms/step - loss: 0.0798 - binary_accuracy: 1.0000\n", + "Epoch 3/100\n", + "1/1 [==============================] - 0s 1ms/step - loss: 0.0735 - binary_accuracy: 1.0000\n", + "Epoch 4/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0681 - binary_accuracy: 1.0000\n", + "Epoch 5/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0643 - binary_accuracy: 1.0000\n", + "Epoch 6/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0605 - binary_accuracy: 1.0000\n", + "Epoch 7/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0571 - binary_accuracy: 1.0000\n", + "Epoch 8/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0537 - binary_accuracy: 1.0000\n", + "Epoch 9/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0503 - binary_accuracy: 1.0000\n", + "Epoch 10/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0471 - binary_accuracy: 1.0000\n", + "Epoch 11/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0440 - binary_accuracy: 1.0000\n", + "Epoch 12/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0411 - binary_accuracy: 1.0000\n", + "Epoch 13/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0384 - binary_accuracy: 1.0000\n", + "Epoch 14/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0359 - binary_accuracy: 1.0000\n", + "Epoch 15/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0336 - binary_accuracy: 1.0000\n", + "Epoch 16/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0313 - binary_accuracy: 1.0000\n", + "Epoch 17/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0292 - binary_accuracy: 1.0000\n", + "Epoch 18/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0272 - binary_accuracy: 1.0000\n", + "Epoch 19/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0253 - binary_accuracy: 1.0000\n", + "Epoch 20/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0235 - binary_accuracy: 1.0000\n", + "Epoch 21/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0217 - binary_accuracy: 1.0000\n", + "Epoch 22/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0201 - binary_accuracy: 1.0000\n", + "Epoch 23/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0187 - binary_accuracy: 1.0000\n", + "Epoch 24/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0173 - binary_accuracy: 1.0000\n", + "Epoch 25/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0159 - binary_accuracy: 1.0000\n", + "Epoch 26/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0147 - binary_accuracy: 1.0000\n", + "Epoch 27/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.0136 - binary_accuracy: 1.0000\n", + "Epoch 28/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0126 - binary_accuracy: 1.0000\n", + "Epoch 29/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0116 - binary_accuracy: 1.0000\n", + "Epoch 30/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0107 - binary_accuracy: 1.0000\n", + "Epoch 31/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0099 - binary_accuracy: 1.0000\n", + "Epoch 32/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0091 - binary_accuracy: 1.0000\n", + "Epoch 33/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0084 - binary_accuracy: 1.0000\n", + "Epoch 34/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0078 - binary_accuracy: 1.0000\n", + "Epoch 35/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0072 - binary_accuracy: 1.0000\n", + "Epoch 36/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0067 - binary_accuracy: 1.0000\n", + "Epoch 37/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0063 - binary_accuracy: 1.0000\n", + "Epoch 38/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0058 - binary_accuracy: 1.0000\n", + "Epoch 39/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0054 - binary_accuracy: 1.0000\n", + "Epoch 40/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0051 - binary_accuracy: 1.0000\n", + "Epoch 41/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0048 - binary_accuracy: 1.0000\n", + "Epoch 42/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0045 - binary_accuracy: 1.0000\n", + "Epoch 43/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0042 - binary_accuracy: 1.0000\n", + "Epoch 44/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.0040 - binary_accuracy: 1.0000\n", + "Epoch 45/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0038 - binary_accuracy: 1.0000\n", + "Epoch 46/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0036 - binary_accuracy: 1.0000\n", + "Epoch 47/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0034 - binary_accuracy: 1.0000\n", + "Epoch 48/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0032 - binary_accuracy: 1.0000\n", + "Epoch 49/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0030 - binary_accuracy: 1.0000\n", + "Epoch 50/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0029 - binary_accuracy: 1.0000\n", + "Epoch 51/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0028 - binary_accuracy: 1.0000\n", + "Epoch 52/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.0026 - binary_accuracy: 1.0000\n", + "Epoch 53/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.0025 - binary_accuracy: 1.0000\n", + "Epoch 54/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0024 - binary_accuracy: 1.0000\n", + "Epoch 55/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0023 - binary_accuracy: 1.0000\n", + "Epoch 56/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.0022 - binary_accuracy: 1.0000\n", + "Epoch 57/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0022 - binary_accuracy: 1.0000\n", + "Epoch 58/100\n", + "1/1 [==============================] - 0s 5ms/step - loss: 0.0021 - binary_accuracy: 1.0000\n", + "Epoch 59/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0020 - binary_accuracy: 1.0000\n", + "Epoch 60/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0019 - binary_accuracy: 1.0000\n", + "Epoch 61/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.0019 - binary_accuracy: 1.0000\n", + "Epoch 62/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0018 - binary_accuracy: 1.0000\n", + "Epoch 63/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0017 - binary_accuracy: 1.0000\n", + "Epoch 64/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.0017 - binary_accuracy: 1.0000\n", + "Epoch 65/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0016 - binary_accuracy: 1.0000\n", + "Epoch 66/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0016 - binary_accuracy: 1.0000\n", + "Epoch 67/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0016 - binary_accuracy: 1.0000\n", + "Epoch 68/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0015 - binary_accuracy: 1.0000\n", + "Epoch 69/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0015 - binary_accuracy: 1.0000\n", + "Epoch 70/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0014 - binary_accuracy: 1.0000\n", + "Epoch 71/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0014 - binary_accuracy: 1.0000\n", + "Epoch 72/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0014 - binary_accuracy: 1.0000\n", + "Epoch 73/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.0013 - binary_accuracy: 1.0000\n", + "Epoch 74/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.0013 - binary_accuracy: 1.0000\n", + "Epoch 75/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0013 - binary_accuracy: 1.0000\n", + "Epoch 76/100\n", + "1/1 [==============================] - 0s 4ms/step - loss: 0.0012 - binary_accuracy: 1.0000\n", + "Epoch 77/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0012 - binary_accuracy: 1.0000\n", + "Epoch 78/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0012 - binary_accuracy: 1.0000\n", + "Epoch 79/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0012 - binary_accuracy: 1.0000\n", + "Epoch 80/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0011 - binary_accuracy: 1.0000\n", + "Epoch 81/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0011 - binary_accuracy: 1.0000\n", + "Epoch 82/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0011 - binary_accuracy: 1.0000\n", + "Epoch 83/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 0.0011 - binary_accuracy: 1.0000\n", + "Epoch 84/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0010 - binary_accuracy: 1.0000\n", + "Epoch 85/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 0.0010 - binary_accuracy: 1.0000\n", + "Epoch 86/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 9.9890e-04 - binary_accuracy: 1.0000\n", + "Epoch 87/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 9.8004e-04 - binary_accuracy: 1.0000\n", + "Epoch 88/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 9.6182e-04 - binary_accuracy: 1.0000\n", + "Epoch 89/100\n", + "1/1 [==============================] - 0s 1ms/step - loss: 9.4389e-04 - binary_accuracy: 1.0000\n", + "Epoch 90/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 9.2633e-04 - binary_accuracy: 1.0000\n", + "Epoch 91/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 9.0949e-04 - binary_accuracy: 1.0000\n", + "Epoch 92/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 8.9346e-04 - binary_accuracy: 1.0000\n", + "Epoch 93/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 8.7824e-04 - binary_accuracy: 1.0000\n", + "Epoch 94/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 8.6300e-04 - binary_accuracy: 1.0000\n", + "Epoch 95/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 8.4822e-04 - binary_accuracy: 1.0000\n", + "Epoch 96/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 8.3342e-04 - binary_accuracy: 1.0000\n", + "Epoch 97/100\n", + "1/1 [==============================] - 0s 3ms/step - loss: 8.1917e-04 - binary_accuracy: 1.0000\n", + "Epoch 98/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 8.0674e-04 - binary_accuracy: 1.0000\n", + "Epoch 99/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 7.9473e-04 - binary_accuracy: 1.0000\n", + "Epoch 100/100\n", + "1/1 [==============================] - 0s 2ms/step - loss: 7.8255e-04 - binary_accuracy: 1.0000\n" + ], + "name": "stdout" + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 101 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C25ZV-x4IRBB" + }, + "source": [ + "#### 7.6. Avaliar a performance da Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tCd2S65ubg_M" + }, + "source": [ + "[**Python**] - Comando modelo.evaluate(X_teste, y_teste):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "I8-Vr9lXXav4", + "outputId": "41470be9-af29-4751-a957-035de23b7eca", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "RN.evaluate(X_XOR, y_XOR)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "1/1 [==============================] - 0s 1ms/step - loss: 0.1502 - binary_accuracy: 1.0000\n" + ], + "name": "stdout" + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[0.1501958817243576, 1.0]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 70 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6IqEhL-2Xj-t" + }, + "source": [ + "**Resultado**: O modelo após o _fine tuning_ apresenta os seguintes resultados:\n", + "* loss= 0.1502;\n", + "* accuracy= 100%.\n", + "\n", + "* **Comentário**: A Rede Neural apresenta resultados satisfatórios." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AZjDavkO58Pu" + }, + "source": [ + "### 8. Fazer Predições com a Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HV4HkNDcbmJ2" + }, + "source": [ + "[**Python**] - Comando RN.predict_classes(X_treinamento):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aum69OJENO6V", + "outputId": "cc3c4c3d-6e7e-4c16-baf9-80ed53ba8485", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "y_pred = RN.predict_classes(X_XOR)\n", + "y_pred" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "WARNING:tensorflow:From :1: Sequential.predict_classes (from tensorflow.python.keras.engine.sequential) is deprecated and will be removed after 2021-01-01.\n", + "Instructions for updating:\n", + "Please use instead:* `np.argmax(model.predict(x), axis=-1)`, if your model does multi-class classification (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype(\"int32\")`, if your model does binary classification (e.g. if it uses a `sigmoid` last-layer activation).\n" + ], + "name": "stdout" + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0],\n", + " [1],\n", + " [1],\n", + " [0]], dtype=int32)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 95 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rNogASabEhz8", + "outputId": "948e70f5-35fd-4932-b106-2b87950c6ca6", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "y_XOR" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0],\n", + " [1],\n", + " [1],\n", + " [0]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 96 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8xaBkwD15-1d" + }, + "source": [ + "### 9. Conclusões" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UULwoI-9yPIs" + }, + "source": [ + "A Rede Neural final, após a fase de _fine tuning_ apresenta os resultados mostrados na sessão 7.6. Diante destes resultados, sugerimos avançarmos para a fase de _deployment_ da Rede Neural, conforme sugere o CRISP-DM." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uFK4SeM5TLOb" + }, + "source": [ + "### **Exercício**\n", + "\n", + "1. Experimente usar outras funções de ativação para a _Hidden Layer_, registre e reporte seus resultados. Para saber mais sobre quais funções de ativação podem ser usadas, consulte [Module: tf.keras.activations](https://www.tensorflow.org/api_docs/python/tf/keras/activations);\n", + "\n", + "2. Experimente usar outros algoritmos de otimização para treinar a Rede Neural, registre e reporte seus resultados. Para saber quais algoritmos podem ser usados, consulte [Module: tf.keras.optimizers](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers).\n", + "\n", + "3. Neste exemplo, usamos o algoritmo de otimização 'adam'. Consulte a documentação sobre o 'adam' no Tensorflow/Keras e você verá que a sintaxe do algoritmo é:\n", + "\n", + "```\n", + "tf.keras.optimizers.Adam(\n", + " learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False,\n", + " name='Adam', **kwargs\n", + ")\n", + "```\n", + "\n", + "Refaça o treinamento da Rede Neural alterando os valores da _Learning Rate_ e reporte seus resultados.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pyyiNwm6eeP4" + }, + "source": [ + "___\n", + "# **_ACTIVATION FUNCTION_**\n", + "\n", + "> As funções de ativação são uma importante parte das Redes Neurais, pois permitem às Redes Neurais a lidar com a não-linearidade existente na maioria dos problemas reais.\n", + "\n", + "As funções de ativação (_Activation Function_ em inglês) mais usadas são:\n", + "* _Sigmoid_;\n", + "* ReLU (_Rectified Linear Unit_);\n", + "* Leaky ReLU;\n", + "* _Generalized_ ReLU;\n", + "* Tanh;\n", + "* _Swish_.\n", + "\n", + "Os artigos a seguir discutem estas principais funções de ativação:\n", + "* [Classical Neural Net: Why/Which Activations Functions?](https://towardsdatascience.com/classical-neural-net-why-which-activations-functions-401159ba01c4);\n", + "* [Intermediate Topics in Neural Networks](https://towardsdatascience.com/comprehensive-introduction-to-neural-network-architecture-c08c6d8e5d98);\n", + "* [Comparison of Activation Functions for Deep Neural Networks](https://towardsdatascience.com/comparison-of-activation-functions-for-deep-neural-networks-706ac4284c8a)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F7jF5SToOKYC" + }, + "source": [ + "### Funções de ativação para _Hidden Layers_:\n", + "\n", + "Há várias funções de ativação que podem ser utilizadas na _Hidden Layer_. As principais são:\n", + "\n", + "* ReLU\n", + " * evita e corrige o problema conhecido como _vanishing gradient problem_, que é justamente o principal ponto fraco das funções de ativação _sigmoid_ e _tanh_. Este problema acontece porque algumas derivadas são zero para metade dos valores da entrada $X = [X_{1}, X_{2}, ..., X_{n}]$, o que pode levar ao que se chama de \"neurônios mortos\";\n", + " * Quase todos os modelos de _Deep Learning_ hoje usam ReLU que **deve ser usada somente para _Hidden Layers_ das Redes Neurais**. \n", + "* Leaky ReLU\n", + " * Alternativa melhor que ReLU;\n", + "* _Swish_\n", + " * esta é outra alternativa melhor que ReLU, proposta pelo Google em 2017;\n", + " * alguns artigos apontam melhoria dos resultados das Redes Neurais com _Swish_." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8dYvHeYbN_c4" + }, + "source": [ + "### Funções de ativação para _Output Layers_:\n", + "\n", + "A função de ativação da _Output Layer_ depende do problema:\n", + "\n", + "* _Sigmoid_ para problemas de classificação binária (2 classes).\n", + " * Exemplo: Dataframe: Titanic, pois queremos estimar se o passageiro morreu ou sobreviveu;\n", + "* _Softmax_ para problemas de classificação multi-classes (> 2 classes).\n", + " * Exemplo: Dataframe: Iris, pois queremos estimar a espécie das flores, que são versicolor, virginica e setosa;\n", + "* _Linear_ para problemas de regressão. \n", + " * Exemplo: Dataframe: Boston Housing Prediction, pois queremos estimar o preço das casas em Boston, que é uma variável contínua.\n", + "\n", + "\n", + "O artigo [Comparison of Activation Functions for Deep Neural Networks](https://towardsdatascience.com/comparison-of-activation-functions-for-deep-neural-networks-706ac4284c8a) compara e discute as principais funções de ativação de forma pormenorizada." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "clQq59rPIkvB" + }, + "source": [ + "___\n", + "# **EXEMPLO 1: Rede Neural para identificar o sexo a partir de peso e altura**\n", + "\n", + "> O dataframe a seguir contem 10.000 medidas de altura (_height_) e peso (_weight_), sendo 5.000 medidas para o sexo masculino (_males_) e 5.000 para o sexo feminino (_females_).\n", + "\n", + "**Objetivo**: Estimar gênero (sexo) (_Gender_, em inglês) em função das variáveis _Height_ e _Weight_.\n", + "\n", + "Fonte do dataframe: Kaggle (weight-height.csv).\n", + "\n", + "Nesta aplicação, vamos seguir os passos adiante:\n", + "\n", + "1. Carregar os dados;\n", + "2. Pré-processamento e transformação dos dados;\n", + "3. Definir as amostras de treinamento e validação;\n", + "4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_;\n", + "5. Compilar a Rede Neural;\n", + "6. Ajustar a Rede Neural;\n", + "7. Avaliar a performance da Rede Neural;\n", + "8. _Fine tuning_ da Rede Neural;\n", + "9. Fazer Predições com a Rede Neural;\n", + "10. Conclusões." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dh5p2GcvLQQX" + }, + "source": [ + "### 0. Carregar as principais bibliotecas" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bhjAdXgab99r" + }, + "source": [ + "[**Python**] - Importar as bibliotecas necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kChuTlPddNZv" + }, + "source": [ + "import numpy as np\n", + "import tensorflow as tf\n", + "from tensorflow import keras\n", + "import pandas as pd" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9ZX00UN5cjvM" + }, + "source": [ + "[**Python**] - Verificar a versão do Tensorflow\n", + "> Assegurar que está a utilizar a versão 2.x." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "THWNIk_FCe_g" + }, + "source": [ + "tf.__version__" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PZgQAKqLcLX3" + }, + "source": [ + "[**Python**] - Definir o número de casas decimais" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tzKor02BCe_d" + }, + "source": [ + "np.set_printoptions(precision= 3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M5V4KopjLWOL" + }, + "source": [ + "### 1. Carregar os dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V_cwAUW3tseE" + }, + "source": [ + "[**Python**] - Carregar os dados:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_Bs87IWPtwtm" + }, + "source": [ + "# Leitura do dataframe:\n", + "df_sexo = pd.read_csv('https://raw.githubusercontent.com/MathMachado/DataFrames/master/weight-height.csv')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mBUeMtV7tzw6" + }, + "source": [ + "[**Python**] - Mostrar as primeiras 5 linhas:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rcH-y4amt3gs" + }, + "source": [ + "df_sexo.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OSa161sPLcAw" + }, + "source": [ + "### Pré-processamento e transformação dos dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lL2-6wpCuARF" + }, + "source": [ + "[**Python**] - Construir coluna 'sexo' da seguinte forma:\n", + "* Se Gender= 'Male' ==> sexo= 1;\n", + "* Se Gender= 'Female' ==> sexo= 0." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ccImSqCqDKre" + }, + "source": [ + "def define_label(row):\n", + " if row['Gender'] == 'Male':\n", + " return 1\n", + " else:\n", + " return 0" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "NDYamauZCq77" + }, + "source": [ + "df_sexo['sexo'] = df_sexo.apply(lambda row: define_label(row), axis = 1)\n", + "df_sexo.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hqkOrJnNuZjg" + }, + "source": [ + "[**Python**] - Renomear ou reescrever os nomes das colunas do dataframe em letras minúsculas:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-dahUMI6DsBz" + }, + "source": [ + "df_sexo = df_sexo.drop(columns= 'Gender', axis= 1)\n", + "df_sexo = df_sexo.rename({'Height': 'altura', 'Weight': 'peso'}, axis= 1)\n", + "df_sexo.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UTISVuZ4ukQO" + }, + "source": [ + "[**Python**] - Definir os arrays X_sexo e y_sexo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oMTIn6Zf5LlU" + }, + "source": [ + "X_sexo = df_sexo.copy()\n", + "X_sexo = X_sexo.drop(columns= ['sexo'])\n", + "y_sexo = df_sexo['sexo'].values" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "iSThKwhj4LsC" + }, + "source": [ + "y_sexo" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FiO_F95jc1_s" + }, + "source": [ + "[**Python**] - Normalizar os dados - StandardScaler()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4myPAnSzE7-l" + }, + "source": [ + "from sklearn.preprocessing import StandardScaler\n", + "\n", + "SS = StandardScaler()\n", + "\n", + "X_sexo= SS.fit_transform(X_sexo)\n", + "X_sexo" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jJaJWuUqJCha" + }, + "source": [ + "### 3. Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LoO2iEimu4SQ" + }, + "source": [ + "[**Python**] - Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hTCdm-F9JBGA" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste= train_test_split(X_sexo, y_sexo, test_size = 0.1, random_state = 20111974)\n", + "print(f'X: Treinamento= {X_treinamento.shape}; X: Teste= {X_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "th9CsQpB8VDK" + }, + "source": [ + "print(f'Y: Treinamento = {y_treinamento.shape}; Y: Teste = {y_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2bL-vXiULupD" + }, + "source": [ + "### 4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zxETX6dTfyU5" + }, + "source": [ + "[**Python**] - Definir a arquitetura, ou seja:\n", + "* $N_{I}$: Número de neurônios na camada de entrada (_Input Layer_);\n", + "* $N_{O}$: Número de neurônios na camada de saída (_Output Layer_);\n", + "* $N_{H}$: Número de neurônios na camada escondida (_Hidden Layer_);\n", + "* FA: Função de ativação;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "F_MdsLicfyU6" + }, + "source": [ + "# Número de Neurônios na Input Layer:\n", + "N_I = 2\n", + "\n", + "# Número de neurônios na Output Layer:\n", + "N_O = 1\n", + "\n", + "# Número de neurônios na Hidden Layer:\n", + "N_H = 64\n", + "\n", + "# Função de Ativação da Hidden Layer:\n", + "FA_H = tf.keras.activations.swish\n", + "\n", + "# Função de Ativação da Output Layer:\n", + "FA_O = tf.keras.activations.swish" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SUMmDuPCcYyB" + }, + "source": [ + "[**Python**] - Definir as sementes para NumPy e Tensorflow:\n", + "> Por questões de reproducibilidade de resultados, use as sementes abaixo:\n", + "\n", + "* NumPy: 20111974;\n", + "* Tensorflow: 20111974;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "T-echOBmceVy" + }, + "source": [ + "np.random.seed(20111974)\n", + "tf.random.set_seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7ZceRRdinEM2" + }, + "source": [ + "[**Python**] - Definir a Rede Neural:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nXQsSYq2DBfI" + }, + "source": [ + "* 1 camada _dropout_ com $p= 0.1$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TRFR5Kr_nDtD" + }, + "source": [ + "from tensorflow.keras import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "from tensorflow.keras.layers import Dropout\n", + "\n", + "RN= Sequential()\n", + "RN.add(Dense(N_H, input_dim= N_I, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "#RN.add(Dense(N_H2, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "#RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_O, activation= FA_O))\n", + "\n", + "# Resumo da arquitetura da Rede Neural\n", + "print(RN.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4JBZf4ypGO8o" + }, + "source": [ + "### 5. Compilar a Rede Neural\n", + "\n", + "Este é um problema de classificação binária (_Male_ ou _Female_). Portanto, temos:\n", + "* optimizer= tf.keras.optimizers.Adam();\n", + "* loss= tf.keras.losses.MeanSquaredError() ou loss= tf.keras.losses.BinaryCrossentropy(). Particularmente, eu gosto de usar loss= tf.keras.losses.MeanSquaredError() porque o resultado é mais intuitivo;\n", + "* metrics= tf.keras.metrics.binary_accuracy." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "USmAuw6f00wL" + }, + "source": [ + "[**Python**] - Comando modelo.compile(optimizer, loss, metrics):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "h7KEi1_e6SSF" + }, + "source": [ + "Algoritmo_Opt = tf.keras.optimizers.Adam()\n", + "Loss_Function = tf.keras.losses.MeanSquaredError()\n", + "Metrics_Perf = tf.keras.metrics.binary_accuracy\n", + "\n", + "RN.compile(optimizer = Algoritmo_Opt, loss = Loss_Function, metrics = Metrics_Perf)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Hc90EeV_GojX" + }, + "source": [ + "### 6. Ajustar a Rede Neural\n", + "\n", + "Obs.: A opção callbacks abaixo implementa o conceito de _early stopping_. Esta opção vai parar o processo de treinamento da Rede Neural antes de atingirmos o númerco de _epochs_ quando o modelo pára de melhorar, medido pela métrica val_loss. O parâmetro _patience_= k significa que o processo de otimização vai parar se tivermos k _epochs_ consecutivas sem observarmos melhoria da performance da Rede Neural." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XCCTtUh_vEFP" + }, + "source": [ + "[**Python**] - Comando modelo.fit(X_treinamento, y_treinamento, epochs)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "EB91J6nrF0db" + }, + "source": [ + "callbacks = [tf.keras.callbacks.EarlyStopping(monitor = 'val_loss', patience = 5, min_delta = 0.001)]\n", + "hist= RN.fit(X_treinamento, y_treinamento, epochs = 100, validation_data = (X_teste, y_teste), callbacks = callbacks)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "71mX1iwvHMc5" + }, + "source": [ + "Model_Accuracy(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "o-zJ6GIjHbY8" + }, + "source": [ + "Model_Loss(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J1sL_DTrKmpq" + }, + "source": [ + "### 7. Avaliar a performance da Rede Neural\n", + "\n", + "Para avaliar a a Rede Neural, simplesmente informamos as amostras de teste: X_teste e y_teste. A função evaluate() vai retornar uma lista contendo 2 valores: loss e accuracy." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VckQfEFPvMa7" + }, + "source": [ + "[**Python**] - Comando modelo.evaluate(X_teste, y_teste)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rUhEiqxfKmpv" + }, + "source": [ + "RN.evaluate(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "agO4cGTqKmpz" + }, + "source": [ + "A seguir, a matriz de confusão:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aLIAXu7SN7pV" + }, + "source": [ + "Mostra_ConfusionMatrix()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D5zYHcGuMPZe" + }, + "source": [ + "### 8. _Fine tuning_ da Rede Neural\n", + "\n", + "Para aumentar a acurácia da Rede Neural, sugiro aumentarmos o número de neurônios na _Hidden Layer_ e/ou aumentar o número de _Hidden Layers_.\n", + "\n", + "No entanto, obtivemos uma acurácia razoável com a Rede Neural _baseline_. Portanto, deixo como exercício para os alunos o desafio de melhorar a acurácia desta Rede Neural." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_ISodOu-Kmp3" + }, + "source": [ + "### 9. Fazer Predições com a Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_xgdL1W4vUrN" + }, + "source": [ + "[**Python**] - Comando:\n", + "* RN.predict_classes(X_treinamento);\n", + "* RN.predict_classes(X_teste)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0qun1-vOKmp4" + }, + "source": [ + "y_pred = RN.predict_classes(X_teste)\n", + "y_pred[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "I7sRwTWGKmp8" + }, + "source": [ + "y_teste[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AvywP0nZMtA-" + }, + "source": [ + "### 10. Conclusões\n", + "\n", + "Desenvolvemos uma Rede Neural capaz de identificar Sexo (_Gender_) com acurácia= 0.9120." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "g5qOWxPczM1O" + }, + "source": [ + "___\n", + "# **EXEMPLO 2: Distinguir cédulas verdadeiras das falsas**\n", + "\n", + "* O exemplo a seguir foi extraído do site [OpenML](https://www.openml.org/home). Este é um problema interessante, que é o de distinguir cédulas verdadeiras de notas falsas. Os dados foram extraídos de imagens tiradas de cédulas verdadeiras e falsas. Para digitalização, foi usada uma câmera industrial normalmente usada para inspeção de impressão. As imagens finais têm 400x 400 pixels. Devido à lente do objeto e à distância do objeto investigado, foram obtidas imagens em escala de cinza com uma resolução de cerca de 660 dpi. Uma ferramenta Wavelet Transform foi usada para extrair recursos dessas imagens.\n", + "\n", + "* Este é o endereço do dataframe: https://www.openml.org/d/1462;\n", + "* Descrição das variáveis - [banknote authentication Data Set](https://archive.ics.uci.edu/ml/datasets/banknote+authentication).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nup7tuLc5kYy" + }, + "source": [ + "> A seguir, vamos desenvolver uma Rede Neural usando _Tensorflow_/_Keras_ para classificar notas falsas e verdadeiras. Nesta aplicação, vamos seguir os passos adiante:\n", + "\n", + "1. Carregar os dados;\n", + "2. Pré-processamento e transformação dos dados;\n", + "3. Definir as amostras de treinamento e validação;\n", + "4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_;\n", + "5. Compilar a Rede Neural;\n", + "6. Ajustar a Rede Neural;\n", + "7. Avaliar a performance da Rede Neural;\n", + "8. _Fine tuning_ da Rede Neural;\n", + "9. Fazer Predições com a Rede Neural;\n", + "10. Conclusões." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YHi73Pbq5vvU" + }, + "source": [ + "### 0. Carregar bibliotecas do Python" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZsZW7_Ev5vvY" + }, + "source": [ + "[**Python**] - Importar as bibliotecas necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1U4OySJw5vvb" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "from sklearn.metrics import confusion_matrix\n", + "import tensorflow as tf\n", + "\n", + "from tensorflow import keras" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lAaecKoj5vv5" + }, + "source": [ + "[**Python**] - Verificar a versão do Tensorflow\n", + "> Assegurar que está a utilizar a versão 2.x." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5lPEsFy45vv6" + }, + "source": [ + "tf.__version__" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uNvl-o5w5vvo" + }, + "source": [ + "[**Python**] - Definir o número de casas decimais" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VqRIBc1J5vvp" + }, + "source": [ + "np.set_printoptions(precision = 3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8jo3Y9Hs5vwD" + }, + "source": [ + "### 1. Carregar os dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tL7k4X--5vwE" + }, + "source": [ + "[**Python**] - Carregar os dados:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RTuRMwld5vwG" + }, + "source": [ + "df_cedulas= pd.read_csv('https://raw.githubusercontent.com/MathMachado/DataFrames/master/Banknote-authentication-dataset.csv')\n", + "df_cedulas.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "501b-Zv38ce7" + }, + "source": [ + "[**Python**] - Corrigir ou renomear as colunas do dataframe:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lBKafqZR8jFb" + }, + "source": [ + "df_cedulas.columns= df_cedulas.columns.str.lower()\n", + "df_cedulas.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HBNIjNaT5vwM" + }, + "source": [ + "[**Python**] - Mostrar quantas classes há na variável-target:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "c7mZgLDl5vwO" + }, + "source": [ + "df_cedulas['class'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MG4q-8nf2GS_" + }, + "source": [ + "[**Python**] - Redefinindo a variável-target:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "IA7f1C4e1zOS" + }, + "source": [ + "def Redefinir_label(row):\n", + " if row['class']== 1:\n", + " return 0\n", + " else:\n", + " return 1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "2DkBD1FU1zOo" + }, + "source": [ + "df_cedulas['class']= df_cedulas.apply(lambda row: Redefinir_label(row), axis= 1)\n", + "df_cedulas.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "I0j5o4Iu5vwT" + }, + "source": [ + "[**Python**] - Mostrar a distribuição da variável-target:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "32EZZ8eP5vwV" + }, + "source": [ + "j = sns.countplot(x=\"class\", data= df_cedulas)\n", + "plt.show(j)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dV8A71C55vwb" + }, + "source": [ + "### 2. Pré-processamento e transformação dos dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Eg-4TkYSXvuo" + }, + "source": [ + "[**Python**] - Definir os arrays X_cedulas e y_cedulas:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vn2yMB80Xvux" + }, + "source": [ + "X_cedulas = df_cedulas.copy()\n", + "X_cedulas = X_cedulas.drop(columns = ['class'])\n", + "y_cedulas = df_cedulas['class'].values" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Nckf3bieXvvC" + }, + "source": [ + "[**Python**] - Normalizar os dados - StandardScaler()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "CFTlvOcRXvvE" + }, + "source": [ + "from sklearn.preprocessing import StandardScaler\n", + "\n", + "SS = StandardScaler()\n", + "\n", + "X_cedulas = SS.fit_transform(X_cedulas)\n", + "X_cedulas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q_ouZ1it5vwz" + }, + "source": [ + "### 3. Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T6NrbTvd5vw1" + }, + "source": [ + "[**Python**] - Definir as amostras de treinamento e validação da Rede Neural:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xw3ZZ2fR5vw1" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_cedulas, y_cedulas, test_size = 0.1, random_state = 20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "trfqJbUg5vw8" + }, + "source": [ + "print(f'X: Treinamento = {X_treinamento.shape}; X: Teste = {X_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "duDt1c7i5vxB" + }, + "source": [ + "print(f'Y: Treinamento = {y_treinamento.shape}; Y: Teste = {y_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e4TKmGtr5vxM" + }, + "source": [ + "### 4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f6EQymRK5vxO" + }, + "source": [ + "[**Python**] - Definir a arquitetura, ou seja:\n", + "* $N_{I}$: Número de neurônios na camada de entrada (_Input Layer_);\n", + "* $N_{O}$: Número de neurônios na camada de saída (_Output Layer_);\n", + "* $N_{H}$: Número de neurônios na camada escondida (_Hidden Layer_);\n", + "* FA: Função de ativação;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JlUflDN3YkG7" + }, + "source": [ + "X_treinamento.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "bRsHYQp05vxO" + }, + "source": [ + "# Número de Neurônios na Input Layer:\n", + "N_I = X_treinamento.shape[1]\n", + "\n", + "# Número de neurônios na Output Layer:\n", + "N_O = 1\n", + "\n", + "# Número de neurônios na Hidden Layer 1:\n", + "N_H1 = 8\n", + "\n", + "# Número de neurônios na Hidden Layer 2:\n", + "N_H2 = 8\n", + "\n", + "# Função de Ativação da Hidden Layer:\n", + "FA_H = tf.keras.activations.swish\n", + "\n", + "# Função de Ativação da Output Layer:\n", + "FA_O = tf.keras.activations.sigmoid" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sSOj8_9n5vxU" + }, + "source": [ + "[**Python**] - Definir as sementes para NumPy e Tensorflow:\n", + "> Por questões de reproducibilidade de resultados, use as sementes abaixo:\n", + "\n", + "* NumPy: 20111974;\n", + "* Tensorflow: 20111974;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wFYGSoKH5vxU" + }, + "source": [ + "np.random.seed(20111974)\n", + "tf.random.set_seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WG1isER05vxZ" + }, + "source": [ + "[**Python**] - Definir a Rede Neural:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "A6IPYp8l5vxa" + }, + "source": [ + "**Observações**: \n", + "\n", + "Para evitar problemas relacionados ao _overfitting_ e _Vanishing or Exploding Gradients in Deep Neural Nets_, os artigos abaixo sugerem as seguintes opções para inicialização dos pesos $W$:\n", + "\n", + "* [How to Reduce Overfitting Using Weight Constraints in Keras](https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-neural-networks-with-weight-constraints-in-keras/) sugere:\n", + " * kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Deep Learning Best Practices (1) — Weight Initialization](https://medium.com/usf-msds/deep-learning-best-practices-1-weight-initialization-14e5c0295b94) sugere:\n", + " * kernel_initializer= tf.keras.initializers.he_normal() para activation= 'tf.nn.relu' ou 'tf.nn.leaky_relu' e kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Vanishing/ Exploding Gradients in Deep Neural Nets and solving them](https://medium.com/swlh/vanishing-exploding-gradients-in-deep-neural-nets-and-solving-them-9d6070f28b29) sugere:\n", + " * kernel_initializer= tf.keras.initializers.GlorotUniform();\n", + " * kernel_initializer= tf.keras.initializers.GlorotNormal()." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sCRp8O4V5vxa" + }, + "source": [ + "from tensorflow.keras import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "from tensorflow.keras.layers import Dropout\n", + "\n", + "RN = Sequential()\n", + "RN.add(Dense(units = N_H1, input_dim = N_I, activation = FA_H, kernel_initializer = tf.keras.initializers.GlorotUniform(1)))#, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dense(units = N_O, activation = FA_O))\n", + "\n", + "# Resumo da arquitetura da Rede Neural\n", + "print(RN.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Titw0r-d5vxh" + }, + "source": [ + "### 5. Compilar a Rede Neural\n", + "\n", + "Este é um problema de classificação binária." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oVQsayDq5vxi" + }, + "source": [ + "[**Python**] - Comando modelo.compile(optimizer, loss, metrics):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "u686jTkd5vxj" + }, + "source": [ + "Algoritmo_Opt = tf.keras.optimizers.Adam()\n", + "Loss_Function = tf.keras.losses.BinaryCrossentropy()\n", + "Metrics_Perf = tf.keras.metrics.binary_accuracy\n", + "\n", + "RN.compile(optimizer= Algoritmo_Opt, loss = Loss_Function, metrics = Metrics_Perf)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BKN3oCa65vxn" + }, + "source": [ + "### 6. Ajustar a Rede Neural\n", + "\n", + "Nesta fase, precisamos informar:\n", + "* **Epoch**: O número de épocas é um hiperparâmetro do _Gradient Descent_ que define o número de iterações para atualizar os pesos $W$ usando o dataframe de treinamento. Uma época significa que cada amostra no dataframe de treinamento atualizou os pesos $W$ 1 vez.\n", + "* **Batch**: número de amostras consideradas pela Rede Neural em cada _epoch_ antes da atualização dos pesos $W$;\n", + "\n", + "#### Exemplo\n", + "Suponha que temos um dataframe com 1.000 linhas (instâncias) e optamos por _epoch_= 1.000 e _batch_= 5. Isso significa que o dataframe será dividido em $\\frac{1000}{5}= 200$ _batches_. Desta forma, os pesos $W$ serão atualizados a cada processamento de 200 instâncias (linhas).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vLHQdKsi5vxn" + }, + "source": [ + "Obs.: A opção callbacks abaixo implementa o conceito de _early stopping_. Esta opção vai parar o processo de treinamento da Rede Neural antes de atingirmos o númerco de _epochs_ quando o modelo pára de melhorar, medido pela métrica val_loss. O parâmetro _patience_= k significa que o processo de otimização vai parar se tivermos k _epochs_ consecutivas sem observarmos melhoria da performance da Rede Neural." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q6UMutI45vxp" + }, + "source": [ + "[**Python**] - Comando modelo.fit(X_treinamento, y_treinamento, epochs)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2YhUEbTC5vxq" + }, + "source": [ + "callbacks = [tf.keras.callbacks.EarlyStopping(monitor = 'val_loss', patience = 3, min_delta = 0.001)]\n", + "\n", + "hist= RN.fit(X_treinamento, y_treinamento, epochs= 100, validation_data = (X_teste, y_teste), callbacks = callbacks)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "jFmtvTwd5vxu" + }, + "source": [ + "Model_Loss(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "X8Lu0jh55vxz" + }, + "source": [ + "Model_Accuracy(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sKh0f7Mc5vx4" + }, + "source": [ + "### 7. Avaliar a performance da Rede Neural\n", + "\n", + "Para avaliar a Rede Neural, simplesmente informamos as amostras de teste: X_teste e y_teste.\n", + "\n", + "A função evaluate() vai retornar uma lista contendo 2 valores: loss e accuracy." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p7nsNQoX5vx5" + }, + "source": [ + "[**Python**] - Comando modelo.evaluate(X_teste, y_teste)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "B1OvhTbf5vx6" + }, + "source": [ + "RN.evaluate(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Z8v2aody5vx-" + }, + "source": [ + "### 8. Fazer Predições com a Rede Neural\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0FQy0bZT5vx_" + }, + "source": [ + "[**Python**] - Comando:\n", + "* RN.predict_classes(X_treinamento);\n", + "* RN.predict_classes(X_teste).\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "n8e327A_5vx_" + }, + "source": [ + "y_pred = RN.predict_classes(X_teste)\n", + "y_pred[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TiVyZ-CG5vyE" + }, + "source": [ + "y_teste" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PawhHD_35vyI" + }, + "source": [ + "### Conclusões" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RcIh4qua_eEU" + }, + "source": [ + "___\n", + "# **APLICAÇÃO 1 - Rede Neural para identificar espécies (Iris Dataframe)**\n", + "\n", + "> A seguir, vamos desenvolver uma Rede Neural usando _Tensorflow_/_Keras_ para classificar flores (Iris). Nesta aplicação, vamos seguir os passos adiante:\n", + "\n", + "1. Carregar os dados;\n", + "2. Pré-processamento e transformação dos dados;\n", + "3. Definir as amostras de treinamento e validação;\n", + "4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_;\n", + "5. Compilar a Rede Neural;\n", + "6. Ajustar a Rede Neural;\n", + "7. Avaliar a performance da Rede Neural;\n", + "8. _Fine tuning_ da Rede Neural;\n", + "9. Fazer Predições com a Rede Neural;\n", + "10. Conclusões." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eXRYOpPR4XF4" + }, + "source": [ + "### 0. Carregar bibliotecas do Python" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pa0ir9C_dgOO" + }, + "source": [ + "[**Python**] - Importar as bibliotecas necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yNYF_qzydgOR" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "from sklearn.metrics import confusion_matrix\n", + "\n", + "from tensorflow import keras" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ird1VzZudgOU" + }, + "source": [ + "[**Python**] - Definir o número de casas decimais" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lwj9CGzEdgOV" + }, + "source": [ + "np.set_printoptions(precision= 3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zmN5HGLOdgOa" + }, + "source": [ + "[**Python**] - Verificar a versão do Tensorflow\n", + "> Assegurar que está a utilizar a versão 2.x." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VI86wuv9dgOa" + }, + "source": [ + "tf.__version__" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NoKZnsJpRA8o" + }, + "source": [ + "Perfeito, estamos a usar o TensorFlow 2.x." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xkLZgdkjavO-" + }, + "source": [ + "### 1. Carregar os dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "b7LLQyA3vgBG" + }, + "source": [ + "[**Python**] - Carregar os dados:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ACzNibyKAkx_" + }, + "source": [ + "df_Iris= pd.read_csv('https://raw.githubusercontent.com/MathMachado/DataFrames/master/Iris.csv', index_col= 'Id')\n", + "df_Iris.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T3Vy41RL-lAQ" + }, + "source": [ + "[**Python**] - Corrigir ou renomear as colunas do dataframe:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xN7nBQWg-lAX" + }, + "source": [ + "df_Iris.columns= df_Iris.columns.str.lower()\n", + "df_Iris.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Em1wLwdzvkgh" + }, + "source": [ + "[**Python**] - Mostrar quantas classes há na variável-target:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QhuoPcRuA9Do" + }, + "source": [ + "df_Iris['species'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lvWcxUvru50G" + }, + "source": [ + "[**Python**] - Mostrar a distribuição da variável-target 'Species':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HMpJiMWJu50J" + }, + "source": [ + "j = sns.countplot(x=\"species\", data= df_Iris)\n", + "plt.show(j)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "a55-G14aa_wG" + }, + "source": [ + "### 2. Pré-processamento e transformação dos dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Uos9OyewvyMo" + }, + "source": [ + "[**Python**] - Aplicar a transformação LabelEncoder() nos dados" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "X0V0hWnBg0so" + }, + "source": [ + "from sklearn.preprocessing import LabelEncoder\n", + "\n", + "LE = LabelEncoder()\n", + "\n", + "species_encoded= LE.fit_transform(df_Iris['species'])\n", + "df_Iris= df_Iris.drop(columns= ['species'], axis= 1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d8L-b9gZwB4L" + }, + "source": [ + "[**Python**] - Definir o array y_Iris:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zMp2_hJ1wGm3" + }, + "source": [ + "y_Iris= tf.keras.utils.to_categorical(species_encoded)\n", + "y_Iris[:5]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qfFg6cWdv9El" + }, + "source": [ + "[**Python**] - Definir o array X_Iris:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7coeWhVRjiQl" + }, + "source": [ + "X_Iris= df_Iris.values\n", + "X_Iris[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cUa2sJOSbUFO" + }, + "source": [ + "### 3. Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kDOw-RHux1nS" + }, + "source": [ + "[**Python**] - Definir as amostras de treinamento e validação da Rede Neural:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "AJE_6w3KL_2O" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste= train_test_split(X_Iris, y_Iris, test_size= 0.2, random_state= 20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "NKHTG5IP9nVj" + }, + "source": [ + "print(f'X: Treinamento= {X_treinamento.shape}; X: Teste= {X_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "qe2mHJhb-PIY" + }, + "source": [ + "print(f'Y: Treinamento= {y_treinamento.shape}; Y: Teste= {y_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wHFI_bLXPPvl" + }, + "source": [ + "### 4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zRYoZ7hwgejR" + }, + "source": [ + "[**Python**] - Definir a arquitetura, ou seja:\n", + "* $N_{I}$: Número de neurônios na camada de entrada (_Input Layer_);\n", + "* $N_{O}$: Número de neurônios na camada de saída (_Output Layer_);\n", + "* $N_{H}$: Número de neurônios na camada escondida (_Hidden Layer_);\n", + "* FA: Função de ativação;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mjF1haRmgejS" + }, + "source": [ + "# Número de Neurônios na Input Layer:\n", + "N_I= X.Treinamento.shape[1]\n", + "\n", + "# Número de neurônios na Output Layer:\n", + "N_O= y_Iris.shape[1]\n", + "\n", + "# Número de neurônios na Hidden Layer:\n", + "N_H1= 32\n", + "\n", + "# Função de Ativação da Hidden Layer:\n", + "FA_H= tf.nn.leaky_relu\n", + "\n", + "# Função de Ativação da Output Layer:\n", + "FA_O= tf.nn.softmax" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tGeDaB3oo02k" + }, + "source": [ + "[**Python**] - Definir as sementes para NumPy e Tensorflow:\n", + "> Por questões de reproducibilidade de resultados, use as sementes abaixo:\n", + "\n", + "* NumPy: 20111974;\n", + "* Tensorflow: 20111974;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zGS15afAo02n" + }, + "source": [ + "np.random.seed(20111974)\n", + "tf.random.set_seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iT9w2tUCo5-X" + }, + "source": [ + "[**Python**] - Definir a Rede Neural:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nytcmC4BkSz1" + }, + "source": [ + "**Observações**: \n", + "\n", + "Para evitar problemas relacionados ao _overfitting_ e _Vanishing or Exploding Gradients in Deep Neural Nets_, os artigos abaixo sugerem as seguintes opções para inicialização dos pesos $W$:\n", + "\n", + "* [How to Reduce Overfitting Using Weight Constraints in Keras](https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-neural-networks-with-weight-constraints-in-keras/) sugere:\n", + " * kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Deep Learning Best Practices (1) — Weight Initialization](https://medium.com/usf-msds/deep-learning-best-practices-1-weight-initialization-14e5c0295b94) sugere:\n", + " * kernel_initializer= tf.keras.initializers.he_normal() para activation= 'tf.nn.relu' ou 'tf.nn.leaky_relu' e kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Vanishing/ Exploding Gradients in Deep Neural Nets and solving them](https://medium.com/swlh/vanishing-exploding-gradients-in-deep-neural-nets-and-solving-them-9d6070f28b29) sugere:\n", + " * kernel_initializer= tf.keras.initializers.GlorotUniform();\n", + " * kernel_initializer= tf.keras.initializers.GlorotNormal()." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LnLeLMmZoUjU" + }, + "source": [ + "from tensorflow.keras import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "from tensorflow.keras.layers import Dropout\n", + "\n", + "RN= Sequential()\n", + "RN.add(Dense(units= N_H, input_dim= N_I, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_O, activation= FA_O))\n", + "\n", + "# Resumo da arquitetura da Rede Neural\n", + "print(RN.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3eT-EHUecTj3" + }, + "source": [ + "### 5. Compilar a Rede Neural\n", + "\n", + "Este é um problema de classificação multi-classes (> 2 classes). Portanto, temos:\n", + "* loss= tf.keras.losses.CategoricalCrossentropy()\n", + ";\n", + "* metrics= tf.keras.metrics.binary_accuracy;\n", + "* optimizer= tf.keras.optimizers.Adam()." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fsq0aEtwyAAM" + }, + "source": [ + "[**Python**] - Comando modelo.compile(optimizer, loss, metrics):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NDStOKqhcRf4" + }, + "source": [ + "Algoritmo_Opt= tf.keras.optimizers.Adam()\n", + "Loss_Function= tf.keras.losses.CategoricalCrossentropy()\n", + "Metrics_Perf = tf.keras.metrics.binary_accuracy\n", + "\n", + "RN.compile(optimizer= Algoritmo_Opt, loss= Loss_Function, metrics= Metrics_Perf)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hZFu65TecabN" + }, + "source": [ + "### 6. Ajustar a Rede Neural\n", + "\n", + "Nesta fase, precisamos informar:\n", + "* **Epoch**: O número de épocas é um hiperparâmetro do _Gradient Descent_ que define o número de iterações para atualizar os pesos $W$ usando o dataframe de treinamento. Uma época significa que cada amostra no dataframe de treinamento atualizou os pesos $W$ 1 vez.\n", + "* **Batch**: número de amostras consideradas pela Rede Neural em cada _epoch_ antes da atualização dos pesos $W$;\n", + "\n", + "#### Exemplo\n", + "Suponha que temos um dataframe com 1.000 linhas (instâncias) e optamos por _epoch_= 1.000 e _batch_= 5. Isso significa que o dataframe será dividido em $\\frac{1000}{5}= 200$ _batches_. Desta forma, os pesos $W$ serão atualizados a cada processamento de 200 instâncias (linhas).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "boIs266gaZt1" + }, + "source": [ + "Obs.: A opção callbacks abaixo implementa o conceito de _early stopping_. Esta opção vai parar o processo de treinamento da Rede Neural antes de atingirmos o númerco de _epochs_ quando o modelo pára de melhorar, medido pela métrica val_loss. O parâmetro _patience_= k significa que o processo de otimização vai parar se tivermos k _epochs_ consecutivas sem observarmos melhoria da performance da Rede Neural." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LpR3dXRZ-jom" + }, + "source": [ + "[**Python**] - Comando modelo.fit(X_treinamento, y_treinamento, epochs)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9hDxEwHjca8V" + }, + "source": [ + "callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience = 3, min_delta=0.001)]\n", + "hist= RN.fit(X_treinamento, y_treinamento, epochs= 100, validation_data= (X_teste, y_teste), callbacks= callbacks)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JF7EC-g82Hho" + }, + "source": [ + "Model_Loss(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ea0HHBsY2NZ5" + }, + "source": [ + "Model_Accuracy(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pqVJMX3xchLF" + }, + "source": [ + "### 7. Avaliar a performance da Rede Neural\n", + "\n", + "Para avaliar a Rede Neural, simplesmente informamos as amostras de teste: X_teste e y_teste.\n", + "\n", + "A função evaluate() vai retornar uma lista contendo 2 valores: loss e accuracy." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hKbO1nT0yQM1" + }, + "source": [ + "[**Python**] - Comando modelo.evaluate(X_teste, y_teste)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cYqDY9V9chcZ" + }, + "source": [ + "RN.evaluate(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qchEpyipcnbE" + }, + "source": [ + "### 8. Fazer Predições com a Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9X2SZ5fx2_s5" + }, + "source": [ + "[**Python**] - Comando:\n", + "* RN.predict_classes(X_treinamento);\n", + "* RN.predict_classes(X_teste)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FR6ySksLhRvR" + }, + "source": [ + "y_pred = RN.predict_classes(X_teste)\n", + "y_pred" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "LWnRgBbmmlA2" + }, + "source": [ + "y_teste" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wjCtiKieE_TT" + }, + "source": [ + "### Conclusões" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LxMzVIGzzGHH" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1qjSoM5quog1" + }, + "source": [ + "___\n", + "# **APLICAÇÃO 2 - Rede Neural para identificar o tipo do vinho (_Red or White_)**\n", + "\n", + "> Nesta aplicação, vamos usar o dataframe [wine quality](https://archive.ics.uci.edu/ml/datasets/wine+quality) extraído do repositório da UCI Machine Learning Repository. Nosso objetivo é prever o tipo do vinho (red ou white) baseado nas suas propriedades químicas.\n", + "\n", + "Novamente, vamos desenvolver uma Rede Neural usando _Tensorflow_/_Keras_ e seguir os passos adiante:\n", + "\n", + "1. Carregar os dados;\n", + "2. Pré-processamento e transformação dos dados;\n", + "3. Definir as amostras de treinamento e validação;\n", + "4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_;\n", + "5. Compilar a Rede Neural;\n", + "6. Ajustar a Rede Neural;\n", + "7. Avaliar a performance da Rede Neural;\n", + "8. _Fine tuning_ da Rede Neural;\n", + "9. Fazer Predições com a Rede Neural;\n", + "10. Conclusões." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VuG8zbYS4jyx" + }, + "source": [ + "### 0. Carregar bibliotecas do Python" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "O3XD4otFd9Ht" + }, + "source": [ + "[**Python**] - Importar as bibliotecas necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "I5Ok5fhid9Hv" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "from sklearn.metrics import confusion_matrix\n", + "\n", + "from tensorflow import keras" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZBXRTgnCd9Hz" + }, + "source": [ + "[**Python**] - Definir o número de casas decimais" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NFrOyNUgd9Hz" + }, + "source": [ + "np.set_printoptions(precision= 3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cGxW2zn6q8xY" + }, + "source": [ + "[**Python**] - Verificar a versão do Tensorflow\n", + "> Assegurar que está a utilizar a versão 2.x." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "C36Z6vGD4jy8" + }, + "source": [ + "tf.__version__" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kWJXP5diof5G" + }, + "source": [ + "### 1. Carregar os dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IJJe4r_ITDzv" + }, + "source": [ + "[**Python**] - Carregar os dados:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jIjqYXlH4tyG" + }, + "source": [ + "from sklearn.datasets import load_wine\n", + "Wine= load_wine()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "5iU33wKQGrFb" + }, + "source": [ + "df_Red= pd.read_table(\"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv\", sep=';')\n", + "df_Red.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Liy2KSr6HJUX" + }, + "source": [ + "df_White= pd.read_table('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv', sep= ';')\n", + "df_White.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4TzUnuM4TMmJ" + }, + "source": [ + "[**Python**] - Mostrar o número de linhas e colunas de cada dataframe:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pHNwVRQyIwdv" + }, + "source": [ + "print(f'Dimensão de df_Red: {df_Red.shape}; Dimensão de df_White: {df_White.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tBWzMsUHJjTU" + }, + "source": [ + "[**Python**] - Construir a variável-target 'type_wine' (tipo do vinho):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "B0WmX3FYJqMR" + }, + "source": [ + "df_Red['type_wine']= 0\n", + "df_White['type_wine']= 1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EVUc1vbqJ0D9" + }, + "source": [ + "[**Python**] - Empilhar os dois dataframes: df_Red e df_White:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gS9SlpvaJ2Ex" + }, + "source": [ + "df_Wine= pd.concat([df_Red, df_White], ignore_index=True)\n", + "df_Wine.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "dM19HHENKMtq" + }, + "source": [ + "df_Wine.tail()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mpoNmcqpeQxO" + }, + "source": [ + "[**Python**] - Mostrar o número de linhas e colunas do dataframe df_Wine:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YJDazoMMeMet" + }, + "source": [ + "df_Wine.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tNyjXmFzo7oe" + }, + "source": [ + "### 2. Pré-processamento e transformação dos dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xcV_dlHsTpiP" + }, + "source": [ + "[**Python**] - Renomear o nome das colunas usando letras minúsculas:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Jau2OybudGc0" + }, + "source": [ + "df_Wine.columns = df_Wine.columns.str.strip().str.lower().str.replace(' ', '_')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "EzOWiaCFdq_C" + }, + "source": [ + "df_Wine.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F1wgkLcvete8" + }, + "source": [ + "[**Python**] - Estatísticas descritivas do dataframe df_Wine:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "A-EbHitYepeQ" + }, + "source": [ + "df_Wine.describe()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uRo_7tOjex3Z" + }, + "source": [ + "#### _Missing Values Handling_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wJ3K_FmCUCsk" + }, + "source": [ + "[**Python**] - Mostrar o número de _missing values_ no dataframe df_Wine:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "PODgu5B6e5qQ" + }, + "source": [ + "df_Wine.info()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "szH_TGJZecXp" + }, + "source": [ + "Como podem ver, o dataframe df_Wine não contem _missing values_." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "64znO089vF_2" + }, + "source": [ + "[**Python**] - Mostrar a distribuição da variável-target 'type_wine':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ztQ17lhWfdAu" + }, + "source": [ + "df_Wine['type_wine'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zDRVQy5avF_4" + }, + "source": [ + "j = sns.countplot(x=\"type_wine\", data= df_Wine)\n", + "plt.show(j)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PxexO1Uyeqnj" + }, + "source": [ + "Como podem ver, temos 6.497 instâncias, das quais 4.898 (75%) instâncias são type_wine= 1 (_white_) e 1.599 (25%) instâncias são type_wine= 0 (_red_). Logo, trata-se de um dataframe desbalanceado. No entanto, não vou me preocupar com este assunto neste Notebook e deixo para os alunos como exercício verificar os efeitos das amostras desbalanceadas." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kr_HZnF0UUh3" + }, + "source": [ + "[**Python**] - Definir os arrays X_Wine e y_Wine:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HNvImi4djc-3" + }, + "source": [ + "X_Wine= df_Wine.copy()\n", + "X_Wine= X_Wine.drop(columns= ['type_wine'])\n", + "y_Wine= df_Wine['type_wine'].values" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ViMFMd_SlE3x" + }, + "source": [ + "[**Python**] - Normalizar os dados - StandardScaler()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OurlnbX3hzSb" + }, + "source": [ + "from sklearn.preprocessing import StandardScaler\n", + "\n", + "# Define o scaler \n", + "SS = StandardScaler()\n", + "\n", + "# Scale o dataframe\n", + "X_Wine = SS.fit_transform(X_Wine)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KFkJ7b4U1x3D" + }, + "source": [ + "### 3. Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MBqzBcDL2JH4" + }, + "source": [ + "[**Python**] - Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bCQUxbbxhVTM" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_Wine, y_Wine, test_size= 0.2, random_state= 20111974)\n", + "print(f'X: Treinamento= {X_treinamento.shape}; X: Teste= {X_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "HHmTtQ3F9A9c" + }, + "source": [ + "print(f'Y: Treinamento= {y_treinamento.shape}; Y: Teste= {y_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pPW0aWd4jw0o" + }, + "source": [ + "### 4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ea4WTsP-hCqS" + }, + "source": [ + "[**Python**] - Definir a arquitetura, ou seja:\n", + "* $N_{I}$: Número de neurônios na camada de entrada (_Input Layer_);\n", + "* $N_{O}$: Número de neurônios na camada de saída (_Output Layer_);\n", + "* $N_{H}$: Número de neurônios na camada escondida (_Hidden Layer_);\n", + "* FA: Função de ativação;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y1pVFjFThCqU" + }, + "source": [ + "# Número de Neurônios na Input Layer:\n", + "N_I= X_Wine.shape[1]\n", + "\n", + "# Número de neurônios na Output Layer:\n", + "N_O= 1\n", + "\n", + "# Número de neurônios na Hidden Layer:\n", + "N_H1= 7\n", + "\n", + "# Função de Ativação da Hidden Layer:\n", + "FA_H= tf.nn.leaky_relu\n", + "\n", + "# Função de Ativação da Output Layer:\n", + "FA_O= tf.nn.leaky_relu" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "url8178EpUNO" + }, + "source": [ + "[**Python**] - Definir as sementes para NumPy e Tensorflow:\n", + "> Por questões de reproducibilidade de resultados, use as sementes abaixo:\n", + "\n", + "* NumPy: 20111974;\n", + "* Tensorflow: 20111974;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MU-8uWn-pUNQ" + }, + "source": [ + "np.random.seed(20111974)\n", + "tf.random.set_seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6TLUirr6oXv1" + }, + "source": [ + "[**Python**] - Definir a Rede Neural:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8HOQIkQ3beJm" + }, + "source": [ + "**Observações**: \n", + "\n", + "Para evitar problemas relacionados ao _overfitting_ e _Vanishing or Exploding Gradients in Deep Neural Nets_, os artigos abaixo sugerem as seguintes opções para inicialização dos pesos $W$:\n", + "\n", + "* [How to Reduce Overfitting Using Weight Constraints in Keras](https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-neural-networks-with-weight-constraints-in-keras/) sugere:\n", + " * kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Deep Learning Best Practices (1) — Weight Initialization](https://medium.com/usf-msds/deep-learning-best-practices-1-weight-initialization-14e5c0295b94) sugere:\n", + " * kernel_initializer= tf.keras.initializers.he_normal() para activation= 'tf.nn.relu' ou 'tf.nn.leaky_relu' e kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Vanishing/ Exploding Gradients in Deep Neural Nets and solving them](https://medium.com/swlh/vanishing-exploding-gradients-in-deep-neural-nets-and-solving-them-9d6070f28b29) sugere:\n", + " * kernel_initializer= tf.keras.initializers.GlorotUniform();\n", + " * kernel_initializer= tf.keras.initializers.GlorotNormal()." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Bj5uJvgaj43u" + }, + "source": [ + "from tensorflow.keras import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "from tensorflow.keras.layers import Dropout\n", + "\n", + "RN= Sequential()\n", + "RN.add(Dense(units= N_H, input_dim= N_I, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_O, activation= FA_O))\n", + "\n", + "# Resumo da arquitetura da Rede Neural\n", + "print(RN.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LTs8xbKx2O3Z" + }, + "source": [ + "### 5. Compilar a Rede Neural\n", + "\n", + "Este é um problema de classificação binária (_red_ ou _white_). Portanto, temos:\n", + "* loss= tf.keras.losses.BinaryCrossentropy();\n", + "* metrics= tf.keras.metrics.binary_accuracy;\n", + "* optimizer= tf.keras.optimizers.Adam().\n", + "\n", + "> **Lembre-se**: se o problema fosse de classificação de multi-classes (> 2 classes), então devemos usar loss= tf.keras.losses.CategoricalCrossentropy()." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kUpBKJkl07KH" + }, + "source": [ + "[**Python**] - Comando modelo.compile(optimizer, loss, metrics):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qGZ1bKCo2L7A" + }, + "source": [ + "Algoritmo_Opt= tf.keras.optimizers.Adam()\n", + "Loss_Function= tf.keras.losses.BinaryCrossentropy()\n", + "Metrics_Perf = tf.keras.metrics.binary_accuracy\n", + "\n", + "RN.compile(optimizer= Algoritmo_Opt, loss= Loss_Function, metrics= Metrics_Perf)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jBhkSc582ROC" + }, + "source": [ + "### 6. Ajustar a Rede Neural\n", + "\n", + "Obs.: A opção callbacks abaixo implementa o conceito de _early stopping_. Esta opção vai parar o processo de treinamento da Rede Neural antes de atingirmos o númerco de _epochs_ quando o modelo pára de melhorar, medido pela métrica val_loss. O parâmetro _patience_= k significa que o processo de otimização vai parar se tivermos k _epochs_ consecutivas sem observarmos melhoria da performance da Rede Neural." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ls8FfHz0z2lX" + }, + "source": [ + "[**Python**] - Comando modelo.fit(X_treinamento, y_treinamento, epochs)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5mpbdiuJ2d1W" + }, + "source": [ + "callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience = 5, min_delta=0.001)]\n", + "hist= RN.fit(X_treinamento, y_treinamento, epochs= 10, validation_data= (X_teste, y_teste), callbacks= callbacks)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "sALJp9FQ2WGC" + }, + "source": [ + "Model_Accuracy(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "cWR2rMLg2ZlQ" + }, + "source": [ + "Model_Loss(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ciXdkbVr2VDG" + }, + "source": [ + "### 7. Avaliar a performance da Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RS4_cV5TyamK" + }, + "source": [ + "[**Python**] - Comando modelo.evaluate(X_teste, y_teste)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hkkxiqPe2gZK" + }, + "source": [ + "RN.evaluate(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "m8Vs47VV4RsV" + }, + "source": [ + "A Rede Neural tem acurácia= 0.9946." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RHQoDk533TRX" + }, + "source": [ + "[**Python**] - Comando:\n", + "* RN.predict_classes(X_treinamento);\n", + "* RN.predict_classes(X_teste)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vlWfMR-k7Qc1" + }, + "source": [ + "y_pred = RN.predict_classes(X_teste)\n", + "y_pred" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mCpp-MO0nXmj" + }, + "source": [ + "A seguir, outras medidas para avaliarmos a performance da Rede Neural:\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oWF-wWfTnjQh" + }, + "source": [ + "# Import the modules from sklearn.metrics\n", + "from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, cohen_kappa_score" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "MHD0CvGQnm0v" + }, + "source": [ + "# Precision \n", + "precision_score(y_teste, y_pred)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ERoX4BeQnosC" + }, + "source": [ + "# Recall\n", + "recall_score(y_teste, y_pred)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "l_2bS23Inq1p" + }, + "source": [ + "# F1 score\n", + "f1_score(y_teste,y_pred)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rLM1BSK2QfZm" + }, + "source": [ + "A seguir, a matriz de confusão:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bGAigaBMQfZn" + }, + "source": [ + "Mostra_ConfusionMatrix()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H3yUb9tP2YfE" + }, + "source": [ + "### 8. Fazer Predições com a Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CupPMVRo3a5q" + }, + "source": [ + "[**Python**] - Comando:\n", + "* RN.predict_classes(X_treinamento);\n", + "* RN.predict_classes(X_teste)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GePOr4EJ3mfn" + }, + "source": [ + "y_pred = RN.predict_classes(X_teste)\n", + "y_pred[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "aO79VSBS3mfv" + }, + "source": [ + "y_teste[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oHlDHVp_KcLn" + }, + "source": [ + "### 10. Conclusão\n", + "\n", + "Desenvolvemos uma Rede Neural com 1 _Hidden Layer_ que atingiu acurácia de 0.998." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Xr2VzmwPFRdr" + }, + "source": [ + "___\n", + "# **EXERCÍCIO 1**: Prever a qualidade do vinho\n", + "\n", + "Neste exercício, vamos considerar a variável quality como múltiplas classes.\n", + "\n", + "Novamente, vamos desenvolver uma Rede Neural usando _Tensorflow_/_Keras_ e seguir os passos adiante:\n", + "\n", + "1. Carregar os dados;\n", + "2. Pré-processamento e transformação dos dados;\n", + "3. Definir as amostras de treinamento e validação;\n", + "4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_;\n", + "5. Compilar a Rede Neural;\n", + "6. Ajustar a Rede Neural;\n", + "7. Avaliar a performance da Rede Neural;\n", + "8. _Fine tuning_ da Rede Neural;\n", + "9. Fazer Predições com a Rede Neural;\n", + "10. Conclusões." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MYpF4TtpvXp4" + }, + "source": [ + "[**Python**] - Mostrar a distribuição da variável-target 'quality':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WmeFj14PvXp6" + }, + "source": [ + "j = sns.countplot(x=\"quality\", data= df_Wine)\n", + "plt.show(j)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "smWNCLsMQWBI" + }, + "source": [ + "Muitas classes... Abaixo, KMeans para definirmos a quantidade de clusters." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Co4GNVexVUoP" + }, + "source": [ + "# Função adaptada de: https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/\n", + "\n", + "from sklearn.metrics.pairwise import euclidean_distances, cosine_distances, manhattan_distances\n", + "from scipy.spatial.distance import cdist\n", + "\n", + "def Numero_Clusters_Elbow(X):\n", + " distortions = [] \n", + " inertias = [] \n", + " mapping1 = {} \n", + " mapping2 = {} \n", + " K = range(1,10) \n", + " for k in K:\n", + " #Building and fitting the model \n", + " kmeanModel = KMeans(n_clusters=k).fit(X) \n", + " kmeanModel.fit(X)\n", + " distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'),axis=1)) / X.shape[0]) \n", + " inertias.append(kmeanModel.inertia_)\n", + " mapping1[k] = sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'),axis=1)) / X.shape[0] \n", + " mapping2[k] = kmeanModel.inertia_ \n", + "\n", + " # Using the different values of Distortion\n", + " print('Cálculo da Distorção:')\n", + " for key,val in mapping1.items():\n", + " print(str(key)+' : '+str(val))\n", + "\n", + " plt.plot(K, distortions, 'bx-')\n", + " plt.xlabel('Values of K')\n", + " plt.ylabel('Distortion')\n", + " plt.title('The Elbow Method using Distortion')\n", + " plt.show() \n", + "\n", + " # Using the different values of Inertia\n", + " print('Cálculo da Inertia:')\n", + " for key,val in mapping2.items():\n", + " print(str(key)+' : '+str(val))\n", + "\n", + " plt.plot(K, inertias, 'bx-')\n", + " plt.xlabel('Values of K')\n", + " plt.ylabel('Inertia') \n", + " plt.title('The Elbow Method using Inertia')\n", + " plt.show() " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "eTA4O3yII-am" + }, + "source": [ + "X_WineQ= df_Wine.copy()\n", + "X_WineQ= X_WineQ.drop(columns= ['quality'])\n", + "X_WineQ= X_WineQ.values\n", + "#y_WineQ= df_Wine['quality2'].values\n", + "X_WineQ" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IJDCLdsalKvS" + }, + "source": [ + "[**Python**] - Normalizar os dados - StandardScaler()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7D9qCW6SYKwA" + }, + "source": [ + "from sklearn.preprocessing import StandardScaler\n", + "\n", + "# Define o scaler \n", + "SS = StandardScaler().fit(X_WineQ)\n", + "\n", + "# Scale o dataframe\n", + "X_WineQ = SS.transform(X_WineQ)\n", + "X_WineQ" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "naLIn_ufIZtL" + }, + "source": [ + "Numero_Clusters_Elbow(X_WineQ)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "At5KtD-DV0f6" + }, + "source": [ + "Os gráficos acima mostram que o número de _clusters_ ótimos para o dataframe df_Wine é 3. Portanto, vamos trabalhar com n_cluster= 3 daqui para frente." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-Oz9kAvh5yPk" + }, + "source": [ + "Sugiro fazermos o seguinte agrupamento:\n", + "* Se quality= 1, 2, 3 $\\Longrightarrow$ quality2= 1 (qualidade ruim);\n", + "* Se quality= 4, 5, 6, 7 $\\Longrightarrow$ quality2= 2 (qualidade média);\n", + "* Se quality= 8, 9, 10 $\\Longrightarrow$ quality2= 3 (qualidade excelente)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nZc2OQR87E9n" + }, + "source": [ + "def define_quality2(row):\n", + " if row['quality'] <= 3:\n", + " return 1\n", + " elif row['quality'] > 3 and row['quality'] <= 7:\n", + " return 2\n", + " elif row['quality'] > 7:\n", + " return 3" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "f8zbLsN27E9t" + }, + "source": [ + "df_Wine['quality2']=df_Wine.apply(lambda row: define_quality2(row), axis= 1)\n", + "df_Wine.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7ugFbdVm77wo" + }, + "source": [ + "df_Wine['quality2'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xf3RYZTwEtAD" + }, + "source": [ + "[**Python**] - Análise de Correlação:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dimuJxsPEsAt" + }, + "source": [ + "corr = df_Wine.corr()['quality2'].drop('quality2')\n", + "print(corr)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "g9KALiQsEsAx" + }, + "source": [ + "plt.figure(figsize=(12,10))\n", + "cor = df_Wine.corr()\n", + "sns.heatmap(cor, annot=True, linewidths=0, vmin=-1, cmap=\"RdBu_r\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "At1XAt6AU2oh" + }, + "source": [ + "[**Python**] - Aplicar a transformação LabelEncoder:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kld7QvCaH1YQ" + }, + "source": [ + "from sklearn.preprocessing import LabelEncoder\n", + "\n", + "LE = LabelEncoder()\n", + "LE.fit(df_Wine['quality2'])quality_Encoded= LE.transform(df_Wine['quality2'])\n", + "\n", + "y_WineQ= tf.keras.utils.to_categorical(quality_Encoded)\n", + "y_WineQ[:5]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7qqY5vxeGEMC" + }, + "source": [ + "X_WineQ= df_Wine.copy()\n", + "X_WineQ= X.drop(columns= 'quality2', axis= 1)\n", + "X_WineQ.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oKnfuTQFlQLB" + }, + "source": [ + "[**Python**] - Normalizar os dados - StandardScaler()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_VSNAOQzGEMG" + }, + "source": [ + "from sklearn.preprocessing import StandardScaler\n", + "\n", + "# Define o scaler \n", + "SS = StandardScaler().fit(X_WineQ)\n", + "\n", + "# Scale o dataframe\n", + "X_WineQ = SS.transform(X_WineQ)\n", + "X_WineQ" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zE5SV0ET-Zwv" + }, + "source": [ + "[**Python**] - Aplicar PCA para selecionar somente os atributos mais importantes:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "I3IRWaIj-feA" + }, + "source": [ + "from sklearn.decomposition import PCA\n", + "\n", + "pca = PCA()\n", + "X_pca = pca.fit_transform(X_WineQ)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NbJ6NBcAVPO1" + }, + "source": [ + "[**Python**] - Proporção de variância explicada por cada componente principal" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kOqLh5BgVVN0" + }, + "source": [ + "pca.explained_variance_ratio_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7j_cYnbSVGaT" + }, + "source": [ + "[**Python**] - Proporção acumulada de cada fator:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ux0lAfy2Ar6g" + }, + "source": [ + "pca.explained_variance_ratio_.cumsum()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1UzYqn7-A5He" + }, + "source": [ + "Como podemos ver acima, 8 componentes principais acumulam mais que 93% da variância. Portanto, vamos selecionar 8 componentes para treinar a Rede Neural." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RzekEmIMBqe1" + }, + "source": [ + "pca8 = PCA(n_components= 8)\n", + "X_pca8 = pca8.fit_transform(X)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DYDhJYdV-pSb" + }, + "source": [ + "A seguir, o gráfico com as componentes principais." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_zUN_3wR-nOX" + }, + "source": [ + "plt.figure(figsize=(7, 7))\n", + "plt.plot(np.cumsum(pca.explained_variance_ratio_), 'ro-')\n", + "plt.grid()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YO3Umy7sGEMJ" + }, + "source": [ + "### 3. Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e2gXueuI2Nb4" + }, + "source": [ + "[**Python**] - Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "giFgdIK7GEMJ" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_pca8, y_WineQ, test_size= 0.2, random_state= 20111974)\n", + "print(f'X: Treinamento= {X_treinamento.shape}; X: Teste= {X_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "nr4sryEr-eOy" + }, + "source": [ + "print(f'Y: Treinamento= {y_treinamento.shape}; Y: Teste= {y_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l2KSZ8TQGEMM" + }, + "source": [ + "### 4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tQyKOA8Qhgtd" + }, + "source": [ + "[**Python**] - Definir a arquitetura, ou seja:\n", + "* $N_{I}$: Número de neurônios na camada de entrada (_Input Layer_);\n", + "* $N_{O}$: Número de neurônios na camada de saída (_Output Layer_);\n", + "* $N_{H}$: Número de neurônios na camada escondida (_Hidden Layer_);\n", + "* FA: Função de ativação;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "62E0wT8hhgtd" + }, + "source": [ + "# Número de Neurônios na Input Layer:\n", + "N_I= X_pca8.shape[1]\n", + "\n", + "# Número de neurônios na Output Layer:\n", + "N_O= y_WineQ.shape[1]\n", + "\n", + "# Número de neurônios na Hidden Layer:\n", + "N_H1= 32\n", + "N_H2= 32\n", + "N_H3= 32\n", + "N_H4= 32\n", + "N_H5= 32\n", + "N_H6= 32\n", + "\n", + "# Função de Ativação da Hidden Layer:\n", + "FA_H= tf.nn.leaky_relu\n", + "\n", + "# Função de Ativação da Output Layer:\n", + "FA_O= tf.nn.softmax" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wT0kLyuGpZ4C" + }, + "source": [ + "[**Python**] - Definir as sementes para NumPy e Tensorflow:\n", + "> Por questões de reproducibilidade de resultados, use as sementes abaixo:\n", + "\n", + "* NumPy: 20111974;\n", + "* Tensorflow: 20111974;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "C3Wdq7XvpZ4F" + }, + "source": [ + "np.random.seed(20111974)\n", + "tf.random.set_seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_6ZBeC4EnZyQ" + }, + "source": [ + "[**Python**] - Definir a Rede Neural:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vjUyPpoJbg20" + }, + "source": [ + "**Observações**: \n", + "\n", + "Para evitar problemas relacionados ao _overfitting_ e _Vanishing or Exploding Gradients in Deep Neural Nets_, os artigos abaixo sugerem as seguintes opções para inicialização dos pesos $W$:\n", + "\n", + "* [How to Reduce Overfitting Using Weight Constraints in Keras](https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-neural-networks-with-weight-constraints-in-keras/) sugere:\n", + " * kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Deep Learning Best Practices (1) — Weight Initialization](https://medium.com/usf-msds/deep-learning-best-practices-1-weight-initialization-14e5c0295b94) sugere:\n", + " * kernel_initializer= tf.keras.initializers.he_normal() para activation= 'tf.nn.relu' ou 'tf.nn.leaky_relu' e kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Vanishing/ Exploding Gradients in Deep Neural Nets and solving them](https://medium.com/swlh/vanishing-exploding-gradients-in-deep-neural-nets-and-solving-them-9d6070f28b29) sugere:\n", + " * kernel_initializer= tf.keras.initializers.GlorotUniform();\n", + " * kernel_initializer= tf.keras.initializers.GlorotNormal()." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ry43SCP4nbvp" + }, + "source": [ + "from tensorflow.keras import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "from tensorflow.keras.layers import Dropout\n", + "\n", + "RN= Sequential()\n", + "RN.add(Dense(units= N_H1, input_dim= N_I, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_H2, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_H3, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_H4, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_H5, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_H6, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dense(units= N_O, activation= FA_O)) # Atenção à Função de Ativação!\n", + "\n", + "# Resumo da arquitetura da Rede Neural\n", + "print(RN.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zQwY5qQLGEMR" + }, + "source": [ + "### 5. Compilar a Rede Neural\n", + "\n", + "Este é um problema de classificação multi-classes (> 2 classes). Portanto, temos:\n", + "* loss= tf.keras.losses.CategoricalCrossentropy();\n", + "* optimizer= tf.keras.optimizers.Adam();\n", + "* metrics= tf.keras.metrics.binary_accuracy." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HVp9pT8n1AQC" + }, + "source": [ + "[**Python**] - Comando modelo.compile(optimizer, loss, metrics):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "el86VgZ-GEMS" + }, + "source": [ + "Algoritmo_Opt= tf.keras.optimizers.Adam()\n", + "Loss_Function= tf.keras.losses.CategoricalCrossentropy()\n", + "Metrics_Perf = tf.keras.metrics.binary_accuracy\n", + "\n", + "RN.compile(optimizer= Algoritmo_Opt, loss= Loss_Function, metrics= Metrics_Perf)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qNyxbvjsGEMV" + }, + "source": [ + "### 6. Ajustar a Rede Neural\n", + "\n", + "Obs.: A opção callbacks abaixo implementa o conceito de _early stopping_. Esta opção vai parar o processo de treinamento da Rede Neural antes de atingirmos o númerco de _epochs_ quando o modelo pára de melhorar, medido pela métrica val_loss. O parâmetro _patience_= k significa que o processo de otimização vai parar se tivermos k _epochs_ consecutivas sem observarmos melhoria da performance da Rede Neural." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3ZtHza72z_q9" + }, + "source": [ + "[**Python**] - Comando modelo.fit(X_treinamento, y_treinamento, epochs)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tikYg2CrGEMV" + }, + "source": [ + "callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience = 5, min_delta=0.001)]\n", + "hist= RN.fit(X_treinamento, y_treinamento, epochs= 100, validation_data= (X_teste, y_teste), batch_size= 12, callbacks= callbacks)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "w6aoEBw92p86" + }, + "source": [ + "Model_Loss(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ei7XWVel2qNW" + }, + "source": [ + "Model_Accuracy(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Tovtcp98GEMa" + }, + "source": [ + "### 7. Avaliar a performance da Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DAC9nKCvyiKu" + }, + "source": [ + "[**Python**] - Comando modelo.evaluate(X_teste, y_teste)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZmhkMbmpGEMb" + }, + "source": [ + "RN.evaluate(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JjON99B1B9F0" + }, + "source": [ + "* Rede Neural com Accuracy= 0.9692 SEM PCA.\n", + "* Rede Neural com Accuracy= 0.9677 COM PCA." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_VzIx2_EGEMp" + }, + "source": [ + "### 8. Fazer Predições com a Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jpLRlBTI3gvF" + }, + "source": [ + "[**Python**] - Comando:\n", + "* RN.predict_classes(X_treinamento);\n", + "* RN.predict_classes(X_teste);" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1c1K5vGjGEMq" + }, + "source": [ + "y_pred = RN.predict_classes(X_teste)\n", + "y_pred[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "5b-hDZ04GEMs" + }, + "source": [ + "y_teste[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "h3mgEOcXu20F" + }, + "source": [ + "___\n", + "# **APLICAÇÃO 3 - Rede Neural para identificar Câncer de Mama (_Breast Cancer Dataframe_)**\n", + "\n", + "Fonte: [Breast Cancer Wisconsin (Diagnostic) Data Set](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)\n", + "\n", + "Vamos desenvolver uma Rede Neural usando _Tensorflow_/_Keras_ e seguir os passos adiante:\n", + "\n", + "1. Carregar os dados;\n", + "2. Pré-processamento e transformação dos dados;\n", + "3. Definir as amostras de treinamento e validação;\n", + "4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_;\n", + "5. Compilar a Rede Neural;\n", + "6. Ajustar a Rede Neural;\n", + "7. Avaliar a performance da Rede Neural;\n", + "8. _Fine tuning_ da Rede Neural;\n", + "9. Fazer Predições com a Rede Neural;\n", + "10. Conclusões." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QVZQ1meF9l_-" + }, + "source": [ + "### 0. Carregar bibliotecas do Python" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Cd4zQSkseKzv" + }, + "source": [ + "[**Python**] - Importar as bibliotecas necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OzMsBroueKzx" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "from sklearn.metrics import confusion_matrix\n", + "import tensorflow as tf\n", + "\n", + "from tensorflow import keras" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wJSbJx9rrBG3" + }, + "source": [ + "[**Python**] - Verificar a versão do Tensorflow\n", + "> Assegurar que está a utilizar a versão 2.x." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "alCVjy_JCWUo" + }, + "source": [ + "tf.__version__" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-Ra_PjiFeKz0" + }, + "source": [ + "[**Python**] - Definir o número de casas decimais" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y4I2Eh2YeKz1" + }, + "source": [ + "np.set_printoptions(precision= 3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1nXjn5KM94lI" + }, + "source": [ + "### 1. Carregar os dados" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ovtqPCcWBgfI" + }, + "source": [ + "from sklearn.datasets import load_breast_cancer\n", + "\n", + "Cancer = load_breast_cancer()\n", + "X_cancer= Cancer.data\n", + "y_Cancer= Cancer.target\n", + "Col_Names= Cancer.feature_names" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "5qbOErH6Dkcu" + }, + "source": [ + "Col_Names" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "weKuFIHqDWRX" + }, + "source": [ + "X_cancer" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "9kix-c9TDY_R" + }, + "source": [ + "y_Cancer[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Qg_IuwHp946c" + }, + "source": [ + "### 2. Pré-processamento e transformação dos dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7wnkuDnMlVe8" + }, + "source": [ + "[**Python**] - Normalizar os dados - StandardScaler()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uVEHT9sqG9WR" + }, + "source": [ + "from sklearn.preprocessing import StandardScaler\n", + "\n", + "SS = StandardScaler()\n", + "X_cancer = SS.fit_transform(X_cancer)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hNeVqHH295FA" + }, + "source": [ + "### 3. Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7Rbc3MwX2RDU" + }, + "source": [ + "[**Python**] - Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xR6XSrbsFX-P" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_cancer, y_Cancer, test_size = 0.1, random_state = 20111974)\n", + "print(f'X: Treinamento= {X_treinamento.shape}; X: Teste= {X_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "NRQ8zW1m-g6l" + }, + "source": [ + "print(f'Y: Treinamento= {y_treinamento.shape}; Y: Teste= {y_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "775KoeQz95O6" + }, + "source": [ + "### 4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fFaKVel2ilZs" + }, + "source": [ + "[**Python**] - Definir a arquitetura, ou seja:\n", + "* $N_{I}$: Número de neurônios na camada de entrada (_Input Layer_);\n", + "* $N_{O}$: Número de neurônios na camada de saída (_Output Layer_);\n", + "* $N_{H}$: Número de neurônios na camada escondida (_Hidden Layer_);\n", + "* FA: Função de ativação;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aB7xcQFWilZs" + }, + "source": [ + "# Número de Neurônios na Input Layer:\n", + "N_I= X_cancer.shape[1]\n", + "\n", + "# Número de neurônios na Output Layer:\n", + "N_O= 1\n", + "\n", + "# Número de neurônios na Hidden Layer:\n", + "N_H1= 16\n", + "N_H2= 16\n", + "N_H3= 16\n", + "\n", + "# Função de Ativação da Hidden Layer:\n", + "FA_H= tf.nn.leaky_relu\n", + "\n", + "# Função de Ativação da Output Layer:\n", + "FA_O= tf.nn.softmax" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rm1PeihTRlgz" + }, + "source": [ + "[**Python**] - Definir as sementes para NumPy e Tensorflow:\n", + "> Por questões de reproducibilidade de resultados, use as sementes abaixo:\n", + "\n", + "* NumPy: 20111974;\n", + "* Tensorflow: 20111974;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "U3oTC94GRlg3" + }, + "source": [ + "np.random.seed(20111974)\n", + "tf.random.set_seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J5GBW-Ucnh9T" + }, + "source": [ + "[**Python**] - Definir a Rede Neural:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "my2xjCniIb4J" + }, + "source": [ + "Vamos adicionar duas camadas _Dropout_ com $p= 0.1$ para evitarmos _overfitting_." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Kne06wmjnTYD" + }, + "source": [ + "**Observações**: \n", + "\n", + "Para evitar problemas relacionados ao _overfitting_ e _Vanishing or Exploding Gradients in Deep Neural Nets_, os artigos abaixo sugerem as seguintes opções para inicialização dos pesos $W$:\n", + "\n", + "* [How to Reduce Overfitting Using Weight Constraints in Keras](https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-neural-networks-with-weight-constraints-in-keras/) sugere:\n", + " * kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Deep Learning Best Practices (1) — Weight Initialization](https://medium.com/usf-msds/deep-learning-best-practices-1-weight-initialization-14e5c0295b94) sugere:\n", + " * kernel_initializer= tf.keras.initializers.he_normal() para activation= 'tf.nn.relu' ou 'tf.nn.leaky_relu' e kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Vanishing/ Exploding Gradients in Deep Neural Nets and solving them](https://medium.com/swlh/vanishing-exploding-gradients-in-deep-neural-nets-and-solving-them-9d6070f28b29) sugere:\n", + " * kernel_initializer= tf.keras.initializers.GlorotUniform();\n", + " * kernel_initializer= tf.keras.initializers.GlorotNormal()." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "s6x7mo4dnjlE" + }, + "source": [ + "from tensorflow.keras import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "from tensorflow.keras.layers import Dropout\n", + "\n", + "RN= Sequential()\n", + "RN.add(Dense(units= N_H1, input_dim= N_I, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_H2, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_H3, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dense(units= N_O, activation= FA_O))\n", + "\n", + "# Resumo da arquitetura da Rede Neural\n", + "print(RN.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "04cKraWZ9mMb" + }, + "source": [ + "### 5. Compilar a Rede Neural\n", + "\n", + "Este é um problema de classificação binária (maligno ou benigno). Portanto, temos:\n", + "* loss= tf.keras.losses.BinaryCrossentropy();\n", + "* metrics= tf.keras.metrics.binary_accuracy;\n", + "* optimizer= tf.keras.optimizers.Adam()." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OGRjWcsm1FM9" + }, + "source": [ + "[**Python**] - Comando modelo.compile(optimizer, loss, metrics):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bLIoA8FrJJCx" + }, + "source": [ + "Algoritmo_Opt= tf.keras.optimizers.Adam()\n", + "Loss_Function= tf.keras.losses.BinaryCrossentropy()\n", + "Metrics_Perf = tf.keras.metrics.binary_accuracy\n", + "\n", + "RN.compile(optimizer= Algoritmo_Opt, loss= Loss_Function, metrics= Metrics_Perf)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cnL12eaF9mU6" + }, + "source": [ + "### 6. Ajustar a Rede Neural\n", + "\n", + "Obs.: A opção callbacks abaixo implementa o conceito de _early stopping_. Esta opção vai parar o processo de treinamento da Rede Neural antes de atingirmos o númerco de _epochs_ quando o modelo pára de melhorar, medido pela métrica val_loss. O parâmetro _patience_= k significa que o processo de otimização vai parar se tivermos k _epochs_ consecutivas sem observarmos melhoria da performance da Rede Neural." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nVsziJfk0FIv" + }, + "source": [ + "[**Python**] - Comando modelo.fit(X_treinamento, y_treinamento, epochs)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "apQY6cQjJb-z" + }, + "source": [ + "callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience = 5, min_delta=0.001)]\n", + "hist= RN.fit(X_treinamento, y_treinamento, epochs= 100, validation_data= (X_teste, y_teste), callbacks= callbacks)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "avd2cXpO20cY" + }, + "source": [ + "Model_Accuracy(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "FCd8xFxA25Lc" + }, + "source": [ + "Model_Loss(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3zToEvUs-pCt" + }, + "source": [ + "### 7. Avaliar a performance da Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IqzKH7jsymwL" + }, + "source": [ + "[**Python**] - Comando modelo.evaluate(X_teste, y_teste)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pmjuk6OqJ7zD" + }, + "source": [ + "RN.evaluate(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "04ZGPI6DKcnz" + }, + "source": [ + "A seguir, a matriz de confusão:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-MZyagwaKfkM" + }, + "source": [ + "Mostra_ConfusionMatrix()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KasqSFWG-pTG" + }, + "source": [ + "### 8. _Fine tuning_ da Rede Neural\n", + "\n", + "Não é necessário, pois obtivemos 0.9825 de acurácia." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GLxgJP3L-pdZ" + }, + "source": [ + "### 9. Fazer Predições com a Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0iXGBnNZYb4V" + }, + "source": [ + "[**Python**] - Comando:\n", + "* RN.predict_classes(X_treinamento);\n", + "* RN.predict_classes(X_teste)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nqBFwxg5Yb4b" + }, + "source": [ + "y_pred = RN.predict_classes(X_teste)\n", + "y_pred[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4CHdWhgD-plr" + }, + "source": [ + "### 10. Conclusões" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T2AQ4uDShdgE" + }, + "source": [ + "___\n", + "# **APLICAÇÃO 4 - Rede Neural para identificar Diabetes (Diabetes Dataframe)**\n", + "\n", + "> Vamos desenvolver uma Rede Neural usando _Tensorflow_/_Keras_ e seguir os passos adiante:\n", + "\n", + "1. Carregar os dados;\n", + "2. Pré-processamento e transformação dos dados;\n", + "3. Definir as amostras de treinamento e validação;\n", + "4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_;\n", + "5. Compilar a Rede Neural;\n", + "6. Ajustar a Rede Neural;\n", + "7. Avaliar a performance da Rede Neural;\n", + "8. _Fine tuning_ da Rede Neural;\n", + "9. Fazer Predições com a Rede Neural;\n", + "10. Conclusões." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HOEJGtAzQfX3" + }, + "source": [ + "### 0. Carregar bibliotecas do Python" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Mxa5UaIXeRgN" + }, + "source": [ + "[**Python**] - Importar as bibliotecas necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ylfhuYeveRgO" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "from sklearn.metrics import confusion_matrix\n", + "\n", + "from tensorflow import keras" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uG9B3WTkeRgR" + }, + "source": [ + "[**Python**] - Definir o número de casas decimais:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TZkm0YVoeRgR" + }, + "source": [ + "np.set_printoptions(precision= 3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mDhEsSJ1rFpy" + }, + "source": [ + "[**Python**] - Verificar a versão do Tensorflow\n", + "> Assegurar que está a utilizar a versão 2.x." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KfKcNLZ3QfYJ" + }, + "source": [ + "tf.__version__" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rIT9N7jSQfYO" + }, + "source": [ + "### 1. Carregar os dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r9QUJZgbSWDG" + }, + "source": [ + "[**Python**] - Carregar os dados:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ofSJNoyfQfYR" + }, + "source": [ + "from sklearn.datasets import load_diabetes\n", + "\n", + "Diabetes = load_diabetes()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Fo7q0BnyShVG" + }, + "source": [ + "[**Python**] - Definir os arrays X_Diabetes e y_Diabetes:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UTnrDMPLQfYW" + }, + "source": [ + "X_Diabetes= Diabetes.data\n", + "y_Diabetes= Diabetes.target\n", + "Col_Names= Diabetes.feature_names" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZjrdZUwp_l40" + }, + "source": [ + "[**Python**] - Corrigir ou renomear as colunas do dataframe:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "skgDY4Lu_l46" + }, + "source": [ + "X_Diabetes.columns= X_Diabetes.columns.str.lower()\n", + "X_Diabetes.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "a5NQO8b-QfYb" + }, + "source": [ + "X_Diabetes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "adpBpNDeQfYj" + }, + "source": [ + "y_Diabetes[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YQA4fN4HQfYo" + }, + "source": [ + "### 2. Pré-processamento e transformação dos dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dTnBpjwalbVG" + }, + "source": [ + "[**Python**] - Normalizar os dados - StandardScaler()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hS_unh4wQfYp" + }, + "source": [ + "from sklearn.preprocessing import StandardScaler\n", + "\n", + "SS = StandardScaler()\n", + "X_Diabetes = SS.fit_transform(X_Diabetes)\n", + "y_Diabetes= SS.fit_transform(y_Diabetes.reshape(-(1, 1))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZSviMMrISt96" + }, + "source": [ + "Y[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pWmVyMF0QfYu" + }, + "source": [ + "### 3. Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cEMSNMJu2VqI" + }, + "source": [ + "[**Python**] - Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2WUQMh2HQfYx" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_Diabetes, y_Diabetes, test_size = 0.1, random_state = 20111974)\n", + "print(f'X: Treinamento= {X_treinamento.shape}; X: Teste= {X_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Zvx80NPT-j0S" + }, + "source": [ + "print(f'Y: Treinamento= {y_treinamento.shape}; Y: Teste= {y_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Wk_CG4H5QfY2" + }, + "source": [ + "### 4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V1bDqK5vi49C" + }, + "source": [ + "[**Python**] - Definir a arquitetura, ou seja:\n", + "* $N_{I}$: Número de neurônios na camada de entrada (_Input Layer_);\n", + "* $N_{O}$: Número de neurônios na camada de saída (_Output Layer_);\n", + "* $N_{H}$: Número de neurônios na camada escondida (_Hidden Layer_);\n", + "* FA: Função de ativação;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "por467-ci49D" + }, + "source": [ + "# Número de Neurônios na Input Layer:\n", + "N_I= X_Diabetes.shape[1]\n", + "\n", + "# Número de neurônios na Output Layer:\n", + "N_O= 1\n", + "\n", + "# Número de neurônios na Hidden Layer:\n", + "N_H1= 6\n", + "\n", + "# Função de Ativação da Hidden Layer:\n", + "FA_H= tf.nn.leaky_relu\n", + "\n", + "# Função de Ativação da Output Layer:\n", + "FA_O= tf.nn.linear" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-r7VC-7lpkkC" + }, + "source": [ + "[**Python**] - Definir as sementes para NumPy e Tensorflow:\n", + "> Por questões de reproducibilidade de resultados, use as sementes abaixo:\n", + "\n", + "* NumPy: 20111974;\n", + "* Tensorflow: 20111974;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "43f-ZPW7pkkD" + }, + "source": [ + "np.random.seed(20111974)\n", + "tf.random.set_seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2qv8lJmHnqi3" + }, + "source": [ + "[**Python**] - Definir a Rede Neural:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8Veeqccdbnks" + }, + "source": [ + "**Observações**: \n", + "\n", + "Para evitar problemas relacionados ao _overfitting_ e _Vanishing or Exploding Gradients in Deep Neural Nets_, os artigos abaixo sugerem as seguintes opções para inicialização dos pesos $W$:\n", + "\n", + "* [How to Reduce Overfitting Using Weight Constraints in Keras](https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-neural-networks-with-weight-constraints-in-keras/) sugere:\n", + " * kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Deep Learning Best Practices (1) — Weight Initialization](https://medium.com/usf-msds/deep-learning-best-practices-1-weight-initialization-14e5c0295b94) sugere:\n", + " * kernel_initializer= tf.keras.initializers.he_normal() para activation= 'tf.nn.relu' ou 'tf.nn.leaky_relu' e kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Vanishing/ Exploding Gradients in Deep Neural Nets and solving them](https://medium.com/swlh/vanishing-exploding-gradients-in-deep-neural-nets-and-solving-them-9d6070f28b29) sugere:\n", + " * kernel_initializer= tf.keras.initializers.GlorotUniform();\n", + " * kernel_initializer= tf.keras.initializers.GlorotNormal()." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WMOoG_0bnsSD" + }, + "source": [ + "from tensorflow.keras import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "\n", + "RN= Sequential()\n", + "RN.add(Dense(units= N_H1, input_dim= N_I, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dense(units= N_O, activation= FA_O)) # Se não definirmos o parâmetro activation, então por default será 'linear'.\n", + "\n", + "# Resumo da arquitetura da Rede Neural\n", + "print(RN.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7W8VtONlQfZP" + }, + "source": [ + "### 5. Compilar a Rede Neural\n", + "\n", + "Este é um problema de regressão. Portanto, temos:\n", + "* loss= tf.keras.losses.MeanSquaredError;\n", + "* metrics= 'mse';\n", + "* optimizer= tf.keras.optimizers.Adam()." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "g97mJCSr1Kat" + }, + "source": [ + "[**Python**] - Comando modelo.compile(optimizer, loss, metrics):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cXJFtlcEQfZQ" + }, + "source": [ + "Algoritmo_Opt= tf.keras.optimizers.Adam()\n", + "Loss_Function= tf.keras.losses.MeanSquaredError()\n", + "Metrics_Perf = tf.keras.metrics.MeanSquaredError()\n", + "\n", + "RN.compile(optimizer= Algoritmo_Opt, loss= Loss_Function, metrics= Metrics_Perf)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1s1S7Fn_QfZW" + }, + "source": [ + "### 6. Ajustar a Rede Neural\n", + "\n", + "Obs.: A opção callbacks abaixo implementa o conceito de _early stopping_. Esta opção vai parar o processo de treinamento da Rede Neural antes de atingirmos o númerco de _epochs_ quando o modelo pára de melhorar, medido pela métrica val_loss. O parâmetro _patience_= k significa que o processo de otimização vai parar se tivermos k _epochs_ consecutivas sem observarmos melhoria da performance da Rede Neural." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PNgR4ihA0JMy" + }, + "source": [ + "[**Python**] - Comando modelo.fit(X_treinamento, y_treinamento, epochs)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JUaqK4j-QfZY" + }, + "source": [ + "callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience = 5, min_delta=0.001)]\n", + "hist= RN.fit(X_treinamento, y_treinamento, epochs= 200, validation_data= (X_teste, y_teste), callbacks= callbacks)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3bFg6kut1jkb" + }, + "source": [ + "A seguir, funções para plotarmos os gráficos das métricas MSE, _Loss_ e _Accuracy_:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VPASVWaR1sWN" + }, + "source": [ + "def Model_Loss(hist):\n", + " print(hist.history.keys())\n", + " plt.plot(hist.history['loss'])\n", + " plt.plot(hist.history['val_loss'])\n", + " plt.title('Model Loss')\n", + " plt.ylabel('Loss')\n", + " plt.xlabel('Epochs')\n", + " plt.legend(['Training', 'Validation'], loc= 'upper right')\n", + " plt.show()\n", + "\n", + "def Model_Accuracy(hist):\n", + " print(hist.history.keys())\n", + " plt.plot(hist.history['accuracy'])\n", + " plt.plot(hist.history['val_accuracy'])\n", + " plt.title('Model Accuracy')\n", + " plt.ylabel('Accuracy')\n", + " plt.xlabel('Epochs')\n", + " plt.legend(['Training', 'Validation'], loc= 'upper right')\n", + " plt.show()\n", + "\n", + "def Model_MSE(hist):\n", + " print(hist.history.keys())\n", + " plt.plot(hist.history['mse'])\n", + " plt.plot(hist.history['val_mse'])\n", + " plt.title('Model MSE')\n", + " plt.ylabel('MSE')\n", + " plt.xlabel('Epochs')\n", + " plt.legend(['Training', 'Validation'], loc= 'upper right')\n", + " plt.show()\n", + "\n", + "def Mostra_ConfusionMatrix():\n", + " y_pred = RN.predict_classes(X_teste)\n", + " mc = confusion_matrix(y_teste, y_pred)\n", + " #sns.heatmap(mc,annot=True, annot_kws={\"size\": 10},fmt=\"d\")\n", + " sns.heatmap(mc/np.sum(mc), annot=True, annot_kws={\"size\": 10}, fmt='.2%', cmap='Blues')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "uWhJUP0v2_fm" + }, + "source": [ + "Model_Loss(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "M8IZFKGyCvqO" + }, + "source": [ + "Model_MSE(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "37_0RhXLQfZc" + }, + "source": [ + "### 7. Avaliar a performance da Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8mMrIS9JyriW" + }, + "source": [ + "[**Python**] - Comando modelo.evaluate(X_teste, y_teste)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cRjEkvWzQfZe" + }, + "source": [ + "RN.evaluate(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MA6_RkjgQfZs" + }, + "source": [ + "### 8. _Fine tuning_ da Rede Neural\n", + "\n", + "Vou deixar esta fase como exercício para o aluno." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vPJtuCzXQfZu" + }, + "source": [ + "### 9. Fazer Predições com a Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p1EptFS1Yi-D" + }, + "source": [ + "[**Python**] - Comando:\n", + "* RN.predict_classes(X_treinamento);\n", + "* RN.predict_classes(X_teste)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fbrvwgyvYi-I" + }, + "source": [ + "y_pred = RN.predict_classes(X_teste)\n", + "y_pred[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JOsTSHwoQfZ0" + }, + "source": [ + "### 10. Conclusões" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EoQ5nySZmLDP" + }, + "source": [ + "___\n", + "# **APLICAÇÃO 5 - Rede Neural para prever os preços das casas (_Boston House Price Prediction_)**\n", + "\n", + "Vamos desenvolver uma Rede Neural usando _Tensorflow_/_Keras_ e seguir os passos adiante:\n", + "\n", + "1. Carregar os dados;\n", + "2. Pré-processamento e transformação dos dados;\n", + "3. Definir as amostras de treinamento e validação;\n", + "4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_;\n", + "5. Compilar a Rede Neural;\n", + "6. Ajustar a Rede Neural;\n", + "7. Avaliar a performance da Rede Neural;\n", + "8. _Fine tuning_ da Rede Neural;\n", + "9. Fazer Predições com a Rede Neural;\n", + "10. Conclusões." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8vdRpBS8VTw_" + }, + "source": [ + "### 0. Carregar bibliotecas do Python" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v629ZppSeY5T" + }, + "source": [ + "[**Python**] - Importar as bibliotecas necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uVXroVLTeY5U" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "from sklearn.metrics import confusion_matrix\n", + "\n", + "from tensorflow import keras" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qYCcNW9qeY5W" + }, + "source": [ + "[**Python**] - Definir o número de casas decimais" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zNn-kwlGeY5X" + }, + "source": [ + "np.set_printoptions(precision= 3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YnlZU1rLrJwt" + }, + "source": [ + "[**Python**] - Verificar a versão do Tensorflow\n", + "> Assegurar que está a utilizar a versão 2.x." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "445U8OKgVTxW" + }, + "source": [ + "tf.__version__" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-1Ckhzf0VTxc" + }, + "source": [ + "### 1. Carregar os dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aAz0_L0e1mxX" + }, + "source": [ + "[] Carregar os dados" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SOdpdceAVTxd" + }, + "source": [ + "from sklearn.datasets import load_boston\n", + "\n", + "Boston = load_boston()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K0P23sJs1raX" + }, + "source": [ + "[**Python**] - Definir as matrizes X_Boston e y_Boston:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rPpJOsgJ1y7J" + }, + "source": [ + "X_Boston= Boston.data\n", + "y_Boston= Boston.target\n", + "Col_Names= Boston.feature_names\n", + "Col_Names" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5XBRc6og_ySA" + }, + "source": [ + "[**Python**] - Corrigir ou renomear as colunas do dataframe:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VPGDwXSF_ySE" + }, + "source": [ + "X_Boston.columns= X_Boston.columns.str.lower()\n", + "X_Boston.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "fKiKT-fkVTxq" + }, + "source": [ + "y_Boston[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T9uYgjz-VTxu" + }, + "source": [ + "### 2. Pré-processamento e transformação dos dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5rbpIU5jlgv6" + }, + "source": [ + "[**Python**] - Normalizar os dados - StandardScaler()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Kbs-x9a2VTxw" + }, + "source": [ + "from sklearn.preprocessing import StandardScaler\n", + "\n", + "SS = StandardScaler()\n", + "X_Boston= SS.fit_transform(X_Boston)\n", + "y_Boston= SS.fit_transform(y_Boston.reshape(-(1, 1))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "S2w2H9BOXK9u" + }, + "source": [ + "X_Boston" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "DXNIHeS2XM_k" + }, + "source": [ + "y_Boston[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gcbomDeKVTx1" + }, + "source": [ + "### 3. Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gEkX579Q2D2q" + }, + "source": [ + "[**Python**] - Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "EZyRBsfYVTx2" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_Boston, y_Boston, test_size = 0.1, random_state = 20111974)\n", + "print(f'X: Treinamento= {X_treinamento.shape}; X: Teste= {X_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "g89c6edL-mBW" + }, + "source": [ + "print(f'Y: Treinamento= {y_treinamento.shape}; Y: Teste= {y_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GU-ebO-3VTx7" + }, + "source": [ + "### 4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gMVzohGHjS_p" + }, + "source": [ + "[**Python**] - Definir a arquitetura, ou seja:\n", + "* $N_{I}$: Número de neurônios na camada de entrada (_Input Layer_);\n", + "* $N_{O}$: Número de neurônios na camada de saída (_Output Layer_);\n", + "* $N_{H}$: Número de neurônios na camada escondida (_Hidden Layer_);\n", + "* FA: Função de ativação;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lf32pQtWjS_u" + }, + "source": [ + "# Número de Neurônios na Input Layer:\n", + "N_I= X_Boston.shape[1]\n", + "\n", + "# Número de neurônios na Output Layer:\n", + "N_O= 1\n", + "\n", + "# Número de neurônios na Hidden Layer:\n", + "N_H1= 7\n", + "\n", + "# Função de Ativação da Hidden Layer:\n", + "FA_H= tf.nn.leaky_relu\n", + "\n", + "# Função de Ativação da Output Layer:\n", + "FA_O= FA_O" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qOI4_BPYVTyE" + }, + "source": [ + "Vamos adicionar uma camada _Dropout_ com $p= 0.1$ para evitarmos _overfitting_." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yN9lxrXspp-m" + }, + "source": [ + "[**Python**] - Definir as sementes para NumPy e Tensorflow:\n", + "> Por questões de reproducibilidade de resultados, use as sementes abaixo:\n", + "\n", + "* NumPy: 20111974;\n", + "* Tensorflow: 20111974;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cWcQ3OS5pp-n" + }, + "source": [ + "np.random.seed(20111974)\n", + "tf.random.set_seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "73PnOLbon3Jh" + }, + "source": [ + "[**Python**] - Definir a Rede Neural:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zjKR3qgEneJr" + }, + "source": [ + "**Observações**: \n", + "\n", + "Para evitar problemas relacionados ao _overfitting_ e _Vanishing or Exploding Gradients in Deep Neural Nets_, os artigos abaixo sugerem as seguintes opções para inicialização dos pesos $W$:\n", + "\n", + "* [How to Reduce Overfitting Using Weight Constraints in Keras](https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-neural-networks-with-weight-constraints-in-keras/) sugere:\n", + " * kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Deep Learning Best Practices (1) — Weight Initialization](https://medium.com/usf-msds/deep-learning-best-practices-1-weight-initialization-14e5c0295b94) sugere:\n", + " * kernel_initializer= tf.keras.initializers.he_normal() para activation= 'tf.nn.relu' ou 'tf.nn.leaky_relu' e kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Vanishing/ Exploding Gradients in Deep Neural Nets and solving them](https://medium.com/swlh/vanishing-exploding-gradients-in-deep-neural-nets-and-solving-them-9d6070f28b29) sugere:\n", + " * kernel_initializer= tf.keras.initializers.GlorotUniform();\n", + " * kernel_initializer= tf.keras.initializers.GlorotNormal()." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GnZuQZZTn4_W" + }, + "source": [ + "from tensorflow.keras import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "from tensorflow.keras.layers import Dropout\n", + "\n", + "RN= Sequential()\n", + "RN.add(Dense(units= N_H1, input_dim= N_I, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_O, activation= FA_O)) # # Se não definirmos o parâmetro activation, então por default será 'linear'.\n", + "\n", + "# Resumo da arquitetura da Rede Neural\n", + "print(RN.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "h-hyQiokVTyM" + }, + "source": [ + "### 5. Compilar a Rede Neural\n", + "\n", + "Este é um problema de regressão. Portanto, temos:\n", + "* loss= tf.keras.losses.MeanSquaredError ou tf.keras.losses.MeanAbsoluteError();\n", + "* metrics= 'mse'.\n", + "* optimizer= tf.keras.optimizers.Adam() ou 'rmsprop'." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JwQOPOhr1Oh0" + }, + "source": [ + "[**Python**] - Comando modelo.compile(optimizer, loss, metrics):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QY2aKnL_VTyN" + }, + "source": [ + "Algoritmo_Opt= tf.keras.optimizers.RMSprop()\n", + "Loss_Function= tf.keras.losses.MeanSquaredError()\n", + "Metrics_Perf = tf.keras.metrics.MeanSquaredError()\n", + "\n", + "RN.compile(optimizer= Algoritmo_Opt, loss= Loss_Function, metrics= Metrics_Perf)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ygJi0ux5VTyT" + }, + "source": [ + "### 6. Ajustar a Rede Neural\n", + "\n", + "Obs.: A opção callbacks abaixo implementa o conceito de _early stopping_. Esta opção vai parar o processo de treinamento da Rede Neural antes de atingirmos o númerco de _epochs_ quando o modelo pára de melhorar, medido pela métrica val_loss. O parâmetro _patience_= k significa que o processo de otimização vai parar se tivermos k _epochs_ consecutivas sem observarmos melhoria da performance da Rede Neural." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Vz0urLrq0NPG" + }, + "source": [ + "[**Python**] - Comando modelo.fit(X_treinamento, y_treinamento, epochs)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9HoQZUl8VTyU" + }, + "source": [ + "callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience = 5, min_delta=0.001)]\n", + "hist= RN.fit(X_treinamento, y_treinamento, epochs= 200, validation_data= (X_teste, y_teste), callbacks= callbacks)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "R_StfUsUzbto" + }, + "source": [ + "Model_Loss(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "I1FIaMx_zzVW" + }, + "source": [ + "Model_MAE(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_t3k3oqg0pXW" + }, + "source": [ + "Model_MSE(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LH0llgTsVTyY" + }, + "source": [ + "### 7. Avaliar a performance da Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yZZPMFXvyvtG" + }, + "source": [ + "[**Python**] - Comando modelo.evaluate(X_teste, y_teste)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iZGhNF5vVTyZ" + }, + "source": [ + "RN.evaluate(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BkosiHm6lmww" + }, + "source": [ + "Observe que a Rede Neural _baseline_ (modelo inicial) apresenta MSE= 0.0795. Ainda assim, vamos tentar reduzir a _Loss Function_ e tentar chegar à um MSE ainda menor." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HcLONQpPVTyi" + }, + "source": [ + "### 8. _Fine tuning_ da Rede Neural\n", + "\n", + "O que pode ser feito para melhorar a performance da Rede Neural?\n", + "* aumentar o número de _Hidden Layers_ e o número de neurônios em cada uma delas." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "g7Uxk3j_ndFX" + }, + "source": [ + "N_I= X_Boston.shape[1]\n", + "N_H1= 32\n", + "N_H2= 32\n", + "N_H3= 32\n", + "N_O= 1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fF3Eb_5dp2PZ" + }, + "source": [ + "[**Python**] - Definir as sementes para NumPy e Tensorflow:\n", + "> Por questões de reproducibilidade de resultados, use as sementes abaixo:\n", + "\n", + "* NumPy: 20111974;\n", + "* Tensorflow: 20111974;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "X48MWaa_p2Pb" + }, + "source": [ + "np.random.seed(20111974)\n", + "tf.random.set_seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mHbjgPT7nP-q" + }, + "source": [ + "[**Python**] - Definir a Rede Neural:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "j8s-XRqdbuXP" + }, + "source": [ + "**Observações**: \n", + "\n", + "Para evitar problemas relacionados ao _overfitting_ e _Vanishing or Exploding Gradients in Deep Neural Nets_, os artigos abaixo sugerem as seguintes opções para inicialização dos pesos $W$:\n", + "\n", + "* [How to Reduce Overfitting Using Weight Constraints in Keras](https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-neural-networks-with-weight-constraints-in-keras/) sugere:\n", + " * kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Deep Learning Best Practices (1) — Weight Initialization](https://medium.com/usf-msds/deep-learning-best-practices-1-weight-initialization-14e5c0295b94) sugere:\n", + " * kernel_initializer= tf.keras.initializers.he_normal() para activation= 'tf.nn.relu' ou 'tf.nn.leaky_relu' e kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Vanishing/ Exploding Gradients in Deep Neural Nets and solving them](https://medium.com/swlh/vanishing-exploding-gradients-in-deep-neural-nets-and-solving-them-9d6070f28b29) sugere:\n", + " * kernel_initializer= tf.keras.initializers.GlorotUniform();\n", + " * kernel_initializer= tf.keras.initializers.GlorotNormal()." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3HcNOQoFnR3W" + }, + "source": [ + "from tensorflow.keras import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "from tensorflow.keras.layers import Dropout\n", + "\n", + "RN= Sequential()\n", + "RN.add(Dense(units= N_H1, input_dim= N_I, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_H2, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm())\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_H3, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_O, activation= FA_O)) # Se não definirmos o parâmetro activation, então por default será 'linear'.\n", + "\n", + "# Resumo da arquitetura da Rede Neural\n", + "print(RN.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JBolPrXZnZ5i" + }, + "source": [ + "#### 8.5. Compilar a Rede Neural (_Fine tuning_ da Rede Neural)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rMBCiUTC1W2H" + }, + "source": [ + "[**Python**] - Comando modelo.compile(optimizer, loss, metrics):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YlBccXXmnZ5k" + }, + "source": [ + "Algoritmo_Opt= tf.keras.optimizers.Adam()\n", + "Loss_Function= tf.keras.losses.MeanSquaredError()\n", + "Metrics_Perf = tf.keras.metrics.MeanSquaredError()\n", + "\n", + "RN.compile(optimizer= Algoritmo_Opt, loss= Loss_Function, metrics= Metrics_Perf)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SIOA5UFfnZ5p" + }, + "source": [ + "#### 8.6. Ajustar a Rede Neural (_Fine tuning_ da Rede Neural)\n", + "\n", + "Obs.: A opção callbacks abaixo implementa o conceito de _early stopping_. Esta opção vai parar o processo de treinamento da Rede Neural antes de atingirmos o númerco de _epochs_ quando o modelo pára de melhorar, medido pela métrica val_loss. O parâmetro _patience_= k significa que o processo de otimização vai parar se tivermos k _epochs_ consecutivas sem observarmos melhoria da performance da Rede Neural." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ktlrSmGQ0Qrq" + }, + "source": [ + "[**Python**] - Comando modelo.fit(X_treinamento, y_treinamento, epochs)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kM5x90ArnZ5r" + }, + "source": [ + "callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience = 5, min_delta=0.001)]\n", + "hist= RN.fit(X_treinamento, y_treinamento, epochs= 500, validation_data= (X_teste, y_teste), callbacks= callbacks)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AfxvOccmnZ5z" + }, + "source": [ + "#### 8.7. Avaliar a performance da Rede Neural (_Fine tuning_ da Rede Neural)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4XZmb9zIy1Xf" + }, + "source": [ + "[**Python**] - Comando modelo.evaluate(X_teste, y_teste)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "belFKJQSnZ51" + }, + "source": [ + "RN.evaluate(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7whymUw5VTyq" + }, + "source": [ + "### 10. Conclusões\n", + "\n", + "A performance da Rede Neural melhorou um pouco com a redução do MSE." + ] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB19_Redes_Neurais_hs2.ipynb b/Notebooks/NB19_Redes_Neurais_hs2.ipynb new file mode 100644 index 000000000..a59ab8885 --- /dev/null +++ b/Notebooks/NB19_Redes_Neurais_hs2.ipynb @@ -0,0 +1,9977 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "accelerator": "TPU", + "colab": { + "name": "NB19_Redes_Neurais__V2.ipynb", + "provenance": [], + "collapsed_sections": [], + "include_colab_link": true + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ShVXyGj9wkgN" + }, + "source": [ + "

REDES NEURAIS ARTIFICIAIS (COMPREHENSIVE GUIDE)

\n", + "\n", + "# Porque Cientistas de Dados desejam aprender e dominar Redes Neurais?\n", + "\n", + "* Redes Neurais têm a capacidade de aprender, modelar e resolver problemas não-lineares e complexos apresentados pela vida real.\n", + "* Você já deve ter ouvido falar em Inteligência Artificial, _self-drive cars_, _Deep Learning_, _Computer Vision_ e _Neural Language Processing_ (NLP). Todos estes assuntos estão estreitamente relacionados às Redes Neurais. Por exemplo, _Deep Learning_ são Redes Neurais com muitas _Hidden Layers_.\n", + "\n", + "Este curso aborda os principais tópicos para você dominar Redes Neurais. Além disso, vamos falar das melhores práticas e atacar as principais dúvidas dos alunos em relação às Redes Neurais. Portanto, ao final deste curso você será capaz de:\n", + "\n", + "* desenvolver suas próprias Redes Neurais;\n", + "* aplicar o algoritmo correto para cada tipo de problema;\n", + "* aplicar as funções de ativação corretamene para cada tipo de problema e camada;\n", + "* aprender o necessário de Tensorflow/Keras para Redes Neurais;\n", + "* Aprender os comandos necessários do Python/NumPy para desenvolvimento de Redes Neurais;\n", + "* aplicar a métrica ideal para cada tipo de problema;\n", + "* entender como as Redes Neurais aprendem (_Backpropagation_);\n", + "\n", + "# **AGENDA**\n", + "\n", + "* Introdução às Redes Neurais;\n", + "* _Activation Function_;\n", + "* _Loss Function_;\n", + "* Métricas para medir a performance das Redes Neurais;\n", + "* _Dropout_;\n", + "* _Backpropagation_;\n", + "* _Gradient Descent_;\n", + "* _Perceptron_ (Redes Neurais com 1 única camada);\n", + "* Exemplo 1: Redes Neurais _Perceptron_ para os operadores lógicos E, OU e XOR;\n", + "* Redes Neurais Multicamada;\n", + "* Exemplo prático: Rede Neural para identificar o sexo a partir de peso e altura;\n", + "* Aplicações de Rede Neural:\n", + " * Aplicação 1 - Rede Neural para identificar espécies de flores (Iris Dataframe);\n", + " * Aplicação 2 - Rede Neural para identificar o tipo do vinho (_Red or White_);\n", + " * Aplicação 3 - Rede Neural para identificar Câncer de Mama (_Breast Cancer_ Dataframe);\n", + " * Aplicação 4 - Rede Neural para identificar Diabetes (Diabetes Dataframe);\n", + " * Aplicação 5 - Rede Neural para prever os preços das casas em Boston (_Boston House Price Prediction_)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aYQ4cDfcPu4e" + }, + "source": [ + "___\n", + "# **NOTAS E OBSERVAÇÕES**\n", + "\n", + "1. Contemplar o uso de StratifiedKFold;\n", + "2. Inserir aqui o exemplo das notas falsas, enviado pela Mónica;\n", + "3. Deixar alguma coisa que foi resolvida como exercício.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QgX6n2VDyY1O" + }, + "source": [ + "___\n", + "# **REFERÊNCIAS**\n", + "- [An Introduction to Neural Networks](http://www.cs.stir.ac.uk/~lss/NNIntro/InvSlides.html)\n", + "- [An Introduction to Image Recognition with Deep Learning](https://medium.com/@ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721)\n", + "- [Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/index.html)\n", + "- [Forward propagation in neural networks — Simplified math and code version](https://towardsdatascience.com/forward-propagation-in-neural-networks-simplified-math-and-code-version-bbcfef6f9250)\n", + "- [Understanding Neural Networks: From Activation Function To Back Propagation](https://medium.com/fintechexplained/neural-networks-activation-function-to-back-propagation-understanding-neural-networks-bdd036c3f29f)\n", + "- [Understanding Gradient Descent](https://medium.com/analytics-vidhya/understanding-gradient-descent-8dd88a4c60e6) - Explica detalhadamente como funciona o _Gradient Descent_ no processo de otimização dos pesos $W$;\n", + "- [Backpropagation step by step](https://medium.com/swlh/backpropagation-step-by-step-13f2b6c0b414) - Eu usei esse artigo para reajustar os pesos $W$;\n", + "- [Perceptron Learning Algorithm: A Graphical Explanation Of Why It Works](https://towardsdatascience.com/perceptron-learning-algorithm-d5db0deab975);\n", + "- [Math behind Perceptrons](https://medium.com/@iamask09/math-behind-perceptrons-7241d5dadbfc);\n", + "- [Neural Network: A Complete Beginners Guide from Scratch](https://medium.com/gadictos/neural-network-a-complete-beginners-guide-from-scratch-cf1fc9d5cd12);\n", + "- [Calculating the Backpropagation of a Network](https://medium.com/towards-artificial-intelligence/calculating-back-propagation-of-a-network-1febbcaa2b5d);\n", + "- [Let’s build a simple Neural Net!](https://becominghuman.ai/lets-build-a-simple-neural-net-f4474256647f) - O autor constroi uma Rede Neural simples, sem _Hidden Layers_;\n", + "- [Coding Neural Network — Forward Propagation and Backpropagtion](https://towardsdatascience.com/coding-neural-network-forward-propagation-and-backpropagtion-ccf8cf369f76);\n", + "- [The Simplest Neural Network: Understanding the non-linearity](https://towardsdatascience.com/the-simplest-neural-network-understanding-the-non-linearity-10846d7d0141) - Ótimo texto para entender a não-linearidade;\n", + "- [Implementing the XOR Gate using Backpropagation in Neural Networks](https://towardsdatascience.com/implementing-the-xor-gate-using-backpropagation-in-neural-networks-c1f255b4f20d) - Usei este texto para resolver o problema do XOR;\n", + "- [Neural Representation of AND, OR, NOT, XOR and XNOR Logic Gates (Perceptron Algorithm)](https://medium.com/@stanleydukor/neural-representation-of-and-or-not-xor-and-xnor-logic-gates-perceptron-algorithm-b0275375fea1) - Eu usei este material para resolver o problema dos operadores E, OU e XOR;\n", + "- [Solving XOR with a single Perceptron](https://medium.com/@lucaspereira0612/solving-xor-with-a-single-perceptron-34539f395182);\n", + "- [Machine Learning 101 — Artificial Neural Networks](https://towardsdatascience.com/machine-learning-101-artificial-neural-networks-3-46ccb04cba30) - Cálculos realizados passo a passo;\n", + "- [Neural Network from scratch in Python](https://towardsdatascience.com/math-neural-network-from-scratch-in-python-d6da9f29ce65) - Este artigo mostra a matemática por trás das Redes Neurais;\n", + "- [Classical Neural Net: Why/Which Activations Functions?](https://towardsdatascience.com/classical-neural-net-why-which-activations-functions-401159ba01c4) - Artigo que discute as principais funções de ativação;\n", + "- [Understanding Activation Functions in Neural Networks](https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0);\n", + "- [Mind: How to Build a Neural Network (Part One)](https://becominghuman.ai/mind-how-to-build-a-neural-network-part-one-67b6aea4ce20);\n", + "- [How to build a simple Neural Network from scratch with Python](https://towardsdatascience.com/how-to-build-a-simple-neural-network-from-scratch-with-python-9f011896d2f3);\n", + "- [Comparison of Activation Functions for Deep Neural Networks](https://towardsdatascience.com/comparison-of-activation-functions-for-deep-neural-networks-706ac4284c8a);\n", + "- [MAE and RMSE — Which Metric is Better?](https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d) - Ótimo artigo, pois discute qual métrica é melhor.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2StZkTpOZbYo" + }, + "source": [ + "___\n", + "# **MACHINE LEARNING DEVELOPMENT LYFECYCLE**\n", + "\n", + "CRISP-DM significa _Cross Industry Standard Process for Data Mining_ ou processos ou fases para desenvolvimento de projetos relacionados à _Data Mining_ e que tem sido muito utilizados pelos Cientistas de Dados para desenvolvimento de modelos predictivos.\n", + "\n", + "\"Drawing\"\n", + "\n", + "Fonte: [The steps to a successful machine learning project](https://emba.epfl.ch/2018/04/10/steps-successful-machine-learning-project/)\n", + "\n", + "Sugiro a leitura do artigo [Why using CRISP-DM will make you a better Data Scientist](https://towardsdatascience.com/why-using-crisp-dm-will-make-you-a-better-data-scientist-66efe5b72686)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TsCbZd2epfxo" + }, + "source": [ + "___\n", + "# **INTRODUÇÃO ÀS REDES NEURAIS**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HqqB2vaHXMGt" + }, + "source": [ + "* Redes Neurais aprendem com as experiências passadas, imitando o funcionamento dos neurônios humanos no processo de aprendizagem;\n", + "* podem e são amplamente utilizadas nas seguintes situações (aplicações):\n", + " * Reconhecimento Facial;\n", + " * Processamento de Linguagem Natural (NLP);\n", + " * _Self-drive car_;\n", + " * Visão computacional;\n", + " * Detecção de padrões (doenças, tumores e etc) em imagens;\n", + "* Ideal para o cenário onde temos muitos dados (_Big Data_) e para resolver problemas complexos." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VzylPHA7BP0x" + }, + "source": [ + "___\n", + "# **_PERCEPTRON_** (Rede Neural com 1 única camada)\n", + "\n", + "* **_PERCEPTRON_** é um algoritmo de _Machine Learning_ da classe _Supervised Learning_ para classificação binária, inventado em 1958 por Frank Rosenblatt;\n", + "* Arquitetura de Rede Neural mais simples existente, com 1 única camada;\n", + "\n", + "**Daí, uma pergunta importante**: Se _Perceptron_ é um tipo de Rede Neural simples que nasceu na década de 1950, então porque devemos estudá-la? Porque não focar no estudo de Redes Neurais mais complexas e atuais?\n", + "\n", + "*E a resposta é**: porque _Perceptron_ nos permite entender claramente os aspectos matemáticos das Redes Neurais. Com isso quero dizer que ao entendermos _Perceptrons_, fica mais fácil entender outros tipos de Redes Neurais." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M5YNraza6jum" + }, + "source": [ + "## Exemplo de Perceptron\n", + "\n", + "A seguir, arquitetura do _Perceptron_: várias entradas (_Inputs_) e 1 camada de saída (_Output Layer_) binária (0 ou 1).\n", + "\n", + "* OL significa **O**utput **L**ayer ==> Valor que queremos estimar, ou seja, $\\hat{y}$.\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_5LVgImx78xY" + }, + "source": [ + "A **FUNÇÃO DE ATIVAÇÃO** $f(S)$ acima é conhecida como **_STEP FUNCTION_** e como podemos ver, retorna uma resposta binária (0 ou 1) que depende do valor de $S$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "84zFWve4FkcY" + }, + "source": [ + "A seguir, implementação usando NumPy:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "htVV-GpgBnw3" + }, + "source": [ + "[**Python**] - Importar o NumPy:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xBYyZ5ZiByH4" + }, + "source": [ + "import numpy as np" + ], + "execution_count": 1, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-8sR77a4B8Uf" + }, + "source": [ + "[**Python**] - Definir o número de casas decimas para 3:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Gj2dioDTaZl-" + }, + "source": [ + "np.set_printoptions(precision = 3)" + ], + "execution_count": 2, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oZ6Sw4uuCggF" + }, + "source": [ + "[**Python**] - Definir os pesos $W$ e as entradas (_inputs_) $X$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "z2m6BxQ_DLFV" + }, + "source": [ + "# Pesos W:\n", + "W = np.array([0.1, 0.3, 0.2, 0.4])\n", + "\n", + "# Entradas X:\n", + "X = np.array([1, -3, 2, 3])" + ], + "execution_count": 3, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lBnZP5MKCg8m" + }, + "source": [ + "[**Python**] - Desenvolver a função soma $S$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dMGuWhAhDaim" + }, + "source": [ + "def Soma(X, W):\n", + " S = X.dot(W) # Faz a seguinte operação: S = X1*W1 + X2*W2 + X3*W3 + X4*W4\n", + " return S" + ], + "execution_count": 4, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EMxMJ05kDhmi" + }, + "source": [ + "[**Python**] - Desenvolver a função de ativação _Step Function_ $f(S)$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dRLYPJl0aZmg" + }, + "source": [ + "def ativacao_StepFunction(S):\n", + " if S >= 1:\n", + " return 1\n", + " else:\n", + " return 0" + ], + "execution_count": 5, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H4g85O2jDu6S" + }, + "source": [ + "[**Python**] - Calcular $S = Soma(X, W)$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zoUMvvzlaZm-", + "outputId": "3d7a4be8-fe0a-4fb4-b8ce-d5a49c14bf3e", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "S = Soma(X, W)\n", + "S" + ], + "execution_count": 6, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.8000000000000003" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 6 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6LzlyDNaD5yB" + }, + "source": [ + "[**Python**] - Calcular $f(S)$, ou seja, $f = ativacao_StepFunction(S)$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6IIe4vIjaZnE", + "outputId": "25b64d72-d559-4458-8131-9d33b760e560", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "f = ativacao_StepFunction(S)\n", + "f" + ], + "execution_count": 7, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 7 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UrRG8e8dDTc_" + }, + "source": [ + "# **EXEMPLO 1: DESENVOLVER UMA REDE NEURAL _PERCEPTRON_ PARA OS OPERADORES LÓGICOS E, OU E XOR**\n", + "\n", + "Os exemplos a seguir foram inspirados e adaptado de:\n", + "* [Perceptron: Theory and Practice](https://medium.com/data-alchemist/perceptron-theory-and-practice-e71733ed3fa5)\n", + "* [The Perceptron — A Building Block of Neural Networks](https://blog.usejournal.com/the-perceptron-the-building-block-of-neural-networks-5a428d3f451d) - Este artigo mostra detalhadamente os cálculos\n", + "* [Mind: How to Build a Neural Network (Part One)](https://becominghuman.ai/mind-how-to-build-a-neural-network-part-one-67b6aea4ce20)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qeZBP3TQN2_1" + }, + "source": [ + "## Exemplo 1.1: Rede Neural _Perceptron_ para o Operador Lógico E\n", + "\n", + "Considere o dataframe a seguir:\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$) |\n", + "|---|---|---|---|\n", + "| 0 | 0 | 0 | 0 |\n", + "| 1 | 0 | 1 | 0 |\n", + "| 2 | 1 | 0 | 0 |\n", + "| 3 | 1 | 1 | 1 |\n", + "\n", + "O dataframe acima representa o operador lógico E (https://en.wikipedia.org/wiki/Truth_table):\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$)|\n", + "|---|---|---|---|\n", + "| 0 | F | F | F |\n", + "| 1 | F | T | F |\n", + "| 2 | T | F | F |\n", + "| 3 | T | T | T |\n", + "\n", + "\n", + "Considere $W= [W_{1}, W_{2}]= (0, 0)$ como pesos iniciais e a função de ativação $F(S)$ _Step Function_ dada abaixo:\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "psJh-MUgFAge" + }, + "source": [ + "A seguir, os cálculos manuais da primeira iteração:\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P8x3EvFUQBsU" + }, + "source": [ + "Os Erros $E_{i}$ são calculados com a fórmula: $E_{i}= ValorReal_{i} - ValorCalculado_{i}= y_{i}-\\hat{Y}_{i}$. A seguir, resumo dos cálculos:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d5dryrbGBesj" + }, + "source": [ + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$) | Soma | ValorCalculado ($\\hat{Y}_{i}$) | Erro |\n", + "|---|---|---|---|---|---|---|\n", + "| 0 | 0 | 0 | 0 | 0 | 0 | 0 |\n", + "| 1 | 0 | 1 | 0 | 0 | 0 | 0 |\n", + "| 2 | 1 | 0 | 0 | 0 | 0 | 0 |\n", + "| 3 | 1 | 1 | 1 | 0 | 0 | 1 |\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lkcRy2RYGLVw" + }, + "source": [ + "### Erro Total ($E_{T}$)\n", + "\n", + "$$E_{T}= \\sum_{i=1}^{n}E_{i}= E_{1}+E_{2}+...+E_{n}$$\n", + "\n", + "No nosso caso, temos que $E_{T}= 0+0+0+1= 1$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fzVxmr9OTfGB" + }, + "source": [ + "### Fórmula para ajustar os pesos $W$\n", + "A fórmula a seguir será utilizada para ajustar os pesos $W$:\n", + "\n", + "$$W_{n+1}= W_{n} + \\alpha*(X*E_{T})$$\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Z1bEDxMhToIj" + }, + "source": [ + "### Taxa de Aprendizagem ($\\alpha$)\n", + "* $\\alpha$ é a taxa de aprendizado (_Learning Rate_ em inglês) e diz respeito à velocidade de aprendizagem da Rede Neural.\n", + " * Quanto MENOR o valor de $\\alpha$ $\\Longrightarrow$ mais devagar e demorada será a convergência para o mínimo global;\n", + " * Quanto MAIOR o valor de $\\alpha$ $\\Longrightarrow$ mais rápido será a convergência para o mínimo, **mas sem a garantia de convergência para o mínimo global**." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "drGfgCIZY4aV" + }, + "source": [ + "Para ajustar os pesos $W$, vamos utilizar $\\alpha= 0.1$. Fórmula:\n", + "\n", + "$$W_{n+1}= W_{n} + \\alpha*(X*E_{T})$$\n", + "\n", + "A seguir, os novos pesos $W$ para a próxima iteração da Rede Neural _Perceptron_:\n", + "\n", + "\\begin{align}\n", + "W_{1}&= 0+ 0.1*1*1= 0.1 \\\\\n", + "W_{2}&= 0+ 0.1*1*1= 0.1 \\\\\n", + "\\end{align}\n", + "\n", + "Portanto, na próxima iteração vamos utilizar os pesos $W= [W_{1}, W_{2}]= [0.1, 0.1]$. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "33xLPLo-Pq0Y" + }, + "source": [ + "A seguir, os cálculos manuais para a segunda iteração da Rede Neural:\n", + "\n", + "Função de ativação $f(S)$ _Step Function_:\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i3EsH8pN9wJ6" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mZiiOu1AyW2N" + }, + "source": [ + "A seguir resumo dos cálculos para a segunda iteração:\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$) | Soma | ValorCalculado ($\\hat{Y}_{i}$) | Erro |\n", + "|---|---|---|---|---|---|---|\n", + "| 0 | 0 | 0 | 0 | 0.0 | 0 | 0 |\n", + "| 1 | 0 | 1 | 0 | 0.1 | 0 | 0 |\n", + "| 2 | 1 | 0 | 0 | 0.1 | 0 | 0 |\n", + "| 3 | 1 | 1 | 1 | 0.2 | 0 | 1 |\n", + "\n", + "Daí, $E_{T}= 0+0+0+1= 1$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MAXO38uqUobn" + }, + "source": [ + "### Ajuste dos pesos $W$\n", + "Fórmula para ajustar $W$:\n", + "\n", + "$$W_{n+1}= W_{n} + \\alpha*(X*E_{T})$$\n", + "\n", + "A seguir, os novos pesos $W$ para a próxima iteração da Rede Neural _Perceptron_:\n", + "\n", + "\\begin{align}\n", + "W_{1}&= 0.1+ 0.1*1*1= 0.2 \\\\\n", + "W_{2}&= 0.1+ 0.1*1*1= 0.2 \\\\\n", + "\\end{align}\n", + "\n", + "Portanto, na próxima iteração vamos utilizar os pesos $W= [W_{1}, W_{2}]= [0.2, 0.2]$. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WX48iRa5VLyk" + }, + "source": [ + "Esse processo iterativo é realizado até que se encontre os pesos $W$ que nos dê 100% de acurácia. A título de exemplo, considere $W= [W_{1}, W_{2}]= [0.5, 0.5]$:\n", + "\n", + "Função de ativação $f(S)$ _Step Function_:\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LZfroCc994oz" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "McKYXohzXzzA" + }, + "source": [ + "Como podem ver, o Erro Total $E_{T}= 0$, pois temos 100% de acertos (acurácia) usando $W= [W_{1}, W_{2}]= [0.5, 0.5]$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Wp_tR7h0btDm" + }, + "source": [ + "### Implementar o **_PERCEPTRON_** no Python usando NumPy" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3ix5vCKaEWdx" + }, + "source": [ + "[**Python**] - Importar NumPy:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "x62R_y89ElPA" + }, + "source": [ + "import numpy as np" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SYvLGlgZEXWu" + }, + "source": [ + "[**Python**] - Definir o número de casas decimais:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yEScd0_LEtJc" + }, + "source": [ + "np.set_printoptions(precision = 3)" + ], + "execution_count": 8, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P8hLz6GAEYCo" + }, + "source": [ + "[**Python**] - Definir os pesos $W$, entradas (_inputs_) $X$ e Output $Y$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fD66QeoqXEU3" + }, + "source": [ + "# Pesos W:\n", + "W = np.array([0.0, 0.0])\n", + "\n", + "# Entradas X:\n", + "X = np.array([[0, 0], [0,1], [1, 0], [1, 1]])\n", + "\n", + "# Output Y:\n", + "Y = np.array([[0], [0], [0], [1]])" + ], + "execution_count": 36, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "alRRwxsUvIU6", + "outputId": "b82c65a6-eba4-47c9-e1c6-c40ae5b90ee4", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X" + ], + "execution_count": 37, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0, 0],\n", + " [0, 1],\n", + " [1, 0],\n", + " [1, 1]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 37 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VB5n2WNUvND3", + "outputId": "21cb912b-cf9e-425d-caba-27b57e71875d", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "Y" + ], + "execution_count": 39, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0],\n", + " [0],\n", + " [0],\n", + " [1]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 39 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2jH1EMfdEYwN" + }, + "source": [ + "[**Python**] - Definir a Taxa de Aprendizagem $\\alpha$:\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zd2k0S-BXEU_" + }, + "source": [ + "alpha = 0.1" + ], + "execution_count": 40, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yvGa7d8LEZD2" + }, + "source": [ + "[**Python**] - Desenvolver a função para treinar a Rede Neural\n", + "> Esta função tenta encontrar os pesos $W$ que levem a 100% de acurácia." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JVl0XwBuXEVC" + }, + "source": [ + "def Treinar_RNA(X, Y, W, alpha):\n", + " ET= 1 # ET= Erro Total\n", + " N= 0\n", + " while ((ET != 0) and (N < 100)):\n", + " ET= 0\n", + " for i in range(len(Y)):\n", + " S = X[i].dot(W)\n", + " f = ativacao_StepFunction(S)\n", + " E= Y[i]-f\n", + " ET+= E\n", + " for j in range(len(W)):\n", + " W[j]= W[j] + alpha*(X[i][j]*E)\n", + " print(f'i: {i} | j: {j} | X: {X[i][j]} | E: {E} | Peso Ajustado W[{j}]: {W[j]}') #incluído hs\n", + " # print(f'Peso Ajustado: {W[j]}') # comentado hs\n", + " print(f'Erro Total: {ET}')\n", + " N+= 1" + ], + "execution_count": 35, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pdI7EHnFF4yo" + }, + "source": [ + "[**Python**] - Evocar a função Treinar_RNA:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gHM5tXEdXEVF", + "outputId": "c2f4083a-61eb-4e8e-beea-afb748db1292", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "Treinar_RNA(X, Y, W, alpha)" + ], + "execution_count": 41, + "outputs": [ + { + "output_type": "stream", + "text": [ + "i: 0 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 0.0\n", + "i: 0 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 0.0\n", + "i: 1 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 0.0\n", + "i: 1 | j: 1 | X: 1 | E: [0] | Peso Ajustado W[1]: 0.0\n", + "i: 2 | j: 0 | X: 1 | E: [0] | Peso Ajustado W[0]: 0.0\n", + "i: 2 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 0.0\n", + "i: 3 | j: 0 | X: 1 | E: [1] | Peso Ajustado W[0]: 0.1\n", + "i: 3 | j: 1 | X: 1 | E: [1] | Peso Ajustado W[1]: 0.1\n", + "Erro Total: [1]\n", + "i: 0 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 0.1\n", + "i: 0 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 0.1\n", + "i: 1 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 0.1\n", + "i: 1 | j: 1 | X: 1 | E: [0] | Peso Ajustado W[1]: 0.1\n", + "i: 2 | j: 0 | X: 1 | E: [0] | Peso Ajustado W[0]: 0.1\n", + "i: 2 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 0.1\n", + "i: 3 | j: 0 | X: 1 | E: [1] | Peso Ajustado W[0]: 0.2\n", + "i: 3 | j: 1 | X: 1 | E: [1] | Peso Ajustado W[1]: 0.2\n", + "Erro Total: [1]\n", + "i: 0 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 0.2\n", + "i: 0 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 0.2\n", + "i: 1 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 0.2\n", + "i: 1 | j: 1 | X: 1 | E: [0] | Peso Ajustado W[1]: 0.2\n", + "i: 2 | j: 0 | X: 1 | E: [0] | Peso Ajustado W[0]: 0.2\n", + "i: 2 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 0.2\n", + "i: 3 | j: 0 | X: 1 | E: [1] | Peso Ajustado W[0]: 0.30000000000000004\n", + "i: 3 | j: 1 | X: 1 | E: [1] | Peso Ajustado W[1]: 0.30000000000000004\n", + "Erro Total: [1]\n", + "i: 0 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 0.30000000000000004\n", + "i: 0 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 0.30000000000000004\n", + "i: 1 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 0.30000000000000004\n", + "i: 1 | j: 1 | X: 1 | E: [0] | Peso Ajustado W[1]: 0.30000000000000004\n", + "i: 2 | j: 0 | X: 1 | E: [0] | Peso Ajustado W[0]: 0.30000000000000004\n", + "i: 2 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 0.30000000000000004\n", + "i: 3 | j: 0 | X: 1 | E: [1] | Peso Ajustado W[0]: 0.4\n", + "i: 3 | j: 1 | X: 1 | E: [1] | Peso Ajustado W[1]: 0.4\n", + "Erro Total: [1]\n", + "i: 0 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 0.4\n", + "i: 0 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 0.4\n", + "i: 1 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 0.4\n", + "i: 1 | j: 1 | X: 1 | E: [0] | Peso Ajustado W[1]: 0.4\n", + "i: 2 | j: 0 | X: 1 | E: [0] | Peso Ajustado W[0]: 0.4\n", + "i: 2 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 0.4\n", + "i: 3 | j: 0 | X: 1 | E: [1] | Peso Ajustado W[0]: 0.5\n", + "i: 3 | j: 1 | X: 1 | E: [1] | Peso Ajustado W[1]: 0.5\n", + "Erro Total: [1]\n", + "i: 0 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 0.5\n", + "i: 0 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 0.5\n", + "i: 1 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 0.5\n", + "i: 1 | j: 1 | X: 1 | E: [0] | Peso Ajustado W[1]: 0.5\n", + "i: 2 | j: 0 | X: 1 | E: [0] | Peso Ajustado W[0]: 0.5\n", + "i: 2 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 0.5\n", + "i: 3 | j: 0 | X: 1 | E: [0] | Peso Ajustado W[0]: 0.5\n", + "i: 3 | j: 1 | X: 1 | E: [0] | Peso Ajustado W[1]: 0.5\n", + "Erro Total: [0]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TPKEML9cDD0E" + }, + "source": [ + "## Exemplo 1.2: Rede Neural _Perceptron_ para o Operador Lógico OU\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rSQnOjDWC7Ta" + }, + "source": [ + "Considere o dataframe a seguir:\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$) |\n", + "|---|---|---|---|\n", + "| 0 | 0 | 0 | 0 |\n", + "| 1 | 0 | 1 | 1 |\n", + "| 2 | 1 | 0 | 1 |\n", + "| 3 | 1 | 1 | 1 |\n", + "\n", + "O dataframe acima representa o operador lógico OU (https://en.wikipedia.org/wiki/Truth_table):\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$)|\n", + "|---|---|---|---|\n", + "| 0 | F | F | F |\n", + "| 1 | F | T | T |\n", + "| 2 | T | F | T |\n", + "| 3 | T | T | T |" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kID13PxSGN6h" + }, + "source": [ + "[**Python**] - Definir os pesos $W$, entradas (_inputs_) $X$ e Output $Y$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "CmuuIX2PGN6l" + }, + "source": [ + "# Pesos W:\n", + "W = np.array([0.0, 0.0])\n", + "\n", + "# Entradas X:\n", + "X = np.array([[0, 0], [0,1], [1, 0], [1, 1]])\n", + "\n", + "# Output Y:\n", + "Y = np.array([[0], [1], [1], [1]])" + ], + "execution_count": 42, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "UDzdS6FX2LOC", + "outputId": "c52d5e55-3dd8-4630-92d2-e48b02ec7fc5", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "X" + ], + "execution_count": 43, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0, 0],\n", + " [0, 1],\n", + " [1, 0],\n", + " [1, 1]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 43 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ar0dk1eQ2MOD", + "outputId": "40dbead6-ef86-4e84-f6f2-d2e78ce400a8", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "Y" + ], + "execution_count": 44, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[0],\n", + " [1],\n", + " [1],\n", + " [1]])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 44 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "agZX698KGeVK" + }, + "source": [ + "[**Python**] - Evocar a função Treinar_RNA:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3GF_W4u0GeVM", + "outputId": "ca418c8c-3117-4587-e27a-19bcaa5f38ec", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "Treinar_RNA(X, Y, W, alpha)" + ], + "execution_count": 45, + "outputs": [ + { + "output_type": "stream", + "text": [ + "i: 0 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 0.0\n", + "i: 0 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 0.0\n", + "i: 1 | j: 0 | X: 0 | E: [1] | Peso Ajustado W[0]: 0.0\n", + "i: 1 | j: 1 | X: 1 | E: [1] | Peso Ajustado W[1]: 0.1\n", + "i: 2 | j: 0 | X: 1 | E: [1] | Peso Ajustado W[0]: 0.1\n", + "i: 2 | j: 1 | X: 0 | E: [1] | Peso Ajustado W[1]: 0.1\n", + "i: 3 | j: 0 | X: 1 | E: [1] | Peso Ajustado W[0]: 0.2\n", + "i: 3 | j: 1 | X: 1 | E: [1] | Peso Ajustado W[1]: 0.2\n", + "Erro Total: [3]\n", + "i: 0 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 0.2\n", + "i: 0 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 0.2\n", + "i: 1 | j: 0 | X: 0 | E: [1] | Peso Ajustado W[0]: 0.2\n", + "i: 1 | j: 1 | X: 1 | E: [1] | Peso Ajustado W[1]: 0.30000000000000004\n", + "i: 2 | j: 0 | X: 1 | E: [1] | Peso Ajustado W[0]: 0.30000000000000004\n", + "i: 2 | j: 1 | X: 0 | E: [1] | Peso Ajustado W[1]: 0.30000000000000004\n", + "i: 3 | j: 0 | X: 1 | E: [1] | Peso Ajustado W[0]: 0.4\n", + "i: 3 | j: 1 | X: 1 | E: [1] | Peso Ajustado W[1]: 0.4\n", + "Erro Total: [3]\n", + "i: 0 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 0.4\n", + "i: 0 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 0.4\n", + "i: 1 | j: 0 | X: 0 | E: [1] | Peso Ajustado W[0]: 0.4\n", + "i: 1 | j: 1 | X: 1 | E: [1] | Peso Ajustado W[1]: 0.5\n", + "i: 2 | j: 0 | X: 1 | E: [1] | Peso Ajustado W[0]: 0.5\n", + "i: 2 | j: 1 | X: 0 | E: [1] | Peso Ajustado W[1]: 0.5\n", + "i: 3 | j: 0 | X: 1 | E: [0] | Peso Ajustado W[0]: 0.5\n", + "i: 3 | j: 1 | X: 1 | E: [0] | Peso Ajustado W[1]: 0.5\n", + "Erro Total: [2]\n", + "i: 0 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 0.5\n", + "i: 0 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 0.5\n", + "i: 1 | j: 0 | X: 0 | E: [1] | Peso Ajustado W[0]: 0.5\n", + "i: 1 | j: 1 | X: 1 | E: [1] | Peso Ajustado W[1]: 0.6\n", + "i: 2 | j: 0 | X: 1 | E: [1] | Peso Ajustado W[0]: 0.6\n", + "i: 2 | j: 1 | X: 0 | E: [1] | Peso Ajustado W[1]: 0.6\n", + "i: 3 | j: 0 | X: 1 | E: [0] | Peso Ajustado W[0]: 0.6\n", + "i: 3 | j: 1 | X: 1 | E: [0] | Peso Ajustado W[1]: 0.6\n", + "Erro Total: [2]\n", + "i: 0 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 0.6\n", + "i: 0 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 0.6\n", + "i: 1 | j: 0 | X: 0 | E: [1] | Peso Ajustado W[0]: 0.6\n", + "i: 1 | j: 1 | X: 1 | E: [1] | Peso Ajustado W[1]: 0.7\n", + "i: 2 | j: 0 | X: 1 | E: [1] | Peso Ajustado W[0]: 0.7\n", + "i: 2 | j: 1 | X: 0 | E: [1] | Peso Ajustado W[1]: 0.7\n", + "i: 3 | j: 0 | X: 1 | E: [0] | Peso Ajustado W[0]: 0.7\n", + "i: 3 | j: 1 | X: 1 | E: [0] | Peso Ajustado W[1]: 0.7\n", + "Erro Total: [2]\n", + "i: 0 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 0.7\n", + "i: 0 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 0.7\n", + "i: 1 | j: 0 | X: 0 | E: [1] | Peso Ajustado W[0]: 0.7\n", + "i: 1 | j: 1 | X: 1 | E: [1] | Peso Ajustado W[1]: 0.7999999999999999\n", + "i: 2 | j: 0 | X: 1 | E: [1] | Peso Ajustado W[0]: 0.7999999999999999\n", + "i: 2 | j: 1 | X: 0 | E: [1] | Peso Ajustado W[1]: 0.7999999999999999\n", + "i: 3 | j: 0 | X: 1 | E: [0] | Peso Ajustado W[0]: 0.7999999999999999\n", + "i: 3 | j: 1 | X: 1 | E: [0] | Peso Ajustado W[1]: 0.7999999999999999\n", + "Erro Total: [2]\n", + "i: 0 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 0.7999999999999999\n", + "i: 0 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 0.7999999999999999\n", + "i: 1 | j: 0 | X: 0 | E: [1] | Peso Ajustado W[0]: 0.7999999999999999\n", + "i: 1 | j: 1 | X: 1 | E: [1] | Peso Ajustado W[1]: 0.8999999999999999\n", + "i: 2 | j: 0 | X: 1 | E: [1] | Peso Ajustado W[0]: 0.8999999999999999\n", + "i: 2 | j: 1 | X: 0 | E: [1] | Peso Ajustado W[1]: 0.8999999999999999\n", + "i: 3 | j: 0 | X: 1 | E: [0] | Peso Ajustado W[0]: 0.8999999999999999\n", + "i: 3 | j: 1 | X: 1 | E: [0] | Peso Ajustado W[1]: 0.8999999999999999\n", + "Erro Total: [2]\n", + "i: 0 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 0.8999999999999999\n", + "i: 0 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 0.8999999999999999\n", + "i: 1 | j: 0 | X: 0 | E: [1] | Peso Ajustado W[0]: 0.8999999999999999\n", + "i: 1 | j: 1 | X: 1 | E: [1] | Peso Ajustado W[1]: 0.9999999999999999\n", + "i: 2 | j: 0 | X: 1 | E: [1] | Peso Ajustado W[0]: 0.9999999999999999\n", + "i: 2 | j: 1 | X: 0 | E: [1] | Peso Ajustado W[1]: 0.9999999999999999\n", + "i: 3 | j: 0 | X: 1 | E: [0] | Peso Ajustado W[0]: 0.9999999999999999\n", + "i: 3 | j: 1 | X: 1 | E: [0] | Peso Ajustado W[1]: 0.9999999999999999\n", + "Erro Total: [2]\n", + "i: 0 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 0.9999999999999999\n", + "i: 0 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 0.9999999999999999\n", + "i: 1 | j: 0 | X: 0 | E: [1] | Peso Ajustado W[0]: 0.9999999999999999\n", + "i: 1 | j: 1 | X: 1 | E: [1] | Peso Ajustado W[1]: 1.0999999999999999\n", + "i: 2 | j: 0 | X: 1 | E: [1] | Peso Ajustado W[0]: 1.0999999999999999\n", + "i: 2 | j: 1 | X: 0 | E: [1] | Peso Ajustado W[1]: 1.0999999999999999\n", + "i: 3 | j: 0 | X: 1 | E: [0] | Peso Ajustado W[0]: 1.0999999999999999\n", + "i: 3 | j: 1 | X: 1 | E: [0] | Peso Ajustado W[1]: 1.0999999999999999\n", + "Erro Total: [2]\n", + "i: 0 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 1.0999999999999999\n", + "i: 0 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 1.0999999999999999\n", + "i: 1 | j: 0 | X: 0 | E: [0] | Peso Ajustado W[0]: 1.0999999999999999\n", + "i: 1 | j: 1 | X: 1 | E: [0] | Peso Ajustado W[1]: 1.0999999999999999\n", + "i: 2 | j: 0 | X: 1 | E: [0] | Peso Ajustado W[0]: 1.0999999999999999\n", + "i: 2 | j: 1 | X: 0 | E: [0] | Peso Ajustado W[1]: 1.0999999999999999\n", + "i: 3 | j: 0 | X: 1 | E: [0] | Peso Ajustado W[0]: 1.0999999999999999\n", + "i: 3 | j: 1 | X: 1 | E: [0] | Peso Ajustado W[1]: 1.0999999999999999\n", + "Erro Total: [0]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u2dZAVVFEpCw" + }, + "source": [ + "## Exemplo 1.3: Rede Neural _Perceptron_ para o Operador Lógico XOR\n", + "\n", + "Problema proposto e demonstrado por Rumelhart et al. (1985)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EaZIyvvEEpC5" + }, + "source": [ + "Considere o dataframe a seguir:\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$)|\n", + "|---|---|---|---|\n", + "| 0 | 0 | 0 | 0 |\n", + "| 1 | 0 | 1 | 1 |\n", + "| 2 | 1 | 0 | 1 |\n", + "| 3 | 1 | 1 | 0 |\n", + "\n", + "O dataframe acima representa o operador lógico XOR (https://pt.wikipedia.org/wiki/Ou_exclusivo):\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$) |\n", + "|---|---|---|---|\n", + "| 0 | F | F | F |\n", + "| 1 | F | T | T |\n", + "| 2 | T | F | T |\n", + "| 3 | T | T | F |" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7rc3hc2RGneF" + }, + "source": [ + "[**Python**] - Definir os pesos $W$, entradas (_inputs_) $X$ e Output $Y$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "u8fAgk3RGneH" + }, + "source": [ + "# Pesos W:\n", + "W = np.array([0.0, 0.0])\n", + "\n", + "# Entradas X:\n", + "X = np.array([[0, 0], [0,1], [1, 0], [1, 1]])\n", + "\n", + "# Output Y:\n", + "Y = np.array([[0], [1], [1], [0]])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "tFKaIhua3Mr6" + }, + "source": [ + "X" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "pm-X-dXX3NZW" + }, + "source": [ + "Y" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "znRL2XozGneM" + }, + "source": [ + "[**Python**] - Evocar a função Treinar_RNA:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "j8leYHZVGneM" + }, + "source": [ + "Treinar_RNA(X, Y, W, alpha)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1Eu5cVvxM60i" + }, + "source": [ + "## Porque conseguimos pesos $W$ para os Operadores Lógicos E e OU e não para XOR?\n", + "\n", + "* Operadores E e OR: Linearmente Separáveis;\n", + "* Operador XOR: Linearmente NÃO-Separável.\n", + "\n", + "[Lucas Araújo](https://medium.com/@lucaspereira0612/solving-xor-with-a-single-perceptron-34539f395182) diz em seu artigo [Solving XOR with a single Perceptron](https://medium.com/@lucaspereira0612/solving-xor-with-a-single-perceptron-34539f395182) que:\n", + "\n", + "\"Everyone who has ever studied about neural networks has probably already read that a single perceptron can’t represent the boolean XOR function. The book Artificial Intelligence: A Modern Approach, the leading textbook in AI, says: “[XOR] is not linearly separable so the perceptron cannot learn it”.\n", + "\n", + "As figuras abaixo demonstram clarmente os conceitos \"linearmente separáveis\" e \"NÃO-linearmente separável\"." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oUrFCMUjFtR1" + }, + "source": [ + "### Representação gráfica do Operador Lógico E\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$) |\n", + "|---|---|---|---|\n", + "| 0 | 0 | 0 | 0 |\n", + "| 1 | 0 | 1 | 0 |\n", + "| 2 | 1 | 0 | 0 |\n", + "| 3 | 1 | 1 | 1 |\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "n9v07MdMF42e" + }, + "source": [ + "### Representação gráfica do Operador Lógico OU\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$) |\n", + "|---|---|---|---|\n", + "| 0 | 0 | 0 | 0 |\n", + "| 1 | 0 | 1 | 1 |\n", + "| 2 | 1 | 0 | 1 |\n", + "| 3 | 1 | 1 | 1 |\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "56Qp1J6LGBe9" + }, + "source": [ + "### Representação gráfica do Operador Lógico XOR\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$)|\n", + "|---|---|---|---|\n", + "| 0 | 0 | 0 | 0 |\n", + "| 1 | 0 | 1 | 1 |\n", + "| 2 | 1 | 0 | 1 |\n", + "| 3 | 1 | 1 | 0 |\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eaQm7zZJbNAc" + }, + "source": [ + "___\n", + "# **O QUE APRENDEMOS ATÉ AQUI?**\n", + "\n", + "* Redes Neurais tentam ajustar os pesos $W$ para tentar melhorar a taxa de acerto. Ou seja, a Rede Neural aprende com os dados através do ajuste iterativo dos pesos $W$;\n", + "* Treinar uma Rede Neural é uma tarefa computacionalmente intensivo, pois o algoritmo tenta encontrar os pesos $W$ que apresentam melhor acurácia. Para um dataframe grande, o custo conputacional do aprendizado pode ser alto." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f_T35rXZOB4G" + }, + "source": [ + "___\n", + "# **REDES NEURAIS MULTICAMADA**\n", + "\n", + "* Pelo menos 1 _Hidden Layer_. Observe a Rede Neural a seguir contendo 20 neurônios distribuídos da seguinte forma:\n", + "\n", + " * Número de neurônios na camada de entrada (_Input Layer_): 4;\n", + " * 3 camadas escondidas (_Hidden Layers_) com 5 neurônios cada, totalizando 15 neurônios:\n", + " * Número de neurônios na _Hidden Layer 1_: 5;\n", + " * Número de neurônios na _Hidden Layer 2_: 5;\n", + " * Número de neurônios na _Hidden Layer 3_: 5;\n", + " * Número de neurônios na camada de saída (_Output Layer_): 1;\n", + "* _Fully connected layer_." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1dXBXuh2-Tuo" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BK4O_Y_l2vev" + }, + "source": [ + "## Função _Sigmoid_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M_nn8zELXEVf" + }, + "source": [ + "\"Drawing\"\n", + "\n", + "Consulte [e (constante matemática)](https://pt.wikipedia.org/wiki/E_(constante_matem%C3%A1tica)) para saber mais sobre a constante de Euler." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kOWwWR7hOmir" + }, + "source": [ + "## Número de _Hidden Layers_\n", + "\n", + "Pesquisadores apontam que 1 única _Hidden Layer_ é suficiente para a grande maioria dos problemas e que usualmente cada _Hidden Layer_ possui o mesmo número de neurônios. Experimentos mostram que mais _Hidden Layers_ implica em maior tempo para treinar a Rede Neural. No entanto, [Heaton Research](https://www.heatonresearch.com/2017/06/01/hidden-layers.html), mostra que:\n", + "\n", + "![Determinining_number_Hidden_Layers](https://github.com/MathMachado/Materials/blob/master/Determinining_number_Hidden_Layers.png?raw=true)\n", + "\n", + "Fonte: [Heaton Research](https://www.heatonresearch.com/2017/06/01/hidden-layers.html).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u4_1JCbcPRrn" + }, + "source": [ + "## Número de neurônios na camada de entrada (_Input Layer_): $N_{I}$\n", + "\n", + "$N_{I}$= Número de colunas (ou variáveis) no dataframe." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fk-lhwhffUZz" + }, + "source": [ + "### Número de neurônios na camada de saída (_Output Layer_): $N_{O}$\n", + "\n", + "* Se a Rede Neural é uma regressão, então o número de neurônios na _Output Layer_ é 1, pois o _output_ de uma regressão é um valor;\n", + "* Se a Rede Neural é uma classificação e usamos uma função de ativação probabilística (como _softmax_, por exemplo), então o número de neurônios na _Output Layer_ é igual ao número de classes que queremos prever. Por exemplo, no problema de classificar espécies no dataframe IRIS temos 3 espécies (versicolor, virginica e setosa). Ao utilizarmos a função de ativação _softmax_, então teremos 3 neurônios na _Output Layer_." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZsrrdLpSfYm9" + }, + "source": [ + "## Número de neurônios na camada escondida (_Hidden Layer_): $N_{H}$\n", + "\n", + "Determinar o número de neurônios na _Hidden Layer_ tem sido um exercício de tentativa e erro, mas alguns experimentos tem demonstrado que o número adequado de neurônios na _Hidden Layer_ pode ser obtido através da expressão a seguir:\n", + "\n", + "$$N_{H}= \\frac{N_{I}+N_{O}}{2}$$\n", + "\n", + "No entanto, o artigo [How to choose the number of hidden layers and nodes in a feedforward neural network?](https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw) sugere o uso da seguinte expressão:\n", + "\n", + "$$N_H= \\frac{N}{\\alpha(N_{I}+N_{O})}$$\n", + "\n", + "onde $N$ é o número de instâncias (linhas) do dataframe e $\\alpha$ é um número entre 2 e 10, sendo que alguns experimentos com $\\alpha= 2$ produzem bons modelos sem _overfitting_. Para saber mais sobre esta expressão e sobre $\\alpha$, sugiro a leitura do artigo mencionado." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Rj6WfilbShX3" + }, + "source": [ + "## Rede Neural Multicamada para o Operador Lógico XOR.\n", + "\n", + "Dataframe que representa o Operador Lógico XOR:\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$)|\n", + "|---|---|---|---|\n", + "| 0 | 0 | 0 | 0 |\n", + "| 1 | 0 | 1 | 1 |\n", + "| 2 | 1 | 0 | 1 |\n", + "| 3 | 1 | 1 | 0 |" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uURlcU78LwbH" + }, + "source": [ + "### Arquitetura da Rede Neural Multicamada que vamos desenvolver para o Operador Lógico XOR\n", + "\n", + "Os pesos $W_{H}= \\begin{bmatrix} W_{H}^{(1, 1)} & W_{H}^{(1, 2)} & W_{H}^{(1, 3)} \\\\ W_{H}^{(2, 1)} & W_{H}^{(2, 2)} & W_{H}^{(2, 3)} \\end{bmatrix}$ e $W_{O}= \\begin{bmatrix} W_{O}^{(1)} \\\\ W_{O}^{(2)} \\\\ W_{O}^{(3)} \\end{bmatrix}$ serão gerados aleatoriamente. A seguir, a arquitetura da Rede Neural com 1 _Hidden Layer_ contendo 3 neurônios:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6XKMdlZr-e9l" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AV2eUQDuLCUL" + }, + "source": [ + "[**Python**] - Importar as bibliotecas necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uTWYP0V-LGHj" + }, + "source": [ + "import math\n", + "import numpy as np" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-DG86PgxLDQA" + }, + "source": [ + "[**Python**] - Definir o número de casas decimais:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Jsvh5DOkXEVm" + }, + "source": [ + "np.set_printoptions(precision = 3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NYIMcp8TLVuq" + }, + "source": [ + "[**Python**] - Definir as entradas (_inputs_) $X$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "U6Mt6zTnXEVq" + }, + "source": [ + "X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])\n", + "X" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tXLd1nZxLbXD" + }, + "source": [ + "[**Python**] - Definir os _Outputs_ $Y$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Oauq3veAXEVu" + }, + "source": [ + "Y = np.array([[0], [1], [1], [0]])\n", + "Y" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TC1y0tO1MAU9" + }, + "source": [ + "### Gerar os pesos $W_{H}= \\begin{bmatrix} W_{H}^{(1, 1)} & W_{H}^{(1, 2)} & W_{H}^{(1, 3)} \\\\ W_{H}^{(2, 1)} & W_{H}^{(2, 2)} & W_{H}^{(2, 3)} \\end{bmatrix}$ e $W_{O}= \\begin{bmatrix} W_{O}^{(1)} \\\\ W_{O}^{(2)} \\\\ W_{O}^{(3)} \\end{bmatrix}$ aleatoriamente\n", + "\n", + "Por questões de reproducibilidade de resultados, vamos usar as sementes a seguir para gerar os pesos $W_{H}$ e $W_{O}$:\n", + "\n", + "* _seed_= 20111974 para gerar $W_{H}$;\n", + "* _seed_= 19741120 para gerar $W_{O}$.\n", + "\n", + "Ao usarmos estas sementes, deveremos ter $W_{H}= \\begin{bmatrix} 0.531 & 0.570 & 0.543 \\\\ 0.655 & 0.857 & 0.602 \\end{bmatrix}$ e $W_{O}= \\begin{bmatrix} 0.240 \\\\ 0.318 \\\\ 0.142 \\end{bmatrix}$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_U3Id5XXG5tw" + }, + "source": [ + "[**Python**] - Sementes para gerar $W_{H}$ (aleatoriamente)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tVXiIpgIHId9" + }, + "source": [ + "np.random.seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XYj0NYofHKkk" + }, + "source": [ + "[**Python**] - Gerar os pesos $W_{H}$ (aleatoriamente)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "o1eGsPNQXEVx" + }, + "source": [ + "W_H = np.array([np.random.random(3), np.random.random(3)])\n", + "W_H" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cj6KJnP3Hbqf" + }, + "source": [ + "[**Python**] - Sementes para gerar $W_{O}$ (aleatoriamente):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "AkVw-SWSHbqh" + }, + "source": [ + "np.random.seed(19741120)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r7ZjUT4oHbqk" + }, + "source": [ + "[**Python**] - Gerar os pesos $W_{O}$ (aleatoriamente):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ebs8p8mOXEV1" + }, + "source": [ + "W_O = np.array([np.random.random(1), np.random.random(1), np.random.random(1)])\n", + "W_O" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vg1ByKjKsWcE" + }, + "source": [ + "Confira os pesos dispostos na figura a seguir:\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GiEc1DwPt7Hm" + }, + "source": [ + "### Calcular $S = \\sum_{i=1}^{4}X_{i}W_{i}$ e passar o valor de $S$ para a função de ativação $f(S)$ (_Sigmoid_)\n", + "\n", + "Função _Sigmoid_:\n", + "\n", + "$$f(x)= y= \\frac{1}{1+e^{-x}}$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mCZsXjIhHqId" + }, + "source": [ + "[**Python**] - Definir a função de ativação $Sigmoid$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kB4-UnOGXEV8" + }, + "source": [ + "def FuncaoAtivacao_Sigmoid(x):\n", + " y = 1/(1+np.exp(-x))\n", + " return y" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XkvMHw1KHrjT" + }, + "source": [ + "[**Python**] - Função MostraCalculos, desenvolvida para validarmos os cálculos manuais de $S$ e $f(S)$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fsxHrthYXEWA" + }, + "source": [ + "def MostraCalculos(i):\n", + " print(f'Array W:\\n {W_H}')\n", + " print('\\n')\n", + " print(f'Array X:\\n {X[i]}')\n", + " S = X[i].dot(W_H)\n", + " f = FuncaoAtivacao_Sigmoid(S)\n", + " S2= f.dot(W_O)\n", + " f2= FuncaoAtivacao_Sigmoid(S2)\n", + " \n", + " print('\\n')\n", + " print(f'*** HIDDEN LAYER ***')\n", + " print(f'Função Soma S: {S}')\n", + " print(f'Função de Ativação Sigmoid: {f}')\n", + " \n", + " print('\\n')\n", + " print(f'*** OUTPUT LAYER ***')\n", + " print(f'Função Soma S: {S2}')\n", + " print(f'Função de Ativação Sigmoid: {f2}')\n", + " \n", + " print('\\n')\n", + " print(f'*** ERRO ***')\n", + " E= Y[i]-f2\n", + " print(f'Erro da linha i= {i}: {E}')\n", + " \n", + " return f " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s80knPTzcIBy" + }, + "source": [ + "___\n", + "O Operador A.dot(B) faz o produto matricial entre os arrays A e B. Para saber mais sobre a função dot(), assista este [vídeo](https://youtu.be/Pb1VIe9657s)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Bw0p2m8mbz3C" + }, + "source": [ + "#### $\\Longrightarrow$ Para $i = 0$:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CelKhuoHISyS" + }, + "source": [ + "[**Python**] - Evocar a função f0= MostraCalculos(0):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ar0zOLuUIio1" + }, + "source": [ + "f0 = MostraCalculos(0)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_R1LdY9QvTqb" + }, + "source": [ + "Observe na figura abaixo os cálculos manuais da Soma $S$, função de ativação $f(S)$ e Erro." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hl_RBLiaa4xS" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NOKtMLHoo_Yt" + }, + "source": [ + "##### _HIDDEN LAYER_\n", + "\\begin{align}\n", + "S_{H}^{(0, 1)} &= (0)(0.531)+(0)(0.655)= 0 \\Longrightarrow f_{H}^{(0, 1)}(S_{H}^{(0, 1)})= f_{H}^{(0, 1)}(0)= 0.5 \\\\\n", + "S_{H}^{(0, 2)} &= (0)(0.570)+(0)(0.857)= 0 \\Longrightarrow f_{H}^{(0, 2)}(S_{H}^{(0, 2)})= f_{H}^{(0, 2)}(0)= 0.5 \\\\\n", + "S_{H}^{(0, 3)} &= (0)(0.543)+(0)(0.602)= 0 \\Longrightarrow f_{H}^{(0, 3)}(S_{H}^{(0, 3)})= f_{H}^{(0, 3)}(0)= 0.5\n", + "\\end{align}\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8Kw-cakYsQGp" + }, + "source": [ + "##### _OUTPUT LAYER_\n", + "\n", + "\\begin{align}\n", + "S_{O}^{(0)}&= (0.5)(0.24)+(0.5)(0.318)+(0.5)(0.142)= 0.35 \\\\\n", + "f_{O}^{(0)}(S_{O}^{(0)})&= f_{O}^{(0)}(0.35)= 0.587 \\\\\n", + "E_{0}&= 0-0.587= -0.587\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TFZ8w1dUdT7A" + }, + "source": [ + "#### $\\Longrightarrow$ Para $i = 1$:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wTz3EfAUIoz-" + }, + "source": [ + "[**Python**] - Evocar a função f1= MostraCalculos(1):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "INUDJ_aMXEWb" + }, + "source": [ + "f1 = MostraCalculos(1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "I91qgS1Uh2T1" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JDuyxsKSvDds" + }, + "source": [ + "##### _HIDDEN LAYER_\n", + "\\begin{align}\n", + "S_{H}^{(1, 1)} &= (0)(0.531)+(1)(0.655)= 0.655 \\Longrightarrow f_{H}^{(1, 1)}(S_{H}^{(1, 1)})= f_{H}^{(1, 1)}(0.655)= 0.658 \\\\\n", + "S_{H}^{(1, 2)} &= (0)(0.570)+(1)(0.857)= 0.857 \\Longrightarrow f_{H}^{(1, 2)}(S_{H}^{(1, 2)})= f_{H}^{(1, 2)}(0.857)= 0.702 \\\\\n", + "S_{H}^{(1, 3)} &= (0)(0.543)+(1)(0.602)= 0.602 \\Longrightarrow f_{H}^{(1, 3)}(S_{H}^{(1, 3)})= f_{H}^{(1, 3)}(0.602)= 0.646\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nKPsQA9dvDdt" + }, + "source": [ + "##### _OUTPUT LAYER_\n", + "\n", + "\\begin{align}\n", + "S_{O}^{(1)}&= (0.658)(0.24)+(0.702)(0.318)+(0.646)(0.142)= 0.473 \\\\\n", + "f_{O}^{(1)}(S_{O}^{(1)})&= f_{O}^{(1)}(0.473)= 0.616 \\\\\n", + "E_{1}&= 1-0.616= 0.384\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IBfztHLfeoTR" + }, + "source": [ + "#### $\\Longrightarrow$ Para $i = 2$:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sjcpG53tIvHf" + }, + "source": [ + "[**Python**] - Evocar a função f2= MostraCalculos(2):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RbnG_WxdXEWg" + }, + "source": [ + "f2 = MostraCalculos(2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9g9MegqIh-Vn" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5gES50aaxszM" + }, + "source": [ + "##### _HIDDEN LAYER_\n", + "\\begin{align}\n", + "S_{H}^{(2, 1)} &= (1)(0.531)+(0)(0.655)= 0.531 \\Longrightarrow f_{H}^{(2, 1)}(S_{H}^{(2, 1)})= f_{H}^{(2, 1)}(0.531)= 0.630 \\\\\n", + "S_{H}^{(2, 2)} &= (1)(0.570)+(0)(0.857)= 0.570 \\Longrightarrow f_{H}^{(2, 2)}(S_{H}^{(2, 2)})= f_{H}^{(2, 2)}(0.570)= 0.639 \\\\\n", + "S_{H}^{(2, 3)} &= (1)(0.543)+(0)(0.602)= 0.543 \\Longrightarrow f_{H}^{(2, 3)}(S_{H}^{(2, 3)})= f_{H}^{(2, 3)}(0.543)= 0.632\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o7n4Eq-6xszP" + }, + "source": [ + "##### _OUTPUT LAYER_\n", + "\n", + "\\begin{align}\n", + "S_{O}^{(2)}&= (0.630)(0.24)+(0.639)(0.318)+(0.632)(0.142)= 0.444 \\\\\n", + "f_{O}^{(2)}(S_{O}^{(2)})&= f_{O}^{(2)}(0.444)= 0.609 \\\\\n", + "E_{2}&= 1-0.609= 0.391\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cPJQKwBthCkh" + }, + "source": [ + "#### $\\Longrightarrow$ Para $i = 3$:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MVhEsrqJI1T7" + }, + "source": [ + "[**Python**] - Evocar a função f3= MostraCalculos(3):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qU87GWKjXEWo" + }, + "source": [ + "f3 = MostraCalculos(3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AjUGJdaYiEH0" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lKptTBkBzysP" + }, + "source": [ + "##### _HIDDEN LAYER_\n", + "\\begin{align}\n", + "S_{H}^{(3, 1)} &= (1)(0.531)+(1)(0.655)= 1.186 \\Longrightarrow f_{H}^{(3, 1)}(S_{H}^{(3, 1)})= f_{H}^{(3, 1)}(1.186)= 0.766 \\\\\n", + "S_{H}^{(3, 2)} &= (1)(0.570)+(1)(0.857)= 1.427 \\Longrightarrow f_{H}^{(3, 2)}(S_{H}^{(3, 2)})= f_{H}^{(3, 2)}(1.427)= 0.806 \\\\\n", + "S_{H}^{(3, 3)} &= (1)(0.543)+(1)(0.602)= 1.144 \\Longrightarrow f_{H}^{(3, 3)}(S_{H}^{(3, 3)})= f_{H}^{(3, 3)}(1.144)= 0.758\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ISxS131GzysR" + }, + "source": [ + "##### _OUTPUT LAYER_\n", + "\n", + "\\begin{align}\n", + "S_{O}^{(3)}&= (0.766)(0.24)+(0.806)(0.318)+(0.758)(0.142)= 0.548 \\\\\n", + "f_{O}^{(3)}(S_{O}^{(3)})&= f_{O}^{(3)}(0.548)= 0.634 \\\\\n", + "E_{3}&= 0-0.634= -0.634\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YR3X25venRv5" + }, + "source": [ + "### Resumo dos cálculos com _arrays_\n", + "\n", + "Os cálculos que foram realizados previamente com o NumPy _step by step_ aqui são feitos utilizando produto matricial." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "n9D-5dE-I_IS" + }, + "source": [ + "[**Python**] - Funções de ativação da _Hidden Layer_:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "efO2aSu8AzMp" + }, + "source": [ + "f0" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "BoDRBC8oXEW0" + }, + "source": [ + "X2 = np.array([f0, f1, f2, f3])\n", + "X2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WVIwTcF1JLIm" + }, + "source": [ + "[**Python**] - Calcular a soma $S$ da _Output Layer_, dado pelo produto matricial de $X2$ por $W_{O}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ddyC0sa6XEW5" + }, + "source": [ + "S = X2.dot(W_O)\n", + "S" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NRkyUZN7Jooz" + }, + "source": [ + "[**Python**] - Função de ativação da _Output Layer_:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Jadac2Q3XEW-" + }, + "source": [ + "f = FuncaoAtivacao_Sigmoid(S)\n", + "f" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZuTe0mHg8Kzk" + }, + "source": [ + "Os resultados das funções de ativação acima conferem com o resumo a seguir:\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r2lHqqhmM6rd" + }, + "source": [ + "[**Python**] - Calcular os Erros" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bCu8miA2XEXE" + }, + "source": [ + "E = Y - f\n", + "E" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P2Q019cxotQM" + }, + "source": [ + "Os cálculos estão resumidos na tabela a seguir:\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$) | ValorCalculado ($\\hat{Y}_{i}$) | $Erro$ | $Erro^{2}$ |\n", + "|---|---|---|---|---|:----------------------:|------------------:|\n", + "| 0 | 0 | 0 | 0 | 0.587 | -0.587 | 0.344 |\n", + "| 1 | 0 | 1 | 1 | 0.616 | 0.384 | 0.147 |\n", + "| 2 | 1 | 0 | 1 | 0.609 | 0.391 | 0.152 |\n", + "| 3 | 1 | 1 | 0 | 0.634 | -0.634 | 0.401 |\n", + "\n", + "Onde:\n", + "\n", + "$Erro= y_{i}-\\hat{Y}_{i}$= ValorReal - ValorCalculado\n", + "\n", + "O cálculo do MSE será $MSE= \\frac{0.344+0.147+0.152+0.401}{4}= \\frac{1.044}{4}= 0.261$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lHm_16jEz-kL" + }, + "source": [ + "### Métrica para avaliação da performance da Rede Neural\n", + "\n", + "* O MSE é uma das principais métricas para medir a performance das Redes Neurais. A seguir, o cálculo do MSE:\n", + "\n", + "$$MSE= \\frac{\\sum_{i=1}^{n}(y_{i}-\\hat{Y}_{i})^{2}}{n}= \\frac{(0-0.587)^{2}+(1-0.616)^{2}+(1-0.609)^{2}+(0-0.634)^{2}}{4}= \\frac{0.344+0.147+0.152+0.401}{4}=0.261$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D2Yo6TdpNIPW" + }, + "source": [ + "[**Python**] - Desenvolver função MSE para calcular o MSE = Erro Quadrático Médio:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "EENpe-rbXEXL" + }, + "source": [ + "def MSE(Y, f):\n", + " return np.square(Y - f).mean()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ySpVD0-mNQ1s" + }, + "source": [ + "[**Python**] - Evocar a função $MSE(Y, f)$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "C0L5ACZnXEXP" + }, + "source": [ + "MSE(Y, f)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Xpzv12a48GhA" + }, + "source": [ + "### _Backpropagation_ - Ajuste dos pesos $W_{O}= \\begin{bmatrix} W_{O}^{(1)} \\\\ W_{O}^{(2)} \\\\ W_{O}^{(3)} \\end{bmatrix}$\n", + "\n", + "> _Backpropagation_ (ou simplesmente _Backward_) é o processo que faz com que a Rede Neural aprenda a partir da atualização iterativa dos pesos $W$. A ideia do _Backpropagation_ é que podemos melhorar a performance da Rede Neural através da calibração dos pesos $W$ usando _Gradient Descent_, de forma que os _outputs_ ($\\hat{y}_{i}$) serão cada vez mais próximos do valor real ($y_{i}$)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vjBeg2TTcd40" + }, + "source": [ + "Como vimos anteriormente, a fórmula para atualização dos pesos $W$, dada pela expressão abaixo\n", + "\n", + "$$W_{n+1}= W_{n}*M+\\alpha \\frac{\\partial L}{\\partial W_{n}}= W_{n}*M+\\alpha*(X*\\Delta)$$\n", + "\n", + "necessita da derivada da _Loss Function_ $L$, que é a função _Sigmoid_, cuja expressão matemática é dada a seguir:\n", + "\n", + "$$y(x)= \\frac{1}{1+e^{-x}}$$\n", + "\n", + "Portanto, a derivada da função _Sigmoid_ é dada pela expressão a seguir:\n", + "\n", + "$$\\frac{dy}{dx}= y^{'}= y(1-y)$$\n", + "\n", + "Caso você tenha dúvidas sobre a derivada da função de ativação _Sigmoid_, sugiro a leitura deste artigo: [Derivative of the Sigmoid function](https://towardsdatascience.com/derivative-of-the-sigmoid-function-536880cf918e)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hDKJakFImuRp" + }, + "source": [ + "* $D_{O}$ é a Derivada do neurônio da _Output Layer_;\n", + "* $\\Delta_{H}= D_{O}* W_{O} * \\Delta_{O}$;\n", + "* $\\Delta_{O}= E_{i}*D_{O}$.\n", + "\n", + "A seguir, a Derivada da função _Sigmoid_ usando o NumPy:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kDnarWwwNZd0" + }, + "source": [ + "[**Python**] - Função Derivada_Sigmoid, que calcula a Derivada da função _Sigmoid_:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qSxVsNeDXEXY" + }, + "source": [ + "def Derivada_Sigmoid(y):\n", + " return y*(1-y)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CY6O0qkWNhby" + }, + "source": [ + "[**Python**] - Evocar a Derivada_Sigmoid(f), ou seja, calcular a derivada das funções de ativação da _Output Layer_:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WTpQfBTpXEXi" + }, + "source": [ + "D_O = Derivada_Sigmoid(f)\n", + "D_O" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jfVdgCFDf9-X" + }, + "source": [ + "Os cálculos acima foram feitos no NumPy e são reproduzidos manualmente abaixo:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TOslp3YSN70r" + }, + "source": [ + "[**Python**] - Função Backpropagation para calcular:\n", + "* _Output Layer_:\n", + " * $D_{O}$ - Derivada dos valores da _Output Layer_;\n", + " * $\\Delta_{O}$ - Delta;\n", + "* _Hidden Layer_:\n", + " * $D_{H}$ - Derivada dos valores da _Hidden Layer_;\n", + " * $\\Delta_{H}$ - Delta" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4PihyM2VXEXq" + }, + "source": [ + "def Backpropagation(i):\n", + " print(f'***** OUTPUT LAYER *****')\n", + " print(f'*** Função de ativação ***')\n", + " print(f[i])\n", + " \n", + " #print('\\n')\n", + " print(f'*** Derivada ***')\n", + " D_O= Derivada_Sigmoid(f)\n", + " print(D_O[i])\n", + "\n", + " #print('\\n') \n", + " print(f'*** Erros ***')\n", + " print(E[i])\n", + "\n", + " #print('\\n')\n", + " print(f'*** Delta ***')\n", + " Delta_O= D_O*E\n", + " print(Delta_O[i])\n", + " \n", + " print('\\n')\n", + " print(f'***** HIDDEN LAYER *****')\n", + " print(f'*** Função de ativação ***')\n", + " print(X2[i])\n", + "\n", + " #print('\\n')\n", + " print(f'*** Derivada ***')\n", + " D_H= Derivada_Sigmoid(X2)\n", + " D_H\n", + " print(D_H[i]) \n", + " \n", + " #print('\\n')\n", + " print(f'*** Delta ***')\n", + " Delta_H= D_H*W_O.T*Delta_O\n", + " print(Delta_H[i]) " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eyGWGHVaFxNG" + }, + "source": [ + "#### $\\Longrightarrow$ Para $i = 0$:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uiOtWoNWOn24" + }, + "source": [ + "[**Python**] - Evocar a função Backpropagation(0):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SiNkv_DBXEXu" + }, + "source": [ + "Backpropagation(0)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PqZ_CvGI0ySD" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yO4njWZb1V1w" + }, + "source": [ + "##### _HIDDEN LAYER_\n", + "\\begin{align}\n", + "\\Delta_{H}^{(0, 1)} &= D_{H}^{(0, 1)}.W_{O}^{(1)}.\\Delta_{O}^{(0)}= (0.25)(0.24)(-0.142)= -0.009 \\\\\n", + "\\Delta_{H}^{(0, 2)} &= D_{H}^{(0, 2)}.W_{O}^{(2)}.\\Delta_{O}^{(0)}= (0.25)(0.318)(-0.142)= -0.011 \\\\\n", + "\\Delta_{H}^{(0, 3)} &= D_{H}^{(0, 3)}.W_{O}^{(3)}.\\Delta_{O}^{(0)}= (0.25)(0.142)(-0.142)= -0.005\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SXpozezsYFCX" + }, + "source": [ + "Na figura acima, temos que $\\Delta_{H}^{(0)}= [\\Delta_{H}^{(0, 1)}, \\Delta_{H}^{(0, 2)}, \\Delta_{H}^{(0, 3)}]= [-0.009, -0.011, -0.005]$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xDZYiujcHGzK" + }, + "source": [ + "#### $\\Longrightarrow$ Para $i = 1$:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ROzKv5VtOuy5" + }, + "source": [ + "[**Python**] - Evocar a função Backpropagation(1):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "S6An6CyUXEX0" + }, + "source": [ + "Backpropagation(1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EwblrxI20ygW" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "18bsrlv_4B0Q" + }, + "source": [ + "##### _HIDDEN LAYER_\n", + "\\begin{align}\n", + "\\Delta_{H}^{(1, 1)} &= D_{H}^{(1, 1)}.W_{O}^{(1)}.\\Delta_{O}^{(1)}= (0.225)(0.24)(0.091)= 0.005 \\\\\n", + "\\Delta_{H}^{(1, 2)} &= D_{H}^{(1, 2)}.W_{O}^{(2)}.\\Delta_{O}^{(1)}= (0.209)(0.318)(0.091)= 0.006 \\\\\n", + "\\Delta_{H}^{(1, 3)} &= D_{H}^{(1, 3)}.W_{O}^{(3)}.\\Delta_{O}^{(1)}= (0.229)(0.142)(0.091)= 0.003\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cPfQUUUHYw4i" + }, + "source": [ + "Na figura acima, temos que $\\Delta_{H}^{(1)}= [\\Delta_{H}^{(1, 1)}, \\Delta_{H}^{(1, 2)}, \\Delta_{H}^{(1, 3)}]= [0.005, 0.006, 0.003]$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e8qfA8CGHJo8" + }, + "source": [ + "#### $\\Longrightarrow$ Para $i = 2$:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UWxqTLsKOyoK" + }, + "source": [ + "[**Python**] - Evocar a função Backpropagation(2):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "w39YvfOWXEX7" + }, + "source": [ + "Backpropagation(2)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BBZuNcOC0yj9" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sZam1meY48hW" + }, + "source": [ + "##### _HIDDEN LAYER_\n", + "\\begin{align}\n", + "\\Delta_{H}^{(2, 1)} &= D_{H}^{(2, 1)}.W_{O}^{(1)}.\\Delta_{O}^{(2)}= (0.233)(0.24)(0.093)= 0.005 \\\\\n", + "\\Delta_{H}^{(2, 2)} &= D_{H}^{(2, 2)}.W_{O}^{(2)}.\\Delta_{O}^{(2)}= (0.231)(0.318)(0.093)= 0.007 \\\\\n", + "\\Delta_{H}^{(2, 3)} &= D_{H}^{(2, 3)}.W_{O}^{(3)}.\\Delta_{O}^{(2)}= (0.232)(0.142)(0.093)= 0.003\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dAc8YrceY5kY" + }, + "source": [ + "Na figura acima, temos que $\\Delta_{H}^{(2)}= [\\Delta_{H}^{(2, 1)}, \\Delta_{H}^{(2, 2)}, \\Delta_{H}^{(2, 3)}]= [0.005, 0.007, 0.003]$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PWrv-aRyHMPh" + }, + "source": [ + "#### $\\Longrightarrow$ Para $i = 3$:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MKqR3izrO15N" + }, + "source": [ + "[**Python**] - Evocar a função Backpropagation(3):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1APffWq2XEYA" + }, + "source": [ + "Backpropagation(3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PGAXyDhW0ynn" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bdgIs_zP5i1y" + }, + "source": [ + "##### _HIDDEN LAYER_\n", + "\\begin{align}\n", + "\\Delta_{H}^{(3, 1)} &= D_{H}^{(3, 1)}.W_{O}^{(1)}.\\Delta_{O}^{(3)}= (0.179)(0.24)(-0.147)= -0.006 \\\\\n", + "\\Delta_{H}^{(3, 2)} &= D_{H}^{(3, 2)}.W_{O}^{(2)}.\\Delta_{O}^{(3)}= (0.156)(0.318)(-0.147)= -0.007 \\\\\n", + "\\Delta_{H}^{(3, 3)} &= D_{H}^{(3, 3)}.W_{O}^{(3)}.\\Delta_{O}^{(3)}= (0.183)(0.142)(-0.147)= -0.004\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6Ie99-SqZA6z" + }, + "source": [ + "Na figura acima, temos que $\\Delta_{H}^{(3)}= [\\Delta_{H}^{(3, 1)}, \\Delta_{H}^{(3, 2)}, \\Delta_{H}^{(3, 3)}]= [-0.006, -0.007, -0.004]$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sndwYO-VbK1C" + }, + "source": [ + "A seguir, cálculos usando o NumPy:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ycvvhnWIO5s9" + }, + "source": [ + "[**Python**] - $D_{O}$: Derivada da _Output Layer_:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Zkdw8tUKw5vo" + }, + "source": [ + "f" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "poTTrvYEXEYE" + }, + "source": [ + "D_O = Derivada_Sigmoid(f)\n", + "D_O" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JkdyDN6BPNZT" + }, + "source": [ + "[**Python**] - Mostrar os Erros:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "AO9Qi9U0aWTx" + }, + "source": [ + "E" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DPPqqxpIPRsT" + }, + "source": [ + "[**Python**] - $\\Delta_{O}$: Delta da _Output Layer_:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6fylksvtaT6h" + }, + "source": [ + "Delta_O = D_O*E\n", + "Delta_O" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "E9zsXwcWPXn1" + }, + "source": [ + "[**Python**] - $D_{H}$: Derivada da _Hidden Layer_:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SCABYAGjaigm" + }, + "source": [ + "D_H = Derivada_Sigmoid(X2)\n", + "D_H" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bLa9L88VPdQu" + }, + "source": [ + "[**Python**] - $D_{O}$ - Derivada da _Output Layer_:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "58r5kgNwa9xo" + }, + "source": [ + "Delta_H = D_H*W_O.T*Delta_O\n", + "Delta_H" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "roh5SVtkQrJE" + }, + "source": [ + "### _Backpropagation_ - Atualizar os pesos da _Output Layer_ $W_{O}= \\begin{bmatrix} W_{O}^{(1)} \\\\ W_{O}^{(2)} \\\\ W_{O}^{(3)} \\end{bmatrix}$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CQ69tO1IPsBQ" + }, + "source": [ + "[**Python**] - $(X*\\Delta_{O})= (X2*\\Delta_{O})$ para atualizar $W_{O}^{(1)}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "K991veZeXEYL" + }, + "source": [ + "X2.T.dot(Delta_O)[0]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hz-0fQAGd7Aw" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ovgjM8l6Np0e" + }, + "source": [ + "$$(0.5)\\Delta_{O}^{(0)}+(0.658)\\Delta_{O}^{(1)}+(0.630)\\Delta_{O}^{(2)}+(0.766)\\Delta_{O}^{(3)}$$\n", + "$$(0.5)(-0.142)+(0.658)(0.091)+(0.630)(0.093)+(0.766)(-0.147)= -0.065$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BFaNh6NEXEYO" + }, + "source": [ + "[**Python**] - $(X*\\Delta_{O})= (X2*\\Delta_{O})$ para atualizar $W_{O}^{(2)}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "eomk5j12XEYT" + }, + "source": [ + "X2.T.dot(Delta_O)[1]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M-3gk0erRpSF" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hVCFLfWGPE7W" + }, + "source": [ + "$$(0.5)\\Delta_{O}^{(0)}+(0.702)\\Delta_{O}^{(1)}+(0.639)\\Delta_{O}^{(2)}+(0.866)\\Delta_{O}^{(3)}$$\n", + "$$(0.5)(-0.142)+(0.702)(0.091)+(0.639)(0.093)+(0.806)(-0.147)= -0.067$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MK92KMHYXEYV" + }, + "source": [ + "[**Python**] - $(X*\\Delta_{O})= (X2*\\Delta_{O})$ para atualizar $W_{O}^{(3)}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "D05BW8CgXEYc" + }, + "source": [ + "X2.T.dot(Delta_O)[2]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K754V1CSRtii" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "q51biJ5TPkKX" + }, + "source": [ + "$$(0.5)\\Delta_{O}^{(0)}+(0.646)\\Delta_{O}^{(1)}+(0.632)\\Delta_{O}^{(2)}+(0.758)\\Delta_{O}^{(3)}$$\n", + "$$(0.5)(-0.142)+(0.646)(0.091)+(0.632)(0.093)+(0.758)(-0.147)= -0.067$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SaEVJAXGV3Xd" + }, + "source": [ + "###### Implementação com NumPy\n", + "\n", + "* Fórmula para atualização dos pesos $W_{O}$:\n", + "\n", + "$$W_{n+1}= W_{n}*M+\\alpha \\frac{\\partial L}{\\partial W_{n}}= W_{n}*M+\\alpha*(X*\\Delta)$$\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "a7dGpwzfRN2M" + }, + "source": [ + "[**Python**] - Calcular/atualizar os pesos $W_{O}$ através da expressão de $W_{n+1}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "er3DprzjXEYg" + }, + "source": [ + "M = 1\n", + "alpha = 0.1\n", + "\n", + "W_O_New = W_O*M+alpha*(X2.T.dot(Delta_O))\n", + "W_O_New" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0-2weyIriNqN" + }, + "source": [ + "Abaixo, os pesos atualizados de $W_{O}$ (antes e depois)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GLkZfXbmi9c6" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "t4fHsSY3AlFi" + }, + "source": [ + "### _Backpropagation_ - Ajuste dos pesos $W_{H}= \\begin{bmatrix} W_{H}^{(1, 1)} & W_{H}^{(1, 2)} & W_{H}^{(1, 3)} \\\\ W_{H}^{(2, 1)} & W_{H}^{(2, 2)} & W_{H}^{(2, 3)} \\end{bmatrix}$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cCED4NKj1_FX" + }, + "source": [ + "#### Ajuste dos pesos $W_{H}^{(1, 1)}, W_{H}^{(1, 2)}, W_{H}^{(1, 3)}$\n", + "\n", + "Para ajustar os pesos $W_{H}^{(1, 1)}, W_{H}^{(1, 2)}, W_{H}^{(1, 3)}$, precisamos dos valores de $\\Delta_{H}$, calculado anteriormente:\n", + "\n", + "* $\\Delta_{H}^{(0)}= [\\Delta_{H}^{(0, 1)}, \\Delta_{H}^{(0, 2)}, \\Delta_{H}^{(0, 3)}]= [-0.009, -0.011, -0.005]$;\n", + "* $\\Delta_{H}^{(1)}= [\\Delta_{H}^{(((1, 1)}, \\Delta_{H}^{(1, 2)}, \\Delta_{H}^{(1, 3)}]= [0.005, 0.006, 0.003]$;\n", + "* $\\Delta_{H}^{(2)}= [\\Delta_{H}^{(2, 1)}, \\Delta_{H}^{(2, 2)}, \\Delta_{H}^{(2, 3)}]= [0.005, 0.007, 0.003]$;\n", + "* $\\Delta_{H}^{(3)}= [\\Delta_{H}^{(3, 1)}, \\Delta_{H}^{(3, 2)}, \\Delta_{H}^{(3, 3)}]= [-0.006, -0.007, -0.004]$.\n", + "\n", + "Veja abaixo no NumPy:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kNHXkTXLRmzu" + }, + "source": [ + "[**Python**] - Mostrar $\\Delta_{H}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oYxrEVC7XEYn" + }, + "source": [ + "Delta_H" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XihSawh-1iKI" + }, + "source": [ + "##### Resumo dos valores de $(X*\\Delta_{H})$ calculados manualmente:\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DNXt5DAhBiVC" + }, + "source": [ + "[**Python**] - $(X*\\Delta_{H})$ para atualizar $W_{H}^{(1, 1)}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "06duXU28XEYy" + }, + "source": [ + "X.T.dot(Delta_H)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DTCJ_5O7SeU9" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WeJOgJd5P5BS" + }, + "source": [ + "$$(0)\\Delta_{H}^{(0, 1)}+(0)\\Delta_{H}^{(1, 1)}+(1)\\Delta_{H}^{(2, 1)}+(1)\\Delta_{O}^{(3, 1)}$$\n", + "$$(0)(-0.009)+(0)(0.005)+(1)(0.005)+(1)(-0.006)= -0.001$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8mbs0ZNTCRKL" + }, + "source": [ + "[**Python**] - $(X*\\Delta_{H})$ para atualizar $W_{H}^{(1, 2)}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qF1iFyRWXEY9" + }, + "source": [ + "X.T.dot(Delta_H)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xWkm7eyLSm6I" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "X9LQgX05Qj8M" + }, + "source": [ + "$$(0)\\Delta_{H}^{(0, 2)}+(0)\\Delta_{H}^{(1, 2)}+(1)\\Delta_{H}^{(2, 2)}+(1)\\Delta_{H}^{(3, 2)}$$\n", + "$$(0)(-0.011)+(0)(0.006)+(1)(0.007)+(1)(-0.007)= 0$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oaVbGCATCd7B" + }, + "source": [ + "[**Python**] - $(X*\\Delta_{H})$ para atualizar $W_{H}^{(1, 3)}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4UNZFSC5XEZE" + }, + "source": [ + "X.T.dot(Delta_H)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4JHAiH5GSqr0" + }, + "source": [ + "\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DDrurArKQ5I_" + }, + "source": [ + "$$(0)\\Delta_{H}^{(0, 3)}+(0)\\Delta_{H}^{(1, 3)}+(1)\\Delta_{H}^{(2, 3)}+(1)\\Delta_{H}^{(3, 3)}$$\n", + "$$(0)(-0.005)+(0)(0.003)+(1)(0.003)+(1)(-0.004)= -0.001$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GWwWUfiXXlom" + }, + "source": [ + "#### Ajuste dos pesos $W_{H}^{(2, 1)}, W_{H}^{(2, 2)}$ e $W_{H}^{(2, 3)}$\n", + "\n", + "Para ajustar os pesos $W_{H}^{(1, 1)}, W_{H}^{(1, 2)}, W_{H}^{(1, 3)}$, precisamos dos valores de $\\Delta_{H}$, calculado anteriormente:\n", + "\n", + "* $\\Delta_{H}^{(0)}= [\\Delta_{H}^{(0, 1)}, \\Delta_{H}^{(0, 2)}, \\Delta_{H}^{(0, 3)}]= [-0.009, -0.011, -0.005]$;\n", + "* $\\Delta_{H}^{(1)}= [\\Delta_{H}^{(1, 1)}, \\Delta_{H}^{(1, 2)}, \\Delta_{H}^{(1, 3)}]= [0.005, 0.006, 0.003]$;\n", + "* $\\Delta_{H}^{(2)}= [\\Delta_{H}^{(2, 1)}, \\Delta_{H}^{(2, 2)}, \\Delta_{H}^{(2, 3)}]= [0.005, 0.007, 0.003]$;\n", + "* $\\Delta_{H}^{(3)}= [\\Delta_{H}^{(3, 1)}, \\Delta_{H}^{(3, 2)}, \\Delta_{H}^{(3, 3)}]= [-0.006, -0.007, -0.004]$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DzeytNngSI08" + }, + "source": [ + "[**Python**] - Mostra $\\Delta_{H}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vfiS9bFqXEZH" + }, + "source": [ + "Delta_H" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dSDbe8o8k9yi" + }, + "source": [ + "##### Resumo de $(X*\\Delta_{H})$ calculados manualmente:\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D6e2ZoMmDLFN" + }, + "source": [ + "[**Python**] - $(X*\\Delta_{H})$ para atualizar $W_{H}^{(2, 1)}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SH4yHqoYXEZP" + }, + "source": [ + "X.T.dot(Delta_H)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zORHdsEiXwSw" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P11cTnsCRpwj" + }, + "source": [ + "$$(0)\\Delta_{H}^{(0, 1)}+(1)\\Delta_{H}^{(1, 1)}+(0)\\Delta_{H}^{(2, 1)}+(1)\\Delta_{H}^{(3, 1)}$$\n", + "$$(0)(-0.009)+(1)(0.005)+(0)(0.005)+(1)(-0.006)= -0.001$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "W_LMmSEVDXY7" + }, + "source": [ + "[**Python**] - $(X*\\Delta_{H})$ para atualizar $W_{H}^{(2, 2)}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YE4DH6P_XEZZ" + }, + "source": [ + "X.T.dot(Delta_H)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Gz7bhUuDX6Me" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OLrwPoE7SGYu" + }, + "source": [ + "$$(0)\\Delta_{H}^{(0, 2)}+(1)\\Delta_{H}^{(1, 2)}+(0)\\Delta_{H}^{(2, 2)}+(1)\\Delta_{H}^{(3, 2)}$$\n", + "$$(0)(-0.011)+(1)(0.006)+(0)(0.007)+(1)(-0.007)= -0.001$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vzbUzC8FDhuo" + }, + "source": [ + "[**Python**] - $(X*\\Delta_{H})$ para atualizar $W_{H}^{(2, 3)}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7-epl7I3XEZf" + }, + "source": [ + "X.T.dot(Delta_H)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0gT8_uDQX-NT" + }, + "source": [ + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QLz57OEPSWjl" + }, + "source": [ + "$$(0)\\Delta_{H}^{(0, 3)}+(1)\\Delta_{H}^{(1, 3)}+(0)\\Delta_{H}^{(2, 3)}+(1)\\Delta_{H}^{(3, 3)}$$\n", + "$$(0)(-0.005)+(1)(0.003)+(0)(0.003)+(1)(-0.004)= -0.001$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7G9gWkKWIIOL" + }, + "source": [ + "##### Implementação com NumPy\n", + "\n", + "Usando:\n", + "* M = 1;\n", + "* $\\alpha = 0.1$;\n", + "* Fórmula: $W_{n+1} = (W_{n}*M)+\\alpha*(X*\\Delta_{H})$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5C3NCWcuShqN" + }, + "source": [ + "[**Python**] - Calcular/atualizar os pesos $W_{H}$ usando a expressão $W_{n+1}$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ys_y-R0BL7Iw" + }, + "source": [ + "M = 1\n", + "alpha = 0.1\n", + "\n", + "W_H_New = W_H*M+alpha*(X.T.dot(Delta_H))\n", + "W_H_New" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IvaIx_PKZEmd" + }, + "source": [ + "##### Novos Pesos $W_{H}$ e $W_{O}$ da Rede Neural (Antes x Depois)\n", + "\n", + "\"Drawing\"\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EuO2t22CffE8" + }, + "source": [ + "___\n", + "# **Como as Redes Neurais aprendem?**\n", + "\n", + "> Vimos até agora grande parte dos cálculos matemáticos que envolvem o treinamento das Redes Neurais, que envolvem a repetição dos processos _Forward_ e _Backward_:\n", + "\n", + "1. _**Forward**_: Consiste na multiplicação de matrizes entre os _arrays_ da _input layer_, pesos $W$ e, na sequência, aplicar as funções de ativação.\n", + "\n", + "2. _**Backward**_: Consiste em atualizar os pesos $W_{O}$ e $W_{H}$ para minimizar a _Loss Function_ $L$ usando _Gradient Descent_.\n", + "\n", + "* Estes 2 processos foram vistos detalhadamente em aulas anteriores.\n", + " * Cálculos matemáticos passo a passo foram mostrados. Portanto, visite nossas aulas anteriores para aprender mais sobre os aspectos teóricos e matemáticos por trás das Redes Neurais.\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mpKyqSuDdbMr" + }, + "source": [ + "___\n", + "# **_GRADIENT DESCENT_**\n", + "\n", + "_Gradient Descent_ é um algoritmo interativo utilizado para otimizar (neste caso, minimizar) a _Loss Function_ $L$. \n", + "\n", + "* Minimizar a _Loss Function_ $L$ significa encontrar os pesos $W_{H}$ e $W_{O}$ que faz com que MSE seja o menor possível, pois quanto menor o MSE, melhor a performance da Rede Neural. \n", + "\n", + "* Para atualizar os pesos $W$, vamos usar a expressão a seguir:\n", + "\n", + "$$W_{n+1}= W_{n}*M+\\alpha \\frac{\\partial L}{\\partial W_{n}}= W_{n}*M+\\alpha*(X*\\Delta)$$\n", + "\n", + "onde:\n", + "\n", + "* $L$ é a _Loss Function_ a ser minimizada;\n", + "* $W_{n}$ são os pesos atuais e que deverão ser atualizados para a próxima iteração;\n", + "* $\\alpha$ é a taxa de aprendizado (_Learning Rate_ em inglês) e diz respeito à velocidade de aprendizagem da Rede Neural.\n", + " * Quanto MENOR o valor de $\\alpha$ $\\Longrightarrow$ mais devagar e demorada será a convergência para o mínimo global;\n", + " * Quanto MAIOR o valor de $\\alpha$ $\\Longrightarrow$ mais rápido será a convergência para o mínimo, mas sem a garantia de convergência para o mínimo global.\n", + "* $M$ é o _Momentum_, que é o artifício para acelerar a otimização (ou minimização) da _Loss Function_ $L$.\n", + " * Valores altos $\\Longrightarrow$ Aumenta a velocidade da aprendizagem;\n", + " * Valores baixos $\\Longrightarrow$ Mais tempo para aprendizagem, mas com maiores chances de se encontrar a solução ótima, evitando os mínimos locais.\n", + "* $\\frac{\\partial L}{\\partial W_{n}}$ é a derivada da _Loss Function_ $L$ em relação ao peso $W_{n}$. Como dito anteriormente, é a contribuição do peso $W$ no Erro. Calcular $(X*\\Delta)$ é a parte mais complicada da fórmula e fizemos estes cálculos passo a passo em aulas anteriores." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tzuFSOV4eboI" + }, + "source": [ + "Observe a figura a seguir: O que o _Gradient Descent_ fará é encontrar o mínimo global da _loss function_, tentando ao máximo possível evitar os mínimos locais.\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z-UbiCxUgvHg" + }, + "source": [ + "A seguir, alguns artigos sobre _Gradient Descent_ caso você queira saber um pouco mais sobre o assunto:\n", + "\n", + "* [An introduction to Gradient Descent Algorithm](https://medium.com/@montjoile/an-introduction-to-gradient-descent-algorithm-34cf3cee752b) - Abrir este artigo para mostrar os efeitos da _Learning Rate_ e os tipos de _Gradient Descent_ disponíveis para Machine Learning;\n", + "* [Machine learning : Gradient Descent](https://medium.com/@arshren/gradient-descent-5a13f385d403);\n", + "* [The Math and Intuition Behind Gradient Descent](https://medium.com/datadriveninvestor/the-math-and-intuition-behind-gradient-descent-13c45f367a11) - Mostra a matemática por trás do _Gradient Descent_;\n", + "* [An Introduction to Gradient Descent](https://towardsdatascience.com/an-introduction-to-gradient-descent-c9cca5739307);\n", + "* [Gradient Descent From Scratch](https://towardsdatascience.com/gradient-descent-from-scratch-e8b75fa986cc);\n", + "* [Gradient Descent Explanation & Implementation](https://towardsdatascience.com/gradient-descent-explanation-implementation-c74005ff7dd1) - Cálculos step-by-step;" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "junvtVY4eePi" + }, + "source": [ + "___\n", + "# **_LOSS FUNCTION_ $L$**\n", + "\n", + "> Como vimos anteriormente, nosso objetivo é minimizar a _Loss function_ através do _Gradient Descent_. Em outras palavras, esse processo de otimização busca, à cada iteração (_epoch_), atualizar os pesos $W$ para reduzir a _Loss Function_. As _Loss Function_ mais comuns são:\n", + "\n", + "* **Regressão**: mse ou mae;\n", + "* **Classificação**: _cross-entropy_ (quando queremos probabilidades de cada observação pertencer à uma determinada classe).\n", + " * **Classificação binária**: tf.keras.losses.BinaryCrossentropy()\n", + ";\n", + " * **Classificação multi-classes**: tf.keras.losses.CategoricalCrossentropy()\n", + "." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6e4ULmJheePY" + }, + "source": [ + "___\n", + "# **MÉTRICAS PARA MEDIR A PERFORMANCE DAS REDES NEURAIS**\n", + "\n", + "* As métricas medem a qualidade/performance das Redes Neurais e as principais são:\n", + " * **Regressão**: Quanto mais próximo de 0 estiver MAE, MSE ou RMSE, melhor a performance da Rede Neural.\n", + " * MAE significa \"_Mean Absolute Error_\".\n", + "\n", + " * MSE - significa \"_Mean Square Error_\", que é a diferença entre os valores reais $y_{i}$ e os valores previstos (ou calculados) $\\hat{Y}_{i}$.\n", + "\n", + " * RMSE - significa \"_Root Mean Square Error_\".\n", + " \n", + " * **Classificação**: Quanto maior a accuracy, melhor a performance da Rede Neural.\n", + " * Accuracy\n", + "\n", + "* Expressões Matemáticas:\n", + "\n", + "\\begin{align}\n", + "MSE &= \\frac{\\sum_{i=1}^{n}(y_{i}-\\hat{Y}_{i})^{2}}{n} \\\\\n", + "RMSE &= \\sqrt{MSE} \\\\\n", + "MAE &= \\frac{\\sum_{i=1}^{n}|y_{i}-\\hat{Y}_{i}|}{n}\n", + "\\end{align}\n", + "\n", + "Para os alunos que estão com dúvidas sobre qual métrica usar, sugiro a leitura do artigo [MAE and RMSE — Which Metric is Better?](https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6n4QjH1WeeO8" + }, + "source": [ + "___\n", + "# **_DROPOUT_**\n", + "\n", + "> _Dropout_ significa ignorar aleatoriamente e temporariamente um percentual $p$ de neurônios durante a fase de treinamento. Ao \"ignorar\", quero dizer que tais neurônios não serão considerados durante os processos _forward_ e _backpropagation_.\n", + "\n", + "* _Dropout_ força que a Rede Neural aprenda a partir dos dados, mas usando diferentes e aleatórios neurônios;\n", + "* Recomenda-se $p = 0.20$. \n", + "* Ao se usar _Dropout_, recomenda-se mais épocas para treinar a Redes Neurais;\n", + "\n", + "* **Vantagens**:\n", + " * Evita _overfitting_ - Num \"_fully connected layer_\", neurônios desenvolvem dependência durante a fase de treinamento levando ao _overfitting_. Com _dropout_ é possível reduzir um pouco desta dependência, reduzindo as chances de _overfitting_;\n", + "\n", + "![Dropout](https://github.com/MathMachado/Materials/blob/master/Dropout.png?raw=true)\n", + "\n", + "Fonte: [Dropout in (Deep) Machine learning](https://medium.com/@amarbudhiraja/https-medium-com-amarbudhiraja-learning-less-to-learn-better-_dropout_-in-deep-machine-learning-74334da4bfc5).\n", + "\n", + "TEMPLATE: keras.layers.Dropout(rate, noise_shape=None, seed=None)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9-9Y9562kNNU" + }, + "source": [ + "___\n", + "# **Rede Neural multicamada (1 _Hidden Layer_) para o Operador Lógico XOR usando _Tensorflow_/_Keras_**\n", + "\n", + "* **Observações**:\n", + " * Há vários artigos (no _medium_, por exemplo) a discutir e desenvolver Redes Neurais para o Operador Lógico XOR. Então porque eu decidi produzir esta aula usando o dataframe do Operador Lógico XOR?\n", + " * Para explicar didaticamente e passo a passo todos os aspectos matemáticos por trás das Redes Neurais usando um dataframe pequeno e, apesar disso, complexo, pois é um problema linearmente NÃO-separável e, sendo assim, requer uma Rede Neural mais complexa (com pelo menos 1 _Hidden Layer_) para melhorar a acurácia e reduzir a _loss_;\n", + " * Para explicar como é fácil desenvolver Redes Neurais usando Tensorflow/Keras;\n", + " * Para explicar didaticamente e passo a passo os processos _Forward_ e _Backward_ para treinar Redes Neurais;\n", + " * Versão do Tensorflow usada: 2.x;\n", + " * Estou a utilizar o Google Colab;\n", + " * Nesta aula, não se preocupe demasiadamente com a sintaxe dos comandos. Porque?\n", + " * Vamos repetir tudo detalhadamente nas próximas aulas. Portanto, você terá a oportunidade de aprender e praticar muito em breve;\n", + " * O objetivo desta aula é simplesmente fazer uma introdução às Redes Neurais, Tensorflow/Keras e mostrar os passos/processos que vamos seguir aqui e no futuro para desenvolver Redes Neurais. Quando você assistir as aulas subsequentes, tudo ficará mais claro.\n", + "* Todas as aulas do curso de Redes Neurais foram cuidadosamente planejadas e preparadas para trazer conteúdos relevantes para você aprender Redes Neurais no menor tempo possível. Portanto, nesta aula você vai encontrar várias linhas como a linha adiante:\n", + "\n", + "[**Python**] - Comando ou _code_ que deve ser executado.\n", + "\n", + "Estas linhas são uma espécie de _guide_ para não nos esquecermos de nenhum detalhe da aula. \n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rOQbXbCgZZRL" + }, + "source": [ + "A seguir, dataframe do Operador Lógico XOR:\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$)|\n", + "|---|---|---|---|\n", + "| 0 | 0 | 0 | 0 |\n", + "| 1 | 0 | 1 | 1 |\n", + "| 2 | 1 | 0 | 1 |\n", + "| 3 | 1 | 1 | 0 |" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KusCpN1S4CtH" + }, + "source": [ + "Vamos obedecer os _steps_ a seguir para construir nossa Rede Neural:\n", + "\n", + "1. Carregar as bibliotecas do Python e Tensorflow;\n", + "2. Carregar os dados para treinar a Rede Neural;\n", + "3. Definir a arquitetura da Rede Neural com Tensorflow/Keras;\n", + "4. Compilar a Rede Neural;\n", + "5. Ajustar a Rede Neural;\n", + "6. Avaliar a performance da Rede Neural;\n", + "7. _Fine tuning_ da Rede Neural;\n", + "8. Fazer Predições com a Rede Neural;\n", + "9. Conclusões." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mq7pF8854cf6" + }, + "source": [ + "### 1. Carregar as bibliotecas do Python e Tensorflow" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Br4REluttJXH" + }, + "source": [ + "[**Python**] - Importar as bibliotecas necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "W-1jl_vnP7n3" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import tensorflow as tf\n", + "from tensorflow import keras" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UA_bHIYOrNwy" + }, + "source": [ + "[**Python**] - Verificar a versão do Tensorflow\n", + "> Assegurar que está a utilizar a versão 2.x." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ApSwaqVbQVGx" + }, + "source": [ + "tf.__version__" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CH37RXFLtPpB" + }, + "source": [ + "[**Python**] - Definir o número de casas decimais = 3" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Pzdu5btatTom" + }, + "source": [ + "np.set_printoptions(precision = 3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5YbKhVkd4sm2" + }, + "source": [ + "### 2. Carregar os dados para treinar a Rede Neural\n", + "\n", + "Segue abaixo o dataframe do Operador Lógico XOR:\n", + "\n", + "| i | $X_{1}$ | $X_{2}$ | ValorReal ($y_{i}$)|\n", + "|---|---|---|---|\n", + "| 0 | 0 | 0 | 0 |\n", + "| 1 | 0 | 1 | 1 |\n", + "| 2 | 1 | 0 | 1 |\n", + "| 3 | 1 | 1 | 0 |" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3uojCCMaWgtI" + }, + "source": [ + "[**Python**] - Definir as entradas (_inputs_) $X$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nTbtpKKdQh9M" + }, + "source": [ + "X_XOR = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])\n", + "X_XOR" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7hc1pYW-WsCp" + }, + "source": [ + "[**Python**] - Definir os _Outputs_ $Y$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Gj4zl-JdQ0nR" + }, + "source": [ + "y_XOR = np.array([[0], [1], [1], [0]])\n", + "y_XOR" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2WF73gTMZtN-" + }, + "source": [ + "### 3. Conceito importante: _Fully connected layer_\n", + "\n", + "> A arquitetura da Rede Neural abaixo é dita _fully connected_, ou seja, os neurônios da camada anterior se conecta com todos os neurônios da camada subsequente. Observe a figura a seguir:\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7X5eys3mxsl2" + }, + "source": [ + "#### Arquitetura da Rede Neural\n", + "\n", + "> A seguir, a arquitetura da Rede Neural que vamos desenvolver neste exemplo:\n", + "\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QpmtGEFgWzhk" + }, + "source": [ + "[**Python**] - Definir a arquitetura, ou seja:\n", + "* $N_{I}$: Número de neurônios na camada de entrada (_Input Layer_);\n", + "* $N_{O}$: Número de neurônios na camada de saída (_Output Layer_);\n", + "* $N_{H}$: Número de neurônios na camada escondida (_Hidden Layer_);\n", + "* FA: Função de ativação:\n", + " * _Hidden Layer_: Há várias opções que podem ser usadas, mas vou tentar resolver este exemplo com a função de ativação _Sigmoid_, que foi a função de ativação que foi a opção escolhida quando explicamos Redes Neurais passo a passo.\n", + " * _Output Layer_: Os valores de $y_{i}$ do dataframe são binários. Portanto, nossa opção para função de ativação para a _Output Layer_ é usar a função de ativação _Sigmoid_." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Id_P910LRRb4" + }, + "source": [ + "# Número de Neurônios na Input Layer:\n", + "N_I = 2 # Número de variáveis/colunas da matriz de preditoras\n", + "\n", + "# Número de neurônios na Output Layer:\n", + "N_O = 1\n", + "\n", + "# Número de neurônios na Hidden Layer:\n", + "N_H = 3\n", + "\n", + "# Função de Ativação da Hidden Layer:\n", + "FA_H = tf.keras.activations.sigmoid\n", + "\n", + "# Função de Ativação da Output Layer\n", + "FA_O = tf.keras.activations.sigmoid" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "n6s9RcjLXqQm" + }, + "source": [ + "[**Python**] - Definir as sementes para NumPy e Tensorflow:\n", + "> Por questões de reproducibilidade de resultados, use as sementes abaixo:\n", + "\n", + "* NumPy: 20111974;\n", + "* Tensorflow: 20111974;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3DizOTqQR6-U" + }, + "source": [ + "np.random.seed(20111974)\n", + "tf.random.set_seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vdcdcNncYB15" + }, + "source": [ + "[**Python**] - Definir a Rede Neural:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "E8KJ0f70HEwN" + }, + "source": [ + "**Observação**:\n", + "\n", + "* A opção kernel_constraint= tf.keras.constraints.UnitNorm() será utilizada para reduzir _overfitting_, conforme sugere o artigo [How to Reduce Overfitting Using Weight Constraints in Keras](https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-neural-networks-with-weight-constraints-in-keras/);." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-LYbXfEZYNcC" + }, + "source": [ + "from tensorflow.keras import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "\n", + "RN = Sequential() # nome da Rede Neural\n", + "RN.add(Dense(units = N_H, \n", + " input_dim = N_I, \n", + " activation = FA_H, \n", + " kernel_constraint = tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dense(units= N_O, activation = FA_O))\n", + "\n", + "# Resumo da arquitetura da Rede Neural:\n", + "print(RN.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OoA-_8A55jMW" + }, + "source": [ + "### 4. Compilar a Rede Neural\n", + "\n", + "> Adam é um algoritmo de otimização.\n", + "\n", + "Para saber mais sobre o algoritmo de otimização 'adam', consulte o artigo [Gentle Introduction to the Adam Optimization Algorithm for Deep Learning](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/).\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ifkjrCT6Yki6" + }, + "source": [ + "[**Python**] - Comando modelo.compile(optimizer, loss, metrics):\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OdIerBPAUGbY" + }, + "source": [ + "Algoritmo_Opt = tf.keras.optimizers.Adam() # Algoritmo de otimização\n", + "Loss_Function = tf.keras.losses.MeanSquaredError() # A métrica para cálculo do erro\n", + "Metrics_Perf = [tf.keras.metrics.binary_accuracy]\n", + "\n", + "RN.compile(optimizer = Algoritmo_Opt, \n", + " loss= Loss_Function, \n", + " metrics = Metrics_Perf)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KVx2w28c5urj" + }, + "source": [ + "### 5. Ajustar/treinar a Rede Neural\n", + "\n", + "* 1 _Epoch_ = 1 iteração da Rede Neural, passando por todo o dataframe de treinamento, sendo que 1 iteração contempla 1 processo _Forward_ e 1 processo _Backward_." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZV3XwUJ8YvxE" + }, + "source": [ + "[**Python**] - Comando modelo.fit(X_treinamento, y_treinamento, epochs):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "45inZ8X3U0Ew" + }, + "source": [ + "hist = RN.fit(X_XOR, y_XOR, epochs = 100)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i1bUiekR5q1E" + }, + "source": [ + "### 6. Avaliar a performance da Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wBp4ctbKY8k7" + }, + "source": [ + "[**Python**] - Comando modelo.evaluate(X_teste, y_teste):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "M4HlrjjjVLjB" + }, + "source": [ + "RN.evaluate(X_XOR, y_XOR)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iPwANO05VT5m" + }, + "source": [ + "**Resultado**: O modelo _baseline_ (modelo inicial) apresenta os seguintes resultados:\n", + "* loss= 0.2515;\n", + "* accuracy= 50%.\n", + "\n", + "* **Comentário**: A Rede Neural apresenta resultados insatisfatórios." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lD2pw9H754ZZ" + }, + "source": [ + "### 7. _Fine tuning_ da Rede Neural\n", + "\n", + "Antes de falarmos de _fine tuning_, vamos voltar a falar de CRISP-DM:\n", + "\n", + "CRISP-DM significa _Cross Industry Standard Process for Data Mining_ ou processos ou fases para desenvolvimento de projetos relacionados à _Data Mining_ e que tem sido muito utilizados pelos Cientistas de Dados para desenvolvimento de modelos predictivos.\n", + "\n", + "\"Drawing\"\n", + "\n", + "Fonte: [The steps to a successful machine learning project](https://emba.epfl.ch/2018/04/10/steps-successful-machine-learning-project/)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1ssJuKF3FNA3" + }, + "source": [ + "* CRISP-DM:\n", + " 1. _Business Understanding_ (Entendimento do Negócio)\n", + " * Concentra-se no entendimento dos objetivos e requisitos do projeto sob uma perspectiva de negócios e, em seguida, na conversão desse conhecimento em uma definição de problema de mineração de dados e em um plano preliminar.\n", + "\n", + " 2. _Data Understanding_ (Entendimento/compreensão dos dados)\n", + " * Está relacionado com as atividades de extração de amostras para se familiarizar com os dados, identificar problemas de qualidade, descobrir as primeiras idéias ou detectar subconjuntos interessantes para formar hipóteses de informações ocultas.\n", + "\n", + " 3. _Data Preparation_ (Preparação de Dados)\n", + "\n", + " * Abrange todas as atividades para construir o conjunto de dados final que será dividida entre amostra de treinamento e validação do modelo preditivo.\n", + "\n", + " 4. _Modeling_ (Modelagem)\n", + "\n", + " * Nesta fase se avalia as possíveis técnicas que podem ser aplicadas.\n", + "\n", + " 5. _Evaluation_ (Avaliação do modelo)\n", + "\n", + " * Após a construção do modelo _baseline_ (modelo inicial) e tendo _Loss Function_ pré-definidas, avalia-se ou testa-se a performance dos modelos preditivos (Redes Neurais, no nosso caso) para garantir que o modelo generaliza. De todos os modelos testados nesta fase, devemos selecionar o modelo campeão.\n", + "\n", + " 6. _Deployment_ (Implantação)\n", + "\n", + " * Significa implementar o código do modelo em um sistema operacional para pontuar/escorar ou categorizar novos dados à medida que surgem e criar um mecanismo para o uso dessas novas informações na solução do problema comercial original. Importante, a representação de código também deve incluir todas as etapas de preparação de dados que antecederam a modelagem, para que o modelo trate novos dados brutos da mesma maneira que durante o desenvolvimento do modelo." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VrgiPmD3jw_o" + }, + "source": [ + "#### Estratégias para melhorar a acurácia da Rede Neural\n", + "\n", + "Nossas alternativas são:\n", + "\n", + "* a. Aumentar o número de neurônios na _Hidden Layer_;\n", + "* b. Aumentar o número de _Hidden Layers_;\n", + "* c. Aumentar o número de _Hidden Layers_ e o número de neurônios;\n", + "* d. Alterar a função de ativação;\n", + "* e. Aumentar o número de _epochs_;\n", + "* f. Alterar o algoritmo de otimização (_optimizer_);\n", + "\n", + "Neste exemplo, depois de várias tentativas, obtive sucesso alterando os parâmetros a seguir: \n", + "* Função de ativação: alterar para tf.keras.activations.relu;\n", + "* Número de neurônios na camada escondida (_Hidden Layer_): aumentei para 64." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V81AQ9t8IA9D" + }, + "source": [ + "#### 7.3. Definir a arquitetura da Rede Neural com Tensorflow/Keras" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yVhR55OoaWeX" + }, + "source": [ + "[**Python**] - Definir a arquitetura, ou seja:\n", + "* $N_{I}$: Número de neurônios na camada de entrada (_Input Layer_);\n", + "* $N_{O}$: Número de neurônios na camada de saída (_Output Layer_);\n", + "* $N_{H}$: Número de neurônios na camada escondida (_Hidden Layer_);\n", + "* FA: Função de ativação:\n", + " * _Hidden Layer_: tf.keras.activations.relu;\n", + " * _Output Layer_: Os valores de $y_{i}$ do dataframe são binários. Portanto, nossa opção para função de ativação para a _Output Layer_ é usar a função de ativação _Sigmoid_." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "R26Rf7x_aWeZ" + }, + "source": [ + "# Número de Neurônios na Input Layer:\n", + "N_I = 2 # NÃO FOI ALTERADA!\n", + "\n", + "# Número de neurônios na Output Layer:\n", + "N_O = 1 # NÃO FOI ALTERADA!\n", + "\n", + "# VARIÁVEIS ALTERADAS:\n", + "# Número de neurônios na Hidden Layer:\n", + "N_H = 64\n", + "\n", + "# Função de Ativação da Hidden Layer:\n", + "FA_H = tf.keras.activations.relu # ALTERADA!\n", + "\n", + "# Função de Ativação da Output Layer\n", + "FA_O = tf.keras.activations.sigmoid # NÃO FOI ALTERADA!" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LtQXjYnvIdJR" + }, + "source": [ + "[**Python**] - Definir as sementes para NumPy e Tensorflow:\n", + "> Por questões de reproducibilidade de resultados, use as sementes abaixo:\n", + "\n", + "* NumPy: 20111974;\n", + "* Tensorflow: 20111974;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WSCCZ6BcIdJZ" + }, + "source": [ + "np.random.seed(20111974)\n", + "tf.random.set_seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MwjdTXWNawSz" + }, + "source": [ + "[**Python**] - Definir a Rede Neural:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OWpJNRQjIRA4" + }, + "source": [ + "**Observações**: \n", + "\n", + "Para evitar problemas relacionados ao _overfitting_ e _Vanishing or Exploding Gradients in Deep Neural Nets_, os artigos abaixo sugerem as seguintes opções para inicialização dos pesos $W$:\n", + "\n", + "* [How to Reduce Overfitting Using Weight Constraints in Keras](https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-neural-networks-with-weight-constraints-in-keras/) sugere:\n", + " * kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Deep Learning Best Practices (1) — Weight Initialization](https://medium.com/usf-msds/deep-learning-best-practices-1-weight-initialization-14e5c0295b94) sugere:\n", + " * kernel_initializer= tf.keras.initializers.he_normal() para activation= 'tf.nn.relu' ou 'tf.nn.leaky_relu' e kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Vanishing/ Exploding Gradients in Deep Neural Nets and solving them](https://medium.com/swlh/vanishing-exploding-gradients-in-deep-neural-nets-and-solving-them-9d6070f28b29) sugere:\n", + " * kernel_initializer= tf.keras.initializers.GlorotUniform();\n", + " * kernel_initializer= tf.keras.initializers.GlorotNormal()." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "khod_vL5awS2" + }, + "source": [ + "from tensorflow.keras import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "from tensorflow.keras.layers import Dropout\n", + "\n", + "RN = Sequential()\n", + "RN.add(Dense(units = N_H, input_dim = N_I, activation = FA_H, kernel_constraint = tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dense(units = N_O, activation = FA_O))\n", + "\n", + "print(RN.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8V5EygkgIRA8" + }, + "source": [ + "#### 7.4. Compilar a Rede Neural\n", + "\n", + "> Adam é um algoritmo de otimização.\n", + "\n", + "Para saber mais sobre 'adam', consulte o artigo [Gentle Introduction to the Adam Optimization Algorithm for Deep Learning](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/).\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aEgVJAInbFW0" + }, + "source": [ + "[**Python**] - Comando modelo.compile(optimizer, loss, metrics):\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FRhAlexLW29o" + }, + "source": [ + "#Algoritmo_Opt = tf.keras.optimizers.Adam()\n", + "Algoritmo_Opt = tf.keras.optimizers.Adam(learning_rate=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False,\n", + " name='Adam')\n", + "\n", + "Loss_Function = tf.keras.losses.MeanSquaredError()\n", + "Metrics_Perf = [tf.keras.metrics.binary_accuracy]\n", + "\n", + "RN.compile(optimizer= Algoritmo_Opt, loss= Loss_Function, metrics= Metrics_Perf)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jxpOHzSKIRA-" + }, + "source": [ + "#### 7.5. Ajustar a Rede Neural\n", + "\n", + "1 _Epoch_ = 1 iteração da Rede Neural, passando por todo o dataframe de treinamento, sendo que 1 iteração contempla 1 processo _Forward_ e 1 processo _Backward_." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o-1iqXLabR4O" + }, + "source": [ + "[**Python**] - Comando modelo.fit(X_treinamento, y_treinamento, epochs):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vqyeqpq5XGAm" + }, + "source": [ + "RN.fit(X_XOR, y_XOR, epochs = 100)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C25ZV-x4IRBB" + }, + "source": [ + "#### 7.6. Avaliar a performance da Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tCd2S65ubg_M" + }, + "source": [ + "[**Python**] - Comando modelo.evaluate(X_teste, y_teste):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "I8-Vr9lXXav4" + }, + "source": [ + "RN.evaluate(X_XOR, y_XOR)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6IqEhL-2Xj-t" + }, + "source": [ + "**Resultado**: O modelo após o _fine tuning_ apresenta os seguintes resultados:\n", + "* loss= 0.1502;\n", + "* accuracy= 100%.\n", + "\n", + "* **Comentário**: A Rede Neural apresenta resultados satisfatórios." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AZjDavkO58Pu" + }, + "source": [ + "### 8. Fazer Predições com a Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HV4HkNDcbmJ2" + }, + "source": [ + "[**Python**] - Comando RN.predict_classes(X_treinamento):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aum69OJENO6V" + }, + "source": [ + "y_pred = RN.predict_classes(X_XOR)\n", + "y_pred" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "rNogASabEhz8" + }, + "source": [ + "y_XOR" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8xaBkwD15-1d" + }, + "source": [ + "### 9. Conclusões" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UULwoI-9yPIs" + }, + "source": [ + "A Rede Neural final, após a fase de _fine tuning_ apresenta os resultados mostrados na sessão 7.6. Diante destes resultados, sugerimos avançarmos para a fase de _deployment_ da Rede Neural, conforme sugere o CRISP-DM." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uFK4SeM5TLOb" + }, + "source": [ + "### **Exercício**\n", + "\n", + "1. Experimente usar outras funções de ativação para a _Hidden Layer_, registre e reporte seus resultados. Para saber mais sobre quais funções de ativação podem ser usadas, consulte [Module: tf.keras.activations](https://www.tensorflow.org/api_docs/python/tf/keras/activations);\n", + "\n", + "2. Experimente usar outros algoritmos de otimização para treinar a Rede Neural, registre e reporte seus resultados. Para saber quais algoritmos podem ser usados, consulte [Module: tf.keras.optimizers](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers).\n", + "\n", + "3. Neste exemplo, usamos o algoritmo de otimização 'adam'. Consulte a documentação sobre o 'adam' no Tensorflow/Keras e você verá que a sintaxe do algoritmo é:\n", + "\n", + "```\n", + "tf.keras.optimizers.Adam(\n", + " learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False,\n", + " name='Adam', **kwargs\n", + ")\n", + "```\n", + "\n", + "Refaça o treinamento da Rede Neural alterando os valores da _Learning Rate_ e reporte seus resultados.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pyyiNwm6eeP4" + }, + "source": [ + "___\n", + "# **_ACTIVATION FUNCTION_**\n", + "\n", + "> As funções de ativação são uma importante parte das Redes Neurais, pois permitem às Redes Neurais a lidar com a não-linearidade existente na maioria dos problemas reais.\n", + "\n", + "As funções de ativação (_Activation Function_ em inglês) mais usadas são:\n", + "* _Sigmoid_;\n", + "* ReLU (_Rectified Linear Unit_);\n", + "* Leaky ReLU;\n", + "* _Generalized_ ReLU;\n", + "* Tanh;\n", + "* _Swish_.\n", + "\n", + "Os artigos a seguir discutem estas principais funções de ativação:\n", + "* [Classical Neural Net: Why/Which Activations Functions?](https://towardsdatascience.com/classical-neural-net-why-which-activations-functions-401159ba01c4);\n", + "* [Intermediate Topics in Neural Networks](https://towardsdatascience.com/comprehensive-introduction-to-neural-network-architecture-c08c6d8e5d98);\n", + "* [Comparison of Activation Functions for Deep Neural Networks](https://towardsdatascience.com/comparison-of-activation-functions-for-deep-neural-networks-706ac4284c8a)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F7jF5SToOKYC" + }, + "source": [ + "### Funções de ativação para _Hidden Layers_:\n", + "\n", + "Há várias funções de ativação que podem ser utilizadas na _Hidden Layer_. As principais são:\n", + "\n", + "* ReLU\n", + " * evita e corrige o problema conhecido como _vanishing gradient problem_, que é justamente o principal ponto fraco das funções de ativação _sigmoid_ e _tanh_. Este problema acontece porque algumas derivadas são zero para metade dos valores da entrada $X = [X_{1}, X_{2}, ..., X_{n}]$, o que pode levar ao que se chama de \"neurônios mortos\";\n", + " * Quase todos os modelos de _Deep Learning_ hoje usam ReLU que **deve ser usada somente para _Hidden Layers_ das Redes Neurais**. \n", + "* Leaky ReLU\n", + " * Alternativa melhor que ReLU;\n", + "* _Swish_\n", + " * esta é outra alternativa melhor que ReLU, proposta pelo Google em 2017;\n", + " * alguns artigos apontam melhoria dos resultados das Redes Neurais com _Swish_." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8dYvHeYbN_c4" + }, + "source": [ + "### Funções de ativação para _Output Layers_:\n", + "\n", + "A função de ativação da _Output Layer_ depende do problema:\n", + "\n", + "* _Sigmoid_ para problemas de classificação binária (2 classes).\n", + " * Exemplo: Dataframe: Titanic, pois queremos estimar se o passageiro morreu ou sobreviveu;\n", + "* _Softmax_ para problemas de classificação multi-classes (> 2 classes).\n", + " * Exemplo: Dataframe: Iris, pois queremos estimar a espécie das flores, que são versicolor, virginica e setosa;\n", + "* _Linear_ para problemas de regressão. \n", + " * Exemplo: Dataframe: Boston Housing Prediction, pois queremos estimar o preço das casas em Boston, que é uma variável contínua.\n", + "\n", + "\n", + "O artigo [Comparison of Activation Functions for Deep Neural Networks](https://towardsdatascience.com/comparison-of-activation-functions-for-deep-neural-networks-706ac4284c8a) compara e discute as principais funções de ativação de forma pormenorizada." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "clQq59rPIkvB" + }, + "source": [ + "___\n", + "# **EXEMPLO 1: Rede Neural para identificar o sexo a partir de peso e altura**\n", + "\n", + "> O dataframe a seguir contem 10.000 medidas de altura (_height_) e peso (_weight_), sendo 5.000 medidas para o sexo masculino (_males_) e 5.000 para o sexo feminino (_females_).\n", + "\n", + "**Objetivo**: Estimar gênero (sexo) (_Gender_, em inglês) em função das variáveis _Height_ e _Weight_.\n", + "\n", + "Fonte do dataframe: Kaggle (weight-height.csv).\n", + "\n", + "Nesta aplicação, vamos seguir os passos adiante:\n", + "\n", + "1. Carregar os dados;\n", + "2. Pré-processamento e transformação dos dados;\n", + "3. Definir as amostras de treinamento e validação;\n", + "4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_;\n", + "5. Compilar a Rede Neural;\n", + "6. Ajustar a Rede Neural;\n", + "7. Avaliar a performance da Rede Neural;\n", + "8. _Fine tuning_ da Rede Neural;\n", + "9. Fazer Predições com a Rede Neural;\n", + "10. Conclusões." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dh5p2GcvLQQX" + }, + "source": [ + "### 0. Carregar as principais bibliotecas" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bhjAdXgab99r" + }, + "source": [ + "[**Python**] - Importar as bibliotecas necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kChuTlPddNZv" + }, + "source": [ + "import numpy as np\n", + "import tensorflow as tf\n", + "from tensorflow import keras\n", + "import pandas as pd" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9ZX00UN5cjvM" + }, + "source": [ + "[**Python**] - Verificar a versão do Tensorflow\n", + "> Assegurar que está a utilizar a versão 2.x." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "THWNIk_FCe_g" + }, + "source": [ + "tf.__version__" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PZgQAKqLcLX3" + }, + "source": [ + "[**Python**] - Definir o número de casas decimais" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tzKor02BCe_d" + }, + "source": [ + "np.set_printoptions(precision= 3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M5V4KopjLWOL" + }, + "source": [ + "### 1. Carregar os dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V_cwAUW3tseE" + }, + "source": [ + "[**Python**] - Carregar os dados:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_Bs87IWPtwtm" + }, + "source": [ + "# Leitura do dataframe:\n", + "df_sexo = pd.read_csv('https://raw.githubusercontent.com/MathMachado/DataFrames/master/weight-height.csv')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mBUeMtV7tzw6" + }, + "source": [ + "[**Python**] - Mostrar as primeiras 5 linhas:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rcH-y4amt3gs" + }, + "source": [ + "df_sexo.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OSa161sPLcAw" + }, + "source": [ + "### Pré-processamento e transformação dos dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lL2-6wpCuARF" + }, + "source": [ + "[**Python**] - Construir coluna 'sexo' da seguinte forma:\n", + "* Se Gender= 'Male' ==> sexo= 1;\n", + "* Se Gender= 'Female' ==> sexo= 0." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ccImSqCqDKre" + }, + "source": [ + "def define_label(row):\n", + " if row['Gender'] == 'Male':\n", + " return 1\n", + " else:\n", + " return 0" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "NDYamauZCq77" + }, + "source": [ + "df_sexo['sexo'] = df_sexo.apply(lambda row: define_label(row), axis = 1)\n", + "df_sexo.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hqkOrJnNuZjg" + }, + "source": [ + "[**Python**] - Renomear ou reescrever os nomes das colunas do dataframe em letras minúsculas:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-dahUMI6DsBz" + }, + "source": [ + "df_sexo = df_sexo.drop(columns= 'Gender', axis= 1)\n", + "df_sexo = df_sexo.rename({'Height': 'altura', 'Weight': 'peso'}, axis= 1)\n", + "df_sexo.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UTISVuZ4ukQO" + }, + "source": [ + "[**Python**] - Definir os arrays X_sexo e y_sexo:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oMTIn6Zf5LlU" + }, + "source": [ + "X_sexo = df_sexo.copy()\n", + "X_sexo = X_sexo.drop(columns= ['sexo'])\n", + "y_sexo = df_sexo['sexo'].values" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "iSThKwhj4LsC" + }, + "source": [ + "y_sexo" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FiO_F95jc1_s" + }, + "source": [ + "[**Python**] - Normalizar os dados - StandardScaler()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4myPAnSzE7-l" + }, + "source": [ + "from sklearn.preprocessing import StandardScaler\n", + "\n", + "SS = StandardScaler()\n", + "\n", + "X_sexo= SS.fit_transform(X_sexo)\n", + "X_sexo" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jJaJWuUqJCha" + }, + "source": [ + "### 3. Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LoO2iEimu4SQ" + }, + "source": [ + "[**Python**] - Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hTCdm-F9JBGA" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste= train_test_split(X_sexo, y_sexo, test_size = 0.1, random_state = 20111974)\n", + "print(f'X: Treinamento= {X_treinamento.shape}; X: Teste= {X_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "th9CsQpB8VDK" + }, + "source": [ + "print(f'Y: Treinamento = {y_treinamento.shape}; Y: Teste = {y_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2bL-vXiULupD" + }, + "source": [ + "### 4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zxETX6dTfyU5" + }, + "source": [ + "[**Python**] - Definir a arquitetura, ou seja:\n", + "* $N_{I}$: Número de neurônios na camada de entrada (_Input Layer_);\n", + "* $N_{O}$: Número de neurônios na camada de saída (_Output Layer_);\n", + "* $N_{H}$: Número de neurônios na camada escondida (_Hidden Layer_);\n", + "* FA: Função de ativação;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "F_MdsLicfyU6" + }, + "source": [ + "# Número de Neurônios na Input Layer:\n", + "N_I = 2\n", + "\n", + "# Número de neurônios na Output Layer:\n", + "N_O = 1\n", + "\n", + "# Número de neurônios na Hidden Layer:\n", + "N_H = 64\n", + "\n", + "# Função de Ativação da Hidden Layer:\n", + "FA_H = tf.keras.activations.swish\n", + "\n", + "# Função de Ativação da Output Layer:\n", + "FA_O = tf.keras.activations.swish" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SUMmDuPCcYyB" + }, + "source": [ + "[**Python**] - Definir as sementes para NumPy e Tensorflow:\n", + "> Por questões de reproducibilidade de resultados, use as sementes abaixo:\n", + "\n", + "* NumPy: 20111974;\n", + "* Tensorflow: 20111974;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "T-echOBmceVy" + }, + "source": [ + "np.random.seed(20111974)\n", + "tf.random.set_seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7ZceRRdinEM2" + }, + "source": [ + "[**Python**] - Definir a Rede Neural:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nXQsSYq2DBfI" + }, + "source": [ + "* 1 camada _dropout_ com $p= 0.1$:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TRFR5Kr_nDtD" + }, + "source": [ + "from tensorflow.keras import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "from tensorflow.keras.layers import Dropout\n", + "\n", + "RN= Sequential()\n", + "RN.add(Dense(N_H, input_dim= N_I, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "#RN.add(Dense(N_H2, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "#RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_O, activation= FA_O))\n", + "\n", + "# Resumo da arquitetura da Rede Neural\n", + "print(RN.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4JBZf4ypGO8o" + }, + "source": [ + "### 5. Compilar a Rede Neural\n", + "\n", + "Este é um problema de classificação binária (_Male_ ou _Female_). Portanto, temos:\n", + "* optimizer= tf.keras.optimizers.Adam();\n", + "* loss= tf.keras.losses.MeanSquaredError() ou loss= tf.keras.losses.BinaryCrossentropy(). Particularmente, eu gosto de usar loss= tf.keras.losses.MeanSquaredError() porque o resultado é mais intuitivo;\n", + "* metrics= tf.keras.metrics.binary_accuracy." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "USmAuw6f00wL" + }, + "source": [ + "[**Python**] - Comando modelo.compile(optimizer, loss, metrics):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "h7KEi1_e6SSF" + }, + "source": [ + "Algoritmo_Opt = tf.keras.optimizers.Adam()\n", + "Loss_Function = tf.keras.losses.MeanSquaredError()\n", + "Metrics_Perf = tf.keras.metrics.binary_accuracy\n", + "\n", + "RN.compile(optimizer = Algoritmo_Opt, loss = Loss_Function, metrics = Metrics_Perf)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Hc90EeV_GojX" + }, + "source": [ + "### 6. Ajustar a Rede Neural\n", + "\n", + "Obs.: A opção callbacks abaixo implementa o conceito de _early stopping_. Esta opção vai parar o processo de treinamento da Rede Neural antes de atingirmos o númerco de _epochs_ quando o modelo pára de melhorar, medido pela métrica val_loss. O parâmetro _patience_= k significa que o processo de otimização vai parar se tivermos k _epochs_ consecutivas sem observarmos melhoria da performance da Rede Neural." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XCCTtUh_vEFP" + }, + "source": [ + "[**Python**] - Comando modelo.fit(X_treinamento, y_treinamento, epochs)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "EB91J6nrF0db" + }, + "source": [ + "callbacks = [tf.keras.callbacks.EarlyStopping(monitor = 'val_loss', patience = 5, min_delta = 0.001)]\n", + "hist= RN.fit(X_treinamento, y_treinamento, epochs = 100, validation_data = (X_teste, y_teste), callbacks = callbacks)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "71mX1iwvHMc5" + }, + "source": [ + "Model_Accuracy(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "o-zJ6GIjHbY8" + }, + "source": [ + "Model_Loss(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J1sL_DTrKmpq" + }, + "source": [ + "### 7. Avaliar a performance da Rede Neural\n", + "\n", + "Para avaliar a a Rede Neural, simplesmente informamos as amostras de teste: X_teste e y_teste. A função evaluate() vai retornar uma lista contendo 2 valores: loss e accuracy." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VckQfEFPvMa7" + }, + "source": [ + "[**Python**] - Comando modelo.evaluate(X_teste, y_teste)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rUhEiqxfKmpv" + }, + "source": [ + "RN.evaluate(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "agO4cGTqKmpz" + }, + "source": [ + "A seguir, a matriz de confusão:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aLIAXu7SN7pV" + }, + "source": [ + "Mostra_ConfusionMatrix()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D5zYHcGuMPZe" + }, + "source": [ + "### 8. _Fine tuning_ da Rede Neural\n", + "\n", + "Para aumentar a acurácia da Rede Neural, sugiro aumentarmos o número de neurônios na _Hidden Layer_ e/ou aumentar o número de _Hidden Layers_.\n", + "\n", + "No entanto, obtivemos uma acurácia razoável com a Rede Neural _baseline_. Portanto, deixo como exercício para os alunos o desafio de melhorar a acurácia desta Rede Neural." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_ISodOu-Kmp3" + }, + "source": [ + "### 9. Fazer Predições com a Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_xgdL1W4vUrN" + }, + "source": [ + "[**Python**] - Comando:\n", + "* RN.predict_classes(X_treinamento);\n", + "* RN.predict_classes(X_teste)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0qun1-vOKmp4" + }, + "source": [ + "y_pred = RN.predict_classes(X_teste)\n", + "y_pred[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "I7sRwTWGKmp8" + }, + "source": [ + "y_teste[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AvywP0nZMtA-" + }, + "source": [ + "### 10. Conclusões\n", + "\n", + "Desenvolvemos uma Rede Neural capaz de identificar Sexo (_Gender_) com acurácia= 0.9120." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "g5qOWxPczM1O" + }, + "source": [ + "___\n", + "# **EXEMPLO 2: Distinguir cédulas verdadeiras das falsas**\n", + "\n", + "* O exemplo a seguir foi extraído do site [OpenML](https://www.openml.org/home). Este é um problema interessante, que é o de distinguir cédulas verdadeiras de notas falsas. Os dados foram extraídos de imagens tiradas de cédulas verdadeiras e falsas. Para digitalização, foi usada uma câmera industrial normalmente usada para inspeção de impressão. As imagens finais têm 400x 400 pixels. Devido à lente do objeto e à distância do objeto investigado, foram obtidas imagens em escala de cinza com uma resolução de cerca de 660 dpi. Uma ferramenta Wavelet Transform foi usada para extrair recursos dessas imagens.\n", + "\n", + "* Este é o endereço do dataframe: https://www.openml.org/d/1462;\n", + "* Descrição das variáveis - [banknote authentication Data Set](https://archive.ics.uci.edu/ml/datasets/banknote+authentication).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nup7tuLc5kYy" + }, + "source": [ + "> A seguir, vamos desenvolver uma Rede Neural usando _Tensorflow_/_Keras_ para classificar notas falsas e verdadeiras. Nesta aplicação, vamos seguir os passos adiante:\n", + "\n", + "1. Carregar os dados;\n", + "2. Pré-processamento e transformação dos dados;\n", + "3. Definir as amostras de treinamento e validação;\n", + "4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_;\n", + "5. Compilar a Rede Neural;\n", + "6. Ajustar a Rede Neural;\n", + "7. Avaliar a performance da Rede Neural;\n", + "8. _Fine tuning_ da Rede Neural;\n", + "9. Fazer Predições com a Rede Neural;\n", + "10. Conclusões." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YHi73Pbq5vvU" + }, + "source": [ + "### 0. Carregar bibliotecas do Python" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZsZW7_Ev5vvY" + }, + "source": [ + "[**Python**] - Importar as bibliotecas necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1U4OySJw5vvb" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "from sklearn.metrics import confusion_matrix\n", + "import tensorflow as tf\n", + "\n", + "from tensorflow import keras" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lAaecKoj5vv5" + }, + "source": [ + "[**Python**] - Verificar a versão do Tensorflow\n", + "> Assegurar que está a utilizar a versão 2.x." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5lPEsFy45vv6" + }, + "source": [ + "tf.__version__" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uNvl-o5w5vvo" + }, + "source": [ + "[**Python**] - Definir o número de casas decimais" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VqRIBc1J5vvp" + }, + "source": [ + "np.set_printoptions(precision = 3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8jo3Y9Hs5vwD" + }, + "source": [ + "### 1. Carregar os dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tL7k4X--5vwE" + }, + "source": [ + "[**Python**] - Carregar os dados:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RTuRMwld5vwG" + }, + "source": [ + "df_cedulas= pd.read_csv('https://raw.githubusercontent.com/MathMachado/DataFrames/master/Banknote-authentication-dataset.csv')\n", + "df_cedulas.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "501b-Zv38ce7" + }, + "source": [ + "[**Python**] - Corrigir ou renomear as colunas do dataframe:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lBKafqZR8jFb" + }, + "source": [ + "df_cedulas.columns= df_cedulas.columns.str.lower()\n", + "df_cedulas.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HBNIjNaT5vwM" + }, + "source": [ + "[**Python**] - Mostrar quantas classes há na variável-target:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "c7mZgLDl5vwO" + }, + "source": [ + "df_cedulas['class'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MG4q-8nf2GS_" + }, + "source": [ + "[**Python**] - Redefinindo a variável-target:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "IA7f1C4e1zOS" + }, + "source": [ + "def Redefinir_label(row):\n", + " if row['class']== 1:\n", + " return 0\n", + " else:\n", + " return 1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "2DkBD1FU1zOo" + }, + "source": [ + "df_cedulas['class']= df_cedulas.apply(lambda row: Redefinir_label(row), axis= 1)\n", + "df_cedulas.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "I0j5o4Iu5vwT" + }, + "source": [ + "[**Python**] - Mostrar a distribuição da variável-target:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "32EZZ8eP5vwV" + }, + "source": [ + "j = sns.countplot(x=\"class\", data= df_cedulas)\n", + "plt.show(j)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dV8A71C55vwb" + }, + "source": [ + "### 2. Pré-processamento e transformação dos dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Eg-4TkYSXvuo" + }, + "source": [ + "[**Python**] - Definir os arrays X_cedulas e y_cedulas:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vn2yMB80Xvux" + }, + "source": [ + "X_cedulas = df_cedulas.copy()\n", + "X_cedulas = X_cedulas.drop(columns = ['class'])\n", + "y_cedulas = df_cedulas['class'].values" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Nckf3bieXvvC" + }, + "source": [ + "[**Python**] - Normalizar os dados - StandardScaler()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "CFTlvOcRXvvE" + }, + "source": [ + "from sklearn.preprocessing import StandardScaler\n", + "\n", + "SS = StandardScaler()\n", + "\n", + "X_cedulas = SS.fit_transform(X_cedulas)\n", + "X_cedulas" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q_ouZ1it5vwz" + }, + "source": [ + "### 3. Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T6NrbTvd5vw1" + }, + "source": [ + "[**Python**] - Definir as amostras de treinamento e validação da Rede Neural:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xw3ZZ2fR5vw1" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_cedulas, y_cedulas, test_size = 0.1, random_state = 20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "trfqJbUg5vw8" + }, + "source": [ + "print(f'X: Treinamento = {X_treinamento.shape}; X: Teste = {X_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "duDt1c7i5vxB" + }, + "source": [ + "print(f'Y: Treinamento = {y_treinamento.shape}; Y: Teste = {y_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e4TKmGtr5vxM" + }, + "source": [ + "### 4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f6EQymRK5vxO" + }, + "source": [ + "[**Python**] - Definir a arquitetura, ou seja:\n", + "* $N_{I}$: Número de neurônios na camada de entrada (_Input Layer_);\n", + "* $N_{O}$: Número de neurônios na camada de saída (_Output Layer_);\n", + "* $N_{H}$: Número de neurônios na camada escondida (_Hidden Layer_);\n", + "* FA: Função de ativação;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JlUflDN3YkG7" + }, + "source": [ + "X_treinamento.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "bRsHYQp05vxO" + }, + "source": [ + "# Número de Neurônios na Input Layer:\n", + "N_I = X_treinamento.shape[1]\n", + "\n", + "# Número de neurônios na Output Layer:\n", + "N_O = 1\n", + "\n", + "# Número de neurônios na Hidden Layer 1:\n", + "N_H1 = 8\n", + "\n", + "# Número de neurônios na Hidden Layer 2:\n", + "N_H2 = 8\n", + "\n", + "# Função de Ativação da Hidden Layer:\n", + "FA_H = tf.keras.activations.swish\n", + "\n", + "# Função de Ativação da Output Layer:\n", + "FA_O = tf.keras.activations.sigmoid" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sSOj8_9n5vxU" + }, + "source": [ + "[**Python**] - Definir as sementes para NumPy e Tensorflow:\n", + "> Por questões de reproducibilidade de resultados, use as sementes abaixo:\n", + "\n", + "* NumPy: 20111974;\n", + "* Tensorflow: 20111974;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wFYGSoKH5vxU" + }, + "source": [ + "np.random.seed(20111974)\n", + "tf.random.set_seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WG1isER05vxZ" + }, + "source": [ + "[**Python**] - Definir a Rede Neural:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "A6IPYp8l5vxa" + }, + "source": [ + "**Observações**: \n", + "\n", + "Para evitar problemas relacionados ao _overfitting_ e _Vanishing or Exploding Gradients in Deep Neural Nets_, os artigos abaixo sugerem as seguintes opções para inicialização dos pesos $W$:\n", + "\n", + "* [How to Reduce Overfitting Using Weight Constraints in Keras](https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-neural-networks-with-weight-constraints-in-keras/) sugere:\n", + " * kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Deep Learning Best Practices (1) — Weight Initialization](https://medium.com/usf-msds/deep-learning-best-practices-1-weight-initialization-14e5c0295b94) sugere:\n", + " * kernel_initializer= tf.keras.initializers.he_normal() para activation= 'tf.nn.relu' ou 'tf.nn.leaky_relu' e kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Vanishing/ Exploding Gradients in Deep Neural Nets and solving them](https://medium.com/swlh/vanishing-exploding-gradients-in-deep-neural-nets-and-solving-them-9d6070f28b29) sugere:\n", + " * kernel_initializer= tf.keras.initializers.GlorotUniform();\n", + " * kernel_initializer= tf.keras.initializers.GlorotNormal()." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sCRp8O4V5vxa" + }, + "source": [ + "from tensorflow.keras import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "from tensorflow.keras.layers import Dropout\n", + "\n", + "RN = Sequential()\n", + "RN.add(Dense(units = N_H1, input_dim = N_I, activation = FA_H, kernel_initializer = tf.keras.initializers.GlorotUniform(1)))#, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dense(units = N_O, activation = FA_O))\n", + "\n", + "# Resumo da arquitetura da Rede Neural\n", + "print(RN.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Titw0r-d5vxh" + }, + "source": [ + "### 5. Compilar a Rede Neural\n", + "\n", + "Este é um problema de classificação binária." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oVQsayDq5vxi" + }, + "source": [ + "[**Python**] - Comando modelo.compile(optimizer, loss, metrics):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "u686jTkd5vxj" + }, + "source": [ + "Algoritmo_Opt = tf.keras.optimizers.Adam()\n", + "Loss_Function = tf.keras.losses.BinaryCrossentropy()\n", + "Metrics_Perf = tf.keras.metrics.binary_accuracy\n", + "\n", + "RN.compile(optimizer= Algoritmo_Opt, loss = Loss_Function, metrics = Metrics_Perf)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BKN3oCa65vxn" + }, + "source": [ + "### 6. Ajustar a Rede Neural\n", + "\n", + "Nesta fase, precisamos informar:\n", + "* **Epoch**: O número de épocas é um hiperparâmetro do _Gradient Descent_ que define o número de iterações para atualizar os pesos $W$ usando o dataframe de treinamento. Uma época significa que cada amostra no dataframe de treinamento atualizou os pesos $W$ 1 vez.\n", + "* **Batch**: número de amostras consideradas pela Rede Neural em cada _epoch_ antes da atualização dos pesos $W$;\n", + "\n", + "#### Exemplo\n", + "Suponha que temos um dataframe com 1.000 linhas (instâncias) e optamos por _epoch_= 1.000 e _batch_= 5. Isso significa que o dataframe será dividido em $\\frac{1000}{5}= 200$ _batches_. Desta forma, os pesos $W$ serão atualizados a cada processamento de 200 instâncias (linhas).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vLHQdKsi5vxn" + }, + "source": [ + "Obs.: A opção callbacks abaixo implementa o conceito de _early stopping_. Esta opção vai parar o processo de treinamento da Rede Neural antes de atingirmos o númerco de _epochs_ quando o modelo pára de melhorar, medido pela métrica val_loss. O parâmetro _patience_= k significa que o processo de otimização vai parar se tivermos k _epochs_ consecutivas sem observarmos melhoria da performance da Rede Neural." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q6UMutI45vxp" + }, + "source": [ + "[**Python**] - Comando modelo.fit(X_treinamento, y_treinamento, epochs)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2YhUEbTC5vxq" + }, + "source": [ + "callbacks = [tf.keras.callbacks.EarlyStopping(monitor = 'val_loss', patience = 3, min_delta = 0.001)]\n", + "\n", + "hist= RN.fit(X_treinamento, y_treinamento, epochs= 100, validation_data = (X_teste, y_teste), callbacks = callbacks)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "jFmtvTwd5vxu" + }, + "source": [ + "Model_Loss(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "X8Lu0jh55vxz" + }, + "source": [ + "Model_Accuracy(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sKh0f7Mc5vx4" + }, + "source": [ + "### 7. Avaliar a performance da Rede Neural\n", + "\n", + "Para avaliar a Rede Neural, simplesmente informamos as amostras de teste: X_teste e y_teste.\n", + "\n", + "A função evaluate() vai retornar uma lista contendo 2 valores: loss e accuracy." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p7nsNQoX5vx5" + }, + "source": [ + "[**Python**] - Comando modelo.evaluate(X_teste, y_teste)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "B1OvhTbf5vx6" + }, + "source": [ + "RN.evaluate(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Z8v2aody5vx-" + }, + "source": [ + "### 8. Fazer Predições com a Rede Neural\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0FQy0bZT5vx_" + }, + "source": [ + "[**Python**] - Comando:\n", + "* RN.predict_classes(X_treinamento);\n", + "* RN.predict_classes(X_teste).\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "n8e327A_5vx_" + }, + "source": [ + "y_pred = RN.predict_classes(X_teste)\n", + "y_pred[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TiVyZ-CG5vyE" + }, + "source": [ + "y_teste" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PawhHD_35vyI" + }, + "source": [ + "### Conclusões" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RcIh4qua_eEU" + }, + "source": [ + "___\n", + "# **APLICAÇÃO 1 - Rede Neural para identificar espécies (Iris Dataframe)**\n", + "\n", + "> A seguir, vamos desenvolver uma Rede Neural usando _Tensorflow_/_Keras_ para classificar flores (Iris). Nesta aplicação, vamos seguir os passos adiante:\n", + "\n", + "1. Carregar os dados;\n", + "2. Pré-processamento e transformação dos dados;\n", + "3. Definir as amostras de treinamento e validação;\n", + "4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_;\n", + "5. Compilar a Rede Neural;\n", + "6. Ajustar a Rede Neural;\n", + "7. Avaliar a performance da Rede Neural;\n", + "8. _Fine tuning_ da Rede Neural;\n", + "9. Fazer Predições com a Rede Neural;\n", + "10. Conclusões." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eXRYOpPR4XF4" + }, + "source": [ + "### 0. Carregar bibliotecas do Python" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pa0ir9C_dgOO" + }, + "source": [ + "[**Python**] - Importar as bibliotecas necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yNYF_qzydgOR" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "from sklearn.metrics import confusion_matrix\n", + "\n", + "from tensorflow import keras" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ird1VzZudgOU" + }, + "source": [ + "[**Python**] - Definir o número de casas decimais" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lwj9CGzEdgOV" + }, + "source": [ + "np.set_printoptions(precision= 3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zmN5HGLOdgOa" + }, + "source": [ + "[**Python**] - Verificar a versão do Tensorflow\n", + "> Assegurar que está a utilizar a versão 2.x." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VI86wuv9dgOa" + }, + "source": [ + "tf.__version__" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NoKZnsJpRA8o" + }, + "source": [ + "Perfeito, estamos a usar o TensorFlow 2.x." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xkLZgdkjavO-" + }, + "source": [ + "### 1. Carregar os dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "b7LLQyA3vgBG" + }, + "source": [ + "[**Python**] - Carregar os dados:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ACzNibyKAkx_" + }, + "source": [ + "df_Iris= pd.read_csv('https://raw.githubusercontent.com/MathMachado/DataFrames/master/Iris.csv', index_col= 'Id')\n", + "df_Iris.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T3Vy41RL-lAQ" + }, + "source": [ + "[**Python**] - Corrigir ou renomear as colunas do dataframe:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xN7nBQWg-lAX" + }, + "source": [ + "df_Iris.columns= df_Iris.columns.str.lower()\n", + "df_Iris.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Em1wLwdzvkgh" + }, + "source": [ + "[**Python**] - Mostrar quantas classes há na variável-target:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QhuoPcRuA9Do" + }, + "source": [ + "df_Iris['species'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lvWcxUvru50G" + }, + "source": [ + "[**Python**] - Mostrar a distribuição da variável-target 'Species':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HMpJiMWJu50J" + }, + "source": [ + "j = sns.countplot(x=\"species\", data= df_Iris)\n", + "plt.show(j)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "a55-G14aa_wG" + }, + "source": [ + "### 2. Pré-processamento e transformação dos dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Uos9OyewvyMo" + }, + "source": [ + "[**Python**] - Aplicar a transformação LabelEncoder() nos dados" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "X0V0hWnBg0so" + }, + "source": [ + "from sklearn.preprocessing import LabelEncoder\n", + "\n", + "LE = LabelEncoder()\n", + "\n", + "species_encoded= LE.fit_transform(df_Iris['species'])\n", + "df_Iris= df_Iris.drop(columns= ['species'], axis= 1)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d8L-b9gZwB4L" + }, + "source": [ + "[**Python**] - Definir o array y_Iris:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zMp2_hJ1wGm3" + }, + "source": [ + "y_Iris= tf.keras.utils.to_categorical(species_encoded)\n", + "y_Iris[:5]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qfFg6cWdv9El" + }, + "source": [ + "[**Python**] - Definir o array X_Iris:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7coeWhVRjiQl" + }, + "source": [ + "X_Iris= df_Iris.values\n", + "X_Iris[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cUa2sJOSbUFO" + }, + "source": [ + "### 3. Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kDOw-RHux1nS" + }, + "source": [ + "[**Python**] - Definir as amostras de treinamento e validação da Rede Neural:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "AJE_6w3KL_2O" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste= train_test_split(X_Iris, y_Iris, test_size= 0.2, random_state= 20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "NKHTG5IP9nVj" + }, + "source": [ + "print(f'X: Treinamento= {X_treinamento.shape}; X: Teste= {X_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "qe2mHJhb-PIY" + }, + "source": [ + "print(f'Y: Treinamento= {y_treinamento.shape}; Y: Teste= {y_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wHFI_bLXPPvl" + }, + "source": [ + "### 4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zRYoZ7hwgejR" + }, + "source": [ + "[**Python**] - Definir a arquitetura, ou seja:\n", + "* $N_{I}$: Número de neurônios na camada de entrada (_Input Layer_);\n", + "* $N_{O}$: Número de neurônios na camada de saída (_Output Layer_);\n", + "* $N_{H}$: Número de neurônios na camada escondida (_Hidden Layer_);\n", + "* FA: Função de ativação;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mjF1haRmgejS" + }, + "source": [ + "# Número de Neurônios na Input Layer:\n", + "N_I= X.Treinamento.shape[1]\n", + "\n", + "# Número de neurônios na Output Layer:\n", + "N_O= y_Iris.shape[1]\n", + "\n", + "# Número de neurônios na Hidden Layer:\n", + "N_H1= 32\n", + "\n", + "# Função de Ativação da Hidden Layer:\n", + "FA_H= tf.nn.leaky_relu\n", + "\n", + "# Função de Ativação da Output Layer:\n", + "FA_O= tf.nn.softmax" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tGeDaB3oo02k" + }, + "source": [ + "[**Python**] - Definir as sementes para NumPy e Tensorflow:\n", + "> Por questões de reproducibilidade de resultados, use as sementes abaixo:\n", + "\n", + "* NumPy: 20111974;\n", + "* Tensorflow: 20111974;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zGS15afAo02n" + }, + "source": [ + "np.random.seed(20111974)\n", + "tf.random.set_seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iT9w2tUCo5-X" + }, + "source": [ + "[**Python**] - Definir a Rede Neural:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nytcmC4BkSz1" + }, + "source": [ + "**Observações**: \n", + "\n", + "Para evitar problemas relacionados ao _overfitting_ e _Vanishing or Exploding Gradients in Deep Neural Nets_, os artigos abaixo sugerem as seguintes opções para inicialização dos pesos $W$:\n", + "\n", + "* [How to Reduce Overfitting Using Weight Constraints in Keras](https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-neural-networks-with-weight-constraints-in-keras/) sugere:\n", + " * kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Deep Learning Best Practices (1) — Weight Initialization](https://medium.com/usf-msds/deep-learning-best-practices-1-weight-initialization-14e5c0295b94) sugere:\n", + " * kernel_initializer= tf.keras.initializers.he_normal() para activation= 'tf.nn.relu' ou 'tf.nn.leaky_relu' e kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Vanishing/ Exploding Gradients in Deep Neural Nets and solving them](https://medium.com/swlh/vanishing-exploding-gradients-in-deep-neural-nets-and-solving-them-9d6070f28b29) sugere:\n", + " * kernel_initializer= tf.keras.initializers.GlorotUniform();\n", + " * kernel_initializer= tf.keras.initializers.GlorotNormal()." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LnLeLMmZoUjU" + }, + "source": [ + "from tensorflow.keras import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "from tensorflow.keras.layers import Dropout\n", + "\n", + "RN= Sequential()\n", + "RN.add(Dense(units= N_H, input_dim= N_I, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_O, activation= FA_O))\n", + "\n", + "# Resumo da arquitetura da Rede Neural\n", + "print(RN.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3eT-EHUecTj3" + }, + "source": [ + "### 5. Compilar a Rede Neural\n", + "\n", + "Este é um problema de classificação multi-classes (> 2 classes). Portanto, temos:\n", + "* loss= tf.keras.losses.CategoricalCrossentropy()\n", + ";\n", + "* metrics= tf.keras.metrics.binary_accuracy;\n", + "* optimizer= tf.keras.optimizers.Adam()." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fsq0aEtwyAAM" + }, + "source": [ + "[**Python**] - Comando modelo.compile(optimizer, loss, metrics):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NDStOKqhcRf4" + }, + "source": [ + "Algoritmo_Opt= tf.keras.optimizers.Adam()\n", + "Loss_Function= tf.keras.losses.CategoricalCrossentropy()\n", + "Metrics_Perf = tf.keras.metrics.binary_accuracy\n", + "\n", + "RN.compile(optimizer= Algoritmo_Opt, loss= Loss_Function, metrics= Metrics_Perf)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hZFu65TecabN" + }, + "source": [ + "### 6. Ajustar a Rede Neural\n", + "\n", + "Nesta fase, precisamos informar:\n", + "* **Epoch**: O número de épocas é um hiperparâmetro do _Gradient Descent_ que define o número de iterações para atualizar os pesos $W$ usando o dataframe de treinamento. Uma época significa que cada amostra no dataframe de treinamento atualizou os pesos $W$ 1 vez.\n", + "* **Batch**: número de amostras consideradas pela Rede Neural em cada _epoch_ antes da atualização dos pesos $W$;\n", + "\n", + "#### Exemplo\n", + "Suponha que temos um dataframe com 1.000 linhas (instâncias) e optamos por _epoch_= 1.000 e _batch_= 5. Isso significa que o dataframe será dividido em $\\frac{1000}{5}= 200$ _batches_. Desta forma, os pesos $W$ serão atualizados a cada processamento de 200 instâncias (linhas).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "boIs266gaZt1" + }, + "source": [ + "Obs.: A opção callbacks abaixo implementa o conceito de _early stopping_. Esta opção vai parar o processo de treinamento da Rede Neural antes de atingirmos o númerco de _epochs_ quando o modelo pára de melhorar, medido pela métrica val_loss. O parâmetro _patience_= k significa que o processo de otimização vai parar se tivermos k _epochs_ consecutivas sem observarmos melhoria da performance da Rede Neural." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LpR3dXRZ-jom" + }, + "source": [ + "[**Python**] - Comando modelo.fit(X_treinamento, y_treinamento, epochs)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9hDxEwHjca8V" + }, + "source": [ + "callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience = 3, min_delta=0.001)]\n", + "hist= RN.fit(X_treinamento, y_treinamento, epochs= 100, validation_data= (X_teste, y_teste), callbacks= callbacks)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JF7EC-g82Hho" + }, + "source": [ + "Model_Loss(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ea0HHBsY2NZ5" + }, + "source": [ + "Model_Accuracy(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pqVJMX3xchLF" + }, + "source": [ + "### 7. Avaliar a performance da Rede Neural\n", + "\n", + "Para avaliar a Rede Neural, simplesmente informamos as amostras de teste: X_teste e y_teste.\n", + "\n", + "A função evaluate() vai retornar uma lista contendo 2 valores: loss e accuracy." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hKbO1nT0yQM1" + }, + "source": [ + "[**Python**] - Comando modelo.evaluate(X_teste, y_teste)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cYqDY9V9chcZ" + }, + "source": [ + "RN.evaluate(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qchEpyipcnbE" + }, + "source": [ + "### 8. Fazer Predições com a Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9X2SZ5fx2_s5" + }, + "source": [ + "[**Python**] - Comando:\n", + "* RN.predict_classes(X_treinamento);\n", + "* RN.predict_classes(X_teste)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FR6ySksLhRvR" + }, + "source": [ + "y_pred = RN.predict_classes(X_teste)\n", + "y_pred" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "LWnRgBbmmlA2" + }, + "source": [ + "y_teste" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wjCtiKieE_TT" + }, + "source": [ + "### Conclusões" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LxMzVIGzzGHH" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1qjSoM5quog1" + }, + "source": [ + "___\n", + "# **APLICAÇÃO 2 - Rede Neural para identificar o tipo do vinho (_Red or White_)**\n", + "\n", + "> Nesta aplicação, vamos usar o dataframe [wine quality](https://archive.ics.uci.edu/ml/datasets/wine+quality) extraído do repositório da UCI Machine Learning Repository. Nosso objetivo é prever o tipo do vinho (red ou white) baseado nas suas propriedades químicas.\n", + "\n", + "Novamente, vamos desenvolver uma Rede Neural usando _Tensorflow_/_Keras_ e seguir os passos adiante:\n", + "\n", + "1. Carregar os dados;\n", + "2. Pré-processamento e transformação dos dados;\n", + "3. Definir as amostras de treinamento e validação;\n", + "4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_;\n", + "5. Compilar a Rede Neural;\n", + "6. Ajustar a Rede Neural;\n", + "7. Avaliar a performance da Rede Neural;\n", + "8. _Fine tuning_ da Rede Neural;\n", + "9. Fazer Predições com a Rede Neural;\n", + "10. Conclusões." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VuG8zbYS4jyx" + }, + "source": [ + "### 0. Carregar bibliotecas do Python" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "O3XD4otFd9Ht" + }, + "source": [ + "[**Python**] - Importar as bibliotecas necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "I5Ok5fhid9Hv" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "from sklearn.metrics import confusion_matrix\n", + "\n", + "from tensorflow import keras" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZBXRTgnCd9Hz" + }, + "source": [ + "[**Python**] - Definir o número de casas decimais" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NFrOyNUgd9Hz" + }, + "source": [ + "np.set_printoptions(precision= 3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cGxW2zn6q8xY" + }, + "source": [ + "[**Python**] - Verificar a versão do Tensorflow\n", + "> Assegurar que está a utilizar a versão 2.x." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "C36Z6vGD4jy8" + }, + "source": [ + "tf.__version__" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kWJXP5diof5G" + }, + "source": [ + "### 1. Carregar os dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IJJe4r_ITDzv" + }, + "source": [ + "[**Python**] - Carregar os dados:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jIjqYXlH4tyG" + }, + "source": [ + "from sklearn.datasets import load_wine\n", + "Wine= load_wine()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "5iU33wKQGrFb" + }, + "source": [ + "df_Red= pd.read_table(\"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv\", sep=';')\n", + "df_Red.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Liy2KSr6HJUX" + }, + "source": [ + "df_White= pd.read_table('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv', sep= ';')\n", + "df_White.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4TzUnuM4TMmJ" + }, + "source": [ + "[**Python**] - Mostrar o número de linhas e colunas de cada dataframe:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pHNwVRQyIwdv" + }, + "source": [ + "print(f'Dimensão de df_Red: {df_Red.shape}; Dimensão de df_White: {df_White.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tBWzMsUHJjTU" + }, + "source": [ + "[**Python**] - Construir a variável-target 'type_wine' (tipo do vinho):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "B0WmX3FYJqMR" + }, + "source": [ + "df_Red['type_wine']= 0\n", + "df_White['type_wine']= 1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EVUc1vbqJ0D9" + }, + "source": [ + "[**Python**] - Empilhar os dois dataframes: df_Red e df_White:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gS9SlpvaJ2Ex" + }, + "source": [ + "df_Wine= pd.concat([df_Red, df_White], ignore_index=True)\n", + "df_Wine.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "dM19HHENKMtq" + }, + "source": [ + "df_Wine.tail()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mpoNmcqpeQxO" + }, + "source": [ + "[**Python**] - Mostrar o número de linhas e colunas do dataframe df_Wine:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YJDazoMMeMet" + }, + "source": [ + "df_Wine.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tNyjXmFzo7oe" + }, + "source": [ + "### 2. Pré-processamento e transformação dos dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xcV_dlHsTpiP" + }, + "source": [ + "[**Python**] - Renomear o nome das colunas usando letras minúsculas:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Jau2OybudGc0" + }, + "source": [ + "df_Wine.columns = df_Wine.columns.str.strip().str.lower().str.replace(' ', '_')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "EzOWiaCFdq_C" + }, + "source": [ + "df_Wine.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F1wgkLcvete8" + }, + "source": [ + "[**Python**] - Estatísticas descritivas do dataframe df_Wine:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "A-EbHitYepeQ" + }, + "source": [ + "df_Wine.describe()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uRo_7tOjex3Z" + }, + "source": [ + "#### _Missing Values Handling_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wJ3K_FmCUCsk" + }, + "source": [ + "[**Python**] - Mostrar o número de _missing values_ no dataframe df_Wine:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "PODgu5B6e5qQ" + }, + "source": [ + "df_Wine.info()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "szH_TGJZecXp" + }, + "source": [ + "Como podem ver, o dataframe df_Wine não contem _missing values_." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "64znO089vF_2" + }, + "source": [ + "[**Python**] - Mostrar a distribuição da variável-target 'type_wine':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ztQ17lhWfdAu" + }, + "source": [ + "df_Wine['type_wine'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zDRVQy5avF_4" + }, + "source": [ + "j = sns.countplot(x=\"type_wine\", data= df_Wine)\n", + "plt.show(j)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PxexO1Uyeqnj" + }, + "source": [ + "Como podem ver, temos 6.497 instâncias, das quais 4.898 (75%) instâncias são type_wine= 1 (_white_) e 1.599 (25%) instâncias são type_wine= 0 (_red_). Logo, trata-se de um dataframe desbalanceado. No entanto, não vou me preocupar com este assunto neste Notebook e deixo para os alunos como exercício verificar os efeitos das amostras desbalanceadas." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kr_HZnF0UUh3" + }, + "source": [ + "[**Python**] - Definir os arrays X_Wine e y_Wine:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HNvImi4djc-3" + }, + "source": [ + "X_Wine= df_Wine.copy()\n", + "X_Wine= X_Wine.drop(columns= ['type_wine'])\n", + "y_Wine= df_Wine['type_wine'].values" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ViMFMd_SlE3x" + }, + "source": [ + "[**Python**] - Normalizar os dados - StandardScaler()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OurlnbX3hzSb" + }, + "source": [ + "from sklearn.preprocessing import StandardScaler\n", + "\n", + "# Define o scaler \n", + "SS = StandardScaler()\n", + "\n", + "# Scale o dataframe\n", + "X_Wine = SS.fit_transform(X_Wine)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KFkJ7b4U1x3D" + }, + "source": [ + "### 3. Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MBqzBcDL2JH4" + }, + "source": [ + "[**Python**] - Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bCQUxbbxhVTM" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_Wine, y_Wine, test_size= 0.2, random_state= 20111974)\n", + "print(f'X: Treinamento= {X_treinamento.shape}; X: Teste= {X_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "HHmTtQ3F9A9c" + }, + "source": [ + "print(f'Y: Treinamento= {y_treinamento.shape}; Y: Teste= {y_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pPW0aWd4jw0o" + }, + "source": [ + "### 4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ea4WTsP-hCqS" + }, + "source": [ + "[**Python**] - Definir a arquitetura, ou seja:\n", + "* $N_{I}$: Número de neurônios na camada de entrada (_Input Layer_);\n", + "* $N_{O}$: Número de neurônios na camada de saída (_Output Layer_);\n", + "* $N_{H}$: Número de neurônios na camada escondida (_Hidden Layer_);\n", + "* FA: Função de ativação;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y1pVFjFThCqU" + }, + "source": [ + "# Número de Neurônios na Input Layer:\n", + "N_I= X_Wine.shape[1]\n", + "\n", + "# Número de neurônios na Output Layer:\n", + "N_O= 1\n", + "\n", + "# Número de neurônios na Hidden Layer:\n", + "N_H1= 7\n", + "\n", + "# Função de Ativação da Hidden Layer:\n", + "FA_H= tf.nn.leaky_relu\n", + "\n", + "# Função de Ativação da Output Layer:\n", + "FA_O= tf.nn.leaky_relu" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "url8178EpUNO" + }, + "source": [ + "[**Python**] - Definir as sementes para NumPy e Tensorflow:\n", + "> Por questões de reproducibilidade de resultados, use as sementes abaixo:\n", + "\n", + "* NumPy: 20111974;\n", + "* Tensorflow: 20111974;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MU-8uWn-pUNQ" + }, + "source": [ + "np.random.seed(20111974)\n", + "tf.random.set_seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6TLUirr6oXv1" + }, + "source": [ + "[**Python**] - Definir a Rede Neural:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8HOQIkQ3beJm" + }, + "source": [ + "**Observações**: \n", + "\n", + "Para evitar problemas relacionados ao _overfitting_ e _Vanishing or Exploding Gradients in Deep Neural Nets_, os artigos abaixo sugerem as seguintes opções para inicialização dos pesos $W$:\n", + "\n", + "* [How to Reduce Overfitting Using Weight Constraints in Keras](https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-neural-networks-with-weight-constraints-in-keras/) sugere:\n", + " * kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Deep Learning Best Practices (1) — Weight Initialization](https://medium.com/usf-msds/deep-learning-best-practices-1-weight-initialization-14e5c0295b94) sugere:\n", + " * kernel_initializer= tf.keras.initializers.he_normal() para activation= 'tf.nn.relu' ou 'tf.nn.leaky_relu' e kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Vanishing/ Exploding Gradients in Deep Neural Nets and solving them](https://medium.com/swlh/vanishing-exploding-gradients-in-deep-neural-nets-and-solving-them-9d6070f28b29) sugere:\n", + " * kernel_initializer= tf.keras.initializers.GlorotUniform();\n", + " * kernel_initializer= tf.keras.initializers.GlorotNormal()." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Bj5uJvgaj43u" + }, + "source": [ + "from tensorflow.keras import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "from tensorflow.keras.layers import Dropout\n", + "\n", + "RN= Sequential()\n", + "RN.add(Dense(units= N_H, input_dim= N_I, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_O, activation= FA_O))\n", + "\n", + "# Resumo da arquitetura da Rede Neural\n", + "print(RN.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LTs8xbKx2O3Z" + }, + "source": [ + "### 5. Compilar a Rede Neural\n", + "\n", + "Este é um problema de classificação binária (_red_ ou _white_). Portanto, temos:\n", + "* loss= tf.keras.losses.BinaryCrossentropy();\n", + "* metrics= tf.keras.metrics.binary_accuracy;\n", + "* optimizer= tf.keras.optimizers.Adam().\n", + "\n", + "> **Lembre-se**: se o problema fosse de classificação de multi-classes (> 2 classes), então devemos usar loss= tf.keras.losses.CategoricalCrossentropy()." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kUpBKJkl07KH" + }, + "source": [ + "[**Python**] - Comando modelo.compile(optimizer, loss, metrics):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qGZ1bKCo2L7A" + }, + "source": [ + "Algoritmo_Opt= tf.keras.optimizers.Adam()\n", + "Loss_Function= tf.keras.losses.BinaryCrossentropy()\n", + "Metrics_Perf = tf.keras.metrics.binary_accuracy\n", + "\n", + "RN.compile(optimizer= Algoritmo_Opt, loss= Loss_Function, metrics= Metrics_Perf)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jBhkSc582ROC" + }, + "source": [ + "### 6. Ajustar a Rede Neural\n", + "\n", + "Obs.: A opção callbacks abaixo implementa o conceito de _early stopping_. Esta opção vai parar o processo de treinamento da Rede Neural antes de atingirmos o númerco de _epochs_ quando o modelo pára de melhorar, medido pela métrica val_loss. O parâmetro _patience_= k significa que o processo de otimização vai parar se tivermos k _epochs_ consecutivas sem observarmos melhoria da performance da Rede Neural." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ls8FfHz0z2lX" + }, + "source": [ + "[**Python**] - Comando modelo.fit(X_treinamento, y_treinamento, epochs)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5mpbdiuJ2d1W" + }, + "source": [ + "callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience = 5, min_delta=0.001)]\n", + "hist= RN.fit(X_treinamento, y_treinamento, epochs= 10, validation_data= (X_teste, y_teste), callbacks= callbacks)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "sALJp9FQ2WGC" + }, + "source": [ + "Model_Accuracy(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "cWR2rMLg2ZlQ" + }, + "source": [ + "Model_Loss(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ciXdkbVr2VDG" + }, + "source": [ + "### 7. Avaliar a performance da Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RS4_cV5TyamK" + }, + "source": [ + "[**Python**] - Comando modelo.evaluate(X_teste, y_teste)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hkkxiqPe2gZK" + }, + "source": [ + "RN.evaluate(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "m8Vs47VV4RsV" + }, + "source": [ + "A Rede Neural tem acurácia= 0.9946." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RHQoDk533TRX" + }, + "source": [ + "[**Python**] - Comando:\n", + "* RN.predict_classes(X_treinamento);\n", + "* RN.predict_classes(X_teste)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vlWfMR-k7Qc1" + }, + "source": [ + "y_pred = RN.predict_classes(X_teste)\n", + "y_pred" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mCpp-MO0nXmj" + }, + "source": [ + "A seguir, outras medidas para avaliarmos a performance da Rede Neural:\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oWF-wWfTnjQh" + }, + "source": [ + "# Import the modules from sklearn.metrics\n", + "from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, cohen_kappa_score" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "MHD0CvGQnm0v" + }, + "source": [ + "# Precision \n", + "precision_score(y_teste, y_pred)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ERoX4BeQnosC" + }, + "source": [ + "# Recall\n", + "recall_score(y_teste, y_pred)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "l_2bS23Inq1p" + }, + "source": [ + "# F1 score\n", + "f1_score(y_teste,y_pred)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rLM1BSK2QfZm" + }, + "source": [ + "A seguir, a matriz de confusão:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bGAigaBMQfZn" + }, + "source": [ + "Mostra_ConfusionMatrix()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H3yUb9tP2YfE" + }, + "source": [ + "### 8. Fazer Predições com a Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CupPMVRo3a5q" + }, + "source": [ + "[**Python**] - Comando:\n", + "* RN.predict_classes(X_treinamento);\n", + "* RN.predict_classes(X_teste)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GePOr4EJ3mfn" + }, + "source": [ + "y_pred = RN.predict_classes(X_teste)\n", + "y_pred[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "aO79VSBS3mfv" + }, + "source": [ + "y_teste[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oHlDHVp_KcLn" + }, + "source": [ + "### 10. Conclusão\n", + "\n", + "Desenvolvemos uma Rede Neural com 1 _Hidden Layer_ que atingiu acurácia de 0.998." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Xr2VzmwPFRdr" + }, + "source": [ + "___\n", + "# **EXERCÍCIO 1**: Prever a qualidade do vinho\n", + "\n", + "Neste exercício, vamos considerar a variável quality como múltiplas classes.\n", + "\n", + "Novamente, vamos desenvolver uma Rede Neural usando _Tensorflow_/_Keras_ e seguir os passos adiante:\n", + "\n", + "1. Carregar os dados;\n", + "2. Pré-processamento e transformação dos dados;\n", + "3. Definir as amostras de treinamento e validação;\n", + "4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_;\n", + "5. Compilar a Rede Neural;\n", + "6. Ajustar a Rede Neural;\n", + "7. Avaliar a performance da Rede Neural;\n", + "8. _Fine tuning_ da Rede Neural;\n", + "9. Fazer Predições com a Rede Neural;\n", + "10. Conclusões." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MYpF4TtpvXp4" + }, + "source": [ + "[**Python**] - Mostrar a distribuição da variável-target 'quality':" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WmeFj14PvXp6" + }, + "source": [ + "j = sns.countplot(x=\"quality\", data= df_Wine)\n", + "plt.show(j)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "smWNCLsMQWBI" + }, + "source": [ + "Muitas classes... Abaixo, KMeans para definirmos a quantidade de clusters." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Co4GNVexVUoP" + }, + "source": [ + "# Função adaptada de: https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/\n", + "\n", + "from sklearn.metrics.pairwise import euclidean_distances, cosine_distances, manhattan_distances\n", + "from scipy.spatial.distance import cdist\n", + "\n", + "def Numero_Clusters_Elbow(X):\n", + " distortions = [] \n", + " inertias = [] \n", + " mapping1 = {} \n", + " mapping2 = {} \n", + " K = range(1,10) \n", + " for k in K:\n", + " #Building and fitting the model \n", + " kmeanModel = KMeans(n_clusters=k).fit(X) \n", + " kmeanModel.fit(X)\n", + " distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'),axis=1)) / X.shape[0]) \n", + " inertias.append(kmeanModel.inertia_)\n", + " mapping1[k] = sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'),axis=1)) / X.shape[0] \n", + " mapping2[k] = kmeanModel.inertia_ \n", + "\n", + " # Using the different values of Distortion\n", + " print('Cálculo da Distorção:')\n", + " for key,val in mapping1.items():\n", + " print(str(key)+' : '+str(val))\n", + "\n", + " plt.plot(K, distortions, 'bx-')\n", + " plt.xlabel('Values of K')\n", + " plt.ylabel('Distortion')\n", + " plt.title('The Elbow Method using Distortion')\n", + " plt.show() \n", + "\n", + " # Using the different values of Inertia\n", + " print('Cálculo da Inertia:')\n", + " for key,val in mapping2.items():\n", + " print(str(key)+' : '+str(val))\n", + "\n", + " plt.plot(K, inertias, 'bx-')\n", + " plt.xlabel('Values of K')\n", + " plt.ylabel('Inertia') \n", + " plt.title('The Elbow Method using Inertia')\n", + " plt.show() " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "eTA4O3yII-am" + }, + "source": [ + "X_WineQ= df_Wine.copy()\n", + "X_WineQ= X_WineQ.drop(columns= ['quality'])\n", + "X_WineQ= X_WineQ.values\n", + "#y_WineQ= df_Wine['quality2'].values\n", + "X_WineQ" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IJDCLdsalKvS" + }, + "source": [ + "[**Python**] - Normalizar os dados - StandardScaler()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7D9qCW6SYKwA" + }, + "source": [ + "from sklearn.preprocessing import StandardScaler\n", + "\n", + "# Define o scaler \n", + "SS = StandardScaler().fit(X_WineQ)\n", + "\n", + "# Scale o dataframe\n", + "X_WineQ = SS.transform(X_WineQ)\n", + "X_WineQ" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "naLIn_ufIZtL" + }, + "source": [ + "Numero_Clusters_Elbow(X_WineQ)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "At5KtD-DV0f6" + }, + "source": [ + "Os gráficos acima mostram que o número de _clusters_ ótimos para o dataframe df_Wine é 3. Portanto, vamos trabalhar com n_cluster= 3 daqui para frente." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-Oz9kAvh5yPk" + }, + "source": [ + "Sugiro fazermos o seguinte agrupamento:\n", + "* Se quality= 1, 2, 3 $\\Longrightarrow$ quality2= 1 (qualidade ruim);\n", + "* Se quality= 4, 5, 6, 7 $\\Longrightarrow$ quality2= 2 (qualidade média);\n", + "* Se quality= 8, 9, 10 $\\Longrightarrow$ quality2= 3 (qualidade excelente)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nZc2OQR87E9n" + }, + "source": [ + "def define_quality2(row):\n", + " if row['quality'] <= 3:\n", + " return 1\n", + " elif row['quality'] > 3 and row['quality'] <= 7:\n", + " return 2\n", + " elif row['quality'] > 7:\n", + " return 3" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "f8zbLsN27E9t" + }, + "source": [ + "df_Wine['quality2']=df_Wine.apply(lambda row: define_quality2(row), axis= 1)\n", + "df_Wine.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7ugFbdVm77wo" + }, + "source": [ + "df_Wine['quality2'].value_counts()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xf3RYZTwEtAD" + }, + "source": [ + "[**Python**] - Análise de Correlação:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dimuJxsPEsAt" + }, + "source": [ + "corr = df_Wine.corr()['quality2'].drop('quality2')\n", + "print(corr)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "g9KALiQsEsAx" + }, + "source": [ + "plt.figure(figsize=(12,10))\n", + "cor = df_Wine.corr()\n", + "sns.heatmap(cor, annot=True, linewidths=0, vmin=-1, cmap=\"RdBu_r\")\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "At1XAt6AU2oh" + }, + "source": [ + "[**Python**] - Aplicar a transformação LabelEncoder:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kld7QvCaH1YQ" + }, + "source": [ + "from sklearn.preprocessing import LabelEncoder\n", + "\n", + "LE = LabelEncoder()\n", + "LE.fit(df_Wine['quality2'])quality_Encoded= LE.transform(df_Wine['quality2'])\n", + "\n", + "y_WineQ= tf.keras.utils.to_categorical(quality_Encoded)\n", + "y_WineQ[:5]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7qqY5vxeGEMC" + }, + "source": [ + "X_WineQ= df_Wine.copy()\n", + "X_WineQ= X.drop(columns= 'quality2', axis= 1)\n", + "X_WineQ.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oKnfuTQFlQLB" + }, + "source": [ + "[**Python**] - Normalizar os dados - StandardScaler()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_VSNAOQzGEMG" + }, + "source": [ + "from sklearn.preprocessing import StandardScaler\n", + "\n", + "# Define o scaler \n", + "SS = StandardScaler().fit(X_WineQ)\n", + "\n", + "# Scale o dataframe\n", + "X_WineQ = SS.transform(X_WineQ)\n", + "X_WineQ" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zE5SV0ET-Zwv" + }, + "source": [ + "[**Python**] - Aplicar PCA para selecionar somente os atributos mais importantes:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "I3IRWaIj-feA" + }, + "source": [ + "from sklearn.decomposition import PCA\n", + "\n", + "pca = PCA()\n", + "X_pca = pca.fit_transform(X_WineQ)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NbJ6NBcAVPO1" + }, + "source": [ + "[**Python**] - Proporção de variância explicada por cada componente principal" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kOqLh5BgVVN0" + }, + "source": [ + "pca.explained_variance_ratio_" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7j_cYnbSVGaT" + }, + "source": [ + "[**Python**] - Proporção acumulada de cada fator:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ux0lAfy2Ar6g" + }, + "source": [ + "pca.explained_variance_ratio_.cumsum()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1UzYqn7-A5He" + }, + "source": [ + "Como podemos ver acima, 8 componentes principais acumulam mais que 93% da variância. Portanto, vamos selecionar 8 componentes para treinar a Rede Neural." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "RzekEmIMBqe1" + }, + "source": [ + "pca8 = PCA(n_components= 8)\n", + "X_pca8 = pca8.fit_transform(X)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DYDhJYdV-pSb" + }, + "source": [ + "A seguir, o gráfico com as componentes principais." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_zUN_3wR-nOX" + }, + "source": [ + "plt.figure(figsize=(7, 7))\n", + "plt.plot(np.cumsum(pca.explained_variance_ratio_), 'ro-')\n", + "plt.grid()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YO3Umy7sGEMJ" + }, + "source": [ + "### 3. Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e2gXueuI2Nb4" + }, + "source": [ + "[**Python**] - Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "giFgdIK7GEMJ" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_pca8, y_WineQ, test_size= 0.2, random_state= 20111974)\n", + "print(f'X: Treinamento= {X_treinamento.shape}; X: Teste= {X_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "nr4sryEr-eOy" + }, + "source": [ + "print(f'Y: Treinamento= {y_treinamento.shape}; Y: Teste= {y_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l2KSZ8TQGEMM" + }, + "source": [ + "### 4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tQyKOA8Qhgtd" + }, + "source": [ + "[**Python**] - Definir a arquitetura, ou seja:\n", + "* $N_{I}$: Número de neurônios na camada de entrada (_Input Layer_);\n", + "* $N_{O}$: Número de neurônios na camada de saída (_Output Layer_);\n", + "* $N_{H}$: Número de neurônios na camada escondida (_Hidden Layer_);\n", + "* FA: Função de ativação;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "62E0wT8hhgtd" + }, + "source": [ + "# Número de Neurônios na Input Layer:\n", + "N_I= X_pca8.shape[1]\n", + "\n", + "# Número de neurônios na Output Layer:\n", + "N_O= y_WineQ.shape[1]\n", + "\n", + "# Número de neurônios na Hidden Layer:\n", + "N_H1= 32\n", + "N_H2= 32\n", + "N_H3= 32\n", + "N_H4= 32\n", + "N_H5= 32\n", + "N_H6= 32\n", + "\n", + "# Função de Ativação da Hidden Layer:\n", + "FA_H= tf.nn.leaky_relu\n", + "\n", + "# Função de Ativação da Output Layer:\n", + "FA_O= tf.nn.softmax" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wT0kLyuGpZ4C" + }, + "source": [ + "[**Python**] - Definir as sementes para NumPy e Tensorflow:\n", + "> Por questões de reproducibilidade de resultados, use as sementes abaixo:\n", + "\n", + "* NumPy: 20111974;\n", + "* Tensorflow: 20111974;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "C3Wdq7XvpZ4F" + }, + "source": [ + "np.random.seed(20111974)\n", + "tf.random.set_seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_6ZBeC4EnZyQ" + }, + "source": [ + "[**Python**] - Definir a Rede Neural:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vjUyPpoJbg20" + }, + "source": [ + "**Observações**: \n", + "\n", + "Para evitar problemas relacionados ao _overfitting_ e _Vanishing or Exploding Gradients in Deep Neural Nets_, os artigos abaixo sugerem as seguintes opções para inicialização dos pesos $W$:\n", + "\n", + "* [How to Reduce Overfitting Using Weight Constraints in Keras](https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-neural-networks-with-weight-constraints-in-keras/) sugere:\n", + " * kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Deep Learning Best Practices (1) — Weight Initialization](https://medium.com/usf-msds/deep-learning-best-practices-1-weight-initialization-14e5c0295b94) sugere:\n", + " * kernel_initializer= tf.keras.initializers.he_normal() para activation= 'tf.nn.relu' ou 'tf.nn.leaky_relu' e kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Vanishing/ Exploding Gradients in Deep Neural Nets and solving them](https://medium.com/swlh/vanishing-exploding-gradients-in-deep-neural-nets-and-solving-them-9d6070f28b29) sugere:\n", + " * kernel_initializer= tf.keras.initializers.GlorotUniform();\n", + " * kernel_initializer= tf.keras.initializers.GlorotNormal()." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ry43SCP4nbvp" + }, + "source": [ + "from tensorflow.keras import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "from tensorflow.keras.layers import Dropout\n", + "\n", + "RN= Sequential()\n", + "RN.add(Dense(units= N_H1, input_dim= N_I, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_H2, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_H3, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_H4, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_H5, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_H6, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dense(units= N_O, activation= FA_O)) # Atenção à Função de Ativação!\n", + "\n", + "# Resumo da arquitetura da Rede Neural\n", + "print(RN.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zQwY5qQLGEMR" + }, + "source": [ + "### 5. Compilar a Rede Neural\n", + "\n", + "Este é um problema de classificação multi-classes (> 2 classes). Portanto, temos:\n", + "* loss= tf.keras.losses.CategoricalCrossentropy();\n", + "* optimizer= tf.keras.optimizers.Adam();\n", + "* metrics= tf.keras.metrics.binary_accuracy." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HVp9pT8n1AQC" + }, + "source": [ + "[**Python**] - Comando modelo.compile(optimizer, loss, metrics):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "el86VgZ-GEMS" + }, + "source": [ + "Algoritmo_Opt= tf.keras.optimizers.Adam()\n", + "Loss_Function= tf.keras.losses.CategoricalCrossentropy()\n", + "Metrics_Perf = tf.keras.metrics.binary_accuracy\n", + "\n", + "RN.compile(optimizer= Algoritmo_Opt, loss= Loss_Function, metrics= Metrics_Perf)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qNyxbvjsGEMV" + }, + "source": [ + "### 6. Ajustar a Rede Neural\n", + "\n", + "Obs.: A opção callbacks abaixo implementa o conceito de _early stopping_. Esta opção vai parar o processo de treinamento da Rede Neural antes de atingirmos o númerco de _epochs_ quando o modelo pára de melhorar, medido pela métrica val_loss. O parâmetro _patience_= k significa que o processo de otimização vai parar se tivermos k _epochs_ consecutivas sem observarmos melhoria da performance da Rede Neural." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3ZtHza72z_q9" + }, + "source": [ + "[**Python**] - Comando modelo.fit(X_treinamento, y_treinamento, epochs)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tikYg2CrGEMV" + }, + "source": [ + "callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience = 5, min_delta=0.001)]\n", + "hist= RN.fit(X_treinamento, y_treinamento, epochs= 100, validation_data= (X_teste, y_teste), batch_size= 12, callbacks= callbacks)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "w6aoEBw92p86" + }, + "source": [ + "Model_Loss(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ei7XWVel2qNW" + }, + "source": [ + "Model_Accuracy(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Tovtcp98GEMa" + }, + "source": [ + "### 7. Avaliar a performance da Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DAC9nKCvyiKu" + }, + "source": [ + "[**Python**] - Comando modelo.evaluate(X_teste, y_teste)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZmhkMbmpGEMb" + }, + "source": [ + "RN.evaluate(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JjON99B1B9F0" + }, + "source": [ + "* Rede Neural com Accuracy= 0.9692 SEM PCA.\n", + "* Rede Neural com Accuracy= 0.9677 COM PCA." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_VzIx2_EGEMp" + }, + "source": [ + "### 8. Fazer Predições com a Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jpLRlBTI3gvF" + }, + "source": [ + "[**Python**] - Comando:\n", + "* RN.predict_classes(X_treinamento);\n", + "* RN.predict_classes(X_teste);" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1c1K5vGjGEMq" + }, + "source": [ + "y_pred = RN.predict_classes(X_teste)\n", + "y_pred[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "5b-hDZ04GEMs" + }, + "source": [ + "y_teste[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "h3mgEOcXu20F" + }, + "source": [ + "___\n", + "# **APLICAÇÃO 3 - Rede Neural para identificar Câncer de Mama (_Breast Cancer Dataframe_)**\n", + "\n", + "Fonte: [Breast Cancer Wisconsin (Diagnostic) Data Set](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)\n", + "\n", + "Vamos desenvolver uma Rede Neural usando _Tensorflow_/_Keras_ e seguir os passos adiante:\n", + "\n", + "1. Carregar os dados;\n", + "2. Pré-processamento e transformação dos dados;\n", + "3. Definir as amostras de treinamento e validação;\n", + "4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_;\n", + "5. Compilar a Rede Neural;\n", + "6. Ajustar a Rede Neural;\n", + "7. Avaliar a performance da Rede Neural;\n", + "8. _Fine tuning_ da Rede Neural;\n", + "9. Fazer Predições com a Rede Neural;\n", + "10. Conclusões." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QVZQ1meF9l_-" + }, + "source": [ + "### 0. Carregar bibliotecas do Python" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Cd4zQSkseKzv" + }, + "source": [ + "[**Python**] - Importar as bibliotecas necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OzMsBroueKzx" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "from sklearn.metrics import confusion_matrix\n", + "import tensorflow as tf\n", + "\n", + "from tensorflow import keras" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wJSbJx9rrBG3" + }, + "source": [ + "[**Python**] - Verificar a versão do Tensorflow\n", + "> Assegurar que está a utilizar a versão 2.x." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "alCVjy_JCWUo" + }, + "source": [ + "tf.__version__" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-Ra_PjiFeKz0" + }, + "source": [ + "[**Python**] - Definir o número de casas decimais" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y4I2Eh2YeKz1" + }, + "source": [ + "np.set_printoptions(precision= 3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1nXjn5KM94lI" + }, + "source": [ + "### 1. Carregar os dados" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ovtqPCcWBgfI" + }, + "source": [ + "from sklearn.datasets import load_breast_cancer\n", + "\n", + "Cancer = load_breast_cancer()\n", + "X_cancer= Cancer.data\n", + "y_Cancer= Cancer.target\n", + "Col_Names= Cancer.feature_names" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "5qbOErH6Dkcu" + }, + "source": [ + "Col_Names" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "weKuFIHqDWRX" + }, + "source": [ + "X_cancer" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "9kix-c9TDY_R" + }, + "source": [ + "y_Cancer[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Qg_IuwHp946c" + }, + "source": [ + "### 2. Pré-processamento e transformação dos dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7wnkuDnMlVe8" + }, + "source": [ + "[**Python**] - Normalizar os dados - StandardScaler()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uVEHT9sqG9WR" + }, + "source": [ + "from sklearn.preprocessing import StandardScaler\n", + "\n", + "SS = StandardScaler()\n", + "X_cancer = SS.fit_transform(X_cancer)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hNeVqHH295FA" + }, + "source": [ + "### 3. Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7Rbc3MwX2RDU" + }, + "source": [ + "[**Python**] - Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xR6XSrbsFX-P" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_cancer, y_Cancer, test_size = 0.1, random_state = 20111974)\n", + "print(f'X: Treinamento= {X_treinamento.shape}; X: Teste= {X_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "NRQ8zW1m-g6l" + }, + "source": [ + "print(f'Y: Treinamento= {y_treinamento.shape}; Y: Teste= {y_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "775KoeQz95O6" + }, + "source": [ + "### 4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fFaKVel2ilZs" + }, + "source": [ + "[**Python**] - Definir a arquitetura, ou seja:\n", + "* $N_{I}$: Número de neurônios na camada de entrada (_Input Layer_);\n", + "* $N_{O}$: Número de neurônios na camada de saída (_Output Layer_);\n", + "* $N_{H}$: Número de neurônios na camada escondida (_Hidden Layer_);\n", + "* FA: Função de ativação;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aB7xcQFWilZs" + }, + "source": [ + "# Número de Neurônios na Input Layer:\n", + "N_I= X_cancer.shape[1]\n", + "\n", + "# Número de neurônios na Output Layer:\n", + "N_O= 1\n", + "\n", + "# Número de neurônios na Hidden Layer:\n", + "N_H1= 16\n", + "N_H2= 16\n", + "N_H3= 16\n", + "\n", + "# Função de Ativação da Hidden Layer:\n", + "FA_H= tf.nn.leaky_relu\n", + "\n", + "# Função de Ativação da Output Layer:\n", + "FA_O= tf.nn.softmax" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rm1PeihTRlgz" + }, + "source": [ + "[**Python**] - Definir as sementes para NumPy e Tensorflow:\n", + "> Por questões de reproducibilidade de resultados, use as sementes abaixo:\n", + "\n", + "* NumPy: 20111974;\n", + "* Tensorflow: 20111974;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "U3oTC94GRlg3" + }, + "source": [ + "np.random.seed(20111974)\n", + "tf.random.set_seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J5GBW-Ucnh9T" + }, + "source": [ + "[**Python**] - Definir a Rede Neural:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "my2xjCniIb4J" + }, + "source": [ + "Vamos adicionar duas camadas _Dropout_ com $p= 0.1$ para evitarmos _overfitting_." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Kne06wmjnTYD" + }, + "source": [ + "**Observações**: \n", + "\n", + "Para evitar problemas relacionados ao _overfitting_ e _Vanishing or Exploding Gradients in Deep Neural Nets_, os artigos abaixo sugerem as seguintes opções para inicialização dos pesos $W$:\n", + "\n", + "* [How to Reduce Overfitting Using Weight Constraints in Keras](https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-neural-networks-with-weight-constraints-in-keras/) sugere:\n", + " * kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Deep Learning Best Practices (1) — Weight Initialization](https://medium.com/usf-msds/deep-learning-best-practices-1-weight-initialization-14e5c0295b94) sugere:\n", + " * kernel_initializer= tf.keras.initializers.he_normal() para activation= 'tf.nn.relu' ou 'tf.nn.leaky_relu' e kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Vanishing/ Exploding Gradients in Deep Neural Nets and solving them](https://medium.com/swlh/vanishing-exploding-gradients-in-deep-neural-nets-and-solving-them-9d6070f28b29) sugere:\n", + " * kernel_initializer= tf.keras.initializers.GlorotUniform();\n", + " * kernel_initializer= tf.keras.initializers.GlorotNormal()." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "s6x7mo4dnjlE" + }, + "source": [ + "from tensorflow.keras import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "from tensorflow.keras.layers import Dropout\n", + "\n", + "RN= Sequential()\n", + "RN.add(Dense(units= N_H1, input_dim= N_I, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_H2, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_H3, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dense(units= N_O, activation= FA_O))\n", + "\n", + "# Resumo da arquitetura da Rede Neural\n", + "print(RN.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "04cKraWZ9mMb" + }, + "source": [ + "### 5. Compilar a Rede Neural\n", + "\n", + "Este é um problema de classificação binária (maligno ou benigno). Portanto, temos:\n", + "* loss= tf.keras.losses.BinaryCrossentropy();\n", + "* metrics= tf.keras.metrics.binary_accuracy;\n", + "* optimizer= tf.keras.optimizers.Adam()." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OGRjWcsm1FM9" + }, + "source": [ + "[**Python**] - Comando modelo.compile(optimizer, loss, metrics):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bLIoA8FrJJCx" + }, + "source": [ + "Algoritmo_Opt= tf.keras.optimizers.Adam()\n", + "Loss_Function= tf.keras.losses.BinaryCrossentropy()\n", + "Metrics_Perf = tf.keras.metrics.binary_accuracy\n", + "\n", + "RN.compile(optimizer= Algoritmo_Opt, loss= Loss_Function, metrics= Metrics_Perf)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cnL12eaF9mU6" + }, + "source": [ + "### 6. Ajustar a Rede Neural\n", + "\n", + "Obs.: A opção callbacks abaixo implementa o conceito de _early stopping_. Esta opção vai parar o processo de treinamento da Rede Neural antes de atingirmos o númerco de _epochs_ quando o modelo pára de melhorar, medido pela métrica val_loss. O parâmetro _patience_= k significa que o processo de otimização vai parar se tivermos k _epochs_ consecutivas sem observarmos melhoria da performance da Rede Neural." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nVsziJfk0FIv" + }, + "source": [ + "[**Python**] - Comando modelo.fit(X_treinamento, y_treinamento, epochs)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "apQY6cQjJb-z" + }, + "source": [ + "callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience = 5, min_delta=0.001)]\n", + "hist= RN.fit(X_treinamento, y_treinamento, epochs= 100, validation_data= (X_teste, y_teste), callbacks= callbacks)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "avd2cXpO20cY" + }, + "source": [ + "Model_Accuracy(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "FCd8xFxA25Lc" + }, + "source": [ + "Model_Loss(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3zToEvUs-pCt" + }, + "source": [ + "### 7. Avaliar a performance da Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IqzKH7jsymwL" + }, + "source": [ + "[**Python**] - Comando modelo.evaluate(X_teste, y_teste)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pmjuk6OqJ7zD" + }, + "source": [ + "RN.evaluate(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "04ZGPI6DKcnz" + }, + "source": [ + "A seguir, a matriz de confusão:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-MZyagwaKfkM" + }, + "source": [ + "Mostra_ConfusionMatrix()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KasqSFWG-pTG" + }, + "source": [ + "### 8. _Fine tuning_ da Rede Neural\n", + "\n", + "Não é necessário, pois obtivemos 0.9825 de acurácia." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GLxgJP3L-pdZ" + }, + "source": [ + "### 9. Fazer Predições com a Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0iXGBnNZYb4V" + }, + "source": [ + "[**Python**] - Comando:\n", + "* RN.predict_classes(X_treinamento);\n", + "* RN.predict_classes(X_teste)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nqBFwxg5Yb4b" + }, + "source": [ + "y_pred = RN.predict_classes(X_teste)\n", + "y_pred[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4CHdWhgD-plr" + }, + "source": [ + "### 10. Conclusões" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T2AQ4uDShdgE" + }, + "source": [ + "___\n", + "# **APLICAÇÃO 4 - Rede Neural para identificar Diabetes (Diabetes Dataframe)**\n", + "\n", + "> Vamos desenvolver uma Rede Neural usando _Tensorflow_/_Keras_ e seguir os passos adiante:\n", + "\n", + "1. Carregar os dados;\n", + "2. Pré-processamento e transformação dos dados;\n", + "3. Definir as amostras de treinamento e validação;\n", + "4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_;\n", + "5. Compilar a Rede Neural;\n", + "6. Ajustar a Rede Neural;\n", + "7. Avaliar a performance da Rede Neural;\n", + "8. _Fine tuning_ da Rede Neural;\n", + "9. Fazer Predições com a Rede Neural;\n", + "10. Conclusões." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HOEJGtAzQfX3" + }, + "source": [ + "### 0. Carregar bibliotecas do Python" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Mxa5UaIXeRgN" + }, + "source": [ + "[**Python**] - Importar as bibliotecas necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ylfhuYeveRgO" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "from sklearn.metrics import confusion_matrix\n", + "\n", + "from tensorflow import keras" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uG9B3WTkeRgR" + }, + "source": [ + "[**Python**] - Definir o número de casas decimais:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TZkm0YVoeRgR" + }, + "source": [ + "np.set_printoptions(precision= 3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mDhEsSJ1rFpy" + }, + "source": [ + "[**Python**] - Verificar a versão do Tensorflow\n", + "> Assegurar que está a utilizar a versão 2.x." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KfKcNLZ3QfYJ" + }, + "source": [ + "tf.__version__" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rIT9N7jSQfYO" + }, + "source": [ + "### 1. Carregar os dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r9QUJZgbSWDG" + }, + "source": [ + "[**Python**] - Carregar os dados:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ofSJNoyfQfYR" + }, + "source": [ + "from sklearn.datasets import load_diabetes\n", + "\n", + "Diabetes = load_diabetes()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Fo7q0BnyShVG" + }, + "source": [ + "[**Python**] - Definir os arrays X_Diabetes e y_Diabetes:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "UTnrDMPLQfYW" + }, + "source": [ + "X_Diabetes= Diabetes.data\n", + "y_Diabetes= Diabetes.target\n", + "Col_Names= Diabetes.feature_names" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZjrdZUwp_l40" + }, + "source": [ + "[**Python**] - Corrigir ou renomear as colunas do dataframe:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "skgDY4Lu_l46" + }, + "source": [ + "X_Diabetes.columns= X_Diabetes.columns.str.lower()\n", + "X_Diabetes.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "a5NQO8b-QfYb" + }, + "source": [ + "X_Diabetes" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "adpBpNDeQfYj" + }, + "source": [ + "y_Diabetes[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YQA4fN4HQfYo" + }, + "source": [ + "### 2. Pré-processamento e transformação dos dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dTnBpjwalbVG" + }, + "source": [ + "[**Python**] - Normalizar os dados - StandardScaler()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hS_unh4wQfYp" + }, + "source": [ + "from sklearn.preprocessing import StandardScaler\n", + "\n", + "SS = StandardScaler()\n", + "X_Diabetes = SS.fit_transform(X_Diabetes)\n", + "y_Diabetes= SS.fit_transform(y_Diabetes.reshape(-(1, 1))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZSviMMrISt96" + }, + "source": [ + "Y[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pWmVyMF0QfYu" + }, + "source": [ + "### 3. Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cEMSNMJu2VqI" + }, + "source": [ + "[**Python**] - Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2WUQMh2HQfYx" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_Diabetes, y_Diabetes, test_size = 0.1, random_state = 20111974)\n", + "print(f'X: Treinamento= {X_treinamento.shape}; X: Teste= {X_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Zvx80NPT-j0S" + }, + "source": [ + "print(f'Y: Treinamento= {y_treinamento.shape}; Y: Teste= {y_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Wk_CG4H5QfY2" + }, + "source": [ + "### 4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V1bDqK5vi49C" + }, + "source": [ + "[**Python**] - Definir a arquitetura, ou seja:\n", + "* $N_{I}$: Número de neurônios na camada de entrada (_Input Layer_);\n", + "* $N_{O}$: Número de neurônios na camada de saída (_Output Layer_);\n", + "* $N_{H}$: Número de neurônios na camada escondida (_Hidden Layer_);\n", + "* FA: Função de ativação;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "por467-ci49D" + }, + "source": [ + "# Número de Neurônios na Input Layer:\n", + "N_I= X_Diabetes.shape[1]\n", + "\n", + "# Número de neurônios na Output Layer:\n", + "N_O= 1\n", + "\n", + "# Número de neurônios na Hidden Layer:\n", + "N_H1= 6\n", + "\n", + "# Função de Ativação da Hidden Layer:\n", + "FA_H= tf.nn.leaky_relu\n", + "\n", + "# Função de Ativação da Output Layer:\n", + "FA_O= tf.nn.linear" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-r7VC-7lpkkC" + }, + "source": [ + "[**Python**] - Definir as sementes para NumPy e Tensorflow:\n", + "> Por questões de reproducibilidade de resultados, use as sementes abaixo:\n", + "\n", + "* NumPy: 20111974;\n", + "* Tensorflow: 20111974;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "43f-ZPW7pkkD" + }, + "source": [ + "np.random.seed(20111974)\n", + "tf.random.set_seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2qv8lJmHnqi3" + }, + "source": [ + "[**Python**] - Definir a Rede Neural:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8Veeqccdbnks" + }, + "source": [ + "**Observações**: \n", + "\n", + "Para evitar problemas relacionados ao _overfitting_ e _Vanishing or Exploding Gradients in Deep Neural Nets_, os artigos abaixo sugerem as seguintes opções para inicialização dos pesos $W$:\n", + "\n", + "* [How to Reduce Overfitting Using Weight Constraints in Keras](https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-neural-networks-with-weight-constraints-in-keras/) sugere:\n", + " * kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Deep Learning Best Practices (1) — Weight Initialization](https://medium.com/usf-msds/deep-learning-best-practices-1-weight-initialization-14e5c0295b94) sugere:\n", + " * kernel_initializer= tf.keras.initializers.he_normal() para activation= 'tf.nn.relu' ou 'tf.nn.leaky_relu' e kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Vanishing/ Exploding Gradients in Deep Neural Nets and solving them](https://medium.com/swlh/vanishing-exploding-gradients-in-deep-neural-nets-and-solving-them-9d6070f28b29) sugere:\n", + " * kernel_initializer= tf.keras.initializers.GlorotUniform();\n", + " * kernel_initializer= tf.keras.initializers.GlorotNormal()." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WMOoG_0bnsSD" + }, + "source": [ + "from tensorflow.keras import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "\n", + "RN= Sequential()\n", + "RN.add(Dense(units= N_H1, input_dim= N_I, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dense(units= N_O, activation= FA_O)) # Se não definirmos o parâmetro activation, então por default será 'linear'.\n", + "\n", + "# Resumo da arquitetura da Rede Neural\n", + "print(RN.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7W8VtONlQfZP" + }, + "source": [ + "### 5. Compilar a Rede Neural\n", + "\n", + "Este é um problema de regressão. Portanto, temos:\n", + "* loss= tf.keras.losses.MeanSquaredError;\n", + "* metrics= 'mse';\n", + "* optimizer= tf.keras.optimizers.Adam()." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "g97mJCSr1Kat" + }, + "source": [ + "[**Python**] - Comando modelo.compile(optimizer, loss, metrics):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cXJFtlcEQfZQ" + }, + "source": [ + "Algoritmo_Opt= tf.keras.optimizers.Adam()\n", + "Loss_Function= tf.keras.losses.MeanSquaredError()\n", + "Metrics_Perf = tf.keras.metrics.MeanSquaredError()\n", + "\n", + "RN.compile(optimizer= Algoritmo_Opt, loss= Loss_Function, metrics= Metrics_Perf)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1s1S7Fn_QfZW" + }, + "source": [ + "### 6. Ajustar a Rede Neural\n", + "\n", + "Obs.: A opção callbacks abaixo implementa o conceito de _early stopping_. Esta opção vai parar o processo de treinamento da Rede Neural antes de atingirmos o númerco de _epochs_ quando o modelo pára de melhorar, medido pela métrica val_loss. O parâmetro _patience_= k significa que o processo de otimização vai parar se tivermos k _epochs_ consecutivas sem observarmos melhoria da performance da Rede Neural." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PNgR4ihA0JMy" + }, + "source": [ + "[**Python**] - Comando modelo.fit(X_treinamento, y_treinamento, epochs)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JUaqK4j-QfZY" + }, + "source": [ + "callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience = 5, min_delta=0.001)]\n", + "hist= RN.fit(X_treinamento, y_treinamento, epochs= 200, validation_data= (X_teste, y_teste), callbacks= callbacks)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3bFg6kut1jkb" + }, + "source": [ + "A seguir, funções para plotarmos os gráficos das métricas MSE, _Loss_ e _Accuracy_:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VPASVWaR1sWN" + }, + "source": [ + "def Model_Loss(hist):\n", + " print(hist.history.keys())\n", + " plt.plot(hist.history['loss'])\n", + " plt.plot(hist.history['val_loss'])\n", + " plt.title('Model Loss')\n", + " plt.ylabel('Loss')\n", + " plt.xlabel('Epochs')\n", + " plt.legend(['Training', 'Validation'], loc= 'upper right')\n", + " plt.show()\n", + "\n", + "def Model_Accuracy(hist):\n", + " print(hist.history.keys())\n", + " plt.plot(hist.history['accuracy'])\n", + " plt.plot(hist.history['val_accuracy'])\n", + " plt.title('Model Accuracy')\n", + " plt.ylabel('Accuracy')\n", + " plt.xlabel('Epochs')\n", + " plt.legend(['Training', 'Validation'], loc= 'upper right')\n", + " plt.show()\n", + "\n", + "def Model_MSE(hist):\n", + " print(hist.history.keys())\n", + " plt.plot(hist.history['mse'])\n", + " plt.plot(hist.history['val_mse'])\n", + " plt.title('Model MSE')\n", + " plt.ylabel('MSE')\n", + " plt.xlabel('Epochs')\n", + " plt.legend(['Training', 'Validation'], loc= 'upper right')\n", + " plt.show()\n", + "\n", + "def Mostra_ConfusionMatrix():\n", + " y_pred = RN.predict_classes(X_teste)\n", + " mc = confusion_matrix(y_teste, y_pred)\n", + " #sns.heatmap(mc,annot=True, annot_kws={\"size\": 10},fmt=\"d\")\n", + " sns.heatmap(mc/np.sum(mc), annot=True, annot_kws={\"size\": 10}, fmt='.2%', cmap='Blues')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "uWhJUP0v2_fm" + }, + "source": [ + "Model_Loss(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "M8IZFKGyCvqO" + }, + "source": [ + "Model_MSE(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "37_0RhXLQfZc" + }, + "source": [ + "### 7. Avaliar a performance da Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8mMrIS9JyriW" + }, + "source": [ + "[**Python**] - Comando modelo.evaluate(X_teste, y_teste)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cRjEkvWzQfZe" + }, + "source": [ + "RN.evaluate(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MA6_RkjgQfZs" + }, + "source": [ + "### 8. _Fine tuning_ da Rede Neural\n", + "\n", + "Vou deixar esta fase como exercício para o aluno." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vPJtuCzXQfZu" + }, + "source": [ + "### 9. Fazer Predições com a Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p1EptFS1Yi-D" + }, + "source": [ + "[**Python**] - Comando:\n", + "* RN.predict_classes(X_treinamento);\n", + "* RN.predict_classes(X_teste)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "fbrvwgyvYi-I" + }, + "source": [ + "y_pred = RN.predict_classes(X_teste)\n", + "y_pred[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JOsTSHwoQfZ0" + }, + "source": [ + "### 10. Conclusões" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EoQ5nySZmLDP" + }, + "source": [ + "___\n", + "# **APLICAÇÃO 5 - Rede Neural para prever os preços das casas (_Boston House Price Prediction_)**\n", + "\n", + "Vamos desenvolver uma Rede Neural usando _Tensorflow_/_Keras_ e seguir os passos adiante:\n", + "\n", + "1. Carregar os dados;\n", + "2. Pré-processamento e transformação dos dados;\n", + "3. Definir as amostras de treinamento e validação;\n", + "4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_;\n", + "5. Compilar a Rede Neural;\n", + "6. Ajustar a Rede Neural;\n", + "7. Avaliar a performance da Rede Neural;\n", + "8. _Fine tuning_ da Rede Neural;\n", + "9. Fazer Predições com a Rede Neural;\n", + "10. Conclusões." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8vdRpBS8VTw_" + }, + "source": [ + "### 0. Carregar bibliotecas do Python" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v629ZppSeY5T" + }, + "source": [ + "[**Python**] - Importar as bibliotecas necessárias:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uVXroVLTeY5U" + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "from sklearn.metrics import confusion_matrix\n", + "\n", + "from tensorflow import keras" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qYCcNW9qeY5W" + }, + "source": [ + "[**Python**] - Definir o número de casas decimais" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zNn-kwlGeY5X" + }, + "source": [ + "np.set_printoptions(precision= 3)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YnlZU1rLrJwt" + }, + "source": [ + "[**Python**] - Verificar a versão do Tensorflow\n", + "> Assegurar que está a utilizar a versão 2.x." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "445U8OKgVTxW" + }, + "source": [ + "tf.__version__" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-1Ckhzf0VTxc" + }, + "source": [ + "### 1. Carregar os dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aAz0_L0e1mxX" + }, + "source": [ + "[] Carregar os dados" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SOdpdceAVTxd" + }, + "source": [ + "from sklearn.datasets import load_boston\n", + "\n", + "Boston = load_boston()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K0P23sJs1raX" + }, + "source": [ + "[**Python**] - Definir as matrizes X_Boston e y_Boston:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rPpJOsgJ1y7J" + }, + "source": [ + "X_Boston= Boston.data\n", + "y_Boston= Boston.target\n", + "Col_Names= Boston.feature_names\n", + "Col_Names" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5XBRc6og_ySA" + }, + "source": [ + "[**Python**] - Corrigir ou renomear as colunas do dataframe:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VPGDwXSF_ySE" + }, + "source": [ + "X_Boston.columns= X_Boston.columns.str.lower()\n", + "X_Boston.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "fKiKT-fkVTxq" + }, + "source": [ + "y_Boston[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T9uYgjz-VTxu" + }, + "source": [ + "### 2. Pré-processamento e transformação dos dados" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5rbpIU5jlgv6" + }, + "source": [ + "[**Python**] - Normalizar os dados - StandardScaler()" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Kbs-x9a2VTxw" + }, + "source": [ + "from sklearn.preprocessing import StandardScaler\n", + "\n", + "SS = StandardScaler()\n", + "X_Boston= SS.fit_transform(X_Boston)\n", + "y_Boston= SS.fit_transform(y_Boston.reshape(-(1, 1))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "S2w2H9BOXK9u" + }, + "source": [ + "X_Boston" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "DXNIHeS2XM_k" + }, + "source": [ + "y_Boston[:10]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gcbomDeKVTx1" + }, + "source": [ + "### 3. Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gEkX579Q2D2q" + }, + "source": [ + "[**Python**] - Definir as amostras de treinamento e validação" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "EZyRBsfYVTx2" + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_Boston, y_Boston, test_size = 0.1, random_state = 20111974)\n", + "print(f'X: Treinamento= {X_treinamento.shape}; X: Teste= {X_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "g89c6edL-mBW" + }, + "source": [ + "print(f'Y: Treinamento= {y_treinamento.shape}; Y: Teste= {y_teste.shape}')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GU-ebO-3VTx7" + }, + "source": [ + "### 4. Definir a arquitetura da Rede Neural com _Tensorflow_/_Keras_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gMVzohGHjS_p" + }, + "source": [ + "[**Python**] - Definir a arquitetura, ou seja:\n", + "* $N_{I}$: Número de neurônios na camada de entrada (_Input Layer_);\n", + "* $N_{O}$: Número de neurônios na camada de saída (_Output Layer_);\n", + "* $N_{H}$: Número de neurônios na camada escondida (_Hidden Layer_);\n", + "* FA: Função de ativação;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lf32pQtWjS_u" + }, + "source": [ + "# Número de Neurônios na Input Layer:\n", + "N_I= X_Boston.shape[1]\n", + "\n", + "# Número de neurônios na Output Layer:\n", + "N_O= 1\n", + "\n", + "# Número de neurônios na Hidden Layer:\n", + "N_H1= 7\n", + "\n", + "# Função de Ativação da Hidden Layer:\n", + "FA_H= tf.nn.leaky_relu\n", + "\n", + "# Função de Ativação da Output Layer:\n", + "FA_O= FA_O" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qOI4_BPYVTyE" + }, + "source": [ + "Vamos adicionar uma camada _Dropout_ com $p= 0.1$ para evitarmos _overfitting_." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yN9lxrXspp-m" + }, + "source": [ + "[**Python**] - Definir as sementes para NumPy e Tensorflow:\n", + "> Por questões de reproducibilidade de resultados, use as sementes abaixo:\n", + "\n", + "* NumPy: 20111974;\n", + "* Tensorflow: 20111974;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cWcQ3OS5pp-n" + }, + "source": [ + "np.random.seed(20111974)\n", + "tf.random.set_seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "73PnOLbon3Jh" + }, + "source": [ + "[**Python**] - Definir a Rede Neural:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zjKR3qgEneJr" + }, + "source": [ + "**Observações**: \n", + "\n", + "Para evitar problemas relacionados ao _overfitting_ e _Vanishing or Exploding Gradients in Deep Neural Nets_, os artigos abaixo sugerem as seguintes opções para inicialização dos pesos $W$:\n", + "\n", + "* [How to Reduce Overfitting Using Weight Constraints in Keras](https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-neural-networks-with-weight-constraints-in-keras/) sugere:\n", + " * kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Deep Learning Best Practices (1) — Weight Initialization](https://medium.com/usf-msds/deep-learning-best-practices-1-weight-initialization-14e5c0295b94) sugere:\n", + " * kernel_initializer= tf.keras.initializers.he_normal() para activation= 'tf.nn.relu' ou 'tf.nn.leaky_relu' e kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Vanishing/ Exploding Gradients in Deep Neural Nets and solving them](https://medium.com/swlh/vanishing-exploding-gradients-in-deep-neural-nets-and-solving-them-9d6070f28b29) sugere:\n", + " * kernel_initializer= tf.keras.initializers.GlorotUniform();\n", + " * kernel_initializer= tf.keras.initializers.GlorotNormal()." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GnZuQZZTn4_W" + }, + "source": [ + "from tensorflow.keras import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "from tensorflow.keras.layers import Dropout\n", + "\n", + "RN= Sequential()\n", + "RN.add(Dense(units= N_H1, input_dim= N_I, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_O, activation= FA_O)) # # Se não definirmos o parâmetro activation, então por default será 'linear'.\n", + "\n", + "# Resumo da arquitetura da Rede Neural\n", + "print(RN.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "h-hyQiokVTyM" + }, + "source": [ + "### 5. Compilar a Rede Neural\n", + "\n", + "Este é um problema de regressão. Portanto, temos:\n", + "* loss= tf.keras.losses.MeanSquaredError ou tf.keras.losses.MeanAbsoluteError();\n", + "* metrics= 'mse'.\n", + "* optimizer= tf.keras.optimizers.Adam() ou 'rmsprop'." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JwQOPOhr1Oh0" + }, + "source": [ + "[**Python**] - Comando modelo.compile(optimizer, loss, metrics):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QY2aKnL_VTyN" + }, + "source": [ + "Algoritmo_Opt= tf.keras.optimizers.RMSprop()\n", + "Loss_Function= tf.keras.losses.MeanSquaredError()\n", + "Metrics_Perf = tf.keras.metrics.MeanSquaredError()\n", + "\n", + "RN.compile(optimizer= Algoritmo_Opt, loss= Loss_Function, metrics= Metrics_Perf)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ygJi0ux5VTyT" + }, + "source": [ + "### 6. Ajustar a Rede Neural\n", + "\n", + "Obs.: A opção callbacks abaixo implementa o conceito de _early stopping_. Esta opção vai parar o processo de treinamento da Rede Neural antes de atingirmos o númerco de _epochs_ quando o modelo pára de melhorar, medido pela métrica val_loss. O parâmetro _patience_= k significa que o processo de otimização vai parar se tivermos k _epochs_ consecutivas sem observarmos melhoria da performance da Rede Neural." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Vz0urLrq0NPG" + }, + "source": [ + "[**Python**] - Comando modelo.fit(X_treinamento, y_treinamento, epochs)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9HoQZUl8VTyU" + }, + "source": [ + "callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience = 5, min_delta=0.001)]\n", + "hist= RN.fit(X_treinamento, y_treinamento, epochs= 200, validation_data= (X_teste, y_teste), callbacks= callbacks)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "R_StfUsUzbto" + }, + "source": [ + "Model_Loss(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "I1FIaMx_zzVW" + }, + "source": [ + "Model_MAE(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_t3k3oqg0pXW" + }, + "source": [ + "Model_MSE(hist)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LH0llgTsVTyY" + }, + "source": [ + "### 7. Avaliar a performance da Rede Neural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yZZPMFXvyvtG" + }, + "source": [ + "[**Python**] - Comando modelo.evaluate(X_teste, y_teste)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "iZGhNF5vVTyZ" + }, + "source": [ + "RN.evaluate(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BkosiHm6lmww" + }, + "source": [ + "Observe que a Rede Neural _baseline_ (modelo inicial) apresenta MSE= 0.0795. Ainda assim, vamos tentar reduzir a _Loss Function_ e tentar chegar à um MSE ainda menor." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HcLONQpPVTyi" + }, + "source": [ + "### 8. _Fine tuning_ da Rede Neural\n", + "\n", + "O que pode ser feito para melhorar a performance da Rede Neural?\n", + "* aumentar o número de _Hidden Layers_ e o número de neurônios em cada uma delas." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "g7Uxk3j_ndFX" + }, + "source": [ + "N_I= X_Boston.shape[1]\n", + "N_H1= 32\n", + "N_H2= 32\n", + "N_H3= 32\n", + "N_O= 1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fF3Eb_5dp2PZ" + }, + "source": [ + "[**Python**] - Definir as sementes para NumPy e Tensorflow:\n", + "> Por questões de reproducibilidade de resultados, use as sementes abaixo:\n", + "\n", + "* NumPy: 20111974;\n", + "* Tensorflow: 20111974;" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "X48MWaa_p2Pb" + }, + "source": [ + "np.random.seed(20111974)\n", + "tf.random.set_seed(20111974)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mHbjgPT7nP-q" + }, + "source": [ + "[**Python**] - Definir a Rede Neural:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "j8s-XRqdbuXP" + }, + "source": [ + "**Observações**: \n", + "\n", + "Para evitar problemas relacionados ao _overfitting_ e _Vanishing or Exploding Gradients in Deep Neural Nets_, os artigos abaixo sugerem as seguintes opções para inicialização dos pesos $W$:\n", + "\n", + "* [How to Reduce Overfitting Using Weight Constraints in Keras](https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-neural-networks-with-weight-constraints-in-keras/) sugere:\n", + " * kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Deep Learning Best Practices (1) — Weight Initialization](https://medium.com/usf-msds/deep-learning-best-practices-1-weight-initialization-14e5c0295b94) sugere:\n", + " * kernel_initializer= tf.keras.initializers.he_normal() para activation= 'tf.nn.relu' ou 'tf.nn.leaky_relu' e kernel_constraint= tf.keras.constraints.UnitNorm();\n", + "* [Vanishing/ Exploding Gradients in Deep Neural Nets and solving them](https://medium.com/swlh/vanishing-exploding-gradients-in-deep-neural-nets-and-solving-them-9d6070f28b29) sugere:\n", + " * kernel_initializer= tf.keras.initializers.GlorotUniform();\n", + " * kernel_initializer= tf.keras.initializers.GlorotNormal()." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3HcNOQoFnR3W" + }, + "source": [ + "from tensorflow.keras import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "from tensorflow.keras.layers import Dropout\n", + "\n", + "RN= Sequential()\n", + "RN.add(Dense(units= N_H1, input_dim= N_I, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_H2, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm())\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_H3, kernel_initializer= tf.keras.initializers.GlorotNormal(), activation= FA_H, kernel_constraint= tf.keras.constraints.UnitNorm()))\n", + "RN.add(Dropout(0.1))\n", + "RN.add(Dense(units= N_O, activation= FA_O)) # Se não definirmos o parâmetro activation, então por default será 'linear'.\n", + "\n", + "# Resumo da arquitetura da Rede Neural\n", + "print(RN.summary())" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JBolPrXZnZ5i" + }, + "source": [ + "#### 8.5. Compilar a Rede Neural (_Fine tuning_ da Rede Neural)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rMBCiUTC1W2H" + }, + "source": [ + "[**Python**] - Comando modelo.compile(optimizer, loss, metrics):" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YlBccXXmnZ5k" + }, + "source": [ + "Algoritmo_Opt= tf.keras.optimizers.Adam()\n", + "Loss_Function= tf.keras.losses.MeanSquaredError()\n", + "Metrics_Perf = tf.keras.metrics.MeanSquaredError()\n", + "\n", + "RN.compile(optimizer= Algoritmo_Opt, loss= Loss_Function, metrics= Metrics_Perf)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SIOA5UFfnZ5p" + }, + "source": [ + "#### 8.6. Ajustar a Rede Neural (_Fine tuning_ da Rede Neural)\n", + "\n", + "Obs.: A opção callbacks abaixo implementa o conceito de _early stopping_. Esta opção vai parar o processo de treinamento da Rede Neural antes de atingirmos o númerco de _epochs_ quando o modelo pára de melhorar, medido pela métrica val_loss. O parâmetro _patience_= k significa que o processo de otimização vai parar se tivermos k _epochs_ consecutivas sem observarmos melhoria da performance da Rede Neural." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ktlrSmGQ0Qrq" + }, + "source": [ + "[**Python**] - Comando modelo.fit(X_treinamento, y_treinamento, epochs)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kM5x90ArnZ5r" + }, + "source": [ + "callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience = 5, min_delta=0.001)]\n", + "hist= RN.fit(X_treinamento, y_treinamento, epochs= 500, validation_data= (X_teste, y_teste), callbacks= callbacks)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AfxvOccmnZ5z" + }, + "source": [ + "#### 8.7. Avaliar a performance da Rede Neural (_Fine tuning_ da Rede Neural)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4XZmb9zIy1Xf" + }, + "source": [ + "[**Python**] - Comando modelo.evaluate(X_teste, y_teste)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "belFKJQSnZ51" + }, + "source": [ + "RN.evaluate(X_teste, y_teste)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7whymUw5VTyq" + }, + "source": [ + "### 10. Conclusões\n", + "\n", + "A performance da Rede Neural melhorou um pouco com a redução do MSE." + ] + } + ] +} \ No newline at end of file diff --git a/Notebooks/NB23__Tools para DataViz_hs.ipynb b/Notebooks/NB23__Tools para DataViz_hs.ipynb new file mode 100644 index 000000000..8e38d4e1b --- /dev/null +++ b/Notebooks/NB23__Tools para DataViz_hs.ipynb @@ -0,0 +1,694 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "intro_to_fairness.ipynb", + "provenance": [], + "collapsed_sections": [ + "J8daw3YOIAXH", + "xFxZOg55lWJE", + "l-K-xqksm-X3", + "TXkkHYyJ98_k", + "91wjnZFpPWw-", + "KlF-lQ8yQ69b", + "qZ-9vJgSEpHj", + "7YVH8hYfSjer", + "2lx4JuLdi7jw", + "TF3B5h3c-7Fb" + ], + "include_colab_link": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "84x4Fxc5lzFv" + }, + "source": [ + "# Entendendo os dados\n", + "***" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xFxZOg55lWJE" + }, + "source": [ + "## Objetivo\n", + "\n", + "* Explorar e entender os diferentes bias presentes nos dados;\n", + "* Explorar as features e identificar potenciais Bias antes de treinar o ML;\n", + "* Avaliar a performance por subgrupos antes de agregar." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TXkkHYyJ98_k" + }, + "source": [ + "## Sobre o dataframe\n", + "\n", + "Neste exercício nós vamos trabalhar com o dataframe...\n", + "\n", + "### Features numéricas:\n", + "Numeric Features\n", + "\n", + "### Features categóricas\n", + "\n", + "### Objetivo\n", + "O objetivo é estimar a probabilidade de churn.\n", + "\n", + "### Label (variável-target)\n", + "* `churn`: indicador se o cliente deixou ou não a empresa.\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "I0RMIktKy8xX" + }, + "source": [ + "## Setup: Carregar as libraries necessárias para a análise\n", + "* Precisamos usar o Tensorflow 2.x." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XJoEoDDPIZpi" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import seaborn as sns\n", + "from matplotlib import pyplot as plt\n", + "from matplotlib import rcParams" + ], + "execution_count": 1, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "MelAK2u6d-xx", + "outputId": "21e39c12-3b80-4450-c750-30b8782b1475", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "import tensorflow as tf\n", + "from tensorflow import keras\n", + "\n", + "tf.__version__" + ], + "execution_count": 2, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'2.3.0'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 2 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jUsgiVsUeKRR" + }, + "source": [ + "A seguir, vamos usar uma ferramenta opensource chamada [Facets](https://pair-code.github.io/facets/), criada pela [PAIR](https://research.google/teams/brain/pair/). Facets possui 2 importantes ferramentas para entendermos os dados e, consequentemente, nos ajudar com o Machine Learning." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wgqSegn9JQ3X" + }, + "source": [ + "### Ajustar a granularidade do report" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2e_0DJJ8zE29", + "outputId": "aa8f6b7a-bd42-45cf-e426-176a2b86541b", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "pd.options.display.max_rows = 10\n", + "pd.options.display.float_format = \"{:.1f}\".format\n", + "\n", + "from google.colab import widgets\n", + "\n", + "# Facets\n", + "from IPython.core.display import display, HTML\n", + "import base64\n", + "\n", + "!pip install facets-overview==1.0.0\n", + "from facets_overview.feature_statistics_generator import FeatureStatisticsGenerator" + ], + "execution_count": 3, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Collecting facets-overview==1.0.0\n", + " Downloading https://files.pythonhosted.org/packages/df/8a/0042de5450dbd9e7e0773de93fe84c999b5b078b1f60b4c19ac76b5dd889/facets_overview-1.0.0-py2.py3-none-any.whl\n", + "Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.6/dist-packages (from facets-overview==1.0.0) (1.18.5)\n", + "Requirement already satisfied: pandas>=0.22.0 in /usr/local/lib/python3.6/dist-packages (from facets-overview==1.0.0) (1.1.4)\n", + "Requirement already satisfied: protobuf>=3.7.0 in /usr/local/lib/python3.6/dist-packages (from facets-overview==1.0.0) (3.12.4)\n", + "Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.22.0->facets-overview==1.0.0) (2018.9)\n", + "Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.22.0->facets-overview==1.0.0) (2.8.1)\n", + "Requirement already satisfied: six>=1.9 in /usr/local/lib/python3.6/dist-packages (from protobuf>=3.7.0->facets-overview==1.0.0) (1.15.0)\n", + "Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from protobuf>=3.7.0->facets-overview==1.0.0) (50.3.2)\n", + "Installing collected packages: facets-overview\n", + "Successfully installed facets-overview-1.0.0\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-xgIRapb5LaQ" + }, + "source": [ + "### Carregar os dados\n", + "* Vamos carregar os dados do Kaggle/LabData." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3uu5rmeXJ0J8" + }, + "source": [ + "url_treinamento = 'https://raw.githubusercontent.com/MathMachado/DataFrames/master/adult.data'\n", + "url_teste = 'https://raw.githubusercontent.com/MathMachado/DataFrames/master/adult.test'" + ], + "execution_count": 4, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "bxWbuxspWPa0" + }, + "source": [ + "nomes_colunas = [\"Age\", \"Workclass\", \"fnlwgt\", \"Education\", \"Education-Num\", \"Marital Status\",\n", + " \"Occupation\", \"Relationship\", \"Race\", \"Sex\", \"Capital Gain\", \"Capital Loss\",\n", + " \"Hours per week\", \"Country\", \"Target\"]" + ], + "execution_count": 5, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "kHova3CrMA35" + }, + "source": [ + "df_50T = pd.read_csv(url_treinamento, na_values=\" ?\", names = nomes_colunas)\n", + "df_50V = pd.read_csv(url_teste, na_values=\" ?\", names = nomes_colunas)" + ], + "execution_count": 11, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "p2wEBPSSMSNQ", + "outputId": "8a051abe-a832-4e1b-9dd7-e9eec3fdf873", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "df_50T.head()" + ], + "execution_count": 12, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AgeWorkclassfnlwgtEducationEducation-NumMarital StatusOccupationRelationshipRaceSexCapital GainCapital LossHours per weekCountryTarget
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States<=50K
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States<=50K
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States<=50K
353Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States<=50K
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba<=50K
\n", + "
" + ], + "text/plain": [ + " Age Workclass fnlwgt ... Hours per week Country Target\n", + "0 39 State-gov 77516 ... 40 United-States <=50K\n", + "1 50 Self-emp-not-inc 83311 ... 13 United-States <=50K\n", + "2 38 Private 215646 ... 40 United-States <=50K\n", + "3 53 Private 234721 ... 40 United-States <=50K\n", + "4 28 Private 338409 ... 40 Cuba <=50K\n", + "\n", + "[5 rows x 15 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 12 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "a2ZNB8CRPhxb", + "outputId": "95edbe44-af8d-4d62-e6d2-6f78dbc100e0", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_50T.shape" + ], + "execution_count": 13, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(32561, 15)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 13 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yaZt9NIenavk", + "outputId": "4f30dae8-e611-4322-9c20-f505857103f1", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "list(df_50T.dtypes)" + ], + "execution_count": 14, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[dtype('int64'),\n", + " dtype('O'),\n", + " dtype('int64'),\n", + " dtype('O'),\n", + " dtype('int64'),\n", + " dtype('O'),\n", + " dtype('O'),\n", + " dtype('O'),\n", + " dtype('O'),\n", + " dtype('O'),\n", + " dtype('int64'),\n", + " dtype('int64'),\n", + " dtype('int64'),\n", + " dtype('O'),\n", + " dtype('O')]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 14 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "OvIqZeNnnibH", + "outputId": "cffb8655-1913-411e-f719-0b879022df12", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "df_50T.isna().sum()" + ], + "execution_count": 15, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Age 0\n", + "Workclass 1836\n", + "fnlwgt 0\n", + "Education 0\n", + "Education-Num 0\n", + " ... \n", + "Capital Gain 0\n", + "Capital Loss 0\n", + "Hours per week 0\n", + "Country 583\n", + "Target 0\n", + "Length: 15, dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 15 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "coilRN-hooja" + }, + "source": [ + "## Analisando os dados com Facets\n", + "* Como dito anteriormente, é muito importante entendermos os dados (80% do tempo) antes da fase de modelagem.\n", + "\n", + "Responda as perguntas a seguir:\n", + "\n", + "* Existem missing values?\n", + "* Há alguma variável com valor inesperado?\n", + "* Há sinais de distorção (Skewness) nos dados?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9yCIuAqWA1Pm" + }, + "source": [ + "A seguir, um pequeno overview do [Facets Overview](https://pair-code.github.io/facets/), uma ferramenta iterativa para Data Visualization para nos ajudar a explorar os dados." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MW-qryqs1gig", + "outputId": "8d4ce332-ca7b-4400-e260-fc77d090be8e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + } + }, + "source": [ + "fsg = FeatureStatisticsGenerator()\n", + "dataframes = [\n", + " {'table': df_50T, 'name': 'trainData'}]\n", + "\n", + "censusProto = fsg.ProtoFromDataFrames(dataframes)\n", + "protostr = base64.b64encode(censusProto.SerializeToString()).decode(\"utf-8\")\n", + "\n", + "HTML_TEMPLATE = \"\"\"\n", + " \n", + " \n", + " \"\"\"\n", + "html = HTML_TEMPLATE.format(protostr=protostr)\n", + "display(HTML(html))" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " " + ], + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KlF-lQ8yQ69b" + }, + "source": [ + "### Ações\n", + "* O que fazer com os missing values?\n", + " * Podemos ignorar os missing values?\n", + "* O que fazer com os outliers?\n", + "* O que fazer com as variáveis que possuem distribuições muito diferentes da Normalidade?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hKj2hz-Sql7V" + }, + "source": [ + "## A Deeper Dive\n", + "\n", + "Depois das primeiras impressões acerca dos dados, é hora de darmos um mergulho mais profundo (Deeper Dive). Vamos usar a segunda ferramenta chamada [Facets Dive](https://pair-code.github.io/facets/), uma ferramenta iterativa para DataVuz que nos ajudará a entender ainda mais nossos dados." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "stlklrG_xssF", + "outputId": "a593f36f-850b-4ade-f820-38c80cf649eb", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 617 + } + }, + "source": [ + "#@title Número de pontos para visualização no facets\n", + "\n", + "SAMPLE_SIZE = df_50T.shape[0] #@param\n", + "train_dive = df_50T.sample(SAMPLE_SIZE).to_json(orient = 'records')\n", + "\n", + "HTML_TEMPLATE = \"\"\"\n", + " \n", + " \n", + " \"\"\"\n", + "html = HTML_TEMPLATE.format(jsonstr = train_dive)\n", + "display(HTML(html))" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " " + ], + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LxqAPDcRDFB2" + }, + "source": [ + "## Exercícios\n", + "\n", + "1. No menu **Binning | X-Axis** selecione **education**, e no menu **Color By**, selecione 'target' e no menu **Label By** selecione **income_bracket**.\n", + "\n", + "* Como você descreveria o relacionamento entre estas variáveis?" + ] + } + ] +} \ No newline at end of file