From e9dab5f76b560f1740776e7e3e8995cb4458c90e Mon Sep 17 00:00:00 2001
From: Celso-Omoto <72219725+Celso-Omoto@users.noreply.github.com>
Date: Thu, 8 Oct 2020 10:53:08 -0300
Subject: [PATCH 01/21] Criado usando o Colaboratory
---
...B09_01__Functions_Exerc\303\255cios.ipynb" | 1547 +++++++++++++++++
1 file changed, 1547 insertions(+)
create mode 100644 "Notebooks/NB09_01__Functions_Exerc\303\255cios.ipynb"
diff --git "a/Notebooks/NB09_01__Functions_Exerc\303\255cios.ipynb" "b/Notebooks/NB09_01__Functions_Exerc\303\255cios.ipynb"
new file mode 100644
index 000000000..87b3f66e8
--- /dev/null
+++ "b/Notebooks/NB09_01__Functions_Exerc\303\255cios.ipynb"
@@ -0,0 +1,1547 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "NB09_01__Functions.ipynb",
+ "provenance": [],
+ "private_outputs": true,
+ "include_colab_link": true
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "d_YndS20uqkK"
+ },
+ "source": [
+ "
FUNÇÕES
\n",
+ "\n",
+ "\n",
+ "\n",
+ "# **AGENDA**:\n",
+ "\n",
+ "> Veja o **índice** dos itens que serão abordados neste capítulo.\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "e0UKAZQvJ_c2"
+ },
+ "source": [
+ "___\n",
+ "# **INTRODUÇÃO ÀS FUNÇÕES**\n",
+ "> Funções são uma sequência de comandos para executar uma tarefa.\n",
+ ">> Atenção ao que recomenda o PEP8 sobre como escrever funções."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Z4-gPTjZUP50"
+ },
+ "source": [
+ "# Não executar este codigo!\n",
+ "def funcao(arg1, arg2, ..., argN):\n",
+ " "
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "etxNlyRYo39A"
+ },
+ "source": [
+ "def show_hello_world():\n",
+ " print('Hello World!')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "G6I9PFvZpBgR"
+ },
+ "source": [
+ "type(show_hello_world)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "_meNdNygpIbv"
+ },
+ "source": [
+ "show_hello_world()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6zfLd8HwpPpg"
+ },
+ "source": [
+ "___\n",
+ "# **DOCUMENTAR FUNÇÕES COM COMMENTS/DOCSTRING**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "3yzgBxtNpRi_"
+ },
+ "source": [
+ "def show_hello_world():\n",
+ " '''\n",
+ " Esta função faz um cumprimento: 'Hello World!'\n",
+ " Inputs: \n",
+ " param1: djdjdjdjdj\n",
+ " param2: fjrjirjjirjir\n",
+ " '''\n",
+ " print('Hello World!')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "0rBaxjpmpbm1"
+ },
+ "source": [
+ "show_hello_world()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6ThOwDQp4TfR"
+ },
+ "source": [
+ "# Se quisermos ver a documentação da função, basta invocar o statement __doc__ da seguinte forma:\n",
+ "show_hello_world.__doc__"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9YZ2afpNA4st"
+ },
+ "source": [
+ "OU..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "uSnwA4BVA5_t"
+ },
+ "source": [
+ "help(show_hello_world)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "whbnnMA5p1Jw"
+ },
+ "source": [
+ "___\n",
+ "# **FUNÇÕES COM ARGUMENTOS**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "O3bSjLA_qTTc"
+ },
+ "source": [
+ "Definir a função mostra_nome com dois argumentos: s_primeiro_nome e s_ultimo_nome:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "9jWyCCPPp4yS"
+ },
+ "source": [
+ "def mostra_nome(s_primeiro_nome, s_ultimo_nome):\n",
+ " print(f'Olá, meu nome é {s_primeiro_nome} {s_ultimo_nome}')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "VOB3Ip63qIzr"
+ },
+ "source": [
+ "mostra_nome('Nelio', 'Machado')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Oi0c_GuesfcL"
+ },
+ "source": [
+ "Neste caso, o primeiro argumento da função (s_primeiro_nome) vai receber o valor 'Nelio' e o segundo argumento da função (s_ultimo_nome) vai receber 'Machado'."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qkMblpnLsITO"
+ },
+ "source": [
+ "No entanto, também podemos invocar a função da seguinte forma:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "TTli7e6xsMCo"
+ },
+ "source": [
+ "mostra_nome(s_ultimo_nome = 'Machado', s_primeiro_nome = 'Nelio')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "rmatMmhTsaVc"
+ },
+ "source": [
+ "Observe que o resultado é o mesmo. No entanto, desta forma, estamos dizendo o valor específico que cada parâmetro irá receber."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "PnNYrgJ6VQo9"
+ },
+ "source": [
+ "## PEP8 + Annotations = Códigos mais fáceis de entender e atualizar\n",
+ "\n",
+ "> Observe abaixo quando combinamos PEP8 + Annotations para tornar o código Python ainda mais detalhado. O objetivo de _Annotations_ é deixar o código mais claro, sem mudar o comportamento da função. No exemplo abaixo, os argumentos da função s_primeiro_nome e s_ultimo_nome são argumentos do tipo _str_ e a função retorna um _output_ do tipo _str_."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "aU2Sob37VVmi"
+ },
+ "source": [
+ "def mostra_nome2(s_primeiro_nome: str, s_ultimo_nome: str) -> str:\n",
+ " print(f'Olá, meu nome é {s_primeiro_nome} {s_ultimo_nome}')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "iIvqS73mXNam"
+ },
+ "source": [
+ "mostra_nome2(s_ultimo_nome = 'Machado', s_primeiro_nome = 'Nelio')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "rSnrtFNtXrbN"
+ },
+ "source": [
+ "# **\\*args**\n",
+ "> \\*args permite que você passe mais argumentos do que o número de argumentos formais que você definiu anteriormente."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "aT0_PeuEvXiP"
+ },
+ "source": [
+ "## Exemplo 1\n",
+ "> Considere a função (simples) para imprimir o nome completo de um cliente."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Npbi_Hy0bUec"
+ },
+ "source": [
+ "# definimos a função mostra_nome3 da seguinte forma:\n",
+ "def mostra_nome3(*args):\n",
+ " nome = ' '.join(args)\n",
+ " print(f'Olá, meu nome é {nome}.')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "dFzM0gA3_9za"
+ },
+ "source": [
+ "mostra_nome3('Nelio', 'Machado')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "370bpgaSvDbJ"
+ },
+ "source": [
+ "E agora, a função recebe qualquer quantidade de parâmetros."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "4kYcu6PEX-Nz"
+ },
+ "source": [
+ "mostra_nome3('Pedro', 'de', 'Alcantara', 'Francisco', 'Antonio', 'Joao', 'Carlos', 'Xavier', 'de', 'Paula', 'Miguel', 'Rafael', 'Joaquim', 'Jose', 'Gonzaga', 'Pascoal', 'Cipriano', 'Serafim')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "KMgngPmFimxb"
+ },
+ "source": [
+ "Observe que desta forma pouco importa a quantidade de parâmetros que passamos á função."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Y9pDa6ZRjo0U"
+ },
+ "source": [
+ "## Exemplo 2\n",
+ "* Suponha que estamos insteressados em desenvolver uma função que multiplica dois números (passados como parâmetros)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "1A-vhsHxv1YE"
+ },
+ "source": [
+ "Antes de vermos a solução usando \\*args, vamos ver como seria nossa função se \\*args não existisse."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "cCDwruF8j5i5"
+ },
+ "source": [
+ "### Forma \"Normal\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "_R03BiwLjtwB"
+ },
+ "source": [
+ "# Definição da função\n",
+ "def multiplicar_numeros(x1, x2):\n",
+ " '''\n",
+ " Objetivo: Esta função multiplica DOIS números passados como argumentos.\n",
+ " Autor: Nelio Machado\n",
+ " Data: 04/10/2020\n",
+ " '''\n",
+ " return x1 * x2"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "0eVm1Qj9kDtd"
+ },
+ "source": [
+ "print(multiplicar_numeros(3, 4))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4h9Nhkickf_8"
+ },
+ "source": [
+ "### Usando \\*args"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "9Kf89meJkjw8"
+ },
+ "source": [
+ "def multiplicar_numeros2(*args):\n",
+ " '''\n",
+ " Objetivo: Esta função multiplica vários números passados como argumentos.\n",
+ " Autor: Nelio Machado\n",
+ " Data: 04/10/2020\n",
+ " '''\n",
+ " print(args)\n",
+ " print(type(args))\n",
+ " x = 1\n",
+ " for N in args:\n",
+ " x *= N\n",
+ " \n",
+ " return x"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZuIzwitWk7by"
+ },
+ "source": [
+ "print(multiplicar_numeros2(1, 2, 3, 4, 5))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "U5kyPu792gMN"
+ },
+ "source": [
+ "Eu também posso fazer da seguinte forma:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "oc2NJmJf2s7X"
+ },
+ "source": [
+ "args= (1, 2, 3, 4, 5)\n",
+ "print(multiplicar_numeros2(*args))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "38jVie_IjMXI"
+ },
+ "source": [
+ "# \\**kwargs\n",
+ "\n",
+ "* \\**kwargs é usado para passar um dicionário de comprimento variável para uma função.\n",
+ "* Argumento do tipo {chave: valor};\n",
+ "\n",
+ "* Para exemplificar o uso de \\**kwargs, vou usar parte do dicionário dFruits que definimos na sessão [Dictionaries](Dictionaries.ipynb). Qualquer dúvida, volte áquele capítulo para relembrar os principais conceitos."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "yAntQ724nMbv"
+ },
+ "source": [
+ "# Definindo a função para receber parâmetros em forma de dicionário:\n",
+ "def imprime_frutas(**kwargs):\n",
+ " '''\n",
+ " Objetivo: Esta função imprime as frutas contidas em kwargs.\n",
+ " Autor: Nelio Machado\n",
+ " Data: 04/10/2020\n",
+ " '''\n",
+ " for key, value in kwargs.items():\n",
+ " print(f'O valor de {key} é {value}')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jpmSk9mfxww3"
+ },
+ "source": [
+ "Atenção à forma como os itens são passados à função!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "88-1lStInaVs"
+ },
+ "source": [
+ "imprime_frutas(Avocado = 0.35, Apple = 0.4, Apricot = 0.25, Banana = 0.30)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-jb_kkLiyQt8"
+ },
+ "source": [
+ "No entanto, posso passar um dicionário na forma como estamos acostumados, da seguinte forma:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "JZJNiLz7wgCy"
+ },
+ "source": [
+ "d_frutas = {'Apple': 0.4, 'Avocado': 0.3, 'Orange': 0.5, 'Lemon': 0.25}"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "eUCum4JPEcxD"
+ },
+ "source": [
+ "imprime_frutas(**d_frutas)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "iK8-e7a1sXmn"
+ },
+ "source": [
+ "___\n",
+ "# **Python return**\n",
+ "> Uma função Python pode ou não retornar um valor."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "HS0dGA55siWw"
+ },
+ "source": [
+ "def par_ou_impar(i_numero1, i_numero2):\n",
+ " '''\n",
+ " Esta função somente avalia se a soma de dois números é par ou impar. \n",
+ " A função retorna odd ou even.\n",
+ " '''\n",
+ " i_soma = i_numero1+i_numero2\n",
+ " i_modulo = i_soma % 2\n",
+ " print(f'A soma é {i_soma}')\n",
+ " if i_modulo > 0:\n",
+ " return 'Odd'\n",
+ " else:\n",
+ " return 'Even' "
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "mZTG2tDJuIZQ"
+ },
+ "source": [
+ "i_numero1 = int(input('Por favor, informe o primeiro número: '))\n",
+ "i_numero2 = int(input('Por favor, informe o segundo número.: '))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "7p_9pq3Du18a"
+ },
+ "source": [
+ "type(i_numero1)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "4oO7aAjcvCAe"
+ },
+ "source": [
+ "type(i_numero2)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Br7yT8UHuKYY"
+ },
+ "source": [
+ "s_resultado = par_ou_impar(i_numero1, i_numero2)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "601QnggJuhf-"
+ },
+ "source": [
+ "print(f'O resultado é {s_resultado}')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "t6HNf9j9yKcT"
+ },
+ "source": [
+ "Mostra o valor de i_modulo ou i_soma:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Yu8RsyDAyXne"
+ },
+ "source": [
+ "i_modulo"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "nx3twrLRyaeJ"
+ },
+ "source": [
+ "Python reporta que i_modulo não existe.\n",
+ "Está correta esta informação?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "imkyRO4kyvgV"
+ },
+ "source": [
+ "Considere o exemplo a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kwRiXDA5y19h"
+ },
+ "source": [
+ "i_modulo = 0\n",
+ "\n",
+ "def par_ou_impar_v2(i_numero1, i_numero2):\n",
+ " '''\n",
+ " Esta função somente avalia se a soma de dois números é par ou impar. \n",
+ " A função retorna odd ou even.\n",
+ " '''\n",
+ " i_soma = i_numero1+i_numero2\n",
+ " i_modulo = i_soma % 2\n",
+ " print(f'A soma é {i_soma}')\n",
+ " if i_modulo > 0:\n",
+ " return 'Odd'\n",
+ " else:\n",
+ " return 'Even' "
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "GYxLSGQLy_Ai"
+ },
+ "source": [
+ "i_numero1 = int(input('Por favor, informe o primeiro número: '))\n",
+ "i_numero2 = int(input('Por favor, informe o segundo número.: '))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "NMtv99fjzHGs"
+ },
+ "source": [
+ "s_resultado = par_ou_impar_v2(i_numero1, i_numero2)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "qjOHnYDVzNGK"
+ },
+ "source": [
+ "print(f'O resultado é {s_resultado}')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "pPTecxRfzQUc"
+ },
+ "source": [
+ "Agora, vamos checar o valor de i_modulo..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jkQb2mQzzTEo"
+ },
+ "source": [
+ "i_modulo"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oOlyGxBAzjE3"
+ },
+ "source": [
+ "Porque agora o Python reconhece a variável i_modulo?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "dceSkt9Z0BZh"
+ },
+ "source": [
+ "___\n",
+ "# **ESCOPO DE VARIÁVEIS: LOCAL & GLOBAL**\n",
+ "* **Local** - Variável declarada dentro da função. Em outras palavras, é uma variável local/uso da função.\n",
+ "\n",
+ "* **Global** - Variável declarada fora da função. Neste caso, a variável é visível à todo o programa. Entretanto, não se pode alterar o valor da variável dentro da função. Caso queira alterar o valor da variável dentro da função, então é necesário declarar a variável usando a palavra reservada 'global’."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0tIjI9GScPxu"
+ },
+ "source": [
+ "## Exemplo 1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "QRojHHJ20iTY"
+ },
+ "source": [
+ "def exemplo1():\n",
+ " i_valor = 20\n",
+ " i_valor += 1\n",
+ " print(i_valor)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "RdhElmTs0y1c"
+ },
+ "source": [
+ "exemplo1()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Tytq7PnH08pz"
+ },
+ "source": [
+ "O escopo da variável 'i_valor' é local, ou seja, de uso/restrito à função. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "299AK0PA1lIg"
+ },
+ "source": [
+ "i_valor"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gGP4cx17y8EZ"
+ },
+ "source": [
+ "Portanto, o erro acima faz sentido, pois a variável i_valor é restrito á função. Ou seja, fora da função o Python não conhece este valor."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "KTV_6Gzxfvpc"
+ },
+ "source": [
+ "## Exemplo 2"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zyi9AyJwfxTm"
+ },
+ "source": [
+ "i_valor= 100\n",
+ "\n",
+ "def exemplo2():\n",
+ " i_valor = 20\n",
+ " i_valor += 1\n",
+ " print(i_valor)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "iEWrboG6gBSs"
+ },
+ "source": [
+ "exemplo2()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JPvT0BHG-vxE"
+ },
+ "source": [
+ "Isso é um tanto estranho! Definimos, fora da função, i_valor= 100 e, dentro da função, redefinimos i_valor= 20. Entretanto, como vimos, exemplo2() retorna 21 como resultado."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "N_t8tIDC-149"
+ },
+ "source": [
+ "Agora, a seguir, fora da função, pedimos para ver o valor de i_valor e temos, como resposta, o valor 100."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "I46Bn4FlgJLu"
+ },
+ "source": [
+ "i_valor"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "IQlP5nbngL6E"
+ },
+ "source": [
+ "Saberia nos explicar o que está acontecendo?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "h8PHd6rLgtwK"
+ },
+ "source": [
+ "## Exemplo 3"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "qB7_zPQVgvVT"
+ },
+ "source": [
+ "i_valor = 100\n",
+ "\n",
+ "def exemplo3():\n",
+ " global i_valor\n",
+ " i_valor = 20\n",
+ " i_valor += 1\n",
+ " print(i_valor)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "2KgQSbYCg8Eq"
+ },
+ "source": [
+ "exemplo3()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Y7yWoojrg_9Z"
+ },
+ "source": [
+ "i_valor"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "cGlmbIJGzWG6"
+ },
+ "source": [
+ "Saberia explicar o que acontece neste exemplo?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "X8qFfIoxhFOp"
+ },
+ "source": [
+ "## Exemplo 4"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZM-yTLuO1bFh"
+ },
+ "source": [
+ "i_valor = 20\n",
+ "\n",
+ "def exemplo4():\n",
+ " i_valor += 1\n",
+ " print(i_valor)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "oLvfPO8w1zwL"
+ },
+ "source": [
+ "exemplo4()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2V7QzpZp2QcM"
+ },
+ "source": [
+ "Qual a razão deste erro?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "w9qI8kln1_C7"
+ },
+ "source": [
+ "i_valor"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "AQFFGqLI1FWn"
+ },
+ "source": [
+ "___\n",
+ "# **ARGUMENTOS DEFAULT**\n",
+ "> Considere o exemplo a seguir: toda vez que vai ao supermercado compra 1 pack de leite (contendo 4 garrafas) e 1 garrafão de água de 5L. Portanto, de forma simples, podemos definir nossa função da seguinte forma:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "HbcSTiBI4nOj"
+ },
+ "source": [
+ "# Define a função para receber os parâmetros arroz, feijao, leite e água.\n",
+ "def lista_de_compras(arroz, feijao, leite= 1, agua= 1):\n",
+ " '''\n",
+ " Documentação da função: objetivos, autor e data.\n",
+ " '''\n",
+ " print('Lista de Compras:')\n",
+ " print(f'Quantidade de arroz.: {arroz} kilos.') \n",
+ " print(f'Quantidade de feijão: {feijao} kilos.') \n",
+ " print(f'Quantidade de leite.: {leite} pack com 4.') \n",
+ " print(f'Quantidade de água..: {agua} garrafa de 5 litros.') "
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "vwZnDgoq5pgB"
+ },
+ "source": [
+ "lista_de_compras(5, 3)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "l7bY5BSO7eJF"
+ },
+ "source": [
+ "Como leite= 1 e agua= 1 são valores default's, não precisamos passar esses parâmetros, desde que informamos ao Python o valor default. No entanto, se numa determinada semana precisarmos de 2 pack's de leite, ao invés de 1, devemos informar ao Python o novo valor:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "YY4OrFuH7yXi"
+ },
+ "source": [
+ "lista_de_compras(5, 3, 2)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-nfrZAvN73YT"
+ },
+ "source": [
+ "Da mesma forma, se numa outra semana precisarmos de 2 garrafões de água ao invés de 1, informamos ao Python da seguinte forma:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Vpoh6TdM7_xb"
+ },
+ "source": [
+ "lista_de_compras(5, 3, 2, 2)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "q3qZn9FuVQly"
+ },
+ "source": [
+ "___\n",
+ "# **map()**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Dav8k0JYWi4B"
+ },
+ "source": [
+ "## Exemplo 1\n",
+ "> Suponha que queremos o quadrado de cada número passado à uma função."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "R6NC0i2OVktM"
+ },
+ "source": [
+ "l_numeros= [0, 1, 2, 3, 4, 5]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "AVjYlN44Vw2k"
+ },
+ "source": [
+ "def quadrado_do_numero(i_numero):\n",
+ " return i_numero**2"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "i_4CHiehV7lD"
+ },
+ "source": [
+ "list(map(quadrado_do_numero, l_numeros))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5tq8QDSPWNf6"
+ },
+ "source": [
+ "OU..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZAfkybybWOcG"
+ },
+ "source": [
+ "for i in map(quadrado_do_numero, l_numeros):\n",
+ " print(i)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "c01V5CEzWlGF"
+ },
+ "source": [
+ "## Exemplo 2\n",
+ "> substituir_truer todos os valores True da lista abaixo por 1 e False por 0."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "qH1ackDZWvKp"
+ },
+ "source": [
+ "import random\n",
+ "\n",
+ "l_dados = []\n",
+ "for i in range(50):\n",
+ " random.seed(i)\n",
+ " l_dados.append(random.choice([True, False]))\n",
+ " \n",
+ "l_dados"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Dt2UKC-WXsxr"
+ },
+ "source": [
+ "def substituir_true(s_String):\n",
+ " if s_String == True:\n",
+ " return 1\n",
+ " else:\n",
+ " return 0"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "BIIkPuDEXaM0"
+ },
+ "source": [
+ "list(map(substituir_true, l_dados))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "TzkLIH1gYpFQ"
+ },
+ "source": [
+ "___\n",
+ "# **Filter()**\n",
+ "* Filtra elementos baseado em condições."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "cjU8YznfZai1"
+ },
+ "source": [
+ "Suponha que agora eu quero filtrar os itens True da lista l_dados."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "a3SeaKJgZlAZ"
+ },
+ "source": [
+ "def filtrar_true(item):\n",
+ " if item == True:\n",
+ " return True\n",
+ " else:\n",
+ " return False"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "1Z1APDQtZyXs"
+ },
+ "source": [
+ "list(filter(filtrar_true, l_dados))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "xPpFqVUnKEH7"
+ },
+ "source": [
+ "___\n",
+ "# **EXERCÍCIOS**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RDgCRPRs0W6C"
+ },
+ "source": [
+ "## Exercício 1\n",
+ "Construa uma função para retornar o dia da semana a partir de um número, sendo:\n",
+ "\n",
+ "* 1 - Dom\n",
+ "* 2 - Seg\n",
+ "* 3 - Ter\n",
+ "* 4 - Qua\n",
+ "* 5 - Qui\n",
+ "* 6 - Sex\n",
+ "* 7 - Sab"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "H17JO6sLOrG7"
+ },
+ "source": [
+ "### Minha solução"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "PuCJNigKWir3"
+ },
+ "source": [
+ "#solução 08/10/2020\n",
+ "#\n",
+ "#função que recebe número do dia da semana e devolve semana em txt, com 3 caracteres. \n",
+ "#parametro de entrada deve estar no intervalo de 1 a 7, diferente disto, apresenta mensagem de erro\n",
+ "#\n",
+ "def f_dias_da_semana(i_dia_da_sem):\n",
+ " d_sem_num_txt = {1: \"Dom\",\n",
+ " 2: \"Seg\",\n",
+ " 3: \"Ter\",\n",
+ " 4: \"Qua\",\n",
+ " 5: \"Qui\",\n",
+ " 6: \"Sex\",\n",
+ " 7: \"Sab\" }\n",
+ " return d_sem_num_txt.get(i_dia_da_sem,'Intervalo deve ser informado entre 1(Dom) e 7(Sáb)! Valor informado: '+ str(i_dia_da_sem))\n",
+ "\n",
+ "f_dias_da_semana(0) "
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6PAucd9vZxMZ"
+ },
+ "source": [
+ "f_dias_da_semana(3) "
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "wX_7XDyB0XSy"
+ },
+ "source": [
+ "def dia_da_semana(dia):\n",
+ " d_palavra= {1: 'Segunda',\n",
+ " 2: 'Terça',\n",
+ " 3: 'Quarta',\n",
+ " 4: 'Quinta',\n",
+ " 5: 'Sexta',\n",
+ " 6: 'Sabado',\n",
+ " 7: 'Domingo' }\n",
+ " return d_palavra.get(dia,\"Dia da semana inválido. Informe um número de 1 a 7\")"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "39toyCRU1Q5T"
+ },
+ "source": [
+ "dia_da_semana(1)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "wt5hQq__1UEd"
+ },
+ "source": [
+ "dia_da_semana(0)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "N53NOsZjOv9m"
+ },
+ "source": [
+ "## Exercício 2\n",
+ "* Desenvolver uma função que retorna True se s_palavra pertence à uma string e False caso contrário."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "amq_LdBBby2T"
+ },
+ "source": [
+ "#função que pesquisa palava a partir de um texto fornecido, retorna a posição encontrada e mensagem\n",
+ "#Parâmetro 1 : texto\n",
+ "#Parâmetro 2 : palavra\n",
+ "#retorno mensagem com posição\n",
+ "def fc_busca_texto(s_prm_texto, s_prm_palavra):\n",
+ " if s_prm_palavra in s_prm_texto:\n",
+ " return 'Encontrou, parabéns! Posição: ' + str(s_prm_texto.find(s_prm_palavra)) +' de '+ str(len(s_prm_texto)) + ' / Palavra pesquisada : ' + s_prm_palavra\n",
+ " else:\n",
+ " return 'Não encontrou palavra: ' + s_prm_palavra\n",
+ "\n",
+ "s_texto_informado = 'Expressão regular ou RegEx do inglês ‘Regular Expression’ é uma poderosa ferramenta para manipulação de strings. Esta ferramenta visa a identificação de padrões textuais ou padrões de caracteres que casam com um determinado padrão especificado.'\n",
+ "s_palavra_buscar = 'string'\n",
+ "fc_busca_texto(s_texto_informado,s_palavra_buscar) "
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vrBZ_68-PBWl"
+ },
+ "source": [
+ "### Minha solução:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "m4Pi4S8hPC_u"
+ },
+ "source": [
+ "def check_palavra(s_frase, s_palavra):\n",
+ " if s_palavra in s_frase:\n",
+ " return True\n",
+ " else:\n",
+ " return False"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "NJeqwxDjPxub"
+ },
+ "source": [
+ "A frase abaixo foi extraída de [+ Bíblia + Camões + Legião Urbana - (Guerra) = Monte Castelo](http://compondoletras.blogspot.com/2013/11/biblia-camoes-legiao-urbana-guerra.html)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Dj_n_beIPRBN"
+ },
+ "source": [
+ "s_frase = 'O amor é o fogo que arde sem se ver. É ferida que dói e não se sente. É um contentamento descontente. É dor que desatina sem doer'\n",
+ "s_frase"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "s40FJ9iCPPY0"
+ },
+ "source": [
+ "s_palavra = 'fogo'"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "tzc2eaM7QUFE"
+ },
+ "source": [
+ "A palavra s_palavra está em s_frase?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "2tlravrMQXn2"
+ },
+ "source": [
+ "check_palavra(s_frase, s_palavra)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "pMx9E0xMu1lc"
+ },
+ "source": [
+ "## Exercício 3\n",
+ "Para mais exercícios envolvendo funções, consulte [Python functions - Exercises, Practice, Solution](https://www.w3resource.com/python-exercises/python-functions-exercises.php)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Mw6Wg5hFvFMR"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ }
+ ]
+}
\ No newline at end of file
From 8c95dd754d51bf1c3f7afb564b6d477bac859771 Mon Sep 17 00:00:00 2001
From: Celso-Omoto <72219725+Celso-Omoto@users.noreply.github.com>
Date: Wed, 14 Oct 2020 15:36:13 -0300
Subject: [PATCH 02/21] Criado usando o Colaboratory
---
Notebooks/NB10_01__Pandas.ipynb | 235 +++++++++++++++++++++++++++++++-
1 file changed, 233 insertions(+), 2 deletions(-)
diff --git a/Notebooks/NB10_01__Pandas.ipynb b/Notebooks/NB10_01__Pandas.ipynb
index a2a03a488..18614be36 100644
--- a/Notebooks/NB10_01__Pandas.ipynb
+++ b/Notebooks/NB10_01__Pandas.ipynb
@@ -21,7 +21,7 @@
"colab_type": "text"
},
"source": [
- "
"
+ "
"
]
},
{
@@ -5231,7 +5231,238 @@
"id": "ldWQd9j4NhPS"
},
"source": [
- ""
+ "# Carrega a library Pandas\n",
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "%matplotlib inline\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "pynSV0viI8CA"
+ },
+ "source": [
+ "#configuração\n",
+ "d_configuracao = {\n",
+ " 'display.max_columns': 1000,\n",
+ " 'display.expand_frame_repr': True,\n",
+ " 'display.max_rows': 10,\n",
+ " 'display.precision': 2,\n",
+ " 'display.show_dimensions': True\n",
+ " }\n",
+ "\n",
+ "for op, value in d_configuracao.items():\n",
+ " pd.set_option(op, value)\n",
+ " print(op, value)\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "AswYS-ILI-qW"
+ },
+ "source": [
+ "url = 'https://raw.githubusercontent.com/Celso-Omoto/DSWP/master/Dataframes/FIFA.csv'\n",
+ "#df_Fifa2018 = pd.read_csv(url, index_col = 'PassengerId')\n",
+ "df_Fifa2018 = pd.read_csv(url)\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "K7xLrlPuKsAW"
+ },
+ "source": [
+ "df_Fifa2018.head()\n",
+ "\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "LTZGbOHiKxsW"
+ },
+ "source": [
+ "df_Fifa2018.set_index('ID', inplace = True)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "1y9oN-IeU7Sb"
+ },
+ "source": [
+ "def transformacao_lower(df):\n",
+ " # Primeira transformação: Aplicar lower() nos nomes das COLUNAS:\n",
+ " df_Fifa2018.columns = [col.lower() for col in df.columns]\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "uRS02MeCVVID"
+ },
+ "source": [
+ "transformacao_lower(df_Fifa2018)\n",
+ "df_Fifa2018.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "sI_2oz3uMQFF"
+ },
+ "source": [
+ "#17 - Quais os insights em relação à variável overall (nota média do atleta) por idade, clube e país?\n",
+ "df_Fifa2018.sort_values('ShotPower', ascending=False).head(5)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "-9jCfEaSNJS1"
+ },
+ "source": [
+ "df_Fifa2018.groupby('Overall').agg({'Age':'mean', 'Nationality': 'count'}).sort_values('Overall').head(10)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "PMrTNLV1Oe5P"
+ },
+ "source": [
+ "df_Fifa2018.groupby('Club').agg({'Overall':'mean'}).sort_values('Overall').head(10)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "dckaTDS9NaqG"
+ },
+ "source": [
+ "df_Fifa2018.groupby('Club').agg({'Potential':'mean', 'Overall':'mean'}).bhead(10)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "OBuycRCzRbyG"
+ },
+ "source": [
+ "del df_Fifa2018['Photo']"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "1crol52URcmt"
+ },
+ "source": [
+ "del df_Fifa2018['Club Logo']"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "W0X7-q1CSNfM"
+ },
+ "source": [
+ "del df_Fifa2018['Flag']"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "LMtzccaxSK29"
+ },
+ "source": [
+ "df_Fifa2018.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZPwu5sLnSyAc"
+ },
+ "source": [
+ "df_Fifa2018.dtypes"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "PBq3jr8nTUS0"
+ },
+ "source": [
+ "df_Fifa2018.select_dtypes(include=['object', 'string']).columns "
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "82JaHKYATgdD"
+ },
+ "source": [
+ "df_Fifa2018[df_Fifa2018.select_dtypes(include=['object', 'string']).columns ]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "NAyzCiTMW7YT"
+ },
+ "source": [
+ "df_Fifa2018.groupby('nationality').agg({'age':['count','mean']}).head(5)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "eivUHr17ZiHs"
+ },
+ "source": [
+ "df_Fifa2018.sort_values('age', ascending=False).groupby('nationality').agg({'age':['count','mean']})"
],
"execution_count": null,
"outputs": []
From 9a622e97075b23bc74b208e6f2e132dbfbc35757 Mon Sep 17 00:00:00 2001
From: Celso-Omoto <72219725+Celso-Omoto@users.noreply.github.com>
Date: Wed, 14 Oct 2020 15:53:18 -0300
Subject: [PATCH 03/21] Criado usando o Colaboratory
---
Notebooks/NB10_01__Pandas_Fifa.ipynb | 5493 ++++++++++++++++++++++++++
1 file changed, 5493 insertions(+)
create mode 100644 Notebooks/NB10_01__Pandas_Fifa.ipynb
diff --git a/Notebooks/NB10_01__Pandas_Fifa.ipynb b/Notebooks/NB10_01__Pandas_Fifa.ipynb
new file mode 100644
index 000000000..073a7477f
--- /dev/null
+++ b/Notebooks/NB10_01__Pandas_Fifa.ipynb
@@ -0,0 +1,5493 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "Copy of NB10_01__Pandas.ipynb",
+ "provenance": [],
+ "private_outputs": true,
+ "include_colab_link": true
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8fpUiw8PwC7_"
+ },
+ "source": [
+ "PANDAS PARA DATA ANALYSIS
\n",
+ "\n",
+ "\n",
+ "\n",
+ "# **AGENDA**:\n",
+ "\n",
+ "> Veja o **índice** dos itens que serão abordados neste capítulo.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vo7mtiNSr_Wk"
+ },
+ "source": [
+ "___\n",
+ "# **REFERÊNCIAS**\n",
+ "* [Learn Aggregation and Data Wrangling with Python](https://data-flair.training/blogs/data-wrangling-with-python/)\n",
+ "* [Python Data Cleansing by Pandas & Numpy | Python Data Operations](https://data-flair.training/blogs/python-data-cleansing/)\n",
+ "* [Pandas from basic to advanced for Data Scientists](https://towardsdatascience.com/pandas-from-basic-to-advanced-for-data-scientists-aee4eed19cfe)\n",
+ "* [Feature engineering and ensembled models for the top 10 in Kaggle “Housing Prices Competition”](https://towardsdatascience.com/feature-engineering-and-ensembled-models-for-the-top-10-in-kaggle-housing-prices-competition-efb35828eef0)\n",
+ "* [Pandas.Series Methods for Machine Learning](https://towardsdatascience.com/pandas-series-methods-for-machine-learning-fd83709368ff)\n",
+ "* [Pandas.Series Methods for Machine Learning](https://towardsdatascience.com/pandas-series-methods-for-machine-learning-fd83709368ff)\n",
+ "* [Gaining a solid understanding of Pandas series](https://towardsdatascience.com/gaining-a-solid-understanding-of-pandas-series-893fb8f785aa)\n",
+ "* [ariáveis Dummy: o que é? Quando usar? E como usar?](https://medium.com/data-hackers/vari%C3%A1veis-dummy-o-que-%C3%A9-quando-usar-e-como-usar-78de66cfcca9)\n",
+ "* [Exploratory Data Analysis Made Easy Using Pandas Profiling](https://towardsdatascience.com/exploratory-data-analysis-made-easy-using-pandas-profiling-86e347ef5b65)\n",
+ "* [Data Handling using Pandas; Machine Learning in Real Life](https://towardsdatascience.com/data-handling-using-pandas-machine-learning-in-real-life-be76a697418c)\n",
+ "* [Exploratory Data Analysis Tutorial in Python](https://towardsdatascience.com/exploratory-data-analysis-tutorial-in-python-15602b417445)\n",
+ "* [Exploring the data using python](https://towardsdatascience.com/exploring-the-data-using-python-47c4bc7b8fa2)\n",
+ "* [A better EDA with Pandas-profiling](https://towardsdatascience.com/a-better-eda-with-pandas-profiling-e842a00e1136)\n",
+ "* [Exploratory Data Analysis: Haberman’s Cancer Survival Dataset](https://towardsdatascience.com/exploratory-data-analysis-habermans-cancer-survival-dataset-c511255d62cb)\n",
+ "* [Exploring Exploratory Data Analysis](https://towardsdatascience.com/exploring-exploratory-data-analysis-1aa72908a5df)\n",
+ "* [Getting started with Data Analysis with Python Pandas](https://towardsdatascience.com/getting-started-to-data-analysis-with-python-pandas-with-titanic-dataset-a195ab043c77)\n",
+ "* [A Gentle Introduction to Exploratory Data Analysis](https://towardsdatascience.com/a-gentle-introduction-to-exploratory-data-analysis-f11d843b8184)\n",
+ "* [Exploratory Data Analysis (EDA) techniques for Kaggle competition beginners](https://towardsdatascience.com/exploratory-data-analysis-eda-techniques-for-kaggle-competition-beginners-be4237c3c3a9)\n",
+ "* [What is Exploratory Data Analysis?](https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15)\n",
+ "* [Exploring real estate investment opportunity in Boston and Seattle](https://towardsdatascience.com/exploring-real-estate-investment-opportunity-in-boston-and-seattle-9d89d0c9bed2)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "BUEbp88oD1Km"
+ },
+ "source": [
+ "___\n",
+ "# **ANÁLISE DE DADOS COM PANDAS**\n",
+ "## Highlights\n",
+ "\n",
+ "* Rápida e eficiente library para data manipulation;\n",
+ "* Ferramentas para ler e gravar todos os tipos de dados e formatos: CSV, txt, Microsoft Excel, SQL databases, JSON e HDF5 format;\n",
+ "* Pandas é a library mais popular para análise de dados. As principais ações que faremos com Pandas são:\n",
+ " * Ler/gravar diferentes formatos de dados;\n",
+ " * Selecionar subconjuntos de dados;\n",
+ " * Cálculos variados por coluna ou por linha das tabelas;\n",
+ " * Encontrar e tratar Missing Values;\n",
+ " * Combinar múltiplos dataframes;"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "wkxQFPPmeKLl"
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "eKawOG-neqaD"
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "TLdSmsJZwlcQ"
+ },
+ "source": [
+ "___\n",
+ "# **ATÉ QUE VOLUME DE DADOS PODEMOS USAR PANDAS?**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "O7YKF5gB2x0K"
+ },
+ "source": [
+ "\n",
+ "\n",
+ "## Sources\n",
+ "### Dask\n",
+ "* [Pandas, Dask or PySpark? What Should You Choose for Your Dataset?](https://medium.com/datadriveninvestor/pandas-dask-or-pyspark-what-should-you-choose-for-your-dataset-c0f67e1b1d36)\n",
+ "* [Processing Data with Dask](https://medium.com/when-i-work-data/processing-data-with-dask-47e4233cf165)\n",
+ "* [Pandas, Fast and Slow](https://medium.com/when-i-work-data/pandas-fast-and-slow-b6d8dde6862e)\n",
+ "* [Por que Parquet](https://medium.com/when-i-work-data/por-que-parquet-2a3ec42141c6)\n",
+ "* [How to Run Parallel Data Analysis in Python using Dask Dataframes](https://towardsdatascience.com/trying-out-dask-dataframes-in-python-for-fast-data-analysis-in-parallel-aa960c18a915)\n",
+ "* [Why every Data Scientist should use Dask?](https://towardsdatascience.com/why-every-data-scientist-should-use-dask-81b2b850e15b)\n",
+ "\n",
+ "### Spark, Koalas\n",
+ "* [Databricks Koalas-Python Pandas for Spark](https://medium.com/future-vision/databricks-koalas-python-pandas-for-spark-ce20fc8a7d08)\n",
+ "* [Bye Pandas, Meet Koalas: Pandas APIs on Apache Spark (Ep. 4)](https://medium.com/@kyleake/bye-pandas-meet-koalas-pandas-apis-on-apache-spark-ep-4-aedcd363cf4e)\n",
+ "* [Koalas: Easy Transition from pandas to Apache Spark](https://databricks.com/blog/2019/04/24/koalas-easy-transition-from-pandas-to-apache-spark.html?source=post_page-----aedcd363cf4e----------------------)\n",
+ "* [Use PySpark for Your Next Big Problem](https://medium.com/swlh/use-pyspark-for-your-next-big-problem-8aa288d5ecfa)\n",
+ "* [A Neanderthal’s Guide to Apache Spark in Python](https://towardsdatascience.com/a-neanderthals-guide-to-apache-spark-in-python-9ef1f156d427)\n",
+ "* [The Jungle of Koalas, Pandas, Optimus and Spark](https://towardsdatascience.com/the-jungle-of-koalas-pandas-optimus-and-spark-dd486f873aa4)\n",
+ "* [From Pandas to PySpark with Koalas](https://towardsdatascience.com/from-pandas-to-pyspark-with-koalas-e40f293be7c8)\n",
+ "\n",
+ "# O que Dask?\n",
+ "\n",
+ "\"Dask is designed to extend the numpy and pandas packages to work on data processing problems that are too large to be kept in memory. It breaks the larger processing job into many smaller tasks that are handled by numpy or pandas and then it reassembles the results into a coherent whole.\" - Eric Ness ([Processing Data with Dask](https://medium.com/when-i-work-data/processing-data-with-dask-47e4233cf165))\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "yEyzjGUfG33-"
+ },
+ "source": [
+ "___\n",
+ "# **Carregar a library Pandas e verificar a versão**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "oVMjT3DrG97K"
+ },
+ "source": [
+ "# Carrega a library Pandas\n",
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "%matplotlib inline\n",
+ "\n",
+ "print(f'Versão do Pandas: {pd.__version__}')\n",
+ "print(f'Versão do NumPy.: {np.__version__}')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "OxoDsaKUVHdH"
+ },
+ "source": [
+ "# Configurações\n",
+ "> Podemos configurar o pandas de forma a tornar nosso trabalho mais produtivo. Podemos configurar, por exemplo, o número de LINHAS e COLUNAS a ser mostrado, precisão dos números float. Vamos ver com mais detalhes a seguir.\n",
+ "\n",
+ "Fonte: [5 Advanced Features of Pandas and How to Use Them](https://www.kdnuggets.com/2019/10/5-advanced-features-pandas.html)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "IOdqrf7uVlhC"
+ },
+ "source": [
+ "d_configuracao = {\n",
+ " 'display.max_columns': 1000,\n",
+ " 'display.expand_frame_repr': True,\n",
+ " 'display.max_rows': 5,\n",
+ " 'display.precision': 2,\n",
+ " 'display.show_dimensions': True\n",
+ " }\n",
+ "\n",
+ "for op, value in d_configuracao.items():\n",
+ " pd.set_option(op, value)\n",
+ " print(op, value)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Paz-R-FOAJ7F"
+ },
+ "source": [
+ "___\n",
+ "# **Criar um dataframe a partir de outros objetos**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "L4Jc0C2qPAQz"
+ },
+ "source": [
+ "## Criar dataframe a partir de dicionários"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Sa5rKwq6Fscj"
+ },
+ "source": [
+ "### Exemplo 1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "0ofIGkiSSuYq"
+ },
+ "source": [
+ "d_frutas = {'Apple': [5, 6, 6, 8, 10, 3, 2],\n",
+ " 'Avocado': [6, 6, 3, 9, 3, 2, 1]}"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "iJCNvPlUTzTI"
+ },
+ "source": [
+ "d_frutas"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "7Y_0O_tJTfm3"
+ },
+ "source": [
+ "# index=['Seg', 'Ter', 'Qua', 'Qui', 'Sex', 'Sab', 'Dom'] abaixo define os label.\n",
+ "df_frutas = pd.DataFrame(d_frutas, index = ['Seg', 'Ter', 'Qua', 'Qui', 'Sex', 'Sab', 'Dom'])\n",
+ "df_frutas"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "l2ll8ktfUKz2"
+ },
+ "source": [
+ "O que se comprou na sexta?\n",
+ "\n",
+ "* Função df.loc[label] retorna o(s) valor(es) associados à label. Em nosso caso, os label (chaves do dicionário) são 'Seg', 'Ter', ..., 'Dom'."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "9Voor8_PUJum"
+ },
+ "source": [
+ "df_frutas.loc['Sex'] # Aqui, label= 'Sex'."
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "LMh4DTfebwAr"
+ },
+ "source": [
+ "* Ou seja, o label = 'Sex', que ocupa a posição 4, tem os valores:\n",
+ " * Apple..: 10\n",
+ " * Avocado: 3\n",
+ "\n",
+ "Da mesma forma, poderíamos utilizar a função df.iloc[index] para retornar o conteúdo/informações de index."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "GJxawdh6bvJN"
+ },
+ "source": [
+ "df_frutas.iloc[4]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "obJt9OPGcL-x"
+ },
+ "source": [
+ "Portanto, df.loc['Sex'] = df.iloc[4]. Correto?\n",
+ "\n",
+ "Para nos ajudar a memorizar, considere que:\n",
+ "\n",
+ "* pd.loc[label] --> loc começa com a letra **l**, o que remete à label da linha.\n",
+ "* pd.iloc[indice] --> iloc começa com a letra **i**, o que remete ao índice (inteiro) da linha."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "v7QlCcEorEIX"
+ },
+ "source": [
+ "#### Qual é o output do code abaixo?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kRRdQShrrKHk"
+ },
+ "source": [
+ "df_frutas.loc[4]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "EkjAtbrRF01h"
+ },
+ "source": [
+ "### Exemplo 2"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2EOX5MC4E1xL"
+ },
+ "source": [
+ "Na prática, lidamos com grandes bancos de dados e, nesses casos, não temos label das LINHAS definidos. Para exemplificar, considere o mesmo exemplo que acabamos de ver, com uma pequena alteração:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "RC_OXmdjrkQm"
+ },
+ "source": [
+ "d_frutas"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "D6FckgDPFFs0"
+ },
+ "source": [
+ "df_frutas = pd.DataFrame(d_frutas) # Observe que aqui não definimos os indíces\n",
+ "df_frutas"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "tkGc4JQcFPkp"
+ },
+ "source": [
+ "Veja agora que os label são números inteiros de 0 a N."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Ri-EdUYAovLG"
+ },
+ "source": [
+ "#### Qual o conteúdo da linha cujo label é 4?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5YgWG_vlFVe_"
+ },
+ "source": [
+ "df_frutas.loc[4]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "rFQxcAcVo2KD"
+ },
+ "source": [
+ "#### Qual o conteúdo da linha cujo índice é 4?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "xB1j4n6HFank"
+ },
+ "source": [
+ "df_frutas.iloc[4]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jEbCke3TFf_q"
+ },
+ "source": [
+ "Ou seja, nesses casos, tanto faz usar pd.loc[] ou pd.iloc[]. Entendeu?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "bKHw_VBKjkoL"
+ },
+ "source": [
+ "### Exemplo 3 - Definir os indices do dataframe usando df.set_index()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "13ArWIhYju6s"
+ },
+ "source": [
+ "d_frutas= {'Dia_Semana': ['Seg', 'Ter', 'Qua', 'Qui', 'Sex', 'Sab', 'Dom'],\n",
+ " 'Apple': [5, 6, 6, 8, 10, 3, 2],\n",
+ " 'Avocado': [6, 6, 3, 9, 3, 2, 1]}\n",
+ "\n",
+ "d_frutas"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Evw9w16gk5h0"
+ },
+ "source": [
+ "# Cria o dataframe df_frutas:\n",
+ "df_frutas = pd.DataFrame(d_frutas) # Não apontamos o índice do dataframe. Portanto, o índice é criado automaticamente de 0.. N.\n",
+ "df_frutas"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "NLbbRrdYoclw"
+ },
+ "source": [
+ "#### Qual o conteúdo da linha 4?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "lB-ngbutl_0c"
+ },
+ "source": [
+ "df_frutas.iloc[4]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "1aJLGapZlUFI"
+ },
+ "source": [
+ "# Definir 'Dia_Semana' como índice (label das linhas) do dataframe df_frutas\n",
+ "df_frutas.set_index('Dia_Semana', inplace = True)\n",
+ "df_frutas"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "L1-U_sD-jAoO"
+ },
+ "source": [
+ "A expressão acima é equivalente a:\n",
+ "\n",
+ "```\n",
+ "df_frutas2 = df_frutas.set_index('Dia_Semana') # Observe que aqui não há 'inplace'\n",
+ "df_frutas2\n",
+ "```\n",
+ "\n",
+ "* Então, qual a função do 'inplace =True' na primeira opção?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oXeFjJonpQfB"
+ },
+ "source": [
+ "#### Qual o conteúdo da linha 4?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "MMXg3vVQpUhh"
+ },
+ "source": [
+ "df_frutas.iloc[4]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "fhoYuGMlpVFj"
+ },
+ "source": [
+ "#### Qual o conteúdo da linha cujo label é 'Sex'?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "fmcWbrEspdYW"
+ },
+ "source": [
+ "df_frutas.loc['Sex']"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "bobggpoCTRkj"
+ },
+ "source": [
+ "### Qual a diferença entre as duas próximas linhas?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "SjiYgbNrsvpl"
+ },
+ "source": [
+ "df_frutas"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "OFhzE7hgTD0a"
+ },
+ "source": [
+ "df_frutas.mean()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "V42I3807TNte"
+ },
+ "source": [
+ "df_frutas.mean(1)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6iUCthsbtLV8"
+ },
+ "source": [
+ "df_frutas.describe()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "YdkmYePYtcON"
+ },
+ "source": [
+ "df_frutas.dtypes"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2RmgCIC2HZFp"
+ },
+ "source": [
+ "### Exemplo 4"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kbHHuMzzAR1A"
+ },
+ "source": [
+ "d_estudantes = {'Nome': ['Jack', 'Richard', 'Tommy', 'Ana'], \n",
+ " 'Age': [25, 34, 18, 21],\n",
+ " 'City': ['Sydney', 'Rio de Janeiro', 'Lisbon', 'New York'],\n",
+ " 'Country': ['Australia', 'Brazil', 'Portugal', 'United States']}"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ayKqLmHTANOu"
+ },
+ "source": [
+ "# Mostrar o conteúdo do dicionário d_estudantes...\n",
+ "d_estudantes"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "0ONA8QsBBP6R"
+ },
+ "source": [
+ "# Keys associadas ao dicionário d_estudantes\n",
+ "d_estudantes.keys()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "k8mmvKQ_BjO6"
+ },
+ "source": [
+ "# Itens associados ao dicionário d_estudantes\n",
+ "d_estudantes.items()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "hcm8V_UmBr1Y"
+ },
+ "source": [
+ "# Valores associados ao dicionário d_estudantes\n",
+ "d_estudantes.values()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "KK7IejsPDkWC"
+ },
+ "source": [
+ "Temos uma key = 'nome'. Qual o conteúdo desta key?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "eHvPpeiTBwoR"
+ },
+ "source": [
+ "d_estudantes['nome']"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "S1y7p8CcDsXl"
+ },
+ "source": [
+ "Qual o output da expressão a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "26WIDl-HB3Bq"
+ },
+ "source": [
+ "d_estudantes['nome'][0]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gV68kQ5HCIif"
+ },
+ "source": [
+ "Criando o dataframe df_estudantes a partir do dicionário d_estudantes:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "2oa808hkCSaq"
+ },
+ "source": [
+ "df_estudantes = pd.DataFrame(d_estudantes)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "7HLp0FYpCiSc"
+ },
+ "source": [
+ "# Mostra o conteúdo do dataframe df_estudantes...\n",
+ "df_estudantes"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "en06lfazciE0"
+ },
+ "source": [
+ "**Atenção**: Observe que nesse caso, não definimos labels para as LINHAS. Na prática, isso é o mais comum, ou seja, os label = index, que aqui são números inteiros de 0 a N."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gFaPp-S-cy1-"
+ },
+ "source": [
+ "Mais uma vez, vamos usar df.loc[] e df.iloc[]..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "mT9vwRBidGXX"
+ },
+ "source": [
+ "# Mostrando o conteúdo de da linha 3 usando df.loc[]\n",
+ "df_estudantes.loc[3]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Zj88AwHUdix0"
+ },
+ "source": [
+ "OU"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "SP2mG8todkMe"
+ },
+ "source": [
+ "# Mostrando o conteúdo de da linha 3 usando df.iloc[]\n",
+ "df_estudantes.iloc[3]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hzbLO0EDGWTf"
+ },
+ "source": [
+ "Ok, já discutimos isso anteriormente. Quando não temos labels para as LINHAS, então iloc[] = loc[]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "VvzVg7SpeOOB"
+ },
+ "source": [
+ "___\n",
+ "## Criar dataframes a partir de listas\n",
+ "* Considere a lista de frutas a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "0_PY9OROeUiT"
+ },
+ "source": [
+ "l_frutas = [('Melon', 6, 8, 5, 4 ,6, 2, 8), ('Avocado', 6, 6, 3, 8, 9, 3, 1), ('Blueberry', 7, 5, 9, 3, 1, 0, 4)]\n",
+ "l_frutas"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "AfE_rHq5g4_P"
+ },
+ "source": [
+ "type(l_frutas)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZpdPSi7RgVjK"
+ },
+ "source": [
+ "l_frutas[0]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "NMyIpVW8gZTH"
+ },
+ "source": [
+ "l_frutas[0][0]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "-cyZVqQFhjjg"
+ },
+ "source": [
+ "# Lista contendo os nomes das COLUNAS do dataframe:\n",
+ "l_colunas = ['Frutas', 'Dom', 'Seg', 'Ter', 'Qua', 'Qui', 'Sex', 'Sab']\n",
+ "l_colunas"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "wplKvgayfZm_"
+ },
+ "source": [
+ "# Convertendo as listas em dataframe\n",
+ "df_frutas = pd.DataFrame(l_frutas, columns = l_colunas) # Observe que aqui, o nome das COLUNAS é uma lista.\n",
+ "df_frutas"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "GojgsAXTFZmB"
+ },
+ "source": [
+ "___\n",
+ "# **Copiar dataframes**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "g_Tda4ZwjWIW"
+ },
+ "source": [
+ "O dataframe df_estudantes tem o seguinte conteúdo:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "P5y0aVkdkA8o"
+ },
+ "source": [
+ "df_estudantes"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Cp3bvPEqj5fS"
+ },
+ "source": [
+ "se fizermos..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "J2PT5L11j8O0"
+ },
+ "source": [
+ "df_estudantes2 = df_estudantes"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2D29pGuikBBK"
+ },
+ "source": [
+ "então df_estudantes2 tem o mesmo conteúdo de df_estudantes, ok?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "_IseZEpLkGS4"
+ },
+ "source": [
+ "df_estudantes2"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "29MpozLrkI83"
+ },
+ "source": [
+ "Agora altere o valor 'Rio de Janeiro' para 'Sao Paulo' no dataframe df_estudantes2."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "TXCqFiGFkmyv"
+ },
+ "source": [
+ "df_estudantes2['city'] = df_estudantes2['city'].replace({'Rio de Janeiro': 'Sao Paulo'})\n",
+ "df_estudantes2"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "I_0mgT7-8Fsl"
+ },
+ "source": [
+ "# OU\n",
+ "alteracoes = {'Rio de Janeiro': 'Sao Paulo'}\n",
+ "df_estudantes2['city'] = df_estudantes2['city'].replace(alteracoes)\n",
+ "df_estudantes2"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "BN8ZGu2Xk6vt"
+ },
+ "source": [
+ "Ok, alteramos o valor 'Rio de Janeiro' por 'Sao Paulo', como queríamos. Vamos ver o conteúdo de df_estudantes (**que está intacto, pois fizemos a alteração no dataframe df_estudantes2**)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "thNAWoDflRoQ"
+ },
+ "source": [
+ "df_estudantes"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "VkIS8wVmlAyq"
+ },
+ "source": [
+ "Ooooops... df_estudantes foi alterado? Como, se procedemos a alteração em df_estudantes2 e NÃO em df_estudantes???\n",
+ "\n",
+ "* **As operações que fizermos em df_estudantes2 também serão aplicadas à df_estudantes**?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "e9u-Z9NMltC9"
+ },
+ "source": [
+ "**Resposta**: SIM, pois df_estudantes2 é um ponteiro para df_estudantes. Ou seja, **qualquer operação que fizermos em df_estudantes2 será feita em df_estudantes**."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "IDwvsxhhmlE4"
+ },
+ "source": [
+ "Uma forma fácil de ver isso é através dos endereços de memória dos dois (**supostos diferentes**) dataframes:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ePFwKua8mu7k"
+ },
+ "source": [
+ "id(df_estudantes2)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "bMvY_E0mmwQH"
+ },
+ "source": [
+ "id(df_estudantes)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "K5qC5BuzmyF0"
+ },
+ "source": [
+ "**Conclusão**: df_estudantes2 é ponteiro para df_estudantes."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZZ50ejRImAQ8"
+ },
+ "source": [
+ "## Forma correta de fazer a cópia de um dataframe"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oTbzxNkDmQiJ"
+ },
+ "source": [
+ "Primeiramente, vamos reconstruir df_estudantes:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "DmVq0vM0mTtQ"
+ },
+ "source": [
+ "df_estudantes = pd.DataFrame(d_estudantes)\n",
+ "df_estudantes"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oZrlwtqJmYB_"
+ },
+ "source": [
+ "Fazendo a cópia do dataframe (**da forma correta**):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "No5A7nHDFbsy"
+ },
+ "source": [
+ "df_estudantes_Copy = df_estudantes.copy()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "NvKNFr8RnEft"
+ },
+ "source": [
+ "Vamos verificar os endereços de memória dos dois dataframes:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "0_OO90SFki4f"
+ },
+ "source": [
+ "id(df_estudantes_Copy)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "T0BibX8rkes5"
+ },
+ "source": [
+ "id(df_estudantes)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Fbm-8cCUFgJa"
+ },
+ "source": [
+ "Agora, dataframe df_estudantes_Copy é uma cópia do dataframe df_estudantes"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "SuL8WUxL-u6-"
+ },
+ "source": [
+ "___\n",
+ "# **Renomear COLUNAS do dataframe**\n",
+ "> **Snippet**: \n",
+ "\n",
+ " * df.rename(columns = {'Old_Name': 'New_Name'}, inplace = True)\n",
+ " * OU df = df.rename(columns = {'Old_Name': 'New_Name'})"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "IvpCfmQnIZKl"
+ },
+ "source": [
+ "Suponha que quero renamear a COLUNA 'nome' para 'nome_cliente', que é um nome mais sugestivo."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "o54Fa-yxnmuz"
+ },
+ "source": [
+ "df_estudantes"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "FwzXjYJgCvGk"
+ },
+ "source": [
+ "df_estudantes= df_estudantes.rename(columns = {'nome': 'nome_cliente'})"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gOolGiWt4A18"
+ },
+ "source": [
+ "O comando abaixo produz o mesmo resultado que a linha anterior:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Y6jjAFRd341e"
+ },
+ "source": [
+ "```\n",
+ "df_estudantes.rename(columns= {'nome': 'nome_cliente'}, inplace = True)\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "DwVMldKiF5gS"
+ },
+ "source": [
+ "# Mostrando o conteúdo de df_estudantes após renamearmos a coluna/variável 'nome' para 'Clien_Name'...\n",
+ "df_estudantes"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "m-WZBLWqELOv"
+ },
+ "source": [
+ "Agora, suponha que queremos renamear 'age' para 'idade_cliente', 'city' para 'cidade_cliente' e 'country' para 'pais_cliente'..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "VS6ua4u1EX5g"
+ },
+ "source": [
+ "df_estudantes.rename(columns = {'age': 'idade_cliente', 'city': 'cidade_cliente', 'country': 'pais_cliente'}, inplace = True)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "i_7LW07y4SvO"
+ },
+ "source": [
+ "O comando abaixo produz o mesmo resultado que a linha anterior:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9X-cv9RL4WjV"
+ },
+ "source": [
+ "```\n",
+ "df_estudante = df_estudantes.rename(columns= {'Age': 'idade_cliente', 'City': 'cidade_cliente', 'Country': 'pais_cliente'}, inplace = True)\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "EOb1-TEKGM9I"
+ },
+ "source": [
+ "# Mostrando o conteúdo de df_estudantes após a múltipla operação de renamear...\n",
+ "df_estudantes"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "q0IZZjLRJlU6"
+ },
+ "source": [
+ "Alguma dúvida até aqui?\n",
+ "Tudo bem até aqui?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5LwL2m5KbLYz"
+ },
+ "source": [
+ "## Challenge\n",
+ "* Aplicar lowercase() em todas as COLUNAS do dataframe df_estudantes. Como fazer isso?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "MURfzmeLbUzF"
+ },
+ "source": [
+ "### Minha solução:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "r-FgBY-3xBi9"
+ },
+ "source": [
+ "df_estudantes2 = df_estudantes.copy()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "hlSlfcoub8gH"
+ },
+ "source": [
+ "# Colocar o nome das COLUNAS numa lista:\n",
+ "l_colunas = df_estudantes2.columns\n",
+ "l_colunas"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "I_IGvEK4bdQP"
+ },
+ "source": [
+ "# Lowercase todas as COLUNAS\n",
+ "df_estudantes2.columns = [col.lower() for col in l_colunas]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "0qzzAa3ycKmF"
+ },
+ "source": [
+ "# Mostrando o conteúdo do dataframe df_estudantes\n",
+ "df_estudantes2"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "c-u-ndMPV_KX"
+ },
+ "source": [
+ "___\n",
+ "# **Adicionar/Acrescentar novas LINHAS ao dataframe**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "MDkWbukBLhw7"
+ },
+ "source": [
+ "## Usando dicionários\n",
+ "* É necessário informar {'Column_Name': value} para cada inserção. Por exemplo, vou adicionar o seguinte registro ao dataframe:\n",
+ " * nome_cliente= 'Anderson';\n",
+ " * idade_cliente= 22;\n",
+ " * cidade_cliente= 'Porto';\n",
+ " * pais_cliente= 'Portugal'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "GECPO7iyK9UU"
+ },
+ "source": [
+ "df_estudantes"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "XQKqqC93LoQ_"
+ },
+ "source": [
+ "df_estudantes_Copia= df_estudantes.copy()\n",
+ "df_estudantes.append({'nome_cliente': 'Anderson', \n",
+ " 'idade_cliente': 22,\n",
+ " 'cidade_cliente': 'Porto',\n",
+ " 'pais_cliente': 'Portugal'}, ignore_index = True)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "bdBttsHNLjd-"
+ },
+ "source": [
+ "Esse é o resultado que desejamos?\n",
+ "Saberia explicar-nos o que houve de errado?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6jDoq6CCMerp"
+ },
+ "source": [
+ "**DICA**: Lembre-se que no passo anterior, reescrevemos os nomes das COLUNAS usando o método lower()."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ffReAaUHLvEF"
+ },
+ "source": [
+ "# Definindo df_estudantes novamente usando a cópia df_estudantes_Copia\n",
+ "df_estudantes = df_estudantes_Copia.copy()\n",
+ "df_estudantes"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "EzTo-IvmM2Fg"
+ },
+ "source": [
+ "Ok, restabelecemos a cópia de df_estudantes. Agora vamos à forma correta:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "IRhE76i4M6d6"
+ },
+ "source": [
+ "df_estudantes = df_estudantes.append({'nome_cliente': 'Anderson', \n",
+ " 'idade_cliente': 22,\n",
+ " 'cidade_cliente': 'Porto',\n",
+ " 'pais_cliente': 'Portugal'}, ignore_index= True)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jAojB2MMNDRJ"
+ },
+ "source": [
+ "Bom, esse é o resultado que estávamos à espera..."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5czZb-5wNp_F"
+ },
+ "source": [
+ "## Usando Series\n",
+ "* Como exemplo, considere que queremos adicionar os seguintes dados:\n",
+ " * nome_cliente= 'Bill';\n",
+ " * idade_cliente= 30;\n",
+ " * cidade_cliente= 'São Paulo';\n",
+ " * pais_cliente= 'Brazil'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "J3qCydqMNtGt"
+ },
+ "source": [
+ "novo_estudante = pd.Series(['Bill', 30, 'Sao Paulo', 'Brazil'], index= df_estudantes2.columns) # Olha que interessante: estamos a usar index= df_estudantes.columns."
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "g_DyMDrNPrmC"
+ },
+ "source": [
+ "Vamos ver o conteúdo de novo_estudante:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jDQUl0RBPoLB"
+ },
+ "source": [
+ "novo_estudante"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "zMKRNQrsPvxp"
+ },
+ "source": [
+ "Por fim, adiciona/acrescenta novo_estudante ao dataframe df_estudantes..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5mEQg26iPw4A"
+ },
+ "source": [
+ "df_estudantes2 = df_estudantes2.append(novo_estudante, ignore_index= True)\n",
+ "df_estudantes2"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Biwk2McAWW1Z"
+ },
+ "source": [
+ "___\n",
+ "# **Adicionar/acrescentar novas COLUNAS ao Dataframe**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "EZFTH7A-Wpw5"
+ },
+ "source": [
+ "## Usando Lists\n",
+ "* Suponha que queremos adicionar a coluna/variável 'Score'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "YzBKQo5epXP5"
+ },
+ "source": [
+ "df_estudantes2"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "pPoObAKJW6YF"
+ },
+ "source": [
+ "# Acrescentando ou criando a coluna/variável 'score' ao dataframe usando um objeto list\n",
+ "df_estudantes2['score'] = [500, 300, 200, 800, 700, 100]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Ocbh8sZqWsoW"
+ },
+ "source": [
+ "# Mostra o conteúdo do dataframe df_estudantes...\n",
+ "df_estudantes2"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZxfCMcVxYQgL"
+ },
+ "source": [
+ "> **Atenção**:\n",
+ "\n",
+ "* Se a quantidade de valores da lista forem menores que o número de LINHAS do dataframe, então o Python apresenta um erro.\n",
+ "* Se a coluna/variável que queremos inserir já existe no dataframe, então os valores serão atualizados com os novos."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "34ntllD_YbNa"
+ },
+ "source": [
+ "## Usando um valor default\n",
+ "* Adicionar a coluna 'total' com o mesmo valor para todas as LINHAS"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "T7QSMJMQYous"
+ },
+ "source": [
+ "df_estudantes['total'] = 500\n",
+ "df_estudantes"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gll-gJt7as3C"
+ },
+ "source": [
+ "## Adicionar uma COLUNA calculada a partir de outras COLUNAS"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "T_pB_isBaw-E"
+ },
+ "source": [
+ "df_estudantes['percentagem'] = 100*(df_estudantes['score']/sum(df_estudantes['score']))\n",
+ "df_estudantes"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "D9TNylt84hle"
+ },
+ "source": [
+ "___\n",
+ "# **Ler/carregar dados no Python**\n",
+ "* Vários formatos de arquivos podem ser lidos:\n",
+ "\n",
+ "|Format Type | Data Description | Reader | Writer |\n",
+ "|---|---|---|---|\n",
+ "text | CSV | read_csv | to_csv |\n",
+ "text | JSON | read_json | to_json |\n",
+ "text | HTML | read_html | to_html |\n",
+ "text | Local clipboard | read_clipboard | to_clipboard |\n",
+ "binary | MS Excel | read_excel | to_excel |\n",
+ "binary | HDF5 Format | read_hdf | to_hdf |\n",
+ "binary | Stata | read_stata | to_stata |\n",
+ "binary | SAS | read_sas \n",
+ "binary | Python Pickle Format | read_pickle | to_pickle |\n",
+ "SQL | SQL | read_sql | to_sql |\n",
+ "SQL | Google Big Query | read_gbq | to_gbq |\n",
+ "\n",
+ "* Fonte: [IO tools (text, CSV, HDF5, …)](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Ss8jLEUSblDm"
+ },
+ "source": [
+ "___\n",
+ "# **Ler/Carregar csv**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "n8e9aphab_oe"
+ },
+ "source": [
+ "# carregar a library Pandas\n",
+ "import pandas as pd"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "R2fRd_MSQ2Xa"
+ },
+ "source": [
+ "A seguir, vamos:\n",
+ "* Ler o dataframe Titanic.csv;\n",
+ "* Definir 'PassengerId' como índice/chave da tabela através do comando index_col= 'PassengerId'."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "1R9YoFJ02TR7"
+ },
+ "source": [
+ "url = 'https://raw.githubusercontent.com/MathMachado/DataFrames/master/Titanic_With_MV.csv?token=AGDJQ67OZ36XJUJPE77Z7LC7RBCAU'\n",
+ "df_Titanic = pd.read_csv(url, index_col = 'PassengerId')\n",
+ "df_Titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "VS7_V15u0MgR"
+ },
+ "source": [
+ "df_Titanic.iloc[4] # NÃO É A MESMA COISA QUE df_Titanic.loc[4]!!!"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "WJ9RlRDSkk0_"
+ },
+ "source": [
+ "* Segue o dicionário de dados do dataframe df_Titanic:\n",
+ " * PassengerID: ID do passageiro;\n",
+ " * survived: Indicador, sendo 1= Passageiro sobreviveu e 0= Passageiro morreu;\n",
+ " * Pclass: Classe;\n",
+ " * Age: Idade do Passageiro;\n",
+ " * SibSp: Número de parentes a bordo (esposa, irmãos, pais e etc);\n",
+ " * Parch: Número de pais/crianças a bordo;\n",
+ " * Fare: Valor pago pelo Passageiro;\n",
+ " * Cabin: Cabine do Passageiro;\n",
+ " * Embarked: A porta pelo qual o Passageiro embarcou.\n",
+ " * Name: Nome do Passageiro;\n",
+ " * sex: sexo do Passageiro."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "wz7Qd9mqMrfY"
+ },
+ "source": [
+ "# Show o dataframe df_Titanic:\n",
+ "df_Titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "nDlANdnm4iod"
+ },
+ "source": [
+ "### DICA 1\n",
+ "Suponha que o dataframe que queremos ler esteja localizado em:\n",
+ "\n",
+ "```\n",
+ "/home/nsolucoes4ds/Dropbox/Data_Science/Python/Python_RFB/Python_RFB-DS_Python_020919_2244/Dataframes\n",
+ "```\n",
+ "\n",
+ "Desta forma, para ler o dataframe (local), basta usar o comando a seguir:\n",
+ "\n",
+ "```\n",
+ "url = '/home/nsolucoes4ds/Dropbox/Data_Science/Python/Python_RFB/Python_RFB-DS_Python_020919_2244/Dataframes/creditcard.csv'\n",
+ "df_Titanic = pd.read_csv(url)\n",
+ "```\n",
+ "\n",
+ "### Dica 2\n",
+ "No Windows, o diretório aparece, por exemplo, da seguinte forma: \n",
+ "```\n",
+ "C:\\nsolucoes4ds\\Data_Science\n",
+ "```\n",
+ "Observe as '\\\\' (**barras invertidas**). Neste caso, use o comando a seguir:\n",
+ "\n",
+ "```\n",
+ "url= r'C:\\nsolucoes4ds\\Data_Science\\creditcard.csv'\n",
+ "df_Titanic = pd.read_csv(url)\n",
+ "```\n",
+ "\n",
+ "Percebeu o r'diretorio'?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HubfewY8NgUv"
+ },
+ "source": [
+ "___\n",
+ "# **Corrigir (ou uniformizar) nome das COLUNAS**\n",
+ "* Por exemplo, reescrever o nome das COLUNAS usando lowercase()."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4f_pEEOjvwjk"
+ },
+ "source": [
+ "Para facilitar nossas análises, vamos aplicar o método lower() em todos os valores das COLUNAS objects/strings. Para isso, considere a função abaixo:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Ft13IahH1kVX"
+ },
+ "source": [
+ "df_Titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "G-UlaHFPv7kp"
+ },
+ "source": [
+ "def transformacao_lower(df):\n",
+ " # Primeira transformação: Aplicar lower() nos nomes das COLUNAS:\n",
+ " df.columns = [col.lower() for col in df.columns]\n",
+ "\n",
+ " # Segunda transformação: Aplicar o método .str.lower() nos valores das COLUNAS object/strings:\n",
+ " l_cols_objeto = df.select_dtypes(include = ['object']).columns\n",
+ " \n",
+ " for col in l_cols_objeto:\n",
+ " df[col] = df[col].str.lower()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hNixsW8M7n1X"
+ },
+ "source": [
+ "Para saber mais sobre o método df[col].str.lower(), consulte [pandas.Series.str.lower](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.lower.html)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "hz90zejtbxYj"
+ },
+ "source": [
+ "transformacao_lower(df_Titanic)\n",
+ "df_Titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "UE5P1W-CPePM"
+ },
+ "source": [
+ "# **Selecionar um subconjunto de colunas**\n",
+ "Suponha que eu queira selecionar somente as colunas 'Name' e 'Sex'."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "P7HJa4x7P0bQ"
+ },
+ "source": [
+ "df_Titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "3jLZUCfePsBs"
+ },
+ "source": [
+ "df_Titanic2 = df_Titanic[['Name', 'Sex']]\n",
+ "df_Titanic2.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "PyNsYTilnL2r"
+ },
+ "source": [
+ "# map()\n",
+ "> Artificio para lidar com a transformação de dados utilizando um dicionário: {'key': valor}."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6z4FcyyAiTfF"
+ },
+ "source": [
+ "# Construindo uma variável mais intuitiva para nos ajudar nas análises:\n",
+ "df_Titanic['survived2'] = df_Titanic['survived']\n",
+ "df_Titanic['survived2'] = df_Titanic['survived2'].map({0:'died', 1:'survived'})\n",
+ "df_Titanic[['survived', 'survived2']].head(3)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jwBWkaJOdhCv"
+ },
+ "source": [
+ "___\n",
+ "# **Selecionar COLUNAS do dataframe**\n",
+ "* Suponha que queremos selecionar somente as COLUNAS 'survived', 'sex' e 'embarked':"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Ivvj8JU2pBTq"
+ },
+ "source": [
+ "df_Titanic2 = df_Titanic[['survived', 'sex', 'embarked']]\n",
+ "df_Titanic2.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Nf-Wnof_fdTR"
+ },
+ "source": [
+ "___\n",
+ "# **Criar um dicionário a partir de um dataframe**\n",
+ "> Suponha o dataframe-exemplo a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "lxf6Lgp4fit8"
+ },
+ "source": [
+ "df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})\n",
+ "df"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "l7yzJu1y5huV"
+ },
+ "source": [
+ "De dataframe para Dicionário..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "_6V0qFZGhEoF"
+ },
+ "source": [
+ "df.to_dict('dict')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0GIe6xtqPA1Z"
+ },
+ "source": [
+ "___\n",
+ "# **Criar uma lista a partir de um dataframe**\n",
+ "> Suponha o dataframe-exemplo a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "fZxgejTtPLzX"
+ },
+ "source": [
+ "df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})\n",
+ "df"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JoShm6oF5qLV"
+ },
+ "source": [
+ "De dataframe para Lista..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "gigPpSH_hlXu"
+ },
+ "source": [
+ "df.to_dict('list')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "GpJDX-5xUUC0"
+ },
+ "source": [
+ "___\n",
+ "# **Mostrar as primeiras k LINHAS do dataframe**\n",
+ "> df.head(k), onde k é o número de LINHAS que queremos visualizar. Por default, k= 10."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "RwC9j_OxUbIR"
+ },
+ "source": [
+ "df_Titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "G9cp2QrsA5M0"
+ },
+ "source": [
+ "___\n",
+ "# **Mostrar as últimas k LINHAS do dataframe**\n",
+ "> df.tail(k), onde k é o número de LINHAS que queremos ver. Por default, k= 10."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "9mPxyhqoA4Wc"
+ },
+ "source": [
+ "df_Titanic.tail()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Odwm2qSLA_Ro"
+ },
+ "source": [
+ "Por default, df.tail() mostra as últimas 5 LINHAS/instâncias do dataframe. Entretando, pode ser ver qualquer número de LINHAS k, como, por exemplo, k= 10 mostrado abaixo."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "pUAnR00WA8ma"
+ },
+ "source": [
+ "df_Titanic.tail(10)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "cZ64LfWv4zxo"
+ },
+ "source": [
+ "___\n",
+ "# **Mostrar o nome das COLUNAS do dataframe**\n",
+ "* df.columns"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "CKUUrX5n4zFW"
+ },
+ "source": [
+ "df_Titanic.columns"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6m7ukrOu5Inv"
+ },
+ "source": [
+ "___\n",
+ "# **Mostrar os tipos das COLUNAS do dataframe**\n",
+ "* Propriedade: df.dtypes --> Não há parênteses!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "S4NIHAPPl9lc"
+ },
+ "source": [
+ "df_Titanic.dtypes # dtypes é uma propriedade, portanto não requer \"()\". Os métodos, por outro lado, requerem \"(arg1, arg2, ..., argN)\""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "DGc6m-UBdHlE"
+ },
+ "source": [
+ "___\n",
+ "# **Selecionar automaticamente as COLUNAS do dataframe pelo tipo**\n",
+ "> snippet: df.select_dtypes(include=[tipo]).columns\n",
+ "\n",
+ "| Tipo | O que seleciona | Sintaxe |\n",
+ "|------|-----------------|---------|\n",
+ "| number | colunas do tipo numéricas | df.select_dtypes(include=['number]).columns |\n",
+ "| float | colunas do tipo float | df.select_dtypes(include=['float']).columns |\n",
+ "| bool | colunas do tipo booleanas | df.select_dtypes(include=['bool']).columns |\n",
+ "| object | colunas do tipo categóricas/strings | df.select_dtypes(include=['object']).columns |\n",
+ "\n",
+ "* Se quisermos selecionar mais de um tipo, basta informar a lista de tipos. \n",
+ " * Exemplo: df.select_dtypes(include=['object', 'float']).columns\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "O88YRCqIdYFL"
+ },
+ "source": [
+ "## Selecionar automaticamente as COLUNAS Numéricas do dataframe"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "xG4a9ZfRnxPW"
+ },
+ "source": [
+ "### Lista com as COLUNAS numéricas do dataframe:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "C87uga35dKsF"
+ },
+ "source": [
+ "l_cols_numericas = df_Titanic.select_dtypes(include = ['number']).columns # \".columns\" retorna a lista de colunas numéricas\n",
+ "l_cols_numericas"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5W6kbIVNn2UA"
+ },
+ "source": [
+ "### DataFrame com as COLUNAS numéricas:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "iTieUd_-eDmW"
+ },
+ "source": [
+ "df_numericas = df_Titanic.select_dtypes(include = ['number']) # Atenção: aqui não temos .columns --> Neste caso, o retorno será o dataframe.\n",
+ "df_numericas.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "xh4BFs_lds80"
+ },
+ "source": [
+ "## Selecionar automaticamente as COLUNAS float do dataframe"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Tw3FD74MoC6q"
+ },
+ "source": [
+ "### Lista com as COLUNAS float:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5clAUAIrd3UR"
+ },
+ "source": [
+ "l_cols_float = df_Titanic.select_dtypes(include = ['float']).columns\n",
+ "l_cols_float"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "IZPROG6IoHwy"
+ },
+ "source": [
+ "### DataFrame com as COLUNAS float:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "osJDsyMHeXX4"
+ },
+ "source": [
+ "df_float = df_Titanic.select_dtypes(include = ['float']) # Atenção: aqui não temos .columns\n",
+ "df_float.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5uObezIIfuJ4"
+ },
+ "source": [
+ "## Selecionar automaticamente as COLUNAS Booleanas do dataframe"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "xMKP5HhgoeMg"
+ },
+ "source": [
+ "### Lista com as COLUNAS Booleanas:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "3Pn2IPBkf7k-"
+ },
+ "source": [
+ "l_cols_booleanas = df_Titanic.select_dtypes(include = ['bool']).columns\n",
+ "l_cols_booleanas"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "k3sdiuXYokBE"
+ },
+ "source": [
+ "### DataFrame com as COLUNAS Booleanas:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Oem-M-17f7lG"
+ },
+ "source": [
+ "df_booleanas = df_Titanic.select_dtypes(include=['bool']) # Atenção: aqui não temos .columns\n",
+ "df_booleanas.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ObHYW92-gOXz"
+ },
+ "source": [
+ "## Selecionar automaticamente as COLUNAS do tipo string (object)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "IzM5CIKXoxHO"
+ },
+ "source": [
+ "### Lista com as COLUNAS do tipo object/string:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "rdYThBingOX1"
+ },
+ "source": [
+ "l_cols_objeto = df_Titanic.select_dtypes(include=['object']).columns\n",
+ "l_cols_objeto"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2ZGB5d36o21t"
+ },
+ "source": [
+ "### DataFrame com as COLUNAS do tipo Object/String:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kWTtxeU4gOX4"
+ },
+ "source": [
+ "df_cols_obs = df_Titanic.select_dtypes(include=['object']) # Atenção: aqui não temos .columns\n",
+ "df_cols_obs.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "SEBKHKRLkbUK"
+ },
+ "source": [
+ "___\n",
+ "# **Reordenar as COLUNAS do dataframe**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "XRWfelWEkhae"
+ },
+ "source": [
+ "df_Titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "KBGDeR_JkyCc"
+ },
+ "source": [
+ "* Suponha que queremos reordenar as COLUNAS do dataframe df_Titanic em ordem alfabética, conforme abaixo:\n",
+ " * age;\n",
+ " * embarked;\n",
+ " * fare;\n",
+ " * parch;\n",
+ " * pclass;\n",
+ " * sex;\n",
+ " * sibsp;\n",
+ " * survived."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "d9jJi6qllnq_"
+ },
+ "source": [
+ "# Dataframe ordenado\n",
+ "df_Titanic = df_Titanic.reindex(sorted(df_Titanic.columns), axis = 1)\n",
+ "df_Titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Cj4MREti-izC"
+ },
+ "source": [
+ "___\n",
+ "# **Mostrar a dimensão do dataframe**\n",
+ "* Dimensão = Número de LINHAS e COLUNAS"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "50Tij93l-n7B"
+ },
+ "source": [
+ "df_Titanic.shape"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZQo4YeH_-qfL"
+ },
+ "source": [
+ "Qual a interpretação?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "klHcwpPEALP8"
+ },
+ "source": [
+ "## **Quebrar a dimensão em duas partes: número de LINHAS e COLUNAS**\n",
+ "* Número de linhas do dataframe.: df_Titanic.shape[0]\n",
+ "* Número de colunas do dataframe: df_Titanic.shape[1]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "qjR8OEdDAOog"
+ },
+ "source": [
+ "f'O dataframe df_Titanic possui {df_Titanic.shape[0]} linhas e {df_Titanic.shape[1]} colunas.'"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "pIsf_nDtyAvF"
+ },
+ "source": [
+ "___\n",
+ "# **Combinar dataframes: Merge, Join & Concatenate**\n",
+ "* Fonte: [Merge, join, and concatenate](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "s1fSplrlEMHK"
+ },
+ "source": [
+ "* A seguir, três formas para combinar dataframes:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6DYtWxuIrdzF"
+ },
+ "source": [
+ "## Concatenate\n",
+ "* Une/empilha dataframes\n",
+ "* Fonte: https://github.com/aakankshaws/Pandas-exercises"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "nnP5VuWkri_b"
+ },
+ "source": [
+ "import pandas as pd\n",
+ "df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\n",
+ " 'B': ['B0', 'B1', 'B2', 'B3'],\n",
+ " 'C': ['C0', 'C1', 'C2', 'C3'],\n",
+ " 'D': ['D0', 'D1', 'D2', 'D3']})"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "rkJvSGYSrm8b"
+ },
+ "source": [
+ "df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],\n",
+ " 'B': ['B4', 'B5', 'B6', 'B7'],\n",
+ " 'C': ['C4', 'C5', 'C6', 'C7'],\n",
+ " 'D': ['D4', 'D5', 'D6', 'D7']})"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "NCgdYvJIrqx1"
+ },
+ "source": [
+ "df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],\n",
+ " 'B': ['B8', 'B9', 'B10', 'B11'],\n",
+ " 'C': ['C8', 'C9', 'C10', 'C11'],\n",
+ " 'D': ['D8', 'D9', 'D10', 'D11']})"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "gUoyjyjur5Zn"
+ },
+ "source": [
+ "df1"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "xU6Rh10Gr7NA"
+ },
+ "source": [
+ "df2"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "qKwmOWsQr9wA"
+ },
+ "source": [
+ "df3"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "-MNn-XdlsjJS"
+ },
+ "source": [
+ "df= pd.concat([df1, df2, df3])\n",
+ "df"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "BV6HgxSYtG6Z"
+ },
+ "source": [
+ "Veja que basicamente empilhamos os dataframes. No entanto, se fizermos..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Dp-oh-7ftLo5"
+ },
+ "source": [
+ "df = pd.concat([df1, df2, df3], axis = 1) # axis = 1 é uma operação de coluna\n",
+ "df"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "iyDZt2XEtmVs"
+ },
+ "source": [
+ "Se, no entanto, tivermos:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5PAhjjVZtpP5"
+ },
+ "source": [
+ "df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\n",
+ " 'B': ['B0', 'B1', 'B2', 'B3'],\n",
+ " 'C': ['C0', 'C1', 'C2', 'C3'],\n",
+ " 'D': ['D0', 'D1', 'D2', 'D3']},\n",
+ " index=[0, 1, 2, 3])\n",
+ "\n",
+ "df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],\n",
+ " 'B': ['B4', 'B5', 'B6', 'B7'],\n",
+ " 'C': ['C4', 'C5', 'C6', 'C7'],\n",
+ " 'D': ['D4', 'D5', 'D6', 'D7']},\n",
+ " index=[4, 5, 6, 7])\n",
+ "\n",
+ "df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],\n",
+ " 'B': ['B8', 'B9', 'B10', 'B11'],\n",
+ " 'C': ['C8', 'C9', 'C10', 'C11'],\n",
+ " 'D': ['D8', 'D9', 'D10', 'D11']},\n",
+ " index=[8, 9, 10, 11])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "zGDHd-kPt3-T"
+ },
+ "source": [
+ "Então..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "3bTl2Nr2t5WM"
+ },
+ "source": [
+ "df = pd.concat([df1, df2, df3], axis = 1)\n",
+ "df"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "sUXjlp_Jt925"
+ },
+ "source": [
+ "Porque isso acontece?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JdKXY873HrYt"
+ },
+ "source": [
+ "## Merge\n",
+ "> Primeiramente, vamos ver todos os casos possíveis de joins.\n",
+ "\n",
+ "### Exemplo\n",
+ "> O exemplo a seguir foi inspirado no exemplo apresentado em [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins). Considere os dataframes a seguir"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "g4pmhk2t3x8s"
+ },
+ "source": [
+ "import pandas as pd\n",
+ "\n",
+ "d_Tabela_A = {'indices': [1,2,3,6,7,5,4,10], 'valores': ['A','B','C','D','E','F','G','H']}\n",
+ "d_Tabela_B = {'indices': [1,2,3,6,7,8,9,11], 'valores': ['AA', 'BB','CC','DD','EE','FF','GG','HH']}"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "XxfUULxY52ns"
+ },
+ "source": [
+ "df_conjunto_A = pd.DataFrame(d_Tabela_A).set_index('indices')\n",
+ "df_conjunto_B = pd.DataFrame(d_Tabela_B).set_index('indices')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gGdU36Vi0Yso"
+ },
+ "source": [
+ "\n",
+ "\n",
+ "Source: [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5w7ox7LV9cuG"
+ },
+ "source": [
+ "df_conjunto_A"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "TPhmKw-F9fWX"
+ },
+ "source": [
+ "df_conjunto_B"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5AaTlCPy9FBZ"
+ },
+ "source": [
+ "df_inner_join = pd.merge(df_conjunto_A, df_conjunto_B, on = 'indices', how = 'inner')\n",
+ "df_inner_join"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "U3OjFM0E0af-"
+ },
+ "source": [
+ "\n",
+ "\n",
+ "Source: [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins).\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "-efYd9c69k4L"
+ },
+ "source": [
+ "df_conjunto_A"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "SqFbNStz9k4S"
+ },
+ "source": [
+ "df_conjunto_B"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "rUpc2k729KA-"
+ },
+ "source": [
+ "df_left_join = pd.merge(df_conjunto_A, df_conjunto_B, on = 'indices', how = 'left')\n",
+ "df_left_join"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "WioSBHjW06Hg"
+ },
+ "source": [
+ "\n",
+ "\n",
+ "Source: [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "IrzPjGNp9o2n"
+ },
+ "source": [
+ "df_conjunto_A"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "tFFTp_yG9o2s"
+ },
+ "source": [
+ "df_conjunto_B"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "_D4tF7E-9PCx"
+ },
+ "source": [
+ "df_right_join = pd.merge(df_conjunto_A, df_conjunto_B, on = 'indices', how = 'right')\n",
+ "df_right_join"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "E9xFrurZ0ksg"
+ },
+ "source": [
+ "\n",
+ "\n",
+ "Source: [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kQCBAfj_9rO_"
+ },
+ "source": [
+ "df_conjunto_A"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "FTDHYsgc9rP0"
+ },
+ "source": [
+ "df_conjunto_B"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "hJqyAs0U9XwO"
+ },
+ "source": [
+ "df_outer_join = pd.merge(df_conjunto_A, df_conjunto_B, on = 'indices', how = 'outer')\n",
+ "df_outer_join"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "fHEgLynu0vve"
+ },
+ "source": [
+ "\n",
+ "\n",
+ "Source: [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins).\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZA8CcERE-RRS"
+ },
+ "source": [
+ "df_conjunto_A"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "IZiAa9X6-UL0"
+ },
+ "source": [
+ "df_conjunto_B"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jdUt63rA-Vjo"
+ },
+ "source": [
+ "df_left_excluding_join = pd.merge(df_conjunto_A, df_conjunto_B, on = 'indices', how =\"outer\", indicator=True).query('_merge==\"left_only\"')\n",
+ "df_left_excluding_join"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "CShcqL-h1MqK"
+ },
+ "source": [
+ "\n",
+ "\n",
+ "Source: [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins).\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ECjUDoYf_C9x"
+ },
+ "source": [
+ "df_conjunto_A"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "xym7VsXi_FXa"
+ },
+ "source": [
+ "df_conjunto_B"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "-zFalmly_HJ7"
+ },
+ "source": [
+ "df_right_excluding_join = pd.merge(df_conjunto_A, df_conjunto_B, on = 'indices', how =\"outer\", indicator=True).query('_merge==\"right_only\"')\n",
+ "df_right_excluding_join"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "T8v4-zUt1WQz"
+ },
+ "source": [
+ "\n",
+ "\n",
+ "Source: [Visual Representation of SQL Joins](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins).\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8HeMgBqyAYjW"
+ },
+ "source": [
+ "### Desafio: Como resolver este?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "SkCbLsoktgKl"
+ },
+ "source": [
+ "### Observações:\n",
+ "\n",
+ "* Em alguns casos a variável chave nos dois dataframes que se quer fazer o join possui nomes diferentes. Neste caso, use 'left_on' e 'right_on' para definir o nome das COLUNAS chaves no dataframe da esquerda e direita:\n",
+ " * pd.merge(df1, df2, left_on =\"employee\", right_on =\"nome\")\n",
+ " * No exemplo acima, o dataframe df1 (dataframe da esquerda) possui chave 'employee' enquanto que o dataframe df2 (dataframe da direita), possui chave 'nome'. Usando as 'left_on' e 'right_on' fica claro o nome das chaves de ligação de cada dataframe."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6Obc0fHUwIpu"
+ },
+ "source": [
+ "## Joining"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "DQOa89_cwLyd"
+ },
+ "source": [
+ "df_esquerdo = pd.DataFrame({'A': ['A0', 'A1', 'A2'],\n",
+ " 'B': ['B0', 'B1', 'B2']},\n",
+ " index=['K0', 'K1', 'K2']) \n",
+ "\n",
+ "df_direito = pd.DataFrame({'C': ['C0', 'C2', 'C3'],\n",
+ " 'D': ['D0', 'D2', 'D3']},\n",
+ " index=['K0', 'K2', 'K3'])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "UHnX9rxzwMmx"
+ },
+ "source": [
+ "df_esquerdo"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "GBc1Mr0Qwff3"
+ },
+ "source": [
+ "df_direito"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "TmIk3Kjlwg-7"
+ },
+ "source": [
+ "df_esquerdo.join(df_direito)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "h609fbjjwoZ3"
+ },
+ "source": [
+ "df_esquerdo.join(df_direito, how ='outer')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Y8W2kP-VCB3E"
+ },
+ "source": [
+ "___\n",
+ "# **Selecionar LINHAS do dataframe baseado nos índices**\n",
+ "### Leitura Adicional\n",
+ "* [pandas loc vs. iloc vs. ix vs. at vs. iat?\n",
+ "](https://stackoverflow.com/questions/28757389/pandas-loc-vs-iloc-vs-ix-vs-at-vs-iat/47098873#47098873)\n",
+ "* [Indexing and selecting data](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "NN1R1ngAG61x"
+ },
+ "source": [
+ "## 1st Approach - pd.loc[]\n",
+ "* Para capturar o conteúdo da linha k, use df.loc[row_indexer,column_indexer]."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "oduXMUtIUvkN"
+ },
+ "source": [
+ "df_Titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JX9nGPWcVLgE"
+ },
+ "source": [
+ "\n",
+ "Por exemlo, o comando a seguir mostra o conteúdo da linha 0, todas as COLUNAS(:)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "U5-I2NgYC2fD"
+ },
+ "source": [
+ "df2= df_Titanic.loc[1,:]\n",
+ "df2.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "tDSJcQLTDyJw"
+ },
+ "source": [
+ "Mostrando o conteúdo das LINHAS k= 1:2 (ou seja, LINHAS 1 e 2), todas as COLUNAS(:)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "JD1TDTqAD_5r"
+ },
+ "source": [
+ "df_Titanic.loc[1:2, :]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "EoAmcdfnEIho"
+ },
+ "source": [
+ "Mostrar os conteúdos da linha k= 1, coluna 'pclass':"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8vjc5z3_EQfY"
+ },
+ "source": [
+ "df_Titanic.loc[1, ['pclass']]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7bC8-H-QFLgd"
+ },
+ "source": [
+ "Mostrar os conteúdos da linha k= 1 e COLUNAS ['pclass', 'sex']:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "LYFTrZr_FR5g"
+ },
+ "source": [
+ "df_Titanic.loc[0, ['pclass', 'sex']]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "UtUsmU8sXYTU"
+ },
+ "source": [
+ "Porque temos um erro aqui?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "CRy5sDx-XbBL"
+ },
+ "source": [
+ "Versão correta abaixo:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5Lfw0HEnXdn0"
+ },
+ "source": [
+ "df_Titanic.loc[1, ['pclass', 'sex']]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Tjw3vjkDZg1Z"
+ },
+ "source": [
+ "Mostrar os conteúdos da linha k= 1:5 e COLUNAS ['pclass', 'sex']:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "4GuAE5MSZjNb"
+ },
+ "source": [
+ "df_Titanic.loc[1:5, ['pclass', 'sex']]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "xRZxqE6RFnJI"
+ },
+ "source": [
+ "Agora suponha que queremos selecionar toda a 'sex'. Como fazer isso?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "JdeD_uzfFrp5"
+ },
+ "source": [
+ "df_sex= df_Titanic.loc[:, 'sex']\n",
+ "df_sex.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "z_WUjYxsX-Av"
+ },
+ "source": [
+ "Fácil selecionarmos o que queremos usando .loc() e iloc(), certo?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RKk0zollHFbp"
+ },
+ "source": [
+ "## 2nd Approach - Usando lists\n",
+ "\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jhwoY6LmGzC0"
+ },
+ "source": [
+ "df_Titanic[0:2] # Mostrar os conteúdos das LINHAS 0:2"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "I6EOVIDxGiy-"
+ },
+ "source": [
+ "df_Titanic[:3] # Mostrar os conteúdos até a linha 3"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "VOHp77F8H9t1"
+ },
+ "source": [
+ "df_Titanic['sex'].head() # Mostrar o conteúdo inteiro da variável 'sex'"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8nvHNdhPZ040"
+ },
+ "source": [
+ "df_Titanic[0:5]['sex'].head() # Mostrar as LINHAS 0 a 5 da variável 'sex'"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "GMFso1jaYXgN"
+ },
+ "source": [
+ "___\n",
+ "# **Selecionar/Filtrar/Substituir LINHAS do dataframe baseado em condições**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "BKljSpS5ou-i"
+ },
+ "source": [
+ "## Exemplo 1\n",
+ "> Aproveitando o exemplo anterior, queremos selecionar do dataframe somente os passageiros do sexo 'male'."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jek8Ru3Aam23"
+ },
+ "source": [
+ "### Approach 1: df.loc() e df.iloc()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "eysZoBX2YKb-"
+ },
+ "source": [
+ "df_sexo_m_1 = df_Titanic.loc[df_Titanic['sex'] == 'male', 'sex']\n",
+ "df_sexo_m_1.head() "
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "uLDOHKGfaq-Z"
+ },
+ "source": [
+ "### Approach 2: Uso do []"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "QncrZwHkasiu"
+ },
+ "source": [
+ "df_sexo_m_2 = df_Titanic[df_Titanic['sex'] == 'male']['sex']\n",
+ "df_sexo_m_2.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ot6UBTYJF-AJ"
+ },
+ "source": [
+ "### Approach 3: df.isin()\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "OBRF0be3VuTi"
+ },
+ "source": [
+ "#### Exemplo 1 - Filtro simples"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "LeTDiGICGOzb"
+ },
+ "source": [
+ "df_sexo_m_3 = df_Titanic['sex'].isin(['male'])\n",
+ "df_sexo_m_3.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Q6emu30nGmpt"
+ },
+ "source": [
+ "#### Exemplo 2 - Filtro duplo = Duas condições\n",
+ "> Selecionar todas as LINHAS onde sexo = 'male' e Pclass = 1."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "TRaiCYMRGpgl"
+ },
+ "source": [
+ "# Filtros usando df.isin() \n",
+ "filtro_m = df_Titanic[\"sex\"].isin([\"male\"]) \n",
+ "filtro_class1 = df_Titanic[\"Pclass\"].isin([1]) \n",
+ " \n",
+ "# Mostra os resutados \n",
+ "df_Titanic[filtro_m & filtro_class1].head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Sh0DDj1xcPaI"
+ },
+ "source": [
+ "df_sexo_m_class = df_Titanic[((df_Titanic['sex'] == 'male') & (df_Titanic['Pclass'] == 1))]\n",
+ "df_sexo_m_class.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ujrYHyOsfW7n"
+ },
+ "source": [
+ "### Approach 4 - Filtrar com df.str.contains('s_substr')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "gntbfHgTfanx"
+ },
+ "source": [
+ "# Mostrar todas as LINHAS onde a string 'Mr' aparece no nome do passageiro:\n",
+ "df2 = df_Titanic[df_Titanic['Name'].str.contains('Mr')]\n",
+ "df2.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "eaRtQ8Ja8MOH"
+ },
+ "source": [
+ "Para saber mais sobre o método df[col].str.contais(), consulte https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "FyJ-gEjzQI2Y"
+ },
+ "source": [
+ "## Substituir valores do dataframe\n",
+ "> Suponha que queremos substituir todos os valores de pclass da seguinte forma:\n",
+ "* Se pclass = 1 --> pclass2 = 'Classe1';\n",
+ "* Se pclass = 2 --> pclass2 = 'Classe2';\n",
+ "* Se pclass = 3 --> pclass2 = 'Classe3';\n",
+ "\n",
+ "Como fazer isso?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Pi8MFiUPQQb7"
+ },
+ "source": [
+ "df_Titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "19mynzdfQqVf"
+ },
+ "source": [
+ "df_Titanic['pclass2'] = df_Titanic['pclass']\n",
+ "df_Titanic['pclass2'][df_Titanic['pclass'] == 1] = 'Classe1'\n",
+ "df_Titanic['pclass2'][df_Titanic['pclass'] == 2] = 'Classe2'\n",
+ "df_Titanic['pclass2'][df_Titanic['pclass'] == 3] = 'Classe3'\n",
+ "df_Titanic['pclass2'].head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "KVSAYeU0KA2V"
+ },
+ "source": [
+ "___\n",
+ "# **Selecionar amostras aleatórias do dataframe**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "U502dAs3OfOH"
+ },
+ "source": [
+ "Vimos que o dataframe df_Titanic é muito grande. Então, vamos selecionar aleatoriamente 100 LINHAS."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "0BrKUnAiPcAy"
+ },
+ "source": [
+ "import random \n",
+ "\n",
+ "# Biblioteca para avaliarmos o tempo de processamento de cada alternativa\n",
+ "import time"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "iJ1G8lYgKGsc"
+ },
+ "source": [
+ "# Usando sample\n",
+ "t0= time.time()\n",
+ "df_Titanic_a100= df_Titanic.sample(100, replace= False, random_state= 20111974)\n",
+ "t1= time.time()\n",
+ "t= t1-t0\n",
+ "df_Titanic_a100.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8DvWOKizZQr8"
+ },
+ "source": [
+ "f'Tempo de processamento: {t}'"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "nAHLTjpvYKPS"
+ },
+ "source": [
+ "# Usando NumPy\n",
+ "import numpy as np\n",
+ "\n",
+ "t0 = time.time()\n",
+ "np.random.seed(20111974)\n",
+ "indices = np.random.choice(df_Titanic.shape[0], replace = False, size=100)\n",
+ "df_Titanic_a100_2 = df_Titanic.iloc[indices]\n",
+ "t1 = time.time()\n",
+ "t = t1-t0\n",
+ "df_Titanic_a100_2.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "U8PEDMJ4a52P"
+ },
+ "source": [
+ "f'Tempo de processamento: {t}'"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "wYeuJWdEdMPd"
+ },
+ "source": [
+ "df_Titanic_a100_2.shape"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vNMiRkjCQ9Mu"
+ },
+ "source": [
+ "___\n",
+ "# **Descrever o Dataframe**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "GllUFj56RHuD"
+ },
+ "source": [
+ "df_Titanic_a100.describe()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "izbpIEi1d1sx"
+ },
+ "source": [
+ "df_Titanic_a100_2.describe()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "H40G3QzWbG9N"
+ },
+ "source": [
+ "___\n",
+ "# **Identificar e lidar com LINHAS duplicadas**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_OoM_HS5ZgxG"
+ },
+ "source": [
+ "## Exemplo 1\n",
+ "* considera as duplicatas em todas as COLUNAS do dataframe."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5XOOdOZBbLc_"
+ },
+ "source": [
+ "df = pd.DataFrame({'A':[1,1,3,4,5,1], 'B':[1,1,3,7,8,1], 'C':[3,1,1,6,7,1]})\n",
+ "df"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Gio08BkTbTOp"
+ },
+ "source": [
+ "# Lista as duplicações em forma booleana\n",
+ "df.duplicated()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "obgbM4d_hJ_J"
+ },
+ "source": [
+ "Observe a linha 5, onde temos a informação que esta linha está duplicada. Na verdade, a linha 5 é igual à linha 1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "LHhOIb-EbWfn"
+ },
+ "source": [
+ "# Mostra as LINHAS duplicadas\n",
+ "df[df.duplicated()]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "IyJS70_kZ-Jk"
+ },
+ "source": [
+ "# Deleta a linha 5 que, como vimos, estava duplicada (uma cópia da linha 1).\n",
+ "df= df.drop_duplicates()\n",
+ "df"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3Q05mxOSaEjX"
+ },
+ "source": [
+ "## Exemplo 2\n",
+ "* Considera somente algumas COLUNAS"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jiqyjcqdaQ1y"
+ },
+ "source": [
+ "df = pd.DataFrame({'A':[1,1,3,4,5,1], 'B':[1,1,3,7,8,1], 'C':[3,1,1,6,7,1]})\n",
+ "df"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "F_118d7vbZ9Y"
+ },
+ "source": [
+ "# Mostra as LINHAS duplicadas usando as COLUNAS 'A' e 'B'\n",
+ "df[df.duplicated(subset=['A','B'])]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "_1w_ZZO4vF3A"
+ },
+ "source": [
+ "# Deleta as LINHAS 1 e 5, pois como podemos ver, são duplicatas da linha 0\n",
+ "df= df.drop_duplicates(subset = ['A', 'B'])\n",
+ "df"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qVx6p8u36jhD"
+ },
+ "source": [
+ "___\n",
+ "# **Trabalhar com dados do tipo texto**\n",
+ "* Fontes:\n",
+ " * [Working with text data](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html)\n",
+ " * [Using String Methods](https://www.ritchieng.com/pandas-string-methods/)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JLG3cVA1e8-B"
+ },
+ "source": [
+ "Preparando os dados para o exemplo:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "G_CEULoyeP8C"
+ },
+ "source": [
+ "# Definir um dicionário com os dados: \n",
+ "import numpy as np\n",
+ "\n",
+ "l_idade = []\n",
+ "for i in range(6):\n",
+ " np.random.seed(i) \n",
+ " l_idade.append(np.random.randint(10, 40))\n",
+ " \n",
+ "\n",
+ "d_exemplo = {'Nome':['Mr. Antonio dos Santos', 'Mr. Joao Pedro', 'Miss. Priscila Alvarenga', 'Mr. fagner NoVAES', 'Miss. Danielle Aparecida', 'Mr. Paullo Amarantes'], \n",
+ " 'Idade': l_idade, \n",
+ " 'Cidade':['lisboa', 'Sintra', 'Braga', 'Guimaraes', 'Mafra', 'Nazare']} \n",
+ " \n",
+ "# Converte o dicionário num dataframe\n",
+ "df = pd.DataFrame(d_exemplo) \n",
+ "df"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "or-Kzaqmdn2b"
+ },
+ "source": [
+ "* Sugestões do que podemos fazer com relação á coluna 'nome' do dataframe df:\n",
+ " * Extrair o cumprimento do nome: Mr., Miss e etc.\n",
+ " * Construir as COLUNAS PrimeiroNome e SegundoNome.\n",
+ " * Criar a variável classe_idade."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Vd99ksvcg7uy"
+ },
+ "source": [
+ "## Extrair o cumprimento do nome"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "rNsANzFAg_Kn"
+ },
+ "source": [
+ "df_Nome= df['Nome'].str.split(' ', n = 2, expand = True) \n",
+ "df_Nome"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ianqsxLol008"
+ },
+ "source": [
+ "Altere o valor de n para 3 e explique como as coisas funcionam..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5NDAkEqCl6H5"
+ },
+ "source": [
+ "# Capturando o cumprimento do nome:\n",
+ "df['tamanho_nome'] = df['Nome'].str.split(' ', n = 2, expand = True)[0]\n",
+ "df"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "B1QoH4LyrpVI"
+ },
+ "source": [
+ "## Construir as COLUNAS primeiro_nome e Segundo_Nome"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "cbi4eRN2mOu9"
+ },
+ "source": [
+ "# Capturando o primeiro nome:\n",
+ "df['primeiro_nome'] = df['Nome'].str.split(' ', n = 2, expand = True)[1]\n",
+ "df['ultimo_nome'] = df['Nome'].str.split(' ', n = 2, expand = True)[2]\n",
+ "df"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7eagWhgZrwOh"
+ },
+ "source": [
+ "### Construir a variável classe_idade\n",
+ "\n",
+ " | Limite Inferior | Limite Superior | Classe |\n",
+ " |-----------------|-----------------|--------|\n",
+ " | Inf | 15 | Inf_15 |\n",
+ " | 15 | 20 | 15_20 |\n",
+ " | 20 | 30 | 25_30 |\n",
+ " | 30 | 40 | 30_40 |\n",
+ " | 40 | 50 | 40_50 |\n",
+ " | 50 | Sup | 50_Sup |"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "lBjRBGBWr2AH"
+ },
+ "source": [
+ "def classe_idade):\n",
+ " if (Idade <= 15):\n",
+ " return 'Inf_15'\n",
+ " if (15 < Idade <= 20):\n",
+ " return '15_20'\n",
+ " elif(20 < Idade <= 30):\n",
+ " return '20_30'\n",
+ " elif (30 < Idade <= 40):\n",
+ " return '30_40'\n",
+ " elif (40 < Idade <= 50):\n",
+ " return '40_50'\n",
+ " elif (Idade > 50):\n",
+ " return '50_Sup'\n",
+ " else:\n",
+ " return 'Outros'"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "OogrvjCrsdoh"
+ },
+ "source": [
+ "df['classe_idade'] = df['Idade'].map(classe_idade)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JDtxz_eaRcmi"
+ },
+ "source": [
+ "___\n",
+ "# **Agrupar Informações: pd.groupby()**\n",
+ "* Fonte: [Group By: split-apply-combine](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)\n",
+ "\n",
+ "* Os componentes do comando Groupby()\n",
+ " * **Grouping_Column** - Coluna Categórica pelo qual os dados serão agrupados;\n",
+ " * **Aggregating_Column** - Coluna numérica cujos valores serão agrupados;\n",
+ " * **Aggregating_Function** - Função agregadora, ou seja: sum, min, max, mean, median, etc...\n",
+ "\n",
+ "> Sintaxe: \n",
+ "\n",
+ "```\n",
+ "df.groupby('Grouping_Column').agg({'Aggregating_Column': 'Aggregating_Function'})\n",
+ "\n",
+ "OU\n",
+ "\n",
+ "df['Aggregating_Column'].groupby(df['Grouping_Column']).Function()\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "bmFf-273XPXj"
+ },
+ "source": [
+ "## Exemplo 1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "wteEveUsd36C"
+ },
+ "source": [
+ "transformacao_lower(df_Titanic)\n",
+ "df_Titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "buF5DhkFfqVA"
+ },
+ "source": [
+ "# Agrupando df_Titanic por 'sex3'\n",
+ "df_Titanic.groupby(['sex', 'pclass']).agg({'fare': ['min', 'median', 'mean','max'], 'age': ['count', 'mean','max']})"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "YP3GDwq0gR_V"
+ },
+ "source": [
+ "# Agrupando df_Titanic por 'sex3' e 'Pclass'\n",
+ "df_Titanic.groupby(['sex3','Pclass']).agg({'Fare': ['max', 'min']}).round(0)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "se4tQ3ETeUfv"
+ },
+ "source": [
+ "df_Titanic.groupby(['sex3']).agg({'Age': ['mean','min','max']}).round(0)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zUj82I7Cm220"
+ },
+ "source": [
+ "df_Titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "OrLZjm9bXTOr"
+ },
+ "source": [
+ "## Exemplo 2"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "x8aPZPT6XZVP"
+ },
+ "source": [
+ "### Preparando o exemplo"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "KrCe6RgOXaFx"
+ },
+ "source": [
+ "l_coluna = []\n",
+ "\n",
+ "for i in range(1,6):\n",
+ " np.random.seed(i)\n",
+ " l_coluna.append(np.random.randint(0, 10, 10))\n",
+ " \n",
+ "np.random.seed(6)\n",
+ "l_coluna.append(np.random.rand(10))\n",
+ "\n",
+ "l_coluna"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "tXaHjmfSXeCw"
+ },
+ "source": [
+ "l_coluna[0]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "U_aEVMTHq6ee"
+ },
+ "source": [
+ "df = pd.DataFrame({'coluna6' : ['a', 'a', 'b', 'b', 'a', 'b', 'b', 'b', 'a', 'a'],\n",
+ " 'coluna7' : ['um', 'dois', 'um', 'dois', 'um', 'dois', 'dois', 'um', 'um', 'dois'],\n",
+ " 'coluna1' : l_coluna[0],\n",
+ " 'coluna2' : l_coluna[1],\n",
+ " 'coluna3' : l_coluna[2],\n",
+ " 'coluna4' : l_coluna[3],\n",
+ " 'coluna5' : l_coluna[4],\n",
+ " 'coluna8' : l_coluna[5],\n",
+ " 'Pessoas' : ['Jose','Maria','Pedro','Carlos','Joao','Ana','Manoel','Mafalda','Antonio','Ricardo'],\n",
+ " 'sexo' : ['m','f','m','m','m','f','m','f','m','m']})\n",
+ "df"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Ok4a28lGlVC5"
+ },
+ "source": [
+ "Agrupando por 'coluna6':"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Vx77lyzlZIFW"
+ },
+ "source": [
+ "df.groupby('coluna6').agg({'coluna1': ['min','mean','median','max']})"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "T6i-R2KemadE"
+ },
+ "source": [
+ "Agora, vamos repetir o processo usando duas COLUNAS-chaves 'coluna6' e 'coluna7':"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "WxmHQnQSZrXA"
+ },
+ "source": [
+ "df_estatisticas_descritivas = df.groupby(['coluna6','coluna7']).agg({'coluna1': ['min','mean','median','max']})\n",
+ "df_estatisticas_descritivas"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Ipw5EROwaaCX"
+ },
+ "source": [
+ "Observe que df_estatisticas_descritivas é um dataframe. Portanto, podemos selecionar LINHAS e/ou COLUNAS deste dataframe da forma que quisermos."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "qk5uSdVwb7dH"
+ },
+ "source": [
+ "# Índices do dataframe:\n",
+ "df_estatisticas_descritivas.index"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "brIgUFlkalix"
+ },
+ "source": [
+ "# Selecionando o conteúdo de coluna6= 'a' e coluna7= 'um':\n",
+ "df_estatisticas_descritivas.loc[('a', 'um')]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "fQUs2PVHc6iR"
+ },
+ "source": [
+ "# Selecionando o conteúdo de coluna6= 'a' e coluna7= 'um', primeiro valor:\n",
+ "df_estatisticas_descritivas.loc[('a', 'um')][0] # ou seja, selecionamos min"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zT0xiee6dDpK"
+ },
+ "source": [
+ "# Selecionando o conteúdo de coluna6= 'a' e coluna7= 'um', segundo valor:\n",
+ "df_estatisticas_descritivas.loc[('a', 'um')][1] # ou seja, selecionamos mean"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vXlcjPM6dQKi"
+ },
+ "source": [
+ "E daí por diante..."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "EMxFMqn9dm3g"
+ },
+ "source": [
+ "Para aprender mais sobre como trabalhar com dois índices em um dataframe, consulte [Hierarchical indices, groupby and pandas](https://www.datacamp.com/community/tutorials/pandas-multi-index)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gNHyH7M0pGDy"
+ },
+ "source": [
+ "___\n",
+ "## Exemplo 3\n",
+ "### Operações e transformações em grupo"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ywl3k_l8pGD0"
+ },
+ "source": [
+ "# Mostra o dataframe-exemplo:\n",
+ "df"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "AF8cbNsjpGD5"
+ },
+ "source": [
+ "# Constroi dataframe df_Medias\n",
+ "df_Medias = df.groupby('coluna6').mean().add_prefix('mean_')\n",
+ "df_Medias"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "JGlA6ufLpGD9"
+ },
+ "source": [
+ "# Combina (merge) com o dataframe df:\n",
+ "pd.merge(df, df_Medias, left_on ='coluna6', right_index=True)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "1MjZu3sVpGEd"
+ },
+ "source": [
+ "___\n",
+ "# **Discretizar COLUNAS numéricas**\n",
+ "* pd.cut() - classes com base em valores;\n",
+ "* pd.qcut() - classes com base em quantis da amostra, ou seja teremos a mesma quantidade de itens em cada classe.\n",
+ "\n",
+ "> Este artifício é muito utilizado em Machine Learning quando queremos construir classes para variáveis numéricas (integer ou float). Acompanhe a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "yK772hiSfZaE"
+ },
+ "source": [
+ "df"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "wi-nv6fshKIX"
+ },
+ "source": [
+ "## pd.cut()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "SVExQmzDpGEe"
+ },
+ "source": [
+ "# Construir 4 classes para a variável float 'coluna8':\n",
+ "Bucket_cut = pd.cut(df['coluna8'], 4) # aqui, estamos construindo 4 buckets\n",
+ "Bucket_cut"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "OOD38I6ug1AY"
+ },
+ "source": [
+ "# Quem são os Bucket's que construimos:\n",
+ "Bucket_cut.value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9s2eaZGtfsxu"
+ },
+ "source": [
+ "Como podem ver, de fato construimos 4 bucket's. **Observe que não temos a mesma quantidade de itens em cada classe!!!**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "T7u0pS64hPHC"
+ },
+ "source": [
+ "## pd.qcut()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "cJTQTHA6pGEm"
+ },
+ "source": [
+ "Bucket_qcut = pd.qcut(df['coluna8'], 4, labels=False)\n",
+ "Bucket_qcut"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "vM30Td_8hZre"
+ },
+ "source": [
+ "# Quem são os Bucket's que construimos:\n",
+ "Bucket_qcut.value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jhf6V5LTh4G7"
+ },
+ "source": [
+ "## Comentários\n",
+ "* pd.qcut() garante uma distribuição mais uniforme dos valores em cada classe. Isso significa que é menos provável que você tenha uma classe com muitos dados e outra com poucos dados.\n",
+ "* Eu prefiro usar pd.qcut()."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RNsR0NsS5iIU"
+ },
+ "source": [
+ "___\n",
+ "# **Distribuição conjunta - crosstabs**\n",
+ "> Suponha que queremos analisar o número de sobreviventes em relação à COLUNA embarked."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "LKQv6YtSfGSU"
+ },
+ "source": [
+ "df_Titanic2.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ANhb5rBffTh6"
+ },
+ "source": [
+ "pd.crosstab(df_Titanic2['survived'], df_Titanic2['embarked'])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "WIlHAYEVqSjT"
+ },
+ "source": [
+ "___\n",
+ "# **Deletar COLUNAS do dataframe**\n",
+ "> Deletar as COLUNAS 'coluna2' e 'coluna5' do dataframe."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "YssOMF_Vqso5"
+ },
+ "source": [
+ "df"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "rVF_1p0Gq3gZ"
+ },
+ "source": [
+ "## Usando inplace = True"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "7BjRIX1jqWQT"
+ },
+ "source": [
+ "df.drop(['coluna2','coluna5'], axis =1, inplace =True)\n",
+ "df"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "POC2fnTlq8mK"
+ },
+ "source": [
+ "## Usando atribuição"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "YRSwEbnfq7s_"
+ },
+ "source": [
+ "df= df.drop(['coluna2','coluna5'], axis =1)\n",
+ "df"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "bHth6KSv7k0G"
+ },
+ "source": [
+ "___\n",
+ "# **Criar COLUNAS dummies para dados categóricos**\n",
+ "> Nosso objetivo é construir variáveis dummies para nossas COLUNAS categóricas.\n",
+ "\n",
+ "* Fontes: \n",
+ " * [Categorical data](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html)\n",
+ " * [Creating Dummy Variables](https://www.ritchieng.com/pandas-creating-dummy-variables/)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "GOqcARHqjMr_"
+ },
+ "source": [
+ "df"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "yNqvwEu9jbuW"
+ },
+ "source": [
+ "Vamos construir variáveis dummies para as COLUNAS 'coluna6' e 'coluna7', da seguinte forma:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "16osZsMEjmDh"
+ },
+ "source": [
+ "pd.get_dummies(df['coluna6'])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Cb1gp_Y1jxz2"
+ },
+ "source": [
+ "Qual a interpretação do resultado acima?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Cic19l-Mj39q"
+ },
+ "source": [
+ "pd.get_dummies(df['coluna7'])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "44FDXcoyj-tT"
+ },
+ "source": [
+ "Qual a interpretação do resultado acima?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "cxHc6BvDkCWl"
+ },
+ "source": [
+ "df = pd.get_dummies(df, columns =['coluna6', 'coluna7', 'sexo'])\n",
+ "df"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "A2m25N4znZ2O"
+ },
+ "source": [
+ "df.columns"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "x0uXu0RRlB2a"
+ },
+ "source": [
+ "___\n",
+ "# **Calcular correlação (Análise de Correlação)**\n",
+ "> A correlação pode ser calculada usando o método df.corr(). Para mais detalhes sobre os tipos de correlação existentes bem como a aplicação de cada uma delas, consulte os links a seguir:\n",
+ "\n",
+ "* [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)\n",
+ "* [Kendall rank correlation coefficient](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient)\n",
+ "* [Spearman's rank correlation coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient).\n",
+ "\n",
+ "Para aprender mais sobre a geração de heatmap, consulte [Seaborn Heatmap Tutorial (Python Data Visualization)](https://likegeeks.com/seaborn-heatmap-tutorial/)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "AgoigF8AnYG0"
+ },
+ "source": [
+ "## Gerando o dataframe-exemplo:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "NsuhsZCTmqEm"
+ },
+ "source": [
+ "# Visualizar os dados\n",
+ "df_X.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "D0JNMHqYoSMs"
+ },
+ "source": [
+ "# Mostra a matriz de correlação usando a correlação de Pearson\n",
+ "set_Colunas_Correlacionadas = set()\n",
+ "matriz_correlacao = df_X.corr().where(np.triu(np.ones(df_X.corr().shape), k = 1).astype(np.bool))\n",
+ "matriz_correlacao"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6scRm8kNnbby"
+ },
+ "source": [
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "import seaborn as sns\n",
+ "import matplotlib.pyplot as plt\n",
+ "%matplotlib inline\n",
+ "\n",
+ "# Gerando um dataframe com 15 colunas, sendo 9 informativas e 6 redundantes:\n",
+ "from sklearn.datasets import make_classification\n",
+ "X, y = make_classification(n_samples=1000, n_features=15, n_informative=9,\n",
+ " n_redundant=6, n_repeated=0, n_classes=2, n_clusters_per_class=1,\n",
+ " random_state=20111974)\n",
+ "\n",
+ "df_X = pd.DataFrame(X, columns= ['v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8', 'v9', 'v10', 'v11', 'v12', 'v13', 'v14', 'v15'])\n",
+ "df_y = pd.DataFrame(y, columns= ['target'])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Vnj6A8z6r7nM"
+ },
+ "source": [
+ "### Quem são as colunas altamente correlacionadas?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "a_YUD-dOr_p-"
+ },
+ "source": [
+ "for i in range(len(matriz_correlacao.columns)):\n",
+ " for j in range(i):\n",
+ " if abs(matriz_correlacao.iloc[i, j]) > 0.8:\n",
+ " colnome = matriz_correlacao.columns[i]\n",
+ " set_Colunas_Correlacionadas.add(colnome)\n",
+ "\n",
+ "set_Colunas_Correlacionadas"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3-0Xe6GdozYT"
+ },
+ "source": [
+ "A seguir, a correlação mais visual:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5-_Qadx1o1U9"
+ },
+ "source": [
+ "fig, ax = plt.subplots(figsize = (12, 12)) \n",
+ "mask = np.zeros_like(df_X.corr().abs())\n",
+ "mask[np.triu_indices_from(mask)] = 1\n",
+ "sns.heatmap(df_X.corr().abs(), mask= mask, ax= ax, cmap='coolwarm', annot= True, fmt= '.2f', center= 0)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5ZOp9ZGgtqFQ"
+ },
+ "source": [
+ "# **Scatterplot**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "eReJJjG8tuKV"
+ },
+ "source": [
+ "## Com regressão"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "tVmdSo6ztruA"
+ },
+ "source": [
+ "sns.pairplot(df_X, kind = \"reg\")\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "xG9A6b32twv-"
+ },
+ "source": [
+ "## Sem regressão"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "fyTOS3zVtz-O"
+ },
+ "source": [
+ "sns.pairplot(df_X, kind = \"scatter\")\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "f-1bpipc6bMh"
+ },
+ "source": [
+ "___\n",
+ "# **Salvar dataframe como csv**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "64CoM1aY6gf6"
+ },
+ "source": [
+ "df_X.to_csv('example.csv')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oy646p33DJV0"
+ },
+ "source": [
+ "# **Dicionário de palavras**\n",
+ "> Muito utilizado em NLP e Machine Learning.\n",
+ "* Caso de Uso: Seguradoras --> Quando um segurado aciona a Seguradora para descrever um acidente (por exemplo), há um algorítmo que transforma o áudio em texto para mineração de textos."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "DQR906rVD1V-"
+ },
+ "source": [
+ "df_Titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "sHvDaztJDPP7"
+ },
+ "source": [
+ "from sklearn.feature_extraction.text import CountVectorizer\n",
+ "CountVectorizer = CountVectorizer()\n",
+ "matriz_contagens = CountVectorizer.fit_transform(df_Titanic['name']) # Informe a coluna do tipo texto/string que queremos analisar/avaliar\n",
+ "print(matriz_contagens)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jwT-56dED8VJ"
+ },
+ "source": [
+ "df_dicionario_palavras = pd.DataFrame(CountVectorizer.get_feature_names(), columns = ['palavra'])\n",
+ "df_dicionario_palavras[\"vezes_que_aparece\"] = matriz_contagens.sum(axis = 0).tolist()[0]\n",
+ "df_dicionario_palavras = df_dicionario_palavras.sort_values(\"vezes_que_aparece\", ascending = False) #.reset_index(drop = True) # Sorte ordena as linhas do dataframe\n",
+ "df_dicionario_palavras.head(10)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "nx65RmEAGTvd"
+ },
+ "source": [
+ "# Desafio\n",
+ "> Transforme o code Python da sessão **Dicionário de palavras** em função para usarmos futuramente."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "iwd1lhq9mrD3"
+ },
+ "source": [
+ "___\n",
+ "# **Exercícios**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "o_cl0kFgQfFh"
+ },
+ "source": [
+ "## Exercício 1\n",
+ "* A partir dos dataframes USA_Abbrev, USA_Area e USA_Population, construa o Dataframe USA contendo as COLUNAS state, abbreviation, area, ages, year, population.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "s8rQUo7yHKJ1"
+ },
+ "source": [
+ "* Observação: A forma mais fácil de ler um arquivo CSV (a partir do Excell por exemplo) a partir do GitHub é clicar no arquivo csv no seu repositório do GitHub e em seguida clicar em 'raw'. Depois, copie o endereço apresentado no browser e cole na variável 'url'. Qualquer dúvida, leia o documento a seguir: https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "KTun1uSLuJ-A"
+ },
+ "source": [
+ "## Exercício 2\n",
+ "Source: https://github.com/aakankshaws/Pandas-exercises\n",
+ "\n",
+ "* Considere os dataframes a seguir e faça o merge do dataframe df_esquerdo com o dataframe df_direito:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Soq7GVZnuREq"
+ },
+ "source": [
+ "df_esquerdo = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],\n",
+ " 'A': ['A0', 'A1', 'A2', 'A3'],\n",
+ " 'B': ['B0', 'B1', 'B2', 'B3']})\n",
+ " \n",
+ "df_direito = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],\n",
+ " 'C': ['C0', 'C1', 'C2', 'C3'],\n",
+ " 'D': ['D0', 'D1', 'D2', 'D3']})"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6KEsTARfvM1C"
+ },
+ "source": [
+ "## Exercício 3\n",
+ "Source: https://github.com/aakankshaws/Pandas-exercises\n",
+ "\n",
+ "* Considere os dataframes a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "hgxE5gZ9vMEg"
+ },
+ "source": [
+ "df_esquerdo = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],\n",
+ " 'key2': ['K0', 'K1', 'K0', 'K1'],\n",
+ " 'A': ['A0', 'A1', 'A2', 'A3'],\n",
+ " 'B': ['B0', 'B1', 'B2', 'B3']})\n",
+ " \n",
+ "df_direito = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],\n",
+ " 'key2': ['K0', 'K0', 'K0', 'K0'],\n",
+ " 'C': ['C0', 'C1', 'C2', 'C3'],\n",
+ " 'D': ['D0', 'D1', 'D2', 'D3']})"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "iv7AmZ1ivm8R"
+ },
+ "source": [
+ "### Perguntas\n",
+ "* Qual o output e a interpretação dos comandos a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "TWAW_1tuvvSO"
+ },
+ "source": [
+ "pd.merge(df_esquerdo, df_direito, on = ['key1', 'key2'])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "QjM7pBONvzCJ"
+ },
+ "source": [
+ "pd.merge(df_esquerdo, df_direito, how = 'outer', on = ['key1', 'key2'])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "D1Rr3Ghsv2iS"
+ },
+ "source": [
+ "pd.merge(df_esquerdo, df_direito, how = 'right', on = ['key1', 'key2'])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "vXQwLjT-v3Iu"
+ },
+ "source": [
+ "pd.merge(df_esquerdo, df_direito, how = 'left', on = ['key1', 'key2'])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "EIdltTC-t_lF"
+ },
+ "source": [
+ "## Exercício 5\n",
+ "5.1. Identifique e delete os atributos do dataframe df_Titanic que podem ser excluídos inicialmente no início da análise de dados."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "bMwPLgWclWBq"
+ },
+ "source": [
+ "___\n",
+ "## Exercício 6\n",
+ "* (a) Carregue o dataframe Titanic_With_MV.csv e analise o dataframe em busca de inconsistências e Missing Values (NaN).\n",
+ "\n",
+ "### Feature Engineering\n",
+ "* (b) Com a coluna 'cabin', construir as colunas:\n",
+ " * deck - Letra de Cabin;\n",
+ " * seat - Número de Cabin\n",
+ "* (c) Criar a coluna 'sozinho_parch', onde sozinho_parch= 1 significa que o passageiro viaja sozinho e 0, caso contrário.\n",
+ "* (d) Criar o atributo 'sozinho_sibsp', onde sozinho= 1 significa que o passageiro viaja sozinho e 0, caso contrário.\n",
+ "* (e) Discretizar a coluna 'fare' em 10 buckets.\n",
+ "* (f) Discretizar a coluna 'age'.\n",
+ "* (g) Capturar os títulos 'Ms', 'Mr' e etc contidos na coluna 'Title';\n",
+ "* (h) Qual a relação entre as variáveis e a variável-target?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "V7KUGAX6lilP"
+ },
+ "source": [
+ "import pandas as pd\n",
+ "df_Titanic = pd.read_csv('https://raw.githubusercontent.com/MathMachado/Python4DS/DS_Python/Dataframes/Titanic_With_MV.csv?token =AGDJQ63MNPPPROFNSO2BZW25XSR72', index_col= 'PassengerId')\n",
+ "df_Titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "m3UnAPJakCLR"
+ },
+ "source": [
+ "* Segue o dicionário de dados do dataframe Titanic:\n",
+ " * PassengerID: ID do passageiro;\n",
+ " * survived: Indicador, sendo 1= Passageiro sobreviveu e 0= Passageiro morreu;\n",
+ " * Pclass: Classe;\n",
+ " * Age: Idade do Passageiro;\n",
+ " * SibSp: Número de parentes a bordo (esposa, irmãos, pais e etc);\n",
+ " * Parch: Número de pais/crianças a bordo;\n",
+ " * Fare: Valor pago pelo Passageiro;\n",
+ " * Cabin: Cabine do Passageiro;\n",
+ " * Embarked: A porta pelo qual o Passageiro embarcou.\n",
+ " * Name: Nome do Passageiro;\n",
+ " * sex: sexo do Passageiro\n",
+ " "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "B_3s5cgxfNKQ"
+ },
+ "source": [
+ "## Resposta do item (a)\n",
+ "### Coluna XPTO\n",
+ "\n",
+ "\n",
+ "### Coluna XPTO2"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "q3oLgyhdL6xd"
+ },
+ "source": [
+ "## Resposta do item (b)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "UbexhGtayV4X"
+ },
+ "source": [
+ "## Exercício 7\n",
+ "Consulte a página [Pandas Exercises, Practice, Solution](https://www.w3resource.com/python-exercises/pandas/index.php) para mais exercícios relacionados á este tópico."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Iia0ikd_KBtH"
+ },
+ "source": [
+ "## Exercício 8\n",
+ "Crie a coluna 'aleatorio' no dataframe df_Titanic em que cada linha recebe um valor aleatório usando o método np.random.random()."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "HPiLKUkWNYs3"
+ },
+ "source": [
+ "df_Titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "TUVTlE9WYW8C"
+ },
+ "source": [
+ "## Exercício 9\n",
+ "O arquivo FIFA.csv contem dados relacionados à última edição do FIFA 2018 (um dos jogos de video-game mais famosos) e traz os mais variados dados sobre os jogadores (exemplo): idade, nacionalidade, potencial, salário e etc. Faça o seguinte:\n",
+ "\n",
+ "1. Carregue o arquivo FIFA.csv (está na área de Dataframes do curso);\n",
+ "2. Que colunas podem previamente ser eliminadas da análise? Porque identificar o que pode ser eliminado é importante?\n",
+ "3. Qual o dtype de cada variável/atributo do dataframe?\n",
+ "4. Se alguma variávável/atributo é do tipo string (object) e supostamente deveria ser numérica, como alteramos o tipo?\n",
+ "5. Normalize os nomes das colunas, ou seja, renomeie o nome das colunas para minúsculo;\n",
+ "6Há Missing values nos dados? Se sim, o qual sua proposta (proposta do grupo) para tratar estes Missing values?\n",
+ "7. Qual a distribuição do número de jogadores por países? Apresente uma tabela com a distribuição.\n",
+ "8. Qual a média de idade dos jogadores por países (variável/atributo 'Nacionality');\n",
+ "9. Qual a número de jogadores por idade?\n",
+ "10. Quantos jogadores possuem cada clube?\n",
+ "11. Qual a média de idade por clube?\n",
+ "12. Qual a média de salário por país?\n",
+ "13. Qual a média de salário por clube?\n",
+ "14. Qual a média de salário por idade?\n",
+ "15. Quanto cada clube gasta com pagamento de salários?\n",
+ "16. Quais são os insight (o que você consegue descobrir) em relação à variável 'Potential' (mede o potencial dos jogadores)?\n",
+ "17. Quais os insights em relação à variável overall (nota média do atleta) por idade, clube e país?\n",
+ "18. Quais são os melhores clubes se levarmos em consideração as variáveis Potential e Overall?\n",
+ "19. Apresente o ranking dos goleiros (use a variável/atributo 'Preferred Positions') por Potencial, Overall. Estamos à procura de 'GK'.\n",
+ "20. Quem são os jogadores mais rápidos (variável/atributo 'Sprint speed'=?\n",
+ "21. Quem são os 5 melhores jogadores em termos de chute (força para chutar) (use a variável/atributo 'Shot power')?\n",
+ "22. Quem são os outliers em termos de salário?\n",
+ "23. Quem são os outliers em termos de potência no chute?\n",
+ "\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ldWQd9j4NhPS"
+ },
+ "source": [
+ "# Carrega a library Pandas\n",
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "%matplotlib inline\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "pynSV0viI8CA"
+ },
+ "source": [
+ "#configuração\n",
+ "d_configuracao = {\n",
+ " 'display.max_columns': 1000,\n",
+ " 'display.expand_frame_repr': True,\n",
+ " 'display.max_rows': 10,\n",
+ " 'display.precision': 2,\n",
+ " 'display.show_dimensions': True\n",
+ " }\n",
+ "\n",
+ "for op, value in d_configuracao.items():\n",
+ " pd.set_option(op, value)\n",
+ " print(op, value)\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "AswYS-ILI-qW"
+ },
+ "source": [
+ "url = 'https://raw.githubusercontent.com/Celso-Omoto/DSWP/master/Dataframes/FIFA.csv'\n",
+ "#df_Fifa2018 = pd.read_csv(url, index_col = 'PassengerId')\n",
+ "df_Fifa2018 = pd.read_csv(url)\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "K7xLrlPuKsAW"
+ },
+ "source": [
+ "df_Fifa2018.head()\n",
+ "\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "LTZGbOHiKxsW"
+ },
+ "source": [
+ "df_Fifa2018.set_index('ID', inplace = True)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "1y9oN-IeU7Sb"
+ },
+ "source": [
+ "def transformacao_lower(df):\n",
+ " # Primeira transformação: Aplicar lower() nos nomes das COLUNAS:\n",
+ " df_Fifa2018.columns = [col.lower() for col in df.columns]\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "uRS02MeCVVID"
+ },
+ "source": [
+ "transformacao_lower(df_Fifa2018)\n",
+ "df_Fifa2018.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "sI_2oz3uMQFF"
+ },
+ "source": [
+ "#17 - Quais os insights em relação à variável overall (nota média do atleta) por idade, clube e país?\n",
+ "df_Fifa2018.sort_values('shotpower', ascending=False).head(5)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "-9jCfEaSNJS1"
+ },
+ "source": [
+ "df_Fifa2018.groupby('overall').agg({'age':'mean', 'nationality': 'count'}).sort_values('overall').head(10)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "PMrTNLV1Oe5P"
+ },
+ "source": [
+ "df_Fifa2018.groupby('club').agg({'overall':'mean'}).sort_values('overall').head(10)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "dckaTDS9NaqG"
+ },
+ "source": [
+ "df_Fifa2018.groupby('club').agg({'potential':'mean','overall':'mean'}).head(10)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "OBuycRCzRbyG"
+ },
+ "source": [
+ "del df_Fifa2018['photo']"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "1crol52URcmt"
+ },
+ "source": [
+ "del df_Fifa2018['club logo']"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "W0X7-q1CSNfM"
+ },
+ "source": [
+ "del df_Fifa2018['flag']"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "LMtzccaxSK29"
+ },
+ "source": [
+ "df_Fifa2018.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZPwu5sLnSyAc"
+ },
+ "source": [
+ "df_Fifa2018.dtypes"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "PBq3jr8nTUS0"
+ },
+ "source": [
+ "df_Fifa2018.select_dtypes(include=['object', 'string']).columns "
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "82JaHKYATgdD"
+ },
+ "source": [
+ "df_Fifa2018[df_Fifa2018.select_dtypes(include=['object', 'string']).columns ]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "NAyzCiTMW7YT"
+ },
+ "source": [
+ "df_Fifa2018.groupby('nationality').agg({'age':['count','mean']}).head(5)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "mkNA4xskc1b5"
+ },
+ "source": [
+ "df_Fifa2018.groupby('nationality').agg({'age':'mean','nationality':'count'}).sort_values('nationality').head(10)\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "eivUHr17ZiHs"
+ },
+ "source": [
+ "df_Fifa2018.sort_values('age', ascending=False).groupby('nationality').agg({'age':['count','mean']})"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "FCs2gPK8ckS7"
+ },
+ "source": [
+ "\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ }
+ ]
+}
\ No newline at end of file
From 7efafa0dfac3f2c1178e9309595129b897e4b742 Mon Sep 17 00:00:00 2001
From: Celso-Omoto <72219725+Celso-Omoto@users.noreply.github.com>
Date: Wed, 14 Oct 2020 16:42:04 -0300
Subject: [PATCH 04/21] Criado usando o Colaboratory
---
...__DataViz_Matplotlib & Seaborn_aluno.ipynb | 865 ++++++++++++++++++
1 file changed, 865 insertions(+)
create mode 100644 Notebooks/NB11__DataViz_Matplotlib & Seaborn_aluno.ipynb
diff --git a/Notebooks/NB11__DataViz_Matplotlib & Seaborn_aluno.ipynb b/Notebooks/NB11__DataViz_Matplotlib & Seaborn_aluno.ipynb
new file mode 100644
index 000000000..8e2d66760
--- /dev/null
+++ b/Notebooks/NB11__DataViz_Matplotlib & Seaborn_aluno.ipynb
@@ -0,0 +1,865 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "Untitled31.ipynb",
+ "provenance": [],
+ "private_outputs": true,
+ "include_colab_link": true
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oRokSLxEMgDN"
+ },
+ "source": [
+ "## Referência\n",
+ "* [Visualization](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "IFiAWdKnFS5A"
+ },
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "import matplotlib.pyplot as plt\n",
+ "import seaborn as sns\n",
+ "\n",
+ "plt.rcParams[\"figure.figsize\"] = [15, 12]\n",
+ "%matplotlib inline"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "UfrAHnWpJTwD"
+ },
+ "source": [
+ "## Séries temporais simples"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_PV_kTGRMq4B"
+ },
+ "source": [
+ "#### Série/Dados simulados"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "_yVTB9v0KQxp"
+ },
+ "source": [
+ "from datetime import datetime\n",
+ "\n",
+ "dt_hoje = datetime.strptime('2020-10-14', '%Y-%m-%d')\n",
+ "dt_inicio = datetime.strptime('2020-01-01', '%Y-%m-%d')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "gMQx3JSlJz0R"
+ },
+ "source": [
+ "# Quantos dias desde a data inicial?\n",
+ "i_quantidade_dias = abs((dt_hoje - dt_inicio).days)\n",
+ "i_quantidade_dias"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Tb70ycS_JWvQ"
+ },
+ "source": [
+ "np.random.seed(20111974)\n",
+ "\n",
+ "i_qtd_ativos = 4\n",
+ "df_series_temporais = pd.DataFrame(np.random.randn(i_quantidade_dias, i_qtd_ativos), index = pd.date_range(dt_inicio, periods = i_quantidade_dias)) #, columns = list('ABCD'))\n",
+ "df_series_temporais.columns = ['Ativo1', 'Ativo2', 'Ativo3', 'Ativo4']\n",
+ "\n",
+ "#serie_temporal = pd.Series(np.random.randn(i_quantidade_dias), index = pd.date_range(dt_inicio, periods = i_quantidade_dias))\n",
+ "df_series_temporais.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hPq0XtirNMhm"
+ },
+ "source": [
+ "## Gráfico de séries temporais"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kEu3wDl9L92i"
+ },
+ "source": [
+ "df_series_temporais2 = df_series_temporais.cumsum()\n",
+ "plt.figure()\n",
+ "df_series_temporais2.plot()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oEQQHUG8KtAv"
+ },
+ "source": [
+ "Gráfico de 1 única série temporal"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "xqNCkZdIKh3L"
+ },
+ "source": [
+ "df_series_temporais3 = df_series_temporais['Ativo1']\n",
+ "plt.figure()\n",
+ "df_series_temporais3.plot()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "m5rMpulVKrSe"
+ },
+ "source": [
+ "df_series_temporais3 = df_series_temporais['Ativo1'].cumsum()\n",
+ "plt.figure()\n",
+ "df_series_temporais3.plot()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Wa4sXjcMNkzS"
+ },
+ "source": [
+ "Experimente kind = {'line', 'box', 'hist', 'kde'}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8eAETzNARsxo"
+ },
+ "source": [
+ "### Se quisermos comparar horizontalmente\n",
+ "* No caso abaixo, estou a comparar as colunas 'Ativo1', 'Ativo2', 'Ativo3' e 'Ativo4' quanto ao conteúdo da linha 3 --> iloc[3]."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6a0FB-SPReD9"
+ },
+ "source": [
+ "plt.figure()\n",
+ "df_series_temporais2.iloc[3].plot(kind ='bar')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ov7fKVO3So9v"
+ },
+ "source": [
+ "df_series_temporais2.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "rM32swu3S046"
+ },
+ "source": [
+ "df_series_temporais2.iloc[3]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qJ8SBoT6SSu0"
+ },
+ "source": [
+ "### Comparar grupos\n",
+ "* Neste caso, vou selecionar (ou dar um zoom) somente em alguns dias do dataframe."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kKeby_vwTB5j"
+ },
+ "source": [
+ "df_series_temporais2_zoom = df_series_temporais2[0:10]\n",
+ "df_series_temporais2_zoom"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "I_XBwdn_Sa8h"
+ },
+ "source": [
+ "df_series_temporais2_zoom.plot(kind = 'bar')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Zru6GuoYTuzd"
+ },
+ "source": [
+ "#### Outra forma de visualizar o mesmo resultado:\n",
+ "* stacked bar plot --> Basta usar o parâmetro stacked = True"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "lHY7A1RLTzaT"
+ },
+ "source": [
+ "df_series_temporais2_zoom.plot(kind = 'bar', stacked = True)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "UWP6hLn8US1M"
+ },
+ "source": [
+ "### Se quiser visualizar o gráfico na horizontal..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "7dtzx-vOUWNG"
+ },
+ "source": [
+ "df_series_temporais2_zoom.plot(kind = 'barh', stacked = True)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "IqAIybxdUbOH"
+ },
+ "source": [
+ "df_series_temporais2_zoom.plot(kind = 'barh', stacked = True)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Z22k7IOyU6la"
+ },
+ "source": [
+ "### Histogramas"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "LKLWYWYeU8UV"
+ },
+ "source": [
+ "df_series_temporais2.plot(kind = 'hist', bins = 100) # O que são bins?"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "dG4zhQExVbY1"
+ },
+ "source": [
+ "#### Histograma individual"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZNGWjh9LVdb7"
+ },
+ "source": [
+ "plt.figure()\n",
+ "df_series_temporais2['Ativo3'].diff().hist() # Veja abaixo melhores explicações sobre o método diff(axis, periods) "
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "3LQlM_qjWd7g"
+ },
+ "source": [
+ "df_series_temporais2.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "x3N6q_fTWl60"
+ },
+ "source": [
+ "df_series_temporais2.diff(axis = 0, periods= 1).head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "LGknpyFaWqcZ"
+ },
+ "source": [
+ "df_series_temporais2.iloc[1][0] - df_series_temporais2.iloc[0][0]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Yq6TtAU2XAHL"
+ },
+ "source": [
+ "#### diff(axis = 1, periods = 1) aplica a diferença nas colunas! Veja abaixo:\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6QRBLyBQXKq8"
+ },
+ "source": [
+ "df_series_temporais2.diff(axis = 1, periods = 1).head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "niDjEkSpYgAj"
+ },
+ "source": [
+ "### Histogramas em múltiplos gráficos"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "4ie8toFUYlF-"
+ },
+ "source": [
+ "plt.figure()\n",
+ "df_series_temporais2.diff(axis = 0, periods = 1).hist(color ='k', alpha = 0.5, bins = 50)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "r7W97FztGTMl"
+ },
+ "source": [
+ "## Boxplot"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Q-19pTLZZKVj"
+ },
+ "source": [
+ "plt.figure()\n",
+ "boxplot = df_series_temporais2.boxplot(vert = True) # Observe o parâmetro vert = True"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "aQ2qQetiGU8f"
+ },
+ "source": [
+ "plt.figure()\n",
+ "boxplot = df_series_temporais2.boxplot(vert = False) # Observe o parâmetro vert = False"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Wo6AFzOPMvMf"
+ },
+ "source": [
+ "#### Dados sobre a qualidade de vinhos - White vs Red\n",
+ "\n",
+ "* O objetivo é avaliar a qualidade dos vinhos (tinto vs branco), numa scala de 0–100. A seguir, a qualidade em função da escala:\n",
+ "\n",
+ "* 95–100 Classic: a great wine\n",
+ "* 90–94 Outstanding: a wine of superior character and style\n",
+ "* 85–89 Very good: a wine with special qualities\n",
+ "* 80–84 Good: a solid, well-made wine\n",
+ "* 75–79 Mediocre: a drinkable wine that may have minor flaws\n",
+ "* 50–74 Not recommended"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "aO9K8R9Qa9Uj"
+ },
+ "source": [
+ "url_tinto = 'https://raw.githubusercontent.com/MathMachado/DataFrames/master/Wine_red.csv?token=AGDJQ64FIW7QA6DNJTVT6JC7SACV6'\n",
+ "url_branco = 'https://raw.githubusercontent.com/MathMachado/DataFrames/master/Wine_white.csv?token=AGDJQ67RPQDN45RZYZHV5SK7SACXY'\n",
+ "df_vinho_tinto = pd.read_csv(url_tinto)\n",
+ "df_vinho_tinto[\"color\"] = 1 # --> Vinho Tinto\n",
+ "\n",
+ "df_vinho_branco = pd.read_csv(url_branco)\n",
+ "df_vinho_branco[\"color\"] = 0 # --> Vinho Branco"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "owdOjksbg7Dc"
+ },
+ "source": [
+ "df_vinhos = pd.concat([df_vinho_tinto, df_vinho_branco], axis = 0)\n",
+ "df_vinhos.shape"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zYniNn5PfGx9"
+ },
+ "source": [
+ "df_vinho_tinto.columns"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "KL7iW5mtgCre"
+ },
+ "source": [
+ "df_vinhos['quality'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "G_yOZ-Gqmscv"
+ },
+ "source": [
+ "df_vinhos['color'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "IKTEbTW2jMVv"
+ },
+ "source": [
+ "#### Tratamento do nome das colunas"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "1Oo-6k2jh3bx"
+ },
+ "source": [
+ "df_vinhos.columns = [col.lower() for col in df_vinhos.columns]\n",
+ "\n",
+ "# substituir ' ' por '_' no nome das colunas:\n",
+ "df_vinhos.columns = [col.replace(' ', '_') for col in df_vinhos.columns]\n",
+ "df_vinhos.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "eiMHK6aJjoZl"
+ },
+ "source": [
+ "df_vinhos.describe()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "sUNEzoC7j0PV"
+ },
+ "source": [
+ "print(f\"Média do vinho Branco: {df_vinho_branco['quality'].mean()}\")\n",
+ "print(f\"Média do vinho Tinto.: {df_vinho_tinto['quality'].mean()}\")\n",
+ "print(f\"Média Geral..........: {df_vinhos['quality'].mean()}\")"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "tIBDUBI4n78b"
+ },
+ "source": [
+ "Abaixo, o mesmo cálculo, porém usando o artificio de procurar/selecionar o tipo que queremos no dataframe:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "X1Nllwpxl228"
+ },
+ "source": [
+ "print(f\"Média do vinho Branco: {df_vinhos[df_vinhos['color'] == 0]['quality'].mean()}\")\n",
+ "print(f\"Média do vinho Tinto.: {df_vinhos[df_vinhos['color'] == 1]['quality'].mean()}\")\n",
+ "print(f\"Média Geral..........: {df_vinhos['quality'].mean()}\")"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "GHjfSmExmg0u"
+ },
+ "source": [
+ "df_vinhos.columns"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "J3ZsHlrWmLDQ"
+ },
+ "source": [
+ "df_vinhos[df_vinhos['color'] == 1]['quality']"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7HjKZ6Z1bkct"
+ },
+ "source": [
+ "A seguir, algo mais sofisticado, contendo título do gráfico, annotations e etc"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jB9BTwBOa7UA"
+ },
+ "source": [
+ "fig, ax = plt.subplots(figsize=(10, 6))\n",
+ "df_vinhos['quality'].value_counts().plot(kind = 'bar')\n",
+ "\n",
+ "# Título e label dos eixos X e Y\n",
+ "plt.title('Avaliação da qualidade do vinho', fontsize = 25)\n",
+ "plt.xlabel('Atributo: quality', fontsize = 10)\n",
+ "plt.ylabel('Quantidade', fontsize = 10)\n",
+ "\n",
+ "# Colocar grid no gráfico\n",
+ "ax.grid(True)\n",
+ "\n",
+ "# Configurar a legenda\n",
+ "plt.legend()\n",
+ "\n",
+ "# Configurar limites do eixo Y\n",
+ "plt.ylim(0, 3000)\n",
+ "\n",
+ "# Configurar limites do eixo X\n",
+ "#plt.xlim(0, 3000)\n",
+ " \n",
+ "# Show graphic\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "w1CyCXVkmrFV"
+ },
+ "source": [
+ "df_vinhos['color'].value_counts().plot(kind = 'bar')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jU1AY-_wpU2h"
+ },
+ "source": [
+ "df_vinhos.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "e0ayzbRanNDq"
+ },
+ "source": [
+ "df_vinhos['fixed_acidity'].value_counts().sort_index().plot(kind = 'area')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "RjzkMuPTn0yI"
+ },
+ "source": [
+ "#geração de vários gráficos \n",
+ "l_colunas = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar', 'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density', 'ph', 'sulphates', 'alcohol']\n",
+ "for caracteristica in l_colunas:\n",
+ " plt.figure() # Tire esta linha e veja o resultado\n",
+ " df_vinhos[caracteristica].value_counts().sort_index().plot(kind = 'area')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "PYIjyMkVnWnr"
+ },
+ "source": [
+ "### Correlações"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "IOCk4vhpnYn9"
+ },
+ "source": [
+ "correlacoes = df_vinhos.corr()\n",
+ "\n",
+ "top_correlacoes_cols = correlacoes.color.sort_values(ascending=False).keys()\n",
+ "top_correlacoes = correlacoes.loc[top_correlacoes_cols, top_correlacoes_cols]\n",
+ "dropSelf = np.zeros_like(top_correlacoes)\n",
+ "dropSelf[np.triu_indices_from(dropSelf)] = True\n",
+ "plt.figure(figsize=(18, 10))\n",
+ "sns.heatmap(top_correlacoes, cmap=sns.diverging_palette(220, 10, as_cmap=True), annot=True, fmt=\".2f\", mask=dropSelf)\n",
+ "sns.set(font_scale=1.5)\n",
+ "plt.show()\n",
+ "del correlacoes, dropSelf, top_correlacoes"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "SFqklDJf-8le"
+ },
+ "source": [
+ "df_vinhos.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "H7hKbxfdBV8w"
+ },
+ "source": [
+ "### Avaliar o comportamento bivariado ou a relação entre a variável-target e a variável preditora"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "LllKqLx3_IIG"
+ },
+ "source": [
+ "sns.jointplot(df_vinhos['alcohol'], df_vinhos['fixed_acidity'], kind = \"kde\")\n",
+ "plt.savefig('minha_figura.png')\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4ixcDmeXIFQ1"
+ },
+ "source": [
+ "### Pairplot\n",
+ "* Verificar relacionamentos entre pares no dataframe."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "lWqwaZ_lArji"
+ },
+ "source": [
+ "#sns.pairplot(df_vinhos, hue = \"color\") # Compare os gráficos com e sem a opção hue\n",
+ "sns.pairplot(df_vinhos)\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "vtOH-mTHLGC-"
+ },
+ "source": [
+ "df_vinhos.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "dcaQ8aPaHwBB"
+ },
+ "source": [
+ "sns.lmplot(\"alcohol\", \"sulphates\", df_vinhos, hue = \"color\", fit_reg = False)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "pWsCs585LPyn"
+ },
+ "source": [
+ "sns.lmplot(\"alcohol\", \"sulphates\", df_vinhos, hue = \"quality\", fit_reg = False)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5RwOiYi3OfD5"
+ },
+ "source": [
+ "### Boxplot"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZqIP5xUOMAqL"
+ },
+ "source": [
+ "df_vinhos.boxplot(column = 'alcohol', by = 'quality', figsize = (12, 8))\n",
+ "plt.xlabel('Quality', fontsize = 10, color= 'blue')\n",
+ "plt.ylabel('alcohol', fontsize = 10, color= 'blue')\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ }
+ ]
+}
\ No newline at end of file
From ca6697322dbfac2596b1142734ca3aa3ea3f6cd4 Mon Sep 17 00:00:00 2001
From: Celso-Omoto <72219725+Celso-Omoto@users.noreply.github.com>
Date: Thu, 15 Oct 2020 16:15:57 -0300
Subject: [PATCH 05/21] Criado usando o Colaboratory
---
...1__DataViz_Matplotlib & Seaborn_Fifa.ipynb | 1360 +++++++++++++++++
1 file changed, 1360 insertions(+)
create mode 100644 Notebooks/NB11__DataViz_Matplotlib & Seaborn_Fifa.ipynb
diff --git a/Notebooks/NB11__DataViz_Matplotlib & Seaborn_Fifa.ipynb b/Notebooks/NB11__DataViz_Matplotlib & Seaborn_Fifa.ipynb
new file mode 100644
index 000000000..afae826b7
--- /dev/null
+++ b/Notebooks/NB11__DataViz_Matplotlib & Seaborn_Fifa.ipynb
@@ -0,0 +1,1360 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "Untitled31.ipynb",
+ "provenance": [],
+ "private_outputs": true,
+ "include_colab_link": true
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oRokSLxEMgDN"
+ },
+ "source": [
+ "## Referência\n",
+ "* [Visualization](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html)\n",
+ "* [Python Graph Galery](https://python-graph-gallery.com/all-charts/)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "IFiAWdKnFS5A"
+ },
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "import matplotlib.pyplot as plt\n",
+ "import seaborn as sns\n",
+ "import bokeh # Library necessária ***\n",
+ "\n",
+ "plt.rcParams[\"figure.figsize\"] = [15, 12]\n",
+ "%matplotlib inline"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "UfrAHnWpJTwD"
+ },
+ "source": [
+ "## Séries temporais simples"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_PV_kTGRMq4B"
+ },
+ "source": [
+ "#### Série/Dados simulados"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "_yVTB9v0KQxp"
+ },
+ "source": [
+ "from datetime import datetime\n",
+ "\n",
+ "dt_hoje = datetime.strptime('2020-10-14', '%Y-%m-%d')\n",
+ "dt_inicio = datetime.strptime('2020-01-01', '%Y-%m-%d')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "gMQx3JSlJz0R"
+ },
+ "source": [
+ "# Quantos dias desde a data inicial?\n",
+ "i_quantidade_dias = abs((dt_hoje - dt_inicio).days)\n",
+ "i_quantidade_dias"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Tb70ycS_JWvQ"
+ },
+ "source": [
+ "np.random.seed(20111974)\n",
+ "\n",
+ "i_qtd_ativos = 4\n",
+ "df_series_temporais = pd.DataFrame(np.random.randn(i_quantidade_dias, i_qtd_ativos), index = pd.date_range(dt_inicio, periods = i_quantidade_dias)) #, columns = list('ABCD'))\n",
+ "df_series_temporais.columns = ['Ativo1', 'Ativo2', 'Ativo3', 'Ativo4']\n",
+ "\n",
+ "#serie_temporal = pd.Series(np.random.randn(i_quantidade_dias), index = pd.date_range(dt_inicio, periods = i_quantidade_dias))\n",
+ "df_series_temporais.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hPq0XtirNMhm"
+ },
+ "source": [
+ "## Gráfico de séries temporais"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kEu3wDl9L92i"
+ },
+ "source": [
+ "df_series_temporais2 = df_series_temporais.cumsum()\n",
+ "plt.figure()\n",
+ "df_series_temporais2.plot()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oEQQHUG8KtAv"
+ },
+ "source": [
+ "Gráfico de 1 única série temporal"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "xqNCkZdIKh3L"
+ },
+ "source": [
+ "df_series_temporais3 = df_series_temporais['Ativo1']\n",
+ "plt.figure()\n",
+ "df_series_temporais3.plot()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "m5rMpulVKrSe"
+ },
+ "source": [
+ "df_series_temporais3 = df_series_temporais['Ativo1'].cumsum()\n",
+ "plt.figure()\n",
+ "df_series_temporais3.plot(kind = 'line')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Wa4sXjcMNkzS"
+ },
+ "source": [
+ "Experimente kind = {'line', 'box', 'hist', 'kde'}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8eAETzNARsxo"
+ },
+ "source": [
+ "### Se quisermos comparar horizontalmente\n",
+ "* No caso abaixo, estou a comparar as colunas 'Ativo1', 'Ativo2', 'Ativo3' e 'Ativo4' quanto ao conteúdo da linha 3 --> iloc[3]."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "APnKHRMSbYMO"
+ },
+ "source": [
+ "df_series_temporais2.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6a0FB-SPReD9"
+ },
+ "source": [
+ "plt.figure()\n",
+ "df_series_temporais2.iloc[3].plot(kind = 'bar')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qJ8SBoT6SSu0"
+ },
+ "source": [
+ "### Comparar grupos\n",
+ "* Neste caso, vou selecionar (ou dar um zoom) somente em alguns dias do dataframe."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kKeby_vwTB5j"
+ },
+ "source": [
+ "df_series_temporais2_zoom = df_series_temporais2[0:10]\n",
+ "df_series_temporais2_zoom"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "I_XBwdn_Sa8h"
+ },
+ "source": [
+ "df_series_temporais2_zoom.plot(kind = 'bar')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Zru6GuoYTuzd"
+ },
+ "source": [
+ "#### Outra forma de visualizar o mesmo resultado:\n",
+ "* stacked bar plot --> Basta usar o parâmetro stacked = True"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "lHY7A1RLTzaT"
+ },
+ "source": [
+ "df_series_temporais2_zoom.plot(kind = 'bar', stacked = True)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "UWP6hLn8US1M"
+ },
+ "source": [
+ "### Se quiser visualizar o gráfico na horizontal..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "7dtzx-vOUWNG"
+ },
+ "source": [
+ "df_series_temporais2_zoom.plot(kind = 'barh', stacked = True)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Z22k7IOyU6la"
+ },
+ "source": [
+ "### Histogramas"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "LKLWYWYeU8UV"
+ },
+ "source": [
+ "df_series_temporais2.plot(kind = 'hist', bins = 10) # O que são bins?"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "MjLO8BqUeQvP"
+ },
+ "source": [
+ "#### O que são bins?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "dG4zhQExVbY1"
+ },
+ "source": [
+ "#### Histograma individual"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZNGWjh9LVdb7"
+ },
+ "source": [
+ "plt.figure()\n",
+ "df_series_temporais2['Ativo3'].diff().hist() # Veja abaixo melhores explicações sobre o método diff(axis, periods) "
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "3LQlM_qjWd7g"
+ },
+ "source": [
+ "df_series_temporais2.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "x3N6q_fTWl60"
+ },
+ "source": [
+ "df_series_temporais2.diff(axis = 0, periods = 1).head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "LGknpyFaWqcZ"
+ },
+ "source": [
+ "df_series_temporais2.iloc[1][0] - df_series_temporais2.iloc[0][0]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "TdjsYr4Wer73"
+ },
+ "source": [
+ "df_series_temporais2.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Yq6TtAU2XAHL"
+ },
+ "source": [
+ "#### diff(axis = 1, periods = 1) aplica a diferença nas colunas! Veja abaixo:\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6QRBLyBQXKq8"
+ },
+ "source": [
+ "df_series_temporais2.diff(axis = 1, periods = 1).head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "niDjEkSpYgAj"
+ },
+ "source": [
+ "### Histogramas em múltiplos gráficos"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "4ie8toFUYlF-"
+ },
+ "source": [
+ "plt.figure()\n",
+ "df_series_temporais2.diff(axis = 0, periods = 1).hist(color ='k', alpha = 0.5, bins = 50)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "r7W97FztGTMl"
+ },
+ "source": [
+ "## Boxplot"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Q-19pTLZZKVj"
+ },
+ "source": [
+ "plt.figure()\n",
+ "boxplot = df_series_temporais2.boxplot(vert = True) # Observe o parâmetro vert = True"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "aQ2qQetiGU8f"
+ },
+ "source": [
+ "plt.figure()\n",
+ "boxplot = df_series_temporais2.boxplot(vert = False) # Observe o parâmetro vert = False"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Wo6AFzOPMvMf"
+ },
+ "source": [
+ "#### Dados sobre a qualidade de vinhos - White vs Red\n",
+ "\n",
+ "* O objetivo é avaliar a qualidade dos vinhos (tinto vs branco), numa scala de 0–100. A seguir, a qualidade em função da escala:\n",
+ "\n",
+ "* 95–100 Classic: a great wine\n",
+ "* 90–94 Outstanding: a wine of superior character and style\n",
+ "* 85–89 Very good: a wine with special qualities\n",
+ "* 80–84 Good: a solid, well-made wine\n",
+ "* 75–79 Mediocre: a drinkable wine that may have minor flaws\n",
+ "* 50–74 Not recommended"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "aO9K8R9Qa9Uj"
+ },
+ "source": [
+ "url_tinto = 'https://raw.githubusercontent.com/MathMachado/DataFrames/master/Wine_red.csv?token=AGDJQ64FIW7QA6DNJTVT6JC7SACV6'\n",
+ "url_branco = 'https://raw.githubusercontent.com/MathMachado/DataFrames/master/Wine_white.csv?token=AGDJQ67RPQDN45RZYZHV5SK7SACXY'\n",
+ "df_vinho_tinto = pd.read_csv(url_tinto)\n",
+ "df_vinho_tinto[\"color\"] = 1 # --> Vinho Tinto\n",
+ "\n",
+ "df_vinho_branco = pd.read_csv(url_branco)\n",
+ "df_vinho_branco[\"color\"] = 0 # --> Vinho Branco"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "owdOjksbg7Dc"
+ },
+ "source": [
+ "# Empilhando os dataframes df_vinho_tinto e df_vinho_branco:\n",
+ "df_vinhos = pd.concat([df_vinho_tinto, df_vinho_branco], axis = 0)\n",
+ "df_vinhos.shape"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zYniNn5PfGx9"
+ },
+ "source": [
+ "df_vinho_tinto.columns"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "KL7iW5mtgCre"
+ },
+ "source": [
+ "df_vinhos['quality'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "G_yOZ-Gqmscv"
+ },
+ "source": [
+ "df_vinhos['color'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "IKTEbTW2jMVv"
+ },
+ "source": [
+ "#### Tratamento do nome das colunas"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "JeXjuKNIm39F"
+ },
+ "source": [
+ "df_vinhos.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "1Oo-6k2jh3bx"
+ },
+ "source": [
+ "df_vinhos.columns = [col.lower() for col in df_vinhos.columns]\n",
+ "\n",
+ "# substituir ' ' por '_' no nome das colunas:\n",
+ "df_vinhos.columns = [col.replace(' ', '_') for col in df_vinhos.columns]\n",
+ "df_vinhos.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "eiMHK6aJjoZl"
+ },
+ "source": [
+ "df_vinhos.describe()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "sUNEzoC7j0PV"
+ },
+ "source": [
+ "print(f\"Média do vinho Branco: {df_vinho_branco['quality'].mean()}\")\n",
+ "print(f\"Média do vinho Tinto.: {df_vinho_tinto['quality'].mean()}\")\n",
+ "print(f\"Média Geral..........: {df_vinhos['quality'].mean()}\")"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "tIBDUBI4n78b"
+ },
+ "source": [
+ "Abaixo, o mesmo cálculo, porém usando o artificio de procurar/selecionar o tipo que queremos no dataframe:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "X1Nllwpxl228"
+ },
+ "source": [
+ "print(f\"Média do vinho Branco: {df_vinhos[df_vinhos['color'] == 0]['quality'].mean()}\")\n",
+ "print(f\"Média do vinho Tinto.: {df_vinhos[df_vinhos['color'] == 1]['quality'].mean()}\")\n",
+ "print(f\"Média Geral..........: {df_vinhos['quality'].mean()}\")"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "GHjfSmExmg0u"
+ },
+ "source": [
+ "df_vinhos.columns"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "J3ZsHlrWmLDQ"
+ },
+ "source": [
+ "df_vinhos[df_vinhos['color'] == 1]['quality']"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "a-4XRBelnKCW"
+ },
+ "source": [
+ "fig, ax = plt.subplots(figsize=(10, 6))\n",
+ "df_vinhos['quality'].value_counts().plot(kind = 'bar')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7HjKZ6Z1bkct"
+ },
+ "source": [
+ "A seguir, algo mais sofisticado, contendo título do gráfico, annotations e etc"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jB9BTwBOa7UA"
+ },
+ "source": [
+ "fig, ax = plt.subplots(figsize = (10, 6))\n",
+ "df_vinhos['quality'].value_counts().plot(kind = 'bar')\n",
+ "\n",
+ "# Título e label dos eixos X e Y\n",
+ "plt.title('Avaliação da qualidade do vinho', fontsize = 25)\n",
+ "plt.xlabel('Atributo: quality', fontsize = 10)\n",
+ "plt.ylabel('Quantidade', fontsize = 10)\n",
+ "\n",
+ "# Colocar grid no gráfico\n",
+ "ax.grid(True)\n",
+ "\n",
+ "# Configurar a legenda\n",
+ "#plt.legend()\n",
+ "\n",
+ "# Configurar limites do eixo Y\n",
+ "#plt.ylim(0, 5000)\n",
+ "\n",
+ "# Configurar limites do eixo X\n",
+ "#plt.xlim(0, 3000)\n",
+ " \n",
+ "# Show graphic\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "w1CyCXVkmrFV"
+ },
+ "source": [
+ "df_vinhos['color'].value_counts().plot(kind = 'bar')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jU1AY-_wpU2h"
+ },
+ "source": [
+ "df_vinhos.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "e0ayzbRanNDq"
+ },
+ "source": [
+ "df_vinhos['fixed_acidity'].value_counts().sort_index().plot(kind = 'area')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "eSxvaczjoll-"
+ },
+ "source": [
+ "### Desafio: Tornar o gráfico abaixo mais informativo\n",
+ "* Por exemplo, mostrar qual a variável analisada, eixo X e Y, títulos e etc."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "RjzkMuPTn0yI"
+ },
+ "source": [
+ "l_colunas = df_vinhos.columns # automatizando\n",
+ "for caracteristica in l_colunas:\n",
+ " plt.figure() # Tire esta linha e veja o resultado\n",
+ " df_vinhos[caracteristica].value_counts().sort_index().plot(kind = 'area')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "PYIjyMkVnWnr"
+ },
+ "source": [
+ "### Correlações"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gn7xXclM7ewN"
+ },
+ "source": [
+ "### Introdução\n",
+ "O código abaixo gera dataframes para avaliarmos as correlações entre variáveis/dataframe."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "4un3dsyZ7fFU"
+ },
+ "source": [
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "i_simulacoes = 5000\n",
+ "\n",
+ "# Definir a semente --> Reproducibilidade\n",
+ "np.random.seed(19741120)\n",
+ "\n",
+ "# Array de médias das amostras:\n",
+ "a_media = np.array([0.0, 5.0, 10.0])\n",
+ "\n",
+ "# Array com a matriz de covariância:\n",
+ "a_covariancia = np.array([\n",
+ " [ 3.40, -2.75, -2.00],\n",
+ " [ -2.75, 5.50, 1.50],\n",
+ " [ -2.00, 1.50, 1.25]\n",
+ " ])\n",
+ "\n",
+ "# Geração das amostras aleatórias usando f_media e eGenerate the random samples.\n",
+ "a_amostras = np.random.multivariate_normal(a_media, a_covariancia, size = i_simulacoes)\n",
+ "a_amostras"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "akHw3Mym_FgQ"
+ },
+ "source": [
+ "A seguir, gráfico que mostra a correlação entre a_amostras[:, 0] e a_amostras[:, 1]:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "iTLIn1uwJoVi"
+ },
+ "source": [
+ "plt.figure(figsize= (12, 8))\n",
+ "ax = sns.regplot(x = a_amostras[:,0], y = a_amostras[:,1], color = 'g')\n",
+ "plt.xlabel('a_amostras[0]')\n",
+ "plt.ylabel('a_amostras[1]')\n",
+ "#plt.axis('equal')\n",
+ "plt.grid(True)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "JcermWt-Ar5c"
+ },
+ "source": [
+ "np.corrcoef(a_amostras[:, 0], a_amostras[:, 1])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ryLXMQ66_fce"
+ },
+ "source": [
+ "Gráfico da correlação entre a_amostras[:, 0] e a_amostras[:, 2]:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8Xp69Xgg9iRV"
+ },
+ "source": [
+ "plt.figure(figsize= (12, 8))\n",
+ "ax = sns.regplot(x = a_amostras[:,0], y = a_amostras[:,2], color = 'g')\n",
+ "plt.xlabel('a_amostras[0]')\n",
+ "plt.ylabel('a_amostras[2]')\n",
+ "#plt.axis('equal')\n",
+ "plt.grid(True)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Gw6OpxFBA5Sp"
+ },
+ "source": [
+ "np.corrcoef(a_amostras[:, 0], a_amostras[:, 2])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "GmnKTqxQ_uZ9"
+ },
+ "source": [
+ "E por fim, gráfico com as correlações entre a_amostras[:, 1] e a_amostras[:, 2]:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "yjWoFPhR_t3I"
+ },
+ "source": [
+ "plt.figure(figsize= (12, 8))\n",
+ "ax = sns.regplot(x = a_amostras[:, 1], y = a_amostras[:, 2], color = 'g')\n",
+ "plt.xlabel('a_amostras[1]')\n",
+ "plt.ylabel('a_amostras[2]')\n",
+ "#plt.axis('equal')\n",
+ "plt.grid(True)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "xnJkxZ25C7kX"
+ },
+ "source": [
+ "np.corrcoef(a_amostras[:, 1], a_amostras[:, 2])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qEttRQwgDGq_"
+ },
+ "source": [
+ "E a seguir, o pairplot para avaliarmos todas as colunas ao mesmo tempo:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "mkAJivoPC_OM"
+ },
+ "source": [
+ "sns.pairplot(pd.DataFrame(a_amostras))\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6FVQwuNP8w6s"
+ },
+ "source": [
+ "### Análise do dataframe df_vinhos"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "N-Aa8wnh6rky"
+ },
+ "source": [
+ "df_vinhos.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZhtIILrs6vUT"
+ },
+ "source": [
+ "### Correlações entre um par de variáveis X e Y"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "lJh2Comx6a_k"
+ },
+ "source": [
+ "np.corrcoef(df_vinhos['fixed_acidity'], df_vinhos['alcohol'])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ifZybEAE68V9"
+ },
+ "source": [
+ "### Correlações do dataframe"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "IOCk4vhpnYn9"
+ },
+ "source": [
+ "correlacoes = df_vinhos.corr()\n",
+ "\n",
+ "top_correlacoes_cols = correlacoes.color.sort_values(ascending = False).keys()\n",
+ "top_correlacoes = correlacoes.loc[top_correlacoes_cols, top_correlacoes_cols]\n",
+ "dropSelf = np.zeros_like(top_correlacoes)\n",
+ "dropSelf[np.triu_indices_from(dropSelf)] = True\n",
+ "plt.figure(figsize = (15, 9))\n",
+ "sns.heatmap(top_correlacoes, cmap=sns.diverging_palette(220, 10, as_cmap = True), annot = True, fmt = \".2f\", mask = dropSelf)\n",
+ "sns.set(font_scale=1.5)\n",
+ "plt.show()\n",
+ "del correlacoes, dropSelf, top_correlacoes"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "SFqklDJf-8le"
+ },
+ "source": [
+ "df_vinhos.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "H7hKbxfdBV8w"
+ },
+ "source": [
+ "### Avaliar o comportamento bivariado\n",
+ "* 2D Density Plot\n",
+ " * Útil para avaliarmos a relação entre 2 variáveis numéricas. O gráfico no centro mostra a correlação entre as variáveis enquanto os gráficos marginais mostra a distribuição das respectivas variáveis usando histogramas ou gráficos de densidade."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "LllKqLx3_IIG"
+ },
+ "source": [
+ "sns.jointplot(x = df_vinhos['alcohol'], y = df_vinhos['density'], kind = \"scatter\", color = 'm', s=50, edgecolor = \"skyblue\", linewidth = 2)\n",
+ "plt.savefig('minha_figura.png')\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "33yTNYN2K40X"
+ },
+ "source": [
+ "Mesmos dados, gráfico diferente --> Explorem as opções disponíveis: https://python-graph-gallery.com/82-marginal-plot-with-seaborn/"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "BVmAt0wCK1Ob"
+ },
+ "source": [
+ "sns.jointplot(x = df_vinhos['alcohol'], y = df_vinhos['density'], kind = \"reg\", color = 'm', )\n",
+ "plt.savefig('minha_figura.png')\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4ixcDmeXIFQ1"
+ },
+ "source": [
+ "### Pairplot\n",
+ "* Verificar relacionamentos entre pares no dataframe."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "lWqwaZ_lArji"
+ },
+ "source": [
+ "sns.pairplot(df_vinhos)\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vAhaEgyYtfX9"
+ },
+ "source": [
+ "Abaixo, gráfico segmentado por color:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jnu-giD_tcwd"
+ },
+ "source": [
+ "sns.pairplot(df_vinhos, hue = \"color\") # Compare os gráficos com e sem a opção hue\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "vtOH-mTHLGC-"
+ },
+ "source": [
+ "df_vinhos.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "dcaQ8aPaHwBB"
+ },
+ "source": [
+ "sns.lmplot(\"alcohol\", \"density\", df_vinhos, hue = \"color\", fit_reg = False)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "pWsCs585LPyn"
+ },
+ "source": [
+ "sns.lmplot(\"alcohol\", \"density\", df_vinhos, hue = \"quality\", fit_reg = False)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5RwOiYi3OfD5"
+ },
+ "source": [
+ "### Boxplot"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZqIP5xUOMAqL"
+ },
+ "source": [
+ "df_vinhos.boxplot(column = 'alcohol', by = 'quality', figsize = (12, 8))\n",
+ "plt.xlabel('Quality', fontsize = 10, color= 'blue')\n",
+ "plt.ylabel('alcohol', fontsize = 10, color= 'blue')\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "lWypAe78YQNm"
+ },
+ "source": [
+ "## Exercícios"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "YD8jgEZyYSHP"
+ },
+ "source": [
+ "### Exercício 1\n",
+ "* Análise gráfica das variáveis do dataframe IRIS.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "h0F7uXixYVqx"
+ },
+ "source": [
+ "from sklearn.datasets import load_iris\n",
+ "\n",
+ "iris = load_iris()\n",
+ "X= iris['data']\n",
+ "y= iris['target']\n",
+ "\n",
+ "df_iris = pd.DataFrame(np.c_[X, y], columns= np.append(iris['feature_names'], ['target']))\n",
+ "df_iris['target2'] = df_iris['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})\n",
+ "df_iris.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "yV5gDSF1YdGL"
+ },
+ "source": [
+ "### Exercício 2\n",
+ "* Usando o dataframe FIFA, responda:\n",
+ " * (1) Mostre o gráfico de barras com o número de jogadores por clube;\n",
+ " * (2) Mostre o boxplot/histograma dos salários dos atletas para os clubes Real Madrid, Barcelona Paris Saint-Germain Bayern Munich e Juventus."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "27NbnlDkYoeH"
+ },
+ "source": [
+ "# Carrega a library Pandas\n",
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "%matplotlib inline\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "E3EVKJFRjiyA"
+ },
+ "source": [
+ "#configuração\n",
+ "d_configuracao = {\n",
+ " 'display.max_columns': 1000,\n",
+ " 'display.expand_frame_repr': True,\n",
+ " 'display.max_rows': 10,\n",
+ " 'display.precision': 2,\n",
+ " 'display.show_dimensions': True\n",
+ " }\n",
+ "\n",
+ "for op, value in d_configuracao.items():\n",
+ " pd.set_option(op, value)\n",
+ " print(op, value)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "r5K0504rjmum"
+ },
+ "source": [
+ "url = 'https://raw.githubusercontent.com/Celso-Omoto/DSWP/master/Dataframes/FIFA.csv'\n",
+ "df_Fifa2018 = pd.read_csv(url)\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "_H9fgfk_lOIv"
+ },
+ "source": [
+ "df_Fifa2018.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zer-di19l3tf"
+ },
+ "source": [
+ "from sklearn.datasets import load_iris\n",
+ "\n",
+ "iris = load_iris()\n",
+ "X= iris['data']\n",
+ "y= iris['target']\n",
+ "\n",
+ "df_iris = pd.DataFrame(np.c_[X, y], columns= np.append(iris['feature_names'], ['target']))\n",
+ "df_iris['target2'] = df_iris['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})\n",
+ "df_iris.head()\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "oNMpCMIUnJAA"
+ },
+ "source": [
+ "def transformacao_lower(df):\n",
+ " # Primeira transformação: Aplicar lower() nos nomes das COLUNAS:\n",
+ " df_Fifa2018.columns = [col.lower() for col in df.columns]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "gHxJ5xlFnLiV"
+ },
+ "source": [
+ "transformacao_lower(df_Fifa2018)\n",
+ "df_Fifa2018.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "bykPdka9jukN"
+ },
+ "source": [
+ "df_Fifa2018[['club', 'name','value']]\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "0zZTRJo6lYAU"
+ },
+ "source": [
+ "df_conta_jogadores = df_Fifa2018[['club', 'name','value']]\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "HwTYS2FmqOJP"
+ },
+ "source": [
+ "#faz gráfico do Fifa, totalizando jogadores por clue\n",
+ "df_conta_jogadores['club'].value_counts().sort_values().plot(xlabel = 'Clube',ylabel = 'Quant',kind = 'bar')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kkpO458zq-dG"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ }
+ ]
+}
\ No newline at end of file
From d0d7604777e3aaf055849d3e22d384d118d80103 Mon Sep 17 00:00:00 2001
From: Celso-Omoto <72219725+Celso-Omoto@users.noreply.github.com>
Date: Thu, 15 Oct 2020 16:24:48 -0300
Subject: [PATCH 06/21] Criado usando o Colaboratory
---
Notebooks/NB10_01__Pandas_Fifa.ipynb | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/Notebooks/NB10_01__Pandas_Fifa.ipynb b/Notebooks/NB10_01__Pandas_Fifa.ipynb
index 073a7477f..e3813afea 100644
--- a/Notebooks/NB10_01__Pandas_Fifa.ipynb
+++ b/Notebooks/NB10_01__Pandas_Fifa.ipynb
@@ -5484,7 +5484,10 @@
"id": "FCs2gPK8ckS7"
},
"source": [
- "\n"
+ "#df_estudantes['percentagem'] = 100*(df_estudantes['score']/sum(df_estudantes['score']))\n",
+ "#lista\n",
+ "#df_Fifa2018['salario'] = df_Fifa2018['value']*.str.split('/w', n = 2, expand = True) \n",
+ "#df_Fifa2018\n"
],
"execution_count": null,
"outputs": []
From 462124c2de270ef80b97fbd799b4e3337eb86fb5 Mon Sep 17 00:00:00 2001
From: Celso-Omoto <72219725+Celso-Omoto@users.noreply.github.com>
Date: Fri, 16 Oct 2020 09:47:40 -0300
Subject: [PATCH 07/21] Criado usando o Colaboratory
---
.../NB02__Numpy_exercicio_resolvido.ipynb | 8418 +++++++++++++++++
1 file changed, 8418 insertions(+)
create mode 100644 Notebooks/NB02__Numpy_exercicio_resolvido.ipynb
diff --git a/Notebooks/NB02__Numpy_exercicio_resolvido.ipynb b/Notebooks/NB02__Numpy_exercicio_resolvido.ipynb
new file mode 100644
index 000000000..f6fa37aea
--- /dev/null
+++ b/Notebooks/NB02__Numpy_exercicio_resolvido.ipynb
@@ -0,0 +1,8418 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "NB02__Numpy.ipynb",
+ "provenance": [],
+ "collapsed_sections": [
+ "n8BIbzQbNWUo",
+ "7eS94uQ4NhVR",
+ "SYOgJpGYVLUu",
+ "CaHFxk98W5if",
+ "ReWUyWiHXCnc",
+ "CqszHxaKHr2h",
+ "tXgF1Wl9gHKY",
+ "Fotx7XUquAo8",
+ "36kmLUYDvsUI",
+ "SWO2GdNovxAp",
+ "vpN54l4vxze5",
+ "u4HOf9SNytSq",
+ "6BQ9oZiD9hg5",
+ "tz5-QdrX9vct",
+ "p1muBgMX8NK4",
+ "FxTC2-U88ajk",
+ "z8EYn0pP25Rh"
+ ],
+ "include_colab_link": true
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "accelerator": "GPU"
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6QhLXoatkvKR"
+ },
+ "source": [
+ "NUMPY
\n",
+ "\n",
+ "> NumPy é um pacote para computação científica e álgebra linear para Python.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "b8EZupp68vW8"
+ },
+ "source": [
+ "# **AGENDA**:\n",
+ "> Neste capítulo, vamos abordar os seguintes assuntos:\n",
+ "\n",
+ "* NumPy\n",
+ "* Criar arrays\n",
+ "* Criar Arrays Multidimensionais\n",
+ "* Selecionar itens\n",
+ "* Aplicar funções como max(), min() e etc\n",
+ "* Calcular Estatísticas Descritivas: média e variância\n",
+ "* Reshaping\n",
+ "* Tansposta de um array\n",
+ "* Autovalores e Autovetores\n",
+ "* Wrap Up\n",
+ "* Exercícios"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "cO5t3xCO8kyK"
+ },
+ "source": [
+ "___\n",
+ "# **NOTAS E OBSERVAÇÕES**\n",
+ "\n",
+ "* Nosso foco com o NumPy é facilitar o uso do Pandas;"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "z2IFUG4GSB0Z"
+ },
+ "source": [
+ "___\n",
+ "# **CHEETSHEET**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jYLeDVH-SNCg"
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0mKvExmgUFOk"
+ },
+ "source": [
+ "# **ESCALAR, VETORES, MATRIZES E TENSORES**\n",
+ "\n",
+ "\n",
+ "\n",
+ "Source: [PyTorch for Deep Learning: A Quick Guide for Starters](https://towardsdatascience.com/pytorch-for-deep-learning-a-quick-guide-for-starters-5b60d2dbb564)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "o00pYRIkXiAU"
+ },
+ "source": [
+ "## Import Statement - Primeiros exemplos\n",
+ "> Como exemplo, considere gerar uma amostra aleatória de tamanho 10 da Distribuição Normal(0, 1):"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "l_XuvcUDWNDk"
+ },
+ "source": [
+ "## Importar a library NumPy"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "am_ZTIGaapCo"
+ },
+ "source": [
+ "### **Opção 1**: Importar a biblioteca NumPy COM alias"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "b4irLw6BWVVZ"
+ },
+ "source": [
+ "import numpy as np"
+ ],
+ "execution_count": 2,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "JK54ga7dXnJu",
+ "outputId": "e87ad67b-79c4-4ec8-ce74-cbb12851720a",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "# Set up o número de casas decimais para o NumPy:\n",
+ "np.set_printoptions(precision = 2, suppress = True)\n",
+ "\n",
+ "'''\n",
+ "Define seed por questões de reproducibilidade, ou seja, \n",
+ "garante que todos vamos gerar os mesmos números aleatórios\n",
+ "'''\n",
+ "np.random.seed(seed = 20111974)\n",
+ "\n",
+ "# Gera 10 números aleatórios a partir da Distribuição Normal(media, desvio_padrao)\n",
+ "media = 0\n",
+ "desvio_padrao = 1\n",
+ "a_conjunto1 = np.random.normal(media, desvio_padrao, size = 10) # Array 1D de size = 10\n",
+ "a_conjunto1"
+ ],
+ "execution_count": 3,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 2.51, 1.11, 2.06, 0.56, 0.3 , 1.05, -0.13, 1.06, 1.14,\n",
+ " 1.38])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 3
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "7GfmDF43NrHd",
+ "outputId": "a193bbbd-7711-428a-bc72-0f12801050c1",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 68
+ }
+ },
+ "source": [
+ "a_tab=np.random.normal(8,1,size=(3,10))\n",
+ "a_tab"
+ ],
+ "execution_count": 12,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[8.74, 8.18, 8.1 , 7.22, 7.96, 9.67, 6.93, 7.45, 6.17, 8.12],\n",
+ " [9.39, 7.71, 8.32, 7.3 , 7.56, 5.97, 7.86, 9.66, 7.42, 7.21],\n",
+ " [7.19, 8.06, 8.87, 7.65, 9.37, 8.88, 6.52, 7.59, 7.81, 8.47]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 12
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3-0934isZUm6"
+ },
+ "source": [
+ "**Observação**: Altere o valor de [precision] para 4, 2 e 0 e observe o que acontece."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9ob_8S_bYYa2"
+ },
+ "source": [
+ "### **Opção 2**: Importar a biblioteca NumPy SEM alias"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "NcGd1ho_XDXU"
+ },
+ "source": [
+ "import numpy"
+ ],
+ "execution_count": 14,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zFYH6J5-Ydjl",
+ "outputId": "0f5ae787-4691-4e29-edb0-5912ae39d0e9",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "# Set up o número de casas decimais para o NumPy:\n",
+ "numpy.set_printoptions(precision = 2, suppress = True)\n",
+ "\n",
+ "'''\n",
+ "Define seed por questões de reproducibilidade, ou seja, \n",
+ "garante que todos vamos gerar os mesmos números aleatórios\n",
+ "'''\n",
+ "numpy.random.seed(seed = 20111974)\n",
+ "\n",
+ "# Gera 10 números aleatórios a partir da Distribuição Normal(mu, desvio_padrao)\n",
+ "media = 0\n",
+ "desvio_padrao = 1\n",
+ "numpy.random.normal(size = 10)"
+ ],
+ "execution_count": 15,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 2.51, 1.11, 2.06, 0.56, 0.3 , 1.05, -0.13, 1.06, 1.14,\n",
+ " 1.38])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 15
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "AwWSzYrZWfvA"
+ },
+ "source": [
+ "### **Opção 3**: Importar funções específicas da biblioteca NumPy"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "bfYJzcqRa5eu"
+ },
+ "source": [
+ "from numpy import set_printoptions\n",
+ "from numpy.random import seed, normal"
+ ],
+ "execution_count": 16,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Xj6fbpvubH_p",
+ "outputId": "00836057-014f-44c7-a6ae-f41895de17cc",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "# Set up o número de casas decimais para o NumPy:\n",
+ "set_printoptions(precision = 2, suppress = True)\n",
+ "\n",
+ "'''\n",
+ "Define seed por questões de reproducibilidade, ou seja, \n",
+ "garante que todos vamos gerar os mesmos números aleatórios\n",
+ "'''\n",
+ "seed(seed = 20111974)\n",
+ "\n",
+ "# Gera 10 números aleatórios a partir da Distribuição Normal(mu, desvio_padrao)\n",
+ "media = 0\n",
+ "desvio_padrao = 1 \n",
+ "np.random.normal(size = 10)"
+ ],
+ "execution_count": 17,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 2.51, 1.11, 2.06, 0.56, 0.3 , 1.05, -0.13, 1.06, 1.14,\n",
+ " 1.38])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 17
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "00RerJPChnuP"
+ },
+ "source": [
+ "___\n",
+ "# **Estatísticas Descriticas com NumPy**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Qa6ro1VJlShd"
+ },
+ "source": [
+ "## Exemplo 1\n",
+ "> Vamos voltar ao mesmo exemplo anterior, mas desta vez, usando a opção 1 (com alias):\n",
+ "\n",
+ "* Gerar uma amostra aleatória de tamanho 10 da Distribuiçao Normal(0, 1)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "31dSBU8khvFk",
+ "outputId": "14d42b6f-1a70-4d38-b84a-7cf71e5a8a44",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "# Set up o número de casas decimais para o NumPy:\n",
+ "np.set_printoptions(precision = 2, suppress = True)\n",
+ "\n",
+ "# Define seed\n",
+ "np.random.seed(seed = 20111974)\n",
+ "\n",
+ "# Gera 10 números aleatórios a partir da Distribuição Normal(media, desvio_padrao)\n",
+ "media = 0\n",
+ "desvio_padrao = 1\n",
+ "\n",
+ "np.random\n",
+ "a_conjunto1 = np.random.normal(media, desvio_padrao, size = 10) # Array 1D de size = 10\n",
+ "a_conjunto1"
+ ],
+ "execution_count": 18,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 2.51, 1.11, 2.06, 0.56, 0.3 , 1.05, -0.13, 1.06, 1.14,\n",
+ " 1.38])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 18
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "wa2t0P3nevTh"
+ },
+ "source": [
+ "Conferindo a média e desvio-padrão do array gerado:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "drUyk3f5ekDq",
+ "outputId": "f12428fe-0c82-4595-f725-20b02a1c9138",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "f'Distribuição N({np.mean(a_conjunto1)}, {np.std(a_conjunto1)})'"
+ ],
+ "execution_count": 19,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ },
+ "text/plain": [
+ "'Distribuição N(1.1043374540652753, 0.735246705657231)'"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 19
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "HUIlWDMsPFfG",
+ "outputId": "a1a90dc8-a74e-4c9b-d29f-6d77fb6001f7",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "# Set up o número de casas decimais para o NumPy:\n",
+ "np.set_printoptions(precision = 2, suppress = True)\n",
+ "\n",
+ "# Define seed\n",
+ "np.random.seed(seed = 20111974)\n",
+ "\n",
+ "# Gera 10 números aleatórios a partir da Distribuição Normal(media, desvio_padrao)\n",
+ "media = 0\n",
+ "desvio_padrao = 1\n",
+ "i_size = 100\n",
+ "\n",
+ "np.random\n",
+ "a_conjunto1 = np.random.normal(media, desvio_padrao, size = i_size) # Array 1D de size = 10\n",
+ "f'Distribuição : número de elementos: {i_size} / Média {np.mean(a_conjunto1)} / DP {np.std(a_conjunto1)})'"
+ ],
+ "execution_count": 26,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ },
+ "text/plain": [
+ "'Distribuição : número de elementos: 100 / Média -0.016996335492713833 / DP 1.0055613764417128)'"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 26
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "wfhWpOtjPogD",
+ "outputId": "d74b48e4-0fca-4cdf-efc8-a3fc2152dff7",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 102
+ }
+ },
+ "source": [
+ "# Set up o número de casas decimais para o NumPy:\n",
+ "np.set_printoptions(precision = 2, suppress = True)\n",
+ "\n",
+ "# Define seed\n",
+ "np.random.seed(seed = 20111974)\n",
+ "\n",
+ "# Gera 10 números aleatórios a partir da Distribuição Normal(media, desvio_padrao)\n",
+ "media = 0\n",
+ "desvio_padrao = 1\n",
+ "a_elementos = [10,100,1000,10000,100000]\n",
+ "\n",
+ "for i_size in a_elementos:\n",
+ " a_conjunto1 = np.random.normal(media, desvio_padrao, size = i_size) # Array 1D de size = 10\n",
+ " print(f'Distribuição : número de elementos: {i_size} / Média {np.mean(a_conjunto1)} / DP {np.std(a_conjunto1)})')"
+ ],
+ "execution_count": 30,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "Distribuição : número de elementos: 10 / Média 1.1043374540652753 / DP 0.735246705657231)\n",
+ "Distribuição : número de elementos: 100 / Média -0.14020525697186714 / DP 0.9254100654233511)\n",
+ "Distribuição : número de elementos: 1000 / Média 0.021644923462910873 / DP 1.0054417533501039)\n",
+ "Distribuição : número de elementos: 10000 / Média 0.015499353804764507 / DP 0.9970905566844254)\n",
+ "Distribuição : número de elementos: 100000 / Média 0.002039323041103302 / DP 0.9960906293570095)\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XSp7Hd-Gib67"
+ },
+ "source": [
+ "Estávamos à espera de media = 0 e sigma = 1. Certo? Porque isso não aconteceu?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HP_8VSgygXOF"
+ },
+ "source": [
+ "## **Laboratório 1**\n",
+ "> Altere os valores de [size] para 100, 1.000, 10.000, 100.000 e 1.000.000 e relate o que acontece com a média e desvio padrão."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4TbmVbdcg6iU"
+ },
+ "source": [
+ "## **Minha solução**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "-qdiqBVHg-gd",
+ "outputId": "d2ea4d17-6db3-4113-d94f-91273460ebd8",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 153
+ }
+ },
+ "source": [
+ "# Define a média e o desvio-padrão\n",
+ "media = 0\n",
+ "desvio_padrao = 1\n",
+ "\n",
+ "# Define seed\n",
+ "np.random.seed(seed = 20111974)\n",
+ "l_lista_conjunto = [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000]\n",
+ "\n",
+ "for i_size in l_lista_conjunto:\n",
+ " a_conjunto1 = np.random.normal(media, desvio_padrao, size = i_size)\n",
+ " print(f'Size: {i_size}--> Distribuição: N({np.mean(a_conjunto1)}, {np.std(a_conjunto1)})')"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "Size: 10--> Distribuição: N(1.1043374540652753, 0.735246705657231)\n",
+ "Size: 100--> Distribuição: N(-0.14020525697186714, 0.9254100654233511)\n",
+ "Size: 1000--> Distribuição: N(0.021644923462910873, 1.0054417533501039)\n",
+ "Size: 10000--> Distribuição: N(0.015499353804764507, 0.9970905566844254)\n",
+ "Size: 100000--> Distribuição: N(0.002039323041103302, 0.9960906293570095)\n",
+ "Size: 1000000--> Distribuição: N(-1.1062145143945444e-06, 0.999473966169304)\n",
+ "Size: 10000000--> Distribuição: N(0.0002892972723094128, 1.0001202837422036)\n",
+ "Size: 100000000--> Distribuição: N(0.00011967896623555603, 0.999944390106086)\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "bp-YuviQwWqE"
+ },
+ "source": [
+ "Com relação à Distribuição Normal($\\mu, \\sigma$), temos que:\n",
+ "\n",
+ "\n",
+ "\n",
+ "Fonte: [Normal Distribution](https://towardsdatascience.com/understanding-the-68-95-99-7-rule-for-a-normal-distribution-b7b7cbf760c2)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "KwHBY3Enk04N"
+ },
+ "source": [
+ "## Lei Forte dos Grandes Números - LFGN\n",
+ "> Por favor, leia o que diz a [Law of large numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers). --> 3 minutos.\n",
+ "\n",
+ "* O que você aprendeu com isso?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "BhwmSkAjlszT"
+ },
+ "source": [
+ "## Exemplo 2\n",
+ "> Vamos nos aprofundar um pouco mais no que diz a LFGN. Para isso, vamos simular o lançamento de dados. Como sabemos, os dados possuem 6 lados numerados de 1 a 6, com igual probabilidade. Certo?\n",
+ "\n",
+ "A LFGN nos diz que à medida que N (o tamanho da amostra ou número de dados) cresce, então a média dos dados converge para o valor esperado. Isso quer dizer que:\n",
+ "\n",
+ "$$\\frac{1+2+3+4+5+6}{6}= \\frac{21}{6}= 3,5$$\n",
+ "\n",
+ "Ou seja, à medida que N (o tamanho da amostra) cresce, espera-se que a média dos dados se aproxime de 3,5. Ok?\n",
+ "\n",
+ "Vamos ver se isso é verdade..."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-QcJXf6roj0D"
+ },
+ "source": [
+ "Vamos usar o método np.random.randint (= função randint definido na classe np.random), a seguir:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "A2u0RzLOrRE2"
+ },
+ "source": [
+ "O que significa ou qual é a interpretação do resultado abaixo?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "B3-X_VBerUfa",
+ "outputId": "2c0dde81-a718-4523-80cf-a8b546fc1a49",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 102
+ }
+ },
+ "source": [
+ "# Define seed\n",
+ "import numpy as np\n",
+ "np.random.seed(seed = 20111974)\n",
+ "\n",
+ "# Simular 100 lançamentos de um dado:\n",
+ "a_dados_simulados = np.random.randint(1, 7, size = 100)\n",
+ "a_dados_simulados"
+ ],
+ "execution_count": 2,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([4, 5, 3, 1, 1, 4, 3, 1, 2, 2, 1, 1, 6, 4, 5, 3, 1, 4, 1, 6, 2, 4,\n",
+ " 6, 2, 4, 3, 2, 6, 3, 6, 2, 6, 1, 3, 1, 2, 4, 2, 4, 6, 3, 2, 6, 1,\n",
+ " 4, 3, 6, 5, 2, 3, 3, 3, 3, 2, 1, 6, 2, 1, 2, 3, 1, 5, 6, 6, 6, 6,\n",
+ " 5, 6, 6, 5, 6, 3, 3, 2, 4, 2, 6, 1, 2, 3, 4, 5, 5, 3, 1, 6, 6, 5,\n",
+ " 5, 1, 4, 6, 2, 2, 4, 3, 6, 1, 5, 5])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 2
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "m8Of2MMIrbF3",
+ "outputId": "dca5755e-c2bb-44e9-dc49-e7569bcbcfe2",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 136
+ }
+ },
+ "source": [
+ "# Importar o pandas, pois vamos precisar do método pd.value_counts():\n",
+ "import pandas as pd\n",
+ "pd.value_counts(a_dados_simulados)"
+ ],
+ "execution_count": 3,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "6 22\n",
+ "3 18\n",
+ "2 18\n",
+ "1 17\n",
+ "4 13\n",
+ "5 12\n",
+ "dtype: int64"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 3
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "54VwED8Br8rx"
+ },
+ "source": [
+ "**Interpretação**: Isso quer dizer que fizemos a simulação de lançamento de um dado 100 vezes. Acima, a frequência com que cada lado do dado aparece.\n",
+ "\n",
+ "Eu estava à espera de frequência igual para cada um dos lados, isto é, por volta dos 16 ou 17. Ou seja:\n",
+ "\n",
+ "$$\\frac{100}{6}= 16,66$$\n",
+ "\n",
+ "Mas ok, vamos continuar com nosso experimento..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "HT_Dak-umC6I",
+ "outputId": "a4bcdc1f-0366-48cb-a2da-905d5751c994",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 170
+ }
+ },
+ "source": [
+ "# Definir a semente\n",
+ "np.random.seed(20111974)\n",
+ "\n",
+ "for i_size in [10, 30, 50, 75, 100, 1000, 10000, 100000, 1000000]:\n",
+ " a_dados_simulados = np.random.randint(1, 7, size = i_size)\n",
+ " print(f'Size: {i_size} --> Média: {np.mean(a_dados_simulados)}')"
+ ],
+ "execution_count": 4,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "Size: 10 --> Média: 2.6\n",
+ "Size: 30 --> Média: 3.3666666666666667\n",
+ "Size: 50 --> Média: 3.72\n",
+ "Size: 75 --> Média: 3.2666666666666666\n",
+ "Size: 100 --> Média: 3.42\n",
+ "Size: 1000 --> Média: 3.461\n",
+ "Size: 10000 --> Média: 3.5259\n",
+ "Size: 100000 --> Média: 3.50794\n",
+ "Size: 1000000 --> Média: 3.50151\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "edWNNOnXtbtd"
+ },
+ "source": [
+ "E agora, como você interpreta esses resultados?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "eL6gXThkYcSf"
+ },
+ "source": [
+ "## Calcular percentis\n",
+ "> Boxplot"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jlGOQfXfPf0D"
+ },
+ "source": [
+ "\n",
+ "\n",
+ "Fonte: [Understanding Boxplots](https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "grtEXG2BoNRt"
+ },
+ "source": [
+ "Considere o array de retornos (simulados) a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zgAh7gWeRews"
+ },
+ "source": [
+ "#usando dados de Fonte: Understanding Boxplots \n",
+ "import pandas as pd\n",
+ "import seaborn as sns\n",
+ "import matplotlib.pyplot as plt\n",
+ "# Put dataset on my github repo \n",
+ "df = pd.read_csv('https://raw.githubusercontent.com/mGalarnyk/Python_Tutorials/master/Kaggle/BreastCancerWisconsin/data/data.csv')"
+ ],
+ "execution_count": 5,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jKjcOB_1RseM",
+ "outputId": "434f2866-7c1e-478b-b382-29ccf082104a",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 241
+ }
+ },
+ "source": [
+ "df.head(5)"
+ ],
+ "execution_count": 8,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " id | \n",
+ " diagnosis | \n",
+ " radius_mean | \n",
+ " texture_mean | \n",
+ " perimeter_mean | \n",
+ " area_mean | \n",
+ " smoothness_mean | \n",
+ " compactness_mean | \n",
+ " concavity_mean | \n",
+ " concave points_mean | \n",
+ " symmetry_mean | \n",
+ " fractal_dimension_mean | \n",
+ " radius_se | \n",
+ " texture_se | \n",
+ " perimeter_se | \n",
+ " area_se | \n",
+ " smoothness_se | \n",
+ " compactness_se | \n",
+ " concavity_se | \n",
+ " concave points_se | \n",
+ " symmetry_se | \n",
+ " fractal_dimension_se | \n",
+ " radius_worst | \n",
+ " texture_worst | \n",
+ " perimeter_worst | \n",
+ " area_worst | \n",
+ " smoothness_worst | \n",
+ " compactness_worst | \n",
+ " concavity_worst | \n",
+ " concave points_worst | \n",
+ " symmetry_worst | \n",
+ " fractal_dimension_worst | \n",
+ " Unnamed: 32 | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 842302 | \n",
+ " M | \n",
+ " 17.99 | \n",
+ " 10.38 | \n",
+ " 122.80 | \n",
+ " 1001.0 | \n",
+ " 0.11840 | \n",
+ " 0.27760 | \n",
+ " 0.3001 | \n",
+ " 0.14710 | \n",
+ " 0.2419 | \n",
+ " 0.07871 | \n",
+ " 1.0950 | \n",
+ " 0.9053 | \n",
+ " 8.589 | \n",
+ " 153.40 | \n",
+ " 0.006399 | \n",
+ " 0.04904 | \n",
+ " 0.05373 | \n",
+ " 0.01587 | \n",
+ " 0.03003 | \n",
+ " 0.006193 | \n",
+ " 25.38 | \n",
+ " 17.33 | \n",
+ " 184.60 | \n",
+ " 2019.0 | \n",
+ " 0.1622 | \n",
+ " 0.6656 | \n",
+ " 0.7119 | \n",
+ " 0.2654 | \n",
+ " 0.4601 | \n",
+ " 0.11890 | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 842517 | \n",
+ " M | \n",
+ " 20.57 | \n",
+ " 17.77 | \n",
+ " 132.90 | \n",
+ " 1326.0 | \n",
+ " 0.08474 | \n",
+ " 0.07864 | \n",
+ " 0.0869 | \n",
+ " 0.07017 | \n",
+ " 0.1812 | \n",
+ " 0.05667 | \n",
+ " 0.5435 | \n",
+ " 0.7339 | \n",
+ " 3.398 | \n",
+ " 74.08 | \n",
+ " 0.005225 | \n",
+ " 0.01308 | \n",
+ " 0.01860 | \n",
+ " 0.01340 | \n",
+ " 0.01389 | \n",
+ " 0.003532 | \n",
+ " 24.99 | \n",
+ " 23.41 | \n",
+ " 158.80 | \n",
+ " 1956.0 | \n",
+ " 0.1238 | \n",
+ " 0.1866 | \n",
+ " 0.2416 | \n",
+ " 0.1860 | \n",
+ " 0.2750 | \n",
+ " 0.08902 | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 84300903 | \n",
+ " M | \n",
+ " 19.69 | \n",
+ " 21.25 | \n",
+ " 130.00 | \n",
+ " 1203.0 | \n",
+ " 0.10960 | \n",
+ " 0.15990 | \n",
+ " 0.1974 | \n",
+ " 0.12790 | \n",
+ " 0.2069 | \n",
+ " 0.05999 | \n",
+ " 0.7456 | \n",
+ " 0.7869 | \n",
+ " 4.585 | \n",
+ " 94.03 | \n",
+ " 0.006150 | \n",
+ " 0.04006 | \n",
+ " 0.03832 | \n",
+ " 0.02058 | \n",
+ " 0.02250 | \n",
+ " 0.004571 | \n",
+ " 23.57 | \n",
+ " 25.53 | \n",
+ " 152.50 | \n",
+ " 1709.0 | \n",
+ " 0.1444 | \n",
+ " 0.4245 | \n",
+ " 0.4504 | \n",
+ " 0.2430 | \n",
+ " 0.3613 | \n",
+ " 0.08758 | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 84348301 | \n",
+ " M | \n",
+ " 11.42 | \n",
+ " 20.38 | \n",
+ " 77.58 | \n",
+ " 386.1 | \n",
+ " 0.14250 | \n",
+ " 0.28390 | \n",
+ " 0.2414 | \n",
+ " 0.10520 | \n",
+ " 0.2597 | \n",
+ " 0.09744 | \n",
+ " 0.4956 | \n",
+ " 1.1560 | \n",
+ " 3.445 | \n",
+ " 27.23 | \n",
+ " 0.009110 | \n",
+ " 0.07458 | \n",
+ " 0.05661 | \n",
+ " 0.01867 | \n",
+ " 0.05963 | \n",
+ " 0.009208 | \n",
+ " 14.91 | \n",
+ " 26.50 | \n",
+ " 98.87 | \n",
+ " 567.7 | \n",
+ " 0.2098 | \n",
+ " 0.8663 | \n",
+ " 0.6869 | \n",
+ " 0.2575 | \n",
+ " 0.6638 | \n",
+ " 0.17300 | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 84358402 | \n",
+ " M | \n",
+ " 20.29 | \n",
+ " 14.34 | \n",
+ " 135.10 | \n",
+ " 1297.0 | \n",
+ " 0.10030 | \n",
+ " 0.13280 | \n",
+ " 0.1980 | \n",
+ " 0.10430 | \n",
+ " 0.1809 | \n",
+ " 0.05883 | \n",
+ " 0.7572 | \n",
+ " 0.7813 | \n",
+ " 5.438 | \n",
+ " 94.44 | \n",
+ " 0.011490 | \n",
+ " 0.02461 | \n",
+ " 0.05688 | \n",
+ " 0.01885 | \n",
+ " 0.01756 | \n",
+ " 0.005115 | \n",
+ " 22.54 | \n",
+ " 16.67 | \n",
+ " 152.20 | \n",
+ " 1575.0 | \n",
+ " 0.1374 | \n",
+ " 0.2050 | \n",
+ " 0.4000 | \n",
+ " 0.1625 | \n",
+ " 0.2364 | \n",
+ " 0.07678 | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " id diagnosis ... fractal_dimension_worst Unnamed: 32\n",
+ "0 842302 M ... 0.11890 NaN\n",
+ "1 842517 M ... 0.08902 NaN\n",
+ "2 84300903 M ... 0.08758 NaN\n",
+ "3 84348301 M ... 0.17300 NaN\n",
+ "4 84358402 M ... 0.07678 NaN\n",
+ "\n",
+ "[5 rows x 33 columns]"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 8
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "plyx1XniRgzD",
+ "outputId": "29317060-c75d-4e26-97de-b8df9a92d1d4",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 296
+ }
+ },
+ "source": [
+ "sns.boxplot(x='diagnosis', y='area_mean', data=df)"
+ ],
+ "execution_count": 6,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 6
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAEGCAYAAACUzrmNAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAWr0lEQVR4nO3de7SddX3n8feHBBEBC4SUlQY02pPq8jKCPaKd2tYLkYhVoONy4VozRGvFWVVIaWfNQIcRxmKrHdsuiNYpjGgYrRRHLRmHiU0Yq+0aLwSk3B3OYJBkEGIIN7k1yXf+2M8ph5Bznh04+zz7nPN+rbXXfvbvuX131sn5nN/zey6pKiRJmsp+XRcgSRp+hoUkqZVhIUlqZVhIkloZFpKkVgu7LmAQjjjiiFq2bFnXZUjSrHLttdf+pKoW723enAyLZcuWsWnTpq7LkKRZJcmdk83zMJQkqZVhIUlqZVhIkloZFpKkVoaFpjQ2Nsbb3vY2xsbGui5FUocGFhZJjk7yjSS3JLk5yeqm/fwkW5Nc37xOnLDOOUnGkvwgyQkT2lc2bWNJzh5UzXq6Cy64gJ/+9KdccMEFXZciqUODPHV2J/B7VXVdkkOAa5NsaOb9WVV9YuLCSV4GnAq8HPg5YGOSX2hmfwpYAWwBrkmyrqpuGWDtoter2Lx5MwCbN29mbGyMkZGRbouS1ImB9Syq6u6quq6Zfgi4FVg6xSonAZdX1eNV9UNgDDiueY1V1R1V9QRwebOsBmzP3oS9C2n+mpExiyTLgGOB7zZNH0pyQ5JLkxzWtC0F7pqw2pambbL2PfdxepJNSTZt27Ztmr/B/DTeq5jss6T5Y+BhkeRg4MvA71TVg8CngZ8HjgHuBv5kOvZTVRdX1WhVjS5evNer1bWP9rxlirdQkeavgYZFkv3pBcUXquorAFV1T1XtqqrdwCX0DjMBbAWOnrD6UU3bZO0asHPPPXfKz5Lmj0GeDRXgM8CtVfWnE9qXTFjsFOCmZnodcGqSA5K8CFgOfA+4Blie5EVJnkNvEHzdoOrWk0ZGRv6pN7Fs2TIHt6V5bJA9i18G/hXwpj1Ok/3jJDcmuQF4I3AWQFXdDFwB3AKsBz7Y9EB2Ah8Cvk5vkPyKZlnNgHPPPZeDDjrIXoU0z6Wquq5h2o2OjpZ3nZWkfZPk2qoa3ds8r+CWJLUyLCRJrQwLSVIrw0KS1MqwkCS1MiwkSa0MC0lSK8NCktTKsJAktTIsJEmtDAtJUivDQlPavn07Z555Jtu3b++6FEkdMiw0pbVr13LjjTdy2WWXdV2KpA4ZFprU9u3bWb9+PVXF+vXr7V1I85hhoUmtXbuW3bt3A7Br1y57F9I8ZlhoUhs3bmTnzp0A7Ny5kw0bNnRckaSuGBaa1PHHH8/ChQsBWLhwIStWrOi4IkldMSw0qVWrVrHffr0fkQULFnDaaad1XJGkrhgWmtSiRYtYuXIlSVi5ciWLFi3quiRJHVnYdQEabqtWrWLz5s32KqR5zrDQlBYtWsRFF13UdRmSOuZhKElSK8NCktTKsJAktTIsJEmtDAtJUivDQpLUyrCQJLUyLCRJrQwLSVIrw0KS1Mqw0JR8BrckMCzUwmdwS4IBhkWSo5N8I8ktSW5OsrppPzzJhiS3N++HNe1JclGSsSQ3JHn1hG2tapa/PcmqQdWsp/IZ3JLGDbJnsRP4vap6GfA64INJXgacDVxdVcuBq5vPAG8Fljev04FPQy9cgPOA1wLHAeeNB4wGy2dwSxo3sLCoqrur6rpm+iHgVmApcBKwtllsLXByM30ScFn1fAc4NMkS4ARgQ1XdV1U7gA3AykHVrSf5DG5J42ZkzCLJMuBY4LvAkVV1dzPrx8CRzfRS4K4Jq21p2iZr33MfpyfZlGTTtm3bprX++cpncEsaN/CwSHIw8GXgd6rqwYnzqqqAmo79VNXFVTVaVaOLFy+ejk3Oez6DW9K4gYZFkv3pBcUXquorTfM9zeElmvd7m/atwNETVj+qaZusXQPmM7gljRvk2VABPgPcWlV/OmHWOmD8jKZVwJUT2k9rzop6HfBAc7jq68BbkhzWDGy/pWnTDFi1ahWvfOUr7VVI81x6R4IGsOHk9cDfATcCu5vm36c3bnEF8ALgTuBdVXVfEy6fpDd4/Qjw3qra1GzrN5t1AT5aVZ+dat+jo6O1adOmaf5GkjS3Jbm2qkb3Om9QYdElw0KS9t1UYeEV3JKkVoaFJKmVYSFJarWw6wI0uTVr1jA2NtZpDVu39s5SXrr0addBzriRkRHOOOOMrsuQ5iXDQlN69NFHuy5B0hAwLIbYMPwVvXr1agAuvPDCjiuR1CXHLCRJrQwLSVIrw0KS1MqwkCS1MiwkSa0MC0lSK8NCktTKsJAktTIsJEmtDAtJUivDQpLUyrCQJLUyLCRJrQwLSVIrw0KS1MqwkCS1MiwkSa0MC0lSK8NCktTKsJAktTIsJEmtFrYtkGQx8H5g2cTlq+o3B1eWJGmYtIYFcCXwd8BGYNdgy5EkDaN+wuJ5VfXvBl6JJGlo9TNm8bUkJw68EknS0OonLFbTC4xHkzyY5KEkDw66MEnS8Gg9DFVVh8xEIZKk4dXXqbNJDktyXJJfHX/1sc6lSe5NctOEtvOTbE1yffM6ccK8c5KMJflBkhMmtK9s2saSnL2vX1CS9Oz1c+rsb9E7FHUUcD3wOuDbwJtaVv0c8Engsj3a/6yqPrHHPl4GnAq8HPg5YGOSX2hmfwpYAWwBrkmyrqpuaatbkjR9+h2zeA1wZ1W9ETgWuL9tpar6FnBfn3WcBFxeVY9X1Q+BMeC45jVWVXdU1RPA5c2ykqQZ1E9YPFZVjwEkOaCqbgNe8iz2+aEkNzSHqQ5r2pYCd01YZkvTNln70yQ5PcmmJJu2bdv2LMqTJO2pn7DYkuRQ4K+BDUmuBO58hvv7NPDzwDHA3cCfPMPtPE1VXVxVo1U1unjx4unarCSJ/s6GOqWZPD/JN4CfAdY/k51V1T3j00kuAb7WfNwKHD1h0aOaNqZolyTNkH7Phnp9kvdW1TfpDW7v9VBQH9tZMuHjKcD4mVLrgFOTHJDkRcBy4HvANcDyJC9K8hx6g+Drnsm+JUnPXD9nQ50HjNIbp/gssD/weeCXW9b7IvAG4IgkW4DzgDckOQYoYDPwAYCqujnJFcAtwE7gg1W1q9nOh4CvAwuAS6vq5n3+lpKkZ6Wfe0OdQu8MqOsAqur/JWm9UK+q3r2X5s9MsfxHgY/upf0q4Ko+6pQkDUg/h6GeqKqi1xsgyUGDLUmSNGz6CYsrkvwFcGiS99O7Vfklgy1LkjRM+jkb6hNJVgAP0hu3+HBVbRh4ZZKkodHPmAVVtSHJd8eXT3J4VfV7dbYkaZbr52yoDwD/EXgM2A2E3vjFiwdbmiRpWPTTs/g3wCuq6ieDLkaSNJz6GeD+v8Ajgy5EkjS8+ulZnAP872bM4vHxxqo6c2BVSZKGSj9h8RfA/wJupDdmIUmaZ/oJi/2r6ncHXokkaWj1M2bxP5tnRSxJcvj4a+CVSZKGRj89i/F7PJ0zoc1TZyVpHunnCu4XTTU/yQqv6Jakua2v51m0+Pg0bEOSNMSmIywyDduQJA2x6QiLmoZtSJKG2HSEhSRpjpuOsNg8DduQJA2xvm5RnuQVwMuA5463VdVlzftvDKa07qxZs4axsbGuyxgK4/8Oq1ev7riS4TAyMsIZZ5zRdRnSjOvnFuXnAW+gFxZXAW8F/h64bKCVdWhsbIzrb7qVXc/z2sP9nugNSV17xz0dV9K9BY/4CBfNX/30LN4JvAr4flW9N8mRwOcHW1b3dj3vcB596Yldl6EhcuBtV3VdgtSZfsYsHq2q3cDOJM8H7gWOHmxZkqRh0k/PYlOSQ4FLgGuBh4FvD7QqSdJQ6ed2H7/dTP7nJOuB51fVDYMtS5I0TFoPQ6XnXyb5cFVtBu5PctzgS5MkDYt+xiz+HPglnrz77EPApwZWkSRp6PQzZvHaqnp1ku8DVNWOJM8ZcF2SpCHST8/iH5MsoLkHVJLF+HhVSZpX+gmLi4CvAj+b5KP0Lsj7w4FWJUkaKlMehkqyH/BD4N8Cb6Z3O/KTq+rWGahNkjQkpgyLqtqd5FNVdSxw2wzVJEkaMv0chro6yb9I4kOOJGme6icsPgB8CXg8yYNJHkry4IDrkiQNkdawqKpDgCOAXwHeDvx68z6lJJcmuTfJTRPaDk+yIcntzfthTXuSXJRkLMkNSV49YZ1VzfK3J1n1DL6jpDlq+/btnHnmmWzfvr3rUua8fq7g/i3gm8B64Pzm/cN9bPtzwMo92s4Grq6q5cDVzWfo3fZ8efM6Hfh0s+/DgfOA1wLHAeeNB4wkrV27lhtvvJHLLpuzT0wYGv0chloNvAa4s6reCBwLPNC2UlV9C9jzAQAnAWub6bXAyRPaL6ue7wCHJlkCnABsqKr7qmoHsIGnB5CkeWj79u2sX7+eqmL9+vX2Lgasn7B4rKoeA0hyQFXdBrzkGe7vyKq6u5n+MXBkM70UuGvCcluatsnanybJ6Uk2Jdm0bdu2Z1iepNli7dq17N7duz54165d9i4GrJ+w2NLcovyvgQ1JrgTufLY7rqqiuSp8OlTVxVU1WlWjixcvnq7NShpSGzduZOfOnQDs3LmTDRs2dFzR3NbPAPcpVXV/VZ0P/AfgMzx5+Ghf3dMcXqJ5v7dp38pTH6h0VNM2Wbukee74449n4cLepWILFy5kxYoVHVc0t/XTs/gnVfXNqlpXVU88w/2tA8bPaFoFXDmh/bTmrKjXAQ80h6u+DrwlyWHNwPZbmjZJ89yqVavYb7/er7AFCxZw2mmndVzR3LZPYbEvknyR3hP1XpJkS5L3AR8DViS5HTi++QxwFXAHMEbviXy/DVBV9wF/AFzTvD7StEma5xYtWsTKlStJwsqVK1m0aFHXJc1p/dyi/BmpqndPMuvNe1m2gA9Osp1LgUunsbRWW7duZcEjD3DgbVfN5G415BY8sp2tW3d2XYYmWLVqFZs3b7ZXMQMG1rOQJM0dA+tZzGZLly7lx48v5NGXnth1KRoiB952FUuXHtm+oGbMxIvyzjrrrK7LmdPsWUialbwob2YZFpJmJS/Km1mGhaRZyYvyZpZhIWlWOv7445/y2YvyBsuwkDQrveMd73jK57e/vfXJCXoWDAtJs9IVV1zxlM9f+tKXOqpkfjAsJM1KV1999VM+b9y4saNK5gfDQtKslGTKz5peXpQnaZ+tWbOGsbGxTms45JBD2LFjx1M+r169upNaRkZGOOOMMzrZ90yxZyFpVlqyZMmUnzW97FlI2mfD8lf0Kaecwo4dOzjhhBM455xzui5nTjMsJM1aS5Ys4YknnuD000/vupQ5z8NQkmat/fffn5GREZ9lMQMMC0lSK8NCktTKsJAktXKAexILHrnPx6oC+z32IAC7n/v8jivp3oJH7gN8+JHmJ8NiL0ZGRrouYWiMjT0EwMiL/SUJR/qzoXnLsNiLYTmHfBiMXxF74YUXdlyJpC45ZiFJamVYSJJaGRaSpFaGhSSplWEhSWplWEiSWhkWkqRWhoUkqZVhIUlqZVhIklp5uw9pllmzZg1jY2NdlzEUxv8dxm9LM9+NjIwM7HZFhoU0y4yNjXH7zd/nBQfv6rqUzj3nH3sHRx6/c1PHlXTvRw8vGOj2OwmLJJuBh4BdwM6qGk1yOPBXwDJgM/CuqtqRJMCFwInAI8B7quq6LuqWhsULDt7F77/6wa7L0BD5w+sG+xiBLscs3lhVx1TVaPP5bODqqloOXN18BngrsLx5nQ58esYrlaR5bpgGuE8C1jbTa4GTJ7RfVj3fAQ5NsqSLAiVpvuoqLAr4myTXJjm9aTuyqu5upn/Mk48kWwrcNWHdLU3bUyQ5PcmmJJu2bds2qLolaV7qaoD79VW1NcnPAhuS3DZxZlVVktqXDVbVxcDFAKOjo/u0riRpap30LKpqa/N+L/BV4DjgnvHDS837vc3iW4GjJ6x+VNMmSZohM96zSHIQsF9VPdRMvwX4CLAOWAV8rHm/slllHfChJJcDrwUemHC4Spp3tm7dyk8fWjDws180u9z50AIO2jq4v6O7OAx1JPDV3hmxLAT+sqrWJ7kGuCLJ+4A7gXc1y19F77TZMXqnzr535kuWpPltxsOiqu4AXrWX9u3Am/fSXsAHZ6A0aVZYunQpj++82+ss9BR/eN3zOWDp0879mTbDdOqsJGlIGRaSpFaGhSSplTcSlGahHz3s2VAA9zzS+3v3yOft7riS7v3o4QUsH+D2DQtplhkZGem6hKHxRHOL8gNe6L/Jcgb7s5HeyUZzy+joaG3aNPtvWTwMzy0Y3/8w/IIa5L36NTuNP8fiwgsv7LiSuSHJtRNu7voU9iw0pQMPPLDrEiQNAcNiiPlXtKRh4dlQkqRWhoUkqZVhIUlqZVhIkloZFpKkVoaFJKmVYSFJamVYSJJaGRaSpFaGhSSplWEhSWplWEiSWhkWkqRWhoUkqZVhIUlq5fMsJO2zYXiKIzz5JMfxJ+Z1ZT48xdGwkDRr+STHmWNYSNpnc/2vaD2dYxaSpFaGhSSplWEhSWplWEiSWhkWkqRWhoUkqZVhIUlqZVhIklqlqrquYdol2Qbc2XUdc8gRwE+6LkKahD+f0+eFVbV4bzPmZFhoeiXZVFWjXdch7Y0/nzPDw1CSpFaGhSSplWGhflzcdQHSFPz5nAGOWUiSWtmzkCS1MiwkSa0MC+1Vkkry+QmfFybZluRrXdYlASTZleT6JP+Q5Lok/7zrmuY6n5SnyfwUeEWSA6vqUWAFsLXjmqRxj1bVMQBJTgD+CPi1bkua2+xZaCpXAW9rpt8NfLHDWqTJPB/Y0XURc51hoalcDpya5LnAPwO+23E90rgDm8NQtwH/BfiDrgua6zwMpUlV1Q1JltHrVVzVbTXSU0w8DPVLwGVJXlFeCzAw9izUZh3wCTwEpSFVVd+mdzPBvd4AT9PDnoXaXArcX1U3JnlD18VIe0ryUmABsL3rWuYyw0JTqqotwEVd1yHt4cAk1zfTAVZV1a4uC5rrvN2HJKmVYxaSpFaGhSSplWEhSWplWEiSWhkWkqRWnjor9SHJ+cDD9O5D9K2q2thhLR/pugbNP4aFtA+q6sPWoPnIw1DSJJL8+yT/J8nfAy9p2j6X5J3N9IeTXJPkpiQXJ0nT/pokNzQ3uvtPSW5q2t+T5CtJ1ie5PckfT9jXu5Pc2Gzr403bgmZ/NzXzztpLDR9Lckuzv0/M6D+Q5hV7FtJeJPlF4FTgGHr/T64Drt1jsU9W1Uea5f8r8OvAfwc+C7y/qr6d5GN7rHMMcCzwOPCDJGuAXcDHgV+kd6vtv0lyMnAXsLSqXtHs49A9alwEnAK8tKpqz/nSdLJnIe3drwBfrapHqupBejdU3NMbk3w3yY3Am4CXN7+wD2lubgfwl3usc3VVPVBVjwG3AC8EXgP8bVVtq6qdwBeAXwXuAF6cZE2SlcCDe2zrAeAx4DNJfgN45Fl/a2kShoX0DDTP+Phz4J1V9UrgEuC5faz6+ITpXUzRu6+qHcCrgL8F/jW95zZMnL8TOA74b/R6Nev7/wbSvjEspL37FnBykgOTHAK8fY/548HwkyQHA+8EqKr7gYeSvLaZf2of+/oe8GtJjkiygN7zQ76Z5Ahgv6r6MnAu8OqJKzX7/Zmqugo4i16wSAPhmIW0F1V1XZK/Av4BuBe4Zo/59ye5BLgJ+PEe898HXJJkN/BNeoeLptrX3UnOBr5B7w6q/6OqrkzyKuCzScb/qDtnj1UPAa5sejkBfvcZfFWpL951VppmSQ6uqoeb6bOBJVW1uuOypGfFnoU0/d6W5Bx6/7/uBN7TbTnSs2fPQpLUygFuSVIrw0KS1MqwkCS1MiwkSa0MC0lSq/8PC8ak6d0hMjMAAAAASUVORK5CYII=\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": [],
+ "needs_background": "light"
+ }
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "OXRRKDX-R5_L",
+ "outputId": "8fc79075-ad94-4ca4-9064-fec56e6c73c2",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 520
+ }
+ },
+ "source": [
+ "df_malignant = df[df['diagnosis']=='M']['area_mean']\n",
+ "df_benign = df[df['diagnosis']=='B']['area_mean']\n",
+ "fig = plt.figure()\n",
+ "ax = fig.add_subplot(111)\n",
+ "ax.boxplot([df_malignant,df_benign], labels=['M', 'B'])"
+ ],
+ "execution_count": 24,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "{'boxes': [,\n",
+ " ],\n",
+ " 'caps': [,\n",
+ " ,\n",
+ " ,\n",
+ " ],\n",
+ " 'fliers': [,\n",
+ " ],\n",
+ " 'means': [],\n",
+ " 'medians': [,\n",
+ " ],\n",
+ " 'whiskers': [,\n",
+ " ,\n",
+ " ,\n",
+ " ]}"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 24
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD4CAYAAAAAczaOAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAQ2ElEQVR4nO3df2xdd3nH8fczN4vRIDRWs6hrqgWhSHNjTYVdFbZZWo00+oM/Cv9UzaQlMIsQiVpUVFpK/EdbkFvQBoxaUKtdKlqNuIsElAh1Y1XliVkaUKfqSluDGpFWTRRak2SAColC+uwPH3c3qX/GP058v++XdOVzn3Puvc9Vbj/39HvOPd/ITCRJZfi9uhuQJK0cQ1+SCmLoS1JBDH1JKoihL0kFuaTuBmZz2WWX5ebNm+tuQ5JWlYMHD/4iMzdMt+6iDv3NmzczNjZWdxuStKpExMszrXN4R5IKYuhLUkEMfUkqiKEvSQUx9CWpIHOGfkRcGREjEfFCRDwfEZ+q6ndFxNGIeKa63dj0mM9ExKGI+GlEXNdUv76qHYqIO5bnLel8fX19tLe3ExG0t7fT19dXd0uSajKfPf3fAbdn5lXA+4FPRsRV1bovZ+bV1e1xgGrdLcBW4HrgaxHRFhFtwFeBG4CrgG1Nz6Nl0tfXx9DQEPfccw+vv/4699xzD0NDQwa/VKg5z9PPzGPAsWr51xExDlwxy0NuAh7NzNPA4Yg4BFxTrTuUmT8DiIhHq21fWET/msODDz7IF77wBT796U8DvPl3z549DA4O1tmapBosaEw/IjYD7wF+WJVujYhnI+KhiFhf1a4AXml62JGqNlP9/NfYGRFjETE2MTGxkPY0jdOnT7Nr165zart27eL06dM1dSSpTvMO/Yh4O/BN4LbM/BVwP/Bu4Gom/0/gi0vRUGY+kJmNzGxs2DDtr4i1AGvXrmVoaOic2tDQEGvXrq2pI0l1mtdlGCJiDZOB/43M/BZAZr7atP5B4LvV3aPAlU0P31TVmKWuZfLxj3+c3bt3A5N7+ENDQ+zevfste/+SyjBn6EdEAHuB8cz8UlP98mq8H+AjwHPV8gFgX0R8CfgjYAvwIyCALRHxLibD/hbgb5bqjWh6U+P2e/bs4fbbb2ft2rXs2rXL8XypUDHXHLkR0Q38F/Bj4I2qvAfYxuTQTgIvAZ+Y+hKIiH7g75g88+e2zPy3qn4j8E9AG/BQZg7M9tqNRiO94JokLUxEHMzMxrTrLuaJ0Q19SVq42ULfX+RKUkEMfUkqiKEvSQUx9CWpIIa+JBXE0Jekghj6klQQQ1+SCmLoS1JBDH1JKoihL0kFMfQlqSCGfgGGh4fp6uqira2Nrq4uhoeH625JUk3mNYmKVq/h4WH6+/vZu3cv3d3djI6O0tvbC8C2bdtq7k7SSvPSyi2uq6uLwcFBenp63qyNjIzQ19fHc889N8sjJa1WXk+/YG1tbZw6dYo1a9a8WTtz5gzt7e2cPXu2xs4kLRevp1+wzs5ORkdHz6mNjo7S2dlZU0eS6mTot7j+/n56e3sZGRnhzJkzjIyM0NvbS39/f92tSaqBB3Jb3NTB2r6+PsbHx+ns7GRgYMCDuFKhHNOXpBbjmL4kCTD0Jakohr4kFcTQl6SCGPqSVBBDX5IKYuhLUkEMfUkqiKEvSQUx9CWpIIa+JBXE0Jekghj6BXCOXElT5gz9iLgyIkYi4oWIeD4iPlXVOyLiiYh4sfq7vqpHRNwXEYci4tmIeG/Tc+2otn8xInYs39vSlKk5cgcHBzl16hSDg4P09/cb/FKh5ry0ckRcDlyemU9HxDuAg8CHgY8CJzLz8xFxB7A+M3dHxI1AH3Aj8D7gK5n5vojoAMaABpDV8/xZZp6c6bW9tPLiOUeuVJ5FXVo5M49l5tPV8q+BceAK4Cbg4Wqzh5n8IqCqP5KTfgBcWn1xXAc8kZknqqB/Arh+Ee9L8zA+Pk53d/c5te7ubsbHx2vqSFKdFjSmHxGbgfcAPwQ2ZuaxatXPgY3V8hXAK00PO1LVZqqf/xo7I2IsIsYmJiYW0p6m4Ry5kprNO/Qj4u3AN4HbMvNXzetycoxoSabgyswHMrORmY0NGzYsxVMWzTlyJTWb1xy5EbGGycD/RmZ+qyq/GhGXZ+axavjmtap+FLiy6eGbqtpR4Nrz6v954a1rPpwjV1Kz+RzIDSbH7E9k5m1N9X8AjjcdyO3IzL+PiA8Bt/L/B3Lvy8xrqgO5B4Gps3meZvJA7omZXtsDuZK0cLMdyJ3Pnv5fAn8L/Dginqlqe4DPA/sjohd4Gbi5Wvc4k4F/CPgN8DGAzDwREZ8Dnqq2++xsgS9JWnpz7unXyT19SVq4RZ2yKUlqHYa+JBXE0Jekghj6klSQeZ2nr9Vl8izbhbmYD+hLWjqGfguaKcAjwnCXCufwjiQVxNCXpIIY+pJUEENfkgpi6EtSQQx9SSqIoS9JBTH0Jakghr4kFcTQl6SCGPqSVBBDX5IKYuhLUkEMfUkqiKEvSQUx9CWpIIa+JBXE0Jekghj6klQQQ1+SCmLoS1JBDH1JKoihL0kFMfQlqSCGviQVxNCXpIIY+pJUkDlDPyIeiojXIuK5ptpdEXE0Ip6pbjc2rftMRByKiJ9GxHVN9eur2qGIuGPp34okaS7z2dP/OnD9NPUvZ+bV1e1xgIi4CrgF2Fo95msR0RYRbcBXgRuAq4Bt1baSpBV0yVwbZOb3I2LzPJ/vJuDRzDwNHI6IQ8A11bpDmfkzgIh4tNr2hQV3LEm6YIsZ0781Ip6thn/WV7UrgFeatjlS1Waqv0VE7IyIsYgYm5iYWER7kqTzXWjo3w+8G7gaOAZ8cakayswHMrORmY0NGzYs1dNKkpjH8M50MvPVqeWIeBD4bnX3KHBl06abqhqz1CVJK+SC9vQj4vKmux8Bps7sOQDcEhFrI+JdwBbgR8BTwJaIeFdE/D6TB3sPXHjbkqQLMeeefkQMA9cCl0XEEeBO4NqIuBpI4CXgEwCZ+XxE7GfyAO3vgE9m5tnqeW4Fvge0AQ9l5vNL/m4kSbOKzKy7hxk1Go0cGxuru42WERFczP/ekpZGRBzMzMZ06/xFriQVxNCXpIIY+pJUEENfkgpi6EtSQQx9SSqIoS9JBTH0Jakghr4kFcTQl6SCGPqSVBBDX5IKYuhLUkEMfUkqiKEvSQUx9CWpIIa+JBXE0Jekghj6klQQQ38V6+joICLmfQMWtH1E0NHRUfO7lLSULqm7AV24kydPLvtE51NfFpJag3v6klQQQ1+SCmLoS1JBDH1JKoihL0kFMfQlqSCGviQVxNCXpIIY+pJUEENfkgpi6EtSQQx9SSrInKEfEQ9FxGsR8VxTrSMinoiIF6u/66t6RMR9EXEoIp6NiPc2PWZHtf2LEbFjed6OJGk289nT/zpw/Xm1O4AnM3ML8GR1H+AGYEt12wncD5NfEsCdwPuAa4A7p74oJJVteHiYrq4u2tra6OrqYnh4uO6WWtqcoZ+Z3wdOnFe+CXi4Wn4Y+HBT/ZGc9APg0oi4HLgOeCIzT2TmSeAJ3vpFIqkww8PD9Pf3Mzg4yKlTpxgcHKS/v9/gX0YXOqa/MTOPVcs/BzZWy1cArzRtd6SqzVR/i4jYGRFjETE2MTFxge1JWg0GBgbYu3cvPT09rFmzhp6eHvbu3cvAwEDdrbWsRR/IzclZPJZsJo/MfCAzG5nZ2LBhw1I9raSL0Pj4ON3d3efUuru7GR8fr6mj1nehof9qNWxD9fe1qn4UuLJpu01Vbaa6pIJ1dnYyOjp6Tm10dJTOzs6aOmp9Fxr6B4CpM3B2AN9pqm+vzuJ5P/DLahjoe8AHI2J9dQD3g1VNUsH6+/vp7e1lZGSEM2fOMDIyQm9vL/39/XW31rLmnCM3IoaBa4HLIuIIk2fhfB7YHxG9wMvAzdXmjwM3AoeA3wAfA8jMExHxOeCparvPZub5B4clFWbbtm0A9PX1MT4+TmdnJwMDA2/WtfRiuSfWXoxGo5FjY2N1t3HRiogVmRj9Yv6MSHqriDiYmY3p1s25p6+LV965Du565/K/hqSWYeivYnH3r1ZmT/+uZX0JFW54eJiBgYE3h3f6+/sd3llGhr6k2kz9OGvv3r10d3czOjpKb28vgMG/TLzgmqTa+OOsleeB3FXMA7la7dra2jh16hRr1qx5s3bmzBna29s5e/ZsjZ2tbrMdyHVPX1JtOjs7ufvuu8+54Nrdd9/tj7OWkaEvqTY9PT3ce++9HD9+HIDjx49z77330tPTU3NnrcvQl1Sbxx57jHXr1tHe3k5m0t7ezrp163jsscfqbq1lGfqSanPkyBH279/P4cOHeeONNzh8+DD79+/nyJEjdbfWsgx9SSqIoS+pNps2bWL79u3nXHBt+/btbNq0qe7WWpY/zpK0YiJi2voHPvCBWbf1tOGl456+pBWTmW+57du3j61btwKwdetW9u3b95ZttHT8cdYq5o+z1Er8rC0df5wlSQIMfUkqiqEvSQUx9CWpIIa+JBXE8/RXuZnOe14q69evX9bnl7SyDP1VbKGnt3lKnCSHdySpIIa+JBXE0Jekghj6klQQQ1+SCmLoS1JBDH1JKoihL0kFMfQlqSCGviQVxNCXpIIY+pKWXEdHBxGxoBuwoO07Ojpqfper06IuuBYRLwG/Bs4Cv8vMRkR0AP8KbAZeAm7OzJMx+a/6FeBG4DfARzPz6cW8vqSL08mTJ1dk/mYt3FLs6fdk5tVNk/DeATyZmVuAJ6v7ADcAW6rbTuD+JXhtSdICLMfwzk3Aw9Xyw8CHm+qP5KQfAJdGxOXL8PqSpBksNvQT+I+IOBgRO6vaxsw8Vi3/HNhYLV8BvNL02CNV7RwRsTMixiJibGJiYpHtSZKaLXYSle7MPBoRfwg8ERE/aV6ZmRkRCxrYy8wHgAcAGo2GM35I0hJa1J5+Zh6t/r4GfBu4Bnh1atim+vtatflR4Mqmh2+qapKkFXLBoR8RfxAR75haBj4IPAccAHZUm+0AvlMtHwC2x6T3A79sGgaSJK2AxQzvbAS+XZ02dQmwLzP/PSKeAvZHRC/wMnBztf3jTJ6ueYjJUzY/tojXlnQRyzvXwV3vXP7X0ILFxTxRdqPRyLGxsbrbaBlOjK6VshKfNT/PM4uIg02n0Z/DX+RKUkEMfUkqiKEvSQUx9CWpIIv9cZYkTWu5L4i2fv36ZX3+VmXoS1pyF3JWjWfjrAxDvwXNtoc10zr/Y5PKYOi3IANc0kw8kCtJBTH0Jakghr4kFcTQl6SCGPqSVBBDX5IKYuhLUkEMfUkqiKEvSQUx9CWpIIa+JBXE0Jekghj6klQQQ1+SCmLoS1JBDH1JKoiTqEhaMXPNm+vMbsvP0Je0Ygzv+jm8I0kFMfQlqSCGviQVxNCXpIIY+pJUEENfkgpi6EtSQQx9SSpIXMw/loiICeDluvtoIZcBv6i7CWkGfj6Xzh9n5obpVlzUoa+lFRFjmdmouw9pOn4+V4bDO5JUEENfkgpi6JflgbobkGbh53MFOKYvSQVxT1+SCmLoS1JBDP0WFxEZEf/SdP+SiJiIiO/W2ZcEEBFnI+KZiPifiHg6Iv6i7p5anTNntb7Xga6IeFtm/hb4a+BozT1JU36bmVcDRMR1wL3AX9XbUmtzT78MjwMfqpa3AcM19iLNZB1wsu4mWp2hX4ZHgVsioh34U+CHNfcjTXlbNbzzE+Cfgc/V3VCrc3inAJn5bERsZnIv//F6u5HO0Ty88+fAIxHRlZ5Lvmzc0y/HAeAfcWhHF6nM/G8mL7o27YXCtDTc0y/HQ8D/ZuaPI+LaupuRzhcRfwK0Acfr7qWVGfqFyMwjwH119yGd520R8Uy1HMCOzDxbZ0OtzsswSFJBHNOXpIIY+pJUEENfkgpi6EtSQQx9SSqIoS9JBTH0Jakg/wcOCx6LlH8FlQAAAABJRU5ErkJggg==\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": [],
+ "needs_background": "light"
+ }
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "oe8yTLWjR_ab",
+ "outputId": "b18523cb-2dd3-4ebf-9a4d-bbaf355cc050",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 119
+ }
+ },
+ "source": [
+ "df_malignant.head(5)"
+ ],
+ "execution_count": 11,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "0 1001.0\n",
+ "1 1326.0\n",
+ "2 1203.0\n",
+ "3 386.1\n",
+ "4 1297.0\n",
+ "Name: area_mean, dtype: float64"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 11
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "nNlhW6FsSN8z",
+ "outputId": "f2f8a202-32cd-4e03-b75f-0c0ec398b5de",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 119
+ }
+ },
+ "source": [
+ "df_benign.head(5)\n"
+ ],
+ "execution_count": 12,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "19 566.3\n",
+ "20 520.0\n",
+ "21 273.9\n",
+ "37 523.8\n",
+ "46 201.9\n",
+ "Name: area_mean, dtype: float64"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 12
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "DjPKKq01YjF9",
+ "outputId": "372960aa-7d85-4514-8f12-28b3a2ec4259",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "import numpy as np\n",
+ "np.random.seed(20111974)\n",
+ "\n",
+ "# Simulando Retornos de ativos financeiros com a distribuição Normal(0, 1):\n",
+ "a_retornos = np.random.normal(0, 1, 100)\n",
+ "print(f'Média: {np.mean(a_retornos)}')"
+ ],
+ "execution_count": 13,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "Média: -0.016996335492713833\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ajjlfqgssLVO",
+ "outputId": "cb581b84-1c27-4c09-fee9-2d1a8ab56034",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 357
+ }
+ },
+ "source": [
+ "a_retornos"
+ ],
+ "execution_count": 14,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 2.5062768 , 1.11440422, 2.05565501, 0.56482376, 0.29897276,\n",
+ " 1.04930857, -0.12607366, 1.06227632, 1.13807032, 1.37966044,\n",
+ " -2.05995563, 0.67474814, 0.72722843, -0.33923852, 0.43613107,\n",
+ " 0.59135489, -1.29281877, 1.17712036, -0.98644163, -1.79034143,\n",
+ " -1.08913605, -0.90712825, -1.02291108, -1.36445713, -0.29429164,\n",
+ " 0.06343709, -1.14196185, -0.50706079, -0.83539436, -1.41492946,\n",
+ " -0.2159062 , -1.16519474, -0.60767518, -0.61510925, 1.0771542 ,\n",
+ " 0.5043687 , 0.02674197, 1.83494644, 0.34728874, -1.14671885,\n",
+ " -0.59841423, -0.42698353, 0.10901983, -0.75168457, 0.71689294,\n",
+ " -0.50810299, 0.47524103, -0.38248511, -1.37491973, 1.5355728 ,\n",
+ " -0.27356178, 0.68072592, -1.80454873, 1.16995833, -0.37988822,\n",
+ " 0.19305861, 1.53792436, -0.11802807, -0.97621103, -1.23463994,\n",
+ " 1.0504434 , 1.91481015, 0.80359454, 0.35869561, 1.03409992,\n",
+ " -0.37200685, 0.32947575, 0.70038627, -0.98085533, -1.21072144,\n",
+ " 0.74366412, 0.18372348, 0.10430302, -0.78160841, -0.0423915 ,\n",
+ " 1.67094293, -1.07256479, -0.5493723 , -1.83082917, 0.11510819,\n",
+ " 1.3911365 , -0.28940563, 0.31904722, -0.70009623, -0.4353552 ,\n",
+ " -2.0301258 , -0.14205882, 1.66292963, -0.57691495, -0.78963384,\n",
+ " -0.80660503, 0.05581487, 0.8715663 , -0.3499477 , 1.37366912,\n",
+ " 0.88027638, -1.47925906, -0.40657104, -0.18789895, 0.47475142])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 14
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XZ3m06gv9lei"
+ },
+ "source": [
+ "A seguir, o boxplot do array a_retornos:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "QtuwJP449tBQ",
+ "outputId": "c0267932-7470-483d-daca-f1913a5f4772",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 269
+ }
+ },
+ "source": [
+ "# Import da biblioteca seaborn: Uma das principais libraries para Data Visualization (outras: matplotlib)\n",
+ "import seaborn as sns\n",
+ "\n",
+ "sns.boxplot(y = a_retornos)"
+ ],
+ "execution_count": 15,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 15
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXIAAADrCAYAAAB0Oh02AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAIPUlEQVR4nO3d3YtdVxnH8d/TxJeIikiHCqMxyogiIgiDIF4I6kXtjSgIeiGIQvTCYQRBlP4JghAGbwKKN6I3WhSM+AKCCCpORKS1UQ6C2MGX0YIWEpXY5YUR25pmzuRsZ5+n+XwgkLNnZu2HkHxZWdknU2OMANDXXXMPAMBqhBygOSEHaE7IAZoTcoDmhBygudNz3PTuu+8e586dm+PWAG1dvnz5T2OMjadenyXk586dy/7+/hy3Bmirqn5zs+uOVgCaE3KA5oQcoDkhB2hOyAGaE3KA5oQcoLlZniNnfezt7WWxWMw9xlo4ODhIkmxubs48yXrY2trKzs7O3GOwBCGHG65duzb3CHBbhPwOZ8f1X7u7u0mSCxcuzDwJHI8zcoDmhBygOSEHaE7IAZoTcoDmhBygOSEHaE7IAZoTcoDmhBygOSEHaE7IAZoTcoDmhBygOSEHaE7IAZpbOeRV9bKq+l5V/aKqHqqq3SkGA2A5U3yHoOtJPj7G+GlVvSDJ5ar6zhjjFxOsDcARVt6RjzF+N8b46Y2fP5bk4SS+ey3ACZn0jLyqziV5Q5IfT7kuAE9vspBX1fOTfCXJx8YYf73Jx89X1X5V7R8eHk51W4A73iQhr6pn5d8R/+IY46s3+5wxxsUxxvYYY3tjY2OK2wKQaZ5aqSSfS/LwGOMzq48EwHFMsSN/c5L3J3lrVf3sxo/7JlgXgCWs/PjhGOMHSWqCWQC4Dd7ZCdCckAM0J+QAzQk5QHNCDtCckAM0J+QAzQk5QHNCDtCckAM0J+QAzQk5QHNCDtCckAM0J+QAzQk5QHNCDtCckAM0J+QAzQk5QHNCDtCckAM0J+QAzQk5QHNCDtCckAM0J+QAzQk5QHNCDtCckAM0J+QAzQk5QHNCDtCckAM0N0nIq+rzVfXHqnpwivUAWN5UO/IvJLl3orUAOIZJQj7G+H6SR6dYC4DjcUYO0NyJhbyqzlfVflXtHx4entRtAZ7xTizkY4yLY4ztMcb2xsbGSd0W4BnP0QpAc1M9fvilJD9M8uqqeqSqPjTFugAc7fQUi4wx3jfFOgAcn6MVgOaEHKA5IQdoTsgBmhNygOaEHKC5SR4/7GZvby+LxWLuMVgz//k9sbu7O/MkrJutra3s7OzMPcbTuiNDvlgs8rMHH84/n/fiuUdhjdz1j5EkufzrP8w8Cevk1NX1/49d78iQJ8k/n/fiXHvNfXOPAay5M1cuzT3CkZyRAzQn5ADNCTlAc0IO0JyQAzQn5ADNCTlAc0IO0JyQAzQn5ADNCTlAc0IO0JyQAzQn5ADNCTlAc0IO0JyQAzQn5ADNCTlAc0IO0JyQAzQn5ADNCTlAc0IO0JyQAzQn5ADNTRLyqrq3qn5ZVYuq+uQUawKwnJVDXlWnknw2yTuSvDbJ+6rqtauuC8ByptiRvzHJYozx6zHGP5J8Ock7J1gXgCVMEfLNJL99wutHblx7kqo6X1X7VbV/eHg4wW0BSE7wHzvHGBfHGNtjjO2NjY2Tui3AM94UIT9I8rInvH7pjWsAnIApQv6TJK+qqldU1bOTvDfJ1ydYF4AlnF51gTHG9ar6aJJvJTmV5PNjjIdWngyApawc8iQZY1xKcmmKtU7CwcFBTl39S85caTMyMJNTV/+cg4Prc49xS97ZCdDcJDvybjY3N/P7v5/OtdfcN/cowJo7c+VSNjfvmXuMW7IjB2hOyAGaE3KA5oQcoDkhB2hOyAGaE3KA5oQcoDkhB2hOyAGaE3KA5oQcoDkhB2hOyAGaE3KA5oQcoDkhB2hOyAGaE3KA5oQcoDkhB2hOyAGaE3KA5oQcoDkhB2ju9NwDzOXU1Udz5sqlucdgjdz1t78mSR5/7gtnnoR1curqo0numXuMW7ojQ761tTX3CKyhxeKxJMnWK9f7Dy0n7Z61b8YdGfKdnZ25R2AN7e7uJkkuXLgw8yRwPM7IAZoTcoDmhBygOSEHaG6lkFfVe6rqoap6vKq2pxoKgOWtuiN/MMm7k3x/glkAuA0rPX44xng4SapqmmkAODZn5ADNHbkjr6rvJnnJTT50/xjja8veqKrOJzmfJGfPnl16QABu7ciQjzHePsWNxhgXk1xMku3t7THFmgA4WgFob9XHD99VVY8keVOSb1TVt6YZC4BlrfrUygNJHphoFgBug6MVgOaEHKA5IQdoTsgBmhNygOaEHKA5IQdoTsgBmhNygOaEHKA5IQdoTsgBmhNygOaEHKA5IQdoTsgBmhNygOaEHKA5IQdoTsgBmhNygOaEHKA5IQdoTsgBmhNygOaEHKA5IQdoTsgBmhNygOaEHKA5IQdoTsgBmhNygOaEHKC5lUJeVZ+uqitV9fOqeqCqXjTVYAAsZ9Ud+XeSvG6M8fokv0ryqdVHAuA4Vgr5GOPbY4zrN17+KMlLVx8JgOOY8oz8g0m+OeF6ACzh9FGfUFXfTfKSm3zo/jHG1258zv1Jrif54i3WOZ/kfJKcPXv2toYF4H/VGGO1Bao+kOTDSd42xri6zNdsb2+P/f39le7LNPb29rJYLOYeYy3859dha2tr5knWw9bWVnZ2duYegyeoqstjjO2nXj9yR37Eovcm+USStywbcVhXZ86cmXsEuC0r7cirapHkOUn+fOPSj8YYHznq6+zIAY7v/7IjH2P4OyjAzLyzE6A5IQdoTsgBmhNygOaEHKA5IQdoTsgBmlv5Lfq3ddOqwyS/OfEbw9HuTvKnuYeAp/HyMcbGUy/OEnJYV1W1f7N3zsE6c7QC0JyQAzQn5PBkF+ceAI7LGTlAc3bkAM0JOUBzQg7QnJADNCfkAM39C46PaZwmexaoAAAAAElFTkSuQmCC\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": [],
+ "needs_background": "light"
+ }
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "o9ujdjxNY6qE"
+ },
+ "source": [
+ "# Vamos usar o método np.percentile(array, q = [p1, p2, p3, ..., p99])\n",
+ "percentis = np.percentile(a_retornos, q = [1, 5, 25, 50, 55, 75, 99])\n",
+ "\n",
+ "# Primeiro Quartil\n",
+ "q1 = percentis[2]"
+ ],
+ "execution_count": 17,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "c75g2Egco2lc"
+ },
+ "source": [
+ "Em qual posição do array a_retornos se encontra Q3?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "nZr-A82Zo8Kb",
+ "outputId": "29c67f4b-eeb2-4072-916f-e7abaed09521",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "q3 = percentis[5]\n",
+ "\n",
+ "# ou de trás para a frente do conteúdo da lista:\n",
+ "q3_2 = percentis[-2]\n",
+ "print(q3, q3_2)"
+ ],
+ "execution_count": 18,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "0.7194768106252311 0.7194768106252311\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "sWrnESPQT4JM"
+ },
+ "source": [
+ "# lim_inferior_outlier e lim_superior_outlier para detecção de outliers\n",
+ "lim_inferior_outlier = q1 - 1.5 * (q3 - q1)\n",
+ "lim_superior_outlier = q3 + 1.5 * (q3 - q1)"
+ ],
+ "execution_count": 19,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Yb4-ZJlUUYsi",
+ "outputId": "a93c3b50-4537-4a59-dd59-b5e981ed9e74",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "f'Limite Inferior: {lim_inferior_outlier}; Limite Superior: {lim_superior_outlier}'"
+ ],
+ "execution_count": 20,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ },
+ "text/plain": [
+ "'Limite Inferior: -3.0382521297304486; Limite Superior: 2.974114174838639'"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 20
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Jr6oXIHlUxOe",
+ "outputId": "fb66a413-161e-4556-b635-cff4c11b973e",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "np.min(a_retornos)"
+ ],
+ "execution_count": 21,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "-2.0599556303504514"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 21
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "UxE47cN0U54X",
+ "outputId": "a1a6bb8b-6cae-4af3-85f9-d19c9e6a7897",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "np.max(a_retornos)"
+ ],
+ "execution_count": 22,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "2.506276801325959"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 22
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "OTB9HnIac499"
+ },
+ "source": [
+ "___\n",
+ "# **Ordenar itens de um array**\n",
+ "> Considere o array a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Jgj8Yw46dBMx",
+ "outputId": "06107adb-7ff6-4d9e-edc8-27d82e39c8e7",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "np.random.seed(20111974)\n",
+ "a_conjunto1 = np.random.random(10)\n",
+ "a_conjunto1"
+ ],
+ "execution_count": 25,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([0.53097233, 0.56965626, 0.54252938, 0.65478409, 0.85708456,\n",
+ " 0.60174181, 0.87298309, 0.45573342, 0.67336717, 0.64300912])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 25
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "cC9272GFdRln"
+ },
+ "source": [
+ "Ordenando os itens de a_conjunto1..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "YUP90nBVdUeF",
+ "outputId": "b781aeb1-2418-4d6f-ddbb-a122451aab4f",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "np.sort(a_conjunto1)"
+ ],
+ "execution_count": 26,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([0.45573342, 0.53097233, 0.54252938, 0.56965626, 0.60174181,\n",
+ " 0.64300912, 0.65478409, 0.67336717, 0.85708456, 0.87298309])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 26
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "lG763cDGj-yB"
+ },
+ "source": [
+ "___\n",
+ "# **Obter ajuda**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ehxPlD3EkEYL",
+ "outputId": "c63a079a-285f-4c41-fe27-84630a824b58",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 1000
+ }
+ },
+ "source": [
+ "help(np.random.normal)"
+ ],
+ "execution_count": 27,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "Help on built-in function normal:\n",
+ "\n",
+ "normal(...) method of numpy.random.mtrand.RandomState instance\n",
+ " normal(loc=0.0, scale=1.0, size=None)\n",
+ " \n",
+ " Draw random samples from a normal (Gaussian) distribution.\n",
+ " \n",
+ " The probability density function of the normal distribution, first\n",
+ " derived by De Moivre and 200 years later by both Gauss and Laplace\n",
+ " independently [2]_, is often called the bell curve because of\n",
+ " its characteristic shape (see the example below).\n",
+ " \n",
+ " The normal distributions occurs often in nature. For example, it\n",
+ " describes the commonly occurring distribution of samples influenced\n",
+ " by a large number of tiny, random disturbances, each with its own\n",
+ " unique distribution [2]_.\n",
+ " \n",
+ " .. note::\n",
+ " New code should use the ``normal`` method of a ``default_rng()``\n",
+ " instance instead; see `random-quick-start`.\n",
+ " \n",
+ " Parameters\n",
+ " ----------\n",
+ " loc : float or array_like of floats\n",
+ " Mean (\"centre\") of the distribution.\n",
+ " scale : float or array_like of floats\n",
+ " Standard deviation (spread or \"width\") of the distribution. Must be\n",
+ " non-negative.\n",
+ " size : int or tuple of ints, optional\n",
+ " Output shape. If the given shape is, e.g., ``(m, n, k)``, then\n",
+ " ``m * n * k`` samples are drawn. If size is ``None`` (default),\n",
+ " a single value is returned if ``loc`` and ``scale`` are both scalars.\n",
+ " Otherwise, ``np.broadcast(loc, scale).size`` samples are drawn.\n",
+ " \n",
+ " Returns\n",
+ " -------\n",
+ " out : ndarray or scalar\n",
+ " Drawn samples from the parameterized normal distribution.\n",
+ " \n",
+ " See Also\n",
+ " --------\n",
+ " scipy.stats.norm : probability density function, distribution or\n",
+ " cumulative density function, etc.\n",
+ " Generator.normal: which should be used for new code.\n",
+ " \n",
+ " Notes\n",
+ " -----\n",
+ " The probability density for the Gaussian distribution is\n",
+ " \n",
+ " .. math:: p(x) = \\frac{1}{\\sqrt{ 2 \\pi \\sigma^2 }}\n",
+ " e^{ - \\frac{ (x - \\mu)^2 } {2 \\sigma^2} },\n",
+ " \n",
+ " where :math:`\\mu` is the mean and :math:`\\sigma` the standard\n",
+ " deviation. The square of the standard deviation, :math:`\\sigma^2`,\n",
+ " is called the variance.\n",
+ " \n",
+ " The function has its peak at the mean, and its \"spread\" increases with\n",
+ " the standard deviation (the function reaches 0.607 times its maximum at\n",
+ " :math:`x + \\sigma` and :math:`x - \\sigma` [2]_). This implies that\n",
+ " normal is more likely to return samples lying close to the mean, rather\n",
+ " than those far away.\n",
+ " \n",
+ " References\n",
+ " ----------\n",
+ " .. [1] Wikipedia, \"Normal distribution\",\n",
+ " https://en.wikipedia.org/wiki/Normal_distribution\n",
+ " .. [2] P. R. Peebles Jr., \"Central Limit Theorem\" in \"Probability,\n",
+ " Random Variables and Random Signal Principles\", 4th ed., 2001,\n",
+ " pp. 51, 51, 125.\n",
+ " \n",
+ " Examples\n",
+ " --------\n",
+ " Draw samples from the distribution:\n",
+ " \n",
+ " >>> mu, sigma = 0, 0.1 # mean and standard deviation\n",
+ " >>> s = np.random.normal(mu, sigma, 1000)\n",
+ " \n",
+ " Verify the mean and the variance:\n",
+ " \n",
+ " >>> abs(mu - np.mean(s))\n",
+ " 0.0 # may vary\n",
+ " \n",
+ " >>> abs(sigma - np.std(s, ddof=1))\n",
+ " 0.1 # may vary\n",
+ " \n",
+ " Display the histogram of the samples, along with\n",
+ " the probability density function:\n",
+ " \n",
+ " >>> import matplotlib.pyplot as plt\n",
+ " >>> count, bins, ignored = plt.hist(s, 30, density=True)\n",
+ " >>> plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *\n",
+ " ... np.exp( - (bins - mu)**2 / (2 * sigma**2) ),\n",
+ " ... linewidth=2, color='r')\n",
+ " >>> plt.show()\n",
+ " \n",
+ " Two-by-four array of samples from N(3, 6.25):\n",
+ " \n",
+ " >>> np.random.normal(3, 2.5, size=(2, 4))\n",
+ " array([[-4.49401501, 4.00950034, -1.81814867, 7.29718677], # random\n",
+ " [ 0.39924804, 4.68456316, 4.99394529, 4.84057254]]) # random\n",
+ "\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "1Q_konJVaBsV"
+ },
+ "source": [
+ "___\n",
+ "# **Criar arrays 1D**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "DddZT5kadYJ7"
+ },
+ "source": [
+ "import numpy as np\n",
+ "np.set_printoptions(precision = 2, suppress = True)\n",
+ "np.random.seed(seed = 20111974)"
+ ],
+ "execution_count": 28,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jaqd-VnF3yIt"
+ },
+ "source": [
+ "Criar o array 1D a_conjunto1, com os seguintes números:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "E3niz_zHaF3e",
+ "outputId": "75a9c3a7-2fd8-4b30-bde7-0978cebc5f69",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "a_conjunto1 = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n",
+ "a_conjunto1"
+ ],
+ "execution_count": 29,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 29
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "DyfXbW_ZKJBS"
+ },
+ "source": [
+ "Qual a dimensão de a_conjunto1?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "gbHlydALKB3R",
+ "outputId": "c15659d6-930f-4369-a015-a4830b141152",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "# Dimensão do array\n",
+ "a_conjunto1.ndim"
+ ],
+ "execution_count": 30,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "1"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 30
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "am9otElpKNPa"
+ },
+ "source": [
+ "Qual o shape (dimensão) do array a_conjunto1?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "juJJ74d2wale",
+ "outputId": "116b39b3-f216-4863-93c5-3b14875116ae",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "# Números de itens no array\n",
+ "a_conjunto1.shape"
+ ],
+ "execution_count": 31,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "(10,)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 31
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "BHg4Rre3GwPy"
+ },
+ "source": [
+ "O array a_conjunto1 poderia ter sido criado usando a função np.arange(inicio, fim, step):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "I3fyusN7G5Zn"
+ },
+ "source": [
+ "# Lembre-se que o número 10 é exclusive.\n",
+ "a_conjunto2 = np.arange(start = 0, stop = 10, step = 1)"
+ ],
+ "execution_count": 32,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "IHCEpmUxXsaK"
+ },
+ "source": [
+ "Outra alternativa seria usar np.linspace(start = 0, stop = 10, num = 9). Acompanhe a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "JB9Y_x3RX1GX"
+ },
+ "source": [
+ "# Com np.linspace, o valor 9 é inclusive.\n",
+ "a_conjunto3 = np.linspace(0, 9, 10)"
+ ],
+ "execution_count": 33,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "P6MR8MPeYOZm"
+ },
+ "source": [
+ "Compare os resultados de a_conjunto1, a_conjunto2 e a_conjunto3 a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "tWEzge6HYSFu",
+ "outputId": "33db7b05-003a-457c-f58e-dbd7458e2921",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "a_conjunto1"
+ ],
+ "execution_count": 34,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 34
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "lUNlFVKYYT9f",
+ "outputId": "f41f95ef-c618-4b58-f910-365d543978d8",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "a_conjunto2"
+ ],
+ "execution_count": 35,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 35
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Xo8Lid5fYVPW",
+ "outputId": "2d2c9a69-d58c-4878-c36c-d778b6077657",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "a_conjunto3"
+ ],
+ "execution_count": 36,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 36
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "V9aW7C4vHAcF"
+ },
+ "source": [
+ "Ou seja, a_conjunto1 é igual a a_conjunto2 que também é igual a a_conjunto3. Ok?\n",
+ "\n",
+ "**ATENÇÃO**: Observe que a sintaxe para criar a_conjunto3 é ligeiramente diferente da sintaxe usada para criar a_conjunto1 e a_conjunto2. Abaixo, a sintaxe do comando np.linspace:\n",
+ "\n",
+ "\n",
+ "\n",
+ "Source: [HOW TO USE THE NUMPY LINSPACE FUNCTION](https://www.sharpsightlabs.com/blog/numpy-linspace/)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "KNnwZa3uvYqE"
+ },
+ "source": [
+ "Soma 2 à cada item de a_conjunto1:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Jt2KVyviw0bp",
+ "outputId": "23ebdf9d-aa2d-4e6a-fd90-77624f196a41",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "a_conjunto1"
+ ],
+ "execution_count": 37,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 37
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "arROkhWXbdTW",
+ "outputId": "b104cb80-440f-4af8-8632-624a8213984c",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "a_conjunto2 = a_conjunto1 + 2\n",
+ "a_conjunto2"
+ ],
+ "execution_count": 38,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 38
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZJx2vG86vdVi"
+ },
+ "source": [
+ "Multiplicar por 10 cada item de a_conjunto1:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Vm7abO6Ebkun",
+ "outputId": "32f8afea-e559-4d67-e4f2-53c17b28185a",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "a_conjunto1 = a_conjunto1*10\n",
+ "a_conjunto1"
+ ],
+ "execution_count": 39,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 39
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0Ev1xnBwaYJG"
+ },
+ "source": [
+ "___\n",
+ "# **Criar Arrays Multidimensionais**\n",
+ "> Ao criarmos, por exemplo, um array 2D, então a chamamos de matriz."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gHaeAug5vjjd"
+ },
+ "source": [
+ "Criar o array com 2 linhas e 3 colunas usando números aleatórios:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "PJaLyBO5TkgM"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "VDi0vIPSYR4F",
+ "outputId": "ef57de03-9fdf-420a-c3c5-cdd981c03be6",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "np.random.seed(20111974)\n",
+ "a_conjunto1 = np.random.randn(2, 3)\n",
+ "a_conjunto1"
+ ],
+ "execution_count": 40,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[2.51, 1.11, 2.06],\n",
+ " [0.56, 0.3 , 1.05]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 40
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "DIdd-nA3tJjV"
+ },
+ "source": [
+ "## Dimensão de um array\n",
+ "> Dimensão é o número de linhas e colunas da matriz."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "pKvjjnkrK-v7",
+ "outputId": "ccab898f-5653-46b7-e1e1-ae78032592bf",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "a_conjunto1.shape"
+ ],
+ "execution_count": 41,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "(2, 3)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 41
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-DHS5jXELCfa"
+ },
+ "source": [
+ "a_conjunto1 é um array 2D (ou matriz), ou seja, 2 linhas, onde cada linha tem 3 elementos."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HJI6X1wvv4Bg"
+ },
+ "source": [
+ "Criar um array com 3 linhas e 3 colunas:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "hXPbWh3Tv26T",
+ "outputId": "65a25251-38bc-44b0-9cd0-53b2a35dbb92",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 68
+ }
+ },
+ "source": [
+ "a_conjunto2 = np.array([[1, 2, 3],[4, 5, 6],[7, 8, 9]])\n",
+ "a_conjunto2"
+ ],
+ "execution_count": 42,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[1, 2, 3],\n",
+ " [4, 5, 6],\n",
+ " [7, 8, 9]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 42
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "we6ZJOICc7bQ",
+ "outputId": "05b14f8d-89b2-4eea-9e9c-c003cbdcaf7b",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "# Número de linhas e colunas de a_conjunto1:\n",
+ "a_conjunto1.shape"
+ ],
+ "execution_count": 43,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "(2, 3)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 43
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "f0ocwuI1dED6",
+ "outputId": "e438e744-a40f-4b60-8af1-062e763170ba",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "# Número de linhas e colunas de a_conjunto2\n",
+ "a_conjunto2.shape"
+ ],
+ "execution_count": 44,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "(3, 3)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 44
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "CApPtnW0YuRP",
+ "outputId": "ac8254a1-4de1-4541-ac74-90525d79463e",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 68
+ }
+ },
+ "source": [
+ "# Somar 2 à cada elemento de a_conjunto2\n",
+ "a_conjunto2 = a_conjunto2+2\n",
+ "a_conjunto2"
+ ],
+ "execution_count": 45,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[ 3, 4, 5],\n",
+ " [ 6, 7, 8],\n",
+ " [ 9, 10, 11]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 45
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "M87aGmxRY3RW",
+ "outputId": "be71ec2c-fd81-424c-c842-1ab347cbec8f",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 68
+ }
+ },
+ "source": [
+ "# Multiplicar por 10 cada elemento de a_conjunto2\n",
+ "a_conjunto2 = a_conjunto2*10\n",
+ "a_conjunto2"
+ ],
+ "execution_count": 46,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[ 30, 40, 50],\n",
+ " [ 60, 70, 80],\n",
+ " [ 90, 100, 110]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 46
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qZt93y1IL_v7"
+ },
+ "source": [
+ "___\n",
+ "# **Copiar arrays**\n",
+ "> Considere o array abaixo:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "sH2FTXj5MRRC",
+ "outputId": "9507249f-6a01-4244-898e-836b4a72f72f",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "np.random.seed(20111974)\n",
+ "a_conjunto1 = np.random.randn(2, 3)\n",
+ "a_conjunto1"
+ ],
+ "execution_count": 47,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[2.51, 1.11, 2.06],\n",
+ " [0.56, 0.3 , 1.05]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 47
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "VtgKeMt6MYrr"
+ },
+ "source": [
+ "Fazendo a cópia de a_conjunto1..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "K0hOHR3IMa-o",
+ "outputId": "998e10d1-d94b-4772-8937-05a7ed96142a",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "a_salarios_copia = a_conjunto1.copy()\n",
+ "a_salarios_copia"
+ ],
+ "execution_count": 48,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[2.51, 1.11, 2.06],\n",
+ " [0.56, 0.3 , 1.05]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 48
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "lFpmcR0HkCar"
+ },
+ "source": [
+ "___\n",
+ "# **Operações com arrays**\n",
+ "> Considere um array com temperaturas em Farenheit dado por:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "N1ewYIYrUFFz",
+ "outputId": "a183084c-7fdc-4f7b-e1bd-85e52077b8c2",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "np.random.seed(20111974)\n",
+ "np.random.randint(0, 100, 10)"
+ ],
+ "execution_count": 56,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([60, 42, 40, 8, 27, 2, 46, 88, 81, 88])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 56
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "gEF497YLUOd7",
+ "outputId": "59ba5395-d847-4555-fbb5-c0cb2120e120",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "np.random.seed(20111974)\n",
+ "a_temperatura_farenheit = np.random.randint(0, 100, 10)\n",
+ "a_temperatura_farenheit "
+ ],
+ "execution_count": 57,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([60, 42, 40, 8, 27, 2, 46, 88, 81, 88])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 57
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "VnagcUqVkLhW",
+ "outputId": "76d71142-f0cf-4fc4-ffd3-1c7790d2adcd",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "# Define a seed\n",
+ "np.random.seed(20111974)\n",
+ "\n",
+ "a_temperatura_farenheit = np.array(np.random.randint(0, 100, 10))\n",
+ "a_temperatura_farenheit "
+ ],
+ "execution_count": 55,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([60, 42, 40, 8, 27, 2, 46, 88, 81, 88])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 55
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "VrjNKfXxk1yv",
+ "outputId": "2874cced-5cbf-4f04-87d5-c60f598dbfbe",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "type(a_temperatura_farenheit)"
+ ],
+ "execution_count": 58,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "numpy.ndarray"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 58
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "o1STejhrk0kZ"
+ },
+ "source": [
+ "Transformando a temperatura Fahrenheit em Celsius..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "E_jXflR_lNy3",
+ "outputId": "616374ed-3843-49c9-c490-1a3d31dfb179",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "a_temperatura_celsius = 5*a_temperatura_farenheit/9 - 5*32/9\n",
+ "a_temperatura_celsius"
+ ],
+ "execution_count": 59,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 15.56, 5.56, 4.44, -13.33, -2.78, -16.67, 7.78, 31.11,\n",
+ " 27.22, 31.11])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 59
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "U4pCv0pNqPZI",
+ "outputId": "07201a68-67f6-42ad-9023-75dcb0d00702",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "# O mesmo resultado, porém, escrito de forma diferente:\n",
+ "a_temperatura_celsius = (5/9)*a_temperatura_farenheit - (160/9)\n",
+ "a_temperatura_celsius"
+ ],
+ "execution_count": 60,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 15.56, 5.56, 4.44, -13.33, -2.78, -16.67, 7.78, 31.11,\n",
+ " 27.22, 31.11])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 60
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "1UT4YD2FawUA"
+ },
+ "source": [
+ "___\n",
+ "# **Selecionar itens**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "pqOv8P1za1m8",
+ "outputId": "977f643f-370b-4e53-d0fc-a8d7a9ac5ecc",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "# Selecionar o segundo item de a_conjunto1 (lembre-se que no Python arrays começam com indice = 0)\n",
+ "a_conjunto1[1]"
+ ],
+ "execution_count": 61,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([0.56, 0.3 , 1.05])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 61
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "TIwVKk6AyRv6"
+ },
+ "source": [
+ "Dado a_conjunto2 abaixo:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zoDmbXo6bCeu",
+ "outputId": "2e1e5ac4-cd9e-4561-9d34-40f24ee665cf",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 68
+ }
+ },
+ "source": [
+ "a_conjunto2"
+ ],
+ "execution_count": 62,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[ 30, 40, 50],\n",
+ " [ 60, 70, 80],\n",
+ " [ 90, 100, 110]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 62
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "iJXSPp-0yb4w"
+ },
+ "source": [
+ "... selecionar o item da linha 2, coluna 3 do array a_conjunto2:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "sJiVfnlzcjRv",
+ "outputId": "f8265b6e-f9b7-489c-9c61-c61689ef76bc",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "a_conjunto2[1, 2]"
+ ],
+ "execution_count": 63,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "80"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 63
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Xl5HwJIMcv2e",
+ "outputId": "7c7d7bec-c8ca-4941-8b2b-fba1c448ad52",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "# Selecionar o último elemento de a_conjunto1 --> Lembre-se que a_conjunto1 é um array. Desta forma, teremos o último elemento do array!\n",
+ "a_conjunto1[-1]"
+ ],
+ "execution_count": 64,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([0.56, 0.3 , 1.05])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 64
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ezTH0HsyrnAl"
+ },
+ "source": [
+ "Veja..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "OBv9EM54rYX3",
+ "outputId": "03083e15-bce9-4df3-aea8-3b22183d3fcd",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "a_conjunto1"
+ ],
+ "execution_count": 65,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[2.51, 1.11, 2.06],\n",
+ " [0.56, 0.3 , 1.05]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 65
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Po3WLFC-rod8",
+ "outputId": "35018018-52be-4e0f-f0df-3c2b7dd59757",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "a_temperatura_celsius[-1]"
+ ],
+ "execution_count": 66,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "31.111111111111114"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 66
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4qJJ2HCedW4h"
+ },
+ "source": [
+ "___\n",
+ "# **Aplicar funções como max(), min() e etc**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "_meTJdUsda4e",
+ "outputId": "38b377c5-d7e6-4121-80f5-cd8e5e6e5d42",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "f'O máximo de a_conjunto1 é: {np.max(a_conjunto1)}'"
+ ],
+ "execution_count": 67,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ },
+ "text/plain": [
+ "'O máximo de a_conjunto1 é: 2.506276801325959'"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 67
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "m-wiBkAidnhN",
+ "outputId": "c57d17c1-8501-4cc7-b1f2-4eb5305747f4",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "f'O mínimo de a_conjunto1 é: {np.min(a_conjunto1)}'"
+ ],
+ "execution_count": 68,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ },
+ "text/plain": [
+ "'O mínimo de a_conjunto1 é: 0.29897275739745677'"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 68
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "lmupnRHQdtwh",
+ "outputId": "1fb75802-b7fb-46d7-efef-d2c27492408b",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "f'O máximo de a_conjunto2 é: {np.max(a_conjunto2)}'"
+ ],
+ "execution_count": 69,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ },
+ "text/plain": [
+ "'O máximo de a_conjunto2 é: 110'"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 69
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "H2z7oB6Bd786",
+ "outputId": "73dc5fb9-0d53-4e97-cd40-16d1e38d7b03",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "f'O máximo de cada LINHA de a_conjunto2 é: {np.max(a_conjunto2, axis = 1)}' # Aqui, axis = 1 é que diz ao numpy que estamos interessados nas linhas"
+ ],
+ "execution_count": 70,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ },
+ "text/plain": [
+ "'O máximo de cada LINHA de a_conjunto2 é: [ 50 80 110]'"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 70
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "gj2ZBDsWeMyk",
+ "outputId": "c879bbd9-d17f-4dd1-a9db-dc08b93f6735",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "f'O máximo de cada COLUNA de a_conjunto2 é: {np.max(a_conjunto2, axis = 0)}' # axis = 0, diz ao numpy que estamos interessados nas colunas."
+ ],
+ "execution_count": 71,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ },
+ "text/plain": [
+ "'O máximo de cada COLUNA de a_conjunto2 é: [ 90 100 110]'"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 71
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7_tEfm2IecIU"
+ },
+ "source": [
+ "___\n",
+ "# **Calcular Estatísticas Descritivas: média e variância**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "lIY5jx3ueh7q",
+ "outputId": "cf275100-6c2d-4278-bc0b-aeea828a9197",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "f'A média de a_conjunto1 é: {np.mean(a_conjunto1)}'"
+ ],
+ "execution_count": 72,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ },
+ "text/plain": [
+ "'A média de a_conjunto1 é: 1.2649068535973844'"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 72
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "VmqSELRReuAW",
+ "outputId": "d4d64bc2-3d95-4101-e8aa-bcb99feaaaf9",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "f'A média de a_conjunto2 é: {np.mean(a_conjunto2)}'"
+ ],
+ "execution_count": 73,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ },
+ "text/plain": [
+ "'A média de a_conjunto2 é: 70.0'"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 73
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Gxap-Wg5e2_H",
+ "outputId": "513165e1-1e98-43f5-9278-112ce6973892",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "f'O Desvio Padrão de a_conjunto2 é: {np.std(a_conjunto2)}'"
+ ],
+ "execution_count": 74,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ },
+ "text/plain": [
+ "'O Desvio Padrão de a_conjunto2 é: 25.81988897471611'"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 74
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "R0GcljGtfBvP"
+ },
+ "source": [
+ "___\n",
+ "# **Reshaping**\n",
+ "> Muito útil em Machine Learning."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vfEmw01j8zux"
+ },
+ "source": [
+ "## Exemplo 1\n",
+ "* O array a_conjunto2 tem a seguinte forma:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "-Lb3VZCCfK_a",
+ "outputId": "16aa9de9-0959-4594-e096-472d051dd7ba",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 68
+ }
+ },
+ "source": [
+ "a_conjunto2"
+ ],
+ "execution_count": 78,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[ 30, 40, 50],\n",
+ " [ 60, 70, 80],\n",
+ " [ 90, 100, 110]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 78
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "YWN_nN-4fD7u",
+ "outputId": "a2a67789-17e1-4237-8b04-30f498f47db4",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 170
+ }
+ },
+ "source": [
+ "# reshaping para 9 linhas e 1 coluna:\n",
+ "a_conjunto2.reshape(9, 1) # a_conjunto2.reshape(9,-1) produz o mesmo resultado."
+ ],
+ "execution_count": 76,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[ 30],\n",
+ " [ 40],\n",
+ " [ 50],\n",
+ " [ 60],\n",
+ " [ 70],\n",
+ " [ 80],\n",
+ " [ 90],\n",
+ " [100],\n",
+ " [110]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 76
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "id9ILRRt7SwY"
+ },
+ "source": [
+ "## Mais um exemplo de Reshape\n",
+ "> Dado o array 1D abaixo, reshape para um array 3D com 2 colunas."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "9RA9Ht2b7Swd",
+ "outputId": "43c0fa78-3313-4ec9-eb2d-b097d6606b3e",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "# Define seed\n",
+ "np.random.seed(20111974)\n",
+ "a_conjunto1 = np.array(np.random.randint(1, 10, size = 15))\n",
+ "a_conjunto1"
+ ],
+ "execution_count": 79,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([9, 9, 3, 9, 2, 9, 1, 5, 3, 1, 9, 4, 8, 2, 4])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 79
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8KxR4xZT7cRv"
+ },
+ "source": [
+ "### Solução\n",
+ "> Temos 15 elementos em a_conjunto1 para construir (\"reshape\") um array 3D com 2 colunas.\n",
+ "\n",
+ "A princípio, a solução seria..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "VMdHl1Il7wLw",
+ "outputId": "d51c7263-f523-4af8-9606-ee93cab66f1c",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 163
+ }
+ },
+ "source": [
+ "a_conjunto1.reshape(-1, 2) # O valor \"-1\" na posição das linhas pede ao NumPy para calcular o número de linhas automaticamente."
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "error",
+ "ename": "ValueError",
+ "evalue": "ignored",
+ "traceback": [
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
+ "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0ma_numeros1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreshape\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m2\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# O valor \"-1\" na posição das linhas pede ao NumPy para calcular o número de linhas automaticamente.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+ "\u001b[0;31mValueError\u001b[0m: cannot reshape array of size 15 into shape (2)"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "pZS4b4-y708q"
+ },
+ "source": [
+ "Porque temos esse erro?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4disywvR8HeH"
+ },
+ "source": [
+ "E se fizermos..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "3oEAAXTp8I7Z",
+ "outputId": "e8c8a90f-c34a-4304-d9b4-fd7f04ce224f",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "# Define seed\n",
+ "np.random.seed(20111974)\n",
+ "a_conjunto1 = np.array(np.random.randint(1, 10, size = 16)) # Observe que agora temos 16 elementos\n",
+ "a_conjunto1"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([9, 9, 3, 9, 2, 9, 1, 5, 3, 1, 9, 4, 8, 2, 4, 3])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 21
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "iUhth0QV8Rpt"
+ },
+ "source": [
+ "Reshapping..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "9D1y7uD88Qip",
+ "outputId": "e7d22bcd-c10f-4ea3-e41b-03f6f98a054f",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 153
+ }
+ },
+ "source": [
+ "a_conjunto1.reshape(-1, 2) # O valor \"-1\" na posição das linhas pede ao NumPy para calcular o número de linhas automaticamente."
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[9, 9],\n",
+ " [3, 9],\n",
+ " [2, 9],\n",
+ " [1, 5],\n",
+ " [3, 1],\n",
+ " [9, 4],\n",
+ " [8, 2],\n",
+ " [4, 3]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 22
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ALh-sq7DMnN5",
+ "outputId": "db373349-7910-4f1f-93f3-8ac8f67da8b8",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 153
+ }
+ },
+ "source": [
+ "# OU --> Neste caso, estamos reshaping o array em 8 linhas e 2 colunas\n",
+ "a_conjunto1.reshape(8, -1)"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[9, 9],\n",
+ " [3, 9],\n",
+ " [2, 9],\n",
+ " [1, 5],\n",
+ " [3, 1],\n",
+ " [9, 4],\n",
+ " [8, 2],\n",
+ " [4, 3]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 26
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "yvTnrszn8Yk0"
+ },
+ "source": [
+ "Porque agora deu certo?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "LeQ9LqIE8baG"
+ },
+ "source": [
+ "## Último exemplo com reshape\n",
+ "> Considere o array a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "OQOC9iiN8hZT",
+ "outputId": "f2405e24-36b7-48e7-a815-bb6059d5f9d8",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "np.random.seed(20111974)\n",
+ "a_conjunto1 = np.random.randn(2, 3)\n",
+ "a_conjunto1"
+ ],
+ "execution_count": 80,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[2.51, 1.11, 2.06],\n",
+ " [0.56, 0.3 , 1.05]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 80
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Cvce8qBl9Cvq"
+ },
+ "source": [
+ "Queremos agora transformá-la num array de 3 linhas e 2 colunas."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "QDDsYoVt9Klz",
+ "outputId": "eba2e377-bdaa-4586-8512-4e5ad4f52456",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 68
+ }
+ },
+ "source": [
+ "a_conjunto1.reshape(-1, 2)"
+ ],
+ "execution_count": 81,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[2.51, 1.11],\n",
+ " [2.06, 0.56],\n",
+ " [0.3 , 1.05]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 81
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "AdwU5ygt9Svq"
+ },
+ "source": [
+ "Poderia ser..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5uBeokKc9Uo-"
+ },
+ "source": [
+ "a_conjunto1.reshape(3, -1)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "OeRBsobc9aKj"
+ },
+ "source": [
+ "E por fim, também poderia ser..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "MDt8UYYH9dBw"
+ },
+ "source": [
+ "a_conjunto1.reshape(3, 2)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "91o5vycQfdKW"
+ },
+ "source": [
+ "___\n",
+ "# **Transposta**\n",
+ "* O array a_conjunto2 tem a seguinte forma:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "RsZwyuhoffjb",
+ "outputId": "f0eec83e-02a7-4931-cc33-bd72b8b73dc0",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 68
+ }
+ },
+ "source": [
+ "a_conjunto2"
+ ],
+ "execution_count": 82,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[ 30, 40, 50],\n",
+ " [ 60, 70, 80],\n",
+ " [ 90, 100, 110]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 82
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "A3MzTVoGfiyO",
+ "outputId": "aac33a4c-a98d-4379-fba0-ce27022418c9",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 68
+ }
+ },
+ "source": [
+ "# Transposta do array a_conjunto2 é dado por:\n",
+ "a_conjunto2.T"
+ ],
+ "execution_count": 83,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[ 30, 60, 90],\n",
+ " [ 40, 70, 100],\n",
+ " [ 50, 80, 110]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 83
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Ij-ZW5IyzXIb"
+ },
+ "source": [
+ "Ou seja, linha virou coluna. Ok?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qLy6ajgpt3lU"
+ },
+ "source": [
+ "# **Inversa da matriz quadrada**\n",
+ "> Se uma matriz é não-singular, então sua inversa existe.\n",
+ "\n",
+ "* Se o determinante de uma matriz is not equal to zero, then the matrix isé diferente de 0, então a matriz é não-singular."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "-u7jRq34t9_x"
+ },
+ "source": [
+ "import numpy as np\n",
+ "\n",
+ "a_conjunto1 = np.array([[1, 2, 3],[4, 5, 6],[7, 8, 9]])\n",
+ "a_conjunto2 = np.array([[6, 2], [5, 3]])\n",
+ "a_conjunto3 = np.array([[1, 3, 5],[2, 5, 1],[2, 3, 8]])"
+ ],
+ "execution_count": 84,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "7zmHHWWlvaYB",
+ "outputId": "ff3bd49c-1701-4b80-ec4c-6f3d85c8825a",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 68
+ }
+ },
+ "source": [
+ "a_conjunto1"
+ ],
+ "execution_count": 85,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[1, 2, 3],\n",
+ " [4, 5, 6],\n",
+ " [7, 8, 9]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 85
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "3fHKyhOJvcak",
+ "outputId": "49a239c0-baa5-41af-9afd-87eaec5d9cb5",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "a_conjunto2"
+ ],
+ "execution_count": 86,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[6, 2],\n",
+ " [5, 3]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 86
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "vQG7yyfjwLg9",
+ "outputId": "2acfc295-6770-48df-e5f2-c792e39498fc",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 68
+ }
+ },
+ "source": [
+ "a_conjunto3"
+ ],
+ "execution_count": 87,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[1, 3, 5],\n",
+ " [2, 5, 1],\n",
+ " [2, 3, 8]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 87
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qa2Yre2rwgRk"
+ },
+ "source": [
+ "## Determinantes da matriz quadrada"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "N6jwuC6twkyc",
+ "outputId": "93b6c505-1e4a-499c-cea4-5de2afe46dc2",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "np.linalg.det(a_conjunto1)"
+ ],
+ "execution_count": 88,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "0.0"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 88
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "QSvViNwzwnhI",
+ "outputId": "a5fa2582-734e-45c4-fc82-916e0bd8072e",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "np.linalg.det(a_conjunto2)"
+ ],
+ "execution_count": 89,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "8.000000000000002"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 89
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "o8jwsnccw5id",
+ "outputId": "6d0fa7b3-097a-4c2d-b83c-4288f9b09886",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "np.linalg.det(a_conjunto3)"
+ ],
+ "execution_count": 90,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "-25.000000000000007"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 90
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "kkVaTgzgw_XJ"
+ },
+ "source": [
+ "A seguir, calculamos as inversas das matrizes acima definidas..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "b9FgWvTYvpik",
+ "outputId": "c486ab74-85c9-4718-c010-83b1e9c067ca",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "np.linalg.inv(a_conjunto2)"
+ ],
+ "execution_count": 91,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[ 0.38, -0.25],\n",
+ " [-0.62, 0.75]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 91
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "KsdEt1kIvsM_",
+ "outputId": "1872d30a-4c4e-44b2-cec4-3c6251c278ee",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 340
+ }
+ },
+ "source": [
+ "np.linalg.inv(a_conjunto1)"
+ ],
+ "execution_count": 92,
+ "outputs": [
+ {
+ "output_type": "error",
+ "ename": "LinAlgError",
+ "evalue": "ignored",
+ "traceback": [
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[0;31mLinAlgError\u001b[0m Traceback (most recent call last)",
+ "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlinalg\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0minv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma_conjunto1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+ "\u001b[0;32m<__array_function__ internals>\u001b[0m in \u001b[0;36minv\u001b[0;34m(*args, **kwargs)\u001b[0m\n",
+ "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/numpy/linalg/linalg.py\u001b[0m in \u001b[0;36minv\u001b[0;34m(a)\u001b[0m\n\u001b[1;32m 545\u001b[0m \u001b[0msignature\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'D->D'\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0misComplexType\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mt\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32melse\u001b[0m \u001b[0;34m'd->d'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 546\u001b[0m \u001b[0mextobj\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mget_linalg_error_extobj\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_raise_linalgerror_singular\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 547\u001b[0;31m \u001b[0mainv\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_umath_linalg\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0minv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msignature\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msignature\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mextobj\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mextobj\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 548\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mwrap\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mainv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresult_t\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 549\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/numpy/linalg/linalg.py\u001b[0m in \u001b[0;36m_raise_linalgerror_singular\u001b[0;34m(err, flag)\u001b[0m\n\u001b[1;32m 95\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 96\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_raise_linalgerror_singular\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mflag\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 97\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mLinAlgError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Singular matrix\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 98\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 99\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_raise_linalgerror_nonposdef\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mflag\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;31mLinAlgError\u001b[0m: Singular matrix"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "VA_F7_7kccpn"
+ },
+ "source": [
+ "Porque não temos a inversa de a_conjunto1?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ANPBCnmVwOf4",
+ "outputId": "bea278b5-3959-47c7-e76c-216cd2c7d5cd",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 68
+ }
+ },
+ "source": [
+ "np.linalg.inv(a_conjunto3)"
+ ],
+ "execution_count": 93,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[-1.48, 0.36, 0.88],\n",
+ " [ 0.56, 0.08, -0.36],\n",
+ " [ 0.16, -0.12, 0.04]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 93
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XAf9k1egxcdF"
+ },
+ "source": [
+ "# **Resolver sistemas de equações lineares**\n",
+ "> Considere o sistema de euqações lineares abaixo:\n",
+ "\n",
+ "\\begin{equation}\n",
+ "x + 3y + 5z = 10\\\\\n",
+ "2x+ 5y + z = 8 \\\\\n",
+ "2x + 3y + 8z= 3\n",
+ "\\end{equation}\n",
+ "\n",
+ "Ou $Ax = b$. A solução deste sistema de equações é dada por $A^{-1}b$."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oNf5nqaLxhBY"
+ },
+ "source": [
+ "Ou seja, basta encontrarmos a inversa de A e multiplicarmos por b."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "omzC5dGA0btc",
+ "outputId": "50af27a9-e578-4e39-882a-0d2a5e51ba8c",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 68
+ }
+ },
+ "source": [
+ "A= np.array([[1, 3, 5], [2, 5, 1], [2, 3, 8]])\n",
+ "np.linalg.inv(A)"
+ ],
+ "execution_count": 94,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[-1.48, 0.36, 0.88],\n",
+ " [ 0.56, 0.08, -0.36],\n",
+ " [ 0.16, -0.12, 0.04]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 94
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "AiXI3oxB05iE"
+ },
+ "source": [
+ "Agora basta multiplicar a matriz inversa $A^{-1}$ acima por b. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "XoGebKDa2Fcd"
+ },
+ "source": [
+ "A_Inv = np.linalg.inv(A)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "sKaP0a1QZG-P"
+ },
+ "source": [
+ "b= np.array([10, 8, 3]).reshape(3, -1)\n",
+ "b"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "3dAVq8dg19VI"
+ },
+ "source": [
+ "A_Inv.dot(b)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "zso6hTnB17cm"
+ },
+ "source": [
+ "Uma forma fácil de se fazer isso é utilizar a expressão abaixo:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ptQHIVll1E4P"
+ },
+ "source": [
+ "b= np.array([[10], [8], [3]])\n",
+ "b"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "X4VL8lyY1Xus"
+ },
+ "source": [
+ "np.linalg.solve(A, b)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "fJKmwTS59-Bc"
+ },
+ "source": [
+ "# **Empilhar arrays**\n",
+ "\n",
+ "## Exemplo 1\n",
+ "\n",
+ "\n",
+ "\n",
+ "## Exemplo 2\n",
+ "\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "rhPTt3EwXden"
+ },
+ "source": [
+ "## Gerar os arrays do exemplo1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zEI-yBy3-E46"
+ },
+ "source": [
+ "np.random.seed(20111974)\n",
+ "a_conjunto1 = np.random.randn(5, 8)\n",
+ "\n",
+ "np.random.seed(19741120)\n",
+ "a_conjunto2 = np.random.randn(8, 8)"
+ ],
+ "execution_count": 95,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "UYsAqBRp--79"
+ },
+ "source": [
+ "## Método 1 - Concatenate([A, B])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "HgO1ujvhObyE",
+ "outputId": "0aa2aa36-0468-434c-81fc-a9641932564a",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 102
+ }
+ },
+ "source": [
+ "a_conjunto1"
+ ],
+ "execution_count": 96,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[ 2.51, 1.11, 2.06, 0.56, 0.3 , 1.05, -0.13, 1.06],\n",
+ " [ 1.14, 1.38, -2.06, 0.67, 0.73, -0.34, 0.44, 0.59],\n",
+ " [-1.29, 1.18, -0.99, -1.79, -1.09, -0.91, -1.02, -1.36],\n",
+ " [-0.29, 0.06, -1.14, -0.51, -0.84, -1.41, -0.22, -1.17],\n",
+ " [-0.61, -0.62, 1.08, 0.5 , 0.03, 1.83, 0.35, -1.15]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 96
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "2aQY_klZOeg9",
+ "outputId": "38317a5e-5d47-4570-af47-781f03a8dd5a",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 153
+ }
+ },
+ "source": [
+ "a_conjunto2"
+ ],
+ "execution_count": 97,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[-0.77, -1.11, 0.1 , -1.15, -2.15, -0.75, -2.15, -0.33],\n",
+ " [-1.1 , 0.33, 0.01, -1.33, -0.34, -0.01, 0.05, -0.19],\n",
+ " [ 0.39, -0.89, -0.51, -0.75, 1.84, -1.21, 1.2 , 0.51],\n",
+ " [-0.57, -0.93, -0.25, 0.98, 1.19, 2.3 , 0.17, 0.71],\n",
+ " [-0.45, 0.92, 0.73, 2.18, -0.06, 1.25, -0.37, 1.44],\n",
+ " [ 0.86, -0.11, -0.35, 0.94, -0.09, -1.49, 0.01, 0.87],\n",
+ " [ 1.63, 1.36, -0.02, -0.45, -0.37, -0.05, -2.27, 0.95],\n",
+ " [ 0.71, -0.8 , -0.32, -1.58, -0.38, -0.3 , -0.73, -0.56]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 97
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "bK70vaq8_KMH",
+ "outputId": "40e9b550-ba17-4896-ffb9-8a9b72ff6055",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 238
+ }
+ },
+ "source": [
+ "np.concatenate([a_conjunto1, a_conjunto2], axis = 0) # axis= 0 diz ao NumPy para empilhar as linhas"
+ ],
+ "execution_count": 98,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[ 2.51, 1.11, 2.06, 0.56, 0.3 , 1.05, -0.13, 1.06],\n",
+ " [ 1.14, 1.38, -2.06, 0.67, 0.73, -0.34, 0.44, 0.59],\n",
+ " [-1.29, 1.18, -0.99, -1.79, -1.09, -0.91, -1.02, -1.36],\n",
+ " [-0.29, 0.06, -1.14, -0.51, -0.84, -1.41, -0.22, -1.17],\n",
+ " [-0.61, -0.62, 1.08, 0.5 , 0.03, 1.83, 0.35, -1.15],\n",
+ " [-0.77, -1.11, 0.1 , -1.15, -2.15, -0.75, -2.15, -0.33],\n",
+ " [-1.1 , 0.33, 0.01, -1.33, -0.34, -0.01, 0.05, -0.19],\n",
+ " [ 0.39, -0.89, -0.51, -0.75, 1.84, -1.21, 1.2 , 0.51],\n",
+ " [-0.57, -0.93, -0.25, 0.98, 1.19, 2.3 , 0.17, 0.71],\n",
+ " [-0.45, 0.92, 0.73, 2.18, -0.06, 1.25, -0.37, 1.44],\n",
+ " [ 0.86, -0.11, -0.35, 0.94, -0.09, -1.49, 0.01, 0.87],\n",
+ " [ 1.63, 1.36, -0.02, -0.45, -0.37, -0.05, -2.27, 0.95],\n",
+ " [ 0.71, -0.8 , -0.32, -1.58, -0.38, -0.3 , -0.73, -0.56]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 98
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "CpaXBkm8_BF8"
+ },
+ "source": [
+ "## Método 2 - np.r_[A, B]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "3QnVUzAY_teZ",
+ "outputId": "e8adfd85-e760-40f5-d9ac-48353d24ccd2",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 459
+ }
+ },
+ "source": [
+ "np.r_[a_conjunto1, a_conjunto2]"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[ 2.5062768 , 1.11440422, 2.05565501, 0.56482376, 0.29897276,\n",
+ " 1.04930857, -0.12607366, 1.06227632],\n",
+ " [ 1.13807032, 1.37966044, -2.05995563, 0.67474814, 0.72722843,\n",
+ " -0.33923852, 0.43613107, 0.59135489],\n",
+ " [-1.29281877, 1.17712036, -0.98644163, -1.79034143, -1.08913605,\n",
+ " -0.90712825, -1.02291108, -1.36445713],\n",
+ " [-0.29429164, 0.06343709, -1.14196185, -0.50706079, -0.83539436,\n",
+ " -1.41492946, -0.2159062 , -1.16519474],\n",
+ " [-0.60767518, -0.61510925, 1.0771542 , 0.5043687 , 0.02674197,\n",
+ " 1.83494644, 0.34728874, -1.14671885],\n",
+ " [-0.77337752, -1.10547465, 0.10062807, -1.14571729, -2.15266227,\n",
+ " -0.75255725, -2.1529949 , -0.33017773],\n",
+ " [-1.10465731, 0.32889675, 0.01010198, -1.33213633, -0.33945805,\n",
+ " -0.01299007, 0.05342823, -0.18641201],\n",
+ " [ 0.39473805, -0.89354231, -0.50667323, -0.74660913, 1.83586365,\n",
+ " -1.20536871, 1.20184886, 0.51160897],\n",
+ " [-0.56952286, -0.93343871, -0.24972528, 0.98487133, 1.19333367,\n",
+ " 2.29956497, 0.16657022, 0.71357415],\n",
+ " [-0.45251078, 0.92163918, 0.73421263, 2.17811191, -0.05655212,\n",
+ " 1.25326 , -0.37039248, 1.43855202],\n",
+ " [ 0.85646091, -0.11257239, -0.35400297, 0.94136671, -0.08696163,\n",
+ " -1.49000701, 0.00848666, 0.86705275],\n",
+ " [ 1.6340906 , 1.36321063, -0.02175361, -0.45301645, -0.37111236,\n",
+ " -0.04716069, -2.27337435, 0.95318738],\n",
+ " [ 0.7100548 , -0.79883269, -0.3165779 , -1.58352824, -0.37751484,\n",
+ " -0.29760341, -0.73424207, -0.55703223]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 36
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XmSPbDP6_20W"
+ },
+ "source": [
+ "**Obs**.: Eu prefiro este método!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "dzVKW_wX_Dzw"
+ },
+ "source": [
+ "## Método 3 - np.vstack([A, B]) = np.r_[A, B]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "uL7lEN_mABID",
+ "outputId": "d1ea4d86-2cc1-4e2d-af72-b3a292ef15fd",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 459
+ }
+ },
+ "source": [
+ "np.vstack([a_conjunto1, a_conjunto2])"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[ 2.5062768 , 1.11440422, 2.05565501, 0.56482376, 0.29897276,\n",
+ " 1.04930857, -0.12607366, 1.06227632],\n",
+ " [ 1.13807032, 1.37966044, -2.05995563, 0.67474814, 0.72722843,\n",
+ " -0.33923852, 0.43613107, 0.59135489],\n",
+ " [-1.29281877, 1.17712036, -0.98644163, -1.79034143, -1.08913605,\n",
+ " -0.90712825, -1.02291108, -1.36445713],\n",
+ " [-0.29429164, 0.06343709, -1.14196185, -0.50706079, -0.83539436,\n",
+ " -1.41492946, -0.2159062 , -1.16519474],\n",
+ " [-0.60767518, -0.61510925, 1.0771542 , 0.5043687 , 0.02674197,\n",
+ " 1.83494644, 0.34728874, -1.14671885],\n",
+ " [-0.77337752, -1.10547465, 0.10062807, -1.14571729, -2.15266227,\n",
+ " -0.75255725, -2.1529949 , -0.33017773],\n",
+ " [-1.10465731, 0.32889675, 0.01010198, -1.33213633, -0.33945805,\n",
+ " -0.01299007, 0.05342823, -0.18641201],\n",
+ " [ 0.39473805, -0.89354231, -0.50667323, -0.74660913, 1.83586365,\n",
+ " -1.20536871, 1.20184886, 0.51160897],\n",
+ " [-0.56952286, -0.93343871, -0.24972528, 0.98487133, 1.19333367,\n",
+ " 2.29956497, 0.16657022, 0.71357415],\n",
+ " [-0.45251078, 0.92163918, 0.73421263, 2.17811191, -0.05655212,\n",
+ " 1.25326 , -0.37039248, 1.43855202],\n",
+ " [ 0.85646091, -0.11257239, -0.35400297, 0.94136671, -0.08696163,\n",
+ " -1.49000701, 0.00848666, 0.86705275],\n",
+ " [ 1.6340906 , 1.36321063, -0.02175361, -0.45301645, -0.37111236,\n",
+ " -0.04716069, -2.27337435, 0.95318738],\n",
+ " [ 0.7100548 , -0.79883269, -0.3165779 , -1.58352824, -0.37751484,\n",
+ " -0.29760341, -0.73424207, -0.55703223]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 37
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "68icJ-2ZAdRj"
+ },
+ "source": [
+ "# Concatenar arrays\n",
+ "\n",
+ "## Exemplo 1\n",
+ "\n",
+ "\n",
+ "\n",
+ "# Exemplo 2\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "OplgK9YoQi9o"
+ },
+ "source": [
+ "## Concatenar os elementos de dois arrays - np.c_[A, B]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "lpdsbTEKQ9EY"
+ },
+ "source": [
+ "np.random.seed(20111974)\n",
+ "a_conjunto1 = np.random.randint(0, 10, 100).reshape(-1, 10)\n",
+ "a_conjunto2 = np.random.randint(0, 2, 10).reshape(-1, 1)"
+ ],
+ "execution_count": 99,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "JPxhGsaSSMk2",
+ "outputId": "d7e2784d-7896-48ce-f8d9-e81b8254102a",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 187
+ }
+ },
+ "source": [
+ "a_conjunto1"
+ ],
+ "execution_count": 100,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[8, 8, 2, 8, 9, 1, 8, 0, 4, 2],\n",
+ " [0, 8, 9, 3, 7, 1, 3, 2, 9, 7],\n",
+ " [7, 9, 5, 6, 8, 7, 0, 9, 3, 9],\n",
+ " [3, 1, 8, 6, 3, 5, 4, 1, 2, 9],\n",
+ " [8, 6, 6, 1, 0, 9, 2, 0, 7, 5],\n",
+ " [5, 4, 4, 2, 7, 2, 7, 9, 3, 1],\n",
+ " [5, 0, 1, 2, 3, 8, 7, 5, 4, 0],\n",
+ " [5, 9, 6, 6, 1, 3, 6, 0, 4, 9],\n",
+ " [2, 1, 0, 9, 1, 4, 2, 9, 7, 9],\n",
+ " [5, 3, 7, 6, 3, 9, 8, 4, 3, 0]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 100
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "9ZyUPfybTfej",
+ "outputId": "9e66a78e-3fc1-4ba1-972d-8044325d7de5",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 187
+ }
+ },
+ "source": [
+ "a_conjunto2"
+ ],
+ "execution_count": 101,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[1],\n",
+ " [0],\n",
+ " [0],\n",
+ " [0],\n",
+ " [0],\n",
+ " [1],\n",
+ " [0],\n",
+ " [0],\n",
+ " [0],\n",
+ " [1]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 101
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "nS1cPG3aRug1",
+ "outputId": "1758003d-b435-4fc2-b094-6870b64039b1",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 187
+ }
+ },
+ "source": [
+ "# colocando o array a_conjunto2 do lado de a_conjunto1.\n",
+ "np.c_[a_conjunto1, a_conjunto2]"
+ ],
+ "execution_count": 102,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[8, 8, 2, 8, 9, 1, 8, 0, 4, 2, 1],\n",
+ " [0, 8, 9, 3, 7, 1, 3, 2, 9, 7, 0],\n",
+ " [7, 9, 5, 6, 8, 7, 0, 9, 3, 9, 0],\n",
+ " [3, 1, 8, 6, 3, 5, 4, 1, 2, 9, 0],\n",
+ " [8, 6, 6, 1, 0, 9, 2, 0, 7, 5, 0],\n",
+ " [5, 4, 4, 2, 7, 2, 7, 9, 3, 1, 1],\n",
+ " [5, 0, 1, 2, 3, 8, 7, 5, 4, 0, 0],\n",
+ " [5, 9, 6, 6, 1, 3, 6, 0, 4, 9, 0],\n",
+ " [2, 1, 0, 9, 1, 4, 2, 9, 7, 9, 0],\n",
+ " [5, 3, 7, 6, 3, 9, 8, 4, 3, 0, 1]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 102
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "kIgU1YBw0OeM"
+ },
+ "source": [
+ "___\n",
+ "# **Selecionar itens que satisfazem condições**\n",
+ "> Considere o array a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "e2pL5anBV0DI",
+ "outputId": "59325971-d3be-4eed-faf5-67991a242054",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "a_conjunto1 = np.arange(10, 0, -1)\n",
+ "a_conjunto1"
+ ],
+ "execution_count": 103,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 103
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "i9HuZZAfV302"
+ },
+ "source": [
+ "Selecionar somente os itens > 7:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZCESvr7iXMkV"
+ },
+ "source": [
+ "## Usando np.where()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "BdrAQLHkTS-v",
+ "outputId": "af66ee9d-531e-415c-b019-71a548e0b741",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "a_conjunto1"
+ ],
+ "execution_count": 104,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 104
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "O_ZBaWxfWA9o",
+ "outputId": "3f6482f5-69b2-48a3-e21e-5300b2bd940e",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "# Índices do array que atendem a condição\n",
+ "l_indices = np.where(a_conjunto1 > 7)\n",
+ "l_indices"
+ ],
+ "execution_count": 105,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "(array([0, 1, 2]),)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 105
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "EdWlfPOZWPME"
+ },
+ "source": [
+ "**Atenção**: Capturamos os índices. Para selecionar os itens, basta fazer:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "tOxs3iYQWWxu",
+ "outputId": "8799d399-938f-4c69-fca8-dd355a41f42a",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "a_conjunto2 = a_conjunto1[l_indices]\n",
+ "a_conjunto2"
+ ],
+ "execution_count": 106,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([10, 9, 8])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 106
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "PGsENqkaXRjh"
+ },
+ "source": [
+ "## Alternativa: Usando []"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "YbdRNk1WXTLT",
+ "outputId": "370c4930-10f1-45cc-8214-a0684eb42312",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "a_conjunto1[a_conjunto1 > 7]"
+ ],
+ "execution_count": 107,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([10, 9, 8])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 107
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jijpzFxcSQC8"
+ },
+ "source": [
+ "Acho que vale a pena quebrar esta solução para entendermos melhor como as coisas funcionam:#"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "rujhP2LQSWsq"
+ },
+ "source": [
+ " # Primeiro, avalie o resultado de a_conjunto1 > 7:"
+ ],
+ "execution_count": 108,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "FYZaBsasSb3N",
+ "outputId": "c2b1a20f-b10e-4c5d-db31-ba6522a55ead",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "a_conjunto1 > 7"
+ ],
+ "execution_count": 109,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ True, True, True, False, False, False, False, False, False,\n",
+ " False])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 109
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "mvEof-UKaaVG",
+ "outputId": "4fcc225d-d235-4a9d-dcf2-ead48e241e00",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "a_conjunto1[a_conjunto1 > 7]"
+ ],
+ "execution_count": 110,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([10, 9, 8])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 110
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "nO4FiBmDUZOT",
+ "outputId": "7eb0a17f-69f6-4041-a03a-561dab4fd525",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "a_conjunto1"
+ ],
+ "execution_count": 111,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 111
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Ci5lT9nmSfsX"
+ },
+ "source": [
+ "Agora, com este resultado, fica fácil entender como o Python seleciona os elementos. Consegue explicar?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "1v5Lfin0GGKD"
+ },
+ "source": [
+ "# Substituir itens baseado em condições\n",
+ "> Substituir os valores negativos do array abaixo por 0."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "CLY_u0ePWdN7"
+ },
+ "source": [
+ "## Gerar o exemplo"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "NUANFy-fNXf5",
+ "outputId": "21a681f7-1fb6-4feb-bc24-565589292f1a",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 119
+ }
+ },
+ "source": [
+ "np.random.seed(20111974)\n",
+ "a_conjunto1 = np.array(np.random.randint(0, 10, size = 100))\n",
+ "\n",
+ "# Lista aleatória de índices que vou alterar\n",
+ "np.random.seed(20111974)\n",
+ "l_indices= np.random.randint(0, 99, 9)\n",
+ "\n",
+ "for i in l_indices:\n",
+ " a_conjunto1[i] = -1*a_conjunto1[i]\n",
+ "\n",
+ "a_conjunto2 = a_conjunto1.copy()\n",
+ "a_conjunto2"
+ ],
+ "execution_count": 112,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 8, 8, -2, 8, 9, 1, 8, 0, -4, 2, 0, 8, 9, 3, 7, 1, 3,\n",
+ " 2, 9, 7, 7, 9, 5, 6, 8, 7, 0, -9, 3, 9, 3, 1, 8, 6,\n",
+ " 3, 5, 4, 1, 2, 9, -8, 6, -6, 1, 0, 9, -2, 0, 7, 5, 5,\n",
+ " 4, 4, 2, 7, 2, 7, 9, 3, 1, -5, 0, 1, 2, 3, 8, 7, 5,\n",
+ " 4, 0, 5, 9, 6, 6, 1, 3, 6, 0, 4, 9, 2, -1, 0, 9, 1,\n",
+ " 4, 2, 9, -7, 9, 5, 3, 7, 6, 3, 9, 8, 4, 3, 0])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 112
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Bo44elZsWhBN",
+ "outputId": "b2f38dac-a1a8-44d7-cae3-06e1a2e4a7ab",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 119
+ }
+ },
+ "source": [
+ "a_conjunto1"
+ ],
+ "execution_count": 117,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 8, 8, -2, 8, 9, 1, 8, 0, -4, 2, 0, 8, 9, 3, 7, 1, 3,\n",
+ " 2, 9, 7, 7, 9, 5, 6, 8, 7, 0, -9, 3, 9, 3, 1, 8, 6,\n",
+ " 3, 5, 4, 1, 2, 9, -8, 6, -6, 1, 0, 9, -2, 0, 7, 5, 5,\n",
+ " 4, 4, 2, 7, 2, 7, 9, 3, 1, -5, 0, 1, 2, 3, 8, 7, 5,\n",
+ " 4, 0, 5, 9, 6, 6, 1, 3, 6, 0, 4, 9, 2, -1, 0, 9, 1,\n",
+ " 4, 2, 9, -7, 9, 5, 3, 7, 6, 3, 9, 8, 4, 3, 0])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 117
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "dWVyI40uN2d2",
+ "outputId": "5d605357-c96e-471c-a870-b09616ffbb95",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "# Indices a serem multiplicados por -1:\n",
+ "l_indices.sort()\n",
+ "l_indices"
+ ],
+ "execution_count": 116,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 2, 8, 27, 40, 42, 46, 60, 81, 88])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 116
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3Whuu854OJDZ"
+ },
+ "source": [
+ "## Substituir os valores negativos por 0"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "sr268Rp8b-Se",
+ "outputId": "dfb6923e-2ff2-4e92-bcbf-bc81a38cc4b9",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 221
+ }
+ },
+ "source": [
+ "a_conjunto2 < 0"
+ ],
+ "execution_count": 118,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([False, False, True, False, False, False, False, False, True,\n",
+ " False, False, False, False, False, False, False, False, False,\n",
+ " False, False, False, False, False, False, False, False, False,\n",
+ " True, False, False, False, False, False, False, False, False,\n",
+ " False, False, False, False, True, False, True, False, False,\n",
+ " False, True, False, False, False, False, False, False, False,\n",
+ " False, False, False, False, False, False, True, False, False,\n",
+ " False, False, False, False, False, False, False, False, False,\n",
+ " False, False, False, False, False, False, False, False, False,\n",
+ " True, False, False, False, False, False, False, True, False,\n",
+ " False, False, False, False, False, False, False, False, False,\n",
+ " False])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 118
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "iaFIC3JFWp0T",
+ "outputId": "3927203e-9cc4-48ed-9c27-eb8f684cf48e",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "a_conjunto2[a_conjunto2 < 0]"
+ ],
+ "execution_count": 119,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([-2, -4, -9, -8, -6, -2, -5, -1, -7])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 119
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "C-eKqPrfOQF6",
+ "outputId": "27239b4f-b921-4d35-f82e-a69caac4a362",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 102
+ }
+ },
+ "source": [
+ "a_conjunto2[a_conjunto2 < 0] = 0\n",
+ "a_conjunto2"
+ ],
+ "execution_count": 120,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([8, 8, 0, 8, 9, 1, 8, 0, 0, 2, 0, 8, 9, 3, 7, 1, 3, 2, 9, 7, 7, 9,\n",
+ " 5, 6, 8, 7, 0, 0, 3, 9, 3, 1, 8, 6, 3, 5, 4, 1, 2, 9, 0, 6, 0, 1,\n",
+ " 0, 9, 0, 0, 7, 5, 5, 4, 4, 2, 7, 2, 7, 9, 3, 1, 0, 0, 1, 2, 3, 8,\n",
+ " 7, 5, 4, 0, 5, 9, 6, 6, 1, 3, 6, 0, 4, 9, 2, 0, 0, 9, 1, 4, 2, 9,\n",
+ " 0, 9, 5, 3, 7, 6, 3, 9, 8, 4, 3, 0])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 120
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "eDLM0_JSZlfB"
+ },
+ "source": [
+ "Observe acima que os valores negativos foram substituídos por 0, como queríamos."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "AEHJ0rA3dHHU"
+ },
+ "source": [
+ "## Substituir os valores negativos por 0 e os positivos por 1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "y32J8SRNZwRF",
+ "outputId": "ffeee753-67ba-4ccc-fa51-8c8df224b9f8",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 119
+ }
+ },
+ "source": [
+ "a_conjunto2 = a_conjunto1.copy()\n",
+ "a_conjunto2"
+ ],
+ "execution_count": 121,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 8, 8, -2, 8, 9, 1, 8, 0, -4, 2, 0, 8, 9, 3, 7, 1, 3,\n",
+ " 2, 9, 7, 7, 9, 5, 6, 8, 7, 0, -9, 3, 9, 3, 1, 8, 6,\n",
+ " 3, 5, 4, 1, 2, 9, -8, 6, -6, 1, 0, 9, -2, 0, 7, 5, 5,\n",
+ " 4, 4, 2, 7, 2, 7, 9, 3, 1, -5, 0, 1, 2, 3, 8, 7, 5,\n",
+ " 4, 0, 5, 9, 6, 6, 1, 3, 6, 0, 4, 9, 2, -1, 0, 9, 1,\n",
+ " 4, 2, 9, -7, 9, 5, 3, 7, 6, 3, 9, 8, 4, 3, 0])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 121
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "1bSD9Fs6P5wW",
+ "outputId": "828e9490-2cd9-4e11-cd03-c26bee1562b5",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 102
+ }
+ },
+ "source": [
+ "a_conjunto2 = np.where(a_conjunto2 <= 0, 0, 1)\n",
+ "a_conjunto2"
+ ],
+ "execution_count": 122,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
+ " 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,\n",
+ " 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1,\n",
+ " 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1,\n",
+ " 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 122
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "i027scjl0qkm"
+ },
+ "source": [
+ "___\n",
+ "# Outliers\n",
+ "> Qualquer ponto/observação que é incomum quando comparado com todos os outros pontos/observações."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "UnDTqRnZHQ3W"
+ },
+ "source": [
+ "## Z-Score\n",
+ "\n",
+ "* Z-Score pode ser utilizado para detectar Outliers.\n",
+ "* É a diferença entre o valor e a média da amostra expressa como o número de desvios-padrão. \n",
+ "* Se o escore z for menor que 2,5 ou maior que 2,5, o valor estará nos 5% do menor ou maior valor (2,5% dos valores em ambas as extremidades da distribuição). No entanto, é pratica comum utilizarmos 3 ao invés dos 2,5.\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "N7gb2zhtd0uM"
+ },
+ "source": [
+ "## IQR Score\n",
+ "\n",
+ "* O Intervalo interquartil (IQR) é uma medida de dispersão estatística, sendo igual à diferença entre os percentis 75 (Q3) e 25 (Q1), ou entre quartis superiores e inferiores, IQR = Q3 - Q1."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "lMmWOKNvghI7"
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "DUw_a-MjWvBc"
+ },
+ "source": [
+ "### Desafio para resolverem\n",
+ "> **Objetivo**: Simular aleatoriamente o salário de 1.000 pessoas com distribuição N(1.045; 100). \n",
+ "* Identificar os _outliers_ da distribuição que acabamos de simular;\n",
+ "* Qual a média da distribuição que simulamos?\n",
+ "* Qual o desvio-padrão;\n",
+ "* Plotar o Boxplot da distribuição dos dados;\n",
+ "* Quantas pessoas > Q3 + 1.5*(Q3-Q1)\n",
+ "* Substituir os outliers do array por:\n",
+ " * Q1-1.5*(Q3 - Q1), se ponto < Q1-1.5*(Q3-Q1)\n",
+ " * Q3+1.5*(Q3 - Q1), se ponto > Q3+1.5*(Q3-Q1)\n",
+ "\n",
+ "Obs.: Use np.random.seed(20111974)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "L9ntAdS_oOAh"
+ },
+ "source": [
+ "### Geração aleatória do array a_salarios com distribuição $N(\\mu, \\sigma)$"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZVexc7iwW-uz",
+ "outputId": "9578d344-bd54-48d0-81b6-78be7642347c",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 102
+ }
+ },
+ "source": [
+ "import numpy as np\n",
+ "np.random.seed(202020202)\n",
+ "np.set_printoptions(precision = 2, suppress = True)\n",
+ "\n",
+ "i_media = 1045\n",
+ "i_dp = 500\n",
+ "i_size = 1000\n",
+ "i_tamanho = 1000\n",
+ "\n",
+ "\n",
+ "a_salarios = np.array(np.random.normal(i_media, i_dp, size=i_size))\n",
+ "a_salarios[:30]"
+ ],
+ "execution_count": 132,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 488.91, 1227.09, 641.47, 575.7 , 1664.34, 1254.27, 931.62,\n",
+ " 1550.25, 1242.86, 794.05, 924.31, 1184.53, 723.31, 46.76,\n",
+ " 1336.84, 1323.31, 1480.29, 754.29, 1105.24, 1254.35, 930.3 ,\n",
+ " 889.64, 954.45, 716.47, 848.33, 925.16, 1021.7 , 978.52,\n",
+ " 588.65, 1682.72])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 132
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "RL0Zb0fyDory",
+ "outputId": "8f932586-43c2-427a-b445-7b377c8cdb93",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 104
+ }
+ },
+ "source": [
+ "import numpy as np\n",
+ "np.random.seed(20111974)\n",
+ "np.set_printoptions(precision = 2, suppress = True)\n",
+ "\n",
+ "media = 1045\n",
+ "desvio_padrao = 100\n",
+ "i_tamanho = 1000\n",
+ "\n",
+ "a_salarios = np.array(np.random.normal(media, desvio_padrao, size = i_tamanho))\n",
+ "a_salarios[:30]"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([1295.63, 1156.44, 1250.57, 1101.48, 1074.9 , 1149.93, 1032.39,\n",
+ " 1151.23, 1158.81, 1182.97, 839. , 1112.47, 1117.72, 1011.08,\n",
+ " 1088.61, 1104.14, 915.72, 1162.71, 946.36, 865.97, 936.09,\n",
+ " 954.29, 942.71, 908.55, 1015.57, 1051.34, 930.8 , 994.29,\n",
+ " 961.46, 903.51])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 5
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Fc3a-yhViCTs"
+ },
+ "source": [
+ "### Geração aleatória dos índices que serão (manualmente) alterados"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Iakt6i1cgEcB",
+ "outputId": "555936b6-9a67-4220-a78f-9abf464ad966",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "# Lista aleatória de índices que vou alterar\n",
+ "np.random.seed(19741120)\n",
+ "l_indices = np.random.randint(0, i_tamanho, 10)\n",
+ "\n",
+ "# Estas são as posições que serão alteradas (manualmente)\n",
+ "np.sort(l_indices)"
+ ],
+ "execution_count": 133,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 14, 105, 208, 349, 484, 567, 615, 616, 622, 847])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 133
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oXwME1rciHkw"
+ },
+ "source": [
+ "### Cópia dos salários para compararmos o ANTES e DEPOIS"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "BEtnua7sgp_y",
+ "outputId": "e7e6f685-4d47-49ba-e04e-4571617bd07f",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 102
+ }
+ },
+ "source": [
+ "# cópia do array a_salarios\n",
+ "a_salarios_copia = a_salarios.copy()\n",
+ "a_salarios_copia2 = a_salarios.copy()\n",
+ "\n",
+ "a_salarios[:30]"
+ ],
+ "execution_count": 134,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 488.91, 1227.09, 641.47, 575.7 , 1664.34, 1254.27, 931.62,\n",
+ " 1550.25, 1242.86, 794.05, 924.31, 1184.53, 723.31, 46.76,\n",
+ " 1336.84, 1323.31, 1480.29, 754.29, 1105.24, 1254.35, 930.3 ,\n",
+ " 889.64, 954.45, 716.47, 848.33, 925.16, 1021.7 , 978.52,\n",
+ " 588.65, 1682.72])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 134
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "So8qj3Yrh-Az"
+ },
+ "source": [
+ "### Alteração (manual dos salários): 2 alternativas\n",
+ "> Vamos medir o tempo para avaliarmos o que é mais rápido. Qual solução é mais rápida?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Z0613on8z5VH"
+ },
+ "source": [
+ "from timeit import default_timer as timer\n",
+ "from datetime import timedelta"
+ ],
+ "execution_count": 135,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "NpvvholVxMhs",
+ "outputId": "2f23cbcd-2667-4ec7-a2b8-6d10480d2fde",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "# Índices a serem alterados\n",
+ "l_indices"
+ ],
+ "execution_count": 136,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([567, 14, 616, 484, 208, 105, 349, 615, 622, 847])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 136
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "BqXsmMdm1yF-"
+ },
+ "source": [
+ "#### Solução 1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "FiiOrlnbgKOD",
+ "outputId": "9c77735d-8362-455c-830b-9f0c468d1f93",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "# Alteração dos salários dos índices propostos\n",
+ "start = timer()\n",
+ "for i_indice in l_indices:\n",
+ " a_salarios_copia[i_indice] = 2*a_salarios[i_indice] # Loop para os índices a serem alterados (manualmente)\n",
+ "\n",
+ "end = timer()\n",
+ "a_salarios_copia[:30]\n",
+ "print(timedelta(seconds=end-start))"
+ ],
+ "execution_count": 137,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "0:00:00.000071\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "FgvKC-aFzWpZ"
+ },
+ "source": [
+ "#### Solução 2"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "XWlQC5Jazt26",
+ "outputId": "8445dc45-c409-4681-de94-84a4e7ec4957",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "start = timer()\n",
+ "a_salarios_copia2[l_indices] = 2*a_salarios_copia2[l_indices] # Loop para os índices a serem alterados (manualmente)\n",
+ "end = timer()\n",
+ "a_salarios_copia2[:30]\n",
+ "\n",
+ "print(timedelta(seconds=end-start))"
+ ],
+ "execution_count": 138,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "0:00:00.000079\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "U92w03afhrmC"
+ },
+ "source": [
+ "### Compare"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Ls-jCFCYhtD8",
+ "outputId": "9cbfe609-9393-428f-94df-ffdae2064bda",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "# Antes\n",
+ "a_salarios[l_indices]"
+ ],
+ "execution_count": 139,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 614.6 , 1336.84, 1365.54, 1112.75, 1086.39, 549.15, 2017.39,\n",
+ " 1065.89, 917.95, 1085.65])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 139
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "nwwU06OahzD2",
+ "outputId": "6ca35eba-5e8d-46b1-a8ae-0b34608bae3f",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "# Depois\n",
+ "a_salarios_copia[l_indices]"
+ ],
+ "execution_count": 140,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([1229.19, 2673.67, 2731.07, 2225.51, 2172.78, 1098.31, 4034.79,\n",
+ " 2131.78, 1835.91, 2171.3 ])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 140
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "qyUUdHmtisJS",
+ "outputId": "3963e188-69cc-4550-e1bf-23d17167f5a2",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 102
+ }
+ },
+ "source": [
+ "# 30 primeiras elementos de a_salarios\n",
+ "a_salarios[:30]"
+ ],
+ "execution_count": 141,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 488.91, 1227.09, 641.47, 575.7 , 1664.34, 1254.27, 931.62,\n",
+ " 1550.25, 1242.86, 794.05, 924.31, 1184.53, 723.31, 46.76,\n",
+ " 1336.84, 1323.31, 1480.29, 754.29, 1105.24, 1254.35, 930.3 ,\n",
+ " 889.64, 954.45, 716.47, 848.33, 925.16, 1021.7 , 978.52,\n",
+ " 588.65, 1682.72])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 141
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "CJ1FEjlCi0-n",
+ "outputId": "20177975-5f0f-4b66-cf42-aa7bb2c08148",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 102
+ }
+ },
+ "source": [
+ "# 30 primeiras posições de a_salarios_copia\n",
+ "a_salarios_copia[:30]"
+ ],
+ "execution_count": 142,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 488.91, 1227.09, 641.47, 575.7 , 1664.34, 1254.27, 931.62,\n",
+ " 1550.25, 1242.86, 794.05, 924.31, 1184.53, 723.31, 46.76,\n",
+ " 2673.67, 1323.31, 1480.29, 754.29, 1105.24, 1254.35, 930.3 ,\n",
+ " 889.64, 954.45, 716.47, 848.33, 925.16, 1021.7 , 978.52,\n",
+ " 588.65, 1682.72])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 142
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "wKbSUgxxiOUL"
+ },
+ "source": [
+ "### Algumas Estatísticas Descritivas:\n",
+ "#### Antes"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZnmykyahLWX9",
+ "outputId": "90196058-0220-45d8-d87e-ab7756131fb6",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "f'Amostra: {i_size} Média: {np.mean(a_salarios)}; Mediana: {np.median(a_salarios)}; STD: {np.std(a_salarios)}'"
+ ],
+ "execution_count": 144,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ },
+ "text/plain": [
+ "'Amostra: 1000 Média: 1044.877512524266; Mediana: 1032.846999013294; STD: 500.9747087649826'"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 144
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ow7MHjgmPIty"
+ },
+ "source": [
+ "#### Depois"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5iO-BAikieHJ",
+ "outputId": "2e04ed85-08be-44b6-b75b-5d4ce2f292b7",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "f'Amostra: {i_size} Média: {np.mean(a_salarios_copia)}; Mediana: {np.median(a_salarios_copia)}; STD: {np.std(a_salarios_copia)}'"
+ ],
+ "execution_count": 145,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ },
+ "text/plain": [
+ "'Amostra: 1000 Média: 1056.0296676634293; Mediana: 1035.8622742229763; STD: 519.1113559585269'"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 145
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ILhNe80xW5C6"
+ },
+ "source": [
+ "### Solução do desafio"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "OyFbWs-APowd",
+ "outputId": "9dad8981-0c5b-415a-9712-b8c77f429248",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 270
+ }
+ },
+ "source": [
+ "# Import a biblioteca seaborn:\n",
+ "import seaborn as sns\n",
+ "\n",
+ "# Boxplot antes dos \"outliers\"\n",
+ "sns.boxplot(y = a_salarios)"
+ ],
+ "execution_count": 146,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 146
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX8AAADsCAYAAACcwaY+AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAQiklEQVR4nO3dX4hd53nv8e9Po6bH/UfssWpcWRz5RCrFIRw3DE6gvUiPJXtsSJTcBOdANYSAfGHL6qEXdc5NgkshlCbB0skxTIjoCNoaQ1uiFCFXMoVyLtJ4XIzkPwkeHBtrkO3puCQBuykjPediluwtV7L3nhnPmun7/cBmr/2sd6/9bGP99Opda++dqkKS1JYtfTcgSVp/hr8kNcjwl6QGGf6S1CDDX5IaZPhLUoO29t3AMK6//vrauXNn321I0qby1FNP/UtVbbvSvk0R/jt37mR2drbvNiRpU0ny8tX2uewjSQ0y/CWpQYa/JDXI8JekBhn+0iosLi7ywAMPsLi42Hcr0kgMf2kVZmZmOHv2LMeOHeu7FWkkhr+0QouLi5w8eZKq4uTJk87+takY/tIKzczMcPHiRQAuXLjg7F+biuEvrdDp06dZWloCYGlpiVOnTvXckTQ8w19aoT179pAEgCTs3bu3546k4Rn+0gp95jOf4dLPoFYVn/70p3vuSBqe4S+t0PHjxy+b+X/ve9/ruSNpeIa/tEKnT5++bObvmr82E8NfWqE9e/awdevyF+Nu3brVNX9tKoa/tEJTU1Ns2bL8R2hsbIz9+/f33JE0PMNfWqHx8XEmJydJwuTkJOPj4323JA1tU/yYi7RRTU1N8dJLLznr16Yz9Mw/yY4k/5DkuSTPJjnU1b+aZD7J093t7oHnfDnJXJIfJblzoD7Z1eaSPLi2b0laP+Pj4xw+fNhZvzadUWb+S8AfVtU/J/lV4Kkkly5v+GZV/dng4CS3APcAHwV+Azid5De73d8C9gLngCeTHK+q51bzRiRJwxs6/KvqPHC+2/5ZkueB7e/xlH3Ao1X1c+DHSeaA27p9c1X1IkCSR7uxhr8krZMVnfBNshP4beCfutL9Sc4kOZrk2q62HXhl4GnnutrV6pKkdTJy+Cf5FeCvgT+oqp8CjwAfAW5l+V8GX1+LxpIcSDKbZHZhYWEtDimtOX/MRZvVSOGf5BdYDv6/qKq/Aaiq16rqQlVdBL7NO0s788COgaff1NWuVr9MVU1X1URVTWzbtm2UNqV1Mz09zZkzZ5ienu67FWkko1ztE+A7wPNV9Y2B+o0Dwz4HPNNtHwfuSfKLSW4GdgM/AJ4Edie5OcmHWD4pfHx1b0Naf4uLi29/pcOpU6ec/WtTGWXm/zvA7wP/412Xdf5pkrNJzgC/B/wvgKp6FniM5RO5J4H7un8hLAH3A48DzwOPdWOlTWV6evrtH3O5ePGis39tKrn0xVQb2cTERM3OzvbdhnSZPXv2vP1jLrD8/T6nT5/usSPpckmeqqqJK+3z6x2kFXr3xGkzTKSkSwx/aYVuv/32yx7v2bOnp06k0Rn+0grde++9b3+r55YtWzhw4EDPHUnDM/ylFRofH397tr93716/30ebit/qKa3Cvffey6uvvuqsX5uO4a+RHTlyhLm5ub7b2BDm55c/n/jQQw/13MnGsGvXLg4ePNh3GxqC4S+twltvvdV3C9KKGP4amTO7dxw6dAiAhx9+uOdOpNF4wleSGmT4S1KDDH9JapDhL0kNMvwlqUGGvyQ1yPCXpAYZ/pLUIMNfkhpk+EtSgwx/SWqQ4S9JDTL8JalBhr8kNcjwl6QGDR3+SXYk+YckzyV5Nsmhrn5dklNJXujur+3qSXI4yVySM0k+PnCsqW78C0mm1v5tSZLeyygz/yXgD6vqFuCTwH1JbgEeBJ6oqt3AE91jgLuA3d3tAPAILP9lAXwF+ARwG/CVS39hSJLWx9DhX1Xnq+qfu+2fAc8D24F9wEw3bAb4bLe9DzhWy74PfDjJjcCdwKmqeqOq/hU4BUyuybuRJA1lRWv+SXYCvw38E3BDVZ3vdr0K3NBtbwdeGXjaua52tbokaZ2MHP5JfgX4a+APquqng/uqqoBai8aSHEgym2R2YWFhLQ4pSeqMFP5JfoHl4P+Lqvqbrvxat5xDd/96V58Hdgw8/aaudrX6Zapquqomqmpi27Zto7QpSXofo1ztE+A7wPNV9Y2BXceBS1fsTAHfHajv7676+STwk2556HHgjiTXdid67+hqkqR1snWEsb8D/D5wNsnTXe1/A18DHkvyJeBl4PPdvhPA3cAc8CbwRYCqeiPJHwNPduMeqqo3VvUuJEkjGTr8q+r/AbnK7tuvML6A+65yrKPA0WFfW5K0tvyEryQ1yPCXpAYZ/pLUIMNfkhpk+EtSgwx/SWqQ4S9JDTL8JalBhr8kNcjwl6QGGf6S1CDDX5IaZPhLUoMMf0lqkOEvSQ0y/CWpQYa/JDXI8JekBhn+ktQgw1+SGmT4S1KDDH9JapDhL0kNGjr8kxxN8nqSZwZqX00yn+Tp7nb3wL4vJ5lL8qMkdw7UJ7vaXJIH1+6tSJKGNcrM/8+BySvUv1lVt3a3EwBJbgHuAT7aPef/JhlLMgZ8C7gLuAX4QjdWkrSOtg47sKr+McnOIYfvAx6tqp8DP04yB9zW7ZurqhcBkjzajX1u6I4lSau2Fmv+9yc50y0LXdvVtgOvDIw519WuVpckraPVhv8jwEeAW4HzwNdX3VEnyYEks0lmFxYW1uqwkiRWGf5V9VpVXaiqi8C3eWdpZx7YMTD0pq52tfqVjj1dVRNVNbFt27bVtClJepdVhX+SGwcefg64dCXQceCeJL+Y5GZgN/AD4Elgd5Kbk3yI5ZPCx1fTgyRpdEOf8E3yV8CngOuTnAO+Anwqya1AAS8B9wJU1bNJHmP5RO4ScF9VXeiOcz/wODAGHK2qZ9fs3UiShjLK1T5fuEL5O+8x/k+AP7lC/QRwYtjXlSStPT/hK0kNMvwlqUFDL/u07siRI8zNzfXdhjaYS/9PHDp0qOdOtNHs2rWLgwcP9t3GVRn+Q5qbm+PpZ57nwi9d13cr2kC2/HsB8NSLr/XciTaSsTff6LuF92X4j+DCL13HW7919/sPlNS0a3648a9pcc1fkhpk+EtSgwx/SWqQ4S9JDTL8JalBhr8kNcjwl6QGGf6S1CDDX5IaZPhLUoMMf0lqkOEvSQ0y/CWpQYa/JDXI8JekBhn+ktQgw1+SGmT4S1KDhg7/JEeTvJ7kmYHadUlOJXmhu7+2qyfJ4SRzSc4k+fjAc6a68S8kmVrbtyNJGsYov+H758D/AY4N1B4EnqiqryV5sHv8R8BdwO7u9gngEeATSa4DvgJMAAU8leR4Vf3rat/IB21+fp6xN3+yKX6bU1K/xt5cZH5+qe823tPQM/+q+kfg3T9Jvw+Y6bZngM8O1I/Vsu8DH05yI3AncKqq3ugC/xQwuZo3IEka3Sgz/yu5oarOd9uvAjd029uBVwbGnetqV6tveNu3b+fVn2/lrd+6u+9WJG1w1/zwBNu33/D+A3u0Zid8q6pYXspZE0kOJJlNMruwsLBWh5Uksfrwf61bzqG7f72rzwM7Bsbd1NWuVv8Pqmq6qiaqamLbtm2rbFOSNGi14X8cuHTFzhTw3YH6/u6qn08CP+mWhx4H7khybXdl0B1dTZK0joZe80/yV8CngOuTnGP5qp2vAY8l+RLwMvD5bvgJ4G5gDngT+CJAVb2R5I+BJ7txD1XVu08iS5I+YEOHf1V94Sq7br/C2ALuu8pxjgJHh31dSdLa8xO+ktQgw1+SGmT4S1KDDH9JapDhL0kNMvwlqUGGvyQ1yPCXpAYZ/pLUIMNfkhpk+EtSgwx/SWqQ4S9JDVrtzzg2ZezNN/wBd11my7/9FICL/+XXeu5EG8nYm2/wzq/abkyG/5B27drVdwvagObmfgbArv+2sf+ga73dsOEzw/Af0sGDB/tuQRvQoUOHAHj44Yd77kQajWv+ktQgw1+SGmT4S1KDDH9JapDhL0kNMvwlqUGGvyQ1yPCXpAatSfgneSnJ2SRPJ5ntatclOZXkhe7+2q6eJIeTzCU5k+Tja9GDJGl4aznz/72qurWqJrrHDwJPVNVu4InuMcBdwO7udgB4ZA17kCQN4YNc9tkHzHTbM8BnB+rHatn3gQ8nufED7EOS9C5rFf4F/H2Sp5Ic6Go3VNX5bvtV3vmKu+3AKwPPPdfVLpPkQJLZJLMLCwtr1KYkCdbui91+t6rmk/w6cCrJDwd3VlUlqVEOWFXTwDTAxMTESM+VJL23NZn5V9V8d/868LfAbcBrl5ZzuvvXu+HzwI6Bp9/U1SRJ62TV4Z/kl5P86qVt4A7gGeA4MNUNmwK+220fB/Z3V/18EvjJwPKQJGkdrMWyzw3A3ya5dLy/rKqTSZ4EHkvyJeBl4PPd+BPA3cAc8CbwxTXoQZI0glWHf1W9CPz3K9QXgduvUC/gvtW+riRp5fyEryQ1yPCXpAYZ/pLUIMNfkhpk+EtSgwx/SWqQ4S9JDTL8JalBhr8kNcjwl6QGGf6S1CDDX5IaZPhLUoMMf0lqkOEvSQ0y/CWpQYa/JDXI8JekBhn+ktQgw1+SGmT4S1KDDH9JapDhL0kN6i38k0wm+VGSuSQP9tWHJLWol/BPMgZ8C7gLuAX4QpJb+uhFklrU18z/NmCuql6sqn8HHgX29dSLJDWnr/DfDrwy8PhcV5MkrYMNe8I3yYEks0lmFxYW+m5Hkv5T6Sv854EdA49v6mpvq6rpqpqoqolt27ata3OS9J9dX+H/JLA7yc1JPgTcAxzvqRdJas7WPl60qpaS3A88DowBR6vq2T56kaQW9RL+AFV1AjjR1+tLUss27AlfSdIHx/CXpAYZ/pLUIMNfkhpk+EtSgwx/SWqQ4S9JDTL8JalBhr8kNcjwl6QGGf6S1KDevttHm9eRI0eYm5vru40N4dJ/h0OHDvXcycawa9cuDh482HcbGoLhL63CNddc03cL0ooY/hqZMztp83PNX5IaZPhLq7C4uMgDDzzA4uJi361IIzH8pVWYmZnh7NmzHDt2rO9WpJEY/tIKLS4ucvLkSaqKkydPOvvXpmL4Sys0MzPDxYsXAbhw4YKzf20qhr+0QqdPn2ZpaQmApaUlTp061XNH0vAMf2mF9uzZw9aty1dLb926lb179/bckTQ8w19aoampKbZsWf4jNDY2xv79+3vuSBqe4S+t0Pj4OJOTkyRhcnKS8fHxvluShraq8E/y1STzSZ7ubncP7PtykrkkP0py50B9sqvNJXlwNa8v9W1qaoqPfexjzvq16azF1zt8s6r+bLCQ5BbgHuCjwG8Ap5P8Zrf7W8Be4BzwZJLjVfXcGvQhrbvx8XEOHz7cdxvSyD6o7/bZBzxaVT8HfpxkDrit2zdXVS8CJHm0G2v4S9I6Wos1//uTnElyNMm1XW078MrAmHNd7Wp1SdI6et/wT3I6yTNXuO0DHgE+AtwKnAe+vlaNJTmQZDbJ7MLCwlodVpLEEMs+VbVnmAMl+Tbwd93DeWDHwO6buhrvUX/3604D092xF5K8PEwfUg+uB/6l7yakK/ivV9uxqjX/JDdW1fnu4eeAZ7rt48BfJvkGyyd8dwM/AALsTnIzy6F/D/A/3+91qmrbavqUPkhJZqtqou8+pFGs9oTvnya5FSjgJeBegKp6NsljLJ/IXQLuq6oLAEnuBx4HxoCjVfXsKnuQJI0oVdV3D9Km5sxfm5Gf8JVWb7rvBqRROfOXpAY585ekBhn+ktQgw1+SGmT4S1KDDH9JapDhL0kN+v+2BmjqiYSXHQAAAABJRU5ErkJggg==\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": [],
+ "needs_background": "light"
+ }
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "U993i1GJg2hk",
+ "outputId": "ae9aa1c4-5054-40b2-9e53-54aeaeda95fe",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 269
+ }
+ },
+ "source": [
+ "# Boxplot do array a_salarios_copia depois dos \"outliers\"\n",
+ "sns.boxplot(y = a_salarios_copia)"
+ ],
+ "execution_count": 147,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 147
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAADrCAYAAACFMUa7AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAO40lEQVR4nO3dX2zdZ33H8fc3dhOoptH2xKoqp1o6ORIqQmPIKp24mUicumw0vUCoCC0WqhQuutRDk0bZDRpwATftkmwgRWs1B00rLZvUgNqAW5CmXfDHGYzQFtQzaJWGPzUnpTCyhTr+7sJPisvs+BzXOb9jnvdLsvx7/pxzvr8o/vjR7zzn58hMJEl12NJ0AZKk/jH0Jakihr4kVcTQl6SKGPqSVBFDX5IqMtx0AZeyffv23LlzZ9NlSNKmcvLkyZ9m5shKYwMd+jt37mRubq7pMiRpU4mI51Yb8/KOJFXE0Jekihj6klQRQ1+SKmLoS+vQ6XS4++676XQ6TZci9cTQl9ZhZmaGU6dOcezYsaZLkXrSdehHxFBEfDMivlDaN0TE1yKiHRGfjYitpX9babfL+M5lz/Hh0v+9iLhlo09G6odOp8OJEyfITE6cOOFqX5tKLyv9aeDpZe1PAvdl5hjwInBn6b8TeLH031fmERE3AncAbwImgU9FxNBrK1/qv5mZGRYXFwG4cOGCq31tKl2FfkTsAP4E+IfSDuAdwOfKlBng9nK8r7Qp47vL/H3Ag5l5PjN/ALSBmzbiJKR+evzxx1lYWABgYWGB2dnZhiuSutftSv9vgb8CFku7BfwsMxdK+3lgtByPAqcByvhLZf4r/Ss85hURcSAi5iJibn5+vodTkfpjz549DA8vfZh9eHiYiYmJhiuSurdm6EfEnwIvZObJPtRDZh7NzPHMHB8ZWfHWEVKjpqam2LJl6UdnaGiI/fv3N1yR1L1uVvpvB26LiGeBB1m6rHMIuCoiLt67ZwdwphyfAa4HKONvADrL+1d4jLRptFotJicniQgmJydptVpNlyR1bc3Qz8wPZ+aOzNzJ0huxX87M9wFfAd5dpk0Bj5Tj46VNGf9yLv319ePAHWV3zw3ALuDrG3YmUh9NTU3x5je/2VW+Np3XcpfNDwEPRsTHgW8C95f++4HPREQbOMvSLwoy88mIeAh4ClgA7srMC6/h9aXGtFotDh8+3HQZUs9iaRE+mMbHx9NbK0tSbyLiZGaOrzTmJ3IlqSKGviRVxNCXpIoY+pJUEUNfkipi6EtSRQx9SaqIoS9JFTH0Jakihr4kVcTQl6SKGPqSVBFDX5IqYuhLUkUMfUmqiKEvSRUx9CWpIoa+JFXE0Jekihj6klQRQ1+SKmLoS1JFDH1JqoihL0kVMfQlqSKGviRVxNCXpIoY+pJUEUNfkipi6EtSRQx9SaqIoS9JFTH0Jakihr4kVcTQl6SKrBn6EfG6iPh6RPxnRDwZEX9T+m+IiK9FRDsiPhsRW0v/ttJul/Gdy57rw6X/exFxy+U6KUnSyrpZ6Z8H3pGZfwC8BZiMiJuBTwL3ZeYY8CJwZ5l/J/Bi6b+vzCMibgTuAN4ETAKfioihjTwZSdKlrRn6ueS/S/OK8pXAO4DPlf4Z4PZyvK+0KeO7IyJK/4OZeT4zfwC0gZs25CwkSV3p6pp+RAxFxLeAF4BZ4L+An2XmQpnyPDBajkeB0wBl/CWgtbx/hcdIkvqgq9DPzAuZ+RZgB0ur8zderoIi4kBEzEXE3Pz8/OV6GUmqUk+7dzLzZ8BXgD8CroqI4TK0AzhTjs8A1wOU8TcAneX9Kzxm+WsczczxzBwfGRnppTxJ0hq62b0zEhFXlePXAxPA0yyF/7vLtCngkXJ8vLQp41/OzCz9d5TdPTcAu4Cvb9SJSJLWNrz2FK4DZspOmy3AQ5n5hYh4CngwIj4OfBO4v8y/H/hMRLSBsyzt2CEzn4yIh4CngAXgrsy8sLGnI0m6lFhahA+m8fHxnJuba7oMSdpUIuJkZo6vNOYnciWpIoa+JFXE0Jekihj60jp0Oh3uvvtuOp1O06VIPTH0pXWYmZnh1KlTHDt2rOlSpJ4Y+lKPOp0Ojz32GJnJY4895mpfm4qhL/VoZmaGhYWl2069/PLLrva1qRj6Uo9mZ2e5+PmWzORLX/pSwxVJ3TP0pR5t3779km1pkBn6Uo9++MMfXrItDTJDX5IqYuhLPbruuusu2ZYGmaEv9eg3t2i6ZVObiaEv9WhiYuJV7b179zZUidQ7Q1/q0dTUFFu3bgVg69at7N+/v+GKpO4Z+lKPWq0Wk5OTRAS33norrVar6ZKkrhn60jrcdtttXHnllbzrXe9quhSpJ4a+tA4PP/wwv/zlL3n44YebLkXqiaEv9ajT6TA7Owss3ZLB3TvaTAx9qUdHjx5lcXERgMXFRY4ePdpwRVL3DH2pR0888cQl29IgM/SlHl28w+ZqbWmQGfpSj3bv3v2q9p49exqqROqdoS/16AMf+ABbtiz96GzZsoUDBw40XJHUPUNf6lGr1XpldT8xMeGHs7SpDDddgDaPI0eO0G63my5jIJw+fZrh4WFOnz7N9PR00+U0bmxsjIMHDzZdhrrgSl9ah/Pnz7Nt2zauuOKKpkuReuJKX11zJfdrF1f3hw4dargSqTeu9CWpIoa+JFXE0Jekihj6klQRQ1+SKmLoS1JFDH1JqoihL0kVWTP0I+L6iPhKRDwVEU9GxHTpvyYiZiPimfL96tIfEXE4ItoR8e2IeOuy55oq85+JiKnLd1qSpJV0s9JfAP4yM28EbgbuiogbgXuAJzJzF/BEaQPcCuwqXweAT8PSLwngI8DbgJuAj1z8RSFJ6o81Qz8zf5SZ/1GOfwE8DYwC+4CZMm0GuL0c7wOO5ZKvAldFxHXALcBsZp7NzBeBWWByQ89GknRJPV3Tj4idwB8CXwOuzcwflaEfA9eW41Hg9LKHPV/6VuuXJPVJ16EfEb8D/AvwF5n58+VjufT34jbkb8ZFxIGImIuIufn5+Y14SklS0VXoR8QVLAX+P2Xmv5bun5TLNpTvL5T+M8D1yx6+o/St1v8qmXk0M8czc3xkZKSXc5EkraGb3TsB3A88nZn3Lhs6DlzcgTMFPLKsf3/ZxXMz8FK5DPRFYG9EXF3ewN1b+iRJfdLN/fTfDvwZcCoivlX6/hr4BPBQRNwJPAe8p4w9CrwTaAPngPcDZObZiPgY8I0y76OZeXZDzkKS1JU1Qz8z/x2IVYZ3rzA/gbtWea4HgAd6KVCStHH8RK4kVcTQl6SKGPqSVBFDX5IqYuhLUkUMfUmqiKEvSRUx9CWpIoa+JFXE0Jekihj6klQRQ1+SKmLoS1JFDH1JqoihL0kVMfQlqSKGviRVxNCXpIoY+pJUEUNfkipi6EtSRQx9SarIcNMFDLojR47QbrebLkMD5uL/ienp6YYr0aAZGxvj4MGDTZexKkN/De12m29952kuXHlN06VogGz5VQJw8vs/abgSDZKhc2ebLmFNhn4XLlx5Df/zxnc2XYakAff67z7adAlr8pq+JFXE0Jekihj6klQRQ1+SKmLoS1JFDH1JqoihL0kVMfQlqSKGviRVxNCXpIoY+pJUkTVDPyIeiIgXIuI7y/quiYjZiHimfL+69EdEHI6IdkR8OyLeuuwxU2X+MxExdXlOR5J0Kd3ccO0fgb8Dji3ruwd4IjM/ERH3lPaHgFuBXeXrbcCngbdFxDXAR4BxIIGTEXE8M1/cqBO5XM6cOcPQuZc2xY2UJDVr6FyHM2cWmi7jktZc6WfmvwG/eb/QfcBMOZ4Bbl/WfyyXfBW4KiKuA24BZjPzbAn6WWByI05AktS99d5a+drM/FE5/jFwbTkeBU4vm/d86Vutf+CNjo7y4/PD3lpZ0ppe/91HGR29du2JDXrNb+RmZrJ0yWZDRMSBiJiLiLn5+fmNelpJEusP/Z+UyzaU7y+U/jPA9cvm7Sh9q/X/P5l5NDPHM3N8ZGRkneVJklay3tA/DlzcgTMFPLKsf3/ZxXMz8FK5DPRFYG9EXF12+uwtfZKkPlrzmn5E/DPwx8D2iHiepV04nwAeiog7geeA95TpjwLvBNrAOeD9AJl5NiI+BnyjzPtoZg7+H5OUpN8ya4Z+Zr53laHdK8xN4K5VnucB4IGeqpMkbSg/kStJFTH0Jakihr4kVcTQl6SKGPqSVBFDX5IqYuhLUkXWe8O1qgydO+utlfUqW/735wAsvu53G65Eg2To3Fl+ff/JwWTor2FsbKzpEjSA2u1fADD2+4P9A65+u3bgM8PQX8PBgwebLkEDaHp6GoBDhw41XInUG6/pS1JFDH1JqoihL0kVMfQlqSKGviRVxNCXpIoY+pJUEUNfkipi6EtSRQx9SaqIoS9JFTH0Jakihr4kVcTQl6SKGPqSVBFDX5IqYuhLUkUMfUmqiKEvSRUx9CWpIoa+JFXE0Jekihj6klQRQ1+SKmLoS1JFDH1JqkjfQz8iJiPiexHRjoh7+v36klSzvoZ+RAwBfw/cCtwIvDcibuxnDZJUs36v9G8C2pn5/cz8FfAgsK/PNUhStfod+qPA6WXt50ufJKkPBu6N3Ig4EBFzETE3Pz/fdDmS9Ful36F/Brh+WXtH6XtFZh7NzPHMHB8ZGelrcZL0267fof8NYFdE3BARW4E7gON9rkGSqjXczxfLzIWI+HPgi8AQ8EBmPtnPGiSpZn0NfYDMfBR4tN+vK0kawDdyJUmXj6EvSRUx9CWpIoa+JFXE0Jekihj6klSRvm/Z1OZ15MgR2u1202UMhIv/DtPT0w1XMhjGxsY4ePBg02WoC670pXXYtm0b58+f5+WXX266FKknrvTVNVdyv3bvvffy+c9/nl27dvHBD36w6XKkrrnSl3rU6XQ4ceIEmcmJEyfodDpNlyR1zdCXejQzM8Pi4iIAFy5c4NixYw1XJHXP0Jd69Pjjj7OwsADAwsICs7OzDVckdc/Ql3q0Z88ehoeX3g4bHh5mYmKi4Yqk7hn6Uo+mpqbYsmXpR2doaIj9+/c3XJHUPUNf6lGr1WJycpKIYHJyklar1XRJUtfcsimtw9TUFM8++6yrfG06hr60Dq1Wi8OHDzddhtQzL+9IUkUMfUmqiKEvSRUx9CWpIpGZTdewqoiYB55rug5pFduBnzZdhLSC38vMkZUGBjr0pUEWEXOZOd50HVIvvLwjSRUx9CWpIoa+tH5Hmy5A6pXX9CWpIq70Jakihr4kVcTQl6SKGPqSVBFDX5Iq8n8KAb3y/Z8O2QAAAABJRU5ErkJggg==\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": [],
+ "needs_background": "light"
+ }
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "VtenLK1uK1Pi"
+ },
+ "source": [
+ "Consegue identificar os outliers do array?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "e3sHuGVGFBdW"
+ },
+ "source": [
+ "## Objetivo\n",
+ "> Identificar e substituir os outliers pela mediana dos dados. \n",
+ "\n",
+ "* Como fazer isso?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RSegPNKCI-dS"
+ },
+ "source": [
+ "### Siga os passos a seguir\n",
+ "1. Calcule estatísticas descritivas antes das transformações par avaliar o impacto;\n",
+ " * Calcule média, mediana e desvio-padrão dos dados originais;\n",
+ "2. Calcule os valores a seguir:\n",
+ " * Q1, Q3\n",
+ " * IQR = Q3-Q1\n",
+ " * lim_inferior_outlier = Q1-1.5\\*IQR\n",
+ " * lim_superior_outlier = Q3+1.5\\*IQR\n",
+ "3. Proceda à substituição:\n",
+ " * Se a_salarios_copia[i] < lim_inferior_outlier então a_salarios_copia[i]= Mediana\n",
+ " * Se a_salarios_copia[i] > lim_superior_outlier então a_salarios_copia[i]= Mediana\n",
+ "4. Calcule as estatísticas descritivas após as substituições e compare com os valores antes das transformações."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9DQ7YnWaFn4v"
+ },
+ "source": [
+ "### Minha solução\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RBXJbTeGLC7Q"
+ },
+ "source": [
+ "1. Estatísticas Descritivas antes das transformações:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "QueKYn7MLG12",
+ "outputId": "75489f71-3f1e-4819-b5fe-21f134bf2b1e",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "# Algumas estatísticas descritivas:\n",
+ "f'Média: {np.mean(a_salarios_copia)}; Mediana: {np.median(a_salarios_copia)}; STD: {np.std(a_salarios_copia)}'"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ },
+ "text/plain": [
+ "'Média: 1057.4744151862524; Mediana: 1048.089607774499; STD: 144.64306489539533'"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 35
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oOBJ8INWL5fo"
+ },
+ "source": [
+ "Observe o quanto nossos dados estão distorcidos dos valores originalmente utilizados."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "MX-fJeh2MBTD"
+ },
+ "source": [
+ "2. Calcular Q1, Q3 e IQR"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "JlsPiQeGMGeU"
+ },
+ "source": [
+ "Q1 = np.percentile(a_salarios_copia, q = [25])\n",
+ "Q3 = np.percentile(a_salarios_copia, q = [75])\n",
+ "Q2 = np.percentile(a_salarios_copia, q = [50])\n",
+ "p99 = np.percentile(a_salarios_copia, q = [99])\n",
+ "p95 = np.percentile(a_salarios_copia, q = [95])\n",
+ "\n",
+ "IQR = Q3-Q1 # Diferença interquartílica\n",
+ "lim_inferior_outlier = Q1-1.5*IQR\n",
+ "lim_superior_outlier = Q3+1.5*IQR"
+ ],
+ "execution_count": 148,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "VF2NJ3rCeI1_",
+ "outputId": "34a2097c-334f-472f-9fe0-7198ea827e47",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "f'Q1: {Q1}; Q3: {Q3}; lim_inferior_outlier: {lim_inferior_outlier}; lim_superior_outlier: {lim_superior_outlier}'"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ },
+ "text/plain": [
+ "'Q1: [974.41]; Q3: [1119.81]; lim_inferior: [756.33]; lim_superior: [1337.89]'"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 37
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JjnwJ7HwMxcl"
+ },
+ "source": [
+ "3. Substituir\n",
+ "* Se a_salarios[i] < lim_inferior_outlier --> a_salarios[i] = Mediana\n",
+ "* Se a_salarios[i] > lim_superior_outlier --> a_salarios[i] = Mediana"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "hcAn-IwVfbcI"
+ },
+ "source": [
+ "a_salarios2 = a_salarios_copia.copy()"
+ ],
+ "execution_count": 149,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "M4UJY4vbRics",
+ "outputId": "eb6ac1e0-de5a-4545-f955-c4794c699de2",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "Q2[0]"
+ ],
+ "execution_count": 150,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "1035.8622742229763"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 150
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "J3SSE45oM9oh",
+ "outputId": "0396ad3e-13a7-4680-cc05-7c497fa8e754",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 102
+ }
+ },
+ "source": [
+ "a_salarios2[a_salarios2 < lim_inferior_outlier[0]] = Q2[0] # Atribuição da Mediana\n",
+ "a_salarios2[a_salarios2 > lim_superior_outlier[0]] = Q2[0] # Atribuição da Mediana\n",
+ "a_salarios2[:30]"
+ ],
+ "execution_count": 151,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 488.91, 1227.09, 641.47, 575.7 , 1664.34, 1254.27, 931.62,\n",
+ " 1550.25, 1242.86, 794.05, 924.31, 1184.53, 723.31, 46.76,\n",
+ " 1035.86, 1323.31, 1480.29, 754.29, 1105.24, 1254.35, 930.3 ,\n",
+ " 889.64, 954.45, 716.47, 848.33, 925.16, 1021.7 , 978.52,\n",
+ " 588.65, 1682.72])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 151
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "VEGFio0Nfj7O"
+ },
+ "source": [
+ "4. Estatísticas Descritivas para avaliarmos o impacto das alterações na amostra:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "gX1LZHFqfjFQ",
+ "outputId": "5b72b7fc-d5ba-4692-c1a7-b3394004004e",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "# Algumas estatísticas descritivas - Depois do trtamento de OUtliers:\n",
+ "f'Média: {np.mean(a_salarios2)}; Mediana: {np.median(a_salarios2)}; STD: {np.std(a_salarios2)}'"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ },
+ "text/plain": [
+ "'Média: 1047.3019702056902; Mediana: 1048.089607774499; STD: 98.3265929249586'"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 43
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "cSXrg2PFSYKY",
+ "outputId": "fbde7b45-c3b6-4533-cfa8-34ee1cbc40b7",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "# Algumas estatísticas descritivas - Antes do trtamento de OUtliers:\n",
+ "f'Média: {np.mean(a_salarios)}; Mediana: {np.median(a_salarios)}; STD: {np.std(a_salarios)}'"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ },
+ "text/plain": [
+ "'Média: 1047.150212238584; Mediana: 1047.631166829137; STD: 101.18708333868835'"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 44
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZVc6_nsGS_J2"
+ },
+ "source": [
+ "### Exercício: Substituir e comentar com seus respectivos colegas de grupo quando substituimos:\n",
+ "* Q2[0] pela média.\n",
+ "* Q2[0] pelo valor do percentil 95 e 99."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "-xnguZ7XgyvK",
+ "outputId": "48e05f4e-1b0a-4b6f-97f6-de2d49205697",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 269
+ }
+ },
+ "source": [
+ "# Import a biblioteca seaborn:\n",
+ "import seaborn as sns\n",
+ "sns.boxplot(y = a_salarios2)"
+ ],
+ "execution_count": 152,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 152
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAADrCAYAAACFMUa7AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAL6UlEQVR4nO3dX6ik9X3H8fcna9Nu+geVPV3kKF3bsxDsRa0cVGgvLIL/brQ3ohd1EWF7ocsp9KK2N5aEgDdt0W0qbMkShTYitMG9WGoXKYRe2HosYjQaHGzEPah7kg0msDZB8+3FPlsn6Tl7/uzZmTHf9wuGmfk9z8x8R/R9hmeec0xVIUnq4TPTHkCSNDlGX5IaMfqS1IjRl6RGjL4kNWL0JamRS6Y9wPns2bOn9u3bN+0xJOlT5aWXXvpuVc2ttW2mo79v3z6Wl5enPYYkfaokeXu9bR7ekaRGjL4kNWL0JakRoy9JjRh9SWrE6EtSI0ZfkhqZ6fP0NVsOHz7MaDSa9hgzYWVlBYD5+fkpTzIbFhYWOHTo0LTH0CYYfWkbPvzww2mPIG2L0dem+UnuE0tLSwA89thjU55E2hqP6UtSI0Zfkhox+pLUiNGXpEaMviQ1YvQlqRGjL0mNGH1JasToS1IjRl+SGjH6ktSI0ZekRoy+JDVi9CWpEaMvSY0YfUlqxOhLUiNGX5Ia2TD6Sa5K8m9JvpXktSRLw/rlSU4keXO4vmxYT5LHk4ySvJLkurHnOjDs/2aSAxfvbUmS1rKZT/ofAX9aVdcANwIPJrkGeBh4vqr2A88P9wFuB/YPl4PAE3D2hwTwCHADcD3wyLkfFJKkydgw+lX1blX913D7h8DrwDxwJ/DksNuTwF3D7TuBp+qsF4BLk1wB3AqcqKrTVfV94ARw246+G0nSeW3pmH6SfcDvAv8B7K2qd4dN7wF7h9vzwDtjDzs5rK23LkmakE1HP8mvAP8E/ElV/WB8W1UVUDsxUJKDSZaTLK+uru7EU0qSBpuKfpJf4Gzw/6Gq/nlYfn84bMNwfWpYXwGuGnv4lcPaeus/paqOVNViVS3Ozc1t5b1IkjawmbN3AnwFeL2q/nps0zHg3Bk4B4Bnx9bvG87iuRH4YDgM9BxwS5LLhi9wbxnWJEkTcskm9vk94I+AbyZ5eVj7C+BR4JkkDwBvA3cP244DdwAj4AxwP0BVnU7yReDFYb8vVNXpHXkXkqRN2TD6VfXvQNbZfPMa+xfw4DrPdRQ4upUBJUk7x9/IlaRGjL4kNWL0JakRoy9JjRh9SWrE6EtSI0Zfkhox+pLUiNGXpEaMviQ1YvQlqRGjL0mNGH1JasToS1IjRl+SGjH6ktSI0ZekRoy+JDVi9CWpEaMvSY0YfUlqxOhLUiNGX5IaMfqS1IjRl6RGjL4kNWL0JakRoy9JjRh9SWpkw+gnOZrkVJJXx9b+MslKkpeHyx1j2/48ySjJt5PcOrZ+27A2SvLwzr8VSdJGNvNJ/6vAbWus/01VXTtcjgMkuQa4B/jt4TF/l2RXkl3Al4HbgWuAe4d9JUkTdMlGO1TVN5Ls2+Tz3Qk8XVU/Av47yQi4ftg2qqq3AJI8Pez7rS1PLEnatgs5pv9QkleGwz+XDWvzwDtj+5wc1tZblyRN0Haj/wTwW8C1wLvAX+3UQEkOJllOsry6urpTTytJYpvRr6r3q+rjqvoJ8Pd8cghnBbhqbNcrh7X11td67iNVtVhVi3Nzc9sZT5K0jm1FP8kVY3f/EDh3Zs8x4J4kv5jkamA/8J/Ai8D+JFcn+Sxnv+w9tv2xJUnbseEXuUm+BtwE7ElyEngEuCnJtUAB3wH+GKCqXkvyDGe/oP0IeLCqPh6e5yHgOWAXcLSqXtvxdyNJOq/NnL1z7xrLXznP/l8CvrTG+nHg+JamkyTtKH8jV5IaMfqS1IjRl6RGjL4kNbLhF7ndHT58mNFoNO0xNGPO/TuxtLQ05Uk0axYWFjh06NC0x1iX0d/AaDTi5Vdf5+PPXT7tUTRDPvPjAuClt96f8iSaJbvOnJ72CBsy+pvw8ecu58PP37HxjpJa2/3G7J+V7jF9SWrE6EtSI0Zfkhox+pLUiNGXpEaMviQ1YvQlqRGjL0mNGH1JasToS1IjRl+SGjH6ktSI0ZekRoy+JDVi9CWpEaMvSY0YfUlqxOhLUiNGX5IaMfqS1IjRl6RGjL4kNXLJtAeYdSsrK+w68wG73zg+7VEkzbhdZ77HyspH0x7jvDb8pJ/kaJJTSV4dW7s8yYkkbw7Xlw3rSfJ4klGSV5JcN/aYA8P+byY5cHHejiTpfDbzSf+rwN8CT42tPQw8X1WPJnl4uP9nwO3A/uFyA/AEcEOSy4FHgEWggJeSHKuq7+/UG7lY5ufnee9Hl/Dh5++Y9iiSZtzuN44zP7932mOc14af9KvqG8Dpn1m+E3hyuP0kcNfY+lN11gvApUmuAG4FTlTV6SH0J4DbduINSJI2b7tf5O6tqneH2+8B5360zQPvjO13clhbb/3/SXIwyXKS5dXV1W2OJ0laywWfvVNVxdlDNjuiqo5U1WJVLc7Nze3U00qS2H703x8O2zBcnxrWV4Crxva7clhbb12SNEHbjf4x4NwZOAeAZ8fW7xvO4rkR+GA4DPQccEuSy4YzfW4Z1iRJE7Th2TtJvgbcBOxJcpKzZ+E8CjyT5AHgbeDuYffjwB3ACDgD3A9QVaeTfBF4cdjvC1X1s18OS5Iusg2jX1X3rrPp5jX2LeDBdZ7nKHB0S9NJknaUf4ZBkhox+pLUiNGXpEaMviQ1YvQlqRGjL0mNGH1JasToS1IjRl+SGjH6ktSI0ZekRoy+JDVi9CWpEaMvSY0YfUlqZMO/py/YdeY0u984Pu0xNEM+8z8/AOAnv/RrU55Es2TXmdPA3mmPcV5GfwMLCwvTHkEzaDT6IQALvznb/4Fr0vbOfDOM/gYOHTo07RE0g5aWlgB47LHHpjyJtDUe05ekRoy+JDVi9CWpEaMvSY0YfUlqxOhLUiNGX5IaMfqS1IjRl6RGjL4kNWL0JamRC4p+ku8k+WaSl5MsD2uXJzmR5M3h+rJhPUkeTzJK8kqS63biDUiSNm8nPun/QVVdW1WLw/2Hgeeraj/w/HAf4HZg/3A5CDyxA68tSdqCi3F4507gyeH2k8BdY+tP1VkvAJcmueIivL4kaR0XGv0C/jXJS0kODmt7q+rd4fZ7fPJ/FJgH3hl77Mlh7ackOZhkOcny6urqBY4nSRp3oX9P//eraiXJrwMnkrwxvrGqKklt5Qmr6ghwBGBxcXFLj5Uknd8FfdKvqpXh+hTwdeB64P1zh22G61PD7ivAVWMPv3JYkyRNyLajn+SXk/zqudvALcCrwDHgwLDbAeDZ4fYx4L7hLJ4bgQ/GDgNJkibgQg7v7AW+nuTc8/xjVf1LkheBZ5I8ALwN3D3sfxy4AxgBZ4D7L+C1JUnbsO3oV9VbwO+ssf494OY11gt4cLuvJ0m6cP5GriQ1YvQlqRGjL0mNGH1JasToS1IjRl+SGjH6ktSI0ZekRoy+JDVi9CWpEaMvSY0YfUlqxOhLUiNGX5IaMfqS1IjRl6RGjL4kNWL0JakRoy9JjRh9SWrE6EtSI0Zfkhox+pLUiNGXpEaMviQ1YvQlqRGjL0mNGH1JasToS1IjE49+ktuSfDvJKMnDk359SepsotFPsgv4MnA7cA1wb5JrJjmDJHU26U/61wOjqnqrqn4MPA3cOeEZJKmtSUd/Hnhn7P7JYU2SNAEz90VukoNJlpMsr66uTnscSfq5MunorwBXjd2/clj7P1V1pKoWq2pxbm5uosNJ0s+7SUf/RWB/kquTfBa4Bzg24Rkkqa1LJvliVfVRkoeA54BdwNGqem2SM0hSZxONPkBVHQeOT/p1JUkz+EWuJOniMfqS1IjRl6RGjL4kNWL0JakRoy9JjRh9SWrE6EtSI0Zfkhox+pLUiNGXpEaMviQ1YvQlqRGjL0mNTPxPK+vT6/Dhw4xGo2mPMRPO/XNYWlqa8iSzYWFhgUOHDk17DG2C0Ze2Yffu3dMeQdoWo69N85Oc9OnnMX1JasToS1IjRl+SGjH6ktSI0ZekRoy+JDVi9CWpEaMvSY2kqqY9w7qSrAJvT3sOaR17gO9OewhpDb9RVXNrbZjp6EuzLMlyVS1Oew5pKzy8I0mNGH1JasToS9t3ZNoDSFvlMX1JasRP+pLUiNGXpEaMviQ1YvQlqRGjL0mN/C+FC6tQUSJSIgAAAABJRU5ErkJggg==\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": [],
+ "needs_background": "light"
+ }
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "uEPFcBjFhETQ"
+ },
+ "source": [
+ "Como podem ver, os outliers desapareceram, como queríamos."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "tHfzjW_ymKuR"
+ },
+ "source": [
+ "___\n",
+ "# **Valores únicos**\n",
+ "> Considere o array de a_idades a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "HzmQgWZVmUUD",
+ "outputId": "1a594c37-8240-415c-ac4f-a578bee17113",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 119
+ }
+ },
+ "source": [
+ "np.random.seed(20111974)\n",
+ "a_idades = np.random.randint(18, 100, 100)\n",
+ "a_idades"
+ ],
+ "execution_count": 153,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([78, 60, 58, 26, 45, 20, 64, 99, 98, 31, 48, 81, 97, 90, 31, 85, 51,\n",
+ " 91, 95, 60, 73, 63, 59, 39, 40, 26, 80, 28, 18, 33, 27, 85, 53, 60,\n",
+ " 26, 44, 23, 86, 92, 75, 58, 40, 63, 24, 99, 18, 43, 68, 98, 94, 47,\n",
+ " 25, 39, 23, 70, 49, 96, 79, 68, 68, 25, 59, 21, 51, 65, 23, 34, 51,\n",
+ " 37, 78, 74, 73, 71, 46, 34, 45, 40, 56, 67, 31, 22, 43, 65, 64, 36,\n",
+ " 76, 19, 82, 75, 35, 38, 68, 43, 73, 91, 92, 61, 37, 73, 72])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 153
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Dm9ky1F1mrNA"
+ },
+ "source": [
+ "Quem são os valores únicos do array?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "G-LPRqc-mS5j",
+ "outputId": "6f8e357e-066d-42f3-a09c-8733ce7fa833",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 85
+ }
+ },
+ "source": [
+ "np.unique(a_idades)"
+ ],
+ "execution_count": 154,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 31, 33, 34, 35, 36, 37,\n",
+ " 38, 39, 40, 43, 44, 45, 46, 47, 48, 49, 51, 53, 56, 58, 59, 60, 61,\n",
+ " 63, 64, 65, 67, 68, 70, 71, 72, 73, 74, 75, 76, 78, 79, 80, 81, 82,\n",
+ " 85, 86, 90, 91, 92, 94, 95, 96, 97, 98, 99])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 154
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "uXZZoTd6nMuq"
+ },
+ "source": [
+ "___\n",
+ "# **Diferença entre dois arrays**\n",
+ "> O resultado é um array com os **valores únicos de A que não estão em B**. Na teoria de conjuntos escrevemos $A - B = A - A \\cap B$.\n",
+ "\n",
+ "\n",
+ "\n",
+ "Fonte: [Python Set](https://www.learnbyexample.org/python-set/)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "uW6i3m9q1ZNs"
+ },
+ "source": [
+ "\n",
+ "* Vamos ver como isso funciona na prática:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vw05sfe22mfk"
+ },
+ "source": [
+ "## Exemplo 1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Qqw2do90nQ7k"
+ },
+ "source": [
+ "a_conjunto1 = np.array([0, 1, 2, 4, 5, 7, 8, 8]) # array de valores que serão excluidos em a_conjunto1. Observe que '3' não pertence a a_conjunto1.\n",
+ "a_conjunto2 = np.array([1, 6, 7, 3])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zXJ00pOMorM-",
+ "outputId": "c3108557-ad55-45cc-f707-3af35cf456c1",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "np.setdiff1d(a_conjunto1, a_conjunto2)"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([0, 2, 4, 5, 8])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 50
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8GXZNgjfo8lO"
+ },
+ "source": [
+ "Observe que o resultado são os elementos de a_conjunto1 que não pertencem a x_Y. Mas como fica o '3' nesta história?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "aJSu6VKb2oc_"
+ },
+ "source": [
+ "## Exemplo 2"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "N1wahElXTqoB"
+ },
+ "source": [
+ "a_conjunto1 = np.arange(10)\n",
+ "a_conjunto1"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "nxDpCMg7T7Rj"
+ },
+ "source": [
+ "a_conjunto2 = np.array([1, 5, 7])\n",
+ "a_conjunto2"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "3LU3qYyiUXqm"
+ },
+ "source": [
+ "np.setdiff1d(a_conjunto1, a_conjunto2)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "mzZEytrRUioU"
+ },
+ "source": [
+ "Observe que os elementos de a_conjunto2 foram deletados de a_conjunto1. Ok?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gJRcoVRUnaY9"
+ },
+ "source": [
+ "___\n",
+ "# Diferença Simétrica\n",
+ "* Em teoria de conjuntos, chamamos de Diferença Simétrica e escrevemos $(A \\cup B)- (A \\cap B)$.\n",
+ "\n",
+ "\n",
+ "\n",
+ "Fonte: [Python Set](https://www.learnbyexample.org/python-set/)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2Uzzm85Kup3H"
+ },
+ "source": [
+ "* Vamos ver como isso funciona na prática:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "1z5wZ8VwpsWN"
+ },
+ "source": [
+ "import numpy as np\n",
+ "a_conjunto1 = np.array([0, 1, 2, 4, 5, 7, 8]) # Observe que [1, 4, 7] pertencem a a_conjunto1, mas 3, não. Portanto:\n",
+ "a_conjunto2 = np.array([1, 4, 7, 3])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Tqd_9XO5p7bo",
+ "outputId": "d7670965-e38f-40a1-9864-8ec850143245",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "np.setxor1d(a_conjunto1, a_conjunto2)"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([0, 2, 3, 5, 8])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 52
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_meurG3mqS5Y"
+ },
+ "source": [
+ "Como explicamos ou interpretamos este resultado?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Kc8JoKe2nj2n"
+ },
+ "source": [
+ "___\n",
+ "# **União de dois arrays**\n",
+ "> Retorna os valores **únicos** dos dois arrays. Na teoria dos conjuntos, escrevemos:\n",
+ "\n",
+ "$$A \\cup B$$\n",
+ "\n",
+ "\n",
+ "\n",
+ "Fonte: [Python Set](https://www.learnbyexample.org/python-set/)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "1LZxorw2p2mg"
+ },
+ "source": [
+ "a_conjunto1 = np.array([0, 1, 2, 4, 5, 7, 8, 8])\n",
+ "\n",
+ "# Observe que [1, 4, 7] pertencem a a_conjunto1, mas 3, não. Portanto:\n",
+ "a_conjunto2 = np.array([1, 4, 7, 3])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "COsZEmSwuY5L"
+ },
+ "source": [
+ "np.union1d(a_conjunto1, a_conjunto2)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "b53bR-GYRu_3"
+ },
+ "source": [
+ "___\n",
+ "# **Selecionar itens comuns dos arrays X e Y**\n",
+ "* Na teoria de conjuntos, chamamos de intersecção e escrevemos $X \\cap Y$.\n",
+ "\n",
+ "\n",
+ "\n",
+ "Fonte: [Python Set](https://www.learnbyexample.org/python-set/)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "n2ec2tqqR1Gw"
+ },
+ "source": [
+ "* Considere os arrays a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "rXVQQvBqR4J-",
+ "outputId": "c1332edd-af01-45cb-d3e1-c6e3ba30e157",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "a_conjunto1 = np.arange(10)\n",
+ "a_conjunto1"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 53
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "pZTHhHxGSRfB",
+ "outputId": "2c93501a-3ed8-4297-d58e-990c529a5a3d",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "a_conjunto2 = np.arange(8, 18)\n",
+ "a_conjunto2"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 8, 9, 10, 11, 12, 13, 14, 15, 16, 17])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 54
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "MxB2_qHpScMB"
+ },
+ "source": [
+ "Quais são os elementos comuns à X e Y?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "e-rncJHtSfw0",
+ "outputId": "11f0b85d-c634-419a-cc62-e0899f9cef31",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "np.intersect1d(a_conjunto1, a_conjunto2)"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([8, 9])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 55
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3Bb39sWdfqaF"
+ },
+ "source": [
+ "___\n",
+ "# **Autovalores e Autovetores**\n",
+ "> Autovetor e Autovalor são um dos tópicos mais importantes em Machine Learning.\n",
+ "\n",
+ "Por definição, o escalar $\\lambda$ e o vetor $v$ são autovalor e autovetor da matriz $A$ se\n",
+ "\n",
+ "$$Av = \\lambda v$$\n",
+ "\n",
+ "## Leitura Adicional:\n",
+ "\n",
+ "* [Machine Learning & Linear Algebra — Eigenvalue and eigenvector](https://medium.com/@jonathan_hui/machine-learning-linear-algebra-eigenvalue-and-eigenvector-f8d0493564c9)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XZBKq8nGCUbL"
+ },
+ "source": [
+ "* O array a_conjunto2 tem a seguinte forma:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "iYlZGKFUfw-R"
+ },
+ "source": [
+ "a_conjunto2"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6EfvIbBNf02Z"
+ },
+ "source": [
+ "# Calcula autovalores e autovetores:\n",
+ "a_autovalores, a_autovalores= np.linalg.eig(a_conjunto2)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "v3GtQQvAz9QU"
+ },
+ "source": [
+ "Os autovalores do array a_conjunto2 são:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "WvZGyBR1f9vP"
+ },
+ "source": [
+ "a_autovalores"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "AuuDRJVh0FC8"
+ },
+ "source": [
+ "Os autovetores do array a_conjunto2 são:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6m4YFAwsf_rA"
+ },
+ "source": [
+ "a_autovalores"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "DASn2Un9ZNV-"
+ },
+ "source": [
+ "___\n",
+ "# **Encontrar Missing Values (NaN)**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "TKilWBsSXtR4"
+ },
+ "source": [
+ "## Gerar o exemplo"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "lqLI2ER_ZUMY"
+ },
+ "source": [
+ "np.random.seed(20111974)\n",
+ "a_conjunto1 = np.random.random(100)\n",
+ "\n",
+ "# Inserindo 15 NaN's no array:\n",
+ "np.random.seed(20111974)\n",
+ "l_indices_aleatorios= np.random.randint(0, 100, size = 15)\n",
+ "\n",
+ "for i_indices in l_indices_aleatorios:\n",
+ " #print(i_indices)\n",
+ " a_conjunto1[i_indices] = np.nan"
+ ],
+ "execution_count": 155,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "gw--poMaadv3",
+ "outputId": "115842f8-f789-4ab2-dda2-17eae01d7e70",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "l_indices_aleatorios.sort()\n",
+ "l_indices_aleatorios"
+ ],
+ "execution_count": 158,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 2, 8, 13, 27, 30, 40, 42, 46, 60, 80, 81, 82, 88, 88, 96])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 158
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "2ZkbMPXMawYh",
+ "outputId": "7dc30b68-52c8-474c-c070-da8be55f2bf3",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 187
+ }
+ },
+ "source": [
+ "a_conjunto1"
+ ],
+ "execution_count": 157,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([0.53, 0.57, nan, 0.65, 0.86, 0.6 , 0.87, 0.46, nan, 0.64, 0.55,\n",
+ " 0.35, 0.32, nan, 0.85, 0.76, 0.66, 0.33, 0.35, 0.42, 0.31, 0.27,\n",
+ " 0.31, 0.36, 0.6 , 0.02, 0.36, nan, 0.28, 0.37, nan, 0.44, 0.2 ,\n",
+ " 0.21, 0.65, 0.82, 0.72, 0.5 , 0.17, 0.6 , nan, 0.14, nan, 0.71,\n",
+ " 0.07, 0.56, nan, 0.84, 0.21, 0.85, 0.63, 0.38, 0.91, 0.34, 0.07,\n",
+ " 0.1 , 0.85, 0.12, 0.94, 0.16, nan, 0.91, 0.59, 0.37, 0.72, 0.07,\n",
+ " 0.48, 0.78, 0.97, 0.72, 0.29, 0.33, 0.95, 0.24, 0.98, 0.85, 0.63,\n",
+ " 0.57, 0.67, 0.88, nan, nan, nan, 0.68, 0.29, 0.33, 0.98, 0.17,\n",
+ " nan, 0.92, 0.98, 0.76, 0.31, 0.97, 0.08, 0.56, nan, 0.49, 0.07,\n",
+ " 0.11])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 157
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Z7Bs75NvbSjx"
+ },
+ "source": [
+ "Ok, inserimos aleatoriamente 14 NaN's no array a_conjunto1. Agora, vamos contar quantos NaN's (já sabemos a resposta!)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hL1Wn0vdX8ur"
+ },
+ "source": [
+ "## Identificar os NaN's"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5R-n3H0xbd6d",
+ "outputId": "16f7be91-47f2-4c61-dca1-6b55619f7b17",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "np.isnan(a_conjunto1).sum()"
+ ],
+ "execution_count": 159,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "14"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 159
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Y7hh5uowoa3U"
+ },
+ "source": [
+ "Ok, temos 14 NaN's em a_conjunto1."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "iVLQf_bqbyNU"
+ },
+ "source": [
+ "Ok, agora eu quero saber os índices desses NaN's."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kJHxjZiwb5HM",
+ "outputId": "f9f26416-c184-4db9-b114-6521b5a28a8d",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "i_indices = np.where(np.isnan(a_conjunto1))\n",
+ "i_indices"
+ ],
+ "execution_count": 160,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "(array([ 2, 8, 13, 27, 30, 40, 42, 46, 60, 80, 81, 82, 88, 96]),)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 160
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "W_jHGNImok7L",
+ "outputId": "703ea6e2-0580-4cfa-9fd0-47531b686215",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "# Checando...\n",
+ "a_conjunto1[2]"
+ ],
+ "execution_count": 161,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "nan"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 161
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "iPhHAhDYcMWO"
+ },
+ "source": [
+ "Vamos conferir se está correto? Para isso, basta comparar com l_indices_aleatorios:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gxQYslRCe11G"
+ },
+ "source": [
+ "___\n",
+ "# **Deletar NaN's de um array**\n",
+ "> Considere o mesmo array que acabamos de trabalhar. Agora eu quero excluir os NaN's identificados."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "AeBARFqNfNnN",
+ "outputId": "bc4f82a3-0212-452d-c149-10b119903d8b",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 191
+ }
+ },
+ "source": [
+ "a_conjunto1"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([0.53, 0.57, nan, 0.65, 0.86, 0.6 , 0.87, 0.46, nan, 0.64, 0.55,\n",
+ " 0.35, 0.32, nan, 0.85, 0.76, 0.66, 0.33, 0.35, 0.42, 0.31, 0.27,\n",
+ " 0.31, 0.36, 0.6 , 0.02, 0.36, nan, 0.28, 0.37, nan, 0.44, 0.2 ,\n",
+ " 0.21, 0.65, 0.82, 0.72, 0.5 , 0.17, 0.6 , nan, 0.14, nan, 0.71,\n",
+ " 0.07, 0.56, nan, 0.84, 0.21, 0.85, 0.63, 0.38, 0.91, 0.34, 0.07,\n",
+ " 0.1 , 0.85, 0.12, 0.94, 0.16, nan, 0.91, 0.59, 0.37, 0.72, 0.07,\n",
+ " 0.48, 0.78, 0.97, 0.72, 0.29, 0.33, 0.95, 0.24, 0.98, 0.85, 0.63,\n",
+ " 0.57, 0.67, 0.88, nan, nan, nan, 0.68, 0.29, 0.33, 0.98, 0.17,\n",
+ " nan, 0.92, 0.98, 0.76, 0.31, 0.97, 0.08, 0.56, nan, 0.49, 0.07,\n",
+ " 0.11])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 66
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ck1w6_Tvb72M",
+ "outputId": "c9f3469a-5fcb-4794-882b-9a882871061f",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 225
+ }
+ },
+ "source": [
+ "np.isnan(a_conjunto1)"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([False, False, True, False, False, False, False, False, True,\n",
+ " False, False, False, False, True, False, False, False, False,\n",
+ " False, False, False, False, False, False, False, False, False,\n",
+ " True, False, False, True, False, False, False, False, False,\n",
+ " False, False, False, False, True, False, True, False, False,\n",
+ " False, True, False, False, False, False, False, False, False,\n",
+ " False, False, False, False, False, False, True, False, False,\n",
+ " False, False, False, False, False, False, False, False, False,\n",
+ " False, False, False, False, False, False, False, False, True,\n",
+ " True, True, False, False, False, False, False, True, False,\n",
+ " False, False, False, False, False, False, True, False, False,\n",
+ " False])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 67
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "e497B492fFru",
+ "outputId": "03020338-a360-4f1f-b025-838b0738509d",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 156
+ }
+ },
+ "source": [
+ "a_conjunto1[~np.isnan(a_conjunto1)]"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([0.53, 0.57, 0.65, 0.86, 0.6 , 0.87, 0.46, 0.64, 0.55, 0.35, 0.32,\n",
+ " 0.85, 0.76, 0.66, 0.33, 0.35, 0.42, 0.31, 0.27, 0.31, 0.36, 0.6 ,\n",
+ " 0.02, 0.36, 0.28, 0.37, 0.44, 0.2 , 0.21, 0.65, 0.82, 0.72, 0.5 ,\n",
+ " 0.17, 0.6 , 0.14, 0.71, 0.07, 0.56, 0.84, 0.21, 0.85, 0.63, 0.38,\n",
+ " 0.91, 0.34, 0.07, 0.1 , 0.85, 0.12, 0.94, 0.16, 0.91, 0.59, 0.37,\n",
+ " 0.72, 0.07, 0.48, 0.78, 0.97, 0.72, 0.29, 0.33, 0.95, 0.24, 0.98,\n",
+ " 0.85, 0.63, 0.57, 0.67, 0.88, 0.68, 0.29, 0.33, 0.98, 0.17, 0.92,\n",
+ " 0.98, 0.76, 0.31, 0.97, 0.08, 0.56, 0.49, 0.07, 0.11])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 68
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RpvKfJU_fmA6"
+ },
+ "source": [
+ "Observe que os NaN's foram excluidos."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "60-l91_ccJxt"
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "kywe-SmtcLpF"
+ },
+ "source": [
+ "### **Exercício**: Atribuir a mediana aos valores nan da amostra."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_Dv8MmNYg8zN"
+ },
+ "source": [
+ "___\n",
+ "# **Converter lista em array**\n",
+ "> Considere a lista a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "but6T9dVhFYb"
+ },
+ "source": [
+ "l_Lista = [np.random.randint(0, 10, 10)]\n",
+ "l_Lista"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "xytj4Eo4hTh9"
+ },
+ "source": [
+ "type(l_Lista)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qrINdcruhWcH"
+ },
+ "source": [
+ "Convertendo a minha lista para array:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "RoSyaX0OhZSE"
+ },
+ "source": [
+ "a_conjunto = np.asarray(l_Lista)\n",
+ "a_conjunto"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "dMjTdbBUhlrk"
+ },
+ "source": [
+ "type(a_conjunto)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Mbm3ZP9DhxDI"
+ },
+ "source": [
+ "___\n",
+ "# Converter tupla em array\n",
+ "> Considere a tupla a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "cZxEFYLAh3S_",
+ "outputId": "701203e5-2a45-4dd8-d9fd-ae4fe665ce7d",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "np.random.seed(20111974)\n",
+ "t_numeros = ([np.random.randint(0, 10, 3)], [np.random.randint(0, 10, 3)], [np.random.randint(0, 10, 3)])\n",
+ "t_numeros"
+ ],
+ "execution_count": 162,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "([array([8, 8, 2])], [array([8, 9, 1])], [array([8, 0, 4])])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 162
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "vlTXUJviiAml",
+ "outputId": "77a7e854-37de-425e-9cd5-ccfe50702954",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "type(t_numeros)"
+ ],
+ "execution_count": 163,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "tuple"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 163
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "yEaOlq8oh3oh",
+ "outputId": "287e061f-2a9f-461b-831d-b3844e301e7a",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 102
+ }
+ },
+ "source": [
+ "a_conjunto = np.asarray(t_numeros)\n",
+ "a_conjunto"
+ ],
+ "execution_count": 164,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[[8, 8, 2]],\n",
+ "\n",
+ " [[8, 9, 1]],\n",
+ "\n",
+ " [[8, 0, 4]]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 164
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "PSgQDmRWh3g5",
+ "outputId": "bdcf61ef-a4f5-40b2-def3-e1affcc58446",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "type(a_conjunto)"
+ ],
+ "execution_count": 165,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "numpy.ndarray"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 165
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "pH-Ht6yMiqJN"
+ },
+ "source": [
+ "___\n",
+ "# Acrescentar elementos à um array\n",
+ "> Considere o array a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "dFaDZInZiwoo",
+ "outputId": "58f1f504-476b-4641-d24a-4ababdc366c5",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "a_conjunto1 = np.arange(5)\n",
+ "a_conjunto1"
+ ],
+ "execution_count": 166,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([0, 1, 2, 3, 4])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 166
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "d3zrlf_Ci73Z",
+ "outputId": "0329a4e8-ad8e-4faa-b200-8382e7ecda3b",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "np.random.seed(20111974)\n",
+ "a_conjunto1 = np.append(a_conjunto1, [np.random.randint(0, 10, 3), np.random.randint(0, 10, 3), np.random.randint(0, 10, 3)])\n",
+ "a_conjunto1"
+ ],
+ "execution_count": 167,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([0, 1, 2, 3, 4, 8, 8, 2, 8, 9, 1, 8, 0, 4])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 167
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "eFRhtk13ojqA"
+ },
+ "source": [
+ "___\n",
+ "# **Converter array 1D num array 2D**\n",
+ "> Considere os arrays a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "wYhBgW9Zu6ZP"
+ },
+ "source": [
+ "np.random.seed(20111974)\n",
+ "a_conjunto1 = np.array(np.random.randint(0, 10, 6))\n",
+ "\n",
+ "np.random.seed(19741120)\n",
+ "a_conjunto2 = np.array(np.random.randint(0, 10, 6))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "febs9AUHvs6n"
+ },
+ "source": [
+ "a_conjunto1"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "C9OEd-iavvBm"
+ },
+ "source": [
+ "a_conjunto2"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "KJWjtaWKv0MJ"
+ },
+ "source": [
+ "np.column_stack((a_conjunto1, a_conjunto2)) # Atenção aos parênteses em (a_conjunto1, a_conjunto2)."
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "xr_WZXJ7pi2D"
+ },
+ "source": [
+ "___\n",
+ "# **Excluir um elemento específico do array usando indices**\n",
+ "> Considere os arrays a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "tS0ZzOs8w0dw",
+ "outputId": "92cb94a2-f2ac-4717-a4fd-b6cd3cc86f74",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "np.random.seed(20111974)\n",
+ "a_conjunto1 = np.array(np.random.randint(0, 10, 6))\n",
+ "print(a_conjunto1)"
+ ],
+ "execution_count": 171,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "[8 8 2 8 9 1]\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7bOJiKDKxEsC"
+ },
+ "source": [
+ "Suponha que eu queira excluir os valores '8' de a_conjunto1. Os índices dos valores '8' são: [0, 1, 3]. Portanto, temos:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "SSjueEvjxTJO"
+ },
+ "source": [
+ "a_conjunto1 = np.delete(a_conjunto1, [0, 1, 3])\n",
+ "a_conjunto1"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "pfqzCfhHaxsL",
+ "outputId": "17d563dd-3c04-40b2-cc4c-aa5f280bd690",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "##alternativa para exclusão do valor 8 do array\n",
+ "np.random.seed(20113374)\n",
+ "a_conjunto1 = np.array(np.random.randint(0, 10, 6))\n",
+ "print(a_conjunto1)\n",
+ "\n",
+ "#armazena os índices que contém 8\n",
+ "i_indice_excl = np.where(a_conjunto1==8)\n",
+ "\n",
+ "#elimina elementos do array que estejam no índice\n",
+ "a_conjunto1= np.delete(a_conjunto1, i_indice_excl)\n",
+ "a_conjunto1"
+ ],
+ "execution_count": 174,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "[3 3 8 4 5 8]\n"
+ ],
+ "name": "stdout"
+ },
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([3, 3, 4, 5])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 174
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "mZkGZ2Rgp--5"
+ },
+ "source": [
+ "___\n",
+ "# **Frequência dos valores únicos de um array**\n",
+ "> Considere o array a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Z2BWKfH0xvQ8",
+ "outputId": "169b34a0-56d6-48ea-cffe-fbd7a757372c",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 104
+ }
+ },
+ "source": [
+ "np.random.seed(20111974)\n",
+ "a_conjunto1 = np.array(np.random.randint(0, 10, 100))\n",
+ "a_conjunto1"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([8, 8, 2, 8, 9, 1, 8, 0, 4, 2, 0, 8, 9, 3, 7, 1, 3, 2, 9, 7, 7, 9,\n",
+ " 5, 6, 8, 7, 0, 9, 3, 9, 3, 1, 8, 6, 3, 5, 4, 1, 2, 9, 8, 6, 6, 1,\n",
+ " 0, 9, 2, 0, 7, 5, 5, 4, 4, 2, 7, 2, 7, 9, 3, 1, 5, 0, 1, 2, 3, 8,\n",
+ " 7, 5, 4, 0, 5, 9, 6, 6, 1, 3, 6, 0, 4, 9, 2, 1, 0, 9, 1, 4, 2, 9,\n",
+ " 7, 9, 5, 3, 7, 6, 3, 9, 8, 4, 3, 0])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 69
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "s_tdQBsax4rQ"
+ },
+ "source": [
+ "Suponha que eu queira saber quantas vezes o número/elemento '2' aparece em a_conjunto1."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "JhsN15T5cm55"
+ },
+ "source": [
+ "a = np.unique()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6yIlk7pWyAtf",
+ "outputId": "31d8d842-d6c2-4955-a3ed-76ec3badac9a",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "l_itens_unicos, i_count = np.unique(a_conjunto1, return_counts = True)\n",
+ "l_itens_unicos"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 70
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "DyvrIwS9yZIR"
+ },
+ "source": [
+ "O que significa o output acima?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "uO-MPMhXyV9H",
+ "outputId": "0cb620a1-d0ac-46b5-f379-9b0b563fdd71",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ }
+ },
+ "source": [
+ "i_count"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([10, 10, 10, 11, 8, 8, 8, 10, 10, 15])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 71
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "zwoezXrPyofK"
+ },
+ "source": [
+ "Qual a interpretação do output acima?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "HgYycSG7yr5e",
+ "outputId": "2091104b-45db-4d13-d65e-80d8af3acd7f",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 52
+ }
+ },
+ "source": [
+ "np.asarray((l_itens_unicos, i_count))"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],\n",
+ " [10, 10, 10, 11, 8, 8, 8, 10, 10, 15]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 72
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "SwIZiJAiy06T"
+ },
+ "source": [
+ "Qual a interpretação do output acima?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JpNRpN2Dql3N"
+ },
+ "source": [
+ "___\n",
+ "# **Combinações possíveis de outros arrays**\n",
+ "> Considere o exemplo a seguir:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "BUr89dH4zLXD"
+ },
+ "source": [
+ "a_conjunto1 = [2, 4, 6]\n",
+ "a_conjunto2 = [0, 8]\n",
+ "a_conjunto4 = [1, 5]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "cEZH6l-Czx7y"
+ },
+ "source": [
+ "np.meshgrid(a_conjunto1, a_conjunto2, a_conjunto4)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "btvmDkEcz0tH"
+ },
+ "source": [
+ "np.array(np.meshgrid(a_conjunto1, a_conjunto2, a_conjunto4))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Z0xhO7rGz059"
+ },
+ "source": [
+ "np.array(np.meshgrid(a_conjunto1, a_conjunto2, a_conjunto4)).T"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "eMv4lFnD0Enn"
+ },
+ "source": [
+ "# Resultado final\n",
+ "a_conjunto3 = np.array(np.meshgrid(a_conjunto1, a_conjunto2, a_conjunto4)).T.reshape(-1,3)\n",
+ "a_conjunto3"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Rz80YANfAh2k"
+ },
+ "source": [
+ "___\n",
+ "# **Wrap Up**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_cyhMsAVXxGC"
+ },
+ "source": [
+ "___\n",
+ "# **Exercícios**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "kNjovMw3uJ3R"
+ },
+ "source": [
+ "## Exercício 1 - Selecionar os números pares\n",
+ "> Dado o 1D array abaixo, selecionar somente os números pares."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "bRA8nKHBbPAD",
+ "outputId": "849729a5-2223-4118-d34d-e2529f2bce9f",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 85
+ }
+ },
+ "source": [
+ "#criar array com 20 elementos variando de 2 a 100\n",
+ "a_cj1 = np.random.randint(2,100,20)\n",
+ "a_cj1.sort()\n",
+ "print(a_cj1)\n",
+ "#seleciona os índices pares\n",
+ "i_ind_pares = np.where(a_cj1 % 2 ==0)\n",
+ "print(i_ind_pares)\n",
+ "#imprime elementos com índices pares\n",
+ "print(a_cj1[i_ind_pares])\n",
+ "#executando de modo direto\n",
+ "print(a_cj1[np.where(a_cj1 % 2 ==0)])"
+ ],
+ "execution_count": 178,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "[ 3 7 8 13 14 20 20 26 32 33 39 42 43 46 57 57 61 67 85 92]\n",
+ "(array([ 2, 4, 5, 6, 7, 8, 11, 13, 19]),)\n",
+ "[ 8 14 20 20 26 32 42 46 92]\n",
+ "[ 8 14 20 20 26 32 42 46 92]\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "isDzQjwjBX3V"
+ },
+ "source": [
+ "a_conjunto1 = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n",
+ "a_conjunto1"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Kq1zt-uO1HXv"
+ },
+ "source": [
+ "### **Minha solução**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "YFmK_n2M1Ks9"
+ },
+ "source": [
+ "a_conjunto1[a_conjunto1 % 2 == 0]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "sScYG0hp05vb"
+ },
+ "source": [
+ "___\n",
+ "## Exercício 2 - Substituir pela mediana\n",
+ "> Dado o array 1D abaixo, substituir os números pares pela mediana de a_conjunto1."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "g-Yq251IcQF9",
+ "outputId": "2e0e249e-9b4b-4178-d092-47bf85ff9cb6",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 102
+ }
+ },
+ "source": [
+ "#criar array com 20 elementos variando de 2 a 100\n",
+ "a_cj1 = np.random.randint(2,100,20)\n",
+ "a_cj1.sort()\n",
+ "print(a_cj1)\n",
+ "#seleciona os índices pares\n",
+ "i_ind_pares = np.where(a_cj1 % 2 ==0)\n",
+ "print(i_ind_pares)\n",
+ "\n",
+ "#substituindo pares por mediana\n",
+ "#i_mediana = np.median(a_cj1)\n",
+ "print(f'Mediana {i_mediana}')\n",
+ "\n",
+ "#a_cj1[i_ind_pares] = i_mediana\n",
+ "\n",
+ "#imprime elementos com índices pares\n",
+ "#print(a_cj1[i_ind_pares])\n",
+ "#print(a_cj1)\n",
+ "\n",
+ "#executando de modo direto\n",
+ "a_cj1[np.where(a_cj1 % 2 ==0)] = np.median(a_cj1)\n",
+ "print(a_cj1[i_ind_pares])\n",
+ "print(a_cj1)\n"
+ ],
+ "execution_count": 182,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "[ 2 11 11 24 27 30 33 39 42 44 53 67 67 73 76 83 83 85 89 90]\n",
+ "(array([ 0, 3, 5, 8, 9, 14, 19]),)\n",
+ "Mediana 48.5\n",
+ "[48 48 48 48 48 48 48]\n",
+ "[48 11 11 48 27 48 33 39 48 48 53 67 67 73 48 83 83 85 89 48]\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "XLZ-DIWU1WFs"
+ },
+ "source": [
+ "a_conjunto1 = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n",
+ "a_conjunto1"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9c4QWJno1WVB"
+ },
+ "source": [
+ "### **Minha solução**\n",
+ "* Primeiramente, precisamos calcular a mediana.\n",
+ "* Depois, substituimos os valores pares de a_conjunto1 pela mediana encontrada anteriormente. Ok?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "rx7NGAO01Wfb"
+ },
+ "source": [
+ "a_conjunto1[a_conjunto1 % 2 == 0] = np.median(a_conjunto1)\n",
+ "a_conjunto1"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2c_AphX82qp8"
+ },
+ "source": [
+ "Verificando..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "9kVta0Cr13Z9"
+ },
+ "source": [
+ "f'A média de a_conjunto1 é: {np.median(a_conjunto1)}'"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "L9O-Hf5x26TY"
+ },
+ "source": [
+ "___\n",
+ "## Exercício 3 - Reshape\n",
+ "> Dado o array 1D abaixo, reshape para um array 2D com 3 colunas."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "0_laUvtB4Wl-"
+ },
+ "source": [
+ "# Define seed\n",
+ "np.random.seed(20111974)\n",
+ "a_conjunto1 = np.array(np.random.randint(1, 10, size = 15))\n",
+ "a_conjunto1"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "dKzEX8TK5b4Z"
+ },
+ "source": [
+ "### **Minha solução**\n",
+ "* O array 1D a_conjunto1 acima possui 15 elementos. Como queremos transformá-lo num array 2D com 3 colunas, então cada coluna terá 5 elementos."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "I-j5yVD04249"
+ },
+ "source": [
+ "a_conjunto1.reshape(5, 3) \n",
+ "# Poderia ser a_conjunto1.reshape(-1, 3), onde \"-1\" pede para o NumPy calcular o número de linhas. "
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "F1vfS8jE6L0_"
+ },
+ "source": [
+ "___\n",
+ "## Exercício 4 - Reshape\n",
+ "> Dado o array 1D abaixo, reshape para um array 3D com 2 colunas."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "xcN-bez56L1D"
+ },
+ "source": [
+ "# Define seed\n",
+ "np.random.seed(20111974)\n",
+ "a_conjunto1 = np.array(np.random.randint(1, 10, size = 16))\n",
+ "a_conjunto1"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7iICnOyG6fcj"
+ },
+ "source": [
+ "### **Minha solução**\n",
+ "* O array 1D a_conjunto1 acima possui 16 elementos. Queremos transformá-lo num array 3D com 2 colunas."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "vdq5ybuD6fcn"
+ },
+ "source": [
+ "a_conjunto1.reshape(-1, 2) # O valor \"-1\" na posição das linhas pede ao NumPy para calcular o número de linhas automaticamente."
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "haQfWPcCs_H0"
+ },
+ "source": [
+ "## Exercício 5\n",
+ "Para mais exercícios envolvendo arrays, visite a página [Python: Array Exercises, Practice, Solution](https://www.w3resource.com/python-exercises/array/)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "LQQL0JS2tnc0"
+ },
+ "source": [
+ "## Exercício 6\n",
+ "Para mais exercícios envolvendo matemática, viste a página [Python Math: - Exercises, Practice, Solution](https://www.w3resource.com/python-exercises/math/index.php)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qNskKFy9t4D5"
+ },
+ "source": [
+ "## Exercício 7\n",
+ "Para mais exercícios envolvendo NumPy em geral, visite a página [NumPy Exercises, Practice, Solution](https://www.w3resource.com/python-exercises/numpy/index.php)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qqc1AiHXuKZ5"
+ },
+ "source": [
+ "## Exercício 8\n"
+ ]
+ }
+ ]
+}
\ No newline at end of file
From cb93ec3045e84eaaaf3a7069c63d5a2e4e6a3156 Mon Sep 17 00:00:00 2001
From: Celso-Omoto <72219725+Celso-Omoto@users.noreply.github.com>
Date: Fri, 16 Oct 2020 15:02:53 -0300
Subject: [PATCH 08/21] Criado usando o Colaboratory
---
Notebooks/NB10_01__Pandas_Fifa.ipynb | 55 +++++++++++++++++++++++++---
1 file changed, 50 insertions(+), 5 deletions(-)
diff --git a/Notebooks/NB10_01__Pandas_Fifa.ipynb b/Notebooks/NB10_01__Pandas_Fifa.ipynb
index e3813afea..9ae31c427 100644
--- a/Notebooks/NB10_01__Pandas_Fifa.ipynb
+++ b/Notebooks/NB10_01__Pandas_Fifa.ipynb
@@ -729,7 +729,7 @@
"id": "eHvPpeiTBwoR"
},
"source": [
- "d_estudantes['nome']"
+ "d_estudantes['Nome']"
],
"execution_count": null,
"outputs": []
@@ -749,7 +749,7 @@
"id": "26WIDl-HB3Bq"
},
"source": [
- "d_estudantes['nome'][0]"
+ "d_estudantes['Nome'][0]"
],
"execution_count": null,
"outputs": []
@@ -5281,7 +5281,7 @@
"id": "K7xLrlPuKsAW"
},
"source": [
- "df_Fifa2018.head()\n",
+ "df_Fifa2018.head(5)\n",
"\n"
],
"execution_count": null,
@@ -5304,6 +5304,9 @@
"id": "1y9oN-IeU7Sb"
},
"source": [
+ "'''\n",
+ "5. Normalize os nomes das colunas, ou seja, renomeie o nome das colunas para minúsculo; 6Há Missing values nos dados? Se sim, o qual sua proposta (proposta do grupo) para tratar estes Missing values?\n",
+ "'''\n",
"def transformacao_lower(df):\n",
" # Primeira transformação: Aplicar lower() nos nomes das COLUNAS:\n",
" df_Fifa2018.columns = [col.lower() for col in df.columns]\n"
@@ -5318,7 +5321,7 @@
},
"source": [
"transformacao_lower(df_Fifa2018)\n",
- "df_Fifa2018.head()"
+ "df_Fifa2018.head(5)"
],
"execution_count": null,
"outputs": []
@@ -5374,6 +5377,9 @@
"id": "OBuycRCzRbyG"
},
"source": [
+ "'''\n",
+ "02 Que colunas podem previamente ser eliminadas da análise? Porque identificar o que pode ser eliminado é importante?\n",
+ "'''\n",
"del df_Fifa2018['photo']"
],
"execution_count": null,
@@ -5418,6 +5424,9 @@
"id": "ZPwu5sLnSyAc"
},
"source": [
+ "'''\n",
+ "03. Qual o dtype de cada variável/atributo do dataframe?\n",
+ "'''\n",
"df_Fifa2018.dtypes"
],
"execution_count": null,
@@ -5429,6 +5438,9 @@
"id": "PBq3jr8nTUS0"
},
"source": [
+ "'''\n",
+ "05 Se alguma variávável/atributo é do tipo string (object) e supostamente deveria ser numérica, como alteramos o tipo?\n",
+ "'''\n",
"df_Fifa2018.select_dtypes(include=['object', 'string']).columns "
],
"execution_count": null,
@@ -5440,7 +5452,40 @@
"id": "82JaHKYATgdD"
},
"source": [
- "df_Fifa2018[df_Fifa2018.select_dtypes(include=['object', 'string']).columns ]"
+ "df_Fifa2018[df_Fifa2018.select_dtypes(include=['object', 'string']).columns].head(5)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "clV9YL8RkKI_"
+ },
+ "source": [
+ "df_Fifa2018[['name','value']].head(5)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "cbzgZuAtlXpl"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "YPp5kJDhkfWt"
+ },
+ "source": [
+ ""
],
"execution_count": null,
"outputs": []
From 9c442fa7e51b2bb08116bebadc2acc291a38250f Mon Sep 17 00:00:00 2001
From: Celso-Omoto <72219725+Celso-Omoto@users.noreply.github.com>
Date: Fri, 16 Oct 2020 17:32:49 -0300
Subject: [PATCH 09/21] Criado usando o Colaboratory
---
...3DP_3_Data_Transformation_exercicios.ipynb | 1314 +++++++++++++++++
1 file changed, 1314 insertions(+)
create mode 100644 Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios.ipynb
diff --git a/Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios.ipynb b/Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios.ipynb
new file mode 100644
index 000000000..0ac9de626
--- /dev/null
+++ b/Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios.ipynb
@@ -0,0 +1,1314 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "NB10_04__3DP_3_Data_Transformation.ipynb",
+ "provenance": [],
+ "private_outputs": true,
+ "include_colab_link": true
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5CgDLvphxfcX"
+ },
+ "source": [
+ "3DP_3 - DATA TRANSFORMATION
\n",
+ "\n",
+ "* **Objetivo**: Preparar os dados para o Machine Learning."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "PvW689ZBxbxH"
+ },
+ "source": [
+ "# **AGENDA**:\n",
+ "\n",
+ "> Consulte **Table of contents**.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "GNiuYCCxGe8v"
+ },
+ "source": [
+ "# **Melhorias da sessão**\n",
+ "* Desenvolver a sessão sobe WOE."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-TdSY74U0XS9"
+ },
+ "source": [
+ "___\n",
+ "# **Referências**\n",
+ "* [Why, How and When to Scale your Features](https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e)\n",
+ "* [Demonstrating the different strategies of KBinsDiscretizer](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization_strategies.html#sphx-glr-auto-examples-preprocessing-plot-discretization-strategies-py);\n",
+ "* [Why do we need feature scaling in Machine Learning and how to do it using SciKit Learn?](https://medium.com/@contactsunny/why-do-we-need-feature-scaling-in-machine-learning-and-how-to-do-it-using-scikit-learn-d8314206fe73)\n",
+ "* [Importance of Feature Scaling](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py) --> Muito importante por demonstrar os efeitos e a importância de se transformar as colunas numéricas.\n",
+ "* [Feature discretization](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization_classification.html#sphx-glr-auto-examples-preprocessing-plot-discretization-classification-py) --> Mostra o impacto na acurácia dos modelos com e sem discretização. Ou seja, discretizar faz sentido!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "l9DGifbWSmW3"
+ },
+ "source": [
+ "___\n",
+ "# **Machine Learning com Python (Scikit-Learn)**\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Vg82Iouo_Qm2"
+ },
+ "source": [
+ "# Porque dimensionar (Scale), padronizar (Standardize) e normalizar (Normalize) importa?\n",
+ "* Porque muitos algoritmos de Machine Learning performam melhor ou convergem mais rápido quando os atributos/colunas/variáveis estão na mesma escala e possuem distribuição \"próxima\" da Normal."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "q-chlATnKSza"
+ },
+ "source": [
+ "## Carregar as bibliotecas (genéricas) Python"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kQGVQB18-tM_"
+ },
+ "source": [
+ "#!pip install category_encoders\n",
+ "#!pip install update"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "7FJxrZckYxk6"
+ },
+ "source": [
+ "import pandas as pd\n",
+ "\n",
+ "import numpy as np\n",
+ "from sklearn import preprocessing\n",
+ "import matplotlib.pyplot as plt\n",
+ "import seaborn as sns\n",
+ "%matplotlib inline\n",
+ "\n",
+ "import category_encoders as ce # library para aplicação do WOE - Weight Of Evidence para avaliar importância dos atributos\n",
+ "\n",
+ "# remove warnings to keep notebook clean\n",
+ "import warnings\n",
+ "warnings.filterwarnings('ignore')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "CyuWQM2NTMls"
+ },
+ "source": [
+ "pd.options.display.float_format = '{:.2f}'.format"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "R0fuDyI8_UPf"
+ },
+ "source": [
+ "## Carregar os dados"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9oRWtarakgMY"
+ },
+ "source": [
+ "### Dataframe gerado aleatoriamente - variáveis com distribuição Normal"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "7BXPXo3k0VDI"
+ },
+ "source": [
+ "np.random.seed(20111974)\n",
+ "\n",
+ "i_N = 10000\n",
+ "\n",
+ "df_A1 = pd.DataFrame({\n",
+ " 'coluna1': np.random.normal(0, 2, i_N), # Observem que a média das colunas são distintas\n",
+ " 'coluna2': np.random.normal(50, 3, i_N),\n",
+ " 'coluna3': np.random.normal(-5, 5, i_N),\n",
+ " 'coluna4': np.random.normal(-10, 10, i_N)\n",
+ "})\n",
+ "\n",
+ "df_A1.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "93ST1JnoRZKm"
+ },
+ "source": [
+ "**Dica**: Podemos usar outras distribuições (se quisermos), como a Exponential (mostrada abaixo)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "XUqjo5QcQH99"
+ },
+ "source": [
+ "np.random.seed(20111974)\n",
+ "\n",
+ "df_A2 = pd.DataFrame({\n",
+ " 'coluna1': np.random.normal(0, 2, i_N),\n",
+ " 'coluna2': np.random.normal(50, 3, i_N),\n",
+ " 'coluna3': np.random.exponential(1, i_N), # coluna3 tem distribuição Exponential\n",
+ " 'coluna4': np.random.normal(-10, 10, i_N)\n",
+ "})\n",
+ "\n",
+ "df_A2.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "J8MZNLbUkp8R"
+ },
+ "source": [
+ "### Dataframe gerado aleatoriamente 2"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "BR-fDDujcTup"
+ },
+ "source": [
+ "from sklearn.datasets import make_classification\n",
+ "\n",
+ "dados, classe = make_classification(n_samples = i_N, n_features = 4, n_informative = 3, n_redundant = 1, n_classes = 3)\n",
+ "\n",
+ "df_A3 = pd.DataFrame({'coluna1': dados[:,0],\n",
+ " 'coluna2':dados[:,1],\n",
+ " 'coluna3':dados[:,2],\n",
+ " 'coluna4':dados[:,3]}) #, 'coluna5':classe})\n",
+ "\n",
+ "df_A3.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Zq1cnpwLKvjS"
+ },
+ "source": [
+ "df_A4 = pd.DataFrame({ \n",
+ " 'coluna1': np.random.beta(5, 1, i_N) * 25, \n",
+ " 'coluna2': np.random.exponential(10, i_N),\n",
+ " 'coluna3': np.random.normal(10, 2, i_N),\n",
+ " 'coluna4': np.random.normal(10, 10, i_N), \n",
+ "})\n",
+ "\n",
+ "df_A4.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "O7sXQjvYRfhb"
+ },
+ "source": [
+ "#### Extração de amostras para compararmos"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "rjVHsnnHRkIo"
+ },
+ "source": [
+ "df_A1_test = df_A1.sample(n = 100)\n",
+ "df_A2_test = df_A2.sample(n = 100)\n",
+ "df_A3_test = df_A3.sample(n = 100)\n",
+ "df_A4_test = df_A4.sample(n = 100)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "t0v0uXFRl-yG"
+ },
+ "source": [
+ "___\n",
+ "# **Transformações**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "pkzTO0fdz93b"
+ },
+ "source": [
+ "## (1) StandardScaler\n",
+ "* StandardScaler é a transformação que centraliza os dados através da remoção da média (dos dados) e, na sequência, redimensiona (scale) através da divisão pelo desvio-padrão;\n",
+ "* Após a transformação, os dados terão média zero e desvio-padrão 1;\n",
+ "* Assume que os dados (as colunas a serem transformadas) são normalmente distribuidos ;\n",
+ "* Se os dados não possuem distribuição Normal, então esta não é uma boa transformação a se aplicar.\n",
+ "\n",
+ "$$z_{i}= \\frac{x_{i}-mean(x)}{std(x)}$$"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "v1UOOWeQ0R_Y"
+ },
+ "source": [
+ "### Exemplo"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "y1Lzx3xN6wpZ"
+ },
+ "source": [
+ "df_A3.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9cPq_7Vu2HCS"
+ },
+ "source": [
+ "Histograma:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZYW9WwBC3hd_"
+ },
+ "source": [
+ "plt.figure(figsize = (12, 8))\n",
+ "plt.hist(df_A1['coluna3'], color = 'blue', edgecolor = 'black', bins = int(180/5))\n",
+ "\n",
+ "# Adiciona títulos e labels\n",
+ "plt.title('Histograma da coluna3')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "h8ogcQvvT5zK"
+ },
+ "source": [
+ "plt.figure(figsize = (12, 8))\n",
+ "plt.hist(df_A2['coluna3'], color = 'blue', edgecolor = 'black', bins = int(180/5))\n",
+ "\n",
+ "# Adiciona títulos e labels\n",
+ "plt.title('Histograma da coluna3')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RrgxkESc-Uaq"
+ },
+ "source": [
+ "Considere o gráfico a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "U7dHTF1W-Xsn"
+ },
+ "source": [
+ "df_A1.plot(kind = 'kde') # KDE (= kernel Density Estimate) ajuda-nos a visualizar a distribuição dos dados, análogo ao histograma."
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hMS72n14-hDO"
+ },
+ "source": [
+ "Qual a interpretação para o gráfico acima?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "izqGNcNILdaX"
+ },
+ "source": [
+ "df_A1.plot()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZEkAqlZg-p0v"
+ },
+ "source": [
+ "A seguir, a transformação StandardScaler:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "N4u3T_BX-oc_"
+ },
+ "source": [
+ "from sklearn.preprocessing import StandardScaler"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "voFQ4odSzzPZ"
+ },
+ "source": [
+ "O ideal é termos um array com as preditoras, da seguinte forma:\n",
+ "X = [coluna1, coluna2, ..., colunaN]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "rPa4-SCt-ynX"
+ },
+ "source": [
+ "np.set_printoptions(precision = 3)\n",
+ "\n",
+ "A1_scale = StandardScaler().fit_transform(df_A1) # Combinação dos métodos fit() + transform()\n",
+ "\n",
+ "A1_scale_fit = StandardScaler().fit(df_A1) # Aplica o fit() separadamente\n",
+ "A1_scale_transform = A1_scale_fit.transform(df_A1) # Aplica o transform() separadamente.\n",
+ "A1_scale_fit_transform = StandardScaler().fit(df_A1).transform(df_A1) # Aplica fit().transform() encadeado\n",
+ "\n",
+ "A2_scale = StandardScaler().fit_transform(df_A2)\n",
+ "\n",
+ "A3_scale = StandardScaler().fit_transform(df_A3)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "SGR9-bG0q-SI"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ioZ_IN3Z6d39"
+ },
+ "source": [
+ "Observe abaixo que A1_scale = A1_scale_transform = A1_scale_fit_transform --> São arrays multidimensionais (do tipo NumPy)!\n",
+ "\n",
+ "**é importante salvar as medidas de StandardScaler e outros para não ser necessário reprocessar os valores para todo processamento.**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "v4xQR4cu5D1J"
+ },
+ "source": [
+ "A1_scale"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "j6GtN2KF4E_A"
+ },
+ "source": [
+ "A1_scale_transform"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "0q2bvSqb6T4g"
+ },
+ "source": [
+ "A1_scale_fit_transform"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "WIhaErnA46Fi"
+ },
+ "source": [
+ "Transformando em dataframe:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "HAhRvPze44JW"
+ },
+ "source": [
+ "df_A1_scale = pd.DataFrame(A1_scale, columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])\n",
+ "df_A2_scale = pd.DataFrame(A2_scale, columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])\n",
+ "df_A3_scale = pd.DataFrame(A3_scale, columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "bmQp8wDO_E88"
+ },
+ "source": [
+ "Agora compare esse novo gráfico abaixo --> Vemos que os dados transformados tem distribuição Normal(0, 1):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "csfqRhDH2zUb"
+ },
+ "source": [
+ "df_A1.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "-krh1pDg22RF"
+ },
+ "source": [
+ "df_A1_scale.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "D2fTPWsm_Hq3"
+ },
+ "source": [
+ "df_A1_scale.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "9oN-829l3277"
+ },
+ "source": [
+ "df_A2.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Jqh8L5BeUHT-"
+ },
+ "source": [
+ "df_A2_scale.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Yvz6O1zk4XNE"
+ },
+ "source": [
+ "df_A3.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ffU-fQxCUSmm"
+ },
+ "source": [
+ "df_A3_scale.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "y24MOLL83w9j"
+ },
+ "source": [
+ "### Exercício: Calcular a média e o desvio-padrão."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "1Aa25gVlSdOi"
+ },
+ "source": [
+ "df_A1.describe()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "EXZUiZImSmOE"
+ },
+ "source": [
+ "df_A1_scale.describe()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "uIUQw5dpRwvA"
+ },
+ "source": [
+ "#### Correlação das colunas\n",
+ "* Observe que as correlações entre as variáveis não se alteram com as transformações."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "uj1UerjORq9q"
+ },
+ "source": [
+ "df_A1.corr()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jp6vPK0aR_p0"
+ },
+ "source": [
+ "df_A1_scale.corr()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4fuURrao_M0c"
+ },
+ "source": [
+ "Qual a conclusão?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "f0A9U7rs_RAT"
+ },
+ "source": [
+ "## (2) MinMaxScaler\n",
+ "* **Transformação muito popular e utilizada**.\n",
+ "* Transforma os dados para o intervalo (0, 1);\n",
+ "* Se StandardScaler não é aplicável, então essa transformação funciona bem.\n",
+ "* Sensível aos outliers. Portanto, o ideal é que os outliers sejam tratados previamente.\n",
+ "* Uma transformação similar à MinMaxScaler() é MaxAbsScaler() que redimensiona os dados no intervalo [-1, 1], centralizado em 0(zero)\n",
+ "\n",
+ "$$z_{i}= \\frac{x_{i}-min(x)}{max(x)-min(x)}$$"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "C0HbeuP-AU_p"
+ },
+ "source": [
+ "### Exemplo"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "mgeLckzxAWaC"
+ },
+ "source": [
+ "from sklearn.preprocessing import MinMaxScaler"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "S_W9bTO2AbEg"
+ },
+ "source": [
+ "df_A1.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "PJRFbUpBAg5J"
+ },
+ "source": [
+ "A1_MinMaxScaler = MinMaxScaler().fit_transform(df_A1)\n",
+ "df_A1_MinMaxScaler = pd.DataFrame(A1_MinMaxScaler,columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])\n",
+ "\n",
+ "# Gráfico\n",
+ "df_A1_MinMaxScaler.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7g8GA4LTA40U"
+ },
+ "source": [
+ "Qual a conclusão?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4Z6D3vfnB9Nm"
+ },
+ "source": [
+ "## (3) RobustScaler\n",
+ "* Transformação ideal para dados com outliers.\n",
+ "\n",
+ "$$z_{i}= \\frac{x_{i}-Q_{1}(x)}{Q_{3}(x)-Q_{1}(x)}$$"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "m3oyuxLeCW1D"
+ },
+ "source": [
+ "df_A1.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zeDF7-w_CcBy"
+ },
+ "source": [
+ "from sklearn.preprocessing import RobustScaler"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "vLoqSKijCf2v"
+ },
+ "source": [
+ "A1_RobustScaler = RobustScaler().fit_transform(df_A1)\n",
+ "df_A1_RobustScaler = pd.DataFrame(A1_RobustScaler, columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])\n",
+ "\n",
+ "# Gráfico\n",
+ "df_A1_RobustScaler.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-YVMgt-WEFif"
+ },
+ "source": [
+ "## Encoding Variáveis Categóricas"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "xHYvLc8T_jxQ"
+ },
+ "source": [
+ "### Encoding Variáveis Ordinais\n",
+ "* Exemplo: Variáveis com valores ordinais: baixo, médio ou alto."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "i1BgGiGdSTcG"
+ },
+ "source": [
+ "#### Gera um dataframe como exemplo."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kdVahfJAEkuO"
+ },
+ "source": [
+ "# Aqui vou usar a função randint - Retorna números inteiros aleatórios incluindo o número inferior e excluindo o superior.\n",
+ "\n",
+ "l_idade= [np.random.randint(20, 40), np.random.randint(20, 40), np.random.randint(20, 40), np.random.randint(20, 40), np.random.randint(20, 40),\n",
+ " np.random.randint(20, 40), np.random.randint(20, 40), np.random.randint(20, 40), np.random.randint(20, 40), np.random.randint(20, 40)]\n",
+ "\n",
+ "l_salario = ['baixo', 'medio', 'alto']\n",
+ "l_salario2 = np.random.choice(l_salario, 10, p = [0.6, 0.3, 0.1])\n",
+ "\n",
+ "df_A4 = pd.DataFrame({\n",
+ " 'idade': l_idade,\n",
+ " 'salario': l_salario2})"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "m_15P2eUHSBY"
+ },
+ "source": [
+ "df_A4"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "R1g9pEuyHe2q"
+ },
+ "source": [
+ "Neste exemplo, vamos redefinir a variável categórical ordinal 'Salario' da seguinte forma:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "bkwFuEa8HnMV"
+ },
+ "source": [
+ "df_A4['salario_cat'] = df_A4['salario'].map({'baixo': 1, 'medio': 2, 'alto': 3})\n",
+ "df_A4"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "DlaIFiWIIPAl"
+ },
+ "source": [
+ "### Encoding Variáveis Nominais\n",
+ "* Exemplo: Variáveis com valores nominais: Sexo (Feminino, Masculino).\n",
+ "\n",
+ "* Use One-Hot Encoding ou pd.get.dummies()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ffNoJQbgJRoY"
+ },
+ "source": [
+ "Vamos utilizar o dataframe criado no passo anterior:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "PMCoUWZOI7c0"
+ },
+ "source": [
+ "df_A4['salario'].unique()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "bdIEyBkaJeN8"
+ },
+ "source": [
+ "from sklearn.preprocessing import LabelEncoder, OneHotEncoder"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4MwK4cUEKeK4"
+ },
+ "source": [
+ "#### Aplicar LabelEncoder()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6X6VXDsHJiII"
+ },
+ "source": [
+ "le = LabelEncoder()\n",
+ "df_A4['salario_le'] = le.fit_transform(df_A4['salario'])\n",
+ "df_A4"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "RY80x59J8Ham"
+ },
+ "source": [
+ "df_A4['salario'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Dgv2Zz07Kqfj"
+ },
+ "source": [
+ "#### Aplicar pd.get.dummies()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "WSZRIEs6K5sP"
+ },
+ "source": [
+ "dummies = pd.get_dummies(df_A4['salario'])\n",
+ "df_A4 = pd.concat([df_A4, dummies], axis = 1)\n",
+ "df_A4"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "CY8GZ-HlNOgm"
+ },
+ "source": [
+ "*texto em itálico*# **Wrap Up**\n",
+ "\n",
+ "\n",
+ "* Use MinMaxScaler como transformação default, pois esta transformação não distorce os dados;\n",
+ "* Use RobustScaler se seus dados/coluna/variável possui outliers e gostaríamos de reduzir o efeito/impacto destes outliers. Entretanto, o melhor tratamento é estudar os outliers cuidadosamente e tratá-los adequadamente;\n",
+ "* Use StandardScaler se seus dados/colunas/variáveis possuem distribuição Normal (ou pelo menos se aproxima bem da distribuição Normal)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Mwh0alhdgrE3"
+ },
+ "source": [
+ "___\n",
+ "# **Exercícios**\n",
+ "> Para cada um dos dataframes a seguir, aplique os seguintes steps:\n",
+ "\n",
+ "* Padronizar o nome das colunas\n",
+ " * Eliminar espaços entre os nomes das colunas;\n",
+ " * Eliminar caracteres especiais dos nomes das colunas;\n",
+ " * Renomear as colunas com lower() (ou upper());\n",
+ "* Aplicar a trasformação StandardScaler e MinMaxScaler em cada uma das colunas do dataframe;\n",
+ "* DataViz - Mostrar a distribuição das colunas para compararmos os resultados antes e depois das transformações.\n",
+ "* As correlações das colunas mudam com as transformações?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hSTKrd992LtI"
+ },
+ "source": [
+ "## Exercício 1 - Iris --> **Resolvido**\n",
+ "* [Aqui](https://en.wikipedia.org/wiki/Iris_flower_data_set) você obterá mais informações sobre o dataframe iris. Confira."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "mThqvGGr2Vuk"
+ },
+ "source": [
+ "from sklearn.datasets import load_iris\n",
+ "\n",
+ "iris = load_iris()\n",
+ "X= iris['data']\n",
+ "y= iris['target']\n",
+ "\n",
+ "df_iris = pd.DataFrame(np.c_[X, y], columns= np.append(iris['feature_names'], ['target']))\n",
+ "df_iris['target2'] = df_iris['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})\n",
+ "df_iris.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "eU5FaJhdYblP"
+ },
+ "source": [
+ "df_iris.columns = [c.replace(' ', '_') for c in df_iris.columns]\n",
+ "df_iris.columns = [c.replace('_(cm)', '') for c in df_iris.columns]\n",
+ "df_iris.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "PGmZjd_Y79lY"
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "K9DPAakJZQHH"
+ },
+ "source": [
+ "df_iris.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "YYYmVq68Y8bB"
+ },
+ "source": [
+ "# Aplica a transformação:\n",
+ "df_iris_MinMaxScaler = MinMaxScaler().fit_transform(df_iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']])\n",
+ "\n",
+ "# Transformando em Dataframe:\n",
+ "df_iris_MinMaxScaler = pd.DataFrame(df_iris_MinMaxScaler, columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])\n",
+ "\n",
+ "# Gráfico\n",
+ "df_iris_MinMaxScaler.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "IwPH8-258JrF"
+ },
+ "source": [
+ "aplicar as outras transformações e comparar os gráficos."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "caFkC6oCmUKK"
+ },
+ "source": [
+ "## Exercício 2 - Breast Cancer"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "vhOM-Z9zmf-f"
+ },
+ "source": [
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "from sklearn.datasets import load_breast_cancer\n",
+ "\n",
+ "cancer = load_breast_cancer()\n",
+ "X= cancer['data']\n",
+ "y= cancer['target']\n",
+ "\n",
+ "df_A1_cancer = pd.DataFrame(np.c_[X, y], columns= np.append(cancer['feature_names'], ['target']))\n",
+ "df_A1_cancer['target'] = df_A1_cancer['target'].map({0: 'malign', 1: 'benign'})\n",
+ "df_A1_cancer.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "1qruqUDqnvMc"
+ },
+ "source": [
+ "## Exercício 3 - Boston Housing Price"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "trxK8YXNnsam"
+ },
+ "source": [
+ "from sklearn.datasets import load_boston\n",
+ "\n",
+ "boston = load_boston()\n",
+ "X= boston['data']\n",
+ "y= boston['target']\n",
+ "\n",
+ "df_A1_boston = pd.DataFrame(np.c_[X, y], columns= np.append(boston['feature_names'], ['target']))\n",
+ "df_A1_boston.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "nzu0Dz33c8ds"
+ },
+ "source": [
+ "## Exercícios 4 - Diabetes"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "d6ahBZmqc_-1"
+ },
+ "source": [
+ "from sklearn.datasets import load_diabetes\n",
+ "\n",
+ "diabetes = load_diabetes()\n",
+ "X= diabetes['data']\n",
+ "y= diabetes['target']\n",
+ "\n",
+ "df_A1_diabetes = pd.DataFrame(np.c_[X, y], columns= np.append(diabetes['feature_names'], ['target']))\n",
+ "df_A1_diabetes.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "NyunIr6oaWEl"
+ },
+ "source": [
+ "## Exercícios 6 - 120 years of Olympic history: athletes and results\n",
+ "* [120 years of Olympic history: athletes and results](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results)\n",
+ " * Trate adequadamente as variáveis 'sex', 'season', 'team', 'city', 'sport' e 'medal';\n",
+ " * Aplique as transformações que acabamos de estudar nos campos/colunas numéricas 'height' e 'weight'. Cuidado com os Missing Values contidos nas variáveis!\n",
+ " * Verifique/avalie o impacto dos outliers nestas colunas.\n",
+ " * Neste caso, qual transformação é mais adequado diante dos outliers?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "o5fDp1Ib_Dg8"
+ },
+ "source": [
+ "# WOE - Weight Of Evidence\n",
+ "* As vantagens da transformação WOE são\n",
+ " * Lida bem com NaN's;\n",
+ " * Lida bem com outliers;\n",
+ " * A transformação é baseada no valor logarítmico das distribuições.\n",
+ " * Usando a técnica de binning apropriada, pode estabelecer uma relação monotônica (aumentar ou diminuir) entre a variável dependente e independente."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "wXEsP96A9TSd"
+ },
+ "source": [
+ "import pandas as pd\n",
+ "\n",
+ "import numpy as np\n",
+ "from sklearn import preprocessing\n",
+ "import matplotlib.pyplot as plt\n",
+ "import seaborn as sns\n",
+ "%matplotlib inline\n",
+ "\n",
+ "#import category_encoders as ce # library para aplicação do WOE - Weight Of Evidence para avaliar importância dos atributos\n",
+ "\n",
+ "# remove warnings to keep notebook clean\n",
+ "import warnings\n",
+ "warnings.filterwarnings('ignore')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "gGdOGDZAHu-V"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Z2W9PXAc-vHY"
+ },
+ "source": [
+ "from google.colab import drive\n",
+ "drive.mount('/content/drive', force_remount=True)\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "g45JU2LXHwkz"
+ },
+ "source": [
+ "import pandas as pd \n",
+ "df=pd.read_csv('gdrive/My Drive/athlete_events.zip', compression='zip')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "JLxrnkJw_f7m"
+ },
+ "source": [
+ "pd.read_csv('/content/drive/My Drive/file/d/athlete_events.zip', compression='zip')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "WcPQhh04E2du"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ }
+ ]
+}
\ No newline at end of file
From ace6545c39971edf7798503b90f0275850dfef78 Mon Sep 17 00:00:00 2001
From: Celso-Omoto <72219725+Celso-Omoto@users.noreply.github.com>
Date: Fri, 16 Oct 2020 17:35:07 -0300
Subject: [PATCH 10/21] Criado usando o Colaboratory
---
.../NB10_04__3DP_3_Data_Transformation_exercicios.ipynb | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios.ipynb b/Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios.ipynb
index 0ac9de626..ddc2ee109 100644
--- a/Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios.ipynb
+++ b/Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios.ipynb
@@ -1282,8 +1282,9 @@
"id": "g45JU2LXHwkz"
},
"source": [
- "import pandas as pd \n",
- "df=pd.read_csv('gdrive/My Drive/athlete_events.zip', compression='zip')"
+ "url = '/content/drive/My Drive/athlete_events.csv'\n",
+ "import pandas as pd\n",
+ "df_olympics = pd.read_csv(url)"
],
"execution_count": null,
"outputs": []
From b9811c398dca8a47084883f53beffc5b2b3f55cb Mon Sep 17 00:00:00 2001
From: Celso-Omoto <72219725+Celso-Omoto@users.noreply.github.com>
Date: Mon, 19 Oct 2020 14:54:57 -0300
Subject: [PATCH 11/21] Criado usando o Colaboratory
---
...xercicios_exerc\303\255cio Olympics.ipynb" | 1478 +++++++++++++++++
1 file changed, 1478 insertions(+)
create mode 100644 "Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios_exerc\303\255cio Olympics.ipynb"
diff --git "a/Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios_exerc\303\255cio Olympics.ipynb" "b/Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios_exerc\303\255cio Olympics.ipynb"
new file mode 100644
index 000000000..92769c16a
--- /dev/null
+++ "b/Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios_exerc\303\255cio Olympics.ipynb"
@@ -0,0 +1,1478 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "NB10_04__3DP_3_Data_Transformation.ipynb",
+ "provenance": [],
+ "private_outputs": true,
+ "include_colab_link": true
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5CgDLvphxfcX"
+ },
+ "source": [
+ "3DP_3 - DATA TRANSFORMATION
\n",
+ "\n",
+ "* **Objetivo**: Preparar os dados para o Machine Learning."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "PvW689ZBxbxH"
+ },
+ "source": [
+ "# **AGENDA**:\n",
+ "\n",
+ "> Consulte **Table of contents**.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "GNiuYCCxGe8v"
+ },
+ "source": [
+ "# **Melhorias da sessão**\n",
+ "* Desenvolver a sessão sobe WOE."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-TdSY74U0XS9"
+ },
+ "source": [
+ "___\n",
+ "# **Referências**\n",
+ "* [Why, How and When to Scale your Features](https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e)\n",
+ "* [Demonstrating the different strategies of KBinsDiscretizer](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization_strategies.html#sphx-glr-auto-examples-preprocessing-plot-discretization-strategies-py);\n",
+ "* [Why do we need feature scaling in Machine Learning and how to do it using SciKit Learn?](https://medium.com/@contactsunny/why-do-we-need-feature-scaling-in-machine-learning-and-how-to-do-it-using-scikit-learn-d8314206fe73)\n",
+ "* [Importance of Feature Scaling](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py) --> Muito importante por demonstrar os efeitos e a importância de se transformar as colunas numéricas.\n",
+ "* [Feature discretization](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization_classification.html#sphx-glr-auto-examples-preprocessing-plot-discretization-classification-py) --> Mostra o impacto na acurácia dos modelos com e sem discretização. Ou seja, discretizar faz sentido!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "l9DGifbWSmW3"
+ },
+ "source": [
+ "___\n",
+ "# **Machine Learning com Python (Scikit-Learn)**\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Vg82Iouo_Qm2"
+ },
+ "source": [
+ "# Porque dimensionar (Scale), padronizar (Standardize) e normalizar (Normalize) importa?\n",
+ "* Porque muitos algoritmos de Machine Learning performam melhor ou convergem mais rápido quando os atributos/colunas/variáveis estão na mesma escala e possuem distribuição \"próxima\" da Normal."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "q-chlATnKSza"
+ },
+ "source": [
+ "## Carregar as bibliotecas (genéricas) Python"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kQGVQB18-tM_"
+ },
+ "source": [
+ "#!pip install category_encoders\n",
+ "#!pip install update"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "7FJxrZckYxk6"
+ },
+ "source": [
+ "import pandas as pd\n",
+ "\n",
+ "import numpy as np\n",
+ "from sklearn import preprocessing\n",
+ "import matplotlib.pyplot as plt\n",
+ "import seaborn as sns\n",
+ "%matplotlib inline\n",
+ "\n",
+ "import category_encoders as ce # library para aplicação do WOE - Weight Of Evidence para avaliar importância dos atributos\n",
+ "\n",
+ "# remove warnings to keep notebook clean\n",
+ "import warnings\n",
+ "warnings.filterwarnings('ignore')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "CyuWQM2NTMls"
+ },
+ "source": [
+ "pd.options.display.float_format = '{:.2f}'.format"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "R0fuDyI8_UPf"
+ },
+ "source": [
+ "## Carregar os dados"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9oRWtarakgMY"
+ },
+ "source": [
+ "### Dataframe gerado aleatoriamente - variáveis com distribuição Normal"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "7BXPXo3k0VDI"
+ },
+ "source": [
+ "np.random.seed(20111974)\n",
+ "\n",
+ "i_N = 10000\n",
+ "\n",
+ "df_A1 = pd.DataFrame({\n",
+ " 'coluna1': np.random.normal(0, 2, i_N), # Observem que a média das colunas são distintas\n",
+ " 'coluna2': np.random.normal(50, 3, i_N),\n",
+ " 'coluna3': np.random.normal(-5, 5, i_N),\n",
+ " 'coluna4': np.random.normal(-10, 10, i_N)\n",
+ "})\n",
+ "\n",
+ "df_A1.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "93ST1JnoRZKm"
+ },
+ "source": [
+ "**Dica**: Podemos usar outras distribuições (se quisermos), como a Exponential (mostrada abaixo)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "XUqjo5QcQH99"
+ },
+ "source": [
+ "np.random.seed(20111974)\n",
+ "\n",
+ "df_A2 = pd.DataFrame({\n",
+ " 'coluna1': np.random.normal(0, 2, i_N),\n",
+ " 'coluna2': np.random.normal(50, 3, i_N),\n",
+ " 'coluna3': np.random.exponential(1, i_N), # coluna3 tem distribuição Exponential\n",
+ " 'coluna4': np.random.normal(-10, 10, i_N)\n",
+ "})\n",
+ "\n",
+ "df_A2.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "J8MZNLbUkp8R"
+ },
+ "source": [
+ "### Dataframe gerado aleatoriamente 2"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "BR-fDDujcTup"
+ },
+ "source": [
+ "from sklearn.datasets import make_classification\n",
+ "\n",
+ "dados, classe = make_classification(n_samples = i_N, n_features = 4, n_informative = 3, n_redundant = 1, n_classes = 3)\n",
+ "\n",
+ "df_A3 = pd.DataFrame({'coluna1': dados[:,0],\n",
+ " 'coluna2':dados[:,1],\n",
+ " 'coluna3':dados[:,2],\n",
+ " 'coluna4':dados[:,3]}) #, 'coluna5':classe})\n",
+ "\n",
+ "df_A3.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Zq1cnpwLKvjS"
+ },
+ "source": [
+ "df_A4 = pd.DataFrame({ \n",
+ " 'coluna1': np.random.beta(5, 1, i_N) * 25, \n",
+ " 'coluna2': np.random.exponential(10, i_N),\n",
+ " 'coluna3': np.random.normal(10, 2, i_N),\n",
+ " 'coluna4': np.random.normal(10, 10, i_N), \n",
+ "})\n",
+ "\n",
+ "df_A4.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "O7sXQjvYRfhb"
+ },
+ "source": [
+ "#### Extração de amostras para compararmos"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "rjVHsnnHRkIo"
+ },
+ "source": [
+ "df_A1_test = df_A1.sample(n = 100)\n",
+ "df_A2_test = df_A2.sample(n = 100)\n",
+ "df_A3_test = df_A3.sample(n = 100)\n",
+ "df_A4_test = df_A4.sample(n = 100)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "t0v0uXFRl-yG"
+ },
+ "source": [
+ "___\n",
+ "# **Transformações**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "pkzTO0fdz93b"
+ },
+ "source": [
+ "## (1) StandardScaler\n",
+ "* StandardScaler é a transformação que centraliza os dados através da remoção da média (dos dados) e, na sequência, redimensiona (scale) através da divisão pelo desvio-padrão;\n",
+ "* Após a transformação, os dados terão média zero e desvio-padrão 1;\n",
+ "* Assume que os dados (as colunas a serem transformadas) são normalmente distribuidos ;\n",
+ "* Se os dados não possuem distribuição Normal, então esta não é uma boa transformação a se aplicar.\n",
+ "\n",
+ "$$z_{i}= \\frac{x_{i}-mean(x)}{std(x)}$$"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "v1UOOWeQ0R_Y"
+ },
+ "source": [
+ "### Exemplo"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "y1Lzx3xN6wpZ"
+ },
+ "source": [
+ "df_A3.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9cPq_7Vu2HCS"
+ },
+ "source": [
+ "Histograma:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZYW9WwBC3hd_"
+ },
+ "source": [
+ "plt.figure(figsize = (12, 8))\n",
+ "plt.hist(df_A1['coluna3'], color = 'blue', edgecolor = 'black', bins = int(180/5))\n",
+ "\n",
+ "# Adiciona títulos e labels\n",
+ "plt.title('Histograma da coluna3')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "h8ogcQvvT5zK"
+ },
+ "source": [
+ "plt.figure(figsize = (12, 8))\n",
+ "plt.hist(df_A2['coluna3'], color = 'blue', edgecolor = 'black', bins = int(180/5))\n",
+ "\n",
+ "# Adiciona títulos e labels\n",
+ "plt.title('Histograma da coluna3')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RrgxkESc-Uaq"
+ },
+ "source": [
+ "Considere o gráfico a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "U7dHTF1W-Xsn"
+ },
+ "source": [
+ "df_A1.plot(kind = 'kde') # KDE (= kernel Density Estimate) ajuda-nos a visualizar a distribuição dos dados, análogo ao histograma."
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hMS72n14-hDO"
+ },
+ "source": [
+ "Qual a interpretação para o gráfico acima?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "izqGNcNILdaX"
+ },
+ "source": [
+ "df_A1.plot()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZEkAqlZg-p0v"
+ },
+ "source": [
+ "A seguir, a transformação StandardScaler:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "N4u3T_BX-oc_"
+ },
+ "source": [
+ "from sklearn.preprocessing import StandardScaler"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "voFQ4odSzzPZ"
+ },
+ "source": [
+ "O ideal é termos um array com as preditoras, da seguinte forma:\n",
+ "X = [coluna1, coluna2, ..., colunaN]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "rPa4-SCt-ynX"
+ },
+ "source": [
+ "np.set_printoptions(precision = 3)\n",
+ "\n",
+ "A1_scale = StandardScaler().fit_transform(df_A1) # Combinação dos métodos fit() + transform()\n",
+ "\n",
+ "A1_scale_fit = StandardScaler().fit(df_A1) # Aplica o fit() separadamente\n",
+ "A1_scale_transform = A1_scale_fit.transform(df_A1) # Aplica o transform() separadamente.\n",
+ "A1_scale_fit_transform = StandardScaler().fit(df_A1).transform(df_A1) # Aplica fit().transform() encadeado\n",
+ "\n",
+ "A2_scale = StandardScaler().fit_transform(df_A2)\n",
+ "\n",
+ "A3_scale = StandardScaler().fit_transform(df_A3)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "SGR9-bG0q-SI"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ioZ_IN3Z6d39"
+ },
+ "source": [
+ "Observe abaixo que A1_scale = A1_scale_transform = A1_scale_fit_transform --> São arrays multidimensionais (do tipo NumPy)!\n",
+ "\n",
+ "**é importante salvar as medidas de StandardScaler e outros para não ser necessário reprocessar os valores para todo processamento.**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "v4xQR4cu5D1J"
+ },
+ "source": [
+ "A1_scale"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "j6GtN2KF4E_A"
+ },
+ "source": [
+ "A1_scale_transform"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "0q2bvSqb6T4g"
+ },
+ "source": [
+ "A1_scale_fit_transform"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "WIhaErnA46Fi"
+ },
+ "source": [
+ "Transformando em dataframe:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "HAhRvPze44JW"
+ },
+ "source": [
+ "df_A1_scale = pd.DataFrame(A1_scale, columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])\n",
+ "df_A2_scale = pd.DataFrame(A2_scale, columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])\n",
+ "df_A3_scale = pd.DataFrame(A3_scale, columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "bmQp8wDO_E88"
+ },
+ "source": [
+ "Agora compare esse novo gráfico abaixo --> Vemos que os dados transformados tem distribuição Normal(0, 1):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "csfqRhDH2zUb"
+ },
+ "source": [
+ "df_A1.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "-krh1pDg22RF"
+ },
+ "source": [
+ "df_A1_scale.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "D2fTPWsm_Hq3"
+ },
+ "source": [
+ "df_A1_scale.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "9oN-829l3277"
+ },
+ "source": [
+ "df_A2.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Jqh8L5BeUHT-"
+ },
+ "source": [
+ "df_A2_scale.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Yvz6O1zk4XNE"
+ },
+ "source": [
+ "df_A3.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ffU-fQxCUSmm"
+ },
+ "source": [
+ "df_A3_scale.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "y24MOLL83w9j"
+ },
+ "source": [
+ "### Exercício: Calcular a média e o desvio-padrão."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "1Aa25gVlSdOi"
+ },
+ "source": [
+ "df_A1.describe()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "EXZUiZImSmOE"
+ },
+ "source": [
+ "df_A1_scale.describe()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "uIUQw5dpRwvA"
+ },
+ "source": [
+ "#### Correlação das colunas\n",
+ "* Observe que as correlações entre as variáveis não se alteram com as transformações."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "uj1UerjORq9q"
+ },
+ "source": [
+ "df_A1.corr()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jp6vPK0aR_p0"
+ },
+ "source": [
+ "df_A1_scale.corr()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4fuURrao_M0c"
+ },
+ "source": [
+ "Qual a conclusão?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "f0A9U7rs_RAT"
+ },
+ "source": [
+ "## (2) MinMaxScaler\n",
+ "* **Transformação muito popular e utilizada**.\n",
+ "* Transforma os dados para o intervalo (0, 1);\n",
+ "* Se StandardScaler não é aplicável, então essa transformação funciona bem.\n",
+ "* Sensível aos outliers. Portanto, o ideal é que os outliers sejam tratados previamente.\n",
+ "* Uma transformação similar à MinMaxScaler() é MaxAbsScaler() que redimensiona os dados no intervalo [-1, 1], centralizado em 0(zero)\n",
+ "\n",
+ "$$z_{i}= \\frac{x_{i}-min(x)}{max(x)-min(x)}$$"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "C0HbeuP-AU_p"
+ },
+ "source": [
+ "### Exemplo"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "mgeLckzxAWaC"
+ },
+ "source": [
+ "from sklearn.preprocessing import MinMaxScaler"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "S_W9bTO2AbEg"
+ },
+ "source": [
+ "df_A1.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "PJRFbUpBAg5J"
+ },
+ "source": [
+ "A1_MinMaxScaler = MinMaxScaler().fit_transform(df_A1)\n",
+ "df_A1_MinMaxScaler = pd.DataFrame(A1_MinMaxScaler,columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])\n",
+ "\n",
+ "# Gráfico\n",
+ "df_A1_MinMaxScaler.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7g8GA4LTA40U"
+ },
+ "source": [
+ "Qual a conclusão?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4Z6D3vfnB9Nm"
+ },
+ "source": [
+ "## (3) RobustScaler\n",
+ "* Transformação ideal para dados com outliers.\n",
+ "\n",
+ "$$z_{i}= \\frac{x_{i}-Q_{1}(x)}{Q_{3}(x)-Q_{1}(x)}$$"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "m3oyuxLeCW1D"
+ },
+ "source": [
+ "df_A1.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zeDF7-w_CcBy"
+ },
+ "source": [
+ "from sklearn.preprocessing import RobustScaler"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "vLoqSKijCf2v"
+ },
+ "source": [
+ "A1_RobustScaler = RobustScaler().fit_transform(df_A1)\n",
+ "df_A1_RobustScaler = pd.DataFrame(A1_RobustScaler, columns = ['coluna1', 'coluna2', 'coluna3', 'coluna4'])\n",
+ "\n",
+ "# Gráfico\n",
+ "df_A1_RobustScaler.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-YVMgt-WEFif"
+ },
+ "source": [
+ "## Encoding Variáveis Categóricas"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "xHYvLc8T_jxQ"
+ },
+ "source": [
+ "### Encoding Variáveis Ordinais\n",
+ "* Exemplo: Variáveis com valores ordinais: baixo, médio ou alto."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "i1BgGiGdSTcG"
+ },
+ "source": [
+ "#### Gera um dataframe como exemplo."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kdVahfJAEkuO"
+ },
+ "source": [
+ "# Aqui vou usar a função randint - Retorna números inteiros aleatórios incluindo o número inferior e excluindo o superior.\n",
+ "\n",
+ "l_idade= [np.random.randint(20, 40), np.random.randint(20, 40), np.random.randint(20, 40), np.random.randint(20, 40), np.random.randint(20, 40),\n",
+ " np.random.randint(20, 40), np.random.randint(20, 40), np.random.randint(20, 40), np.random.randint(20, 40), np.random.randint(20, 40)]\n",
+ "\n",
+ "l_salario = ['baixo', 'medio', 'alto']\n",
+ "l_salario2 = np.random.choice(l_salario, 10, p = [0.6, 0.3, 0.1])\n",
+ "\n",
+ "df_A4 = pd.DataFrame({\n",
+ " 'idade': l_idade,\n",
+ " 'salario': l_salario2})"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "m_15P2eUHSBY"
+ },
+ "source": [
+ "df_A4"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "R1g9pEuyHe2q"
+ },
+ "source": [
+ "Neste exemplo, vamos redefinir a variável categórical ordinal 'Salario' da seguinte forma:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "bkwFuEa8HnMV"
+ },
+ "source": [
+ "df_A4['salario_cat'] = df_A4['salario'].map({'baixo': 1, 'medio': 2, 'alto': 3})\n",
+ "df_A4"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "DlaIFiWIIPAl"
+ },
+ "source": [
+ "### Encoding Variáveis Nominais\n",
+ "* Exemplo: Variáveis com valores nominais: Sexo (Feminino, Masculino).\n",
+ "\n",
+ "* Use One-Hot Encoding ou pd.get.dummies()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ffNoJQbgJRoY"
+ },
+ "source": [
+ "Vamos utilizar o dataframe criado no passo anterior:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "PMCoUWZOI7c0"
+ },
+ "source": [
+ "df_A4['salario'].unique()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "bdIEyBkaJeN8"
+ },
+ "source": [
+ "from sklearn.preprocessing import LabelEncoder, OneHotEncoder"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4MwK4cUEKeK4"
+ },
+ "source": [
+ "#### Aplicar LabelEncoder()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6X6VXDsHJiII"
+ },
+ "source": [
+ "le = LabelEncoder()\n",
+ "df_A4['salario_le'] = le.fit_transform(df_A4['salario'])\n",
+ "df_A4"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "RY80x59J8Ham"
+ },
+ "source": [
+ "df_A4['salario'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Dgv2Zz07Kqfj"
+ },
+ "source": [
+ "#### Aplicar pd.get.dummies()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "WSZRIEs6K5sP"
+ },
+ "source": [
+ "dummies = pd.get_dummies(df_A4['salario'])\n",
+ "df_A4 = pd.concat([df_A4, dummies], axis = 1)\n",
+ "df_A4"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "CY8GZ-HlNOgm"
+ },
+ "source": [
+ "*texto em itálico*# **Wrap Up**\n",
+ "\n",
+ "\n",
+ "* Use MinMaxScaler como transformação default, pois esta transformação não distorce os dados;\n",
+ "* Use RobustScaler se seus dados/coluna/variável possui outliers e gostaríamos de reduzir o efeito/impacto destes outliers. Entretanto, o melhor tratamento é estudar os outliers cuidadosamente e tratá-los adequadamente;\n",
+ "* Use StandardScaler se seus dados/colunas/variáveis possuem distribuição Normal (ou pelo menos se aproxima bem da distribuição Normal)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Mwh0alhdgrE3"
+ },
+ "source": [
+ "___\n",
+ "# **Exercícios**\n",
+ "> Para cada um dos dataframes a seguir, aplique os seguintes steps:\n",
+ "\n",
+ "* Padronizar o nome das colunas\n",
+ " * Eliminar espaços entre os nomes das colunas;\n",
+ " * Eliminar caracteres especiais dos nomes das colunas;\n",
+ " * Renomear as colunas com lower() (ou upper());\n",
+ "* Aplicar a trasformação StandardScaler e MinMaxScaler em cada uma das colunas do dataframe;\n",
+ "* DataViz - Mostrar a distribuição das colunas para compararmos os resultados antes e depois das transformações.\n",
+ "* As correlações das colunas mudam com as transformações?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hSTKrd992LtI"
+ },
+ "source": [
+ "## Exercício 1 - Iris --> **Resolvido**\n",
+ "* [Aqui](https://en.wikipedia.org/wiki/Iris_flower_data_set) você obterá mais informações sobre o dataframe iris. Confira."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "mThqvGGr2Vuk"
+ },
+ "source": [
+ "from sklearn.datasets import load_iris\n",
+ "\n",
+ "iris = load_iris()\n",
+ "X= iris['data']\n",
+ "y= iris['target']\n",
+ "\n",
+ "df_iris = pd.DataFrame(np.c_[X, y], columns= np.append(iris['feature_names'], ['target']))\n",
+ "df_iris['target2'] = df_iris['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})\n",
+ "df_iris.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "eU5FaJhdYblP"
+ },
+ "source": [
+ "df_iris.columns = [c.replace(' ', '_') for c in df_iris.columns]\n",
+ "df_iris.columns = [c.replace('_(cm)', '') for c in df_iris.columns]\n",
+ "df_iris.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "PGmZjd_Y79lY"
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "K9DPAakJZQHH"
+ },
+ "source": [
+ "df_iris.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "YYYmVq68Y8bB"
+ },
+ "source": [
+ "# Aplica a transformação:\n",
+ "df_iris_MinMaxScaler = MinMaxScaler().fit_transform(df_iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']])\n",
+ "\n",
+ "# Transformando em Dataframe:\n",
+ "df_iris_MinMaxScaler = pd.DataFrame(df_iris_MinMaxScaler, columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])\n",
+ "\n",
+ "# Gráfico\n",
+ "df_iris_MinMaxScaler.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "IwPH8-258JrF"
+ },
+ "source": [
+ "aplicar as outras transformações e comparar os gráficos."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "caFkC6oCmUKK"
+ },
+ "source": [
+ "## Exercício 2 - Breast Cancer"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "vhOM-Z9zmf-f"
+ },
+ "source": [
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "from sklearn.datasets import load_breast_cancer\n",
+ "\n",
+ "cancer = load_breast_cancer()\n",
+ "X= cancer['data']\n",
+ "y= cancer['target']\n",
+ "\n",
+ "df_A1_cancer = pd.DataFrame(np.c_[X, y], columns= np.append(cancer['feature_names'], ['target']))\n",
+ "df_A1_cancer['target'] = df_A1_cancer['target'].map({0: 'malign', 1: 'benign'})\n",
+ "df_A1_cancer.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "1qruqUDqnvMc"
+ },
+ "source": [
+ "## Exercício 3 - Boston Housing Price"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "trxK8YXNnsam"
+ },
+ "source": [
+ "from sklearn.datasets import load_boston\n",
+ "\n",
+ "boston = load_boston()\n",
+ "X= boston['data']\n",
+ "y= boston['target']\n",
+ "\n",
+ "df_A1_boston = pd.DataFrame(np.c_[X, y], columns= np.append(boston['feature_names'], ['target']))\n",
+ "df_A1_boston.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "nzu0Dz33c8ds"
+ },
+ "source": [
+ "## Exercícios 4 - Diabetes"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "d6ahBZmqc_-1"
+ },
+ "source": [
+ "from sklearn.datasets import load_diabetes\n",
+ "\n",
+ "diabetes = load_diabetes()\n",
+ "X= diabetes['data']\n",
+ "y= diabetes['target']\n",
+ "\n",
+ "df_A1_diabetes = pd.DataFrame(np.c_[X, y], columns= np.append(diabetes['feature_names'], ['target']))\n",
+ "df_A1_diabetes.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "FOrG-8-o2qdG"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "NyunIr6oaWEl"
+ },
+ "source": [
+ "## Exercícios 6 - 120 years of Olympic history: athletes and results\n",
+ "* [120 years of Olympic history: athletes and results](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results)\n",
+ " * Trate adequadamente as variáveis 'sex', 'season', 'team', 'city', 'sport' e 'medal';\n",
+ " * Aplique as transformações que acabamos de estudar nos campos/colunas numéricas 'height' e 'weight'. Cuidado com os Missing Values contidos nas variáveis!\n",
+ " * Verifique/avalie o impacto dos outliers nestas colunas.\n",
+ " * Neste caso, qual transformação é mais adequado diante dos outliers?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "H94gYFj12xjW"
+ },
+ "source": [
+ "# Carrega a library Pandas\n",
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "%matplotlib inline"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "yOknNCLW3gcg"
+ },
+ "source": [
+ "#configuração\n",
+ "d_configuracao = {\n",
+ " 'display.max_columns':1000,\n",
+ " 'display.expand_frame_repr':True,\n",
+ " 'display.max_rows':10,\n",
+ " 'display.precision':2,\n",
+ " 'display.show_dimensions':True\n",
+ " }"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "qcLRA_uh35J-"
+ },
+ "source": [
+ "for op,value in d_configuracao.items():\n",
+ " pd.set_option(op, value)\n",
+ " print(op, value)\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "2bNfE4K24OjG"
+ },
+ "source": [
+ "from google.colab import drive\n",
+ "drive.mount('/content/drive')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZkmDSUlU7JTw"
+ },
+ "source": [
+ "!ls \"/content/drive/My Drive/\""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "fhajXFlq4dwm"
+ },
+ "source": [
+ "import pandas as pd\n",
+ "url='/content/drive/My Drive/athlete_events.zip' \n",
+ "df_olympics = pd.read_csv(url, compression='zip')\n",
+ "\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "xR4Hr3829zA3"
+ },
+ "source": [
+ "df_olympics.head(5)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "CQA_Wdxe-aOF"
+ },
+ "source": [
+ "df_olympics[['sex','season','team','city','sport','medal','height','weight']].head(5)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zBRZbYAC-Jdt"
+ },
+ "source": [
+ "def transformacao_lower(df):\n",
+ " # Primeira transformação: Aplicar lower() nos nomes das COLUNAS:\n",
+ " df_olympics.columns = [col.lower() for col in df.columns]\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ruCg9AXN-K6l"
+ },
+ "source": [
+ "transformacao_lower(df_olympics)\n",
+ "df_olympics.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "o5fDp1Ib_Dg8"
+ },
+ "source": [
+ "# WOE - Weight Of Evidence\n",
+ "* As vantagens da transformação WOE são\n",
+ " * Lida bem com NaN's;\n",
+ " * Lida bem com outliers;\n",
+ " * A transformação é baseada no valor logarítmico das distribuições.\n",
+ " * Usando a técnica de binning apropriada, pode estabelecer uma relação monotônica (aumentar ou diminuir) entre a variável dependente e independente."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "wXEsP96A9TSd"
+ },
+ "source": [
+ "import pandas as pd\n",
+ "\n",
+ "import numpy as np\n",
+ "from sklearn import preprocessing\n",
+ "import matplotlib.pyplot as plt\n",
+ "import seaborn as sns\n",
+ "%matplotlib inline\n",
+ "\n",
+ "#import category_encoders as ce # library para aplicação do WOE - Weight Of Evidence para avaliar importância dos atributos\n",
+ "\n",
+ "# remove warnings to keep notebook clean\n",
+ "import warnings\n",
+ "warnings.filterwarnings('ignore')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "gGdOGDZAHu-V"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Z2W9PXAc-vHY"
+ },
+ "source": [
+ "from google.colab import drive\n",
+ "drive.mount('/content/drive', force_remount=True)\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "g45JU2LXHwkz"
+ },
+ "source": [
+ "url = '/content/drive/My Drive/athlete_events.csv'\n",
+ "import pandas as pd\n",
+ "df_olympics = pd.read_csv(url)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "JLxrnkJw_f7m"
+ },
+ "source": [
+ "pd.read_csv('/content/drive/My Drive/file/d/athlete_events.zip', compression='zip')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "WcPQhh04E2du"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "TlgG1t392nBY"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "dl_Jma-i33XO"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ }
+ ]
+}
\ No newline at end of file
From 00d3029fd0b858712ae72b96273f2a6e9f2124e5 Mon Sep 17 00:00:00 2001
From: Celso-Omoto <72219725+Celso-Omoto@users.noreply.github.com>
Date: Mon, 19 Oct 2020 15:34:53 -0300
Subject: [PATCH 12/21] Criado usando o Colaboratory
---
...3DP_3_Data_Transformation_exercicios.ipynb | 291 ++++++++++++++++++
1 file changed, 291 insertions(+)
diff --git a/Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios.ipynb b/Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios.ipynb
index ddc2ee109..69fd2406a 100644
--- a/Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios.ipynb
+++ b/Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios.ipynb
@@ -1202,6 +1202,17 @@
"execution_count": null,
"outputs": []
},
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "FOrG-8-o2qdG"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
{
"cell_type": "markdown",
"metadata": {
@@ -1216,6 +1227,253 @@
" * Neste caso, qual transformação é mais adequado diante dos outliers?"
]
},
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "H94gYFj12xjW"
+ },
+ "source": [
+ "import pandas as pd\n",
+ "\n",
+ "import numpy as np\n",
+ "from sklearn import preprocessing\n",
+ "import matplotlib.pyplot as plt\n",
+ "import seaborn as sns\n",
+ "%matplotlib inline\n",
+ "\n",
+ "#import category_encoders as ce # library para aplicação do WOE - Weight Of Evidence para avaliar importância dos atributos\n",
+ "\n",
+ "# remove warnings to keep notebook clean\n",
+ "import warnings\n",
+ "warnings.filterwarnings('ignore')\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "yOknNCLW3gcg"
+ },
+ "source": [
+ "#configuração\n",
+ "d_configuracao = {\n",
+ " 'display.max_columns':1000,\n",
+ " 'display.expand_frame_repr':True,\n",
+ " 'display.max_rows':10,\n",
+ " 'display.precision':2,\n",
+ " 'display.show_dimensions':True\n",
+ " }"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "qcLRA_uh35J-"
+ },
+ "source": [
+ "for op,value in d_configuracao.items():\n",
+ " pd.set_option(op, value)\n",
+ " print(op, value)\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "2bNfE4K24OjG"
+ },
+ "source": [
+ "from google.colab import drive\n",
+ "drive.mount('/content/drive')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZkmDSUlU7JTw"
+ },
+ "source": [
+ "!ls \"/content/drive/My Drive/athlete_events.zip\""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "fhajXFlq4dwm"
+ },
+ "source": [
+ "import pandas as pd\n",
+ "url='/content/drive/My Drive/athlete_events.zip' \n",
+ "df_olympics = pd.read_csv(url, compression='zip')\n",
+ "\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "xR4Hr3829zA3"
+ },
+ "source": [
+ "df_olympics.head(5)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "CQA_Wdxe-aOF"
+ },
+ "source": [
+ "df_olympics[['sex','season','team','city','sport','medal','height','weight']].head(5)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zBRZbYAC-Jdt"
+ },
+ "source": [
+ "def transformacao_lower(df):\n",
+ " # Primeira transformação: Aplicar lower() nos nomes das COLUNAS:\n",
+ " df_olympics.columns = [col.lower() for col in df.columns]\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ruCg9AXN-K6l"
+ },
+ "source": [
+ "transformacao_lower(df_olympics)\n",
+ "df_olympics.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "a7h-jRIiEGxr"
+ },
+ "source": [
+ "df_atleta2=df_olympics[['height','weight']]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8IRA3xp_EUA0"
+ },
+ "source": [
+ "df_atleta2.isnull().sum()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "UGEqcw_7EYg0"
+ },
+ "source": [
+ "df_atleta2.dropna(inplace=True)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "MrvFs8s5EdqM"
+ },
+ "source": [
+ "df_atleta2.hist(bins=100)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "i8CQyOKQE6CM"
+ },
+ "source": [
+ "from sklearn.preprocessing import StandardScaler\n",
+ "atleta2_standard = StandardScaler().fit_transform(df_atleta2)\n",
+ "df_atleta2_standard = pd.DataFrame(atleta2_standard, columns=list(df_atleta2.keys()))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jAlPzqM4IgN0"
+ },
+ "source": [
+ "from sklearn.preprocessing import MinMaxScaler\n",
+ "atleta2_minmax = MinMaxScaler().fit_transform(df_atleta2)\n",
+ "df_atleta2_minmax = pd.DataFrame(atleta2_minmax, columns=list(df_atleta2.keys()))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "u1DCyxneIiIc"
+ },
+ "source": [
+ "from sklearn.preprocessing import RobustScaler\n",
+ "atleta2_robust = RobustScaler().fit_transform(df_atleta2)\n",
+ "df_atleta2_robust = pd.DataFrame(atleta2_robust, columns=list(df_atleta2.keys()))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "GQWOmU0tFDh9"
+ },
+ "source": [
+ "df_atleta2_robust['tipo'] = 'robust'\n",
+ "df_atleta2_minmax['tipo'] = 'minmax'\n",
+ "df_atleta2_standard['tipo'] = 'standard'\n",
+ "df_atletax = pd.concat([df_atleta2_minmax,df_atleta2_robust,df_atleta2_standard])\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "_jo551PjHxKX"
+ },
+ "source": [
+ "import seaborn as sns\n",
+ "import matplotlib.pyplot as plt\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
{
"cell_type": "markdown",
"metadata": {
@@ -1310,6 +1568,39 @@
],
"execution_count": null,
"outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "TlgG1t392nBY"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "dl_Jma-i33XO"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "GzO_OcAsEAEH"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
}
]
}
\ No newline at end of file
From 0507eb634af5a01392dfda16493e341055f09c2d Mon Sep 17 00:00:00 2001
From: Celso-Omoto <72219725+Celso-Omoto@users.noreply.github.com>
Date: Mon, 19 Oct 2020 15:43:46 -0300
Subject: [PATCH 13/21] Criado usando o Colaboratory
---
...ndas__Resposta_Exercicios_Aluno_Fifa.ipynb | 2788 +++++++++++++++++
1 file changed, 2788 insertions(+)
create mode 100644 Notebooks/NB10_01__Pandas__Resposta_Exercicios_Aluno_Fifa.ipynb
diff --git a/Notebooks/NB10_01__Pandas__Resposta_Exercicios_Aluno_Fifa.ipynb b/Notebooks/NB10_01__Pandas__Resposta_Exercicios_Aluno_Fifa.ipynb
new file mode 100644
index 000000000..7ea9a913f
--- /dev/null
+++ b/Notebooks/NB10_01__Pandas__Resposta_Exercicios_Aluno_Fifa.ipynb
@@ -0,0 +1,2788 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "NB10_01__Pandas.ipynb",
+ "provenance": [],
+ "private_outputs": true,
+ "include_colab_link": true
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8fpUiw8PwC7_"
+ },
+ "source": [
+ "PANDAS PARA DATA ANALYSIS
\n",
+ "\n",
+ "\n",
+ "\n",
+ "# **Resposta dos Exercícios**\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "wkxQFPPmeKLl"
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "eKawOG-neqaD"
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "iwd1lhq9mrD3"
+ },
+ "source": [
+ "___\n",
+ "# **Exercícios**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "o_cl0kFgQfFh"
+ },
+ "source": [
+ "## Exercício 1\n",
+ "* A partir dos dataframes USA_Abbrev, USA_Area e USA_Population, construa o Dataframe USA contendo as COLUNAS state, abbreviation, area, ages, year, population.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "s8rQUo7yHKJ1"
+ },
+ "source": [
+ "* Observação: A forma mais fácil de ler um arquivo CSV (a partir do Excell por exemplo) a partir do GitHub é clicar no arquivo csv no seu repositório do GitHub e em seguida clicar em 'raw'. Depois, copie o endereço apresentado no browser e cole na variável 'url'. Qualquer dúvida, leia o documento a seguir: https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "KTun1uSLuJ-A"
+ },
+ "source": [
+ "## Exercício 2\n",
+ "Source: https://github.com/aakankshaws/Pandas-exercises\n",
+ "\n",
+ "* Considere os dataframes a seguir e faça o merge do dataframe df_esquerdo com o dataframe df_direito:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Soq7GVZnuREq"
+ },
+ "source": [
+ "df_esquerdo = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],\n",
+ " 'A': ['A0', 'A1', 'A2', 'A3'],\n",
+ " 'B': ['B0', 'B1', 'B2', 'B3']})\n",
+ " \n",
+ "df_direito = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],\n",
+ " 'C': ['C0', 'C1', 'C2', 'C3'],\n",
+ " 'D': ['D0', 'D1', 'D2', 'D3']})"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6KEsTARfvM1C"
+ },
+ "source": [
+ "## Exercício 3\n",
+ "Source: https://github.com/aakankshaws/Pandas-exercises\n",
+ "\n",
+ "* Considere os dataframes a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "hgxE5gZ9vMEg"
+ },
+ "source": [
+ "df_esquerdo = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],\n",
+ " 'key2': ['K0', 'K1', 'K0', 'K1'],\n",
+ " 'A': ['A0', 'A1', 'A2', 'A3'],\n",
+ " 'B': ['B0', 'B1', 'B2', 'B3']})\n",
+ " \n",
+ "df_direito = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],\n",
+ " 'key2': ['K0', 'K0', 'K0', 'K0'],\n",
+ " 'C': ['C0', 'C1', 'C2', 'C3'],\n",
+ " 'D': ['D0', 'D1', 'D2', 'D3']})"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "iv7AmZ1ivm8R"
+ },
+ "source": [
+ "### Perguntas\n",
+ "* Qual o output e a interpretação dos comandos a seguir:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "TWAW_1tuvvSO"
+ },
+ "source": [
+ "pd.merge(df_esquerdo, df_direito, on = ['key1', 'key2'])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "QjM7pBONvzCJ"
+ },
+ "source": [
+ "pd.merge(df_esquerdo, df_direito, how = 'outer', on = ['key1', 'key2'])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "D1Rr3Ghsv2iS"
+ },
+ "source": [
+ "pd.merge(df_esquerdo, df_direito, how = 'right', on = ['key1', 'key2'])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "vXQwLjT-v3Iu"
+ },
+ "source": [
+ "pd.merge(df_esquerdo, df_direito, how = 'left', on = ['key1', 'key2'])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "EIdltTC-t_lF"
+ },
+ "source": [
+ "## Exercício 5\n",
+ "5.1. Identifique e delete os atributos do dataframe df_Titanic que podem ser excluídos inicialmente no início da análise de dados."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "bMwPLgWclWBq"
+ },
+ "source": [
+ "___\n",
+ "## Exercício 6 - Resolvido\n",
+ "* Carregue o dataframe Titanic_With_MV.csv e analise o dataframe em busca de inconsistências e Missing Values (NaN)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Ej6WjQX90n1E"
+ },
+ "source": [
+ "### Identificação e tratamento dos Missing Values\n",
+ "* Em geral, deletamos variáveis com mais de 50% de Missing Values."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "nuaM4JKNLeSI"
+ },
+ "source": [
+ "df4.shape"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "GaYc-HXNJ1TQ"
+ },
+ "source": [
+ "pd.set_option('display.max_rows', 500)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "v5s71jcHIGch"
+ },
+ "source": [
+ "df_missing_values = pd.DataFrame(df4.isnull().sum())\n",
+ "df_missing_values['mv_percent'] = 100*df_missing_values[0]/df4.shape[0]\n",
+ "df_missing_values[0].sort_values(ascending= False)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "V7KUGAX6lilP"
+ },
+ "source": [
+ "import pandas as pd\n",
+ "df_Titanic = pd.read_csv('https://raw.githubusercontent.com/MathMachado/Python4DS/DS_Python/Dataframes/Titanic_With_MV.csv?token =AGDJQ63MNPPPROFNSO2BZW25XSR72', index_col= 'PassengerId')\n",
+ "df_Titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "m3UnAPJakCLR"
+ },
+ "source": [
+ "* Segue o dicionário de dados do dataframe Titanic:\n",
+ " * PassengerID: ID do passageiro;\n",
+ " * survived: Indicador, sendo 1= Passageiro sobreviveu e 0= Passageiro morreu;\n",
+ " * Pclass: Classe;\n",
+ " * Age: Idade do Passageiro;\n",
+ " * SibSp: Número de parentes a bordo (esposa, irmãos, pais e etc);\n",
+ " * Parch: Número de pais/crianças a bordo;\n",
+ " * Fare: Valor pago pelo Passageiro;\n",
+ " * Cabin: Cabine do Passageiro;\n",
+ " * Embarked: A porta pelo qual o Passageiro embarcou.\n",
+ " * Name: Nome do Passageiro;\n",
+ " * sex: sexo do Passageiro\n",
+ " "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_6RvRCXgwomw"
+ },
+ "source": [
+ "### Avaliando inconsistências nas COLUNAS"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "PToomnfRxxI5"
+ },
+ "source": [
+ "import seaborn as sns\n",
+ "import pandas as pd\n",
+ "import numpy as np"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "3nc_iuRR1Tju"
+ },
+ "source": [
+ "# Uniformizando o nome das COLUNAS\n",
+ "df_Titanic.columns= [cols.lower() for cols in df_Titanic.columns]\n",
+ "df_Titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "G9jteCnAxdnK"
+ },
+ "source": [
+ "### Coluna 'pclass'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "wUk0YNlxsgvf"
+ },
+ "source": [
+ "df_Titanic['pclass'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "9vPrB3AAx0Ym"
+ },
+ "source": [
+ "sns.countplot(x = 'survived', hue ='pclass', data = df_Titanic)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2n8s9Ad1m7od"
+ },
+ "source": [
+ "Não me parece nada estranho com a variável 'pclass'. Ou você identifica alguma coisa estranho?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "m8EGM6gSxrzS"
+ },
+ "source": [
+ "### Coluna 'sex'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "BRRgcLtinIRz"
+ },
+ "source": [
+ "sns.countplot(x = 'survived', hue ='sex', data = df_Titanic)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8SQ8v2Wnspfb"
+ },
+ "source": [
+ "df_Titanic['sex'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "wpp0iL0kyGTl"
+ },
+ "source": [
+ "Qual sua opinião sobre esse preenchimento?\n",
+ "\n",
+ "Algum problema?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jxx06kJFnNrP"
+ },
+ "source": [
+ "Oops... Aqui temos vários problemas... Olhando para estes resultados, você concorda que 'male', 'm', 'MALE', M', 'mALE' e 'Men' se trata da mesma informação?\n",
+ "\n",
+ "Da mesma forma, 'female', 'f', 'F', 'Female', 'fEMALE', 'Woman', 'w' e 'W' também se trata da mesma informação?\n",
+ "\n",
+ "Então, vamos fazer o seguinte:\n",
+ "\n",
+ "* Toda vez que eu encontrar um desses valores: ['m', 'MALE', 'M', 'mALE', 'Men'], vou substituir por 'male';\n",
+ "* Toda vez que eu encontrar um desses valores: ['f', 'F', 'Female', 'fEMALE', 'Woman', 'w', 'W'], vou substituit por 'female'."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "oQbEVi1t2tfR"
+ },
+ "source": [
+ "df_Titanic2= df_Titanic.copy()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "apc-ccODyZ-d"
+ },
+ "source": [
+ "#### Corrigir com df.replace()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "CwoyLBK9oME5"
+ },
+ "source": [
+ "df_Titanic['sex2'] = df_Titanic['sex'].replace(['m', 'MALE', 'M', 'mALE', 'Men'], 'male')\n",
+ "df_Titanic['sex3'] = df_Titanic['sex2'].replace(['f', 'F', 'Female', 'fEMALE', 'Woman', 'w', 'W'], 'female') "
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RC35I-Njp4vh"
+ },
+ "source": [
+ "Vamos ver a distribuição dos dados novamente no gráfico:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "1eGvEhA9qAN6"
+ },
+ "source": [
+ "sns.countplot(x = 'survived', hue ='sex3', data = df_Titanic)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "IY3TaKUcszTQ"
+ },
+ "source": [
+ "df_Titanic['sex3'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2nOAcv3iqEaK"
+ },
+ "source": [
+ "Ok, de fato corrigimos os problemas de preenchimento da variável 'sex'."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "dqLqmrTWylY3"
+ },
+ "source": [
+ "#### Corrigir com df.map()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "dRvuNo4E3Ewx"
+ },
+ "source": [
+ "df_Titanic= df_Titanic2.copy()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "3X0_ZdwCyquk"
+ },
+ "source": [
+ "d_sexo= {}\n",
+ "d_sexo.update(dict.fromkeys(['m', 'MALE', 'M', 'mALE', 'Men', 'male'], 'male'))\n",
+ "d_sexo.update(dict.fromkeys(['f', 'F', 'Female', 'fEMALE', 'Woman', 'w', 'W', 'female'], 'female'))\n",
+ "d_sexo"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "YQ3lwKRKbsx0"
+ },
+ "source": [
+ "Aplica a transformação:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "idBwRNI7bvCC"
+ },
+ "source": [
+ "df_Titanic['sex2'] = df_Titanic['sex'].map(d_sexo)\n",
+ "df_Titanic['sex2'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "FzDl78rfb3p5"
+ },
+ "source": [
+ "Qual a conclusão? Este preenchimento faz mais sentido que o anterior?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "SvrZtKRpzIDc"
+ },
+ "source": [
+ "# Deleta as variáveis 'sex':\n",
+ "df_Titanic = df_Titanic.drop(columns = ['sex'], axis = 1).rename(columns= {'sex2': 'sex'})\n",
+ "\n",
+ "# Mostra os dados:\n",
+ "df_Titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6URC6h8xzfc5"
+ },
+ "source": [
+ "sns.catplot(x=\"sex\", kind=\"count\", data = df_Titanic)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "k_spkJbmqdRW"
+ },
+ "source": [
+ "sns.countplot(x = 'survived', hue ='sex', data = df_Titanic)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "bgBNoXUNzoWZ"
+ },
+ "source": [
+ "### Feature Engineering\n",
+ "#### Coluna 'cabin'\n",
+ "* Construir as COLUNAS:\n",
+ " * deck - Letra de Cabin;\n",
+ " * seat - Número de Cabin"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8fHsLrnut6mk"
+ },
+ "source": [
+ "Sugestões:\n",
+ "1) Não descartar nenhuma informação (Fábio);\n",
+ "\n",
+ "2) Coluna com número de cabines reservadas (Thomaz)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "p0NFFxx8z-vq"
+ },
+ "source": [
+ "set(df_Titanic['cabin'])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7E6yje89u7KF"
+ },
+ "source": [
+ "Como podemos ver, trata-se de uma variável categórica com vários níveis. Portanto, vamos capturar somente a primeira letra da variável 'cabin'. Para tal, vamos utilizar a função slice().\n",
+ "\n",
+ "> str.slice() - Captura (slice) partes de s_Str."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "wmZLlSaArR6F"
+ },
+ "source": [
+ "A seguir, capturamos a primeira letra da variável 'cabin':"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "hUZTJU0MvVxP"
+ },
+ "source": [
+ "# definindo a variável 'deck' que representará a primeira letra da variável 'cabin'\n",
+ "df_Titanic[\"deck\"] = df_Titanic[\"cabin\"].str.slice(0, 1) # slice(inicio, tamanho_da_string)\n",
+ "df_Titanic['deck'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6myhrth0rZ6t"
+ },
+ "source": [
+ "A seguir, vamos extrair a parte numérica da variável 'cabin' usando Expressões Regulares:\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8UXkACPmsfwN"
+ },
+ "source": [
+ "# Importar a biblioiteca para Expressões Regulares\n",
+ "import re"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "QKk-fnW4rf4o"
+ },
+ "source": [
+ "# Primeiramente, usamos a função split() para separar o conteúdo da variável em COLUNAS: \n",
+ "new = df_Titanic[\"cabin\"].str.split(\" \", n = 3, expand = True) \n",
+ "new.head(5)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "dFqoR-Xew9gX"
+ },
+ "source": [
+ "Observe acima que o comando gera quantos splits da variável eu quiser. No entanto, por simplicidade, me interessa somente o primeiro split."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_M7vA6WoVG05"
+ },
+ "source": [
+ "Agora, vou extrair o número do assento do passageiro usando Expressões Regulares:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "rVH5o9KT_IH3"
+ },
+ "source": [
+ "# Aqui está o conteúdo de new[0]:\n",
+ "new[0].head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "P7NTcsGOxxSX"
+ },
+ "source": [
+ "new2= new[0].str.extract('(\\d+)')\n",
+ "new2.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "bf8vw2Mc18bQ"
+ },
+ "source": [
+ "Por fim, vou carregar esta informação ao dataframe df:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6l6EoRvsxRXn"
+ },
+ "source": [
+ "df_Titanic[\"seat\"] = new2\n",
+ "df_Titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "LK4V61uy3N9s"
+ },
+ "source": [
+ "Por fim, excluir a variável 'cabin':"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "4uAr55J43NY7"
+ },
+ "source": [
+ "df_Titanic= df_Titanic.drop(columns= [\"cabin\"], axis =1, errors=\"ignore\")"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qZuH7YJXZCgY"
+ },
+ "source": [
+ "### Coluna 'embarked'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "nTPikhrIZGya"
+ },
+ "source": [
+ "df_Titanic['embarked'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ixbZsuqOZsOc"
+ },
+ "source": [
+ "sns.catplot(x=\"embarked\", kind=\"count\", data = df_Titanic)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "VvdU8aAwZNvG"
+ },
+ "source": [
+ "Não vejo problemas com esta variável. Vamos em frente..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "k2SLRAhrub_B"
+ },
+ "source": [
+ "sns.countplot(x = 'survived', hue ='embarked', data = df_Titanic)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "YRJcWaYkuxK4"
+ },
+ "source": [
+ "sns.countplot(x = 'pclass', hue ='embarked', data = df_Titanic)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "rzrOUULUu6-P"
+ },
+ "source": [
+ "sns.countplot(x = 'sex', hue ='embarked', data = df_Titanic)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "DfSMcYYZ5yLV"
+ },
+ "source": [
+ "### Variável 'pclass'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Q2uU0k-G5yLN"
+ },
+ "source": [
+ "df_Titanic['pclass'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Gue26Y3A5yLL"
+ },
+ "source": [
+ "Algum problema com esta variável?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "q3P82wPp5yK8"
+ },
+ "source": [
+ "sns.catplot(x=\"pclass\", kind=\"count\", data = df)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Qrnc6VUKSTNp"
+ },
+ "source": [
+ "### Coluna 'parch'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "2i4ed-0zSvJc"
+ },
+ "source": [
+ "df_Titanic['parch'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "qd7u__6KZ6DM"
+ },
+ "source": [
+ "sns.catplot(x=\"parch\", kind=\"count\", data = df_Titanic)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Z9vM3vktC7BG"
+ },
+ "source": [
+ "### Feature Engineering\n",
+ "* Criar a coluna 'sozinho_parch', onde sozinho_parch= 1 significa que o passageiro viaja sozinho e 0, caso contrário."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Nd4TyOYjs-HW"
+ },
+ "source": [
+ "# Função para retornar 0 ou 1 em função dos valores de variavel\n",
+ "def sozinho(variavel):\n",
+ " if (variavel == 0):\n",
+ " return 1\n",
+ " else:\n",
+ " return 0"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5oByiBuos_B3"
+ },
+ "source": [
+ "df_Titanic['sozinho_parch'] = df_Titanic['parch'].map(sozinho)\n",
+ "df_Titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "C1ICby1oSd41"
+ },
+ "source": [
+ "### Coluna 'sibsp'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5n7JNEQqTNjz"
+ },
+ "source": [
+ "df_Titanic['sibsp'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "NLfMhiy0x4u5"
+ },
+ "source": [
+ "* Algum problema?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "nayYFRK9g8iV"
+ },
+ "source": [
+ "sns.catplot(x=\"sibsp\", kind=\"count\", data = df_Titanic)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "KzCX2MTmE9Tw"
+ },
+ "source": [
+ "sns.countplot(x = 'survived', hue ='sibsp', data = df_Titanic)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_58rZqMaDzf-"
+ },
+ "source": [
+ "### Feature Engineering:\n",
+ "* Criar o atributo 'sozinho_sibsp', onde sozinho= 1 significa que o passageiro viaja sozinho e 0, caso contrário."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "HUrJ4IywrEoA"
+ },
+ "source": [
+ "df_Titanic['sozinho_sibsp'] = df_Titanic['sibsp'].map(sozinho)\n",
+ "df_Titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0MO9jj2NvGp_"
+ },
+ "source": [
+ "### Coluna 'fare'\n",
+ "> Discretizar a coluna 'fare' em 10 buckets."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "4-qO2Xk76Buz"
+ },
+ "source": [
+ "df_Titanic['fare_class'] = pd.qcut(df_Titanic['fare'], 10, labels=False)\n",
+ "df_Titanic['fare_class'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "boAj64RHvQHu"
+ },
+ "source": [
+ "sns.catplot(x=\"fare_class\", kind=\"count\", data = df_Titanic)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3CIqHUJpvcPa"
+ },
+ "source": [
+ "### Coluna 'age'\n",
+ "> Discretizar a coluna 'age' em 10 buckets."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "rCRnbKX57VN-"
+ },
+ "source": [
+ "df_Titanic['age_class'] = pd.qcut(df_Titanic['age'], 10, labels=False)\n",
+ "df_Titanic['age_class'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "uFsZLYDi7VOH"
+ },
+ "source": [
+ "sns.catplot(x=\"age_class\", kind=\"count\", data = df_Titanic)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "DIY-sL337uje"
+ },
+ "source": [
+ "#### Alternativa para discretizar 'age'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "W66GkyuKkhFe"
+ },
+ "source": [
+ "def Age_Category(age):\n",
+ " if (age <= 1):\n",
+ " return 1\n",
+ " elif (age <= 5):\n",
+ " return 2\n",
+ " elif(age <= 10):\n",
+ " return 3\n",
+ " elif (age <= 15):\n",
+ " return 4\n",
+ " elif (age <= 20):\n",
+ " return 5\n",
+ " elif (age <= 25):\n",
+ " return 6\n",
+ " elif(age < 30):\n",
+ " return 7\n",
+ " elif(age < 35):\n",
+ " return 8\n",
+ " elif(age < 40):\n",
+ " return 9\n",
+ " elif(age < 45):\n",
+ " return 10\n",
+ " elif(age < 50):\n",
+ " return 11\n",
+ " elif(age < 60):\n",
+ " return 12\n",
+ " elif(age < 70):\n",
+ " return 13\n",
+ " elif(age < 80):\n",
+ " return 14\n",
+ " else:\n",
+ " return 15"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "TnLzC6hCkuBL"
+ },
+ "source": [
+ "df_Titanic['age_class2'] = df['age'].map(Age_Category)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kG8td6HPsNlP"
+ },
+ "source": [
+ "set(df_Titanic['age_category']) # Esse comando mostra os NaN's da coluna, se houver."
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "B_3s5cgxfNKQ"
+ },
+ "source": [
+ "### Coluna 'title'\n",
+ "\n",
+ "* Para fins de Data Manipulation, vamos capturar o tratamento dos passageiros contido na variável 'nome'. Ou seja, 'Mr.', 'Mrs.', 'Miss' e etc...\n",
+ "\n",
+ "> Fonte: As funções get_title e title_map foram extraídas de https://www.kaggle.com/tjsauer/titanic-survival-python-solution"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "gslSjRdDoJFY"
+ },
+ "source": [
+ "df.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XjqEVVnr8R4d"
+ },
+ "source": [
+ "Primeiramente, vamos entender como funciona, step by step..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "D6gjWc3XozK7"
+ },
+ "source": [
+ "'Allen, Mr. William Henry'.split(',')[1].split('.')[0].strip()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "nfIG6toGfhd5"
+ },
+ "source": [
+ "def get_title(nome):\n",
+ " if '.' in nome:\n",
+ " return nome.split(',')[1].split('.')[0].strip()\n",
+ " else:\n",
+ " return 'Unknown'\n",
+ "\n",
+ "def title_map(title):\n",
+ " if title in ['Mr', 'Ms']:\n",
+ " return 1\n",
+ " elif title in ['Master']:\n",
+ " return 2\n",
+ " elif title in ['Ms','Mlle','Miss']:\n",
+ " return 3\n",
+ " elif title in [\"Mme\", \"Ms\", \"Mrs\"]:\n",
+ " return 4\n",
+ " elif title in [\"Jonkheer\", \"Don\", \"Sir\", \"the Countess\", \"Dona\", \"Lady\"]:\n",
+ " return 5\n",
+ " elif title in [\"Capt\", \"Col\", \"Major\", \"Dr\", \"Rev\"]:\n",
+ " return 6\n",
+ " else:\n",
+ " return 7"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HLQoJwf0rjrf"
+ },
+ "source": [
+ "Exercícios\n",
+ "* Melhorar a função title_map."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7qNUwnCepe_x"
+ },
+ "source": [
+ "Captura o tratamento dos passageiros:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "r-Ltf33vgJ6Q"
+ },
+ "source": [
+ "df_Titanic['title'] = df_Titanic['nome'].apply(get_title).apply(title_map) \n",
+ "set(df_Titanic['title']) # Esse comando mostra os NaN's da variável"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "D3hY0WVhpRYK"
+ },
+ "source": [
+ "Drop a coluna 'nome', pois não vamos mais precisar dela em nossas análises:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Y8i3xKCes5WF"
+ },
+ "source": [
+ "df_Titanic= df_Titanic.drop(columns= [\"nome\"], axis =1)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7Sl1uFdwpW3y"
+ },
+ "source": [
+ "Apresenta o conteúdo do dataframe:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "2uFnw-pZpan-"
+ },
+ "source": [
+ "df_Titanic.head(10)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "B0fZMKKpdHIl"
+ },
+ "source": [
+ "## Missing Value\n",
+ "> Faça o devido tratamento de NaN's das COLUNAS do dataframe df_Titanic.\n",
+ "\n",
+ "**Pergunta**: Na coluna 'value', os valores 0 (zero) são considerados Missing Values?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "UHzKFytXsNkh"
+ },
+ "source": [
+ "df_Titanic['age'].isna().sum()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZC1ULWd883t2"
+ },
+ "source": [
+ "## Relação causa --> efeito"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_WCbklv0bDlp"
+ },
+ "source": [
+ "A função a seguir nos ajudará com o Data Visualization, cruzando a variável-resposta 'survived' com qualquer outra passada à função:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "epxI-F2UbGGS"
+ },
+ "source": [
+ "def taxa_sobrevivencia(df, column):\n",
+ " title_xt = pd.crosstab(df[column], df['survived'])\n",
+ " print(pd.crosstab(df[column], df['survived'], margins=True))\n",
+ " title_xt_pct = title_xt.div(title_xt.sum(1).astype(float), axis =0)\n",
+ " \n",
+ " title_xt_pct.plot(kind='bar', stacked=True, title='Taxa de Sobrevivência dos Passageiros', \n",
+ " color= ['r', 'g'])\n",
+ " plt.xlabel(column)\n",
+ " plt.ylabel('Taxa de Sobrevivência')\n",
+ " plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05),shadow=True, ncol=2)\n",
+ " plt.show()\n",
+ "\n",
+ "def grafico_catplot(x, y, hue = 'survived', col= None):\n",
+ " plt.rcdefaults()\n",
+ " g= sns.catplot(x= x, y= y, hue = hue, palette={'Died':'red','Survived':'blue'}, col= col, data = df, kind= 'bar', height=4, aspect=.7)\n",
+ " plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "34-Qbd_QrC8W"
+ },
+ "source": [
+ "Qual a relação entre a variável 'sex' e a variável-resposta?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "bhY8-UjyrC8Z"
+ },
+ "source": [
+ "taxa_sobrevivencia(df_Titanic, 'sex')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "UbexhGtayV4X"
+ },
+ "source": [
+ "## Exercício 7\n",
+ "Consulte a página [Pandas Exercises, Practice, Solution](https://www.w3resource.com/python-exercises/pandas/index.php) para mais exercícios relacionados á este tópico."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "P62MXm3tK8Ty"
+ },
+ "source": [
+ "## Exercício 8\n",
+ "Crie a coluna 'aleatorio' no dataframe df_Titanic em que cada linha recebe um valor aleatório usando o método np.random.random()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Du7Y8E4uFmiu"
+ },
+ "source": [
+ "i_linhas_Titanic = df_Titanic.shape[0]\n",
+ "\n",
+ "df_Titanic['aleatorio'] = np.random.random(i_linhas_Titanic)\n",
+ "df_Titanic.head(10)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "LMD3HksDL0PQ"
+ },
+ "source": [
+ "## Exercício 9\n",
+ "\n",
+ "1. Carregue o arquivo FIFA.csv (está na área de Dataframes do curso);\n",
+ "2. Que colunas podem previamente ser eliminadas da análise? Porque identificar o que pode ser eliminado é importante?\n",
+ "3. Qual o dtype de cada variável/atributo do dataframe?\n",
+ "4. Se alguma variável/atributo é do tipo string (object) e supostamente deveria ser numérica, como alteramos o tipo?\n",
+ "5. Normalize os nomes das colunas, ou seja, renomeie o nome das colunas para minúsculo;\n",
+ "6. Há Missing values nos dados? Se sim, o qual sua proposta (proposta do grupo) para tratar estes Missing values?\n",
+ "7. Qual a distribuição do número de jogadores por países? Apresente uma tabela com a distribuição.\n",
+ "8. Qual a média de idade dos jogadores por países (variável/atributo 'Nacionality');\n",
+ "9. Qual a número de jogadores por idade?\n",
+ "10. Quantos jogadores possuem cada clube?\n",
+ "11. Qual a média de idade por clube?\n",
+ "12. Qual a média de salário por país?\n",
+ "13. Qual a média de salário por clube?\n",
+ "14. Qual a média de salário por idade?\n",
+ "15. Quanto cada clube gasta com pagamento de salários?\n",
+ "16. Quais são os insight (o que você consegue descobrir) em relação à variável 'Potential' (mede o potencial dos jogadores)?\n",
+ "17. Quais os insights em relação à variável overall (nota média do atleta) por idade, clube e país?\n",
+ "18. Quais são os melhores clubes se levarmos em consideração as variáveis Potential e Overall?\n",
+ "19. Apresente o ranking dos goleiros (use a variável/atributo 'Preferred Positions') por Potencial, Overall. Estamos à procura de 'GK'.\n",
+ "20. Quem são os jogadores mais rápidos (variável/atributo 'Sprint speed'=?\n",
+ "21. Quem são os 5 melhores jogadores em termos de chute (força para chutar) (use a variável/atributo 'Shot power')?\n",
+ "22. Quem são os outliers em termos de salário?\n",
+ "23. Quem são os outliers em termos de potência no chute?\n",
+ "24. Qual a correlação e a interpretação entre as variáveis 'value' e as demais variáveis numéricas do dataframe?\n",
+ "25. Construa variáveis dummy para as colunas preferred_foot e work_rate. preferred_foot_left;\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "70Ml5KyZ04mk"
+ },
+ "source": [
+ "A seguir, significado da variável \"Position\":\n",
+ "* GK = Goalkeeper – Goleiro.\n",
+ "* RB = Right Back – Zagueiro Direito.\n",
+ "* CB = Central Back – Zagueiro Central.\n",
+ "* LB = Left Back – Zagueiro Esquerdo.\n",
+ "* SW = Sweeper – Líbero.\n",
+ "* RWB = Right Wing Back – Lateral Direito.\n",
+ "* LWB = Left Wing back – Lateral Esquerdo.\n",
+ "* CDM = Central Defensive Midfielder – Meio Campo Defensivo / Volante.\n",
+ "* CM = Central Midfielder – Meia Central.\n",
+ "* CAM = Center Attacking Middlefielder – Meio Campo Ofensivo / Armador.\n",
+ "* OM = Offensive Midfielder – Meia Ofensivo.\n",
+ "* LOM = Left Offensive Midfielder – Meia Esquerda Ofensivo.\n",
+ "* ROM = Right Offensive Midfielder – Meia Direita Ofensivo.\n",
+ "* LM = Left Midfielder – Meia Esquerda.\n",
+ "* RM = Right Midfielder – Meia Direita.\n",
+ "* LWM = Left Wing Midfielder – Meio Ala Esquerdo.\n",
+ "* RWM = Right Wing Midfielder – Meio Ala Direito.\n",
+ "* RW = Right Winger – Ala Direito.\n",
+ "* LW = Left Winger – Ala Esquerto.\n",
+ "* LF = Left Forward – Atacante Esquerdo.\n",
+ "* RF = Right Forward – Atacante Direito.\n",
+ "* ST = Striker – Atacante.\n",
+ "* CF = Center Forward – Centro Avante.\n",
+ "* RS = Right Striker – Atacante Direito.\n",
+ "* LS = Left Striker – Atacante Esquerdo."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "tjHDjj68zawa"
+ },
+ "source": [
+ "## 1. Carregue o arquivo FIFA.csv (está na área de Dataframes do curso);"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Wzosi4Ue1vDs"
+ },
+ "source": [
+ "### Carregar as libraries necessárias"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "B0fqR6rzMAa3"
+ },
+ "source": [
+ "import pandas as pd\n",
+ "import numpy as np"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vgoLTamaOC50"
+ },
+ "source": [
+ "#### Configurar ambiente"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "RRwi_z8JOFrD"
+ },
+ "source": [
+ "d_configuracao = {\n",
+ " 'display.max_columns': 1000,\n",
+ " 'display.expand_frame_repr': True,\n",
+ " 'display.max_rows': 10,\n",
+ " 'display.precision': 2,\n",
+ " 'display.show_dimensions': True\n",
+ " }\n",
+ "\n",
+ "for op, value in d_configuracao.items():\n",
+ " pd.set_option(op, value)\n",
+ " print(op, value)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "MdVljEbcMGU9"
+ },
+ "source": [
+ "#### Carregar os dados"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "GMivDUHEMFKp"
+ },
+ "source": [
+ "df = pd.read_csv('https://raw.githubusercontent.com/MathMachado/DataFrames/master/FIFA.csv?token=AGDJQ63GC7SPIHTGNW73QB27RXRN6') #, index_col= 'PassengerId')\n",
+ "df.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7pDUpFVLTOfl"
+ },
+ "source": [
+ "#### Definir a coluna 'ID' como index do dataframe"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "TEue20CbMp9U"
+ },
+ "source": [
+ "df.set_index('ID', inplace = True)\n",
+ "df.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "G8CDrpI1_wMd"
+ },
+ "source": [
+ "### Função para retirar os sinais de \"+\" ou \"-\" em algumas colunas/vriáveis:\n",
+ "* Percebeste algumas colunas com o sinal de \"+\" no nome?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7zqHkNCsEDpJ"
+ },
+ "source": [
+ "A seguir, exemplo de algumas colunas com este problema:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "_hUvJbCqCBBl"
+ },
+ "source": [
+ "df[['RS', 'LS', 'ST']].head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "78QhptWdEIB0"
+ },
+ "source": [
+ "A seguir, definimos um dataframe chamado df_string contendo a quantidade de colunas separadas pelo sinal \"+\". Observe que o máximo de colunas que obtemos são 2. Porque?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "DzeSvQMGF4G7"
+ },
+ "source": [
+ "df_string = df['RS'].str.split(r'\\+', n = 4, expand = True) # n representa o número de splits no output.\n",
+ "df_string.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "PEzqRR5CEUru"
+ },
+ "source": [
+ "df_string[0] = pd.to_numeric(df_string[0])\n",
+ "df_string[1] = pd.to_numeric(df_string[1])\n",
+ "df_string['RS2'] = df_string[0]+df_string[1]\n",
+ "\n",
+ "df_string.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "2t4rnjRWFPON"
+ },
+ "source": [
+ "df_string.dtypes"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "MAYju4f6GFzw"
+ },
+ "source": [
+ "df_string.drop(columns= [0, 1], axis = 1, inplace = True)\n",
+ "df = pd.merge(df, df_string, how = 'left', on = 'ID')\n",
+ "df.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "sm5lOGrrHoDp"
+ },
+ "source": [
+ " **Desafio**: Próximo passo: transformar isso numa função para tratar as demais variáveis!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "QtmOlKNpzbOz"
+ },
+ "source": [
+ "## 2. Que colunas podem previamente ser eliminadas da análise? Porque identificar o que pode ser eliminado é importante?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7TzcuD2GxfBP"
+ },
+ "source": [
+ "### Colunas que poderiam previamente ser eliminadas:\n",
+ "* Photo\n",
+ "* Flag\n",
+ "* Club Logo\n",
+ "* Unnamed: 0"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kXDe_AdEx3DD"
+ },
+ "source": [
+ "df2 = df.copy()\n",
+ "\n",
+ "l_cols_drop = ['Unnamed: 0', 'Photo', 'Flag', 'Club Logo']\n",
+ "df2.drop(columns = l_cols_drop, axis = 1, inplace = True)\n",
+ "df2.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "m97dcDy9zbSO"
+ },
+ "source": [
+ "## 3. Qual o dtype de cada variável/atributo do dataframe?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "GEbvITXR2U17"
+ },
+ "source": [
+ "# Função para nos mostrar o tipo das colunas:\n",
+ "def mostra_tipo(df):\n",
+ " d_tipos = dict(zip(df.columns, df.dtypes))\n",
+ " for item in d_tipos.items():\n",
+ " print(item)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "3B9vxmbl9HNP"
+ },
+ "source": [
+ "mostra_tipo(df2)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5XKcxC0Pzshm"
+ },
+ "source": [
+ "## 4. Se alguma variável/atributo é do tipo string (object) e supostamente deveria ser numérica, como alteramos o tipo?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "A7T31nFiPdDu"
+ },
+ "source": [
+ "### Mudar o tipo de algumas colunas\n",
+ "* Exemplo: 'Wage', 'Value' e 'Release Clause'. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "VJSsvOpK71n7"
+ },
+ "source": [
+ "df4 = df2.copy()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "xyV-_MY9688C"
+ },
+ "source": [
+ "def transforma_monetarias(coluna):\n",
+ " if 'M' in coluna:\n",
+ " return int(float(coluna.replace('M', '')) * 1000000)\n",
+ "\n",
+ " elif 'K' in coluna:\n",
+ " return int(float(coluna.replace('K', '')) * 1000)\n",
+ " \n",
+ " else:\n",
+ " return int(coluna) "
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "AJ9-8sVS6MXj"
+ },
+ "source": [
+ "Substituindo o símbolo \"€\" por '':"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ArgK2NVe6vqz"
+ },
+ "source": [
+ "l_colunas_monetarias = ['Value', 'Wage']\n",
+ "\n",
+ "for coluna in l_colunas_monetarias:\n",
+ " df4[coluna] = df4[coluna].str.replace('€', '')\n",
+ " df4[coluna] = df4[coluna].apply(lambda x: transforma_monetarias(x))\n",
+ "\n",
+ "df4.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "c_lznTRHzbV9"
+ },
+ "source": [
+ "## 5. Normalize os nomes das colunas, ou seja, renomeie o nome das colunas para minúsculo;"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "usM674sR8Gv9"
+ },
+ "source": [
+ "df5 = df4.copy()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "N6LCmJ0QUsJo"
+ },
+ "source": [
+ "### Nome das colunas --> Substituir os \" \" por \"_\" nos nomes"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "NWJYqphfUxn1"
+ },
+ "source": [
+ "df5.columns = [c.replace(' ', '_') for c in df5.columns]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "lXUOzLWmVTNZ"
+ },
+ "source": [
+ "### Renomear as colunas usando lower()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZwwLMOYRVXnr"
+ },
+ "source": [
+ "df5.columns = [c.lower() for c in df5.columns]\n",
+ "mostra_tipo(df5)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Uc12gBThz1nD"
+ },
+ "source": [
+ "## 6. Há Missing values nos dados? Se sim, o qual sua proposta (proposta do grupo) para tratar estes Missing values?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "nYgvxvcT8QIT"
+ },
+ "source": [
+ "df6 = df5.copy()\n",
+ "df6.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "9STC9fsWJAHn"
+ },
+ "source": [
+ "# Fazendo uma cópia permanente do dataframe df6 para uso futuro\n",
+ "df6[['overall', 'potential', 'value', 'wage', 'nationality', 'position', 'age', 'preferred_foot']].to_csv('FIFA_algumas_features.csv')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ESFYFvOy8XOM"
+ },
+ "source": [
+ "Aqui vou substituir os Missing Values pela mediana. Fique à vontade para substituir por outras alternativas como min, max, média, limite superior de outliers e limite inferior para outliers."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "j7zDrRvi8iay"
+ },
+ "source": [
+ "l_colunas_numericas = df6.select_dtypes(np.number).columns.tolist()\n",
+ "l_colunas_numericas"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "mZEM0N2f9vi7"
+ },
+ "source": [
+ "# Mediana antes da substituição:\n",
+ "df6[l_colunas_numericas].median()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "dzfw0kp69dK2"
+ },
+ "source": [
+ "# Substituição pela mediana\n",
+ "for coluna in l_colunas_numericas:\n",
+ " df6[coluna].fillna(df6[coluna].median())\n",
+ "\n",
+ "# Mediana depois da substituição:\n",
+ "df6[l_colunas_numericas].median()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jpQR9zDC-nEj"
+ },
+ "source": [
+ "Abaixo, identifiquei 252 registros com value = 0 --> Nestes casos, vou atribuir a mediana também."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "s1Zj3gBJ-Z5c"
+ },
+ "source": [
+ "df6[df6['value'] == 0]['value'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "HjuNw2u6-7i9"
+ },
+ "source": [
+ "# Mediana antes\n",
+ "df6['value'].median()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "VWEp0Tc_-vLD"
+ },
+ "source": [
+ "# Atribuição da mediana para os valores 0 de 'value'\n",
+ "df6.loc[df6['value'] == 0, 'value'] = df6['value'].median()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "HynCT_Yu_JL-"
+ },
+ "source": [
+ "# Mediana depois\n",
+ "df6['value'].median()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "B4O5kw6h_z3H"
+ },
+ "source": [
+ "E se tivéssemos substituído pela média, ao invés da mediana? Teria mudado alguma coisa?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "eU7ybhA2zbZh"
+ },
+ "source": [
+ "## 7. Qual a distribuição do número de jogadores por países? Apresente uma tabela com a distribuição."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "A34BwvXrXAqU"
+ },
+ "source": [
+ "df7 = df6.copy()\n",
+ "df7.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "fu87YSiudcM_"
+ },
+ "source": [
+ "df7.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "IQQ7AvgBYZmx"
+ },
+ "source": [
+ "df_jogadores_por_paises = pd.DataFrame(df7.groupby(by=['nationality']).size())\n",
+ "df_jogadores_por_paises.columns = ['numero_jogadores']\n",
+ "df_jogadores_por_paises.sort_values(by = ['numero_jogadores'], ascending = False, inplace= True)\n",
+ "df_jogadores_por_paises = df_jogadores_por_paises.reset_index()\n",
+ "df_jogadores_por_paises\n",
+ "\n",
+ "# Numa única linha ficaria assim:\n",
+ "df_jogadores_por_paises2 = pd.DataFrame(df7.groupby(by=['nationality']).size(), columns= ['numero_jogadores']).sort_values(by = ['numero_jogadores'], ascending = False).reset_index()\n",
+ "df_jogadores_por_paises2"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JfyDUEC2zbcv"
+ },
+ "source": [
+ "## 8. Qual a média de idade dos jogadores por países (variável/atributo 'Nacionality');"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "0a9MvyWPcu-C"
+ },
+ "source": [
+ "df_media_idade_por_paises = df7.groupby(by = ['nationality']).agg({'age': ['count', 'mean']}).reset_index()\n",
+ "df_media_idade_por_paises.columns = ['nationality', 'numero_joagadores', 'media_idade']\n",
+ "df_media_idade_por_paises.sort_values(by = ['media_idade'], ascending = False, inplace = True)\n",
+ "df_media_idade_por_paises.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vNmu0xyg0CW4"
+ },
+ "source": [
+ "## 9. Qual a número de jogadores por idade?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "DRVvPgpRf9vw"
+ },
+ "source": [
+ "df_jogadores_por_idade = df7.groupby(by = ['age']).agg({'age': ['count']}).reset_index()\n",
+ "df_jogadores_por_idade.columns = ['age', 'numero_joagadores']\n",
+ "df_jogadores_por_idade.sort_values(by = ['numero_joagadores'], ascending = False, inplace = True)\n",
+ "df_jogadores_por_idade.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8eChi2NW0CZp"
+ },
+ "source": [
+ "## 10. Quantos jogadores possuem cada clube?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "JpNI3ZlHgUx1"
+ },
+ "source": [
+ "df_jogadores_por_clube = df7.groupby(by = ['club']).size().reset_index()\n",
+ "df_jogadores_por_clube.columns = ['clube', 'numero_joagadores']\n",
+ "df_jogadores_por_clube.sort_values(by = ['numero_joagadores'], ascending = False, inplace = True)\n",
+ "df_jogadores_por_clube.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gMiibNwW0Cck"
+ },
+ "source": [
+ "## 11. Qual a média de idade por clube?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "D9rF9frzgqSr"
+ },
+ "source": [
+ "df_media_idade_por_clube = df7.groupby(by = ['club']).agg({'age': ['count', 'mean']}).reset_index()\n",
+ "df_media_idade_por_clube.columns = ['clube', 'numero_joagadores', 'media_idade']\n",
+ "df_media_idade_por_clube.sort_values(by = ['media_idade'], ascending = False, inplace = True)\n",
+ "df_media_idade_por_clube.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "uE_o76xH0QU-"
+ },
+ "source": [
+ "## 12. Qual a média de salário por país?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "keQXqnU7hJy4"
+ },
+ "source": [
+ "df_media_salario_por_pais = df7.groupby(by = ['nationality']).agg({'wage': ['count', 'mean']}).reset_index()\n",
+ "df_media_salario_por_pais.columns = ['nationality', 'numero_joagadores', 'media_salario']\n",
+ "df_media_salario_por_pais.sort_values(by = ['media_salario'], ascending = False, inplace = True)\n",
+ "df_media_salario_por_pais.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vqT1ozNA0Cfd"
+ },
+ "source": [
+ "## 13. Qual a média de salário por clube?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "54_Q2IGchmN-"
+ },
+ "source": [
+ "df_media_salario_por_clube = df7.groupby(by = ['club']).agg({'wage': ['count', 'mean']}).reset_index()\n",
+ "df_media_salario_por_clube.columns = ['clube', 'numero_joagadores', 'media_salario']\n",
+ "df_media_salario_por_clube.sort_values(by = ['media_salario'], ascending = False, inplace = True)\n",
+ "df_media_salario_por_clube.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4eflozOo0Cif"
+ },
+ "source": [
+ "## 14. Qual a média de salário por idade?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Xtq9Am60hwGr"
+ },
+ "source": [
+ "df_media_salario_por_idade = df7.groupby(by = ['age']).agg({'wage': ['count', 'mean']}).reset_index()\n",
+ "df_media_salario_por_idade.columns = ['age', 'numero_joagadores', 'media_salario']\n",
+ "df_media_salario_por_idade.sort_values(by = ['media_salario'], ascending = False, inplace = True)\n",
+ "df_media_salario_por_idade.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "L0yRSSIb0WYj"
+ },
+ "source": [
+ "## 15. Quanto cada clube gasta com pagamento de salários?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "C9N7_pLfh_uq"
+ },
+ "source": [
+ "df_soma_salario_por_clube = df7.groupby(by = ['club']).agg({'wage': ['count', 'mean', 'sum']}).reset_index()\n",
+ "df_soma_salario_por_clube.columns = ['clube', 'numero_joagadores', 'media_salario', 'soma_salario']\n",
+ "df_soma_salario_por_clube.sort_values(by = ['soma_salario'], ascending = False, inplace = True)\n",
+ "df_soma_salario_por_clube.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "1c7NGMg90YMi"
+ },
+ "source": [
+ "## 16. Quais são os insight (o que você consegue descobrir) em relação à variável 'Potential' (mede o potencial dos jogadores)?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "RAU41Iyaihvc"
+ },
+ "source": [
+ "df7.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "bM_ePTWfiTFq"
+ },
+ "source": [
+ "df_potential_por_clube = df7.groupby(by = ['potential', 'club', 'nationality']).agg({'potential': ['count']}).reset_index()\n",
+ "df_potential_por_clube.columns = ['potential', 'club', 'nationality', 'numero_joagadores']\n",
+ "df_potential_por_clube.sort_values(by = ['potential'], ascending = False, inplace = True)\n",
+ "df_potential_por_clube.head(10)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HytWPvfvjTON"
+ },
+ "source": [
+ "#### Quem é o jogador com potential = 95?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Fk2X1q7LjWJE"
+ },
+ "source": [
+ "df7.loc[df7['potential'] == 95]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "W2o4oLzujnHj"
+ },
+ "source": [
+ "#### Quem são os jogadores com potencial = 94?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "GOCyMr-qjsL7"
+ },
+ "source": [
+ "df7.loc[df7['potential'] == 94]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "LHDJimdw0ClU"
+ },
+ "source": [
+ "## 17. Quais os insights em relação à variável overall (nota média do atleta) por idade, clube e país?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "FXFp5nxrj9Yc"
+ },
+ "source": [
+ "df_overall = df7.groupby(by = ['overall', 'club', 'nationality']).agg({'overall': ['count']}).reset_index()\n",
+ "df_overall.columns = ['overall', 'club', 'nationality', 'numero_joagadores']\n",
+ "df_overall.sort_values(by = ['overall'], ascending = False, inplace = True)\n",
+ "df_overall.head(10)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4LTooiIdk1XV"
+ },
+ "source": [
+ "#### Quem é o jogador com overall = 94?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "QieAKyi7k5Bb"
+ },
+ "source": [
+ "df7.loc[df7['overall'] == 94]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JFH54d1D0b5B"
+ },
+ "source": [
+ "## 18. Quais são os melhores clubes se levarmos em consideração as variáveis Potential e Overall?\n",
+ "* Para responder esta questão, tirei a média de overall e potential."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "0JZ7PTFTle_d"
+ },
+ "source": [
+ "df18 = df7.copy()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "s25u8RoplMZ8"
+ },
+ "source": [
+ "df18['overall_potential'] = ((df18['potential']+df18['overall'])/2)\n",
+ "df18[['name', 'overall', 'potential', 'overall_potential']].head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8gJFzhhIlDCH"
+ },
+ "source": [
+ "df_overall_potential = df18.groupby(by = ['club', 'nationality', 'age']).agg({'overall_potential': ['count', 'mean']}).reset_index()\n",
+ "df_overall_potential.columns = ['club', 'nationality', 'age', 'numero_jogadores', 'media_overall_potential']\n",
+ "df_overall_potential.sort_values(by = ['media_overall_potential'], ascending = False, inplace = True)\n",
+ "df_overall_potential.head(10)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "adpxQpWlmvac"
+ },
+ "source": [
+ "De forma geral:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "fzn_81eomrj2"
+ },
+ "source": [
+ "df_overall_potential2 = df18.groupby(by = ['club']).agg({'overall_potential': ['count', 'mean']}).reset_index()\n",
+ "df_overall_potential2.columns = ['club', 'numero_jogadores', 'media_overall_potential']\n",
+ "df_overall_potential2.sort_values(by = ['media_overall_potential'], ascending = False, inplace = True)\n",
+ "df_overall_potential2.head(10)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "dM8FehYC0df7"
+ },
+ "source": [
+ "## 19. Apresente o ranking dos goleiros (use a variável/atributo 'Preferred Positions') por Potencial, Overall. Estamos à procura de 'GK'."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "_967BF6MnD4U"
+ },
+ "source": [
+ "df19 = df18.copy()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "wXPah5zOmkXc"
+ },
+ "source": [
+ "df_goleiros = df19[df19['position'] == 'GK']\n",
+ "df_goleiros.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "77ehyNmSnTIB"
+ },
+ "source": [
+ "df_overall_potential_goleiros = df_goleiros.groupby(by = ['club']).agg({'overall_potential': ['count', 'mean']}).reset_index()\n",
+ "df_overall_potential_goleiros.columns = ['club', 'numero_jogadores', 'media_overall_potential']\n",
+ "df_overall_potential_goleiros.sort_values(by = ['media_overall_potential'], ascending = False, inplace = True)\n",
+ "df_overall_potential_goleiros.head(10)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-dEtuBtF0fiZ"
+ },
+ "source": [
+ "## 20. Quem são os jogadores mais rápidos (variável/atributo 'Sprint speed')?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "KWMU1hMMnxTI"
+ },
+ "source": [
+ "df20 = df19.copy()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "sezEQIjqnwCZ"
+ },
+ "source": [
+ "df20.sort_values(by = 'sprintspeed', ascending = False).head(5)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "aEg0eaFO0lF6"
+ },
+ "source": [
+ "## 21. Quem são os 5 melhores jogadores em termos de chute (força para chutar) (use a variável/atributo 'Shot power')?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "xXuj-dc7oA-0"
+ },
+ "source": [
+ "df21 = df20.copy()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8HGT_dM2oEES"
+ },
+ "source": [
+ "df21.sort_values(by = 'shotpower', ascending = False).head(5)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "bRk42JIf0moZ"
+ },
+ "source": [
+ "## 22. Quem são os outliers em termos de salário?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qRNaog7y0qI4"
+ },
+ "source": [
+ "### Identificação e tratamento dos Outliers\n",
+ "* Qual o Overall médio do Barcelona, Juventus e Real Madrid?\n",
+ "* E qual o overall médio depois do tratamento dos outliers?\n",
+ "* Quem são os atletas que estão influenciando a média?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_bIxG1Sw9OUB"
+ },
+ "source": [
+ "\n",
+ "\n",
+ "Fonte: [Understanding Boxplots](https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "qEiikIxNoZkl"
+ },
+ "source": [
+ "df22 = df21.copy()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "lGYYvE0BoOoV"
+ },
+ "source": [
+ "q1_salario, q3_salario = df22['wage'].quantile([0.25,0.75]).to_list()\n",
+ "iqr_salario = q3_salario - q1_salario\n",
+ "print(q1_salario, q3_salario)\n",
+ "print(iqr_salario)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "PB44VV9pogT1"
+ },
+ "source": [
+ "outlier_salario_inferior = q1_salario - 1.5 * iqr_salario\n",
+ "outlier_salario_superior = q3_salario + 1.5 * iqr_salario\n",
+ "\n",
+ "df_outliers_salario = df22[['name', 'club', 'nationality', 'wage', 'overall', 'potential']]\n",
+ "\n",
+ "# Salários outliers inferiores\n",
+ "df_outliers_salario[df_outliers_salario['wage'] < outlier_salario_inferior]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "9867KNNBqG7Z"
+ },
+ "source": [
+ "# Top 10 Salários outliers superior\n",
+ "df_outliers_salario[df_outliers_salario['wage'] > outlier_salario_superior].sort_values(by = ['wage'], ascending = False).head(10)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gT2zGwq90oQ5"
+ },
+ "source": [
+ "## 23. Quem são os outliers em termos de potência no chute?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "05uYj7cwqrdW"
+ },
+ "source": [
+ "df23 = df22.copy()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "GzbVRU9HqrdZ"
+ },
+ "source": [
+ "q1_chute, q3_chute = df23['shotpower'].quantile([0.25,0.75]).to_list()\n",
+ "iqr_chute = q3_chute - q1_chute\n",
+ "print(q1_chute, q3_chute)\n",
+ "print(iqr_chute)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "V5TQX_yGqrda"
+ },
+ "source": [
+ "outlier_chute_inferior = q1_chute - 1.5 * iqr_chute\n",
+ "outlier_chute_superior = q3_chute + 1.5 * iqr_chute\n",
+ "\n",
+ "df_outliers_chute = df23[['name', 'club', 'nationality', 'shotpower', 'overall', 'potential']]\n",
+ "\n",
+ "# Salários outliers inferiores\n",
+ "df_outliers_chute[df_outliers_chute['shotpower'] < outlier_chute_inferior]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "URj1SYXxqrdc"
+ },
+ "source": [
+ "# Top 10 outliers superiores - shotpower\n",
+ "df_outliers_chute[df_outliers_chute['shotpower'] > outlier_chute_superior].sort_values(by = ['shotpower'], ascending = False).head(10)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "eHm1qeHx0pza"
+ },
+ "source": [
+ "## 24. Qual a correlação e a interpretação entre as variáveis 'value' e as demais variáveis numéricas do dataframe?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "OR6MD2GQ0rNq"
+ },
+ "source": [
+ "## 25. Construa variáveis dummy para as colunas preferred_foot e work_rate. preferred_foot_left;"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Z2RTzTxbLzW_"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ }
+ ]
+}
\ No newline at end of file
From 40557e076dcdc9e9d917858ef657899fc7745574 Mon Sep 17 00:00:00 2001
From: Celso-Omoto <72219725+Celso-Omoto@users.noreply.github.com>
Date: Mon, 19 Oct 2020 17:35:29 -0300
Subject: [PATCH 14/21] Criado usando o Colaboratory
---
...xercicios_exerc\303\255cio Olympics.ipynb" | 371 +++++++++++++++++-
1 file changed, 367 insertions(+), 4 deletions(-)
diff --git "a/Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios_exerc\303\255cio Olympics.ipynb" "b/Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios_exerc\303\255cio Olympics.ipynb"
index 92769c16a..e434f3bb5 100644
--- "a/Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios_exerc\303\255cio Olympics.ipynb"
+++ "b/Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios_exerc\303\255cio Olympics.ipynb"
@@ -108,8 +108,8 @@
"id": "kQGVQB18-tM_"
},
"source": [
- "#!pip install category_encoders\n",
- "#!pip install update"
+ "!pip install category_encoders\n",
+ "!pip install update"
],
"execution_count": null,
"outputs": []
@@ -1291,7 +1291,7 @@
"id": "ZkmDSUlU7JTw"
},
"source": [
- "!ls \"/content/drive/My Drive/\""
+ "!ls \"/content/drive/My Drive/athlete_events.zip\""
],
"execution_count": null,
"outputs": []
@@ -1327,7 +1327,7 @@
"id": "CQA_Wdxe-aOF"
},
"source": [
- "df_olympics[['sex','season','team','city','sport','medal','height','weight']].head(5)"
+ ""
],
"execution_count": null,
"outputs": []
@@ -1357,6 +1357,347 @@
"execution_count": null,
"outputs": []
},
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "fg929P3DblPr"
+ },
+ "source": [
+ "df_olympics[['sex','season','team','city','sport','medal','height','weight']].head(5)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "hJ7lAgAPcZ-J"
+ },
+ "source": [
+ "df_atleta2=df_olympics[['height','weight']]\n",
+ "df_atleta2.isnull().sum()\n",
+ "df_atleta2.dropna(inplace=True)\n",
+ "#gráfico\n",
+ "df_atleta2.hist(bins=100)\n",
+ "\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zlka9jjlfmgJ"
+ },
+ "source": [
+ "plt.figure(figsize = (12, 8))\n",
+ "plt.hist(df_atleta2['height'], color = 'blue', edgecolor = 'black', bins = int(180/5))\n",
+ "\n",
+ "# Adiciona títulos e labels\n",
+ "plt.title('height - Distribuição Normal')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "AW4xWnuVf235"
+ },
+ "source": [
+ "plt.figure(figsize = (12, 8))\n",
+ "plt.hist(df_atleta2['weight'], color = 'blue', edgecolor = 'black', bins = int(180/5))\n",
+ "\n",
+ "# Adiciona títulos e labels\n",
+ "plt.title('weight - Distribuição Normal')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "QRXHQJ9DfEzr"
+ },
+ "source": [
+ "df_atleta2.plot(kind = 'kde') # KDE (= kernel Density Estimate) ajuda-nos a visualizar a distribuição dos dados, análogo ao histograma."
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jbsJR08RgFOx"
+ },
+ "source": [
+ "df_atleta2.plot()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "wThfuOfwgNDY"
+ },
+ "source": [
+ "#padroniza StandardScaler\n",
+ "from sklearn.preprocessing import StandardScaler"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "vHTiFkm_gTjw"
+ },
+ "source": [
+ "np.set_printoptions(precision = 3)\n",
+ "# usa alternativa 1 ou alternativa 2\n",
+ "# alternativa 1\n",
+ "atleta2_scale = StandardScaler().fit_transform(df_atleta2) # Combinação dos métodos fit() + transform()\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "PmpW2Ny9hoH4"
+ },
+ "source": [
+ "# alternativa 2\n",
+ "atleta2_scale_fit = StandardScaler().fit(df_atleta2) # Aplica o fit() separadamente\n",
+ "atleta2_scale_transform = atleta2_scale_fit.transform(df_atleta2) # Aplica o transform() separadamente.\n",
+ "atleta2_scale_fit_transform = StandardScaler().fit(df_atleta2).transform(df_atleta2) # Aplica fit().transform() encadeado\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "S1OltLNZhvHw"
+ },
+ "source": [
+ "#atleta2_scale\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "i7A4nquHhy8J"
+ },
+ "source": [
+ "#atleta2_scale_fit\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "y-hJ8ofQh3lB"
+ },
+ "source": [
+ "#atleta2_scale_transform"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "xj_lxSTOh7gI"
+ },
+ "source": [
+ "#atleta2_scale_fit_transform"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "0mLy9gKliCRA"
+ },
+ "source": [
+ "#alternativa 1\n",
+ "df_atleta2_scale = atleta2_scale = pd.DataFrame(atleta2_scale, columns = ['height', 'weight'])\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "1TWjsfDrihii"
+ },
+ "source": [
+ "#alternativa 2\n",
+ "df_atleta2_scale = pd.DataFrame(atleta2_scale, columns=list(df_atleta2.keys()))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "hHzx5QQvitVv"
+ },
+ "source": [
+ "df_atleta2_scale.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "VZsaqzfjdvLS"
+ },
+ "source": [
+ "#Resolução StandardScaler em comandos concentrados\n",
+ "from sklearn.preprocessing import StandardScaler\n",
+ "atleta2_standard = StandardScaler().fit_transform(df_atleta2)\n",
+ "df_atleta2_standard = pd.DataFrame(atleta2_standard, columns=list(df_atleta2.keys()))\n",
+ "df_atleta2_standard.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "WJLETmD9jN1w"
+ },
+ "source": [
+ "#padroniza MinMaxScaler\n",
+ "from sklearn.preprocessing import MinMaxScaler"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "I73dw7TgjZhg"
+ },
+ "source": [
+ "# usa alternativa 1 ou alternativa 2\n",
+ "# alternativa 1\n",
+ "atleta2_MinMax = MinMaxScaler().fit_transform(df_atleta2) # Combinação dos métodos fit() + transform()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "4cEc-1CRjj2H"
+ },
+ "source": [
+ "# alternativa 2\n",
+ "atleta2_MinMax_fit = MinMaxScaler().fit(df_atleta2) # Aplica o fit() separadamente\n",
+ "atleta2_MinMax_transform = atleta2_scale_fit.transform(df_atleta2) # Aplica o transform() separadamente.\n",
+ "atleta2_MinMax_fit_transform = MinMaxScaler().fit(df_atleta2).transform(df_atleta2) # Aplica fit().transform() encadeado"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "eW5t52NCj8qv"
+ },
+ "source": [
+ "#alternativa 1\n",
+ "df_atleta2_MinMax = atleta2_MinMax = pd.DataFrame(atleta2_MinMax, columns = ['height', 'weight'])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Fe6KupnMkFGg"
+ },
+ "source": [
+ "#alternativa 2\n",
+ "df_atleta2_MinMax = pd.DataFrame(atleta2_MinMax, columns=list(df_atleta2.keys()))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "y9XHXj0Ujyan"
+ },
+ "source": [
+ "df_atleta2_MinMax.plot(kind = 'kde')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "omhfhcOKdxVh"
+ },
+ "source": [
+ "#Resolução MinMaxScaler em comandos concentrados\n",
+ "from sklearn.preprocessing import MinMaxScaler\n",
+ "atleta2_MinMax = MinMaxScaler().fit_transform(df_atleta2)\n",
+ "df_atleta2_MinMax = pd.DataFrame(atleta2_MinMax, columns=list(df_atleta2.keys()))\n",
+ "df_atleta2_MinMax.plot(kind = 'kde')\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "rr7mMCqdjLVg"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6JwKjpiRd2Oh"
+ },
+ "source": [
+ "#Resolução RobustScaler em comandos concentrados\n",
+ "from sklearn.preprocessing import RobustScaler\n",
+ "atleta2_Robust = RobustScaler().fit_transform(df_atleta2)\n",
+ "df_atleta2_Robust = pd.DataFrame(atleta2_Robust, columns=list(df_atleta2.keys()))\n",
+ "df_atleta2_MinMax.plot(kind = 'kde')\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "m2lU3aJIeDvR"
+ },
+ "source": [
+ "df_atleta2_robust['tipo'] = robust\n",
+ "df_atleta2_robust = RobustScaler().fit_transform(df_atleta2)\n",
+ "df_atleta2_robust = pd.DataFrame(atleta2_RobustScaler, columns = ['height', 'wright'])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
{
"cell_type": "markdown",
"metadata": {
@@ -1371,6 +1712,17 @@
" * Usando a técnica de binning apropriada, pode estabelecer uma relação monotônica (aumentar ou diminuir) entre a variável dependente e independente."
]
},
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "hEnjv-AleB6R"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
{
"cell_type": "code",
"metadata": {
@@ -1473,6 +1825,17 @@
],
"execution_count": null,
"outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "2PWN5TF2jv_I"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
}
]
}
\ No newline at end of file
From ab0233acdc7af24f693ba89e409e3c6673916407 Mon Sep 17 00:00:00 2001
From: Celso-Omoto <72219725+Celso-Omoto@users.noreply.github.com>
Date: Mon, 19 Oct 2020 17:53:57 -0300
Subject: [PATCH 15/21] Criado usando o Colaboratory
---
...xercicios_exerc\303\255cio Olympics.ipynb" | 59 +++++++++++++++++++
1 file changed, 59 insertions(+)
diff --git "a/Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios_exerc\303\255cio Olympics.ipynb" "b/Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios_exerc\303\255cio Olympics.ipynb"
index e434f3bb5..e0b482ff6 100644
--- "a/Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios_exerc\303\255cio Olympics.ipynb"
+++ "b/Notebooks/NB10_04__3DP_3_Data_Transformation_exercicios_exerc\303\255cio Olympics.ipynb"
@@ -1368,6 +1368,65 @@
"execution_count": null,
"outputs": []
},
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZZw7Nm1EmEEi"
+ },
+ "source": [
+ "#Trate adequadamente as variáveis 'sex', 'season', 'team', 'city', 'sport' e 'medal';\n",
+ "#Encoding Variáveis Categóricas\n",
+ "df_atleta1=df_olympics[['sex','season','team','city','sport','medal']]\n",
+ "\n",
+ "l_atributos = ['sex','season','team','city','sport','medal']\n",
+ "for s_atributo in l_atributos:\n",
+ " s_result = df_atleta1[s_atributo].unique()\n",
+ " print(f'{s_atributo} {s_result}')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "JJFDaCguoj-X"
+ },
+ "source": [
+ "from sklearn.preprocessing import LabelEncoder, OneHotEncoder"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "-nvvR_5doIsR"
+ },
+ "source": [
+ "#l_atributos = ['sex','season','medal']\n",
+ "le = LabelEncoder()\n",
+ "df_atleta1['sex-cat'] = le.fit_transform(df_atleta1['sex'])\n",
+ "df_atleta1['season-cat'] = le.fit_transform(df_atleta1['season'])\n",
+ "#df_atleta1['medal-cat'] = le.fit_transform(df_atleta1['medal'])\n",
+ "df_atleta1"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "w204Is8ipdDq"
+ },
+ "source": [
+ "#l_atributos = ['sex','season','medal']\n",
+ "le = LabelEncoder()\n",
+ "df_atleta1['medal-cat'] = le.fit_transform(df_atleta1['medal'])\n",
+ "df_atleta1"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
{
"cell_type": "code",
"metadata": {
From ae2dd56db24f64c2942ff75344feeeae02740405 Mon Sep 17 00:00:00 2001
From: Celso-Omoto <72219725+Celso-Omoto@users.noreply.github.com>
Date: Tue, 20 Oct 2020 14:14:51 -0300
Subject: [PATCH 16/21] Criado usando o Colaboratory
---
.../NB10_01__Pandas__Resposta_Exercicios_Aluno_Fifa.ipynb | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/Notebooks/NB10_01__Pandas__Resposta_Exercicios_Aluno_Fifa.ipynb b/Notebooks/NB10_01__Pandas__Resposta_Exercicios_Aluno_Fifa.ipynb
index 7ea9a913f..468408c32 100644
--- a/Notebooks/NB10_01__Pandas__Resposta_Exercicios_Aluno_Fifa.ipynb
+++ b/Notebooks/NB10_01__Pandas__Resposta_Exercicios_Aluno_Fifa.ipynb
@@ -1556,7 +1556,8 @@
"id": "vgoLTamaOC50"
},
"source": [
- "#### Configurar ambiente"
+ " \n",
+ " #### Configurar ambiente"
]
},
{
@@ -1595,7 +1596,7 @@
"id": "GMivDUHEMFKp"
},
"source": [
- "df = pd.read_csv('https://raw.githubusercontent.com/MathMachado/DataFrames/master/FIFA.csv?token=AGDJQ63GC7SPIHTGNW73QB27RXRN6') #, index_col= 'PassengerId')\n",
+ "df = pd.read_csv('https://raw.githubusercontent.com/Celso-Omoto/DSWP/master/Dataframes/FIFA.csv') #, index_col= 'PassengerId')\n",
"df.head()"
],
"execution_count": null,
From c0db1f15988ada695f4ab0c9a823e017997b8389 Mon Sep 17 00:00:00 2001
From: Celso-Omoto <72219725+Celso-Omoto@users.noreply.github.com>
Date: Tue, 20 Oct 2020 14:39:29 -0300
Subject: [PATCH 17/21] Criado usando o Colaboratory
---
...ndas__Resposta_Exercicios_Aluno_Fifa.ipynb | 21 +++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)
diff --git a/Notebooks/NB10_01__Pandas__Resposta_Exercicios_Aluno_Fifa.ipynb b/Notebooks/NB10_01__Pandas__Resposta_Exercicios_Aluno_Fifa.ipynb
index 468408c32..cb7b4c537 100644
--- a/Notebooks/NB10_01__Pandas__Resposta_Exercicios_Aluno_Fifa.ipynb
+++ b/Notebooks/NB10_01__Pandas__Resposta_Exercicios_Aluno_Fifa.ipynb
@@ -1648,7 +1648,7 @@
"id": "_hUvJbCqCBBl"
},
"source": [
- "df[['RS', 'LS', 'ST']].head()"
+ "df[['LS','ST','RS','LW','LF','CF','RF','RW','LAM','CAM','RAM','LM','LCM','CM','RCM','RM','LWB','LDM','CDM','RDM','RWB','LB','LCB','CB','RCB','RB']].head()"
],
"execution_count": null,
"outputs": []
@@ -1819,7 +1819,8 @@
"id": "VJSsvOpK71n7"
},
"source": [
- "df4 = df2.copy()"
+ "df4 = df2.copy()\n",
+ "df4.head()"
],
"execution_count": null,
"outputs": []
@@ -1858,9 +1859,10 @@
"id": "ArgK2NVe6vqz"
},
"source": [
- "l_colunas_monetarias = ['Value', 'Wage']\n",
+ "l_colunas_monetarias = ['Value','Wage']\n",
"\n",
"for coluna in l_colunas_monetarias:\n",
+ " print(coluna)\n",
" df4[coluna] = df4[coluna].str.replace('€', '')\n",
" df4[coluna] = df4[coluna].apply(lambda x: transforma_monetarias(x))\n",
"\n",
@@ -2123,7 +2125,18 @@
"\n",
"# Numa única linha ficaria assim:\n",
"df_jogadores_por_paises2 = pd.DataFrame(df7.groupby(by=['nationality']).size(), columns= ['numero_jogadores']).sort_values(by = ['numero_jogadores'], ascending = False).reset_index()\n",
- "df_jogadores_por_paises2"
+ "df_jogadores_por_paises2\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Y_h9sFgiFaCJ"
+ },
+ "source": [
+ ""
],
"execution_count": null,
"outputs": []
From 8fb038cec89d93c2671bd7800507d2a9b1aca2dd Mon Sep 17 00:00:00 2001
From: Celso-Omoto <72219725+Celso-Omoto@users.noreply.github.com>
Date: Fri, 23 Oct 2020 17:22:35 -0300
Subject: [PATCH 18/21] Criado usando o Colaboratory
---
...0__Machine_Learning___DSWP_exercicio.ipynb | 8349 +++++++++++++++++
1 file changed, 8349 insertions(+)
create mode 100644 Notebooks/NB15_00__Machine_Learning___DSWP_exercicio.ipynb
diff --git a/Notebooks/NB15_00__Machine_Learning___DSWP_exercicio.ipynb b/Notebooks/NB15_00__Machine_Learning___DSWP_exercicio.ipynb
new file mode 100644
index 000000000..5264c6e3b
--- /dev/null
+++ b/Notebooks/NB15_00__Machine_Learning___DSWP_exercicio.ipynb
@@ -0,0 +1,8349 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "colab": {
+ "name": "NB15_00__Machine_Learning.ipynb",
+ "provenance": [],
+ "include_colab_link": true
+ },
+ "accelerator": "TPU"
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ShVXyGj9wkgN"
+ },
+ "source": [
+ "MACHINE LEARNING WITH PYTHON
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "aYQ4cDfcPu4e"
+ },
+ "source": [
+ "___\n",
+ "# **NOTAS E OBSERVAÇÕES**\n",
+ "* Abordar o impacto do desbalanceamento da amostra;\n",
+ "* Colocar AUROC no material e mostrar o cut off para classificação entre 0 e 1;\n",
+ "* Conceitos estatísticos de bias & variance;\n",
+ "* Ver Sklearn.optimize: https://web.telegram.org/#/im?p=g497957288"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5YvhLC_uf4_G"
+ },
+ "source": [
+ "___\n",
+ "# **AGENDA**\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "QgX6n2VDyY1O"
+ },
+ "source": [
+ "___\n",
+ "# **REFERÊNCIAS**\n",
+ "* [scikit-learn - Machine Learning With Python](https://scikit-learn.org/stable/);\n",
+ "* [An Introduction to Machine Learning Theory and Its Applications: A Visual Tutorial with Examples](https://www.toptal.com/machine-learning/machine-learning-theory-an-introductory-primer)\n",
+ "* [The Difference Between Artificial Intelligence, Machine Learning, and Deep Learning](https://medium.com/iotforall/the-difference-between-artificial-intelligence-machine-learning-and-deep-learning-3aa67bff5991)\n",
+ "* [A Gentle Guide to Machine Learning](https://blog.monkeylearn.com/a-gentle-guide-to-machine-learning/)\n",
+ "* [A Visual Introduction to Machine Learning](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)\n",
+ "* [Introduction to Machine Learning](http://alex.smola.org/drafts/thebook.pdf)\n",
+ "* [The 10 Statistical Techniques Data Scientists Need to Master](https://medium.com/cracking-the-data-science-interview/the-10-statistical-techniques-data-scientists-need-to-master-1ef6dbd531f7)\n",
+ "* [Tune: a library for fast hyperparameter tuning at any scale](https://towardsdatascience.com/fast-hyperparameter-tuning-at-scale-d428223b081c)\n",
+ "* [How to lie with Data Science](https://towardsdatascience.com/how-to-lie-with-data-science-5090f3891d9c)\n",
+ "* [5 Reasons “Logistic Regression” should be the first thing you learn when becoming a Data Scientist](https://towardsdatascience.com/5-reasons-logistic-regression-should-be-the-first-thing-you-learn-when-become-a-data-scientist-fcaae46605c4)\n",
+ "* [Machine learning on categorical variables](https://towardsdatascience.com/machine-learning-on-categorical-variables-3b76ffe4a7cb)\n",
+ "\n",
+ "## Deep Learning & Neural Networks\n",
+ "\n",
+ "- [An Introduction to Neural Networks](http://www.cs.stir.ac.uk/~lss/NNIntro/InvSlides.html)\n",
+ "- [An Introduction to Image Recognition with Deep Learning](https://medium.com/@ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721)\n",
+ "- [Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/index.html)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "TsCbZd2epfxo"
+ },
+ "source": [
+ "___\n",
+ "# **INTRODUÇÃO**\n",
+ "\n",
+ "* \"__Information is the oil of the 21st century, and analytics is the combustion engine__.\" - Peter Sondergaard, SVP, Garner Research;\n",
+ "\n",
+ "\n",
+ ">O foco deste capítulo será:\n",
+ "* Linear, Logistic Regression, Decision Tree, Random Forest, Support Vector Machine and XGBoost algorithms for building Machine Learning models;\n",
+ "* Entender como resolver problemas de classificação e Regressão;\n",
+ "* Aplicar técnicas de Ensemble como Bagging e Boosting;\n",
+ "* Como medir a acurácia dos modelos de Machine Learning;\n",
+ "* Aprender os principais algoritmos de Machine Learning tanto das técnicas de aprendizagem supervisionada quanto da não-supervisionada.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HqqB2vaHXMGt"
+ },
+ "source": [
+ "___\n",
+ "# **ARTIFICIAL INTELLIGENCE VS MACHINE LEARNING VS DEEP LEARNING**\n",
+ "* **Machine Learning** - dá aos computadores a capacidade de aprender sem serem explicitamente programados. Os computadores podem melhorar sua capacidade de aprendizagem através da prática de uma tarefa, geralmente usando grandes conjuntos de dados.\n",
+ "* **Deep Learning** - é um método de Machine Learning que depende de redes neurais artificiais, permitindo que os sistemas de computadores aprendam pelo exemplo, assim como nós humanos aprendemos."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "P961GcguXFFA"
+ },
+ "source": [
+ "\n",
+ "\n",
+ "Source: [Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://github.com/MathMachado/P4ML/blob/DS_Python/Material/Evolution%20of%20AI.PNG)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "lkqGtO88ZkPr"
+ },
+ "source": [
+ "\n",
+ "\n",
+ "Source: [Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://towardsdatascience.com/artificial-intelligence-vs-machine-learning-vs-deep-learning-2210ba8cc4ac)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "xesQpzfmaqj6"
+ },
+ "source": [
+ "\n",
+ "\n",
+ "Source: [Artificial Intelligence vs. Machine Learning vs. Deep Learning](https://towardsdatascience.com/artificial-intelligence-vs-machine-learning-vs-deep-learning-2210ba8cc4ac)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "KeIVR59IIS7f"
+ },
+ "source": [
+ "___\n",
+ "# **MACHINE LEARNING - TECHNIQUES**\n",
+ "\n",
+ "* Supervised Learning\n",
+ "* Unsupervised Learning\n",
+ "\n",
+ "\n",
+ "\n",
+ "Source: [Machine Learning for Everyone](https://vas3k.com/blog/machine_learning/?source=post_page-----885aa35db58b----------------------)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "rvwp5UHdBiup"
+ },
+ "source": [
+ "___\n",
+ "# **NOSSO FOCO AQUI SERÁ...**\n",
+ "\n",
+ "\n",
+ "\n",
+ "Source: [Machine Learning for Everyone](https://vas3k.com/blog/machine_learning/?source=post_page-----885aa35db58b----------------------)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "cBLSvJTXHBjK"
+ },
+ "source": [
+ "___\n",
+ "# **CHEETSHEET**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZdjR3nahUuKq"
+ },
+ "source": [
+ "\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "MkBSvyorGXQz"
+ },
+ "source": [
+ "___\n",
+ "# **CROSS-VALIDATION**\n",
+ "* K-fold é o método de Cross-Validation (CV) mais conhecido e utilizado;\n",
+ "* Como funciona: divide o dataframe de treinamento em k partes;\n",
+ " * Usa k-1 partes para treinar o modelo e o restante para validar o modelo;\n",
+ " * repete este processo k vezes, sendo que em cada iteração calcula as métricas desejadas (exemplo: acurácia);\n",
+ " * Ao final das k iterações, teremos k métricas das quais calculamos média e desvio-padrão.\n",
+ "\n",
+ " A figura abaixo nos ajuda a entender como funciona CV:\n",
+ "\n",
+ "\n",
+ "\n",
+ "Source: [5 Reasons why you should use Cross-Validation in your Data Science Projects](https://towardsdatascience.com/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79)\n",
+ "\n",
+ "* **valor de k**:\n",
+ " * valor de k (folds): entre 5 e 10 --> Não há regra geral para a escolha de k;\n",
+ " * Quanto maior o valor de k --> menor o viés do CV --> Experimento Estatístico para mostrar o efeito.\n",
+ "\n",
+ "[Applied Predictive Modeling, 2013](https://www.amazon.com/Applied-Predictive-Modeling-Max-Kuhn/dp/1461468485/ref=as_li_ss_tl?ie=UTF8&qid=1520380699&sr=8-1&keywords=applied+predictive+modeling&linkCode=sl1&tag=inspiredalgor-20&linkId=1af1f3de89c11e4a7fd49de2b05e5ebf)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HscfN-a1V043"
+ },
+ "source": [
+ "* **Vantagens do uso de CV**:\n",
+ " * Modelos com melhor acurácia;\n",
+ " * Melhor uso dos dados, pois todos os dados são utilizados como treinamento e validação. Portanto, qualquer problema com os dados serão encontrados nesta fase.\n",
+ "\n",
+ "* **Leitura Adicional**\n",
+ " * [Cross-Validation in Machine Learning](https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f)\n",
+ " * [5 Reasons why you should use Cross-Validation in your Data Science Projects](https://towardsdatascience.com/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79)\n",
+ " * [Cross-validation: evaluating estimator performance](https://scikit-learn.org/stable/modules/cross_validation.html)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XRukccWQSklx"
+ },
+ "source": [
+ "## Medidas para avaliarmos a variabilidade presente nos dados\n",
+ "* As principais medidas para medirmos a variabilidade dos dados são amplitude, variância, desvio padrão e coeficiente de variação;\n",
+ "* Estas medidas nos permite concluir se os dados são homogêneos (menor dispersão/variabilidade) ou heterogêneos (maior variabilidade/dispersão).\n",
+ "\n",
+ "* **Na próxima versão, trazer estes conceitos para o Notebook e usar o Python para calcular estas medidas**."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "yBR8tWV_lhQq"
+ },
+ "source": [
+ "___\n",
+ "# **ENSEMBLE METHODS** (= Combinar modelos preditivos)\n",
+ "* Métodos\n",
+ " * **Bagging** (Bootstrap AGGregatING)\n",
+ " * **Boosting**\n",
+ " * Stacking --> Não é muito utilizado\n",
+ "* Evita overfitting (Overfitting é quando o modelo/função se ajusta muito bem ao dados de treinamento, sendo ineficiente para generalizar para outras amostras/população).\n",
+ "* Constroi meta-classificadores: combinar os resultados de vários algoritmos para produzir previsões mais precisas e robustas do que as previsões de cada classificador individual.\n",
+ "* Ensemble reduz/minimiza os efeitos das principais causas de erros nos modelos de Machine Learning:\n",
+ " * ruído;\n",
+ " * bias (viés);\n",
+ " * variância --> Principal medida para medir a variabilidade presente nos dados.\n",
+ "\n",
+ "# Referências\n",
+ "* [Simple guide for ensemble learning methods](https://towardsdatascience.com/simple-guide-for-ensemble-learning-methods-d87cc68705a2) - Explica didaticamente como funcionam ensembes."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "25RW8u-Sj780"
+ },
+ "source": [
+ "### Leitura Adicional\n",
+ "* [Ensemble methods: bagging, boosting and stacking](https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205)\n",
+ "* [Ensemble Methods in Machine Learning: What are They and Why Use Them?](https://towardsdatascience.com/ensemble-methods-in-machine-learning-what-are-they-and-why-use-them-68ec3f9fef5f)\n",
+ "* [Ensemble Learning Using Scikit-learn](https://towardsdatascience.com/ensemble-learning-using-scikit-learn-85c4531ff86a)\n",
+ "* [Let’s Talk About Machine Learning Ensemble Learning In Python](https://medium.com/fintechexplained/lets-talk-about-machine-learning-ensemble-learning-in-python-382747e5fba8)\n",
+ "* [Boosting, Bagging, and Stacking — Ensemble Methods with sklearn and mlens](https://medium.com/@rrfd/boosting-bagging-and-stacking-ensemble-methods-with-sklearn-and-mlens-a455c0c982de)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "FugME1HSl4jJ"
+ },
+ "source": [
+ "___\n",
+ "# **PARAMETER TUNNING** (= Parâmetros ótimos dos modelos de Machine Learning)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "u_147cIRl9F1"
+ },
+ "source": [
+ "## GridSearch (Ferramenta ou meio que vamos utilizar para otimização dos parâmetros dos modelos de ML)\n",
+ "* Encontra os parâmetros ótimos (hyperparameter tunning) que melhoram a acurácia dos modelos.\n",
+ "* Necessita dos seguintes inputs:\n",
+ " * A matrix $X_{p}$ com as $p$ COLUNAS (variáveis ou atributos) do dataframe;\n",
+ " * A matriz $y_{p}$ com a COLUNA-target (vaiável resposta);\n",
+ " * Exemplo: DecisionTree, RandomForestClassifier, XGBoostClassificer e etc;\n",
+ " * Um dicionário com os parâmetros a serem otimizados;\n",
+ " * O número de folds para o método de Cross-validation."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "39Sg77fbTWCO"
+ },
+ "source": [
+ "___\n",
+ "# **MODEL SELECTION & EVALUATION**\n",
+ "> Nesta fase identificamos e aplicamos as melhores métricas (Accuracy, Sensitivity, Specificity, F-Score, AUC, R-Sq, Adj R-SQ, RMSE (Root Mean Square Error)) para avaliar o desempenho/acurácia/performance dos modelos de ML.\n",
+ ">> Treinamos os modelos de ML usando a amostra de treinamento e avaliamos o desempenho/acurácia/performance na amostra de teste/validação.\n",
+ "\n",
+ "* Leitura Adicional\n",
+ " * [The 5 Classification Evaluation metrics every Data Scientist must know](https://towardsdatascience.com/the-5-classification-evaluation-metrics-you-must-know-aa97784ff226)\n",
+ " * [Confusion matrix and other metrics in machine learning](https://medium.com/hugo-ferreiras-blog/confusion-matrix-and-other-metrics-in-machine-learning-894688cb1c0a)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oQQVzZ2ZTYrB"
+ },
+ "source": [
+ "## Confusion Matrix\n",
+ "* Termos associados à Confusion Matrix:\n",
+ " * **Verdadeiro Positivo** (TP = True Positive): Quando o valor observado é True e o modelo estima como True. Ou seja, o modelo acertou na estimativa.\n",
+ " * Exemplo: **Observado**: Fraude (Positive); **Modelo**: Fraude (Positive) --> Modelo acertou!\n",
+ " * **Verdadeiro Negativo** (TN = True Negative): Quando o valor observado é False e o modelo estima como False. Ou seja, o modelo acertou na estimativa;\n",
+ " * Exemplo: **Observado**: NÃO-Fraude (Negative); **Modelo**: NÃO-Fraude (Negative) --> Modelo acertou!\n",
+ " * **Falso Positivo** (FP = False Positive): Quando o valor observado é False e o modelo estima como True. Ou seja, o modelo errou na estimativa. \n",
+ " * Exemplo: **Observado**: NÃO-Fraude (Negative); **Modelo**: Fraude (Positive) --> Modelo errou!\n",
+ " * **Falso Negativo** (FN = False Negative): Quando o valor observado é True e o modelo estima como False.\n",
+ " * Exemplo: **Observado**: Fraude (Positive); **Modelo**: NÃO-Fraude (Negative) --> Modelo errou!\n",
+ "\n",
+ "* Consulte [Confusion matrix](https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py)\n",
+ "\n",
+ "\n",
+ "\n",
+ "Source: [Confusion Matrix](https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781838555078/6/ch06lvl1sec34/confusion-matrix)\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ci-6eiqBTgbL"
+ },
+ "source": [
+ "## Accuracy\n",
+ "> Accuracy - é o número de previsões corretas feitas pelo modelo.\n",
+ "\n",
+ "Responde à seguinte pergunta:\n",
+ "\n",
+ "```\n",
+ "Com que frequência o classificador (modelo preditivo) classifica corretamente?\n",
+ "```\n",
+ "\n",
+ "$$Accuracy= \\frac{TP+TN}{TP+TN+FP+FN}$$"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "F7YI8X5TRx-R"
+ },
+ "source": [
+ "## Precision (ou Specificity)\n",
+ "> **Precision** - fornece informações sobre o desempenho em relação a Falsos Positivos (quantos capturamos).\n",
+ "\n",
+ "Responde à seguinte pergunta:\n",
+ "\n",
+ "```\n",
+ "Com relação ao resultado Positivo, com que frequência o classificador está correto?\n",
+ "```\n",
+ "\n",
+ "\n",
+ "$$Precision= \\frac{TP}{TP+FP}$$\n",
+ "\n",
+ "**Exemplo**: Precison nos dirá a proporção de clientes que o modelo estimou como sendo Fraude quando, na verdade, são fraude.\n",
+ "\n",
+ "**Comentário**: Se nosso foco é minimizar Falso Negativos (FN), então precisamos nos esforçar para termos Recall próximo de 100%."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "zO39n8x_Sz3L"
+ },
+ "source": [
+ "## Recall (ou Sensitivity)\n",
+ "> **Recall** - nos fornece informações sobre o desempenho de um classificador em relação a Falsos Negativos (quantos perdemos).\n",
+ "\n",
+ "Responde à seguinte pergunta:\n",
+ "\n",
+ "```\n",
+ "Quando o valor observado é Positivo, com que frequência o classificador está correto?\n",
+ "```\n",
+ "\n",
+ "$$Recall = Sensitivity = \\frac{TP}{TP+FN}$$\n",
+ "\n",
+ "**Exemplo**: Recall é a proporção de clientes observados como Fraude e que o modelo estima como Fraude.\n",
+ "\n",
+ "**Comentário**: Se nosso foco for minimizar Falso Positivos (FP), então precisamos nos esforçar para fazer Precision mais próximo de 100% possível."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "htS6rdHVVXRG"
+ },
+ "source": [
+ "## Specificity\n",
+ "> **Specificity** - proporção de TN por TN+FP.\n",
+ "\n",
+ "Responde à seguinte pergunta:\n",
+ "\n",
+ "```\n",
+ "Quando o valor observado é Negativo, com que frequência o classificador está correto?\n",
+ "```\n",
+ "\n",
+ "**Exemplo**: Specificity é a proporção de clientes NÃO-Fraude que o modelo estima como NÃO-Fraude.\n",
+ "\n",
+ "$$Specificity= \\frac{TN}{TN+FP}$$\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "mNn0twadTacc"
+ },
+ "source": [
+ "## F1-Score\n",
+ "> F1-Score é a média harmônica entre Recall e Precision e é um número entre 0 e 1. Quanto mais próximo de 1, melhor. Quanto mais próximo de 0, pior. Ou seja, é um equilíbrio entre Recall e Precision.\n",
+ "\n",
+ "$$F1\\_Score= 2\\left(\\frac{Recall*Precision}{Recall+Precision}\\right)$$"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "rsH9dMxazWCg"
+ },
+ "source": [
+ "# **DATAFRAME-EXEMPLO USADO NESTE TUTORIAL**\n",
+ "> Gerar um dataframe com 18 colunas, sendo 9 informativas, 6 redundantes e 3 repetidas:\n",
+ "\n",
+ "Para saber mais sobre a geração de dataframes-exemplo (toy), consulte [Synthetic data generation — a must-have skill for new data scientists](https://towardsdatascience.com/synthetic-data-generation-a-must-have-skill-for-new-data-scientists-915896c0c1ae)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "GEyDo_EIV_jV"
+ },
+ "source": [
+ "## Definir variáveis globais"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "TdwgpZ76WFaT"
+ },
+ "source": [
+ "i_CV = 10 # Número de Cross-Validations'\n",
+ "i_Seed = 20111974 # semente por questões de reproducibilidade\n",
+ "f_Test_Size = 0.3 # Proporção do dataframe de validação (outros valores poderiam ser 0.15, 0.20 ou 0.25)"
+ ],
+ "execution_count": 3,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "gJTJfpwWzykS"
+ },
+ "source": [
+ "from sklearn.datasets import make_classification\n",
+ "\n",
+ "X, y = make_classification(n_samples = 1000, \n",
+ " n_features = 18, \n",
+ " n_informative = 9, \n",
+ " n_redundant = 6, \n",
+ " n_repeated = 3, \n",
+ " n_classes = 2, \n",
+ " n_clusters_per_class = 1, \n",
+ " random_state=i_Seed)"
+ ],
+ "execution_count": 4,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "gWy2IZh3s-o3",
+ "outputId": "d64728be-3319-42df-aa1a-4feab2728aa4",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 238
+ }
+ },
+ "source": [
+ "X"
+ ],
+ "execution_count": 5,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[ 0.06844089, 4.21184154, -2.5583024 , ..., -0.63061895,\n",
+ " -0.97831983, -0.88826977],\n",
+ " [-4.8240213 , 0.17950903, -2.98447332, ..., 0.33992045,\n",
+ " 1.89153784, -6.10967565],\n",
+ " [ 1.38953042, -0.226476 , 1.8774004 , ..., -1.47784549,\n",
+ " 0.96052606, 2.06020368],\n",
+ " ...,\n",
+ " [ 1.62548685, 0.43377848, 4.93537285, ..., -4.61990917,\n",
+ " 0.18310709, 6.16040231],\n",
+ " [-2.40619087, -1.65474635, 2.64196493, ..., -1.21427845,\n",
+ " 0.83745861, 0.8254619 ],\n",
+ " [-4.00041881, 2.52475556, -4.15290177, ..., -0.51680266,\n",
+ " 1.72224835, -5.59558306]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 5
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ccjhGnzxtAaV",
+ "outputId": "9e86bfa9-321d-4d94-8e07-1be6453beb79",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "y[0:30] # Semelhante aos casos de fraude: {0, 1}"
+ ],
+ "execution_count": 6,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,\n",
+ " 1, 1, 0, 1, 0, 1, 0, 1])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 6
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "OHO2befKJxR3"
+ },
+ "source": [
+ "___\n",
+ "# **DECISION TREE**\n",
+ "> Decision Trees possuem estrutura em forma de árvores.\n",
+ "\n",
+ "* **Principais Vantagens**:\n",
+ " * São algoritmos fáceis de entender, visualizar e interpretar;\n",
+ " * Captura facilmente padrões não-lineares presentes nos dados;\n",
+ " * Requer pouco poder computacional --> Treinar Decision Trees não requer tanto recurso computacional!\n",
+ " * Lida bem com COLUNAS numéricas ou categóricas;\n",
+ " * Não requer os dados sejam normalizados;\n",
+ " * Pode ser utilizado como Feature Engineering ao lidar com Missing Values;\n",
+ " * Pode ser utilizado como Feature Selection;\n",
+ " * Não requer suposições sobre a distribuição dos dados por causa da natureza não-paramétrica do algoritmo\n",
+ "\n",
+ "* **Principais desvantagens**\n",
+ " * Propenso a Overfitting, pois Decision Trees podem construir árvores complexas que não sejam capazes de generalizar bem os dados. As coisas complicam muito se a amostra de treinamento possuir outliers. Portanto, **recomenda-se fortemente a tratar os outliers previamente**.\n",
+ " * Pode criar árvores viesadas se tivermos um dataframe não-balanceado ou que alguma classe seja dominante. Por conta disso, **recomenda-se balancear o dataframe previamente para se evitar esse problema**.\n",
+ "\n",
+ "* **Principais parâmetros**\n",
+ " * **Gini Index** - é uma métrica que mede a frequência com que um ponto/observação aleatoriamente selecionado seria incorretamente identificado.\n",
+ " * Portanto, quanto menor o valor de Gini Index, melhor a COLUNA;\n",
+ " * **Entropy** - é uma métrica que mede aleatoriedade da informação presente nos dados.\n",
+ " * Portanto, quanto maior a entropia da COLUNA, pior ela se torna para nos ajudar a tomar uma conclusão (classificar, por exemplo).\n",
+ "\n",
+ "## **Referências**:\n",
+ "* [1.10. Decision Trees](https://scikit-learn.org/stable/modules/tree.html).\n",
+ "* [Decision Tree Algorithm With Hands On Example](https://medium.com/datadriveninvestor/decision-tree-algorithm-with-hands-on-example-e6c2afb40d38) - ótimo tutorial para aprender, entender, interpretar e calcular os índices de Gini e entropia.\n",
+ "* [Intuitive Guide to Understanding Decision Trees](https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-understanding-decision-trees-adb2165ccab7) - ótimo tutorial para aprender, entender, interpretar e calcular os índices de Gini e entropia.\n",
+ "* [The Complete Guide to Decision Trees](https://towardsdatascience.com/the-complete-guide-to-decision-trees-28a4e3c7be14)\n",
+ "* [Creating and Visualizing Decision Tree Algorithm in Machine Learning Using Sklearn](https://intellipaat.com/blog/decision-tree-algorithm-in-machine-learning/) - Muito didático!\n",
+ "* [Decision Trees in Machine Learning](https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052)\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "FrMkPN5aLp0Y"
+ },
+ "source": [
+ "## Carregar as bibliotecas"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "FVU1CM0PKgO4"
+ },
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "import seaborn as sns\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "import warnings\n",
+ "warnings.filterwarnings(\"ignore\")"
+ ],
+ "execution_count": 7,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "15clh4XrISpz"
+ },
+ "source": [
+ "## Carregar/Ler os dados"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "UMPL46w2IWJw"
+ },
+ "source": [
+ "l_colunas = ['v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8', 'v9', 'v10', 'v11', 'v12', 'v13', 'v14', 'v15', 'v16', 'v17', 'v18']\n",
+ "\n",
+ "df_X = pd.DataFrame(X, columns = l_colunas)\n",
+ "df_y = pd.DataFrame(y, columns = ['target'])"
+ ],
+ "execution_count": 8,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "MFaQF2MGFl_M",
+ "outputId": "b0ac8b65-3fe7-47b0-e120-bcf2b21e441b",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 224
+ }
+ },
+ "source": [
+ "df_X.head()"
+ ],
+ "execution_count": 9,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " v1 | \n",
+ " v2 | \n",
+ " v3 | \n",
+ " v4 | \n",
+ " v5 | \n",
+ " v6 | \n",
+ " v7 | \n",
+ " v8 | \n",
+ " v9 | \n",
+ " v10 | \n",
+ " v11 | \n",
+ " v12 | \n",
+ " v13 | \n",
+ " v14 | \n",
+ " v15 | \n",
+ " v16 | \n",
+ " v17 | \n",
+ " v18 | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 0.068441 | \n",
+ " 4.211842 | \n",
+ " -2.558302 | \n",
+ " 3.665482 | \n",
+ " -3.835158 | \n",
+ " 3.499851 | \n",
+ " 2.490856 | \n",
+ " 3.665482 | \n",
+ " 0.245117 | \n",
+ " 0.867172 | \n",
+ " 2.865546 | \n",
+ " 0.493956 | \n",
+ " -5.148596 | \n",
+ " 2.865546 | \n",
+ " 3.499851 | \n",
+ " -0.630619 | \n",
+ " -0.978320 | \n",
+ " -0.888270 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " -4.824021 | \n",
+ " 0.179509 | \n",
+ " -2.984473 | \n",
+ " 1.033618 | \n",
+ " -3.893426 | \n",
+ " 3.428734 | \n",
+ " -3.334605 | \n",
+ " 1.033618 | \n",
+ " -0.882780 | \n",
+ " -0.753281 | \n",
+ " 1.441522 | \n",
+ " -1.395514 | \n",
+ " -4.002880 | \n",
+ " 1.441522 | \n",
+ " 3.428734 | \n",
+ " 0.339920 | \n",
+ " 1.891538 | \n",
+ " -6.109676 | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 1.389530 | \n",
+ " -0.226476 | \n",
+ " 1.877400 | \n",
+ " 2.713426 | \n",
+ " 4.630257 | \n",
+ " 0.516455 | \n",
+ " -3.743027 | \n",
+ " 2.713426 | \n",
+ " 1.284039 | \n",
+ " 2.030797 | \n",
+ " -1.095536 | \n",
+ " 1.560159 | \n",
+ " -1.014211 | \n",
+ " -1.095536 | \n",
+ " 0.516455 | \n",
+ " -1.477845 | \n",
+ " 0.960526 | \n",
+ " 2.060204 | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 1.145809 | \n",
+ " 2.255946 | \n",
+ " 0.207364 | \n",
+ " 4.665817 | \n",
+ " 2.294678 | \n",
+ " 6.501306 | \n",
+ " 0.964770 | \n",
+ " 4.665817 | \n",
+ " 0.119410 | \n",
+ " 3.196354 | \n",
+ " 1.894787 | \n",
+ " 3.519138 | \n",
+ " -4.757807 | \n",
+ " 1.894787 | \n",
+ " 6.501306 | \n",
+ " -3.789029 | \n",
+ " 0.579491 | \n",
+ " 1.397106 | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " -0.936646 | \n",
+ " 3.697163 | \n",
+ " -3.363617 | \n",
+ " 3.805126 | \n",
+ " -1.754430 | \n",
+ " 4.954346 | \n",
+ " 0.406605 | \n",
+ " 3.805126 | \n",
+ " -0.824738 | \n",
+ " 1.382591 | \n",
+ " 1.665704 | \n",
+ " -0.649758 | \n",
+ " -3.513036 | \n",
+ " 1.665704 | \n",
+ " 4.954346 | \n",
+ " 0.257052 | \n",
+ " 0.904244 | \n",
+ " -3.071354 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " v1 v2 v3 ... v16 v17 v18\n",
+ "0 0.068441 4.211842 -2.558302 ... -0.630619 -0.978320 -0.888270\n",
+ "1 -4.824021 0.179509 -2.984473 ... 0.339920 1.891538 -6.109676\n",
+ "2 1.389530 -0.226476 1.877400 ... -1.477845 0.960526 2.060204\n",
+ "3 1.145809 2.255946 0.207364 ... -3.789029 0.579491 1.397106\n",
+ "4 -0.936646 3.697163 -3.363617 ... 0.257052 0.904244 -3.071354\n",
+ "\n",
+ "[5 rows x 18 columns]"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 9
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "s-ibdD2ZG7tm",
+ "outputId": "07b35596-7af9-4594-d74e-2d5fb6a0826c",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "df_X.shape"
+ ],
+ "execution_count": 10,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "(1000, 18)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 10
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "f9cqRaywa_TR",
+ "outputId": "9a4bcdd2-1ed2-4d82-c470-d6fe7d640726",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "set(df_y['target'])"
+ ],
+ "execution_count": 11,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "{0, 1}"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 11
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "BN6jbpn6Iwmu"
+ },
+ "source": [
+ "## Estatísticas Descritivas básicas do dataframe - df.describe()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "KlwhxxUNIyYs",
+ "outputId": "9348f412-5000-427a-a5ed-addbc0f53bd8",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 317
+ }
+ },
+ "source": [
+ "df_X.describe()"
+ ],
+ "execution_count": 12,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " v1 | \n",
+ " v2 | \n",
+ " v3 | \n",
+ " v4 | \n",
+ " v5 | \n",
+ " v6 | \n",
+ " v7 | \n",
+ " v8 | \n",
+ " v9 | \n",
+ " v10 | \n",
+ " v11 | \n",
+ " v12 | \n",
+ " v13 | \n",
+ " v14 | \n",
+ " v15 | \n",
+ " v16 | \n",
+ " v17 | \n",
+ " v18 | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | count | \n",
+ " 1000.000000 | \n",
+ " 1000.000000 | \n",
+ " 1000.000000 | \n",
+ " 1000.000000 | \n",
+ " 1000.000000 | \n",
+ " 1000.000000 | \n",
+ " 1000.000000 | \n",
+ " 1000.000000 | \n",
+ " 1000.000000 | \n",
+ " 1000.000000 | \n",
+ " 1000.000000 | \n",
+ " 1000.000000 | \n",
+ " 1000.000000 | \n",
+ " 1000.000000 | \n",
+ " 1000.000000 | \n",
+ " 1000.000000 | \n",
+ " 1000.000000 | \n",
+ " 1000.000000 | \n",
+ "
\n",
+ " \n",
+ " | mean | \n",
+ " -0.085159 | \n",
+ " 1.034227 | \n",
+ " 0.657408 | \n",
+ " 1.405317 | \n",
+ " 0.687279 | \n",
+ " 1.131560 | \n",
+ " 0.108053 | \n",
+ " 1.405317 | \n",
+ " 1.007023 | \n",
+ " 1.048801 | \n",
+ " 0.079248 | \n",
+ " 0.001650 | \n",
+ " -0.365438 | \n",
+ " 0.079248 | \n",
+ " 1.131560 | \n",
+ " -0.027751 | \n",
+ " 0.984606 | \n",
+ " 0.633624 | \n",
+ "
\n",
+ " \n",
+ " | std | \n",
+ " 2.002247 | \n",
+ " 1.631507 | \n",
+ " 3.608772 | \n",
+ " 2.256857 | \n",
+ " 4.019598 | \n",
+ " 4.481832 | \n",
+ " 1.981307 | \n",
+ " 2.256857 | \n",
+ " 1.863288 | \n",
+ " 1.643900 | \n",
+ " 1.949273 | \n",
+ " 1.932641 | \n",
+ " 4.160668 | \n",
+ " 1.949273 | \n",
+ " 4.481832 | \n",
+ " 2.065455 | \n",
+ " 1.850593 | \n",
+ " 3.552991 | \n",
+ "
\n",
+ " \n",
+ " | min | \n",
+ " -6.944169 | \n",
+ " -4.620754 | \n",
+ " -16.300139 | \n",
+ " -6.235192 | \n",
+ " -12.454256 | \n",
+ " -14.305401 | \n",
+ " -6.152747 | \n",
+ " -6.235192 | \n",
+ " -5.484992 | \n",
+ " -3.293216 | \n",
+ " -7.135349 | \n",
+ " -5.705500 | \n",
+ " -9.120941 | \n",
+ " -7.135349 | \n",
+ " -14.305401 | \n",
+ " -6.009023 | \n",
+ " -5.035184 | \n",
+ " -11.439074 | \n",
+ "
\n",
+ " \n",
+ " | 25% | \n",
+ " -1.305566 | \n",
+ " -0.089052 | \n",
+ " -1.623657 | \n",
+ " -0.152888 | \n",
+ " -1.854645 | \n",
+ " -1.684751 | \n",
+ " -1.216983 | \n",
+ " -0.152888 | \n",
+ " -0.240908 | \n",
+ " -0.012710 | \n",
+ " -1.209675 | \n",
+ " -1.292162 | \n",
+ " -3.555363 | \n",
+ " -1.209675 | \n",
+ " -1.684751 | \n",
+ " -1.436673 | \n",
+ " -0.261610 | \n",
+ " -1.691346 | \n",
+ "
\n",
+ " \n",
+ " | 50% | \n",
+ " 0.052523 | \n",
+ " 0.994150 | \n",
+ " 0.573849 | \n",
+ " 1.449931 | \n",
+ " 0.812364 | \n",
+ " 1.281504 | \n",
+ " 0.167091 | \n",
+ " 1.449931 | \n",
+ " 1.066125 | \n",
+ " 1.012899 | \n",
+ " 0.180344 | \n",
+ " 0.035237 | \n",
+ " -0.966638 | \n",
+ " 0.180344 | \n",
+ " 1.281504 | \n",
+ " -0.000190 | \n",
+ " 0.975793 | \n",
+ " 0.844784 | \n",
+ "
\n",
+ " \n",
+ " | 75% | \n",
+ " 1.383853 | \n",
+ " 2.071995 | \n",
+ " 3.038586 | \n",
+ " 2.887141 | \n",
+ " 3.413952 | \n",
+ " 4.008103 | \n",
+ " 1.438719 | \n",
+ " 2.887141 | \n",
+ " 2.288188 | \n",
+ " 2.187202 | \n",
+ " 1.439199 | \n",
+ " 1.315342 | \n",
+ " 2.745806 | \n",
+ " 1.439199 | \n",
+ " 4.008103 | \n",
+ " 1.365369 | \n",
+ " 2.256504 | \n",
+ " 3.109330 | \n",
+ "
\n",
+ " \n",
+ " | max | \n",
+ " 4.997172 | \n",
+ " 7.354860 | \n",
+ " 11.720165 | \n",
+ " 8.494566 | \n",
+ " 12.844418 | \n",
+ " 15.999803 | \n",
+ " 6.293550 | \n",
+ " 8.494566 | \n",
+ " 8.146559 | \n",
+ " 6.523180 | \n",
+ " 6.252448 | \n",
+ " 5.538216 | \n",
+ " 11.259350 | \n",
+ " 6.252448 | \n",
+ " 15.999803 | \n",
+ " 6.531561 | \n",
+ " 7.646802 | \n",
+ " 12.090528 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " v1 v2 ... v17 v18\n",
+ "count 1000.000000 1000.000000 ... 1000.000000 1000.000000\n",
+ "mean -0.085159 1.034227 ... 0.984606 0.633624\n",
+ "std 2.002247 1.631507 ... 1.850593 3.552991\n",
+ "min -6.944169 -4.620754 ... -5.035184 -11.439074\n",
+ "25% -1.305566 -0.089052 ... -0.261610 -1.691346\n",
+ "50% 0.052523 0.994150 ... 0.975793 0.844784\n",
+ "75% 1.383853 2.071995 ... 2.256504 3.109330\n",
+ "max 4.997172 7.354860 ... 7.646802 12.090528\n",
+ "\n",
+ "[8 rows x 18 columns]"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 12
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "N_QhFqyZOKFB"
+ },
+ "source": [
+ "## Selecionar as amostras de treinamento e validação\n",
+ "\n",
+ "* Dividir os dados/amostras em:\n",
+ " * **Amostra de treinamento**: usado para treinar o modelo e otimizar os hiperparâmetros;\n",
+ " * **Amostra de teste**: usado para verificar se o modelo otimizado funciona em dados totalmente desconhecidos. É nesta amostra de teste que avaliamos a performance do modelo em termos de generalização (trabalhar com dados que não lhe foi apresentado);\n",
+ "* Geralmente usamos 70% da amostra para treinamento e 30% validação. Outras opções são usar os percentuais 80/20 ou 75/25 (default).\n",
+ "* Consulte [sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) para mais detalhes.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8sKBgs-QOOfn"
+ },
+ "source": [
+ "from sklearn.model_selection import train_test_split\n",
+ "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(df_X, df_y, test_size = f_Test_Size, random_state = i_Seed)"
+ ],
+ "execution_count": 13,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "TPTKBBHgOpoA",
+ "outputId": "8f521310-6ec5-49f2-fdfb-1acb336a3dd1",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "X_treinamento.shape"
+ ],
+ "execution_count": 14,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "(700, 18)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 14
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "lEn_LLs2OtRI",
+ "outputId": "163839bc-7ff2-4fb8-c217-4752d857eb3e",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "y_treinamento.shape"
+ ],
+ "execution_count": 15,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "(700, 1)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 15
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "_uAw8EcyOvrG",
+ "outputId": "9ecadc3e-2370-438b-ff14-9fdff53733aa",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "X_teste.shape"
+ ],
+ "execution_count": 16,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "(300, 18)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 16
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "A2LYI-9hOyXI",
+ "outputId": "6ecf6284-b13f-4816-bb63-d5afd536d92e",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "y_teste.shape"
+ ],
+ "execution_count": 17,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "(300, 1)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 17
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "npgoBSX2dd4l"
+ },
+ "source": [
+ "## Treinar o algoritmo com os dados de treinamento\n",
+ "### Carregar os algoritmos/libraries"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "hcvzrtolGfnQ",
+ "outputId": "07e2ad08-50ab-4bd0-a586-d47c3f46833e",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 68
+ }
+ },
+ "source": [
+ "!pip install graphviz\n",
+ "!pip install pydotplus"
+ ],
+ "execution_count": 18,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "Requirement already satisfied: graphviz in /usr/local/lib/python3.6/dist-packages (0.10.1)\n",
+ "Requirement already satisfied: pydotplus in /usr/local/lib/python3.6/dist-packages (2.0.2)\n",
+ "Requirement already satisfied: pyparsing>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from pydotplus) (2.4.7)\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "v_pF-HH3JKL2"
+ },
+ "source": [
+ "from sklearn.metrics import accuracy_score # para medir a acurácia do modelo preditivo\n",
+ "#from sklearn.model_selection import train_test_split\n",
+ "#from sklearn.metrics import classification_report\n",
+ "from sklearn.metrics import confusion_matrix # para plotar a confusion matrix\n",
+ "\n",
+ "from sklearn.model_selection import GridSearchCV # para otimizar os parâmetros dos modelos preditivos\n",
+ "from sklearn.model_selection import cross_val_score # Para o CV (Cross-Validation)\n",
+ "from time import time\n",
+ "from operator import itemgetter\n",
+ "from scipy.stats import randint\n",
+ "\n",
+ "from sklearn.tree import export_graphviz\n",
+ "from sklearn.externals.six import StringIO \n",
+ "from IPython.display import Image \n",
+ "import pydotplus\n",
+ "\n",
+ "np.set_printoptions(suppress=True)"
+ ],
+ "execution_count": 19,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9ROlyvgij2yl"
+ },
+ "source": [
+ "Função para plotar a Confusion Matrix extraído de [Confusion Matrix Visualization](https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "klQ0FLOIgeX1"
+ },
+ "source": [
+ "def mostra_confusion_matrix(cf, \n",
+ " group_names = None, \n",
+ " categories = 'auto', \n",
+ " count = True, \n",
+ " percent = True, \n",
+ " cbar = True, \n",
+ " xyticks = False, \n",
+ " xyplotlabels = True, \n",
+ " sum_stats = True, \n",
+ " figsize = (8, 8), \n",
+ " cmap = 'Blues'):\n",
+ " '''\n",
+ " This function will make a pretty plot of an sklearn Confusion Matrix cm using a Seaborn heatmap visualization.\n",
+ " Arguments\n",
+ " ---------\n",
+ " cf: confusion matrix to be passed in\n",
+ " group_names: List of strings that represent the labels row by row to be shown in each square.\n",
+ " categories: List of strings containing the categories to be displayed on the x,y axis. Default is 'auto'\n",
+ " count: If True, show the raw number in the confusion matrix. Default is True.\n",
+ " normalize: If True, show the proportions for each category. Default is True.\n",
+ " cbar: If True, show the color bar. The cbar values are based off the values in the confusion matrix.\n",
+ " Default is True.\n",
+ " xyticks: If True, show x and y ticks. Default is True.\n",
+ " xyplotlabels: If True, show 'True Label' and 'Predicted Label' on the figure. Default is True.\n",
+ " sum_stats: If True, display summary statistics below the figure. Default is True.\n",
+ " figsize: Tuple representing the figure size. Default will be the matplotlib rcParams value.\n",
+ " cmap: Colormap of the values displayed from matplotlib.pyplot.cm. Default is 'Blues'\n",
+ " See http://matplotlib.org/examples/color/colormaps_reference.html\n",
+ " '''\n",
+ "\n",
+ " # CODE TO GENERATE TEXT INSIDE EACH SQUARE\n",
+ " blanks = ['' for i in range(cf.size)]\n",
+ "\n",
+ " if group_names and len(group_names)==cf.size:\n",
+ " group_labels = [\"{}\\n\".format(value) for value in group_names]\n",
+ " else:\n",
+ " group_labels = blanks\n",
+ "\n",
+ " if count:\n",
+ " group_counts = [\"{0:0.0f}\\n\".format(value) for value in cf.flatten()]\n",
+ " else:\n",
+ " group_counts = blanks\n",
+ "\n",
+ " if percent:\n",
+ " group_percentages = [\"{0:.2%}\".format(value) for value in cf.flatten()/np.sum(cf)]\n",
+ " else:\n",
+ " group_percentages = blanks\n",
+ "\n",
+ " box_labels = [f\"{v1}{v2}{v3}\".strip() for v1, v2, v3 in zip(group_labels,group_counts,group_percentages)]\n",
+ " box_labels = np.asarray(box_labels).reshape(cf.shape[0],cf.shape[1])\n",
+ "\n",
+ " # CODE TO GENERATE SUMMARY STATISTICS & TEXT FOR SUMMARY STATS\n",
+ " if sum_stats:\n",
+ " #Accuracy is sum of diagonal divided by total observations\n",
+ " accuracy = np.trace(cf) / float(np.sum(cf))\n",
+ "\n",
+ " #if it is a binary confusion matrix, show some more stats\n",
+ " if len(cf)==2:\n",
+ " #Metrics for Binary Confusion Matrices\n",
+ " precision = cf[1,1] / sum(cf[:,1])\n",
+ " recall = cf[1,1] / sum(cf[1,:])\n",
+ " f1_score = 2*precision*recall / (precision + recall)\n",
+ " stats_text = \"\\n\\nAccuracy={:0.3f}\\nPrecision={:0.3f}\\nRecall={:0.3f}\\nF1 Score={:0.3f}\".format(accuracy,precision,recall,f1_score)\n",
+ " else:\n",
+ " stats_text = \"\\n\\nAccuracy={:0.3f}\".format(accuracy)\n",
+ " else:\n",
+ " stats_text = \"\"\n",
+ "\n",
+ " # SET FIGURE PARAMETERS ACCORDING TO OTHER ARGUMENTS\n",
+ " if figsize==None:\n",
+ " #Get default figure size if not set\n",
+ " figsize = plt.rcParams.get('figure.figsize')\n",
+ "\n",
+ " if xyticks==False:\n",
+ " #Do not show categories if xyticks is False\n",
+ " categories=False\n",
+ "\n",
+ " # MAKE THE HEATMAP VISUALIZATION\n",
+ " plt.figure(figsize=figsize)\n",
+ " sns.heatmap(cf,annot=box_labels,fmt=\"\",cmap=cmap,cbar=cbar,xticklabels=categories,yticklabels=categories)\n",
+ "\n",
+ " if xyplotlabels:\n",
+ " plt.ylabel('True label')\n",
+ " plt.xlabel('Predicted label' + stats_text)\n",
+ " else:\n",
+ " plt.xlabel(stats_text)"
+ ],
+ "execution_count": 20,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "YJMS9ePQ6B6t"
+ },
+ "source": [
+ "**Atenção**: Para evitar overfitting nos algoritmos DecisionTreeClassifier, considere min_samples_split = 2 como default."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "nNeRHYePJc-r"
+ },
+ "source": [
+ "from sklearn.tree import DecisionTreeClassifier # Library para Decision Tree (Classificação)\n",
+ "\n",
+ "# Instancia (configuração do Decision Trees) com os parâmetros sugeridos para se evitar overfitting:\n",
+ "ml_DT = DecisionTreeClassifier(criterion = 'gini', \n",
+ " splitter = 'best', \n",
+ " max_depth = None, \n",
+ " min_samples_split = 2, \n",
+ " min_samples_leaf = 1, \n",
+ " min_weight_fraction_leaf = 0.0, \n",
+ " max_features = None, \n",
+ " random_state = i_Seed, \n",
+ " max_leaf_nodes = None, \n",
+ " min_impurity_decrease = 0.0, \n",
+ " min_impurity_split = None, \n",
+ " class_weight = None, \n",
+ " presort = False)"
+ ],
+ "execution_count": 21,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "gVLZznprx2YX",
+ "outputId": "c0aa7ba6-132c-4b36-844a-703c1e5facf4",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 119
+ }
+ },
+ "source": [
+ "# Objeto/classificador configurado\n",
+ "ml_DT"
+ ],
+ "execution_count": 22,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n",
+ " max_depth=None, max_features=None, max_leaf_nodes=None,\n",
+ " min_impurity_decrease=0.0, min_impurity_split=None,\n",
+ " min_samples_leaf=1, min_samples_split=2,\n",
+ " min_weight_fraction_leaf=0.0, presort=False,\n",
+ " random_state=20111974, splitter='best')"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 22
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "OgAHfXVo-Nw8",
+ "outputId": "e312d5ca-a17e-46a7-bbc4-f0e7c7d92e5a",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 119
+ }
+ },
+ "source": [
+ "# Treina o algoritmo: fit(df)\n",
+ "ml_DT.fit(X_treinamento, y_treinamento)"
+ ],
+ "execution_count": 23,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n",
+ " max_depth=None, max_features=None, max_leaf_nodes=None,\n",
+ " min_impurity_decrease=0.0, min_impurity_split=None,\n",
+ " min_samples_leaf=1, min_samples_split=2,\n",
+ " min_weight_fraction_leaf=0.0, presort=False,\n",
+ " random_state=20111974, splitter='best')"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 23
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ohmGCDpfyhvV",
+ "outputId": "e57e03a4-b85d-4fd0-879b-d905f65b8168",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "i_CV"
+ ],
+ "execution_count": 24,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "10"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 24
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6exa9D8R2fDJ",
+ "outputId": "d5f76459-4c2c-4d06-d77e-195913cbb16f",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "# Cross-Validation com k = 10 folds\n",
+ "a_scores_CV = cross_val_score(ml_DT, X_treinamento, y_treinamento, cv = i_CV)\n",
+ "\n",
+ "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n",
+ "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')"
+ ],
+ "execution_count": 25,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "Média das Acurácias calculadas pelo CV....: 91.43\n",
+ "std médio das Acurácias calculadas pelo CV: 3.44\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Uxoplcea0byV",
+ "outputId": "9cea4e91-5b15-4323-9a20-6edbbd7d9637",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "a_scores_CV # array com os scores a cada iteração do CV"
+ ],
+ "execution_count": 26,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([0.9 , 0.98571429, 0.85714286, 0.92857143, 0.88571429,\n",
+ " 0.94285714, 0.92857143, 0.9 , 0.88571429, 0.92857143])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 26
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "y3k-PcbN0o_i",
+ "outputId": "c844a899-f07e-4db6-e35b-38ec468497ce",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "a_scores_CV.mean()"
+ ],
+ "execution_count": 27,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "0.9142857142857144"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 27
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6_rYker2gzeG"
+ },
+ "source": [
+ "**Interpretação**: Nosso classificador (DecisionTreeClassifier) tem uma acurácia média de 91,43% (base de treinamento). Além disso, o std é da ordem de 3,66%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "tkwchmkP3p_A",
+ "outputId": "732065c8-85f0-4169-9526-0c68123c6ee8",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "print(f'Acurácias: {a_scores_CV}')"
+ ],
+ "execution_count": 29,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "Acurácias: [0.9 0.98571429 0.85714286 0.92857143 0.88571429 0.94285714\n",
+ " 0.92857143 0.9 0.88571429 0.92857143]\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "sI31WkZs2ht_"
+ },
+ "source": [
+ "# Faz predições usando o classificador (Decision Trees) para inferir na amostra de teste:\n",
+ "y_pred = ml_DT.predict(X_teste)"
+ ],
+ "execution_count": 30,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "rfapj3OG13PG",
+ "outputId": "6f217d76-774a-4b20-8b75-e40e0eeb27d8",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "y_pred[0:30]"
+ ],
+ "execution_count": 31,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0,\n",
+ " 1, 0, 0, 1, 1, 0, 1, 1])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 31
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "sc88ofqh16RT",
+ "outputId": "d1261312-3bd7-414f-cfb0-9a2a6d44788e",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "y[0:30]"
+ ],
+ "execution_count": 32,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,\n",
+ " 1, 1, 0, 1, 0, 1, 0, 1])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 32
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "fSaVzJ9xFpwW",
+ "outputId": "0317eccb-2f78-484c-c383-24950f8e5a72",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 538
+ }
+ },
+ "source": [
+ "# Confusion Matrix\n",
+ "cf_matrix = confusion_matrix(y_teste, y_pred)\n",
+ "cf_labels = ['True_Negative', 'False_Positive', 'False_Negative', 'True_Positive']\n",
+ "cf_categories = ['Zero', 'One']\n",
+ "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)"
+ ],
+ "execution_count": 33,
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": [],
+ "needs_background": "light"
+ }
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "p8D975NqsGtj"
+ },
+ "source": [
+ "## Parameter tunning\n",
+ "### Referência\n",
+ "* [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74)\n",
+ "* [Decision Tree Adventures 2 — Explanation of Decision Tree Classifier Parameters](https://medium.com/datadriveninvestor/decision-tree-adventures-2-explanation-of-decision-tree-classifier-parameters-84776f39a28) - Explica didaticamente e step by step como fazer parameter tunning."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Bfdq5zEhlVsk"
+ },
+ "source": [
+ "# Dicionário de parâmetros para o parameter tunning. Ao todo serão ajustados 2X13X5X5X7= 4.550 modelos. Contando com 10 folds no Cross-Validation, então são 45.500 modelos.\n",
+ "d_parametros_DT = {\"criterion\": [\"gini\", \"entropy\"], \n",
+ " \"min_samples_split\": [2, 5, 10, 30, 50, 70, 90, 120, 150, 180, 210, 240, 270, 350, 400], \n",
+ " \"max_depth\": [None, 2, 5, 9, 15], \n",
+ " \"min_samples_leaf\": [20, 40, 60, 80, 100], \n",
+ " \"max_leaf_nodes\": [None, 2, 3, 4, 5, 10, 15]}"
+ ],
+ "execution_count": 34,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "BtajXuuUpGwq",
+ "outputId": "57b0fe72-b999-4d1c-d9e3-2a48dc9c28ad",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 340
+ }
+ },
+ "source": [
+ "d_parametros_DT"
+ ],
+ "execution_count": 35,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "{'criterion': ['gini', 'entropy'],\n",
+ " 'max_depth': [None, 2, 5, 9, 15],\n",
+ " 'max_leaf_nodes': [None, 2, 3, 4, 5, 10, 15],\n",
+ " 'min_samples_leaf': [20, 40, 60, 80, 100],\n",
+ " 'min_samples_split': [2,\n",
+ " 5,\n",
+ " 10,\n",
+ " 30,\n",
+ " 50,\n",
+ " 70,\n",
+ " 90,\n",
+ " 120,\n",
+ " 150,\n",
+ " 180,\n",
+ " 210,\n",
+ " 240,\n",
+ " 270,\n",
+ " 350,\n",
+ " 400]}"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 35
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "H8gNSs0G0A-L"
+ },
+ "source": [
+ "```\n",
+ "grid_search = GridSearchCV(ml_DT, param_grid= d_parametros_DT, cv = i_CV, n_jobs= -1)\n",
+ "start = time()\n",
+ "grid_search.fit(X_treinamento, y_treinamento)\n",
+ "tempo_elapsed= time()-start\n",
+ "print(f\"\\nGridSearchCV levou {tempo_elapsed:.2f} segundos para estimar {len(grid_search.cv_results_)} modelos candidatos\")\n",
+ "\n",
+ "GridSearchCV levou 1999.12 segundos para estimar 23 modelos candidatos\n",
+ "```\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ap3WMXqDthu9"
+ },
+ "source": [
+ "# Definindo a função para o GridSearchCV\n",
+ "def GridSearchOptimizer(modelo, ml_Opt, d_Parametros, X_treinamento, y_treinamento, X_teste, y_teste, cv = i_CV):\n",
+ " ml_GridSearchCV = GridSearchCV(modelo, d_Parametros, cv = i_CV, n_jobs= -1, verbose= 10, scoring= 'accuracy')\n",
+ " start = time()\n",
+ " ml_GridSearchCV.fit(X_treinamento, y_treinamento)\n",
+ " tempo_elapsed= time()-start\n",
+ " #print(f\"\\nGridSearchCV levou {tempo_elapsed:.2f} segundos.\")\n",
+ "\n",
+ " # Parâmetros que otimizam a classificação:\n",
+ " print(f'\\nParametros otimizados: {ml_GridSearchCV.best_params_}')\n",
+ " \n",
+ " if ml_Opt == 'ml_DT2':\n",
+ " print(f'\\nDecisionTreeClassifier *********************************************************************************************************')\n",
+ " ml_Opt = DecisionTreeClassifier(criterion= ml_GridSearchCV.best_params_['criterion'], \n",
+ " max_depth= ml_GridSearchCV.best_params_['max_depth'],\n",
+ " max_leaf_nodes= ml_GridSearchCV.best_params_['max_leaf_nodes'],\n",
+ " min_samples_split= ml_GridSearchCV.best_params_['min_samples_leaf'],\n",
+ " min_samples_leaf= ml_GridSearchCV.best_params_['min_samples_split'], \n",
+ " random_state= i_Seed)\n",
+ " \n",
+ " elif ml_Opt == 'ml_RF2':\n",
+ " print(f'\\nRandomForestClassifier *********************************************************************************************************')\n",
+ " ml_Opt = RandomForestClassifier(bootstrap= ml_GridSearchCV.best_params_['bootstrap'], \n",
+ " max_depth= ml_GridSearchCV.best_params_['max_depth'],\n",
+ " max_features= ml_GridSearchCV.best_params_['max_features'],\n",
+ " min_samples_leaf= ml_GridSearchCV.best_params_['min_samples_leaf'],\n",
+ " min_samples_split= ml_GridSearchCV.best_params_['min_samples_split'],\n",
+ " n_estimators= ml_GridSearchCV.best_params_['n_estimators'],\n",
+ " random_state= i_Seed)\n",
+ " \n",
+ " elif ml_Opt == 'ml_AB2':\n",
+ " print(f'\\nAdaBoostClassifier *********************************************************************************************************')\n",
+ " ml_Opt = AdaBoostClassifier(algorithm='SAMME.R', \n",
+ " base_estimator=RandomForestClassifier(bootstrap = False, \n",
+ " max_depth = 10, \n",
+ " max_features = 'auto', \n",
+ " min_samples_leaf = 1, \n",
+ " min_samples_split = 2, \n",
+ " n_estimators = 400), \n",
+ " learning_rate = ml_GridSearchCV.best_params_['learning_rate'], \n",
+ " n_estimators = ml_GridSearchCV.best_params_['n_estimators'], \n",
+ " random_state = i_Seed)\n",
+ " \n",
+ " elif ml_Opt == 'ml_GB2':\n",
+ " print(f'\\nGradientBoostingClassifier *********************************************************************************************************')\n",
+ " ml_Opt = GradientBoostingClassifier(learning_rate = ml_GridSearchCV.best_params_['learning_rate'], \n",
+ " n_estimators = ml_GridSearchCV.best_params_['n_estimators'], \n",
+ " max_depth = ml_GridSearchCV.best_params_['max_depth'], \n",
+ " min_samples_split = ml_GridSearchCV.best_params_['min_samples_split'], \n",
+ " min_samples_leaf = ml_GridSearchCV.best_params_['min_samples_leaf'], \n",
+ " max_features = ml_GridSearchCV.best_params_['max_features'])\n",
+ " \n",
+ " elif ml_Opt == 'ml_XGB2':\n",
+ " print(f'\\nXGBoostingClassifier *********************************************************************************************************')\n",
+ " ml_Opt = XGBoostingClassifier(learning_rate= ml_GridSearchCV.best_params_['learning_rate'], \n",
+ " max_depth= ml_GridSearchCV.best_params_['max_depth'], \n",
+ " colsample_bytree= ml_GridSearchCV.best_params_['colsample_bytree'], \n",
+ " subsample= ml_GridSearchCV.best_params_['subsample'], \n",
+ " gamma= ml_GridSearchCV.best_params_['gamma'], \n",
+ " min_child_weight= ml_GridSearchCV.best_params_['min_child_weight'])\n",
+ " \n",
+ " # Treina novamente usando os parametros otimizados...\n",
+ " ml_Opt.fit(X_treinamento, y_treinamento)\n",
+ "\n",
+ " # Cross-Validation com 10 folds\n",
+ " print(f'\\n********* CROSS-VALIDATION ***********')\n",
+ " a_scores_CV = cross_val_score(ml_Opt, X_treinamento, y_treinamento, cv = i_CV)\n",
+ " print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n",
+ " print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')\n",
+ "\n",
+ " # Faz predições com os parametros otimizados...\n",
+ " y_pred = ml_Opt.predict(X_teste)\n",
+ " \n",
+ " # Importância das COLUNAS\n",
+ " print(f'\\n********* IMPORTÂNCIA DAS COLUNAS ***********')\n",
+ " df_importancia_variaveis = pd.DataFrame(zip(l_colunas, ml_Opt.feature_importances_), columns= ['coluna', 'importancia'])\n",
+ " df_importancia_variaveis = df_importancia_variaveis.sort_values(by= ['importancia'], ascending=False)\n",
+ " print(df_importancia_variaveis)\n",
+ "\n",
+ " # Matriz de Confusão\n",
+ " print(f'\\n********* CONFUSION MATRIX - PARAMETER TUNNING ***********')\n",
+ " cf_matrix = confusion_matrix(y_teste, y_pred)\n",
+ " cf_labels = ['True_Negative', 'False_Positive', 'False_Negative', 'True_Positive']\n",
+ " cf_categories = ['Zero', 'One']\n",
+ " mostra_confusion_matrix(cf_matrix, group_names = cf_labels, categories = cf_categories)\n",
+ "\n",
+ " return ml_Opt, ml_GridSearchCV.best_params_"
+ ],
+ "execution_count": 36,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "44-BRnNjBT25",
+ "outputId": "a73bf3b0-2831-4396-f98e-fa67b06c9835",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 1000
+ }
+ },
+ "source": [
+ "# Invoca a função com o modelo baseline\n",
+ "ml_DT2, best_params = GridSearchOptimizer(ml_DT, 'ml_DT2', d_parametros_DT, X_treinamento, y_treinamento, X_teste, y_teste, cv = i_CV)"
+ ],
+ "execution_count": 37,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "Fitting 10 folds for each of 5250 candidates, totalling 52500 fits\n"
+ ],
+ "name": "stdout"
+ },
+ {
+ "output_type": "stream",
+ "text": [
+ "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.\n",
+ "[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 1.3s\n",
+ "[Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 1.4s\n",
+ "[Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 1.5s\n",
+ "[Parallel(n_jobs=-1)]: Done 14 tasks | elapsed: 1.6s\n",
+ "[Parallel(n_jobs=-1)]: Batch computation too fast (0.1796s.) Setting batch_size=2.\n",
+ "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0489s.) Setting batch_size=4.\n",
+ "[Parallel(n_jobs=-1)]: Done 24 tasks | elapsed: 1.7s\n",
+ "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0863s.) Setting batch_size=8.\n",
+ "[Parallel(n_jobs=-1)]: Batch computation too fast (0.1844s.) Setting batch_size=16.\n",
+ "[Parallel(n_jobs=-1)]: Done 58 tasks | elapsed: 2.0s\n",
+ "[Parallel(n_jobs=-1)]: Done 186 tasks | elapsed: 2.9s\n",
+ "[Parallel(n_jobs=-1)]: Done 330 tasks | elapsed: 3.8s\n",
+ "[Parallel(n_jobs=-1)]: Done 506 tasks | elapsed: 4.9s\n",
+ "[Parallel(n_jobs=-1)]: Done 682 tasks | elapsed: 5.9s\n",
+ "[Parallel(n_jobs=-1)]: Done 890 tasks | elapsed: 7.1s\n",
+ "[Parallel(n_jobs=-1)]: Done 1098 tasks | elapsed: 8.3s\n",
+ "[Parallel(n_jobs=-1)]: Done 1338 tasks | elapsed: 9.5s\n",
+ "[Parallel(n_jobs=-1)]: Done 1578 tasks | elapsed: 10.8s\n",
+ "[Parallel(n_jobs=-1)]: Done 1850 tasks | elapsed: 12.4s\n",
+ "[Parallel(n_jobs=-1)]: Done 2122 tasks | elapsed: 13.9s\n",
+ "[Parallel(n_jobs=-1)]: Done 2426 tasks | elapsed: 15.6s\n",
+ "[Parallel(n_jobs=-1)]: Done 2730 tasks | elapsed: 17.3s\n",
+ "[Parallel(n_jobs=-1)]: Done 3066 tasks | elapsed: 19.2s\n",
+ "[Parallel(n_jobs=-1)]: Done 3402 tasks | elapsed: 21.2s\n",
+ "[Parallel(n_jobs=-1)]: Done 3770 tasks | elapsed: 23.3s\n",
+ "[Parallel(n_jobs=-1)]: Done 4138 tasks | elapsed: 25.5s\n",
+ "[Parallel(n_jobs=-1)]: Done 4538 tasks | elapsed: 27.9s\n",
+ "[Parallel(n_jobs=-1)]: Done 4938 tasks | elapsed: 30.4s\n",
+ "[Parallel(n_jobs=-1)]: Done 5370 tasks | elapsed: 32.7s\n",
+ "[Parallel(n_jobs=-1)]: Done 5802 tasks | elapsed: 35.0s\n",
+ "[Parallel(n_jobs=-1)]: Done 6266 tasks | elapsed: 37.5s\n",
+ "[Parallel(n_jobs=-1)]: Done 6730 tasks | elapsed: 39.8s\n",
+ "[Parallel(n_jobs=-1)]: Done 7226 tasks | elapsed: 42.5s\n",
+ "[Parallel(n_jobs=-1)]: Done 7722 tasks | elapsed: 45.2s\n",
+ "[Parallel(n_jobs=-1)]: Done 8250 tasks | elapsed: 47.9s\n",
+ "[Parallel(n_jobs=-1)]: Done 8778 tasks | elapsed: 50.8s\n",
+ "[Parallel(n_jobs=-1)]: Done 9338 tasks | elapsed: 53.8s\n",
+ "[Parallel(n_jobs=-1)]: Done 9898 tasks | elapsed: 56.9s\n",
+ "[Parallel(n_jobs=-1)]: Done 10490 tasks | elapsed: 1.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 11082 tasks | elapsed: 1.1min\n",
+ "[Parallel(n_jobs=-1)]: Done 11706 tasks | elapsed: 1.1min\n",
+ "[Parallel(n_jobs=-1)]: Done 12330 tasks | elapsed: 1.2min\n",
+ "[Parallel(n_jobs=-1)]: Done 12986 tasks | elapsed: 1.2min\n",
+ "[Parallel(n_jobs=-1)]: Done 13642 tasks | elapsed: 1.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 14330 tasks | elapsed: 1.4min\n",
+ "[Parallel(n_jobs=-1)]: Done 15018 tasks | elapsed: 1.4min\n",
+ "[Parallel(n_jobs=-1)]: Done 15738 tasks | elapsed: 1.5min\n",
+ "[Parallel(n_jobs=-1)]: Done 16458 tasks | elapsed: 1.6min\n",
+ "[Parallel(n_jobs=-1)]: Done 17210 tasks | elapsed: 1.6min\n",
+ "[Parallel(n_jobs=-1)]: Done 17962 tasks | elapsed: 1.7min\n",
+ "[Parallel(n_jobs=-1)]: Done 18746 tasks | elapsed: 1.8min\n",
+ "[Parallel(n_jobs=-1)]: Done 19530 tasks | elapsed: 1.8min\n",
+ "[Parallel(n_jobs=-1)]: Done 20346 tasks | elapsed: 1.9min\n",
+ "[Parallel(n_jobs=-1)]: Done 21162 tasks | elapsed: 2.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 22010 tasks | elapsed: 2.1min\n",
+ "[Parallel(n_jobs=-1)]: Done 22858 tasks | elapsed: 2.2min\n",
+ "[Parallel(n_jobs=-1)]: Done 23738 tasks | elapsed: 2.2min\n",
+ "[Parallel(n_jobs=-1)]: Done 24618 tasks | elapsed: 2.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 25530 tasks | elapsed: 2.4min\n",
+ "[Parallel(n_jobs=-1)]: Done 26442 tasks | elapsed: 2.5min\n",
+ "[Parallel(n_jobs=-1)]: Done 27386 tasks | elapsed: 2.6min\n",
+ "[Parallel(n_jobs=-1)]: Done 28330 tasks | elapsed: 2.7min\n",
+ "[Parallel(n_jobs=-1)]: Done 29306 tasks | elapsed: 2.8min\n",
+ "[Parallel(n_jobs=-1)]: Done 30282 tasks | elapsed: 2.9min\n",
+ "[Parallel(n_jobs=-1)]: Done 31290 tasks | elapsed: 3.1min\n",
+ "[Parallel(n_jobs=-1)]: Done 32298 tasks | elapsed: 3.2min\n",
+ "[Parallel(n_jobs=-1)]: Done 33338 tasks | elapsed: 3.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 34378 tasks | elapsed: 3.4min\n",
+ "[Parallel(n_jobs=-1)]: Done 35450 tasks | elapsed: 3.5min\n",
+ "[Parallel(n_jobs=-1)]: Done 36522 tasks | elapsed: 3.6min\n",
+ "[Parallel(n_jobs=-1)]: Done 37626 tasks | elapsed: 3.8min\n",
+ "[Parallel(n_jobs=-1)]: Done 38730 tasks | elapsed: 3.9min\n",
+ "[Parallel(n_jobs=-1)]: Done 39866 tasks | elapsed: 4.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 41002 tasks | elapsed: 4.1min\n",
+ "[Parallel(n_jobs=-1)]: Done 42170 tasks | elapsed: 4.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 43338 tasks | elapsed: 4.4min\n",
+ "[Parallel(n_jobs=-1)]: Done 44538 tasks | elapsed: 4.5min\n",
+ "[Parallel(n_jobs=-1)]: Done 45738 tasks | elapsed: 4.7min\n",
+ "[Parallel(n_jobs=-1)]: Done 46970 tasks | elapsed: 4.8min\n",
+ "[Parallel(n_jobs=-1)]: Done 48202 tasks | elapsed: 5.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 49466 tasks | elapsed: 5.1min\n",
+ "[Parallel(n_jobs=-1)]: Done 50730 tasks | elapsed: 5.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 52026 tasks | elapsed: 5.4min\n",
+ "[Parallel(n_jobs=-1)]: Done 52500 out of 52500 | elapsed: 5.5min finished\n"
+ ],
+ "name": "stderr"
+ },
+ {
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "Parametros otimizados: {'criterion': 'entropy', 'max_depth': None, 'max_leaf_nodes': None, 'min_samples_leaf': 20, 'min_samples_split': 70}\n",
+ "\n",
+ "DecisionTreeClassifier *********************************************************************************************************\n",
+ "\n",
+ "********* CROSS-VALIDATION ***********\n",
+ "Média das Acurácias calculadas pelo CV....: 87.14\n",
+ "std médio das Acurácias calculadas pelo CV: 4.33\n",
+ "\n",
+ "********* IMPORTÂNCIA DAS COLUNAS ***********\n",
+ " coluna importancia\n",
+ "12 v13 0.735896\n",
+ "0 v1 0.135030\n",
+ "9 v10 0.090888\n",
+ "6 v7 0.025768\n",
+ "1 v2 0.012418\n",
+ "3 v4 0.000000\n",
+ "4 v5 0.000000\n",
+ "5 v6 0.000000\n",
+ "7 v8 0.000000\n",
+ "8 v9 0.000000\n",
+ "10 v11 0.000000\n",
+ "11 v12 0.000000\n",
+ "2 v3 0.000000\n",
+ "13 v14 0.000000\n",
+ "14 v15 0.000000\n",
+ "15 v16 0.000000\n",
+ "16 v17 0.000000\n",
+ "17 v18 0.000000\n",
+ "\n",
+ "********* CONFUSION MATRIX - PARAMETER TUNNING ***********\n"
+ ],
+ "name": "stdout"
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": [],
+ "needs_background": "light"
+ }
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gmCkjGjPJMLr"
+ },
+ "source": [
+ "### Visualizar o resultado"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "cIc3ZgaISEd0",
+ "outputId": "db25ff3b-7957-438b-9a1d-c8327cf1b71c",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 753
+ }
+ },
+ "source": [
+ "from sklearn.tree import export_graphviz\n",
+ "from sklearn.externals.six import StringIO \n",
+ "from IPython.display import Image \n",
+ "import pydotplus\n",
+ "\n",
+ "dot_data = StringIO()\n",
+ "export_graphviz(ml_DT2, out_file = dot_data, filled = True, rounded = True, special_characters = True, feature_names = l_colunas, class_names = ['0','1'])\n",
+ "\n",
+ "graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) \n",
+ "graph.write_png('DecisionTree.png')\n",
+ "Image(graph.create_png())"
+ ],
+ "execution_count": 40,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 40
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "e1R2GBkbnV37"
+ },
+ "source": [
+ "## Selecionar as COLUNAS importantes/relevantes"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "vv7GKBvs6Ybf"
+ },
+ "source": [
+ "# Função desenvolvida para Selecionar COLUNAS relevantes\n",
+ "from sklearn.feature_selection import SelectFromModel\n",
+ "\n",
+ "def seleciona_colunas_relevantes(modelo, X_treinamento, X_teste, threshold = 0.05):\n",
+ " # Cria um seletor para selecionar as COLUNAS com importância > threshold\n",
+ " sfm = SelectFromModel(modelo, threshold)\n",
+ " \n",
+ " # Treina o seletor\n",
+ " sfm.fit(X_treinamento, y_treinamento)\n",
+ "\n",
+ " # Mostra o indice das COLUNAS mais importantes\n",
+ " print(f'\\n********** COLUNAS Relevantes ******')\n",
+ " print(sfm.get_support(indices=True))\n",
+ "\n",
+ " # Seleciona somente as COLUNAS relevantes\n",
+ " X_treinamento_I = sfm.transform(X_treinamento)\n",
+ " X_teste_I = sfm.transform(X_teste)\n",
+ " return X_treinamento_I, X_teste_I "
+ ],
+ "execution_count": 41,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ukMLoEr7nbUf",
+ "outputId": "482ef28c-0ce4-4c0b-cdfa-c27970870520",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 68
+ }
+ },
+ "source": [
+ "X_treinamento_DT, X_teste_DT = seleciona_colunas_relevantes(ml_DT2, X_treinamento, X_teste)"
+ ],
+ "execution_count": 42,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "********** COLUNAS Relevantes ******\n",
+ "[ 0 9 12]\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8JjePRQAoqkk"
+ },
+ "source": [
+ "## Treina o classificador com as COLUNAS relevantes"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Gt3aCPpfKRxm",
+ "outputId": "42773aa9-2e3e-48b4-8ca6-f1e3b7bdc0c7",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 102
+ }
+ },
+ "source": [
+ "best_params"
+ ],
+ "execution_count": 43,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "{'criterion': 'entropy',\n",
+ " 'max_depth': None,\n",
+ " 'max_leaf_nodes': None,\n",
+ " 'min_samples_leaf': 20,\n",
+ " 'min_samples_split': 70}"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 43
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zq6uCVtzovMt",
+ "outputId": "273a2602-243a-41d8-ebea-48f8808dada0",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "# Treina usando as COLUNAS relevantes...\n",
+ "ml_DT2.fit(X_treinamento_DT, y_treinamento)\n",
+ "\n",
+ "# Cross-Validation com 10 folds\n",
+ "a_scores_CV = cross_val_score(ml_DT2, X_treinamento_DT, y_treinamento, cv = i_CV)\n",
+ "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n",
+ "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')"
+ ],
+ "execution_count": 44,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "Média das Acurácias calculadas pelo CV....: 88.71\n",
+ "std médio das Acurácias calculadas pelo CV: 2.5100000000000002\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Tc7esxqtq-Og",
+ "outputId": "d1b27e41-5785-4d81-b96e-241b1cc10c9b",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 129
+ }
+ },
+ "source": [
+ "****************************************************************"
+ ],
+ "execution_count": 45,
+ "outputs": [
+ {
+ "output_type": "error",
+ "ename": "SyntaxError",
+ "evalue": "ignored",
+ "traceback": [
+ "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m ****************************************************************\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "znWy3LE1q-Z3",
+ "outputId": "b32aaec0-77e2-4347-ec14-1d5dd852a8e9",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 1000
+ }
+ },
+ "source": [
+ "ml_DT3, best_params2 = GridSearchOptimizer(ml_DT2, 'ml_DT2', d_parametros_DT, X_treinamento_DT, y_treinamento, X_teste_DT, y_teste, cv = i_CV)"
+ ],
+ "execution_count": 46,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "Fitting 10 folds for each of 5250 candidates, totalling 52500 fits\n"
+ ],
+ "name": "stdout"
+ },
+ {
+ "output_type": "stream",
+ "text": [
+ "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.\n",
+ "[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.0s\n",
+ "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0151s.) Setting batch_size=2.\n",
+ "[Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 0.0s\n",
+ "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0266s.) Setting batch_size=4.\n",
+ "[Parallel(n_jobs=-1)]: Done 16 tasks | elapsed: 0.1s\n",
+ "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0472s.) Setting batch_size=8.\n",
+ "[Parallel(n_jobs=-1)]: Batch computation too fast (0.0610s.) Setting batch_size=16.\n",
+ "[Parallel(n_jobs=-1)]: Done 44 tasks | elapsed: 0.2s\n",
+ "[Parallel(n_jobs=-1)]: Batch computation too fast (0.1373s.) Setting batch_size=32.\n",
+ "[Parallel(n_jobs=-1)]: Done 156 tasks | elapsed: 0.6s\n",
+ "[Parallel(n_jobs=-1)]: Done 380 tasks | elapsed: 1.2s\n",
+ "[Parallel(n_jobs=-1)]: Done 668 tasks | elapsed: 2.0s\n",
+ "[Parallel(n_jobs=-1)]: Done 956 tasks | elapsed: 2.7s\n",
+ "[Parallel(n_jobs=-1)]: Done 1308 tasks | elapsed: 3.6s\n",
+ "[Parallel(n_jobs=-1)]: Done 1660 tasks | elapsed: 4.5s\n",
+ "[Parallel(n_jobs=-1)]: Done 2076 tasks | elapsed: 5.5s\n",
+ "[Parallel(n_jobs=-1)]: Done 2492 tasks | elapsed: 6.6s\n",
+ "[Parallel(n_jobs=-1)]: Done 2972 tasks | elapsed: 7.8s\n",
+ "[Parallel(n_jobs=-1)]: Done 3452 tasks | elapsed: 9.1s\n",
+ "[Parallel(n_jobs=-1)]: Done 3996 tasks | elapsed: 10.5s\n",
+ "[Parallel(n_jobs=-1)]: Done 4540 tasks | elapsed: 12.0s\n",
+ "[Parallel(n_jobs=-1)]: Done 5148 tasks | elapsed: 13.6s\n",
+ "[Parallel(n_jobs=-1)]: Done 5756 tasks | elapsed: 15.2s\n",
+ "[Parallel(n_jobs=-1)]: Done 6428 tasks | elapsed: 17.0s\n",
+ "[Parallel(n_jobs=-1)]: Done 7100 tasks | elapsed: 18.8s\n",
+ "[Parallel(n_jobs=-1)]: Done 7836 tasks | elapsed: 20.7s\n",
+ "[Parallel(n_jobs=-1)]: Done 8572 tasks | elapsed: 22.5s\n",
+ "[Parallel(n_jobs=-1)]: Done 9372 tasks | elapsed: 24.6s\n",
+ "[Parallel(n_jobs=-1)]: Done 10172 tasks | elapsed: 26.7s\n",
+ "[Parallel(n_jobs=-1)]: Done 11036 tasks | elapsed: 29.0s\n",
+ "[Parallel(n_jobs=-1)]: Done 11900 tasks | elapsed: 31.2s\n",
+ "[Parallel(n_jobs=-1)]: Done 12828 tasks | elapsed: 33.6s\n",
+ "[Parallel(n_jobs=-1)]: Done 13756 tasks | elapsed: 36.1s\n",
+ "[Parallel(n_jobs=-1)]: Done 14748 tasks | elapsed: 38.7s\n",
+ "[Parallel(n_jobs=-1)]: Done 15740 tasks | elapsed: 41.3s\n",
+ "[Parallel(n_jobs=-1)]: Done 16796 tasks | elapsed: 44.2s\n",
+ "[Parallel(n_jobs=-1)]: Done 17852 tasks | elapsed: 46.9s\n",
+ "[Parallel(n_jobs=-1)]: Done 18972 tasks | elapsed: 49.9s\n",
+ "[Parallel(n_jobs=-1)]: Done 20092 tasks | elapsed: 52.9s\n",
+ "[Parallel(n_jobs=-1)]: Done 21276 tasks | elapsed: 56.2s\n",
+ "[Parallel(n_jobs=-1)]: Done 22460 tasks | elapsed: 59.3s\n",
+ "[Parallel(n_jobs=-1)]: Done 23708 tasks | elapsed: 1.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 24956 tasks | elapsed: 1.1min\n",
+ "[Parallel(n_jobs=-1)]: Done 26268 tasks | elapsed: 1.2min\n",
+ "[Parallel(n_jobs=-1)]: Done 27580 tasks | elapsed: 1.2min\n",
+ "[Parallel(n_jobs=-1)]: Done 28956 tasks | elapsed: 1.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 30332 tasks | elapsed: 1.4min\n",
+ "[Parallel(n_jobs=-1)]: Done 31772 tasks | elapsed: 1.4min\n",
+ "[Parallel(n_jobs=-1)]: Done 33212 tasks | elapsed: 1.5min\n",
+ "[Parallel(n_jobs=-1)]: Done 34716 tasks | elapsed: 1.6min\n",
+ "[Parallel(n_jobs=-1)]: Done 36220 tasks | elapsed: 1.6min\n",
+ "[Parallel(n_jobs=-1)]: Done 37788 tasks | elapsed: 1.7min\n",
+ "[Parallel(n_jobs=-1)]: Done 39356 tasks | elapsed: 1.8min\n",
+ "[Parallel(n_jobs=-1)]: Done 40988 tasks | elapsed: 1.9min\n",
+ "[Parallel(n_jobs=-1)]: Done 42620 tasks | elapsed: 1.9min\n",
+ "[Parallel(n_jobs=-1)]: Done 44316 tasks | elapsed: 2.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 46012 tasks | elapsed: 2.1min\n",
+ "[Parallel(n_jobs=-1)]: Done 47772 tasks | elapsed: 2.2min\n",
+ "[Parallel(n_jobs=-1)]: Done 49532 tasks | elapsed: 2.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 51356 tasks | elapsed: 2.4min\n",
+ "[Parallel(n_jobs=-1)]: Done 52500 out of 52500 | elapsed: 2.4min finished\n"
+ ],
+ "name": "stderr"
+ },
+ {
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "Parametros otimizados: {'criterion': 'entropy', 'max_depth': None, 'max_leaf_nodes': None, 'min_samples_leaf': 60, 'min_samples_split': 2}\n",
+ "\n",
+ "DecisionTreeClassifier *********************************************************************************************************\n",
+ "\n",
+ "********* CROSS-VALIDATION ***********\n",
+ "Média das Acurácias calculadas pelo CV....: 89.29\n",
+ "std médio das Acurácias calculadas pelo CV: 2.73\n",
+ "\n",
+ "********* IMPORTÂNCIA DAS COLUNAS ***********\n",
+ " coluna importancia\n",
+ "2 v3 0.691283\n",
+ "0 v1 0.177569\n",
+ "1 v2 0.131148\n",
+ "\n",
+ "********* CONFUSION MATRIX - PARAMETER TUNNING ***********\n"
+ ],
+ "name": "stdout"
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": [],
+ "needs_background": "light"
+ }
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6IhCC6pfq-jL",
+ "outputId": "eeec1e49-71fa-44ce-b9f0-05d4f83245bd",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 102
+ }
+ },
+ "source": [
+ "best_params"
+ ],
+ "execution_count": 47,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "{'criterion': 'entropy',\n",
+ " 'max_depth': None,\n",
+ " 'max_leaf_nodes': None,\n",
+ " 'min_samples_leaf': 20,\n",
+ " 'min_samples_split': 70}"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 47
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "qw6Dk3kesT0q",
+ "outputId": "771c6d2a-364d-45a5-ded1-935d556d9fde",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 102
+ }
+ },
+ "source": [
+ "best_params2"
+ ],
+ "execution_count": 48,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "{'criterion': 'entropy',\n",
+ " 'max_depth': None,\n",
+ " 'max_leaf_nodes': None,\n",
+ " 'min_samples_leaf': 60,\n",
+ " 'min_samples_split': 2}"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 48
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "SbS4ZKN8s-ee",
+ "outputId": "804f7297-dd42-4573-8c56-b882c11de1de",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "# Cross-Validation com 10 folds\n",
+ "a_scores_CV = cross_val_score(ml_DT3, X_treinamento_DT, y_treinamento, cv = i_CV)\n",
+ "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n",
+ "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')"
+ ],
+ "execution_count": 49,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "Média das Acurácias calculadas pelo CV....: 89.29\n",
+ "std médio das Acurácias calculadas pelo CV: 2.73\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "_at3XP1Bq-qb"
+ },
+ "source": [
+ "***************************************************************"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "MZ1-vGRcxJoN"
+ },
+ "source": [
+ "## Valida o modelo usando o dataframe X_teste"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ig9GiUAEw9jr"
+ },
+ "source": [
+ "y_pred_DT = ml_DT2.predict(X_teste_DT)"
+ ],
+ "execution_count": 50,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "7UZz4UzHDqae",
+ "outputId": "9ddde91a-ce92-4ad2-e0e1-eda71009c9da",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "# Calcula acurácia\n",
+ "accuracy_score(y_teste, y_pred_DT)"
+ ],
+ "execution_count": 51,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "0.9333333333333333"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 51
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "K3EUMAxxKBur"
+ },
+ "source": [
+ "___\n",
+ "# **RANDOM FOREST**\n",
+ "* Decision Trees possuem estrutura em forma de árvores.\n",
+ "* Random Forest pode ser utilizado tanto para classificação (RandomForestClassifier)quanto para Regressão (RandomForestRegressor).\n",
+ "\n",
+ "* **Vantagens**:\n",
+ " * Não requer tanto data preprocessing;\n",
+ " * Lida bem com COLUNAS categóricas e numéricas;\n",
+ " * É um Boosting Ensemble Method (pois constrói muitas árvores). Estes modelos aprendem com os próprios erros e ajustam as árvores de modo a fazer melhores classificações;\n",
+ " * Mais robusta que uma simples Decision Tree. **Porque?**\n",
+ " * Controla automaticamente overfitting (**porque?**) e frequentemente produz modelos muito robustos e de alta-performance.\n",
+ " * Pode ser utilizado como Feature Selection, pois gera a matriz de importância dos atributos (importance sample). A soma das importâncias soma 100;\n",
+ " * Assim como as Decision Trees, esses modelos capturam facilmente padrões não-lineares presentes nos dados;\n",
+ " * Não requer os dados sejam normalizados;\n",
+ " * Lida bem com Missing Values;\n",
+ " * Não requer suposições (assumptions) sobre a distribuição dos dados por causa da natureza não-paramétrica do algoritmo\n",
+ "\n",
+ "* **Desvantagens**\n",
+ " * **Recomenda-se balancear o dataframe previamente para se evitar esse problema**.\n",
+ "\n",
+ "* **Principais parâmetros**\n",
+ "\n",
+ "## **Referências**:\n",
+ "* [Running Random Forests? Inspect the feature importances with this code](https://towardsdatascience.com/running-random-forests-inspect-the-feature-importances-with-this-code-2b00dd72b92e)\n",
+ "* [Feature importances with forests of trees](https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html)\n",
+ "* [Understanding Random Forests Classifiers in Python](https://www.datacamp.com/community/tutorials/random-forests-classifier-python)\n",
+ "* [Understanding Random Forest](https://towardsdatascience.com/understanding-random-forest-58381e0602d2)\n",
+ "* [An Implementation and Explanation of the Random Forest in Python](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76)\n",
+ "* [Random Forest Simple Explanation](https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d)\n",
+ "* [Random Forest Explained](https://www.youtube.com/watch?v=eM4uJ6XGnSM)\n",
+ "* [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74) - Explica os principais parâmetros do Random Forest."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "cnfDw_GEKBuu"
+ },
+ "source": [
+ "from sklearn.ensemble import RandomForestClassifier\n",
+ "\n",
+ "# Instancia...\n",
+ "ml_RF= RandomForestClassifier(n_estimators=100, min_samples_split= 2, max_features=\"auto\", random_state= i_Seed)\n",
+ "\n",
+ "# Treina...\n",
+ "ml_RF.fit(X_treinamento, y_treinamento)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "lYa9oaZW__o6"
+ },
+ "source": [
+ "# Cross-Validation com 10 folds\n",
+ "a_scores_CV = cross_val_score(ml_RF, X_treinamento, y_treinamento, cv = i_CV)\n",
+ "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n",
+ "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "AouWUu8vANdb"
+ },
+ "source": [
+ "**Interpretação**: Nosso classificador (RandomForestClassifier) tem uma acurácia média de 96,44% (base de treinamento). Além disso, o std é da ordem de 2,77%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "vbducxlgAa85"
+ },
+ "source": [
+ "print(f'Acurácias: {a_scores_CV}')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "_lxx-LUw_5sd"
+ },
+ "source": [
+ "# Faz predições...\n",
+ "y_pred = ml_RF.predict(X_teste)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "pQIRO_LpGAkw"
+ },
+ "source": [
+ "# Confusion Matrix\n",
+ "cf_matrix = confusion_matrix(y_teste, y_pred)\n",
+ "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n",
+ "cf_categories = ['Zero', 'One']\n",
+ "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "yKLHZ5_C6FJ8"
+ },
+ "source": [
+ "## Parameter tunning\n",
+ "### Referência\n",
+ "* [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74)\n",
+ "* [Decision Tree Adventures 2 — Explanation of Decision Tree Classifier Parameters](https://medium.com/datadriveninvestor/decision-tree-adventures-2-explanation-of-decision-tree-classifier-parameters-84776f39a28) - Explica didaticamente e step by step como fazer parameter tunning.\n",
+ "* [Optimizing Hyperparameters in Random Forest Classification](https://towardsdatascience.com/optimizing-hyperparameters-in-random-forest-classification-ec7741f9d3f6) - Outro approach para entender parameter tunning. Recomendo fortemente a leitura! "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "XOa9naju6FKA"
+ },
+ "source": [
+ "# Dicionário de parâmetros para o parameter tunning.\n",
+ "d_parametros_RF= {'bootstrap': [True, False]} #,\n",
+ "# 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],\n",
+ "# 'max_features': ['auto', 'sqrt'],\n",
+ "# 'min_samples_leaf': [1, 2, 4],\n",
+ "# 'min_samples_split': [2, 5, 10],\n",
+ "# 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6__f2jZaTQat"
+ },
+ "source": [
+ "# Invoca a função\n",
+ "ml_RF2, best_params = GridSearchOptimizer(ml_RF, 'ml_RF2', d_parametros_RF, X_treinamento, y_treinamento, X_teste, y_teste, cv = i_CV)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "crfn-n--KG4n"
+ },
+ "source": [
+ "### Resultado da execução do Random Forest\n",
+ "\n",
+ "```\n",
+ "[Parallel(n_jobs=-1)]: Done 7920 out of 7920 | elapsed: 194.0min finished\n",
+ "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "SGTOe5PaRw59"
+ },
+ "source": [
+ "# Como o procedimento acima levou 194 minutos para executar, então vou estimar ml_RF2 abaixo usando os parâmetros acima estimados\n",
+ "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}\n",
+ "\n",
+ "ml_RF2= RandomForestClassifier(bootstrap= best_params['bootstrap'], \n",
+ " max_depth= best_params['max_depth'], \n",
+ " max_features= best_params['max_features'], \n",
+ " min_samples_leaf= best_params['min_samples_leaf'], \n",
+ " min_samples_split= best_params['min_samples_split'], \n",
+ " n_estimators= best_params['n_estimators'], \n",
+ " random_state= i_Seed)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HMJcAdLlTQa0"
+ },
+ "source": [
+ "## Visualizar o resultado\n",
+ "> Implementar a visualização do RandomForest."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "WWNiy7Z0TQa3"
+ },
+ "source": [
+ "## Selecionar as COLUNAS importantes/relevantes"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kOi11YOKTQa4"
+ },
+ "source": [
+ "X_treinamento_RF, X_teste_RF = seleciona_colunas_relevantes(ml_RF2, X_treinamento, X_teste)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Zn_O7c_DTQbE"
+ },
+ "source": [
+ "## Treina o classificador com as COLUNAS relevantes"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "UwEOwzSGTQbF"
+ },
+ "source": [
+ "best_params"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Rr8qDrgvTQbL"
+ },
+ "source": [
+ "# Treina com as COLUNAS relevantes...\n",
+ "ml_RF2.fit(X_treinamento_RF, y_treinamento)\n",
+ "\n",
+ "# Cross-Validation com 10 folds\n",
+ "a_scores_CV = cross_val_score(ml_RF2, X_treinamento_RF, y_treinamento, cv = i_CV)\n",
+ "print(f'Acurácia Media: {100*a_scores_CV.mean():.2f}')\n",
+ "print(f'std médio.....: {100*a_scores_CV.std():.2f}')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-mYfQLlsTQbQ"
+ },
+ "source": [
+ "## Valida o modelo usando o dataframe X_teste"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "sSD5o1JQTQbR"
+ },
+ "source": [
+ "y_pred_RF = ml_RF2.predict(X_teste_RF)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "wywF6LymDzKr"
+ },
+ "source": [
+ "# Calcula acurácia\n",
+ "accuracy_score(y_teste, y_pred_RF)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hJJsL0IJb6iO"
+ },
+ "source": [
+ "## Estudo do comportamento dos parametros do algoritmo\n",
+ "> Consulte [Optimizing Hyperparameters in Random Forest Classification](https://towardsdatascience.com/optimizing-hyperparameters-in-random-forest-classification-ec7741f9d3f6) para mais detalhes."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "navUWMwHi44D"
+ },
+ "source": [
+ "param_range = np.arange(1, 250, 2)\n",
+ "\n",
+ "# Calculate accuracy on training and test set using range of parameter values\n",
+ "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n",
+ " X_treinamento, \n",
+ " y_treinamento, \n",
+ " param_name=\"n_estimators\", \n",
+ " param_range = param_range, \n",
+ " cv = i_CV, \n",
+ " scoring = \"accuracy\", \n",
+ " n_jobs = -1)\n",
+ "\n",
+ "\n",
+ "# Calculate mean and standard deviation for training set a_scores_CV\n",
+ "train_mean = np.mean(train_a_scores_CV, axis = 1)\n",
+ "train_std = np.std(train_a_scores_CV, axis = 1)\n",
+ "\n",
+ "# Calculate mean and standard deviation for test set a_scores_CV\n",
+ "test_mean = np.mean(test_a_scores_CV, axis = 1)\n",
+ "test_std = np.std(test_a_scores_CV, axis = 1)\n",
+ "\n",
+ "# Plot mean accuracy a_scores_CV for training and test sets\n",
+ "plt.plot(param_range, train_mean, label = \"Training score\", color = \"black\")\n",
+ "plt.plot(param_range, test_mean, label = \"Cross-validation score\", color = \"dimgrey\")\n",
+ "\n",
+ "# Plot accurancy bands for training and test sets\n",
+ "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color = \"gray\")\n",
+ "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color = \"gainsboro\")\n",
+ "\n",
+ "# Create plot\n",
+ "plt.title(\"Validation Curve With Random Forest\")\n",
+ "plt.xlabel(\"Number Of Trees\")\n",
+ "plt.ylabel(\"Accuracy Score\")\n",
+ "plt.tight_layout()\n",
+ "plt.legend(loc = \"best\")\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "rv7TIM9kjsud"
+ },
+ "source": [
+ "param_range = np.arange(1, 250, 2)\n",
+ "\n",
+ "# Calculate accuracy on training and test set using range of parameter values\n",
+ "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n",
+ " X_treinamento, \n",
+ " y_treinamento, \n",
+ " param_name = \"max_depth\", \n",
+ " param_range = param_range, \n",
+ " cv = i_CV, \n",
+ " scoring = \"accuracy\", \n",
+ " n_jobs = -1)\n",
+ "\n",
+ "# Calculate mean and standard deviation for training set a_scores_CV\n",
+ "train_mean = np.mean(train_a_scores_CV, axis = 1)\n",
+ "train_std = np.std(train_a_scores_CV, axis = 1)\n",
+ "\n",
+ "# Calculate mean and standard deviation for test set a_scores_CV\n",
+ "test_mean = np.mean(test_a_scores_CV, axis = 1)\n",
+ "test_std = np.std(test_a_scores_CV, axis = 1)\n",
+ "\n",
+ "# Plot mean accuracy a_scores_CV for training and test sets\n",
+ "plt.plot(param_range, train_mean, label=\"Training score\", color=\"black\")\n",
+ "plt.plot(param_range, test_mean, label=\"Cross-validation score\", color=\"dimgrey\")\n",
+ "\n",
+ "# Plot accurancy bands for training and test sets\n",
+ "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color=\"gray\")\n",
+ "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color=\"gainsboro\")\n",
+ "\n",
+ "# Create plot\n",
+ "plt.title(\"Validation Curve With Random Forest\")\n",
+ "plt.xlabel(\"Number Of Trees\")\n",
+ "plt.ylabel(\"Accuracy Score\")\n",
+ "plt.tight_layout()\n",
+ "plt.legend(loc=\"best\")\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "lm_fPGYwkJYc"
+ },
+ "source": [
+ "param_range = np.arange(1, 250, 2)\n",
+ "\n",
+ "# Calculate accuracy on training and test set using range of parameter values\n",
+ "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n",
+ " X_treinamento, \n",
+ " y_treinamento, \n",
+ " param_name='min_samples_leaf', \n",
+ " param_range=param_range,\n",
+ " cv = i_CV, \n",
+ " scoring=\"accuracy\", \n",
+ " n_jobs=-1)\n",
+ "\n",
+ "\n",
+ "# Calculate mean and standard deviation for training set a_scores_CV\n",
+ "train_mean = np.mean(train_a_scores_CV, axis = 1)\n",
+ "train_std = np.std(train_a_scores_CV, axis = 1)\n",
+ "\n",
+ "# Calculate mean and standard deviation for test set a_scores_CV\n",
+ "test_mean = np.mean(test_a_scores_CV, axis = 1)\n",
+ "test_std = np.std(test_a_scores_CV, axis = 1)\n",
+ "\n",
+ "# Plot mean accuracy a_scores_CV for training and test sets\n",
+ "plt.plot(param_range, train_mean, label=\"Training score\", color=\"black\")\n",
+ "plt.plot(param_range, test_mean, label=\"Cross-validation score\", color=\"dimgrey\")\n",
+ "\n",
+ "# Plot accurancy bands for training and test sets\n",
+ "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color=\"gray\")\n",
+ "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color=\"gainsboro\")\n",
+ "\n",
+ "# Create plot\n",
+ "plt.title(\"Validation Curve With Random Forest\")\n",
+ "plt.xlabel(\"Number Of Trees\")\n",
+ "plt.ylabel(\"Accuracy Score\")\n",
+ "plt.tight_layout()\n",
+ "plt.legend(loc=\"best\")\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "CAqdiSaVlAB8"
+ },
+ "source": [
+ "param_range = np.arange(0.05, 1, 0.05)\n",
+ "\n",
+ "# Calculate accuracy on training and test set using range of parameter values\n",
+ "train_a_scores_CV, test_a_scores_CV = validation_curve(RandomForestClassifier(), \n",
+ " X_treinamento, \n",
+ " y_treinamento, \n",
+ " param_name='min_samples_split', \n",
+ " param_range=param_range,\n",
+ " cv = i_CV, \n",
+ " scoring=\"accuracy\", \n",
+ " n_jobs=-1)\n",
+ "\n",
+ "\n",
+ "# Calculate mean and standard deviation for training set a_scores_CV\n",
+ "train_mean = np.mean(train_a_scores_CV, axis = 1)\n",
+ "train_std = np.std(train_a_scores_CV, axis = 1)\n",
+ "\n",
+ "# Calculate mean and standard deviation for test set a_scores_CV\n",
+ "test_mean = np.mean(test_a_scores_CV, axis = 1)\n",
+ "test_std = np.std(test_a_scores_CV, axis = 1)\n",
+ "\n",
+ "# Plot mean accuracy a_scores_CV for training and test sets\n",
+ "plt.plot(param_range, train_mean, label=\"Training score\", color=\"black\")\n",
+ "plt.plot(param_range, test_mean, label=\"Cross-validation score\", color=\"dimgrey\")\n",
+ "\n",
+ "# Plot accurancy bands for training and test sets\n",
+ "plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color=\"gray\")\n",
+ "plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color=\"gainsboro\")\n",
+ "\n",
+ "# Create plot\n",
+ "plt.title(\"Validation Curve With Random Forest\")\n",
+ "plt.xlabel(\"Number Of Trees\")\n",
+ "plt.ylabel(\"Accuracy Score\")\n",
+ "plt.tight_layout()\n",
+ "plt.legend(loc=\"best\")\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "cX_gfsbQSdNd"
+ },
+ "source": [
+ "___\n",
+ "# **BOOSTING MODELS**\n",
+ "* São algoritmos muito utilizados nas competições do Kaggle;\n",
+ "* São algoritmos utilizados para melhorar a performance dos algoritmos de Machine Learning;\n",
+ "* Modelos:\n",
+ " - [X] AdaBoost\n",
+ " - [X] XGBoost\n",
+ " - [X] LightGBM\n",
+ " - [X] GradientBoosting\n",
+ " - [X] CatBoost\n",
+ "\n",
+ "## Bagging vs Boosting vc Stacking\n",
+ "### **Bagging**\n",
+ "* Objetivo é reduzir a variância;\n",
+ "\n",
+ "#### Como funciona\n",
+ "* Seleciona várias amostras **COM REPOSIÇÃO** do dataframe de treinamento. Cada amostra é usada para treinar um modelo usando Decision Trees. Como resultado, temos um ensemble de muitas e diferentes modelos (Decision Trees). A média de desses muitos e diferentes modelos (Decision Trees) são usados para produzir o resultado final;\n",
+ "* O resultado final é mais robusto do que usarmos uma simples Decision Tree.\n",
+ "\n",
+ "\n",
+ "\n",
+ "Souce: [Boosting and Bagging: How To Develop A Robust Machine Learning Algorithm](https://hackernoon.com/how-to-develop-a-robust-algorithm-c38e08f32201).\n",
+ "\n",
+ "#### Steps\n",
+ "* Suponha um dataframe X_treinamento (dataframe de treinamento) contendo N observações (instâncias, pontos, linhas) e M COLUNAS (features, atributos).\n",
+ " 1. Bagging seleciona aleatoriamente uma amostra **COM REPOSIÇÃO** de X_treinamento;\n",
+ " 2. Bagging seleciona aleatoriamente M2 (M2 < M) COLUNAS do dataframe extraído do passo (1);\n",
+ " 3. Constroi uma Decision Tree com as M2 COLUNAS do passo (2) e o dataframe obtido no passo (1) e as COLUNAS são avaliadas pela sua habilidade de classificar as observações;\n",
+ " 4. Os passos (1)--> (2)-- (3) são repetidos K vezes (ou seja, K Decision Trees), de forma que as COLUNAS são ranqueadas pelo seu poder preditivo e o resultado final (acurácia, por exemplo) é obtido pela agregação das predições dos K Decision Trees.\n",
+ "\n",
+ "#### Vantagens\n",
+ "* Reduz overfitting;\n",
+ "* Lida bem com dataframes com muitas COLUNAS (high dimensionality);\n",
+ "* Lida automaticamente com Missing Values;\n",
+ "\n",
+ "#### Desvantagem\n",
+ "* A predição final é baseada na média das K Decision Trees, o que pode comprometer a acurácia final.\n",
+ "\n",
+ "___ \n",
+ "### **Boosting**\n",
+ "* Objetivo é melhorar acurácia;\n",
+ "\n",
+ "#### Como funciona\n",
+ "* Os classificadores são usados sequencialmente, de forma que o classificador no passo N aprende com os erros do classificador do passo N-1. Ou seja, o objetivo é melhorar a precisão/acurácia à cada passo aprendendo com o passado.\n",
+ "\n",
+ "\n",
+ "\n",
+ "Source: [Ensemble methods: bagging, boosting and stacking](https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205), Joseph Rocca\n",
+ ".\n",
+ "\n",
+ "#### Steps\n",
+ "* Suponha um dataframe X_treinamento (dataframe de treinamento) contendo N observações (instâncias, pontos, linhas) e M COLUNAS (features, atributos).\n",
+ " 1. Boosting seleciona aleatoriamente uma amostra D1 SEM reposição de X_treinamento;\n",
+ " 2. Boosting treina o classificador C1;\n",
+ " 3. Boosting seleciona aleatoriamente a SEGUNDA amostra D2 SEM reposição de X_treinamento e acrescenta à D2 50% das observações que foram classificadas incorretamente para treinar o classificador C2;\n",
+ " 4. Boosting encontra em X_treinamento a amostra D3 que os classificadores C1 e C2 discordam em classificar e treina C3;\n",
+ " 5. Combina (voto) as predições de C1, C2 e C3 para produzir o resultado final.\n",
+ "\n",
+ "#### Vantagens\n",
+ "* Lida bem com dataframes com muitas COLUNAS (high dimensionality);\n",
+ "* Lida automaticamente com Missing Values;\n",
+ "\n",
+ "#### Desvantagem\n",
+ "* Propenso a overfitting. Recomenda-se tratar outliers previamente.\n",
+ "* Requer ajuste cuidadoso dos hyperparameters;"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9fgUrkmPk4dr"
+ },
+ "source": [
+ "___\n",
+ "# STACKING\n",
+ "\n",
+ "\n",
+ "\n",
+ "Kd a referência desta figura???"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "B0jxx3ETpOdm"
+ },
+ "source": [
+ "___\n",
+ "# **BOOTSTRAPPING METHODS**\n",
+ "> Antes de falarmos de Boosting ou Bagging, precisamos entender primeiro o que é Bootstrap, pois ambos (Boosting e Bagging) são baseados em Bootstrap.\n",
+ "\n",
+ "* Em Estatística (e em Machine Learning), Bootstrap se refere à extrair amostras aleatórias COM reposição da população X."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "SyqazmUuifkE"
+ },
+ "source": [
+ "___\n",
+ "# **ADABOOST(Adaptive Boosting)**\n",
+ "* Quando nada funciona, AdaBoost funciona!\n",
+ "* Foi um dos primeiros algoritmos de Boosting (1995);\n",
+ "* AdaBoost pode ser utilizado tanto para classificação (AdaBoostClassifier) quanto para Regressão (AdaBoostRegressor);\n",
+ "* AdaBoost usam algoritmos DecisionTree como base_estimator;"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RU-vzkXqrFVw"
+ },
+ "source": [
+ "## Referências\n",
+ "* [AdaBoost Classifier Example In Python](https://towardsdatascience.com/machine-learning-part-17-boosting-algorithms-adaboost-in-python-d00faac6c464) - Didático e explica exatamente como o AdaBoost funciona.\n",
+ "* [Adaboost for Dummies: Breaking Down the Math (and its Equations) into Simple Terms](https://towardsdatascience.com/adaboost-for-dummies-breaking-down-the-math-and-its-equations-into-simple-terms-87f439757dcf) - Para quem quer entender a matemática por trás do algoritmo.\n",
+ "* [Gradient Boosting and XGBoost](https://medium.com/hackernoon/gradient-boosting-and-xgboost-90862daa6c77)\n",
+ "* [Understanding AdaBoost](https://towardsdatascience.com/understanding-adaboost-2f94f22d5bfe), Akash Desarda.\n",
+ "* [AdaBoost Classifier Example In Python](https://towardsdatascience.com/machine-learning-part-17-boosting-algorithms-adaboost-in-python-d00faac6c464)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6EMrjQDZIMl_"
+ },
+ "source": [
+ "## O que é AdaBoost (Adaptive Boosting)?\n",
+ "* é um dos classificadores do tipo ensemble (combina vários classificadores para aumentar a precisão).\n",
+ "* AdaBoost é um classificador iterativo e forte que combina (ensemble) vários classificadores fracos para melhorar a precisão.\n",
+ "* Qualquer algoritmo de aprendizado de máquina pode ser usado como um classificador de base (parâmetro base_estimator);\n",
+ "\n",
+ "## Parâmetros mais importantes do AdaBoost:\n",
+ "* base_estimator - É um classificador usado para treinar o modelo. Como default, AdaBoost usa o DecisionTreeClassifier. Como dito anteriormente, pode-se utilizar diferentes algoritmos para esse fim.\n",
+ "* n_estimators - Número de base_estimator para treinar iterativamente.\n",
+ "* learning_rate - Controla a contribuição do base_estimator na solução/combinação final;"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "TzLtHzWNJBix"
+ },
+ "source": [
+ "## Usando diferentes algoritmos para base_estimator\n",
+ "> Como dito anteriormente, pode-se utilizar vários tipos de base_estimator em AdaBoost. Por exemplo, se quisermos usar SVM (Support Vector Machines), devemos proceder da seguinte forma:\n",
+ "\n",
+ "\n",
+ "```\n",
+ "# Importar a biblioteca base_estimator\n",
+ "from sklearn.svm import SVC\n",
+ "\n",
+ "# Treina o classificador (algoritmo)\n",
+ "ml_SVC= SVC(probability=True, kernel='linear')\n",
+ "\n",
+ "# Constroi o modelo AdaBoost\n",
+ "ml_AB = AdaBoostClassifier(n_estimators= 50, base_estimator=ml_SVC, learning_rate=1)\n",
+ "```\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hrj4a4s6hMMB"
+ },
+ "source": [
+ "## Vantagens\n",
+ "* AdaBoost é fácil de implementar;\n",
+ "* AdaBoost corrige os erros do base_estimator iterativamente e melhora a acurácia;\n",
+ "* Faz o Feature Selection automaticamente (**Porque**?);\n",
+ "* Pode-se usar muitos algoritos como base_estimator ;\n",
+ "* Como é um método ensemble, então o modelo final é pouco propenso à overfitting.\n",
+ "\n",
+ "## Desvantagens\n",
+ "* AdaBoost é sensível a ruídos nos dados;\n",
+ "* Altamente impactado por outliers (contribui para overfitting), pois o algoritmo tenta se ajustr a cada ponto da mehor forma possível;\n",
+ "* AdaBoost é mais lento que XGBoost;"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "bgJmu7YLiyv7"
+ },
+ "source": [
+ "No exemplo a seguir, vou usar RandomForestClassifier com os parâmetros otimizados, ou seja:\n",
+ "\n",
+ "```\n",
+ "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}\n",
+ "```\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5VCRNyZT3qvc"
+ },
+ "source": [
+ "best_params= {'bootstrap': False, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "1gIboJdriq61"
+ },
+ "source": [
+ "from sklearn.ensemble import AdaBoostClassifier\n",
+ "from sklearn.ensemble import RandomForestClassifier\n",
+ "\n",
+ "# Instancia RandomForestClassifier - Parâmetros otimizados!\n",
+ "ml_RF2= RandomForestClassifier(bootstrap= best_params['bootstrap'], \n",
+ " max_depth= best_params['max_depth'], \n",
+ " max_features= best_params['max_features'], \n",
+ " min_samples_leaf= best_params['min_samples_leaf'], \n",
+ " min_samples_split= best_params['min_samples_split'], \n",
+ " n_estimators= best_params['n_estimators'], \n",
+ " random_state= i_Seed)\n",
+ "# Instancia AdaBoostClassifier\n",
+ "ml_AB= AdaBoostClassifier(n_estimators=100, base_estimator= ml_RF2, random_state= i_Seed)\n",
+ "\n",
+ "# Treina...\n",
+ "ml_AB.fit(X_treinamento, y_treinamento)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "A4Cs81OLD40y"
+ },
+ "source": [
+ "# Cross-Validation com 10 folds\n",
+ "a_scores_CV = cross_val_score(ml_AB, X_treinamento, y_treinamento, cv = i_CV)\n",
+ "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n",
+ "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "F7Ce5L38ECoC"
+ },
+ "source": [
+ "**Interpretação**: Nosso classificador (AdaBoostClassifier) tem uma acurácia média de 96,72% (base de treinamento). Além disso, o std é da ordem de 2,54%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "t5GfnBwEifkO"
+ },
+ "source": [
+ "print(f'Acurácias: {a_scores_CV}')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Q9rSpuXyEPA5"
+ },
+ "source": [
+ "# Faz predições com os parametros otimizados...\n",
+ "y_pred = ml_AB.predict(X_teste)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "2F9k-_eXGDLa"
+ },
+ "source": [
+ "# Confusion Matrix\n",
+ "cf_matrix = confusion_matrix(y_teste, y_pred)\n",
+ "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n",
+ "cf_categories = ['Zero', 'One']\n",
+ "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XweWTjQ9EXLw"
+ },
+ "source": [
+ "## Parameter tunning"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "fcrKzse9EbL_"
+ },
+ "source": [
+ "# Dicionário de parâmetros para o parameter tunning.\n",
+ "d_parametros_AB = {'n_estimators':[50, 100, 200], 'learning_rate':[.001, 0.01, 0.05, 0.1, 0.3,1]}"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Susc3I7mFDQX"
+ },
+ "source": [
+ "# Invoca a função\n",
+ "ml_AB2, best_params= GridSearchOptimizer(ml_AB, 'ml_AB2', d_parametros_AB, X_treinamento, y_treinamento, X_teste, y_teste, cv = i_CV)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "w4JjWsusjNS8"
+ },
+ "source": [
+ "___\n",
+ "# **GRADIENT BOOSTING**\n",
+ "* Gradient boosting pode ser usado para resolver problemas de classificação (GradientBoostingClassifier) e Regressão (GradientBoostingRegressor);\n",
+ "* Gradient boosting são um refinamento do AdaBoost (lembra que AdaBoost foi um dos primeiros métodos de Boosting - criado em 1995). O que Gradient Boosting faz adicionalmente ao AdaBoost é minimizar a loss (função perda), ie, minimizar a diferença entre os valores observados de y e os valores preditos.\n",
+ "* Usa Gradient Descent para encontrar as deficiências nas previsões do passo anterior. Gradient Descent é um algoritmo popular e poderoso e usado em Redes Neurais;\n",
+ "* O objetivo do Gradient Boosting é minimizar 'loss function'. Portanto, Gradient Boosting depende da \"loss function\".\n",
+ "* Gradient boosting usam algoritmos DecisionTree como base_estimator;\n",
+ "\n",
+ "## Vantagens\n",
+ "* Não há necessidade de pre-processing;\n",
+ "* Trabalha normalmente com COLUNAS numéricas ou categóricas;\n",
+ "* Trata automaticamente os Missing Values. Ou seja, não é necessário aplicar métodos de Missing Value Imputation;\n",
+ "\n",
+ "## Desvantagens\n",
+ "* Como Gradient Boosting tenta continuamente minimizar os erros à cada iteração, isso pode enfatizar os outliers e causar overfitting. Portanto, deve-se:\n",
+ " * Tratar os outliers previamente OU\n",
+ " * Usar Cross-Validation para neutralizar os efeitos dos outliers (**Eu prefiro este método, pois toma menos tempo**);\n",
+ "* Computacionalmene caro. Geralmente são necessários muitas árvores (> 1000) para se obter bons resultados;\n",
+ "* Devido à flexibilidade (muitos parâmetros para ajustar), então é necessário usar GridSearchCV para encontrar a combinação ótima dos hyperparameters;\n",
+ "\n",
+ "## Referências\n",
+ "* [Gradient Boosting Decision Tree Algorithm Explained](https://towardsdatascience.com/machine-learning-part-18-boosting-algorithms-gradient-boosting-in-python-ef5ae6965be4) - Didático e detalhista.\n",
+ "* [Predicting Wine Quality with Gradient Boosting Machines](https://towardsdatascience.com/predicting-wine-quality-with-gradient-boosting-machines-a-gmb-tutorial-d950b1542065)\n",
+ "* [Parameter Tuning in Gradient Boosting (GBM) with Python](https://www.datacareer.de/blog/parameter-tuning-in-gradient-boosting-gbm/)\n",
+ "* [Tune Learning Rate for Gradient Boosting with XGBoost in Python](https://machinelearningmastery.com/tune-learning-rate-for-gradient-boosting-with-xgboost-in-python/)\n",
+ "* [In Depth: Parameter tuning for Gradient Boosting](https://medium.com/all-things-ai/in-depth-parameter-tuning-for-gradient-boosting-3363992e9bae) - Muito bom\n",
+ "* [Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Q4bUCZs2jNTA"
+ },
+ "source": [
+ "from sklearn.ensemble import GradientBoostingClassifier\n",
+ "\n",
+ "# Instancia...\n",
+ "ml_GB=GradientBoostingClassifier(n_estimators=100, min_samples_split= 2)\n",
+ "\n",
+ "# Treina...\n",
+ "ml_GB.fit(X_treinamento, y_treinamento)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "-dr6dyjdXwvd"
+ },
+ "source": [
+ "# Cross-Validation com 10 folds\n",
+ "a_scores_CV = cross_val_score(ml_GB, X_treinamento, y_treinamento, cv = i_CV)\n",
+ "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n",
+ "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "VlC3y3M5YaGG"
+ },
+ "source": [
+ "print(f'Acurácias: {a_scores_CV}')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vnLvQ0ZDYNjB"
+ },
+ "source": [
+ "**Interpretação**: Nosso classificador (GradientBoostingClassifier) tem uma acurácia média de 96,86% (base de treinamento). Além disso, o std é da ordem de 2,52%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "D2n1RKZuXq3D"
+ },
+ "source": [
+ "# Faz precições...\n",
+ "y_pred = ml_GB.predict(X_teste)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8r6JCzQRGFa0"
+ },
+ "source": [
+ "# Confusion Matrix\n",
+ "cf_matrix = confusion_matrix(y_teste, y_pred)\n",
+ "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n",
+ "cf_categories = ['Zero', 'One']\n",
+ "mostra_confusion_matrix(cf_matrix, group_names = cf_labels, categories = cf_categories)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "KFv-Q2AD5uCk"
+ },
+ "source": [
+ "## Parameter tunning\n",
+ "> Consulte [Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/) para detalhes sobre os parâmetros, significado e etc."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "wgU040AcjNTF"
+ },
+ "source": [
+ "# Dicionário de parâmetros para o parameter tunning.\n",
+ "d_parametros_GB= {'learning_rate': [1, 0.5, 0.25, 0.1, 0.05, 0.01]} #,\n",
+ "# 'n_estimators': [1, 2, 4, 8, 16, 32, 64, 100, 200],\n",
+ "# 'max_depth': [5, 10, 15, 20, 25, 30],\n",
+ "# 'min_samples_split': [0.1, 0.3, 0.5, 0.7, 0.9],\n",
+ "# 'min_samples_leaf': [0.1, 0.2, 0.3, 0.4, 0.5],\n",
+ "# 'max_features': list(range(1, X_treinamento.shape[1]))}"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "v5KLFlpTjNTH"
+ },
+ "source": [
+ "# Invoca a função\n",
+ "ml_GB2, best_params= GridSearchOptimizer(ml_GB, 'ml_GB2', d_parametros_GB, X_treinamento, y_treinamento, X_teste, y_teste, cv = i_CV)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "YQ6ERz3fi9i2"
+ },
+ "source": [
+ "### Resultado da execução do Gradient Boosting"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RSa7uKw13mKG"
+ },
+ "source": [
+ "```\n",
+ "[Parallel(n_jobs=-1)]: Done 275400 out of 275400 | elapsed: 93.7min finished\n",
+ "\n",
+ "Parametros otimizados: {'learning_rate': 1, 'max_depth': 30, 'max_features': 11, 'min_samples_leaf': 0.1, 'min_samples_split': 0.1, 'n_estimators': 100}\n",
+ "```\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "wiJpA2PyjDjR"
+ },
+ "source": [
+ "# Como o procedimento acima levou 93 minutos para executar, então vou estimar ml_GB2 abaixo usando os parâmetros acima estimados\n",
+ "best_params= {'learning_rate': 1, 'max_depth': 30, 'max_features': 11, 'min_samples_leaf': 0.1, 'min_samples_split': 0.1, 'n_estimators': 100}\n",
+ "\n",
+ "#ml_GB2= GradientBoostingClassifier(learning_rate= best_params['learning_rate'], \n",
+ "# max_depth= best_params['max_depth'],\n",
+ "# max_features= best_params['max_features'],\n",
+ "# min_samples_leaf= best_params['min_samples_leaf'],\n",
+ "# min_samples_split= best_params['min_samples_split'],\n",
+ "# n_estimators= best_params['n_estimators'],\n",
+ "# random_state= i_Seed)\n",
+ "\n",
+ "ml_GB2= GradientBoostingClassifier(learning_rate= best_params['learning_rate'], \n",
+ " max_depth= best_params['max_depth'],\n",
+ " min_samples_leaf= best_params['min_samples_leaf'],\n",
+ " min_samples_split= best_params['min_samples_split'],\n",
+ " n_estimators= best_params['n_estimators'],\n",
+ " random_state= i_Seed)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "mb14gJ7-jbVM"
+ },
+ "source": [
+ "## Selecionar as COLUNAS importantes/relevantes"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "TAqGZIFYm2sU"
+ },
+ "source": [
+ "X_treinamento_GB, X_teste_GB = seleciona_colunas_relevantes(ml_GB2, X_treinamento, X_teste)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6yiu6dahnBvC"
+ },
+ "source": [
+ "## Treina o classificador com as COLUNAS relevantes "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "APrtWN18nc4t"
+ },
+ "source": [
+ "best_params"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "VS0mLdOmnXAY"
+ },
+ "source": [
+ "# Treina com as COLUNAS relevantes\n",
+ "ml_GB2.fit(X_treinamento_GB, y_treinamento)\n",
+ "\n",
+ "# Cross-Validation com 10 folds\n",
+ "a_scores_CV = cross_val_score(ml_GB2, X_treinamento_GB, y_treinamento, cv = i_CV)\n",
+ "print(f'Acurácia Media: {100*a_scores_CV.mean():.2f}')\n",
+ "print(f'std médio.....: {100*a_scores_CV.std():.2f}')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vmc9PP_Rn1TN"
+ },
+ "source": [
+ "## Valida o modelo usando o dataframe X_teste"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "e3mnIALvnzP2"
+ },
+ "source": [
+ "y_pred_GB = ml_GB2.predict(X_teste_GB)\n",
+ "\n",
+ "# Calcula acurácia\n",
+ "accuracy_score(y_teste, y_pred_GB)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "kwP9Z2GnkV7r"
+ },
+ "source": [
+ "___\n",
+ "# **XGBOOST (eXtreme Gradient Boosting)**\n",
+ "* XGBoost é uma melhoria de Gradient Boosting. As melhorias são em velocidade e performace, além de corrigir as ineficiências do GradientBoosting.\n",
+ "* Algoritmo preferido pelos Kaggle Grandmasters;\n",
+ "* Paralelizável;\n",
+ "* Estado-da-arte em termos de Machine Learning;\n",
+ "\n",
+ "## Parâmetros relevantes e seus valores iniciais\n",
+ "Consulte [Complete Guide to Parameter Tuning in XGBoost with codes in Python](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/) para detalhes completos sobre os parâmetros, significado e etc.\n",
+ "\n",
+ "* n_estimators = 100 (100 caso o dataframe for grande. Se o dataframe for médio/pequeno, então 1000) - É o número de árvores desejamos construir;\n",
+ "* max_depth= 3 - Determina quão profundo cada árvore pode crescer durante qualquer round de treinamento. Valores típicos no intervalo [3, 10];\n",
+ "* learning rate= 0.01 - Usado para evitar overfitting, intervalo: [0, 1];\n",
+ "* alpha (somente para problemas de Regressão) - L1 regularization nos pesos. Valores altos resulta em mais regularization;\n",
+ "* lambda (somente para problemas de Regressão) - L2 regularization nos pesos.\n",
+ "* colsample_bytree: 1 - porcentagem de COLUNAS usados por cada árvore. Alto valor pode causar overfitting;\n",
+ "* subsample: 0.8 - porcentagem de amostras usadas por árvore. Um valor baixo pode levar a overfitting;\n",
+ "* gamma: 1 - Controla se um determinado nó será dividido com base na redução esperada na perda após a divisão. Um valor mais alto leva a menos divisões.\n",
+ "* objective: Define a \"loss function\". As opções são:\n",
+ " * reg:linear - Para resolver problemas de regressão;\n",
+ " * reg:logistic - Para resolver problemas de classificação;\n",
+ " * binary:logistic - Para resolver problemas de classificação com cálculo de probabilidades;\n",
+ "\n",
+ "# Referências\n",
+ "* [How exactly XGBoost Works?](https://medium.com/@pushkarmandot/how-exactly-xgboost-works-a320d9b8aeef)\n",
+ "* [Fine-tuning XGBoost in Python like a boss](https://towardsdatascience.com/fine-tuning-xgboost-in-python-like-a-boss-b4543ed8b1e)\n",
+ "* [Gentle Introduction of XGBoost Library](https://medium.com/@imoisharma/gentle-introduction-of-xgboost-library-2b1ac2669680)\n",
+ "* [A Beginner’s guide to XGBoost](https://towardsdatascience.com/a-beginners-guide-to-xgboost-87f5d4c30ed7)\n",
+ "* [Exploring XGBoost](https://towardsdatascience.com/exploring-xgboost-4baf9ace0cf6)\n",
+ "* [Feature Importance and Feature Selection With XGBoost in Python](https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/)\n",
+ "* [Ensemble Learning case study: Running XGBoost on Google Colab free GPU](https://towardsdatascience.com/running-xgboost-on-google-colab-free-gpu-a-case-study-841c90fef101) - Recomendo\n",
+ "* [Predicting movie revenue with AdaBoost, XGBoost and LightGBM](https://towardsdatascience.com/predicting-movie-revenue-with-adaboost-xgboost-and-lightgbm-262eadee6daa)\n",
+ "* [Tuning XGBoost Hyperparameters with Scikit Optimize](https://towardsdatascience.com/how-to-improve-the-performance-of-xgboost-models-1af3995df8ad)\n",
+ "* [An Example of Hyperparameter Optimization on XGBoost, LightGBM and CatBoost using Hyperopt](https://towardsdatascience.com/an-example-of-hyperparameter-optimization-on-xgboost-lightgbm-and-catboost-using-hyperopt-12bc41a271e) - Interessante\n",
+ "* [XGBOOST vs LightGBM: Which algorithm wins the race !!!](https://towardsdatascience.com/lightgbm-vs-xgboost-which-algorithm-win-the-race-1ff7dd4917d) - LightGBM tem se mostrado interessante.\n",
+ "* [From Zero to Hero in XGBoost Tuning](https://towardsdatascience.com/from-zero-to-hero-in-xgboost-tuning-e48b59bfaf58) - Gostei\n",
+ "* [Build XGBoost / LightGBM models on large datasets — what are the possible solutions?](https://towardsdatascience.com/build-xgboost-lightgbm-models-on-large-datasets-what-are-the-possible-solutions-bf882da2c27d)\n",
+ "* [Selecting Optimal Parameters for XGBoost Model Training](https://towardsdatascience.com/selecting-optimal-parameters-for-xgboost-model-training-c7cd9ed5e45e) - Muito bom!\n",
+ "* [CatBoost vs. Light GBM vs. XGBoost](https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "iMM_R4_ukV7x"
+ },
+ "source": [
+ "from xgboost import XGBClassifier\n",
+ "import xgboost as xgb\n",
+ "\n",
+ "# Instancia...\n",
+ "ml_XGB= XGBClassifier(silent=False, \n",
+ " scale_pos_weight=1,\n",
+ " learning_rate=0.01, \n",
+ " colsample_bytree = 1,\n",
+ " subsample = 0.8,\n",
+ " objective='binary:logistic', \n",
+ " n_estimators=1000, \n",
+ " reg_alpha = 0.3,\n",
+ " max_depth= 3, \n",
+ " gamma=1, \n",
+ " max_delta_step=5)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "E4wQMlDEFINR"
+ },
+ "source": [
+ "# Treina...\n",
+ "ml_XGB.fit(X_treinamento, y_treinamento)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zAhsTtwGqMkG"
+ },
+ "source": [
+ "# Cross-Validation com 10 folds\n",
+ "a_scores_CV = cross_val_score(ml_XGB, X_treinamento, y_treinamento, cv = i_CV)\n",
+ "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n",
+ "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JNyKX6PkrXOk"
+ },
+ "source": [
+ "**Interpretação**: Nosso classificador (XGBClassifier) tem uma acurácia média de 96,72% (base de treinamento). Além disso, o std é da ordem de 2,02%, ou seja, pequena. Vamos tentar melhorar a acurácia do classificador usando parameter tunning (GridSearchCV)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "_h0QYv3FkV73"
+ },
+ "source": [
+ "print(f'Acurácias: {a_scores_CV}')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "AKhhAZLjkV76"
+ },
+ "source": [
+ "# Faz predições...\n",
+ "y_pred = ml_XGB.predict(X_teste)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Ir2Kd1PqGHgz"
+ },
+ "source": [
+ "# Confusion Matrix\n",
+ "cf_matrix = confusion_matrix(y_teste, y_pred)\n",
+ "cf_labels = ['True_Negative','False_Positive','False_Negative','True_Positive']\n",
+ "cf_categories = ['Zero', 'One']\n",
+ "mostra_confusion_matrix(cf_matrix, group_names= cf_labels, categories= cf_categories)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jEC7gW4qYpWw"
+ },
+ "source": [
+ "## Parameter tunning\n",
+ "### Leitura Adicional:\n",
+ "* [Fine-tuning XGBoost in Python like a boss](https://towardsdatascience.com/fine-tuning-xgboost-in-python-like-a-boss-b4543ed8b1e)\n",
+ "* [Complete Guide to Parameter Tuning in XGBoost with codes in Python](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)\n",
+ "\n",
+ "> Olhando para os resultados acima, qual o melhor modelo?\n",
+ "\n",
+ "XGBoost? Supondo que sim, agora vamos fazer o fine-tuning dos parâmetros do modelo."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "n3MsUONPwIV9"
+ },
+ "source": [
+ "# Dicionário de parâmetros para XGBoost:\n",
+ "d_parametros_XGB = {'min_child_weight': [i for i in np.arange(1, 13)]} #,\n",
+ "# 'gamma': [i for i in np.arange(0, 5, 0.5)],\n",
+ "# 'subsample': [0.6, 0.8, 1.0],\n",
+ "# 'colsample_bytree': [0.6, 0.8, 1.0],\n",
+ "# 'max_depth': [3, 4, 5, 7, 9],\n",
+ "# 'learning_rate': [i for i in np.arange(0.01, 1, 0.1)]}"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "CX27FCKmwSni"
+ },
+ "source": [
+ "# Invoca a função\n",
+ "ml_XGB, best_params= GridSearchOptimizer(ml_XGB, 'ml_XGB2', d_parametros_XGB, X_treinamento, y_treinamento, X_teste, y_teste, cv = i_CV)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9b7uCuF74Hjv"
+ },
+ "source": [
+ "### Resultado da execução do XGBoostClassifier\n",
+ "\n",
+ "```\n",
+ "[Parallel(n_jobs=-1)]: Done 108000 out of 108000 | elapsed: 372.0min finished\n",
+ "\n",
+ "Parametros otimizados: {'colsample_bytree': 0.8, 'gamma': 0.5, 'learning_rate': 0.51, 'max_depth': 5, 'min_child_weight': 1, 'subsample': 0.6}\n",
+ "```\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "n7E0oyxEtbGi"
+ },
+ "source": [
+ "# Como o procedimento acima levou 372 minutos para executar, então vou estimar ml_XGB2 abaixo usando os parâmetros acima estimados\n",
+ "best_params= {'colsample_bytree': 0.8, 'gamma': 0.5, 'learning_rate': 0.51, 'max_depth': 5, 'min_child_weight': 1, 'subsample': 0.6}\n",
+ "\n",
+ "ml_XGB2= XGBClassifier(min_child_weight= best_params['min_child_weight'], \n",
+ " gamma= best_params['gamma'], \n",
+ " subsample= best_params['subsample'], \n",
+ " colsample_bytree= best_params['colsample_bytree'], \n",
+ " max_depth= best_params['max_depth'], \n",
+ " learning_rate= best_params['learning_rate'], \n",
+ " random_state= i_Seed)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "CuqyLHTU5Z-j"
+ },
+ "source": [
+ "## Selecionar as COLUNAS importantes/relevantes\n",
+ "* [The Multiple faces of ‘Feature importance’ in XGBoost](https://towardsdatascience.com/be-careful-when-interpreting-your-features-importance-in-xgboost-6e16132588e7)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "QPG3JZIpRZ-T"
+ },
+ "source": [
+ "# plot feature importance\n",
+ "from xgboost import plot_importance\n",
+ "\n",
+ "xgb.plot_importance(ml_XGB2, color = 'red')\n",
+ "plt.title('importance', fontsize = 20)\n",
+ "plt.yticks(fontsize = 10)\n",
+ "plt.ylabel('features', fontsize = 20)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "EmpRC2lHW-KP"
+ },
+ "source": [
+ "ml_XGB2"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "4f9MIEBiyq-5"
+ },
+ "source": [
+ "X_treinamento_XGB, X_teste_XGB= seleciona_colunas_relevantes(ml_XGB2, X_treinamento, X_teste)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "F6EayWaY5nMm"
+ },
+ "source": [
+ "## Treina o classificador com as COLUNAS relevantes"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Huy18gKI5qad"
+ },
+ "source": [
+ "best_params"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "E3-PaTdc5vZk"
+ },
+ "source": [
+ "# Treina com as COLUNAS relevantes...\n",
+ "ml_XGB2.fit(X_treinamento_XGB, y_treinamento)\n",
+ "\n",
+ "# Cross-Validation com 10 folds\n",
+ "a_scores_CV = cross_val_score(ml_XGB2, X_treinamento_XGB, y_treinamento, cv = i_CV)\n",
+ "print(f'Acurácia Media: {100*a_scores_CV.mean():.2f}')\n",
+ "print(f'std médio.....: {100*a_scores_CV.std():.2f}')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "tBdYikDU6NhD"
+ },
+ "source": [
+ "## Valida o modelo usando o dataframe X_teste"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "GcvY-VdL6VIZ"
+ },
+ "source": [
+ "y_pred_XGB = ml_XGB2.predict(X_teste_XGB)\n",
+ "\n",
+ "# Calcula acurácia\n",
+ "accuracy_score(y_teste, y_pred_XGB)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8oLtdH-vTSbC"
+ },
+ "source": [
+ "xgb.to_graphviz(ml_XGB2)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "czXQG3MCHfHM"
+ },
+ "source": [
+ "# KNN - KNEIGHBORSCLASSIFIER"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "llTTXNeyHiwx"
+ },
+ "source": [
+ "# BAGGINGCLASSIFIER"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Fbkekd4QHoZO"
+ },
+ "source": [
+ "# EXTRATREESCLASSIFIER"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "widavwR4HzwE"
+ },
+ "source": [
+ "# SVM\n",
+ "https://data-flair.training/blogs/svm-support-vector-machine-tutorial/"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "id_Ubulns6We"
+ },
+ "source": [
+ "# NAIVE BAYES"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3e0m7lEnYOV9"
+ },
+ "source": [
+ "# **IMPORTANCIA DAS COLUNAS**\n",
+ "Source: [Plotting Feature Importances](https://www.kaggle.com/grfiv4/plotting-feature-importances)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "fjco0HnNYr-N"
+ },
+ "source": [
+ "def mostra_feature_importances(clf, X_treinamento, y_treinamento=None, \n",
+ " top_n=10, figsize=(8,8), print_table=False, title=\"Feature Importances\"):\n",
+ " '''\n",
+ " plot feature importances of a tree-based sklearn estimator\n",
+ " \n",
+ " Note: X_treinamento and y_treinamento are pandas DataFrames\n",
+ " \n",
+ " Note: Scikit-plot is a lovely package but I sometimes have issues\n",
+ " 1. flexibility/extendibility\n",
+ " 2. complicated models/datasets\n",
+ " But for many situations Scikit-plot is the way to go\n",
+ " see https://scikit-plot.readthedocs.io/en/latest/Quickstart.html\n",
+ " \n",
+ " Parameters\n",
+ " ----------\n",
+ " clf (sklearn estimator) if not fitted, this routine will fit it\n",
+ " \n",
+ " X_treinamento (pandas DataFrame)\n",
+ " \n",
+ " y_treinamento (pandas DataFrame) optional\n",
+ " required only if clf has not already been fitted \n",
+ " \n",
+ " top_n (int) Plot the top_n most-important features\n",
+ " Default: 10\n",
+ " \n",
+ " figsize ((int,int)) The physical size of the plot\n",
+ " Default: (8,8)\n",
+ " \n",
+ " print_table (boolean) If True, print out the table of feature importances\n",
+ " Default: False\n",
+ " \n",
+ " Returns\n",
+ " -------\n",
+ " the pandas dataframe with the features and their importance\n",
+ " \n",
+ " Author\n",
+ " ------\n",
+ " George Fisher\n",
+ " '''\n",
+ " \n",
+ " __name__ = \"mostra_feature_importances\"\n",
+ " \n",
+ " import pandas as pd\n",
+ " import numpy as np\n",
+ " import matplotlib.pyplot as plt\n",
+ " \n",
+ " from xgboost.core import XGBoostError\n",
+ " from lightgbm.sklearn import LightGBMError\n",
+ " \n",
+ " try: \n",
+ " if not hasattr(clf, 'feature_importances_'):\n",
+ " clf.fit(X_treinamento.values, y_treinamento.values.ravel())\n",
+ "\n",
+ " if not hasattr(clf, 'feature_importances_'):\n",
+ " raise AttributeError(\"{} does not have feature_importances_ attribute\".\n",
+ " format(clf.__class__.__name__))\n",
+ " \n",
+ " except (XGBoostError, LightGBMError, ValueError):\n",
+ " clf.fit(X_treinamento.values, y_treinamento.values.ravel())\n",
+ " \n",
+ " feat_imp = pd.DataFrame({'importance':clf.feature_importances_}) \n",
+ " feat_imp['feature'] = X_treinamento.columns\n",
+ " feat_imp.sort_values(by ='importance', ascending = False, inplace = True)\n",
+ " feat_imp = feat_imp.iloc[:top_n]\n",
+ " \n",
+ " feat_imp.sort_values(by='importance', inplace = True)\n",
+ " feat_imp = feat_imp.set_index('feature', drop = True)\n",
+ " feat_imp.plot.barh(title=title, figsize=figsize)\n",
+ " plt.xlabel('Feature Importance Score')\n",
+ " plt.show()\n",
+ " \n",
+ " if print_table:\n",
+ " from IPython.display import display\n",
+ " print(\"Top {} features in descending order of importance\".format(top_n))\n",
+ " display(feat_imp.sort_values(by = 'importance', ascending = False))\n",
+ " \n",
+ " return feat_imp"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ycu_EIGlYUYn"
+ },
+ "source": [
+ "import pandas as pd\n",
+ "\n",
+ "from xgboost import XGBClassifier\n",
+ "from sklearn.ensemble import ExtraTreesClassifier\n",
+ "from sklearn.tree import ExtraTreeClassifier\n",
+ "from sklearn.tree import DecisionTreeClassifier\n",
+ "from sklearn.ensemble import GradientBoostingClassifier\n",
+ "from sklearn.ensemble import BaggingClassifier\n",
+ "from sklearn.ensemble import AdaBoostClassifier\n",
+ "from sklearn.ensemble import RandomForestClassifier\n",
+ "from sklearn.linear_model import LogisticRegression\n",
+ "from lightgbm import LGBMClassifier\n",
+ "\n",
+ "clfs = [XGBClassifier(), LGBMClassifier(), \n",
+ " ExtraTreesClassifier(), ExtraTreeClassifier(),\n",
+ " BaggingClassifier(), DecisionTreeClassifier(),\n",
+ " GradientBoostingClassifier(), LogisticRegression(),\n",
+ " AdaBoostClassifier(), RandomForestClassifier()]\n",
+ "\n",
+ "for clf in clfs:\n",
+ " try:\n",
+ " _ = mostra_feature_importances(clf, X_treinamento, y_treinamento, top_n=X_treinamento.shape[1], title=clf.__class__.__name__)\n",
+ " except AttributeError as e:\n",
+ " print(e)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "EwWkjfC8KEZH"
+ },
+ "source": [
+ "# ENSEMBLE METHODS\n",
+ "https://towardsdatascience.com/using-bagging-and-boosting-to-improve-classification-tree-accuracy-6d3bb6c95e5b\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3Uf1RML7xETY"
+ },
+ "source": [
+ "# WOE e IV\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "TBNRfYZCyhMP"
+ },
+ "source": [
+ "## Construção do exemplo"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "gIIroyyP4ZRZ"
+ },
+ "source": [
+ "df_y.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "PzQQdrkf1ohX"
+ },
+ "source": [
+ "from random import choices\n",
+ "\n",
+ "df_X2= df_X.copy()\n",
+ "df_X2['tipo']= choices(['A', 'B', 'C', 'D'], k= 1000)\n",
+ "df_X2['idade']= np.random.randint(10, 15, size= 1000)\n",
+ "df_X2['target']= df_y['target']\n",
+ "df_X2.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "v-OpwIpx4hXJ"
+ },
+ "source": [
+ "df_X2['target'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "yZfqSvbKzeJ3"
+ },
+ "source": [
+ "def Constroi_Buckets(df, i, k= 10):\n",
+ " coluna= 'v'+ str(i)\n",
+ " df[coluna+'_Bucket']= pd.cut(df[coluna], bins= k, labels= np.arange(1, k+1))\n",
+ " df= df.drop(columns= [coluna], axis= 1)\n",
+ " return df"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "V6Nrpsx60HD3"
+ },
+ "source": [
+ "for i in np.arange(1,19):\n",
+ " df_X2= Constroi_Buckets(df_X2, i)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "J2Fbh41-03OB"
+ },
+ "source": [
+ "df_X2.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "O9r5BeWVxIr3"
+ },
+ "source": [
+ "# Função para calcular WOE e IV\n",
+ "def calculate_woe_iv(dataset, feature, target):\n",
+ "\n",
+ " def codethem(IV):\n",
+ " if IV < 0.02: return 'Useless'\n",
+ " elif IV >= 0.02 and IV < 0.1: return 'Weak'\n",
+ " elif IV >= 0.1 and IV < 0.3: return 'Medium'\n",
+ " elif IV >= 0.3 and IV < 0.5: return 'Strong'\n",
+ " elif IV >= 0.5: return 'Suspicious'\n",
+ " else: return 'None'\n",
+ "\n",
+ " lst = []\n",
+ " for i in range(dataset[feature].nunique()):\n",
+ " val = list(dataset[feature].unique())[i]\n",
+ " lst.append({\n",
+ " 'Value': val,\n",
+ " 'All': dataset[dataset[feature] == val].count()[feature],\n",
+ " 'Good': dataset[(dataset[feature] == val) & (dataset[target] == 0)].count()[feature],\n",
+ " 'Bad': dataset[(dataset[feature] == val) & (dataset[target] == 1)].count()[feature]\n",
+ " })\n",
+ " \n",
+ " dset = pd.DataFrame(lst)\n",
+ " dset['Distr_Good'] = dset['Good']/dset['Good'].sum()\n",
+ " dset['Distr_Bad'] = dset['Bad']/dset['Bad'].sum()\n",
+ " dset['Mean']= dset['All']/dset['All'].sum()\n",
+ " dset['WoE'] = np.log(dset['Distr_Good']/dset['Distr_Bad'])\n",
+ " dset = dset.replace({'WoE': {np.inf: 0, -np.inf: 0}})\n",
+ " dset['IV'] = (dset['Distr_Good'] - dset['Distr_Bad']) * dset['WoE']\n",
+ " #dset= dset.drop(columns= ['Distr_Good', 'Distr_Bad'], axis= 1)\n",
+ "\n",
+ " dset['Predictive_Power']= dset['IV'].map(codethem)\n",
+ " iv = dset['IV'].sum() \n",
+ " dset = dset.sort_values(by='IV') \n",
+ " return dset, iv"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Y8WGjWH63nx_"
+ },
+ "source": [
+ "df_Lab = df_X2.copy()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "-N6xr1MgxTiz"
+ },
+ "source": [
+ "def calcula_Predictive_Power(df_Lab, coluna):\n",
+ " print('WoE and IV for column: {}'.format(coluna))\n",
+ " df, iv = calculate_woe_iv(df_Lab, coluna, 'target')\n",
+ " print(df)\n",
+ " print('IV score: {:.2f}'.format(iv))\n",
+ " print('\\n')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ayqN_7WnxVq9"
+ },
+ "source": [
+ "for i in np.arange(1,19):\n",
+ " coluna= 'v'+str(i)+'_Bucket'\n",
+ " calcula_Predictive_Power(df_Lab, coluna)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qtoJVI4Pyx3I"
+ },
+ "source": [
+ "# **IMBALANCED SAMPLE**\n",
+ "> Alguns objetivos como detectar fraude em transações bancárias ou detecção de intrusão em network tem em comum o fato que a classe de interesse (o que queremos detectar), geralmente é um evento raro\n",
+ "\n",
+ "## Exemplo: Detectar fraude\n",
+ "A proporção de fraudes diante de NÃO-FRAUDES são mais ou menos 1%/99%. Neste caso, ao desenvovermos um modelo para detectar fraudes e o modelo classificar todas as instâncias como NÃO-FRAUDE, então o modelo terá uma acurácia de 99%. No entanto, este modelo não nos ajudará em nada.\n",
+ "\n",
+ "## Necessidade de se usar outras métricas \n",
+ "> Recomenda-se utilizar outras métricas (na verdade, é boa prática usar mais de 1 métrica para medir a performance dos modelos) como, por exemplo, F1-Score, Precision/Specificity, Recall/Sensitivity e AUROC.\n",
+ "\n",
+ "## Como lidar com a amostra desbalanceada?\n",
+ "* Under-sampling\n",
+ "> Seleciona aleatoriamente a classe MAJORITÁRIA (em nosso exemplo, NÃO-FRAUDE) até o número de instâncias da classe MINORITÁRIA (FRAUDE);\n",
+ "\n",
+ "* Over-sampling\n",
+ "> Resample aleatoriamente a classe MINORITÁRIA (em nosso exemplo, FRAUDE) até o número de instâncias da classe MAJORITÁRIA (NÃO-FRAUDE), ou uma proporção da classe MAJORITÁRIA. Veja a bibliotea SMOTE (Synthetic Minority Over-Sampling Techniques);\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2o45zx8zw-aB"
+ },
+ "source": [
+ "## EFEITOS DA AMOSTRA DESBALANCEADA"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "cCVTPCB-Xkbd"
+ },
+ "source": [
+ "# TPOT\n",
+ "https://towardsdatascience.com/tpot-automated-machine-learning-in-python-4c063b3e5de9"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "2ulXii6JXpWd"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_TWUq-z4X4yZ"
+ },
+ "source": [
+ "___\n",
+ "# FEATURETOOLS\n",
+ "https://medium.com/@rrfd/simple-automatic-feature-engineering-using-featuretools-in-python-for-classification-b1308040e183\n",
+ "\n",
+ "https://www.analyticsvidhya.com/blog/2018/08/guide-automated-feature-engineering-featuretools-python/\n",
+ "\n",
+ "https://mlwhiz.com/blog/2019/05/19/feature_extraction/\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "aZUNOgmSgAmq"
+ },
+ "source": [
+ "!pip install featuretools"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "_sxdONzsh9rb"
+ },
+ "source": [
+ "df_X.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "p5_ynGo1dBJJ"
+ },
+ "source": [
+ "df_X.shape"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "TqJRJXUhiDqf"
+ },
+ "source": [
+ "from random import choices\n",
+ "\n",
+ "df_X2= df_X.copy()\n",
+ "df_X2['tipo'] = choices(['A', 'B', 'C', 'D'], k = 1000)\n",
+ "df_X2['idade'] = np.random.randint(10, 15, size = 1000)\n",
+ "df_X2['id'] = range(0,1000)\n",
+ "df_X2.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "nR56bGGngk-W"
+ },
+ "source": [
+ "# Automated feature engineering\n",
+ "import featuretools as ft\n",
+ "import featuretools.variable_types as vtypes\n",
+ "\n",
+ "es= ft.EntitySet(id = 'simulacao')\n",
+ "\n",
+ "# adding a dataframe \n",
+ "es.entity_from_dataframe(entity_id = 'df_X2', dataframe = df_X2, index = 'id')\n",
+ "es"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "IOJ4Tr5Ogk6M"
+ },
+ "source": [
+ "es['df_X2'].variables"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "1uXPqHDZgkys"
+ },
+ "source": [
+ "variable_types = {'idade': vtypes.Categorical}\n",
+ " \n",
+ "es.entity_from_dataframe(entity_id = 'df_X2', dataframe = df_X2, index = 'id', variable_types= variable_types)\n",
+ "\n",
+ "es = es.normalize_entity(base_entity_id='df_X2', new_entity_id= 'tipo', index='id')\n",
+ "es = es.normalize_entity(base_entity_id='df_X2', new_entity_id= 'idade', index='id')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "dnbYTBqugkvm"
+ },
+ "source": [
+ "es"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "I2v_jetdgkr7"
+ },
+ "source": [
+ "feature_matrix, feature_names = ft.dfs(entityset=es, target_entity = 'df_X2', max_depth = 3, verbose = 3, n_jobs= 1)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zZiRBvHXgkoJ"
+ },
+ "source": [
+ "feature_matrix.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "aWiahwKe2d6U"
+ },
+ "source": [
+ "# **EXERCÍCIOS**\n",
+ "> Encontre algoritmos adequados para ser aplicados aos seguintes problemas:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XbSLkbDB2mzK"
+ },
+ "source": [
+ "## Exercício 1 - Credit Card Fraud Detection\n",
+ "Source: [Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud)\n",
+ "\n",
+ "### Leitura suporte\n",
+ "* [Detecting Credit Card Fraud Using Machine Learning](https://towardsdatascience.com/detecting-credit-card-fraud-using-machine-learning-a3d83423d3b8)\n",
+ "* [Credit Card Fraud Detection](https://towardsdatascience.com/credit-card-fraud-detection-a1c7e1b75f59)\n",
+ "\n",
+ "### Dataframe\n",
+ "* [Creditcard.csv](https://raw.githubusercontent.com/MathMachado/DSWP/master/Dataframes/creditcard.csv)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JYVM3StS-g0E"
+ },
+ "source": [
+ "### Importar as libraries necessárias"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "dyliPChh-jPk"
+ },
+ "source": [
+ "from sklearn.metrics import accuracy_score # para medir a acurácia do modelo preditivo\n",
+ "#from sklearn.model_selection import train_test_split\n",
+ "#from sklearn.metrics import classification_report\n",
+ "from sklearn.metrics import confusion_matrix # para plotar a confusion matrix\n",
+ "\n",
+ "from sklearn.model_selection import GridSearchCV # para otimizar os parâmetros dos modelos preditivos\n",
+ "from sklearn.model_selection import cross_val_score\n",
+ "from time import time\n",
+ "from operator import itemgetter\n",
+ "from scipy.stats import randint\n",
+ "\n",
+ "from sklearn.tree import export_graphviz\n",
+ "from sklearn.externals.six import StringIO \n",
+ "from IPython.display import Image \n",
+ "import pydotplus\n",
+ "\n",
+ "np.set_printoptions(suppress=True)"
+ ],
+ "execution_count": 52,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kpy5coZZ-BLx"
+ },
+ "source": [
+ "from sklearn.metrics import confusion_matrix # para plotar a confusion matrix"
+ ],
+ "execution_count": 91,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "lAl9ZwP_0-d0"
+ },
+ "source": [
+ "url = 'https://raw.githubusercontent.com/MathMachado/DSWP/master/Dataframes/creditcard.csv'\n",
+ "df_cc = pd.read_csv(url)"
+ ],
+ "execution_count": 53,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "w6lN8FjJ12VU",
+ "outputId": "eee4d215-282a-44e7-b4a8-2123e6d391c8",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 379
+ }
+ },
+ "source": [
+ "df_cc.head(10)"
+ ],
+ "execution_count": 54,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Time | \n",
+ " V1 | \n",
+ " V2 | \n",
+ " V3 | \n",
+ " V4 | \n",
+ " V5 | \n",
+ " V6 | \n",
+ " V7 | \n",
+ " V8 | \n",
+ " V9 | \n",
+ " V10 | \n",
+ " V11 | \n",
+ " V12 | \n",
+ " V13 | \n",
+ " V14 | \n",
+ " V15 | \n",
+ " V16 | \n",
+ " V17 | \n",
+ " V18 | \n",
+ " V19 | \n",
+ " V20 | \n",
+ " V21 | \n",
+ " V22 | \n",
+ " V23 | \n",
+ " V24 | \n",
+ " V25 | \n",
+ " V26 | \n",
+ " V27 | \n",
+ " V28 | \n",
+ " Amount | \n",
+ " Class | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 0 | \n",
+ " -1.359807 | \n",
+ " -0.072781 | \n",
+ " 2.536347 | \n",
+ " 1.378155 | \n",
+ " -0.338321 | \n",
+ " 0.462388 | \n",
+ " 0.239599 | \n",
+ " 0.098698 | \n",
+ " 0.363787 | \n",
+ " 0.090794 | \n",
+ " -0.551600 | \n",
+ " -0.617801 | \n",
+ " -0.991390 | \n",
+ " -0.311169 | \n",
+ " 1.468177 | \n",
+ " -0.470401 | \n",
+ " 0.207971 | \n",
+ " 0.025791 | \n",
+ " 0.403993 | \n",
+ " 0.251412 | \n",
+ " -0.018307 | \n",
+ " 0.277838 | \n",
+ " -0.110474 | \n",
+ " 0.066928 | \n",
+ " 0.128539 | \n",
+ " -0.189115 | \n",
+ " 0.133558 | \n",
+ " -0.021053 | \n",
+ " 149.62 | \n",
+ " 0.0 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 0 | \n",
+ " 1.191857 | \n",
+ " 0.266151 | \n",
+ " 0.166480 | \n",
+ " 0.448154 | \n",
+ " 0.060018 | \n",
+ " -0.082361 | \n",
+ " -0.078803 | \n",
+ " 0.085102 | \n",
+ " -0.255425 | \n",
+ " -0.166974 | \n",
+ " 1.612727 | \n",
+ " 1.065235 | \n",
+ " 0.489095 | \n",
+ " -0.143772 | \n",
+ " 0.635558 | \n",
+ " 0.463917 | \n",
+ " -0.114805 | \n",
+ " -0.183361 | \n",
+ " -0.145783 | \n",
+ " -0.069083 | \n",
+ " -0.225775 | \n",
+ " -0.638672 | \n",
+ " 0.101288 | \n",
+ " -0.339846 | \n",
+ " 0.167170 | \n",
+ " 0.125895 | \n",
+ " -0.008983 | \n",
+ " 0.014724 | \n",
+ " 2.69 | \n",
+ " 0.0 | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 1 | \n",
+ " -1.358354 | \n",
+ " -1.340163 | \n",
+ " 1.773209 | \n",
+ " 0.379780 | \n",
+ " -0.503198 | \n",
+ " 1.800499 | \n",
+ " 0.791461 | \n",
+ " 0.247676 | \n",
+ " -1.514654 | \n",
+ " 0.207643 | \n",
+ " 0.624501 | \n",
+ " 0.066084 | \n",
+ " 0.717293 | \n",
+ " -0.165946 | \n",
+ " 2.345865 | \n",
+ " -2.890083 | \n",
+ " 1.109969 | \n",
+ " -0.121359 | \n",
+ " -2.261857 | \n",
+ " 0.524980 | \n",
+ " 0.247998 | \n",
+ " 0.771679 | \n",
+ " 0.909412 | \n",
+ " -0.689281 | \n",
+ " -0.327642 | \n",
+ " -0.139097 | \n",
+ " -0.055353 | \n",
+ " -0.059752 | \n",
+ " 378.66 | \n",
+ " 0.0 | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 1 | \n",
+ " -0.966272 | \n",
+ " -0.185226 | \n",
+ " 1.792993 | \n",
+ " -0.863291 | \n",
+ " -0.010309 | \n",
+ " 1.247203 | \n",
+ " 0.237609 | \n",
+ " 0.377436 | \n",
+ " -1.387024 | \n",
+ " -0.054952 | \n",
+ " -0.226487 | \n",
+ " 0.178228 | \n",
+ " 0.507757 | \n",
+ " -0.287924 | \n",
+ " -0.631418 | \n",
+ " -1.059647 | \n",
+ " -0.684093 | \n",
+ " 1.965775 | \n",
+ " -1.232622 | \n",
+ " -0.208038 | \n",
+ " -0.108300 | \n",
+ " 0.005274 | \n",
+ " -0.190321 | \n",
+ " -1.175575 | \n",
+ " 0.647376 | \n",
+ " -0.221929 | \n",
+ " 0.062723 | \n",
+ " 0.061458 | \n",
+ " 123.50 | \n",
+ " 0.0 | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 2 | \n",
+ " -1.158233 | \n",
+ " 0.877737 | \n",
+ " 1.548718 | \n",
+ " 0.403034 | \n",
+ " -0.407193 | \n",
+ " 0.095921 | \n",
+ " 0.592941 | \n",
+ " -0.270533 | \n",
+ " 0.817739 | \n",
+ " 0.753074 | \n",
+ " -0.822843 | \n",
+ " 0.538196 | \n",
+ " 1.345852 | \n",
+ " -1.119670 | \n",
+ " 0.175121 | \n",
+ " -0.451449 | \n",
+ " -0.237033 | \n",
+ " -0.038195 | \n",
+ " 0.803487 | \n",
+ " 0.408542 | \n",
+ " -0.009431 | \n",
+ " 0.798278 | \n",
+ " -0.137458 | \n",
+ " 0.141267 | \n",
+ " -0.206010 | \n",
+ " 0.502292 | \n",
+ " 0.219422 | \n",
+ " 0.215153 | \n",
+ " 69.99 | \n",
+ " 0.0 | \n",
+ "
\n",
+ " \n",
+ " | 5 | \n",
+ " 2 | \n",
+ " -0.425966 | \n",
+ " 0.960523 | \n",
+ " 1.141109 | \n",
+ " -0.168252 | \n",
+ " 0.420987 | \n",
+ " -0.029728 | \n",
+ " 0.476201 | \n",
+ " 0.260314 | \n",
+ " -0.568671 | \n",
+ " -0.371407 | \n",
+ " 1.341262 | \n",
+ " 0.359894 | \n",
+ " -0.358091 | \n",
+ " -0.137134 | \n",
+ " 0.517617 | \n",
+ " 0.401726 | \n",
+ " -0.058133 | \n",
+ " 0.068653 | \n",
+ " -0.033194 | \n",
+ " 0.084968 | \n",
+ " -0.208254 | \n",
+ " -0.559825 | \n",
+ " -0.026398 | \n",
+ " -0.371427 | \n",
+ " -0.232794 | \n",
+ " 0.105915 | \n",
+ " 0.253844 | \n",
+ " 0.081080 | \n",
+ " 3.67 | \n",
+ " 0.0 | \n",
+ "
\n",
+ " \n",
+ " | 6 | \n",
+ " 4 | \n",
+ " 1.229658 | \n",
+ " 0.141004 | \n",
+ " 0.045371 | \n",
+ " 1.202613 | \n",
+ " 0.191881 | \n",
+ " 0.272708 | \n",
+ " -0.005159 | \n",
+ " 0.081213 | \n",
+ " 0.464960 | \n",
+ " -0.099254 | \n",
+ " -1.416907 | \n",
+ " -0.153826 | \n",
+ " -0.751063 | \n",
+ " 0.167372 | \n",
+ " 0.050144 | \n",
+ " -0.443587 | \n",
+ " 0.002821 | \n",
+ " -0.611987 | \n",
+ " -0.045575 | \n",
+ " -0.219633 | \n",
+ " -0.167716 | \n",
+ " -0.270710 | \n",
+ " -0.154104 | \n",
+ " -0.780055 | \n",
+ " 0.750137 | \n",
+ " -0.257237 | \n",
+ " 0.034507 | \n",
+ " 0.005168 | \n",
+ " 4.99 | \n",
+ " 0.0 | \n",
+ "
\n",
+ " \n",
+ " | 7 | \n",
+ " 7 | \n",
+ " -0.644269 | \n",
+ " 1.417964 | \n",
+ " 1.074380 | \n",
+ " -0.492199 | \n",
+ " 0.948934 | \n",
+ " 0.428118 | \n",
+ " 1.120631 | \n",
+ " -3.807864 | \n",
+ " 0.615375 | \n",
+ " 1.249376 | \n",
+ " -0.619468 | \n",
+ " 0.291474 | \n",
+ " 1.757964 | \n",
+ " -1.323865 | \n",
+ " 0.686133 | \n",
+ " -0.076127 | \n",
+ " -1.222127 | \n",
+ " -0.358222 | \n",
+ " 0.324505 | \n",
+ " -0.156742 | \n",
+ " 1.943465 | \n",
+ " -1.015455 | \n",
+ " 0.057504 | \n",
+ " -0.649709 | \n",
+ " -0.415267 | \n",
+ " -0.051634 | \n",
+ " -1.206921 | \n",
+ " -1.085339 | \n",
+ " 40.80 | \n",
+ " 0.0 | \n",
+ "
\n",
+ " \n",
+ " | 8 | \n",
+ " 7 | \n",
+ " -0.894286 | \n",
+ " 0.286157 | \n",
+ " -0.113192 | \n",
+ " -0.271526 | \n",
+ " 2.669599 | \n",
+ " 3.721818 | \n",
+ " 0.370145 | \n",
+ " 0.851084 | \n",
+ " -0.392048 | \n",
+ " -0.410430 | \n",
+ " -0.705117 | \n",
+ " -0.110452 | \n",
+ " -0.286254 | \n",
+ " 0.074355 | \n",
+ " -0.328783 | \n",
+ " -0.210077 | \n",
+ " -0.499768 | \n",
+ " 0.118765 | \n",
+ " 0.570328 | \n",
+ " 0.052736 | \n",
+ " -0.073425 | \n",
+ " -0.268092 | \n",
+ " -0.204233 | \n",
+ " 1.011592 | \n",
+ " 0.373205 | \n",
+ " -0.384157 | \n",
+ " 0.011747 | \n",
+ " 0.142404 | \n",
+ " 93.20 | \n",
+ " 0.0 | \n",
+ "
\n",
+ " \n",
+ " | 9 | \n",
+ " 9 | \n",
+ " -0.338262 | \n",
+ " 1.119593 | \n",
+ " 1.044367 | \n",
+ " -0.222187 | \n",
+ " 0.499361 | \n",
+ " -0.246761 | \n",
+ " 0.651583 | \n",
+ " 0.069539 | \n",
+ " -0.736727 | \n",
+ " -0.366846 | \n",
+ " 1.017614 | \n",
+ " 0.836390 | \n",
+ " 1.006844 | \n",
+ " -0.443523 | \n",
+ " 0.150219 | \n",
+ " 0.739453 | \n",
+ " -0.540980 | \n",
+ " 0.476677 | \n",
+ " 0.451773 | \n",
+ " 0.203711 | \n",
+ " -0.246914 | \n",
+ " -0.633753 | \n",
+ " -0.120794 | \n",
+ " -0.385050 | \n",
+ " -0.069733 | \n",
+ " 0.094199 | \n",
+ " 0.246219 | \n",
+ " 0.083076 | \n",
+ " 3.68 | \n",
+ " 0.0 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Time V1 V2 V3 ... V27 V28 Amount Class\n",
+ "0 0 -1.359807 -0.072781 2.536347 ... 0.133558 -0.021053 149.62 0.0\n",
+ "1 0 1.191857 0.266151 0.166480 ... -0.008983 0.014724 2.69 0.0\n",
+ "2 1 -1.358354 -1.340163 1.773209 ... -0.055353 -0.059752 378.66 0.0\n",
+ "3 1 -0.966272 -0.185226 1.792993 ... 0.062723 0.061458 123.50 0.0\n",
+ "4 2 -1.158233 0.877737 1.548718 ... 0.219422 0.215153 69.99 0.0\n",
+ "5 2 -0.425966 0.960523 1.141109 ... 0.253844 0.081080 3.67 0.0\n",
+ "6 4 1.229658 0.141004 0.045371 ... 0.034507 0.005168 4.99 0.0\n",
+ "7 7 -0.644269 1.417964 1.074380 ... -1.206921 -1.085339 40.80 0.0\n",
+ "8 7 -0.894286 0.286157 -0.113192 ... 0.011747 0.142404 93.20 0.0\n",
+ "9 9 -0.338262 1.119593 1.044367 ... 0.246219 0.083076 3.68 0.0\n",
+ "\n",
+ "[10 rows x 31 columns]"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 54
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "M47GS1cK2NdD",
+ "outputId": "91cb28e1-5fd6-4ce4-b733-db59fcace09e",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "df_cc.shape"
+ ],
+ "execution_count": 55,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "(12842, 31)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 55
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "NFlXj1OAzvPS",
+ "outputId": "228e6a58-8a52-4f1c-944b-ebe25b89b339",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 102
+ }
+ },
+ "source": [
+ "df_cc.columns"
+ ],
+ "execution_count": 58,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',\n",
+ " 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',\n",
+ " 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',\n",
+ " 'Class'],\n",
+ " dtype='object')"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 58
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "vpVRbkTV0AI_"
+ },
+ "source": [
+ "#df_cc.columns() = (coluna.lower() for coluna in df_cc.columns)"
+ ],
+ "execution_count": 60,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "b2QBZFbR3W_q",
+ "outputId": "b2076b85-21cc-49ea-a603-586036f5fae5",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 68
+ }
+ },
+ "source": [
+ "df_cc['Class'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "0.0 12785\n",
+ "1.0 56\n",
+ "Name: Class, dtype: int64"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 126
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "pzjW3Bf_3h7J",
+ "outputId": "c87f7be2-0bf6-4dea-b77a-26bee019992a",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "56/12842"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "0.004360691481077714"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 129
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9bWDX9H12k5g"
+ },
+ "source": [
+ "### Drop NaN"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "27ob8tRR21TB",
+ "outputId": "59e26e44-a89c-4af8-9129-2452b3483ae9",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 561
+ }
+ },
+ "source": [
+ "df_cc.isna().sum()\n"
+ ],
+ "execution_count": 61,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "Time 0\n",
+ "V1 0\n",
+ "V2 0\n",
+ "V3 0\n",
+ "V4 0\n",
+ "V5 0\n",
+ "V6 0\n",
+ "V7 0\n",
+ "V8 0\n",
+ "V9 0\n",
+ "V10 1\n",
+ "V11 1\n",
+ "V12 1\n",
+ "V13 1\n",
+ "V14 1\n",
+ "V15 1\n",
+ "V16 1\n",
+ "V17 1\n",
+ "V18 1\n",
+ "V19 1\n",
+ "V20 1\n",
+ "V21 1\n",
+ "V22 1\n",
+ "V23 1\n",
+ "V24 1\n",
+ "V25 1\n",
+ "V26 1\n",
+ "V27 1\n",
+ "V28 1\n",
+ "Amount 1\n",
+ "Class 1\n",
+ "dtype: int64"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 61
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "o1yH1ve00Y1P",
+ "outputId": "ca196cce-a22e-4229-d3b6-fb9d99a3a0fc",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 400
+ }
+ },
+ "source": [
+ "#DataViz \n",
+ "sns.catplot(x = 'Class', y='V1', data= df_cc, kind='box' )"
+ ],
+ "execution_count": 64,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 64
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW4AAAFuCAYAAAChovKPAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAWU0lEQVR4nO3df4xlZX3H8c9nZmTLaqxyIcAOrmBnSQuoiBNam8Vuy2y8EivVFINN3WttsiXR3a0xtSImNiaYqCUW11+MSp1trNRIkN2Kw+6gIKShOiCwIGAHXJWZrS4Xo8KuK7Pz7R9z2c6sswMze899znPP+5Xc5D7n3L33s5nhw9nnPuccR4QAAPnoSR0AALA0FDcAZIbiBoDMUNwAkBmKGwAy05c6QDvU6/UYHR1NHQMA2s0LbeyKI+7HH388dQQA6JiuKG4AqBKKGwAyQ3EDQGYobgDIDMUNAJmhuAEgMxQ3AGSG4gaAzFDcAJAZihsAMkNxA0BmKG4AyExXXB0wV1u3btXExESSz56cnJQk9ff3J/n8gYEBbdq0KclnA7mjuCvqwIEDqSMAWCZ3w13eBwcHY3x8PHWMrGzZskWSdPXVVydOAmAR3Xs9bgCoEoobADJDcQNAZihuAMgMq0qACmDpaXctPa18caf8hU7pmb/zM6tLqqQb/0MuM5aetl/li3tiYkL33P+gDq08IXWUjur5zewy0Lse/WniJJ3Vu/+J1BGSSPk/Kpaetl/li1uSDq08QQd+/6LUMdABxz90U+oIwDHjy0kAyAzFDQCZobgBIDMUNwBkhuIGgMxQ3ACQmdIWt+267YdtT9h+X+o8AFAWpSxu272SPiXp9ZLOkvRW22elTQUA5VDWE3DOlzQREY9Kku3rJF0s6ftJUwHHgMsrcHmFdilrcfdL+smc8WOS/nDuC2xvlLRRklavXt25ZMAyTUxM6H8e+J5Wv+BQ6igdddzTs/+wP/ijat2l6sdP9hb23mUt7mcVEcOShqXZW5cljgM8J6tfcEjvP++XqWOgAz589wsLe++yFvekpJfMGZ/W2tb+D5qcVO/+X3ANi4ro3d/U5OR06hjAMSnll5OSvitpje0zbB8n6VJJ2xNnAoBSKOURd0RM236XpJsl9Uq6NiIeKOKz+vv79b8H+7g6YEUc/9BN6u8/OXUM4JiUsrglKSJuksT8BQAcoaxTJQCAo6C4ASAzFDcAZKa0c9yd1Lv/icotB+z59exa4pnfKW6taRnN3nOSLyeRt8oX98DAQOoISUxM/EqSNPCyqpXYyZX9maN7VL64U979OiXuvA3kizluAMgMxQ0AmaG4ASAzFDcAZIbiBoDMUNwAkBmKGwAyQ3EDQGYobgDIDMUNAJmhuAEgMxQ3AGSG4gaAzFT+6oBAp0xOTuqpX/Xqw3dX6xroVfWjX/Xq+ZOThbw3R9wAkBmOuIEO6e/v18HpvXr/eb9MHQUd8OG7X6gV/f2FvDdH3ACQGYobADJDcQNAZihuAMgMxQ0AmaG4ASAzFDcAZIbiBoDMUNwAkBmKu6L27dune++9Vzt27EgdBcASUdwVNTU1JUm66qqrEicBsFQUdwXdeOON88YcdQN54SJTCW3dulUTExMd/9x777133viqq67S2NhYRzMMDAxo06ZNHf1MoFtwxA0AmeGIO6FUR5zr1q37rW1XX31154MAWBaOuAEgMxQ3AGSG4gaAzJSuuG3/k+1J2/e0HhelzgQAZVLWLyc/HhH/nDoEAJRR6Y64AQCLK2txv8v2fbavtf3i1GEAoEySFLftMdv3L/C4WNJnJP2epHMl7ZW04MU0bG+0PW57fN++fR1MDwBpJZnjjoih5/I625+T9J9HeY9hScOSNDg4GO1LBxTnx0/26sN3vzB1jI766f7Z48OTV84kTtJZP36yV2sKeu/SfTlp+9SI2NsavknS/SnzAO0yMDCQOkISv2ldj2fFS6v191+j4n7mpStuSR+1fa6kkLRH0t+ljQO0R1UvqrVlyxZJXFahnUpX3BHxttQZAKDMyrqqBABwFBQ3AGSG4gaAzFDcAJAZihsAMkNxA0BmKG4AyAzFDQCZobgBIDMUNwBkhuIGgMxQ3ACQGYobADJDcQNAZihuAMgMxQ0AmaG4ASAzFDcAZIbiBoDMUNwAkBmKGwAyQ3EDQGYobgDIDMUNAJmhuAEgMxQ3AGSG4gaAzFDcAJAZihsAMkNxA0BmKG4AyAzFDQCZobgr6MUvfvG88QknnJAoCYDloLgraP/+/fPGTz31VKIkAJaD4q6ggwcPLjoGUG4UNwBkhuIGgMxQ3ACQGYq7gk488cRFxwDKjeKuoMcff3zRMYByo7gBIDNJitv2JbYfsD1je/CIfZfbnrD9sO3XpcgHAGXWl+hz75f0ZknXzN1o+yxJl0o6W9IqSWO2z4yIQ52PCADllOSIOyIejIiHF9h1saTrIuJgRPxQ0oSk8zubrvs9//nPX3QMoNzKNsfdL+knc8aPtbb9FtsbbY/bHt+3b19HwnWL6enpRccAyq2w4rY9Zvv+BR4Xt+P9I2I4IgYjYvCkk05qx1tWxpEXlarVaomSAFiOwua4I2JoGX9sUtJL5oxPa21DG+3du3feeGpqKlESAMtRtqmS7ZIutb3C9hmS1kj6TuJMAFAqqZYDvsn2Y5JeI+nrtm+WpIh4QNJXJH1f0qikd7KiBADmS7IcMCJukHTDUfZdKenKziaqllNPPXXedMmqVasSpgGwVGWbKkEHvOc971l0DKDcKO4K2rFjx6JjAOVGcVfQbbfdNm986623pgkCYFkobgDIDMVdQX19fYuOAZQbxV1BnPIO5I3irqDTTz990TGAcqO4K+gDH/jAomMA5UZxV9DAwIBWrlwpSVq5cqUGBgYSJwKwFBR3BTWbTR04cECSdODAATWbzcSJACwFxV1B11xzjSJCkhQRGh4eTpwIwFJQ3BV0yy23zBuPjY0lSgJgOSjuCrK96BhAuVHcFbR27dpFxwDKjeKuoOOOO27eeMWKFYmSAFgOiruC7rjjjnnj22+/PVESAMvBRSoqaO3atdq5c+fh8QUXXJAwDTph69atmpiYSPLZz3zuli1bknz+wMCANm3alOSzi0JxVxBfRqKTjj/++NQRuo6fWc+bs8HBwRgfH08dIxsXXXSR9u/ff3i8cuVK3XTTTQkTATiKBY+yljXHbXv9sWVBSkNDQ4cv5drX16f16/lxAjlZ7peTX2hrCnRUo9FQT8/sj763t1cbNmxInAjAUhx1jtv29qPtklQrJg46oVarad26ddq5c6fWrVunWo0fJ5CTxb6cvEDSX0t68ojtlnR+YYnQEXxBCeRrsamSOyXtj4jbjnjcKunhzsRDEZrNpr71rW9Jmr1RMFcHBPKyWHH/UNLTC+2IiNcWEwedMDIyopmZGUnSoUOHtG3btsSJACzFYsX9sKSP2d5j+6O2X9WpUCjW2NjY4ftMTk9Pa9euXYkTAViKoxZ3RFwdEa+R9CeSmpKutf2Q7Q/aPrNjCdF2LAcE8vasywEj4kcR8ZGIeJWkt0r6C0kPFp4MhWk0Goe/nOzp6WE5IJCZZy1u2322/9z2lyR9Q7NTKG8uPBkKU6vV1N/fL0latWoVywGBzCy2jnu9Zo+wL5L0HUnXSdoYEU91KBsK0mw2NTU1JUmamppSs9mkvIGMLHbEfbmk/5L0BxHxxoj4d0q7O8xdVTIzM8OqEiAzi305+WcR8fmI+HknA6F4rCpBJzWbTW3evJnzBdqIGylUEKtK0EkjIyPavXs3/7JrI4q7grjIFDql2WxqdHRUEaHR0VGOutuE4q6gWq2mer0u26rX63wxicJwlm4xKO6KajQaevnLX87RNgrF9ynFoLgrqlar6ROf+ARH2yjUkfcz5f6m7UFxAyhMN9wasYwobgCFueOOO+aNb7/99kRJugvFDaAwa9eunTdmqqQ9khS37UtsP2B7xvbgnO2n2z5g+57W47Mp8gFoD+60VIxUR9z3a/ZCVd9eYN8jEXFu63FZh3MBaKMjp0aYKmmPJMUdEQ9GBLc/A7rc0NCQent7Jc2e7MVZuu1RxjnuM2x/z/Ztto86IWZ7o+1x2+P79u3rZD4Az1Gj0Ti8siQiOG+gTRa7y/sxsT0m6ZQFdl0RETce5Y/tlbQ6Ipq2Xy3pa7bPjohfHvnCiBiWNCxJg4ODrDkCSmpucaM9CjvijoihiDhngcfRSlsRcTAimq3nd0l6RBK3SQMyNTw8PK+4h4eHEyfqDqWaKrF9ku3e1vOXSVoj6dG0qQAs1y233LLoGMuTajngm2w/Juk1kr5u++bWrtdKus/2PZK+KumyiHgiRUYAx+7I6RGmS9oj1aqSGyLitIhYEREnR8TrWtuvj4izW0sBz4uIHSnyAWiPCy+8cN54aGgoUZLuUqqpEgDd5S1vecu88SWXXJIoSXehuAEUZvv27YfPnrStHTv4R3Q7UNwACjM2NjZvVQnX424PihtAYbi/aTEobgCFaTQah6dKenp6OHOyTShuAIWp1Wrq7++XJK1atYo7LrUJxQ2gMM1mU1NTU5Kkqakp7vLeJhQ3gMLMvcv7zMwMd3lvE4obQGG4y3sxKG4AheF63MWguAEUhutxF4PiBoDMUNwACjMyMqKentma6enp4cvJNqG4ARSGLyeLQXEDKAynvBeD4gZQGE55LwbFDaAwnPJeDIobQGE45b0YFDeAwnDKezEobgCFYVVJMShuAIXhlPdiUNwACsMp78WguAEgMxQ3gMJwynsxKG4AheHLyWJQ3AAKwynvxaC4ARSm0Wgcnirp7e3ly8k2obgBFKZWq6ler8u26vU6p7y3SV/qAAC6W6PR0J49ezjabiM/s8YyZ4ODgzE+Pp46BgC0mxfayFQJAGSG4gaAzFDcAJAZihsAMkNxA0BmKG4AyAzFDQCZobgBIDMUNwBkhuIGgMwkKW7bH7P9kO37bN9g+0Vz9l1ue8L2w7ZflyIfAJRZqiPuXZLOiYhXSPqBpMslyfZZki6VdLakuqRP2+5NlBEASilJcUfEzoiYbg3vlHRa6/nFkq6LiIMR8UNJE5LOT5ERAMqqDHPc75D0jdbzfkk/mbPvsdY2AEBLYdfjtj0m6ZQFdl0RETe2XnOFpGlJX1rG+2+UtFGSVq9efQxJASAvhRV3RAwttt/22yW9QdKF8f8XBZ+U9JI5LzuttW2h9x+WNCzNXo/7WPMCQC5SrSqpS3qvpDdGxP45u7ZLutT2CttnSFoj6TspMgJAWaW6ddknJa2QtMu2JN0ZEZdFxAO2vyLp+5qdQnlnRBxKlBEASilJcUfEwCL7rpR0ZQfjAEBWyrCqBACwBBQ3AGSG4gaAzFDcAJAZihsAMkNxA0BmKG4AyAzFDQCZobgBIDMUNwBkhuIGgMxQ3ACQGYobADJDcQNAZihuAMgMxQ0AmaG4ASAzFDcAZIbiBoDMUNwAkBmKGwAyQ3FXVLPZ1ObNm9VsNlNHAbBEFHdFjYyMaPfu3dq2bVvqKACWiOKuoGazqdHRUUWERkdHOeoGMkNxV9DIyIhmZmYkSYcOHeKoG8gMxV1BY2Njmp6eliRNT09r165diRMBWAqKu4KGhobU19cnSerr69P69esTJwKwFBR3BTUaDfX0zP7oe3t7tWHDhsSJACwFxV1BtVpN9XpdtlWv11Wr1VJHArAEfakDII1Go6E9e/ZwtA1kyBGROsMxGxwcjPHx8dQxAKDdvNBGpkoAIDMUNwBkhuIGgMxQ3ACQGYobADJDcQNAZihuAMgMxQ0AmaG4ASAzFDcAZCZJcdv+mO2HbN9n+wbbL2ptP932Adv3tB6fTZEPAMos1RH3LknnRMQrJP1A0uVz9j0SEee2HpeliQcA5ZWkuCNiZ0RMt4Z3SjotRQ4AyFEZ5rjfIekbc8Zn2P6e7dtsX3C0P2R7o+1x2+P79u0rPiUAlERhl3W1PSbplAV2XRERN7Zec4WkQUlvjoiwvULSCyKiafvVkr4m6eyI+OVin8VlXQF0qQUv61rYjRQiYmix/bbfLukNki6M1v89IuKgpIOt53fZfkTSmZJoZQBoSbWqpC7pvZLeGBH752w/yXZv6/nLJK2R9GiKjABQVqluXfZJSSsk7bItSXe2VpC8VtKHbD8taUbSZRHxRKKMAFBKSYo7IgaOsv16Sdd3OA4AZKUMq0oAAEtAcQNAZihuAMgMxQ0AmaG4ASAzFDcAZIbiBoDMUNwAkBmKGwAyQ3EDQGYobgDIDMUNAJmhuAEgMxQ3AGSG4gaAzFDcAJAZiruims2mNm/erGazmToKgCWiuCtqZGREu3fv1rZt21JHAbBEFHcFNZtNjY6OKiI0OjrKUTeQGYq7gkZGRjQzMyNJOnToEEfdQGYo7goaGxvT9PS0JGl6elq7du1KnAjAUlDcFTQ0NKS+vj5JUl9fn9avX584EYCloLgrqNFoqKdn9kff29urDRs2JE4EYCko7gqq1Wqq1+uyrXq9rlqtljoSgCXoSx0AaTQaDe3Zs4ejbSBDjojUGY7Z4OBgjI+Pp44BAO3mhTYyVQIAmaG4ASAzFDcAZIbiBoDMUNwAkBmKGwAyQ3EDQGYobgDIDMUNAJnpijMnbe+T9KPUOTJ0oqTHU4dAJfC7tjyPR0T9yI1dUdxYHtvjETGYOge6H79r7cVUCQBkhuIGgMxQ3NU2nDoAKoPftTZijhsAMsMRNwBkhuIGgMxQ3BVgu277YdsTtt+3wP4Vtv+jtf+/bZ/e+ZTIne1rbf/M9v1H2W/bn2j9nt1n+7xOZ+wWFHeXs90r6VOSXi/pLElvtX3WES/7W0k/j4gBSR+X9JHOpkSX+KKk3zpZZI7XS1rTemyU9JkOZOpKFHf3O1/SREQ8GhG/kXSdpIuPeM3FkkZaz78q6ULbC97rDjiaiPi2pCcWecnFkrbFrDslvcj2qZ1J110o7u7XL+knc8aPtbYt+JqImJb0C0m1jqRDlTyX30U8BxQ3AGSG4u5+k5JeMmd8Wmvbgq+x3SfpdyU1O5IOVfJcfhfxHFDc3e+7ktbYPsP2cZIulbT9iNdsl9RoPf9LSd8MzsxC+22XtKG1uuSPJP0iIvamDpWjvtQBUKyImLb9Lkk3S+qVdG1EPGD7Q5LGI2K7pC9I+jfbE5r9cunSdImRK9tflrRO0om2H5P0QUnPk6SI+KykmyRdJGlC0n5Jf5Mmaf445R0AMsNUCQBkhuIGgMxQ3ACQGYobADJDcQNAZihuVJ7tU2xfZ/sR23fZvsn2mUe7yh2QGuu4UWmti2ndIGkkIi5tbXulpJOTBgMWwRE3qu5PJT3dOkFEkhQR92rOxZBsn277dtt3tx5/3Np+qu1v277H9v22L7Dda/uLrfFu2+/u/F8J3Y4jblTdOZLuepbX/EzS+oj4te01kr4saVDSX0m6OSKubF33fKWkcyX1R8Q5kmT7RcVFR1VR3MCze56kT9o+V9IhSWe2tn9X0rW2nyfpaxFxj+1HJb3M9lZJX5e0M0lidDWmSlB1D0h69bO85t2SfirplZo90j5OOnzjgNdq9gp3X7S9ISJ+3nrdrZIuk/T5YmKjyihuVN03Ja2wvfGZDbZfofmXH/1dSXsjYkbS2zR7sS7Zfqmkn0bE5zRb0OfZPlFST0RcL+kDkrivItqOqRJUWkSE7TdJ+hfb/yjp15L2SPr7OS/7tKTrbW+QNCrpqdb2dZL+wfbTkp6UtEGzd3T5V9vPHBRdXvhfApXD1QEBIDNMlQBAZihuAMgMxQ0AmaG4ASAzFDcAZIbiBoDMUNwAkJn/A6mkPUGHxSxTAAAAAElFTkSuQmCC\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": [],
+ "needs_background": "light"
+ }
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8tiBGhSX1tKn",
+ "outputId": "71fb07d6-e90f-4195-a52a-55c5810c201f",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 400
+ }
+ },
+ "source": [
+ "#DataViz \n",
+ "sns.catplot(x = 'Class', y='V2', data= df_cc, kind='box' )"
+ ],
+ "execution_count": 65,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 65
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW4AAAFuCAYAAAChovKPAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAVQ0lEQVR4nO3df5DcdX3H8dc7t0iDDgWOTKAXorEXbZEBhC3VzpDG5k4PphDtr4nO9K7qzMUZTYA/WkU7daYdHPHHtDH+6KRK2XOoFOtYg4aFO53O4bRUTksMKNQtJpCThGSRXyYE9u7dP26TuQuXDXu53c/3vd/nYybDfb677r0zOZ58/Oa73zV3FwAgjiWpBwAANIdwA0AwhBsAgiHcABAM4QaAYAqpB1gMAwMDXi6XU48BAIvN5jvYETvugwcPph4BANqmI8INAHlCuAEgGMINAMEQbgAIhnADQDCEGwCCIdwAEAzhBoBgCDcABEO4ASAYwg0AwRBuAAimI+4OCKCxrVu3qlKpJPnek5OTkqSenp4k37+3t1ebNm1K8r1bJemO28xuMbMnzezBWcfOMbNRM/tZ/Z9np5wRwKk5fPiwDh8+nHqMjmIpP+XdzNZIel7SiLtfVD/2KUlPufsnzewjks529w83ep1isegTExOtHxhA06677jpJ0pYtWxJPElL27sft7uOSnjru8HpJpfrXJUnvbOtQAJBxWfzLyeXu/kT9632SlqccBgCyJovhPsZnzuPMey7HzIbNbMLMJg4cONDmyQAgnSyGe7+ZnS9J9X8+Od+T3H2buxfdvbhs2bK2DtgJqtWqNm/erGq1mnoUAE3KYri3Sxqqfz0k6VsJZ+lYpVJJu3bt0sjISOpRADQp9eWAX5P0X5LeaGZ7zez9kj4pqd/Mfiapr77GIqpWqyqXy3J3lctldt1AMKmvKnm3u5/v7qe5+wp3/4q7V919nbuvdvc+dz/+qhOcolKppOnpaUnS1NQUu24gmCyeKkGLjY2NqVarSZJqtZpGR0cTTwSgGYQ7h6688sqGawDZRrhzKOW7ZQGcOsKdQ9///vfnrO+9995EkwBYCMKdQ5wqAWIj3DnEqRIgNsKdQ8efGhkfH080CYCFINw5tHz58oZrANlGuHNo//79DdcAso1w59CaNWsargFkG+HOoRdeeGHO+siRI4kmAbAQhDuHuI4biI1wA0AwhdQDoP3OOuusObdyPfvssxNOkx9bt25VpVJJPUbbHf09H/3Q4Dzp7e3Vpk2bFv11CXcOHX//7YMHDyaaJF8qlYp+9tD/aOVrplKP0lavemnm/9gf2TOReJL2euz5rpa9NuEG2mjla6b00cueTT0G2uATPzqzZa/NOW4ACIZwA0AwhBsAgiHcABAM4QaAYAg3AARDuAEgGMINAMEQbgAIhnADQDCEGwCCIdwAEAzhBoBgCDcABEO4ASAYwg0AwfBBCkCbTE5O6lfPdbX0BvvIjj3PdenVk5MteW123AAQDDtuoE16enp0pPYEH12WE5/40Zk6vaenJa/NjhsAgiHcABAM4QaAYAg3AARDuAEgGMINAMFwOSDQRo89n7834Ow/NLM/XH7GdOJJ2uux57u0ukWvTbgT2rp1qyqVSuoxJEnXXXddW79fb2+vNm3a1NbvmVpvb2/qEZJ4sf4zfvpr8/X7X63W/ZkTbqBN8vYfqqOObgq2bNmSeJLOQbgTSvUv8tq1a192jH+pgDj4y8kc+sxnPjNn/dnPfjbRJAAWgnDnULFYnLO+/PLLE00CYCEId06tWrVKErttICLCnVNnnnmmLrnkEnbbQECEGwCCIdwAEAzhBoBgCDcABJP7N+Bk6W3n7XT099zut7pnQR7fbo/OkvtwVyoVPfDgTzV1xjmpR2mrJS+6JOmHj+5PPEl7dR16KvUIwCnLfbglaeqMc3T4t65OPQbaYOnDO1KPAJyyzJ7jNrMBM3vEzCpm9pHU8wBAVmRyx21mXZK+IKlf0l5J95vZdnf/yWJ/r8nJSXUdeoadWE50HapqcrKWegzglGR1x32FpIq7P+ruL0q6XdL6xDMBQCZkcsctqUfS47PWeyX97uwnmNmwpGFJWrly5cK/UU+P9h0pcI47J5Y+vEM9PctTjwGckqyG+6TcfZukbZJULBb9VF6r69BTuTtVsuSFZyVJ07+Wr4/RmrmqhHAjtqyGe1LSBbPWK+rHFl1eP06qUnlOktT7+rxFbHlu/8zRObIa7vslrTazVZoJ9gZJ72nFN8rrGzH4OCkgrkyG291rZvYhSXdL6pJ0i7s/lHgsAMiETIZbktx9h6R8nXgGgFcgq5cDAgBOgHADQDCEGwCCIdw5tWfPHu3cuZMPCwYCItw59fTTT0uS7rzzzsSTAGgW4c6hm2++ec6aXTcQS2YvB8yDVJ++s3PnzjnrO++8U4899lhbZ+BTaICFY8cNAMGw404o1Y5z7dq1LzvGW9+BONhxA0AwhBsAgiHcABAM4QaAYAg3AATDVSVADqR6z4CkY9/36Id3tFsnvmeAcANoqaVLl6YeoeMQbiAHOm3HmXec4waAYAg3AARDuAEgGMINAMEQbgAIhnADQDCEGwCCIdwAEAzhBoBgCDcABEO4ASAYwg0AwRBuAAiGcANAMIQbAIIh3AAQDOEGgGAINwAEQ7gBIBjCDQDBEG4ACIZwA0AwhDuHzj333IZrANlGuHPo4MGDDdcAso1wA0AwhDuHli5d2nANINsIdw4dOXKk4RpAthHuHJqenm64BpBthBsAgiHcObRixYqGawDZRrhz6Prrr5+zvuGGGxJNAmAhCHcOjY+PN1wDyDbCnUOjo6Nz1vfcc0+iSQAsBOHOoeXLlzdcA8g2wp1D+/bta7gGkG2EO4fOO++8hmsA2Ua4c4gdNxAb4c6hc845p+EaQLYR7hz6xS9+0XANINuShNvM/tTMHjKzaTMrHvfYjWZWMbNHzOwdKeYDgCxLteN+UNIfSZrzzg8zu1DSBklvkjQg6Ytm1tX+8TrbGWec0XANINuShNvdf+ruj8zz0HpJt7v7EXf/uaSKpCvaO13n47auQGxZO8fdI+nxWeu99WMvY2bDZjZhZhMHDhxoy3AAkAUtC7eZjZnZg/P8Wr8Yr+/u29y96O7FZcuWLcZL5sa6devmrPv6+hJNAmAhCq16YXdfSA0mJV0wa72ifgyLaOPGjXPuTzI8PJxwGgDNytqpku2SNpjZ6Wa2StJqST9IPBMAZEqqywHfZWZ7Jb1V0nfM7G5JcveHJN0h6SeSypI+6O5TKWbsZKVSSWYmSTIzjYyMJJ4IQDPM3VPPcMqKxaJPTEykHiOMq666SocPHz62Xrp0qe66666EEwE4AZvvYNZOlaANuK0rEBvhzqH9+/c3XAPINsKdQ/39/XPWb3/72xNNAmAhCHcOXXvttXPW11xzTaJJACxEw3Cb2Zlm9pvzHL+4dSOh1e644445669//euJJgGwECcMt5n9maSHJX2jfie/35n18K2tHgytw4cFA7E12nF/VNLl7n6ppPdK+qqZvav+2LyXqCCG4y8B7YRLQoE8afSW94K7PyFJ7v4DM3ubpG+b2QWS+DcdABJptON+dvb57XrE12rm1qtvavFcAIATaBTupyWdP/uAuz+nmQ84eF8rh0JrdXV1NVwDyLZG4b5b0qfNbLeZfcrM3ixJ7v6Su9/WnvHQCsffxpXbugKxnDDc7r7F3d8q6fclVSXdYmYPm9nHzWx12ybEohseHtaSJTN/9EuWLOG2rkAwJ30Djrvvcfeb3f3Nkt4t6Z2auUwQQXV3d2vNmjWSpDVr1qi7uzvxRACacdJwm1nBzK4xs9sk3SXpEc180C8AIIETXg5oZv2a2WFfrZkPM7hd0rC7/6pNs6FFqtWqxsfHJUnj4+OqVqvsuoFAGu24b5T0n5J+292vdfd/IdqdYdu2bZqenpYkTU9Pa9u2bYknAtCMRn85+Qfu/mV3/2U7B0Lrffe73224BpBt3B0wh3jLOxAb4c6hdevWzVlzHTcQC+HOoY0bN3IdNxAY4c6h7u7uY7vs/v5+rigBgml0d0B0sI0bN2rfvn3stoGArBP+YqpYLPrExETqMQBgsc372QecKgGAYAg3AARDuAEgGMINAMEQ7pyqVqvavHmzqtVq6lEANIlw51SpVNKuXbs0MjKSehQATSLcOVStVlUul+XuKpfL7LqBYAh3DpVKpWO3dZ2ammLXDQRDuHNobGxMtVpNklSr1TQ6Opp4IgDNINw51NfXp0Jh5m4HhUJB/f39iScC0AzCnUNDQ0PH7g7Y1dWlwcHBxBMBaAbhzqHu7m4NDAzIzDQwMMDdAYFguDtgTg0NDWn37t3stoGAuDsgAGQXdwcEgE5AuAEgGMINAMEQbgAIhnADQDCEGwCCIdwAEAzhBoBgCDcABEO4ASAYwg0AwRBuAAiGcANAMIQbAIIh3AAQDOEGgGAINwAEQ7gBIJgk4TazT5vZw2b2YzP7ppmdNeuxG82sYmaPmNk7UswHAFmWasc9Kukid79Y0v9KulGSzOxCSRskvUnSgKQvmllXohkBIJOShNvd73H3Wn15n6QV9a/XS7rd3Y+4+88lVSRdkWJGAMiqLJzjfp+ku+pf90h6fNZje+vHAAB1hVa9sJmNSTpvnoc+5u7fqj/nY5Jqkm5bwOsPSxqWpJUrV57CpAAQS8vC7e59jR43s7+Q9IeS1rm71w9PSrpg1tNW1I/N9/rbJG2TpGKx6PM9BwA6UaqrSgYk/ZWka9390KyHtkvaYGanm9kqSasl/SDFjACQVS3bcZ/E5yWdLmnUzCTpPnf/gLs/ZGZ3SPqJZk6hfNDdpxLNCACZlCTc7t7b4LGbJN3UxnEAIJQsXFUCAGgC4QaAYAg3AARDuAEgGMINAMEQbgAIhnADQDCEGwCCIdwAEAzhBoBgCDcABEO4ASAYwg0AwRBuAAiGcANAMIQbAIIh3AAQDOEGgGAINwAEQ7gBIBjCDQDBEG4ACIZwA0AwhBsAgiHcABAM4QaAYAg3AARDuHOqWq1q8+bNqlarqUcB0CTCnVOlUkm7du3SyMhI6lEANIlw51C1WlW5XJa7q1wus+sGgiHcOVQqlTQ9PS1JmpqaYtcNBEO4c2hsbEy1Wk2SVKvVNDo6mngiAM0g3DnU19enQqEgSSoUCurv7088EYBmEO4cGhoa0pIlM3/0XV1dGhwcTDwRgGYQ7hzq7u7WwMCAzEwDAwPq7u5OPRKAJhRSD4A0hoaGtHv3bnbbQEDm7qlnOGXFYtEnJiZSjwEAi83mO8ipEgAIhnADQDCEGwCCIdwAEAzhBoBgCDcABEO4ASAYwg0AwRBuAAiGcANAMIQbAIIh3AAQDOEGgGAINwAEQ7gBIBjCDQDBEG4ACIZwA0AwhBsAgkkSbjP7OzP7sZk9YGb3mNlv1I+bmX3OzCr1xy9LMR8AZFmqHfen3f1id79U0rcl/U39+FWSVtd/DUv6UqL5ACCzkoTb3Z+dtXy1pKMfNb9e0ojPuE/SWWZ2ftsHBIAMK6T6xmZ2k6RBSc9Ielv9cI+kx2c9bW/92BPz/O+HNbMr18qVK1s6KwBkSct23GY2ZmYPzvNrvSS5+8fc/QJJt0n6ULOv7+7b3L3o7sVly5Yt9vgAkFkt23G7e98rfOptknZI+rikSUkXzHpsRf0YAKAu1VUlq2ct10t6uP71dkmD9atL3iLpGXd/2WkSAMizVOe4P2lmb5Q0LWmPpA/Uj++QdLWkiqRDkt6bZjwAyK4k4Xb3Pz7BcZf0wTaPAwCh8M5JAAiGcANAMIQbAIIh3AAQDOEGgGAINwAEQ7gBIBjCDQDBEG4ACIZwA0AwhBsAgiHcABAM4QaAYAg3AARDuAEgGMINAMEQbgAIhnADQDCEGwCCIdwAEAzhzqlqtarNmzerWq2mHgVAkwh3TpVKJe3atUsjIyOpRwHQJMKdQ9VqVeVyWe6ucrnMrhsIhnDnUKlU0vT0tCRpamqKXTcQDOHOobGxMdVqNUlSrVbT6Oho4okANINw51BfX58KhYIkqVAoqL+/P/FEAJpBuHNoaGhIS5bM/NF3dXVpcHAw8UQAmkG4c6i7u1sDAwMyMw0MDKi7uzv1SACaUEg9ANIYGhrS7t272W0DAZm7p57hlBWLRZ+YmEg9BgAsNpvvIKdKACAYwg0AwRBuAAiGcANAMIQbAIIh3AAQDOEGgGAINwAEQ7gBIJiOeOekmR2QtCf1HAGdK+lg6iGQC/ysLcxBdx84/mBHhBsLY2YT7l5MPQc6Hz9ri4tTJQAQDOEGgGAId75tSz0AcoOftUXEOW4ACIYdNwAEQ7gBIBjCnQNmNmBmj5hZxcw+Ms/jp5vZv9Yf/28ze137p0R0ZnaLmT1pZg+e4HEzs8/Vf85+bGaXtXvGTkG4O5yZdUn6gqSrJF0o6d1mduFxT3u/pF+6e6+kv5d0c3unRIe4VdLL3iwyy1WSVtd/DUv6Uhtm6kiEu/NdIani7o+6+4uSbpe0/rjnrJdUqn/9b5LWmdm8n3UHnIi7j0t6qsFT1ksa8Rn3STrLzM5vz3SdhXB3vh5Jj89a760fm/c57l6T9Iyk7rZMhzx5JT+LeAUINwAEQ7g736SkC2atV9SPzfscMytI+nVJ1bZMhzx5JT+LeAUId+e7X9JqM1tlZq+StEHS9uOes13SUP3rP5H0PeedWVh82yUN1q8ueYukZ9z9idRDRVRIPQBay91rZvYhSXdL6pJ0i7s/ZGZ/K2nC3bdL+oqkr5pZRTN/ubQh3cSIysy+JmmtpHPNbK+kj0s6TZLc/R8l7ZB0taSKpEOS3ptm0vh4yzsABMOpEgAIhnADQDCEGwCCIdwAEAzhBoBgCDdyz8zOM7Pbzez/zOyHZrbDzN5worvcAalxHTdyrX4zrW9KKrn7hvqxSyQtTzoY0AA7buTd2yS9VH+DiCTJ3Xdq1s2QzOx1Znavmf2o/uv36sfPN7NxM3vAzB40syvNrMvMbq2vd5nZDe3/LaHTseNG3l0k6Ycnec6Tkvrd/QUzWy3pa5KKkt4j6W53v6l+3/MzJF0qqcfdL5IkMzurdaMjrwg3cHKnSfq8mV0qaUrSG+rH75d0i5mdJunf3f0BM3tU0uvNbKuk70i6J8nE6GicKkHePSTp8pM85wZJ+yVdopmd9qukYx8csEYzd7i71cwG3f2X9ef9h6QPSPpya8ZGnhFu5N33JJ1uZsNHD5jZxZp7+9Ffl/SEu09L+nPN3KxLZvZaSfvd/Z80E+jLzOxcSUvc/RuS/loSn6uIRcepEuSau7uZvUvSP5jZhyW9IGm3pOtnPe2Lkr5hZoOSypJ+VT++VtJfmtlLkp6XNKiZT3T5ZzM7uim6seW/CeQOdwcEgGA4VQIAwRBuAAiGcANAMIQbAIIh3AAQDOEGgGAINwAE8//mylqAeH1/jgAAAABJRU5ErkJggg==\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": [],
+ "needs_background": "light"
+ }
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "HNZd3Ha0104H",
+ "outputId": "6434a04d-1679-4a27-b31f-cba37402bede",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 400
+ }
+ },
+ "source": [
+ "#DataViz \n",
+ "sns.catplot(x = 'Class', y='V3', data= df_cc, kind='box' )"
+ ],
+ "execution_count": 66,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 66
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW4AAAFuCAYAAAChovKPAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAWKElEQVR4nO3df5DcdX3H8dcrdxACFIEjBriAAS9YSNSAN6F2EGJNMEhrilNnYmdMrE4jHYmp/aGlgHacwamtGRtORaNSQ0eljjQCgoFkLILTUjx+BJIA7SaCySXg5XQUSAhc8u4fu2HuwmWTu9zu5/vZ7/Mxs8N+vt+93ddNbl7z4bPfH44IAQDyMSF1AADA6FDcAJAZihsAMkNxA0BmKG4AyEx76gDjYf78+bFmzZrUMQBgvHmkjS0x4965c2fqCADQNC1R3ABQJoVdKrH9tKTnJe2VNBgR3WkTAUAxFLa4a94ZEayDAMAQLJUAQGaKXNwh6R7bD9lecuBO20ts99ru7e/vTxAPANIocnFfFBEXSLpM0sdsXzx0Z0SsjIjuiOiePHlymoQAkEBhizsi+mr//aWk1ZJmp00EAMVQyOK2fZzt39n/XNKlkjakTQUAxVDUo0qmSFptW6pm/E5EcGokAKigM+6I2BIRb609ZkTE9akztZpKpaLLL79clUoldRQAo1TI4kbjXXfddXrxxRf16U9/OnUUAKNEcZdQpVLRjh07JEnbt29n1g1khuIuoeuuu27YmFk3kBeKu4T2z7b32759e6IkAMaC4gaAzFDcAJAZihsAMkNxA0BmKG4AyExRT3kvhZ6ensIcQ71s2bKmfl5XV5eWLl3a1M8EWgUzbgDIjCMidYYj1t3dHb29valjZGPOnDmv2Xbvvfc2PQeAQ/JIG5lxl9C55547bPzmN785URIAY0Fxl9CNN944bNzT05MoCYCxoLhL6phjjpHEbBvIEUeVlNSb3vQmSdKKFSsSJwEwWsy4ASAzFDcAZIbiBoDMUNwAkBmKGwAyQ3EDQGYobgDIDMdxAyWQ8kqUfX19kqTOzs4kn9+KV6KkuAE01O7du1NHaDkUN1ACKWec+6/1zlm644c1bgDITOln3EW6C00z7f+dm33nmyJoxTVPlEvpi7tSqejRDU9o77Enp47SVBNert5A46EtzyVO0lxtu36VOgJwxEpf3JK099iTtft335M6Bppg0pN3pY4AHDHWuAEgMxQ3AGSG4gaAzFDcAJCZ0n852dfXp7Zdv+FLq5Jo2zWgvr7B1DGAI8KMGwAyU/oZd2dnp57d087hgCUx6cm71Nk5JXUM4Igw4waAzFDcAJCZ0i+VSNXToMv25eSEl34rSdp3zAmJkzRX9ZR3lkqQt9IXd1dXV+oISVQqz0uSus4uW4lNKe2/OVpHYYvb9nxJKyS1SfpGRPxjIz6nrFeJ4xrJQL4KucZtu03SlyVdJuk8SR+wfV7aVABQDIUsbkmzJVUiYktEvCzpFkkLEmcCgEIoanF3Sto6ZLyttg0ASq+oxX1ItpfY7rXd29/fnzoOADRNUYu7T9IZQ8ZTa9teFRErI6I7IronT57c1HAAkFJRi/tnkqbbPsv20ZIWSro9cSYAKIRCHg4YEYO2r5J0t6qHA94UERsTxwKAQihkcUtSRNwlqVynMwLAYSjqUgkA4CAobgDIDMUNAJmhuAEgMxR3SfX392v9+vW64447UkcBMEoUd0lt375dkrR8+fLESQCMFsVdQrfddtuwMbNuIC+FPY67DHp6elSpVJr+uevXrx82Xr58udatW9fUDF1dXaW9FjpwpJhxA0BmmHEnlGrGOWfOnNds4044QD6YcQNAZihuAMgMxQ0AmaG4ASAzFDcAZIbiBoDMUNwAkBmKGwAyQ3EDQGYobgDIDMUNAJmhuAEgMxQ3AGSG4gaAzFDcAJAZihsAMkNxA0BmKG4AyAzFDQCZobgBIDMUNwBkhuIGgMxQ3ACQGYobADJDcQNAZihuAMgMxQ0AmaG4ASAzFDcAZKY9dQCgLHp6elSpVFLHaLr9v/OyZcsSJ2m+rq4uLV26dNzfl+IGmqRSqej/Nj6iM4/fmzpKUx39SvV/7Pc805s4SXP94oW2hr03xQ000ZnH79XfX/Db1DHQBJ97+ISGvXfh1rht/4PtPtuP1h7vSZ0JAIqkqDPuL0bEF1KHAIAiKtyMGwBQX1GL+yrbj9m+yfZJI73A9hLbvbZ7+/v7m50va21tbXXHAIotSXHbXmd7wwiPBZJulPRGSbMk7ZC0fKT3iIiVEdEdEd2TJ09uYvr8zZo1q+4YQLElWeOOiLmH8zrbX5f0wwbHKZ2NGzcOG2/atClREgBjUbilEtunDRleIWlDqiyt6qWXXho23r17d6IkAMaiiEeV/JPtWZJC0tOSPpo2DgAUS+GKOyI+mDoDABRZ4ZZK0Hgnn3xy3TGAYqO4S2jXrl3DxqxxA3mhuEuILyeBvFHcAJAZiruEbNcdAyg2iruELrnkkrpjAMVGcQNAZijuErr//vvrjgEUW+FOwAFaVV9fn158vq2hd0ZBcTzzfJuO6+tryHsz4y6h008/ve4YQLEx4y6hnTt31h2jMTo7O7VncAf3nCyJzz18giZ2djbkvZlxl9Ds2bOHjS+88MJESQCMBcVdQlu2bBk23rx5c6IkAMaC4i6hrVu31h0DKDaKu4SOP/74umMAxUZxl9Dg4GDdMYBio7hLiFPegbxR3CV04GVd9+zZkygJgLGguEvopz/96bAxp7wDeaG4ASAzFHcJcco7kDeKu4T6+/vrjgEUG8VdQieddNKwMXd5B/JCcZfQjh07ho23b9+eKAmAsaC4ASAzFHcJtbe31x0DKDaKu4Q45R3IG8VdQlOnTh02PuOMMxIlATAWFHcJHVjUBxY5gGKjuEvowQcfrDsGUGwUdwnt3bu37hhAsVHcAJAZihsAMkNxl1BbW1vdMYBio7hL6PWvf/2w8ZQpUxIlATAWFHcJPffcc8PGzz77bKIkAMaC4i6hffv21R0DKDaKu4RY4wbyRnGX0IwZM4aNZ86cmSgJgLGguEto48aNw8YbNmxIlATAWFDcJcSZk0DeKO4SYo0byFuS4rb9ftsbbe+z3X3AvqttV2w/ZfvdKfK1ulmzZg0bn3/++YmSABiLVLc+2SDpfZK+NnSj7fMkLZQ0Q9LpktbZPici+H/5cbRp06Zh4wPXvAEUW90Zt+1TbZ9aez7Z9vtsz6j3M4cjIp6IiKdG2LVA0i0RsScifi6pImn2kX4ehjvllFPqjgEU20GL2/ZHJf23pAds/4WkH0q6XNJ/2P5Ig/J0Sto6ZLyttm2kfEts99ru7e/vb1Cc1nTgXd25yzuQl3pLJVepumQxSdIzkroi4lnbJ0n6T0nfrPfGttdJOnWEXddExG1jzPuqiFgpaaUkdXd3x5G+HwDkol5xD0bELkm7bG+OiGclKSJ+bfuQRRkRc8eQp0/S0PtqTa1twziaOHGidu3aNWwMIB/11rj32T6q9vzy/RttH3OInzsSt0taaHui7bMkTZfEfbXG2dDSHmkMoNjqFfB61b4YjIhtQ7Z3SPrrI/lQ21fY3ibp7ZLutH137XM2SvqepE2S1kj6GEeUjL9jjz227hhAsdVbKlkv6Qu2T1O1TL8bEY9ERJ+OcPkiIlZLWn2QfddLuv5I3h/17dmzp+4YQLEddMYdESsi4u2SLpE0IOkm20/a/oztc5qWEOPOdt0xgGI75Fp1RDwTEZ+PiPMlfUDSH0t6ouHJ0DAXXXRR3TGAYjtkcdtut/1Htr8t6UeSnlL1rEdk6uijjx425qgSIC/1TsCZZ/smVU+C+XNJd0p6Y0QsHI/jsJHO/fffP2x83333JUoCYCzqzbivlvRfks6NiPdGxHci4sUm5UIDHXhzYG4WDOTloEeVRMQfNDMImufAmwUfOAZQbFyPu4TmzZs3bHzppZcmSgJgLCjuElq8ePGrhwDa1qJFixInAjAaFHdJDS1uAHmhuEto1apVmjCh+k8/YcIE3XzzzYkTARgNiruE1q1bp8HBQUnS4OCg1q5dmzgRgNGguEto7ty5am+vHlDU3t7+mi8rARQbxV1CQ7+cnDBhAl9OApmhuEuoo6NDnZ3VO8Kdfvrp6ujoSJwIwGhQ3CU0MDCgvr7qlXn7+vo0MDCQOBGA0aC4S2jVqlXDvpzkqBIgLxR3Ca1du1YR1duGRoTuueeexIkAjAbFXUJcZArIG8VdQlxkCsgbxV1Cs2fPHja+8MILEyUBMBYUdwlt2bJl2Hjz5s2JkgAYC4q7hLZu3Vp3DKDYKO4SmjZtWt0xgGKjuEvo2muvrTsGUGwUdwl1dXW9OsueNm2aurq60gYCMCoUd0lde+21Ou6445htAxk66M2C0dq6urp05513po4BYAyYcQNAZihuAMgMxQ0AmaG4ASAzFDcAZIbiBoDMUNwAkBmO4waa6BcvtOlzD5+QOkZTPberOj+ccuy+xEma6xcvtGl6g96b4gaapKyXFni5UpEkTXxDuX7/6Wrcv7n333swZ93d3dHb25s6BoARLFu2TJK0YsWKxEmy5JE2ssYNAJmhuAEgMxQ3AGSG4gaAzFDcAJCZJMVt+/22N9reZ7t7yPZptnfbfrT2+GqKfABQZKmO494g6X2SvjbCvs0RMavJeQAgG0lm3BHxREQ8leKzUTUwMKCPf/zjGhgYSB0FwCgVcY37LNuP2P6J7Xcc7EW2l9jutd3b39/fzHwtYdWqVXr88cd18803p44CYJQaVty219neMMJjQZ0f2yHpzIg4X9JfSfqO7REv7BARKyOiOyK6J0+e3IhfoWUNDAxozZo1igitWbOGWTeQmYYVd0TMjYiZIzxuq/MzeyJioPb8IUmbJZ3TqIxltWrVKu3bV73gz969e5l1A5kp1FKJ7cm222rPz1b1Oi1b0qZqPevWrdPg4KAkaXBwUGvXrk2cCMBopDoc8Arb2yS9XdKdtu+u7bpY0mO2H5X0fUlXRsSvUmRsZXPnzlV7e/WAovb2ds2bNy9xIgCjkeqoktURMTUiJkbElIh4d237rRExIyJmRcQFEXFHinytbvHixZowofpP39bWpkWLFiVOBGA0CrVUgubo6OjQ/PnzZVvz589XR0dH6kgARoEbKZTU4sWL9fTTTzPbBjJEcZdUR0eHbrjhhtQxAIwBSyUAkBmKGwAyQ3EDQGYobgDIDMUNAJmhuAEgMxQ3AGSG4gaAzFDcAJAZihsAMkNxA0BmKG4AyAzFDQCZobgBIDMUNwBkhuIGgMxQ3ACQGYobADJDcQNAZihuAMgMxQ0AmaG4ASAzFDcAZIbiBoDMUNwAkBmKGwAyQ3EDQGYobgDIDMUNAJmhuAEgMxQ3AGSG4gaAzFDcAJAZihsAMkNxA0BmKG4AyAzFDQCZobgBIDMUNwBkJklx2/5n20/afsz2atsnDtl3te2K7adsvztFPgAoslQz7rWSZkbEWyT9r6SrJcn2eZIWSpohab6kr9huS5QRAAopSXFHxD0RMVgbPiBpau35Akm3RMSeiPi5pIqk2SkyAkBRFWGN+8OSflR73ilp65B922rbXsP2Etu9tnv7+/sbHBEAiqO9UW9se52kU0fYdU1E3FZ7zTWSBiV9e7TvHxErJa2UpO7u7jiCqACQlYYVd0TMrbff9ock/aGkd0XE/uLtk3TGkJdNrW0DANSkOqpkvqRPSnpvROwasut2SQttT7R9lqTpkh5MkREAiqphM+5D+JKkiZLW2pakByLiyojYaPt7kjapuoTysYjYmygjABRSkuKOiK46+66XdH0T4wBAVopwVAkAYBQobgDIDMUNAJmhuAEgMxQ3AGSG4gaAzFDcAJAZihsAMkNxA0BmKG4AyAzFDQCZobgBIDMUNwBkhuIGgMxQ3ACQGYobADJDcQNAZihuAMgMxQ0AmaG4ASAzFDcAZIbiBoDMUNwAkBmKGwAyQ3EDQGYobgDIDMUNAJmhuAEgMxQ3AGSG4gaAzFDcAJAZihsAMkNxA0BmKG4AyAzFDQCZobgBIDMUNwBkhuIGgMxQ3ACQGYobADLTnjoAgMbr6elRpVJJ8tn7P3fZsmVJPr+rq0tLly5N8tmNkmTGbfufbT9p+zHbq22fWNs+zfZu24/WHl9NkQ/A+Jk0aZImTZqUOkZLcUQ0/0PtSyX9OCIGbX9ekiLiU7anSfphRMwczft1d3dHb2/v+AcFgLQ80sYkM+6IuCciBmvDByRNTZEDAHJUhC8nPyzpR0PGZ9l+xPZPbL8jVSgAKKqGfTlpe52kU0fYdU1E3FZ7zTWSBiV9u7Zvh6QzI2LA9tsk/cD2jIj47Qjvv0TSEkk688wzG/ErAEAhJVnjliTbH5L0UUnviohdB3nNvZL+JiLqLmCzxg2gRRVnjdv2fEmflPTeoaVte7LtttrzsyVNl7QlRUYAKKpUx3F/SdJESWttS9IDEXGlpIslfdb2K5L2SboyIn6VKCMAFFKS4o6IroNsv1XSrU2OAwBZKcJRJQCAUaC4ASAzFDcAZIbiBoDMUNwAkBmKGwAyk+zMyfFku1/SM6lzZOgUSTtTh0Ap8Lc2NjsjYv6BG1uiuDE2tnsjojt1DrQ+/tbGF0slAJAZihsAMkNxl9vK1AFQGvytjSPWuAEgM8y4ASAzFDcAZIbiLgHb820/Zbti++9G2D/R9r/X9v+P7WnNT4nc2b7J9i9tbzjIftu+ofZ39pjtC5qdsVVQ3C2udkehL0u6TNJ5kj5g+7wDXvYRSb+uXSf9i5I+39yUaBHfkvSak0WGuEzVu1pNV/V+sTc2IVNLorhb32xJlYjYEhEvS7pF0oIDXrNA0qra8+9LepdrtyYCDldE3Cep3h2rFki6OaoekHSi7dOak661UNytr1PS1iHjbbVtI74mIgYl/UZSR1PSoUwO528Rh4HiBoDMUNytr0/SGUPGU2vbRnyN7XZJr5M00JR0KJPD+VvEYaC4W9/PJE23fZbtoyUtlHT7Aa+5XdLi2vM/kfTj4MwsjL/bJS2qHV3ye5J+ExE7UofKUZK7vKN5ImLQ9lWS7pbUJummiNho+7OSeiPidknflPRvtiuqfrm0MF1i5Mr2dyXNkXSK7W2SPiPpKEmKiK9KukvSeyRVJO2S9GdpkuaPU94BIDMslQBAZihuAMgMxQ0AmaG4ASAzFDcAZIbiRunZPtX2LbY3237I9l22zznYVe6A1DiOG6VWu5jWakmrImJhbdtbJU1JGgyogxk3yu6dkl6pnSAiSYqI9RpyMSTb02zfb/vh2uP3a9tPs32f7Udtb7D9Dttttr9VGz9u+xPN/5XQ6phxo+xmSnroEK/5paR5EfGS7emSviupW9KfSro7Iq6vXff8WEmzJHVGxExJsn1i46KjrChu4NCOkvQl27Mk7ZV0Tm37zyTdZPsoST+IiEdtb5F0tu0eSXdKuidJYrQ0lkpQdhslve0Qr/mEpOckvVXVmfbR0qs3DrhY1Svcfcv2ooj4de1190q6UtI3GhMbZUZxo+x+LGmi7SX7N9h+i4ZffvR1knZExD5JH1T1Yl2y/QZJz0XE11Ut6AtsnyJpQkTcKulaSdxXEeOOpRKUWkSE7Ssk/YvtT0l6SdLTkv5yyMu+IulW24skrZH0Ym37HEl/a/sVSS9IWqTqHV3+1fb+SdHVDf8lUDpcHRAAMsNSCQBkhuIGgMxQ3ACQGYobADJDcQNAZihuAMgMxQ0Amfl/A/2d3AXCGFwAAAAASUVORK5CYII=\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": [],
+ "needs_background": "light"
+ }
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "eYf95Muj2CB_",
+ "outputId": "890172a9-faa2-4ecb-d55c-4a6dce44e6a0",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 1000
+ }
+ },
+ "source": [
+ "figura = plt.figure(figsize = (16,20))\n",
+ "for i in range(1,29):\n",
+ " plt.subplot(5,6,i)\n",
+ " plt.plot(df_cc['V'+str(i)])\n",
+ "plt.plot(df_cc['Amount']) "
+ ],
+ "execution_count": 68,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 68
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": [],
+ "needs_background": "light"
+ }
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "rhsQe_Tc3yLy",
+ "outputId": "b002ee2b-4859-4206-c102-d1367b015ce6",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 317
+ }
+ },
+ "source": [
+ "df_cc.describe()"
+ ],
+ "execution_count": 70,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Time | \n",
+ " V1 | \n",
+ " V2 | \n",
+ " V3 | \n",
+ " V4 | \n",
+ " V5 | \n",
+ " V6 | \n",
+ " V7 | \n",
+ " V8 | \n",
+ " V9 | \n",
+ " V10 | \n",
+ " V11 | \n",
+ " V12 | \n",
+ " V13 | \n",
+ " V14 | \n",
+ " V15 | \n",
+ " V16 | \n",
+ " V17 | \n",
+ " V18 | \n",
+ " V19 | \n",
+ " V20 | \n",
+ " V21 | \n",
+ " V22 | \n",
+ " V23 | \n",
+ " V24 | \n",
+ " V25 | \n",
+ " V26 | \n",
+ " V27 | \n",
+ " V28 | \n",
+ " Amount | \n",
+ " Class | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | count | \n",
+ " 12842.000000 | \n",
+ " 12842.000000 | \n",
+ " 12842.000000 | \n",
+ " 12842.000000 | \n",
+ " 12842.000000 | \n",
+ " 12842.000000 | \n",
+ " 12842.000000 | \n",
+ " 12842.000000 | \n",
+ " 12842.000000 | \n",
+ " 12842.000000 | \n",
+ " 12841.000000 | \n",
+ " 12841.000000 | \n",
+ " 12841.000000 | \n",
+ " 12841.000000 | \n",
+ " 12841.000000 | \n",
+ " 12841.000000 | \n",
+ " 12841.000000 | \n",
+ " 12841.000000 | \n",
+ " 12841.000000 | \n",
+ " 12841.000000 | \n",
+ " 12841.000000 | \n",
+ " 12841.000000 | \n",
+ " 12841.000000 | \n",
+ " 12841.000000 | \n",
+ " 12841.000000 | \n",
+ " 12841.000000 | \n",
+ " 12841.000000 | \n",
+ " 12841.000000 | \n",
+ " 12841.000000 | \n",
+ " 12841.000000 | \n",
+ " 12841.000000 | \n",
+ "
\n",
+ " \n",
+ " | mean | \n",
+ " 8949.099984 | \n",
+ " -0.216783 | \n",
+ " 0.275675 | \n",
+ " 0.875939 | \n",
+ " 0.280864 | \n",
+ " -0.111409 | \n",
+ " 0.134556 | \n",
+ " -0.147548 | \n",
+ " -0.031229 | \n",
+ " 0.966379 | \n",
+ " -0.320292 | \n",
+ " 0.842062 | \n",
+ " -1.494439 | \n",
+ " 0.965904 | \n",
+ " 0.815934 | \n",
+ " -0.177547 | \n",
+ " -0.036491 | \n",
+ " 0.393921 | \n",
+ " -0.012121 | \n",
+ " -0.072788 | \n",
+ " 0.021230 | \n",
+ " -0.062996 | \n",
+ " -0.147793 | \n",
+ " -0.035406 | \n",
+ " 0.015229 | \n",
+ " 0.113644 | \n",
+ " 0.043892 | \n",
+ " 0.011375 | \n",
+ " 0.000744 | \n",
+ " 62.219386 | \n",
+ " 0.004361 | \n",
+ "
\n",
+ " \n",
+ " | std | \n",
+ " 6914.588371 | \n",
+ " 1.653324 | \n",
+ " 1.338732 | \n",
+ " 1.453434 | \n",
+ " 1.495532 | \n",
+ " 1.232153 | \n",
+ " 1.307300 | \n",
+ " 1.202073 | \n",
+ " 1.243610 | \n",
+ " 1.217670 | \n",
+ " 1.209806 | \n",
+ " 1.189877 | \n",
+ " 1.544952 | \n",
+ " 1.171303 | \n",
+ " 1.331752 | \n",
+ " 0.981075 | \n",
+ " 0.949526 | \n",
+ " 1.158501 | \n",
+ " 0.833046 | \n",
+ " 0.823737 | \n",
+ " 0.571957 | \n",
+ " 0.894494 | \n",
+ " 0.621258 | \n",
+ " 0.496013 | \n",
+ " 0.589287 | \n",
+ " 0.426605 | \n",
+ " 0.563938 | \n",
+ " 0.401608 | \n",
+ " 0.257426 | \n",
+ " 175.780115 | \n",
+ " 0.065897 | \n",
+ "
\n",
+ " \n",
+ " | min | \n",
+ " 0.000000 | \n",
+ " -27.670569 | \n",
+ " -34.607649 | \n",
+ " -24.667741 | \n",
+ " -4.657545 | \n",
+ " -32.092129 | \n",
+ " -23.496714 | \n",
+ " -26.548144 | \n",
+ " -23.632502 | \n",
+ " -7.175097 | \n",
+ " -14.166795 | \n",
+ " -2.595325 | \n",
+ " -17.769143 | \n",
+ " -3.389510 | \n",
+ " -19.214325 | \n",
+ " -4.152532 | \n",
+ " -12.227189 | \n",
+ " -18.587366 | \n",
+ " -8.061208 | \n",
+ " -4.932733 | \n",
+ " -13.276034 | \n",
+ " -11.468435 | \n",
+ " -8.593642 | \n",
+ " -19.254328 | \n",
+ " -2.512377 | \n",
+ " -4.781606 | \n",
+ " -1.338556 | \n",
+ " -7.976100 | \n",
+ " -3.575312 | \n",
+ " 0.000000 | \n",
+ " 0.000000 | \n",
+ "
\n",
+ " \n",
+ " | 25% | \n",
+ " 2789.250000 | \n",
+ " -0.966739 | \n",
+ " -0.280216 | \n",
+ " 0.420540 | \n",
+ " -0.631430 | \n",
+ " -0.713310 | \n",
+ " -0.618201 | \n",
+ " -0.612138 | \n",
+ " -0.180510 | \n",
+ " 0.252389 | \n",
+ " -0.773721 | \n",
+ " 0.041619 | \n",
+ " -2.442860 | \n",
+ " 0.148875 | \n",
+ " 0.218535 | \n",
+ " -0.759873 | \n",
+ " -0.525637 | \n",
+ " -0.078526 | \n",
+ " -0.447914 | \n",
+ " -0.554832 | \n",
+ " -0.159392 | \n",
+ " -0.265563 | \n",
+ " -0.534956 | \n",
+ " -0.171847 | \n",
+ " -0.334567 | \n",
+ " -0.134560 | \n",
+ " -0.372507 | \n",
+ " -0.077529 | \n",
+ " -0.015002 | \n",
+ " 5.490000 | \n",
+ " 0.000000 | \n",
+ "
\n",
+ " \n",
+ " | 50% | \n",
+ " 7605.500000 | \n",
+ " -0.319439 | \n",
+ " 0.245807 | \n",
+ " 0.962057 | \n",
+ " 0.205730 | \n",
+ " -0.195153 | \n",
+ " -0.146859 | \n",
+ " -0.109023 | \n",
+ " 0.017735 | \n",
+ " 0.944073 | \n",
+ " -0.373792 | \n",
+ " 0.782630 | \n",
+ " -1.817630 | \n",
+ " 1.053044 | \n",
+ " 1.090067 | \n",
+ " -0.041798 | \n",
+ " 0.034157 | \n",
+ " 0.392384 | \n",
+ " 0.044854 | \n",
+ " -0.069879 | \n",
+ " -0.035732 | \n",
+ " -0.129139 | \n",
+ " -0.115690 | \n",
+ " -0.044329 | \n",
+ " 0.067107 | \n",
+ " 0.153360 | \n",
+ " -0.022228 | \n",
+ " -0.000787 | \n",
+ " 0.015907 | \n",
+ " 15.300000 | \n",
+ " 0.000000 | \n",
+ "
\n",
+ " \n",
+ " | 75% | \n",
+ " 14441.750000 | \n",
+ " 1.162983 | \n",
+ " 0.875673 | \n",
+ " 1.610908 | \n",
+ " 1.169584 | \n",
+ " 0.337192 | \n",
+ " 0.508040 | \n",
+ " 0.420298 | \n",
+ " 0.265311 | \n",
+ " 1.643443 | \n",
+ " 0.133831 | \n",
+ " 1.648365 | \n",
+ " -0.248728 | \n",
+ " 1.836428 | \n",
+ " 1.544681 | \n",
+ " 0.504464 | \n",
+ " 0.534263 | \n",
+ " 0.873357 | \n",
+ " 0.485550 | \n",
+ " 0.448875 | \n",
+ " 0.140779 | \n",
+ " 0.020585 | \n",
+ " 0.234024 | \n",
+ " 0.071117 | \n",
+ " 0.397918 | \n",
+ " 0.388684 | \n",
+ " 0.391314 | \n",
+ " 0.101575 | \n",
+ " 0.071500 | \n",
+ " 50.000000 | \n",
+ " 0.000000 | \n",
+ "
\n",
+ " \n",
+ " | max | \n",
+ " 22549.000000 | \n",
+ " 1.960497 | \n",
+ " 10.558600 | \n",
+ " 4.101716 | \n",
+ " 11.927512 | \n",
+ " 34.099309 | \n",
+ " 21.393069 | \n",
+ " 34.303177 | \n",
+ " 8.675685 | \n",
+ " 10.392889 | \n",
+ " 12.259949 | \n",
+ " 12.018913 | \n",
+ " 3.774837 | \n",
+ " 4.465413 | \n",
+ " 7.692209 | \n",
+ " 3.635042 | \n",
+ " 4.816252 | \n",
+ " 9.253526 | \n",
+ " 4.295648 | \n",
+ " 4.555359 | \n",
+ " 8.012574 | \n",
+ " 22.614889 | \n",
+ " 4.534454 | \n",
+ " 13.876221 | \n",
+ " 3.200201 | \n",
+ " 5.525093 | \n",
+ " 3.517346 | \n",
+ " 8.254376 | \n",
+ " 4.860769 | \n",
+ " 7712.430000 | \n",
+ " 1.000000 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Time V1 ... Amount Class\n",
+ "count 12842.000000 12842.000000 ... 12841.000000 12841.000000\n",
+ "mean 8949.099984 -0.216783 ... 62.219386 0.004361\n",
+ "std 6914.588371 1.653324 ... 175.780115 0.065897\n",
+ "min 0.000000 -27.670569 ... 0.000000 0.000000\n",
+ "25% 2789.250000 -0.966739 ... 5.490000 0.000000\n",
+ "50% 7605.500000 -0.319439 ... 15.300000 0.000000\n",
+ "75% 14441.750000 1.162983 ... 50.000000 0.000000\n",
+ "max 22549.000000 1.960497 ... 7712.430000 1.000000\n",
+ "\n",
+ "[8 rows x 31 columns]"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 70
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "fnqd6WKc62z7"
+ },
+ "source": [
+ "limite_superior_outlier = q3+1.5*iqr\n"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "X9k16WLI49JI",
+ "outputId": "63775512-a869-46b2-a566-53059de569ce",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "df_cc2 = df_cc.copy()\n",
+ "df_cc2 = df_cc.dropna()\n",
+ "df_cc2.shape"
+ ],
+ "execution_count": 71,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "(12841, 31)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 71
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "IM_qvu0G7DDg",
+ "outputId": "1b7b1084-75d4-4b0e-fb0b-aa669c4d9e3e",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 439
+ }
+ },
+ "source": [
+ "l_preditoras = df_cc2.iloc[:,1:30]\n",
+ "l_preditoras"
+ ],
+ "execution_count": 72,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " V1 | \n",
+ " V2 | \n",
+ " V3 | \n",
+ " V4 | \n",
+ " V5 | \n",
+ " V6 | \n",
+ " V7 | \n",
+ " V8 | \n",
+ " V9 | \n",
+ " V10 | \n",
+ " V11 | \n",
+ " V12 | \n",
+ " V13 | \n",
+ " V14 | \n",
+ " V15 | \n",
+ " V16 | \n",
+ " V17 | \n",
+ " V18 | \n",
+ " V19 | \n",
+ " V20 | \n",
+ " V21 | \n",
+ " V22 | \n",
+ " V23 | \n",
+ " V24 | \n",
+ " V25 | \n",
+ " V26 | \n",
+ " V27 | \n",
+ " V28 | \n",
+ " Amount | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " -1.359807 | \n",
+ " -0.072781 | \n",
+ " 2.536347 | \n",
+ " 1.378155 | \n",
+ " -0.338321 | \n",
+ " 0.462388 | \n",
+ " 0.239599 | \n",
+ " 0.098698 | \n",
+ " 0.363787 | \n",
+ " 0.090794 | \n",
+ " -0.551600 | \n",
+ " -0.617801 | \n",
+ " -0.991390 | \n",
+ " -0.311169 | \n",
+ " 1.468177 | \n",
+ " -0.470401 | \n",
+ " 0.207971 | \n",
+ " 0.025791 | \n",
+ " 0.403993 | \n",
+ " 0.251412 | \n",
+ " -0.018307 | \n",
+ " 0.277838 | \n",
+ " -0.110474 | \n",
+ " 0.066928 | \n",
+ " 0.128539 | \n",
+ " -0.189115 | \n",
+ " 0.133558 | \n",
+ " -0.021053 | \n",
+ " 149.62 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 1.191857 | \n",
+ " 0.266151 | \n",
+ " 0.166480 | \n",
+ " 0.448154 | \n",
+ " 0.060018 | \n",
+ " -0.082361 | \n",
+ " -0.078803 | \n",
+ " 0.085102 | \n",
+ " -0.255425 | \n",
+ " -0.166974 | \n",
+ " 1.612727 | \n",
+ " 1.065235 | \n",
+ " 0.489095 | \n",
+ " -0.143772 | \n",
+ " 0.635558 | \n",
+ " 0.463917 | \n",
+ " -0.114805 | \n",
+ " -0.183361 | \n",
+ " -0.145783 | \n",
+ " -0.069083 | \n",
+ " -0.225775 | \n",
+ " -0.638672 | \n",
+ " 0.101288 | \n",
+ " -0.339846 | \n",
+ " 0.167170 | \n",
+ " 0.125895 | \n",
+ " -0.008983 | \n",
+ " 0.014724 | \n",
+ " 2.69 | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " -1.358354 | \n",
+ " -1.340163 | \n",
+ " 1.773209 | \n",
+ " 0.379780 | \n",
+ " -0.503198 | \n",
+ " 1.800499 | \n",
+ " 0.791461 | \n",
+ " 0.247676 | \n",
+ " -1.514654 | \n",
+ " 0.207643 | \n",
+ " 0.624501 | \n",
+ " 0.066084 | \n",
+ " 0.717293 | \n",
+ " -0.165946 | \n",
+ " 2.345865 | \n",
+ " -2.890083 | \n",
+ " 1.109969 | \n",
+ " -0.121359 | \n",
+ " -2.261857 | \n",
+ " 0.524980 | \n",
+ " 0.247998 | \n",
+ " 0.771679 | \n",
+ " 0.909412 | \n",
+ " -0.689281 | \n",
+ " -0.327642 | \n",
+ " -0.139097 | \n",
+ " -0.055353 | \n",
+ " -0.059752 | \n",
+ " 378.66 | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " -0.966272 | \n",
+ " -0.185226 | \n",
+ " 1.792993 | \n",
+ " -0.863291 | \n",
+ " -0.010309 | \n",
+ " 1.247203 | \n",
+ " 0.237609 | \n",
+ " 0.377436 | \n",
+ " -1.387024 | \n",
+ " -0.054952 | \n",
+ " -0.226487 | \n",
+ " 0.178228 | \n",
+ " 0.507757 | \n",
+ " -0.287924 | \n",
+ " -0.631418 | \n",
+ " -1.059647 | \n",
+ " -0.684093 | \n",
+ " 1.965775 | \n",
+ " -1.232622 | \n",
+ " -0.208038 | \n",
+ " -0.108300 | \n",
+ " 0.005274 | \n",
+ " -0.190321 | \n",
+ " -1.175575 | \n",
+ " 0.647376 | \n",
+ " -0.221929 | \n",
+ " 0.062723 | \n",
+ " 0.061458 | \n",
+ " 123.50 | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " -1.158233 | \n",
+ " 0.877737 | \n",
+ " 1.548718 | \n",
+ " 0.403034 | \n",
+ " -0.407193 | \n",
+ " 0.095921 | \n",
+ " 0.592941 | \n",
+ " -0.270533 | \n",
+ " 0.817739 | \n",
+ " 0.753074 | \n",
+ " -0.822843 | \n",
+ " 0.538196 | \n",
+ " 1.345852 | \n",
+ " -1.119670 | \n",
+ " 0.175121 | \n",
+ " -0.451449 | \n",
+ " -0.237033 | \n",
+ " -0.038195 | \n",
+ " 0.803487 | \n",
+ " 0.408542 | \n",
+ " -0.009431 | \n",
+ " 0.798278 | \n",
+ " -0.137458 | \n",
+ " 0.141267 | \n",
+ " -0.206010 | \n",
+ " 0.502292 | \n",
+ " 0.219422 | \n",
+ " 0.215153 | \n",
+ " 69.99 | \n",
+ "
\n",
+ " \n",
+ " | ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " | 12836 | \n",
+ " 1.216532 | \n",
+ " -0.314522 | \n",
+ " 1.134570 | \n",
+ " 0.302071 | \n",
+ " -1.047467 | \n",
+ " -0.226341 | \n",
+ " -0.808963 | \n",
+ " 0.011571 | \n",
+ " 2.484110 | \n",
+ " -0.749128 | \n",
+ " -0.113215 | \n",
+ " -2.463177 | \n",
+ " 1.217232 | \n",
+ " 1.078202 | \n",
+ " -0.353184 | \n",
+ " 0.264467 | \n",
+ " 0.560170 | \n",
+ " 0.070299 | \n",
+ " 0.162726 | \n",
+ " -0.101807 | \n",
+ " -0.289677 | \n",
+ " -0.451358 | \n",
+ " 0.021372 | \n",
+ " 0.025676 | \n",
+ " 0.112433 | \n",
+ " 0.974426 | \n",
+ " -0.067625 | \n",
+ " 0.007633 | \n",
+ " 23.27 | \n",
+ "
\n",
+ " \n",
+ " | 12837 | \n",
+ " -1.730579 | \n",
+ " 2.510772 | \n",
+ " -3.816998 | \n",
+ " 1.981314 | \n",
+ " -0.013296 | \n",
+ " -2.005823 | \n",
+ " -0.761365 | \n",
+ " 1.439695 | \n",
+ " 1.029358 | \n",
+ " -1.507332 | \n",
+ " 0.574218 | \n",
+ " -3.026862 | \n",
+ " 1.041710 | \n",
+ " -0.878371 | \n",
+ " 0.412419 | \n",
+ " 1.463627 | \n",
+ " 4.071892 | \n",
+ " 2.149217 | \n",
+ " -0.544000 | \n",
+ " 0.013777 | \n",
+ " -0.256248 | \n",
+ " -0.705186 | \n",
+ " 0.012378 | \n",
+ " -0.531591 | \n",
+ " -0.260890 | \n",
+ " -0.398332 | \n",
+ " 0.078616 | \n",
+ " -0.176480 | \n",
+ " 1.00 | \n",
+ "
\n",
+ " \n",
+ " | 12838 | \n",
+ " 0.092413 | \n",
+ " 0.707487 | \n",
+ " 1.468534 | \n",
+ " 0.835819 | \n",
+ " 0.077369 | \n",
+ " 0.319184 | \n",
+ " -0.309622 | \n",
+ " -0.926561 | \n",
+ " 1.308510 | \n",
+ " -0.798995 | \n",
+ " -0.374541 | \n",
+ " -1.944870 | \n",
+ " 2.612574 | \n",
+ " 1.113404 | \n",
+ " -0.813064 | \n",
+ " -0.454353 | \n",
+ " 0.802003 | \n",
+ " -0.030830 | \n",
+ " 0.925470 | \n",
+ " -0.062606 | \n",
+ " 0.440212 | \n",
+ " -0.720211 | \n",
+ " -0.648152 | \n",
+ " -0.415473 | \n",
+ " 1.544434 | \n",
+ " 0.696797 | \n",
+ " 0.053918 | \n",
+ " 0.133374 | \n",
+ " 10.00 | \n",
+ "
\n",
+ " \n",
+ " | 12839 | \n",
+ " 1.105940 | \n",
+ " -0.093522 | \n",
+ " 0.775855 | \n",
+ " 0.797238 | \n",
+ " -0.601505 | \n",
+ " -0.372565 | \n",
+ " -0.332458 | \n",
+ " -0.138450 | \n",
+ " 1.685372 | \n",
+ " -0.491947 | \n",
+ " 0.192390 | \n",
+ " -2.428955 | \n",
+ " 1.946760 | \n",
+ " 1.473073 | \n",
+ " 0.616349 | \n",
+ " 0.699512 | \n",
+ " 0.009298 | \n",
+ " 0.317635 | \n",
+ " -0.366135 | \n",
+ " 0.052355 | \n",
+ " -0.220249 | \n",
+ " -0.562235 | \n",
+ " -0.029329 | \n",
+ " -0.164029 | \n",
+ " 0.174923 | \n",
+ " 0.197386 | \n",
+ " -0.048843 | \n",
+ " 0.029153 | \n",
+ " 87.00 | \n",
+ "
\n",
+ " \n",
+ " | 12840 | \n",
+ " 1.169834 | \n",
+ " 0.017381 | \n",
+ " 0.836739 | \n",
+ " 1.035206 | \n",
+ " -0.671919 | \n",
+ " -0.362198 | \n",
+ " -0.373648 | \n",
+ " -0.005412 | \n",
+ " 1.745210 | \n",
+ " -0.393025 | \n",
+ " 1.911392 | \n",
+ " -1.687593 | \n",
+ " 0.686310 | \n",
+ " 1.605153 | \n",
+ " -1.768634 | \n",
+ " -0.221949 | \n",
+ " 0.689228 | \n",
+ " 0.232808 | \n",
+ " 0.201811 | \n",
+ " -0.210741 | \n",
+ " -0.182634 | \n",
+ " -0.105628 | \n",
+ " -0.054807 | \n",
+ " 0.527694 | \n",
+ " 0.479663 | \n",
+ " 0.371689 | \n",
+ " -0.051277 | \n",
+ " -0.005160 | \n",
+ " 6.99 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
12841 rows × 29 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " V1 V2 V3 ... V27 V28 Amount\n",
+ "0 -1.359807 -0.072781 2.536347 ... 0.133558 -0.021053 149.62\n",
+ "1 1.191857 0.266151 0.166480 ... -0.008983 0.014724 2.69\n",
+ "2 -1.358354 -1.340163 1.773209 ... -0.055353 -0.059752 378.66\n",
+ "3 -0.966272 -0.185226 1.792993 ... 0.062723 0.061458 123.50\n",
+ "4 -1.158233 0.877737 1.548718 ... 0.219422 0.215153 69.99\n",
+ "... ... ... ... ... ... ... ...\n",
+ "12836 1.216532 -0.314522 1.134570 ... -0.067625 0.007633 23.27\n",
+ "12837 -1.730579 2.510772 -3.816998 ... 0.078616 -0.176480 1.00\n",
+ "12838 0.092413 0.707487 1.468534 ... 0.053918 0.133374 10.00\n",
+ "12839 1.105940 -0.093522 0.775855 ... -0.048843 0.029153 87.00\n",
+ "12840 1.169834 0.017381 0.836739 ... -0.051277 -0.005160 6.99\n",
+ "\n",
+ "[12841 rows x 29 columns]"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 72
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "4kZIJ9iN7OTJ",
+ "outputId": "9f2ebae9-940c-4619-91bd-9f68f324afe2",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 544
+ }
+ },
+ "source": [
+ "l_preditoras = df_cc2.iloc[:,1:30].columns\n",
+ "l_preditoras2 = list(df_cc2.columns)\n",
+ "l_preditoras2"
+ ],
+ "execution_count": 73,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "['Time',\n",
+ " 'V1',\n",
+ " 'V2',\n",
+ " 'V3',\n",
+ " 'V4',\n",
+ " 'V5',\n",
+ " 'V6',\n",
+ " 'V7',\n",
+ " 'V8',\n",
+ " 'V9',\n",
+ " 'V10',\n",
+ " 'V11',\n",
+ " 'V12',\n",
+ " 'V13',\n",
+ " 'V14',\n",
+ " 'V15',\n",
+ " 'V16',\n",
+ " 'V17',\n",
+ " 'V18',\n",
+ " 'V19',\n",
+ " 'V20',\n",
+ " 'V21',\n",
+ " 'V22',\n",
+ " 'V23',\n",
+ " 'V24',\n",
+ " 'V25',\n",
+ " 'V26',\n",
+ " 'V27',\n",
+ " 'V28',\n",
+ " 'Amount',\n",
+ " 'Class']"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 73
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Sg9IN-m27kbY",
+ "outputId": "e61bea7a-70c0-4bd8-db4b-2d562e497a87",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 163
+ }
+ },
+ "source": [
+ "l_preditoras2.remove['Class']"
+ ],
+ "execution_count": 76,
+ "outputs": [
+ {
+ "output_type": "error",
+ "ename": "TypeError",
+ "evalue": "ignored",
+ "traceback": [
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
+ "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0ml_preditoras2\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mremove\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Class'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+ "\u001b[0;31mTypeError\u001b[0m: 'builtin_function_or_method' object is not subscriptable"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ttD6UhA27YNB"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "OY-DYRKg34ZX"
+ },
+ "source": [
+ "### Definir as variáveis globais"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "KVhHgV_s3_5f"
+ },
+ "source": [
+ "i_CV = 10 # Número de Cross-Validations\n",
+ "i_Seed = 20111974 # semente por questões de reproducibilidade\n",
+ "f_Test_Size = 0.3 # Proporção do dataframe de validação (outros valores poderiam ser 0.15, 0.20 ou 0.25)"
+ ],
+ "execution_count": 74,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "wKbqrF4Q2nBq"
+ },
+ "source": [
+ "### Define amostras de treinamento e teste"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "N8CUAiA57OhS",
+ "outputId": "12824dd9-f1d6-4627-e602-d438047196d1",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 224
+ }
+ },
+ "source": [
+ "df_cc.head()"
+ ],
+ "execution_count": 75,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Time | \n",
+ " V1 | \n",
+ " V2 | \n",
+ " V3 | \n",
+ " V4 | \n",
+ " V5 | \n",
+ " V6 | \n",
+ " V7 | \n",
+ " V8 | \n",
+ " V9 | \n",
+ " V10 | \n",
+ " V11 | \n",
+ " V12 | \n",
+ " V13 | \n",
+ " V14 | \n",
+ " V15 | \n",
+ " V16 | \n",
+ " V17 | \n",
+ " V18 | \n",
+ " V19 | \n",
+ " V20 | \n",
+ " V21 | \n",
+ " V22 | \n",
+ " V23 | \n",
+ " V24 | \n",
+ " V25 | \n",
+ " V26 | \n",
+ " V27 | \n",
+ " V28 | \n",
+ " Amount | \n",
+ " Class | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 0 | \n",
+ " -1.359807 | \n",
+ " -0.072781 | \n",
+ " 2.536347 | \n",
+ " 1.378155 | \n",
+ " -0.338321 | \n",
+ " 0.462388 | \n",
+ " 0.239599 | \n",
+ " 0.098698 | \n",
+ " 0.363787 | \n",
+ " 0.090794 | \n",
+ " -0.551600 | \n",
+ " -0.617801 | \n",
+ " -0.991390 | \n",
+ " -0.311169 | \n",
+ " 1.468177 | \n",
+ " -0.470401 | \n",
+ " 0.207971 | \n",
+ " 0.025791 | \n",
+ " 0.403993 | \n",
+ " 0.251412 | \n",
+ " -0.018307 | \n",
+ " 0.277838 | \n",
+ " -0.110474 | \n",
+ " 0.066928 | \n",
+ " 0.128539 | \n",
+ " -0.189115 | \n",
+ " 0.133558 | \n",
+ " -0.021053 | \n",
+ " 149.62 | \n",
+ " 0.0 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 0 | \n",
+ " 1.191857 | \n",
+ " 0.266151 | \n",
+ " 0.166480 | \n",
+ " 0.448154 | \n",
+ " 0.060018 | \n",
+ " -0.082361 | \n",
+ " -0.078803 | \n",
+ " 0.085102 | \n",
+ " -0.255425 | \n",
+ " -0.166974 | \n",
+ " 1.612727 | \n",
+ " 1.065235 | \n",
+ " 0.489095 | \n",
+ " -0.143772 | \n",
+ " 0.635558 | \n",
+ " 0.463917 | \n",
+ " -0.114805 | \n",
+ " -0.183361 | \n",
+ " -0.145783 | \n",
+ " -0.069083 | \n",
+ " -0.225775 | \n",
+ " -0.638672 | \n",
+ " 0.101288 | \n",
+ " -0.339846 | \n",
+ " 0.167170 | \n",
+ " 0.125895 | \n",
+ " -0.008983 | \n",
+ " 0.014724 | \n",
+ " 2.69 | \n",
+ " 0.0 | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 1 | \n",
+ " -1.358354 | \n",
+ " -1.340163 | \n",
+ " 1.773209 | \n",
+ " 0.379780 | \n",
+ " -0.503198 | \n",
+ " 1.800499 | \n",
+ " 0.791461 | \n",
+ " 0.247676 | \n",
+ " -1.514654 | \n",
+ " 0.207643 | \n",
+ " 0.624501 | \n",
+ " 0.066084 | \n",
+ " 0.717293 | \n",
+ " -0.165946 | \n",
+ " 2.345865 | \n",
+ " -2.890083 | \n",
+ " 1.109969 | \n",
+ " -0.121359 | \n",
+ " -2.261857 | \n",
+ " 0.524980 | \n",
+ " 0.247998 | \n",
+ " 0.771679 | \n",
+ " 0.909412 | \n",
+ " -0.689281 | \n",
+ " -0.327642 | \n",
+ " -0.139097 | \n",
+ " -0.055353 | \n",
+ " -0.059752 | \n",
+ " 378.66 | \n",
+ " 0.0 | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 1 | \n",
+ " -0.966272 | \n",
+ " -0.185226 | \n",
+ " 1.792993 | \n",
+ " -0.863291 | \n",
+ " -0.010309 | \n",
+ " 1.247203 | \n",
+ " 0.237609 | \n",
+ " 0.377436 | \n",
+ " -1.387024 | \n",
+ " -0.054952 | \n",
+ " -0.226487 | \n",
+ " 0.178228 | \n",
+ " 0.507757 | \n",
+ " -0.287924 | \n",
+ " -0.631418 | \n",
+ " -1.059647 | \n",
+ " -0.684093 | \n",
+ " 1.965775 | \n",
+ " -1.232622 | \n",
+ " -0.208038 | \n",
+ " -0.108300 | \n",
+ " 0.005274 | \n",
+ " -0.190321 | \n",
+ " -1.175575 | \n",
+ " 0.647376 | \n",
+ " -0.221929 | \n",
+ " 0.062723 | \n",
+ " 0.061458 | \n",
+ " 123.50 | \n",
+ " 0.0 | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 2 | \n",
+ " -1.158233 | \n",
+ " 0.877737 | \n",
+ " 1.548718 | \n",
+ " 0.403034 | \n",
+ " -0.407193 | \n",
+ " 0.095921 | \n",
+ " 0.592941 | \n",
+ " -0.270533 | \n",
+ " 0.817739 | \n",
+ " 0.753074 | \n",
+ " -0.822843 | \n",
+ " 0.538196 | \n",
+ " 1.345852 | \n",
+ " -1.119670 | \n",
+ " 0.175121 | \n",
+ " -0.451449 | \n",
+ " -0.237033 | \n",
+ " -0.038195 | \n",
+ " 0.803487 | \n",
+ " 0.408542 | \n",
+ " -0.009431 | \n",
+ " 0.798278 | \n",
+ " -0.137458 | \n",
+ " 0.141267 | \n",
+ " -0.206010 | \n",
+ " 0.502292 | \n",
+ " 0.219422 | \n",
+ " 0.215153 | \n",
+ " 69.99 | \n",
+ " 0.0 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Time V1 V2 V3 ... V27 V28 Amount Class\n",
+ "0 0 -1.359807 -0.072781 2.536347 ... 0.133558 -0.021053 149.62 0.0\n",
+ "1 0 1.191857 0.266151 0.166480 ... -0.008983 0.014724 2.69 0.0\n",
+ "2 1 -1.358354 -1.340163 1.773209 ... -0.055353 -0.059752 378.66 0.0\n",
+ "3 1 -0.966272 -0.185226 1.792993 ... 0.062723 0.061458 123.50 0.0\n",
+ "4 2 -1.158233 0.877737 1.548718 ... 0.219422 0.215153 69.99 0.0\n",
+ "\n",
+ "[5 rows x 31 columns]"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 75
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "LZjNUDNb7s1t",
+ "outputId": "ede96084-76d8-47e0-e88e-8a2827ecd2e2",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 224
+ }
+ },
+ "source": [
+ "# Definição do dataframe contendo as variáveis preditoras:\n",
+ "df_X = df_cc2.copy()\n",
+ "df_X.drop(columns= ['Class'], axis = 1, inplace = True)\n",
+ "df_X.head()"
+ ],
+ "execution_count": 77,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Time | \n",
+ " V1 | \n",
+ " V2 | \n",
+ " V3 | \n",
+ " V4 | \n",
+ " V5 | \n",
+ " V6 | \n",
+ " V7 | \n",
+ " V8 | \n",
+ " V9 | \n",
+ " V10 | \n",
+ " V11 | \n",
+ " V12 | \n",
+ " V13 | \n",
+ " V14 | \n",
+ " V15 | \n",
+ " V16 | \n",
+ " V17 | \n",
+ " V18 | \n",
+ " V19 | \n",
+ " V20 | \n",
+ " V21 | \n",
+ " V22 | \n",
+ " V23 | \n",
+ " V24 | \n",
+ " V25 | \n",
+ " V26 | \n",
+ " V27 | \n",
+ " V28 | \n",
+ " Amount | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 0 | \n",
+ " -1.359807 | \n",
+ " -0.072781 | \n",
+ " 2.536347 | \n",
+ " 1.378155 | \n",
+ " -0.338321 | \n",
+ " 0.462388 | \n",
+ " 0.239599 | \n",
+ " 0.098698 | \n",
+ " 0.363787 | \n",
+ " 0.090794 | \n",
+ " -0.551600 | \n",
+ " -0.617801 | \n",
+ " -0.991390 | \n",
+ " -0.311169 | \n",
+ " 1.468177 | \n",
+ " -0.470401 | \n",
+ " 0.207971 | \n",
+ " 0.025791 | \n",
+ " 0.403993 | \n",
+ " 0.251412 | \n",
+ " -0.018307 | \n",
+ " 0.277838 | \n",
+ " -0.110474 | \n",
+ " 0.066928 | \n",
+ " 0.128539 | \n",
+ " -0.189115 | \n",
+ " 0.133558 | \n",
+ " -0.021053 | \n",
+ " 149.62 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 0 | \n",
+ " 1.191857 | \n",
+ " 0.266151 | \n",
+ " 0.166480 | \n",
+ " 0.448154 | \n",
+ " 0.060018 | \n",
+ " -0.082361 | \n",
+ " -0.078803 | \n",
+ " 0.085102 | \n",
+ " -0.255425 | \n",
+ " -0.166974 | \n",
+ " 1.612727 | \n",
+ " 1.065235 | \n",
+ " 0.489095 | \n",
+ " -0.143772 | \n",
+ " 0.635558 | \n",
+ " 0.463917 | \n",
+ " -0.114805 | \n",
+ " -0.183361 | \n",
+ " -0.145783 | \n",
+ " -0.069083 | \n",
+ " -0.225775 | \n",
+ " -0.638672 | \n",
+ " 0.101288 | \n",
+ " -0.339846 | \n",
+ " 0.167170 | \n",
+ " 0.125895 | \n",
+ " -0.008983 | \n",
+ " 0.014724 | \n",
+ " 2.69 | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 1 | \n",
+ " -1.358354 | \n",
+ " -1.340163 | \n",
+ " 1.773209 | \n",
+ " 0.379780 | \n",
+ " -0.503198 | \n",
+ " 1.800499 | \n",
+ " 0.791461 | \n",
+ " 0.247676 | \n",
+ " -1.514654 | \n",
+ " 0.207643 | \n",
+ " 0.624501 | \n",
+ " 0.066084 | \n",
+ " 0.717293 | \n",
+ " -0.165946 | \n",
+ " 2.345865 | \n",
+ " -2.890083 | \n",
+ " 1.109969 | \n",
+ " -0.121359 | \n",
+ " -2.261857 | \n",
+ " 0.524980 | \n",
+ " 0.247998 | \n",
+ " 0.771679 | \n",
+ " 0.909412 | \n",
+ " -0.689281 | \n",
+ " -0.327642 | \n",
+ " -0.139097 | \n",
+ " -0.055353 | \n",
+ " -0.059752 | \n",
+ " 378.66 | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 1 | \n",
+ " -0.966272 | \n",
+ " -0.185226 | \n",
+ " 1.792993 | \n",
+ " -0.863291 | \n",
+ " -0.010309 | \n",
+ " 1.247203 | \n",
+ " 0.237609 | \n",
+ " 0.377436 | \n",
+ " -1.387024 | \n",
+ " -0.054952 | \n",
+ " -0.226487 | \n",
+ " 0.178228 | \n",
+ " 0.507757 | \n",
+ " -0.287924 | \n",
+ " -0.631418 | \n",
+ " -1.059647 | \n",
+ " -0.684093 | \n",
+ " 1.965775 | \n",
+ " -1.232622 | \n",
+ " -0.208038 | \n",
+ " -0.108300 | \n",
+ " 0.005274 | \n",
+ " -0.190321 | \n",
+ " -1.175575 | \n",
+ " 0.647376 | \n",
+ " -0.221929 | \n",
+ " 0.062723 | \n",
+ " 0.061458 | \n",
+ " 123.50 | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 2 | \n",
+ " -1.158233 | \n",
+ " 0.877737 | \n",
+ " 1.548718 | \n",
+ " 0.403034 | \n",
+ " -0.407193 | \n",
+ " 0.095921 | \n",
+ " 0.592941 | \n",
+ " -0.270533 | \n",
+ " 0.817739 | \n",
+ " 0.753074 | \n",
+ " -0.822843 | \n",
+ " 0.538196 | \n",
+ " 1.345852 | \n",
+ " -1.119670 | \n",
+ " 0.175121 | \n",
+ " -0.451449 | \n",
+ " -0.237033 | \n",
+ " -0.038195 | \n",
+ " 0.803487 | \n",
+ " 0.408542 | \n",
+ " -0.009431 | \n",
+ " 0.798278 | \n",
+ " -0.137458 | \n",
+ " 0.141267 | \n",
+ " -0.206010 | \n",
+ " 0.502292 | \n",
+ " 0.219422 | \n",
+ " 0.215153 | \n",
+ " 69.99 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Time V1 V2 V3 ... V26 V27 V28 Amount\n",
+ "0 0 -1.359807 -0.072781 2.536347 ... -0.189115 0.133558 -0.021053 149.62\n",
+ "1 0 1.191857 0.266151 0.166480 ... 0.125895 -0.008983 0.014724 2.69\n",
+ "2 1 -1.358354 -1.340163 1.773209 ... -0.139097 -0.055353 -0.059752 378.66\n",
+ "3 1 -0.966272 -0.185226 1.792993 ... -0.221929 0.062723 0.061458 123.50\n",
+ "4 2 -1.158233 0.877737 1.548718 ... 0.502292 0.219422 0.215153 69.99\n",
+ "\n",
+ "[5 rows x 30 columns]"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 77
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "d3DDsN2V7IOU",
+ "outputId": "1b7041e4-631a-47db-bbe3-27d6df2d923d",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 119
+ }
+ },
+ "source": [
+ "df_y = df_cc2['Class'] # Variável-resposta\n",
+ "df_y.head()"
+ ],
+ "execution_count": 78,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "0 0.0\n",
+ "1 0.0\n",
+ "2 0.0\n",
+ "3 0.0\n",
+ "4 0.0\n",
+ "Name: Class, dtype: float64"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 78
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "aMthdXHD8vnh",
+ "outputId": "df66e9bc-b2b8-4902-ad05-defb09aad589",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "df_y.shape"
+ ],
+ "execution_count": 79,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "(12841,)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 79
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "EiJRftpZ2103"
+ },
+ "source": [
+ "from sklearn.model_selection import train_test_split\n",
+ "\n",
+ "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(df_X, df_y, test_size = f_Test_Size, random_state = i_Seed)"
+ ],
+ "execution_count": 80,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "TmSkPzNt8O6I",
+ "outputId": "c9da0fd9-379e-4550-e983-83f8a09e8722",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "X_treinamento.shape"
+ ],
+ "execution_count": 81,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "(8988, 30)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 81
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "9h1PjPKh8Xb1",
+ "outputId": "5a1e7d6b-c646-43d7-d52a-0372e3292085",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "X_teste.shape\n"
+ ],
+ "execution_count": 82,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "(3853, 30)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 82
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "b1w3NuJ5-mKK",
+ "outputId": "1836686d-230c-46f8-93e8-e0d79e667eef",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 221
+ }
+ },
+ "source": [
+ "y_teste"
+ ],
+ "execution_count": 95,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "1224 0.0\n",
+ "11994 0.0\n",
+ "5408 0.0\n",
+ "4385 0.0\n",
+ "8164 0.0\n",
+ " ... \n",
+ "2944 0.0\n",
+ "10949 0.0\n",
+ "12248 0.0\n",
+ "7380 0.0\n",
+ "12327 0.0\n",
+ "Name: Class, Length: 3853, dtype: float64"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 95
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "NbCN_puI2qk1"
+ },
+ "source": [
+ "### Ajusta o modelo"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "hjRwSI079ADn"
+ },
+ "source": [
+ "# Importar o classificador (modelo, algoritmo, ...)\n",
+ "from sklearn.tree import DecisionTreeClassifier # Este é o nosso classificador"
+ ],
+ "execution_count": 83,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "HuhKJOQA22bR",
+ "outputId": "b4c73a73-3d0c-4f75-abb5-5f3c8d66a38a",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 119
+ }
+ },
+ "source": [
+ "ml_DT = DecisionTreeClassifier(max_depth = 5, min_samples_split = 2, random_state = i_Seed)\n",
+ "ml_DT"
+ ],
+ "execution_count": 84,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n",
+ " max_depth=5, max_features=None, max_leaf_nodes=None,\n",
+ " min_impurity_decrease=0.0, min_impurity_split=None,\n",
+ " min_samples_leaf=1, min_samples_split=2,\n",
+ " min_weight_fraction_leaf=0.0, presort='deprecated',\n",
+ " random_state=20111974, splitter='best')"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 84
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Zai1d6eM93VQ",
+ "outputId": "cfe02da8-397a-4a0f-8fe5-424dbcdf0282",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 119
+ }
+ },
+ "source": [
+ "# Treinar o algoritmo/classificador: fit(df)\n",
+ "ml_DT.fit(X_treinamento, y_treinamento)"
+ ],
+ "execution_count": 85,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n",
+ " max_depth=5, max_features=None, max_leaf_nodes=None,\n",
+ " min_impurity_decrease=0.0, min_impurity_split=None,\n",
+ " min_samples_leaf=1, min_samples_split=2,\n",
+ " min_weight_fraction_leaf=0.0, presort='deprecated',\n",
+ " random_state=20111974, splitter='best')"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 85
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ybbS4zHn-8BO",
+ "outputId": "6a1206b9-1afe-4336-eb6f-08b492b0a7e4",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": 86,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "Média das Acurácias calculadas pelo CV....: 99.9\n",
+ "std médio das Acurácias calculadas pelo CV: 0.09\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "r_NLku7q_YT9",
+ "outputId": "bbdf1793-9a9a-47f2-94d9-165fc383f5aa",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": 87,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([1. , 0.99888765, 0.99777531, 0.99777531, 1. ,\n",
+ " 0.99888765, 1. , 0.99888765, 0.99777283, 1. ])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 87
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "bCRgHxUu2s7c"
+ },
+ "source": [
+ "### Cross-Validation"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "2wMWm-p5229A",
+ "outputId": "290582fe-a2e4-4bb8-fed1-7638a8442c0d",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "# Cross-Validation com 10 folds\n",
+ "a_scores_CV = cross_val_score(ml_DT, X_treinamento, y_treinamento, cv = i_CV)\n",
+ "\n",
+ "print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(), 4)}')\n",
+ "print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(), 4)}')"
+ ],
+ "execution_count": 88,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "Média das Acurácias calculadas pelo CV....: 99.9\n",
+ "std médio das Acurácias calculadas pelo CV: 0.09\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "m0Jbi5XQ8OZb",
+ "outputId": "ea01e17d-c5b9-4f7f-ce3f-55f317c1aaf5",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "a_scores_CV # array com os scores a cada iteração do CV"
+ ],
+ "execution_count": 89,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([1. , 0.99888765, 0.99777531, 0.99777531, 1. ,\n",
+ " 0.99888765, 1. , 0.99888765, 0.99777283, 1. ])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 89
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "g83qQ9RW-Lox",
+ "outputId": "e86d65f5-47cd-4faa-befc-feab86b05378",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "y_pred = ml_DT.predict(X_teste)\n",
+ "y_pred[0:30]"
+ ],
+ "execution_count": 98,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
+ " 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 98
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "CPGhZKJf84F5",
+ "outputId": "dbf436ce-12b0-497b-a3d8-053e044dd023",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 572
+ }
+ },
+ "source": [
+ "#Matriz de confusão xxxxxxx\n",
+ "# Matriz de Confusão\n",
+ "print(f'\\n********* CONFUSION MATRIX - PARAMETER TUNNING ***********')\n",
+ "cf_matrix = confusion_matrix(y_teste, y_pred)\n",
+ "cf_labels = ['True_Negative', 'False_Positive', 'False_Negative', 'True_Positive']\n",
+ "cf_categories = ['Zero', 'One']\n",
+ "mostra_confusion_matrix(cf_matrix, group_names = cf_labels, categories = cf_categories)\n",
+ "\n",
+ "#return ml_Opt, ml_GridSearchCV.best_params_"
+ ],
+ "execution_count": 99,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "********* CONFUSION MATRIX - PARAMETER TUNNING ***********\n"
+ ],
+ "name": "stdout"
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": [],
+ "needs_background": "light"
+ }
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "OFnXPNxQ-JHh"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "_Uez8lk09lV6"
+ },
+ "source": [
+ "# Invoca a função com o modelo baseline\n",
+ "ml_DT2, best_params = GridSearchOptimizer(ml_DT, 'ml_DT2', d_parametros_DT, X_treinamento, y_treinamento, X_teste, y_teste, cv = i_CV)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Am_UELOg2vDh"
+ },
+ "source": [
+ "### Fine tuning dos parâmetros"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "lF9mxe7y23hr"
+ },
+ "source": [
+ "from sklearn.tree import DecisionTreeClassifier # Library para Decision Tree (Classificação)\n",
+ "\n",
+ "# Instancia (configuração do Decision Trees) com os parâmetros sugeridos para se evitar overfitting:\n",
+ "ml_DT = DecisionTreeClassifier(criterion = 'gini', \n",
+ " splitter = 'best', \n",
+ " max_depth = None, \n",
+ " min_samples_split = 2, \n",
+ " min_samples_leaf = 1, \n",
+ " min_weight_fraction_leaf = 0.0, \n",
+ " max_features = None, \n",
+ " random_state = i_Seed, \n",
+ " max_leaf_nodes = None, \n",
+ " min_impurity_decrease = 0.0, \n",
+ " min_impurity_split = None, \n",
+ " class_weight = None, \n",
+ " presort = False)"
+ ],
+ "execution_count": 100,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8nDOkHnf_AsB",
+ "outputId": "aeceed17-6365-428f-e88a-67c59bba857a",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 119
+ }
+ },
+ "source": [
+ "# Treina o algoritmo: fit(df)\n",
+ "ml_DT.fit(X_treinamento, y_treinamento)"
+ ],
+ "execution_count": 101,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n",
+ " max_depth=None, max_features=None, max_leaf_nodes=None,\n",
+ " min_impurity_decrease=0.0, min_impurity_split=None,\n",
+ " min_samples_leaf=1, min_samples_split=2,\n",
+ " min_weight_fraction_leaf=0.0, presort=False,\n",
+ " random_state=20111974, splitter='best')"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 101
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ftIRtZph_f3R"
+ },
+ "source": [
+ "# Dicionário de parâmetros para o parameter tunning. Ao todo serão ajustados 2X13X5X5X7= 4.550 modelos. Contando com 10 folds no Cross-Validation, então são 45.500 modelos.\n",
+ "d_parametros_DT = {\"criterion\": [\"gini\", \"entropy\"], \n",
+ " \"min_samples_split\": [2, 5, 10, 30, 50], \n",
+ " \"max_depth\": [None, 2, 5, 9, 15], \n",
+ " \"min_samples_leaf\": [20, 40, 60, 80, 100], \n",
+ " \"max_leaf_nodes\": [None, 2, 3, 4, 5, 10]}\n"
+ ],
+ "execution_count": 105,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Fkn7vDRe_igh"
+ },
+ "source": [
+ "# Definindo a função para o GridSearchCV\n",
+ "def GridSearchOptimizer(modelo, ml_Opt, d_Parametros, X_treinamento, y_treinamento, X_teste, y_teste, cv = i_CV):\n",
+ " ml_GridSearchCV = GridSearchCV(modelo, d_Parametros, cv = i_CV, n_jobs= -1, verbose= 10, scoring= 'accuracy')\n",
+ " start = time()\n",
+ " ml_GridSearchCV.fit(X_treinamento, y_treinamento)\n",
+ " tempo_elapsed= time()-start\n",
+ " #print(f\"\\nGridSearchCV levou {tempo_elapsed:.2f} segundos.\")\n",
+ "\n",
+ " # Parâmetros que otimizam a classificação:\n",
+ " print(f'\\nParametros otimizados: {ml_GridSearchCV.best_params_}')\n",
+ " \n",
+ " if ml_Opt == 'ml_DT2':\n",
+ " print(f'\\nDecisionTreeClassifier *********************************************************************************************************')\n",
+ " ml_Opt = DecisionTreeClassifier(criterion= ml_GridSearchCV.best_params_['criterion'], \n",
+ " max_depth= ml_GridSearchCV.best_params_['max_depth'],\n",
+ " max_leaf_nodes= ml_GridSearchCV.best_params_['max_leaf_nodes'],\n",
+ " min_samples_split= ml_GridSearchCV.best_params_['min_samples_leaf'],\n",
+ " min_samples_leaf= ml_GridSearchCV.best_params_['min_samples_split'], \n",
+ " random_state= i_Seed)\n",
+ " \n",
+ " elif ml_Opt == 'ml_RF2':\n",
+ " print(f'\\nRandomForestClassifier *********************************************************************************************************')\n",
+ " ml_Opt = RandomForestClassifier(bootstrap= ml_GridSearchCV.best_params_['bootstrap'], \n",
+ " max_depth= ml_GridSearchCV.best_params_['max_depth'],\n",
+ " max_features= ml_GridSearchCV.best_params_['max_features'],\n",
+ " min_samples_leaf= ml_GridSearchCV.best_params_['min_samples_leaf'],\n",
+ " min_samples_split= ml_GridSearchCV.best_params_['min_samples_split'],\n",
+ " n_estimators= ml_GridSearchCV.best_params_['n_estimators'],\n",
+ " random_state= i_Seed)\n",
+ " \n",
+ " elif ml_Opt == 'ml_AB2':\n",
+ " print(f'\\nAdaBoostClassifier *********************************************************************************************************')\n",
+ " ml_Opt = AdaBoostClassifier(algorithm='SAMME.R', \n",
+ " base_estimator=RandomForestClassifier(bootstrap = False, \n",
+ " max_depth = 10, \n",
+ " max_features = 'auto', \n",
+ " min_samples_leaf = 1, \n",
+ " min_samples_split = 2, \n",
+ " n_estimators = 400), \n",
+ " learning_rate = ml_GridSearchCV.best_params_['learning_rate'], \n",
+ " n_estimators = ml_GridSearchCV.best_params_['n_estimators'], \n",
+ " random_state = i_Seed)\n",
+ " \n",
+ " elif ml_Opt == 'ml_GB2':\n",
+ " print(f'\\nGradientBoostingClassifier *********************************************************************************************************')\n",
+ " ml_Opt = GradientBoostingClassifier(learning_rate = ml_GridSearchCV.best_params_['learning_rate'], \n",
+ " n_estimators = ml_GridSearchCV.best_params_['n_estimators'], \n",
+ " max_depth = ml_GridSearchCV.best_params_['max_depth'], \n",
+ " min_samples_split = ml_GridSearchCV.best_params_['min_samples_split'], \n",
+ " min_samples_leaf = ml_GridSearchCV.best_params_['min_samples_leaf'], \n",
+ " max_features = ml_GridSearchCV.best_params_['max_features'])\n",
+ " \n",
+ " elif ml_Opt == 'ml_XGB2':\n",
+ " print(f'\\nXGBoostingClassifier *********************************************************************************************************')\n",
+ " ml_Opt = XGBoostingClassifier(learning_rate= ml_GridSearchCV.best_params_['learning_rate'], \n",
+ " max_depth= ml_GridSearchCV.best_params_['max_depth'], \n",
+ " colsample_bytree= ml_GridSearchCV.best_params_['colsample_bytree'], \n",
+ " subsample= ml_GridSearchCV.best_params_['subsample'], \n",
+ " gamma= ml_GridSearchCV.best_params_['gamma'], \n",
+ " min_child_weight= ml_GridSearchCV.best_params_['min_child_weight'])\n",
+ " \n",
+ " # Treina novamente usando os parametros otimizados...\n",
+ " ml_Opt.fit(X_treinamento, y_treinamento)\n",
+ "\n",
+ " # Cross-Validation com 10 folds\n",
+ " print(f'\\n********* CROSS-VALIDATION ***********')\n",
+ " a_scores_CV = cross_val_score(ml_Opt, X_treinamento, y_treinamento, cv = i_CV)\n",
+ " print(f'Média das Acurácias calculadas pelo CV....: {100*round(a_scores_CV.mean(),4)}')\n",
+ " print(f'std médio das Acurácias calculadas pelo CV: {100*round(a_scores_CV.std(),4)}')\n",
+ "\n",
+ " # Faz predições com os parametros otimizados...\n",
+ " y_pred = ml_Opt.predict(X_teste)\n",
+ " \n",
+ " # Importância das COLUNAS\n",
+ " print(f'\\n********* IMPORTÂNCIA DAS COLUNAS ***********')\n",
+ " df_importancia_variaveis = pd.DataFrame(zip(l_colunas, ml_Opt.feature_importances_), columns= ['coluna', 'importancia'])\n",
+ " df_importancia_variaveis = df_importancia_variaveis.sort_values(by= ['importancia'], ascending=False)\n",
+ " print(df_importancia_variaveis)\n",
+ "\n",
+ " # Matriz de Confusão\n",
+ " print(f'\\n********* CONFUSION MATRIX - PARAMETER TUNNING ***********')\n",
+ " cf_matrix = confusion_matrix(y_teste, y_pred)\n",
+ " cf_labels = ['True_Negative', 'False_Positive', 'False_Negative', 'True_Positive']\n",
+ " cf_categories = ['Zero', 'One']\n",
+ " mostra_confusion_matrix(cf_matrix, group_names = cf_labels, categories = cf_categories)\n",
+ "\n",
+ " return ml_Opt, ml_GridSearchCV.best_params_\n",
+ "\n"
+ ],
+ "execution_count": 106,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "IOp0SmbC_l7h",
+ "outputId": "d9cf147e-a5d9-451d-98c2-563fa129e11c",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 1000
+ }
+ },
+ "source": [
+ "# Invoca a função com o modelo baseline\n",
+ "ml_DT2, best_params = GridSearchOptimizer(ml_DT, 'ml_DT2', d_parametros_DT, X_treinamento, y_treinamento, X_teste, y_teste, cv = i_CV)\n"
+ ],
+ "execution_count": 107,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "Fitting 10 folds for each of 1500 candidates, totalling 15000 fits\n"
+ ],
+ "name": "stdout"
+ },
+ {
+ "output_type": "stream",
+ "text": [
+ "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.\n",
+ "[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 1.4s\n",
+ "[Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 1.7s\n",
+ "[Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 2.4s\n",
+ "[Parallel(n_jobs=-1)]: Done 14 tasks | elapsed: 2.9s\n",
+ "[Parallel(n_jobs=-1)]: Done 21 tasks | elapsed: 3.8s\n",
+ "[Parallel(n_jobs=-1)]: Done 28 tasks | elapsed: 4.5s\n",
+ "[Parallel(n_jobs=-1)]: Done 37 tasks | elapsed: 5.6s\n",
+ "[Parallel(n_jobs=-1)]: Done 46 tasks | elapsed: 6.6s\n",
+ "[Parallel(n_jobs=-1)]: Done 57 tasks | elapsed: 7.8s\n",
+ "[Parallel(n_jobs=-1)]: Done 68 tasks | elapsed: 8.8s\n",
+ "[Parallel(n_jobs=-1)]: Done 81 tasks | elapsed: 10.2s\n",
+ "[Parallel(n_jobs=-1)]: Done 94 tasks | elapsed: 11.4s\n",
+ "[Parallel(n_jobs=-1)]: Done 109 tasks | elapsed: 12.8s\n",
+ "[Parallel(n_jobs=-1)]: Done 124 tasks | elapsed: 14.3s\n",
+ "[Parallel(n_jobs=-1)]: Done 141 tasks | elapsed: 15.9s\n",
+ "[Parallel(n_jobs=-1)]: Done 158 tasks | elapsed: 17.6s\n",
+ "[Parallel(n_jobs=-1)]: Done 177 tasks | elapsed: 19.4s\n",
+ "[Parallel(n_jobs=-1)]: Done 196 tasks | elapsed: 21.1s\n",
+ "[Parallel(n_jobs=-1)]: Done 217 tasks | elapsed: 23.2s\n",
+ "[Parallel(n_jobs=-1)]: Done 238 tasks | elapsed: 25.1s\n",
+ "[Parallel(n_jobs=-1)]: Done 261 tasks | elapsed: 26.8s\n",
+ "[Parallel(n_jobs=-1)]: Done 284 tasks | elapsed: 28.2s\n",
+ "[Parallel(n_jobs=-1)]: Done 309 tasks | elapsed: 29.6s\n",
+ "[Parallel(n_jobs=-1)]: Done 334 tasks | elapsed: 31.0s\n",
+ "[Parallel(n_jobs=-1)]: Done 361 tasks | elapsed: 32.5s\n",
+ "[Parallel(n_jobs=-1)]: Done 388 tasks | elapsed: 34.0s\n",
+ "[Parallel(n_jobs=-1)]: Done 417 tasks | elapsed: 35.7s\n",
+ "[Parallel(n_jobs=-1)]: Done 446 tasks | elapsed: 37.2s\n",
+ "[Parallel(n_jobs=-1)]: Done 477 tasks | elapsed: 39.0s\n",
+ "[Parallel(n_jobs=-1)]: Done 508 tasks | elapsed: 40.9s\n",
+ "[Parallel(n_jobs=-1)]: Done 541 tasks | elapsed: 43.6s\n",
+ "[Parallel(n_jobs=-1)]: Done 574 tasks | elapsed: 46.1s\n",
+ "[Parallel(n_jobs=-1)]: Done 609 tasks | elapsed: 48.9s\n",
+ "[Parallel(n_jobs=-1)]: Done 644 tasks | elapsed: 51.6s\n",
+ "[Parallel(n_jobs=-1)]: Done 681 tasks | elapsed: 54.6s\n",
+ "[Parallel(n_jobs=-1)]: Done 718 tasks | elapsed: 57.4s\n",
+ "[Parallel(n_jobs=-1)]: Done 757 tasks | elapsed: 1.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 796 tasks | elapsed: 1.1min\n",
+ "[Parallel(n_jobs=-1)]: Done 837 tasks | elapsed: 1.1min\n",
+ "[Parallel(n_jobs=-1)]: Done 878 tasks | elapsed: 1.2min\n",
+ "[Parallel(n_jobs=-1)]: Done 921 tasks | elapsed: 1.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 964 tasks | elapsed: 1.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 1009 tasks | elapsed: 1.4min\n",
+ "[Parallel(n_jobs=-1)]: Done 1054 tasks | elapsed: 1.5min\n",
+ "[Parallel(n_jobs=-1)]: Done 1101 tasks | elapsed: 1.6min\n",
+ "[Parallel(n_jobs=-1)]: Done 1148 tasks | elapsed: 1.6min\n",
+ "[Parallel(n_jobs=-1)]: Done 1197 tasks | elapsed: 1.7min\n",
+ "[Parallel(n_jobs=-1)]: Done 1246 tasks | elapsed: 1.8min\n",
+ "[Parallel(n_jobs=-1)]: Done 1297 tasks | elapsed: 1.9min\n",
+ "[Parallel(n_jobs=-1)]: Done 1348 tasks | elapsed: 2.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 1401 tasks | elapsed: 2.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 1454 tasks | elapsed: 2.1min\n",
+ "[Parallel(n_jobs=-1)]: Done 1509 tasks | elapsed: 2.2min\n",
+ "[Parallel(n_jobs=-1)]: Done 1564 tasks | elapsed: 2.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 1621 tasks | elapsed: 2.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 1678 tasks | elapsed: 2.4min\n",
+ "[Parallel(n_jobs=-1)]: Done 1737 tasks | elapsed: 2.4min\n",
+ "[Parallel(n_jobs=-1)]: Done 1796 tasks | elapsed: 2.5min\n",
+ "[Parallel(n_jobs=-1)]: Done 1857 tasks | elapsed: 2.5min\n",
+ "[Parallel(n_jobs=-1)]: Done 1918 tasks | elapsed: 2.6min\n",
+ "[Parallel(n_jobs=-1)]: Done 1981 tasks | elapsed: 2.7min\n",
+ "[Parallel(n_jobs=-1)]: Done 2044 tasks | elapsed: 2.7min\n",
+ "[Parallel(n_jobs=-1)]: Done 2109 tasks | elapsed: 2.8min\n",
+ "[Parallel(n_jobs=-1)]: Done 2174 tasks | elapsed: 2.8min\n",
+ "[Parallel(n_jobs=-1)]: Done 2241 tasks | elapsed: 2.9min\n",
+ "[Parallel(n_jobs=-1)]: Done 2308 tasks | elapsed: 3.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 2377 tasks | elapsed: 3.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 2446 tasks | elapsed: 3.1min\n",
+ "[Parallel(n_jobs=-1)]: Done 2517 tasks | elapsed: 3.2min\n",
+ "[Parallel(n_jobs=-1)]: Done 2588 tasks | elapsed: 3.2min\n",
+ "[Parallel(n_jobs=-1)]: Done 2661 tasks | elapsed: 3.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 2734 tasks | elapsed: 3.4min\n",
+ "[Parallel(n_jobs=-1)]: Done 2809 tasks | elapsed: 3.4min\n",
+ "[Parallel(n_jobs=-1)]: Done 2884 tasks | elapsed: 3.5min\n",
+ "[Parallel(n_jobs=-1)]: Done 2961 tasks | elapsed: 3.6min\n",
+ "[Parallel(n_jobs=-1)]: Done 3038 tasks | elapsed: 3.7min\n",
+ "[Parallel(n_jobs=-1)]: Done 3117 tasks | elapsed: 3.8min\n",
+ "[Parallel(n_jobs=-1)]: Done 3196 tasks | elapsed: 3.9min\n",
+ "[Parallel(n_jobs=-1)]: Done 3277 tasks | elapsed: 4.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 3358 tasks | elapsed: 4.1min\n",
+ "[Parallel(n_jobs=-1)]: Done 3441 tasks | elapsed: 4.2min\n",
+ "[Parallel(n_jobs=-1)]: Done 3524 tasks | elapsed: 4.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 3609 tasks | elapsed: 4.4min\n",
+ "[Parallel(n_jobs=-1)]: Done 3694 tasks | elapsed: 4.5min\n",
+ "[Parallel(n_jobs=-1)]: Done 3781 tasks | elapsed: 4.6min\n",
+ "[Parallel(n_jobs=-1)]: Done 3868 tasks | elapsed: 4.8min\n",
+ "[Parallel(n_jobs=-1)]: Done 3957 tasks | elapsed: 4.9min\n",
+ "[Parallel(n_jobs=-1)]: Done 4046 tasks | elapsed: 5.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 4137 tasks | elapsed: 5.2min\n",
+ "[Parallel(n_jobs=-1)]: Done 4228 tasks | elapsed: 5.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 4321 tasks | elapsed: 5.5min\n",
+ "[Parallel(n_jobs=-1)]: Done 4414 tasks | elapsed: 5.6min\n",
+ "[Parallel(n_jobs=-1)]: Done 4509 tasks | elapsed: 5.8min\n",
+ "[Parallel(n_jobs=-1)]: Done 4604 tasks | elapsed: 6.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 4701 tasks | elapsed: 6.1min\n",
+ "[Parallel(n_jobs=-1)]: Done 4798 tasks | elapsed: 6.2min\n",
+ "[Parallel(n_jobs=-1)]: Done 4897 tasks | elapsed: 6.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 4996 tasks | elapsed: 6.4min\n",
+ "[Parallel(n_jobs=-1)]: Done 5097 tasks | elapsed: 6.6min\n",
+ "[Parallel(n_jobs=-1)]: Done 5198 tasks | elapsed: 6.7min\n",
+ "[Parallel(n_jobs=-1)]: Done 5301 tasks | elapsed: 6.8min\n",
+ "[Parallel(n_jobs=-1)]: Done 5404 tasks | elapsed: 7.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 5509 tasks | elapsed: 7.2min\n",
+ "[Parallel(n_jobs=-1)]: Done 5614 tasks | elapsed: 7.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 5721 tasks | elapsed: 7.5min\n",
+ "[Parallel(n_jobs=-1)]: Done 5828 tasks | elapsed: 7.7min\n",
+ "[Parallel(n_jobs=-1)]: Done 5937 tasks | elapsed: 7.9min\n",
+ "[Parallel(n_jobs=-1)]: Done 6046 tasks | elapsed: 8.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 6157 tasks | elapsed: 8.2min\n",
+ "[Parallel(n_jobs=-1)]: Done 6268 tasks | elapsed: 8.4min\n",
+ "[Parallel(n_jobs=-1)]: Done 6381 tasks | elapsed: 8.5min\n",
+ "[Parallel(n_jobs=-1)]: Done 6494 tasks | elapsed: 8.6min\n",
+ "[Parallel(n_jobs=-1)]: Done 6609 tasks | elapsed: 8.8min\n",
+ "[Parallel(n_jobs=-1)]: Done 6724 tasks | elapsed: 8.9min\n",
+ "[Parallel(n_jobs=-1)]: Done 6841 tasks | elapsed: 9.1min\n",
+ "[Parallel(n_jobs=-1)]: Done 6958 tasks | elapsed: 9.2min\n",
+ "[Parallel(n_jobs=-1)]: Done 7077 tasks | elapsed: 9.5min\n",
+ "[Parallel(n_jobs=-1)]: Done 7196 tasks | elapsed: 9.6min\n",
+ "[Parallel(n_jobs=-1)]: Done 7317 tasks | elapsed: 9.9min\n",
+ "[Parallel(n_jobs=-1)]: Done 7438 tasks | elapsed: 10.1min\n",
+ "[Parallel(n_jobs=-1)]: Done 7561 tasks | elapsed: 10.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 7684 tasks | elapsed: 10.5min\n",
+ "[Parallel(n_jobs=-1)]: Done 7809 tasks | elapsed: 10.7min\n",
+ "[Parallel(n_jobs=-1)]: Done 7934 tasks | elapsed: 10.9min\n",
+ "[Parallel(n_jobs=-1)]: Done 8061 tasks | elapsed: 11.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 8188 tasks | elapsed: 11.2min\n",
+ "[Parallel(n_jobs=-1)]: Done 8317 tasks | elapsed: 11.4min\n",
+ "[Parallel(n_jobs=-1)]: Done 8446 tasks | elapsed: 11.6min\n",
+ "[Parallel(n_jobs=-1)]: Done 8577 tasks | elapsed: 11.8min\n",
+ "[Parallel(n_jobs=-1)]: Done 8708 tasks | elapsed: 12.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 8841 tasks | elapsed: 12.2min\n",
+ "[Parallel(n_jobs=-1)]: Done 8974 tasks | elapsed: 12.4min\n",
+ "[Parallel(n_jobs=-1)]: Done 9109 tasks | elapsed: 12.6min\n",
+ "[Parallel(n_jobs=-1)]: Done 9244 tasks | elapsed: 12.8min\n",
+ "[Parallel(n_jobs=-1)]: Done 9381 tasks | elapsed: 13.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 9518 tasks | elapsed: 13.1min\n",
+ "[Parallel(n_jobs=-1)]: Done 9657 tasks | elapsed: 13.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 9796 tasks | elapsed: 13.5min\n",
+ "[Parallel(n_jobs=-1)]: Done 9937 tasks | elapsed: 13.7min\n",
+ "[Parallel(n_jobs=-1)]: Done 10078 tasks | elapsed: 13.9min\n",
+ "[Parallel(n_jobs=-1)]: Done 10221 tasks | elapsed: 14.1min\n",
+ "[Parallel(n_jobs=-1)]: Done 10364 tasks | elapsed: 14.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 10509 tasks | elapsed: 14.5min\n",
+ "[Parallel(n_jobs=-1)]: Done 10654 tasks | elapsed: 14.7min\n",
+ "[Parallel(n_jobs=-1)]: Done 10801 tasks | elapsed: 15.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 10948 tasks | elapsed: 15.2min\n",
+ "[Parallel(n_jobs=-1)]: Done 11097 tasks | elapsed: 15.4min\n",
+ "[Parallel(n_jobs=-1)]: Done 11246 tasks | elapsed: 15.6min\n",
+ "[Parallel(n_jobs=-1)]: Done 11397 tasks | elapsed: 15.8min\n",
+ "[Parallel(n_jobs=-1)]: Done 11548 tasks | elapsed: 16.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 11701 tasks | elapsed: 16.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 11854 tasks | elapsed: 16.5min\n",
+ "[Parallel(n_jobs=-1)]: Done 12009 tasks | elapsed: 16.7min\n",
+ "[Parallel(n_jobs=-1)]: Done 12164 tasks | elapsed: 17.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 12321 tasks | elapsed: 17.2min\n",
+ "[Parallel(n_jobs=-1)]: Done 12478 tasks | elapsed: 17.4min\n",
+ "[Parallel(n_jobs=-1)]: Done 12637 tasks | elapsed: 17.6min\n",
+ "[Parallel(n_jobs=-1)]: Done 12796 tasks | elapsed: 17.9min\n",
+ "[Parallel(n_jobs=-1)]: Done 12957 tasks | elapsed: 18.1min\n",
+ "[Parallel(n_jobs=-1)]: Done 13118 tasks | elapsed: 18.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 13281 tasks | elapsed: 18.6min\n",
+ "[Parallel(n_jobs=-1)]: Done 13444 tasks | elapsed: 18.8min\n",
+ "[Parallel(n_jobs=-1)]: Done 13609 tasks | elapsed: 19.1min\n",
+ "[Parallel(n_jobs=-1)]: Done 13774 tasks | elapsed: 19.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 13941 tasks | elapsed: 19.6min\n",
+ "[Parallel(n_jobs=-1)]: Done 14108 tasks | elapsed: 19.8min\n",
+ "[Parallel(n_jobs=-1)]: Done 14277 tasks | elapsed: 20.0min\n",
+ "[Parallel(n_jobs=-1)]: Done 14446 tasks | elapsed: 20.3min\n",
+ "[Parallel(n_jobs=-1)]: Done 14617 tasks | elapsed: 20.5min\n",
+ "[Parallel(n_jobs=-1)]: Done 14788 tasks | elapsed: 20.8min\n",
+ "[Parallel(n_jobs=-1)]: Done 14961 tasks | elapsed: 21.1min\n",
+ "[Parallel(n_jobs=-1)]: Done 15000 out of 15000 | elapsed: 21.1min finished\n"
+ ],
+ "name": "stderr"
+ },
+ {
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "Parametros otimizados: {'criterion': 'gini', 'max_depth': None, 'max_leaf_nodes': None, 'min_samples_leaf': 20, 'min_samples_split': 2}\n",
+ "\n",
+ "DecisionTreeClassifier *********************************************************************************************************\n",
+ "\n",
+ "********* CROSS-VALIDATION ***********\n",
+ "Média das Acurácias calculadas pelo CV....: 99.83999999999999\n",
+ "std médio das Acurácias calculadas pelo CV: 0.06999999999999999\n",
+ "\n",
+ "********* IMPORTÂNCIA DAS COLUNAS ***********\n",
+ " coluna importancia\n",
+ "12 v13 0.834026\n",
+ "17 v18 0.110131\n",
+ "4 v5 0.015970\n",
+ "14 v15 0.007955\n",
+ "15 v16 0.000761\n",
+ "10 v11 0.000000\n",
+ "16 v17 0.000000\n",
+ "13 v14 0.000000\n",
+ "11 v12 0.000000\n",
+ "0 v1 0.000000\n",
+ "1 v2 0.000000\n",
+ "8 v9 0.000000\n",
+ "7 v8 0.000000\n",
+ "6 v7 0.000000\n",
+ "5 v6 0.000000\n",
+ "3 v4 0.000000\n",
+ "2 v3 0.000000\n",
+ "9 v10 0.000000\n",
+ "\n",
+ "********* CONFUSION MATRIX - PARAMETER TUNNING ***********\n"
+ ],
+ "name": "stdout"
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": [],
+ "needs_background": "light"
+ }
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "QfordmeT_uxi",
+ "outputId": "4fd2292f-378d-4ee4-9cd5-44f96686e610",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 391
+ }
+ },
+ "source": [
+ "from sklearn.tree import export_graphviz\n",
+ "from sklearn.externals.six import StringIO \n",
+ "from IPython.display import Image \n",
+ "import pydotplus\n",
+ "\n",
+ "dot_data = StringIO()\n",
+ "export_graphviz(ml_DT2, out_file = dot_data, filled = True, rounded = True, special_characters = True, feature_names = l_colunas, class_names = ['0','1'])\n",
+ "\n",
+ "graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) \n",
+ "graph.write_png('DecisionTree.png')\n",
+ "Image(graph.create_png())"
+ ],
+ "execution_count": 110,
+ "outputs": [
+ {
+ "output_type": "error",
+ "ename": "ValueError",
+ "evalue": "ignored",
+ "traceback": [
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
+ "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0mdot_data\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mStringIO\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0mexport_graphviz\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mml_DT2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout_file\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdot_data\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfilled\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrounded\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mspecial_characters\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfeature_names\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0ml_colunas\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mclass_names\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m'0'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m'1'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 8\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0mgraph\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpydotplus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgraph_from_dot_data\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdot_data\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgetvalue\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/sklearn/tree/_export.py\u001b[0m in \u001b[0;36mexport_graphviz\u001b[0;34m(decision_tree, out_file, max_depth, feature_names, class_names, label, filled, leaves_parallel, impurity, node_ids, proportion, rotate, rounded, special_characters, precision)\u001b[0m\n\u001b[1;32m 762\u001b[0m \u001b[0mrounded\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mrounded\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mspecial_characters\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mspecial_characters\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 763\u001b[0m precision=precision)\n\u001b[0;32m--> 764\u001b[0;31m \u001b[0mexporter\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexport\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdecision_tree\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 765\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 766\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mreturn_string\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/sklearn/tree/_export.py\u001b[0m in \u001b[0;36mexport\u001b[0;34m(self, decision_tree)\u001b[0m\n\u001b[1;32m 397\u001b[0m \u001b[0;34m\"does not match number of features, %d\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 398\u001b[0m % (len(self.feature_names),\n\u001b[0;32m--> 399\u001b[0;31m decision_tree.n_features_))\n\u001b[0m\u001b[1;32m 400\u001b[0m \u001b[0;31m# each part writes to out_file\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 401\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhead\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;31mValueError\u001b[0m: Length of feature_names, 18 does not match number of features, 30"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "bG31I7_n4RQg"
+ },
+ "source": [
+ "### Aplicar as transformações (principais) estudadas e reestimar o modelo novamente\n",
+ "* Qual o impacto das transformações?\n",
+ "* A conclusão muda/mudou?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oYgK6JXd3MgA"
+ },
+ "source": [
+ "## Exercício 2 - Predicting species on IRIS dataset\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "si0rsJvu3O6O"
+ },
+ "source": [
+ "from sklearn import datasets\n",
+ "import xgboost as xgb\n",
+ "\n",
+ "iris = datasets.load_iris()\n",
+ "X_iris = iris.data\n",
+ "y_iris = iris.target"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "zom8t4yWC_UC"
+ },
+ "source": [
+ "## Exercício 3 - Predict Wine Quality\n",
+ "> Estimar a qualidade dos vinhos, numa scala de 0–100. A seguir, a qualidade em função da escala:\n",
+ "\n",
+ "* 95–100 Classic: a great wine\n",
+ "* 90–94 Outstanding: a wine of superior character and style\n",
+ "* 85–89 Very good: a wine with special qualities\n",
+ "* 80–84 Good: a solid, well-made wine\n",
+ "* 75–79 Mediocre: a drinkable wine that may have minor flaws\n",
+ "* 50–74 Not recommended\n",
+ "\n",
+ "Source: [Wine Reviews](https://www.kaggle.com/zynicide/wine-reviews)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "klL2Q9Ria96n"
+ },
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "from sklearn import datasets\n",
+ "\n",
+ "Wine = datasets.load_wine()\n",
+ "X_vinho = Wine.data\n",
+ "y_vinho = Wine.target"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "lhVhSWBgGijq"
+ },
+ "source": [
+ "## Exercício 4 - Predict Parkinson\n",
+ "Source: https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "SVCxHqv0VBJn"
+ },
+ "source": [
+ "## Exercício 5 - Predict survivors from Titanic tragedy\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "CwvB8us4eKNi"
+ },
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "import seaborn as sns\n",
+ "\n",
+ "df_titanic = sns.load_dataset('titanic')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZJrT9YIXVdtx"
+ },
+ "source": [
+ "## Exercício 6 - Predict Loan\n",
+ "> Os dados devem ser obtidos diretamente da fonte: [Loan Default Prediction - Imperial College London](https://www.kaggle.com/c/loan-default-prediction/data)\n",
+ "\n",
+ "* [Bank Loan Default Prediction](https://medium.com/@wutianhao910/bank-loan-default-prediction-94d4902db740)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "R8-GVu7ZWeA8"
+ },
+ "source": [
+ "## Exercício 7 - Predict the sales of a store.\n",
+ "* [Predicting expected sales for Bigmart’s stores](https://medium.com/diogo-menezes-borges/project-1-bigmart-sale-prediction-fdc04f07dc1e)\n",
+ "* Dataframes\n",
+ " * [Treinamento](https://raw.githubusercontent.com/MathMachado/DataFrames/master/Big_Mart_Sales_III_train.txt)\n",
+ " * [Validação](https://raw.githubusercontent.com/MathMachado/DataFrames/master/Big_Mart_Sales_III_test.txt)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "fv9w86j4Wnwj"
+ },
+ "source": [
+ "## Exercício 8 - [The Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html)\n",
+ "> Predict the median value of owner occupied homes."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5HYRt8-ug1BT"
+ },
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "from sklearn import datasets\n",
+ "\n",
+ "Boston = datasets.load_boston()\n",
+ "X_boston = Boston.data\n",
+ "y_boston = Boston.target"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "1UDIaqmtXQ0T"
+ },
+ "source": [
+ "## Exercício 9 - Predict the height or weight of a person.\n",
+ "\n",
+ "http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-7R146nIXmMT"
+ },
+ "source": [
+ "## Exercício 10 - Black Friday Sales Prediction - Predict purchase amount.\n",
+ "\n",
+ "This dataset comprises of sales transactions captured at a retail store. It’s a classic dataset to explore and expand your feature engineering skills and day to day understanding from multiple shopping experiences. This is a regression problem. The dataset has 550,069 rows and 12 columns.\n",
+ "\n",
+ "https://github.com/MathMachado/DataFrames/blob/master/blackfriday.zip\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "mQ8FPbuLZlIh"
+ },
+ "source": [
+ "## Exercício 11 - Predict the income class of US population.\n",
+ "\n",
+ "http://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Af4NRrchgPlM"
+ },
+ "source": [
+ "## Exercício 12 - Predicting Cancer\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "c4LOlgZW3P40"
+ },
+ "source": [
+ "from sklearn import datasets\n",
+ "cancer = datasets.load_breast_cancer()\n",
+ "X_cancer = cancer.data\n",
+ "y_cancer = cancer.target"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "74PmpT8Ix0tD"
+ },
+ "source": [
+ "## Exercício 13\n",
+ "Source: [Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/).\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "WY8GZMixZ9W9"
+ },
+ "source": [
+ "## Exercício 14 - Predict Diabetes"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "y92t6tbOge0S"
+ },
+ "source": [
+ "from sklearn import datasets\n",
+ "Diabetes= datasets.load_diabetes()\n",
+ "\n",
+ "X_diabetes = Diabetes.data\n",
+ "y_diabetes = Diabetes.target"
+ ],
+ "execution_count": null,
+ "outputs": []
+ }
+ ]
+}
\ No newline at end of file
From 2b6779711bfe5ded5add4317c6bc8b8d758a299f Mon Sep 17 00:00:00 2001
From: Celso-Omoto <72219725+Celso-Omoto@users.noreply.github.com>
Date: Mon, 26 Oct 2020 18:12:25 -0300
Subject: [PATCH 19/21] Criado usando o Colaboratory
---
...egress\303\243o Linear_apontamentos.ipynb" | 5943 +++++++++++++++++
1 file changed, 5943 insertions(+)
create mode 100644 "Notebooks/NB15_02__Regress\303\243o Linear_apontamentos.ipynb"
diff --git "a/Notebooks/NB15_02__Regress\303\243o Linear_apontamentos.ipynb" "b/Notebooks/NB15_02__Regress\303\243o Linear_apontamentos.ipynb"
new file mode 100644
index 000000000..6c070dd81
--- /dev/null
+++ "b/Notebooks/NB15_02__Regress\303\243o Linear_apontamentos.ipynb"
@@ -0,0 +1,5943 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.1"
+ },
+ "colab": {
+ "name": "NB15_02__Regressão Linear.ipynb",
+ "provenance": [],
+ "include_colab_link": true
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XwQDhId7N6_r"
+ },
+ "source": [
+ "MACHINE LEARNING WITH PYTHON
\n",
+ "APRENDIZAGEM SUPERVISIONADA
\n",
+ "MODELOS DE REGRESSÃO (LINEAR E LOGÍSTICA)
\n",
+ "\n",
+ "Fonte: https://realpython.com/linear-regression-in-python/"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "PN-dQFJcM1UV"
+ },
+ "source": [
+ "Passos para implementação da Regressão Linear:\n",
+ "\n",
+ "* (1) Importar as libraries necessárias;\n",
+ "* (2) Carregar os dados;\n",
+ "* (3) Aplicar as transformações necessárias: outliers, NaN's, normalização (MinMaxScaler, RobustScaler, StandarScaler, Log, Box-Cox e etc);\n",
+ "* (4) Construir e treinar o modelo preditivo (neste caso, modelo de regressão);\n",
+ "* (5) Validar/verificar as métricas para avaliação do(s) modelo(s);\n",
+ "* (6) Predições."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8TldGZxAFV5E"
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0QRbxlqaq7pr"
+ },
+ "source": [
+ "# Melhorias da sessão:\n",
+ "* "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "P4sAIblOgFyL"
+ },
+ "source": [
+ "# Modelos de Regressão com Regularization para Classificação e Regressão"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "o7Y7cuJNgFyU"
+ },
+ "source": [
+ "## Regressão Linear Simples (usando OLS - Ordinary Least Squares)\n",
+ "\n",
+ "* Features $X_{np}$: é uma matriz de dimensão nxp;\n",
+ "* Variável target/dependente representada por y;\n",
+ "* Relação entre X e y é representado pela equação abaixo, onde $w_{i}$ representa os pesos de cada coeficiente e $w_{0}$ representa o intercepto."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "NpJ580y9gFyU"
+ },
+ "source": [
+ "
\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5rhbVGJ0gFyY"
+ },
+ "source": [
+ "* Soma de Quadrados dos Resíduos (RSS) - Soma de Quadrados das diferenças entre os valores observados e preditos.\n",
+ "\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "u8gA0YkbgFyp"
+ },
+ "source": [
+ "## Principais parâmetros do algoritmo:\n",
+ "* fit_intercept - Indica se o intercepto $w_{0}$ deve ou não ser ajustado. Se os dados estão normalizados, então não faz sentido ajustar o intercepto $w_{0}$.\n",
+ "\n",
+ "* normalize - $X$ será automaticamente normalizada (subtrai a média e divide pelo desvio-padrão);\n",
+ "\n",
+ "## Atributos do modelo de Machine Learning para Regressão\n",
+ "* coef - peso/fator de cada variável independente do modelo de ML;\n",
+ "\n",
+ "* intercepto $w_{0}$ - intercepto ou viés de $y$;\n",
+ "\n",
+ "## Funções para ajuste do ML:\n",
+ "* fit - treina o modelo com as matrizes $X$ e $y$;\n",
+ "* predict - Uma vez que o modelo foi treinado, para um dado $X$, use $y$ para calcular os valores preditos de $y$ (y_pred).\n",
+ "\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "A-JG8El1gFy7"
+ },
+ "source": [
+ "# Limitações do OLS:\n",
+ "* Impactado/sensível à Outliers;\n",
+ "* Multicolinearidade; \n",
+ "* Heterocedasticidade - apresenta-se como uma forte dispersão dos dados em torno de uma reta;\n",
+ "\n",
+ "* References"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "xylMYR8COyrw"
+ },
+ "source": [
+ "### Importar as libraries"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "2BGgrILlPK6Z"
+ },
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "from scipy import stats"
+ ],
+ "execution_count": 1,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "263GgbwhO2kQ"
+ },
+ "source": [
+ "### Carregar os dados\n",
+ "* Vamos carregar o dataset [Boston House Pricing](https://archive.ics.uci.edu/ml/datasets/housing)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "g2WoVKwkPYEd",
+ "outputId": "2e4ea40b-33b5-4214-cc6b-d768d5d948e9",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 204
+ }
+ },
+ "source": [
+ "from sklearn.datasets import load_boston\n",
+ "#url = 'https://raw.githubusercontent.com/MathMachado/DSWP/master/Dataframes/housing.csv'\n",
+ "# Variáveis preditoras\n",
+ "df_boston = pd.DataFrame(load_boston().data, columns = load_boston().feature_names)\n",
+ "df_boston['preco'] = load_boston().target\n",
+ "df_boston.head()\n"
+ ],
+ "execution_count": 7,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " CRIM | \n",
+ " ZN | \n",
+ " INDUS | \n",
+ " CHAS | \n",
+ " NOX | \n",
+ " RM | \n",
+ " AGE | \n",
+ " DIS | \n",
+ " RAD | \n",
+ " TAX | \n",
+ " PTRATIO | \n",
+ " B | \n",
+ " LSTAT | \n",
+ " preco | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 0.00632 | \n",
+ " 18.0 | \n",
+ " 2.31 | \n",
+ " 0.0 | \n",
+ " 0.538 | \n",
+ " 6.575 | \n",
+ " 65.2 | \n",
+ " 4.0900 | \n",
+ " 1.0 | \n",
+ " 296.0 | \n",
+ " 15.3 | \n",
+ " 396.90 | \n",
+ " 4.98 | \n",
+ " 24.0 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 0.02731 | \n",
+ " 0.0 | \n",
+ " 7.07 | \n",
+ " 0.0 | \n",
+ " 0.469 | \n",
+ " 6.421 | \n",
+ " 78.9 | \n",
+ " 4.9671 | \n",
+ " 2.0 | \n",
+ " 242.0 | \n",
+ " 17.8 | \n",
+ " 396.90 | \n",
+ " 9.14 | \n",
+ " 21.6 | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 0.02729 | \n",
+ " 0.0 | \n",
+ " 7.07 | \n",
+ " 0.0 | \n",
+ " 0.469 | \n",
+ " 7.185 | \n",
+ " 61.1 | \n",
+ " 4.9671 | \n",
+ " 2.0 | \n",
+ " 242.0 | \n",
+ " 17.8 | \n",
+ " 392.83 | \n",
+ " 4.03 | \n",
+ " 34.7 | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 0.03237 | \n",
+ " 0.0 | \n",
+ " 2.18 | \n",
+ " 0.0 | \n",
+ " 0.458 | \n",
+ " 6.998 | \n",
+ " 45.8 | \n",
+ " 6.0622 | \n",
+ " 3.0 | \n",
+ " 222.0 | \n",
+ " 18.7 | \n",
+ " 394.63 | \n",
+ " 2.94 | \n",
+ " 33.4 | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 0.06905 | \n",
+ " 0.0 | \n",
+ " 2.18 | \n",
+ " 0.0 | \n",
+ " 0.458 | \n",
+ " 7.147 | \n",
+ " 54.2 | \n",
+ " 6.0622 | \n",
+ " 3.0 | \n",
+ " 222.0 | \n",
+ " 18.7 | \n",
+ " 396.90 | \n",
+ " 5.33 | \n",
+ " 36.2 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " CRIM ZN INDUS CHAS NOX ... TAX PTRATIO B LSTAT preco\n",
+ "0 0.00632 18.0 2.31 0.0 0.538 ... 296.0 15.3 396.90 4.98 24.0\n",
+ "1 0.02731 0.0 7.07 0.0 0.469 ... 242.0 17.8 396.90 9.14 21.6\n",
+ "2 0.02729 0.0 7.07 0.0 0.469 ... 242.0 17.8 392.83 4.03 34.7\n",
+ "3 0.03237 0.0 2.18 0.0 0.458 ... 222.0 18.7 394.63 2.94 33.4\n",
+ "4 0.06905 0.0 2.18 0.0 0.458 ... 222.0 18.7 396.90 5.33 36.2\n",
+ "\n",
+ "[5 rows x 14 columns]"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 7
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kinipZB4Yq4f",
+ "outputId": "53d258d1-1b82-4fe9-b3af-c4b324e86971",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 799
+ }
+ },
+ "source": [
+ "load_boston().target"
+ ],
+ "execution_count": 8,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,\n",
+ " 18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,\n",
+ " 15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,\n",
+ " 13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,\n",
+ " 21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,\n",
+ " 35.4, 24.7, 31.6, 23.3, 19.6, 18.7, 16. , 22.2, 25. , 33. , 23.5,\n",
+ " 19.4, 22. , 17.4, 20.9, 24.2, 21.7, 22.8, 23.4, 24.1, 21.4, 20. ,\n",
+ " 20.8, 21.2, 20.3, 28. , 23.9, 24.8, 22.9, 23.9, 26.6, 22.5, 22.2,\n",
+ " 23.6, 28.7, 22.6, 22. , 22.9, 25. , 20.6, 28.4, 21.4, 38.7, 43.8,\n",
+ " 33.2, 27.5, 26.5, 18.6, 19.3, 20.1, 19.5, 19.5, 20.4, 19.8, 19.4,\n",
+ " 21.7, 22.8, 18.8, 18.7, 18.5, 18.3, 21.2, 19.2, 20.4, 19.3, 22. ,\n",
+ " 20.3, 20.5, 17.3, 18.8, 21.4, 15.7, 16.2, 18. , 14.3, 19.2, 19.6,\n",
+ " 23. , 18.4, 15.6, 18.1, 17.4, 17.1, 13.3, 17.8, 14. , 14.4, 13.4,\n",
+ " 15.6, 11.8, 13.8, 15.6, 14.6, 17.8, 15.4, 21.5, 19.6, 15.3, 19.4,\n",
+ " 17. , 15.6, 13.1, 41.3, 24.3, 23.3, 27. , 50. , 50. , 50. , 22.7,\n",
+ " 25. , 50. , 23.8, 23.8, 22.3, 17.4, 19.1, 23.1, 23.6, 22.6, 29.4,\n",
+ " 23.2, 24.6, 29.9, 37.2, 39.8, 36.2, 37.9, 32.5, 26.4, 29.6, 50. ,\n",
+ " 32. , 29.8, 34.9, 37. , 30.5, 36.4, 31.1, 29.1, 50. , 33.3, 30.3,\n",
+ " 34.6, 34.9, 32.9, 24.1, 42.3, 48.5, 50. , 22.6, 24.4, 22.5, 24.4,\n",
+ " 20. , 21.7, 19.3, 22.4, 28.1, 23.7, 25. , 23.3, 28.7, 21.5, 23. ,\n",
+ " 26.7, 21.7, 27.5, 30.1, 44.8, 50. , 37.6, 31.6, 46.7, 31.5, 24.3,\n",
+ " 31.7, 41.7, 48.3, 29. , 24. , 25.1, 31.5, 23.7, 23.3, 22. , 20.1,\n",
+ " 22.2, 23.7, 17.6, 18.5, 24.3, 20.5, 24.5, 26.2, 24.4, 24.8, 29.6,\n",
+ " 42.8, 21.9, 20.9, 44. , 50. , 36. , 30.1, 33.8, 43.1, 48.8, 31. ,\n",
+ " 36.5, 22.8, 30.7, 50. , 43.5, 20.7, 21.1, 25.2, 24.4, 35.2, 32.4,\n",
+ " 32. , 33.2, 33.1, 29.1, 35.1, 45.4, 35.4, 46. , 50. , 32.2, 22. ,\n",
+ " 20.1, 23.2, 22.3, 24.8, 28.5, 37.3, 27.9, 23.9, 21.7, 28.6, 27.1,\n",
+ " 20.3, 22.5, 29. , 24.8, 22. , 26.4, 33.1, 36.1, 28.4, 33.4, 28.2,\n",
+ " 22.8, 20.3, 16.1, 22.1, 19.4, 21.6, 23.8, 16.2, 17.8, 19.8, 23.1,\n",
+ " 21. , 23.8, 23.1, 20.4, 18.5, 25. , 24.6, 23. , 22.2, 19.3, 22.6,\n",
+ " 19.8, 17.1, 19.4, 22.2, 20.7, 21.1, 19.5, 18.5, 20.6, 19. , 18.7,\n",
+ " 32.7, 16.5, 23.9, 31.2, 17.5, 17.2, 23.1, 24.5, 26.6, 22.9, 24.1,\n",
+ " 18.6, 30.1, 18.2, 20.6, 17.8, 21.7, 22.7, 22.6, 25. , 19.9, 20.8,\n",
+ " 16.8, 21.9, 27.5, 21.9, 23.1, 50. , 50. , 50. , 50. , 50. , 13.8,\n",
+ " 13.8, 15. , 13.9, 13.3, 13.1, 10.2, 10.4, 10.9, 11.3, 12.3, 8.8,\n",
+ " 7.2, 10.5, 7.4, 10.2, 11.5, 15.1, 23.2, 9.7, 13.8, 12.7, 13.1,\n",
+ " 12.5, 8.5, 5. , 6.3, 5.6, 7.2, 12.1, 8.3, 8.5, 5. , 11.9,\n",
+ " 27.9, 17.2, 27.5, 15. , 17.2, 17.9, 16.3, 7. , 7.2, 7.5, 10.4,\n",
+ " 8.8, 8.4, 16.7, 14.2, 20.8, 13.4, 11.7, 8.3, 10.2, 10.9, 11. ,\n",
+ " 9.5, 14.5, 14.1, 16.1, 14.3, 11.7, 13.4, 9.6, 8.7, 8.4, 12.8,\n",
+ " 10.5, 17.1, 18.4, 15.4, 10.8, 11.8, 14.9, 12.6, 14.1, 13. , 13.4,\n",
+ " 15.2, 16.1, 17.8, 14.9, 14.1, 12.7, 13.5, 14.9, 20. , 16.4, 17.7,\n",
+ " 19.5, 20.2, 21.4, 19.9, 19. , 19.1, 19.1, 20.1, 19.9, 19.6, 23.2,\n",
+ " 29.8, 13.8, 13.3, 16.7, 12. , 14.6, 21.4, 23. , 23.7, 25. , 21.8,\n",
+ " 20.6, 21.2, 19.1, 20.6, 15.2, 7. , 8.1, 13.6, 20.1, 21.8, 24.5,\n",
+ " 23.1, 19.7, 18.3, 21.2, 17.5, 16.8, 22.4, 20.6, 23.9, 22. , 11.9])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 8
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "4cYeEC6tYnFC",
+ "outputId": "0e936cda-5853-4ead-99e5-2c61080ddda9",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "load_boston().feature_names"
+ ],
+ "execution_count": 9,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',\n",
+ " 'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='\n",
+ "\n",
+ "\n",
+ " \n",
+ " \n",
+ " | \n",
+ " crim | \n",
+ " zn | \n",
+ " indus | \n",
+ " chas | \n",
+ " nox | \n",
+ " rm | \n",
+ " age | \n",
+ " dis | \n",
+ " rad | \n",
+ " tax | \n",
+ " ptratio | \n",
+ " b | \n",
+ " lstat | \n",
+ " preco | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 0.00632 | \n",
+ " 18.0 | \n",
+ " 2.31 | \n",
+ " 0.0 | \n",
+ " 0.538 | \n",
+ " 6.575 | \n",
+ " 65.2 | \n",
+ " 4.0900 | \n",
+ " 1.0 | \n",
+ " 296.0 | \n",
+ " 15.3 | \n",
+ " 396.90 | \n",
+ " 4.98 | \n",
+ " 24.0 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 0.02731 | \n",
+ " 0.0 | \n",
+ " 7.07 | \n",
+ " 0.0 | \n",
+ " 0.469 | \n",
+ " 6.421 | \n",
+ " 78.9 | \n",
+ " 4.9671 | \n",
+ " 2.0 | \n",
+ " 242.0 | \n",
+ " 17.8 | \n",
+ " 396.90 | \n",
+ " 9.14 | \n",
+ " 21.6 | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 0.02729 | \n",
+ " 0.0 | \n",
+ " 7.07 | \n",
+ " 0.0 | \n",
+ " 0.469 | \n",
+ " 7.185 | \n",
+ " 61.1 | \n",
+ " 4.9671 | \n",
+ " 2.0 | \n",
+ " 242.0 | \n",
+ " 17.8 | \n",
+ " 392.83 | \n",
+ " 4.03 | \n",
+ " 34.7 | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 0.03237 | \n",
+ " 0.0 | \n",
+ " 2.18 | \n",
+ " 0.0 | \n",
+ " 0.458 | \n",
+ " 6.998 | \n",
+ " 45.8 | \n",
+ " 6.0622 | \n",
+ " 3.0 | \n",
+ " 222.0 | \n",
+ " 18.7 | \n",
+ " 394.63 | \n",
+ " 2.94 | \n",
+ " 33.4 | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 0.06905 | \n",
+ " 0.0 | \n",
+ " 2.18 | \n",
+ " 0.0 | \n",
+ " 0.458 | \n",
+ " 7.147 | \n",
+ " 54.2 | \n",
+ " 6.0622 | \n",
+ " 3.0 | \n",
+ " 222.0 | \n",
+ " 18.7 | \n",
+ " 396.90 | \n",
+ " 5.33 | \n",
+ " 36.2 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ ""
+ ],
+ "text/plain": [
+ " crim zn indus chas nox ... tax ptratio b lstat preco\n",
+ "0 0.00632 18.0 2.31 0.0 0.538 ... 296.0 15.3 396.90 4.98 24.0\n",
+ "1 0.02731 0.0 7.07 0.0 0.469 ... 242.0 17.8 396.90 9.14 21.6\n",
+ "2 0.02729 0.0 7.07 0.0 0.469 ... 242.0 17.8 392.83 4.03 34.7\n",
+ "3 0.03237 0.0 2.18 0.0 0.458 ... 222.0 18.7 394.63 2.94 33.4\n",
+ "4 0.06905 0.0 2.18 0.0 0.458 ... 222.0 18.7 396.90 5.33 36.2\n",
+ "\n",
+ "[5 rows x 14 columns]"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 12
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "CMDh5jyqekmr"
+ },
+ "source": [
+ "#### Outliers"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jJIG0jJQf6em"
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "FgYPzlvfemFc"
+ },
+ "source": [
+ "#### Missing values"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "BAjw7UhJen0D",
+ "outputId": "d4adbe43-869a-4f90-f65a-1606db0d5768",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 272
+ }
+ },
+ "source": [
+ "# Missing values por colunas/variáveis\n",
+ "df_boston.isna().sum()"
+ ],
+ "execution_count": 13,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "crim 0\n",
+ "zn 0\n",
+ "indus 0\n",
+ "chas 0\n",
+ "nox 0\n",
+ "rm 0\n",
+ "age 0\n",
+ "dis 0\n",
+ "rad 0\n",
+ "tax 0\n",
+ "ptratio 0\n",
+ "b 0\n",
+ "lstat 0\n",
+ "preco 0\n",
+ "dtype: int64"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 13
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "0Yp8g7hxfQli",
+ "outputId": "51d084ad-f980-4c58-8614-3cd93e122728",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 49
+ }
+ },
+ "source": [
+ "# Missing Values por linhas\n",
+ "df_boston[df_boston.isnull().any(axis = 1)]"
+ ],
+ "execution_count": 14,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " crim | \n",
+ " zn | \n",
+ " indus | \n",
+ " chas | \n",
+ " nox | \n",
+ " rm | \n",
+ " age | \n",
+ " dis | \n",
+ " rad | \n",
+ " tax | \n",
+ " ptratio | \n",
+ " b | \n",
+ " lstat | \n",
+ " preco | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "Empty DataFrame\n",
+ "Columns: [crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat, preco]\n",
+ "Index: []"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 14
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5qmkTFLrf9MT"
+ },
+ "source": [
+ "#### Estatísticas Descritivas"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Nprn3p_Wf_bn",
+ "outputId": "e3654337-020d-4109-9c9f-9dd65b3a10e8",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 317
+ }
+ },
+ "source": [
+ "df_boston.describe()"
+ ],
+ "execution_count": 15,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " crim | \n",
+ " zn | \n",
+ " indus | \n",
+ " chas | \n",
+ " nox | \n",
+ " rm | \n",
+ " age | \n",
+ " dis | \n",
+ " rad | \n",
+ " tax | \n",
+ " ptratio | \n",
+ " b | \n",
+ " lstat | \n",
+ " preco | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | count | \n",
+ " 506.000000 | \n",
+ " 506.000000 | \n",
+ " 506.000000 | \n",
+ " 506.000000 | \n",
+ " 506.000000 | \n",
+ " 506.000000 | \n",
+ " 506.000000 | \n",
+ " 506.000000 | \n",
+ " 506.000000 | \n",
+ " 506.000000 | \n",
+ " 506.000000 | \n",
+ " 506.000000 | \n",
+ " 506.000000 | \n",
+ " 506.000000 | \n",
+ "
\n",
+ " \n",
+ " | mean | \n",
+ " 3.613524 | \n",
+ " 11.363636 | \n",
+ " 11.136779 | \n",
+ " 0.069170 | \n",
+ " 0.554695 | \n",
+ " 6.284634 | \n",
+ " 68.574901 | \n",
+ " 3.795043 | \n",
+ " 9.549407 | \n",
+ " 408.237154 | \n",
+ " 18.455534 | \n",
+ " 356.674032 | \n",
+ " 12.653063 | \n",
+ " 22.532806 | \n",
+ "
\n",
+ " \n",
+ " | std | \n",
+ " 8.601545 | \n",
+ " 23.322453 | \n",
+ " 6.860353 | \n",
+ " 0.253994 | \n",
+ " 0.115878 | \n",
+ " 0.702617 | \n",
+ " 28.148861 | \n",
+ " 2.105710 | \n",
+ " 8.707259 | \n",
+ " 168.537116 | \n",
+ " 2.164946 | \n",
+ " 91.294864 | \n",
+ " 7.141062 | \n",
+ " 9.197104 | \n",
+ "
\n",
+ " \n",
+ " | min | \n",
+ " 0.006320 | \n",
+ " 0.000000 | \n",
+ " 0.460000 | \n",
+ " 0.000000 | \n",
+ " 0.385000 | \n",
+ " 3.561000 | \n",
+ " 2.900000 | \n",
+ " 1.129600 | \n",
+ " 1.000000 | \n",
+ " 187.000000 | \n",
+ " 12.600000 | \n",
+ " 0.320000 | \n",
+ " 1.730000 | \n",
+ " 5.000000 | \n",
+ "
\n",
+ " \n",
+ " | 25% | \n",
+ " 0.082045 | \n",
+ " 0.000000 | \n",
+ " 5.190000 | \n",
+ " 0.000000 | \n",
+ " 0.449000 | \n",
+ " 5.885500 | \n",
+ " 45.025000 | \n",
+ " 2.100175 | \n",
+ " 4.000000 | \n",
+ " 279.000000 | \n",
+ " 17.400000 | \n",
+ " 375.377500 | \n",
+ " 6.950000 | \n",
+ " 17.025000 | \n",
+ "
\n",
+ " \n",
+ " | 50% | \n",
+ " 0.256510 | \n",
+ " 0.000000 | \n",
+ " 9.690000 | \n",
+ " 0.000000 | \n",
+ " 0.538000 | \n",
+ " 6.208500 | \n",
+ " 77.500000 | \n",
+ " 3.207450 | \n",
+ " 5.000000 | \n",
+ " 330.000000 | \n",
+ " 19.050000 | \n",
+ " 391.440000 | \n",
+ " 11.360000 | \n",
+ " 21.200000 | \n",
+ "
\n",
+ " \n",
+ " | 75% | \n",
+ " 3.677083 | \n",
+ " 12.500000 | \n",
+ " 18.100000 | \n",
+ " 0.000000 | \n",
+ " 0.624000 | \n",
+ " 6.623500 | \n",
+ " 94.075000 | \n",
+ " 5.188425 | \n",
+ " 24.000000 | \n",
+ " 666.000000 | \n",
+ " 20.200000 | \n",
+ " 396.225000 | \n",
+ " 16.955000 | \n",
+ " 25.000000 | \n",
+ "
\n",
+ " \n",
+ " | max | \n",
+ " 88.976200 | \n",
+ " 100.000000 | \n",
+ " 27.740000 | \n",
+ " 1.000000 | \n",
+ " 0.871000 | \n",
+ " 8.780000 | \n",
+ " 100.000000 | \n",
+ " 12.126500 | \n",
+ " 24.000000 | \n",
+ " 711.000000 | \n",
+ " 22.000000 | \n",
+ " 396.900000 | \n",
+ " 37.970000 | \n",
+ " 50.000000 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " crim zn indus ... b lstat preco\n",
+ "count 506.000000 506.000000 506.000000 ... 506.000000 506.000000 506.000000\n",
+ "mean 3.613524 11.363636 11.136779 ... 356.674032 12.653063 22.532806\n",
+ "std 8.601545 23.322453 6.860353 ... 91.294864 7.141062 9.197104\n",
+ "min 0.006320 0.000000 0.460000 ... 0.320000 1.730000 5.000000\n",
+ "25% 0.082045 0.000000 5.190000 ... 375.377500 6.950000 17.025000\n",
+ "50% 0.256510 0.000000 9.690000 ... 391.440000 11.360000 21.200000\n",
+ "75% 3.677083 12.500000 18.100000 ... 396.225000 16.955000 25.000000\n",
+ "max 88.976200 100.000000 27.740000 ... 396.900000 37.970000 50.000000\n",
+ "\n",
+ "[8 rows x 14 columns]"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 15
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "1JimyY3SgECE"
+ },
+ "source": [
+ "#### Análise de Correlação"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jScHq7eTgIpm",
+ "outputId": "554ca659-b49c-4a12-9dea-cf910c558dbd",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 483
+ }
+ },
+ "source": [
+ "correlacoes = df_boston.corr()\n",
+ "correlacoes"
+ ],
+ "execution_count": 16,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " crim | \n",
+ " zn | \n",
+ " indus | \n",
+ " chas | \n",
+ " nox | \n",
+ " rm | \n",
+ " age | \n",
+ " dis | \n",
+ " rad | \n",
+ " tax | \n",
+ " ptratio | \n",
+ " b | \n",
+ " lstat | \n",
+ " preco | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | crim | \n",
+ " 1.000000 | \n",
+ " -0.200469 | \n",
+ " 0.406583 | \n",
+ " -0.055892 | \n",
+ " 0.420972 | \n",
+ " -0.219247 | \n",
+ " 0.352734 | \n",
+ " -0.379670 | \n",
+ " 0.625505 | \n",
+ " 0.582764 | \n",
+ " 0.289946 | \n",
+ " -0.385064 | \n",
+ " 0.455621 | \n",
+ " -0.388305 | \n",
+ "
\n",
+ " \n",
+ " | zn | \n",
+ " -0.200469 | \n",
+ " 1.000000 | \n",
+ " -0.533828 | \n",
+ " -0.042697 | \n",
+ " -0.516604 | \n",
+ " 0.311991 | \n",
+ " -0.569537 | \n",
+ " 0.664408 | \n",
+ " -0.311948 | \n",
+ " -0.314563 | \n",
+ " -0.391679 | \n",
+ " 0.175520 | \n",
+ " -0.412995 | \n",
+ " 0.360445 | \n",
+ "
\n",
+ " \n",
+ " | indus | \n",
+ " 0.406583 | \n",
+ " -0.533828 | \n",
+ " 1.000000 | \n",
+ " 0.062938 | \n",
+ " 0.763651 | \n",
+ " -0.391676 | \n",
+ " 0.644779 | \n",
+ " -0.708027 | \n",
+ " 0.595129 | \n",
+ " 0.720760 | \n",
+ " 0.383248 | \n",
+ " -0.356977 | \n",
+ " 0.603800 | \n",
+ " -0.483725 | \n",
+ "
\n",
+ " \n",
+ " | chas | \n",
+ " -0.055892 | \n",
+ " -0.042697 | \n",
+ " 0.062938 | \n",
+ " 1.000000 | \n",
+ " 0.091203 | \n",
+ " 0.091251 | \n",
+ " 0.086518 | \n",
+ " -0.099176 | \n",
+ " -0.007368 | \n",
+ " -0.035587 | \n",
+ " -0.121515 | \n",
+ " 0.048788 | \n",
+ " -0.053929 | \n",
+ " 0.175260 | \n",
+ "
\n",
+ " \n",
+ " | nox | \n",
+ " 0.420972 | \n",
+ " -0.516604 | \n",
+ " 0.763651 | \n",
+ " 0.091203 | \n",
+ " 1.000000 | \n",
+ " -0.302188 | \n",
+ " 0.731470 | \n",
+ " -0.769230 | \n",
+ " 0.611441 | \n",
+ " 0.668023 | \n",
+ " 0.188933 | \n",
+ " -0.380051 | \n",
+ " 0.590879 | \n",
+ " -0.427321 | \n",
+ "
\n",
+ " \n",
+ " | rm | \n",
+ " -0.219247 | \n",
+ " 0.311991 | \n",
+ " -0.391676 | \n",
+ " 0.091251 | \n",
+ " -0.302188 | \n",
+ " 1.000000 | \n",
+ " -0.240265 | \n",
+ " 0.205246 | \n",
+ " -0.209847 | \n",
+ " -0.292048 | \n",
+ " -0.355501 | \n",
+ " 0.128069 | \n",
+ " -0.613808 | \n",
+ " 0.695360 | \n",
+ "
\n",
+ " \n",
+ " | age | \n",
+ " 0.352734 | \n",
+ " -0.569537 | \n",
+ " 0.644779 | \n",
+ " 0.086518 | \n",
+ " 0.731470 | \n",
+ " -0.240265 | \n",
+ " 1.000000 | \n",
+ " -0.747881 | \n",
+ " 0.456022 | \n",
+ " 0.506456 | \n",
+ " 0.261515 | \n",
+ " -0.273534 | \n",
+ " 0.602339 | \n",
+ " -0.376955 | \n",
+ "
\n",
+ " \n",
+ " | dis | \n",
+ " -0.379670 | \n",
+ " 0.664408 | \n",
+ " -0.708027 | \n",
+ " -0.099176 | \n",
+ " -0.769230 | \n",
+ " 0.205246 | \n",
+ " -0.747881 | \n",
+ " 1.000000 | \n",
+ " -0.494588 | \n",
+ " -0.534432 | \n",
+ " -0.232471 | \n",
+ " 0.291512 | \n",
+ " -0.496996 | \n",
+ " 0.249929 | \n",
+ "
\n",
+ " \n",
+ " | rad | \n",
+ " 0.625505 | \n",
+ " -0.311948 | \n",
+ " 0.595129 | \n",
+ " -0.007368 | \n",
+ " 0.611441 | \n",
+ " -0.209847 | \n",
+ " 0.456022 | \n",
+ " -0.494588 | \n",
+ " 1.000000 | \n",
+ " 0.910228 | \n",
+ " 0.464741 | \n",
+ " -0.444413 | \n",
+ " 0.488676 | \n",
+ " -0.381626 | \n",
+ "
\n",
+ " \n",
+ " | tax | \n",
+ " 0.582764 | \n",
+ " -0.314563 | \n",
+ " 0.720760 | \n",
+ " -0.035587 | \n",
+ " 0.668023 | \n",
+ " -0.292048 | \n",
+ " 0.506456 | \n",
+ " -0.534432 | \n",
+ " 0.910228 | \n",
+ " 1.000000 | \n",
+ " 0.460853 | \n",
+ " -0.441808 | \n",
+ " 0.543993 | \n",
+ " -0.468536 | \n",
+ "
\n",
+ " \n",
+ " | ptratio | \n",
+ " 0.289946 | \n",
+ " -0.391679 | \n",
+ " 0.383248 | \n",
+ " -0.121515 | \n",
+ " 0.188933 | \n",
+ " -0.355501 | \n",
+ " 0.261515 | \n",
+ " -0.232471 | \n",
+ " 0.464741 | \n",
+ " 0.460853 | \n",
+ " 1.000000 | \n",
+ " -0.177383 | \n",
+ " 0.374044 | \n",
+ " -0.507787 | \n",
+ "
\n",
+ " \n",
+ " | b | \n",
+ " -0.385064 | \n",
+ " 0.175520 | \n",
+ " -0.356977 | \n",
+ " 0.048788 | \n",
+ " -0.380051 | \n",
+ " 0.128069 | \n",
+ " -0.273534 | \n",
+ " 0.291512 | \n",
+ " -0.444413 | \n",
+ " -0.441808 | \n",
+ " -0.177383 | \n",
+ " 1.000000 | \n",
+ " -0.366087 | \n",
+ " 0.333461 | \n",
+ "
\n",
+ " \n",
+ " | lstat | \n",
+ " 0.455621 | \n",
+ " -0.412995 | \n",
+ " 0.603800 | \n",
+ " -0.053929 | \n",
+ " 0.590879 | \n",
+ " -0.613808 | \n",
+ " 0.602339 | \n",
+ " -0.496996 | \n",
+ " 0.488676 | \n",
+ " 0.543993 | \n",
+ " 0.374044 | \n",
+ " -0.366087 | \n",
+ " 1.000000 | \n",
+ " -0.737663 | \n",
+ "
\n",
+ " \n",
+ " | preco | \n",
+ " -0.388305 | \n",
+ " 0.360445 | \n",
+ " -0.483725 | \n",
+ " 0.175260 | \n",
+ " -0.427321 | \n",
+ " 0.695360 | \n",
+ " -0.376955 | \n",
+ " 0.249929 | \n",
+ " -0.381626 | \n",
+ " -0.468536 | \n",
+ " -0.507787 | \n",
+ " 0.333461 | \n",
+ " -0.737663 | \n",
+ " 1.000000 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " crim zn indus ... b lstat preco\n",
+ "crim 1.000000 -0.200469 0.406583 ... -0.385064 0.455621 -0.388305\n",
+ "zn -0.200469 1.000000 -0.533828 ... 0.175520 -0.412995 0.360445\n",
+ "indus 0.406583 -0.533828 1.000000 ... -0.356977 0.603800 -0.483725\n",
+ "chas -0.055892 -0.042697 0.062938 ... 0.048788 -0.053929 0.175260\n",
+ "nox 0.420972 -0.516604 0.763651 ... -0.380051 0.590879 -0.427321\n",
+ "rm -0.219247 0.311991 -0.391676 ... 0.128069 -0.613808 0.695360\n",
+ "age 0.352734 -0.569537 0.644779 ... -0.273534 0.602339 -0.376955\n",
+ "dis -0.379670 0.664408 -0.708027 ... 0.291512 -0.496996 0.249929\n",
+ "rad 0.625505 -0.311948 0.595129 ... -0.444413 0.488676 -0.381626\n",
+ "tax 0.582764 -0.314563 0.720760 ... -0.441808 0.543993 -0.468536\n",
+ "ptratio 0.289946 -0.391679 0.383248 ... -0.177383 0.374044 -0.507787\n",
+ "b -0.385064 0.175520 -0.356977 ... 1.000000 -0.366087 0.333461\n",
+ "lstat 0.455621 -0.412995 0.603800 ... -0.366087 1.000000 -0.737663\n",
+ "preco -0.388305 0.360445 -0.483725 ... 0.333461 -0.737663 1.000000\n",
+ "\n",
+ "[14 rows x 14 columns]"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 16
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "AxQp7xqdgTJP"
+ },
+ "source": [
+ "##### Gráfico das correlações entre as features/variáveis/colunas\n",
+ "Source: https://seaborn.pydata.org/examples/many_pairwise_correlations.html\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "KOiH2X-WgqmN",
+ "outputId": "ef9b58f8-e9fd-4019-92c7-151bfaf4f58f",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 557
+ }
+ },
+ "source": [
+ "import seaborn as sns\n",
+ "from string import ascii_letters\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "sns.set_theme(style = \"white\")\n",
+ "\n",
+ "d = df_boston\n",
+ "\n",
+ "# Compute the correlation matrix\n",
+ "corr = d.corr()\n",
+ "\n",
+ "# Generate a mask for the upper triangle\n",
+ "mask = np.triu(np.ones_like(corr, dtype=bool))\n",
+ "\n",
+ "# Set up the matplotlib figure\n",
+ "f, ax = plt.subplots(figsize=(11, 9))\n",
+ "\n",
+ "# Generate a custom diverging colormap\n",
+ "cmap = sns.diverging_palette(230, 20, as_cmap=True)\n",
+ "\n",
+ "# Draw the heatmap with the mask and correct aspect ratio\n",
+ "sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,\n",
+ " square=True, linewidths=.5, cbar_kws={\"shrink\": .5})"
+ ],
+ "execution_count": 17,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 17
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": []
+ }
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "nogPhyfVO70G"
+ },
+ "source": [
+ "### Construir e treinar o(s) modelo(s)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "0BhLZJhibVNG"
+ },
+ "source": [
+ "X_boston = df_boston.drop(columns = ['preco'], axis = 1)\n",
+ "y_boston = df_boston['preco']\n"
+ ],
+ "execution_count": 18,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "b50_6tv5h1kY"
+ },
+ "source": [
+ "# Definindo os dataframes de treinamento e teste:\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "\n",
+ "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_boston, y_boston, test_size = 0.2, random_state = 20111974)"
+ ],
+ "execution_count": 19,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "SvevXulFiJj1"
+ },
+ "source": [
+ "#### Treinamento do modelo de Regressão Linear"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "GVwF3vp8iNff",
+ "outputId": "5157316e-d00f-4ceb-f930-1643e673fd70",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 71
+ }
+ },
+ "source": [
+ "# Importa a library LinearRegression --> Para treinamento da Regressão Linear\n",
+ "from sklearn.linear_model import LinearRegression\n",
+ "\n",
+ "# Library para statmodels\n",
+ "import statsmodels.api as sm"
+ ],
+ "execution_count": 20,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.\n",
+ " import pandas.util.testing as tm\n"
+ ],
+ "name": "stderr"
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ibX6bCbViW-v"
+ },
+ "source": [
+ "# Instancia o objeto\n",
+ "regressao_linear = LinearRegression()"
+ ],
+ "execution_count": 21,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "M-5wRGUribY0",
+ "outputId": "2e94fc43-b82c-44fd-9580-e8eb0eb325ba",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "# Treina o modelo usando as amostras/dataset de treinamento: X_treinamento e y_treinamento \n",
+ "regressao_linear.fit(X_treinamento, y_treinamento)"
+ ],
+ "execution_count": 22,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 22
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jri-jA1VjmUl",
+ "outputId": "4ff22bd7-5308-482f-a601-f9946ac5106c",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "# Valor do intercepto\n",
+ "regressao_linear.intercept_"
+ ],
+ "execution_count": 23,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "35.9020918753502"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 23
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "VOjadxdxjqtT",
+ "outputId": "eb581dc3-3c7a-4b23-f0e9-77b17e2f0153",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 452
+ }
+ },
+ "source": [
+ "# Coeficientes do modelo de Regressão Linear\n",
+ "coeficientes_regressao_linear = pd.DataFrame([X_treinamento.columns, regressao_linear.coef_]).T\n",
+ "coeficientes_regressao_linear = coeficientes_regressao_linear.rename(columns={0: 'Feature/variável/coluna', 1: 'Coeficientes'})\n",
+ "coeficientes_regressao_linear"
+ ],
+ "execution_count": 24,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Feature/variável/coluna | \n",
+ " Coeficientes | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " crim | \n",
+ " -0.0822083 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " zn | \n",
+ " 0.0428002 | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " indus | \n",
+ " 0.0756011 | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " chas | \n",
+ " 3.16348 | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " nox | \n",
+ " -19.4945 | \n",
+ "
\n",
+ " \n",
+ " | 5 | \n",
+ " rm | \n",
+ " 3.98161 | \n",
+ "
\n",
+ " \n",
+ " | 6 | \n",
+ " age | \n",
+ " 0.00480929 | \n",
+ "
\n",
+ " \n",
+ " | 7 | \n",
+ " dis | \n",
+ " -1.37396 | \n",
+ "
\n",
+ " \n",
+ " | 8 | \n",
+ " rad | \n",
+ " 0.298883 | \n",
+ "
\n",
+ " \n",
+ " | 9 | \n",
+ " tax | \n",
+ " -0.0123962 | \n",
+ "
\n",
+ " \n",
+ " | 10 | \n",
+ " ptratio | \n",
+ " -0.984657 | \n",
+ "
\n",
+ " \n",
+ " | 11 | \n",
+ " b | \n",
+ " 0.008949 | \n",
+ "
\n",
+ " \n",
+ " | 12 | \n",
+ " lstat | \n",
+ " -0.526478 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Feature/variável/coluna Coeficientes\n",
+ "0 crim -0.0822083\n",
+ "1 zn 0.0428002\n",
+ "2 indus 0.0756011\n",
+ "3 chas 3.16348\n",
+ "4 nox -19.4945\n",
+ "5 rm 3.98161\n",
+ "6 age 0.00480929\n",
+ "7 dis -1.37396\n",
+ "8 rad 0.298883\n",
+ "9 tax -0.0123962\n",
+ "10 ptratio -0.984657\n",
+ "11 b 0.008949\n",
+ "12 lstat -0.526478"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 24
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jwnkhPwDjkhS"
+ },
+ "source": [
+ "#### Usando statmodels"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ltbekHd_k3PH",
+ "outputId": "8df0e302-86fa-425d-ee12-c73460e69a8d",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 680
+ }
+ },
+ "source": [
+ "X2_treinamento = sm.add_constant(X_treinamento)\n",
+ "lm_sm = sm.OLS(y_treinamento, X2_treinamento).fit()\n",
+ "print(lm_sm.summary())"
+ ],
+ "execution_count": 25,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ " OLS Regression Results \n",
+ "==============================================================================\n",
+ "Dep. Variable: preco R-squared: 0.725\n",
+ "Model: OLS Adj. R-squared: 0.716\n",
+ "Method: Least Squares F-statistic: 78.97\n",
+ "Date: Mon, 26 Oct 2020 Prob (F-statistic): 1.48e-100\n",
+ "Time: 19:31:57 Log-Likelihood: -1214.8\n",
+ "No. Observations: 404 AIC: 2458.\n",
+ "Df Residuals: 390 BIC: 2514.\n",
+ "Df Model: 13 \n",
+ "Covariance Type: nonrobust \n",
+ "==============================================================================\n",
+ " coef std err t P>|t| [0.025 0.975]\n",
+ "------------------------------------------------------------------------------\n",
+ "const 35.9021 6.037 5.947 0.000 24.033 47.771\n",
+ "crim -0.0822 0.045 -1.824 0.069 -0.171 0.006\n",
+ "zn 0.0428 0.016 2.638 0.009 0.011 0.075\n",
+ "indus 0.0756 0.072 1.054 0.292 -0.065 0.217\n",
+ "chas 3.1635 0.997 3.174 0.002 1.204 5.123\n",
+ "nox -19.4945 4.539 -4.295 0.000 -28.418 -10.571\n",
+ "rm 3.9816 0.510 7.802 0.000 2.978 4.985\n",
+ "age 0.0048 0.015 0.312 0.755 -0.025 0.035\n",
+ "dis -1.3740 0.236 -5.827 0.000 -1.838 -0.910\n",
+ "rad 0.2989 0.079 3.760 0.000 0.143 0.455\n",
+ "tax -0.0124 0.004 -2.814 0.005 -0.021 -0.004\n",
+ "ptratio -0.9847 0.156 -6.309 0.000 -1.292 -0.678\n",
+ "b 0.0089 0.003 2.796 0.005 0.003 0.015\n",
+ "lstat -0.5265 0.060 -8.764 0.000 -0.645 -0.408\n",
+ "==============================================================================\n",
+ "Omnibus: 140.799 Durbin-Watson: 2.083\n",
+ "Prob(Omnibus): 0.000 Jarque-Bera (JB): 591.650\n",
+ "Skew: 1.484 Prob(JB): 3.35e-129\n",
+ "Kurtosis: 8.132 Cond. No. 1.51e+04\n",
+ "==============================================================================\n",
+ "\n",
+ "Warnings:\n",
+ "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
+ "[2] The condition number is large, 1.51e+04. This might indicate that there are\n",
+ "strong multicollinearity or other numerical problems.\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "YM4pbCw_iMKX",
+ "outputId": "79b06023-8d21-4791-fae7-decb5a82ddc6",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 663
+ }
+ },
+ "source": [
+ "X3 = X2_treinamento.drop(columns = 'age', axis = 1)\n",
+ "X3_treinamento = sm.add_constant(X3)\n",
+ "lm_sm3 = sm.OLS(y_treinamento, X3_treinamento).fit()\n",
+ "print(lm_sm3.summary())\n"
+ ],
+ "execution_count": 29,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ " OLS Regression Results \n",
+ "==============================================================================\n",
+ "Dep. Variable: preco R-squared: 0.725\n",
+ "Model: OLS Adj. R-squared: 0.716\n",
+ "Method: Least Squares F-statistic: 85.75\n",
+ "Date: Mon, 26 Oct 2020 Prob (F-statistic): 1.64e-101\n",
+ "Time: 20:10:30 Log-Likelihood: -1214.8\n",
+ "No. Observations: 404 AIC: 2456.\n",
+ "Df Residuals: 391 BIC: 2508.\n",
+ "Df Model: 12 \n",
+ "Covariance Type: nonrobust \n",
+ "==============================================================================\n",
+ " coef std err t P>|t| [0.025 0.975]\n",
+ "------------------------------------------------------------------------------\n",
+ "const 35.7325 6.006 5.950 0.000 23.925 47.540\n",
+ "crim -0.0815 0.045 -1.812 0.071 -0.170 0.007\n",
+ "zn 0.0422 0.016 2.623 0.009 0.011 0.074\n",
+ "indus 0.0750 0.072 1.048 0.295 -0.066 0.216\n",
+ "chas 3.1794 0.994 3.198 0.001 1.225 5.134\n",
+ "nox -19.1299 4.381 -4.367 0.000 -27.742 -10.517\n",
+ "rm 4.0153 0.498 8.059 0.000 3.036 4.995\n",
+ "dis -1.3963 0.224 -6.223 0.000 -1.837 -0.955\n",
+ "rad 0.2958 0.079 3.755 0.000 0.141 0.451\n",
+ "tax -0.0123 0.004 -2.802 0.005 -0.021 -0.004\n",
+ "ptratio -0.9812 0.156 -6.310 0.000 -1.287 -0.675\n",
+ "b 0.0090 0.003 2.825 0.005 0.003 0.015\n",
+ "lstat -0.5202 0.057 -9.203 0.000 -0.631 -0.409\n",
+ "==============================================================================\n",
+ "Omnibus: 142.363 Durbin-Watson: 2.081\n",
+ "Prob(Omnibus): 0.000 Jarque-Bera (JB): 608.694\n",
+ "Skew: 1.496 Prob(JB): 6.67e-133\n",
+ "Kurtosis: 8.216 Cond. No. 1.48e+04\n",
+ "==============================================================================\n",
+ "\n",
+ "Warnings:\n",
+ "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
+ "[2] The condition number is large, 1.48e+04. This might indicate that there are\n",
+ "strong multicollinearity or other numerical problems.\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jVXj1Cz7i7yy",
+ "outputId": "870d1a70-bc73-4fae-f96d-97dedf69763f",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 646
+ }
+ },
+ "source": [
+ "X4 = X3_treinamento.drop(columns = 'indus', axis = 1)\n",
+ "X4_treinamento = sm.add_constant(X4)\n",
+ "lm_sm4 = sm.OLS(y_treinamento, X4_treinamento).fit()\n",
+ "print(lm_sm4.summary())"
+ ],
+ "execution_count": 31,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ " OLS Regression Results \n",
+ "==============================================================================\n",
+ "Dep. Variable: preco R-squared: 0.724\n",
+ "Model: OLS Adj. R-squared: 0.716\n",
+ "Method: Least Squares F-statistic: 93.42\n",
+ "Date: Mon, 26 Oct 2020 Prob (F-statistic): 2.86e-102\n",
+ "Time: 20:11:16 Log-Likelihood: -1215.4\n",
+ "No. Observations: 404 AIC: 2455.\n",
+ "Df Residuals: 392 BIC: 2503.\n",
+ "Df Model: 11 \n",
+ "Covariance Type: nonrobust \n",
+ "==============================================================================\n",
+ " coef std err t P>|t| [0.025 0.975]\n",
+ "------------------------------------------------------------------------------\n",
+ "const 35.4757 6.001 5.911 0.000 23.677 47.275\n",
+ "crim -0.0840 0.045 -1.871 0.062 -0.172 0.004\n",
+ "zn 0.0407 0.016 2.539 0.012 0.009 0.072\n",
+ "chas 3.2924 0.989 3.330 0.001 1.349 5.236\n",
+ "nox -17.9558 4.235 -4.239 0.000 -26.283 -9.629\n",
+ "rm 3.9674 0.496 7.996 0.000 2.992 4.943\n",
+ "dis -1.4553 0.217 -6.699 0.000 -1.882 -1.028\n",
+ "rad 0.2744 0.076 3.606 0.000 0.125 0.424\n",
+ "tax -0.0103 0.004 -2.603 0.010 -0.018 -0.003\n",
+ "ptratio -0.9609 0.154 -6.227 0.000 -1.264 -0.658\n",
+ "b 0.0089 0.003 2.778 0.006 0.003 0.015\n",
+ "lstat -0.5151 0.056 -9.145 0.000 -0.626 -0.404\n",
+ "==============================================================================\n",
+ "Omnibus: 142.123 Durbin-Watson: 2.073\n",
+ "Prob(Omnibus): 0.000 Jarque-Bera (JB): 605.868\n",
+ "Skew: 1.494 Prob(JB): 2.74e-132\n",
+ "Kurtosis: 8.202 Cond. No. 1.47e+04\n",
+ "==============================================================================\n",
+ "\n",
+ "Warnings:\n",
+ "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
+ "[2] The condition number is large, 1.47e+04. This might indicate that there are\n",
+ "strong multicollinearity or other numerical problems.\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "MRKRHcsqjGlc",
+ "outputId": "0215400b-657b-4c9c-cd11-0159e2ca5f6a",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 629
+ }
+ },
+ "source": [
+ "X5 = X4_treinamento.drop(columns = 'crim', axis = 1)\n",
+ "X5_treinamento = sm.add_constant(X5)\n",
+ "lm_sm5 = sm.OLS(y_treinamento, X5_treinamento).fit()\n",
+ "print(lm_sm5.summary())"
+ ],
+ "execution_count": 33,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ " OLS Regression Results \n",
+ "==============================================================================\n",
+ "Dep. Variable: preco R-squared: 0.721\n",
+ "Model: OLS Adj. R-squared: 0.714\n",
+ "Method: Least Squares F-statistic: 101.8\n",
+ "Date: Mon, 26 Oct 2020 Prob (F-statistic): 1.55e-102\n",
+ "Time: 20:11:56 Log-Likelihood: -1217.2\n",
+ "No. Observations: 404 AIC: 2456.\n",
+ "Df Residuals: 393 BIC: 2500.\n",
+ "Df Model: 10 \n",
+ "Covariance Type: nonrobust \n",
+ "==============================================================================\n",
+ " coef std err t P>|t| [0.025 0.975]\n",
+ "------------------------------------------------------------------------------\n",
+ "const 33.9950 5.968 5.696 0.000 22.262 45.728\n",
+ "zn 0.0375 0.016 2.349 0.019 0.006 0.069\n",
+ "chas 3.3959 0.990 3.430 0.001 1.449 5.343\n",
+ "nox -17.1637 4.228 -4.060 0.000 -25.475 -8.852\n",
+ "rm 4.0365 0.496 8.132 0.000 3.061 5.012\n",
+ "dis -1.3999 0.216 -6.484 0.000 -1.824 -0.975\n",
+ "rad 0.2278 0.072 3.158 0.002 0.086 0.370\n",
+ "tax -0.0100 0.004 -2.513 0.012 -0.018 -0.002\n",
+ "ptratio -0.9493 0.155 -6.137 0.000 -1.253 -0.645\n",
+ "b 0.0101 0.003 3.217 0.001 0.004 0.016\n",
+ "lstat -0.5315 0.056 -9.523 0.000 -0.641 -0.422\n",
+ "==============================================================================\n",
+ "Omnibus: 140.245 Durbin-Watson: 2.070\n",
+ "Prob(Omnibus): 0.000 Jarque-Bera (JB): 609.563\n",
+ "Skew: 1.464 Prob(JB): 4.32e-133\n",
+ "Kurtosis: 8.257 Cond. No. 1.46e+04\n",
+ "==============================================================================\n",
+ "\n",
+ "Warnings:\n",
+ "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
+ "[2] The condition number is large, 1.46e+04. This might indicate that there are\n",
+ "strong multicollinearity or other numerical problems.\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "UafIUrpZB0YP"
+ },
+ "source": [
+ "### Conclusão\n",
+ "* Quais variáveis/colunas/atributos ficam no modelo?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "nXeiFtnJO_1u"
+ },
+ "source": [
+ "### Validação do(s) modelo(s)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "QlGVFA6uPDvr"
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "PE3aKJ6mPDyJ"
+ },
+ "source": [
+ "### Predições"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5YQF4NIlGSLH"
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "UQfpoo1igFy8"
+ },
+ "source": [
+ "# Regularized Regression Methods \n",
+ "## Ridge Regression - Penalized Regression\n",
+ "> Reduz a complexidade do modelo através do uso de todas as variáveis de $X$, mas penalizando (valor de $\\alpha$) os coeficientes $w_{i}$ quando estiverem muito longe de zero, forçando-os a serem pequenos de maneira contínua. Dessa forma, diminuímos a complexidade do modelo enquanto mantemos todas as variáveis no modelo.\n",
+ "* Menor impacto dos outliers.\n",
+ "\n",
+ "### Exemplo"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "p1BKT6sRu-1p"
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Kp4VIJWxgFy8"
+ },
+ "source": [
+ "from sklearn.linear_model import Ridge\n",
+ "ridge = Ridge(alpha = .1)\n",
+ "lr = LinearRegression()"
+ ],
+ "execution_count": 34,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "cmRMoOwV6FMt",
+ "outputId": "8029d6ca-7c6e-4c63-ae18-53ecfad97ada",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "ridge = Ridge(alpha = .1)\n",
+ "ridge.fit(X_treinamento, y_treinamento)"
+ ],
+ "execution_count": 35,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "Ridge(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=None,\n",
+ " normalize=False, random_state=None, solver='auto', tol=0.001)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 35
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "VPnekyUbK6Xg"
+ },
+ "source": [
+ "#### Peso/contribuição das variáveis para a regressão usando RIDGE"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "-VxmbdkHtPWb",
+ "outputId": "750557fc-e0af-4cc5-bc8f-43970645e901",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 68
+ }
+ },
+ "source": [
+ "df_boston.columns"
+ ],
+ "execution_count": 42,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "Index(['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax',\n",
+ " 'ptratio', 'b', 'lstat', 'preco'],\n",
+ " dtype='object')"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 42
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "vMCb0CFjK973",
+ "outputId": "930f717d-a475-479b-b738-47bb36e965c8",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 85
+ }
+ },
+ "source": [
+ "ridge.coef_"
+ ],
+ "execution_count": 36,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([-8.08728088e-02, 4.31105323e-02, 6.96774448e-02, 3.14478949e+00,\n",
+ " -1.79983020e+01, 3.98675653e+00, 3.54464890e-03, -1.35303958e+00,\n",
+ " 2.95042916e-01, -1.25115273e-02, -9.68282109e-01, 9.02744064e-03,\n",
+ " -5.29135646e-01])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 36
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZqksuIjXypRJ",
+ "outputId": "81ed8a07-36c4-49c8-c99a-b1cb219e744b",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "# treinando a regressão Ridge\n",
+ "ridge.fit(X_treinamento, y_treinamento)\n",
+ "\n",
+ "# treinando a regressão linear simples (OLS)\n",
+ "lr.fit(X_treinamento, y_treinamento)"
+ ],
+ "execution_count": 37,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 37
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "7r28PBsWLtjA",
+ "outputId": "45ab4dc6-4090-4ab1-bfe8-9f750ba2e85a",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "ridge.alpha"
+ ],
+ "execution_count": 38,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "0.1"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 38
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "hRMK_QTmNgc1",
+ "outputId": "67d8e823-fabb-42d2-ad45-6b4a9b37e5b7",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
+ }
+ },
+ "source": [
+ "# maior alpha --> mais restrição aos coeficientes; \n",
+ "# Menor alpha --> mais generalização, e Ridge se assemelha da OLS\n",
+ "#rr = Ridge(alpha = 0.01)\n",
+ "rr = Ridge(alpha = 0.01)\n",
+ "rr.fit(X_treinamento, y_treinamento)"
+ ],
+ "execution_count": 58,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "Ridge(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=None,\n",
+ " normalize=False, random_state=None, solver='auto', tol=0.001)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 58
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "IRuWmBE7Ngc7"
+ },
+ "source": [
+ "# MSE\n",
+ "from sklearn.metrics import mean_squared_error\n",
+ "rr_model=(mean_squared_error(y_true = y_treinamento, y_pred = rr.predict(X_treinamento)))\n",
+ "lr_model=(mean_squared_error(y_true = y_treinamento, y_pred = lr.predict(X_treinamento)))\n"
+ ],
+ "execution_count": 59,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "B9pzuxOlukoJ",
+ "outputId": "4fa3f913-32f4-4024-a8c0-3502d94c1eb7",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "print(rr_model)"
+ ],
+ "execution_count": 60,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "23.94639697817076\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "RXg_PAaGubh7",
+ "outputId": "9f20b504-fcfd-4f0d-b217-907dbeec84de",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "print(lr_model)"
+ ],
+ "execution_count": 61,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "23.946319854597377\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "NEaj4QRrNgdA"
+ },
+ "source": [
+ "rr100 = Ridge(alpha=100)\n",
+ "rr100.fit(X_treinamento, y_treinamento)\n",
+ "train_score=lr.score(X_treinamento, y_treinamento)\n",
+ "test_score=lr.score(X_teste, y_teste)\n",
+ "Ridge_treinamento_score = rr.score(X_treinamento,y_treinamento)"
+ ],
+ "execution_count": 44,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zhcfoTEENgdE",
+ "outputId": "cfc27d00-a0b6-4cde-9e96-2fa541bea2ce",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "# MSE\n",
+ "rr100_model = (mean_squared_error(y_true = y_treinamento, y_pred = rr100.predict(X_treinamento)))\n",
+ "print(rr100_model)"
+ ],
+ "execution_count": 45,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "26.460105089888508\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "cEF_3GgUgF0Q"
+ },
+ "source": [
+ "# Lasso\n",
+ "* Reduz overfitting;\n",
+ "* Se encarrega do Feature Selection, pois descarta variáveis altamente correlacionadas."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-YiKb9reQdI4"
+ },
+ "source": [
+ "* Usado no processo de Regularization - processo de penalizar as variáveis para manter somente os atributos mais importantes. Pense na utilidade disso diante de um dataframe com muitas variáveis;\n",
+ "* A regressão Lasso vem com um parâmetro ($\\alpha$), e quanto maior o alfa, a maioria dos coeficientes de recurso é zero. Ou seja, quando $\\alpha = 0$, a regressão Lasso produz os mesmos coeficientes que uma regressão linear. Quando alfa é muito grande, todos os coeficientes são zero."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ME6v6LFlgF0Q",
+ "outputId": "dcfa9e54-ffe1-476a-f236-77d92193c7b6",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 67
+ }
+ },
+ "source": [
+ "from sklearn.linear_model import Lasso\n",
+ "lasso = Lasso(alpha = .1)\n",
+ "lasso.fit(X_treinamento, y_treinamento)"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,\n",
+ " normalize=False, positive=False, precompute=False, random_state=None,\n",
+ " selection='cyclic', tol=0.0001, warm_start=False)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 197
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "h6DSEHc1gF0V",
+ "outputId": "03e88d83-e311-4fc6-fadc-db9483c4092f",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 84
+ }
+ },
+ "source": [
+ "lasso.coef_"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([-6.53050169e-02, 4.69929493e-02, 2.03045631e-03, 1.56638852e+00,\n",
+ " -0.00000000e+00, 3.77954671e+00, -6.40432403e-03, -1.06129312e+00,\n",
+ " 2.58073061e-01, -1.42708307e-02, -7.81773992e-01, 9.95091849e-03,\n",
+ " -5.87452824e-01])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 198
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "xP1fX1Bi6VdX"
+ },
+ "source": [
+ "Coeficientes zero podem ser excluídos da Análise/modelo."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9P7hYoo4gF0Z"
+ },
+ "source": [
+ "# Elastic Net \n",
+ "* Combina o poder de Ridge e LASSO;\n",
+ "* Remove variáveis de pouco poder preditivo (LASSO) ou as penaliza (Ridge)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "yChNUYs7gF0b"
+ },
+ "source": [
+ "from sklearn.linear_model import ElasticNet\n",
+ "from sklearn.model_selection import GridSearchCV\n",
+ "\n",
+ "# Instancia o objeto\n",
+ "en = ElasticNet(alpha = .1)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "4mbIaAUAF4N6",
+ "outputId": "0a0b7576-7240-419d-d7b3-871f34f141f4",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 67
+ }
+ },
+ "source": [
+ "en.fit(X_treinamento, y_treinamento)"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "ElasticNet(alpha=0.1, copy_X=True, fit_intercept=True, l1_ratio=0.5,\n",
+ " max_iter=1000, normalize=False, positive=False, precompute=False,\n",
+ " random_state=None, selection='cyclic', tol=0.0001, warm_start=False)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 203
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "MaUkZw8ngF0h",
+ "outputId": "5d6db1e6-1d99-46da-bdb8-c4e1054dc2df",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 84
+ }
+ },
+ "source": [
+ "en.coef_"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([-7.14375105e-02, 4.98062892e-02, 3.25764298e-03, 1.32398367e+00,\n",
+ " -1.16648025e-01, 3.29040345e+00, -3.09984870e-03, -1.07673872e+00,\n",
+ " 2.80823236e-01, -1.50703816e-02, -8.13376450e-01, 9.70397656e-03,\n",
+ " -6.21886279e-01])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 204
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "xl-Qh9caDyCp"
+ },
+ "source": [
+ "# Instancia o objeto:\n",
+ "en = ElasticNet(normalize = True)\n",
+ "\n",
+ "# Otimização dos hiperparâmetros:\n",
+ "d_hiperparametros = {'alpha': np.logspace(-5, 2, 8), \n",
+ " 'l1_ratio': [.2, .4, .6, .8]}\n",
+ "\n",
+ "search = GridSearchCV(estimator = en, \n",
+ " param_grid = d_hiperparametros, \n",
+ " scoring = 'neg_mean_squared_error', \n",
+ " n_jobs = 1,\n",
+ " refit = True, \n",
+ " cv = 10)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "c3_XCQCPGlr3",
+ "outputId": "bd72fa59-54a2-43f0-fc29-536fbb41cf99",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "search.fit(X_treinamento, y_treinamento)\n",
+ "search.best_params_"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "{'alpha': 0.0001, 'l1_ratio': 0.4}"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 181
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zq0_ugQfGrdb",
+ "outputId": "9965249d-2280-4d0c-c568-0d7c892f6c3b",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "en2 = ElasticNet(normalize = True, alpha = 0.001, l1_ratio = 0.6)\n",
+ "en2.fit(X_treinamento, y_treinamento)\n",
+ "\n",
+ "# Métrica\n",
+ "ml2 = (mean_squared_error(y_true = y_teste, y_pred = en2.predict(X_teste)))\n",
+ "print(ml2)"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "15.410850398354446\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "aoxf9KKYOjEd"
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "sUUrajAxOkHg"
+ },
+ "source": [
+ "# Regularized Regression Methods \n",
+ "## Ridge Regression - Penalized Regression\n",
+ "> Reduz a complexidade do modelo através do uso de todas as variáveis de $X$, mas penalizando os coeficientes $w_{i}$ quando estiverem muito longe de zero, forçando-os a serem pequenos de maneira contínua. Dessa forma, diminuímos a complexidade do modelo enquanto mantemos todas as variáveis no modelo.\n",
+ "* Menor impacto dos outliers.\n",
+ "\n",
+ "### Exemplo"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "VVBd5g8NOkHh"
+ },
+ "source": [
+ "from sklearn.linear_model import Ridge\n",
+ "ridge = Ridge(alpha = .1)\n",
+ "lr = LinearRegression()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "o00xH2MvxvgP"
+ },
+ "source": [
+ "# Matriz de covariáveis do modelo:\n",
+ "X_new = [[0, 0], [0, 0], [1, 1]]\n",
+ "y_new = [0, .1, 1]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "v9U7c03NzW_c"
+ },
+ "source": [
+ "X_new"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "iiVEAPpUzXyN"
+ },
+ "source": [
+ "y_new"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8mWj2GbPOkHx"
+ },
+ "source": [
+ "ridge = Ridge(alpha = .1)\n",
+ "ridge.fit(X_new, y_new)\n",
+ "ridge.coef_"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "0kD7Bsq_OkH1"
+ },
+ "source": [
+ "# treinando a regressão Ridge\n",
+ "ridge.fit(X, y)\n",
+ "\n",
+ "# treinando a regressão linear simples (OLS)\n",
+ "lr.fit(X, y)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "aUEyK4lygFy_"
+ },
+ "source": [
+ "ridge.coef_"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "qYRLUwIugFzC"
+ },
+ "source": [
+ "lr.coef_"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "URcHb6uggFzF"
+ },
+ "source": [
+ "# Adicionar alguns outliers aos dados\n",
+ "outliers = y[950:] - 600\n",
+ "outliers"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "AqA2dFQWgFzH"
+ },
+ "source": [
+ "import numpy as np\n",
+ "y_outlier = np.append(y[:950], outliers)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "3YRDzmZkgFzL"
+ },
+ "source": [
+ "plt.scatter(X, y_outlier, s=5)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "_wFZ_AX1gFzU"
+ },
+ "source": [
+ "lr = LinearRegression()\n",
+ "lr.fit(X, y_outlier)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "7NYzB9nEgFze"
+ },
+ "source": [
+ "y_pred_outliers= lr.predict(X)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "E7aaUXzDgFzh"
+ },
+ "source": [
+ "plt.scatter(X, y_outlier,s=5,label='actual')\n",
+ "plt.scatter(X, y_pred_outliers,s=5,label='prediction with outliers')\n",
+ "plt.scatter(X, y_pred,s=5,c='k', label='prediction sem outlier')\n",
+ "plt.legend()\n",
+ "plt.title('Linear Regression')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "LON9hAomgFzl"
+ },
+ "source": [
+ "lr.coef_"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "m3Yu7pAigFzt"
+ },
+ "source": [
+ "ridge = Ridge(alpha = 1000)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "xB9fKEImgFzw"
+ },
+ "source": [
+ "ridge.fit(X, y_outlier)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "U4DxOv8EgFzz"
+ },
+ "source": [
+ "y_pred_ridge = ridge.predict(X)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "vVWQuFVegFz2"
+ },
+ "source": [
+ "plt.scatter(X, y_outlier, s = 5,label = 'actual')\n",
+ "plt.scatter(X, y_pred_outliers, s = 5, c = 'r' ,label = 'LinearRegression with outliers')\n",
+ "plt.scatter(X, y_pred_ridge, s = 5, c = 'k', label = 'RidgeRegression with outlier')\n",
+ "plt.legend()\n",
+ "plt.title('Linear Regression')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "seq5MCvDgFz5"
+ },
+ "source": [
+ "ridge.coef_"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0d0DL3YYgFz-"
+ },
+ "source": [
+ "## Efeito de $\\alpha$ na Regressão Ridge\n",
+ "### Exemplo"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "vCA4BvRkgFz_"
+ },
+ "source": [
+ "X, y, w = make_regression(n_samples = 10, \n",
+ " n_features = 10, \n",
+ " coef = True, \n",
+ " random_state = 1, \n",
+ " bias = 3.5)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "2l2zCIX6gF0D"
+ },
+ "source": [
+ "w"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "BSXLQl5COkI8"
+ },
+ "source": [
+ "# Lasso\n",
+ "* Reduz overfitting;\n",
+ "* Se encarrega do Feature Selection, pois descarta variáveis altamente correlacionadas."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "i5JZTnkTOkI9"
+ },
+ "source": [
+ "from sklearn.linear_model import Lasso\n",
+ "lasso = Lasso(alpha = .1)\n",
+ "lasso.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "gEUxSlThOkJD"
+ },
+ "source": [
+ "lasso.coef_"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "90pfP9-3OkJG"
+ },
+ "source": [
+ "Observe acima que o segundo coeficiente foi estimado como 0 e, desta forma, podemos excluí-lo do ML."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ILCXvYKDOkJH"
+ },
+ "source": [
+ "# Elastic Net \n",
+ "* Combina o poder de Ridge e LASSO;\n",
+ "* Remove variáveis de pouco poder preditivo (LASSO) ou as penaliza (Ridge)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "GaQPDCR2OkJI"
+ },
+ "source": [
+ "from sklearn.linear_model import ElasticNet\n",
+ "\n",
+ "# Instancia o objeto\n",
+ "en = ElasticNet(alpha = .1)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "xVp16Eu_OkJL"
+ },
+ "source": [
+ "en.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kwj018U8OkJO"
+ },
+ "source": [
+ "en.coef_"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XTvoKmbY_uXM"
+ },
+ "source": [
+ "# Exemplo completo: Ridge\n",
+ "* Adaptado de [Ridge and Lasso Regression: A Complete Guide with Python Scikit-Learn](https://towardsdatascience.com/ridge-and-lasso-regression-a-complete-guide-with-python-scikit-learn-e20e34bcbf0b)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "If7A_ceC_wW4"
+ },
+ "source": [
+ "from sklearn.datasets import load_boston\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "import matplotlib.pyplot as plt\n",
+ "%matplotlib inline\n",
+ "\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.linear_model import LinearRegression\n",
+ "from sklearn.linear_model import Ridge\n",
+ "\n",
+ "from sklearn.metrics import mean_squared_error"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "93n7uz0X_367"
+ },
+ "source": [
+ "boston = load_boston()\n",
+ "df_Boston = pd.DataFrame(boston.data, columns = boston.feature_names)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "EqRuXuzB_8Ge"
+ },
+ "source": [
+ "X_boston = boston.data\n",
+ "y_boston = boston.target"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "NztUeubIACuA"
+ },
+ "source": [
+ "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_boston, y_boston, test_size = 0.2, random_state = 20111974)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "QNwBSc1FAEoB"
+ },
+ "source": [
+ "lr = LinearRegression()\n",
+ "lr.fit(X_treinamento, y_treinamento)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "r9BrlLSIAHS3"
+ },
+ "source": [
+ "# maior alpha --> mais restrição aos coeficientes; \n",
+ "# Menor alpha --> mais generalização, e Ridge se assemelha da OLS\n",
+ "rr = Ridge(alpha = 0.01)\n",
+ "rr.fit(X_treinamento, y_treinamento)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "dchr1UwjEn-A"
+ },
+ "source": [
+ "# MSE\n",
+ "rr_model=(mean_squared_error(y_true = y_treinamento, y_pred = rr.predict(X_treinamento)))\n",
+ "print(rr_model)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Fnic6PY-CV-P"
+ },
+ "source": [
+ "rr100 = Ridge(alpha=100)\n",
+ "rr100.fit(X_treinamento, y_treinamento)\n",
+ "train_score=lr.score(X_treinamento, y_treinamento)\n",
+ "test_score=lr.score(X_teste, y_teste)\n",
+ "Ridge_treinamento_score = rr.score(X_treinamento,y_treinamento)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "TIe76z26EdG5"
+ },
+ "source": [
+ "# MSE\n",
+ "rr100_model = (mean_squared_error(y_true = y_treinamento, y_pred = rr100.predict(X_treinamento)))\n",
+ "print(rr100_model)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "dAuvhdNdAJ7C"
+ },
+ "source": [
+ "Ridge_teste_score = rr.score(X_teste, y_teste)\n",
+ "Ridge_treinamento_score100 = rr100.score(X_treinamento, y_treinamento)\n",
+ "Ridge_teste_score100 = rr100.score(X_teste, y_teste)\n",
+ "print(\"linear regression train score:\", train_score)\n",
+ "print(\"linear regression test score:\", test_score)\n",
+ "print(\"ridge regression train score low alpha:\", Ridge_treinamento_score)\n",
+ "print(\"ridge regression test score low alpha:\", Ridge_teste_score)\n",
+ "print(\"ridge regression train score high alpha:\", Ridge_treinamento_score100)\n",
+ "print(\"ridge regression test score high alpha:\", Ridge_teste_score100)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "a0OaweJrCchd"
+ },
+ "source": [
+ "plt.plot(rr.coef_, \n",
+ " alpha = 0.7, \n",
+ " linestyle = 'none', \n",
+ " marker = '*', \n",
+ " markersize = 5, \n",
+ " color = 'red', \n",
+ " label = r'Ridge; \n",
+ " $\\alpha = 0.01$', \n",
+ " zorder = 7) # zorder for ordering the markers\n",
+ "\n",
+ "plt.plot(rr100.coef_,alpha = 0.5, \n",
+ " linestyle = 'none', \n",
+ " marker = 'd', \n",
+ " markersize = 6, \n",
+ " color = 'blue', \n",
+ " label = r'Ridge; \n",
+ " $\\alpha = 100$') # alpha here is for transparency\n",
+ "\n",
+ "plt.plot(lr.coef_, \n",
+ " alpha = 0.4, \n",
+ " linestyle = 'none', \n",
+ " marker = 'o', \n",
+ " markersize = 7, \n",
+ " color = 'green', \n",
+ " label = 'Linear Regression')\n",
+ "\n",
+ "plt.xlabel('Coefficient Index', fontsize = 16)\n",
+ "plt.ylabel('Coefficient Magnitude',fontsize = 16)\n",
+ "plt.legend(fontsize = 13, loc = 4)\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "PEtGRcl-EHaF"
+ },
+ "source": [
+ "from sklearn.metrics import mean_squared_error\n",
+ "rr_model=(mean_squared_error(y_true= y, y_pred=regression.predict(X)))\n",
+ "print(first_model)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_dwlPByHDipf"
+ },
+ "source": [
+ "# Exemplo completo - Elastic Net"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "yhbIfGnfOkKF"
+ },
+ "source": [
+ "from sklearn.linear_model import ElasticNet\n",
+ "from sklearn.model_selection import GridSearchCV\n",
+ "\n",
+ "# Instancia o objeto:\n",
+ "en = ElasticNet(normalize = True)\n",
+ "\n",
+ "# Otimização dos hiperparâmetros:\n",
+ "d_hiperparametros = {'alpha': np.logspace(-5, 2, 8), \n",
+ " 'l1_ratio': [.2, .4, .6, .8]}\n",
+ "\n",
+ "search = GridSearchCV(estimator = en, \n",
+ " param_grid = d_hiperparametros, \n",
+ " scoring = 'neg_mean_squared_error', \n",
+ " n_jobs = 1,\n",
+ " refit = True, \n",
+ " cv = 10)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "yDPkRazPOkKJ"
+ },
+ "source": [
+ "search.fit(X, y)\n",
+ "search.best_params_"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "D_K-f8KCOkKM"
+ },
+ "source": [
+ "en2 = ElasticNet(normalize = True, alpha = 0.001, l1_ratio = 0.6)\n",
+ "en2.fit(X, y)\n",
+ "\n",
+ "ml2 = (mean_squared_error(y_true = y, y_pred = en2.predict(X)))\n",
+ "print(ml2)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "a5o7FiRm9_vb"
+ },
+ "source": [
+ "# Exercício 1 - Regressão Linear - Mall_Customers.csv\n",
+ "> A variável-target deste dataframe é 'Annual Income'. Desenvolva um modelo de regressão utilizando OLS, Ridge e LASSO e compare os resultados.\n",
+ "\n",
+ "* Experimente:\n",
+ " * Lasso(alpha = 0.01, max_iter = 10e5);\n",
+ " * Lasso(alpha = 0.0001, max_iter = 10e5);\n",
+ " * Ridge(alpha = 0.01);\n",
+ " * Ridge(alpha = 100);"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "rJRWBzSQCcss"
+ },
+ "source": [
+ "# Regressão Logística"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Ucn0pQThO1eN"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XwuMfMD1gFyd"
+ },
+ "source": [
+ "# Exemplo para regressão LOGÍSTICA!!!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "efF3st3sHxPG"
+ },
+ "source": [
+ "# Carrega as bibliotecas\n",
+ "import numpy as np\n",
+ "np.set_printoptions(formatter = {'float': lambda x: \"{0:0.2f}\".format(x)})\n",
+ "\n",
+ "import pandas as pd\n",
+ "import matplotlib.pyplot as plt\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "import statsmodels.api as sm\n",
+ "\n",
+ "%matplotlib inline"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Bk9F6JO0IELv",
+ "outputId": "71d6cdf3-6eb8-4e0a-9829-67d301b60b97",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 195
+ }
+ },
+ "source": [
+ "# Carregar/ler o banco de dados - Dataframe Diabetes\n",
+ "from sklearn import datasets\n",
+ "#Diabetes = datasets.load_diabetes()\n",
+ "\n",
+ "url = 'https://raw.githubusercontent.com/MathMachado/DSWP/master/Dataframes/diabetes.csv'\n",
+ "diabetes = pd.read_csv(url)\n",
+ "diabetes.head()"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Pregnancies | \n",
+ " Glucose | \n",
+ " BloodPressure | \n",
+ " SkinThickness | \n",
+ " Insulin | \n",
+ " BMI | \n",
+ " DiabetesPedigreeFunction | \n",
+ " Age | \n",
+ " Outcome | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 6 | \n",
+ " 148 | \n",
+ " 72 | \n",
+ " 35 | \n",
+ " 0 | \n",
+ " 33.6 | \n",
+ " 0.627 | \n",
+ " 50 | \n",
+ " 1 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 1 | \n",
+ " 85 | \n",
+ " 66 | \n",
+ " 29 | \n",
+ " 0 | \n",
+ " 26.6 | \n",
+ " 0.351 | \n",
+ " 31 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 8 | \n",
+ " 183 | \n",
+ " 64 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 23.3 | \n",
+ " 0.672 | \n",
+ " 32 | \n",
+ " 1 | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 1 | \n",
+ " 89 | \n",
+ " 66 | \n",
+ " 23 | \n",
+ " 94 | \n",
+ " 28.1 | \n",
+ " 0.167 | \n",
+ " 21 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 0 | \n",
+ " 137 | \n",
+ " 40 | \n",
+ " 35 | \n",
+ " 168 | \n",
+ " 43.1 | \n",
+ " 2.288 | \n",
+ " 33 | \n",
+ " 1 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Pregnancies Glucose BloodPressure ... DiabetesPedigreeFunction Age Outcome\n",
+ "0 6 148 72 ... 0.627 50 1\n",
+ "1 1 85 66 ... 0.351 31 0\n",
+ "2 8 183 64 ... 0.672 32 1\n",
+ "3 1 89 66 ... 0.167 21 0\n",
+ "4 0 137 40 ... 2.288 33 1\n",
+ "\n",
+ "[5 rows x 9 columns]"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 21
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "tjRmpaPIDknb",
+ "outputId": "b5102f14-cfa1-4354-9167-e6d8fbf313cc",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 195
+ }
+ },
+ "source": [
+ "# Definir as matrizes X e y\n",
+ "X_diabetes = diabetes.copy()\n",
+ "X_diabetes.drop(columns = ['Outcome'], axis = 1, inplace = True)\n",
+ "y_diabetes = diabetes['Outcome']\n",
+ "\n",
+ "X_diabetes.head()"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Pregnancies | \n",
+ " Glucose | \n",
+ " BloodPressure | \n",
+ " SkinThickness | \n",
+ " Insulin | \n",
+ " BMI | \n",
+ " DiabetesPedigreeFunction | \n",
+ " Age | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 6 | \n",
+ " 148 | \n",
+ " 72 | \n",
+ " 35 | \n",
+ " 0 | \n",
+ " 33.6 | \n",
+ " 0.627 | \n",
+ " 50 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 1 | \n",
+ " 85 | \n",
+ " 66 | \n",
+ " 29 | \n",
+ " 0 | \n",
+ " 26.6 | \n",
+ " 0.351 | \n",
+ " 31 | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 8 | \n",
+ " 183 | \n",
+ " 64 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 23.3 | \n",
+ " 0.672 | \n",
+ " 32 | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 1 | \n",
+ " 89 | \n",
+ " 66 | \n",
+ " 23 | \n",
+ " 94 | \n",
+ " 28.1 | \n",
+ " 0.167 | \n",
+ " 21 | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 0 | \n",
+ " 137 | \n",
+ " 40 | \n",
+ " 35 | \n",
+ " 168 | \n",
+ " 43.1 | \n",
+ " 2.288 | \n",
+ " 33 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Pregnancies Glucose BloodPressure ... BMI DiabetesPedigreeFunction Age\n",
+ "0 6 148 72 ... 33.6 0.627 50\n",
+ "1 1 85 66 ... 26.6 0.351 31\n",
+ "2 8 183 64 ... 23.3 0.672 32\n",
+ "3 1 89 66 ... 28.1 0.167 21\n",
+ "4 0 137 40 ... 43.1 2.288 33\n",
+ "\n",
+ "[5 rows x 8 columns]"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 24
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jLrx69TH-Mad",
+ "outputId": "0f802232-d17b-4803-fe0f-ef87135f0e01",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "X_diabetes.shape"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "(768, 8)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 25
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "mdFBioP6-Ply",
+ "outputId": "ffdc7ca3-045c-46ff-e70a-5f26a7157bca",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "y_diabetes.shape"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "(442,)"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 6
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "fhLySN65IaDF"
+ },
+ "source": [
+ "# Definir as matrizes de treinamento e validação\n",
+ "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_diabetes, y_diabetes)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "J5R8HlnuIGpL",
+ "outputId": "27dbf904-24e8-4013-93b8-007bc1fe36aa",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 67
+ }
+ },
+ "source": [
+ "# Carregar a library LinearRegression()\n",
+ "from sklearn.linear_model import LinearRegression\n",
+ "\n",
+ "# Instanciar o objeto\n",
+ "lr = LinearRegression()\n",
+ "\n",
+ "# Usando statmodels:\n",
+ "x = sm.add_constant(X_treinamento)\n",
+ "lr_sm = sm.Logit(y_treinamento, X_treinamento) # Atenção: aqui é o contrário: [y, x]\n",
+ "\n",
+ "# Treinar o modelo\n",
+ "lr.fit(X_treinamento, y_treinamento)\n",
+ "resultado_sm = lr_sm.fit()"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "Optimization terminated successfully.\n",
+ " Current function value: 0.596920\n",
+ " Iterations 5\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "GlbCaPp1ETNa",
+ "outputId": "93b95119-9395-4677-87db-6a41dc27b940",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 357
+ }
+ },
+ "source": [
+ "resultado_sm.summary()"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "Logit Regression Results\n",
+ "\n",
+ " | Dep. Variable: | Outcome | No. Observations: | 576 | \n",
+ "
\n",
+ "\n",
+ " | Model: | Logit | Df Residuals: | 568 | \n",
+ "
\n",
+ "\n",
+ " | Method: | MLE | Df Model: | 7 | \n",
+ "
\n",
+ "\n",
+ " | Date: | Mon, 26 Oct 2020 | Pseudo R-squ.: | 0.05860 | \n",
+ "
\n",
+ "\n",
+ " | Time: | 13:23:03 | Log-Likelihood: | -343.83 | \n",
+ "
\n",
+ "\n",
+ " | converged: | True | LL-Null: | -365.23 | \n",
+ "
\n",
+ "\n",
+ " | Covariance Type: | nonrobust | LLR p-value: | 3.632e-07 | \n",
+ "
\n",
+ "
\n",
+ "\n",
+ "\n",
+ " | coef | std err | z | P>|z| | [0.025 | 0.975] | \n",
+ "
\n",
+ "\n",
+ " | Pregnancies | 0.1447 | 0.033 | 4.364 | 0.000 | 0.080 | 0.210 | \n",
+ "
\n",
+ "\n",
+ " | Glucose | 0.0116 | 0.003 | 3.614 | 0.000 | 0.005 | 0.018 | \n",
+ "
\n",
+ "\n",
+ " | BloodPressure | -0.0318 | 0.006 | -5.574 | 0.000 | -0.043 | -0.021 | \n",
+ "
\n",
+ "\n",
+ " | SkinThickness | 0.0022 | 0.007 | 0.300 | 0.764 | -0.012 | 0.017 | \n",
+ "
\n",
+ "\n",
+ " | Insulin | 0.0014 | 0.001 | 1.476 | 0.140 | -0.000 | 0.003 | \n",
+ "
\n",
+ "\n",
+ " | BMI | -0.0012 | 0.013 | -0.094 | 0.925 | -0.027 | 0.025 | \n",
+ "
\n",
+ "\n",
+ " | DiabetesPedigreeFunction | 0.0411 | 0.283 | 0.145 | 0.885 | -0.514 | 0.596 | \n",
+ "
\n",
+ "\n",
+ " | Age | -0.0145 | 0.010 | -1.474 | 0.140 | -0.034 | 0.005 | \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "\n",
+ "\"\"\"\n",
+ " Logit Regression Results \n",
+ "==============================================================================\n",
+ "Dep. Variable: Outcome No. Observations: 576\n",
+ "Model: Logit Df Residuals: 568\n",
+ "Method: MLE Df Model: 7\n",
+ "Date: Mon, 26 Oct 2020 Pseudo R-squ.: 0.05860\n",
+ "Time: 13:23:03 Log-Likelihood: -343.83\n",
+ "converged: True LL-Null: -365.23\n",
+ "Covariance Type: nonrobust LLR p-value: 3.632e-07\n",
+ "============================================================================================\n",
+ " coef std err z P>|z| [0.025 0.975]\n",
+ "--------------------------------------------------------------------------------------------\n",
+ "Pregnancies 0.1447 0.033 4.364 0.000 0.080 0.210\n",
+ "Glucose 0.0116 0.003 3.614 0.000 0.005 0.018\n",
+ "BloodPressure -0.0318 0.006 -5.574 0.000 -0.043 -0.021\n",
+ "SkinThickness 0.0022 0.007 0.300 0.764 -0.012 0.017\n",
+ "Insulin 0.0014 0.001 1.476 0.140 -0.000 0.003\n",
+ "BMI -0.0012 0.013 -0.094 0.925 -0.027 0.025\n",
+ "DiabetesPedigreeFunction 0.0411 0.283 0.145 0.885 -0.514 0.596\n",
+ "Age -0.0145 0.010 -1.474 0.140 -0.034 0.005\n",
+ "============================================================================================\n",
+ "\"\"\""
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 37
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "e7hWIjr0J8fd",
+ "outputId": "0c1247dd-d6d3-4d38-b9c5-c90a83458580",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 67
+ }
+ },
+ "source": [
+ "# Coeficientes do modelo\n",
+ "lr.coef_ "
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([ 10.50312025, -263.32615982, 516.66778363, 356.84510148,\n",
+ " -1037.40954808, 731.51011113, 121.62332809, 163.11261651,\n",
+ " 780.54426871, 66.11245968])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 11
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5DVjyWUUKH4t",
+ "outputId": "a58aca10-7682-4ccd-97d7-11e19b6d7604",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "# Intercepto do modelo\n",
+ "lr.intercept_"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "155.02945244919295"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 12
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "-FJaSnJLKICU",
+ "outputId": "4daf587a-5c70-48bc-b563-f14a23b49d0a",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "# EQM - Erro Quadrático Médio\n",
+ "np.mean((lr.predict(X_teste) - y_teste) ** 2) "
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "2998.4466244258106"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 13
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6bVEUSTUPzOj"
+ },
+ "source": [
+ "### Calcular y_pred - os valores preditos de y"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "OjGrNhTNLcr-",
+ "outputId": "5577e115-8c38-4451-d10a-d25065e3b9cc",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 907
+ }
+ },
+ "source": [
+ "y_pred = lr.predict(X_treinamento)\n",
+ "\n",
+ "# Predit com statmodels\n",
+ "resultado_sm.predict()"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([0.41, 0.34, 0.41, 0.35, 0.18, 0.16, 0.34, 0.56, 0.48, 0.26, 0.16,\n",
+ " 0.34, 0.19, 0.44, 0.23, 0.32, 0.29, 0.11, 0.59, 0.31, 0.33, 0.46,\n",
+ " 0.83, 0.42, 0.16, 0.83, 0.17, 0.40, 0.37, 0.37, 0.35, 0.38, 0.29,\n",
+ " 0.57, 0.37, 0.30, 0.53, 0.37, 0.13, 0.42, 0.93, 0.22, 0.32, 0.52,\n",
+ " 0.23, 0.49, 0.34, 0.20, 0.10, 0.27, 0.28, 0.35, 0.37, 0.27, 0.42,\n",
+ " 0.41, 0.32, 0.54, 0.32, 0.46, 0.32, 0.24, 0.62, 0.49, 0.26, 0.34,\n",
+ " 0.84, 0.60, 0.23, 0.44, 0.48, 0.22, 0.19, 0.23, 0.55, 0.35, 0.13,\n",
+ " 0.38, 0.27, 0.09, 0.44, 0.18, 0.23, 0.35, 0.30, 0.23, 0.36, 0.30,\n",
+ " 0.72, 0.25, 0.24, 0.25, 0.45, 0.75, 0.05, 0.20, 0.67, 0.34, 0.43,\n",
+ " 0.35, 0.32, 0.20, 0.15, 0.17, 0.33, 0.44, 0.52, 0.41, 0.49, 0.27,\n",
+ " 0.20, 0.33, 0.31, 0.41, 0.49, 0.46, 0.82, 0.68, 0.54, 0.30, 0.19,\n",
+ " 0.15, 0.23, 0.38, 0.41, 0.37, 0.32, 0.34, 0.44, 0.17, 0.10, 0.56,\n",
+ " 0.50, 0.32, 0.25, 0.24, 0.27, 0.36, 0.69, 0.40, 0.30, 0.55, 0.49,\n",
+ " 0.35, 0.44, 0.36, 0.30, 0.18, 0.41, 0.15, 0.23, 0.71, 0.17, 0.15,\n",
+ " 0.28, 0.83, 0.56, 0.37, 0.35, 0.32, 0.29, 0.24, 0.27, 0.31, 0.28,\n",
+ " 0.35, 0.30, 0.43, 0.25, 0.23, 0.73, 0.26, 0.35, 0.43, 0.22, 0.32,\n",
+ " 0.33, 0.47, 0.30, 0.82, 0.19, 0.55, 0.54, 0.19, 0.30, 0.27, 0.23,\n",
+ " 0.41, 0.21, 0.61, 0.16, 0.29, 0.34, 0.32, 0.28, 0.24, 0.45, 0.20,\n",
+ " 0.26, 0.24, 0.25, 0.17, 0.28, 0.44, 0.32, 0.42, 0.39, 0.31, 0.25,\n",
+ " 0.35, 0.20, 0.37, 0.54, 0.32, 0.37, 0.41, 0.31, 0.22, 0.09, 0.27,\n",
+ " 0.36, 0.42, 0.19, 0.38, 0.48, 0.42, 0.50, 0.54, 0.31, 0.22, 0.46,\n",
+ " 0.32, 0.22, 0.26, 0.42, 0.35, 0.20, 0.22, 0.18, 0.50, 0.46, 0.25,\n",
+ " 0.48, 0.20, 0.19, 0.16, 0.49, 0.30, 0.70, 0.24, 0.20, 0.33, 0.15,\n",
+ " 0.34, 0.37, 0.14, 0.26, 0.21, 0.91, 0.35, 0.24, 0.20, 0.22, 0.39,\n",
+ " 0.43, 0.50, 0.30, 0.83, 0.21, 0.31, 0.51, 0.35, 0.39, 0.42, 0.30,\n",
+ " 0.48, 0.76, 0.28, 0.19, 0.26, 0.85, 0.30, 0.26, 0.18, 0.17, 0.18,\n",
+ " 0.30, 0.79, 0.57, 0.41, 0.25, 0.27, 0.70, 0.48, 0.56, 0.30, 0.23,\n",
+ " 0.23, 0.49, 0.74, 0.36, 0.31, 0.31, 0.44, 0.39, 0.73, 0.41, 0.36,\n",
+ " 0.91, 0.19, 0.31, 0.29, 0.55, 0.58, 0.47, 0.53, 0.35, 0.51, 0.29,\n",
+ " 0.13, 0.35, 0.52, 0.64, 0.19, 0.24, 0.72, 0.31, 0.42, 0.37, 0.50,\n",
+ " 0.26, 0.23, 0.47, 0.56, 0.31, 0.40, 0.56, 0.37, 0.35, 0.32, 0.36,\n",
+ " 0.65, 0.43, 0.41, 0.28, 0.57, 0.42, 0.22, 0.47, 0.63, 0.15, 0.58,\n",
+ " 0.10, 0.59, 0.16, 0.28, 0.17, 0.43, 0.21, 0.22, 0.22, 0.29, 0.37,\n",
+ " 0.76, 0.73, 0.24, 0.64, 0.45, 0.14, 0.34, 0.44, 0.85, 0.29, 0.45,\n",
+ " 0.63, 0.35, 0.21, 0.38, 0.45, 0.36, 0.19, 0.62, 0.72, 0.29, 0.46,\n",
+ " 0.44, 0.50, 0.26, 0.22, 0.64, 0.26, 0.32, 0.61, 0.67, 0.27, 0.28,\n",
+ " 0.22, 0.36, 0.56, 0.24, 0.36, 0.23, 0.37, 0.50, 0.26, 0.59, 0.15,\n",
+ " 0.29, 0.35, 0.09, 0.09, 0.29, 0.29, 0.43, 0.44, 0.23, 0.20, 0.42,\n",
+ " 0.30, 0.22, 0.19, 0.37, 0.43, 0.29, 0.19, 0.47, 0.26, 0.19, 0.23,\n",
+ " 0.26, 0.42, 0.24, 0.30, 0.38, 0.81, 0.88, 0.44, 0.22, 0.33, 0.29,\n",
+ " 0.51, 0.23, 0.22, 0.48, 0.35, 0.25, 0.45, 0.28, 0.52, 0.32, 0.45,\n",
+ " 0.34, 0.48, 0.46, 0.32, 0.61, 0.26, 0.12, 0.50, 0.48, 0.22, 0.28,\n",
+ " 0.61, 0.35, 0.60, 0.31, 0.44, 0.37, 0.29, 0.87, 0.09, 0.41, 0.50,\n",
+ " 0.29, 0.16, 0.34, 0.29, 0.24, 0.38, 0.32, 0.39, 0.25, 0.56, 0.28,\n",
+ " 0.08, 0.27, 0.37, 0.24, 0.26, 0.35, 0.48, 0.24, 0.33, 0.20, 0.61,\n",
+ " 0.14, 0.31, 0.60, 0.53, 0.62, 0.53, 0.54, 0.35, 0.14, 0.31, 0.42,\n",
+ " 0.21, 0.64, 0.19, 0.38, 0.41, 0.11, 0.27, 0.26, 0.22, 0.36, 0.28,\n",
+ " 0.38, 0.51, 0.08, 0.27, 0.68, 0.38, 0.55, 0.57, 0.49, 0.50, 0.46,\n",
+ " 0.20, 0.28, 0.38, 0.44, 0.37, 0.45, 0.45, 0.22, 0.31, 0.33, 0.26,\n",
+ " 0.21, 0.25, 0.17, 0.33, 0.30, 0.46, 0.26, 0.36, 0.53, 0.52, 0.27,\n",
+ " 0.28, 0.33, 0.27, 0.81, 0.47, 0.27, 0.20, 0.10, 0.27, 0.26, 0.33,\n",
+ " 0.66, 0.58, 0.25, 0.25, 0.29, 0.31, 0.24, 0.35, 0.35, 0.29, 0.24,\n",
+ " 0.69, 0.22, 0.29, 0.55])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 39
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "FolXBGbFFUnE",
+ "outputId": "b4e5039d-4e71-40e2-8549-b4bb5f21a7dc",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 605
+ }
+ },
+ "source": [
+ "np.array(diabetes['Outcome'])"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0,\n",
+ " 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1,\n",
+ " 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,\n",
+ " 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,\n",
+ " 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,\n",
+ " 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1,\n",
+ " 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,\n",
+ " 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,\n",
+ " 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1,\n",
+ " 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1,\n",
+ " 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0,\n",
+ " 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,\n",
+ " 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0,\n",
+ " 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0,\n",
+ " 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,\n",
+ " 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,\n",
+ " 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0,\n",
+ " 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0,\n",
+ " 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1,\n",
+ " 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,\n",
+ " 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,\n",
+ " 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,\n",
+ " 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,\n",
+ " 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,\n",
+ " 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0,\n",
+ " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,\n",
+ " 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,\n",
+ " 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0,\n",
+ " 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,\n",
+ " 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,\n",
+ " 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1,\n",
+ " 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0,\n",
+ " 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,\n",
+ " 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0,\n",
+ " 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 40
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "pUxasncIFaw4",
+ "outputId": "e799ed76-bcbd-4620-fc98-11b2b998f67f",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 50
+ }
+ },
+ "source": [
+ "resultado_sm.pred_table()"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[343.00, 43.00],\n",
+ " [129.00, 61.00]])"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 41
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "_liLYinwFgch",
+ "outputId": "6f5743ab-ae7b-4d4e-d23b-5779be359113",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 106
+ }
+ },
+ "source": [
+ "confusion_matrix = pd.DataFrame(resultado_sm.pred_table())\n",
+ "confusion_matrix.columns = ['Predicted No Diabetes', 'Predicted Diabetes']\n",
+ "confusion_matrix = confusion_matrix.rename(index = {0 : 'Actual No Diabetes', 1 : 'Actual Diabetes'})\n",
+ "confusion_matrix"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Predicted No Diabetes | \n",
+ " Predicted Diabetes | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | Actual No Diabetes | \n",
+ " 343.0 | \n",
+ " 43.0 | \n",
+ "
\n",
+ " \n",
+ " | Actual Diabetes | \n",
+ " 129.0 | \n",
+ " 61.0 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Predicted No Diabetes Predicted Diabetes\n",
+ "Actual No Diabetes 343.0 43.0\n",
+ "Actual Diabetes 129.0 61.0"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 42
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ceH3MODWFm7S",
+ "outputId": "52c6473a-7a20-4eed-ee46-05d1bf460f84",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "source": [
+ "cm = np.array(confusion_matrix)\n",
+ "training_accuracy = (cm[0,0] + cm[1,1])/ cm.sum()\n",
+ "training_accuracy"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "0.7013888888888888"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 43
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "kfRda6kWFzHZ"
+ },
+ "source": [
+ "### Testando o modelo"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "R0n-pdnkF3MC"
+ },
+ "source": [
+ "test_cleaned = test_data['Outcome']\n",
+ "test_data = test_data.drop(['Outcome'], axis = 1)\n",
+ "test_data = sm.add_constant(test_data)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "x5hTgpelF5Pu"
+ },
+ "source": [
+ "def confusion_matrix(data, actual_values, model):\n",
+ " predicted_values = model.predict(data)\n",
+ " bins = np.array ([0, 0.5, 1])\n",
+ " cm = np.histogram2d(actual_values, predicted_values, bins = bins)[0]\n",
+ " accuracy = (cm[0,0] + cm[1,1])/cm.sum()\n",
+ " return cm, accuracy"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "9PBK6M8yF7JE"
+ },
+ "source": [
+ "conf_matrix = confusion_matrix(test_data, test_cleaned, result)\n",
+ "conf_matrix"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "EQz6ys7EF854"
+ },
+ "source": [
+ "confusion_matrix = pd.DataFrame(conf_matrix[0])\n",
+ "confusion_matrix.columns = ['Predicted No Diabetes', 'Predicted Diabetes']\n",
+ "confusion_matrix = confusion_matrix.rename(index = {0 : 'Actual No Diabetes', 1 : 'Actual Diabetes'})\n",
+ "confusion_matrix"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4MeuBR4QO1x3"
+ },
+ "source": [
+ "# Regularized Regression Methods \n",
+ "## Ridge Regression - Penalized Regression\n",
+ "> Reduz a complexidade do modelo através do uso de todas as variáveis de $X$, mas penalizando os coeficientes $w_{i}$ quando estiverem muito longe de zero, forçando-os a serem pequenos de maneira contínua. Dessa forma, diminuímos a complexidade do modelo enquanto mantemos todas as variáveis no modelo.\n",
+ "* Menor impacto dos outliers.\n",
+ "\n",
+ "### Exemplo"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "J1gLVnjXO1x6"
+ },
+ "source": [
+ "from sklearn.linear_model import Ridge\n",
+ "ridge = Ridge(alpha = .1)\n",
+ "lr = LinearRegression()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "0neeicWaO1yA"
+ },
+ "source": [
+ "# Matriz de covariáveis do modelo:\n",
+ "X_new = [[0, 0], [0, 0], [1, 1]]\n",
+ "y_new = [0, .1, 1]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "CnDB5Bd0O1yE"
+ },
+ "source": [
+ "X_new"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "UldOGKA6O1yJ"
+ },
+ "source": [
+ "y_new"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "GRzwxGWpO1yO"
+ },
+ "source": [
+ "ridge = Ridge(alpha = .1)\n",
+ "ridge.fit(X_new, y_new)\n",
+ "ridge.coef_"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "FXas7SfCO1yR"
+ },
+ "source": [
+ "# treinando a regressão Ridge\n",
+ "ridge.fit(X, y)\n",
+ "\n",
+ "# treinando a regressão linear simples (OLS)\n",
+ "lr.fit(X, y)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "gvQY7qzBO1yV"
+ },
+ "source": [
+ "ridge.coef_"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Wg1Z4h5tO1yY"
+ },
+ "source": [
+ "lr.coef_"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Puj1I_8CO1yd"
+ },
+ "source": [
+ "# Adicionar alguns outliers aos dados\n",
+ "outliers = y[950:] - 600\n",
+ "outliers"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Tw4Q3q6lO1yg"
+ },
+ "source": [
+ "import numpy as np\n",
+ "y_outlier = np.append(y[:950], outliers)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "pq7ufBP_O1yk"
+ },
+ "source": [
+ "plt.scatter(X, y_outlier, s=5)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Nx07u9aLO1yo"
+ },
+ "source": [
+ "lr = LinearRegression()\n",
+ "lr.fit(X, y_outlier)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5U_5Oc3TO1ys"
+ },
+ "source": [
+ "y_pred_outliers= lr.predict(X)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "KsZ_jwBgO1yz"
+ },
+ "source": [
+ "plt.scatter(X, y_outlier,s=5,label='actual')\n",
+ "plt.scatter(X, y_pred_outliers,s=5,label='prediction with outliers')\n",
+ "plt.scatter(X, y_pred,s=5,c='k', label='prediction sem outlier')\n",
+ "plt.legend()\n",
+ "plt.title('Linear Regression')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "4ZJghaE-O1y4"
+ },
+ "source": [
+ "lr.coef_"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "dMYaRMZxO1y8"
+ },
+ "source": [
+ "ridge = Ridge(alpha = 1000)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8swmMWrWO1zA"
+ },
+ "source": [
+ "ridge.fit(X, y_outlier)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "FbrgmNIhO1zE"
+ },
+ "source": [
+ "y_pred_ridge = ridge.predict(X)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZLEEbFHYO1zI"
+ },
+ "source": [
+ "plt.scatter(X, y_outlier, s = 5,label = 'actual')\n",
+ "plt.scatter(X, y_pred_outliers, s = 5, c = 'r' ,label = 'LinearRegression with outliers')\n",
+ "plt.scatter(X, y_pred_ridge, s = 5, c = 'k', label = 'RidgeRegression with outlier')\n",
+ "plt.legend()\n",
+ "plt.title('Linear Regression')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "KL_0cZilO1zO"
+ },
+ "source": [
+ "ridge.coef_"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5Q146_AyO1zS"
+ },
+ "source": [
+ "## Efeito de $\\alpha$ na Regressão Ridge\n",
+ "### Exemplo"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ydZvyvJ3O1zT"
+ },
+ "source": [
+ "X, y, w = make_regression(n_samples = 10, \n",
+ " n_features = 10, \n",
+ " coef = True, \n",
+ " random_state = 1, \n",
+ " bias = 3.5)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "z187ZGCqO1zY"
+ },
+ "source": [
+ "w"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "h4UEPOD6O1zd"
+ },
+ "source": [
+ "# Lasso\n",
+ "* Reduz overfitting;\n",
+ "* Se encarrega do Feature Selection, pois descarta variáveis altamente correlacionadas."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "nv4t9GlkO1ze"
+ },
+ "source": [
+ "from sklearn.linear_model import Lasso\n",
+ "lasso = Lasso(alpha = .1)\n",
+ "lasso.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "uLIbSdeSO1zj"
+ },
+ "source": [
+ "lasso.coef_"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_3AZohS2O1zq"
+ },
+ "source": [
+ "Observe acima que o segundo coeficiente foi estimado como 0 e, desta forma, podemos excluí-lo do ML."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "dazegSrTO1zr"
+ },
+ "source": [
+ "# Elastic Net \n",
+ "* Combina o poder de Ridge e LASSO;\n",
+ "* Remove variáveis de pouco poder preditivo (LASSO) ou as penaliza (Ridge)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "SmfRQ4QKO1zs"
+ },
+ "source": [
+ "from sklearn.linear_model import ElasticNet\n",
+ "\n",
+ "# Instancia o objeto\n",
+ "en = ElasticNet(alpha = .1)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "IwWarH8BO1zv"
+ },
+ "source": [
+ "en.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "R2017DAXO1zz"
+ },
+ "source": [
+ "en.coef_"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "zz2uKTAJO1z6"
+ },
+ "source": [
+ "# Exemplo completo: Ridge\n",
+ "* Adaptado de [Ridge and Lasso Regression: A Complete Guide with Python Scikit-Learn](https://towardsdatascience.com/ridge-and-lasso-regression-a-complete-guide-with-python-scikit-learn-e20e34bcbf0b)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "mLUs3Y4lO1z7"
+ },
+ "source": [
+ "from sklearn.datasets import load_boston\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "import matplotlib.pyplot as plt\n",
+ "%matplotlib inline\n",
+ "\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.linear_model import LinearRegression\n",
+ "from sklearn.linear_model import Ridge\n",
+ "\n",
+ "from sklearn.metrics import mean_squared_error"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "IPVmVN83O1z_"
+ },
+ "source": [
+ "boston = load_boston()\n",
+ "df_Boston = pd.DataFrame(boston.data, columns = boston.feature_names)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Wyf2IO9fO10D"
+ },
+ "source": [
+ "X_boston = boston.data\n",
+ "y_boston = boston.target"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "mddJC9UyO10K"
+ },
+ "source": [
+ "X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_boston, y_boston, test_size = 0.2, random_state = 20111974)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "7GoluU_cO10S"
+ },
+ "source": [
+ "lr = LinearRegression()\n",
+ "lr.fit(X_treinamento, y_treinamento)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kA603LoWO10W"
+ },
+ "source": [
+ "# maior alpha --> mais restrição aos coeficientes; \n",
+ "# Menor alpha --> mais generalização, e Ridge se assemelha da OLS\n",
+ "rr = Ridge(alpha = 0.01)\n",
+ "rr.fit(X_treinamento, y_treinamento)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "-MvzKjBuO10a"
+ },
+ "source": [
+ "# MSE\n",
+ "rr_model=(mean_squared_error(y_true = y_treinamento, y_pred = rr.predict(X_treinamento)))\n",
+ "print(rr_model)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Hjg0XJPWO10g"
+ },
+ "source": [
+ "rr100 = Ridge(alpha=100)\n",
+ "rr100.fit(X_treinamento, y_treinamento)\n",
+ "train_score=lr.score(X_treinamento, y_treinamento)\n",
+ "test_score=lr.score(X_teste, y_teste)\n",
+ "Ridge_treinamento_score = rr.score(X_treinamento,y_treinamento)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6XpGG2_xO10j"
+ },
+ "source": [
+ "# MSE\n",
+ "rr100_model = (mean_squared_error(y_true = y_treinamento, y_pred = rr100.predict(X_treinamento)))\n",
+ "print(rr100_model)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "F9YF1Jc0O10m"
+ },
+ "source": [
+ "Ridge_teste_score = rr.score(X_teste, y_teste)\n",
+ "Ridge_treinamento_score100 = rr100.score(X_treinamento, y_treinamento)\n",
+ "Ridge_teste_score100 = rr100.score(X_teste, y_teste)\n",
+ "print(\"linear regression train score:\", train_score)\n",
+ "print(\"linear regression test score:\", test_score)\n",
+ "print(\"ridge regression train score low alpha:\", Ridge_treinamento_score)\n",
+ "print(\"ridge regression test score low alpha:\", Ridge_teste_score)\n",
+ "print(\"ridge regression train score high alpha:\", Ridge_treinamento_score100)\n",
+ "print(\"ridge regression test score high alpha:\", Ridge_teste_score100)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "qoHzgC53O10q"
+ },
+ "source": [
+ "plt.plot(rr.coef_, \n",
+ " alpha = 0.7, \n",
+ " linestyle = 'none', \n",
+ " marker = '*', \n",
+ " markersize = 5, \n",
+ " color = 'red', \n",
+ " label = r'Ridge; \n",
+ " $\\alpha = 0.01$', \n",
+ " zorder = 7) # zorder for ordering the markers\n",
+ "\n",
+ "plt.plot(rr100.coef_,alpha = 0.5, \n",
+ " linestyle = 'none', \n",
+ " marker = 'd', \n",
+ " markersize = 6, \n",
+ " color = 'blue', \n",
+ " label = r'Ridge; \n",
+ " $\\alpha = 100$') # alpha here is for transparency\n",
+ "\n",
+ "plt.plot(lr.coef_, \n",
+ " alpha = 0.4, \n",
+ " linestyle = 'none', \n",
+ " marker = 'o', \n",
+ " markersize = 7, \n",
+ " color = 'green', \n",
+ " label = 'Linear Regression')\n",
+ "\n",
+ "plt.xlabel('Coefficient Index', fontsize = 16)\n",
+ "plt.ylabel('Coefficient Magnitude',fontsize = 16)\n",
+ "plt.legend(fontsize = 13, loc = 4)\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "GCwp0QRBO10u"
+ },
+ "source": [
+ "from sklearn.metrics import mean_squared_error\n",
+ "rr_model=(mean_squared_error(y_true= y, y_pred=regression.predict(X)))\n",
+ "print(first_model)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "tPw-VP63O10x"
+ },
+ "source": [
+ "# Exemplo completo - Elastic Net"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "dLu6v8HkO10y"
+ },
+ "source": [
+ "from sklearn.linear_model import ElasticNet\n",
+ "from sklearn.model_selection import GridSearchCV\n",
+ "\n",
+ "# Instancia o objeto:\n",
+ "en = ElasticNet(normalize = True)\n",
+ "\n",
+ "# Otimização dos hiperparâmetros:\n",
+ "d_hiperparametros = {'alpha': np.logspace(-5, 2, 8), \n",
+ " 'l1_ratio': [.2, .4, .6, .8]}\n",
+ "\n",
+ "search = GridSearchCV(estimator = en, \n",
+ " param_grid = d_hiperparametros, \n",
+ " scoring = 'neg_mean_squared_error', \n",
+ " n_jobs = 1,\n",
+ " refit = True, \n",
+ " cv = 10)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "DzirI7FJO101"
+ },
+ "source": [
+ "search.fit(X, y)\n",
+ "search.best_params_"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jqPPkVP5O105"
+ },
+ "source": [
+ "en2 = ElasticNet(normalize = True, alpha = 0.001, l1_ratio = 0.6)\n",
+ "en2.fit(X, y)\n",
+ "\n",
+ "ml2 = (mean_squared_error(y_true = y, y_pred = en2.predict(X)))\n",
+ "print(ml2)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "CH_iEuzhO109"
+ },
+ "source": [
+ "# Exercício 1 - Mall_Customers.csv\n",
+ "> A variável-target deste dataframe é 'Annual Income'. Desenvolva um modelo de regressão utilizando OLS, Ridge e LASSO e compare os resultados.\n",
+ "\n",
+ "* Experimente:\n",
+ " * Lasso(alpha = 0.01, max_iter = 10e5);\n",
+ " * Lasso(alpha = 0.0001, max_iter = 10e5);\n",
+ " * Ridge(alpha = 0.01);\n",
+ " * Ridge(alpha = 100);"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZfRDEaaRYxFQ"
+ },
+ "source": [
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "from sklearn import preprocessing\n",
+ "import matplotlib.pyplot as plt \n",
+ "plt.rc(\"font\", size=14)\n",
+ "from sklearn.linear_model import LogisticRegression\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "import seaborn as sns\n",
+ "sns.set(style=\"white\")\n",
+ "sns.set(style=\"whitegrid\", color_codes=True)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "nulrLzUqYxFY"
+ },
+ "source": [
+ "## Data\n",
+ "\n",
+ "The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe (1/0) a term deposit (variable y)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4LdrQCwxYxFY"
+ },
+ "source": [
+ "This dataset provides the customer information. It includes 41188 records and 21 fields."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "qoT6zkoFYxFZ",
+ "outputId": "2a1561ef-28cd-445f-d8ec-a2dd8e8e8c20",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 50
+ }
+ },
+ "source": [
+ "df_bank = pd.read_csv('https://raw.githubusercontent.com/MathMachado/DataFrames/master/bank-full.csv', header = 0)\n",
+ "df_bank = df_bank.dropna()\n",
+ "print(df_bank.shape)\n",
+ "print(list(df_bank.columns))"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "(45211, 1)\n",
+ "['age;\"job\";\"marital\";\"education\";\"default\";\"balance\";\"housing\";\"loan\";\"contact\";\"day\";\"month\";\"duration\";\"campaign\";\"pdays\";\"previous\";\"poutcome\";\"y\"']\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZD23hMCeYxFc",
+ "outputId": "f3f4434d-428c-46fb-d15e-96cf2fa4ad8d",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 195
+ }
+ },
+ "source": [
+ "df_bank.head()"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " age;\"job\";\"marital\";\"education\";\"default\";\"balance\";\"housing\";\"loan\";\"contact\";\"day\";\"month\";\"duration\";\"campaign\";\"pdays\";\"previous\";\"poutcome\";\"y\" | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 58;\"management\";\"married\";\"tertiary\";\"no\";2143... | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 44;\"technician\";\"single\";\"secondary\";\"no\";29;\"... | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 33;\"entrepreneur\";\"married\";\"secondary\";\"no\";2... | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 47;\"blue-collar\";\"married\";\"unknown\";\"no\";1506... | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 33;\"unknown\";\"single\";\"unknown\";\"no\";1;\"no\";\"n... | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " age;\"job\";\"marital\";\"education\";\"default\";\"balance\";\"housing\";\"loan\";\"contact\";\"day\";\"month\";\"duration\";\"campaign\";\"pdays\";\"previous\";\"poutcome\";\"y\"\n",
+ "0 58;\"management\";\"married\";\"tertiary\";\"no\";2143... \n",
+ "1 44;\"technician\";\"single\";\"secondary\";\"no\";29;\"... \n",
+ "2 33;\"entrepreneur\";\"married\";\"secondary\";\"no\";2... \n",
+ "3 47;\"blue-collar\";\"married\";\"unknown\";\"no\";1506... \n",
+ "4 33;\"unknown\";\"single\";\"unknown\";\"no\";1;\"no\";\"n... "
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 185
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "CtGbim_EYxFh"
+ },
+ "source": [
+ "#### Input variables"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0pJ7ai5ZYxFh"
+ },
+ "source": [
+ "1 - age (numeric)\n",
+ "\n",
+ "2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')\n",
+ "\n",
+ "3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)\n",
+ "\n",
+ "4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')\n",
+ "\n",
+ "5 - default: has credit in default? (categorical: 'no','yes','unknown')\n",
+ "\n",
+ "6 - housing: has housing loan? (categorical: 'no','yes','unknown')\n",
+ "\n",
+ "7 - loan: has personal loan? (categorical: 'no','yes','unknown')\n",
+ "\n",
+ "8 - contact: contact communication type (categorical: 'cellular','telephone')\n",
+ "\n",
+ "9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')\n",
+ "\n",
+ "10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')\n",
+ "\n",
+ "11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.\n",
+ "\n",
+ "12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)\n",
+ "\n",
+ "13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)\n",
+ "\n",
+ "14 - previous: number of contacts performed before this campaign and for this client (numeric)\n",
+ "\n",
+ "15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')\n",
+ "\n",
+ "16 - emp.var.rate: employment variation rate - (numeric)\n",
+ "\n",
+ "17 - cons.price.idx: consumer price index - (numeric)\n",
+ "\n",
+ "18 - cons.conf.idx: consumer confidence index - (numeric) \n",
+ "\n",
+ "19 - euribor3m: euribor 3 month rate - (numeric)\n",
+ "\n",
+ "20 - nr.employed: number of employees - (numeric)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "YwsaBV_OYxFi"
+ },
+ "source": [
+ "#### Predict variable (desired target):\n",
+ "\n",
+ "y - has the client subscribed a term deposit? (binary: '1','0')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2SsNWV_SYxFj"
+ },
+ "source": [
+ "The education column of the dataset has many categories and we need to reduce the categories for a better modelling. The education column has the following categories:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6TFbgh3vYxFk",
+ "outputId": "ce01e46e-d11a-4192-a4c7-ea57039f185f",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 555
+ }
+ },
+ "source": [
+ "df_bank['education'].unique()"
+ ],
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "error",
+ "ename": "KeyError",
+ "evalue": "ignored",
+ "traceback": [
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
+ "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 2890\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2891\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2892\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n",
+ "\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n",
+ "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n",
+ "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n",
+ "\u001b[0;31mKeyError\u001b[0m: 'education'",
+ "\nThe above exception was the direct cause of the following exception:\n",
+ "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
+ "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf_bank\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'education'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0munique\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+ "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 2900\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnlevels\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2901\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_multilevel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2902\u001b[0;31m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2903\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_integer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2904\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 2891\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2892\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2893\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2894\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2895\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mtolerance\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;31mKeyError\u001b[0m: 'education'"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "luv7Bdf_YxFn"
+ },
+ "source": [
+ "Let us group \"basic.4y\", \"basic.9y\" and \"basic.6y\" together and call them \"basic\"."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "gkOlUOs2YxFn"
+ },
+ "source": [
+ "df_bank['education']=np.where(df_bank['education'] =='basic.9y', 'Basic', df_bank['education'])\n",
+ "df_bank['education']=np.where(df_bank['education'] =='basic.6y', 'Basic', df_bank['education'])\n",
+ "df_bank['education']=np.where(df_bank['education'] =='basic.4y', 'Basic', df_bank['education'])"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "H-X1WMv2YxFq"
+ },
+ "source": [
+ "After grouping, this is the columns"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "r9LlgpkjYxFq"
+ },
+ "source": [
+ "df_bank['education'].unique()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "fcnJy3KYYxFt"
+ },
+ "source": [
+ "### Data exploration"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "qUrTMR8BYxFt"
+ },
+ "source": [
+ "df_bank['y'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "rpzHnzJKYxFx"
+ },
+ "source": [
+ "sns.countplot(x='y',data=df_bank, palette='hls')\n",
+ "plt.show()\n",
+ "plt.savefig('count_plot')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "C99nOe3mYxF0"
+ },
+ "source": [
+ "There are 36548 no's and 4640 yes's in the outcome variables."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8nGaox_kYxF1"
+ },
+ "source": [
+ "Let's get a sense of the numbers across the two classes"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "sQvzA60bYxF1"
+ },
+ "source": [
+ "df_bank.groupby('y').mean()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "u3xjoceKYxF3"
+ },
+ "source": [
+ "Observations:\n",
+ "\n",
+ "The average age of customers who bought the term deposit is higher than that of the customers who didn't.\n",
+ "The pdays (days since the customer was last contacted) is understandably lower for the customers who bought it. The lower the pdays, the better the memory of the last call and hence the better chances of a sale.\n",
+ "Surprisingly, campaigns (number of contacts or calls made during the current campaign) are lower for customers who bought the term deposit."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jvzGMePPYxF4"
+ },
+ "source": [
+ "We can calculate categorical means for other categorical variables such as education and marital status to get a more detailed sense of our data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "RqLVMjoxYxF5"
+ },
+ "source": [
+ "df_bank.groupby('job').mean()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "GTUeRJAtYxF7"
+ },
+ "source": [
+ "df_bank.groupby('marital').mean()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "xsxdFumiYxF9"
+ },
+ "source": [
+ "df_bank.groupby('education').mean()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3i1DCWV-YxGA"
+ },
+ "source": [
+ "Visualizations"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "OEArHQPbYxGB"
+ },
+ "source": [
+ "%matplotlib inline\n",
+ "pd.crosstab(df_bank.job,df_bank.y).plot(kind='bar')\n",
+ "plt.title('Purchase Frequency for Job Title')\n",
+ "plt.xlabel('Job')\n",
+ "plt.ylabel('Frequency of Purchase')\n",
+ "plt.savefig('purchase_fre_job')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "PNwo5du_YxGD"
+ },
+ "source": [
+ "The frequency of purchase of the deposit depends a great deal on the job title. Thus, the job title can be a good predictor of the outcome variable."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "eM7CWfAZYxGE"
+ },
+ "source": [
+ "table=pd.crosstab(df_bank.marital,df_bank.y)\n",
+ "table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)\n",
+ "plt.title('Stacked Bar Chart of Marital Status vs Purchase')\n",
+ "plt.xlabel('Marital Status')\n",
+ "plt.ylabel('Proportion of Customers')\n",
+ "plt.savefig('mariral_vs_pur_stack')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "LWBLh7toYxGG"
+ },
+ "source": [
+ "Hard to see, but the marital status does not seem a strong predictor for the outcome variable."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "vh_u4QphYxGH"
+ },
+ "source": [
+ "table=pd.crosstab(df_bank.education,df_bank.y)\n",
+ "table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)\n",
+ "plt.title('Stacked Bar Chart of Education vs Purchase')\n",
+ "plt.xlabel('Education')\n",
+ "plt.ylabel('Proportion of Customers')\n",
+ "plt.savefig('edu_vs_pur_stack')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "d9AgJroYYxGK"
+ },
+ "source": [
+ "Education seems a good predictor of the outcome variable."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "dHI2LT-IYxGL"
+ },
+ "source": [
+ "pd.crosstab(df_bank.day_of_week,df_bank.y).plot(kind='bar')\n",
+ "plt.title('Purchase Frequency for Day of Week')\n",
+ "plt.xlabel('Day of Week')\n",
+ "plt.ylabel('Frequency of Purchase')\n",
+ "plt.savefig('pur_dayofweek_bar')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3A2jmS4MYxGR"
+ },
+ "source": [
+ "Day of week may not be a good predictor of the outcome"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "bzafDBHpYxGS"
+ },
+ "source": [
+ "pd.crosstab(df_bank.month,df_bank.y).plot(kind='bar')\n",
+ "plt.title('Purchase Frequency for Month')\n",
+ "plt.xlabel('Month')\n",
+ "plt.ylabel('Frequency of Purchase')\n",
+ "plt.savefig('pur_fre_month_bar')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "x5CBtquEYxGW"
+ },
+ "source": [
+ "Month might be a good predictor of the outcome variable"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "tgF_3SqWYxGY"
+ },
+ "source": [
+ "df_bank.age.hist()\n",
+ "plt.title('Histogram of Age')\n",
+ "plt.xlabel('Age')\n",
+ "plt.ylabel('Frequency')\n",
+ "plt.savefig('hist_age')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "y0FhKYDsYxGc"
+ },
+ "source": [
+ "The most of the customers of the bank in this dataset are in the age range of 30-40."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5Nd3yV7DYxGd"
+ },
+ "source": [
+ "pd.crosstab(df_bank.poutcome,df_bank.y).plot(kind='bar')\n",
+ "plt.title('Purchase Frequency for Poutcome')\n",
+ "plt.xlabel('Poutcome')\n",
+ "plt.ylabel('Frequency of Purchase')\n",
+ "plt.savefig('pur_fre_pout_bar')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oRKUAGrjYxGh"
+ },
+ "source": [
+ "Poutcome seems to be a good predictor of the outcome variable."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "63RLRI9uYxGi"
+ },
+ "source": [
+ "### Create dummy variables"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "V8S4WUKmYxGj"
+ },
+ "source": [
+ "cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']\n",
+ "for var in cat_vars:\n",
+ " cat_list='var'+'_'+var\n",
+ " cat_list = pd.get_dummies(df_bank[var], prefix=var)\n",
+ " df_bank1=df_bank.join(cat_list)\n",
+ " data=df_bank1"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "uX3w9i9WYxGl"
+ },
+ "source": [
+ "cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']\n",
+ "df_bank_vars=df_bank.columns.values.tolist()\n",
+ "to_keep=[i for i in df_bank_vars if i not in cat_vars]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "cMX_82xaYxGq"
+ },
+ "source": [
+ "df_bank_final=df_bank[to_keep]\n",
+ "df_bank_final.columns.values"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "LkTjpxYoYxGr"
+ },
+ "source": [
+ "df_bank_final_vars=df_bank_final.columns.values.tolist()\n",
+ "y=['y']\n",
+ "X=[i for i in df_bank_final_vars if i not in y]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2QbKaRcsYxGt"
+ },
+ "source": [
+ "### Feature Selection"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "EkxjW1AQYxGu"
+ },
+ "source": [
+ "from sklearn import datasets\n",
+ "from sklearn.feature_selection import RFE\n",
+ "from sklearn.linear_model import LogisticRegression\n",
+ "\n",
+ "logreg = LogisticRegression()\n",
+ "\n",
+ "rfe = RFE(logreg, 18)\n",
+ "rfe = rfe.fit(df_bank_final[X], df_bank_final[y] )\n",
+ "print(rfe.support_)\n",
+ "print(rfe.ranking_)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2P9hd4jHYxGw"
+ },
+ "source": [
+ "The Recursive Feature Elimination (RFE) has helped us select the following features: \"previous\", \"euribor3m\", \"job_blue-collar\", \"job_retired\", \"job_services\", \"job_student\", \"default_no\", \"month_aug\", \"month_dec\", \"month_jul\", \"month_nov\", \"month_oct\", \"month_sep\", \"day_of_week_fri\", \"day_of_week_wed\", \"poutcome_failure\", \"poutcome_nonexistent\", \"poutcome_success\"."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5PW8WZX_YxGx"
+ },
+ "source": [
+ "cols=[\"previous\", \"euribor3m\", \"job_blue-collar\", \"job_retired\", \"job_services\", \"job_student\", \"default_no\", \n",
+ " \"month_aug\", \"month_dec\", \"month_jul\", \"month_nov\", \"month_oct\", \"month_sep\", \"day_of_week_fri\", \"day_of_week_wed\", \n",
+ " \"poutcome_failure\", \"poutcome_nonexistent\", \"poutcome_success\"] \n",
+ "X=df_bank_final[cols]\n",
+ "y=df_bank_final['y']"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Ix0mN9qxYxG0"
+ },
+ "source": [
+ "### Implementing the model"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Hbx2bwtiYxG0"
+ },
+ "source": [
+ "import statsmodels.api as sm\n",
+ "logit_model=sm.Logit(y,X)\n",
+ "result=logit_model.fit()\n",
+ "print(result.summary())"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HR1ui-UcYxG2"
+ },
+ "source": [
+ "The p-values for most of the variables are very small, therefore, most of them are significant to the model."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9GHhrsaeYxG3"
+ },
+ "source": [
+ "### Logistic Regression Model Fitting"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "MFQnH5MzYxG3"
+ },
+ "source": [
+ "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)\n",
+ "from sklearn.linear_model import LogisticRegression\n",
+ "from sklearn import metrics\n",
+ "logreg = LogisticRegression()\n",
+ "logreg.fit(X_train, y_train)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "YUa3QL7tYxG6"
+ },
+ "source": [
+ "#### Predicting the test set results and caculating the accuracy"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "SD-y2e33YxG6"
+ },
+ "source": [
+ "y_pred = logreg.predict(X_test)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kkPWzos7YxG-"
+ },
+ "source": [
+ "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "kwC3rt_6YxHA"
+ },
+ "source": [
+ "### Cross Validation"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Muw50oqSYxHB"
+ },
+ "source": [
+ "from sklearn import model_selection\n",
+ "from sklearn.model_selection import cross_val_score\n",
+ "kfold = model_selection.KFold(n_splits=10, random_state=7)\n",
+ "modelCV = LogisticRegression()\n",
+ "scoring = 'accuracy'\n",
+ "results = model_selection.cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)\n",
+ "print(\"10-fold cross validation average accuracy: %.3f\" % (results.mean()))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4y8XCTqoYxHE"
+ },
+ "source": [
+ "### Confusion Matrix"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "BCza9NkVYxHE"
+ },
+ "source": [
+ "from sklearn.metrics import confusion_matrix\n",
+ "confusion_matrix = confusion_matrix(y_test, y_pred)\n",
+ "print(confusion_matrix)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "X9SapwS2YxHG"
+ },
+ "source": [
+ "The result is telling us that we have 10872+254 correct predictions and 1122+109 incorrect predictions."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6bEWvWScYxHG"
+ },
+ "source": [
+ "#### Accuracy"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "NaH2nESwYxHH"
+ },
+ "source": [
+ "print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(classifier.score(X_test, y_test)))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "C6oxlhbpYxHJ"
+ },
+ "source": [
+ "#### Compute precision, recall, F-measure and support\n",
+ "\n",
+ "The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.\n",
+ "\n",
+ "The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.\n",
+ "\n",
+ "The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.\n",
+ "\n",
+ "The F-beta score weights recall more than precision by a factor of beta. beta == 1.0 means recall and precision are equally important.\n",
+ "\n",
+ "The support is the number of occurrences of each class in y_test."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "mhN5_p4yYxHK"
+ },
+ "source": [
+ "from sklearn.metrics import classification_report\n",
+ "print(classification_report(y_test, y_pred))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "xzSFVEnAYxHP"
+ },
+ "source": [
+ "#### Interpretation: \n",
+ "\n",
+ "Of the entire test set, 88% of the promoted term deposit were the term deposit that the customers liked. Of the entire test set, 90% of the customer's preferred term deposit were promoted."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "NGXJ6g2nYxHQ"
+ },
+ "source": [
+ "### ROC Curvefrom sklearn import metrics\n",
+ "from ggplot import *\n",
+ "\n",
+ "prob = clf1.predict_proba(X_test)[:,1]\n",
+ "fpr, sensitivity, _ = metrics.roc_curve(Y_test, prob)\n",
+ "\n",
+ "df = pd.DataFrame(dict(fpr=fpr, sensitivity=sensitivity))\n",
+ "ggplot(df, aes(x='fpr', y='sensitivity')) +\\\n",
+ " geom_line() +\\\n",
+ " geom_abline(linetype='dashed')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "u9QKDuS0YxHQ"
+ },
+ "source": [
+ "from sklearn.metrics import roc_auc_score\n",
+ "from sklearn.metrics import roc_curve\n",
+ "logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))\n",
+ "fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])\n",
+ "plt.figure()\n",
+ "plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)\n",
+ "plt.plot([0, 1], [0, 1],'r--')\n",
+ "plt.xlim([0.0, 1.0])\n",
+ "plt.ylim([0.0, 1.05])\n",
+ "plt.xlabel('False Positive Rate')\n",
+ "plt.ylabel('True Positive Rate')\n",
+ "plt.title('Receiver operating characteristic')\n",
+ "plt.legend(loc=\"lower right\")\n",
+ "plt.savefig('Log_ROC')\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ }
+ ]
+}
\ No newline at end of file
From e7cfb05bc262d20ebdbec1f3cf5aee7c0a4a464f Mon Sep 17 00:00:00 2001
From: Celso-Omoto <72219725+Celso-Omoto@users.noreply.github.com>
Date: Thu, 5 Nov 2020 16:03:09 -0300
Subject: [PATCH 20/21] Criado usando o Colaboratory
---
.../NB15__ML_AutoML_pycaret_testes.ipynb | 374 ++++++++++++++++++
1 file changed, 374 insertions(+)
create mode 100644 Notebooks/NB15__ML_AutoML_pycaret_testes.ipynb
diff --git a/Notebooks/NB15__ML_AutoML_pycaret_testes.ipynb b/Notebooks/NB15__ML_AutoML_pycaret_testes.ipynb
new file mode 100644
index 000000000..45178a675
--- /dev/null
+++ b/Notebooks/NB15__ML_AutoML_pycaret_testes.ipynb
@@ -0,0 +1,374 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "Untitled8.ipynb",
+ "provenance": [],
+ "private_outputs": true,
+ "include_colab_link": true
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ "
"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "FfhCoyP98gDt"
+ },
+ "source": [
+ "import pandas as pd\n",
+ "import numpy as np"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "226lzu3i8kRp"
+ },
+ "source": [
+ "url = 'https://raw.githubusercontent.com/MathMachado/DataFrames/master/Titanic_Original.csv'\n",
+ "df_titanic = pd.read_csv(url)\n",
+ "df_titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6mL0RI0V9JmP"
+ },
+ "source": [
+ "!pip install pycaret"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "WL9nShOd86Fu"
+ },
+ "source": [
+ "from pycaret.classification import *"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "YRtIVR7LC9nl"
+ },
+ "source": [
+ "https://www.kaggle.com/frtgnn/pycaret-introduction-classification-regression"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3-dLwhmi9jTA"
+ },
+ "source": [
+ "### Set up"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jooY5VUr9sqd"
+ },
+ "source": [
+ "# Normalizar os nomes das colunas:\n",
+ "df_titanic.columns = [colunas.lower() for colunas in df_titanic.columns]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "erqtZNz9yZ2T"
+ },
+ "source": [
+ "df_titanic.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "bf_IntG2ygtP"
+ },
+ "source": [
+ "### Tratamento da feature/variável fare"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "buRgX2rucrHT"
+ },
+ "source": [
+ "#fare_bins = ['Muito Baixo', 'Baixo', 'Medio', 'Alto', 'Muito Alto']\n",
+ "fare_bins = ['Baixo', 'Medio', 'Alto']\n",
+ "\n",
+ "df_titanic2 = df_titanic.copy()\n",
+ "\n",
+ "# Tratamentos necessários\n",
+ "\n",
+ "#df_titanic2['fare_bins'] = pd.qcut(df_titanic2['fare'], q = [0, .2, .4, .6, .8, 1], labels = fare_bins)\n",
+ "#df_titanic2['fare_bins'] = pd.qcut(df_titanic2['fare'], q = 5, labels = fare_bins)\n",
+ "df_titanic2['fare_bins'] = pd.qcut(df_titanic2['fare'], q = 3, labels = fare_bins)\n",
+ "\n",
+ "#df_titanic2.drop(columns = [], axis = 1, inplace = True)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "CZ87ERxzjluM"
+ },
+ "source": [
+ "#pd.qcut(df_titanic2['fare'], q = [0, .2, .4, .6, .8, 1], labels = fare_bins)\n",
+ "pd.qcut(df_titanic2['fare'], q = 3, labels = fare_bins)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6AWpUmbE7p39"
+ },
+ "source": [
+ "df_titanic2['fare_bins'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "JyoQGTrqlWzp"
+ },
+ "source": [
+ "df_titanic2['sex'].value_counts()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ZKgXAUMcvr63"
+ },
+ "source": [
+ "df_titanic2.drop(columns = ['name', 'ticket'], axis = 1, inplace = True)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "FJwl0RrZvfDn"
+ },
+ "source": [
+ "for colunas in df_titanic2.columns:\n",
+ " print ( colunas, df_titanic2[colunas].value_counts())"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "W0RtT6Xr9IVL"
+ },
+ "source": [
+ "clf = setup(data = df_titanic2,\n",
+ " target = 'survived', \n",
+ " numeric_imputation = 'mean', # para tratamento dos missing values\n",
+ " categorical_features = ['sex', 'embarked'], # lista das variáveis categóricas\n",
+ " ignore_features = ['name', 'ticket', 'cabin', 'passengerid'], # variáveis que serão ignoradas\n",
+ " session_id = 20111974, # Seed por questões de reproducibilidade\n",
+ " silent = False)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "mGGubn7k-GNi"
+ },
+ "source": [
+ "compare_models()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4YdMCHT92Tij"
+ },
+ "source": [
+ "\tModel\tAccuracy\tAUC\tRecall\tPrec.\tF1\tKappa\tMCC\tTT (Sec)\n",
+ "catboost\tCatBoost Classifier\t0.8218\t0.8634\t0.7857\t0.8275\t0.8154\t0.5996\t0.6150\t1.026\n",
+ "gbc\tGradient Boosting Classifier\t0.8187\t0.8540\t0.7867\t0.8231\t0.8129\t0.5959\t0.6084\t0.111\n",
+ "lightgbm\tLight Gradient Boosting Machine\t0.8186\t0.8683\t0.7937\t0.8195\t0.8149\t0.6009\t0.6073\t0.052"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8rcs_jJFCjRW"
+ },
+ "source": [
+ "lgbm = create_model('lightgbm') "
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5BdvRHPdCq0E"
+ },
+ "source": [
+ "tuned_lightgbm = tune_model(lgbm)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "WwCW_pDYI1hy"
+ },
+ "source": [
+ "plot_model(estimator = tuned_lightgbm, plot = 'learning')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "LES2FO1zI4X8"
+ },
+ "source": [
+ "plot_model(estimator = tuned_lightgbm, plot = 'auc')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "xxGQX3jbI4bN"
+ },
+ "source": [
+ "plot_model(estimator = tuned_lightgbm, plot = 'confusion_matrix')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "1O_9qDHgJJjw"
+ },
+ "source": [
+ "plot_model(estimator = tuned_lightgbm, plot = 'feature')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "W1xnpqD-46vh"
+ },
+ "source": [
+ "### Painel com todos os outputs"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "PluFZQ8bI4hV"
+ },
+ "source": [
+ "evaluate_model(tuned_lightgbm)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "JaffgUyy4bwz"
+ },
+ "source": [
+ "!pip install shap"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "Uez4Gik8JwET"
+ },
+ "source": [
+ "interpret_model(tuned_lightgbm)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "9U2SnEKA41nW"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "dfiBqgkXvdHJ"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ }
+ ]
+}
\ No newline at end of file
From e8d1d8a322f99b4f7b540aa01f4bd6e4b8c4f58e Mon Sep 17 00:00:00 2001
From: Celso-Omoto <72219725+Celso-Omoto@users.noreply.github.com>
Date: Thu, 19 Nov 2020 15:59:32 -0300
Subject: [PATCH 21/21] Criado usando o Colaboratory
---
...ndas__Resposta_Exercicios_Aluno_Fifa.ipynb | 58 ++++++++++++++++++-
1 file changed, 57 insertions(+), 1 deletion(-)
diff --git a/Notebooks/NB10_01__Pandas__Resposta_Exercicios_Aluno_Fifa.ipynb b/Notebooks/NB10_01__Pandas__Resposta_Exercicios_Aluno_Fifa.ipynb
index cb7b4c537..32df5a3e4 100644
--- a/Notebooks/NB10_01__Pandas__Resposta_Exercicios_Aluno_Fifa.ipynb
+++ b/Notebooks/NB10_01__Pandas__Resposta_Exercicios_Aluno_Fifa.ipynb
@@ -4,8 +4,8 @@
"metadata": {
"colab": {
"name": "NB10_01__Pandas.ipynb",
- "provenance": [],
"private_outputs": true,
+ "provenance": [],
"include_colab_link": true
},
"kernelspec": {
@@ -1713,6 +1713,62 @@
"execution_count": null,
"outputs": []
},
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "P13kEu99zYBh"
+ },
+ "source": [
+ "df_string = df['LS'].str.split(r'\\+', n = 4, expand = True) # n representa o número de splits no output.\n",
+ "#df_string.head()\n",
+ "df_string[0] = pd.to_numeric(df_string[0])\n",
+ "df_string[1] = pd.to_numeric(df_string[1])\n",
+ "df_string['LS2'] = df_string[0]+df_string[1]\n",
+ "df_string.head()\n",
+ "df_string.drop(columns= [0, 1], axis = 1, inplace = True)\n",
+ "df = pd.merge(df, df_string, how = 'left', on = 'ID')\n",
+ "df.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "ahXBrk1oyFcq"
+ },
+ "source": [
+ "##não consegui fazer funcionar utilizando substituição com f'{}\n",
+ "def trata_sinal_str(s_nome_coluna,s_nome_coluna2):\n",
+ " print(s_nome_coluna)\n",
+ " #df_string = df[f'{s_nome_coluna}'].str.split(r'\\+', n = 4, expand = True) # n representa o número de splits no output.\n",
+ " df_string = df[f'{s_nome_coluna}'].str.split(r'\\+', n = 4, expand = True) # n representa o número de splits no output.\n",
+ " df_string[0] = pd.to_numeric(df_string[0])\n",
+ " df_string[1] = pd.to_numeric(df_string[1])\n",
+ " df_string[f'{s_nome_coluna2}'] = df_string[0]+df_string[1]\n",
+ " df_string.drop(columns= [0, 1], axis = 1, inplace = True)\n",
+ " df = pd.merge(df, df_string, how = 'left', on = 'ID')\n",
+ " df.head()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6BE-bIcqyRR6"
+ },
+ "source": [
+ "#l_lista = ['LS','ST','RS','LW','LF','CF','RF','RW','LAM','CAM','RAM','LM','LCM','CM','RCM','RM','LWB','LDM','CDM','RDM','RWB','LB','LCB','CB','RCB','RB']\n",
+ "#l_lista = ['LS','ST','LW','LF','CF','RF','RW','LAM','CAM','RAM','LM','LCM','CM','RCM','RM','LWB','LDM','CDM','RDM','RWB','LB','LCB','CB','RCB','RB']\n",
+ "l_lista = ['LS','ST']\n",
+ "for nome_coluna in l_lista:\n",
+ " print\n",
+ " trata_sinal_str(nome_coluna,nome_coluna+'2')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
{
"cell_type": "markdown",
"metadata": {