Skip to content

aminraz/SQL_Basics_for_Data_Science

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

Table of Contents

  1. Sample Database
  2. General Commands
  3. WEEK #1
    1. keywords
    2. installation
    3. commands
    4. articles
  4. WEEK #2
    1. Introduction
    2. Basics of Filtering with SQL
    3. Advanced Filtering: IN, OR, and NOT
    4. Using Wildcards in SQL
    5. ORDER BY
    6. Math Operations
    7. Aggregate Functions
    8. Grouping Data with SQL
  5. WEEK #3
    1. Using Subqueries
    2. Subquery Best Practices and Considerations
    3. Joins
    4. JOIN vs Subquery
  6. WEEK #4
    1. Working with Text Strings
    2. Date and Time Strings
    3. Case Statements
    4. Create View
    5. Data Governance and Profiling
    6. links

This file includes notes and summary of the course SQL for Data Science from University of California, Davis available in Coursera as well as additional information that I find useful to learn but it's missing in the course. You can use it as a supplementary material if you're taking the course or as an standalone summary of basics of SQL for data science.

You are encouraged to download the README.org and open it inside Emacs if you're familiar with org mode. It has the same content.

Sample Database

Download the sample database called chinook from this link.

General Commands

Some general commands you can use to inspect your sample database:

  • open the database

    .open chinook.db
    
  • inspect the tables

    .tables
    
  • .help: show help for dot commands

  • .database: show the connected databases in the session

  • attach a database

    attach database "address.db" as db_name;
    
  • .exit: exit the sqlite

  • showing the structure of a table

    .schema tab_name
    
  • .indexes: show the indexes of the current database or table

  • write query result into a file

    .output filename
    	select * from tab_name;
    

WEEK #1

keywords

  • data modeling
  • NoSQL
  • primary keys, foreign keys
  • ER diagrams: Chen notation, Crow's foot notation, UML class diagram notation

installation

  • for Linux: install sqlite3 from terminal

commands

select

simple select query

select col1, col2, col3 from my_table limit number_of_lines;

limit the number of rows

select col1, col2, col3 from my_table where col4 = 'value';

sum of a column

select sum(col1) from my_table;

create table

create table my_table
       (col1 char(20) null,
       col2 char(20) not null,
       col3 char(20) primary key);

insert into table

	/* It's a good practice to first
	identify columns that the values
	 are going to get there */
insert into my_table
       (col1, col2, )
       values
       (value1, value2,);		

create temporary table

	-- create a subset from a table
create temporary table my_tmp_table as
       (
       select *
       from my_table
       where col1 = 'value')

adding comments by – and **

articles

WEEK #2

Filtering, sorting, and calculating data with SQL

Introduction

Clauses and operators

  • WHERE
  • BETWEEN AND
  • IN
  • OR
  • NOT
  • LIKE
  • ORDER BY
  • GROUP BY

Wildcards

Math operators

  • AVERAGE
  • COUNT
  • MAX
  • MIN

Basics of Filtering with SQL

where clause operator

SELECT col1, col2 FROM my_table
       WHERE col operator value;

operator can be

  • =
  • <> (not equal !=)
  • <
  • =

  • <=
  • BETWEEN AND
  • IS NULL

BETWEEN AND

select city from customers where customerid between 4 and 10;	

Advanced Filtering: IN, OR, and NOT

IN

  • use parentheses for more than one condition with comma between them

    select country from customers where city in ('Paris', 'Rome', 'Oslo');
    

OR

WHERE X or Y output
  true or false X
  true or true X
  false or false nil
  false or true Y

IN or OR

  • IN is faster
  • OR: order is important
  • IN: order is not important
  • IN: making sub-queries

OR AND

WHERE X OR Y AND Z output
  T   F   F X

OR is executed before AND. By using parentheses we can force to check AND condition:

WHERE (X OR Y) AND Z output
  T   F   F nil

Using Wildcards in SQL

LIKE operator(predicate)

Used for only string data not numerical data.

  • %: '%string', 'st%g', 'st%' it does not match NULL value

  • _: is not supported in DB2 but most other system support it.

    WHERE val LIKE '_string'
    
  • []: not supported in SQLite

  • Wildcards are slower than filtering operators

ORDER BY

  • It should be the last clause in an statement

  • Different columns can be used for sorting

  • Column position can be used

    ORDER BY 1,5
    
  • DESC: descending

  • ASC: ascending

  • DESC and ASC should be repeated for each column

Math Operations

Creating new column in the output from math operations on other columns:

SELECT col1,col2, col1*col2 AS new_col
       FROM my_table;

newcol is as alias for the new column

Aggregate Functions

  • AVG()

  • COUNT()

  • MIN()

  • MAX()

  • SUM()

    SELECT AVG(col1) AS col1_avg FROM my_table; -- null rows are ignored

    SELECT count(*) AS total_rows FROM my_table; -- it counts null rows -- if you count a specific column, it ignores null rows

AS is not mandatory.

DISTINCT

It recognizes duplicates in a column

SELECT count(DISTINCT col1)
       FROM my_table

Grouping Data with SQL

GROUP BY

In the following example we are counting the number of cities each country has in our table customers:

select country, count(city) from customers group by country ;	

HAVING

It filters the result of group by like where. In the following example, the result is filtered to those countries with customers having a special email patter:

select country, count(city) from customers group by country having  email like '%com';	

We can use multiple columns for group by. If there are more columns in the group by, these groups need to be in the select part too. NULL is categorized separately.

WHERE is used before grouping and HAVING is used after it.

WEEK #3

Using Subqueries

They are used to create queries inside queries.

select country, company from customers
where customerid in
(select customerid from invoices where billingstate like 'a%');

Subquery Best Practices and Considerations

Format your code using poorsql.com

Joins

selecting from different tables

SELECT orders.order_id, customers.customer_name
FROM orders
INNER JOIN customers ON orders.customer_id = customers.customer_id;

The dot (.) in orders.orderid and customers.customername is used to specify the table from which each column should be retrieved. This notation is often referred to as "table.column" notation, and it's necessary when you are selecting columns from multiple tables that have columns with the same name.

INNER join

SELECT orders.order_id, customers.customer_name
FROM orders
INNER JOIN customers ON orders.customer_id = customers.customer_id;
  1. `SELECT orders.orderid, customers.customername`: This part of the SQL statement specifies the columns you want to retrieve in the result set. It's saying that you want to retrieve two columns:

    • `orders.orderid`: This is the `orderid` column from the `orders` table.
    • `customers.customername`: This is the `customername` column from the `customers` table.
  2. `FROM orders`: This part of the statement specifies the source table from which you want to retrieve data. In this case, it's the `orders` table.

  3. `INNER JOIN customers ON orders.customerid = customers.customerid`: This is where the actual join operation occurs. Let's break it down further:

    • `INNER JOIN`: This specifies that you want to perform an inner join between the `orders` table and the `customers` table. An inner join returns only the rows where there is a match in both tables.
    • `customers` is the name of the table you're joining with.
    • `ON orders.customerid = customers.customerid`: This part of the statement specifies the join condition. It tells the database how to match rows between the two tables. Specifically, it's saying that you want to join rows where the `customerid` column in the `orders` table is equal to the `customerid` column in the `customers` table. This condition establishes the relationship between the two tables based on the `customerid` column.

So, when you execute this SQL statement, SQLite will retrieve data from both the `orders` and `customers` tables and combine it into a single result set. The result will include pairs of `orderid` and `customername` where the `customerid` values in the `orders` and `customers` tables match.

For example, if you have the following data:

orders table:

orderid customerid
1 101
2 102
3 103

customers table:

customerid customername
101 Alice
102 Bob
104 Carol

The result of the SQL query will be:

orderid customername
1 Alice
2 Bob

As you can see, only the rows with matching `customerid` values (1 and 2) are included in the result, and it combines the relevant data from both tables.

LEFT JOIN (or LEFT OUTER JOIN)

  • A LEFT JOIN returns all rows from the left table and the matched rows

from the right table.

  • If there's no match in the right table, NULL

values are filled in for the columns from the right table.

Example:

SELECT customers.customer_name, orders.order_id
FROM customers
LEFT JOIN orders ON customers.customer_id = orders.customer_id;

RIGHT JOIN (or RIGHT OUTER JOIN)

  • A RIGHT JOIN is similar to a LEFT JOIN, but it returns all rows

from the right table and the matched rows from the left table.

  • If there's no match in the left table, NULL values are filled in for

the columns from the left table.

Example:

SELECT customers.customer_name, orders.order_id
FROM customers
RIGHT JOIN orders ON customers.customer_id = orders.customer_id;

LEFT or RIGHT

In SQL joins, particularly when discussing LEFT JOIN and RIGHT JOIN, the terms "left" and "right" refer to the order of the tables being joined and the direction of the inclusion of rows in the result set. These terms help describe which table's data is preserved fully and which table's data may have missing values (NULLs) in the result set. Let's clarify the differences:

  1. LEFT JOIN (or LEFT OUTER JOIN):

    • The "left" table is the table listed on the left side of the JOIN clause.
    • All rows from the left table are included in the result set, regardless of whether there is a match in the right table.
    • If there is a match in the right table, the corresponding data is included in the result set.
    • If there is no match in the right table, NULL values are used for the columns from the right table.
  2. RIGHT JOIN (or RIGHT OUTER JOIN):

    • The "right" table is the table listed on the right side of the JOIN clause.
    • All rows from the right table are included in the result set, regardless of whether there is a match in the left table.
    • If there is a match in the left table, the corresponding data is included in the result set.
    • If there is no match in the left table, NULL values are used for the columns from the left table.

In summary, the primary difference between LEFT JOIN and RIGHT JOIN is the direction in which they include rows from the tables involved in the join. LEFT JOIN keeps all rows from the left table, while RIGHT JOIN keeps all rows from the right table. The choice of whether to use LEFT JOIN or RIGHT JOIN depends on your specific data and query requirements. Most often, LEFT JOIN is used because it retains all records from the "left" or main table and includes matching data from the "right" or related table when available.

CROSS join

A CROSS JOIN, also known as a Cartesian join, is a type of join operation in SQL where every row from one table is combined with every row from another table, resulting in a Cartesian product of the two tables. In other words, it creates all possible combinations of rows from the two tables, without any specific condition or criteria for matching rows. As a result, the size of the result set can grow rapidly, especially if the tables being joined are large.

Here's the basic syntax for a CROSS JOIN:

SELECT *
FROM table1
CROSS JOIN table2;

In this syntax:

  • `table1` and `table2` are the names of the two tables you want to combine.
  • `*` is used to select all columns from both tables in the Cartesian product.

Here's an example to illustrate a CROSS JOIN:

Assume you have two tables:

table1 (colors):

color
Red
Green
Blue

table2 (sizes):

size
Small
Medium
Large

If you perform a CROSS JOIN between these two tables:

SELECT *
FROM colors
CROSS JOIN sizes;

The result will be the Cartesian product of all color and size combinations:

color size
Red Small
Red Medium
Red Large
Green Small
Green Medium
Green Large
Blue Small
Blue Medium
Blue Large

As you can see, every color is combined with every size, resulting in a total of 9 rows in the result set.

CROSS JOINs are relatively rare in practice because they tend to generate large result sets and can be computationally expensive. They are typically used in specific scenarios where you need to generate all possible combinations, such as when creating test data or performing certain types of mathematical calculations. In most cases, other types of joins (e.g., INNER JOIN, LEFT JOIN) with appropriate conditions are more commonly used to retrieve meaningful data from related tables.

FULL UTER JOIN

Here's a brief overview of the key characteristics of a FULL OUTER JOIN:

Returns All Rows: Unlike INNER JOIN, which only returns matching rows, a FULL OUTER JOIN returns all rows from both the left and right tables.

NULL Values: When there is no match for a row in one of the tables, the columns from that table will contain NULL values in the result set.

Combines Data: A FULL OUTER JOIN combines data from two tables based on a specified join condition, creating a unified result set.

Useful for Comparisons: FULL OUTER JOINs are often used when you need to compare data from two tables comprehensively, such as identifying records that exist in one table but not in the other or finding matching records while also including non-matching ones.

Here's the basic syntax for a FULL OUTER JOIN:

SELECT *
FROM table1
FULL OUTER JOIN table2 ON table1.column_name = table2.column_name;

multiple columns

In this scenario, we have three tables: `orders`, `customers`, and `products`. We want to retrieve a list of orders that includes the customer name and the product name for each order.

Assuming the following table structures:

orders:

orderid customerid productid
1 101 201
2 102 202
3 103 201
4 104 203

customers:

customerid customername
101 Alice
102 Bob
103 Carol
104 David

products:

productid productname
201 Laptop
202 Smartphone
203 Tablet

We can use an INNER JOIN to combine these tables to get the desired result:

SELECT orders.order_id, customers.customer_name, products.product_name
FROM (orders
INNER JOIN customers ON orders.customer_id = customers.customer_id)
INNER JOIN products ON orders.product_id = products.product_id;

In this query:

  1. The first INNER JOIN combines the `orders` and `customers` tables based on the `customerid` column, linking each order to its respective customer.

  2. The second INNER JOIN combines the result of the first join (which includes order and customer information) with the `products` table based on the `productid` column, linking each order to its corresponding product.

The result of this query will be a list of orders with the customer name and the product name for each order:

orderid customername productname
1 Alice Laptop
2 Bob Smartphone
3 Carol Laptop
4 David Tablet

This query demonstrates how you can use INNER JOINs to combine data from three tables to create a comprehensive result set that includes information from all three tables based on specified join conditions.

using alias

SELECT c.customer_name, o.order_date
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
WHERE c.city = 'New York';

SELECT o.order_id, c.customer_name
FROM orders AS o
INNER JOIN customers AS c ON o.customer_id = c.customer_id;

self-join

A self-join is a specific type of join operation in SQL where a table is joined with itself. In other words, it's a way to combine rows from the same table based on a related column. Self-joins are commonly used when you have hierarchical or recursive data stored within a single table, and you need to retrieve relationships or data from different rows within the same table.

Here's how a self-join works:

  1. Table with Self-Reference: In a self-join scenario, the table must contain a column or columns that establish a relationship between rows within the same table. This typically involves having a foreign key column that references the primary key column of the same table. This creates a self-referencing relationship.

  2. Join Condition: You specify a join condition that defines how the rows in the table should be matched with other rows in the same table. This condition typically involves comparing the values in the self-referencing columns.

  3. Result Set: The result of the self-join is a new table that combines data from the original table based on the specified join condition.

Here's a simple example to illustrate a self-join. Suppose you have an "employees" table with the following structure:

employeeid employeename managerid
1 Alice NULL
2 Bob 1
3 Carol 1
4 David 2
5 Eve 2

In this table:

  • `employeeid` is the primary key.
  • `employeename` stores the names of employees.
  • `managerid` is a foreign key that references the `employeeid` of the employee's manager. It establishes a self-referencing relationship.

You can use a self-join to retrieve the names of employees along with the names of their managers:

SELECT e.employee_name AS employee, m.employee_name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id;

In this query:

  • `e` is an alias for the "employees" table, representing employees.
  • `m` is an alias for the same "employees" table, representing managers.
  • The self-join condition `e.managerid = m.employeeid` connects each employee with their respective manager.

The result of the self-join will be:

employee manager
Alice NULL
Bob Alice
Carol Alice
David Bob
Eve Bob

As you can see, the self-join allows you to retrieve the names of employees along with the names of their respective managers from the same table, creating a hierarchical relationship. Self-joins are particularly useful in scenarios involving organizational hierarchies, bill-of-materials structures, and other situations where data has a parent-child or recursive relationship within a single table.

UNIONs

In SQL, the UNION operator is used to combine the result sets of two or more SELECT statements into a single result set. It allows you to merge rows from different tables or even from the same table based on common columns or data types. The UNION operator removes duplicate rows by default, but you can also use UNION ALL if you want to include duplicate rows in the result.

  • Column Compatibility: The SELECT statements within a UNION must have the same number of columns, and the corresponding columns in each SELECT statement must have compatible data types. The column names from the first SELECT statement are used as the column names in the result.

  • Duplicate Removal: By default, UNION removes duplicate rows from the combined result set. If you want to keep duplicate rows, you can use UNION ALL instead.

    SELECT column1, column2, ... FROM table1 UNION SELECT column1, column2, ... FROM table2;

JOIN vs Subquery

In JOIN you can call different columns from different tables but in subquery, the table that is called inside the subquery is not accessible outside the subquery.

WEEK #4

Working with Text Strings

  • Concatenating

    select col1 || '(' || col2 || ')' from tab1;
    
  • Trimming

    select trim('      you the best     ') as trim_str;
    

Removes trailing spaces from both sides. There is RTRIM and LTRIM also.

  • Substring

    select substr(col1, num1, num2) from tab1;
    

num1 is the beginning of the trimming and num2 is the length of the trim. num1 starts with 1.

  • Case

    select upper(col1) from tab1;
    select lower(col1) from tab1;
    select ucase(col1) from tab1;
    

Date and Time Strings

strfmt

The `strftime` function in SQLite is used to format date and time values as strings. It allows you to specify a format string with various modifiers to control how the date and time should be displayed. The basic syntax of the `strftime` function is as follows:

strftime(format, timestamp)
  • `format`: This is a required argument and specifies the format you want the date and time to be displayed in. It is a string containing format codes and optional text.

  • `timestamp`: This is the timestamp or date and time value that you want to format.

Here are some common format codes and modifiers you can use with the `strftime` function in SQLite:

Date Format Codes:

  • `%Y`: Four-digit year (e.g., 2023).
  • `%m`: Month as a zero-padded decimal number (01-12).
  • `%d`: Day of the month as a zero-padded decimal number (01-31).
  • `%W`: Week number of the year as a decimal number (00-53).

Time Format Codes:

  • `%H`: Hour (00-23).
  • `%M`: Minute (00-59).
  • `%S`: Second (00-59).

Other Format Codes:

  • `%a`: Abbreviated weekday name (Sun, Mon, Tue, etc.).
  • `%A`: Full weekday name (Sunday, Monday, Tuesday, etc.).
  • `%b`: Abbreviated month name (Jan, Feb, Mar, etc.).
  • `%B`: Full month name (January, February, March, etc.).
  • `%c`: Preferred date and time representation (usually the same as `%a %b %d %H:%M:%S %Y`).
  • `%p`: AM or PM designation (AM/PM).

Modifiers:

  • `-`: A hyphen `-` before a modifier (e.g., `%-d`) removes leading zeros.
  • `_`: An underscore `_` before a modifier (e.g., `_%d`) pads with spaces instead of zeros.
  • `0`: A zero `0` before a modifier (e.g., `%0d`) pads with zeros (default behavior).

Examples of using the `strftime` function with modifiers and format codes:

-- Format a timestamp as "YYYY-MM-DD HH:MM:SS"
SELECT strftime('%Y-%m-%d %H:%M:%S', '2023-09-06 15:30:00');

-- Format a timestamp without leading zeros in the day and month
SELECT strftime('%Y-%-m-%-d %H:%M:%S', '2023-09-06 15:30:00');

-- Format a timestamp with abbreviated month and day names
SELECT strftime('%A, %B %d, %Y', '2023-09-06 15:30:00');

-- Format a timestamp as "hh:MM AM/PM"
SELECT strftime('%I:%M %p', '2023-09-06 15:30:00');

```

Now

To retrieve the current date:

select date('now');

Case Statements

In SQLite, the CASE statement is a conditional expression that allows you to perform conditional logic within SQL queries. It's useful for making decisions and returning different values or performing different actions based on specific conditions. The CASE statement can be used in both SELECT statements and UPDATE statements.

The basic syntax of the CASE statement in SQLite is as follows:

CASE
    WHEN condition1 THEN result1
    WHEN condition2 THEN result2
    ...
    ELSE else_result
END

Example:

SELECT
    name,
    CASE
	WHEN age >= 18 THEN 'Adult'
	ELSE 'Minor'
    END AS age_group
FROM people;

In this example, we use a CASE statement to categorize people as either 'Adult' or 'Minor' based on their age.

Create View

In SQLite, a view is a virtual table created from the result of a SELECT query. Unlike physical tables, views do not store data themselves; instead, they provide a way to access and manipulate data from one or more underlying tables or other views. Views are often used to simplify complex queries, encapsulate logic, and provide a consistent interface to users or applications.

To create a view in SQLite, you use the CREATE VIEW statement. Here's the basic syntax:

CREATE VIEW view_name AS
SELECT columns
FROM tables
WHERE conditions;

Once a view is created, you can query it just like a regular table using the SELECT statement:

SELECT * FROM top_customers;

drop the view

drop view my_view;

ETL

ETL stands for Extract, Transform, Load, and it is a concept related to data integration and database management, often associated with SQL databases. ETL is a process used to collect, clean, and move data from various sources into a destination database or data warehouse for analysis, reporting, or other purposes. Each phase of the ETL process serves a specific purpose.

Data Governance and Profiling

Data profiling in SQL is a process used to analyze and understand the characteristics, structure, and quality of a dataset or a database. It involves examining the data to gain insights into its content, distribution, completeness, accuracy, and potential issues. Data profiling is an essential step in data analysis, data cleansing, and data integration processes. Here are the key concepts related to data profiling in SQL:

  • Data Exploration: Data profiling begins with exploring the data to get a high-level understanding of its contents. Analysts or data professionals use SQL queries to retrieve sample data, count records, and examine the data types of columns.

  • Column Profiling: Data profiling often involves profiling individual columns within a dataset or database table. This includes: Determining data types (e.g., integer, string, date). Calculating statistics such as minimum, maximum, mean, median, and standard deviation. Identifying unique values and their frequencies. Detecting missing values and calculating the percentage of missing data. Checking for data patterns, formats, and distributions.

links

SQL puzzles

SQL Authority: SQL Puzzles

Recommended by many recruiters to practice SQL for a data science interview.

SQLZOO

About

Notes and resources for my journey of learning SQL from an online course.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published