This file includes notes and summary of the course SQL for Data Science from University of California, Davis available in Coursera as well as additional information that I find useful to learn but it's missing in the course. You can use it as a supplementary material if you're taking the course or as an standalone summary of basics of SQL for data science.
You are encouraged to download the README.org and open it inside Emacs if you're familiar with org mode. It has the same content.
Download the sample database called chinook from this link.
Some general commands you can use to inspect your sample database:
-
open the database
.open chinook.db -
inspect the tables
.tables -
.help: show help for dot commands
-
.database: show the connected databases in the session
-
attach a database
attach database "address.db" as db_name; -
.exit: exit the sqlite
-
showing the structure of a table
.schema tab_name -
.indexes: show the indexes of the current database or table
-
write query result into a file
.output filename select * from tab_name;
- data modeling
- NoSQL
- primary keys, foreign keys
- ER diagrams: Chen notation, Crow's foot notation, UML class diagram notation
- for Linux: install sqlite3 from terminal
simple select query
select col1, col2, col3 from my_table limit number_of_lines;
limit the number of rows
select col1, col2, col3 from my_table where col4 = 'value';
sum of a column
select sum(col1) from my_table;
create table my_table
(col1 char(20) null,
col2 char(20) not null,
col3 char(20) primary key);
/* It's a good practice to first
identify columns that the values
are going to get there */
insert into my_table
(col1, col2, )
values
(value1, value2,);
-- create a subset from a table
create temporary table my_tmp_table as
(
select *
from my_table
where col1 = 'value')
- AWS and SQL
- Sqlite toturials
- SQL vs. NoSQL – Know the Difference
- NoSQL keeps rising, but relational databases still dominate big data
Filtering, sorting, and calculating data with SQL
- WHERE
- BETWEEN AND
- IN
- OR
- NOT
- LIKE
- ORDER BY
- GROUP BY
- AVERAGE
- COUNT
- MAX
- MIN
SELECT col1, col2 FROM my_table
WHERE col operator value;
operator can be
- =
- <> (not equal !=)
- <
-
-
=
- <=
- BETWEEN AND
- IS NULL
select city from customers where customerid between 4 and 10;
-
use parentheses for more than one condition with comma between them
select country from customers where city in ('Paris', 'Rome', 'Oslo');
| WHERE | X | or | Y | output |
|---|---|---|---|---|
| Â | true | or | false | X |
| Â | true | or | true | X |
| Â | false | or | false | nil |
| Â | false | or | true | Y |
- IN is faster
- OR: order is important
- IN: order is not important
- IN: making sub-queries
| WHERE | X | OR | Y | AND | Z | output |
|---|---|---|---|---|---|---|
| Â | T | Â | F | Â | F | X |
OR is executed before AND. By using parentheses we can force to check AND condition:
| WHERE | (X | OR | Y) | AND | Z | output |
|---|---|---|---|---|---|---|
| Â | T | Â | F | Â | F | nil |
Used for only string data not numerical data.
-
%: '%string', 'st%g', 'st%' it does not match NULL value
-
_: is not supported in DB2 but most other system support it.
WHERE val LIKE '_string' -
[]: not supported in SQLite
-
Wildcards are slower than filtering operators
-
It should be the last clause in an statement
-
Different columns can be used for sorting
-
Column position can be used
ORDER BY 1,5 -
DESC: descending
-
ASC: ascending
-
DESC and ASC should be repeated for each column
Creating new column in the output from math operations on other columns:
SELECT col1,col2, col1*col2 AS new_col
FROM my_table;
newcol is as alias for the new column
-
AVG()
-
COUNT()
-
MIN()
-
MAX()
-
SUM()
SELECT AVG(col1) AS col1_avg FROM my_table; -- null rows are ignored
SELECT count(*) AS total_rows FROM my_table; -- it counts null rows -- if you count a specific column, it ignores null rows
AS is not mandatory.
It recognizes duplicates in a column
SELECT count(DISTINCT col1)
FROM my_table
In the following example we are counting the number of cities each country has in our table customers:
select country, count(city) from customers group by country ;
It filters the result of group by like where. In the following example, the result is filtered to those countries with customers having a special email patter:
select country, count(city) from customers group by country having email like '%com';
We can use multiple columns for group by. If there are more columns in the group by, these groups need to be in the select part too. NULL is categorized separately.
WHERE is used before grouping and HAVING is used after it.
They are used to create queries inside queries.
select country, company from customers
where customerid in
(select customerid from invoices where billingstate like 'a%');
Format your code using poorsql.com
SELECT orders.order_id, customers.customer_name
FROM orders
INNER JOIN customers ON orders.customer_id = customers.customer_id;
The dot (.) in orders.orderid and customers.customername is used to specify the table from which each column should be retrieved. This notation is often referred to as "table.column" notation, and it's necessary when you are selecting columns from multiple tables that have columns with the same name.
SELECT orders.order_id, customers.customer_name
FROM orders
INNER JOIN customers ON orders.customer_id = customers.customer_id;
-
`SELECT orders.orderid, customers.customername`: This part of the SQL statement specifies the columns you want to retrieve in the result set. It's saying that you want to retrieve two columns:
- `orders.orderid`: This is the `orderid` column from the `orders` table.
- `customers.customername`: This is the `customername` column from the `customers` table.
-
`FROM orders`: This part of the statement specifies the source table from which you want to retrieve data. In this case, it's the `orders` table.
-
`INNER JOIN customers ON orders.customerid = customers.customerid`: This is where the actual join operation occurs. Let's break it down further:
- `INNER JOIN`: This specifies that you want to perform an inner join between the `orders` table and the `customers` table. An inner join returns only the rows where there is a match in both tables.
- `customers` is the name of the table you're joining with.
- `ON orders.customerid = customers.customerid`: This part of the statement specifies the join condition. It tells the database how to match rows between the two tables. Specifically, it's saying that you want to join rows where the `customerid` column in the `orders` table is equal to the `customerid` column in the `customers` table. This condition establishes the relationship between the two tables based on the `customerid` column.
So, when you execute this SQL statement, SQLite will retrieve data from both the `orders` and `customers` tables and combine it into a single result set. The result will include pairs of `orderid` and `customername` where the `customerid` values in the `orders` and `customers` tables match.
For example, if you have the following data:
orders table:
| orderid | customerid |
|---|---|
| 1 | 101 |
| 2 | 102 |
| 3 | 103 |
customers table:
| customerid | customername |
|---|---|
| 101 | Alice |
| 102 | Bob |
| 104 | Carol |
The result of the SQL query will be:
| orderid | customername |
|---|---|
| 1 | Alice |
| 2 | Bob |
As you can see, only the rows with matching `customerid` values (1 and 2) are included in the result, and it combines the relevant data from both tables.
- A LEFT JOIN returns all rows from the left table and the matched rows
from the right table.
- If there's no match in the right table, NULL
values are filled in for the columns from the right table.
Example:
SELECT customers.customer_name, orders.order_id
FROM customers
LEFT JOIN orders ON customers.customer_id = orders.customer_id;
- A RIGHT JOIN is similar to a LEFT JOIN, but it returns all rows
from the right table and the matched rows from the left table.
- If there's no match in the left table, NULL values are filled in for
the columns from the left table.
Example:
SELECT customers.customer_name, orders.order_id
FROM customers
RIGHT JOIN orders ON customers.customer_id = orders.customer_id;
In SQL joins, particularly when discussing LEFT JOIN and RIGHT JOIN, the terms "left" and "right" refer to the order of the tables being joined and the direction of the inclusion of rows in the result set. These terms help describe which table's data is preserved fully and which table's data may have missing values (NULLs) in the result set. Let's clarify the differences:
-
LEFT JOIN (or LEFT OUTER JOIN):
- The "left" table is the table listed on the left side of the JOIN clause.
- All rows from the left table are included in the result set, regardless of whether there is a match in the right table.
- If there is a match in the right table, the corresponding data is included in the result set.
- If there is no match in the right table, NULL values are used for the columns from the right table.
-
RIGHT JOIN (or RIGHT OUTER JOIN):
- The "right" table is the table listed on the right side of the JOIN clause.
- All rows from the right table are included in the result set, regardless of whether there is a match in the left table.
- If there is a match in the left table, the corresponding data is included in the result set.
- If there is no match in the left table, NULL values are used for the columns from the left table.
In summary, the primary difference between LEFT JOIN and RIGHT JOIN is the direction in which they include rows from the tables involved in the join. LEFT JOIN keeps all rows from the left table, while RIGHT JOIN keeps all rows from the right table. The choice of whether to use LEFT JOIN or RIGHT JOIN depends on your specific data and query requirements. Most often, LEFT JOIN is used because it retains all records from the "left" or main table and includes matching data from the "right" or related table when available.
A CROSS JOIN, also known as a Cartesian join, is a type of join operation in SQL where every row from one table is combined with every row from another table, resulting in a Cartesian product of the two tables. In other words, it creates all possible combinations of rows from the two tables, without any specific condition or criteria for matching rows. As a result, the size of the result set can grow rapidly, especially if the tables being joined are large.
Here's the basic syntax for a CROSS JOIN:
SELECT *
FROM table1
CROSS JOIN table2;
In this syntax:
- `table1` and `table2` are the names of the two tables you want to combine.
- `*` is used to select all columns from both tables in the Cartesian product.
Here's an example to illustrate a CROSS JOIN:
Assume you have two tables:
table1 (colors):
| color |
|---|
| Red |
| Green |
| Blue |
table2 (sizes):
| size |
|---|
| Small |
| Medium |
| Large |
If you perform a CROSS JOIN between these two tables:
SELECT *
FROM colors
CROSS JOIN sizes;
The result will be the Cartesian product of all color and size combinations:
| color | size |
|---|---|
| Red | Small |
| Red | Medium |
| Red | Large |
| Green | Small |
| Green | Medium |
| Green | Large |
| Blue | Small |
| Blue | Medium |
| Blue | Large |
As you can see, every color is combined with every size, resulting in a total of 9 rows in the result set.
CROSS JOINs are relatively rare in practice because they tend to generate large result sets and can be computationally expensive. They are typically used in specific scenarios where you need to generate all possible combinations, such as when creating test data or performing certain types of mathematical calculations. In most cases, other types of joins (e.g., INNER JOIN, LEFT JOIN) with appropriate conditions are more commonly used to retrieve meaningful data from related tables.
Here's a brief overview of the key characteristics of a FULL OUTER JOIN:
Returns All Rows: Unlike INNER JOIN, which only returns matching rows, a FULL OUTER JOIN returns all rows from both the left and right tables.
NULL Values: When there is no match for a row in one of the tables, the columns from that table will contain NULL values in the result set.
Combines Data: A FULL OUTER JOIN combines data from two tables based on a specified join condition, creating a unified result set.
Useful for Comparisons: FULL OUTER JOINs are often used when you need to compare data from two tables comprehensively, such as identifying records that exist in one table but not in the other or finding matching records while also including non-matching ones.
Here's the basic syntax for a FULL OUTER JOIN:
SELECT *
FROM table1
FULL OUTER JOIN table2 ON table1.column_name = table2.column_name;
In this scenario, we have three tables: `orders`, `customers`, and `products`. We want to retrieve a list of orders that includes the customer name and the product name for each order.
Assuming the following table structures:
orders:
| orderid | customerid | productid |
|---|---|---|
| 1 | 101 | 201 |
| 2 | 102 | 202 |
| 3 | 103 | 201 |
| 4 | 104 | 203 |
customers:
| customerid | customername |
|---|---|
| 101 | Alice |
| 102 | Bob |
| 103 | Carol |
| 104 | David |
products:
| productid | productname |
|---|---|
| 201 | Laptop |
| 202 | Smartphone |
| 203 | Tablet |
We can use an INNER JOIN to combine these tables to get the desired result:
SELECT orders.order_id, customers.customer_name, products.product_name
FROM (orders
INNER JOIN customers ON orders.customer_id = customers.customer_id)
INNER JOIN products ON orders.product_id = products.product_id;
In this query:
-
The first INNER JOIN combines the `orders` and `customers` tables based on the `customerid` column, linking each order to its respective customer.
-
The second INNER JOIN combines the result of the first join (which includes order and customer information) with the `products` table based on the `productid` column, linking each order to its corresponding product.
The result of this query will be a list of orders with the customer name and the product name for each order:
| orderid | customername | productname |
|---|---|---|
| 1 | Alice | Laptop |
| 2 | Bob | Smartphone |
| 3 | Carol | Laptop |
| 4 | David | Tablet |
This query demonstrates how you can use INNER JOINs to combine data from three tables to create a comprehensive result set that includes information from all three tables based on specified join conditions.
SELECT c.customer_name, o.order_date
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
WHERE c.city = 'New York';
SELECT o.order_id, c.customer_name
FROM orders AS o
INNER JOIN customers AS c ON o.customer_id = c.customer_id;
A self-join is a specific type of join operation in SQL where a table is joined with itself. In other words, it's a way to combine rows from the same table based on a related column. Self-joins are commonly used when you have hierarchical or recursive data stored within a single table, and you need to retrieve relationships or data from different rows within the same table.
Here's how a self-join works:
-
Table with Self-Reference: In a self-join scenario, the table must contain a column or columns that establish a relationship between rows within the same table. This typically involves having a foreign key column that references the primary key column of the same table. This creates a self-referencing relationship.
-
Join Condition: You specify a join condition that defines how the rows in the table should be matched with other rows in the same table. This condition typically involves comparing the values in the self-referencing columns.
-
Result Set: The result of the self-join is a new table that combines data from the original table based on the specified join condition.
Here's a simple example to illustrate a self-join. Suppose you have an "employees" table with the following structure:
| employeeid | employeename | managerid |
|---|---|---|
| 1 | Alice | NULL |
| 2 | Bob | 1 |
| 3 | Carol | 1 |
| 4 | David | 2 |
| 5 | Eve | 2 |
In this table:
- `employeeid` is the primary key.
- `employeename` stores the names of employees.
- `managerid` is a foreign key that references the `employeeid` of the employee's manager. It establishes a self-referencing relationship.
You can use a self-join to retrieve the names of employees along with the names of their managers:
SELECT e.employee_name AS employee, m.employee_name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id;
In this query:
- `e` is an alias for the "employees" table, representing employees.
- `m` is an alias for the same "employees" table, representing managers.
- The self-join condition `e.managerid = m.employeeid` connects each employee with their respective manager.
The result of the self-join will be:
| employee | manager |
|---|---|
| Alice | NULL |
| Bob | Alice |
| Carol | Alice |
| David | Bob |
| Eve | Bob |
As you can see, the self-join allows you to retrieve the names of employees along with the names of their respective managers from the same table, creating a hierarchical relationship. Self-joins are particularly useful in scenarios involving organizational hierarchies, bill-of-materials structures, and other situations where data has a parent-child or recursive relationship within a single table.
In SQL, the UNION operator is used to combine the result sets of two or more SELECT statements into a single result set. It allows you to merge rows from different tables or even from the same table based on common columns or data types. The UNION operator removes duplicate rows by default, but you can also use UNION ALL if you want to include duplicate rows in the result.
-
Column Compatibility: The SELECT statements within a UNION must have the same number of columns, and the corresponding columns in each SELECT statement must have compatible data types. The column names from the first SELECT statement are used as the column names in the result.
-
Duplicate Removal: By default, UNION removes duplicate rows from the combined result set. If you want to keep duplicate rows, you can use UNION ALL instead.
SELECT column1, column2, ... FROM table1 UNION SELECT column1, column2, ... FROM table2;
In JOIN you can call different columns from different tables but in subquery, the table that is called inside the subquery is not accessible outside the subquery.
-
Concatenating
select col1 || '(' || col2 || ')' from tab1; -
Trimming
select trim(' you the best ') as trim_str;
Removes trailing spaces from both sides. There is RTRIM and LTRIM also.
-
Substring
select substr(col1, num1, num2) from tab1;
num1 is the beginning of the trimming and num2 is the length of the trim. num1 starts with 1.
-
Case
select upper(col1) from tab1; select lower(col1) from tab1; select ucase(col1) from tab1;
The `strftime` function in SQLite is used to format date and time values as strings. It allows you to specify a format string with various modifiers to control how the date and time should be displayed. The basic syntax of the `strftime` function is as follows:
strftime(format, timestamp)
-
`format`: This is a required argument and specifies the format you want the date and time to be displayed in. It is a string containing format codes and optional text.
-
`timestamp`: This is the timestamp or date and time value that you want to format.
Here are some common format codes and modifiers you can use with the `strftime` function in SQLite:
Date Format Codes:
- `%Y`: Four-digit year (e.g., 2023).
- `%m`: Month as a zero-padded decimal number (01-12).
- `%d`: Day of the month as a zero-padded decimal number (01-31).
- `%W`: Week number of the year as a decimal number (00-53).
Time Format Codes:
- `%H`: Hour (00-23).
- `%M`: Minute (00-59).
- `%S`: Second (00-59).
Other Format Codes:
- `%a`: Abbreviated weekday name (Sun, Mon, Tue, etc.).
- `%A`: Full weekday name (Sunday, Monday, Tuesday, etc.).
- `%b`: Abbreviated month name (Jan, Feb, Mar, etc.).
- `%B`: Full month name (January, February, March, etc.).
- `%c`: Preferred date and time representation (usually the same as `%a %b %d %H:%M:%S %Y`).
- `%p`: AM or PM designation (AM/PM).
Modifiers:
- `-`: A hyphen `-` before a modifier (e.g., `%-d`) removes leading zeros.
- `_`: An underscore `_` before a modifier (e.g., `_%d`) pads with spaces instead of zeros.
- `0`: A zero `0` before a modifier (e.g., `%0d`) pads with zeros (default behavior).
Examples of using the `strftime` function with modifiers and format codes:
-- Format a timestamp as "YYYY-MM-DD HH:MM:SS"
SELECT strftime('%Y-%m-%d %H:%M:%S', '2023-09-06 15:30:00');
-- Format a timestamp without leading zeros in the day and month
SELECT strftime('%Y-%-m-%-d %H:%M:%S', '2023-09-06 15:30:00');
-- Format a timestamp with abbreviated month and day names
SELECT strftime('%A, %B %d, %Y', '2023-09-06 15:30:00');
-- Format a timestamp as "hh:MM AM/PM"
SELECT strftime('%I:%M %p', '2023-09-06 15:30:00');
```
To retrieve the current date:
select date('now');
In SQLite, the CASE statement is a conditional expression that allows you to perform conditional logic within SQL queries. It's useful for making decisions and returning different values or performing different actions based on specific conditions. The CASE statement can be used in both SELECT statements and UPDATE statements.
The basic syntax of the CASE statement in SQLite is as follows:
CASE
WHEN condition1 THEN result1
WHEN condition2 THEN result2
...
ELSE else_result
END
Example:
SELECT
name,
CASE
WHEN age >= 18 THEN 'Adult'
ELSE 'Minor'
END AS age_group
FROM people;
In this example, we use a CASE statement to categorize people as either 'Adult' or 'Minor' based on their age.
In SQLite, a view is a virtual table created from the result of a SELECT query. Unlike physical tables, views do not store data themselves; instead, they provide a way to access and manipulate data from one or more underlying tables or other views. Views are often used to simplify complex queries, encapsulate logic, and provide a consistent interface to users or applications.
To create a view in SQLite, you use the CREATE VIEW statement. Here's the basic syntax:
CREATE VIEW view_name AS
SELECT columns
FROM tables
WHERE conditions;
Once a view is created, you can query it just like a regular table using the SELECT statement:
SELECT * FROM top_customers;
drop view my_view;
ETL stands for Extract, Transform, Load, and it is a concept related to data integration and database management, often associated with SQL databases. ETL is a process used to collect, clean, and move data from various sources into a destination database or data warehouse for analysis, reporting, or other purposes. Each phase of the ETL process serves a specific purpose.
Data profiling in SQL is a process used to analyze and understand the characteristics, structure, and quality of a dataset or a database. It involves examining the data to gain insights into its content, distribution, completeness, accuracy, and potential issues. Data profiling is an essential step in data analysis, data cleansing, and data integration processes. Here are the key concepts related to data profiling in SQL:
-
Data Exploration: Data profiling begins with exploring the data to get a high-level understanding of its contents. Analysts or data professionals use SQL queries to retrieve sample data, count records, and examine the data types of columns.
-
Column Profiling: Data profiling often involves profiling individual columns within a dataset or database table. This includes: Determining data types (e.g., integer, string, date). Calculating statistics such as minimum, maximum, mean, median, and standard deviation. Identifying unique values and their frequencies. Detecting missing values and calculating the percentage of missing data. Checking for data patterns, formats, and distributions.
SQL puzzles
Recommended by many recruiters to practice SQL for a data science interview.