Skip to content

Conversation

@colinthebomb1
Copy link
Collaborator

Overview:

This PR implements a basic formatter with a basic test for converting from our custom AST into JSON that can be understood by mosql, which turns it into plaintext. AST -> JSON -> SQL TEXT (via mosql).

Code Changes:

  • In tests/test_query_parser.py added test_basic_format() to run a basic test of the formatter and normalize_sql() for comparing if the expected sql is equal to our output.
  • Implemented format() and it's clause specific formatters: format_select(), format_from() ...etc in core/query_parser.py, responsible for converting AST to JSON clause by clause.

@colinthebomb1
Copy link
Collaborator Author

@HazelYuAhiru Because the test case assumes your output will match its input, I want to make sure our AST matches. The changes I made are:

  • Implemented order by using new OrderByItem node
order_by_clause = OrderByNode([
        OrderByItemNode(dept_alias, SortOrder.DESC),
        OrderByItemNode(count_alias, SortOrder.DESC)  
])  
  • Changed all sets to lists
  • added dept alias for order by dept_alias = ColumnNode("dept_name")
  • added alias to count star FunctionNode count_star = FunctionNode("COUNT", {ColumnNode("*")}, 'emp_count')

@colinthebomb1 colinthebomb1 marked this pull request as draft November 21, 2025 00:14
@HazelYuAhiru
Copy link
Collaborator

I'm still reviewing the main formatter functions, but I was wondering that should we separate the formatter and parser code into different files with their own tests? Like having query_formatter.py and query_formatter_test.py to avoid future merge conflicts.

@colinthebomb1 colinthebomb1 marked this pull request as ready for review December 1, 2025 20:19
Copy link
Collaborator

@HazelYuAhiru HazelYuAhiru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a basic query formatter that converts custom AST (Abstract Syntax Tree) nodes into JSON format compatible with the mosql library, which then generates SQL text. The implementation provides a two-step conversion pipeline: AST → JSON → SQL.

Key Changes:

  • Added QueryFormatter class with format() method that orchestrates the AST-to-SQL conversion
  • Implemented clause-specific formatters (format_select, format_from, format_where, etc.) to handle different SQL clauses
  • Created comprehensive test case test_basic_format() that validates formatting of a complex query with JOIN, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT, and OFFSET clauses

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 11 comments.

File Description
core/query_formatter.py New formatter implementation with AST-to-JSON conversion functions and support for all major SQL clauses including JOIN operations
tests/test_query_formatter.py New test file with test_basic_format() validating the formatter against a comprehensive multi-clause SQL query, plus normalize_sql() helper for SQL comparison
Comments suppressed due to low confidence (3)

core/query_formatter.py:8

  • Import of 'ColumnNode' is not used.
    Import of 'LiteralNode' is not used.
    Import of 'OperatorNode' is not used.
    Import of 'FunctionNode' is not used.
    Import of 'LimitNode' is not used.
    Import of 'OffsetNode' is not used.
    Import of 'SubqueryNode' is not used.
    Import of 'VarNode' is not used.
    Import of 'VarSetNode' is not used.
from core.ast.node import (
    QueryNode, SelectNode, FromNode, WhereNode, TableNode, ColumnNode, 
    LiteralNode, OperatorNode, FunctionNode, GroupByNode, HavingNode,
    OrderByNode, LimitNode, OffsetNode, SubqueryNode, VarNode, VarSetNode,
    JoinNode
)

tests/test_query_formatter.py:7

  • Import of 'SubqueryNode' is not used.
    Import of 'VarNode' is not used.
    Import of 'VarSetNode' is not used.
from core.ast.node import (
    OrderByItemNode, QueryNode, SelectNode, FromNode, WhereNode, TableNode, ColumnNode, 
    LiteralNode, OperatorNode, FunctionNode, GroupByNode, HavingNode,
    OrderByNode, LimitNode, OffsetNode, SubqueryNode, VarNode, VarSetNode, JoinNode
)

tests/test_query_formatter.py:9

  • Import of 'get_query' is not used.
from data.queries import get_query

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

from core.ast.node import (
OrderByItemNode, QueryNode, SelectNode, FromNode, WhereNode, TableNode, ColumnNode,
LiteralNode, OperatorNode, FunctionNode, GroupByNode, HavingNode,
OrderByNode, LimitNode, OffsetNode, SubqueryNode, VarNode, VarSetNode, JoinNode
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused imports: Several node types are imported but never used in this test file: SubqueryNode, VarNode, VarSetNode. Consider removing unused imports.

Suggested change
OrderByNode, LimitNode, OffsetNode, SubqueryNode, VarNode, VarSetNode, JoinNode
OrderByNode, LimitNode, OffsetNode, JoinNode

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as the above. We will use them once we have a test case with a subquery.

from core.ast.node import (
QueryNode, SelectNode, FromNode, WhereNode, TableNode, ColumnNode,
LiteralNode, OperatorNode, FunctionNode, GroupByNode, HavingNode,
OrderByNode, LimitNode, OffsetNode, SubqueryNode, VarNode, VarSetNode,
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused imports: SubqueryNode, VarNode, and VarSetNode are imported but never used in the formatter implementation. Consider removing unused imports.

Suggested change
OrderByNode, LimitNode, OffsetNode, SubqueryNode, VarNode, VarSetNode,
OrderByNode, LimitNode, OffsetNode,

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same comment as in Yihong's PR. We will support Subquery in future PRs when we introduce a complex test case with a subquery.

Comment on lines +21 to +89
def test_basic_format():
# Construct expected AST
# Tables
emp_table = TableNode("employees", "e")
dept_table = TableNode("departments", "d")
# Columns
emp_name = ColumnNode("name", _parent_alias="e")
emp_salary = ColumnNode("salary", _parent_alias="e")
emp_age = ColumnNode("age", _parent_alias="e")
emp_dept_id = ColumnNode("department_id", _parent_alias="e")

dept_name = ColumnNode("name", _alias="dept_name", _parent_alias="d")
dept_id = ColumnNode("id", _parent_alias="d")

count_star = FunctionNode("COUNT", _alias="emp_count", _args=[ColumnNode("*")])

# SELECT clause
select_clause = SelectNode([emp_name, dept_name, count_star])
# FROM clause with JOIN
join_condition = OperatorNode(emp_dept_id, "=", dept_id)
join_node = JoinNode(emp_table, dept_table, JoinType.INNER, join_condition)
from_clause = FromNode([join_node])
# WHERE clause
salary_condition = OperatorNode(emp_salary, ">", LiteralNode(40000))
age_condition = OperatorNode(emp_age, "<", LiteralNode(60))
where_condition = OperatorNode(salary_condition, "AND", age_condition)
where_clause = WhereNode([where_condition])
# GROUP BY clause
group_by_clause = GroupByNode([dept_id, dept_name])
# HAVING clause
having_condition = OperatorNode(count_star, ">", LiteralNode(2))
having_clause = HavingNode([having_condition])
# ORDER BY clause
order_by_item1 = OrderByItemNode(dept_name, SortOrder.ASC)
order_by_item2 = OrderByItemNode(count_star, SortOrder.DESC)
order_by_clause = OrderByNode([order_by_item1, order_by_item2])
# LIMIT and OFFSET
limit_clause = LimitNode(10)
offset_clause = OffsetNode(5)
# Complete query
ast = QueryNode(
_select=select_clause,
_from=from_clause,
_where=where_clause,
_group_by=group_by_clause,
_having=having_clause,
_order_by=order_by_clause,
_limit=limit_clause,
_offset=offset_clause
)

# Construct expected query text
expected_sql = """
SELECT e.name, d.name AS dept_name, COUNT(*) AS emp_count
FROM employees AS e JOIN departments AS d ON e.department_id = d.id
WHERE e.salary > 40000 AND e.age < 60
GROUP BY d.id, d.name
HAVING COUNT(*) > 2
ORDER BY dept_name, emp_count DESC
LIMIT 10 OFFSET 5
"""
expected_sql = expected_sql.strip()
print(mosql.parse(expected_sql))
print(ast)

sql = formatter.format(ast)
sql = sql.strip()

assert normalize_sql(sql) == normalize_sql(expected_sql) No newline at end of file
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Limited test coverage: Consider adding test cases for edge cases such as:

  • SELECT with no FROM clause
  • WHERE with OR conditions
  • Different join types (LEFT, RIGHT, FULL)
  • Nested subqueries
  • Functions with multiple arguments
  • NULL comparisons (IS NULL, IS NOT NULL)

While the current test provides good basic coverage, these cases would ensure the formatter handles various SQL patterns correctly.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We will add them in the following PRs.

elif node.type == NodeType.FUNCTION:
# format: {'function_name': args}
func_name = node.name.lower()
args = [format_expression(arg) for arg in node.children]
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential edge case: The function formatting logic assumes functions have at least one argument. SQL functions with no arguments (e.g., NOW(), CURRENT_TIMESTAMP()) would result in {func_name: []}. Consider handling the zero-argument case explicitly, e.g., return {func_name: args} if args else {func_name: None} or verify mosql's expected format for zero-argument functions.

Suggested change
args = [format_expression(arg) for arg in node.children]
args = [format_expression(arg) for arg in node.children]
if not args:
return {func_name: None}

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We can fix it once we have a test case for it.

Comment on lines +86 to +89
sql = formatter.format(ast)
sql = sql.strip()

assert normalize_sql(sql) == normalize_sql(expected_sql) No newline at end of file
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Test could be more comprehensive: The test only validates the final SQL output but doesn't verify the intermediate JSON format produced by ast_to_json(). Consider adding an assertion to check the JSON structure matches mosql's expected format, which would help catch issues in the AST-to-JSON conversion step separately from the JSON-to-SQL step.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,265 @@
import mo_sql_parsing as mosql
from core.ast.node import QueryNode
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate import: QueryNode is imported on line 2 and again on line 4. Remove the duplicate import on line 2.

Suggested change
from core.ast.node import QueryNode

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, let's remove it.

LiteralNode, OperatorNode, FunctionNode, GroupByNode, HavingNode,
OrderByNode, LimitNode, OffsetNode, SubqueryNode, VarNode, VarSetNode, JoinNode
)
from core.ast.enums import NodeType, JoinType, SortOrder
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'NodeType' is not used.

Suggested change
from core.ast.enums import NodeType, JoinType, SortOrder
from core.ast.enums import JoinType, SortOrder

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Let's remove it.

Copy link
Contributor

@baiqiushi baiqiushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @colinthebomb1 for the PR! It looks great overall! Only a few minor changes are needed. Please take care of them. Thanks!

@@ -0,0 +1,265 @@
import mo_sql_parsing as mosql
from core.ast.node import QueryNode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, let's remove it.

from core.ast.node import (
QueryNode, SelectNode, FromNode, WhereNode, TableNode, ColumnNode,
LiteralNode, OperatorNode, FunctionNode, GroupByNode, HavingNode,
OrderByNode, LimitNode, OffsetNode, SubqueryNode, VarNode, VarSetNode,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same comment as in Yihong's PR. We will support Subquery in future PRs when we introduce a complex test case with a subquery.

elif node.type == NodeType.FUNCTION:
# format: {'function_name': args}
func_name = node.name.lower()
args = [format_expression(arg) for arg in node.children]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We can fix it once we have a test case for it.

from core.ast.node import (
OrderByItemNode, QueryNode, SelectNode, FromNode, WhereNode, TableNode, ColumnNode,
LiteralNode, OperatorNode, FunctionNode, GroupByNode, HavingNode,
OrderByNode, LimitNode, OffsetNode, SubqueryNode, VarNode, VarSetNode, JoinNode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as the above. We will use them once we have a test case with a subquery.

LiteralNode, OperatorNode, FunctionNode, GroupByNode, HavingNode,
OrderByNode, LimitNode, OffsetNode, SubqueryNode, VarNode, VarSetNode, JoinNode
)
from core.ast.enums import NodeType, JoinType, SortOrder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Let's remove it.

OrderByNode, LimitNode, OffsetNode, SubqueryNode, VarNode, VarSetNode, JoinNode
)
from core.ast.enums import NodeType, JoinType, SortOrder
from data.queries import get_query
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Let's remove it.

Comment on lines +21 to +89
def test_basic_format():
# Construct expected AST
# Tables
emp_table = TableNode("employees", "e")
dept_table = TableNode("departments", "d")
# Columns
emp_name = ColumnNode("name", _parent_alias="e")
emp_salary = ColumnNode("salary", _parent_alias="e")
emp_age = ColumnNode("age", _parent_alias="e")
emp_dept_id = ColumnNode("department_id", _parent_alias="e")

dept_name = ColumnNode("name", _alias="dept_name", _parent_alias="d")
dept_id = ColumnNode("id", _parent_alias="d")

count_star = FunctionNode("COUNT", _alias="emp_count", _args=[ColumnNode("*")])

# SELECT clause
select_clause = SelectNode([emp_name, dept_name, count_star])
# FROM clause with JOIN
join_condition = OperatorNode(emp_dept_id, "=", dept_id)
join_node = JoinNode(emp_table, dept_table, JoinType.INNER, join_condition)
from_clause = FromNode([join_node])
# WHERE clause
salary_condition = OperatorNode(emp_salary, ">", LiteralNode(40000))
age_condition = OperatorNode(emp_age, "<", LiteralNode(60))
where_condition = OperatorNode(salary_condition, "AND", age_condition)
where_clause = WhereNode([where_condition])
# GROUP BY clause
group_by_clause = GroupByNode([dept_id, dept_name])
# HAVING clause
having_condition = OperatorNode(count_star, ">", LiteralNode(2))
having_clause = HavingNode([having_condition])
# ORDER BY clause
order_by_item1 = OrderByItemNode(dept_name, SortOrder.ASC)
order_by_item2 = OrderByItemNode(count_star, SortOrder.DESC)
order_by_clause = OrderByNode([order_by_item1, order_by_item2])
# LIMIT and OFFSET
limit_clause = LimitNode(10)
offset_clause = OffsetNode(5)
# Complete query
ast = QueryNode(
_select=select_clause,
_from=from_clause,
_where=where_clause,
_group_by=group_by_clause,
_having=having_clause,
_order_by=order_by_clause,
_limit=limit_clause,
_offset=offset_clause
)

# Construct expected query text
expected_sql = """
SELECT e.name, d.name AS dept_name, COUNT(*) AS emp_count
FROM employees AS e JOIN departments AS d ON e.department_id = d.id
WHERE e.salary > 40000 AND e.age < 60
GROUP BY d.id, d.name
HAVING COUNT(*) > 2
ORDER BY dept_name, emp_count DESC
LIMIT 10 OFFSET 5
"""
expected_sql = expected_sql.strip()
print(mosql.parse(expected_sql))
print(ast)

sql = formatter.format(ast)
sql = sql.strip()

assert normalize_sql(sql) == normalize_sql(expected_sql) No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We will add them in the following PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants