Skip to content

Conversation

@bio-boris
Copy link
Contributor

No description provided.

Copilot AI review requested due to automatic review settings December 15, 2025 22:00
@github-advanced-security
Copy link

This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR restructures the TableScanner application by introducing a layered architecture with separate service modules, configuration management, and a more comprehensive search API endpoint.

  • Introduces service layer with modules for MinIO, SQLite, and cache operations
  • Adds configuration management using pydantic-settings for environment variables
  • Replaces GET /search endpoint with a POST endpoint that accepts structured search requests with filtering, ordering, and pagination

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
pyproject.toml Adds pydantic-settings dependency for configuration management
app/config.py New configuration module using pydantic-settings to manage KB_SERVICE_AUTH_TOKEN and CACHE_DIR settings
app/models.py New Pydantic models defining SearchRequest schema with pangenome_id, table_name, and query parameters
app/services/init.py New services module initialization exporting core service functions
app/services/cache_service.py New cache service module with functions for managing local file caching
app/services/minio_service.py New MinIO service stub for downloading binary files from object storage
app/services/sqlite_service.py New SQLite service with conversion and querying functions including flexible filtering and ordering
app/main.py Updates to store settings in app state for access throughout the application
app/routes.py Refactors search endpoint to POST method with integrated workflow for downloading, caching, converting, and querying data

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if op not in allowed_ops:
raise ValueError(f"Invalid operator: {op}")

where_clauses.append(f"{column} {op} ?")
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column names from filter_spec are directly interpolated into the SQL query without sanitization. This creates a SQL injection vulnerability. An attacker could pass malicious column names that break out of the query structure. Validate column names against the actual table schema or use a whitelist approach.

Copilot uses AI. Check for mistakes.
filters=search_request.filters,
)

#TODO use a return model when we figure out what we want to return
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO comment is missing a space after the hash and colon. The standard format is '# TODO:' with proper spacing for consistency with other TODO comments in the codebase.

Suggested change
#TODO use a return model when we figure out what we want to return
# TODO: use a return model when we figure out what we want to return

Copilot uses AI. Check for mistakes.
if direction not in ["ASC", "DESC"]:
raise ValueError(f"Invalid direction: {direction}")

order_clauses.append(f"{column} {direction}")
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column names in order_spec are directly interpolated into the SQL query without sanitization. This creates a SQL injection vulnerability. An attacker could pass malicious column names. Validate column names against the actual table schema or use a whitelist approach.

Copilot uses AI. Check for mistakes.
Comment on lines +143 to +144
where_clauses.append(f"{column} {op} ?")
params.append(value)
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IN operator requires special handling with multiple values (e.g., a tuple), but the current implementation treats it like other operators with a single placeholder. When op is "IN" and value is a list, this will fail. The IN operator needs special parameter binding handling or should be removed from the allowed_ops list.

Suggested change
where_clauses.append(f"{column} {op} ?")
params.append(value)
if op == "IN":
if not isinstance(value, (list, tuple)) or not value:
raise ValueError(f"Value for IN operator must be a non-empty list or tuple, got: {value}")
placeholders = ", ".join(["?"] * len(value))
where_clauses.append(f"{column} IN ({placeholders})")
params.extend(value)
else:
where_clauses.append(f"{column} {op} ?")
params.append(value)

Copilot uses AI. Check for mistakes.

# Convert to SQLite if not already converted
if not is_cached(sqlite_file_path):
convert_to_sqlite(cache_file_path, sqlite_file_path)
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convert_to_sqlite function raises NotImplementedError, but this is called without a try-except block. This will cause the endpoint to fail with an unhandled exception. Either implement the function or add proper error handling to return a meaningful HTTP error response.

Suggested change
convert_to_sqlite(cache_file_path, sqlite_file_path)
try:
convert_to_sqlite(cache_file_path, sqlite_file_path)
except NotImplementedError:
raise HTTPException(
status_code=501,
detail="The convert_to_sqlite function is not implemented."
)

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 18 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"""
headers = {"Authorization": auth_token}
response = requests.get(handle_url, headers=headers)
response.raise_for_status()
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Poor error handling: The raise_for_status() call will raise a generic HTTPError without additional context. Consider catching this exception and raising a more informative error that includes details about what failed (e.g., which URL, what status code) to help with debugging and provide better user feedback.

Suggested change
response.raise_for_status()
try:
response.raise_for_status()
except requests.HTTPError as e:
raise requests.HTTPError(
f"Failed to download from {handle_url} (status code: {response.status_code}): {response.text}"
) from e

Copilot uses AI. Check for mistakes.
requests.HTTPError: If download fails
"""
headers = {"Authorization": auth_token}
response = requests.get(handle_url, headers=headers)
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing timeout parameter: The requests.get call does not specify a timeout, which could cause the application to hang indefinitely if the remote server is unresponsive. Consider adding a timeout parameter (e.g., timeout=30) to prevent resource exhaustion.

Suggested change
response = requests.get(handle_url, headers=headers)
response = requests.get(handle_url, headers=headers, timeout=30)

Copilot uses AI. Check for mistakes.
Comment on lines +21 to +22
cache_file_path = cache_dir / id / filename
sqlite_file_path = cache_dir / id / f"{Path(filename).stem}.db"
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path traversal vulnerability: The id and filename parameters are used directly in path construction without validation. An attacker could provide values like "../../../etc/passwd" to access or create files outside the intended cache directory. Consider sanitizing these inputs by removing path separators or validating against a safe pattern.

Copilot uses AI. Check for mistakes.
convert_to_sqlite(cache_file_path, sqlite_file_path)

# Query the SQLite file with parameters
from app.utils.sqlite import get_table_data
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import for get_table_data should be moved to the top of the file with the other imports from app.utils.sqlite (line 14) for better code organization and consistency. Inline imports should be avoided unless there's a specific reason like circular imports.

Copilot uses AI. Check for mistakes.
Comment on lines +6 to +13
from pydantic import BaseModel


class SearchRequest(BaseModel):
"""Search request with query parameters."""
pangenome_id: str
table_name: str
limit: Optional[int] = None
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing validation for limit field: The limit field should have a minimum value (at least 1) and possibly a maximum value to prevent excessive memory usage or denial of service from extremely large result sets. Consider adding Field validators with appropriate constraints.

Suggested change
from pydantic import BaseModel
class SearchRequest(BaseModel):
"""Search request with query parameters."""
pangenome_id: str
table_name: str
limit: Optional[int] = None
from pydantic import BaseModel, Field
class SearchRequest(BaseModel):
"""Search request with query parameters."""
pangenome_id: str
table_name: str
limit: Optional[int] = Field(
default=None,
ge=1,
le=1000,
description="Maximum number of results to return (min 1, max 1000)"
)

Copilot uses AI. Check for mistakes.
Comment on lines +122 to +175
conn = sqlite3.connect(sqlite_file)
conn.row_factory = sqlite3.Row
cursor = conn.cursor()

# Build SELECT query
query = f"SELECT * FROM {table_name}"
params = []

# Add WHERE clause for filters
if filters:
where_clauses = []
for filter_spec in filters:
column = filter_spec["column"]
op = filter_spec["op"]
value = filter_spec["value"]

# Sanitize operator
allowed_ops = ["=", "!=", "<", ">", "<=", ">=", "LIKE", "IN"]
if op not in allowed_ops:
raise ValueError(f"Invalid operator: {op}")

where_clauses.append(f"{column} {op} ?")
params.append(value)

query += " WHERE " + " AND ".join(where_clauses)

# Add ORDER BY clause
if order_by:
order_clauses = []
for order_spec in order_by:
column = order_spec["column"]
direction = order_spec.get("direction", "ASC").upper()

if direction not in ["ASC", "DESC"]:
raise ValueError(f"Invalid direction: {direction}")

order_clauses.append(f"{column} {direction}")

query += " ORDER BY " + ", ".join(order_clauses)

# Add LIMIT clause
if limit is not None:
query += f" LIMIT {int(limit)}"

# Execute query
cursor.execute(query, params)
rows = cursor.fetchall()

# Convert to list of dicts
results = [dict(row) for row in rows]

conn.close()

return results
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing resource cleanup on error: If an exception occurs during query execution, the database connection is not properly closed. Consider using a try-finally block or a context manager (with statement) to ensure the connection is always closed, even when errors occur.

Suggested change
conn = sqlite3.connect(sqlite_file)
conn.row_factory = sqlite3.Row
cursor = conn.cursor()
# Build SELECT query
query = f"SELECT * FROM {table_name}"
params = []
# Add WHERE clause for filters
if filters:
where_clauses = []
for filter_spec in filters:
column = filter_spec["column"]
op = filter_spec["op"]
value = filter_spec["value"]
# Sanitize operator
allowed_ops = ["=", "!=", "<", ">", "<=", ">=", "LIKE", "IN"]
if op not in allowed_ops:
raise ValueError(f"Invalid operator: {op}")
where_clauses.append(f"{column} {op} ?")
params.append(value)
query += " WHERE " + " AND ".join(where_clauses)
# Add ORDER BY clause
if order_by:
order_clauses = []
for order_spec in order_by:
column = order_spec["column"]
direction = order_spec.get("direction", "ASC").upper()
if direction not in ["ASC", "DESC"]:
raise ValueError(f"Invalid direction: {direction}")
order_clauses.append(f"{column} {direction}")
query += " ORDER BY " + ", ".join(order_clauses)
# Add LIMIT clause
if limit is not None:
query += f" LIMIT {int(limit)}"
# Execute query
cursor.execute(query, params)
rows = cursor.fetchall()
# Convert to list of dicts
results = [dict(row) for row in rows]
conn.close()
return results
with sqlite3.connect(sqlite_file) as conn:
conn.row_factory = sqlite3.Row
cursor = conn.cursor()
# Build SELECT query
query = f"SELECT * FROM {table_name}"
params = []
# Add WHERE clause for filters
if filters:
where_clauses = []
for filter_spec in filters:
column = filter_spec["column"]
op = filter_spec["op"]
value = filter_spec["value"]
# Sanitize operator
allowed_ops = ["=", "!=", "<", ">", "<=", ">=", "LIKE", "IN"]
if op not in allowed_ops:
raise ValueError(f"Invalid operator: {op}")
where_clauses.append(f"{column} {op} ?")
params.append(value)
query += " WHERE " + " AND ".join(where_clauses)
# Add ORDER BY clause
if order_by:
order_clauses = []
for order_spec in order_by:
column = order_spec["column"]
direction = order_spec.get("direction", "ASC").upper()
if direction not in ["ASC", "DESC"]:
raise ValueError(f"Invalid direction: {direction}")
order_clauses.append(f"{column} {direction}")
query += " ORDER BY " + ", ".join(order_clauses)
# Add LIMIT clause
if limit is not None:
query += f" LIMIT {int(limit)}"
# Execute query
cursor.execute(query, params)
rows = cursor.fetchall()
# Convert to list of dicts
results = [dict(row) for row in rows]
return results

Copilot uses AI. Check for mistakes.
value = filter_spec["value"]

# Sanitize operator
allowed_ops = ["=", "!=", "<", ">", "<=", ">=", "LIKE", "IN"]
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IN operator is listed as allowed but the current implementation doesn't handle it correctly. The IN operator expects a tuple of values (e.g., "column IN (?, ?, ?)"), but the current code only appends a single placeholder. This will cause SQL syntax errors when the IN operator is used.

Copilot uses AI. Check for mistakes.
Comment on lines +71 to +72
convert_to_sqlite(cache_file_path, sqlite_file_path)

Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convert_to_sqlite function always raises NotImplementedError (as seen in sqlite.py line 48), which will cause the endpoint to fail when attempting to convert uncached files. This should be implemented or the endpoint should handle the NotImplementedError appropriately.

Suggested change
convert_to_sqlite(cache_file_path, sqlite_file_path)
try:
convert_to_sqlite(cache_file_path, sqlite_file_path)
except NotImplementedError:
raise HTTPException(
status_code=501,
detail="Conversion to SQLite is not implemented yet."
)

Copilot uses AI. Check for mistakes.
Comment on lines +88 to +89
"cache_file": str(cache_file_path),
"sqlite_file": str(sqlite_file_path),
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential information disclosure: The response exposes internal file system paths (cache_file and sqlite_file). This could reveal sensitive information about the server's directory structure to clients. Consider removing these fields or making them optional for debugging purposes only.

Suggested change
"cache_file": str(cache_file_path),
"sqlite_file": str(sqlite_file_path),

Copilot uses AI. Check for mistakes.
KBase Workspace utilities for retrieving object information.
"""

import requests
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'requests' is not used.

Suggested change
import requests

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants