-
Notifications
You must be signed in to change notification settings - Fork 0
update structure #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR restructures the TableScanner application by introducing a layered architecture with separate service modules, configuration management, and a more comprehensive search API endpoint.
- Introduces service layer with modules for MinIO, SQLite, and cache operations
- Adds configuration management using pydantic-settings for environment variables
- Replaces GET /search endpoint with a POST endpoint that accepts structured search requests with filtering, ordering, and pagination
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Adds pydantic-settings dependency for configuration management |
| app/config.py | New configuration module using pydantic-settings to manage KB_SERVICE_AUTH_TOKEN and CACHE_DIR settings |
| app/models.py | New Pydantic models defining SearchRequest schema with pangenome_id, table_name, and query parameters |
| app/services/init.py | New services module initialization exporting core service functions |
| app/services/cache_service.py | New cache service module with functions for managing local file caching |
| app/services/minio_service.py | New MinIO service stub for downloading binary files from object storage |
| app/services/sqlite_service.py | New SQLite service with conversion and querying functions including flexible filtering and ordering |
| app/main.py | Updates to store settings in app state for access throughout the application |
| app/routes.py | Refactors search endpoint to POST method with integrated workflow for downloading, caching, converting, and querying data |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if op not in allowed_ops: | ||
| raise ValueError(f"Invalid operator: {op}") | ||
|
|
||
| where_clauses.append(f"{column} {op} ?") |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The column names from filter_spec are directly interpolated into the SQL query without sanitization. This creates a SQL injection vulnerability. An attacker could pass malicious column names that break out of the query structure. Validate column names against the actual table schema or use a whitelist approach.
| filters=search_request.filters, | ||
| ) | ||
|
|
||
| #TODO use a return model when we figure out what we want to return |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO comment is missing a space after the hash and colon. The standard format is '# TODO:' with proper spacing for consistency with other TODO comments in the codebase.
| #TODO use a return model when we figure out what we want to return | |
| # TODO: use a return model when we figure out what we want to return |
| if direction not in ["ASC", "DESC"]: | ||
| raise ValueError(f"Invalid direction: {direction}") | ||
|
|
||
| order_clauses.append(f"{column} {direction}") |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The column names in order_spec are directly interpolated into the SQL query without sanitization. This creates a SQL injection vulnerability. An attacker could pass malicious column names. Validate column names against the actual table schema or use a whitelist approach.
| where_clauses.append(f"{column} {op} ?") | ||
| params.append(value) |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IN operator requires special handling with multiple values (e.g., a tuple), but the current implementation treats it like other operators with a single placeholder. When op is "IN" and value is a list, this will fail. The IN operator needs special parameter binding handling or should be removed from the allowed_ops list.
| where_clauses.append(f"{column} {op} ?") | |
| params.append(value) | |
| if op == "IN": | |
| if not isinstance(value, (list, tuple)) or not value: | |
| raise ValueError(f"Value for IN operator must be a non-empty list or tuple, got: {value}") | |
| placeholders = ", ".join(["?"] * len(value)) | |
| where_clauses.append(f"{column} IN ({placeholders})") | |
| params.extend(value) | |
| else: | |
| where_clauses.append(f"{column} {op} ?") | |
| params.append(value) |
|
|
||
| # Convert to SQLite if not already converted | ||
| if not is_cached(sqlite_file_path): | ||
| convert_to_sqlite(cache_file_path, sqlite_file_path) |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convert_to_sqlite function raises NotImplementedError, but this is called without a try-except block. This will cause the endpoint to fail with an unhandled exception. Either implement the function or add proper error handling to return a meaningful HTTP error response.
| convert_to_sqlite(cache_file_path, sqlite_file_path) | |
| try: | |
| convert_to_sqlite(cache_file_path, sqlite_file_path) | |
| except NotImplementedError: | |
| raise HTTPException( | |
| status_code=501, | |
| detail="The convert_to_sqlite function is not implemented." | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 18 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """ | ||
| headers = {"Authorization": auth_token} | ||
| response = requests.get(handle_url, headers=headers) | ||
| response.raise_for_status() |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Poor error handling: The raise_for_status() call will raise a generic HTTPError without additional context. Consider catching this exception and raising a more informative error that includes details about what failed (e.g., which URL, what status code) to help with debugging and provide better user feedback.
| response.raise_for_status() | |
| try: | |
| response.raise_for_status() | |
| except requests.HTTPError as e: | |
| raise requests.HTTPError( | |
| f"Failed to download from {handle_url} (status code: {response.status_code}): {response.text}" | |
| ) from e |
| requests.HTTPError: If download fails | ||
| """ | ||
| headers = {"Authorization": auth_token} | ||
| response = requests.get(handle_url, headers=headers) |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing timeout parameter: The requests.get call does not specify a timeout, which could cause the application to hang indefinitely if the remote server is unresponsive. Consider adding a timeout parameter (e.g., timeout=30) to prevent resource exhaustion.
| response = requests.get(handle_url, headers=headers) | |
| response = requests.get(handle_url, headers=headers, timeout=30) |
| cache_file_path = cache_dir / id / filename | ||
| sqlite_file_path = cache_dir / id / f"{Path(filename).stem}.db" |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Path traversal vulnerability: The id and filename parameters are used directly in path construction without validation. An attacker could provide values like "../../../etc/passwd" to access or create files outside the intended cache directory. Consider sanitizing these inputs by removing path separators or validating against a safe pattern.
| convert_to_sqlite(cache_file_path, sqlite_file_path) | ||
|
|
||
| # Query the SQLite file with parameters | ||
| from app.utils.sqlite import get_table_data |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import for get_table_data should be moved to the top of the file with the other imports from app.utils.sqlite (line 14) for better code organization and consistency. Inline imports should be avoided unless there's a specific reason like circular imports.
| from pydantic import BaseModel | ||
|
|
||
|
|
||
| class SearchRequest(BaseModel): | ||
| """Search request with query parameters.""" | ||
| pangenome_id: str | ||
| table_name: str | ||
| limit: Optional[int] = None |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing validation for limit field: The limit field should have a minimum value (at least 1) and possibly a maximum value to prevent excessive memory usage or denial of service from extremely large result sets. Consider adding Field validators with appropriate constraints.
| from pydantic import BaseModel | |
| class SearchRequest(BaseModel): | |
| """Search request with query parameters.""" | |
| pangenome_id: str | |
| table_name: str | |
| limit: Optional[int] = None | |
| from pydantic import BaseModel, Field | |
| class SearchRequest(BaseModel): | |
| """Search request with query parameters.""" | |
| pangenome_id: str | |
| table_name: str | |
| limit: Optional[int] = Field( | |
| default=None, | |
| ge=1, | |
| le=1000, | |
| description="Maximum number of results to return (min 1, max 1000)" | |
| ) |
| conn = sqlite3.connect(sqlite_file) | ||
| conn.row_factory = sqlite3.Row | ||
| cursor = conn.cursor() | ||
|
|
||
| # Build SELECT query | ||
| query = f"SELECT * FROM {table_name}" | ||
| params = [] | ||
|
|
||
| # Add WHERE clause for filters | ||
| if filters: | ||
| where_clauses = [] | ||
| for filter_spec in filters: | ||
| column = filter_spec["column"] | ||
| op = filter_spec["op"] | ||
| value = filter_spec["value"] | ||
|
|
||
| # Sanitize operator | ||
| allowed_ops = ["=", "!=", "<", ">", "<=", ">=", "LIKE", "IN"] | ||
| if op not in allowed_ops: | ||
| raise ValueError(f"Invalid operator: {op}") | ||
|
|
||
| where_clauses.append(f"{column} {op} ?") | ||
| params.append(value) | ||
|
|
||
| query += " WHERE " + " AND ".join(where_clauses) | ||
|
|
||
| # Add ORDER BY clause | ||
| if order_by: | ||
| order_clauses = [] | ||
| for order_spec in order_by: | ||
| column = order_spec["column"] | ||
| direction = order_spec.get("direction", "ASC").upper() | ||
|
|
||
| if direction not in ["ASC", "DESC"]: | ||
| raise ValueError(f"Invalid direction: {direction}") | ||
|
|
||
| order_clauses.append(f"{column} {direction}") | ||
|
|
||
| query += " ORDER BY " + ", ".join(order_clauses) | ||
|
|
||
| # Add LIMIT clause | ||
| if limit is not None: | ||
| query += f" LIMIT {int(limit)}" | ||
|
|
||
| # Execute query | ||
| cursor.execute(query, params) | ||
| rows = cursor.fetchall() | ||
|
|
||
| # Convert to list of dicts | ||
| results = [dict(row) for row in rows] | ||
|
|
||
| conn.close() | ||
|
|
||
| return results |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing resource cleanup on error: If an exception occurs during query execution, the database connection is not properly closed. Consider using a try-finally block or a context manager (with statement) to ensure the connection is always closed, even when errors occur.
| conn = sqlite3.connect(sqlite_file) | |
| conn.row_factory = sqlite3.Row | |
| cursor = conn.cursor() | |
| # Build SELECT query | |
| query = f"SELECT * FROM {table_name}" | |
| params = [] | |
| # Add WHERE clause for filters | |
| if filters: | |
| where_clauses = [] | |
| for filter_spec in filters: | |
| column = filter_spec["column"] | |
| op = filter_spec["op"] | |
| value = filter_spec["value"] | |
| # Sanitize operator | |
| allowed_ops = ["=", "!=", "<", ">", "<=", ">=", "LIKE", "IN"] | |
| if op not in allowed_ops: | |
| raise ValueError(f"Invalid operator: {op}") | |
| where_clauses.append(f"{column} {op} ?") | |
| params.append(value) | |
| query += " WHERE " + " AND ".join(where_clauses) | |
| # Add ORDER BY clause | |
| if order_by: | |
| order_clauses = [] | |
| for order_spec in order_by: | |
| column = order_spec["column"] | |
| direction = order_spec.get("direction", "ASC").upper() | |
| if direction not in ["ASC", "DESC"]: | |
| raise ValueError(f"Invalid direction: {direction}") | |
| order_clauses.append(f"{column} {direction}") | |
| query += " ORDER BY " + ", ".join(order_clauses) | |
| # Add LIMIT clause | |
| if limit is not None: | |
| query += f" LIMIT {int(limit)}" | |
| # Execute query | |
| cursor.execute(query, params) | |
| rows = cursor.fetchall() | |
| # Convert to list of dicts | |
| results = [dict(row) for row in rows] | |
| conn.close() | |
| return results | |
| with sqlite3.connect(sqlite_file) as conn: | |
| conn.row_factory = sqlite3.Row | |
| cursor = conn.cursor() | |
| # Build SELECT query | |
| query = f"SELECT * FROM {table_name}" | |
| params = [] | |
| # Add WHERE clause for filters | |
| if filters: | |
| where_clauses = [] | |
| for filter_spec in filters: | |
| column = filter_spec["column"] | |
| op = filter_spec["op"] | |
| value = filter_spec["value"] | |
| # Sanitize operator | |
| allowed_ops = ["=", "!=", "<", ">", "<=", ">=", "LIKE", "IN"] | |
| if op not in allowed_ops: | |
| raise ValueError(f"Invalid operator: {op}") | |
| where_clauses.append(f"{column} {op} ?") | |
| params.append(value) | |
| query += " WHERE " + " AND ".join(where_clauses) | |
| # Add ORDER BY clause | |
| if order_by: | |
| order_clauses = [] | |
| for order_spec in order_by: | |
| column = order_spec["column"] | |
| direction = order_spec.get("direction", "ASC").upper() | |
| if direction not in ["ASC", "DESC"]: | |
| raise ValueError(f"Invalid direction: {direction}") | |
| order_clauses.append(f"{column} {direction}") | |
| query += " ORDER BY " + ", ".join(order_clauses) | |
| # Add LIMIT clause | |
| if limit is not None: | |
| query += f" LIMIT {int(limit)}" | |
| # Execute query | |
| cursor.execute(query, params) | |
| rows = cursor.fetchall() | |
| # Convert to list of dicts | |
| results = [dict(row) for row in rows] | |
| return results |
| value = filter_spec["value"] | ||
|
|
||
| # Sanitize operator | ||
| allowed_ops = ["=", "!=", "<", ">", "<=", ">=", "LIKE", "IN"] |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IN operator is listed as allowed but the current implementation doesn't handle it correctly. The IN operator expects a tuple of values (e.g., "column IN (?, ?, ?)"), but the current code only appends a single placeholder. This will cause SQL syntax errors when the IN operator is used.
| convert_to_sqlite(cache_file_path, sqlite_file_path) | ||
|
|
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convert_to_sqlite function always raises NotImplementedError (as seen in sqlite.py line 48), which will cause the endpoint to fail when attempting to convert uncached files. This should be implemented or the endpoint should handle the NotImplementedError appropriately.
| convert_to_sqlite(cache_file_path, sqlite_file_path) | |
| try: | |
| convert_to_sqlite(cache_file_path, sqlite_file_path) | |
| except NotImplementedError: | |
| raise HTTPException( | |
| status_code=501, | |
| detail="Conversion to SQLite is not implemented yet." | |
| ) |
| "cache_file": str(cache_file_path), | ||
| "sqlite_file": str(sqlite_file_path), |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential information disclosure: The response exposes internal file system paths (cache_file and sqlite_file). This could reveal sensitive information about the server's directory structure to clients. Consider removing these fields or making them optional for debugging purposes only.
| "cache_file": str(cache_file_path), | |
| "sqlite_file": str(sqlite_file_path), |
| KBase Workspace utilities for retrieving object information. | ||
| """ | ||
|
|
||
| import requests |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'requests' is not used.
| import requests |
No description provided.