-
Notifications
You must be signed in to change notification settings - Fork 2
Recording runtime logs in DB #363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
A custom log level from Python library.
Major performance improvement. I forgot to lock the regex expression, so it was recompiling it every time.
|
@ppinchuk , this is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds functionality to parse and store runtime logs from the ordinance scraper into the database to enable post-processing analysis. The implementation focuses on capturing INFO, WARNING, and ERROR level logs while filtering out more verbose levels to manage database size.
Key changes:
- New Rust module for parsing runtime logs with regex-based extraction of timestamp, level, subject, and message fields
- Database schema extension with a new
logstable linked to the bookkeeper table - Integration of log recording into the scraped ordinance loading pipeline
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/compass/src/scraper/mod.rs | Integrates RuntimeLogs into ScrapedOrdinance struct, adds initialization and recording calls, updates test setup |
| crates/compass/src/scraper/log/mod.rs | New module implementing log parsing, database schema, and recording logic for runtime logs |
| crates/compass/src/scraper/log/loglevel.rs | New module defining LogLevel enum with serde deserialization support for Python runtime log levels |
| crates/compass/src/lib.rs | Improves error message clarity when opening ordinance data source |
| crates/compass/src/error.rs | Adds regex::Error to the Error enum for regex-related error handling |
| crates/compass/Cargo.toml | Adds chrono and regex dependencies to support timestamp parsing and log pattern matching |
| Cargo.toml | Adds workspace-level chrono and regex dependencies with appropriate features |
| Cargo.lock | Updates dependency graph to include chrono serde feature, regex, and their transitive dependencies |
| fn parse(input: &str) -> Result<Self> { | ||
| let records: Vec<LogRecord> = input | ||
| .lines() | ||
| .filter(|line| !line.trim().is_empty()) | ||
| .filter_map(|line| match LogRecord::parse(line) { | ||
| Ok(record) => { | ||
| trace!("Parsed log line: {}", line); | ||
| Some(record) | ||
| } | ||
| Err(e) => { | ||
| trace!("Failed to parse log line: {}. Error: {}", line, e); | ||
| None | ||
| } | ||
| }) | ||
| .filter(|record| { | ||
| (record.level == LogLevel::Info) | ||
| || (record.level == LogLevel::Warning) | ||
| || (record.level == LogLevel::Error) | ||
| }) | ||
| .map(|record| { | ||
| debug!("Keeping log record: {:?}", record); | ||
| record | ||
| }) | ||
| .collect(); | ||
| Ok(RuntimeLogs(records)) | ||
| } |
Copilot
AI
Dec 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RuntimeLogs::parse method lacks test coverage. Consider adding tests for: parsing logs with different log levels (WARNING, ERROR), handling malformed log lines, verifying that only INFO, WARNING, and ERROR levels are kept after filtering, and testing empty input.
| fn record(&self, conn: &duckdb::Transaction, bookkeeper_id: usize) -> Result<()> { | ||
| trace!("Recording log record: {:?}", self); | ||
| conn.execute( | ||
| "INSERT INTO logs (bookkeeper_lnk, timestamp, level, subject, message) VALUES (?, ?, ?, ?, ?)", | ||
| duckdb::params![ | ||
| bookkeeper_id, | ||
| self.timestamp.format("%Y-%m-%d %H:%M:%S").to_string(), | ||
| format!("{:?}", self.level), | ||
| &self.subject, | ||
| &self.message, | ||
| ], | ||
| )?; | ||
| Ok(()) | ||
| } | ||
| } |
Copilot
AI
Dec 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LogRecord::record database insertion method lacks test coverage. Consider adding a test that verifies the record is correctly inserted into the database with all fields properly formatted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this.
I'm still struggling a little bit to see the value add here. Costs per jurisdiction are already tracked in the jurisdiction.json (do the logs even capture the cost by jurisdiction?), and exceptions are tracked in structured (JSON) log files, broken out by jurisdiction. If we just want these things to be SQL query-able, I would strongly recommend that we parse those dedicated files and upload the info they contain to the database instead of relying on character matching inside of log files.
If I am wrong about the above, then I assume I am missing some other important analysis angle here. Can you please document an example of how you used (or are envisioning other people to use) this data to perform analysis that can't otherwise be done with the information in the database? This would be really helpful both for me and other folks who encounter this data later on.
I left a few other requests below.
P.S. Copilot had a few decent suggestions as well
P.P.S This can definitely wait until you are not swamped with other work 😃
| //! The most verbose levels are ignored to minimize the impact on the | ||
| //! final database size. The outputs are archived, so any forensics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite worried about database size, actually. The ERROR messages can get quite big since they require a lot of context (trace, text chunk from file, etc.), and elm uses errors for control flow, so we get several error messages every time we run the decision tree, even if no "real" error was thrown...
Can you check how the size of the error messages compared to the other types in your current database? It would be nice to know if I am worrying about nothing or if this is a legitimate concern
| let path = root.as_ref().join("logs").join("all.log"); | ||
| dbg!(&path); | ||
| let content = tokio::fs::read_to_string(path).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to handle the case of all.log not existing. This is a redundant file that is not useful for 90% of users, and so in most cases it is not written at all. Writing to the database should not crash just because this file is missing.
I suggest either skipping the log parsing if the file is not found, or making log parsing an option on the command line that the user can enable, in which case the write to the database should crash if the file is not found. Open to other ideas as well.
Parse the runtime logs and record those in the DB to support pos-processing analysis.
The purpose here is to support statistics on cost per jurisdictions or easily identify exceptions in large batches.