Skip to content

Conversation

@castelao
Copy link
Member

@castelao castelao commented Dec 12, 2025

Parse the runtime logs and record those in the DB to support pos-processing analysis.

The purpose here is to support statistics on cost per jurisdictions or easily identify exceptions in large batches.

@castelao castelao self-assigned this Dec 12, 2025
@castelao castelao added enhancement Update to logic or general code improvements p-medium Priority: medium topic-rust-duckdb Issues/pull requests related to DuckDB labels Dec 12, 2025
@castelao castelao marked this pull request as ready for review December 13, 2025 16:13
@castelao castelao requested a review from ppinchuk as a code owner December 13, 2025 16:13
Copilot AI review requested due to automatic review settings December 13, 2025 16:13
@castelao
Copy link
Member Author

@ppinchuk , this is ready for review.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds functionality to parse and store runtime logs from the ordinance scraper into the database to enable post-processing analysis. The implementation focuses on capturing INFO, WARNING, and ERROR level logs while filtering out more verbose levels to manage database size.

Key changes:

  • New Rust module for parsing runtime logs with regex-based extraction of timestamp, level, subject, and message fields
  • Database schema extension with a new logs table linked to the bookkeeper table
  • Integration of log recording into the scraped ordinance loading pipeline

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
crates/compass/src/scraper/mod.rs Integrates RuntimeLogs into ScrapedOrdinance struct, adds initialization and recording calls, updates test setup
crates/compass/src/scraper/log/mod.rs New module implementing log parsing, database schema, and recording logic for runtime logs
crates/compass/src/scraper/log/loglevel.rs New module defining LogLevel enum with serde deserialization support for Python runtime log levels
crates/compass/src/lib.rs Improves error message clarity when opening ordinance data source
crates/compass/src/error.rs Adds regex::Error to the Error enum for regex-related error handling
crates/compass/Cargo.toml Adds chrono and regex dependencies to support timestamp parsing and log pattern matching
Cargo.toml Adds workspace-level chrono and regex dependencies with appropriate features
Cargo.lock Updates dependency graph to include chrono serde feature, regex, and their transitive dependencies

Comment on lines +131 to +156
fn parse(input: &str) -> Result<Self> {
let records: Vec<LogRecord> = input
.lines()
.filter(|line| !line.trim().is_empty())
.filter_map(|line| match LogRecord::parse(line) {
Ok(record) => {
trace!("Parsed log line: {}", line);
Some(record)
}
Err(e) => {
trace!("Failed to parse log line: {}. Error: {}", line, e);
None
}
})
.filter(|record| {
(record.level == LogLevel::Info)
|| (record.level == LogLevel::Warning)
|| (record.level == LogLevel::Error)
})
.map(|record| {
debug!("Keeping log record: {:?}", record);
record
})
.collect();
Ok(RuntimeLogs(records))
}
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RuntimeLogs::parse method lacks test coverage. Consider adding tests for: parsing logs with different log levels (WARNING, ERROR), handling malformed log lines, verifying that only INFO, WARNING, and ERROR levels are kept after filtering, and testing empty input.

Copilot uses AI. Check for mistakes.
Comment on lines 77 to 91
fn record(&self, conn: &duckdb::Transaction, bookkeeper_id: usize) -> Result<()> {
trace!("Recording log record: {:?}", self);
conn.execute(
"INSERT INTO logs (bookkeeper_lnk, timestamp, level, subject, message) VALUES (?, ?, ?, ?, ?)",
duckdb::params![
bookkeeper_id,
self.timestamp.format("%Y-%m-%d %H:%M:%S").to_string(),
format!("{:?}", self.level),
&self.subject,
&self.message,
],
)?;
Ok(())
}
}
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LogRecord::record database insertion method lacks test coverage. Consider adding a test that verifies the record is correctly inserted into the database with all fields properly formatted.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@ppinchuk ppinchuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this.

I'm still struggling a little bit to see the value add here. Costs per jurisdiction are already tracked in the jurisdiction.json (do the logs even capture the cost by jurisdiction?), and exceptions are tracked in structured (JSON) log files, broken out by jurisdiction. If we just want these things to be SQL query-able, I would strongly recommend that we parse those dedicated files and upload the info they contain to the database instead of relying on character matching inside of log files.

If I am wrong about the above, then I assume I am missing some other important analysis angle here. Can you please document an example of how you used (or are envisioning other people to use) this data to perform analysis that can't otherwise be done with the information in the database? This would be really helpful both for me and other folks who encounter this data later on.

I left a few other requests below.

P.S. Copilot had a few decent suggestions as well

P.P.S This can definitely wait until you are not swamped with other work 😃

Comment on lines +6 to +7
//! The most verbose levels are ignored to minimize the impact on the
//! final database size. The outputs are archived, so any forensics
Copy link
Collaborator

@ppinchuk ppinchuk Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite worried about database size, actually. The ERROR messages can get quite big since they require a lot of context (trace, text chunk from file, etc.), and elm uses errors for control flow, so we get several error messages every time we run the decision tree, even if no "real" error was thrown...

Can you check how the size of the error messages compared to the other types in your current database? It would be nice to know if I am worrying about nothing or if this is a legitimate concern

Comment on lines 176 to 178
let path = root.as_ref().join("logs").join("all.log");
dbg!(&path);
let content = tokio::fs::read_to_string(path).await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to handle the case of all.log not existing. This is a redundant file that is not useful for 90% of users, and so in most cases it is not written at all. Writing to the database should not crash just because this file is missing.

I suggest either skipping the log parsing if the file is not found, or making log parsing an option on the command line that the user can enable, in which case the write to the database should crash if the file is not found. Open to other ideas as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Update to logic or general code improvements p-medium Priority: medium topic-rust-duckdb Issues/pull requests related to DuckDB

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants