Recording runtime logs in DB #363

castelao · 2025-12-12T01:55:50Z

Parse the runtime logs and record those in the DB to support pos-processing analysis.

The purpose here is to support statistics on cost per jurisdictions or easily identify exceptions in large batches.

A custom log level from Python library.

Major performance improvement. I forgot to lock the regex expression, so it was recompiling it every time.

castelao · 2025-12-13T16:20:45Z

@ppinchuk , this is ready for review.

Copilot

Pull request overview

This PR adds functionality to parse and store runtime logs from the ordinance scraper into the database to enable post-processing analysis. The implementation focuses on capturing INFO, WARNING, and ERROR level logs while filtering out more verbose levels to manage database size.

Key changes:

New Rust module for parsing runtime logs with regex-based extraction of timestamp, level, subject, and message fields
Database schema extension with a new logs table linked to the bookkeeper table
Integration of log recording into the scraped ordinance loading pipeline

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
crates/compass/src/scraper/mod.rs	Integrates RuntimeLogs into ScrapedOrdinance struct, adds initialization and recording calls, updates test setup
crates/compass/src/scraper/log/mod.rs	New module implementing log parsing, database schema, and recording logic for runtime logs
crates/compass/src/scraper/log/loglevel.rs	New module defining LogLevel enum with serde deserialization support for Python runtime log levels
crates/compass/src/lib.rs	Improves error message clarity when opening ordinance data source
crates/compass/src/error.rs	Adds regex::Error to the Error enum for regex-related error handling
crates/compass/Cargo.toml	Adds chrono and regex dependencies to support timestamp parsing and log pattern matching
Cargo.toml	Adds workspace-level chrono and regex dependencies with appropriate features
Cargo.lock	Updates dependency graph to include chrono serde feature, regex, and their transitive dependencies

crates/compass/src/scraper/log/mod.rs

crates/compass/src/scraper/log/loglevel.rs

crates/compass/src/scraper/log/mod.rs

Copilot · 2025-12-13T16:26:06Z

crates/compass/src/scraper/log/mod.rs

+    fn parse(input: &str) -> Result<Self> {
+        let records: Vec<LogRecord> = input
+            .lines()
+            .filter(|line| !line.trim().is_empty())
+            .filter_map(|line| match LogRecord::parse(line) {
+                Ok(record) => {
+                    trace!("Parsed log line: {}", line);
+                    Some(record)
+                }
+                Err(e) => {
+                    trace!("Failed to parse log line: {}. Error: {}", line, e);
+                    None
+                }
+            })
+            .filter(|record| {
+                (record.level == LogLevel::Info)
+                    || (record.level == LogLevel::Warning)
+                    || (record.level == LogLevel::Error)
+            })
+            .map(|record| {
+                debug!("Keeping log record: {:?}", record);
+                record
+            })
+            .collect();
+        Ok(RuntimeLogs(records))
+    }


The RuntimeLogs::parse method lacks test coverage. Consider adding tests for: parsing logs with different log levels (WARNING, ERROR), handling malformed log lines, verifying that only INFO, WARNING, and ERROR levels are kept after filtering, and testing empty input.

Copilot · 2025-12-13T16:26:06Z

crates/compass/src/scraper/log/mod.rs

+    fn record(&self, conn: &duckdb::Transaction, bookkeeper_id: usize) -> Result<()> {
+        trace!("Recording log record: {:?}", self);
+        conn.execute(
+            "INSERT INTO logs (bookkeeper_lnk, timestamp, level, subject, message) VALUES (?, ?, ?, ?, ?)",
+            duckdb::params![
+                bookkeeper_id,
+                self.timestamp.format("%Y-%m-%d %H:%M:%S").to_string(),
+                format!("{:?}", self.level),
+                &self.subject,
+                &self.message,
+            ],
+        )?;
+        Ok(())
+    }
+}


The LogRecord::record database insertion method lacks test coverage. Consider adding a test that verifies the record is correctly inserted into the database with all fields properly formatted.

ppinchuk

Thanks for this.

I'm still struggling a little bit to see the value add here. Costs per jurisdiction are already tracked in the jurisdiction.json (do the logs even capture the cost by jurisdiction?), and exceptions are tracked in structured (JSON) log files, broken out by jurisdiction. If we just want these things to be SQL query-able, I would strongly recommend that we parse those dedicated files and upload the info they contain to the database instead of relying on character matching inside of log files.

If I am wrong about the above, then I assume I am missing some other important analysis angle here. Can you please document an example of how you used (or are envisioning other people to use) this data to perform analysis that can't otherwise be done with the information in the database? This would be really helpful both for me and other folks who encounter this data later on.

I left a few other requests below.

P.S. Copilot had a few decent suggestions as well

P.P.S This can definitely wait until you are not swamped with other work 😃

ppinchuk · 2025-12-13T18:32:53Z

crates/compass/src/scraper/log/mod.rs

+//! The most verbose levels are ignored to minimize the impact on the
+//! final database size. The outputs are archived, so any forensics


I'm quite worried about database size, actually. The ERROR messages can get quite big since they require a lot of context (trace, text chunk from file, etc.), and elm uses errors for control flow, so we get several error messages every time we run the decision tree, even if no "real" error was thrown...

Can you check how the size of the error messages compared to the other types in your current database? It would be nice to know if I am worrying about nothing or if this is a legitimate concern

ppinchuk · 2025-12-13T18:38:00Z

crates/compass/src/scraper/log/mod.rs

+        let path = root.as_ref().join("logs").join("all.log");
+        dbg!(&path);
+        let content = tokio::fs::read_to_string(path).await?;


We need to handle the case of all.log not existing. This is a redundant file that is not useful for 90% of users, and so in most cases it is not written at all. Writing to the database should not crash just because this file is missing.

I suggest either skipping the log parsing if the file is not found, or making log parsing an option on the command line that the user can enable, in which case the write to the database should crash if the file is not found. Open to other ideas as well.

castelao added 10 commits December 11, 2025 14:17

feat: Expected log levels

a93be03

cfg: Adding dependency on chrono

b33d807

feat: LogRecord.parse()

653b695

test: LogRecord::parse

f91e5f1

feat: LogRecord::init_db

ba5a6a2

feat: LogRecord::init_db

b1fb6c6

Importing requirements

10e029f

renaming tests

fb0c726

Transparent Regex errors

c494e9e

cfg: Adding Regex as dependency

818aa1f

castelao self-assigned this Dec 12, 2025

castelao added enhancement Update to logic or general code improvements p-medium Priority: medium topic-rust-duckdb Issues/pull requests related to DuckDB labels Dec 12, 2025

castelao added 16 commits December 11, 2025 19:51

refact: Trim logs before processing

5d2d4f5

feat: LogRecord::open()

18d5ae3

feat: LogRecord::sample()

1a8b397

test: Adding a sample with a county

4700231

feat: Include logs when opening

926f20f

test: Extending open_scraped_ordinaces to include logs

16f5327

feat: RuntimeLogs as a collection of LogRecord

bb9e044

test, refact: Import super::*

d2e53e4

cfg: Importing duckdb and trace

bfde5e5

feat: LogRecord::record()

f480e38

feat: Logs record in DB must have bookkeeper id

49e6dfb

style:

6d13d20

feat: Including DebugToFile

a30d076

A custom log level from Python library.

refact: Compile regex only once

52b5187

Major performance improvement. I forgot to lock the regex expression, so it was recompiling it every time.

fix: Removing extra spaces from sample

55333a5

doc: RuntimeLogs::parse()

da4de66

castelao added 8 commits December 12, 2025 11:13

refact: Filtering and logging RuntimeLogs::parse

f79cdb2

log: Bump recording on database message to info level

6d8bf27

refact: Minimum information with expect instead of unwrap

16850c9

feat: Recording logs in the DB

152eb1f

style: Left behind some imports from development

9722898

refact: Breaking log module in parts

6404901

clean:

2e9a7a2

doc:

279067e

castelao marked this pull request as ready for review December 13, 2025 16:13

castelao requested a review from ppinchuk as a code owner December 13, 2025 16:13

Copilot AI review requested due to automatic review settings December 13, 2025 16:13

Copilot started reviewing on behalf of castelao December 13, 2025 16:13 View session

Copilot AI reviewed Dec 13, 2025

View reviewed changes

ppinchuk requested changes Dec 13, 2025

View reviewed changes

castelao added 10 commits December 13, 2025 13:05

typo:

834e2b1

clean:

46dbf2e

fix: Missing miliseconds

a88a1ef

doc:

652e153

typo:

2350093

clean:

4b10335

refact: Return one of our custom errros

8811311

fix: Bad format for microseconds

28c79b8

log: Promoting initializing usage to debug level

6d1563b

log: Providing more information on parsing dates

3a697a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Recording runtime logs in DB #363

Recording runtime logs in DB #363

Uh oh!

castelao commented Dec 12, 2025 •

edited

Loading

Uh oh!

castelao commented Dec 13, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 13, 2025

Uh oh!

Copilot AI Dec 13, 2025

Uh oh!

ppinchuk left a comment •

edited

Loading

Uh oh!

ppinchuk Dec 13, 2025 •

edited

Loading

Uh oh!

ppinchuk Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		//! The most verbose levels are ignored to minimize the impact on the
		//! final database size. The outputs are archived, so any forensics

Recording runtime logs in DB #363

Are you sure you want to change the base?

Recording runtime logs in DB #363

Uh oh!

Conversation

castelao commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

castelao commented Dec 13, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

ppinchuk left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ppinchuk Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ppinchuk Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

castelao commented Dec 12, 2025 •

edited

Loading

ppinchuk left a comment •

edited

Loading

ppinchuk Dec 13, 2025 •

edited

Loading