Skip to content

Conversation

@cnlangzi
Copy link
Owner

@cnlangzi cnlangzi commented Jan 11, 2026

Summary by Sourcery

Update Amazonbot configuration to use a dedicated Amazon-specific parser and IP address source page.

New Features:

  • Introduce an Amazon-specific parser to extract IPv4 prefixes from Amazonbot's documented IP address page.

Enhancements:

  • Switch the Amazonbot configuration to reference the new Amazon parser and updated IP address URL.

- Replace broken S3 URL with official Amazon IP list
- Update reference URL to developer.amazon.com
- Add parser_amazon.go to extract IPs from Amazon's HTML page
- Update amazonbot.yaml to use amazon parser with official URL
- Parser handles embedded JSON in HTML and falls back to IP extraction
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jan 11, 2026

Reviewer's Guide

Updates Amazonbot configuration to use Amazon’s official IP address source and introduces a dedicated HTML/JSON parser to extract IPv4 prefixes from the Amazonbot IP listing page.

Sequence diagram for Amazonbot IP parsing with AmazonParser

sequenceDiagram
    participant BotUpdater
    participant HTTPClient
    participant AmazonParser

    BotUpdater->>HTTPClient: GET https://developer.amazon.com/amazonbot/ip-addresses/
    HTTPClient-->>BotUpdater: HTML_with_embedded_JSON
    BotUpdater->>AmazonParser: Parse(htmlReader)
    AmazonParser->>AmazonParser: io.ReadAll(r)
    AmazonParser->>AmazonParser: regexp.FindStringSubmatch for JSON_prefixes
    alt JSON_prefixes_found
        AmazonParser->>AmazonParser: json.Unmarshal(reconstructed_JSON)
        alt JSON_unmarshal_success
            AmazonParser-->>BotUpdater: []netip.Prefix_from_prefixes
        else JSON_unmarshal_failure
            AmazonParser->>AmazonParser: parseIPsFromText(raw_text)
            AmazonParser-->>BotUpdater: []netip.Prefix_from_IPs
        end
    else JSON_prefixes_not_found
        AmazonParser-->>BotUpdater: nil
    end
Loading

Class diagram for new AmazonParser and parser registration

classDiagram
    class Parser {
        <<interface>>
        Name() string
        Parse(r io.Reader) []netip.Prefix
    }

    class AmazonParser {
        Name() string
        Parse(r io.Reader) []netip.Prefix
        parseIPsFromText(data string) []netip.Prefix
    }

    class ParserRegistry {
        RegisterParser(name string, parser Parser)
        GetParser(name string) Parser
    }

    Parser <|.. AmazonParser
    ParserRegistry ..> Parser : uses

    class PackageParserInit {
        init()
    }

    PackageParserInit ..> ParserRegistry : calls_RegisterParser
    PackageParserInit ..> AmazonParser : creates_instance
Loading

File-Level Changes

Change Details Files
Update Amazonbot bot configuration to point to the new IP source and use a dedicated parser.
  • Change reference URL comment to Amazon’s Amazonbot documentation page.
  • Switch the parser identifier from the generic Google parser to a new Amazon-specific parser name.
  • Update the IP list URL to the Amazonbot IP addresses endpoint.
bots/conf.d/amazonbot.yaml
Add a new Amazon-specific parser that extracts IPv4 prefixes from the Amazonbot IP address HTML page.
  • Implement AmazonParser with Name and Parse methods and register it under the "amazon" parser key.
  • In Parse, read the full response body and use regex to extract the JSON block containing prefixes from an HTML page, reconstructing a minimal JSON structure.
  • Unmarshal the reconstructed JSON into a struct with ipv4 prefixes, parsing each as CIDR first and falling back to single IPs as /32 prefixes.
  • Add a fallback parseIPsFromText helper that scans raw HTML for IPv4 addresses and returns unique /32 prefixes when JSON parsing fails.
  • Register the parser in init so it can be looked up by the parser registry.
parser/parser_amazon.go

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@codecov
Copy link

codecov bot commented Jan 11, 2026

Codecov Report

❌ Patch coverage is 51.08696% with 45 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.85%. Comparing base (02d1225) to head (c77cb86).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
parser/parser_amazon.go 59.64% 16 Missing and 7 partials ⚠️
parser/http.go 0.00% 14 Missing ⚠️
parser/parser_ahrefs.go 61.90% 5 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #11      +/-   ##
==========================================
- Coverage   74.84%   71.85%   -2.99%     
==========================================
  Files          14       17       +3     
  Lines         640      732      +92     
==========================================
+ Hits          479      526      +47     
- Misses        117      152      +35     
- Partials       44       54      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In AmazonParser.Parse, returning (nil, nil) when neither JSON pattern matches makes it hard for callers to distinguish between “no data” and “parse failed”; consider returning a descriptive error in that case.
  • The current JSON extraction via HTML regexes (jsonPattern/jsonPattern2) is quite brittle; if possible, anchor the patterns more strictly to the surrounding markup or use an HTML parser to locate the JSON block to reduce false matches and breakage on minor page changes.
  • In parseIPsFromText, you silently skip invalid IPs; if this is expected, consider at least counting or logging how many were discarded so callers can detect unexpected format changes, or document that behavior clearly in a comment.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `AmazonParser.Parse`, returning `(nil, nil)` when neither JSON pattern matches makes it hard for callers to distinguish between “no data” and “parse failed”; consider returning a descriptive error in that case.
- The current JSON extraction via HTML regexes (`jsonPattern`/`jsonPattern2`) is quite brittle; if possible, anchor the patterns more strictly to the surrounding markup or use an HTML parser to locate the JSON block to reduce false matches and breakage on minor page changes.
- In `parseIPsFromText`, you silently skip invalid IPs; if this is expected, consider at least counting or logging how many were discarded so callers can detect unexpected format changes, or document that behavior clearly in a comment.

## Individual Comments

### Comment 1
<location> `parser/parser_amazon.go:25-32` </location>
<code_context>
+	}
+
+	// Extract JSON from HTML page - look for the code block with prefixes
+	jsonPattern := regexp.MustCompile(`\{\s*"creationTime"\s*:\s*"[^"]+"\s*,\s*"prefixes"\s*:\s*\[([^\]]+)\]`)
+	matches := jsonPattern.FindStringSubmatch(string(data))
+	if len(matches) < 2 {
+		// Try alternate pattern - the full JSON object
+		jsonPattern2 := regexp.MustCompile(`\{[^{}]*"prefixes"\s*:\s*\[([^\]]*)\][^{}]*\}`)
+		matches = jsonPattern2.FindStringSubmatch(string(data))
+		if len(matches) < 2 {
+			return nil, nil
+		}
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Consider falling back to text-based IP extraction when both JSON regexes fail instead of returning nil, nil.

Returning `nil, nil` here makes it impossible for callers to tell whether parsing failed or the page truly had no prefixes, and it misses the chance to reuse the existing `parseIPsFromText` fallback. Invoking that fallback instead would make this path more robust to changes in the JSON structure while still extracting IPs when they’re present in the page text.

```suggestion
	if len(matches) < 2 {
		// Try alternate pattern - the full JSON object
		jsonPattern2 := regexp.MustCompile(`\{[^{}]*"prefixes"\s*:\s*\[([^\]]*)\][^{}]*\}`)
		matches = jsonPattern2.FindStringSubmatch(string(data))
		if len(matches) < 2 {
			// Fall back to text-based IP extraction if JSON parsing fails
			return parseIPsFromText(data)
		}
	}
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@github-actions
Copy link

github-actions bot commented Jan 11, 2026

Benchmark Results

BenchmarkFindBotByUA_Hit_First             	 1000000	      1045 ns/op	      12 B/op	       0 allocs/op
BenchmarkFindBotByUA_Hit_First-4           	 3657594	       631.0 ns/op	      15 B/op	       0 allocs/op
BenchmarkFindBotByUA_Hit_First-8           	 2528448	       508.5 ns/op	      10 B/op	       0 allocs/op
BenchmarkFindBotByUA_Hit_Middle            	 1508474	       808.2 ns/op	      17 B/op	       0 allocs/op
BenchmarkFindBotByUA_Hit_Middle-4          	 4586226	       479.0 ns/op	       8 B/op	       0 allocs/op
BenchmarkFindBotByUA_Hit_Middle-8          	 4669262	       494.3 ns/op	       6 B/op	       0 allocs/op
BenchmarkFindBotByUA_Hit_Last              	 1000000	      1019 ns/op	       9 B/op	       0 allocs/op
BenchmarkFindBotByUA_Hit_Last-4            	--- FAIL: BenchmarkFindBotByUA_Hit_Last-4
BenchmarkFindBotByUA_Hit_Last-8            	 4381118	       438.5 ns/op	      12 B/op	       0 allocs/op
BenchmarkFindBotByUA_Miss                  	  508153	      2567 ns/op	      53 B/op	       0 allocs/op
BenchmarkFindBotByUA_Miss-4                	 1354501	       856.6 ns/op	      16 B/op	       0 allocs/op
BenchmarkFindBotByUA_Miss-8                	 1316079	       875.8 ns/op	      14 B/op	       0 allocs/op
BenchmarkFindBotByUA_CaseSensitive         	 1460466	       987.8 ns/op	      23 B/op	       0 allocs/op
BenchmarkFindBotByUA_CaseSensitive-4       	--- FAIL: BenchmarkFindBotByUA_CaseSensitive-4
BenchmarkFindBotByUA_CaseSensitive-8       	 3838663	       513.2 ns/op	      18 B/op	       0 allocs/op
BenchmarkValidate_KnownBot_IPHit           	 1527024	      1063 ns/op	      23 B/op	       0 allocs/op
BenchmarkValidate_KnownBot_IPHit-4         	 2579180	       396.5 ns/op	       8 B/op	       0 allocs/op
BenchmarkValidate_KnownBot_IPHit-8         	 3122292	       381.4 ns/op	      10 B/op	       0 allocs/op
BenchmarkValidate_Browser                  	  236498	      5817 ns/op	     167 B/op	       1 allocs/op
BenchmarkValidate_Browser-4                	  626152	      2042 ns/op	      68 B/op	       0 allocs/op
BenchmarkValidate_Browser-8                	  581090	      1996 ns/op	      26 B/op	       0 allocs/op
BenchmarkContainsWord                      	73572724	        19.37 ns/op	       0 B/op	       0 allocs/op
BenchmarkContainsWord-4                    	74038974	        16.36 ns/op	       0 B/op	       0 allocs/op
BenchmarkContainsWord-8                    	73932078	        16.24 ns/op	       0 B/op	       0 allocs/op
BenchmarkValidate_WithBotUA                	 1000000	      1184 ns/op	      14 B/op	       0 allocs/op
BenchmarkValidate_WithBotUA-4              	 2781932	       434.8 ns/op	      10 B/op	       0 allocs/op
BenchmarkValidate_WithBotUA-8              	 2815316	       439.0 ns/op	      10 B/op	       0 allocs/op
BenchmarkValidate_WithBotUA_IPMismatch     	  792002	      1767 ns/op	      54 B/op	       0 allocs/op
BenchmarkValidate_WithBotUA_IPMismatch-4   	 2231196	       604.3 ns/op	      20 B/op	       0 allocs/op
BenchmarkValidate_WithBotUA_IPMismatch-8   	 2010954	       583.5 ns/op	       9 B/op	       0 allocs/op
BenchmarkValidate_BrowserUA                	  290660	      4619 ns/op	     131 B/op	       1 allocs/op
BenchmarkValidate_BrowserUA-4              	  855226	      1498 ns/op	      40 B/op	       0 allocs/op
BenchmarkValidate_BrowserUA-8              	  849378	      1547 ns/op	      52 B/op	       0 allocs/op
BenchmarkValidate_UnknownBotUA             	 8341300	       172.0 ns/op	       5 B/op	       0 allocs/op
BenchmarkValidate_UnknownBotUA-4           	21643290	        57.28 ns/op	       2 B/op	       0 allocs/op
BenchmarkValidate_UnknownBotUA-8           	22185267	        57.49 ns/op	       1 B/op	       0 allocs/op
BenchmarkContainsIP                        	56004258	        26.72 ns/op	       1 B/op	       0 allocs/op
BenchmarkContainsIP-4                      	100000000	        11.93 ns/op	       0 B/op	       0 allocs/op
BenchmarkContainsIP-8                      	100000000	       946.9 ns/op	       1 B/op	       0 allocs/op
BenchmarkFindBotByUA                       	  814353	      1532 ns/op	       9 B/op	       0 allocs/op
BenchmarkFindBotByUA-4                     	 2116585	       577.3 ns/op	      13 B/op	       0 allocs/op
BenchmarkFindBotByUA-8                     	 2037597	       592.0 ns/op	      15 B/op	       0 allocs/op
BenchmarkClassifyUA                        	 1846064	       615.4 ns/op	      18 B/op	       0 allocs/op
BenchmarkClassifyUA-4                      	 4847744	       255.0 ns/op	       0 B/op	       0 allocs/op
BenchmarkClassifyUA-8                      	 4815897	       249.3 ns/op	       0 B/op	       0 allocs/op
Benchmark_MixedTraffic                     	  485551	      2539 ns/op	      10 B/op	       0 allocs/op
Benchmark_MixedTraffic-4                   	 1274754	       923.9 ns/op	      22 B/op	       0 allocs/op
Benchmark_MixedTraffic-8                   	 1300017	       956.1 ns/op	      27 B/op	       0 allocs/op
BenchmarkReload                            	     819	   1653929 ns/op	  693635 B/op	    6642 allocs/op
BenchmarkReload-4                          	     969	   1246198 ns/op	  682546 B/op	    6533 allocs/op
BenchmarkReload-8                          	SIGQUIT: quit

- Fix AmazonParser fallback to parse IPs from text when JSON extraction fails
- Add unit tests for AmazonParser (JSON and fallback modes)
- Add integration test for AmazonBot (parsed 519 prefixes)
- Add AhrefsParser and integration test (parsed 9870 prefixes)
…ion test

- Create parser/http.go with shared fetchFromURL helper function
- Remove duplicate fetchFromURL from parser_google_test.go
- Move AmazonBot integration test to parser_amazon_test.go
- Update imports in parser_google_test.go
@cnlangzi cnlangzi merged commit 022bce7 into main Jan 11, 2026
3 of 5 checks passed
@cnlangzi cnlangzi deleted the fix/amazonbot branch January 11, 2026 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants