Skip to content

[RFC] Asset Library Pipeline - Entity Extraction to Icon Sourcing #27

@madjin

Description

@madjin

Overview

Pipeline for extracting entities from daily content and sourcing visual assets (icons/logos). Seeking collaboration to improve coverage and methodology.

Related PR: #26

Current Pipeline

Daily Facts → Entity Extraction (LLM) → Inventory → Asset Matching → Coverage Report
                                            ↓
                                    CoinGecko (tokens)
                                    Manual curation (others)

Scripts

Script Purpose
scripts/etl/extract-entities.py Extract entities via LLM
scripts/posters/fetch-icons.py Fetch token icons from CoinGecko
scripts/posters/generate-asset-checklist.py Generate coverage report

Current Coverage

Category Coverage
Tokens 20% (19/96)
Platforms 17% (33/189)
Tech 11% (18/157)
Projects 14% (34/244)
Plugins 30% (53/175)

Strengths

  1. Automated extraction - LLM identifies entities from unstructured content
  2. Normalization - --normalize-only dedupes without re-extraction (saves API calls)
  3. CoinGecko integration - Reliable token icons with rate limiting
  4. Fuzzy matching - Containment matching reduces false negatives
  5. Pre-scan efficiency - Checks existing files before making API calls

Weaknesses / Open Questions

  1. Low platform coverage - No reliable automated source for platform icons
  2. Manual curation - Plugins/projects need manual sourcing
  3. Entity noise - Extraction sometimes includes generic terms
  4. No OSINT automation - Finding official sources is still manual research
  5. No validation - Can't verify icon authenticity/currency

Ideas for Improvement

  • Better extraction prompts to reduce noise
  • GitHub API for project avatars/social images
  • Web scraping for official brand pages (og:image, favicons)
  • Community-sourced icon contributions
  • Image similarity detection to avoid duplicates

How to Contribute

  1. Improve coverage - Add CoinGecko ID mappings for missing tokens in fetch-icons.py
  2. Source research - Find reliable APIs/methods for platform/tech icons
  3. Pipeline feedback - Suggest improvements to extraction/matching logic
  4. Icon contributions - Submit PRs with properly sourced icons

Files

  • scripts/posters/assets/entity-inventory.json - Current entity list (1143 entities)
  • scripts/posters/assets/asset-checklist.md - Coverage report
  • scripts/posters/assets/icons/ - Downloaded icons

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions