High-signal X/Twitter influencer list — functional demo built on the open-source multi-agent framework CCCC. Ready-to-use whitelist of influential individual accounts (non-brand/non-official), curated for downstream ingestion. This open-source bundle is a download-and-use minimal set; the full production flow requires CCCC + RUBE MCP (Twitter tools) for data fetching.
Languages: English (this file) · 中文
- A rigorously filtered list of active, individual X.com authors (no brands/officials), aiming for 5k–10k high-value entries.
- Current release (404 records):
- JSONL:
data/release/influx-latest.jsonl - Gzipped:
data/release/influx-latest.jsonl.gz - Manifest:
data/release/manifest.json
- JSONL:
- Delivered as data only; you don’t need to run the pipeline to use it.
- Components included here (minimal open-source set):
- Data:
data/release/(latest JSONL + manifest) - Guard:
scripts/pipeline_guard.sh - Schema:
schema/bigv.schema.json - Rules:
lists/rules/brand_heuristics.yml,lists/rules/risk_terms.yml - Sample prefetched JSONL:
data/prefetched.sample.jsonl(for local filter demo)
- Data:
- Content ingestion/ranking: high signal-to-noise whitelist reduces crawl and processing cost.
- Research/monitoring: supports trend tracking, community analysis, influence-network studies.
- Product bootstrapping: import a quality author set for recommendations, alerting, sentiment/market intel.
- Individuals only; brand/official/organization accounts excluded via heuristics.
- Thresholds: (Verified AND followers ≥30k) OR followers ≥50k.
- Recency: keeps recent activity fields (metrics_30d*); stale accounts are filtered upstream.
- Evidence: every record carries
sources.evidence+fetched_atfor auditability. - Hard guards: handle and id globally unique; placeholder/“000” followers, mock/test prefixes, non-numeric IDs rejected; strict schema validation enforced.
cp data/release/influx-latest.jsonl .
# or compressed
cp data/release/influx-latest.jsonl.gz . && gunzip influx-latest.jsonl.gzimport json
with open("influx-latest.jsonl") as f:
authors = [json.loads(line) for line in f]
# Example: English AI authors, score >= 60
ai_authors = [
a for a in authors
if "ai_core" in a.get("topic_tags", [])
and a.get("lang_primary") == "en"
and a.get("score", 0) >= 60
]
print(len(ai_authors))- Full schema:
schema/bigv.schema.json - Key fields:
id(author_id),handle,name,verified,followers_count,lang_primary,topic_tags,metrics_30d*,meta.sources(with evidence/fetched_at),provenance_hash.
- Full flow requires: CCCC + RUBE MCP (Twitter tools) to fetch users → prefetched JSONL → run
influx-harvest x-lists|bulk --prefetched-users <file>→ enforcescripts/pipeline_guard.sh(dedup handle/id, evidence required, placeholder/“000” rejection, strict schema) → publish todata/release/. - This repo ships the minimal set (data + guard + schema + rules + sample prefetched); MCP fetching is not included here.
- Apache-2.0 (covers code and released data).
- Built on the CCCC multi-agent framework; thanks to all contributors for curation, filtering, and QA.