Bigdata Research Tools - User Guide

Building with Bigdata.com

Overview

Bigdata Research Tools is a Python library designed to automate and streamline research workflows using the Bigdata.com API. It provides high-level, plug-and-play functions for building customized research processes with minimal effort.

Key Features

⚡ Concurrent Search: Execute multiple searches efficiently with built-in rate limiting.
🛡️ Thread-Safe Operations: Safe concurrent access for all workflows.
🧭 Guided Workflow Builder: Easily build guided research workflows: see ready-to-use examples in the Bigdata Cookbook Repository.
🎨 Interactive Visualizations: Create dashboards and charts for your results.

Library Architecture

bigdata_research_tools/
├── workflows/          # High-level research workflows
├── search/            # Search utilities and query builders
├── visuals/           # Visualization and dashboard tools
├── labeler/           # AI-powered content labeling
├── llm/               # LLM integration (OpenAI, Bedrock)
└── prompts/           # Prompt templates for AI models

Installation

Install the library using pip:

pip install bigdata-research-tools

Optional Dependencies

Install additional packages for specific features:

# For OpenAI integration
pip install bigdata-research-tools[openai]

# For Azure OpenAI integration
pip install bigdata-research-tools[azure]

# For AWS Bedrock integration
pip install bigdata-research-tools[bedrock]

# For all optional features
pip install bigdata-research-tools[azure,bedrock,openai]

Authentication Setup

Environment Variables

Set up your credentials using environment variables:

export BIGDATA_API_KEY="your_api_key"
# or
export BIGDATA_USERNAME="your_username"
export BIGDATA_PASSWORD="your_password"

Using .env File

Create a .env file in your project directory:

BIGDATA_API_KEY="your_api_key"
# or
BIGDATA_USERNAME="your_username"
BIGDATA_PASSWORD="your_password"

Load the environment variables in your Python script:

from dotenv import load_dotenv
load_dotenv()

Core Workflows

Bigdata Research Tools integrates some end-to-end workflows built with Bigdata API, such as:

📊 Thematic Screeners: Analyze company exposure to specific themes
⚠️ Risk Analyzer: Assess company risk exposure to various scenarios
🔍 Narrative Miners: Track narrative evolution across news, transcripts, and filings

Moreover, the Bigdata Research Tools functionalities such as search, LLM integrations, and Labeler, are the cornerstone of many other workflows and use cases, including:

Market Analysis
Daily Digests
Systematic Monitoring
Report Generation

You can find these workflows and additional examples on the Bigdata documentation in the Cookbooks section: Cookbooks – Bigdata docs.

Jupyter Notebook Setup

If you're running these workflows in a Notebook, you'll need to set up asyncio properly to avoid event loop conflicts:

import asyncio
asyncio.get_running_loop()
import nest_asyncio; nest_asyncio.apply()

Thematic Screener

Analyzes company exposure to specific themes by generating sub-themes and assigning exposure scores. Returns structured tables with labeled text and a final basket of companies sorted by exposure scores along with a final motivation.

Basic Usage

from bigdata_research_tools.workflows import ThematicScreener
from bigdata_research_tools.client import bigdata_connection
from bigdata_client.models.search import DocumentType


# Get companies from a watchlist
bigdata = bigdata_connection()
watchlist = bigdata.watchlists.get("watchlist_id")
companies = bigdata.knowledge_graph.get_entities(watchlist.items)

screener = ThematicScreener(
    llm_model_config="openai::gpt-4o-mini",
    main_theme="Electric Vehicles",
    companies=companies,
    start_date="2024-01-01",
    end_date="2024-12-31",
    document_type=DocumentType.TRANSCRIPTS,
    fiscal_year=2024
)

results = screener.screen_companies(
    export_path="thematic_screening.xlsx"
)

Parameters

Constructor Parameters

Parameters to initialize the ThematicScreener class.

Parameter	Type	Required	Description
`llm_model_config`	`str`	✅	LLM model identifier
`main_theme`	`str`	✅	Main theme to analyze
`companies`	`List[Company]`	✅	List of companies to screen (see Company Objects)
`start_date`	`str`	✅	Start date (YYYY-MM-DD)
`end_date`	`str`	✅	End date (YYYY-MM-DD)
`document_type`	`DocumentType`	✅	Document scope (see Document Types)
`fiscal_year`	`int`	❌	Required for transcripts/filings. Set to `None` for news (see Fiscal Year Guide)
`sources`	`List[str]`	❌	Source filters
`rerank_threshold`	`float`	❌	Reranking threshold (0-1) (see Reranker Guide)
`focus`	`str`	❌	Additional focus description (see Focus Parameter Guide)

Method Parameters - `screen_companies()`

Parameters to run the analysis end-to-end.

Parameter	Type	Default	Description
`document_limit`	`int`	`10`	Documents per query (see Document Limit Guide)
`batch_size`	`int`	`10`	Batch size for processing (see Batch Size Parameter Guide)
`frequency`	`str`	`"3M"`	Date range frequency (see Frequency Parameter Guide)
`word_range`	`Tuple[int, int]`	`(50, 100)`	Word range for motivations
`export_path`	`str`	`None`	Excel export path

Return Values

results = {
    "df_labeled": DataFrame,     # Labeled search results
    "df_company": DataFrame,     # Company-level theme scores
    "df_industry": DataFrame,    # Industry-level aggregations
    "df_motivation": DataFrame,  # Company motivations
    "theme_tree": ThemeTree     # Generated theme hierarchy
}

Risk Analyzer

Assesses company exposure to risk scenarios with detailed risk taxonomy generation and exposure score calculation. Returns structured tables with labeled text and a final basket of companies sorted by risk exposure along with a final motivation.

Basic Usage

from bigdata_research_tools.client import bigdata_connection
from bigdata_research_tools.workflows.risk_analyzer import RiskAnalyzer
from bigdata_client.models.search import DocumentType

# Get companies from a watchlist
bigdata = bigdata_connection()
watchlist = bigdata.watchlists.get("watchlist_id")
companies = bigdata.knowledge_graph.get_entities(watchlist.items)

analyzer = RiskAnalyzer(
    llm_model_config="openai::gpt-4o-mini",
    main_theme="Supply Chain Disruption",
    companies=companies,
    start_date="2024-01-01",
    end_date="2024-12-31",
    document_type=DocumentType.NEWS,
    keywords=["supply chain", "logistics"],
    control_entities={"place": ["China", "Taiwan"]}
)

results = analyzer.screen_companies(
    export_path="risk_analysis.xlsx"
)

Parameters

Constructor Parameters

Parameters to initialize the RiskAnalyzer class.

Parameter	Type	Required	Description
`llm_model_config`	`str`	✅	LLM model identifier
`main_theme`	`str`	✅	Main risk theme
`companies`	`List[Company]`	✅	Companies to analyze (see Company Objects)
`start_date`	`str`	✅	Analysis start date
`end_date`	`str`	✅	Analysis end date
`document_type`	`DocumentType`	✅	Document scope (see Document Types)
`keywords`	`List[str]`	❌	Keyword filters
`control_entities`	`Dict[str, List[str]]`	❌	Entity co-mention filters (see Control Entities)
`fiscal_year`	`int`	❌	Required for transcripts/filings. Set to `None` for news (see Fiscal Year Guide)
`sources`	`List[str]`	❌	Source filters
`rerank_threshold`	`float`	❌	Reranking threshold (0-1) (see Reranker Guide)
`focus`	`str`	❌	Additional focus description (see Focus Parameter Guide)

Method Parameters - `screen_companies()`

Parameters to run the analysis end-to-end.

Parameter	Type	Default	Description
`document_limit`	`int`	`10`	Documents per query (see Document Limit Guide)
`batch_size`	`int`	`10`	Batch size for processing (see Batch Size Parameter Guide)
`frequency`	`str`	`"3M"`	Date range frequency (see Frequency Parameter Guide)
`word_range`	`Tuple[int, int]`	`(50, 100)`	Word range for motivations
`export_path`	`str`	`None`	Excel export path

Return Values

results = {
    "df_labeled": DataFrame,     # Labeled search results
    "df_company": DataFrame,     # Company risk scores
    "df_industry": DataFrame,    # Industry risk aggregations
    "df_motivation": DataFrame,  # Risk motivations
    "risk_tree": ThemeTree      # Risk taxonomy tree
}

Narrative Miner

The Narrative Miner tracks how specific narratives evolve over time across different document types. Returns structured tables with labeled text.

Basic Usage

from bigdata_research_tools.workflows import NarrativeMiner
from bigdata_client.models.search import DocumentType

narrative_miner = NarrativeMiner(
    narrative_sentences=[
        "Artificial Intelligence Development",
        "Machine Learning Innovation",
        "Data Privacy Concerns"
    ],
    llm_model_config="openai::gpt-4o-mini",
    start_date="2024-01-01",
    end_date="2024-12-31",
    fiscal_year=2024,

    document_type=DocumentType.NEWS
)

results = narrative_miner.mine_narratives(
    export_path="narrative_analysis.xlsx"
)

Parameters

Constructor Parameters

Parameters to initialize the NarrativeMiner class.

Parameter	Type	Required	Description
`narrative_sentences`	`List[str]`	✅	List of narrative sentences to track
`start_date`	`str`	✅	Start date in YYYY-MM-DD format
`end_date`	`str`	✅	End date in YYYY-MM-DD format
`llm_model_config`	`str`	✅	LLM model in format "provider::model"
`document_type`	`DocumentType`	✅	Document scope (see Document Types)
`fiscal_year`	`int`	❌	Fiscal year for transcripts/filings. Set to `None` for news
`sources`	`List[str]`	❌	Filter by specific news sources
`rerank_threshold`	`float`	❌	Reranking threshold (0-1) (see Reranker Guide)

Method Parameters - `mine_narratives()`

Parameters to run the analysis end-to-end.

Parameter	Type	Default	Description
`document_limit`	`int`	`10`	Documents per query (see Document Limit Guide)
`batch_size`	`int`	`10`	Batch size for processing (see Batch Size Parameter Guide)
`frequency`	`str`	`"3M"`	Date range frequency (see Frequency Parameter Guide)
`export_path`	`str`	`None`	Excel export path

Return Values

results = {
    "df_labeled": DataFrame  # Labeled search results with narrative classifications
}

Core Functionalities

MindMap Generator

The MindMap Generator creates hierarchical tree structures that decompose complex themes into organized sub-themes, enabling structured research and analysis. It offers three generation modes: one-shot, refined, and dynamic evolution.

Basic MindMap Structure

from bigdata_research_tools.mindmap import MindMapGenerator, MindMap

# Create a generator
generator = MindMapGenerator(
    llm_model_config_base="openai::gpt-4o-mini",
    llm_model_config_reasoning="openai::gpt-4o"  # Optional: for refined generation
)

# Basic MindMap structure
mindmap = MindMap(
    label="Climate Risk",
    node=1,
    summary="Climate-related financial risks affecting business operations",
    children=[
        MindMap(label="Physical Risks", node=2, summary="Direct climate impacts"),
        MindMap(label="Transition Risks", node=3, summary="Policy and market changes")
    ]
)

One-Shot Generation

generate_one_shot() creates a complete mind map in a single LLM call, optionally grounded in real-time search results.

# Simple one-shot generation (no search grounding)
mindmap, result = generator.generate_one_shot(
    main_theme="AI in Healthcare",
    focus="Focus on diagnostic applications and regulatory challenges",
    allow_grounding=False,
    map_type="theme"  # or "risk" for risk analysis
)

# One-shot with search grounding
mindmap, result = generator.generate_one_shot(
    main_theme="Supply Chain Disruptions",
    focus="Post-pandemic resilience strategies",
    allow_grounding=True,
    date_range=("2024-01-01", "2024-12-31"),
    map_type="risk"
)

print(f"Generated mindmap with {len(mindmap.get_terminal_labels())} terminal nodes")
# Result includes: mindmap_text, mindmap_df, mindmap_json, search_queries (if grounded)

How One-Shot Works:

Without grounding: LLM generates mind map purely from its training knowledge
With grounding: LLM proposes search queries → searches executed → LLM creates mind map using search results
Use cases: Initial exploration, baseline analysis, quick prototyping

Refined Generation

generate_refined() enhances an existing mind map by having the LLM propose targeted searches, then incorporating the results to expand and improve the structure.

# Start with an initial mindmap (from one-shot or manual creation)
initial_mindmap_json = mindmap.to_json()

# Refine using search-based enhancement
refined_mindmap, result = generator.generate_refined(
    main_theme="Cybersecurity Threats",
    focus="Enterprise security and incident response",
    initial_mindmap=initial_mindmap_json,
    map_type="risk",
    date_range=("2024-06-01", "2024-12-31"),
    chunk_limit=25,  # Results per search query
    output_dir="./refined_outputs",
    filename="cybersecurity_refined.json"
)

print(f"Search queries used: {result['search_queries']}")
print(f"Refined mindmap has {len(refined_mindmap.get_terminal_labels())} terminal nodes")

How Refined Generation Works:

Analysis: LLM analyzes the initial mind map and identifies knowledge gaps
Search Proposal: LLM proposes specific search queries to fill those gaps
Search Execution: Queries are executed against Bigdata's news/documents database
Enhancement: LLM incorporates search results to expand/refine the mind map
Output: Enhanced mind map with real-world grounding and additional detail

Use cases:

Adding depth and specificity to broad themes

Dynamic Evolution

generate_dynamic() creates mind maps that evolve over time intervals, showing how themes develop and change across different periods.

from bigdata_research_tools.search.query_builder import create_date_ranges

# Create time intervals
month_intervals = create_date_ranges("2024-01-01", "2024-06-30", "M")
month_names = ["Jan2024", "Feb2024", "Mar2024", "Apr2024", "May2024", "Jun2024"]

# Generate evolving mindmaps
mindmap_objects, results = generator.generate_dynamic(
    main_theme="ESG Investment Trends",
    focus="Institutional investor behavior and regulatory changes",
    month_intervals=month_intervals,
    month_names=month_names,
    map_type="theme",
    chunk_limit=20,
    output_dir="./dynamic_evolution"
)

# Access evolution over time
for month, mindmap_obj in mindmap_objects.items():
    print(f"{month}: {len(mindmap_obj.get_terminal_labels())} terminal nodes")

How Dynamic Generation Works:

Base Generation: Creates initial mind map for the overall theme
Iterative Refinement: For each time interval:
- Uses previous month's mind map as starting point
- Searches for period-specific information
- Refines mind map based on that period's context
- Uses refined version as input for next iteration
Evolution Tracking: Each step builds upon previous knowledge while incorporating new temporal context

Use cases:

Tracking narrative evolution in financial markets
Analyzing how risk factors emerge and develop over time
Understanding seasonal or cyclical patterns in business themes

Parameters and Configuration

Parameter	Type	Description
`main_theme`	`str`	Core topic to analyze
`focus`	`str`	Specific guidance for analysis direction
`allow_grounding`	`bool`	Enable search-based grounding (one-shot only)
`map_type`	`str`	"theme", "risk", or "risk_entity"
`date_range`	`tuple[str, str]`	Search date range (YYYY-MM-DD format)
`chunk_limit`	`int`	Results per search query
`output_dir`	`str`	Directory for saving results

Visualization and Export

# Visualize the mindmap
mindmap.visualize(engine="graphviz")  # or "plotly", "matplotlib"

# Export to different formats
df = mindmap.to_dataframe()  # Pandas DataFrame
json_str = mindmap.to_json()  # JSON string
mindmap.save_json("output.json")  # Save to file

# Access terminal nodes (leaf nodes)
terminal_labels = mindmap.get_terminal_labels()
terminal_summaries = mindmap.get_terminal_summaries()

Query Builder

Bigdata Research Tools enables advanced query construction for the Bigdata Search API. The Query Builder combines Entity, Keyword, and Similarity Search, allowing users to control the query logic ad optimize its efficiency with entity batching and control entities. It also supports different Document Types and specific Sources.

More information on Bigdata Search API's query filters can be found at Bigdata.com - Query Filters.

Basic Usage

from bigdata_research_tools.search.query_builder import (
    EntitiesToSearch,
    build_batched_query
)
from bigdata_client.models.search import DocumentType
from bigdata_research_tools.client import bigdata_connection

bigdata = bigdata_connection()
company_names = ["Apple Inc", "Microsoft Corp", "Tesla Inc"]
companies = []

for name in company_names:
    results = bigdata.knowledge_graph.find_companies(name)
    if results:
        companies.append(next(iter(results)))

control_entities = {
    "people": ["Tim Cook", "Satya Nadella"],
    "concepts": ["artificial intelligence"]
}

entity_keys = [entity.id for entity in companies]
entities_config = EntitiesToSearch(companies=entity_keys)

control_entities_config = None
if control_entities:
    control_entities_config = EntitiesToSearch(**control_entities)

# Build queries
queries = build_batched_query(
    sentences=["Technology innovation strategies"],
    keywords=["innovation", "technology"],
    entities=entities_config,
    control_entities=control_entities_config,

    batch_size=5,
    fiscal_year=2024,
    scope=DocumentType.TRANSCRIPTS,
    custom_batches=None,
    sources=None,
)

EntitiesToSearch Class

@dataclass
class EntitiesToSearch:
    people: Optional[List[str]] = None        # Person names
    companies: Optional[List[str]] = None     # Company names
    org: Optional[List[str]] = None          # Organization names
    product: Optional[List[str]] = None       # Product names
    place: Optional[List[str]] = None         # Place names
    topic: Optional[List[str]] = None         # Topic keywords
    concepts: Optional[List[str]] = None      # Concept terms

Function Parameters - `build_batched_query()`

Parameter	Type	Required	Description
`sentences`	`List[str]`	✅	Similarity search sentences
`keywords`	`List[str]`	❌	Keyword search terms
`entities`	`EntitiesToSearch`	❌	Entity configuration
`control_entities`	`EntitiesToSearch`	❌	Co-mention entities
`sources`	`List[str]`	❌	Source filters
`batch_size`	`int`	✅	Entities per batch
`fiscal_year`	`int`	❌	Fiscal year filter
`scope`	`DocumentType`	✅	Document scope
`custom_batches`	`List[EntitiesToSearch]`	❌	Custom entity batches

Search Manager

Bigdata Research Tools supports high-performance concurrent search execution, handling client-side rate limiting under the hood. This is particularly useful when searching over a large number of elements (e.g. Companies, Sentences, Keywords).

Basic Usage

from bigdata_research_tools.search.search import run_search
from bigdata_client.models.search import DocumentType, SortBy
from bigdata_research_tools.search.query_builder import create_date_ranges

date_ranges = create_date_ranges("2024-11-01", "2025-03-15", "M")

results = run_search(
        queries,
        date_ranges=date_ranges,
        limit=50,
        scope=DocumentType.ALL,
        sortby=SortBy.RELEVANCE,
        rerank_threshold=None,
    )

Function Parameters - `run_search()`

Parameter	Type	Default	Description
`queries`	`List[QueryComponent]`		List of search queries
`date_ranges`	`INPUT_DATE_RANGE`		Date range specifications
`limit`	`int`	`10`	Results per query
`only_results`	`bool`	`True`	Return format control
`scope`	`DocumentType`	`ALL`	Document type filter
`sortby`	`SortBy`	`RELEVANCE`	Result sorting
`rerank_threshold`	`float`	`None`	Cross-encoder threshold

Note: The function uses bigdata_connection() internally, so no explicit client parameter is needed.

Rate Limiting Configuration

from bigdata_research_tools.search.search import SearchManager, normalize_date_range
from bigdata_research_tools.client import bigdata_connection

bigdata = bigdata_connection()

manager = SearchManager(
    rpm=500,                         # Requests per minute
    bucket_size=100,                 # Token bucket capacity
    bigdata=bigdata                  # Optional: uses default if None
)

date_ranges = create_date_ranges("2024-11-01", "2025-03-15", "M")
date_ranges = normalize_date_range(date_ranges)
date_ranges.sort(key=lambda x: x[0])


# Use the manager for concurrent searches
results = manager.concurrent_search(
    queries=queries,
    date_ranges=date_ranges,
    limit=1000,
    scope=DocumentType.ALL
)

LLM Integration

The library supports multiple LLM providers.

NOTE: While most built-in prompts are optimized for OpenAI models, you can expect them to be robust across LLM providers, although some prompt fine-tuning to fit a specific LLM is recommended.

OpenAI Configuration

# Using OpenAI models
llm_model_config = "openai::gpt-4o-mini"     # Cost-effective
llm_model_config = "openai::gpt-5-mini"          # High performance


# Set OpenAI credentials
import os
os.environ["OPENAI_API_KEY"] = "your_key"

AWS Bedrock Configuration

# Using Bedrock models
llm_model_config = "bedrock::anthropic.claude-3-sonnet-20240229-v1:0"
llm_model_config = "bedrock::anthropic.claude-3-haiku-20240307-v1:0"

# Set AWS credentials
import os
os.environ["AWS_ACCESS_KEY_ID"] = "your_key"
os.environ["AWS_SECRET_ACCESS_KEY"] = "your_secret"
os.environ["AWS_DEFAULT_REGION"] = "us-east-1"

NOTE: If you are logged in using AWS single sign on (SSO) no environment variables are required.

Azure Configuration

In order to use Azure OpenAI as a provider the following environment variables must be set:

AZURE_OPENAI_ENDPOINT="CLIENT_AZURE_OPENAI_ENDPOINT"
OPENAI_API_VERSION="API_VERSION"

Two methods are supported for authentication:

Using API_KEY: The environment variable AZURE_OPENAI_API_KEY must be set.
Other allowed azure authentication methods (e.g. CLI authentication, Entra ID): This is resolved automatically using DefaultAzureCredential in this case only the mandatory environment variables must be set.

In order to use our workflows with these models they need to:

Have a deployed model in their Azure account
Set the workflow model as azure::deployed_model (e.g. azure::gpt-4o-mini)

The following snippets shows how to authenticate with an API Key.

# Using Azure models
llm_model_config = "azure::gpt-4o-mini"

# Set Azure credentials
import os
os.environ["AZURE_OPENAI_ENDPOINT"] = "CLIENT_AZURE_OPENAI_ENDPOINT"
os.environ["OPENAI_API_VERSION"] = "API_VERSION"
os.environ["AZURE_OPENAI_API_KEY"] = "your_key"

If other authentication methods (Entra ID, CLI Authentication) are available the snippets becomes:

# Using Azure models
llm_model_config = "azure::gpt-4o-mini"

# Set Azure credentials
import os
os.environ["AZURE_OPENAI_ENDPOINT"] = "CLIENT_AZURE_OPENAI_ENDPOINT"
os.environ["OPENAI_API_VERSION"] = "API_VERSION"

NOTE: Models deployed on Azure apply configurable safety filters to detect violent, harmful, or otherwise unsafe content. As documented in several discussions , these filters can occasionally produce false positives because they lack the context to interpret prompts or retrieved text accurately. While our prompts contain no harmful language, news or transcript content may include ambiguous terms that trigger these checks. To reduce the likelihood of false positives, the current workaround involves setting the safety threshold to its lowest level and disabling jailbreak-protection shields. Although we generally do not recommend this approach, it may be the only practical option under current constraints. Please review any changes with your IT department before editing or creating your OpenAI model endpoints on Azure, and do not hesitate to contact us if you have any questions.

Parameter Deep Dive

The workflows in Bigdata Research Tools rely on a handful of key parameters. Here is a detailed explanation of how to use them in practice and what they mean.

Company Objects

Company objects are bigdata_client.models.entities.Company instances that represent companies in the Bigdata knowledge graph. Here's how to obtain them:

Method 1: From Watchlists (Recommended)

from bigdata_research_tools.client import bigdata_connection

# Connect to Bigdata API
bigdata = bigdata_connection()

# Get companies from a specific watchlist
watchlist_id = "a3915138-bba9-437e-a813-aa1620a822cc"  # Example GRID watchlist
watchlist = bigdata.watchlists.get(watchlist_id)
companies = bigdata.knowledge_graph.get_entities(watchlist.items)

print(f"Found {len(companies)} companies in watchlist")
# Output: Found 7 companies in watchlist

Method 2: Search by Company Names

# Search for specific companies by name
company_names = ["Apple Inc", "Microsoft Corp.", "Tesla Inc"]
companies = []

for name in company_names:
    # Find company in knowledge graph
    search_results = bigdata.knowledge_graph.autosuggest(name, limit=1)
    if search_results:
        companies.append(next(iter(search_results)))
        print(f"Found: {companies[-1].name} (ID: {companies[-1].id})")

# Output:
# Found: Apple Inc (ID: D8442A)
# Found: Microsoft Corp. (ID: 228D42) 
# Found: Tesla Inc (ID: DD3BB1)

Method 3: Filter by Criteria

# Get all companies from a watchlist, then filter
all_companies = bigdata.knowledge_graph.get_entities(watchlist.items)

# Filter by sector or other criteria
tech_companies = [
    company for company in all_companies 
    if hasattr(company, 'sector') and 'Technology' in company.sector
]

Company Object Properties

Each Company object has these key properties:

company = companies[0]
print(f"Name: {company.name}")           # Apple Inc
print(f"ID: {company.id}")               # D8442C
print(f"Ticker: {company.ticker}")       # AAPL
print(f"Type: {type(company)}")          # <class 'bigdata_client.models.entities.Company'>

# Additional properties may include:
# company.sector, company.industry, company.country, etc.

Control Entities

Control entities allow you to filter results based on co-mentions. You can define queries so that documents must mention both your target companies AND the control entities to be included in results. These can be Places, People, Products, Organizations, Concepts, Topics, or other Companies.

How Control Entities Work

# Example: Find documents about Tesla that also mention China or Taiwan
tesla_company_search = bigdata.knowledge_graph.autosuggest("Tesla Inc.")
tesla_company = tesla_company_search[0]

analyzer = RiskAnalyzer(
    llm_model_config="openai::gpt-4o-mini",
    main_theme="Supply Chain Risk",
    companies=[tesla_company],
    start_date="2024-01-01",
    end_date="2024-12-31",
    document_type=DocumentType.NEWS,
    control_entities={
        "place": ["China", "Taiwan"],  # Must also mention these places
        "people": ["Elon Musk", "Tim Cook"],
        "product": ["iPhone", "Model S", "Azure"]
    }
)

Control Entity Types

control_entities = {
    # Geographic filters
    "place": ["United States", "China", "Taiwan", "Germany"],

    # Organization filters
    "org": ["U.S. Department of Commerce"],
    
    # People filters  
    "people": ["Elon Musk", "Tim Cook", "Satya Nadella"],
    
    # Topic/concept filters
    "topic": ["regulation", "trade policy", "cybersecurity"],

    # Concept filters
    "concepts": ["Trade"],
    
    # Product filters  
    "product": ["iPhone", "Model S", "Azure"]
}

Important Notes

AND Logic: Documents must mention target companies AND control entities
OR Logic: Within each control entity type, documents can mention ANY of the listed entities
Performance: More control entities = fewer but more targeted results
Optional: Control entities are completely optional - omit for broader analysis

Document Types

The document_type parameter allows to direct your queries to specific content types. Options include:

from bigdata_client.models.search import DocumentType

# Available document types
DocumentType.NEWS          # News articles
DocumentType.TRANSCRIPTS   # Earnings call transcripts
DocumentType.FILINGS       # SEC filings
DocumentType.ALL           # All document types. fiscal_year must not be None

Fiscal Year

The fiscal_year parameter is required when working with transcripts or filings and determines which fiscal year documents to analyze. This sets the FiscalYear filter in Bigdata Search API which leverage the Reporting Details of a transcript.

How Fiscal Years Work

# For fiscal year 2024, the system will search for:
fiscal_year = 2024

# Transcripts: Earnings calls from fiscal year 2024
# - Q1 2024 earnings calls (typically Jan-Mar 2024 reports)
# - Q2 2024 earnings calls (typically Apr-Jun 2024 reports)  
# - Q3 2024 earnings calls (typically Jul-Sep 2024 reports)
# - Q4 2024 earnings calls (typically Oct-Dec 2024 reports)

# Filings: SEC filings for fiscal year 2024
# - 10-K annual reports for fiscal year ending in 2024
# - 10-Q quarterly reports for quarters in fiscal year 2024
# - 8-K current reports filed during fiscal year 2024

Fiscal Year Examples

# Analyze recent earnings calls
screener = ThematicScreener(
    # ... other parameters ...
    document_type=DocumentType.TRANSCRIPTS,
    fiscal_year=2024,  # Latest completed or current fiscal year
    start_date="2024-01-01",
    end_date="2024-12-31"
)

# Analyze historical filings
screener = ThematicScreener(
    # ... other parameters ...
    document_type=DocumentType.FILINGS,
    fiscal_year=2023,  # Previous fiscal year
    start_date="2023-01-01", 
    end_date="2023-12-31"
)

Important Notes

Calendar vs Fiscal Year: Companies may have different fiscal year end dates (e.g., Apple's fiscal year ends in September)
Current Year: For the current fiscal year, only filed documents up to the current date will be available

When NOT to Use Fiscal Year

# For NEWS documents, fiscal_year should be None or omitted
screener = ThematicScreener(
    # ... other parameters ...
    document_type=DocumentType.NEWS,
    fiscal_year=None,  # Not applicable for news
    start_date="2024-01-01",
    end_date="2024-12-31"
)

Focus

The ThematicScreener and RiskAnalyzer classes rely on LLM-generated taxonomy trees to conduct an in-depth analysis of company exposure. The focus parameter provides additional context and specificity to guide the AI's taxonomy tree generation and allows the human agent to be involved in the taxonomy generation.

What Focus Does

Refines Taxonomy Generation: Influences how sub-themes are created
Guides Analysis Direction: Helps the AI understand what aspects to emphasize
Improves Relevance: Integrates your expert knowledge and makes results more targeted to your specific research interest

Focus Examples

Basic Theme vs Focused Theme

# Basic theme - broad analysis
screener = ThematicScreener(
    main_theme="Artificial Intelligence",
    focus="",  # No additional focus
    # ... other parameters
)
# Generated sub-themes might include:
# - AI Development, AI Applications, AI Ethics, AI Investment, etc.

# Focused theme - specific analysis
screener = ThematicScreener(
    main_theme="Artificial Intelligence", 
    focus="Focus on enterprise AI adoption, implementation challenges, and ROI measurement in large corporations",
    # ... other parameters
)
# Generated sub-themes might include:
# - Enterprise AI Implementation, AI ROI Metrics, AI Integration Challenges, 
#   Corporate AI Strategy, AI Vendor Selection, etc.

Best Practices for Focus

Be Specific: Include concrete aspects you want to explore
Use Domain Language: Include relevant terminology from your field
Set Context: Explain the business or research context
Define Scope: Clarify what should be included or excluded

# Good focus examples:
focus = "Analyze cybersecurity investments, breach prevention strategies, and incident response capabilities specifically for financial services companies"

focus = "Examine renewable energy transition strategies including wind, solar, and battery storage investments, with emphasis on grid integration challenges"

focus = "Focus on AI-powered drug discovery, clinical trial optimization, and personalized medicine approaches in pharmaceutical companies"

# Less effective focus examples:
focus = "Look at technology"  # Too vague
focus = "AI and stuff"        # Unclear
focus = ""                    # No guidance provided

Focus for Different Document Types

# For transcripts - focus on management commentary
focus = "Focus on management's strategic outlook, guidance updates, and responses to analyst questions about market positioning"

# For news - focus on market reactions
focus = "Analyze market sentiment, analyst opinions, and competitive positioning as reported in financial media"

# For filings - focus on formal disclosures
focus = "Examine risk factor disclosures, business segment performance, and regulatory compliance discussions in official filings"

Cross-Encoder Reranking

Refines search relevance with cross-encoder reranking, ensuring that the search results closely resemble your sentences:

# Enable reranking with threshold
narrative_miner = NarrativeMiner(
    narrative_sentences=sentences,
    rerank_threshold=0.7,  # Higher = more strict
    # ... other parameters
)

Document Limit

The limit parameter determines the maximum number of documents to be retrieved by each query. This is a single int value that applies to any combination of (batched) query and date range.

Frequency

Searching over a long time frame with a set document limit implies a trade-off between speed ad coverage. With the frequency parameter you can control temporal analysis granularity and split your time sample in shorter intervals. Bigdata Research Tools will automatically create the date ranges and run the queries on each of them.

# Frequency options
"Y"    # Yearly intervals
"6M"   # Six-monthly intervals  
"3M"   # Quarterly intervals (default)
"M"    # Monthly intervals
"W"    # Weekly intervals
"D"    # Daily intervals

# Usage example
results = screener.screen_companies(
    frequency="M",  # Monthly analysis
    # ... other parameters
)

Batch Size

Running our analysis on a large portfolio will require you to optimize speed, costs, and coverage. batch_size sets the number of companies that you want to include a single query. This allows to optimize the performance by grouping companies together and running searches separately for each batch.

# For large company universes
screener = ThematicScreener(
    companies=large_company_list,
    # ... other parameters
)

results = screener.screen_companies(
    batch_size=25,        # Larger batches for efficiency
    document_limit=200,   
    # ... other parameters
)

Logging Configuration

import logging

# Enable detailed logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

# Library-specific logging
logging.getLogger("bigdata_research_tools").setLevel(logging.DEBUG)

Interactive Tutorial

The library includes an interactive Jupyter notebook tutorial that demonstrates all key functionality with practical, working examples. This is the best way to get started with the library.

Quick Start with uv

The fastest way to get up and running with the tutorial is using uv (the modern Python package manager):

Step 1: Clone and Navigate to Tutorial

# Clone the repository (if you haven't already)
git clone https://github.com/bigdata-com/bigdata-research-tools.git
cd bigdata-research-tools/tutorial

Step 2: Set Up Environment with uv

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create a virtual environment and install dependencies
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install tutorial dependencies
uv pip install -r requirements.txt

# Install the main package in development mode
uv pip install -e ../.

Step 3: Set Up Authentication

Create a .env file in the tutorial directory:

# Create .env file with your credentials
echo "BIGDATA_API_KEY=your_api_key_here" > .env
echo "OPENAI_API_KEY=your_openai_api_key_here" >> .env   # Required to run the Advanced Workflows

Step 4: Launch Jupyter Notebook

# Install Jupyter if not included in requirements
uv pip install jupyterlab

# Start Jupyter notebook
jupyter lab tutorial_notebook.ipynb

Tutorial Overview

The interactive tutorial covers:

📚 Fundamentals

Setting up authentication and connections
Basic search functionality with search_by_companies()
Custom query building with run_search()

🔍 Key Features Demonstrated

Company-specific document searches
Custom query construction and execution
Result processing and analysis
DataFrame export and manipulation

💡 Learning Outcomes

Understand core library concepts
See practical, working examples
Get hands-on experience with real data

🚀 Next Steps After completing the tutorial, you'll be ready to:

Explore the advanced workflows (NarrativeMiner, ThematicScreener, RiskAnalyzer)
Run the complete examples in the examples/ directory
Build custom analysis workflows for your specific use cases and explore our Bigdata Cookbook, which features a collection of ready-to-use notebooks for a variety of finance-related guided workflows.

Alternative Installation Methods

If you prefer not to use uv, you can also use traditional pip:

# Create virtual environment
python -m venv tutorial_env
source tutorial_env/bin/activate  # On Windows: tutorial_env\Scripts\activate

# Install dependencies
pip install -r requirements.txt
pip install -e ..

# Launch notebook
jupyter notebook tutorial_notebook.ipynb

Examples

The library includes several complete examples in the examples/ directory.

1. Narrative Miner Example

File: examples/narrative_miner.py

What it does: Tracks AI-related narratives across transcripts

# Run the narrative miner example
cd examples
python narrative_miner.py

Expected output:

Environment variables loaded: True
INFO:bigdata_research_tools:Starting narrative mining...
INFO:bigdata_research_tools:Processing 15 narrative sentences...
INFO:bigdata_research_tools:Analysis complete. Results saved to narrative_miner_sample.xlsx

2. Thematic Screener Example

File: examples/thematic_screener.py

What it does: Analyzes companies' exposure to "Chip Manufacturers" theme

# Run the thematic screener example
python thematic_screener.py

Expected output:

Environment variables loaded: True
INFO:bigdata_research_tools:Generating theme tree for: Chip Manufacturers
INFO:bigdata_research_tools:Screening 50 companies...
INFO:bigdata_research_tools:Creating visualizations...
# Browser opens with interactive dashboard

3. Risk Analyzer Example

File: examples/risk_analyzer.py

What it does: Assesses risk exposure to US import tariffs

# Run the risk analyzer example
python risk_analyzer.py

Expected output:

Environment variables loaded: True
INFO:bigdata_research_tools:Creating risk taxonomy...
INFO:bigdata_research_tools:Analyzing risk exposure...
INFO:bigdata_research_tools:Risk analysis complete. Results saved to risk_analyzer_results.xlsx
# Browser opens with risk dashboard

4. Query Builder Example

File: examples/query_builder.py

What it does: Demonstrates advanced query construction techniques

# Run the query builder example
python query_builder.py

Expected output:

INFO:__main__:======================================
INFO:__main__:TEST 1: Basic EntityConfig with Auto-batching
INFO:__main__:Generated 2 query components
INFO:__main__:Sample query structure: [QueryComponent(...)]

5. Portfolio Constructor Example

File: examples/portfolio_example.py

What it does: Shows different portfolio construction methods

# Run the portfolio constructor example
python portfolio_example.py

Expected output:

INFO:__main__:======================================
INFO:__main__:EXAMPLE 1: Basic Equal-Weighted Portfolio (Sector Balanced)
INFO:__main__:Portfolio Size: 20 companies
INFO:__main__:Sectors Represented: 5

6. Search by Companies Example

File: examples/search_by_companies.py

What it does: Shows how to search for documents mentioning specific companies and topics

# Run the search by companies example
python search_by_companies.py

Expected output:

Environment variables loaded: True
INFO:__main__:Found: Apple Inc (ID: D8442C)
INFO:__main__:Found: Microsoft Corporation (ID: D4A6CC)
INFO:__main__:Found 24 relevant documents
INFO:__main__:  Apple Inc: 15 documents
INFO:__main__:  Microsoft Corporation: 9 documents
# Results exported to search_by_companies_results.xlsx

7. Run Search Example

File: examples/run_search.py

What it does: Demonstrates custom query building and search execution

# Run the run_search example
python run_search.py

Expected output:

Environment variables loaded: True
INFO:__main__:Generated 4 search queries
INFO:__main__:Searching across 3 time periods
INFO:__main__:Found 32 documents total
INFO:__main__:  Reuters: 12 documents
INFO:__main__:  Bloomberg: 8 documents
# Results exported to run_search_results.xlsx

Support and Resources

Documentation: https://docs.bigdata.com
API Reference: Check the docs/ directory for detailed API documentation
Examples: See the examples/ directory for complete working examples
Issues: Report issues through support@bigdata.com

Tooling

This project uses ruff for linting and formatting and ty for a type checker. To ensure your code adheres to the project's style guidelines, run the following commands before committing your changes:

make type-check
make lint
make format

License

This software is licensed for use solely under the terms agreed upon in the applicable Master Agreement and Order Schedule between the parties. For trials, the applicable legal documents are the Mutual Non-Disclosure Agreement, or if applicable the Trial Agreement. No other rights or licenses are granted by implication, estoppel, or otherwise. For further details, please refer to your specific Master Agreement and Order Schedule or contact us at legal@ravenpack.com.

Name		Name	Last commit message	Last commit date
Latest commit History 406 Commits
.github/workflows		.github/workflows
examples		examples
src/bigdata_research_tools		src/bigdata_research_tools
tests		tests
tutorial		tutorial
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
cleanup_build.sh		cleanup_build.sh
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

License

Bigdata-com/bigdata-research-tools

Folders and files

Latest commit

History

Repository files navigation

Bigdata Research Tools - User Guide

Table of Contents

Overview

Key Features

Library Architecture

Installation

Optional Dependencies

Authentication Setup

Environment Variables

Using .env File

Core Workflows

Jupyter Notebook Setup

Thematic Screener

Basic Usage

Parameters

Constructor Parameters

Method Parameters - screen_companies()

Return Values

Risk Analyzer

Basic Usage

Parameters

Constructor Parameters

Method Parameters - screen_companies()

Return Values

Narrative Miner

Basic Usage

Parameters

Constructor Parameters

Method Parameters - mine_narratives()

Return Values

Core Functionalities

MindMap Generator

Basic MindMap Structure

One-Shot Generation

Refined Generation

Dynamic Evolution

Parameters and Configuration

Visualization and Export

Query Builder

Basic Usage

EntitiesToSearch Class

Function Parameters - build_batched_query()

Search Manager

Basic Usage

Function Parameters - run_search()

Rate Limiting Configuration

LLM Integration

OpenAI Configuration

AWS Bedrock Configuration

Azure Configuration

Parameter Deep Dive

Company Objects

Method 1: From Watchlists (Recommended)

Method 2: Search by Company Names

Method 3: Filter by Criteria

Company Object Properties

Control Entities

How Control Entities Work

Control Entity Types

Important Notes

Document Types

Fiscal Year

How Fiscal Years Work

Fiscal Year Examples

Important Notes

When NOT to Use Fiscal Year

Focus

What Focus Does

Focus Examples

Basic Theme vs Focused Theme

Best Practices for Focus

Focus for Different Document Types

Cross-Encoder Reranking

Document Limit

Method Parameters - `screen_companies()`

Method Parameters - `screen_companies()`

Method Parameters - `mine_narratives()`

Function Parameters - `build_batched_query()`

Function Parameters - `run_search()`