Building with Bigdata.com
- Overview
- Key Features
- Installation
- Authentication Setup
- Core Workflows
- Core Functionalities
- Parameter Deep Dive
- Interactive Tutorial
- Examples
- Support and Resources
- License
Bigdata Research Tools is a Python library designed to automate and streamline research workflows using the Bigdata.com API. It provides high-level, plug-and-play functions for building customized research processes with minimal effort.
- β‘ Concurrent Search: Execute multiple searches efficiently with built-in rate limiting.
- π‘οΈ Thread-Safe Operations: Safe concurrent access for all workflows.
- π§ Guided Workflow Builder: Easily build guided research workflows: see ready-to-use examples in the Bigdata Cookbook Repository.
- π¨ Interactive Visualizations: Create dashboards and charts for your results.
bigdata_research_tools/
βββ workflows/ # High-level research workflows
βββ search/ # Search utilities and query builders
βββ visuals/ # Visualization and dashboard tools
βββ labeler/ # AI-powered content labeling
βββ llm/ # LLM integration (OpenAI, Bedrock)
βββ prompts/ # Prompt templates for AI models
Install the library using pip:
pip install bigdata-research-toolsInstall additional packages for specific features:
# For OpenAI integration
pip install bigdata-research-tools[openai]
# For Azure OpenAI integration
pip install bigdata-research-tools[azure]
# For AWS Bedrock integration
pip install bigdata-research-tools[bedrock]
# For all optional features
pip install bigdata-research-tools[azure,bedrock,openai]Set up your credentials using environment variables:
export BIGDATA_API_KEY="your_api_key"
# or
export BIGDATA_USERNAME="your_username"
export BIGDATA_PASSWORD="your_password"Create a .env file in your project directory:
BIGDATA_API_KEY="your_api_key"
# or
BIGDATA_USERNAME="your_username"
BIGDATA_PASSWORD="your_password"Load the environment variables in your Python script:
from dotenv import load_dotenv
load_dotenv()Bigdata Research Tools integrates some end-to-end workflows built with Bigdata API, such as:
- π Thematic Screeners: Analyze company exposure to specific themes
β οΈ Risk Analyzer: Assess company risk exposure to various scenarios- π Narrative Miners: Track narrative evolution across news, transcripts, and filings
Moreover, the Bigdata Research Tools functionalities such as search, LLM integrations, and Labeler, are the cornerstone of many other workflows and use cases, including:
- Market Analysis
- Daily Digests
- Systematic Monitoring
- Report Generation
You can find these workflows and additional examples on the Bigdata documentation in the Cookbooks section: Cookbooks β Bigdata docs.
If you're running these workflows in a Notebook, you'll need to set up asyncio properly to avoid event loop conflicts:
import asyncio
asyncio.get_running_loop()
import nest_asyncio; nest_asyncio.apply()Analyzes company exposure to specific themes by generating sub-themes and assigning exposure scores. Returns structured tables with labeled text and a final basket of companies sorted by exposure scores along with a final motivation.
from bigdata_research_tools.workflows import ThematicScreener
from bigdata_research_tools.client import bigdata_connection
from bigdata_client.models.search import DocumentType
# Get companies from a watchlist
bigdata = bigdata_connection()
watchlist = bigdata.watchlists.get("watchlist_id")
companies = bigdata.knowledge_graph.get_entities(watchlist.items)
screener = ThematicScreener(
llm_model_config="openai::gpt-4o-mini",
main_theme="Electric Vehicles",
companies=companies,
start_date="2024-01-01",
end_date="2024-12-31",
document_type=DocumentType.TRANSCRIPTS,
fiscal_year=2024
)
results = screener.screen_companies(
export_path="thematic_screening.xlsx"
)Parameters to initialize the ThematicScreener class.
| Parameter | Type | Required | Description |
|---|---|---|---|
llm_model_config |
str |
β | LLM model identifier |
main_theme |
str |
β | Main theme to analyze |
companies |
List[Company] |
β | List of companies to screen (see Company Objects) |
start_date |
str |
β | Start date (YYYY-MM-DD) |
end_date |
str |
β | End date (YYYY-MM-DD) |
document_type |
DocumentType |
β | Document scope (see Document Types) |
fiscal_year |
int |
β | Required for transcripts/filings. Set to None for news (see Fiscal Year Guide) |
sources |
List[str] |
β | Source filters |
rerank_threshold |
float |
β | Reranking threshold (0-1) (see Reranker Guide) |
focus |
str |
β | Additional focus description (see Focus Parameter Guide) |
Parameters to run the analysis end-to-end.
| Parameter | Type | Default | Description |
|---|---|---|---|
document_limit |
int |
10 |
Documents per query (see Document Limit Guide) |
batch_size |
int |
10 |
Batch size for processing (see Batch Size Parameter Guide) |
frequency |
str |
"3M" |
Date range frequency (see Frequency Parameter Guide) |
word_range |
Tuple[int, int] |
(50, 100) |
Word range for motivations |
export_path |
str |
None |
Excel export path |
results = {
"df_labeled": DataFrame, # Labeled search results
"df_company": DataFrame, # Company-level theme scores
"df_industry": DataFrame, # Industry-level aggregations
"df_motivation": DataFrame, # Company motivations
"theme_tree": ThemeTree # Generated theme hierarchy
}Assesses company exposure to risk scenarios with detailed risk taxonomy generation and exposure score calculation. Returns structured tables with labeled text and a final basket of companies sorted by risk exposure along with a final motivation.
from bigdata_research_tools.client import bigdata_connection
from bigdata_research_tools.workflows.risk_analyzer import RiskAnalyzer
from bigdata_client.models.search import DocumentType
# Get companies from a watchlist
bigdata = bigdata_connection()
watchlist = bigdata.watchlists.get("watchlist_id")
companies = bigdata.knowledge_graph.get_entities(watchlist.items)
analyzer = RiskAnalyzer(
llm_model_config="openai::gpt-4o-mini",
main_theme="Supply Chain Disruption",
companies=companies,
start_date="2024-01-01",
end_date="2024-12-31",
document_type=DocumentType.NEWS,
keywords=["supply chain", "logistics"],
control_entities={"place": ["China", "Taiwan"]}
)
results = analyzer.screen_companies(
export_path="risk_analysis.xlsx"
)Parameters to initialize the RiskAnalyzer class.
| Parameter | Type | Required | Description |
|---|---|---|---|
llm_model_config |
str |
β | LLM model identifier |
main_theme |
str |
β | Main risk theme |
companies |
List[Company] |
β | Companies to analyze (see Company Objects) |
start_date |
str |
β | Analysis start date |
end_date |
str |
β | Analysis end date |
document_type |
DocumentType |
β | Document scope (see Document Types) |
keywords |
List[str] |
β | Keyword filters |
control_entities |
Dict[str, List[str]] |
β | Entity co-mention filters (see Control Entities) |
fiscal_year |
int |
β | Required for transcripts/filings. Set to None for news (see Fiscal Year Guide) |
sources |
List[str] |
β | Source filters |
rerank_threshold |
float |
β | Reranking threshold (0-1) (see Reranker Guide) |
focus |
str |
β | Additional focus description (see Focus Parameter Guide) |
Parameters to run the analysis end-to-end.
| Parameter | Type | Default | Description |
|---|---|---|---|
document_limit |
int |
10 |
Documents per query (see Document Limit Guide) |
batch_size |
int |
10 |
Batch size for processing (see Batch Size Parameter Guide) |
frequency |
str |
"3M" |
Date range frequency (see Frequency Parameter Guide) |
word_range |
Tuple[int, int] |
(50, 100) |
Word range for motivations |
export_path |
str |
None |
Excel export path |
results = {
"df_labeled": DataFrame, # Labeled search results
"df_company": DataFrame, # Company risk scores
"df_industry": DataFrame, # Industry risk aggregations
"df_motivation": DataFrame, # Risk motivations
"risk_tree": ThemeTree # Risk taxonomy tree
}The Narrative Miner tracks how specific narratives evolve over time across different document types. Returns structured tables with labeled text.
from bigdata_research_tools.workflows import NarrativeMiner
from bigdata_client.models.search import DocumentType
narrative_miner = NarrativeMiner(
narrative_sentences=[
"Artificial Intelligence Development",
"Machine Learning Innovation",
"Data Privacy Concerns"
],
llm_model_config="openai::gpt-4o-mini",
start_date="2024-01-01",
end_date="2024-12-31",
fiscal_year=2024,
document_type=DocumentType.NEWS
)
results = narrative_miner.mine_narratives(
export_path="narrative_analysis.xlsx"
)Parameters to initialize the NarrativeMiner class.
| Parameter | Type | Required | Description |
|---|---|---|---|
narrative_sentences |
List[str] |
β | List of narrative sentences to track |
start_date |
str |
β | Start date in YYYY-MM-DD format |
end_date |
str |
β | End date in YYYY-MM-DD format |
llm_model_config |
str |
β | LLM model in format "provider::model" |
document_type |
DocumentType |
β | Document scope (see Document Types) |
fiscal_year |
int |
β | Fiscal year for transcripts/filings. Set to None for news |
sources |
List[str] |
β | Filter by specific news sources |
rerank_threshold |
float |
β | Reranking threshold (0-1) (see Reranker Guide) |
Parameters to run the analysis end-to-end.
| Parameter | Type | Default | Description |
|---|---|---|---|
document_limit |
int |
10 |
Documents per query (see Document Limit Guide) |
batch_size |
int |
10 |
Batch size for processing (see Batch Size Parameter Guide) |
frequency |
str |
"3M" |
Date range frequency (see Frequency Parameter Guide) |
export_path |
str |
None |
Excel export path |
results = {
"df_labeled": DataFrame # Labeled search results with narrative classifications
}The MindMap Generator creates hierarchical tree structures that decompose complex themes into organized sub-themes, enabling structured research and analysis. It offers three generation modes: one-shot, refined, and dynamic evolution.
from bigdata_research_tools.mindmap import MindMapGenerator, MindMap
# Create a generator
generator = MindMapGenerator(
llm_model_config_base="openai::gpt-4o-mini",
llm_model_config_reasoning="openai::gpt-4o" # Optional: for refined generation
)
# Basic MindMap structure
mindmap = MindMap(
label="Climate Risk",
node=1,
summary="Climate-related financial risks affecting business operations",
children=[
MindMap(label="Physical Risks", node=2, summary="Direct climate impacts"),
MindMap(label="Transition Risks", node=3, summary="Policy and market changes")
]
)generate_one_shot() creates a complete mind map in a single LLM call, optionally grounded in real-time search results.
# Simple one-shot generation (no search grounding)
mindmap, result = generator.generate_one_shot(
main_theme="AI in Healthcare",
focus="Focus on diagnostic applications and regulatory challenges",
allow_grounding=False,
map_type="theme" # or "risk" for risk analysis
)
# One-shot with search grounding
mindmap, result = generator.generate_one_shot(
main_theme="Supply Chain Disruptions",
focus="Post-pandemic resilience strategies",
allow_grounding=True,
date_range=("2024-01-01", "2024-12-31"),
map_type="risk"
)
print(f"Generated mindmap with {len(mindmap.get_terminal_labels())} terminal nodes")
# Result includes: mindmap_text, mindmap_df, mindmap_json, search_queries (if grounded)How One-Shot Works:
- Without grounding: LLM generates mind map purely from its training knowledge
- With grounding: LLM proposes search queries β searches executed β LLM creates mind map using search results
- Use cases: Initial exploration, baseline analysis, quick prototyping
generate_refined() enhances an existing mind map by having the LLM propose targeted searches, then incorporating the results to expand and improve the structure.
# Start with an initial mindmap (from one-shot or manual creation)
initial_mindmap_json = mindmap.to_json()
# Refine using search-based enhancement
refined_mindmap, result = generator.generate_refined(
main_theme="Cybersecurity Threats",
focus="Enterprise security and incident response",
initial_mindmap=initial_mindmap_json,
map_type="risk",
date_range=("2024-06-01", "2024-12-31"),
chunk_limit=25, # Results per search query
output_dir="./refined_outputs",
filename="cybersecurity_refined.json"
)
print(f"Search queries used: {result['search_queries']}")
print(f"Refined mindmap has {len(refined_mindmap.get_terminal_labels())} terminal nodes")How Refined Generation Works:
- Analysis: LLM analyzes the initial mind map and identifies knowledge gaps
- Search Proposal: LLM proposes specific search queries to fill those gaps
- Search Execution: Queries are executed against Bigdata's news/documents database
- Enhancement: LLM incorporates search results to expand/refine the mind map
- Output: Enhanced mind map with real-world grounding and additional detail
Use cases:
- Adding depth and specificity to broad themes
generate_dynamic() creates mind maps that evolve over time intervals, showing how themes develop and change across different periods.
from bigdata_research_tools.search.query_builder import create_date_ranges
# Create time intervals
month_intervals = create_date_ranges("2024-01-01", "2024-06-30", "M")
month_names = ["Jan2024", "Feb2024", "Mar2024", "Apr2024", "May2024", "Jun2024"]
# Generate evolving mindmaps
mindmap_objects, results = generator.generate_dynamic(
main_theme="ESG Investment Trends",
focus="Institutional investor behavior and regulatory changes",
month_intervals=month_intervals,
month_names=month_names,
map_type="theme",
chunk_limit=20,
output_dir="./dynamic_evolution"
)
# Access evolution over time
for month, mindmap_obj in mindmap_objects.items():
print(f"{month}: {len(mindmap_obj.get_terminal_labels())} terminal nodes")How Dynamic Generation Works:
- Base Generation: Creates initial mind map for the overall theme
- Iterative Refinement: For each time interval:
- Uses previous month's mind map as starting point
- Searches for period-specific information
- Refines mind map based on that period's context
- Uses refined version as input for next iteration
- Evolution Tracking: Each step builds upon previous knowledge while incorporating new temporal context
Use cases:
- Tracking narrative evolution in financial markets
- Analyzing how risk factors emerge and develop over time
- Understanding seasonal or cyclical patterns in business themes
| Parameter | Type | Description |
|---|---|---|
main_theme |
str |
Core topic to analyze |
focus |
str |
Specific guidance for analysis direction |
allow_grounding |
bool |
Enable search-based grounding (one-shot only) |
map_type |
str |
"theme", "risk", or "risk_entity" |
date_range |
tuple[str, str] |
Search date range (YYYY-MM-DD format) |
chunk_limit |
int |
Results per search query |
output_dir |
str |
Directory for saving results |
# Visualize the mindmap
mindmap.visualize(engine="graphviz") # or "plotly", "matplotlib"
# Export to different formats
df = mindmap.to_dataframe() # Pandas DataFrame
json_str = mindmap.to_json() # JSON string
mindmap.save_json("output.json") # Save to file
# Access terminal nodes (leaf nodes)
terminal_labels = mindmap.get_terminal_labels()
terminal_summaries = mindmap.get_terminal_summaries()Bigdata Research Tools enables advanced query construction for the Bigdata Search API. The Query Builder combines Entity, Keyword, and Similarity Search, allowing users to control the query logic ad optimize its efficiency with entity batching and control entities. It also supports different Document Types and specific Sources.
More information on Bigdata Search API's query filters can be found at Bigdata.com - Query Filters.
from bigdata_research_tools.search.query_builder import (
EntitiesToSearch,
build_batched_query
)
from bigdata_client.models.search import DocumentType
from bigdata_research_tools.client import bigdata_connection
bigdata = bigdata_connection()
company_names = ["Apple Inc", "Microsoft Corp", "Tesla Inc"]
companies = []
for name in company_names:
results = bigdata.knowledge_graph.find_companies(name)
if results:
companies.append(next(iter(results)))
control_entities = {
"people": ["Tim Cook", "Satya Nadella"],
"concepts": ["artificial intelligence"]
}
entity_keys = [entity.id for entity in companies]
entities_config = EntitiesToSearch(companies=entity_keys)
control_entities_config = None
if control_entities:
control_entities_config = EntitiesToSearch(**control_entities)
# Build queries
queries = build_batched_query(
sentences=["Technology innovation strategies"],
keywords=["innovation", "technology"],
entities=entities_config,
control_entities=control_entities_config,
batch_size=5,
fiscal_year=2024,
scope=DocumentType.TRANSCRIPTS,
custom_batches=None,
sources=None,
)@dataclass
class EntitiesToSearch:
people: Optional[List[str]] = None # Person names
companies: Optional[List[str]] = None # Company names
org: Optional[List[str]] = None # Organization names
product: Optional[List[str]] = None # Product names
place: Optional[List[str]] = None # Place names
topic: Optional[List[str]] = None # Topic keywords
concepts: Optional[List[str]] = None # Concept terms| Parameter | Type | Required | Description |
|---|---|---|---|
sentences |
List[str] |
β | Similarity search sentences |
keywords |
List[str] |
β | Keyword search terms |
entities |
EntitiesToSearch |
β | Entity configuration |
control_entities |
EntitiesToSearch |
β | Co-mention entities |
sources |
List[str] |
β | Source filters |
batch_size |
int |
β | Entities per batch |
fiscal_year |
int |
β | Fiscal year filter |
scope |
DocumentType |
β | Document scope |
custom_batches |
List[EntitiesToSearch] |
β | Custom entity batches |
Bigdata Research Tools supports high-performance concurrent search execution, handling client-side rate limiting under the hood. This is particularly useful when searching over a large number of elements (e.g. Companies, Sentences, Keywords).
from bigdata_research_tools.search.search import run_search
from bigdata_client.models.search import DocumentType, SortBy
from bigdata_research_tools.search.query_builder import create_date_ranges
date_ranges = create_date_ranges("2024-11-01", "2025-03-15", "M")
results = run_search(
queries,
date_ranges=date_ranges,
limit=50,
scope=DocumentType.ALL,
sortby=SortBy.RELEVANCE,
rerank_threshold=None,
)| Parameter | Type | Default | Description |
|---|---|---|---|
queries |
List[QueryComponent] |
List of search queries | |
date_ranges |
INPUT_DATE_RANGE |
Date range specifications | |
limit |
int |
10 |
Results per query |
only_results |
bool |
True |
Return format control |
scope |
DocumentType |
ALL |
Document type filter |
sortby |
SortBy |
RELEVANCE |
Result sorting |
rerank_threshold |
float |
None |
Cross-encoder threshold |
Note: The function uses bigdata_connection() internally, so no explicit client parameter is needed.
from bigdata_research_tools.search.search import SearchManager, normalize_date_range
from bigdata_research_tools.client import bigdata_connection
bigdata = bigdata_connection()
manager = SearchManager(
rpm=500, # Requests per minute
bucket_size=100, # Token bucket capacity
bigdata=bigdata # Optional: uses default if None
)
date_ranges = create_date_ranges("2024-11-01", "2025-03-15", "M")
date_ranges = normalize_date_range(date_ranges)
date_ranges.sort(key=lambda x: x[0])
# Use the manager for concurrent searches
results = manager.concurrent_search(
queries=queries,
date_ranges=date_ranges,
limit=1000,
scope=DocumentType.ALL
)The library supports multiple LLM providers.
NOTE: While most built-in prompts are optimized for OpenAI models, you can expect them to be robust across LLM providers, although some prompt fine-tuning to fit a specific LLM is recommended.
# Using OpenAI models
llm_model_config = "openai::gpt-4o-mini" # Cost-effective
llm_model_config = "openai::gpt-5-mini" # High performance
# Set OpenAI credentials
import os
os.environ["OPENAI_API_KEY"] = "your_key"# Using Bedrock models
llm_model_config = "bedrock::anthropic.claude-3-sonnet-20240229-v1:0"
llm_model_config = "bedrock::anthropic.claude-3-haiku-20240307-v1:0"
# Set AWS credentials
import os
os.environ["AWS_ACCESS_KEY_ID"] = "your_key"
os.environ["AWS_SECRET_ACCESS_KEY"] = "your_secret"
os.environ["AWS_DEFAULT_REGION"] = "us-east-1"NOTE: If you are logged in using AWS single sign on (SSO) no environment variables are required.
In order to use Azure OpenAI as a provider the following environment variables must be set:
AZURE_OPENAI_ENDPOINT="CLIENT_AZURE_OPENAI_ENDPOINT"OPENAI_API_VERSION="API_VERSION"
Two methods are supported for authentication:
-
Using API_KEY: The environment variable
AZURE_OPENAI_API_KEYmust be set. -
Other allowed azure authentication methods (e.g. CLI authentication, Entra ID): This is resolved automatically using DefaultAzureCredential in this case only the mandatory environment variables must be set.
In order to use our workflows with these models they need to:
- Have a deployed model in their Azure account
- Set the workflow model as azure::deployed_model (e.g. azure::gpt-4o-mini)
The following snippets shows how to authenticate with an API Key.
# Using Azure models
llm_model_config = "azure::gpt-4o-mini"
# Set Azure credentials
import os
os.environ["AZURE_OPENAI_ENDPOINT"] = "CLIENT_AZURE_OPENAI_ENDPOINT"
os.environ["OPENAI_API_VERSION"] = "API_VERSION"
os.environ["AZURE_OPENAI_API_KEY"] = "your_key"If other authentication methods (Entra ID, CLI Authentication) are available the snippets becomes:
# Using Azure models
llm_model_config = "azure::gpt-4o-mini"
# Set Azure credentials
import os
os.environ["AZURE_OPENAI_ENDPOINT"] = "CLIENT_AZURE_OPENAI_ENDPOINT"
os.environ["OPENAI_API_VERSION"] = "API_VERSION"NOTE: Models deployed on Azure apply configurable safety filters to detect violent, harmful, or otherwise unsafe content. As documented in several discussions , these filters can occasionally produce false positives because they lack the context to interpret prompts or retrieved text accurately. While our prompts contain no harmful language, news or transcript content may include ambiguous terms that trigger these checks. To reduce the likelihood of false positives, the current workaround involves setting the safety threshold to its lowest level and disabling jailbreak-protection shields. Although we generally do not recommend this approach, it may be the only practical option under current constraints. Please review any changes with your IT department before editing or creating your OpenAI model endpoints on Azure, and do not hesitate to contact us if you have any questions.
The workflows in Bigdata Research Tools rely on a handful of key parameters. Here is a detailed explanation of how to use them in practice and what they mean.
Company objects are bigdata_client.models.entities.Company instances that represent companies in the Bigdata knowledge graph. Here's how to obtain them:
from bigdata_research_tools.client import bigdata_connection
# Connect to Bigdata API
bigdata = bigdata_connection()
# Get companies from a specific watchlist
watchlist_id = "a3915138-bba9-437e-a813-aa1620a822cc" # Example GRID watchlist
watchlist = bigdata.watchlists.get(watchlist_id)
companies = bigdata.knowledge_graph.get_entities(watchlist.items)
print(f"Found {len(companies)} companies in watchlist")
# Output: Found 7 companies in watchlist# Search for specific companies by name
company_names = ["Apple Inc", "Microsoft Corp.", "Tesla Inc"]
companies = []
for name in company_names:
# Find company in knowledge graph
search_results = bigdata.knowledge_graph.autosuggest(name, limit=1)
if search_results:
companies.append(next(iter(search_results)))
print(f"Found: {companies[-1].name} (ID: {companies[-1].id})")
# Output:
# Found: Apple Inc (ID: D8442A)
# Found: Microsoft Corp. (ID: 228D42)
# Found: Tesla Inc (ID: DD3BB1)# Get all companies from a watchlist, then filter
all_companies = bigdata.knowledge_graph.get_entities(watchlist.items)
# Filter by sector or other criteria
tech_companies = [
company for company in all_companies
if hasattr(company, 'sector') and 'Technology' in company.sector
]Each Company object has these key properties:
company = companies[0]
print(f"Name: {company.name}") # Apple Inc
print(f"ID: {company.id}") # D8442C
print(f"Ticker: {company.ticker}") # AAPL
print(f"Type: {type(company)}") # <class 'bigdata_client.models.entities.Company'>
# Additional properties may include:
# company.sector, company.industry, company.country, etc.Control entities allow you to filter results based on co-mentions. You can define queries so that documents must mention both your target companies AND the control entities to be included in results. These can be Places, People, Products, Organizations, Concepts, Topics, or other Companies.
# Example: Find documents about Tesla that also mention China or Taiwan
tesla_company_search = bigdata.knowledge_graph.autosuggest("Tesla Inc.")
tesla_company = tesla_company_search[0]
analyzer = RiskAnalyzer(
llm_model_config="openai::gpt-4o-mini",
main_theme="Supply Chain Risk",
companies=[tesla_company],
start_date="2024-01-01",
end_date="2024-12-31",
document_type=DocumentType.NEWS,
control_entities={
"place": ["China", "Taiwan"], # Must also mention these places
"people": ["Elon Musk", "Tim Cook"],
"product": ["iPhone", "Model S", "Azure"]
}
)control_entities = {
# Geographic filters
"place": ["United States", "China", "Taiwan", "Germany"],
# Organization filters
"org": ["U.S. Department of Commerce"],
# People filters
"people": ["Elon Musk", "Tim Cook", "Satya Nadella"],
# Topic/concept filters
"topic": ["regulation", "trade policy", "cybersecurity"],
# Concept filters
"concepts": ["Trade"],
# Product filters
"product": ["iPhone", "Model S", "Azure"]
}- AND Logic: Documents must mention target companies AND control entities
- OR Logic: Within each control entity type, documents can mention ANY of the listed entities
- Performance: More control entities = fewer but more targeted results
- Optional: Control entities are completely optional - omit for broader analysis
The document_type parameter allows to direct your queries to specific content types. Options include:
from bigdata_client.models.search import DocumentType
# Available document types
DocumentType.NEWS # News articles
DocumentType.TRANSCRIPTS # Earnings call transcripts
DocumentType.FILINGS # SEC filings
DocumentType.ALL # All document types. fiscal_year must not be NoneThe fiscal_year parameter is required when working with transcripts or filings and determines which fiscal year documents to analyze. This sets the FiscalYear filter in Bigdata Search API which leverage the Reporting Details of a transcript.
# For fiscal year 2024, the system will search for:
fiscal_year = 2024
# Transcripts: Earnings calls from fiscal year 2024
# - Q1 2024 earnings calls (typically Jan-Mar 2024 reports)
# - Q2 2024 earnings calls (typically Apr-Jun 2024 reports)
# - Q3 2024 earnings calls (typically Jul-Sep 2024 reports)
# - Q4 2024 earnings calls (typically Oct-Dec 2024 reports)
# Filings: SEC filings for fiscal year 2024
# - 10-K annual reports for fiscal year ending in 2024
# - 10-Q quarterly reports for quarters in fiscal year 2024
# - 8-K current reports filed during fiscal year 2024# Analyze recent earnings calls
screener = ThematicScreener(
# ... other parameters ...
document_type=DocumentType.TRANSCRIPTS,
fiscal_year=2024, # Latest completed or current fiscal year
start_date="2024-01-01",
end_date="2024-12-31"
)
# Analyze historical filings
screener = ThematicScreener(
# ... other parameters ...
document_type=DocumentType.FILINGS,
fiscal_year=2023, # Previous fiscal year
start_date="2023-01-01",
end_date="2023-12-31"
)- Calendar vs Fiscal Year: Companies may have different fiscal year end dates (e.g., Apple's fiscal year ends in September)
- Current Year: For the current fiscal year, only filed documents up to the current date will be available
# For NEWS documents, fiscal_year should be None or omitted
screener = ThematicScreener(
# ... other parameters ...
document_type=DocumentType.NEWS,
fiscal_year=None, # Not applicable for news
start_date="2024-01-01",
end_date="2024-12-31"
)The ThematicScreener and RiskAnalyzer classes rely on LLM-generated taxonomy trees to conduct an in-depth analysis of company exposure. The focus parameter provides additional context and specificity to guide the AI's taxonomy tree generation and allows the human agent to be involved in the taxonomy generation.
- Refines Taxonomy Generation: Influences how sub-themes are created
- Guides Analysis Direction: Helps the AI understand what aspects to emphasize
- Improves Relevance: Integrates your expert knowledge and makes results more targeted to your specific research interest
# Basic theme - broad analysis
screener = ThematicScreener(
main_theme="Artificial Intelligence",
focus="", # No additional focus
# ... other parameters
)
# Generated sub-themes might include:
# - AI Development, AI Applications, AI Ethics, AI Investment, etc.
# Focused theme - specific analysis
screener = ThematicScreener(
main_theme="Artificial Intelligence",
focus="Focus on enterprise AI adoption, implementation challenges, and ROI measurement in large corporations",
# ... other parameters
)
# Generated sub-themes might include:
# - Enterprise AI Implementation, AI ROI Metrics, AI Integration Challenges,
# Corporate AI Strategy, AI Vendor Selection, etc.- Be Specific: Include concrete aspects you want to explore
- Use Domain Language: Include relevant terminology from your field
- Set Context: Explain the business or research context
- Define Scope: Clarify what should be included or excluded
# Good focus examples:
focus = "Analyze cybersecurity investments, breach prevention strategies, and incident response capabilities specifically for financial services companies"
focus = "Examine renewable energy transition strategies including wind, solar, and battery storage investments, with emphasis on grid integration challenges"
focus = "Focus on AI-powered drug discovery, clinical trial optimization, and personalized medicine approaches in pharmaceutical companies"
# Less effective focus examples:
focus = "Look at technology" # Too vague
focus = "AI and stuff" # Unclear
focus = "" # No guidance provided# For transcripts - focus on management commentary
focus = "Focus on management's strategic outlook, guidance updates, and responses to analyst questions about market positioning"
# For news - focus on market reactions
focus = "Analyze market sentiment, analyst opinions, and competitive positioning as reported in financial media"
# For filings - focus on formal disclosures
focus = "Examine risk factor disclosures, business segment performance, and regulatory compliance discussions in official filings"Refines search relevance with cross-encoder reranking, ensuring that the search results closely resemble your sentences:
# Enable reranking with threshold
narrative_miner = NarrativeMiner(
narrative_sentences=sentences,
rerank_threshold=0.7, # Higher = more strict
# ... other parameters
)The limit parameter determines the maximum number of documents to be retrieved by each query. This is a single int value that applies to any combination of (batched) query and date range.
Searching over a long time frame with a set document limit implies a trade-off between speed ad coverage. With the frequency parameter you can control temporal analysis granularity and split your time sample in shorter intervals. Bigdata Research Tools will automatically create the date ranges and run the queries on each of them.
# Frequency options
"Y" # Yearly intervals
"6M" # Six-monthly intervals
"3M" # Quarterly intervals (default)
"M" # Monthly intervals
"W" # Weekly intervals
"D" # Daily intervals
# Usage example
results = screener.screen_companies(
frequency="M", # Monthly analysis
# ... other parameters
)Running our analysis on a large portfolio will require you to optimize speed, costs, and coverage. batch_size sets the number of companies that you want to include a single query. This allows to optimize the performance by grouping companies together and running searches separately for each batch.
# For large company universes
screener = ThematicScreener(
companies=large_company_list,
# ... other parameters
)
results = screener.screen_companies(
batch_size=25, # Larger batches for efficiency
document_limit=200,
# ... other parameters
)import logging
# Enable detailed logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
# Library-specific logging
logging.getLogger("bigdata_research_tools").setLevel(logging.DEBUG)The library includes an interactive Jupyter notebook tutorial that demonstrates all key functionality with practical, working examples. This is the best way to get started with the library.
The fastest way to get up and running with the tutorial is using uv (the modern Python package manager):
# Clone the repository (if you haven't already)
git clone https://github.com/bigdata-com/bigdata-research-tools.git
cd bigdata-research-tools/tutorial# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create a virtual environment and install dependencies
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install tutorial dependencies
uv pip install -r requirements.txt
# Install the main package in development mode
uv pip install -e ../.Create a .env file in the tutorial directory:
# Create .env file with your credentials
echo "BIGDATA_API_KEY=your_api_key_here" > .env
echo "OPENAI_API_KEY=your_openai_api_key_here" >> .env # Required to run the Advanced Workflows# Install Jupyter if not included in requirements
uv pip install jupyterlab
# Start Jupyter notebook
jupyter lab tutorial_notebook.ipynbThe interactive tutorial covers:
π Fundamentals
- Setting up authentication and connections
- Basic search functionality with
search_by_companies() - Custom query building with
run_search()
π Key Features Demonstrated
- Company-specific document searches
- Custom query construction and execution
- Result processing and analysis
- DataFrame export and manipulation
π‘ Learning Outcomes
- Understand core library concepts
- See practical, working examples
- Get hands-on experience with real data
π Next Steps After completing the tutorial, you'll be ready to:
- Explore the advanced workflows (NarrativeMiner, ThematicScreener, RiskAnalyzer)
- Run the complete examples in the
examples/directory - Build custom analysis workflows for your specific use cases and explore our Bigdata Cookbook, which features a collection of ready-to-use notebooks for a variety of finance-related guided workflows.
If you prefer not to use uv, you can also use traditional pip:
# Create virtual environment
python -m venv tutorial_env
source tutorial_env/bin/activate # On Windows: tutorial_env\Scripts\activate
# Install dependencies
pip install -r requirements.txt
pip install -e ..
# Launch notebook
jupyter notebook tutorial_notebook.ipynbThe library includes several complete examples in the examples/ directory.
File: examples/narrative_miner.py
What it does: Tracks AI-related narratives across transcripts
# Run the narrative miner example
cd examples
python narrative_miner.pyExpected output:
Environment variables loaded: True
INFO:bigdata_research_tools:Starting narrative mining...
INFO:bigdata_research_tools:Processing 15 narrative sentences...
INFO:bigdata_research_tools:Analysis complete. Results saved to narrative_miner_sample.xlsx
File: examples/thematic_screener.py
What it does: Analyzes companies' exposure to "Chip Manufacturers" theme
# Run the thematic screener example
python thematic_screener.pyExpected output:
Environment variables loaded: True
INFO:bigdata_research_tools:Generating theme tree for: Chip Manufacturers
INFO:bigdata_research_tools:Screening 50 companies...
INFO:bigdata_research_tools:Creating visualizations...
# Browser opens with interactive dashboard
File: examples/risk_analyzer.py
What it does: Assesses risk exposure to US import tariffs
# Run the risk analyzer example
python risk_analyzer.pyExpected output:
Environment variables loaded: True
INFO:bigdata_research_tools:Creating risk taxonomy...
INFO:bigdata_research_tools:Analyzing risk exposure...
INFO:bigdata_research_tools:Risk analysis complete. Results saved to risk_analyzer_results.xlsx
# Browser opens with risk dashboard
File: examples/query_builder.py
What it does: Demonstrates advanced query construction techniques
# Run the query builder example
python query_builder.pyExpected output:
INFO:__main__:======================================
INFO:__main__:TEST 1: Basic EntityConfig with Auto-batching
INFO:__main__:Generated 2 query components
INFO:__main__:Sample query structure: [QueryComponent(...)]
File: examples/portfolio_example.py
What it does: Shows different portfolio construction methods
# Run the portfolio constructor example
python portfolio_example.pyExpected output:
INFO:__main__:======================================
INFO:__main__:EXAMPLE 1: Basic Equal-Weighted Portfolio (Sector Balanced)
INFO:__main__:Portfolio Size: 20 companies
INFO:__main__:Sectors Represented: 5
File: examples/search_by_companies.py
What it does: Shows how to search for documents mentioning specific companies and topics
# Run the search by companies example
python search_by_companies.pyExpected output:
Environment variables loaded: True
INFO:__main__:Found: Apple Inc (ID: D8442C)
INFO:__main__:Found: Microsoft Corporation (ID: D4A6CC)
INFO:__main__:Found 24 relevant documents
INFO:__main__: Apple Inc: 15 documents
INFO:__main__: Microsoft Corporation: 9 documents
# Results exported to search_by_companies_results.xlsx
File: examples/run_search.py
What it does: Demonstrates custom query building and search execution
# Run the run_search example
python run_search.pyExpected output:
Environment variables loaded: True
INFO:__main__:Generated 4 search queries
INFO:__main__:Searching across 3 time periods
INFO:__main__:Found 32 documents total
INFO:__main__: Reuters: 12 documents
INFO:__main__: Bloomberg: 8 documents
# Results exported to run_search_results.xlsx
- Documentation: https://docs.bigdata.com
- API Reference: Check the
docs/directory for detailed API documentation - Examples: See the
examples/directory for complete working examples - Issues: Report issues through support@bigdata.com
This project uses ruff for linting and formatting and ty for a type checker. To ensure your code adheres to the project's style guidelines, run the following commands before committing your changes:
make type-check
make lint
make formatThis software is licensed for use solely under the terms agreed upon in the applicable Master Agreement and Order Schedule between the parties. For trials, the applicable legal documents are the Mutual Non-Disclosure Agreement, or if applicable the Trial Agreement. No other rights or licenses are granted by implication, estoppel, or otherwise. For further details, please refer to your specific Master Agreement and Order Schedule or contact us at legal@ravenpack.com.
RavenPack | Bigdata.com
All rights reserved Β© 2025