Skip to content

jbutcher21/aiclass

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Senzing AI Mapping Workshop

This is a hands-on session where you will learn how to map data to Senzing using AI. Each participant should come prepared so we can move quickly and focus on solving your real-world mapping challenges.

Prerequisites

What to bring:

  • Laptop: each participant needs their own laptop (Mac/Windows).
  • AI account: a paid AI subscription (Claude, ChatGPT, GitHub Copilot, Cursor, Google Gemini, Amazon CodeWhisperer, Codeium, or another paid AI assistant). Let us know if you already use another provider and want to use it.
  • Local development environment with AI: you'll need a way to work with AI locally on your machine. Options include an IDE with AI extension (VS Code + Claude Code/Copilot, JetBrains + AI plugin), an AI-native IDE (Cursor, Windsurf), or a command-line AI tool (Claude Code CLI). This local setup lets you access files directly, run code, execute the linter, and iterate on your mappings throughout the workshop.
  • Create a working folder for workshop files (e.g., ~/bootcamp) and pull this repository into it.
  • Your data file: bring a real dataset you want to map (CSV, JSON, etc.). Aim for a representative sample that’s safe to use in class. If you can’t share production data, bring a small, sanitized sample and put it on the ~/bootcamp directory.
  • Python 3: needed to run the mapping/validation code the AI will generate.
    • Verify: python3 --version (or python --version on Windows).
  • Senzing environment (for final validation): we will load your mapped JSON into Senzing.
    • Install Docker Desktop (Mac/Windows/Linux) and complete the first-run setup.
      • If you cannot install Docker, let us know in advance; we will provide alternatives during the session.
    • Verify Docker is running: docker --version and docker run hello-world
    • Ensure at least 4 GB RAM is allocated to Docker (Settings → Resources).
    • Pull the workshop container image ahead of time (will be available one week before class):
      • docker pull senzing/summit-bootcamp-2025
    • If you can also, do these two pulls to get a local AI model:
      • docker run -d --name ollama -p 11434:11434 -v ollama:/root/.ollama ollama/ollama:latest
      • docker exec -it ollama ollama pull mistral:7b-instruct-q4_K_M

Notes

  • We want you to solve a real problem. Bring a dataset and context so we can map to Senzing in a way that’s meaningful to your use case.
  • Keep sensitive data safe. Prefer samples or de-identified subsets when possible.

What’s Inside

Documents folder

The mapping documentation is maintained in the Senzing/mapper-ai repository:

Employee Data (input and expected outputs)

  • Path: employee_data/
  • Contents:
    • data/us-small-employee-raw.csv: sample input data
    • schema/us-small-employee-schema.csv: inferred schema (from file_analyzer)
    • byhand/*: code and Senzing JSONL generated by hand (current expected result)

Voter Data (input only)

  • Path: voter_data/
  • Contents:
    • data/: sample voter dataset
    • schema/: inferred schema produced for the voter dataset

Company Data (input only)

  • Path: company_data/
  • Contents:
    • data/: sample company dataset
    • schema/: inferred schema produced for the company dataset

Tools

  • File Analyzer (profile files to derive schema and stats):
    • Path: tools/file_analyzer.py
    • Purpose: analyze CSV/JSON/Parquet when a schema doesn’t exist; shows attribute name, inferred type, population %, uniqueness %, and top values.
    • Run: python3 tools/file_analyzer.py path/to/data.csv -o path/to/schema.csv
  • Senzing JSON Linter (schema correctness check):
    • Path: docs/lint_senzing_json.py (local) or fetch from mapper-ai
    • Purpose: validates structure of Senzing JSON/JSONL.
    • Run (file): python3 docs/lint_senzing_json.py path/to/output.jsonl
    • Run (directory): python3 docs/lint_senzing_json.py path/to/dir
  • Senzing JSON Analyzer (validate mapped JSONL before loading):
    • Path: tools/sz_json_analyzer.py
    • Purpose: validates/inspects Senzing JSON/JSONL; highlights mapped vs unmapped attributes, uniqueness/population, warnings, and errors.
    • Run: python3 tools/sz_json_analyzer.py path/to/output.jsonl -o path/to/report.csv
    • Docs: https://github.com/senzing-garage/sz-json-analyzer

Step-by-Step Guide (Senzing Mapping Assistant)

Data Handling Guidance

  • Best practice: Use schema files, not raw data. Generate a schema with the File Analyzer and map from that. This uses fewer tokens, minimizes data exposure, and keeps your AI focused on the mapping logic.
  • If working locally with your IDE: You can have the AI map directly from raw data files, but it's still recommended to use the File Analyzer first when possible.
  • If the File Analyzer can't handle your file format: Either ask your AI to analyze the file and generate a schema, or write your own code to produce a schema document.
  • Never upload full production datasets to web-based AI. Use schema extracts, field lists, small sanitized samples, or analyzer summaries instead.

Tips for collaborating with an AI:

  • Ask it questions if you don't understand something. One of my favorites is: what does the senzing spec say about that
  • If it gives you options, ask it for the pros and cons.
  • Correct it when it gets something wrong. It will learn from you.
  • Keep it on track: AI's hallucinate. See: ChatGPT Common Issues And Solutions

Above all: Don't use it to replace your judgement or expertise. It's just your assistant. You are the decision maker.

Step 1: Create a project folder (if you haven't already)

  • Make a working directory for your data (e.g., ~/bootcamp/my-source).
  • Put your dataset into it (e.g., a data/ subfolder).
  • No dataset? Copy from the aiclass voter_data or company_data folder to your new working directory.

Step 2: Generate a schema (recommended approach)

  • Preferred: Use the File Analyzer to generate a schema from your data:
    • Run: python3 tools/file_analyzer.py path/to/data.csv -o path/to/schema.csv
    • Place the output schema (e.g., schema.csv) in your project (e.g., a schema/ subfolder).
    • Benefits: fewer tokens, less data exposure, better AI focus on mapping logic
  • If you already have an official schema or data dictionary: use that instead, skip this step.
  • If the File Analyzer can't handle your file format:
    • Option A: Ask your AI to analyze the file and generate a schema document
    • Option B: Write your own code to produce a schema
    • Option C (local IDE only): Have the AI map directly from the raw data file

Step 3: Start your mapping session in your IDE

Recommended: Use your local IDE with AI assistant (VS Code with Claude/Copilot, Cursor, Windsurf, JetBrains with AI plugin, etc.)

This approach gives you direct file access, ability to execute the linter, generate and test code, handle complex multi-file schemas, and iterate on mapper implementations.

  • Open your project folder in your local development environment
  • Fetch the RAG files into your workspace (clone the mapper-ai repo or download them):
    https://raw.githubusercontent.com/Senzing/mapper-ai/main/rag/senzing_mapping_assistant_prompt.md
    https://raw.githubusercontent.com/Senzing/mapper-ai/main/rag/senzing_mapping_examples.md
    https://raw.githubusercontent.com/Senzing/mapper-ai/main/rag/senzing_entity_specification.md
    https://raw.githubusercontent.com/Senzing/mapper-ai/main/rag/lint_senzing_json.py
    https://raw.githubusercontent.com/Senzing/mapper-ai/main/rag/identifier_crosswalk.json
    https://raw.githubusercontent.com/Senzing/mapper-ai/main/rag/usage_type_crosswalk.json
    
  • Configure your AI assistant to use these files as context/knowledge resources
  • Use senzing_mapping_assistant_prompt.md as your system prompt or opening instruction
  • Begin interactive work with your schema and data files

Alternative: Web-based AI chat (if you cannot use a local IDE):

  • Open Senzing Mapping Assistant GPT - mapping docs are preloaded
  • Or create a new project in your AI's web interface and upload the RAG files listed above
  • Note: web-based approaches lack local linter execution and may struggle with complex multi-file schemas

Step 4: Map your schema through to code

  • Provide your schema to the AI assistant and start the mapping process.
  • Collaborate with the assistant to analyze your schema, agree on mappings, produce example JSON/JSONL, and generate a transformer script to emit Senzing JSONL.
  • By the end of this step you should have code. Download it, run it to map your data, and then verify the output with the JSON analyzer in tools (tools/sz_json_analyzer.py).

Step 5: Generate Senzing JSON output

Step 6: Load into Senzing Note: this part will depend on if you are on windows, linux or mac, whether you have docker installed and/or python3. If you have trouble with any of this raise your hand and we will help you.

Here is what you should type:

docker run --rm -it --user 0 -v .:/bootcamp senzing/summit-bootcamp-2025

root@89730121f88b:/# cd /bootcamp
root@89730121f88b:/bootcamp# sz_configtool 

(szcfg) addDataSource EMPLOYEES
(szcfg) addDataSource EMPLOYERS
(szcfg) save
(szcfg) quit

sz_file_loader -f employees/output/employee_senzing.jsonl 

sz_snapshot -o snap1

sz_explorer -s snap1.json 
 

About

Summit AI Mapping Class

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages