This README provides step-by-step instructions to set up the transpiler project, including installing necessary dependencies, compiling assembly files, and executing compiled programs using QEMU.
Miniconda is used to manage dependencies within a Conda environment.
chmod +x ./getconda_vl.sh
./getconda_vl.sh- Checks if Miniconda is already installed.
- Prompts the user to remove and reinstall if it exists.
- Updates the package list and installs
wget. - Downloads and installs Miniconda.
- Removes the Miniconda installer file.
- Instructs the user to restart the shell session for changes to take effect.
This script installs required dependencies for cross-compilation and QEMU execution.
chmod +x Shell_Scripts/getdependencies_vl.sh
Shell_Scripts/getdependencies_vl.sh <project_name>- Initializes Conda and reloads the shell.
- Creates a Conda environment named
crosscompilerswith Python 3.9. - Installs required Conda packages:
- Installs system dependencies via
apt: - Verifies installed packages.
This script installs GMP (GNU Multiple Precision Arithmetic Library) for both RISC-V and ARM.
chmod +x Shell_Scripts/install_gmp.sh
Shell_Scripts/install_gmp.sh- Downloads and compiles GMP for RISC-V and ARM architectures if missing.
- Installs GMP for both architectures.
- Removes temporary installation files after completion.
The line below clones and downloads the necessary Euler files into your local
find "$HOME/transpiler_project/project-euler-c" -type f \( -name "*.c" -o -name "*.txt" \) -exec cp {} "$EULER_DIR/" \;This script compiles all C and C++ files into both standard and verbose assembly output for both RISC-V and ARM architectures.
chmod +x Shell_Scripts/compile_assembly_vl.sh
Shell_Scripts/compile_assembly_vl.sh <source_directory>- Prepares directories for storing compiled files.
- Ensures required cross-compilers (
riscv64-linux-gnu-gccandaarch64-linux-gnu-gcc) are installed. - Compiles C files with
gccand C++ files withg++for both RISC-V and ARM. - Generates two types of assembly files:
- Standard assembly (
problemX.arm.s,problemX.risc.s) - Verbose assembly (
problemX.arm.verbose.s,problemX.risc.verbose.s)
- Standard assembly (
- Validates output files and ensures that no empty assembly files were generated.
This script links the compiled assembly files into executable binaries for both RISC-V and ARM.
chmod +x Shell_Scripts/assemble_binary_vl.sh
Shell_Scripts/assemble_binary_vl.sh <source_directory>- Iterates through all compiled
.sassembly files. - Determines whether the source file is C or C++ and selects the appropriate compiler (
gccorg++). - Checks if
gmp.hormath.his required and links against-lgmpor-lmaccordingly. - Generates executable binaries (
.outfiles) for both architectures.
This part checks to make sure that all the necessary files have actually been created
This script runs the compiled executables using QEMU.
chmod +x Shell_Scripts/qemu_execute_vl.sh
Shell_Scripts/qemu_execute_vl.sh <source_directory>- Ensures
names.txtandwords.txtare available by downloading them if necessary. - Iterates over all
.outfiles and runs them using the appropriate QEMU emulator. - Executes ARM binaries using
qemu-aarch64and RISC-V binaries usingqemu-riscv64.
This script runs the compiled executables using QEMU.
mkdir -p json_files # Ensure json_files directory exists
python parse.py "$EULER_DIR" "json_files/euler.json" || { echo "Error running parse.py"; exit 1; }- Gathers all the C and assembly files and converts them into the form needed for Guess and Sketch.
echo "Cloning Unix Commands repository into temporary directory..."
git clone https://github.com/yadu007/Basic-Unix-Commands-Implementation.git "$TEMP_DIR" || { echo "Error cloning repository"; exit 1; }
# Remove .git to avoid submodule issues
rm -rf "$TEMP_DIR/.git"
# Ensure PROJECT_SOURCE exists
mkdir -p "$PROJECT_SOURCE"
# Copy contents instead of moving
cp -r "$TEMP_DIR/"* "$PROJECT_SOURCE/"
# Remove the temporary directory after copying
rm -rf "$TEMP_DIR"- Ensures we have a local repository of the Unix Command Codes
- Uses temp folder to avoid issues with clone another repo and uploading to git later
chmod +x Shell_Scripts/install_gmp.sh
Shell_Scripts/install_gmp.sh
chmod +x Shell_Scripts/fix_missing_headers.sh
Shell_Scripts/fix_missing_headers.sh unix_commands/- Ensures GMP (GNU Multiple Precision Arithmetic Library) is installed for RISC-V and ARM.
- Fixes missing headers to prevent compilation errors.
chmod +x Shell_Scripts/compile_assembly_vl.sh
Shell_Scripts/compile_assembly_vl.sh unix_commands/- Compiles each C program into both RISC-V and ARM assembly.
- Generates assembly files in
assembly_output/. - Handles both standard and verbose assembly output.
chmod +x Shell_Scripts/assemble_binary_vl.sh
Shell_Scripts/assemble_binary_vl.sh unix_commands/- Converts compiled assembly files into executable binaries for both architectures.
- Uses
gccorg++depending on whether the original file was a C or C++ program. - Links the correct libraries (e.g.,
gmpormath.h) when required.
chmod +x Shell_Scripts/test_unixcmds_vl.sh
Shell_Scripts/test_unixcmds_vl.sh unix_commands/- Executes each compiled Unix command within QEMU.
- Runs both ARM and RISC-V binaries to ensure correctness.
- Displays output and verifies functionality.
mkdir -p json_files
python parse.py unix_commands/ json_files/unix_commands.json- Extracts function structures from each compiled RISC-V and ARM assembly file.
- Matches them with their corresponding C source files.
- Outputs the results into a single JSON file (
json_files/unix_commands.json). - This JSON stores parsed Unix command assembly data separately from Euler problems.
This section details the process for compiling, executing, and generating structured JSONL data for the HumanEval dataset.
Each HumanEval problem consists of two separate C files:
code.c- Contains the function implementation.test.c- Contains test cases and themainfunction.
We process these files in two ways:
- Combined (
code.c+test.c) – These files are merged, compiled, executed, and stored for later analysis. - Standalone (
code.conly) – These are compiled separately, linked withtest.c, executed for correctness, and stored for independent function analysis.
python combine_c_files.py- Merges
code.candtest.cinto a single file for later analysis. - Ensures that test cases are included for execution.
- The resulting combined file is compiled and executed.
- Automatically checks for missing standard C headers and adds them if necessary.
- Ensures compatibility across RISC-V, ARM, and x86 architectures.
- Headers that are conditionally added:
stdlib.h(ifmallocorfreeis used)string.h(ifstrcmpis used)math.h(ifceil,floor,pow,sqrt,fabs,roundf,roundare used)stdio.h(ifprintfis used)
- Compiles the merged files into assembly and object files.
- Executes the compiled programs for correctness verification.
cd eval
# Generate assembly files
riscv64-linux-gnu-gcc -S problemX.c -o assembly_output/problemX.risc.s
aarch64-linux-gnu-gcc -S problemX.c -o assembly_output/problemX.arm.s
x86_64-linux-gnu-gcc -S problemX.c -o assembly_output/problemX.x86.s
# Generate object files
riscv64-linux-gnu-gcc -c problemX.c -o assembly_output/problemX.risc.o
aarch64-linux-gnu-gcc -c problemX.c -o assembly_output/problemX.arm.o
x86_64-linux-gnu-gcc -c problemX.c -o assembly_output/problemX.x86.o
# Link and execute
riscv64-linux-gnu-gcc assembly_output/problemX.risc.o -o assembly_output/problemX.risc
aarch64-linux-gnu-gcc assembly_output/problemX.arm.o -o assembly_output/problemX.arm
x86_64-linux-gnu-gcc assembly_output/problemX.x86.o -o assembly_output/problemX.x86
qemu-riscv64 -L /usr/riscv64-linux-gnu assembly_output/problemX.risc
qemu-aarch64 -L /usr/aarch64-linux-gnu assembly_output/problemX.arm
qemu-x86_64 -L /usr/x86_64-linux-gnu assembly_output/problemX.x86- The combined files are executed.
- Results are stored for later analysis.
- Compiles
code.cseparately, links it withtest.c, and executes it. - If the test cases pass,
code.cis stored for later analysis.
for dir in eval/*/; do
if [ -f "$dir/code.c" ] && [ -f "$dir/test.c" ]; then
problem_name=$(basename "$dir")
echo "Compiling and linking test.c for $problem_name..."
# Check if math functions are used in either file
if grep -q -E "ceil|floor|pow|sqrt|fabs|roundf|round" "$dir/code.c" || grep -q -E "ceil|floor|pow|sqrt|fabs|roundf|round" "$dir/test.c"; then
LINK_FLAG="-lm"
else
LINK_FLAG=""
fi
# Link test.c with code.o and include -lm if needed
riscv64-linux-gnu-gcc "$dir/test.c" "eval/assembly_output/${problem_name}.risc.o" -o "eval/qemu_test_output/${problem_name}.risc" $LINK_FLAG
aarch64-linux-gnu-gcc "$dir/test.c" "eval/assembly_output/${problem_name}.arm.o" -o "eval/qemu_test_output/${problem_name}.arm" $LINK_FLAG
x86_64-linux-gnu-gcc "$dir/test.c" "eval/assembly_output/${problem_name}.x86.o" -o "eval/qemu_test_output/${problem_name}.x86" $LINK_FLAG
fi
done- Runs the compiled HumanEval test cases in QEMU for all three architectures.
- If execution is successful, the
code.cfile is stored for analysis.
for test_executable in eval/qemu_test_output/*; do
problem_name=$(basename "$test_executable")
echo "Executing ${problem_name} with QEMU..."
if [[ "$problem_name" == *.risc ]]; then
qemu-riscv64 -L /usr/riscv64-linux-gnu "$test_executable"
elif [[ "$problem_name" == *.arm ]]; then
qemu-aarch64 -L /usr/aarch64-linux-gnu "$test_executable"
elif [[ "$problem_name" == *.x86 ]]; then
qemu-x86_64 -L /usr/x86_64-linux-gnu "$test_executable"
fi
echo "Execution complete for $problem_name."
done- Converts compiled assembly and source code into structured JSONL files.
- Creates two JSONL files:
- Combined (
eval_combined.jsonl) – Represents the merged and executedcode.candtest.cfiles. - Standalone (
eval_standalone.jsonl) – Representscode.cin isolation but verified for correctness.
- Combined (
# Combined (code.c + test.c)
python parse.py "$HOME/transpiler_project/eval" "jsonl_files/eval_combined.jsonl"
# Standalone (code.c only, but verified)
python parse.py "$HOME/transpiler_project/eval" "jsonl_files/eval_standalone.jsonl"- Runs all steps in order, including:
- Combining C files
- Fixing missing headers
- Compiling into assembly and object files
- Verifying correctness through execution
- Generating JSONL output
- Cleaning up temporary files
chmod +x fullsetup.sh
./fullsetup.shpython hippo/main.py --sketch
--source_lang risc --target_lang arm
--predictions_folder unix_commands