This repository contains the source code implementation of the paper "Redundancy Elimination in Distributed Matrix Computation", which appeared at SIGMOD 2022.
ReMac is a distributed matrix computation library developed based on SystemDS 2.0.0, which automatically and adaptively eliminates redundancy in execution plans to improve performance.
-
Prerequisite: Java 8
-
Download, setup and start Hadoop 3.2.2 on your cluster. In particuar, what we need is HDFS, so you do not have to configure and deploy YARN. (
config/hadoop/lists the configuration used in the paper.) -
Download, setup and start Spark 3.0.1 on your cluster. (
config/spark/lists the configuration used in the paper.)For Spark configuration, in the file of
spark-defaults.conf, you need to specifyspark.driver.memory,spark.executor.coresandspark.executor.instanceswhich are essential to the Optimizer of ReMac.In addition, we recommend adding:
spark.serializer org.apache.spark.serializer.KryoSerializerto employ the Kryo serialization library.spark.driver.extraJavaOptions "-Xss256M"andspark.executor.extraJavaOptions "-Xss256M"in case of java.lang.StackOverflowError.
-
Download the source code of SystemDS 2.0.0 as a zip file and not git clone.
-
Replace the original
srcof SystemDS with theremac/srcin this repository. -
Follow the installation guide of SystemDS to build the project. The building artifact
SystemDS.jaris in thetargetfolder.
The scripts implementing the algorithms used in our experiments are in the folder scripts.
The datasets used in our experiments are described in Section 6.1.
For a quick start, you can use scripts/data.dml to generate random datasets that have the same metadata as mentioned in the paper.
The running command of ReMac is the same as that of SystemDS.
In addition, there are options in running ReMac.
-
The default matrix sparsity estimator is MNC. To use the metadata-based estimator, you need to add
-metadatain the command line. -
ReMac uses the dynamic programming-based method for adaptive elimination in default. To use the enumeration method, you need to add
-optimizer forcein the command line.For example, the command to run the DFP algorithm on the criteo1 dataset with the metadata-based estimator and the enumeration method is
spark-submit SystemDS.jar -metadata -optimizer force -stats -f ./scripts/dfp.dml -nvargs name=criteo1
(note: You need to add
spark-submitto the environment variablePATH, and run this command from thetargetdirectory.) -
In particular, ReMac employs the block-wise search for redundancy. To employ the tree-wise search, you need to use the folder
remac-tree_search/srcto overridesrcand rebuild the project.