top_bottom_from_apache_log.py - the script parses a standard apache log file and returns top/bottom with the highest/lowest successful/failure request ratio, with at least one failed request. The output: , the ratio, the request count.
To run:
spark-submit top_bottom_from_apache_log.py <your_access.log> top|bottom ['{"limit":, "select":["<valid_field1>","<valid_field2>"]}']
kafka-stream-find-word-example.py - this script reads Kafka stream from specified broker(s) every 5 seconds. If a message contains a word of interest, it is copied to a file with a timestamp when it has been received. The output shows how many messages with the word have been received in a given 30 sec window, with 30 sec sliding, plus the file with massages, containing the word.
To run:
spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.1 kafka-stream-find-word-example.py server:port <word_of_interest>
kafka-in-out-example.py - this script reads Kafka unstructured stream from specified broker(s) and writes filtered and modified unstructured stream to . Filtering: only messages started with '#' go to the output. Modification: if a message has '"' symbols, they are backslashed in the output.
To run:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 kafka-in-out-example.py server:port <checkpoints_dir>
kafka-to-file.py - this script reads Kafka json structured stream from specified broker(s) and writes predefined columns and rows to specified <data_dir>
To run:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 kafka-to-file.py server:port <data_dir> <checkpoints_dir>
files-to-cassandra-example.py - this script reads specified <data_dir> and writes the content to specified Cassandra
To run:
spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.6 files-to-cassandra-example.py