diff --git a/.gitignore b/.gitignore index 2d86b63..a64d95f 100644 --- a/.gitignore +++ b/.gitignore @@ -3,3 +3,4 @@ Hadoop.egg-info/ __pycache__ build/ *.pyc +html diff --git a/README.md b/README.md index 7065bbe..340e447 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,40 @@ # Python-Hadoop -This library was cloned from [here](https://github.com/matteobertozzi/Hadoop/tree/master/python-hadoop). There have been modifications to make it work with Python3. +A pure Python [Hadoop SequenceFile](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/SequenceFile.html) Reader and Writer implementation with no Java dependency. -### Original README -Pure Python SequenceFile Reader and Writer implementation -that allows you to read and write your Hadoop sequence files -without using java. +# Installation -Author: Matteo Bertozzi +From source +```bash +pip install . +``` -Contributors: +# API Documentation - * Brian Bloniarz - * Alex Roper - * Jeremy G. Kahn +```bash +# install pdoc - https://pdoc3.github.io/pdoc/ +pdoc --html hadoop --skip-errors # skip errors due to broken dependency on pydoop (see below) +open html/hadoop/index.html +``` + +# Reading and writing SequenceFiles + +[hadoop/io/SequenceFile](https://github.com/opaque-systems/sequencefile/blob/master/hadoop/io/SequenceFile.py) provides Reader, Writer and Metadata interfaces. See the examples below for more details. + +# HDFS Integration + +**Is currently broken as the underlying pydoop library is unmaintained** + +The goal of [hadoop.pydoop.reader.SequenceFileReader](https://github.com/opaque-systems/sequencefile/blob/master/hadoop/pydoop/reader.py) is to read sequence files from HDFS. This leveraged an extra dependency on [pydoop](http://crs4.github.io/pydoop/index.html). + +# Examples + +Several [examples](https://github.com/opaque-systems/sequencefile/tree/master/examples) are provideed to demonstrate API usage. With the exception of the [SequenceFileReader example](https://github.com/opaque-systems/sequencefile/blob/master/examples/SequenceFileReader.py), all examples are self contained. + +# Credits + +This source was originally cloned from the [matteobertozzi/Hadoop python-hadoop](https://github.com/matteobertozzi/Hadoop/blob/master/python-hadoop/README) source tree and then modified for Python3 compatibility. + +# License + +[Apache Licence v2](https://github.com/opaque-systems/sequencefile/blob/master/LICENSE) a copy of which ships with this codebase.