Skip to content
This repository was archived by the owner on Nov 26, 2020. It is now read-only.
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
A Python wrapper for MADlib - an open source library for scalable in-database machine learning algorithms.
You can visit [PyMADlib's webpage](http://pivotalsoftware.github.io/pymadlib/) for installation and usage tutorials.
You can visit [PyMADlib's webpage](https://pivotalsoftware.github.io/pymadlib/) for installation and usage tutorials.

## Algorithms

Expand All @@ -11,14 +11,14 @@ PyMADlib currently has wrappers for the following algorithms in MADlib (version
1. K-Means
1. LDA

Refer [MADlib User Docs](http://doc.madlib.net/v0.5/ ) for MADlib's user documentation. Please note that PyMADlib as of now is only compatible with MADlib v0.5. You can obtain MADlib v0.5 from [MADlib v0.5](https://github.com/madlib/madlib/archive/v0.5.tar.gz). We might add support to more recent versions of MADlib depending on adoption rate. Please email me if you have a strong case for an upgrade.
Refer [MADlib User Docs](https://madlib.apache.org/docs/v0.5/ ) for MADlib's user documentation. Please note that PyMADlib as of now is only compatible with MADlib v0.5. You can obtain MADlib v0.5 from [MADlib v0.5](https://github.com/madlib/madlib/archive/v0.5.tar.gz). We might add support to more recent versions of MADlib depending on adoption rate. Please email me if you have a strong case for an upgrade.


## Dependencies

1. You'll need the python extension _**psycopg2**_ to use PyMADlib.
1. If you have matplotlib installed, you'll see Matplotlib visualizations for Linear Regression demo.
1. If you have installed [networkx](http://networkx.github.com/download.html), you'll see a visualization of the k-means demo
1. If you have installed [networkx](https://networkx.github.com/download.html), you'll see a visualization of the k-means demo
1. [PyROC](https://github.com/marcelcaraciolo/PyROC) is included in the source of this distribution with permission from its developer. You'll see a visualization of the ROC curves for Logistic Regression.


Expand Down Expand Up @@ -53,7 +53,7 @@ PyMADlib depends on `MADlib`, `psycopg2` and `Pandas`. It is easiest to work wit

## Build Environment Setup on Mac OS X 10.8

* Download & install [Anaconda-1.9.0-MacOSX-x86_64.pkg] (http://repo.continuum.io/archive/Anaconda-1.9.0-MacOSX-x86_64.pkg)
* Download & install [Anaconda-1.9.0-MacOSX-x86_64.pkg] (https://repo.continuum.io/archive/Anaconda-1.9.0-MacOSX-x86_64.pkg)

* Open a terminal and check if you have Anaconda Python & the package manager conda

Expand All @@ -62,7 +62,7 @@ PyMADlib depends on `MADlib`, `psycopg2` and `Pandas`. It is easiest to work wit
> vatsan-mac$ which conda
> /Users/vatsan/anaconda/bin/conda

* If you haven't installed PostgreSQL on your Mac already, you'll have to download & install `PostGreSQL` for Mac. This is so that we get some required libraries to compile the SQL Engine: psycopg2. The easiest way to install `PostGreSQL` on Mac is via `http://postgresapp.com/`. Once you've downloaded and installed PostGreSQL on Mac, it should typically be found under `/Library/PostgreSQL`
* If you haven't installed PostgreSQL on your Mac already, you'll have to download & install `PostGreSQL` for Mac. This is so that we get some required libraries to compile the SQL Engine: psycopg2. The easiest way to install `PostGreSQL` on Mac is via `https://postgresapp.com/`. Once you've downloaded and installed PostGreSQL on Mac, it should typically be found under `/Library/PostgreSQL`

> vatsan-mac$ ls /Library/PostgreSQL/9.2/
> Library include pg_env.sh uninstall-postgresql.app
Expand Down Expand Up @@ -98,7 +98,7 @@ If the above command did not error out, then installation was successful.

## Usage Tutorial

Visit [PyMADlib Tutorial](http://nbviewer.ipython.org/gist/vatsan/dd88abb47c2fbd9e16bd) for a tutorial on using PyMADlib
Visit [PyMADlib Tutorial](https://nbviewer.ipython.org/gist/vatsan/dd88abb47c2fbd9e16bd) for a tutorial on using PyMADlib
Also visit [PyMADlib IPython NB](https://gist.github.com/vatsan/dd88abb47c2fbd9e16bd) to download the IPython NB tutorial


Expand Down Expand Up @@ -137,9 +137,9 @@ Remember to close the Matplotlib windows that pop-up to continue with the rest o

PyMADlib packages publicly available datasets from the UCI machine learning repository and other sources.

1. [Wine quality dataset from UCI Machine Learning repository](http://archive.ics.uci.edu/ml/datasets/Wine+Quality)
1. [Auto MPG dataset from UCI ML repository from UCI Machine Learning repository](http://archive.ics.uci.edu/ml/datasets/Auto+MPG)
1. [Wine quality dataset from UCI Machine Learning repository](http://archive.ics.uci.edu/ml/datasets/Wine+Quality)
1. [Wine quality dataset from UCI Machine Learning repository](https://archive.ics.uci.edu/ml/datasets/Wine+Quality)
1. [Auto MPG dataset from UCI ML repository from UCI Machine Learning repository](https://archive.ics.uci.edu/ml/datasets/Auto+MPG)
1. [Wine quality dataset from UCI Machine Learning repository](https://archive.ics.uci.edu/ml/datasets/Wine+Quality)
1. Obama-Romney second presidential debate (2012) transcripts


Expand Down
10 changes: 5 additions & 5 deletions README.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,14 @@ Python wrapper for MADlib
Srivatsan Ramanujam <vatsan.cs@utexas.edu>, 3 Jan 2013
This currently implements Linear regression, Logistic Regression,
SVM (regression & classification), K-Means and LDA algorithms of MADlib.
Refer : http://doc.madlib.net/v0.5/ for MADlib's user documentation.
Refer : https://madlib.apache.org/docs/v0.5/ for MADlib's user documentation.
================================================================================

Dependencies :
===============
You'll need the python extension : psycopg2 to use PyMADlib.
(i) If you have matplotlib installed, you'll see Matplotlib visualizations for Linear Regression demo.
(ii) If you have installed networkx (http://networkx.github.com/download.html), you'll see a visualization of the k-means demo
(ii) If you have installed networkx (https://networkx.github.com/download.html), you'll see a visualization of the k-means demo
(iii) PyROC (https://github.com/marcelcaraciolo/PyROC) is included in the source of this distribution with permission from its developer. You'll see a visualization of the ROC curves for Logistic Regression.

Configurations:
Expand Down Expand Up @@ -56,8 +56,8 @@ Datasets packaged with this installation :
=========================================
PyMADlib packages publicly available datasets from the UCI machine learning repository and other sources.

1) Wine quality dataset from UCI Machine Learning repository : http://archive.ics.uci.edu/ml/datasets/Wine+Quality
2) Auto MPG dataset from UCI ML repository : http://archive.ics.uci.edu/ml/datasets/Auto+MPG
1) Wine quality dataset from UCI Machine Learning repository : https://archive.ics.uci.edu/ml/datasets/Wine+Quality
2) Auto MPG dataset from UCI ML repository : https://archive.ics.uci.edu/ml/datasets/Auto+MPG
3) Obama-Romney second presidential debate (2012) transcripts for the LDA models.


Expand All @@ -71,6 +71,6 @@ with installing psycopg2.
Here are some blogs which discuss the issue and offer solutions:

http://hardlifeofapo.com/psycopg2-and-postgresql-9-1-on-snow-leopard/
http://www.initd.org/psycopg/articles/2010/11/11/links-about-building-psycopg-mac-os-x/
https://www.initd.org/psycopg/articles/2010/11/11/links-about-building-psycopg-mac-os-x/


2 changes: 1 addition & 1 deletion pymadlib/doc/PyMADlib Tutorial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
"1. K-Means \n",
"1. LDA \n",
"\n",
"Refer [MADlib User Docs](http://doc.madlib.net/v0.5/ ) for MADlib's user documentation.\n",
"Refer [MADlib User Docs](https://madlib.apache.org/docs/v0.5/ ) for MADlib's user documentation.\n",
"\n",
"We can employ it to push the heavy number crunching to MADlib, while allowing us to work with awesomeness of Python in the front end."
]
Expand Down
2 changes: 1 addition & 1 deletion pymadlib/example.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ def linearRegressionDemo(conn):
smat = scatter_matrix(predictions.get(['quality','prediction']), diagonal='kde')

# 1 b) Linear Regression with categorical variables
# We'll use the auto_mpg dataset from UCI : http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names
# We'll use the auto_mpg dataset from UCI : https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names
# make, fuel_type, fuel_system are all categorical variables, rest are real.
#Train Linear Regression Model on a mixture of Numeric and Categorical Variables
mdl_dict, mdl_params = lreg.train('public.auto_mpg_train',['1','height','width','length','highway_mpg','engine_size','make','fuel_type','fuel_system'],'price')
Expand Down
12 changes: 6 additions & 6 deletions pymadlib/pymadlib.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
3) SVM (regression & classification) and
4) K-Means &
5) PLDA
Refer : http://doc.madlib.net/v0.5/ for MADlib's user documentation.
Refer : https://madlib.apache.org/docs/v0.5/ for MADlib's user documentation.
'''
from utils import pivotCategoricalColumns, convertsColsToArray
import psycopg2
Expand Down Expand Up @@ -98,7 +98,7 @@ def predict(self, *args):
class LinearRegression(SupervisedLearning):
'''
Python Wrapper to invoke MADlib's Linear Regression Algorithm
http://doc.madlib.net/v0.5/group__grp__linreg.html
https://madlib.apache.org/docs/v0.5/group__grp__linreg.html
'''
def __init__(self,conn):
super(LinearRegression,self).__init__(conn)
Expand Down Expand Up @@ -184,7 +184,7 @@ def predict(self, predict_table_name, actual_label_col=''):
class LogisticRegression(SupervisedLearning):
'''
Python Wrapper to invoke MADlib's Logistic Regression Algorithm
http://doc.madlib.net/v0.5/group__grp__logreg.html
https://madlib.apache.org/docs/v0.5/group__grp__logreg.html
'''
def __init__(self,conn):
super(LogisticRegression,self).__init__(conn)
Expand Down Expand Up @@ -293,7 +293,7 @@ def predict(self, predict_table_name,actual_label_col='',threshold=0.5):
class SVM(SupervisedLearning):
'''
Python Wrapper to invoke MADlib's SVM Algorithm
http://doc.madlib.net/v0.5/group__grp__kernmach.html
https://madlib.apache.org/docs/v0.5/group__grp__kernmach.html
'''
def __init__(self,conn):
super(SVM,self).__init__(conn)
Expand Down Expand Up @@ -494,7 +494,7 @@ def predict_batch(self, predict_table, output_table, id_col, data_col):
class KMeans(object):
'''
Python Wrapper to invoke MADlib's KMeans Algorithm
http://doc.madlib.net/v0.5/group__grp__kmeans.html
https://madlib.apache.org/docs/v0.5/group__grp__kmeans.html
'''
def __init__(self,conn):
self.dbconn = conn
Expand Down Expand Up @@ -611,7 +611,7 @@ def generateClusters(
class PLDA(object):
'''
Python Wrapper to invoke MADlib's PLDA Algorithm
http://doc.madlib.net/v0.5/group__grp__plda.html
https://madlib.apache.org/docs/v0.5/group__grp__plda.html
'''
def __init__(self,conn):
self.dbconn = conn
Expand Down
2 changes: 1 addition & 1 deletion pymadlib/pyroc.py
Original file line number Diff line number Diff line change
Expand Up @@ -351,7 +351,7 @@ def _calculate_counts(self,pos_data,neg_data):
if __name__ == '__main__':
print "PyRoC - ROC Curve Generator"
print "By Marcel Pinheiro Caraciolo (@marcelcaraciolo)"
print "http://aimotion.bogspot.com\n"
print "http://ww1.bogspot.com\n"
from optparse import OptionParser

parser = OptionParser()
Expand Down
4 changes: 2 additions & 2 deletions pymadlib/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,7 @@ def __getColNamesAndTypesList__(cols,col_types_dict, col_distinct_vals_dict):
'''
Return a list of column names and types, where any categorical column in the original table have
been 'binarized'. Dummy coding is used to convert categorical columns into dummy variables.
Refer: http://en.wikipedia.org/wiki/Categorical_variable#Dummy_coding
Refer: https://en.wikipedia.org/wiki/Categorical_variable#Dummy_coding

Inputs:
=======
Expand Down Expand Up @@ -278,7 +278,7 @@ def pivotCategoricalColumns(conn,table_name,cols,label='',col_distinct_vals_dict
Take a table_name and a set of columns (some of which may be categorical
and return a new table, where the categorical columns have been pivoted.
This method uses the "Dummy Coding" approach:
http://en.wikipedia.org/wiki/Categorical_variable#Dummy_coding
https://en.wikipedia.org/wiki/Categorical_variable#Dummy_coding

Inputs:
=======
Expand Down
8 changes: 4 additions & 4 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@
'./dist', 'EGG-INFO', '*.egg-info')


# (c) 2005 Ian Bicking and contributors; written for Paste (http://pythonpaste.org)
# Licensed under the MIT license: http://www.opensource.org/licenses/mit-license.php
# (c) 2005 Ian Bicking and contributors; written for Paste (https://web.archive.org/web/http%3A//pythonpaste.org/)
# Licensed under the MIT license: https://www.opensource.org/licenses/mit-license.php
# Note: you may want to copy this into your setup.py file verbatim, as
# you can't import this from another package, when you don't know if
# that package is installed yet.
Expand Down Expand Up @@ -98,12 +98,12 @@ def find_package_data(
version='1.0',
author='Srivatsan Ramanujam',
author_email='vatsan.cs@utexas.edu',
url='http://vatsan.github.com/pymadlib',
url='https://vatsan.github.com/pymadlib',
packages=find_packages(),
package_data=find_package_data(only_in_packages=False,show_ignored=True),
include_package_data=True,
license='LICENSE.txt',
description='A Python wrapper for MADlib (http://madlib.net) - an open source library for scalable in-database machine learning algorithms',
description='A Python wrapper for MADlib (https://madlib.apache.org/) - an open source library for scalable in-database machine learning algorithms',
long_description=open('README.txt').read(),
install_requires=[
"psycopg2 >= 2.4.5",
Expand Down