A DataVil project.
FrameX is a light-weight, dataset fetching library for fast prototyping, tutorial creation, and experimenting. FrameX has currently over 80 datasets available.
Built on top of Polars.
To get started, install the library with:
pip install frameximport framex as fxiris = fx.load("iris")which returns a polars DataFrame
Therefore, you can use all the polars functions and methods on the returned DataFrame.
iris.head()shape: (5, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
│ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width ┆ species │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f32 ┆ f32 ┆ f32 ┆ f32 ┆ str │
╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
│ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ setosa │
│ 4.9 ┆ 3.0 ┆ 1.4 ┆ 0.2 ┆ setosa │
│ 4.7 ┆ 3.2 ┆ 1.3 ┆ 0.2 ┆ setosa │
│ 4.6 ┆ 3.1 ┆ 1.5 ┆ 0.2 ┆ setosa │
│ 5.0 ┆ 3.6 ┆ 1.4 ┆ 0.2 ┆ setosa │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
iris = fx.load("iris", lazy=True)which returns a polars LazyFrame
Both these operations create local copies of the datasets by default cache=True.
To see the list of available datasets, run:
fx.available(){'remote': ['iris', 'mpg', 'netflix', 'starbucks', 'titanic'], 'local': ['titanic']}PS, shorthened for clarity
which returns a dictionary of both locally and remotely available datasets.
To see only local or remote datasets, run:
fx.available("local")
fx.available("remote"){'local': ['titanic']}
{'remote': ['iris', 'mpg', 'netflix', 'starbucks', 'titanic']}To get information on a dataset, run:
fx.about("mpg") # basically the same as `fx.about("mpg", mode="print")`which will print the information on the dataset as the following:
NAME : mpg
SOURCE : https://www.kaggle.com/datasets/uciml/autompg-dataset
LICENSE : CC0: Public Domain
ORIGIN : Kaggle
OG NAME : autompg-dataset
Or you can get the information as a single row polars.DataFrame by running:
row = fx.about("mpg", mode="row")
print(row)which will print the information on the dataset ASCII art as the following:
shape: (1, 4)
┌──────┬─────────────────────────────────┬────────────────────┬────────┐
│ name ┆ source ┆ license ┆ origin │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞══════╪═════════════════════════════════╪════════════════════╪════════╡
│ mpg ┆ https://www.kaggle.com/dataset… ┆ CC0: Public Domain ┆ Kaggle │
└──────┴─────────────────────────────────┴────────────────────┴────────┘
or you can simply treat row as a polars DataFrame in your code.
In case you need the file links.
url_pokemon = fx.get_url("pokemon")by default, the format is " feather".
Optionally, you can specify the format of the dataset.
url_pokemon_csv = fx.get_url("pokemon", format="csv")framex CLI has a slight overhead of around 400 milliseconds due to imports. However, operations still take less than a second, unless bottlenecked by the download speed.
TO see all the available commands, run:
fx -hGet a single dataset (to the current directory):
fx get irisor get multiple datasets:
fx get iris mpg titanicwhich will download dataset(s) to the current directory.
to get the datasets into cache directory:
fx get iris mpg titanic --cacheor to a specific directory:
fx get iris mpg titanic --dir dataTo get the name of the available datasets on the remote server.
fx listthis will list all available datasets on the remote server.
to get the names of the available datasets that includes "dia"
fx list diaLocally available datasets: (feather, parquet, csv, other)
Remote datasets:
diamondsTo get information on a dataset or datasets, run:
fx about mpg irisTo show a preview of a single dataset
fx show irisTo describe (or summarize) a dataset
fx describe irisFor more parameters
fx get --helpBring a dataset to the current directory from cache:
fx bring irisor bring multiple datasets:
fx bring iris mpg titanicwhich will bring dataset(s) to the current directory from cache directory.
