Developers¶
Overview¶
Datakit is a light-weight framework for creating custom data science workflows.
It relies on the Cliff command-line toolkit to provide an extensible system of plugins to support custom workflows.
Core development¶
Here’s how to set up Datakit for local development.
Install Python 3.6 using pyenv and pyenv-virtualenv:
$ pyenv install 3.6.1
Clone datakit-core:
$ git clone git@github.com:associatedpress/datakit-core.git
Create and activate a virtual environment:
$ pyenv virtualenv datakit-core $ pyenv activate datakit-core $ cd datakit-core/
Install dependencies:
$ pip install -r requirements.txt $ pip install -r requirements-dev.txt
When you’re done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox:
$ flake8 datakit tests $ python setup.py test $ tox
Creating plugins¶
Quickstart¶
To jump-start your next plugin, check out Cookiecutter and cookiecutter-datakit-plugin.
Overview¶
A typical plugin should apply the entry points strategy by defining Cliff command classes that are linked to a unique plugin name and action in the plugin’s setup.py. [1]
This allows Datakit plugins to be easily installed using standard Python package installation techniques.
For example, to install a plugin called datakit-data:
$ pip install datakit-data
The entry points strategy lets Datakit easily discover and invoke plugin commands:
$ datakit data push
$ datakit data pull
The Cliff documentation details how to use entry points in a plugin, and contains a demo app for a simple plugin.
You can also check out our Example Plugin below for the basics of wiring up a plugin.
Plugin structure¶
Datakit plugins should have the following structure:
plugin-name # root of git repo
├── plugin_name
│ ├── __init__.py
│ └── some_command.py
└── setup.py
To make a custom command discoverable by the datakit command-line tool, you must expose it in the plugin’s setup.py. See Example Plugin for details.
Plugin configurations¶
Datakit uses a home directory to stash plugin configuration.
By default, this directory is set to ~/.datakit
. It can be customized
by setting the DATAKIT_HOME
environment variable.
Plugins should store configuration files under a directory that matches the name of the plugin’s repo or package.
For example, the datakit-data plugin would store configs in:
~/.datakit/plugins/datakit-data/config.json
Example Plugin¶
Let’s say we have a datakit-data plugin with the below file structure:
datakit-data
├── setup.py
├── datakit_data
│ ├── __init__.py
│ ├── push.py # Contains a Push class to push data to S3
│ ├── pull.py # contains a Pull class to pull down data from S3
To expose the push and pull commands to datakit, you would configure the entry points variable in setup.py as below:
.... entry_points={ 'datakit.plugins': [ 'data push= datakit_data.push:Push', 'data pull= datakit_data.pull:Pull', ] } ....
After installing the plugin, Datakit can discover and invoke these new commands:
$ datakit data push
$ datakit data pull
Plugin Mixins¶
Datakit provides the datakit.command_helpers.CommandHelpers
mixin class
to help build plugin commands.
This mixin contains basic configuration methods and attributes such as default locations for plugin-specific configuration files.
Testing¶
Datakit does not require you to use a particular testing framework. Because each plugin is technically a stand-alone Python package, you’re free to use whatever testing framework you prefer.
Datakit itself uses the pytest framework. We highly recommend it!
[1] | As a convention, Datakit entry points should follow
the plugin:command format. For example data:push in the Example Plugin. |