Developers

Overview

Datakit is a light-weight framework for creating custom data science workflows.

It relies on the Cliff command-line toolkit to provide an extensible system of plugins to support custom workflows.

Core development

Here’s how to set up Datakit for local development.

  1. Install Python 3.6 using pyenv and pyenv-virtualenv:

    $ pyenv install 3.6.1
    
  2. Clone datakit-core:

    $ git clone git@github.com:associatedpress/datakit-core.git
    
  3. Create and activate a virtual environment:

    $ pyenv virtualenv datakit-core
    $ pyenv activate datakit-core
    $ cd datakit-core/
    
  4. Install dependencies:

    $ pip install -r requirements.txt
    $ pip install -r requirements-dev.txt
    
  5. When you’re done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox:

    $ flake8 datakit tests
    $ python setup.py test
    $ tox
    

Creating plugins

Quickstart

To jump-start your next plugin, check out Cookiecutter and cookiecutter-datakit-plugin.

Overview

A typical plugin should apply the entry points strategy by defining Cliff command classes that are linked to a unique plugin name and action in the plugin’s setup.py. [1]

This allows Datakit plugins to be easily installed using standard Python package installation techniques.

For example, to install a plugin called datakit-data:

$ pip install datakit-data

The entry points strategy lets Datakit easily discover and invoke plugin commands:

$ datakit data push
$ datakit data pull

The Cliff documentation details how to use entry points in a plugin, and contains a demo app for a simple plugin.

You can also check out our Example Plugin below for the basics of wiring up a plugin.

Plugin structure

Datakit plugins should have the following structure:

plugin-name # root of git repo
├── plugin_name
│   ├── __init__.py
│   └── some_command.py
└── setup.py

To make a custom command discoverable by the datakit command-line tool, you must expose it in the plugin’s setup.py. See Example Plugin for details.

Plugin configurations

Datakit uses a home directory to stash plugin configuration.

By default, this directory is set to ~/.datakit. It can be customized by setting the DATAKIT_HOME environment variable.

Plugins should store configuration files under a directory that matches the name of the plugin’s repo or package.

For example, the datakit-data plugin would store configs in:

~/.datakit/plugins/datakit-data/config.json

Example Plugin

Let’s say we have a datakit-data plugin with the below file structure:

datakit-data
├── setup.py
├── datakit_data
│   ├── __init__.py
│   ├── push.py # Contains a Push class to push data to S3
│   ├── pull.py # contains a Pull class to pull down data from S3

To expose the push and pull commands to datakit, you would configure the entry points variable in setup.py as below:

....
 entry_points={
    'datakit.plugins': [
      'data push= datakit_data.push:Push',
      'data pull= datakit_data.pull:Pull',
    ]
}
....

After installing the plugin, Datakit can discover and invoke these new commands:

$ datakit data push
$ datakit data pull

Plugin Mixins

Datakit provides the datakit.command_helpers.CommandHelpers mixin class to help build plugin commands.

This mixin contains basic configuration methods and attributes such as default locations for plugin-specific configuration files.

Testing

Datakit does not require you to use a particular testing framework. Because each plugin is technically a stand-alone Python package, you’re free to use whatever testing framework you prefer.

Datakit itself uses the pytest framework. We highly recommend it!

[1]As a convention, Datakit entry points should follow the plugin:command format. For example data:push in the Example Plugin.