Devops and Datascience

Devops & Datascience

Problem: Workflows in data-science are complicated and error-prone

Idea: Software-Engineers had a similar problem and solved it with devops

Solution: Create a devops pipeline for datas-cience

simple
- the average user should only have to press a button to start batch processing
flexible:
- different batch systems must be supported: htcondor and spark
jupyter notebook:
- all functions are accessible from jupyter notebooks
- project-specific adjustments are limited to custom kernels and scripts inside the project
reproducibility
- the batch-processing step must be reproducible
- diagrams and tables must be reproducible

Setting up a project:

Project lead initializes the repository, configures the batch system and gives permissions to members
Project members clone the repository

Configuring a project:

Starting a batch job:

Proof of concept:

set up a remote repository and SDIL, that does all this (done)

first iteration:

create scripts and tools, that improve a project (search for a suitable project once this issue is solved)

final iteration:

create tools that can easily be applied to all data science projects

Option 1: remote repository (git_batch)

create a remote-repository on sdil, that schedules the batch job, as it is defined in the project
batch process is started by pushing into remote repository
results are fetched by pulling from remote repository

Option 2: full blown devops

use sdil as ci-runner
batch process is started by event in gitlab (e.g. tagging a commit or creating a branch)
results are accessed via gitlab

configuration of batch system
- this requires the definition of convention by someone with experience
- for now this is left as exercise for the user
git integration in jupyter notebook
pipeline steps, that ensure consistent data
- this is done with the dirhash