Skip to content

Devops and Datascience

Björn edited this page Mar 26, 2018 · 1 revision

Devops & Datascience

Problem: Workflows in data-science are complicated and error-prone

Idea: Software-Engineers had a similar problem and solved it with devops

Solution: Create a devops pipeline for datas-cience

Requirements

List of Requirements

  • simple
    • the average user should only have to press a button to start batch processing
  • flexible:
    • different batch systems must be supported: htcondor and spark
  • jupyter notebook:
    • all functions are accessible from jupyter notebooks
    • project-specific adjustments are limited to custom kernels and scripts inside the project
  • reproducibility
    • the batch-processing step must be reproducible
    • diagrams and tables must be reproducible

Use Cases

Setting up a project:

  • Project lead initializes the repository, configures the batch system and gives permissions to members
  • Project members clone the repository

Configuring a project:

  • only done by project lead
  • changes are automatically deployed to project members

Starting a batch job:

  • project members send commits into the pipeline
  • pipeline sets up batch job
  • project member fetch results from the pipeline

Roadmap

Proof of concept:

set up a remote repository and SDIL, that does all this (done)

first iteration:

create scripts and tools, that improve a project (search for a suitable project once this issue is solved)

final iteration:

create tools that can easily be applied to all data science projects

Architecture

Option 1: remote repository (git_batch)

  • create a remote-repository on sdil, that schedules the batch job, as it is defined in the project
  • batch process is started by pushing into remote repository
  • results are fetched by pulling from remote repository

Option 2: full blown devops

  • use sdil as ci-runner
  • batch process is started by event in gitlab (e.g. tagging a commit or creating a branch)
  • results are accessed via gitlab

Modules

  • configuration of batch system
    • this requires the definition of convention by someone with experience
    • for now this is left as exercise for the user
  • git integration in jupyter notebook
  • pipeline steps, that ensure consistent data

Clone this wiki locally