-
Notifications
You must be signed in to change notification settings - Fork 0
Devops and Datascience
Björn edited this page Mar 26, 2018
·
1 revision
Problem: Workflows in data-science are complicated and error-prone
Idea: Software-Engineers had a similar problem and solved it with devops
Solution: Create a devops pipeline for datas-cience
- simple
- the average user should only have to press a button to start batch processing
- flexible:
- different batch systems must be supported: htcondor and spark
- jupyter notebook:
- all functions are accessible from jupyter notebooks
- project-specific adjustments are limited to custom kernels and scripts inside the project
- reproducibility
- the batch-processing step must be reproducible
- diagrams and tables must be reproducible
Setting up a project:
- Project lead initializes the repository, configures the batch system and gives permissions to members
- Project members clone the repository
Configuring a project:
- only done by project lead
- changes are automatically deployed to project members
Starting a batch job:
- project members send commits into the pipeline
- pipeline sets up batch job
- project member fetch results from the pipeline
Proof of concept:
set up a remote repository and SDIL, that does all this (done)
first iteration:
create scripts and tools, that improve a project (search for a suitable project once this issue is solved)
final iteration:
create tools that can easily be applied to all data science projects
Option 1: remote repository (git_batch)
- create a remote-repository on sdil, that schedules the batch job, as it is defined in the project
- batch process is started by pushing into remote repository
- results are fetched by pulling from remote repository
Option 2: full blown devops
- use sdil as ci-runner
- batch process is started by event in gitlab (e.g. tagging a commit or creating a branch)
- results are accessed via gitlab
- configuration of batch system
- this requires the definition of convention by someone with experience
- for now this is left as exercise for the user
- git integration in jupyter notebook
- pipeline steps, that ensure consistent data
- this is done with the dirhash