Exercise 4, Docker containers

Exercise 4: Docker containers for processors

Task

Build and save a docker image 'kappamask' for use on Calvalus
Install the processor package with a wrapper script to start the container and a processing script to run inside the container
Create a request and generate a cloud mask
Inspect output

Material

script kappamask-process
script kappamask-km_calvalus.sh
saved docker image kappazeta_kappamask_v2.3.tar
auxiliary data weights-ausdata.tar.gz
velocity template kappamask-parameters.vm
generic docker start script common-calvalus-docker-run.sh
request template s2-kappamask-request.json
see /home/martin/training/kappamask/ on ehproduction02

Step 1: Docker image

If you have a docker deamon running on some machine you can build the docker image and save it into a tar file. This is not possible on ehproduction02 because access to the internet is restricted.

docker pull kappazeta/kappamask:v2.3
docker save kappazeta/kappamask:v2.3 -o kappazeta_kappamask-2.3.tar
docker rmi kappazeta/kappamask:v2.3

It is provided in the training material if you cannot build it yourself.

There are other ways to generate a docker image, e.g. from a dockerfile, or from a base image and installations you do in a running container that is converted into an image with docker commit. In all cases, saving that image results in a tar file for the container as above.

Docker images should

mount external directories for all data they read and write
provide an entry point to process an input that is provided as parameter
(they must not write intermediates or results into container-local directories)
expect that they are called with a user id of the user outside of the container, not with root
(they shall not contain setuid programs)
shall avoid external network access, downloads, or uploads

Step 2: Processor package

Look into the two scripts, one that starts the docker container and the other one that is started and will run inside the docker container. The second one is more or less the script provided as part of the container with a few adjustments for the name of the working directory and the tag to report the output file. We run our replacement script instead of the entry point of the container.

less /home/martin/training/kappamask/kappamask-process
less /home/martin/training/kappamask/kappamask-km_calvalus.sh

Copy all files of the processor into the processor package directory on HDFS:

cd /home/martin/training/kappamask
hdfs dfs -put -f kappamask* weights-auxdata.tar.gz common-calvalus-docker-run.sh /calvalus/home/<username>/software/kappamask-2.3/
ls -l /calvalus/home/<username>/software/kappamask-2.3

Step 3: Processing request

Write and submit a processing request.

Copy the template into the special-requests directory of your instance.

cd ~/training3-inst
cp /home/martin/training/kappamask/s2-kappamask-request.json special-requests/
# edit special-requests/s2-kappamask-request.json

{
    "productionType"    : "processing",
    "productionName"    : "",

    "inputPath"         : "/calvalus/eodata/S2_L1C/v5/2024/06/03/S2A_MSIL1C_20240603T100031_N0510_R122_T34VFL_20240603T120142.zip",

    "processorName"     : "kappamask",

    "outputDir"         : "",

    "queue"             : "general",
    "attempts"          : "1",
    "failurePercent"    : "0",
    "timeout"           : "3600",
    "executableMemory"  : "6144",

    "processorBundles"  : "",
    "calvalus"          : "calvalus-2.26",
    "snap"              : "snap-9.3cv"
}

Insert some name of your job into productionName
Insert an output dir starting with /calvalus/home/<username>/ into outputDir
Insert the path to your new processor bundle into processorBundles
You may change the queue to be used.

Step 4: Request submission

Submit your request.

If you have logged in again after the previous exercise

. mytraining2

Then, submit the request with the Calvalus Hadoop Tool cht. You may use -a with cht to submit the requst asynchronously without waiting for it to finish.

cht -a special-requests/s2-kappamask-request.json

You can monitor status using the job id reported by submission with bash cht --status job_xxxxx_yyy

In case of failure you can use the same commands as listed in exercise 1:

yarn application -list -appStates FAILED | grep <username>
yarn logs -applicationId application_<nnnnn>_<mmm> -log_files stderr,stdout | less

Try to find out what is wrong, correct it, and re-submit your request. Please, ask if you do not succeed.

Step 5: Result inspection

After about half an hour the request may be processed. Download the result (e.g. with filezilla) and open it in a viewer, e.g. in QGIS. You can find the legend of Kappamask at https://github.com/kappazeta/km_predict.