Exercise 4: Docker containers for processors
Task
- Build and save a docker image 'kappamask' for use on Calvalus
- Install the processor package with a wrapper script to start the container and a processing script to run inside the container
- Create a request and generate a cloud mask
- Inspect output
Material
- script
kappamask-process
- script
kappamask-km_calvalus.sh
- saved docker image
kappazeta_kappamask_v2.3.tar
- auxiliary data
weights-ausdata.tar.gz
- velocity template
kappamask-parameters.vm
- generic docker start script
common-calvalus-docker-run.sh
- request template
s2-kappamask-request.json
- see
/home/martin/training/kappamask/
on ehproduction02
Step 1: Docker image
If you have a docker deamon running on some machine you can build the docker image and save it into a tar file. This is not possible on ehproduction02 because access to the internet is restricted.
docker pull kappazeta/kappamask:v2.3
docker save kappazeta/kappamask:v2.3 -o kappazeta_kappamask-2.3.tar
docker rmi kappazeta/kappamask:v2.3
It is provided in the training material if you cannot build it yourself.
There are other ways to generate a docker image, e.g. from a dockerfile, or from a base image and installations you do in a running container that is converted into an image with docker commit. In all cases, saving that image results in a tar file for the container as above.
Docker images should
- mount external directories for all data they read and write
- provide an entry point to process an input that is provided as parameter
- (they must not write intermediates or results into container-local directories)
- expect that they are called with a user id of the user outside of the container, not with root
- (they shall not contain setuid programs)
- shall avoid external network access, downloads, or uploads
Step 2: Processor package
Look into the two scripts, one that starts the docker container and the other one that is started and will run inside the docker container. The second one is more or less the script provided as part of the container with a few adjustments for the name of the working directory and the tag to report the output file. We run our replacement script instead of the entry point of the container.
see also https://github.com/kappazeta/km_predict/blob/main/docker/km_predict/km_local.sh
less /home/martin/training/kappamask/kappamask-process
less /home/martin/training/kappamask/kappamask-km_calvalus.sh
Copy all files of the processor into the processor package directory on HDFS:
cd /home/martin/training/kappamask
hdfs dfs -put -f kappamask* weights-auxdata.tar.gz common-calvalus-docker-run.sh /calvalus/home/<username>/software/kappamask-2.3/
ls -l /calvalus/home/<username>/software/kappamask-2.3
Step 3: Processing request
Write and submit a processing request.
Copy the template into the special-requests directory of your instance.
cd ~/training3-inst
cp /home/martin/training/kappamask/s2-kappamask-request.json special-requests/
# edit special-requests/s2-kappamask-request.json
{
"productionType" : "processing",
"productionName" : "",
"inputPath" : "/calvalus/eodata/S2_L1C/v5/2024/06/03/S2A_MSIL1C_20240603T100031_N0510_R122_T34VFL_20240603T120142.zip",
"processorName" : "kappamask",
"outputDir" : "",
"queue" : "general",
"attempts" : "1",
"failurePercent" : "0",
"timeout" : "3600",
"executableMemory" : "6144",
"processorBundles" : "",
"calvalus" : "calvalus-2.26",
"snap" : "snap-9.3cv"
}
- Insert some name of your job into productionName
- Insert an output dir starting with
/calvalus/home/<username>/
into outputDir - Insert the path to your new processor bundle into processorBundles
- You may change the queue to be used.
Step 4: Request submission
Submit your request.
If you have logged in again after the previous exercise
. mytraining2
Then, submit the request with the Calvalus Hadoop Tool cht. You may use -a
with cht to submit the requst asynchronously without waiting for it to finish.
cht -a special-requests/s2-kappamask-request.json
You can monitor status using the job id reported by submission with
bash
cht --status job_xxxxx_yyy
In case of failure you can use the same commands as listed in exercise 1:
yarn application -list -appStates FAILED | grep <username>
yarn logs -applicationId application_<nnnnn>_<mmm> -log_files stderr,stdout | less
Try to find out what is wrong, correct it, and re-submit your request. Please, ask if you do not succeed.
Step 5: Result inspection
After about half an hour the request may be processed. Download the result (e.g. with filezilla) and open it in a viewer, e.g. in QGIS. You can find the legend of Kappamask at https://github.com/kappazeta/km_predict.