Hi folks!
Recently, I tried to pass an interview in Radiomics. Here is the task that they asked to solve.
Is anybody able to solve it?
The task is to build a data workflow to pre-select/clean DICOM images using Prefect, a workflow orchestration tool for Python that allows to create tasks, that may or may not be interconnected (e.g., not just DAGs).
This data pipeline has a hypothetical downstream consumer, which uses the Prefect REST API to trigger the execution of the defined workflow by passing various parameters. There is no need to be concerned by how this API works, because you’re not responsible for neither its use, nor its definition; it’s just so that you can have more info on the type of data structures needed to be able to allow the consumer to smoothly trigger execution.

Simplified C4 model of the whole system, that the prefect orchestrator, in deep blue, is part of. Systems depicted in grey are external.
The problem at hand consists of allowing project managers to abstract themselves from coding individual tasks to perform data cleaning themselves on DICOM files that do not match certain criterium evaluated on their DICOM tags, while allowing for separation of concerns when it comes to the lifecycle of such a workflow.
A criterium consists of the DICOM tag to evaluate, the operation for evaluation (i.e., EQUALS, GREATER, LESS, IN), and the reference value for evaluation.
For the reference value, you can assume support for only three of the VR (Value Representations) types: Date (DA), Code String (CS), and Age String (AS). For more information on VR reference DICOM PS3.5. Dates in criteria should be formatted in ISO 8601.
For example, if the criteria are that (1) the study date is later than 01/01/2022, and (2) the modality to be either CT (Computed Tomography) or MR (Magnetic Resonance Imaging), the expected input for the Prefect REST API to trigger the execution of a flow would look like (clipped):
{
...
"parameters": {
"input_s3_parameters": {
"endpoint": "https://localhost:0000", // optional
"bucket": "challenge-tech",
"credentials": "key_id:access_key",
"region": "eu-west"
},
"criteria": [
{
"dicom_tag": ["0008", "0020"],
"operation": "GREATER",
"reference": "2022-01-01"
},
{
"dicom_tag": ["0008", "0060"],
"operation": "IN",
"reference": ["CT", "MR"]
}
]
}
...
}👁️ HINT: each parameter in a flow, programmaticaly, is a function parameter.
The designed data workflow for this challenge aims at allowing to use both the database and the object store to run further/later computations, and only on the “objects” that have passed the cleaning stage correctly.
Beyond running computations, the workflow should also provide traceability for all DICOM instances (including those that have been discarded). For example, performing automatic segmentation on the DICOM series that are part of the cleaned data subset should be logged somewhere.
We provide you with a pre-populated S3 bucket that contains DICOM files from an open-source dataset, so that you can test the business logic in a more realistic environment. You can verify its content running the following command (requires Docker):
$ AWS_ACCESS_KEY_ID={ID PROVIDED}
$ AWS_SECRET_ACCESS_KEY={SECRET PROVIDED}
$ docker run -it --rm \
-e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
-e AWS_BUCKET_NAME=challenge-tech \
--entrypoint bash \
amazon/aws-cli:2.8.12 -c "aws s3 ls \$AWS_BUCKET_NAME --region eu-west"
2022-12-27 09:27:18 17855470 1.2.276.0.7230010.3.1.4.2323910823.3528.1597261298.712.dcm
2022-12-27 09:27:51 526218 1.3.6.1.4.1.32722.99.99.101845045862003320455022547334559155059.dcm
...If you’re not familiar with Prefect, follow this link to get started and understand the basic concepts of its domain.
If you’re not familiar with DICOM, we recommend checking the DICOM standard definition of their real world model, but that might be too dense, so this link gives a relatively brief summary of DICOM, and you can find a brief table of most common DICOM tags in this link. For what you’d need to know, the most important component for our work is the DICOM Series, identified by a globally unique identifier Series Instance UID.
For the processing of the DICOM files, and interfaces with S3 (and SQL DBMS if you want to try it out, but that’s optional for this challenge) we do not enforce using any particular library, feel free to choose those that fit your experience better.
answer branch. We recommend pushing your changes frequently, this way we can assess better your thought process. If you’re not able to finish in the time estimate provided, it’s also okay to provide us with a skeleton, and we might further discuss the technical details in a follow-up meeting.In case you need some help, or feedback, from us, feel free to create a GitLab issue. We’ll get back to you.
About 2-3 hours depending on your experience level. There is no countdown, this is just an estimate so you can plan your time.