Monitoring of KNMI Operational Systems: PRISMA

Providing ways to explore and maintain the structure of KNMI's data processing chains.

Imagine a big, complex factory with incoming and outgoing pipelines of various types and sizes. When you open the door you encounter an intricate network of tubes, valves, barrels and gears. Next, imagine you observe that the flow in one blue-and-yellow colored tube is decreasing. You might wonder: will this result in serious problems? What will be the impact on the flow in the outgoing pipelines? And where does the actual congestion happen? Or is there a ruptured pipe? Which employee is responsible, who should be warned?

Now, let's make it even more complicated. They're not liquids or gases flowing through those pipes, it's data. Enter KNMI. Raw data from measurements performed by satellites, instruments at land stations, balloons and ships, runs of numerical weather models, all with their ranging accuracies and fidelities, make their entrance at irregular intervals. Sheer endless processing chains - validating, gauging, feeding models, interpreting and combining data - result in a stream of usable information products to be used by both internal and external users.

The main goals for the PRISMA application resemble answering the aforementioned questions, but in the context of processing chains (including the underlying infrastructure):

Overview of chains
One might easily recognize the trouble a new employee has to go through, learning all these chains, infrastructural dependencies and Dataset information by heart if this knowledge isn’t secured elsewhere. Therefore, PRISMA provides ways to explore (and maintain) the structure of the processing chains, the supporting infrastructure and information about the Datasets.

Root cause
Another way PRISMA lowers the cognitive load for the user, especially in the case of stressful critical failures, is that it tracks back occurring issues to their root cause. By pointing to root causes the user can focus on informing the responsible maintainer.

Impact
When issues occur, the person who is monitoring the data processing not only is responsible for triggering maintainers to solve the issues, but also to inform the internal and external users about the expected unavailability of their products. PRISMA traverses the chains starting at root causes to find all the affected end product Datasets.

From a more technical point of view PRISMA is a web application, using big data technology such as Splunk for log analysis and  a graph database Neo4j to store the chain structure and states, featuring an API to update the structure in an automated way, and providing dashboards and graphs for interaction.

Dashboard: a categorized overview of all warnings and errors.
Dashboard: a categorized overview of all warnings and errors.
Listview (on the left hand side): showing an attention list, ordered by priority.
Listview (on the left hand side): showing an attention list, ordered by priority.
Graph: vertically showing infrastructure dependencies and horizontally showing process chains.
Graph: vertically showing infrastructure dependencies and horizontally showing process chains.