AI-Extension Application Hub - User Manual

Purpose

The AI-Extensions project aims at supporting the Earth Science and Services Communities by expanding the existing Earth Observation (EO) platform offerings services with operationally mature AI/ML software capabilities. This is achieved through the AI-Extension Application Hub, a dedicated Cloud platform that provides the integration and operational implementation of EO and AI capabilities.

This User Manual provides a guideline, as well as step-by-step instructions, for developers and consumers to effectively leverage the AI-Extension ML-Lab.

Core Services

The Application Hub ML-Lab service leverages JupyterHub as an Application Hub to manage and deploy various web applications, including MLflow, JupyterLab, Code Server, QGIS Remote Desktop, and STAC Browser. JupyterHub acts as a central platform, facilitating the launching and management of these applications, providing a seamless and integrated experience for our users.

JupyterHub: JupyterHub is a dedicated multi-user server that brings the power of JupyterLab to collaborative environments. It allows organizations to effortlessly deploy and manage Jupyter Notebook servers for multiple users, enabling seamless collaboration and resource sharing in data science and research settings.
MLflow: MLflow is an open-source platform for managing the ML lifecycle. MLflow provides tools and functionalities to track experiments, package models, version models, model hyperparameters, packaging code into reproducible runs, deploy models, monitor performance, and enable collaboration among team members. MLflow enables users to effectively organize and monitor their ML projects, enabling collaboration, reproducibility, and streamlined deployment workflows.
JupyterLab: JupyterLab is a powerful and flexible web-based Integrated Development Environment (IDE) for data analysis, model development, and interactive scientific computing. With JupyterLab, our users can write and execute code, visualise data, and create rich, interactive notebooks, enabling efficient experimentation and exploration of their data. With its flexible and extensible architecture, JupyterLab provides a seamless interface for data science workflows, allowing AI-users to explore, analyze, and collaborate on data-driven projects effortlessly.
Code Server: Code Server enables users to run Visual Studio Code (VS Code), a lightweight and versatile source code editor that combines the simplicity of a text editor with powerful developer tools, providing an intuitive and customizable environment for coding across various programming languages. With Code Server, VS Code and all its functionalities are available directly from the Application Hub server.
QGIS Remote Desktop: a remote desktop environment with QGIS, a widely-used free and open-source application for viewing, editing, and analysing geo-spatial data. It provides a versatile platform equipped with tools to perform spatial analysis, geoprocessing, visualise geospatial data, enhancing their labelling and analysis tasks. This integration empowers AI-users to leverage QGIS's extensive capabilities directly from their web browser for making informed decisions based on geographic data.
STAC Browser: The SpatioTemporal Asset Catalog (STAC) is a powerful standard for describing geospatial data, so it can be more easily worked with, indexed, discovered and shared. As part of the Application Hub, the system integrates the STAC Browser, which allows AI-users to discover and explore EO data and training datasets. The STAC Browser supports easy search and retrieval of relevant EO data through the STAC standard, enabling efficient data discovery and selection for ML tasks.

User Personas

There are two types of users that benefit from the App Hub in this project:

A ML Practitioner that we will call “Alice”: expert in building and training ML models, selecting appropriate algorithms, analysing data, and using ML techniques to solve real-world problems.
A Stakeholder/Consumer/User that we will call “Eric”: stakeholder or user (e.g. business owner, a customer, a researcher, etc) who benefits from, or relies upon, the insights or predictions generated by the ML models to inform his decision-making process.

User Scenarios

User Scenario 1 - Alice does Exploratory Data Analysis (EDA): The purpose of Explaratory Data Analysis (EDA) is to analyse the data that will be used to train and evaluate Machine Learning models. In this Notebook are firstly shown the steps to access and visualize EO data (e.g. Sentinel-2 scenes) and their metadata using STAC API. Secondly, a pre-arranged DataFrame containing labeled geospatial data with reflectance values and vegetation indices is loaded and used for the purpose of EDA.
User Scenario 2 - Alice labels Earth Observation data: Labelling data is a crucial step in the process for developing supervised Machine Learning (ML) models. It involves the critical task of assigning relevant labels or categories to different features within the data, such as land cover class (e.g. vegetation, water bodies, urban area, etc.) or other physical characteristics of the Earth's surface. These labels can be binary (e.g., water or non-water) or multi-class (e.g., forest, grassland, urban).
User Scenario 3 - Alice describes the labelled Earth Observation data: Labeled Earth Observation (EO) data can be described using the SpatioTemporal Asset Catalog (STAC) standard. This allows to describe the labeled EO data while defining standardized sets of metadata to delineate its key properties, such as spatial and temporal extents, resolution, and other pertinent characteristics. Additionally, it enables the user to include details about the labeling process itself, as well as enabling specific parameters-driven queries on the catalog.
User Scenario 4 - Alice discovers labelled Earth Observation data: This Scenario documents the process for discovering labelled EO data described with the STAC standard. One of the notable advantages of the STAC-supported catalogs is the ability to filter search results using STAC metadata. This empowers the user to narrow down the search based on specific labeling criteria. By applying such filters, the user can pinpoint the datasets that align with the specific requirements.
User Scenario 5 - Alice develops a new Machine Learning model: During the implementation of this Scenario, two Notebooks were developed:
- “Alice develops a new machine learning” Notebook: this is the main Notebook for Scenario 5, as it follows the requirements for Scenario 5 described in the Service Specification document.
- “Implementation of EuroSAT STAC dataset” Notebook: the objective of this Notebook was to generate STAC Catalog, Collection and Items of the EuroSAT dataset. The EuroSAT STAC dataset was then used as input dataset for developing and testing the main “Alice develops a new machine learning” Notebook.
User Scenario 6 - Alice starts a training job on a remote: This scenario involves executing a training job remotely on a different machine, which advantages are twofold: one one hand, the remote machine can be provisioned on-demand through a cloud provider, providing the user with the flexibility to access additional resources such as CPUs and GPUs; on the other hand, this enables the user to keep on working without being hindered by the resource-intensive and time-consuming training process. Once the experiment begins on the remote machine, the user receives a notification containing a link to the experiment in MLflow, enabling an active monitoring on the training progress while reviewing the results in real-time.
User Scenario 7 - Alice describes her trained machine learning model: This Scenario leverabes the capabilities of STAC to provide a comprehensive and standardised description of a trained ML model. The user will create a STAC Item file that encapsulates the relevant metadata, such as: model name and version, description of the model architecture and training process, specifications of input and output data formats, performance metrics and evaluation results, details regarding model deployment and usage. This approach not only facilitates collaboration but also promotes interoperability within the geospatial and ML communities.
User Scenario 8 - Alice reuses an existing pre-trained model: In this Scenario, Alice leverages the power of transfer learning (a widely adopted technique in deep learning that involves employing a pre-trained model on a large dataset (such as ImageNet) as a starting point to train a new ML model on a smaller dataset) by utilising a pre-trained ML model as a foundation for her ML endeavours. This allows the new ML model to achieve higher accuracy even when trained with limited data.
User Scenario 9 - Alice creates a training dataset: In this Scenario, Alice creates a deep learning dataset with image chips, which involves extracting smaller image patches (chips) from a large EO image and associating them with corresponding land-cover labels, using a semantic segmentation approach.
User Scenario 10 - Eric discovers a model and consumes it: In this Scenario, a geospatial data analyst such as Eric discovers an ML model, created by other practitioners such as Alice, by using the STAC catalog’s search functionalities. This enables Eric to use relevant keywords, such as the model name, geographic location, or application domain, to narrow down the search. The data catalog's search interface allows Eric to explore the available ML models that match the search criteria and are described using STAC. Once the ML model is found, it can be run on other geospatial data, integrating it into a larger workflow, or applying it within a specific application context. The ML model will then be deployed as a processing service on an exploitation platform using the OGC API processes. By following these steps, Eric can seamlessly integrate Alice's model into his existing infrastructure and leverage its capabilities within his geospatial analysis workflows.

User Showcases

Urban greenery: This showcase focuses on the development and application of AI approaches for urban greenery using EO data. The objective is to support the assessment and implementation of Natural-Based Solutions (NBS) to address urban challenges, specifically focusing on monitoring urban heat patterns and preventing flooding in urban areas.
Informal settlement: This showcase focuses on the development and application of AI approaches for EO data in the context of urban management, specifically targeting the challenges posed by informal settlements. The objective is to provide objective geospatial data and analytics to support urban management and address the needs of customers in assessing and managing informal settlements.
Geohazards - volcanoes: The Geohazards volcanoes showcase focus on the development and application of AI approaches for EO data in the context of monitoring and assessing volcanic hazards. The proposed activities aim to address the need for a dynamical monitoring tool combining ML and Deep Learning (DL) algorithms with SAR data from the Sentinel-1 satellite. These activities will be conducted in the GEP to leverage its capabilities and ensure efficient model development, training, and deployment. The activities can be divided into several key phases:

Outreach Articles

A number of descriptive and informative articles, one for each User Scenario, have been written and published on Terradue’s Discuss platform. Each article describes one Jupyter Notebook, or an Application Package CWL, with information regarding the User Scenario(s) in a user-friendly way, its implementation via the Notebook(s) and the generated outputs, as well as links to relevant resources.

See related section Articles for more information.