Real-time Monitoring of Crop Planting¶

Note

Before running the model, please download the PASTIS dataset from PASTIS official website and place it in the ./UTAE/data/ folder.

Model Training CommandModel Evaluation Command

# Semantic segmentation task
python train_semantic.py \
  --dataset_folder "./data/PASTIS" \
  --epochs 100 \
  --batch_size 2 \
  --num_workers 0 \
  --display_step 10
# Panoptic segmentation task
python train_panoptic.py \
  --dataset_folder "./data/PASTIS" \
  --epochs 100 \
  --batch_size 2 \
  --num_workers 0 \
  --warmup 5 \
  --l_shape 1 \
  --display_step 10

# Semantic segmentation task
wget -c https://paddle-org.bj.bcebos.com/paddlescience/models/utae/semantic.pdparams -P ./pretrained/
python test_semantic.py \
  --weight_file ./pretrained/semantic.pdparams \
  --dataset_folder "./data/PASTIS" \
  --device gpu
  --num_workers 0
# Panoptic segmentation task
wget -c https://paddle-org.bj.bcebos.com/paddlescience/models/utae/panoptic.pdparams -P ./pretrained/
python test_panoptic.py \
  --weight_folder ./pretrained/panoptic.pdparams \
  --dataset_folder ./data/PASTIS \
  --batch_size 2 \
  --num_workers 0 \
  --device gpu

Pretrained Model	Metrics
Semantic segmentation task	OA (Over all Accuracy): 86.7% mIoU (mean Intersection over Union): 72.6%
Panoptic segmentation task	SQ (Segmentation Quality): 83.8 RQ (Recognition Quality): 58.9 PQ (Panoptic Quality): 49.7

Background Introduction¶

Efficient and accurate monitoring of crop planting distribution and growth status is a core requirement in the field of modern smart agriculture and food security. Traditional manual survey methods are time-consuming and labor-intensive, and methods using single-phase satellite images are difficult to deal with cloud cover problems, nor can they capture the dynamic changes of crops throughout the growth cycle.

Satellite Image Time Series (SITS) technology provides a new way to solve this problem. By continuously collecting multi-spectral images of the same area at different times, SITS data contains spectral and texture information of the whole process of crops from sowing, emergence, growth, maturity to harvest. However, SITS data has characteristics such as long time series, high dimensionality, and strong spatiotemporal correlation. How to efficiently extract features from it and perform accurate pixel-level classification (semantic segmentation) is a major technical challenge.

This project is based on the model U-TAE (U-Net Temporal Attention Encoder), implemented using the PaddlePaddle deep learning framework, aiming to build an end-to-end solution to perform semantic segmentation on satellite image time series in the PASTIS dataset, thereby achieving automated and high-precision identification and monitoring of multiple crop planting conditions. This technology can be widely used in agricultural resource surveys, yield estimation, disaster assessment and other fields, and has important practical value.

Model Principle¶

This chapter only briefly introduces the model principle of U-TAE. For detailed theoretical derivation, please refer to the paper: Panoptic Segmentation of Satellite Image Time Series with Convolutional Temporal Attention Networks

1. Overall Structure¶

UTAE (U-Net Temporal Attention Encoder) adopts an encoder-decoder architecture, specifically designed for semantic segmentation of satellite image time series:

Encoder: Uses lightweight ResNet-18 to extract spatial features of single time phase.
Decoder: Integrates U-TAE module, using temporal attention mechanism to aggregate global context information of multiple time phases.
Output: Generates a pixel-level category probability map with the same resolution as the input.

2. Temporal Attention¶

For a frame sequence of length \(T\), UTAE calculates inter-frame similarity weights for each frame in the decoding stage to achieve adaptive temporal information aggregation:

Query: Feature of current frame \(\mathbf{Q}\)
Key / Value: Features of all frames \(\mathbf{K}, \mathbf{V}\)

Calculation steps are as follows:

\[ \text{Weight} = \text{Softmax}(\mathbf{Q} \cdot \mathbf{K}^\top) \]

Then, use these weights to perform weighted summation on the features of all frames to obtain aggregated features:

\[ \mathbf{F}_{\text{agg}} = \sum_{t=1}^{T} \alpha_t \mathbf{V}_t, \quad \text{where} \quad \alpha_t = \text{Softmax}(\mathbf{Q} \cdot \mathbf{K}_t^\top) \]

This mechanism can automatically suppress low-quality frames such as clouds and shadows, and improve the clarity of crop boundaries.

3. Global-Local Temporal Block (GLTB)¶

Each decoder layer contains two parallel branches:

Global Branch: Adopts Multi-Head Self-Attention mechanism to model long-range dependencies at the field level.
Local Branch: Uses \(3 \times 3\) depthwise separable convolution, focusing on the preservation of edge and detail information.

The outputs of the two branches are fused by element-wise addition, which preserves both global context and local texture details.

4. Real-time Inference Optimization¶

To achieve efficient real-time inference, the model adopts the following optimization strategies:

Lightweight Backbone: ResNet-18, parameter amount less than 12M.
Inter-frame Shared Weights: In the same sequence, Key and Value are calculated only once to avoid repeated calculation.
Sliding Window Inference: Divide large images into multiple blocks for block-by-block inference to ensure constant video memory usage.

Dataset Introduction¶

PASTIS dataset, which consists of 2433 multi-spectral image sequences of shape \(10\times128\times128\). Each sequence contains 38 to 61 observation points between September 2018 and November 2019, totaling more than 2 billion pixels. The acquisition interval is uneven, averaging 5 days. This lack of regularity is due to the automatic filtering of acquisitions with heavy cloud cover by satellite data providers. The dataset covers more than 4,000 square kilometers, and the images come from four different regions in France, with diverse climates and crop distributions. The dataset can be downloaded from PASTIS official website.

Model Implementation¶

Model Construction¶

This case is implemented based on UTAE (U-TAE), encapsulated with PaddleScience as follows:

examples/UTAE/src/backbones/utae.py
class UTAE(nn.Layer):
    """
    U-TAE architecture for spatio-temporal encoding of satellite image time series.
    Args:
        input_dim (int): Number of channels in the input images.
        encoder_widths (List[int]): List giving the number of channels of the successive encoder_widths of the convolutional encoder.
        This argument also defines the number of encoder_widths (i.e. the number of downsampling steps +1)
        in the architecture.
        The number of channels are given from top to bottom, i.e. from the highest to the lowest resolution.
        decoder_widths (List[int], optional): Same as encoder_widths but for the decoder. The order in which the number of
        channels should be given is also from top to bottom. If this argument is not specified the decoder
        will have the same configuration as the encoder.
        out_conv (List[int]): Number of channels of the successive convolutions for the
        str_conv_k (int): Kernel size of the strided up and down convolutions.
        str_conv_s (int): Stride of the strided up and down convolutions.
        str_conv_p (int): Padding of the strided up and down convolutions.
        agg_mode (str): Aggregation mode for the skip connections. Can either be:
            - att_group (default) : Attention weighted temporal average, using the same
            channel grouping strategy as in the LTAE. The attention masks are bilinearly
            resampled to the resolution of the skipped feature maps.
            - att_mean : Attention weighted temporal average,
                using the average attention scores across heads for each date.
            - mean : Temporal average excluding padded dates.
        encoder_norm (str): Type of normalisation layer to use in the encoding branch. Can either be:
            - group : GroupNorm (default)
            - batch : BatchNorm
            - instance : InstanceNorm
        n_head (int): Number of heads in LTAE.
        d_model (int): Parameter of LTAE
        d_k (int): Key-Query space dimension
        encoder (bool): If true, the feature maps instead of the class scores are returned (default False)
        return_maps (bool): If true, the feature maps instead of the class scores are returned (default False)
        pad_value (float): Value used by the dataloader for temporal padding.
        padding_mode (str): Spatial padding strategy for convolutional layers (passed to nn.Conv2D).
    """

    def __init__(
        self,
        input_dim,
        encoder_widths=[64, 64, 64, 128],
        decoder_widths=[32, 32, 64, 128],
        out_conv=[32, 20],
        str_conv_k=4,
        str_conv_s=2,
        str_conv_p=1,
        agg_mode="att_group",
        encoder_norm="group",
        n_head=16,
        d_model=256,
        d_k=4,
        encoder=False,
        return_maps=False,
        pad_value=0,
        padding_mode="reflect",
    ):

        super(UTAE, self).__init__()
        self.n_stages = len(encoder_widths)
        self.return_maps = return_maps
        self.encoder_widths = encoder_widths
        self.decoder_widths = decoder_widths
        self.enc_dim = (
            decoder_widths[0] if decoder_widths is not None else encoder_widths[0]
        )
        self.stack_dim = (
            sum(decoder_widths) if decoder_widths is not None else sum(encoder_widths)
        )
        self.pad_value = pad_value
        self.encoder = encoder
        if encoder:
            self.return_maps = True

        if decoder_widths is not None:
            assert len(encoder_widths) == len(decoder_widths)
            assert encoder_widths[-1] == decoder_widths[-1]
        else:
            decoder_widths = encoder_widths

        self.in_conv = ConvBlock(
            nkernels=[input_dim] + [encoder_widths[0], encoder_widths[0]],
            pad_value=pad_value,
            norm=encoder_norm,
            padding_mode=padding_mode,
        )
        self.down_blocks = nn.LayerList(
            [
                DownConvBlock(
                    d_in=encoder_widths[i],
                    d_out=encoder_widths[i + 1],
                    k=str_conv_k,
                    s=str_conv_s,
                    p=str_conv_p,
                    pad_value=pad_value,
                    norm=encoder_norm,
                    padding_mode=padding_mode,
                )
                for i in range(self.n_stages - 1)
            ]
        )
        self.up_blocks = nn.LayerList(
            [
                UpConvBlock(
                    d_in=decoder_widths[i],
                    d_out=decoder_widths[i - 1],
                    d_skip=encoder_widths[i - 1],
                    k=str_conv_k,
                    s=str_conv_s,
                    p=str_conv_p,
                    norm="batch",
                    padding_mode=padding_mode,
                )
                for i in range(self.n_stages - 1, 0, -1)
            ]
        )
        self.temporal_encoder = LTAE2d(
            in_channels=encoder_widths[-1],
            d_model=d_model,
            n_head=n_head,
            mlp=[d_model, encoder_widths[-1]],
            return_att=True,
            d_k=d_k,
        )
        self.temporal_aggregator = Temporal_Aggregator(mode=agg_mode)
        self.out_conv = ConvBlock(
            nkernels=[decoder_widths[0]] + out_conv, padding_mode=padding_mode
        )

    def forward(self, input, batch_positions=None, return_att=False):
        # Create pad mask by comparing with pad_value
        # Use safe tensor comparison to avoid type issues
        pad_value_tensor = paddle.to_tensor(self.pad_value, dtype=input.dtype)
        comparison = paddle.equal(input, pad_value_tensor)

        # Sequentially reduce dimensions using all()
        mask_step1 = paddle.all(comparison, axis=-1)  # Reduce last dim
        mask_step2 = paddle.all(mask_step1, axis=-1)  # Reduce second-to-last dim
        pad_mask = paddle.all(mask_step2, axis=-1)  # Reduce third-to-last dim (BxT)
        out = self.in_conv.smart_forward(input)
        feature_maps = [out]
        # SPATIAL ENCODER
        for i in range(self.n_stages - 1):
            out = self.down_blocks[i].smart_forward(feature_maps[-1])
            feature_maps.append(out)
        # TEMPORAL ENCODER
        out, att = self.temporal_encoder(
            feature_maps[-1], batch_positions=batch_positions, pad_mask=pad_mask
        )
        # SPATIAL DECODER
        if self.return_maps:
            maps = [out]
        for i in range(self.n_stages - 1):
            skip = self.temporal_aggregator(
                feature_maps[-(i + 2)], pad_mask=pad_mask, attn_mask=att
            )
            out = self.up_blocks[i](out, skip)
            if self.return_maps:
                maps.append(out)

        if self.encoder:
            return out, maps
        else:
            out = self.out_conv(out)
            if return_att:
                return out, att
            if self.return_maps:
                return out, maps

Visualization Results¶

On the PASTIS dataset, this case reproduces the visualization results of panoptic segmentation prediction and semantic segmentation prediction as shown in the figure:

(a) Original image (b) Annotation (Ground Truth) (c) Panoptic segmentation prediction (d) Semantic segmentation prediction

The figure above shows the farmland plot segmentation results on the PASTIS dataset. Different plots are represented by different colors in the figure. The position circled in green represents that a large plot is incorrectly identified as a single plot; the position circled in red represents that many slender plots are not correctly detected; the position circled in blue shows that panoptic segmentation is superior to semantic segmentation. The model performs well in regional boundary detection, especially in the recovery of complex boundaries. However, when facing slender, broken or complex plots, there are still challenges, which easily lead to a decrease in confidence or detection failure.

References¶

U-TAE Original Paper: Panoptic Segmentation of Satellite Image Time Series with Convolutional Temporal Attention Networks
Source Code Implementation: https://github.com/VSainteuf/utae-paps
Dataset and Benchmark: https://github.com/VSainteuf/pastis-benchmark