Adaptive Sampling

Overview

Sampling is an in situ data reduction approach for the scalar datasets generated by the scientific simulations. This generic feature-driven data reduction method is applicable for any regular-grid datasets at this stage and us available through Ascent as a VTK-m/VTK-h filter. This generic sampling method essentially analyzes the scalar data distribution and automatically assigns importance to the scalars. Since the scientific datasets generally contain features where the scientists are primarily interested in, this distribution-driven sampling approach tries to identify the important regions of the dataset based on the probability of occurrence. Features of the data generally span much fewer data points compared to the background (“non-interesting”) regions of the data. Utilizing this idea, our sampling approach assigns more importance to the low probability scalars and less importance to the highly frequent scalars.

The sampling method operates by creating a point representation of the selected data points at the in situ processing phase. These data points can then be restored back to its original size/resolution by an inverse operation or these samples can be directly visualized as a preview of the dataset using visualization tools (such as Paraview).

To achieve sampling, first, the histogram of the dataset is created. Histograms, used as a representation of the data distribution, are used for identifying and assigning importance to the individual scalars. In this approach, the goal of sampling to pick an equal number of samples from each bin and since bins have varying frequencies, it, in turn, assigns high importance to the scalars in the bins that have a low count. More details of the approach can be found in this paper: “In Situ Data-Driven Adaptive Sampling for Large-scale Simulation Data Summarization”, Ayan Biswas, Soumya Dutta, Jesus Pulido, and James Ahrens, In Situ Infrastructures for Enabling Extreme-scale Analysis and Visualization (ISAV 2018), co-located with Supercomputing 2018.

Getting Started

The distribution-driven sampling approach is now available through Ascent and is implemented using VTK-m and VTK-h filters. VTK-h is needed for creating a distributed histogram and making it available to all the individual processors where the VTK-m filter is executed. Due to the dependence on data histogram, the name of the filter is “HistSampling”. Since the light-weight in situ library Ascent can be connected to simulation code, the sampling algorithm can be directly executed in situ on the supercomputers.

This easy to use sampling filter expects only three parameters from the users:

  1. Scalar field name (“field”) - the name of the scalar field that is to be sampled.
  2. Rate of sampling (“sample_rate”) - the amount of data to be stored after sampling. If the total bandwidth is m and the dataset size for the current time step is n, then this value can be set to n/m or less. The default is 0.1, i.e., 10% of the original data will be stored.
  3. Number of bins for the histogram (“bins”) - number of bins for creating the data histogram. The default is 128.

These parameters can be specified by assigning value via an ascent_actions.json file.

Use Case Examples

To demonstrate the use of histogram-based sampling filter, here is an example ascent_actions.json file

[
  {
    "action": "add_pipelines",
    "pipelines":
    {
      "pl1":
      {
        "f1":
        {
          "type": "histsampling",
          "params":
          {
            "field": "density",
            "sample_rate": "0.05",
            "bins": "64"
          }
        }
      }
    }
  },
  {
    "action": "execute"
  },

  {
    "action": "reset"
  }
]

In the use case above, we have selected parameters that are appropriate to down-sample a scalar field named “density”.

  1. “field”: We choose the scalar field “density” that is generated by the scientific simulation.
  2. “sample_rate”: The example sampling rate is equal to 0.05, i.e., 5% of the original data will be stored at each time step.
  3. “bins”: In this example we used 64 bins to create the histogram that will be used for sampling the scalar field.

Performance

To improve overall performance of the sampling operation, this filter is capable of utilizing hardware accelerators. Further, creation of distributed histogram is additive across processors that also helps improve scalability.

Developers

Ayan Biswas, Matt Larsen, Li-Ta Lo