Paper Reading and Journal Club Discussion

Why ML models underestimate extreme values, and how LatentNN may help neutrino experiments

A journal club report on how noisy input features can systematically compress machine-learning predictions, with a focus on possible applications to neutrino experiments.

Report: Zhen Liu Date: 2026.05.11 Article: Yuan-Sen Ting arXiv:2512.23138v2

Motivation

The practical problem is not just accuracy. It is generalization from MC to real data.

For neutrino experiments, labeled real data are unavailable, while rare signals make purely data-driven supervision even harder.

Data-MC generalization gap

  • MC simulations are the primary source of labeled training samples.
  • A persistent data-MC discrepancy means models trained on MC may fail to generalize to real data.
  • True labels never exist in real data.
  • Signal events can be quite rare in neutrino experiments such as JUNO and NvDEx.

Why this journal club paper is relevant

The LatentNN paper is from astronomy, but it addresses a core statistical failure mode that can also matter for detector ML: noisy or mismodeled inputs can systematically compress predictions toward the mean.

JUNO Motivation

JUNO is an example of why input-level data-MC mismatch matters.

The atmospheric-neutrino task is used here as an example: PID and reconstruction depend on PMT waveform features, while training still relies on MC labels.

Animated spherical detector event display
JUNO PMT waveform data-MC asymmetry and pull maps
PMT-level data-MC differences in waveform-derived features.
  • ML is necessary: atmospheric-neutrino event reconstruction and neutrino/anti-neutrino identification are not possible with traditional methods.
  • The model input is detector response: PMT waveform features carry the event information.
  • The supervision comes from MC: the model is trained on simulated labeled events.
  • The risk is distribution mismatch: real PMT waveforms and MC PMT waveforms differ.

However

Data and MC waveforms are different, so a model trained only on MC can be biased when applied to real data.

Existing Methods

Common domain-generalization approaches do not cleanly solve these tasks.

Several common strategies are limited for JUNO atmospheric-neutrino tasks and the NvDEx signal-background problem.

Domain-adversarial training / domain adaptation

Mechanism: align MC and data feature distributions through a discriminator.

Limitation: requires a large amount of real data.

Self-supervised learning

Mechanism: pre-train on unlabeled samples, then fine-tune on a small labeled dataset or MC.

Limitation: real-data labels are impossible; MC fine-tuning can reintroduce bias.

Unsupervised learning

Mechanism: iteratively learn patterns from unlabeled real data.

Limitation: requires a large amount of real data.

Simpler models such as BDTs

Mechanism: use hand-crafted low-level features and a simpler, more robust model class.

Limitation: performance is limited.

Takeaway

These limitations motivate looking for a different approach: not just aligning distributions, but explicitly modeling uncertainty in the inputs.

Bridge to the Paper

Attenuation bias is a systematic compression of predictions toward the mean.

The astronomy setting is different, but the statistical structure is useful: training data contain measurement errors, and the model underestimates extremes.

Main paper claims

  • Astronomical observations often have signal-to-noise ratios of order 10.
  • When training data contain measurement errors, models systematically compress predictions toward the mean.
  • High values are under-predicted.
  • Low values are over-predicted.
  • Models therefore fail to accurately predict extreme values.

LatentNN in one sentence

LatentNN treats unknown true input values as learnable latent variables and optimizes them together with the neural network parameters.

Mechanism

A simple regression example explains why the learned slope becomes too shallow.

The issue begins when the input x has measurement error but the target y is linked to the unknown true input.

Schematic showing true points, observed noisy points, and a shallower fitted line
Input errors stretch the data horizontally and lead to a shallower fitted relationship.

Mechanism

The true relationship is:

Ytrue = f(xtrue)

But x has measurement error. The measurement errors in x do not affect y, so the data are stretched horizontally without a corresponding vertical stretch.

The training data look wider horizontally than the true signal. Any regression method treating xobs as exact will fit a slope that is systematically shallower than the true relationship.

The naive ML approach trains on observed pairs (xobs, yobs) as if xobs were the true value. Errors in x systematically weaken the relationship.

Attenuation Factor

For a linear relationship, the bias can be written explicitly.

The attenuation factor is the clean mathematical reason the learned slope is smaller than the true slope.

Schematic showing true points, observed noisy points, and a shallower fitted line
The same errors-in-variables geometry gives the attenuation factor its meaning.

Linear case

For y = βx, the ordinary least squares estimator has expected value:

E[β̂] = β · λβ, where λβ = 1 / (1 + (σx / σrange)2)

σrange is the intrinsic statistical spread of the signal over the range, while σx is the input measurement uncertainty.

SNR = 10

λβ ≈ 0.99, roughly 1% bias.

SNR ≈ 3

λβ ≈ 0.90, roughly 10% bias.

SNR = 1

λβ = 0.50, a large bias.

Equation for the attenuation factor
The attenuation factor for a linear relationship with noisy input measurements.

Neural Networks

MLPs show the same attenuation behavior as linear regression.

The paper tests the effect with a simple neural network and finds that flexibility alone does not remove the systematic compression.

Results showing MLP attenuation across different input signal-to-noise ratios
Standard MLP predictions are compressed toward the mean at low input SNR.

Key points

  • Neural networks introduce nonlinear models and high-dimensional inputs.
  • The paper tests the behavior with an MLP, the simplest neural-network architecture.
  • Predictions are systematically compressed toward the mean.
  • Model complexity does not resolve the issue.
  • Regularization does not resolve the issue.

LatentNN

The proposed fix is to learn the hidden true inputs, not just the mapping.

LatentNN adds one learnable latent input per sample and per input dimension, initialized at the observed value.

Loss structure

Losstotal = prediction loss weighted by 1 / σy2 + latent likelihood weighted by 1 / σx2
  • σy and σx must be specified for proper correction.
  • Latent values are initialized at the observed values.
  • For N samples and p input dimensions, the optimizer handles network parameters θ plus N × p latent variables.
  • The total parameter count is much larger than in standard neural network training.
LatentNN one-dimensional loss function
One-dimensional LatentNN objective with prediction and latent-input likelihood terms.
1. Observed inputs

Start with measured features xobs.

2. Latent inputs

Create learnable xlatent initialized at xobs.

3. Joint training

Optimize xlatent together with network parameters.

4. Prediction

Use latent features for regression, PID, or classification.

Experiments: One-Dimensional Case

LatentNN removes most attenuation in the simplest noisy-input regression test.

The key comparison is between a standard MLP and LatentNN at low input SNR, followed by a sweep over SNR values.

One-dimensional LatentNN results at SNR equals 1
Standard MLP remains attenuated at SNR = 1, while LatentNN recovers a much less biased relationship.
LatentNN attenuation factor sweep across SNR values
LatentNN maintains λ near 1 across tested SNR values.

Interpretation

At SNR = 1, treating x as exact makes the MLP learn a compressed relationship. LatentNN uses the known input uncertainty to move the latent inputs toward a relationship that is closer to the true function.

Experiments: Multivariate Inputs

Redundant correlated inputs help LatentNN identify the hidden true features.

The paper tests p = 3, 10, and 30 input dimensions across SNR values from 1 to 10.

Multivariate LatentNN results for p equals 3, 10, and 30
Multivariate inputs with p = 3, 10, and 30 across SNR values from 1 to 10.

Multivariate interpretation

Increasing p generally helps because correlated features can support each other. The network can learn their covariance structure and use redundancy for more robust inference.

However, p = 3 can be harder than p = 1: the posterior over latent values becomes more complex, but three features may not provide enough redundancy to leverage.

For detector ML, this points to a practical design principle: keep informative correlated features, but avoid feeding every available dimension when the uncertainty model is not clear.

Stellar Spectra Example

The paper demonstrates LatentNN on synthetic spectra with metallicity labels.

This is the closest paper example to detector feature vectors: many input measurements, informative subsets, and measurement uncertainty.

Stellar spectra and LatentNN performance plots
Stellar spectra and attenuation-factor results for 3, 10, and 30 informative pixels.

Spectral setup

  • The input vector x consists of flux measurements at different wavelength pixels.
  • The output y is a stellar parameter, such as metallicity [M/H].
  • The sample uses synthetic spectra for stars with varying [M/H] values.

Relevance for neutrino ML

Like PMT features, spectra are high-dimensional input measurements. The paper suggests selecting informative dimensions can keep LatentNN tractable.

Model Generalization

The latent-variable formulation can be adapted to both regression and classification.

The same latent-input idea can be used for event reconstruction and for signal/background or PID tasks.

Regression tasks

For event reconstruction, use a continuous prediction loss while learning latent input features.

Regression loss with input covariance
Regression form with an input covariance term for correlated feature uncertainties.

Classification tasks

For PID or signal/background identification, replace the prediction term with cross-entropy while keeping the latent likelihood term.

Classification loss with input covariance
Classification form with cross-entropy plus the latent-input likelihood term.

Architecture extension

The same idea can be extended to CNNs, Transformers, and other differentiable architectures, as long as the model can optimize with respect to the latent input representation.

Limitations

The method is promising, but it shifts difficulty into uncertainty estimates and parameter count.

Main limitations

  1. σy and σx must be specified for proper correction.
  2. The number of parameters increases substantially.
  • Hyperparameter tuning is needed, especially for the weight decay parameter.
  • The paper reports that results are not highly sensitive to the weight decay value.
  • For N = 50K and p ≈ 7000 informative pixels, LatentNN introduces about 3.5 × 108 latent parameters.
  • That is roughly 1.4 GB in float32. A JUNO RTX 6000 Pro GPU has 96 GB memory.
Multivariate LatentNN loss equation
Multivariate LatentNN loss, showing the prediction term and latent-input likelihood term.

Additional paper constraints

The paper emphasizes that correction is most reliable when the input noise is not overwhelming: attenuation can be important for SNRx ≲ 10, while LatentNN correction is most reliable around SNRx ≳ 2. For very high-dimensional inputs, restricting to informative features is practical; dimensionality reduction is less straightforward because the noise model in the reduced space may be unclear.

Discussion

For JUNO-like tasks, the key proposal is to add latent variables to input features.

A direct path toward robust machine learning is to explicitly model input-feature uncertainty.

Proposed modification

  1. Add latent variables for each PMT feature and each PMT.
  2. Use a total loss that combines target prediction and input-feature latent likelihood:
Losstotal = (y - ylatent)2 / σy2 + (x - xlatent)2 / σx2
  1. For data, first reconstruct latent features, then perform PID or reconstruction.

Key points for the next study

  • Simplify the input to reduce input dimension.
  • Determine a good sigma value for each feature.
  • Choose the first validation target: reconstruction, PID, or NvDEx signal/background classification.
Distribution of PMT total charge in data and MC
Total-charge distributions provide a concrete feature-level view of the data-MC mismatch that motivates the latent-input proposal.