Paper Reading and Journal Club Discussion
Why ML models underestimate extreme values, and how LatentNN may help neutrino experiments
A journal club report on how noisy input features can systematically compress machine-learning predictions, with a focus on possible applications to neutrino experiments.
Motivation
The practical problem is not just accuracy. It is generalization from MC to real data.
For neutrino experiments, labeled real data are unavailable, while rare signals make purely data-driven supervision even harder.
Data-MC generalization gap
- MC simulations are the primary source of labeled training samples.
- A persistent data-MC discrepancy means models trained on MC may fail to generalize to real data.
- True labels never exist in real data.
- Signal events can be quite rare in neutrino experiments such as JUNO and NvDEx.
Why this journal club paper is relevant
The LatentNN paper is from astronomy, but it addresses a core statistical failure mode that can also matter for detector ML: noisy or mismodeled inputs can systematically compress predictions toward the mean.
JUNO Motivation
JUNO is an example of why input-level data-MC mismatch matters.
The atmospheric-neutrino task is used here as an example: PID and reconstruction depend on PMT waveform features, while training still relies on MC labels.
- ML is necessary: atmospheric-neutrino event reconstruction and neutrino/anti-neutrino identification are not possible with traditional methods.
- The model input is detector response: PMT waveform features carry the event information.
- The supervision comes from MC: the model is trained on simulated labeled events.
- The risk is distribution mismatch: real PMT waveforms and MC PMT waveforms differ.
However
Data and MC waveforms are different, so a model trained only on MC can be biased when applied to real data.
Existing Methods
Common domain-generalization approaches do not cleanly solve these tasks.
Several common strategies are limited for JUNO atmospheric-neutrino tasks and the NvDEx signal-background problem.
Domain-adversarial training / domain adaptation
Mechanism: align MC and data feature distributions through a discriminator.
Limitation: requires a large amount of real data.
Self-supervised learning
Mechanism: pre-train on unlabeled samples, then fine-tune on a small labeled dataset or MC.
Limitation: real-data labels are impossible; MC fine-tuning can reintroduce bias.
Unsupervised learning
Mechanism: iteratively learn patterns from unlabeled real data.
Limitation: requires a large amount of real data.
Simpler models such as BDTs
Mechanism: use hand-crafted low-level features and a simpler, more robust model class.
Limitation: performance is limited.
Takeaway
These limitations motivate looking for a different approach: not just aligning distributions, but explicitly modeling uncertainty in the inputs.
Bridge to the Paper
Attenuation bias is a systematic compression of predictions toward the mean.
The astronomy setting is different, but the statistical structure is useful: training data contain measurement errors, and the model underestimates extremes.
Main paper claims
- Astronomical observations often have signal-to-noise ratios of order 10.
- When training data contain measurement errors, models systematically compress predictions toward the mean.
- High values are under-predicted.
- Low values are over-predicted.
- Models therefore fail to accurately predict extreme values.
LatentNN in one sentence
LatentNN treats unknown true input values as learnable latent variables and optimizes them together with the neural network parameters.
Mechanism
A simple regression example explains why the learned slope becomes too shallow.
The issue begins when the input x has measurement error but the target y is linked to the unknown true input.
Mechanism
The true relationship is:
Ytrue = f(xtrue)But x has measurement error. The measurement errors in x do not affect y, so the data are stretched horizontally without a corresponding vertical stretch.
The training data look wider horizontally than the true signal. Any regression method treating xobs as exact will fit a slope that is systematically shallower than the true relationship.
The naive ML approach trains on observed pairs (xobs, yobs) as if xobs were the true value. Errors in x systematically weaken the relationship.
Attenuation Factor
For a linear relationship, the bias can be written explicitly.
The attenuation factor is the clean mathematical reason the learned slope is smaller than the true slope.
Linear case
For y = βx, the ordinary least squares estimator has expected value:
E[β̂] = β · λβ, where λβ = 1 / (1 + (σx / σrange)2)σrange is the intrinsic statistical spread of the signal over the range, while σx is the input measurement uncertainty.
SNR = 10
λβ ≈ 0.99, roughly 1% bias.
SNR ≈ 3
λβ ≈ 0.90, roughly 10% bias.
SNR = 1
λβ = 0.50, a large bias.
Neural Networks
MLPs show the same attenuation behavior as linear regression.
The paper tests the effect with a simple neural network and finds that flexibility alone does not remove the systematic compression.
Key points
- Neural networks introduce nonlinear models and high-dimensional inputs.
- The paper tests the behavior with an MLP, the simplest neural-network architecture.
- Predictions are systematically compressed toward the mean.
- Model complexity does not resolve the issue.
- Regularization does not resolve the issue.
LatentNN
The proposed fix is to learn the hidden true inputs, not just the mapping.
LatentNN adds one learnable latent input per sample and per input dimension, initialized at the observed value.
Loss structure
Losstotal = prediction loss weighted by 1 / σy2 + latent likelihood weighted by 1 / σx2- σy and σx must be specified for proper correction.
- Latent values are initialized at the observed values.
- For N samples and p input dimensions, the optimizer handles network parameters θ plus N × p latent variables.
- The total parameter count is much larger than in standard neural network training.
Start with measured features xobs.
Create learnable xlatent initialized at xobs.
Optimize xlatent together with network parameters.
Use latent features for regression, PID, or classification.
Experiments: One-Dimensional Case
LatentNN removes most attenuation in the simplest noisy-input regression test.
The key comparison is between a standard MLP and LatentNN at low input SNR, followed by a sweep over SNR values.
Interpretation
At SNR = 1, treating x as exact makes the MLP learn a compressed relationship. LatentNN uses the known input uncertainty to move the latent inputs toward a relationship that is closer to the true function.
Experiments: Multivariate Inputs
Redundant correlated inputs help LatentNN identify the hidden true features.
The paper tests p = 3, 10, and 30 input dimensions across SNR values from 1 to 10.
Multivariate interpretation
Increasing p generally helps because correlated features can support each other. The network can learn their covariance structure and use redundancy for more robust inference.
However, p = 3 can be harder than p = 1: the posterior over latent values becomes more complex, but three features may not provide enough redundancy to leverage.
For detector ML, this points to a practical design principle: keep informative correlated features, but avoid feeding every available dimension when the uncertainty model is not clear.
Stellar Spectra Example
The paper demonstrates LatentNN on synthetic spectra with metallicity labels.
This is the closest paper example to detector feature vectors: many input measurements, informative subsets, and measurement uncertainty.
Spectral setup
- The input vector x consists of flux measurements at different wavelength pixels.
- The output y is a stellar parameter, such as metallicity [M/H].
- The sample uses synthetic spectra for stars with varying [M/H] values.
Relevance for neutrino ML
Like PMT features, spectra are high-dimensional input measurements. The paper suggests selecting informative dimensions can keep LatentNN tractable.
Model Generalization
The latent-variable formulation can be adapted to both regression and classification.
The same latent-input idea can be used for event reconstruction and for signal/background or PID tasks.
Regression tasks
For event reconstruction, use a continuous prediction loss while learning latent input features.
Classification tasks
For PID or signal/background identification, replace the prediction term with cross-entropy while keeping the latent likelihood term.
Architecture extension
The same idea can be extended to CNNs, Transformers, and other differentiable architectures, as long as the model can optimize with respect to the latent input representation.
Limitations
The method is promising, but it shifts difficulty into uncertainty estimates and parameter count.
Main limitations
- σy and σx must be specified for proper correction.
- The number of parameters increases substantially.
- Hyperparameter tuning is needed, especially for the weight decay parameter.
- The paper reports that results are not highly sensitive to the weight decay value.
- For N = 50K and p ≈ 7000 informative pixels, LatentNN introduces about 3.5 × 108 latent parameters.
- That is roughly 1.4 GB in float32. A JUNO RTX 6000 Pro GPU has 96 GB memory.
Additional paper constraints
The paper emphasizes that correction is most reliable when the input noise is not overwhelming: attenuation can be important for SNRx ≲ 10, while LatentNN correction is most reliable around SNRx ≳ 2. For very high-dimensional inputs, restricting to informative features is practical; dimensionality reduction is less straightforward because the noise model in the reduced space may be unclear.
Discussion
For JUNO-like tasks, the key proposal is to add latent variables to input features.
A direct path toward robust machine learning is to explicitly model input-feature uncertainty.
Proposed modification
- Add latent variables for each PMT feature and each PMT.
- Use a total loss that combines target prediction and input-feature latent likelihood:
- For data, first reconstruct latent features, then perform PID or reconstruction.
Key points for the next study
- Simplify the input to reduce input dimension.
- Determine a good sigma value for each feature.
- Choose the first validation target: reconstruction, PID, or NvDEx signal/background classification.