How Headline Testing works

Parse.ly’s Headline Testing uses a real-time optimization technique based on a machine learning method called multi-armed bandit testing. Specifically, it uses the Thompson sampling algorithm, which dynamically allocates more traffic to better-performing variants over time. This means that if one headline performs well early on, more users will see it, while poorly performing variants are shown less frequently.

You can create tests with up to 10 headline variants per piece of content. One of these is the original (the control), and the others are alternatives to evaluate. As users visit the page, the tool automatically displays different variants to different visitors and collects data on how often each one is seen (impressions) and how often each is clicked. Based on this data, the algorithm adjusts which headlines are prioritized, sending more traffic to those that perform well.

Not every headline test ends with a statistically confident winner, but that doesn’t mean the test failed. Valuable insights can still emerge, especially when looking at the Help Chance metric. This helps you choose headlines that are likely better than your control, even when the system isn’t confident enough to declare a single clear winner.

This approach stands in contrast to traditional A/B testing, where traffic is split evenly between all variants for the entire duration of the test. In a bandit-based system like Parse.ly’s, the testing process begins with equal distribution, but quickly shifts traffic toward better-performing options as more data becomes available.

Details on the Headline Testing process

First, a separate JavaScript script must be installed on your website. This script is intentionally designed to be as lightweight as possible, separate from the main Parse.ly script, to ensure it loads very quickly and before any page content is displayed.
The script comes preloaded with the list of active headline experiments. The system attempts to locate the control headline attached to each article link on the page.
If the control headline is found, the script replaces it with the assigned variant headline for that specific visitor. A local storage key (vipexp-local-state) holds which experiment and variant was shown to a user.
When the variant headline is displayed the first time, this counts as an impression for that visitor.
If the visitor clicks on the headline the first time, it is recorded as a success for that headline variant, and the system tracks this click.
The selected headline variant is stored in the visitor’s cookies. This ensures consistency so that the same visitor will continue to see the same headline variant on subsequent visits.
The system uses a multi-armed bandit approach, specifically Thompson sampling with a beta distribution, to determine which headline should be shown more frequently. For each variant, the algorithm models the probability of success based on observed data:
- Alpha (α) represents the number of unique clicks on a headline variant under test.
- Beta (β) represents the number of visitors the headline has been displayed to (impressions) without clicks.
Over time, the algorithm dynamically adjusts the probability of showing each variant, favoring headlines that perform better while exploring other options to continuously learn.

This entire process happens automatically and quickly, enabling real-time optimization without requiring any manual analysis or adjustments.

A/B testing vs. multi-armed bandit (MAB)

Parse.ly’s Headline Testing uses a multi-armed bandit (MAB) algorithm rather than traditional A/B testing. Here’s how the two approaches differ, and why MAB is better suited for headline optimization:

Aspect	A/B testing	Multi-armed bandit (MAB)
Traffic allocation	Fixed (e.g., 50/50)	Dynamic based on variant performance
Optimization timing	After the tests conclude	Ongoing during the test
Handling poor variants	Continues showing all variants equally	Gradually reduces traffic to underperforming variants

Multi-armed bandit algorithms do more than just test headline variants; they optimize traffic distribution and adapt in real time.

As the system collects performance data, it dynamically adjusts how traffic is allocated, potentially sending more visitors to the better performing headlines as soon as there is enough evidence to support the shift. This allows the best headline to reach a larger portion of the audience more quickly, while also ensuring that just enough exposure is given to the weaker variants.

The multi-armed bandit approach is especially useful in environments where content has a short lifecycle or where fast optimization is required.

Exporting Results

You now export the results of finished headline tests for further analysis or reporting.

How to Export:

Go to any finished headline test.
Click the “Export” button in the top-right corner of the Evolution chart.
Choose your preferred format:
- CSV for spreadsheet-friendly output
- JSON for structured data and automation workflows

These exports includes available test metrics, such as CTR, views, and variant performance.

Our Algorithms and Confidence Grades

Parse.ly uses Bayesian modeling to estimate the future performance of each headline variant. These estimates are translated into confidence grades, which help you make data-driven decisions more easily.

Each headline variant receives a win chance — the probability that it will continue to perform best in the future — and a help chance — the probability that it will perform better than the control.

Confidence Grades Explained:

High: Win chance of 95% or more. Strong result; safe to implement.
Fair: Win chance between 70–95%. Worth considering with editorial judgment.
Low: Win chance between 55–70%. Use cautiously; help chance becomes relevant.
Insufficient: Win chance below 55%. Results are statistically inconclusive.

Outcomes of a Test:

Winner: Identified with sufficient confidence.
No clear winner: Results inconclusive, but data may still be useful.
Not enough data: Traffic volume too low to evaluate properly.

To improve inconclusive tests, consider running them longer, or reducing the number of variants.

Results Dashboard Enhancements

Parse.ly has also improved the results dashboard with clearer metrics and helpful visuals:

Winner badge: Clearly marks the top headline and its confidence grade.
Popover insights: Detailed breakdowns of win chance and help chance.
Performance metrics:
- Confidence grade
- Total clicks
- Total impressions
- Click-through rate (CTR)
- Improvement over control (CTR difference as percentage points)
Evolution chart: Visualizes how each variant performed over time.

Understanding “pp” (Percentage Points)

The improvement metric in Headline Testing is expressed in percentage points (pp), not just as a percent. This distinction matters:

Percent expresses a relative change.
Percentage point (pp) expresses an absolute difference between two percentages.

For example, if the control headline has a CTR of 4% and a variant has a CTR of 5%, the improvement is 1 percentage point (pp), translating to a 25% relative increase. Using percentage points helps avoid confusion and keeps comparisons clear and accurate.

Understanding the difference ensures better decision-making when interpreting small changes in click-through rates.

Best Practices

Use your most preferred headline as the control.
Focus traffic on fewer variants and longer experiments.
Let win chance guide obvious choices; use help chance for nuanced decisions.
Weigh CTR improvement against editorial quality — a minor gain may not warrant change.

Headline Testing is meant to support editorial judgment, not override it. These new features make it easier to blend data and intuition, driving better decisions.

Last updated: November 06, 2025