The 81st Annual AAPOR Conference brought together researchers from across the industry to learn from one another, strengthen connections, and explore the role of polling and survey research in informed decision-making. The SSRS team has gathered key takeaways from select sessions to continue the conversation beyond the event.
Hybrid samples that blend probabilistic with opt-in sample sources can help to correct for selection biases that are known to be present in opt-in samples. The SSRS Encipher® Hybrid product accomplishes this via advanced calibration techniques.
However, in recent years, there has been increasing concern over the contamination of opt-in samples by “bogus respondents”, particularly bots and click farms that produce fraudulent data at scale. Unlike selection bias, this problem cannot be corrected by calibration; it is necessary to either prevent fraudulent respondents from entering the survey on the front end, or identify and remove them on the back end.
In this presentation, we shared a new back-end methodology for leveraging hybrid samples to help mitigate the problem of fraudulent opt-in data. This relies on an unsupervised machine learning algorithm called isolation forests. Isolation forests can identify observations whose response patterns across all available survey questions are highly different from the norm, and that therefore may not be generated by legitimate respondents.
The critical aspect of this methodology is that the isolation forest is trained on the probabilistic portion of a hybrid sample. This is because probability samples, including those from probability panels like the SSRS Opinion Panel, are largely “secure by design” from large-scale fraud. Accordingly, having a parallel probability sample can help us to identify response patterns that are highly unlikely to come from legitimate respondents, without discarding responses that are rare but legitimate.
In this presentation, we reported on a pilot study that blended a probabilistic SSRS Opinion Panel sample with samples from several opt-in vendors.
- Applying an isolation forest trained on the SSRS Opinion Panel substantially reduced failure rates to various data quality checks in the opt-in samples.
- Equally as important, the isolation forest reduced the variation in data quality failure rates between the individual opt-in sample sources.
- Crucially, this was not the case when the isolation forest was trained on the opt-in sample itself, highlighting the importance of a parallel probability sample to this methodology.
Additional information about the isolation forest methodology can be found here. Based on these findings, we have incorporated this methodology into the Encipher® Hybrid solution, which allows us to leverage the strengths of probability sampling to control known data quality risks in opt-in samples.