A Way Out the Ethico-Legal Maze of Social Media Data Scraping

People know that once they publish something online, such content becomes “publicly available” and can be downloaded and re-used by others, for example, researchers and data scientists. The reality is far more complicated. And for us, finding a way to comply with data protection obligations and to respect the tenets of research ethics became an exploration of a largely uncharted territory.

Within DECEPTICON, we are gathering examples about the many manipulative designs that populate online services (i.e., dark patterns) and are publicly condemned on Twitter and Reddit. Our aim is to build a labelled dataset of such pervasive practices by using crowdsourced knowledge and possibly develop supervised machine learning models to flag dark patterns at scale.

Initially, we were convinced that we only needed to address a few data protection concerns, which seemed totally feasible. However, we found out that there is a plethora of legal obligations to comply with and additional research ethics principles to be considered. Finding creative answers to such issues was a long, tiresome, albeit formative experience that we briefly share in these pages, with the conviction that it can be of help to other academic and industrial researchers who collect and analyse internet data…