Post

DELab and the Next Generation Internet: Identifying and mapping key tech topics using text-mining methods

contributing to the European Commission’s Next Generation Internet (NGI) initiative by participating in two consecutive NGI projects: NGI Egineroom (2017-2019) and NGI Forward (2019-2021). This has been quite a journey, with many exciting results and project outputs that we hope will remain useful and relevant. Let us briefly summarise what we have been up to!

By Dr Kristóf Gyódi, DELab, University of Warsaw

Our research team at the Digital Economy Lab at the University of Warsaw has been contributing to the European Commission’s Next Generation Internet (NGI) initiative by participating in two consecutive NGI projects: NGI Egineroom (2017-2019) and NGI Forward (2019-2021). This has been quite a journey, with many exciting results and project outputs that we hope will remain useful and relevant. Let us briefly summarise what we have been up to!

Our main aim has been to support the NGI initiative by providing data science tools to map and analyse the developments of the tech world. More precisely, we have focused on three goals:

  • To develop text-mining methodologies to extract insights on issues relevant to NGI
  • To prepare case studies highlighting key conclusions from the data-driven research
  • To publish the results in forms facilitating further use and research

The main parts of our work is summarised below in a graph, providing an overview of text-mining methods, types of data sources and dissemination types. 

The scheme of our output

During NGI Engineroom and the first half of NGI Forward, we concentrated on the identification of emerging issues gaining importance over time: we refer to this methodology as trend analysis. We compiled a dataset of news articles from 14 major English-language technology websites from the US, EU and Australia for the period of 01.2016 – 04.2021. We examined changes in these documents to highlight technologies, social and regulatory issues gaining relevance over time. Next, we dived deeper into selected topics with a combination of methods, establishing e.g. which terms are mentioned frequently together or whether the coverage of a topic is rather positive or negative. We repeated this analysis multiple times over the years, summarising the results with interactive visualisations (e.g., check the results in 2019). During the first wave of the COVID-19 pandemic, we also applied this methodology to keep track of technology solutions and discussions on fighting the coronavirus (head here for the presentation). 

In the second half of NGI Forward, we shifted our attention from identifying trends to gaining a deeper understanding of selected issues with topic mapping. The main aim of topic mapping was to identify documents covering the same tech, policy and social issues, enabling further research. Based on various techniques (mostly t-SNE, an algorithm that can be used to represent high-dimensional data in two dimensions), we prepared a visual tool organising articles in space and highlighting the topics covered by them (see the figure below). We also modified our data collection method: instead of collecting all articles published by a news portal, articles shared on social media platforms (Twitter, Reddit and HackerNews) on particular topics were obtained. Moreover, we reached beyond English-language media and harvested documents in German, Polish, Portuguese and Spanish, enabling the analysis of not only various European regions, but Latin America as well. Therefore, we were able to drastically increase the variety of sources and harvest insights from a wider pool of articles, blog posts and opinion pieces. Using the interactive visualisations of the collected documents, we prepared detailed case studies on key tech issues, such as the decline of Internet freedom or algorithmic bias, featuring both challenges and solutions. Based on the multi-language analysis, we also covered the local perspectives of global challenges, such as combating the spread of fake news in Latin America or debates in Poland over the European copyright directive. 

You can find all our resources collected here, including datasets, codes, tutorials and reports. We believe the results demonstrate that these tools are useful to support research and the policy-making process. Do get in touch if you have suggestions on the further use of our methods and data!

Map of articles related to access to the Internet, control over information and ICT infrastructure, principles of social justice in the tech industry and the Internet’s ethical challenges