Post

Mapping the tech world with text-mining: Part 1.

Introduction As part of the NGI Forward project, DELab UW is supporting the European Commission’s Next Generation Internet initiative with identifying emerging technologies and social issues related to the Internet. Our team has been experimenting with various natural language processing methods to discover trends and hidden patterns in different types of online media. You may find our tools and presentations at https://fwd.delabapps.eu. […]

Introduction

As part of the NGI Forward project, DELab UW is supporting the European Commission’s Next Generation Internet initiative with identifying emerging technologies and social issues related to the Internet. Our team has been experimenting with various natural language processing methods to discover trends and hidden patterns in different types of online media. You may find our tools and presentations at https://fwd.delabapps.eu.

This series of blog posts presents the results of our latest analysis of technology news. We have two main goals:

  1. to discover the most important topics in news discussing emerging technologies and social issues,
  2. to map the relationship between these topics.

Our text mining exercises are based on a technology news data set that consists of 213 000 tech media articles. The data has been collected for a period of 40 months (between 2016-01-01 and 2019-04-30), and includes the plain text of articles. As the figure shows, the publishers are based in the US, the UK, Belgium and Australia. More information on the data set is available at a Zenodo repository.

In this first installment, we focus on a widely used text-mining method: Latent Dirichlet Allocation (LDA). LDA gained its popularity due to its ease of use, flexibility and interpretable results. First, we briefly explain the basics of the algorithm for all non-technical readers. In the second part of the post, we show what LDA can achieve with a sufficiently large data set.

LDA

Text data is high-dimensional. In its most basic – but comprehensible to computers – form, it is often represented as a bag-of-words (BOW) matrix, where each row is a document and each column contains a count how often a word occurs in the documents. These matrices are transformable by linear algebra methods to discover the hidden (latent and lower-dimensional) structure in it.

Topic modeling assumes that documents, such as news articles, contain various distinguishable topics. As an example, a news article covering the Cambridge Analytica scandal may contain the following topics: social media, politics and tech regulations, with the following relations: 60% social media, 30% politics and 10% tech regulations. The other assumption is that topics contain characteristic vocabularies, e.g. the social media topic is described by the words Facebook, Twitter etc.

LDA has been proposed by Blei et al. (2003), based on Bayesian statistics. The method’s name provides its key foundations. Latent comes from the assumption that documents contain latent topics that we do not know a priori. Allocation shows that we allocate words to topics, and topics to documents. Dirichlet is a multinomial likelihood distribution: it provides the joint distribution of any number of outcomes. As an example, Dirichlet distribution can describe the occurrences of observed species in a safari (Downey, 2013). In LDA, it describes the distribution of topics in documents, and the distribution of words in topics.

The basic mechanism behind topic modeling methods is simple: assuming that documents can be described by a limited number of topics, we try to recreate our texts from a combination of topics that consist of characteristic words. More precisely, we aim at recreating our BOW word-document matrix with the combination of two matrices: the matrix containing the Dirichlet distribution of topics in documents (topic-document matrix), and the matrix containing the words in topics (word-topic matrix). The construction of the final matrices is achieved by a process called Gibbs sampling. The idea behind Gibbs sampling is to introduce changes into the two matrices word-by-word: change the topic allocation of a selected word in a document, and evaluate if this change improves the decomposition of our document. Repeating the steps of the Gibbs sampling in all documents provides the final matrices that provide the best description of the sample.

For more details on topic modelling, we recommend this and this excellent posts. For the full technical description of this study, head to our full report.

Results

The most important parameter of topic modelling is the number of topics. The main objective is to obtain a satisfactory level of topic separation, i.e. a situation in which topics are neither all issues lumped together nor overly fragmented ones. In order to achieve that, we have experimented with different LDA hyper parameters levels. For settings with 20 topics, the topics were balanced and separable.

Therefore, we identified 20 major topics that are presented in the visualisation below. Each circle represents a topic (the size reflects the topic’s prelevance in the documents), with distances determined by the similarity of vocabularies: topics sharing the same words are closer to each other. In the right panel, the bars represent the individual terms that are characteristic for the currently selected topic on the left. A pair of overlapping bars represent both the corpus-wide frequency of a given term, as well as its topic-specific frequency. We managed to reach gradually decreasing topic sizes: the largest topic has a share of 19%, the 5th 8%, and the 10th 5%.

After studying these most relevant terms, we labeled each topic with the closest umbrella term. Upon closer examination, we have reduced the number of topics to 18 (topics 5 & 16 became the joined category Space tech, while topics 10 & 19 were melded together to form a topic on Online streaming). In the following sections we provide brief descriptions of the identified topics.


AI & robots

AI & robots constitutes the largest topic containing around 19% of all tokens and is characterized by machine learning jargon (e.g. train data) as well as popular ML applications (robots, autonomous cars).


 Social media crisis

Social media topic is similarly prevalent and covers contentious aspects of modern social media platforms (facebooktwitter) as right to privacy, content moderation, user bans or election meddling with the use of microtargeting (i.a.: privacybanelectioncontentremove).


 Business news

A large share of tech articles cover business news, especially on major platforms (uberamazon), services such as cloud computing (aws) or emerging technologies as IoT or blockchain. The topic words also suggest great focus on the financial results of tech companies (revenuebillionsalegrowth).


Smartphones

Topic 4 covers articles about the $522B smartphone market. Two major manufacturers – Samsung and Apple are on the top of the keyword list with equal number of appearances. Articles are focused on the features, parameters and additional services provided by devices (cameradisplayalexa etc.).


 Space

Space exploration excitement is common in the tech press. Topic 5 contains reports about NASA, future Mars and Moon mission as well as companies working on space technologies, such as SpaceX.


 Privacy

Topic 6 revolves around Cambridge Analytica privacy scandal and gathers all mentions of this keyword in the corpus. The involvement of Cambridge Analytica in the Leave campaign during the Brexit referendum is of major focus, as suggested by the high position of keywords such as eu and uk. Unsurprisingly, GDPR is also often mentioned in the articles dealing with the aftermath of CA’ controversy.


Cybersecurity

Topic 7 pertains to cyberspace security issues. It explores subjects of malware and system vulnerabilities targeting both traditional computer systems, as well as novel decentralized technologies based on blockchain.


 5G

The much anticipated fifth-generation wireless network has huge potential to transform all areas with an ICT component. Topic 8 deals with global competition over delivering 5G tech to the market (huaweiericsson). It captures also the debate about 5G’s impact on net neutrality. 5G’s main quality is to enable signal ‘segmentation’, causing debate whether it can be treated like previous generations of mobile communications by net neutrality laws.


Cross platforms

The focus of Topic 9 is on operating systems, both mobile (iosandroi), desktop, (windowsmacos) as well as dedicated services (browsers chromemozilla) and app stores (appstore).



 Media

Topic 10 revolves around the most important media platforms: streaming and social media. The global video streaming market size was valued at around USD 37B in 2018, music streaming adds another 9B to this number and account for nearly half of the music industry revenue. Particularly, this topic focuses on major streaming platforms (youtubenetflixspotify), social media (facebookinstagramsnapchat), the rising popularity of podcasts and business strategies of streaming services (subscriptionsads).


Microsoft

During its 40 year history, Microsoft has made above 200 acquisitions. Some of them were considered to be successful (e.g. LinkedIn, Skype), while others were less so… (Nokia). Topic 11 collects articles describing Microsoft finished, planned and failed acquisitions in the recent years (githubskypedropboxslack).


Autonomous vehicles

Autonomous transportation is a vital point of public debate. Policy makers should consider whether to apply subsidies or taxes to equalize the public and private costs and benefits of this technology. AV technology offers the possibility of significant benefits to social welfare — saving lives; reducing crashes, congestion, fuel consumption, and pollution; increasing mobility for the disabled; and ultimately improving land use (RAND, 2016). Topic 12 addresses technological shortcomings of the technology (batteries) as well as positive externalities such as lower emissions (epaemissions).


 Tesla

LDA modelling has identified Tesla and other Elon Musk projects as a separate topic. Besides Tesla developments of electric and autonomous vehicles, the topic also includes words related to other mobility solutions (e.g. Lime).


 CPU and other hardware

Topics 14 & 15 are focused on hardware. Topic 14 covers CPU innovation race between Intel and AMD, as well as the Broadcom-Qualcomm acquisition saga, blocked by Donald Trump due to national security concerns. Topic 15 includes news regarding various standards (usb-c), storage devices (ssd) etc.


 Startups

Topic 17 concentrates on startup ecosystems and crowdsource financing. Articles discuss major startup competitions such as Startup Battlefield or Startup Alley, and crowdfunding services such as Patreon.


 Wearables

We observe a surge in the adoption of wearables, such as fitness trackers, smart watches or augmented and virtual reality headsets. This trend brings important policy questions. On the one hand, wearables offer tremendous potential when it comes to monitoring health. On the other hand, it might be overshadowed with concerns about user privacy and access to personal data. Articles in topic 18 discuss news from the wearable devices world regarding new devices, novel features etc. (fitbitheartrate).


Gaming

Topic 20 deals with gaming industry. It covers inter alia popular games (pokemon), gaming platforms (nintendo), various game consoles (switch) and game expos (e3).


Conclusions

We provided a bird’s eye view on the technology world with topic modelling. Topic modelling serves as an appropriate basis for exploring broad topics, such as the social media crisis, AI or business news. At this stage, we were able to identify major umbrella topics that ignite the public debate.

In the next post, we will introduce another machine learning method: t-SNE. With the help of this algorithm, we will create a two-dimensional map of the news, where articles covering the same topic will be neighbours. We will also show how t-SNE can be combined with LDA.