Post

Predicting future trends from past novelty: Trend estimation on social media

Significant developments of internet technologies are being made across a wide range of fields – breakthroughs that will undoubtedly have a profound impact on society and be highly disruptive in nature. Given the importance of these technological developments, insights into emerging technology trends can be of great value for governments and policy-makers to be able […]

Significant developments of internet technologies are being made across a wide range of fields – breakthroughs that will undoubtedly have a profound impact on society and be highly disruptive in nature. Given the importance of these technological developments, insights into emerging technology trends can be of great value for governments and policy-makers to be able to response to the complex dilemmas that emerge around new technologies.

Today, trends can emerge from many different platforms. For instance, the growth of social media enables citizens to generate and share information in unprecedented ways, and thus sociocultural trends from social media have become an important part of knowledge discovery. Social media can provide policy makers with insights about which internet-related technologies and social issues are frequently discussed, and which trends are emerging in the discussions. Such insights can help identify new technologies and issues and create proper policy or regulatory responses.

 

A new approach to social media trend

Accurate trend estimation on social media is however a matter of debate in the research community. Standard approaches often suffer from several methodological issues by focusing solely on spiky behavior and thus equating trend detection with that of natural catastrophes and epidemics.

To remedy these problematic issues our team at DATALAB at Aarhus University in Denmark developed a new approach to trend estimation that combines domain knowledge of social media with advances in information theory and dynamical systems. In particular, trend reservoirs, i.e. signals that display trend potential, are identified by their relationship between novel and resonant behavior, and their minimal persistence.

The model estimates Novelty as a reliable difference from the past – how much does the content diverge from the past – and Resonance as the degree to which future information conforms to the Novelty – to what degree does the novel content ‘stick’. Using calculations of Novelty and Resonance, trends are then characterized by a strong Novelty-Resonance association and long-range memory in the information stream. Results show that these two ‘signatures’ capture different properties of trend reservoirs, information stickiness and multi-scale correlations respectively, and they both have discriminatory power, i.e. they can actually detect trend reservoirs.


Case study: AI discussions on Reddit

To exemplify the application of the model for trend estimation, it is applied on the social media site Reddit to discover innovative discussions related to artificial intelligence (AI). 

Reddit hosts discussions about text posts and web links across hundreds of topic-based communities called “subreddits” that often target specialized expert audiences on these topics. Topically defined discussions are thus an important part of the appeal of Reddit, unlike the information dissemination focus of Twitter, and the specialized audiences make it a promising source for topical discussions on, for example, internet technology.

The most trending subreddits are discovered using the model on a sample of subreddits with the highest overlap between their descriptions and a seed list of AI-related terms. The top 10 most relevant subreddits in terms of content matching can be found in Table 1 in two categories: Leaders and Prospects. The Leaders are subreddits with a substantial number of posts (more than 2560), while the Prospects are subreddits that can be small but rank the highest on the content matching. 




We can then use the trend estimation model to classify into maximally trending and just trending or not trending. In Table 1, the red subreddits are classified as maximally trending while blue are not. In the Leaders category, subreddits will have to be trending based on both signatures, i.e. having a strong Novelty-Resonance association and displaying long-range memory, to qualify as maximally trending. Prospects can qualify as maximally trending with only one of the two signatures. 

Leaders are relevant because of the many posts of potentially trending content, while Prospects can be used to discover singular new trends. A recommender engine can be trained with this classifier to identify trending subreddits within any given subject. Such classifications can be extremely useful for decision support in terms of which subreddits to follow for a continuum of information on trends in e.g. AI.

After the classification, the content on the most trending subreddits can be explored. To do this, we training a neural embedding model to query the highest-ranking words and their associated words, providing insights into e.g. the contexts in which the technologies are discussed. 




This method of content exploration produces concepts graphs as the one above of words used in the subreddit and their associations. Figure 1 is a concept graph from the trending subreddit r/MachineLearing – we can call the graph TOOL-DIVERSIFICATION. As data science and machine learning are complicated fields, many classes of tools are necessary to develop state-of-the-art deep learning models, and the concept graph shows three clusters of interest: 

  • The upper right corner shows important tools related to “Hardware and Cloud” technologies (NVIDIA, GPU, TPU, AWS, server, Ubuntu, Gluon), which are all characteristics of GPU accelerated high performance computing; 
  • The cluster in the center left of the graph is dominated by the most important deep learning “Software Libraries” in Python (Tensorflow, PyTorch, Keras, Theano) and related languages (JavaScript, Java, Caffe MATLAB); 
  • In the lower part, the graph displays “Classes of Problems” (supervised, unsupervised, reinforcement learning). 

Two further observations can be made. Firstly, “Tutorial” is highly interconnected to all clusters, supporting the fact that tutorials have become one of the primary sources of assimilating the diverse tools in the machine learning community. Secondly, software libraries, packages, and frameworks take a central role in the graph. They all signify bundles of preexisting code that minimize the amount of programming and hardware understanding required by machine learning enthusiasts.

These observations indicate that the subreddit does not consist of solely professional machine learning developers, but rather constitutes a community of machine learning enthusiasts with a do-it-yourself approach to machine learning. 

This is just an example of the content that can be extracted after identifying the most trending subreddits within the topic of investigation with the model for trend estimation. This approach to the estimation of trend reservoirs generalizes to other data sources, such as Twitter, and other data types, such as images.