Logo

Trial lecture

NLP methods for detecting disinformation in social media

by Lars Bungum

Trondheim, May 2021

Overview

  1. Problem Definition
  2. Historical Overview
  3. NLP and OSN Disinformation
  4. Auto-Detection of "virulence"
  5. Environment to mitigate "susceptibility"
  6. Education to make the environment less "conducive"
  7. Conclusion

Disinformation Definition

  • Lack of a clear definition (Choras, et al., 2020)
  • Common denominator in definitions of "fake news" and "disinformation" - often used interchangeably - is that they are verifiably false, e.g., the European Commission

Disinformation is 'verifiably false or misleading information created, presented and disseminated for economic gain or to intentionally deceive the public'.

The Dis- and Misinformation Triangle

  • Rubin (2019) referred to dis/misinformation in digital news as the "fake news problem", and created a conceptual model on top of an epidemiological model

Disinformation Triangle

Disinformation Dimensions

Disinformation can be analyzed at different levels, such as:

  • types (forms)
  • provenance
  • motives

False Information Typology

Zannettou and Sirivianos (2018) created a typology of "false information"

  • Types: Fabricated, Propaganda, Conspiracy Theories, Hoaxes, Biased and One-sided, Rumors, Clickbait, and Satire
  • Provenance: Bots, Criminal/Terrorist Organization, Activist or Political Organization, Governments, Hidden Paid-Posters and State-Sponsored- Trolls, Journalists, True Believers and Conspiracy Theorists, and Trolls
  • Motives: Malicious Intent, Influence, Sow Discord, Profit, Passion, and Fun

Historical Overview

  • Ancient Egypt and Rome
  • Gutenberg Press
  • Third Reich

The internet, and especially, OSNs, makes dissemination much easier

Interdisciplinarity

Web of Science Treemap:

WOS Treemap

Publication Growth

Number of publications

Ethical Considerations

Mjaaland (2020):

Monitoring users’ conversations violates the users’ privacy, but is useful for detecting fake news and their source.

NLP and OSN Disinformation

  • The same tools can be used for both good and bad (Grover very careful about releasing its huge LMs)
  • A detection algorithm can be manipulated (consider SEO and spam-filter omission)

Cat and Mouse Game

Tom and Jerry

Adverserial Machine Learning

  • Researching how machine learning algorithms can be exploited by malicious users.
  1. Fact distortion (e.g. exaggeration)
  2. Subject-object reversal (dog bites man/man bites dog)
  3. Cause confounding (creating causal relationships not present in the original text)

Current Frontiers

  • Deepfakes (manipulated speech and video)

Deepfake Challenge

Famous Obama Deepfake

NLP-Augmented Disinformation Triangle

Augmented Disinformation Challenge

How is NLP Relevant?

  • Auto-detection: Classification and regression problems
  • Environment: Graph-theoretic and time-series analysis of spread
  • Education: Descriptive linguistics can elucidate the characteristics of the language of disinformation, thus informing and educating the public

Auto-Detection

  • Aimed at identifying disinformation text
  • Problem framed as classification
  • Algorithms range from Artificial Neural Networks (ANNs) to classic algorithms, such as, e.g., Support Vector Machines (SVMs) and Logistic Regression

Fake News?!

GLTR mockup

GLTR's Three Tests

  1. The probability of the word
  2. The absolute rank of the word
  3. The entropy of the predicted distribution

Hypothesis: humans pick high-rank words also in low-entropy contexts.

GPT-2 Output Detector

# GPT-2 Output Detector Dataset

  • Created dataset to train classifiers

Glover

  • Successively trained with adversarial machine learning trying to beat the classifier. (No longer openly available.)

Fakebox

Recent Advances in NN LMs

  • RNNs are able to retain context by recurrence
  • LSTMs solve vanishing gradients
  • Bi-Directional LSTMs include context from both sids
  • Attention in encoder-decoder setups (some words contribute more than others)
  • Transformers (address parallelization) with self-attention becoming multi-head attention
  • BERT, RoBERTa, DistillBert, etc..

Other Supervised Approaches

  • Traditional feature-based classifiers on annotated datasets

Datasets

  • Datasets vary in granularity (false/true vs. almost/partially true, etc.)) which impacts nature of the ML task
  • Rubin (2015) defined some criteria for a good dataset, e.g.; both true and false instances, verifiability, homogeneity in lengths, predefined timeframe, langauge and culture.

Feature Engineering

  • Extracting corpus-based, linguistic properties is often referred to as "NLP" methods
  1. Quantitative (counting POS, stop-words, modifiers)
  2. Informality (typographical errors)
  3. Complexity (characters per word/sentence)
  4. Uncertainty (terms that indicate certainty/tendency "always", )
  5. Non-immediacy (pronouns in 1st/2nd person)
  6. Diversity (unique words)
  7. Sentiment

Semi-Supervised

  • Paka (2021) leverage unlabeled data for training a neural attention model to learn more about the language on covid19, the study object
  • Guacho (2018) use vector decomposition to propagate labels on a document collections
  • Generally, multi-view learning (distinct feature sets), data augmentation and transfer learning (e.g., fine-tuning language models) can be leveraged for text classification

Unsupervised

  • Homophily theory (people with similar interests are likely to connect) and Peer Acceptance (Koggalahewa et al, 2020)
  • Q-learning (social attributes (retweets, etc) as states, actions movement between these
  • Other clustering algorithms, such as t-sne show separation (Lingam, Rout et al., 2019)

Network Analysis

  • Several aspects of OSN behavior can be modeled as networks
  • Three dimensions of fake news: "What" (news, comment), who (social dimension), when (temporal dimension)
  • Knowledge, Stance, Interaction networks can be modeled and embedded in classifiers (Shu and Russel Bernard, 2018)

Fake News Characteristics

  • Zhou and Zafarani (2019):
  1. Disseminates further
  2. Engages a larger number of spreaders, where they often prove to be:
  • More fully absorbed in the news and
  • More closely linked within the network.

Descriptive Linguistics

  • Characterizing the linguistic properties of disinformation in term sof , e.g., syntax and vocabulary

Asubiaro and Rubin (2018)

  • False political news tends to :
  • In stories: have fewer words, fewer but lengthier paragraphs; they also contain more slang, swear, and affective words.
  • In headlines: contain more words, punctuation marks, demonstratives, emotiveness and fewer verifiable facts.

Summary

  • Disinformation is present wherever is information
  • NLP methods can be applied to describe, to detect (classify), and to model network behavior

References

  • de Oliveira NR, Pisa PS, Lopez MA, de Medeiros DSV, Mattos DMF. Identifying Fake News on Social Networks Based on Natural Language Processing: Trends and Challenges. Information. 2021; 12(1):38.
  • G.B. Guacho, S. Abdali, N. Shah and E. E. Papalexakis, "Semi-supervised Content-Based Detection of Misinformation via Tensor Embeddings," 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2018, pp. 322-325. -Lingam, G., Rout, R.R. & Somayajulu, D.V.L.N. Adaptive deep Q-learning model for detecting social bots and influential users in online social networks. Appl Intell 49, 3947–3964 (2019).

  • Rubin, V.L.; Chen, Y.; Conroy, N.J. Deception detection for news: Three types of fakes. In Proceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the Community; American Society for Information Science: Silver Spring, MD, USA, 6–10 November 2015; p. 83.

  • K. Shu, H.R. Bernard, H. Liu, Studying fake news via network analysis: detection and mitigation, in: Emerging Research Challenges and Opportunities in Computational Social Network Analysis and Mining, Springer, 2019, pp. 43–65.

  • Zhou and Zafarani, Network-based Fake News Detection: A Pattern-driven Approach, ACM SIGKDD Explorations Newsletter November 2019

  • EU Commmision https://digital-strategy.ec.europa.eu/en/policies/online-disinformation