Logo

Trial lecture

NLP methods for detecting disinformation in social media

by Lars Bungum

Trondheim, May 2021

Overview

Problem Definition
Historical Overview
NLP and OSN Disinformation
Auto-Detection of "virulence"
Environment to mitigate "susceptibility"
Education to make the environment less "conducive"
Conclusion

Disinformation Definition

Lack of a clear definition (Choras, et al., 2020)
Common denominator in definitions of "fake news" and "disinformation" - often used interchangeably - is that they are verifiably false, e.g., the European Commission

Disinformation is 'verifiably false or misleading information created, presented and disseminated for economic gain or to intentionally deceive the public'.

The Dis- and Misinformation Triangle

Rubin (2019) referred to dis/misinformation in digital news as the "fake news problem", and created a conceptual model on top of an epidemiological model

Disinformation Triangle

Disinformation Dimensions

Disinformation can be analyzed at different levels, such as:

types (forms)
provenance
motives

False Information Typology

Zannettou and Sirivianos (2018) created a typology of "false information"

Types: Fabricated, Propaganda, Conspiracy Theories, Hoaxes, Biased and One-sided, Rumors, Clickbait, and Satire
Provenance: Bots, Criminal/Terrorist Organization, Activist or Political Organization, Governments, Hidden Paid-Posters and State-Sponsored- Trolls, Journalists, True Believers and Conspiracy Theorists, and Trolls
Motives: Malicious Intent, Influence, Sow Discord, Profit, Passion, and Fun

Historical Overview

Ancient Egypt and Rome
Gutenberg Press
Third Reich

The internet, and especially, OSNs, makes dissemination much easier

Interdisciplinarity

Web of Science Treemap:

WOS Treemap

Publication Growth

Number of publications

Ethical Considerations

Mjaaland (2020):

Monitoring users’ conversations violates the users’ privacy, but is useful for detecting fake news and their source.

NLP and OSN Disinformation

The same tools can be used for both good and bad (Grover very careful about releasing its huge LMs)
A detection algorithm can be manipulated (consider SEO and spam-filter omission)

Cat and Mouse Game

Tom and Jerry

Adverserial Machine Learning

Researching how machine learning algorithms can be exploited by malicious users.

Fact distortion (e.g. exaggeration)
Subject-object reversal (dog bites man/man bites dog)
Cause confounding (creating causal relationships not present in the original text)

Current Frontiers

Deepfakes (manipulated speech and video)

Deepfake Challenge

Famous Obama Deepfake

NLP-Augmented Disinformation Triangle

Augmented Disinformation Challenge

How is NLP Relevant?

Auto-detection: Classification and regression problems
Environment: Graph-theoretic and time-series analysis of spread
Education: Descriptive linguistics can elucidate the characteristics of the language of disinformation, thus informing and educating the public

Auto-Detection

Aimed at identifying disinformation text
Problem framed as classification
Algorithms range from Artificial Neural Networks (ANNs) to classic algorithms, such as, e.g., Support Vector Machines (SVMs) and Logistic Regression

Fake News?!

GLTR mockup

GLTR's Three Tests

The probability of the word
The absolute rank of the word
The entropy of the predicted distribution

Hypothesis: humans pick high-rank words also in low-entropy contexts.

GPT-2 Output Detector

# GPT-2 Output Detector Dataset

Created dataset to train classifiers

Glover

Successively trained with adversarial machine learning trying to beat the classifier. (No longer openly available.)

Fakebox

Recent Advances in NN LMs

RNNs are able to retain context by recurrence
LSTMs solve vanishing gradients
Bi-Directional LSTMs include context from both sids
Attention in encoder-decoder setups (some words contribute more than others)
Transformers (address parallelization) with self-attention becoming multi-head attention
BERT, RoBERTa, DistillBert, etc..

Other Supervised Approaches

Traditional feature-based classifiers on annotated datasets

Datasets

Datasets vary in granularity (false/true vs. almost/partially true, etc.)) which impacts nature of the ML task
Rubin (2015) defined some criteria for a good dataset, e.g.; both true and false instances, verifiability, homogeneity in lengths, predefined timeframe, langauge and culture.

Feature Engineering

Extracting corpus-based, linguistic properties is often referred to as "NLP" methods

Quantitative (counting POS, stop-words, modifiers)
Informality (typographical errors)
Complexity (characters per word/sentence)
Uncertainty (terms that indicate certainty/tendency "always", )
Non-immediacy (pronouns in 1st/2nd person)
Diversity (unique words)
Sentiment

Semi-Supervised

Paka (2021) leverage unlabeled data for training a neural attention model to learn more about the language on covid19, the study object
Guacho (2018) use vector decomposition to propagate labels on a document collections
Generally, multi-view learning (distinct feature sets), data augmentation and transfer learning (e.g., fine-tuning language models) can be leveraged for text classification

Unsupervised

Homophily theory (people with similar interests are likely to connect) and Peer Acceptance (Koggalahewa et al, 2020)
Q-learning (social attributes (retweets, etc) as states, actions movement between these
Other clustering algorithms, such as t-sne show separation (Lingam, Rout et al., 2019)

Network Analysis

Several aspects of OSN behavior can be modeled as networks
Three dimensions of fake news: "What" (news, comment), who (social dimension), when (temporal dimension)
Knowledge, Stance, Interaction networks can be modeled and embedded in classifiers (Shu and Russel Bernard, 2018)

Fake News Characteristics

Zhou and Zafarani (2019):

Disseminates further
Engages a larger number of spreaders, where they often prove to be:

More fully absorbed in the news and
More closely linked within the network.

Descriptive Linguistics

Characterizing the linguistic properties of disinformation in term sof , e.g., syntax and vocabulary

Asubiaro and Rubin (2018)

False political news tends to :
In stories: have fewer words, fewer but lengthier paragraphs; they also contain more slang, swear, and affective words.
In headlines: contain more words, punctuation marks, demonstratives, emotiveness and fewer verifiable facts.

Summary

Disinformation is present wherever is information
NLP methods can be applied to describe, to detect (classify), and to model network behavior

References

de Oliveira NR, Pisa PS, Lopez MA, de Medeiros DSV, Mattos DMF. Identifying Fake News on Social Networks Based on Natural Language Processing: Trends and Challenges. Information. 2021; 12(1):38.
G.B. Guacho, S. Abdali, N. Shah and E. E. Papalexakis, "Semi-supervised Content-Based Detection of Misinformation via Tensor Embeddings," 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2018, pp. 322-325. -Lingam, G., Rout, R.R. & Somayajulu, D.V.L.N. Adaptive deep Q-learning model for detecting social bots and influential users in online social networks. Appl Intell 49, 3947–3964 (2019).
Rubin, V.L.; Chen, Y.; Conroy, N.J. Deception detection for news: Three types of fakes. In Proceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the Community; American Society for Information Science: Silver Spring, MD, USA, 6–10 November 2015; p. 83.
K. Shu, H.R. Bernard, H. Liu, Studying fake news via network analysis: detection and mitigation, in: Emerging Research Challenges and Opportunities in Computational Social Network Analysis and Mining, Springer, 2019, pp. 43–65.
Zhou and Zafarani, Network-based Fake News Detection: A Pattern-driven Approach, ACM SIGKDD Explorations Newsletter November 2019
EU Commmision https://digital-strategy.ec.europa.eu/en/policies/online-disinformation