AR
Back to Work
MACHINE-LEARNINGPersonal2024

Fake News Detector.

LSTM text classifier

Open source · Python & TensorFlow

Summary

Trains an LSTM on a labeled news corpus to classify articles as real or fake. Covers text cleaning, tokenization, model training, and evaluation — with a full write-up in the case study below.

Case study

In this project, I train a Long Short-Term Memory (LSTM) network to detect fake news from a given news corpus. This project could be practically used by media companies to automatically predict whether the circulating news is fake or not. The process could be done automatically without having humans manually review thousands of news-related articles.

The same model can be implemented to classify circulating articles as real or fake automatically — enabling more efficient verification and fewer manual reviewers at scale.

Case study

Key objectives

  • Apply Python libraries to import and visualize the dataset.
  • Perform exploratory data analysis and plot word clouds.
  • Perform text data cleaning such as removing punctuation and stop words.
  • Understand the concept of a tokenizer.
  • Tokenize and pad the text corpus so sequences can be fed into the deep learning model.
  • Understand the theory and intuition behind recurrent neural networks and LSTM.
  • Build and train the deep learning model.
  • Assess the performance of the trained model.

Problem statement and business case

We live in a world of misinformation and fake news. The goal of this project is to detect fake news using recurrent neural networks.

Natural Language Processing (NLP) converts text into numerical representations. Those numbers train ML models to make predictions.

ML-based fake news detectors are crucial for companies and media to automatically predict whether circulating news is fake or not.

In this case study, I analyze thousands of news text snippets to detect whether the news is fake or not — the same narrative as the project README, expanded below with implementation detail.

More broadly: distinguishing real from fake articles without automated tools is slow and error-prone. ML can help organizations preserve trust and act quickly at scale (aligned with the longer Framer write-up).

Architecture overview

High-level flow: raw headline/text → NLP model → binary authenticity label.

Theory behind recurrent neural networks (RNN) and LSTM

Recurrent neural networks (RNN): what are they?

Feedforward neural networks (vanilla networks) map a fixed-size input, like an image, to a fixed-size output, such as class probabilities. A drawback is that they do not model time dependency or memory across steps.

An RNN is designed to take the temporal dimension into account by maintaining an internal state with a feedback loop — each step can depend on what came before.

RNN architecture

RNNs contain a temporal loop: the hidden layer not only produces an output, but also feeds forward in time. Time is an explicit dimension. That memory of previous time steps is why RNNs fit sequences of text well.

In an RNN, the hidden layer output contributes to the final prediction and feeds back into itself, enabling the network to retain context from earlier positions in the sequence — essential when word order changes meaning.

What makes RNNs unique?

Unlike feedforward networks, RNNs work with variable-length sequences. CNNs and vanilla feedforward nets focus on fixed-size inputs and outputs; RNNs trade that constraint for flexibility on sequential data.

The vanishing gradient problem

In standard RNNs, backpropagation through time can shrink gradients exponentially. Long-range dependencies become hard to learn; earlier timesteps stop receiving useful updates as depth or sequence length grows.

Solution: Long Short-Term Memory (LSTM)

LSTMs introduce gates that regulate information flow, mitigating vanishing gradients and preserving salient context over longer spans — a strong fit for fake-news classification where distant context matters.

LSTM components

  1. Input gate: controls what new information is written into the cell state.
  2. Forget gate: decides what to discard from the previous cell state.
  3. Output gate: decides what to emit as hidden state at each step.

These gates let the model remember or forget selectively, which is why text-heavy classification tasks often prefer LSTM (or related sequence models) over plain RNNs.

Implementation and code breakdown

1. Import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

2. Load and explore the dataset

data = pd.read_csv("news.csv")
data = data[['title', 'text', 'label']]
data['label'] = data['label'].map({'REAL': 0, 'FAKE': 1})

We load the dataset, focusing on the relevant columns (`title`, `text`, and `label`). Labels are mapped to binary values: `0` for real news and `1` for fake news.

3. Exploratory data analysis (EDA)

data['label'].value_counts().plot(kind='bar', color=['blue', 'orange'])
plt.title("Distribution of Real and Fake News")
plt.xlabel("Label (0: Real, 1: Fake)")
plt.ylabel("Count")
plt.show()

We check the distribution of real vs. fake news articles to understand if the dataset is balanced. Visualizing this distribution helps reveal any potential biases in the dataset.

4. Text preprocessing

Convert text to lowercase

data['text'] = data['text'].str.lower()

Standardizes all text to lowercase, treating words like "News" and "news" as the same word.

Tokenization and padding

tokenizer = Tokenizer(num_words=5000, oov_token='<OOV>')
tokenizer.fit_on_texts(data['text'])
sequences = tokenizer.texts_to_sequences(data['text'])
padded_sequences = pad_sequences(sequences, maxlen=500)
  • Tokenization: Converts words into numbers, creating a "vocabulary" that assigns a unique integer to each word.
  • Padding: Standardizes all sequences to a fixed length (500), essential for batch processing in the LSTM model.

5. Splitting the dataset

X_train, X_test, y_train, y_test = train_test_split(padded_sequences, data['label'], test_size=0.2, random_state=42)

Splitting the data into training and testing sets allows us to evaluate the model on unseen data, ensuring it generalizes well.

6. Building the LSTM model

model = Sequential([
    Embedding(input_dim=5000, output_dim=64, input_length=500),
    LSTM(64, dropout=0.2, recurrent_dropout=0.2),
    Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Layer breakdown:

  • Embedding layer: Converts each token into a dense vector that captures semantic meaning.
  • LSTM layer: Processes sequences and retains context over time.
  • Dense layer: Outputs a probability value, predicting whether the article is fake or real.

The model uses:

  • Adam optimizer: Adjusts the learning rate during training.
  • Binary cross-entropy loss: Measures model performance in binary classification.

7. Training the model

history = model.fit(X_train, y_train, epochs=5, batch_size=64, validation_split=0.2)

Training the model over 5 epochs, with a batch size of 64, and validating on 20% of the training data to monitor performance.

8. Model evaluation

y_pred = (model.predict(X_test) > 0.5).astype("int32")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Evaluation metrics:

  • Accuracy: Measures overall prediction correctness.
  • Confusion matrix: Shows true/false positives and negatives.
  • Classification report: Provides precision, recall, and F1-score for each class (real and fake).

9. Visualizing model performance

plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

This visualization shows how the model's accuracy changes over epochs — helping identify overfitting or underfitting.

Conclusion

By implementing an LSTM network, we created a model that detects fake news with reliable accuracy. This project demonstrates how deep learning can tackle real-world problems like misinformation and contribute to better information verification methods in media. The LSTM architecture, with its memory and ability to retain context, proved ideal for this text-heavy task, showcasing the power of RNNs in NLP.

For deployment, this model could be integrated into web platforms, allowing users to input news articles and receive real-time authenticity predictions.

Details

Organization

Personal

Year

2024

Category

MACHINE-LEARNING

View on GitHub

Tech Stack

PythonTensorFlowKerasLSTMNLPTokenization

Impact

Open source · Python & TensorFlow

Other Projects