Workshop Schedule

September 30, 2026 · Tilburg University

All times are approximate and may shift slightly on the day.

Time	Module	Topics
[FILL]	Welcome and Setup	Introductions, environment check, repo clone, install dependencies
[FILL]	Module 1: Bulk Paper Retrieval	API keys, eScience Center download package, running your first bulk download
[FILL]	Module 2: PDF to Text	Tool tour (PyMuPDF, pdfminer, GROBID), extraction, basic cleaning
[FILL]	☕ Coffee Break
[FILL]	Module 3: Sentence Extraction	Sentence segmentation with spaCy, preprocessing for model input
[FILL]	Module 4: HuggingFace Intro	Navigating the Hub, model cards, sentence vs. token models, pipeline API
[FILL]	🍽️ Lunch Break
[FILL]	Module 5: Causal Extraction and Knowledge Graph	SocioCausaNet inference, harmonizing constructs, graph construction, querying the nomological map
[FILL]	Wrap-up and Q&A	Open questions, next steps, how to apply this to your own research

Module Details

Module 1: Bulk Paper Retrieval via API

What you’ll have: A working API setup and a folder of downloaded papers with metadata.

Setting up API keys and configuring your environment
Introduction to the eScience Center bulk download package
Exploring features: search filters, metadata retrieval, download options
Running your first bulk download and organizing the output

from paper_downloader import download_papers

results = download_papers(
    query="workplace stress wellbeing",
    max_results=50,
    output_dir="./papers"
)

Module 2: PDF to Text Conversion

What you’ll have: Plain text files extracted from your PDFs, cleaned and ready for sentence segmentation.

Guided tour of PDF-to-text tools: PyMuPDF (fitz), pdfminer.six, pdfplumber, GROBID
What each tool handles well and where each one struggles
Running extraction on your downloaded corpus
Basic cleaning: removing headers, footers, reference sections, broken line breaks

import fitz  # PyMuPDF

def extract_text(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

Module 3: Sentence Extraction and Preprocessing

What you’ll have: A clean list of sentences extracted from your paper corpus, ready to feed into SocioCausaNet.

Why sentence segmentation matters for causal extraction
Using spaCy to split paper text into sentences reliably
Handling tricky cases: abbreviations, inline citations, broken sentences from PDF extraction
Light preprocessing: stripping noise, normalizing whitespace, filtering short/long sentences

import spacy

nlp = spacy.load("en_core_web_sm")

def extract_sentences(text):
    doc = nlp(text)
    sentences = [
        sent.text.strip()
        for sent in doc.sents
        if len(sent.text.strip()) > 30
    ]
    return sentences

Module 4: Introduction to HuggingFace

What you’ll have: A working inference call on SocioCausaNet using the pipeline API.

Navigating the HuggingFace Hub: searching, filtering by task, reading model cards
What “text classification” and “token classification” mean in practice
Sentence-level models: what they produce and when they are useful
Token-level models: how they label text word by word
Using the pipeline API to run a model in a few lines of code

from transformers import pipeline

model = pipeline(
    "text-classification",
    model="rasoultilburg/SocioCausaNet"
)

sentence = "Increased workload leads to higher levels of burnout."
result = model(sentence)
print(result)

Module 5: Causal Extraction and Knowledge Graph

What you’ll have: A visual causal knowledge graph built entirely from published scientific claims in your corpus.

Running rasoultilburg/SocioCausaNet on your sentence dataset
Understanding the BIO-tagged output: B-Cause/I-Cause, B-Effect/I-Effect spans
Harmonizing social science constructs: grouping synonymous terms into shared nodes
Building the knowledge graph: concepts as nodes, causal relations as directed edges
Adding provenance: linking each edge back to the source paper and sentence
Graphical presentation and visualization
Querying the graph: influential constructs, causal chains, research gaps

from transformers import pipeline
import networkx as nx

model = pipeline(
    "token-classification",
    model="rasoultilburg/SocioCausaNet",
    aggregation_strategy="simple"
)

def parse_bio_spans(entities):
    """Extract cause and effect text spans from BIO-tagged entities."""
    causes, effects = [], []
    for ent in entities:
        if "Cause" in ent["entity_group"]:
            causes.append(ent["word"])
        elif "Effect" in ent["entity_group"]:
            effects.append(ent["word"])
    return causes, effects

G = nx.DiGraph()

sentences = [
    "Increased workload leads to higher levels of burnout.",
    "Social support reduces workplace stress."
]

for sent in sentences:
    entities = model(sent)
    causes, effects = parse_bio_spans(entities)
    for cause in causes:
        for effect in effects:
            G.add_edge(cause, effect, sentence=sent)

# Query: what does 'workload' cause?
print(list(G.successors("workload")))

Preparation Before the Workshop

Please do the following before arriving so we can start on time:

Required setup

Clone the workshop repository:

git clone https://github.com/rasoultilburg/sociocausanet-workshop.git

Create a virtual environment and install dependencies:
```
pip install -r requirements.txt
```
Make sure Jupyter or VS Code with the Jupyter extension is working on your machine.

Optional but helpful

Create a free account at huggingface.co so you can access models directly from the Hub during the workshop.
Skim the HuggingFace model card for rasoultilburg/SocioCausaNet so the model is not completely new to you on the day.