Workshop Schedule

September 30, 2026 · Tilburg University

All times are approximate and may shift slightly on the day.

Time Module Topics
[FILL] Welcome and Setup Introductions, environment check, repo clone, install dependencies
[FILL] Module 1: Bulk Paper Retrieval API keys, eScience Center download package, running your first bulk download
[FILL] Module 2: PDF to Text Tool tour (PyMuPDF, pdfminer, GROBID), extraction, basic cleaning
[FILL] ☕ Coffee Break
[FILL] Module 3: Sentence Extraction Sentence segmentation with spaCy, preprocessing for model input
[FILL] Module 4: HuggingFace Intro Navigating the Hub, model cards, sentence vs. token models, pipeline API
[FILL] 🍽️ Lunch Break
[FILL] Module 5: Causal Extraction and Knowledge Graph SocioCausaNet inference, harmonizing constructs, graph construction, querying the nomological map
[FILL] Wrap-up and Q&A Open questions, next steps, how to apply this to your own research

Module Details

Module 1: Bulk Paper Retrieval via API

What you’ll have: A working API setup and a folder of downloaded papers with metadata.

  • Setting up API keys and configuring your environment
  • Introduction to the eScience Center bulk download package
  • Exploring features: search filters, metadata retrieval, download options
  • Running your first bulk download and organizing the output
from paper_downloader import download_papers

results = download_papers(
    query="workplace stress wellbeing",
    max_results=50,
    output_dir="./papers"
)

Module 2: PDF to Text Conversion

What you’ll have: Plain text files extracted from your PDFs, cleaned and ready for sentence segmentation.

  • Guided tour of PDF-to-text tools: PyMuPDF (fitz), pdfminer.six, pdfplumber, GROBID
  • What each tool handles well and where each one struggles
  • Running extraction on your downloaded corpus
  • Basic cleaning: removing headers, footers, reference sections, broken line breaks
import fitz  # PyMuPDF

def extract_text(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

Module 3: Sentence Extraction and Preprocessing

What you’ll have: A clean list of sentences extracted from your paper corpus, ready to feed into SocioCausaNet.

  • Why sentence segmentation matters for causal extraction
  • Using spaCy to split paper text into sentences reliably
  • Handling tricky cases: abbreviations, inline citations, broken sentences from PDF extraction
  • Light preprocessing: stripping noise, normalizing whitespace, filtering short/long sentences
import spacy

nlp = spacy.load("en_core_web_sm")

def extract_sentences(text):
    doc = nlp(text)
    sentences = [
        sent.text.strip()
        for sent in doc.sents
        if len(sent.text.strip()) > 30
    ]
    return sentences

Module 4: Introduction to HuggingFace

What you’ll have: A working inference call on SocioCausaNet using the pipeline API.

  • Navigating the HuggingFace Hub: searching, filtering by task, reading model cards
  • What “text classification” and “token classification” mean in practice
  • Sentence-level models: what they produce and when they are useful
  • Token-level models: how they label text word by word
  • Using the pipeline API to run a model in a few lines of code
from transformers import pipeline

model = pipeline(
    "text-classification",
    model="rasoultilburg/SocioCausaNet"
)

sentence = "Increased workload leads to higher levels of burnout."
result = model(sentence)
print(result)

Module 5: Causal Extraction and Knowledge Graph

What you’ll have: A visual causal knowledge graph built entirely from published scientific claims in your corpus.

  • Running rasoultilburg/SocioCausaNet on your sentence dataset
  • Understanding the BIO-tagged output: B-Cause/I-Cause, B-Effect/I-Effect spans
  • Harmonizing social science constructs: grouping synonymous terms into shared nodes
  • Building the knowledge graph: concepts as nodes, causal relations as directed edges
  • Adding provenance: linking each edge back to the source paper and sentence
  • Graphical presentation and visualization
  • Querying the graph: influential constructs, causal chains, research gaps
from transformers import pipeline
import networkx as nx

model = pipeline(
    "token-classification",
    model="rasoultilburg/SocioCausaNet",
    aggregation_strategy="simple"
)

def parse_bio_spans(entities):
    """Extract cause and effect text spans from BIO-tagged entities."""
    causes, effects = [], []
    for ent in entities:
        if "Cause" in ent["entity_group"]:
            causes.append(ent["word"])
        elif "Effect" in ent["entity_group"]:
            effects.append(ent["word"])
    return causes, effects

G = nx.DiGraph()

sentences = [
    "Increased workload leads to higher levels of burnout.",
    "Social support reduces workplace stress."
]

for sent in sentences:
    entities = model(sent)
    causes, effects = parse_bio_spans(entities)
    for cause in causes:
        for effect in effects:
            G.add_edge(cause, effect, sentence=sent)

# Query: what does 'workload' cause?
print(list(G.successors("workload")))

Preparation Before the Workshop

Please do the following before arriving so we can start on time:

ImportantRequired setup
  1. Clone the workshop repository:

    git clone https://github.com/rasoultilburg/sociocausanet-workshop.git
  2. Create a virtual environment and install dependencies:

    pip install -r requirements.txt
  3. Make sure Jupyter or VS Code with the Jupyter extension is working on your machine.

TipOptional but helpful
  • Create a free account at huggingface.co so you can access models directly from the Hub during the workshop.
  • Skim the HuggingFace model card for rasoultilburg/SocioCausaNet so the model is not completely new to you on the day.