Workshop Schedule
September 30, 2026 · Tilburg University
All times are approximate and may shift slightly on the day.
| Time | Module | Topics |
|---|---|---|
| [FILL] | Welcome and Setup | Introductions, environment check, repo clone, install dependencies |
| [FILL] | Module 1: Bulk Paper Retrieval | API keys, eScience Center download package, running your first bulk download |
| [FILL] | Module 2: PDF to Text | Tool tour (PyMuPDF, pdfminer, GROBID), extraction, basic cleaning |
| [FILL] | ☕ Coffee Break | |
| [FILL] | Module 3: Sentence Extraction | Sentence segmentation with spaCy, preprocessing for model input |
| [FILL] | Module 4: HuggingFace Intro | Navigating the Hub, model cards, sentence vs. token models, pipeline API |
| [FILL] | 🍽️ Lunch Break | |
| [FILL] | Module 5: Causal Extraction and Knowledge Graph | SocioCausaNet inference, harmonizing constructs, graph construction, querying the nomological map |
| [FILL] | Wrap-up and Q&A | Open questions, next steps, how to apply this to your own research |
Module Details
Module 1: Bulk Paper Retrieval via API
What you’ll have: A working API setup and a folder of downloaded papers with metadata.
- Setting up API keys and configuring your environment
- Introduction to the eScience Center bulk download package
- Exploring features: search filters, metadata retrieval, download options
- Running your first bulk download and organizing the output
from paper_downloader import download_papers
results = download_papers(
query="workplace stress wellbeing",
max_results=50,
output_dir="./papers"
)Module 2: PDF to Text Conversion
What you’ll have: Plain text files extracted from your PDFs, cleaned and ready for sentence segmentation.
- Guided tour of PDF-to-text tools: PyMuPDF (fitz), pdfminer.six, pdfplumber, GROBID
- What each tool handles well and where each one struggles
- Running extraction on your downloaded corpus
- Basic cleaning: removing headers, footers, reference sections, broken line breaks
import fitz # PyMuPDF
def extract_text(pdf_path):
doc = fitz.open(pdf_path)
text = ""
for page in doc:
text += page.get_text()
return textModule 3: Sentence Extraction and Preprocessing
What you’ll have: A clean list of sentences extracted from your paper corpus, ready to feed into SocioCausaNet.
- Why sentence segmentation matters for causal extraction
- Using spaCy to split paper text into sentences reliably
- Handling tricky cases: abbreviations, inline citations, broken sentences from PDF extraction
- Light preprocessing: stripping noise, normalizing whitespace, filtering short/long sentences
import spacy
nlp = spacy.load("en_core_web_sm")
def extract_sentences(text):
doc = nlp(text)
sentences = [
sent.text.strip()
for sent in doc.sents
if len(sent.text.strip()) > 30
]
return sentencesModule 4: Introduction to HuggingFace
What you’ll have: A working inference call on SocioCausaNet using the pipeline API.
- Navigating the HuggingFace Hub: searching, filtering by task, reading model cards
- What “text classification” and “token classification” mean in practice
- Sentence-level models: what they produce and when they are useful
- Token-level models: how they label text word by word
- Using the pipeline API to run a model in a few lines of code
from transformers import pipeline
model = pipeline(
"text-classification",
model="rasoultilburg/SocioCausaNet"
)
sentence = "Increased workload leads to higher levels of burnout."
result = model(sentence)
print(result)Module 5: Causal Extraction and Knowledge Graph
What you’ll have: A visual causal knowledge graph built entirely from published scientific claims in your corpus.
- Running rasoultilburg/SocioCausaNet on your sentence dataset
- Understanding the BIO-tagged output: B-Cause/I-Cause, B-Effect/I-Effect spans
- Harmonizing social science constructs: grouping synonymous terms into shared nodes
- Building the knowledge graph: concepts as nodes, causal relations as directed edges
- Adding provenance: linking each edge back to the source paper and sentence
- Graphical presentation and visualization
- Querying the graph: influential constructs, causal chains, research gaps
from transformers import pipeline
import networkx as nx
model = pipeline(
"token-classification",
model="rasoultilburg/SocioCausaNet",
aggregation_strategy="simple"
)
def parse_bio_spans(entities):
"""Extract cause and effect text spans from BIO-tagged entities."""
causes, effects = [], []
for ent in entities:
if "Cause" in ent["entity_group"]:
causes.append(ent["word"])
elif "Effect" in ent["entity_group"]:
effects.append(ent["word"])
return causes, effects
G = nx.DiGraph()
sentences = [
"Increased workload leads to higher levels of burnout.",
"Social support reduces workplace stress."
]
for sent in sentences:
entities = model(sent)
causes, effects = parse_bio_spans(entities)
for cause in causes:
for effect in effects:
G.add_edge(cause, effect, sentence=sent)
# Query: what does 'workload' cause?
print(list(G.successors("workload")))Preparation Before the Workshop
Please do the following before arriving so we can start on time:
Clone the workshop repository:
git clone https://github.com/rasoultilburg/sociocausanet-workshop.gitCreate a virtual environment and install dependencies:
pip install -r requirements.txtMake sure Jupyter or VS Code with the Jupyter extension is working on your machine.
- Create a free account at huggingface.co so you can access models directly from the Hub during the workshop.
- Skim the HuggingFace model card for rasoultilburg/SocioCausaNet so the model is not completely new to you on the day.