projects
software, datasets, and research tools.
software
CESSC
causal and non-causal sentence classification dataset and model
CESSC provides a curated dataset and a fine-tuned BERT-based model for binary classification of causal and non-causal sentences within social science texts. The work is connected to the paper (Norouzi et al., 2025).
- Published article
- Fine-tuned BERT model on Hugging Face
- Dataset on Hugging Face
- Dataset on GitHub
- Source code
1,000 manually annotated sentences, supplementary machine-labeled sentences, scripts for model fine-tuning and evaluation, and benchmark results.
SocioCausaNet
Multi-task BERT model for joint causal extraction from text
SocioCausaNet is a fine-tuned BERT-based multi-task model that jointly extracts causal relationships from text. It performs three tasks simultaneously: classifying whether sentences contain causal claims, identifying cause and effect spans via BIO tagging, and linking cause-effect pairs with typed relations. The model handles complex patterns including one-to-many and many-to-many cause-effect structures.
- Hugging Face model
- Hugging Face dataset — annotated causal sentence and cause–effect span dataset
- GitHub repository
- Streamlit web app (PDF causal relation miner)
The model is used in production by the MetaCheck tool on ScienceVerse for evaluating randomization and causal claims in scientific reports. Training data includes expert-annotated sentences and the model supports multiple prediction strategies with adjustable confidence thresholds.
Social Science Construct Harmonization
Benchmarking ML models for merging heterogeneous social science concepts
Social Science Construct Harmonization evaluates machine learning models for concept integration — merging heterogeneous social science terms like “Insomnia” and “Sleeping Disorders” into standardized, unified constructs. The project uses a factorial experimental design testing five vector representation models against three harmonization strategies, with an evaluation framework assessing performance, reliability, and bias.
The companion webapp is a fully client-side tool for clustering conceptual terms and mapping them to the ELSST thesaurus hierarchy. It offers three techniques:
- Pairwise and HDBSCAN (with UMAP dimensionality reduction): with default best-calibrated hyperparameters, these methods cluster related constructs by semantic similarity directly in the embedding space (powered by All-MPNet-Base-v2).
- ELSST Lookup: based on the ELSST thesaurus, this mode standardizes a user’s input query and maps it to the most similar concept within the taxonomy, with an optional two-stage cross-encoder re-ranking pipeline using
ms-marco-MiniLM-L-6-v2for more accurate matching.
Paper Hunter
Human-in-the-loop bulk paper downloader using DOIs, forked from the eScience project
Paper Hunter is a human-in-the-loop tool for bulk-downloading academic papers by DOI, created as a fork in collaboration with the eScience project. It provides a straightforward browser-based interface where researchers paste a list of DOIs and receive direct download links, with smart handling for both open-access and paywalled content.
research
Research Tools in Progress
DAG generation and Meta-Science RAG prototypes
Several research tools are currently in development:
- DAG Generator: automated generation of directed acyclic graphs from causal texts.
- Meta-Science RAG: a retrieval-augmented generation pipeline for supporting theory-building in the social sciences.
These entries are placeholders for ongoing work and will be expanded as the tools become ready to share.
fun
Audio Compress
browser-based tool to compress and split audio files
A browser-based tool to compress, split, or both — process audio files directly in the browser. Supports MP3, M4A, OGG, FLAC, WAV, and WebM formats. No uploads to any server, everything runs client-side via Web Audio API.
Sub Translator
SRT subtitle translation tool that runs in the browser
SRT subtitle translator that runs entirely client-side. Upload .srt files, choose source/target languages, and batch-translate via DeepSeek, Google Gemini, or OpenRouter. Side-by-side preview with individual or bulk .zip download.
Live app: rasoulnorouzi.github.io/sub_translator