ModelScope Lite
A dataset health report that flags risks before you train.
By
Kommireddy Venkata Sai Ashrit
Semester
Spring 2026
Problem
When you get a new CSV dataset, you often discover issues only after training fails — imbalanced labels, silent missing-value patterns, duplicated rows, or leaky features that are basically the answer. The bottleneck is that students and early practitioners don't know what to check first, and raw profiling tools dump charts and tables without clearly telling you what's risky and what to do next.
Solution
A small web app where a user uploads a CSV and immediately gets a Dataset Health Report that calls out the top issues in plain English and recommends next steps. The app runs a quick checklist (missing values, duplicates, label distribution, suspicious correlations, basic outlier flags), then generates a concise summary of what's likely to break training and the 3–5 most important fixes to do before modeling.
User flow
- Upload a CSV
- See risk flags and a health score
- Click an issue to view the supporting stats
- Get an actionable cleanup checklist
- Ask follow-up questions like 'Is this dataset safe to train on?' or 'Why might my validation score be too good to be true?'
LLM components
- LLM-as-interpreter — converts computed dataset statistics into a structured report with top risks, modeling impact, and recommended actions
- Q&A over profile results — answers user questions about the dataset profile
Tools
- Hosting: Replit
- Vibe coding: Cursor or Claude Code
- LLM: ChatGPT / Claude API
- Profiling: Python + Pandas
- UI: Streamlit or FastAPI
- Optional: n8n (saving reports, shareable links)