data analysis

ModelScope Lite

A dataset health report that flags risks before you train.

data-qualityMLprofilingCSVeducation

Kommireddy Venkata Sai Ashrit

Semester

Spring 2026

Problem

When you get a new CSV dataset, you often discover issues only after training fails — imbalanced labels, silent missing-value patterns, duplicated rows, or leaky features that are basically the answer. The bottleneck is that students and early practitioners don't know what to check first, and raw profiling tools dump charts and tables without clearly telling you what's risky and what to do next.

Solution

A small web app where a user uploads a CSV and immediately gets a Dataset Health Report that calls out the top issues in plain English and recommends next steps. The app runs a quick checklist (missing values, duplicates, label distribution, suspicious correlations, basic outlier flags), then generates a concise summary of what's likely to break training and the 3–5 most important fixes to do before modeling.

User flow

Upload a CSV
See risk flags and a health score
Click an issue to view the supporting stats
Get an actionable cleanup checklist
Ask follow-up questions like 'Is this dataset safe to train on?' or 'Why might my validation score be too good to be true?'

LLM components

LLM-as-interpreter — converts computed dataset statistics into a structured report with top risks, modeling impact, and recommended actions
Q&A over profile results — answers user questions about the dataset profile

Tools

Hosting: Replit
Vibe coding: Cursor or Claude Code
LLM: ChatGPT / Claude API
Profiling: Python + Pandas
UI: Streamlit or FastAPI
Optional: n8n (saving reports, shareable links)