AMS 691.01
All projects
data analysis

ModelScope Lite

A dataset health report that flags risks before you train.

data-qualityMLprofilingCSVeducation

By

Kommireddy Venkata Sai Ashrit

Semester

Spring 2026

When you get a new CSV dataset, you often discover issues only after training fails — imbalanced labels, silent missing-value patterns, duplicated rows, or leaky features that are basically the answer. The bottleneck is that students and early practitioners don't know what to check first, and raw profiling tools dump charts and tables without clearly telling you what's risky and what to do next.

A small web app where a user uploads a CSV and immediately gets a Dataset Health Report that calls out the top issues in plain English and recommends next steps. The app runs a quick checklist (missing values, duplicates, label distribution, suspicious correlations, basic outlier flags), then generates a concise summary of what's likely to break training and the 3–5 most important fixes to do before modeling.

  • Upload a CSV
  • See risk flags and a health score
  • Click an issue to view the supporting stats
  • Get an actionable cleanup checklist
  • Ask follow-up questions like 'Is this dataset safe to train on?' or 'Why might my validation score be too good to be true?'
  • LLM-as-interpreter — converts computed dataset statistics into a structured report with top risks, modeling impact, and recommended actions
  • Q&A over profile results — answers user questions about the dataset profile
  • Hosting: Replit
  • Vibe coding: Cursor or Claude Code
  • LLM: ChatGPT / Claude API
  • Profiling: Python + Pandas
  • UI: Streamlit or FastAPI
  • Optional: n8n (saving reports, shareable links)