High school students compete to predict which colleges “pay off” the most
In the information era, data is quickly becoming currency. While 8 out of the 10 fastest-growing careers all stem from data science, navigating complex data with ease is increasingly critical for everyday decision-making – from healthcare to personal finance. Regardless of career or life path, every high school graduate will benefit from a strong upskilling in the basics of data. The technical skills, iterative problem-solving, and mathematical foundations of data science will empower students to take on the digital age with confidence.
In collaboration with Data Science 4 Everyone at The University of Chicago, North Carolina State University Data Science Academy, and the CourseKata platform, Skew the Script – a nonprofit education initiative led by a collaborative of AP Statistics teachers – organized a national Data Science challenge for high school students, following their AP examinations.
This past school year, approximately 2,500 students from around the country entered the competition to predictively evaluate a critical life question for many graduates: which colleges or universities will give the greatest return on financial investment, or “pay off” the most? The prompt challenged students to build the best possible model for predicting student loan default rates at different colleges across the country. An additional 4,000 students were waitlisted due to server capacity constraints, which we hope to build extra capacity for in future years’ competitions to accommodate.
Do colleges “pay off?
Lily Larsen, whose final model placed third in the challenge, was motivated to participate because she hopes to pursue biostatistics in college, and saw the challenge as an exciting intersection between coding and statistics. “It was something that piqued my interest initially; it was just an enjoyable experience overall.” Similarly, Prahit Yaugand, whose model won first place, wanted the opportunity to practice using the programming language R.
Leveraging data from the U.S. Department of Education’s College Scorecard and IPEDS portals, students filtered, analyzed, modeled, visualized, and communicated complex data on U.S. colleges and universities from as recent as last school year. The mission: conquer the college debt question, maximizing financial return and minimizing the likelihood of defaulting on student loans.
Lily’s teacher at Essex High School, Stacey Anthony, thought the prompt was timely and highly relevant to high school students. “We see it on the news all the time – the issues with student loan debt and what’s been going on at the national level with loan forgiveness.” She believes students have to weigh all the data and make the right choices for themselves.
The Department’s College Scorecard data is large and complex: students were tasked to navigate a portion of the data, focusing on 26 different variables across 4,435 colleges, and encouraged to carefully choose which information was valuable to include and exclude in their final submissions. Most students had minimal programming experience – having recently completed an AP Statistics course. Others had minimal statistics experience – having recently completed an AP Computer Science course. And they had two weeks, at most, to find a good answer.
Despite these challenges, over 75% of student participants completed all components of the project. The work and engagement from students demonstrates that even highly technical data science skills are still accessible for high school students. Furthermore, when those skills are used to analyze problems that are genuinely relevant for students, they’ll engage deeply with the investigation process.
For Prahit, figuring out how to effectively combine different models was the most rewarding and challenging aspect. “I had so many ideas, including using a subset method and using different degrees, so I had to find a way to combine all my ideas into one central final model.” And mistakes also led to deeper understanding – Lily made a mistake in one of the notebooks she was working on, but going back through her work to see where she went wrong ended up giving her a much better understanding of that notebook.
For students to succeed in the challenge, mastery of linear regression, polynomial regression, the basics of machine-learning (artificial intelligence), and the intuition of calculus were all required.
Using Jupyter Notebooks and R, an open-source statistical software, students engaged in complex data analysis techniques using digital technology, advanced algebra, and complex statistical techniques. For students to succeed in the challenge, mastery of linear regression, polynomial regression, the basics of machine-learning (artificial intelligence), and the intuition of calculus were all required.
Prahit’s teacher, Bellamy Liu, says that using real-world data and thinking about how data science is used in students’ own lives is the right approach to data science education. And throughout the challenge, students often collaborated with each other on those real-world questions, which Stacey Anthony was thrilled to see with her students. “Being a little more skeptical, asking questions, figuring out where numbers come from, and how data is collected would help us all become better consumers of data in the media,” Stacey said.
Among students from over 35 states, 75 schools, and entering from a range of academic backgrounds – including course-completers in AP Statistics, AP Computer Science, and AP Calculus – we are pleased to announce our National Data Science Champions:
Note: All students submitted their work anonymously for evaluation. Student names listed as ‘anonymous’ chose to remain anonymous or have not yet submitted permission forms. This post will be updated as permission forms continue to come in.
The top prediction model was submitted by Prahit Yaugand of Mission San Jose High School. Prahit’s model had a final test R^2 value of approximately 0.78. This means that, when predicting student loan default rates for new colleges (schools that the model hadn’t “seen” previously), the model explained roughly 78% of the variation in default rates. This is quite impressive! Here is Prahit’s final model:
We’d also like to congratulate all the runners-up of the challenge, who submitted highly predictive and accurate models for the same task:
Elliott Salpekar, Ithaca High School
Junyoung Sim, Ithaca High School
Kenneth Tsay, Buckingham Browne & Nichols School
Leo Ren, Buckingham Browne & Nichols School
Lei Cao, Ithaca High School
Nikol Miojevic, Ithaca High School
Ashley Park, South High School
Jaya Kolluri, Winsor School
Gunnika Singh, Fort Zumwalt West High School
Xander Black, Ithaca High School
Michael Perelstein, Ithaca High School
Ira Geller, Baltimore Polytechnic Institute
Cosima Billotte Bermudez, Baltimore Polytechnic Institute
Abigail Hartman, Baltimore Polytechnic Institute
Dhilen Mistry, Westlake High School
Diya Muni, Freedom High School
Jamie Wong, Mills High School
Anonymous, John D. O'Bryant School of Math & Science
Sam Him Yuan, John D. O'Bryant School of Math & Science
Alexandros Lambrou, Ithaca High School
Curan Palmer, Georgetown Day School
Chengwu Meng, Mills High School
On behalf of the national organizers, we congratulate each student team who participated in the challenge – choosing to dedicate two weeks at the end of the school year, when time and perseverance are the hardest to find. Each and every one of you gained valuable skills that will carry with you for the next several years, and long into your career – in addition to hopefully learning how to navigate the college landscape. Thank you for your efforts, and we know this is just an early preview of what you will accomplish.