COGS 109 Final Project

Home
Announcements
Lectures
Assignments

Instructional Team

Links
Frequently Asked Questions
Textbooks
Grading
Office hours
Handouts
Labs and locations
Course Description

 

Site last updated:

Table of Contents

 

Introduction

There will be a group project associated with this course. It provides an opportunity for students to practice formulating a modeling and data analysis related question and hypothesis, consider what data and analysis would be needed to test a hypothesis associated with the question, locate and analyze that realworld data, come to a conclusion and communicate the results in a scientific report format. The focus is on the data, modeling and analysis, delving into the reasoning behind model selection and where it can take us in the future as well as how it can be applied to scientific and engineering work.

Group formation

The project will start in week 2 (with the paper review), but we will begin forming groups immediately since time is short in the summer session. We will provide a team signup sheet where you can tell us who your chosen team members are or select to be assigned. If you have not found a group by Thursday of week 1 we will assign you. Once you have met your assigned team (in the case you are assigned) if you want to change groups you need to change as early as possible and we will provide a procedure (secondary google form for changing groups).

Group size

Groups should consist of optimally 4-5 students, though we have had single member groups and groups as large as 7. The extremes pose challenges. Too few students and it is a lot of work for fewer people. Too many students and it is difficult for everyone to have everyone get enough involvment. If you decide on less than 4-5, we suggest you plan early and choose an attainable question, dataset and analysis. If you decide on a larger team, you need to be very organized who is doing what task and ensure that everyone is given a significant enough role. You may need to choose a more complex problem or put more into the analysis.

Project structure

The entire project will be turned in to a group github repository we will create for you and give you access to. You will practice version control there, and keep data, as well as turn in all the pieces there in the form of jupyter notebooks and a final video.

The project will consist of multiple checkpoints to help you break down the task into managable pieces, culminating in a final report along with a video presentation that you record and a group review form to provide feedback for each others' participation. It might look intimidating, but each element actually helps you draft your final report. By the time you get to the second checkpoint, you will have a rough draft complete and just need to refine it in the last week of the class. We will also be helping with the last assignment to reduce the time demand, so you can put the most effort into the project.

The pieces are as follows:

Project component Group/Indiv. Submission type Due date

Paper review

Group Google form Saturday 7/15 at 11:59pm

Project Proposal/Data checkpoint

Group Jupyter notebook to github Tues 7/25 at 11:59pm

Checkpoint 2: EDA

Group Jupyter notebook to github Tues 8/1 at 11:59pm

Final Report

Group Jupyter notebook to github Friday 8/4 at 11:59pm

Video presentation report summary (5-10min)

Group github (method of choice) Friday 8/4 at 11:59pm

Group review

Individual Google form Friday 8/4 at 11:59pm

Video reviews (depending on timing, either min 3 required or all extra credit, 0.5% each for up to 6% bonus overall)

Individual Google form Saturday 8/5 at 11:59pm

 

Previous quarter examples

We will provide some examples of similar projects and the old final project which was an individual project, but provides guidance for the modeling and data anlalysis path.

 

Description of project parts

1. Previous project review (Due Sat 7/15/2023 at 11:59pm)

Here you will review 2 papers as a team and submit one google form that asks a few questions about each paper. The questions are designed to orient you to the project and get a sense of what you are going to do. It will also get you to think about 1)What will the final report look like in terms of structure? 2)What are some of the questions you can ask? 3)How will you go about answering those questions through the modeling and data analysis structure?

We recommend you each as individuals look over the papers and questions, make notes then come together as a group to put together the final submission. However if it works better for you to do it all as a team it is up to you.

You will be surprised at what you will be able to formulate and refine. While it helps to see what has been done don't be afraid to think outside the box, this is what good scientists and engineers do in order to take human knowledge and achievement in new directions!

Requirements:

  • Review 2 papers in the google form as a group
  • Submit 1 per group, filled out completely
  • The survey will have you list your group teammembers, please be sure to do that correctly with names listed as they appear in canvas

 

2. Project proposal and checkpoint 1: Data (Sat. 7/15/23 at 11:59pm)

In this part of the project you will, following the format linked in here available in the projects github repo for COGS109 here,

  • Question - Write a draft of your question
  • Hypothesis - Given your question, boil it down to a scientific hypothesis and null hypothesis that will be tested in your study
  • Literature review - Provide a background review of the literature (a draft you will add to as the session progresses)
  • Data - Locate at least some of the data you will probably use, and where you might find more if needed, perform initial wrangling and cleaning such that the data is usable, and you will describe the variables as well as how the dataset is going to support your analysis/model development
  • Schedule - Create a rough schedule
  • Tasks - Plan out tasks as much as possible and as specifically as possible as far as you can predict - if some people in the group are great writers, they can take lead on that part (with everyone contributing for experience, perhaps with guidance), and if some are more experienced in coding or data science they can take lead on that (with everyone contributing for experience, perhaps with guidance)
  • References - you will be needing to review the existing literature on your topic and include the most relevant publications as references in APA or MLA format. There will be a literature review section of your final report, and we will discuss how that works.

It is understood that this early on all of this is a draft, and we have a short time to put this together, but the more effort that goes into this part, the more you will cruise through the rest. The idea is to get far enough to have a sense of whether your question makes sense and is answerable with data, or needs refinement.

We will look at your submissions and provide feedback to help you refine the question and your plans for how you will work with and analyze the data (not tear apart what you submit). Asking the right question can mean the difference between a smooth project and a project with many necessary significant changes (such as a totally different dataset).

Given summer session timing we have not done a great deal of review of modeling and data analysis technique yet (at least not the modeling part) and we have kept this in mind in planning the project. You will be adding further specifics as you go, some aspects will change, and this is why we have an exploratory data analysis checkpoint later.

Details of the data checkpoint portion:


For this checkpoint you should have located one or more datasets associated with your project, accessed it and performed the majority of or all of your data wrangling and cleaning in order to get it into a usable form for your analysis. You will write up a short description of the data, where it comes from and what it represents, how it is structured and you will include your code, well commented.

This is not where you will have performed much analysis necessarily, though you should have done some basic visualization or otherwise show that the data is going to be useful for your question/hypothesis. You may at this time decide you need more data, different data or otherwise. To complete the checkpoint you should have a good idea where, if you need more data, you can get it – perhaps you have it but have not completed the secondary cleaning and wrangling, but have the framework from the first dataset.

The main idea is – get your data, get it wrangled so you can operate on it, take a quick look at it. Plots are not required and certainly no significant analysis. But try to have a sense of if this is the dataset you will be sticking with or if you need more data, you should have put significant work into getting it even if it is too late to fully incorporate it into your checkpoint.

Data description requirements: it is structured (or you can make it structured), what it contains, and any characteristics needed for your data science project

  • Data description including what the data is, how it supports answering your question, motivation as to why this is the dataset you will be using
  • Data wrangling and code showing you have gotten the data into python/jupyter and have it in a usable/accessible form – you have removed missing values, NaNs, it is rectangular etc
  • Next steps – a brief statement of what you plan to do next to perform EDA on the data, but you do not have to have actually done this yet

Requirements:

  • Fill out the project checkpoint included in github repository on the main page for the course in the project directory
  • Turned in to your group github repo by the deadline
  • Data wrangled/cleaned with commented code and description of steps
  • Data description filled in as described above
  • Some descriptive representation of the data - pre-visualization, demonstrate the data is input and processed, ready for use

3.Checkpoint 2: EDA (Tues. 8/1 at 11:59pm)

For this checkpoint the goal is to, now that you have your data, work towards the goal of rejecting or failing to reject the null and gaining insight about your data science question. So explore the basic statistics of your data - central tendency and variability. If it has a normal distribution you can use standard statistics, and otherwise you can explore with nonparametric statistics. Then given those insights perform various visualizations, generate tables or other ways to 'look at' the data. Finally, as you probably have an idea of what type of modeling you would like to do, execute at least a good portion of the modeling - regression, curve fits, if you are doing machine learning try to get as far as possible on this. It is understood you may be expanding this for the final report, as that is why we call it 'exploratory' instead of'super final done from every angle' or similar. Keep in mind the more you explore your data, the better the picture you will get in your head as to the final statements you can make about it.

Overall consider this a first rough draft of the report, with potentially missing sections such as the conclusion and results section might be not fully written. You want to explore as far as possible and have a good idea of the last bits of modeling you want to do in order to make a statement and support it regarding your question and hypothesis.

 

Requirements:

  • Fill out the project checkpoint included in github repository on the main page for the course in the project directory (ProjectCheckpoint_EDACP_109_groupXXX.ipynb)
  • Turned in to your group github repo by the deadline
  • Data should be explored through basic descriptive statistics, visualizations, hypothesis testing (inferential analysis)
  • You should have completed basic modeling of the data. The above analysis will give insight as to where to take the data, and it's possible that it might take you in an unplanned direction that is not complete by the checkpoint. That's ok, but you should do as much modeling as is reasonable given the timing.

4.Final report(Sat. 8/5 at 5pm)

Requirements:

  • Fill out the project checkpoint included in github repository on the main page for the course in the project directory (FinalProject-GroupXXX_109.ipynb)
  • Turned in to your group github repo by the deadline
  • To the earlier checkpoints you should add your final analysis and modeling, then draw a conclusion about your question. What were your results and what was the outcome of your modeling and data analysis process? Where would you go from here to take it further? What is the next question perhaps?

4B. Final report video portion (Sat. 8/4 at 5pm)

Video instructions: 5-8m video introducing your project, data, results and conclusion. Can be (encouraged to be but no hard requirement) a set of slides, but you can express how is most effective for your project. It should be planned out however, so outline it first.

Requirements:

  • length ~5-8m, must be less than 10min
  • Cover the project, question, data, methods, results, conclusion briefly
  • Slides are useful but not required, however try to avoid jumping around your report, plan it out and outline so it's repeatable and practice a few times
  • Share as a file, link, in your repo, on youtube, it's up to you just make sure we can access it
  • Everone needs to have involvement, but everyone does NOT need to speak or be physically present in the video. We do not require video of your face, but should see your slides/images/report or whatever form you are presenting
  • You have leeway to be creative, and time is limited but do your best to express it as professionally as possible

4C. Group member review (Sat. 8/5 at 5pm)

You will review each others' participation in the project. We will provide a google form, and please also include a summary in your report. This has an impact on each others' grade, and we will consider the comments when deciding the final grade of the project for individuals.

4D. Extra credit video reviews (Mon 8/7 at 5pm)

Watch video presentation summaries of the other group projects and then submit a google form survey answering various questions. It will not take long per form, and you recieve 0.5% per video review, up to 6%. You'll gain additional experience by learning about the techniques other groups used and their challenges.

 

Ideas for open datasets:

Here you will find several links to open datasets from a variety of sources. You can locate and use your own, and you can also use your own datasets, if they are stripped of personally identifiable information (PII) and you have the rights to use them.

More info to come...